Global ETD Search

1	A Database For Exploratory Analysis of Human Sleep Misra, Shivin Satyawon 26 March 2008 (has links) This thesis focuses on the design, development, and exploratory analysis of a human sleep data repository. We have successfully collected comprehensive data for 1,046 sleep disorder patients and created a Terabyte-scale database system to handle it. The data for each patient was collected from the patient's medical records, and from the patient's allnight sleep study (for a total of about 0.6 Gigabytes per patient). Data collected from the patient's medical record contain more than 70 attributes, including demographic data, smoking, drinking, and exercise habits, depression and daytime sleepiness questionnaires, and overall medical history. Data collected from the patient's all-night sleep study consist of 50-55 time-series signals recorded during a period of 6-8 hours at the hospital's sleep clinic. These signals include among others an electroencephalogram, electromyogram, electrooculogram, electrocardiogram, and signals tracking blood oxygen level, body position, limb movements, snoring and blood pressure. 350 additional attributes summarize sleep related events taking place during the night long study, including sleep stages, arousals, and respiratory disturbances. Particular attention during the development of our database system was paid to a database design that effectively handles the data size and complexity, that describes the structure of sleep data in clinically meaningful terms, and that will facilitates the discovery of patterns in sleep data using machine learning algorithms. We have interfaced our database with Weka, a well known data mining system. To the best of our knowledge, our database is one of the world's largest and most comprehensive in the domain of human sleep disorders. postgres computer science database sleep terabyte data mining weka
2	Parallel Inverted Indices for Large-Scale, Dynamic Digital Libraries Sornil, Ohm 09 February 2001 (has links) The dramatic increase in the amount of content available in digital forms gives rise to large-scale digital libraries, targeted to support millions of users and terabytes of data. Retrieving information from a system of this scale in an efficient manner is a challenging task due to the size of the collection as well as the index. This research deals with the design and implementation of an inverted index that supports searching for information in a large-scale digital library, implemented atop a massively parallel storage system. Inverted index partitioning is studied in a simulation environment, aiming at a terabyte of text. As a result, a high performance partitioning scheme is proposed. It combines the best qualities of the term and document partitioning approaches in a new Hybrid Partitioning Scheme. Simulation experiments show that this organization provides good performance over a wide range of conditions. Further, the issues of creation and incremental updates of the index are considered. A disk-based inversion algorithm and an extensible inverted index architecture are described, and experimental results with actual collections are presented. Finally, distributed algorithms to create a parallel inverted index partitioned according to the hybrid scheme are proposed, and performance is measured on a portion of the equipment that normally makes up the 100 node Virginia Tech PetaPlex™ system. NOTE: (02/2007) An updated copy of this ETD was added after there were patron reports of problems with the file. / Ph. D. Simulation incremental update information retrieval parallel inverted index hybrid partitioning Performance digital library terabyte text collection
3	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Cordeiro, Robson Leonardo Ferreira 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Correlation clustering Dados de média à alta dimensionalidade Labeling and summarization MapReduce MapReduce Moderante-to-high dimensionality data Rotulação e sumarização Terabyte-scale data mining
4	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Robson Leonardo Ferreira Cordeiro 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Dados de média à alta dimensionalidade MapReduce Rotulação e sumarização Correlation clustering Labeling and summarization MapReduce Moderante-to-high dimensionality data Terabyte-scale data mining

1

Page generated in 0.0361 seconds