21 |
Computer-Enhanced Knowledge Discovery in Environmental ScienceFukuda, Kyoko January 2009 (has links)
Encouraging the use of computer algorithms by developing new algorithms and introducing uncommonly known algorithms for use on environmental science problems is a significant contribution, as it provides knowledge discovery tools to extract new aspects of results and draw new insights, additional to those from general statistical methods. Conducting analysis with appropriately chosen methods, in terms of quality of performance and results, computation time, flexibility and applicability to data of various natures, will help decision making in the policy development and management process for environmental studies. This thesis has three fundamental aims and motivations. Firstly, to develop a flexibly applicable attribute selection method, Tree Node Selection (TNS), and a decision tree assessment tool, Tree Node Selection for assessing decision tree structure (TNS-A), both of which use decision trees pre-generated by the widely used C4.5 decision tree algorithm as their information source, to identify important attributes from data. TNS helps the cost effective and efficient data collection and policy making process by selecting fewer, but important, attributes, and TNS-A provides a tool to assess the decision tree structure to extract information on the relationship of attributes and decisions. Secondly, to introduce the use of new, theoretical or unknown computer algorithms, such as the K-Maximum Subarray Algorithm (K-MSA) and Ant-Miner, by adjusting and maximizing their applicability and practicality to assess environmental science problems to bring new insights. Additionally, the unique advanced statistical and mathematical method, Singular Spectrum Analysis (SSA), is demonstrated as a data pre-processing method to help improve C4.5 results on noisy measurements. Thirdly, to promote, encourage and motivate environmental scientists to use ideas and methods developed in this thesis. The methods were tested with benchmark data and various real environmental science problems: sea container contamination, the Weed Risk Assessment model and weed spatial analysis for New Zealand Biosecurity, air pollution, climate and health, and defoliation imagery. The outcome of this thesis will be to introduce the concept and technique of data mining, a process of knowledge discovery from databases, to environmental science researchers in New Zealand and overseas by collaborating on future research to achieve, together with future policy and management, to maintain and sustain a healthy environment to live in.
|
22 |
Patient-Centered and Experience-Aware Mining for Effective Information Discovery in Health ForumsJanuary 2016 (has links)
abstract: Online health forums provide a convenient channel for patients, caregivers, and medical professionals to share their experience, support and encourage each other, and form health communities. The fast growing content in health forums provides a large repository for people to seek valuable information. A forum user can issue a keyword query to search health forums regarding to some specific questions, e.g., what treatments are effective for a disease symptom? A medical researcher can discover medical knowledge in a timely and large-scale fashion by automatically aggregating the latest evidences emerging in health forums.
This dissertation studies how to effectively discover information in health forums. Several challenges have been identified. First, the existing work relies on the syntactic information unit, such as a sentence, a post, or a thread, to bind different pieces of information in a forum. However, most of information discovery tasks should be based on the semantic information unit, a patient. For instance, given a keyword query that involves the relationship between a treatment and side effects, it is expected that the matched keywords refer to the same patient. In this work, patient-centered mining is proposed to mine patient semantic information units. In a patient information unit, the health information, such as diseases, symptoms, treatments, effects, and etc., is connected by the corresponding patient.
Second, the information published in health forums has varying degree of quality. Some information includes patient-reported personal health experience, while others can be hearsay. In this work, a context-aware experience extraction framework is proposed to mine patient-reported personal health experience, which can be used for evidence-based knowledge discovery or finding patients with similar experience.
At last, the proposed patient-centered and experience-aware mining framework is used to build a patient health information database for effectively discovering adverse drug reactions (ADRs) from health forums. ADRs have become a serious health problem and even a leading cause of death in the United States. Health forums provide valuable evidences in a large scale and in a timely fashion through the active participation of patients, caregivers, and doctors. Empirical evaluation shows the effectiveness of the proposed approach. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2016
|
23 |
Identifying the factors that affect the severity of vehicular crashes by driver ageTollefson, John Dietrich 01 December 2016 (has links)
Vehicular crashes are the leading cause of death for young adult drivers, however, very little life course research focuses on drivers in their 20s. Moreover, most data analyses of crash data are limited to simple correlation and regression analysis. This thesis proposes a data-driven approach and usage of machine-learning techniques to further enhance the quality of analysis.
We examine over 10 years of data from the Iowa Department of Transportation by transforming all the data into a format suitable for data analysis. From there, the ages of drivers present in the crash are discretized depending on the ages of drivers present for better analysis. In doing this, we hope to better discover the relationship between driver age and factors present in a given crash.
We use machine learning algorithms to determine important attributes for each age group with the goal of improving predictivity of individual methods. The general format of this thesis follows a Knowledge Discovery workflow, preprocessing and transforming the data into a usable state, from which we perform data mining to discover results and produce knowledge.
We hope to use this knowledge to improve the predictivity of different age groups of drivers with around 60 variables for most sets as well as 10 variables for some. We also explore future directions this data could be analyzed in.
|
24 |
Exploring Massive Volunteered Geographic Information for Geographic Knowledge DiscoveryTao, Jia January 2010 (has links)
Conventionally geographic data produced and disseminated by the national mapping agencies are used for studying various urban issues. These data are not commonly available or accessible, but also are criticized for being expensive. However, this trend is changing along with the rise of Volunteered Geographic Information (VGI). VGI, known as user generated content, is the geographic data collected and disseminated by individuals at a voluntary basis. So far, a huge amount of geographic data has been collected due to the increasing number of contributors and volunteers. More importantly, they are free and accessible to anyone. There are many formats of VGI such as Wikimapia, Flickr, GeoNames and OpenStreetMap (OSM). OSM is a new mapping project contributed by volunteers via a wiki-like collaboration, which is aimed to create free, editable map of the entire world. This thesis adopts OSM as the main data source to uncover the hidden patterns around the urban systems. We investigated some fundamental issues such as city rank size law and the measurement of urban sprawl. These issues were conventionally studied using Census or satellite imagery data. We define the concept of natural cities in order to assess city size distribution. Natural cities are generated in a bottom up manner via the agglomeration of individual street nodes. This clustering process is dependent on one parameter called clustering resolution. Different clustering resolutions could derive different levels of natural cities. In this respect, they show little bias compared to city boundaries imposed by Census bureau or extracted from satellite imagery. Based on the investigation, we made two findings about rank size distributions. The first one is that all the natural cities in US follow strictly Zipf’s law regardless of the clustering resolutions, which is different from other studies only investigating a few largest cities. The second one is that Zipf’s law is not universal at the state level, e.g., Zipf’s law for natural cities within individual states does not hold valid. This thesis continues to detect the sprawling based on natural cities. Urban sprawl devours large amount of open space each year and subsequently leads to many environmental problems. To curb urban sprawl with proper policies, a major problem is how to objectively measure it. In this thesis, a new approach is proposed to measure urban sprawl based on street nodes. This approach is based on the fact that street nodes are significantly correlated with population in cities. Specifically, it is reported that street nodes have a linear relationship with city sizes with correlation coefficient up to 0.97. This linear regression line, known as sprawl ruler, can partition all cities into the sprawling, compact and normal cities. This study verifies this approach with some US census data and US natural cities. Based on the verification, this thesis further applies it to three European countries: France, Germany and UK, and consequently categorizes all natural cities into three classes: sprawling, compact and normal. This categorization provides a new insight into the sprawling detection and sets a uniform standard for cross comparing sprawling level across an entire country. / QC 20101206
|
25 |
Ontology-based Feature Construction on Non-structured DataNi, Weizeng 10 September 2015 (has links)
No description available.
|
26 |
Learning from a Genetic Algorithm with Inductive Logic ProgrammingGandhi, Sachin 17 October 2005 (has links)
No description available.
|
27 |
Empirické porovnání systémů dobývání znalostí z databází / Empirical Comparison of Knowledge Discovery in Databases SystemsDopitová, Kateřina January 2010 (has links)
Submitted diploma thesis considers empirical comparison of knowledge discovery in databases systems. Basic terms and methods of knowledge discovery in databases domain are defined and criterions used to system comparison are determined. Tested software products are also shortly described in the thesis. Results of real task processing are brought out for each system. The comparison of individual systems according to previously determined criterions and comparison of competitiveness of commercial and non-commercial knowledge discovery in databases systems are performed within the framework of thesis.
|
28 |
CSI in the Web 2.0 Age: Data Collection, Selection, and Investigation for Knowledge DiscoveryFu, Tianjun January 2011 (has links)
The growing popularity of various Web 2.0 media has created massive amounts of user-generated content such as online reviews, blog articles, shared videos, forums threads, and wiki pages. Such content provides insights into web users' preferences and opinions, online communities, knowledge generation, etc., and presents opportunities for many knowledge discovery problems. However, several challenges need to be addressed: data collection procedure has to deal with unique characteristics and structures of various Web 2.0 media; advanced data selection methods are required to identify data relevant to specific knowledge discovery problems; interactions between Web 2.0 users which are often embedded in user-generated content also need effective methods to identify, model, and analyze. In this dissertation, I intend to address the above challenges and aim at three types of knowledge discovery tasks: (data) collection, selection, and investigation. Organized in this "CSI" framework, five studies which explore and propose solutions to these tasks for particular Web 2.0 media are presented. In Chapter 2, I study focused and hidden Web crawlers and propose a novel crawling system for Dark Web forums by addressing several unique issues to hidden web data collection. In Chapter 3 I explore the usage of both topical and sentiment information in web crawling. This information is also used to label nodes in web graphs that are employed by a graph-based tunneling mechanism to improve collection recall. Chapter 4 further extends the work in Chapter 3 by exploring the possibilities for other graph comparison techniques to be used in tunneling for focused crawlers. A subtree-based tunneling method which can scale up to large graphs is proposed and evaluated. Chapter 5 examines the usefulness of user-generated content in online video classification. Three types of text features are extracted from the collected user-generated content and utilized by several feature-based classification techniques to demonstrate the effectiveness of the proposed text-based video classification framework. Chapter 6 presents an algorithm to identify forum user interactions and shows how they can be used for knowledge discovery. The algorithm utilizes a bevy of system and linguistic features and adopts several similarity-based methods to account for interactional idiosyncrasies.
|
29 |
"Desenvolvimento de um Framework para Análise Visual de Informações Suportando Data Mining" / "Development of a Framework for Visual Analysis of Information with Data Mining suport"Rodrigues Junior, Jose Fernando 22 July 2003 (has links)
No presente documento são reunidas as colaborações de inúmeros trabalhos das áreas de Bancos de Dados, Descoberta de Conhecimento em Bases de Dados, Mineração de Dados, e Visualização de Informações Auxiliada por Computador que, juntos, estruturam o tema de pesquisa e trabalho da dissertação de Mestrado: a Visualização de Informações. A teoria relevante é revista e relacionada para dar suporte às atividades conclusivas teóricas e práticas relatadas no trabalho. O referido trabalho, embasado pela substância teórica pesquisada, faz diversas contribuições à ciência em voga, a Visualização de Informações, apresentando-as através de propostas formalizadas no decorrer deste texto e através de resultados práticos na forma de softwares habilitados à exploração visual de informações. As idéias apresentadas se baseiam na exibição visual de análises numéricas estatísticas básicas, frequenciais (Frequency Plot), e de relevância (Relevance Plot). São relatadas também as contribuições à ferramenta FastMapDB do Grupo de Bases de Dados e Imagens do ICMC-USP em conjunto com os resultados de sua utilização. Ainda, é apresentado o Arcabouço, previsto no projeto original, para construção de ferramentas visuais de análise, sua arquitetura, características e utilização. Por fim, é descrito o Pipeline de visualização decorrente da junção entre o Arcabouço de visualização e a ferramenta FastMapDB. O trabalho se encerra com uma breve análise da ciência de Visualização de Informações com base na literatura estudada, sendo traçado um cenário do estado da arte desta disciplina com sugestões de futuros trabalhos. / In the present document are joined the collaborations of many works from the fields of Databases, Knowledge Discovery in Databases, Data Mining, and Computer-based Information Visualization, collaborations that, together, define the structure of the research theme and the work of the Masters Dissertation presented herein. This research topic is the Information Visualization discipline, and its relevant theory is reviewed and related to support the concluding activities, both theoretical and practical, reported in this work. The referred work, anchored by the theoretical substance that was studied, makes several contributions to the science in investigation, the Information Visualization, presenting them through formalized proposals described across this text, and through practical results in the form of software enabled to the visual exploration of information. The presented ideas are based on the visual exhibition of numeric analysis, named basic statistics, frequency analysis (Frequency Plot), and according to a relevance analysis (Relevance Plot). There are also reported the contributions to the FastMapDB tool, a visual exploration tool built by the Grupo de Bases de Dados e Imagens do ICMC-USP, the performed enhancements are listed as achieved results in the text. Also, it is presented the Framework, as previewed in this work's original proposal, projected to allow the construction of visual analysis tools; besides its description are listed its architecture, characteristics and utilization. At last, it is described the visualization Pipeline that emerges from the joining of the visualization Framework and the FastMapDB tool. The work ends with a brief analysis of the Information Visualization science based on the studied literature, it is delineated a scenario of the state of the art of this discipline along with suggestions for future work.
|
30 |
O processo de extração de conhecimento de base de dados apoiado por agentes de software. / The process of knowledge discovery in databases supported by software agents.Oliveira, Robson Butaca Taborelli de 01 December 2000 (has links)
Os sistemas de aplicações científicas e comerciais geram, cada vez mais, imensas quantidades de dados os quais dificilmente podem ser analisados sem que sejam usados técnicas e ferramentas adequadas de análise. Além disso, muitas destas aplicações são voltadas para Internet, ou seja, possuem seus dados distribuídos, o que dificulta ainda mais a realização de tarefas como a coleta de dados. A área de Extração de Conhecimento de Base de Dados diz respeito às técnicas e ferramentas usadas para descobrir automaticamente conhecimento embutido nos dados. Num ambiente de rede de computadores, é mais complicado realizar algumas das etapas do processo de KDD, como a coleta e processamento de dados. Dessa forma, pode ser feita a utilização de novas tecnologias na tentativa de auxiliar a execução do processo de descoberta de conhecimento. Os agentes de software são programas de computadores com propriedades, como, autonomia, reatividade e mobilidade, que podem ser utilizados para esta finalidade. Neste sentido, o objetivo deste trabalho é apresentar a proposta de um sistema multi-agente, chamado Minador, para auxiliar na execução e gerenciamento do processo de Extração de Conhecimento de Base de Dados. / Nowadays, commercial and scientific application systems generate huge amounts of data that cannot be easily analyzed without the use of appropriate tools and techniques. A great number of these applications are also based on the Internet which makes it even more difficult to collect data, for instance. The field of Computer Science called Knowledge Discovery in Databases deals with issues of the use and creation of the tools and techniques that allow for the automatic discovery of knowledge from data. Applying these techniques in an Internet environment can be particulary difficult. Thus, new techniques need to be used in order to aid the knowledge discovery process. Software agents are computer programs with properties such as autonomy, reactivity and mobility that can be used in this way. In this context, this work has the main goal of presenting the proposal of a multiagent system, called Minador, aimed at supporting the execution and management of the Knowledge Discovery in Databases process.
|
Page generated in 0.0688 seconds