Global ETD Search

191	Machine Learning for Air Flow Characterization : An application of Theory-Guided Data Science for Air Fow characterization in an Industrial Foundry / Maskininlärning för Luftflödeskarakterisering : En applikation för en Teorivägledd Datavetenskapsmodell för Luftflödeskarakterisering i en Industrimiljö Lundström, Robin January 2019 (has links) In industrial environments, operators are exposed to polluted air which after constant exposure can cause irreversible lethal diseases such as lung cancer. The current air monitoring techniques are carried out sparely in either a single day annually or at few measurement positions for a few days.In this thesis a theory-guided data science (TGDS) model is presented. This hybrid model combines a steady state Computational Fluid Dynamics (CFD) model with a machine learning model. Both the CFD model and the machine learning algorithm was developed in Matlab. The CFD model serves as a basis for the airflow whereas the machine learning model addresses dynamical features in the foundry. Measurements have previously been made at a foundry where five stationary sensors and one mobile robot were used for data acquisition. An Echo State Network was used as a supervised learning technique for airflow predictions at each robot measurement position and Gaussian Processes (GP) were used as a regression technique to form an Echo State Map (ESM). The stationary sensor data were used as input for the echo state network and the difference between the CFD and robot measurements were used as teacher signal which formed a dynamic correction map that was added to the steady state CFD. The proposed model utilizes the high spatio-temporal resolution of the echo state map whilst making use of the physical consistency of the CFD. The initial applications of the novel hybrid model proves that the best qualities of these two models could come together in symbiosis to give enhanced characterizations.The proposed model could have an important role for future characterization of airflow and more research on this and similar topics are encouraged to make sure we properly understand the potential of this novel model. / Industriarbetare utsätts för skadliga luftburna ämnen vilket över tid leder till högre prevalens för lungsjukdomar så som kronisk obstruktiv lungsjukdom, stendammslunga och lungcancer. De nuvarande luftmätningsmetoderna genomförs årligen under korta sessioner och ofta vid få selekterade platser i industrilokalen. I denna masteruppsats presenteras en teorivägledd datavetenskapsmodell (TGDS) som kombinerar en stationär beräkningsströmningsdynamik (CFD) modell med en dynamisk maskininlärningsmodell. Både CFD-modellen och maskininlärningsalgoritmen utvecklades i Matlab. Echo State Network (ESN) användes för att träna maskininlärningsmodellen och Gaussiska Processer (GP) används som regressionsteknik för att kartlägga luftflödet över hela industrilokalen. Att kombinera ESN med GP för att uppskatta luftflöden i stålverk genomfördes första gången 2016 och denna modell benämns Echo State Map (ESM). Nätverket använder data från fem stationära sensorer och tränades på differensen mellan CFD-modellen och mätningar genomfördes med en mobil robot på olika platser i industriområdet. Maskininlärningsmodellen modellerar således de dynamiska effekterna i industrilokalen som den stationära CFD-modellen inte tar hänsyn till. Den presenterade modellen uppvisar lika hög temporal och rumslig upplösning som echo state map medan den också återger fysikalisk konsistens som CFD-modellen. De initiala applikationerna för denna model påvisar att de främsta egenskaperna hos echo state map och CFD används i symbios för att ge förbättrad karakteriseringsförmåga. Den presenterade modellen kan spela en viktig roll för framtida karakterisering av luftflöden i industrilokaler och fler studier är nödvändiga innan full förståelse av denna model uppnås. Machine learning ML Echo State Map ESM Echo State Network ESN Gaussian Process GP Computational Fluid Dynamics CFD Theory-Guided Data Science TGDS Physics-Guided Data Science Data science Cross-discipline Hybrid model MatLab Maskininlärning ML Echo State Map ESM Echo State Network ESN Gaussiska Processer GP Beräkningsströmningsdynamik CFD MatLab Other Physics Topics Annan fysik Computer Sciences Datavetenskap (datalogi)
192	大數據分析時代壽險業之因應對策 / The life insurance industry's Big data strategy 廖晨旭, Liao, Chen Hsu Unknown Date (has links) 自工業革命之後，人類與科技間關係的變化牽引著整個社會、經濟的發展，而其中泛用型科技(GPTs)又扮演著要角，科技持續以指數式速度發展，大數據的出現是有脈絡可循的，某個程度上來說(從資料及分析兩方面的演進觀之)，可以說是必然發生的。大數據分析，不是時尚名詞，而是一個影響著現在及未來的大趨勢，縱有許多反對的聲音與論述，但它確實已經是國家安全戰略的一環，也是企業生存戰賴以維生的命脈。大數據與過去不同的是我們擁有更多資料的來源，資料可能來自外部(Open Data、第三方資料)，也可能是更精進的資料蒐集機制得來(如：設計誘因機制使顧客自願提供其資料或設計隨機試驗取得異於歷史資料的新資訊)，而在資料種類格式、資料取得與回饋反應的速度上，在新興的MapReduce技術、NoSQL資料庫及串流資料處理技術支撐下，均可有效即時或近即時地被完成。大數據分析最重要的還是在於「預測分析」，而為了讓資料說話，我們要熟悉大數據的特性與缺點，而支持大數據的硬技術與軟技術發展上一日千里，更提升了大數據在各產業的應用可能，而投資大數據的企業營收比那些沒有投資大數據的企業可以高出12％以上，在多數產業紛紛投入這場軍備競賽取得初步成效之際，而傳統壽險產業在大數據及其他科技變革的因應上不如別的產業時，則應在壽險價值鏈上去觀察並利用大數據分析，突破現有商業模式，選擇最佳導入策略，尋覓理想的資料科學家擔任CDO，委任其組織分析團隊並擬定大數據成長策略，建立適切軟硬體的架構，並完成第一個先導計畫取得小規模成功，進而加強企業高層大數據分析的信心與投資意願，使得一的又一個專案得以遂行，最終形塑成資料導向的決策文化，成為可以因應未來的壽險公司，避免在這波科技變遷中成為被淘汰者。大數據資料科學資料採礦預測分析壽險業 Big Data Data Science Data Mining Predictive Analysis Life insurance
193	Triple Non-negative Matrix Factorization Technique for Sentiment Analysis and Topic Modeling Waggoner, Alexander A 01 January 2017 (has links) Topic modeling refers to the process of algorithmically sorting documents into categories based on some common relationship between the documents. This common relationship between the documents is considered the “topic” of the documents. Sentiment analysis refers to the process of algorithmically sorting a document into a positive or negative category depending whether this document expresses a positive or negative opinion on its respective topic. In this paper, I consider the open problem of document classification into a topic category, as well as a sentiment category. This has a direct application to the retail industry where companies may want to scour the web in order to find documents (blogs, Amazon reviews, etc.) which both speak about their product, and give an opinion on their product (positive, negative or neutral). My solution to this problem uses a Non-negative Matrix Factorization (NMF) technique in order to determine the topic classifications of a document set, and further factors the matrix in order to discover the sentiment behind this category of product. Machine Learning Sentiment Analysis Topic Modeling Non-negative Matrix Factorization Applied Mathematics Data Science Other Applied Mathematics Other Mathematics Theory and Algorithms
194	L’évolution des systèmes et architectures d’information sous l’influence des données massives : les lacs de données / The information architecture evolution under the big data influence : the data lakes Madera, Cedrine 22 November 2018 (has links) La valorisation du patrimoine des données des organisation est mise au cœur de leur transformation digitale. Sous l’influence des données massives le système d’information doit s’adapter et évoluer. Cette évolution passe par une transformation des systèmes décisionnels mais aussi par l’apparition d’un nouveau composant du système d’information : Les lacs de données. Nous étudions cette évolution des systèmes décisionnels, les éléments clés qui l’influence mais aussi les limites qui apparaissent , du point de vue de l’architecture, sous l’influence des données massives. Nous proposons une évolution des systèmes d’information avec un nouveau composant qu’est le lac de données. Nous l’étudions du point de vue de l’architecture et cherchons les facteurs qui peuvent influencer sa conception , comme la gravité des données. Enfin, nous amorçons une piste de conceptualisation des lacs de données en explorant l’approche ligne de produit.Nouvelle versionSous l'influence des données massives nous étudions l'impact que cela entraîne notamment avec l'apparition de nouvelles technologies comme Apache Hadoop ainsi que les limite actuelles des système décisionnel.Les limites rencontrées par les systèmes décisionnels actuels impose une évolution au système d 'information qui doit s'adapter et qui donne naissance à un nouveau composant : le lac de données.Dans un deuxième temps nous étudions en détail ce nouveau composant, formalisons notre définition, donnons notre point de vue sur son positionnement dans le système d information ainsi que vis à vis des systèmes décisionnels.Par ailleurs, nous mettons en évidence un facteur influençant l’architecture des lacs de données : la gravité des données, en dressant une analogie avec la loi de la gravité et en nous concentrant sur les facteurs qui peuvent influencer la relation donnée-traitement.Nous mettons en évidence , au travers d'un cas d'usage , que la prise en compte de la gravité des données peut influencer la conception d'un lac de données.Nous terminons ces travaux par une adaptation de l'approche ligne de produit logiciel pour amorcer une méthode de formalisations et modélisation des lacs de données. Cette méthode nous permet :- d’établir une liste de composants minimum à mettre en place pour faire fonctionner un lac de données sans que ce dernier soit transformé en marécage,- d’évaluer la maturité d'un lac de donnée existant,- de diagnostiquer rapidement les composants manquants d'un lac de données existant qui serait devenu un marécage,- de conceptualiser la création des lacs de données en étant "logiciel agnostique”. / Data is on the heart of the digital transformation.The consequence is anacceleration of the information system evolution , which must adapt. The Big data phenomenonplays the role of catalyst of this evolution.Under its influence appears a new component of the information system: the data lake.Far from replacing the decision support systems that make up the information system, data lakes comecomplete information systems’s architecture.First, we focus on the factors that influence the evolution of information systemssuch as new software and middleware, new infrastructure technologies, but also the decision support system usage itself.Under the big data influence we study the impact that this entails especially with the appearance ofnew technologies such as Apache Hadoop as well as the current limits of the decision support system .The limits encountered by the current decision support system force a change to the information system which mustadapt and that gives birth to a new component: the data lake.In a second time we study in detail this new component, formalize our definition, giveour point of view on its positioning in the information system as well as with regard to the decision support system .In addition, we highlight a factor influencing the architecture of data lakes: data gravity, doing an analogy with the law of gravity and focusing on the factors that mayinfluence the data-processing relationship. We highlight, through a use case, that takingaccount of the data gravity can influence the design of a data lake.We complete this work by adapting the software product line approach to boot a methodof formalizations and modeling of data lakes. This method allows us:- to establish a minimum list of components to be put in place to operate a data lake without transforming it into a data swamp,- to evaluate the maturity of an existing data lake,- to quickly diagnose the missing components of an existing data lake that would have become a dataswamp- to conceptualize the creation of data lakes by being "software agnostic “. Lac de données Architecture Metadonnées Gravité des données Gouvernance des données Science des données Data lake Big data Data gravity Metadata Data governance Data science
195	Bias Reduction in Machine Learning Classifiers for Spatiotemporal Analysis of Coral Reefs using Remote Sensing Images Gapper, Justin J. 06 May 2019 (has links) This dissertation is an evaluation of the generalization characteristics of machine learning classifiers as applied to the detection of coral reefs using remote sensing images. Three scientific studies have been conducted as part of this research: 1) Evaluation of Spatial Generalization Characteristics of a Robust Classifier as Applied to Coral Reef Habitats in Remote Islands of the Pacific Ocean 2) Coral Reef Change Detection in Remote Pacific Islands using Support Vector Machine Classifiers 3) A Generalized Machine Learning Classifier for Spatiotemporal Analysis of Coral Reefs in the Red Sea. The aim of this dissertation is to propose and evaluate a methodology for developing a robust machine learning classifier that can effectively be deployed to accurately detect coral reefs at scale. The hypothesis is that Landsat data can be used to train a classifier to detect coral reefs in remote sensing imagery and that this classifier can be trained to generalize across multiple sites. Another objective is to identify how well different classifiers perform under the generalized conditions and how unique the spectral signature of coral is as environmental conditions vary across observation sites. A methodology for validating the generalization performance of a classifier to unseen locations is proposed and implemented (Controlled Parameter Cross-Validation,). Analysis is performed using satellite imagery from nine different locations with known coral reefs (six Pacific Ocean sites and three Red Sea sites). Ground truth observations for four of the Pacific Ocean sites and two of the Red Sea sites were used to validate the proposed methodology. Within the Pacific Ocean sites, the consolidated classifier (trained on data from all sites) yielded an accuracy of 75.5% (0.778 AUC). Within the Red Sea sites, the consolidated classifier yielded an accuracy of 71.0% (0.7754 AUC). Finally, long-term change detection analysis is conducted for each of the sites evaluated. In total, over 16,700 km2 was analyzed for benthic cover type and cover change detection analysis. Within the Pacific Ocean sites, decreases in coral cover ranged from 25.3% reduction (Kingman Reef) to 42.7% reduction (Kiritimati Island). Within the Red Sea sites, decrease in coral cover ranged from 3.4% (Umluj) to 13.6% (Al Wajh). Data Science Machine Learning Statistics Applied Mathematics Remote Sensing Coral Reef Oceanography Statistical Models
196	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang 15 May 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency
197	Um modelo para avaliação de relevância científica baseado em métricas de análise de redes sociais Wanderley, Ayslânya Jeronimo 30 March 2015 (has links) Submitted by Viviane Lima da Cunha (viviane@biblioteca.ufpb.br) on 2016-02-16T11:09:43Z No. of bitstreams: 1 arquivototal.pdf: 4774437 bytes, checksum: a394ae47ecd80e53af0ada66393dae80 (MD5) / Made available in DSpace on 2016-02-16T11:09:43Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 4774437 bytes, checksum: a394ae47ecd80e53af0ada66393dae80 (MD5) Previous issue date: 2015-03-30 / The task of assessing the scientific relevance of a researcher is not always trivial. Generally, this process is based on indices that consider the production and the impact of it in their area of research. However, the literature indicates that such indicators taken separately are insufficient, since they ignore the standards of relationship in which researchers are inserted. In addition, many studies have proven that collaborative relationships have a serious impact on the relevance of a researcher. In this context, it is understood that the modeling and analysis of these relationships can help building new indicators that complement the current evaluation process. Thus, this work aimed to specify a statistical model which allows for assessing the scientific relevance of a researcher, defined by the detention of productivity grant from the National Council for Scientific and Technological Development (Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq), based on metrics applied to their scientific collaboration networks. Therefore, we applied metrics of Social Network Analysis (SNA) to collaborative networks of 1592 professors connected with Postgraduate Program in Computer Science area that later served as the basis for construction of a logistic regression model using the stratified 10-fold cross-validation technique. The proposed model produced very encouraging results and demonstrated that the SNA metrics that most influence in assessing the relevance of a researcher are the Betweenness Centrality,Weighted Degree, PageRank and Local Clustering Coefficient, having the first two positive influence and the last two negative influence. This shows that researchers who play an intermediary role within the network and usually maintain strong relationships with its collaborators are more likely to be contemplated with productivity grants, while those researchers with a more cohesive network and often collaborate with researchers who are already leaders in their field are less likely to be a scholarship student. / A tarefa de avaliar a relevância científica de um pesquisador nem sempre é trivial. Geralmente esse processo é baseado em índices que consideram a produção e o impacto do mesmo em sua área de pesquisa. Entretanto, a literatura aponta que tais indicadores tomados isoladamente são insuficientes uma vez que desconsideram os padrões de relação nos quais os pesquisadores se inserem. Além disso, muitos trabalhos já comprovaram que as relações de colaboração exercem forte impacto sobre a relevância de um pesquisador. Nesse contexto, entende-se que a modelagem e análise dessas relações pode ajudar a construir novos indicadores que complementem o processo de avaliação vigente. Sendo assim, o objetivo deste trabalho foi especificar um modelo estatístico que permite avaliar a relevância científica de um pesquisador, definida pela detenção de bolsa de produtividade do Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), baseado em métricas aplicadas às suas redes de colaboração científica. Para tanto foram aplicadas métricas de Análise de Redes Sociais (ARS) às redes de colaboração de 1592 docentes vinculados aos Programas de Pós-Graduação na área de Ciência da Computação que posteriormente serviram como base para construção de um modelo de Regressão Logística utilizando a técnica de validação cruzada 10-fold estratificada. O modelo proposto apresentou resultados bastante animadores e demonstrou que as métricas de ARS que mais influenciam na avaliação de relevância de um pesquisador são a Centralidade de Intermediação, o Grau Ponderado, o PageRank e o Coeficiente de Agrupamento Local, tendo as duas primeiras influência positiva e as duas últimas influência negativa. Isso demonstra que pesquisadores que desempenham um papel de intermediador dentro da rede e que costumam manter relacionamentos fortes com seus colaboradores são mais propensos a serem contemplados com bolsas de produtividade, enquanto que aqueles pesquisadores que possuem uma rede mais coesa e costumam colaborar com pesquisadores que já são líderes na sua área têm menor probabilidade de serem bolsistas. Ciência dos dados Avaliação de Pesquisadores Regressão logística Análise de Redes Sociais Researchers assessment Logistic Regression Social Network Analysis Data Science
198	Um modelo para a detecção das mudanças de posicionamento dos deputados federais Baptista, Vítor Márcio Paiva de Sousa 27 August 2015 (has links) Submitted by Viviane Lima da Cunha (viviane@biblioteca.ufpb.br) on 2016-02-17T11:30:52Z No. of bitstreams: 1 arquivototal.pdf: 945699 bytes, checksum: 9ac1d0e7217344776f8b0044d94ad1cc (MD5) / Made available in DSpace on 2016-02-17T11:30:52Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 945699 bytes, checksum: 9ac1d0e7217344776f8b0044d94ad1cc (MD5) Previous issue date: 2015-08-27 / In Brazil, there are tools for monitoring the behaviour of legislators in rollcalls, such as O Estado de São Paulo’s Basômetro and Radar Parlamentar. These tools are used both by journalists and political scientists for analysis. Although they are great analysis tools, their usefulness for monitoring is limited because they require a manual follow-up, which makes it a lot of work when we consider the volume of data. Only in the Chamber of Deputies, 513 legislators participate on average over than 400 rollcalls by legislature. It is possible to decrease the amount of data analyzing the parties as a whole, but in contrast we lose the ability to detect individuals’ drives or intra-party groups such as factions. In order to mitigate this problem, I developed a statistical model that detects when a legislator changes his or her position, joining or leaving the governmental coalition, through ideal points estimates using theW-NOMINATE. It can be used individually or integrated to tools such as Basômetro, providing a filter for researchers find the deputies who changed their behaviour most significantly. The universe of study is composed of legislators from the Chamber of Deputies from the 50th to the 54th legislatures, starting in the first term of Fernando Henrique Cardoso in 1995 until the beginning of the second term of Dilma Rousseff in 2015. / No Brasil, existem ferramentas para o acompanhamento do comportamento dos parlamentares em votações nominais, tais como o Basômetro do jornal O Estado de São Paulo e o Radar Parlamentar. Essas ferramentas são usadas para análises tanto por jornalistas, quanto por cientistas políticos. Apesar de serem ótimas ferramentas de análise, sua utilidade para monitoramento é limitada por exigir um acompanhamento manual, o que se torna muito trabalhoso quando consideramos o volume de dados. Somente na Câmara dos Deputados, 513 parlamentares participam em média de mais de 400 votações nominais por legislatura. É possível diminuir a quantidade de dados analisando os partidos como um todo, mas em contrapartida perdemos a capacidade de detectar movimentações de indivíduos ou grupos intrapartidários como as bancadas. Para diminuir esse problema, desenvolvi neste trabalho um modelo estatístico que detecta quando um parlamentar muda de posicionamento, entrando ou saindo da coalizão governamental, através de estimativas de pontos ideais usando oW-NOMINATE. Ele pode ser usado individualmente ou integrado a ferramentas como o Basômetro, oferecendo um filtro para os pesquisadores encontrarem os parlamentares que mudaram mais significativamente de comportamento. O universo de estudo é composto pelos parlamentares da Câmara dos Deputados no período da 50ª até a 54ª legislaturas, iniciando no primeiro mandato de Fernando Henrique Cardoso em 1995 até o início do segundo mandato de Dilma Rousseff em 2015. Análise legislativa Ciência política Ciência de dados Modelos preditivos Aprendizagem de máquina Political science Data science Predictive models Machine learning Legislative analysis
199	Examining the structures and practices for knowledge production within Galaxy Zoo : an online citizen science initiative Bantawa, Bipana January 2014 (has links) This study examines the ways in which public participation in the production of scientific knowledge, influences the practices and expertise of the scientists in Galaxy Zoo, an online Big Data citizen science initiative. The need for citizen science in the field of Astronomy arose in response to the challenges of rapid advances in data gathering technologies, which demanded pattern recognition capabilities that were too advanced for existing computer algorithms. To address these challenges, Galaxy Zoo scientists recruited volunteers through their online website, a strategy which proved to be remarkably reliable and efficient. In doing so, they opened up the boundaries of scientific processes to the public. This shift has led to important outcomes in terms of the scientific discovery of new Astronomical objects; the creation and refining of scientific practices; and the development of new forms of expertise among key actors while they continue to pursue their scientific goals. This thesis attempts to answer the over-arching research question: How is citizen science shaping the practices and expertise of Galaxy Zoo scientists? The emergence of new practices and development of the expertise in the domain of managing citizen science projects were observed through following the work of the Galaxy Zoo scientists and in particular the Principal Investigator and the project's Technical Lead, from February 2010 to April 2013. A broadly ethnographic approach was taken, which allowed the study to be sensitive to the uncertainty and unprecedented events that characterised the development of Galaxy Zoo as a pioneering project in the field of data-intensive citizen science. Unstructured interviewing was the major source of data on the work of the PI and TL; while the communication between these participants, the broader Science Team and their inter-institutional collaborators was captured through analyses of the team emailing list, their official blog and their social media posts. The process of data analysis was informed by an initial conceptualisation of Galaxy Zoo as a knowledge production system and the concept of knowledge object (Knorr-Cetina,1999), as an unfolding epistemic entity, became a primary analytical tool. Since the direction and future of Galaxy Zoo involved addressing new challenges, the study demanded periodic recursive analysis of the conceptual framework and the knowledge objects of both Galaxy Zoo and the present examination of its development. The key findings were as follows. The involvement of public volunteers shaped the practices of the Science Team, while they pursued robust scientific outcomes. Changes included: negotiating collaborations; designing the classification tasks for the volunteers; re-examining data reduction methods and data release policies; disseminating results; creating new epistemic communities; and science communication. In addition, new kinds of expertise involved in running Galaxy Zoo were identified. The relational and adaptive aspects of expertise were seen as important. It was therefore proposed that the development of the expertise in running citizen science projects should be recognised as a domain-expertise in its own right. In Galaxy Zoo, the development of the expertise could be attributed to a combined understanding of: the design principles of doing good science; innovation in methods; and creating a dialogic space for scientists and volunteers. The empirical and theoretical implications of this study therefore lie in (i) identifying emergent practices in citizen science while prioritising scientific knowledge production and (ii) a re-examination of expertise for science in the emerging context of data-intensive science. 507.2
200	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang January 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency

Search results