Global ETD Search

191	Triple Non-negative Matrix Factorization Technique for Sentiment Analysis and Topic Modeling Waggoner, Alexander A 01 January 2017 (has links) Topic modeling refers to the process of algorithmically sorting documents into categories based on some common relationship between the documents. This common relationship between the documents is considered the “topic” of the documents. Sentiment analysis refers to the process of algorithmically sorting a document into a positive or negative category depending whether this document expresses a positive or negative opinion on its respective topic. In this paper, I consider the open problem of document classification into a topic category, as well as a sentiment category. This has a direct application to the retail industry where companies may want to scour the web in order to find documents (blogs, Amazon reviews, etc.) which both speak about their product, and give an opinion on their product (positive, negative or neutral). My solution to this problem uses a Non-negative Matrix Factorization (NMF) technique in order to determine the topic classifications of a document set, and further factors the matrix in order to discover the sentiment behind this category of product. Machine Learning Sentiment Analysis Topic Modeling Non-negative Matrix Factorization Applied Mathematics Data Science Other Applied Mathematics Other Mathematics Theory and Algorithms
192	L’évolution des systèmes et architectures d’information sous l’influence des données massives : les lacs de données / The information architecture evolution under the big data influence : the data lakes Madera, Cedrine 22 November 2018 (has links) La valorisation du patrimoine des données des organisation est mise au cœur de leur transformation digitale. Sous l’influence des données massives le système d’information doit s’adapter et évoluer. Cette évolution passe par une transformation des systèmes décisionnels mais aussi par l’apparition d’un nouveau composant du système d’information : Les lacs de données. Nous étudions cette évolution des systèmes décisionnels, les éléments clés qui l’influence mais aussi les limites qui apparaissent , du point de vue de l’architecture, sous l’influence des données massives. Nous proposons une évolution des systèmes d’information avec un nouveau composant qu’est le lac de données. Nous l’étudions du point de vue de l’architecture et cherchons les facteurs qui peuvent influencer sa conception , comme la gravité des données. Enfin, nous amorçons une piste de conceptualisation des lacs de données en explorant l’approche ligne de produit.Nouvelle versionSous l'influence des données massives nous étudions l'impact que cela entraîne notamment avec l'apparition de nouvelles technologies comme Apache Hadoop ainsi que les limite actuelles des système décisionnel.Les limites rencontrées par les systèmes décisionnels actuels impose une évolution au système d 'information qui doit s'adapter et qui donne naissance à un nouveau composant : le lac de données.Dans un deuxième temps nous étudions en détail ce nouveau composant, formalisons notre définition, donnons notre point de vue sur son positionnement dans le système d information ainsi que vis à vis des systèmes décisionnels.Par ailleurs, nous mettons en évidence un facteur influençant l’architecture des lacs de données : la gravité des données, en dressant une analogie avec la loi de la gravité et en nous concentrant sur les facteurs qui peuvent influencer la relation donnée-traitement.Nous mettons en évidence , au travers d'un cas d'usage , que la prise en compte de la gravité des données peut influencer la conception d'un lac de données.Nous terminons ces travaux par une adaptation de l'approche ligne de produit logiciel pour amorcer une méthode de formalisations et modélisation des lacs de données. Cette méthode nous permet :- d’établir une liste de composants minimum à mettre en place pour faire fonctionner un lac de données sans que ce dernier soit transformé en marécage,- d’évaluer la maturité d'un lac de donnée existant,- de diagnostiquer rapidement les composants manquants d'un lac de données existant qui serait devenu un marécage,- de conceptualiser la création des lacs de données en étant "logiciel agnostique”. / Data is on the heart of the digital transformation.The consequence is anacceleration of the information system evolution , which must adapt. The Big data phenomenonplays the role of catalyst of this evolution.Under its influence appears a new component of the information system: the data lake.Far from replacing the decision support systems that make up the information system, data lakes comecomplete information systems’s architecture.First, we focus on the factors that influence the evolution of information systemssuch as new software and middleware, new infrastructure technologies, but also the decision support system usage itself.Under the big data influence we study the impact that this entails especially with the appearance ofnew technologies such as Apache Hadoop as well as the current limits of the decision support system .The limits encountered by the current decision support system force a change to the information system which mustadapt and that gives birth to a new component: the data lake.In a second time we study in detail this new component, formalize our definition, giveour point of view on its positioning in the information system as well as with regard to the decision support system .In addition, we highlight a factor influencing the architecture of data lakes: data gravity, doing an analogy with the law of gravity and focusing on the factors that mayinfluence the data-processing relationship. We highlight, through a use case, that takingaccount of the data gravity can influence the design of a data lake.We complete this work by adapting the software product line approach to boot a methodof formalizations and modeling of data lakes. This method allows us:- to establish a minimum list of components to be put in place to operate a data lake without transforming it into a data swamp,- to evaluate the maturity of an existing data lake,- to quickly diagnose the missing components of an existing data lake that would have become a dataswamp- to conceptualize the creation of data lakes by being "software agnostic “. Lac de données Architecture Metadonnées Gravité des données Gouvernance des données Science des données Data lake Big data Data gravity Metadata Data governance Data science
193	Bias Reduction in Machine Learning Classifiers for Spatiotemporal Analysis of Coral Reefs using Remote Sensing Images Gapper, Justin J. 06 May 2019 (has links) This dissertation is an evaluation of the generalization characteristics of machine learning classifiers as applied to the detection of coral reefs using remote sensing images. Three scientific studies have been conducted as part of this research: 1) Evaluation of Spatial Generalization Characteristics of a Robust Classifier as Applied to Coral Reef Habitats in Remote Islands of the Pacific Ocean 2) Coral Reef Change Detection in Remote Pacific Islands using Support Vector Machine Classifiers 3) A Generalized Machine Learning Classifier for Spatiotemporal Analysis of Coral Reefs in the Red Sea. The aim of this dissertation is to propose and evaluate a methodology for developing a robust machine learning classifier that can effectively be deployed to accurately detect coral reefs at scale. The hypothesis is that Landsat data can be used to train a classifier to detect coral reefs in remote sensing imagery and that this classifier can be trained to generalize across multiple sites. Another objective is to identify how well different classifiers perform under the generalized conditions and how unique the spectral signature of coral is as environmental conditions vary across observation sites. A methodology for validating the generalization performance of a classifier to unseen locations is proposed and implemented (Controlled Parameter Cross-Validation,). Analysis is performed using satellite imagery from nine different locations with known coral reefs (six Pacific Ocean sites and three Red Sea sites). Ground truth observations for four of the Pacific Ocean sites and two of the Red Sea sites were used to validate the proposed methodology. Within the Pacific Ocean sites, the consolidated classifier (trained on data from all sites) yielded an accuracy of 75.5% (0.778 AUC). Within the Red Sea sites, the consolidated classifier yielded an accuracy of 71.0% (0.7754 AUC). Finally, long-term change detection analysis is conducted for each of the sites evaluated. In total, over 16,700 km2 was analyzed for benthic cover type and cover change detection analysis. Within the Pacific Ocean sites, decreases in coral cover ranged from 25.3% reduction (Kingman Reef) to 42.7% reduction (Kiritimati Island). Within the Red Sea sites, decrease in coral cover ranged from 3.4% (Umluj) to 13.6% (Al Wajh). Data Science Machine Learning Statistics Applied Mathematics Remote Sensing Coral Reef Oceanography Statistical Models
194	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang 15 May 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency
195	Um modelo para avaliação de relevância científica baseado em métricas de análise de redes sociais Wanderley, Ayslânya Jeronimo 30 March 2015 (has links) Submitted by Viviane Lima da Cunha (viviane@biblioteca.ufpb.br) on 2016-02-16T11:09:43Z No. of bitstreams: 1 arquivototal.pdf: 4774437 bytes, checksum: a394ae47ecd80e53af0ada66393dae80 (MD5) / Made available in DSpace on 2016-02-16T11:09:43Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 4774437 bytes, checksum: a394ae47ecd80e53af0ada66393dae80 (MD5) Previous issue date: 2015-03-30 / The task of assessing the scientific relevance of a researcher is not always trivial. Generally, this process is based on indices that consider the production and the impact of it in their area of research. However, the literature indicates that such indicators taken separately are insufficient, since they ignore the standards of relationship in which researchers are inserted. In addition, many studies have proven that collaborative relationships have a serious impact on the relevance of a researcher. In this context, it is understood that the modeling and analysis of these relationships can help building new indicators that complement the current evaluation process. Thus, this work aimed to specify a statistical model which allows for assessing the scientific relevance of a researcher, defined by the detention of productivity grant from the National Council for Scientific and Technological Development (Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq), based on metrics applied to their scientific collaboration networks. Therefore, we applied metrics of Social Network Analysis (SNA) to collaborative networks of 1592 professors connected with Postgraduate Program in Computer Science area that later served as the basis for construction of a logistic regression model using the stratified 10-fold cross-validation technique. The proposed model produced very encouraging results and demonstrated that the SNA metrics that most influence in assessing the relevance of a researcher are the Betweenness Centrality,Weighted Degree, PageRank and Local Clustering Coefficient, having the first two positive influence and the last two negative influence. This shows that researchers who play an intermediary role within the network and usually maintain strong relationships with its collaborators are more likely to be contemplated with productivity grants, while those researchers with a more cohesive network and often collaborate with researchers who are already leaders in their field are less likely to be a scholarship student. / A tarefa de avaliar a relevância científica de um pesquisador nem sempre é trivial. Geralmente esse processo é baseado em índices que consideram a produção e o impacto do mesmo em sua área de pesquisa. Entretanto, a literatura aponta que tais indicadores tomados isoladamente são insuficientes uma vez que desconsideram os padrões de relação nos quais os pesquisadores se inserem. Além disso, muitos trabalhos já comprovaram que as relações de colaboração exercem forte impacto sobre a relevância de um pesquisador. Nesse contexto, entende-se que a modelagem e análise dessas relações pode ajudar a construir novos indicadores que complementem o processo de avaliação vigente. Sendo assim, o objetivo deste trabalho foi especificar um modelo estatístico que permite avaliar a relevância científica de um pesquisador, definida pela detenção de bolsa de produtividade do Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), baseado em métricas aplicadas às suas redes de colaboração científica. Para tanto foram aplicadas métricas de Análise de Redes Sociais (ARS) às redes de colaboração de 1592 docentes vinculados aos Programas de Pós-Graduação na área de Ciência da Computação que posteriormente serviram como base para construção de um modelo de Regressão Logística utilizando a técnica de validação cruzada 10-fold estratificada. O modelo proposto apresentou resultados bastante animadores e demonstrou que as métricas de ARS que mais influenciam na avaliação de relevância de um pesquisador são a Centralidade de Intermediação, o Grau Ponderado, o PageRank e o Coeficiente de Agrupamento Local, tendo as duas primeiras influência positiva e as duas últimas influência negativa. Isso demonstra que pesquisadores que desempenham um papel de intermediador dentro da rede e que costumam manter relacionamentos fortes com seus colaboradores são mais propensos a serem contemplados com bolsas de produtividade, enquanto que aqueles pesquisadores que possuem uma rede mais coesa e costumam colaborar com pesquisadores que já são líderes na sua área têm menor probabilidade de serem bolsistas. Ciência dos dados Avaliação de Pesquisadores Regressão logística Análise de Redes Sociais Researchers assessment Logistic Regression Social Network Analysis Data Science
196	Um modelo para a detecção das mudanças de posicionamento dos deputados federais Baptista, Vítor Márcio Paiva de Sousa 27 August 2015 (has links) Submitted by Viviane Lima da Cunha (viviane@biblioteca.ufpb.br) on 2016-02-17T11:30:52Z No. of bitstreams: 1 arquivototal.pdf: 945699 bytes, checksum: 9ac1d0e7217344776f8b0044d94ad1cc (MD5) / Made available in DSpace on 2016-02-17T11:30:52Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 945699 bytes, checksum: 9ac1d0e7217344776f8b0044d94ad1cc (MD5) Previous issue date: 2015-08-27 / In Brazil, there are tools for monitoring the behaviour of legislators in rollcalls, such as O Estado de São Paulo’s Basômetro and Radar Parlamentar. These tools are used both by journalists and political scientists for analysis. Although they are great analysis tools, their usefulness for monitoring is limited because they require a manual follow-up, which makes it a lot of work when we consider the volume of data. Only in the Chamber of Deputies, 513 legislators participate on average over than 400 rollcalls by legislature. It is possible to decrease the amount of data analyzing the parties as a whole, but in contrast we lose the ability to detect individuals’ drives or intra-party groups such as factions. In order to mitigate this problem, I developed a statistical model that detects when a legislator changes his or her position, joining or leaving the governmental coalition, through ideal points estimates using theW-NOMINATE. It can be used individually or integrated to tools such as Basômetro, providing a filter for researchers find the deputies who changed their behaviour most significantly. The universe of study is composed of legislators from the Chamber of Deputies from the 50th to the 54th legislatures, starting in the first term of Fernando Henrique Cardoso in 1995 until the beginning of the second term of Dilma Rousseff in 2015. / No Brasil, existem ferramentas para o acompanhamento do comportamento dos parlamentares em votações nominais, tais como o Basômetro do jornal O Estado de São Paulo e o Radar Parlamentar. Essas ferramentas são usadas para análises tanto por jornalistas, quanto por cientistas políticos. Apesar de serem ótimas ferramentas de análise, sua utilidade para monitoramento é limitada por exigir um acompanhamento manual, o que se torna muito trabalhoso quando consideramos o volume de dados. Somente na Câmara dos Deputados, 513 parlamentares participam em média de mais de 400 votações nominais por legislatura. É possível diminuir a quantidade de dados analisando os partidos como um todo, mas em contrapartida perdemos a capacidade de detectar movimentações de indivíduos ou grupos intrapartidários como as bancadas. Para diminuir esse problema, desenvolvi neste trabalho um modelo estatístico que detecta quando um parlamentar muda de posicionamento, entrando ou saindo da coalizão governamental, através de estimativas de pontos ideais usando oW-NOMINATE. Ele pode ser usado individualmente ou integrado a ferramentas como o Basômetro, oferecendo um filtro para os pesquisadores encontrarem os parlamentares que mudaram mais significativamente de comportamento. O universo de estudo é composto pelos parlamentares da Câmara dos Deputados no período da 50ª até a 54ª legislaturas, iniciando no primeiro mandato de Fernando Henrique Cardoso em 1995 até o início do segundo mandato de Dilma Rousseff em 2015. Análise legislativa Ciência política Ciência de dados Modelos preditivos Aprendizagem de máquina Political science Data science Predictive models Machine learning Legislative analysis
197	Examining the structures and practices for knowledge production within Galaxy Zoo : an online citizen science initiative Bantawa, Bipana January 2014 (has links) This study examines the ways in which public participation in the production of scientific knowledge, influences the practices and expertise of the scientists in Galaxy Zoo, an online Big Data citizen science initiative. The need for citizen science in the field of Astronomy arose in response to the challenges of rapid advances in data gathering technologies, which demanded pattern recognition capabilities that were too advanced for existing computer algorithms. To address these challenges, Galaxy Zoo scientists recruited volunteers through their online website, a strategy which proved to be remarkably reliable and efficient. In doing so, they opened up the boundaries of scientific processes to the public. This shift has led to important outcomes in terms of the scientific discovery of new Astronomical objects; the creation and refining of scientific practices; and the development of new forms of expertise among key actors while they continue to pursue their scientific goals. This thesis attempts to answer the over-arching research question: How is citizen science shaping the practices and expertise of Galaxy Zoo scientists? The emergence of new practices and development of the expertise in the domain of managing citizen science projects were observed through following the work of the Galaxy Zoo scientists and in particular the Principal Investigator and the project's Technical Lead, from February 2010 to April 2013. A broadly ethnographic approach was taken, which allowed the study to be sensitive to the uncertainty and unprecedented events that characterised the development of Galaxy Zoo as a pioneering project in the field of data-intensive citizen science. Unstructured interviewing was the major source of data on the work of the PI and TL; while the communication between these participants, the broader Science Team and their inter-institutional collaborators was captured through analyses of the team emailing list, their official blog and their social media posts. The process of data analysis was informed by an initial conceptualisation of Galaxy Zoo as a knowledge production system and the concept of knowledge object (Knorr-Cetina,1999), as an unfolding epistemic entity, became a primary analytical tool. Since the direction and future of Galaxy Zoo involved addressing new challenges, the study demanded periodic recursive analysis of the conceptual framework and the knowledge objects of both Galaxy Zoo and the present examination of its development. The key findings were as follows. The involvement of public volunteers shaped the practices of the Science Team, while they pursued robust scientific outcomes. Changes included: negotiating collaborations; designing the classification tasks for the volunteers; re-examining data reduction methods and data release policies; disseminating results; creating new epistemic communities; and science communication. In addition, new kinds of expertise involved in running Galaxy Zoo were identified. The relational and adaptive aspects of expertise were seen as important. It was therefore proposed that the development of the expertise in running citizen science projects should be recognised as a domain-expertise in its own right. In Galaxy Zoo, the development of the expertise could be attributed to a combined understanding of: the design principles of doing good science; innovation in methods; and creating a dialogic space for scientists and volunteers. The empirical and theoretical implications of this study therefore lie in (i) identifying emergent practices in citizen science while prioritising scientific knowledge production and (ii) a re-examination of expertise for science in the emerging context of data-intensive science. 507.2
198	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang January 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency
199	Designing Surveys on Youth Immigration Reform: Lessons from the 2016 CCES Anomaly Calkins, Saige 18 December 2020 (has links) Even with clear advantages to using internet based survey research, there are still some uncertainties to which survey methods are most conducive to an online platform. Most survey method literature, whether focusing on online, telephone, or in-person formats, tend to observe little to no differences between using various survey modes and survey results. Despite this, there is little research focused on the interaction effect between survey formatting, in terms of design and framing, and public opinion on social issues, specifically child immigration policies - a recent topic of popular debate. This paper examines an anomalous result found within the 2016 Cooperative Congressional Election Study (CCES) public opinion immigration question focusing on a DACA-related policy, where support was evenly split on the typically highly favored policy. To decipher the unprecedented result, an experimental survey design was conducted via Qualtrics by comparing various survey formats (single-style, forced choice, Likert scale) and inclusionary policy details to the original CCES “select all that apply” matrix style. By comparing the experimental polls, the results indicated that the “select all that apply” matrix again produced anomalous results, while the various other methods produced a breakdown similar to typical DACA-related polling data. These findings have necessary implications for future survey designs and those examining public opinion on child immigration policies. Dream Act DACA Framing theory Survey design Select all that apply Forced choice American Politics Data Science Models and Methods Political Science Social Statistics
200	Identification of alkaline fens using convolutional neural networks and multispectral satellite imagery Jernberg, John January 2021 (has links) The alkaline fen is a particularly valuable type of wetland with unique characteristics.Due to anthropogenic risk factors and the sensitive nature of the fens, protection is highlyprioritized with identification and mapping of current locations being important parts ofthis process. To accomplish this in a cost effective manner for large areas, remote sensingmethods using satellite images might be very effective. Following the rapid developmentin computer vision, deep learning using convolutional neural networks (CNN) is thecurrent state of the art for satellite image classification. Accordingly, this study evaluatesthe combination of different CNN architectures and multispectral Sentinel 2 satelliteimages for identification of alkaline fens using semantic segmentation. The implementedmodels are different variations of the proven U-net network design. In addition, a RandomForest classifier was trained for baseline comparison. The best result was produced bya spatial attention U-net with a IoU-score of 0.31 for the alkaline fen class and a meanIoU-score of 0.61. These findings suggest that identification of alkaline fens is possiblewith the current method even with a small dataset. However, an optimal solution tothis task may require deeper research. The results also further establish deep learningto be the superior choice over traditional machine learning algorithms for satellite imageclassification. Machine learning Deep learning Remote sensing Data science Satellite Sentinel Computer Sciences Datavetenskap (datalogi) Remote Sensing Fjärranalysteknik

Search results