• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 76
  • 7
  • 4
  • 3
  • 2
  • 1
  • 1
  • Tagged with
  • 131
  • 131
  • 42
  • 40
  • 35
  • 32
  • 31
  • 28
  • 25
  • 24
  • 22
  • 22
  • 22
  • 21
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Bayesian Test Analytics for Document Collections

Walker, Daniel David 15 November 2012 (has links) (PDF)
Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end,though most of it is focused on modern, relatively clean text data. We present research for improved modeling of document collections that may contain textual noise or that may include real-valued metadata associated with the documents. This class of documents includes many historical document collections. Indeed, our specific motivation for this work is to help improve the modeling of historical documents, which are often noisy and/or have historical context represented by metadata. Many historical documents are digitized by means of Optical Character Recognition(OCR) from document images of old and degraded original documents. Historical documents also often include associated metadata, such as timestamps,which can be incorporated in an analysis of their topical content. Many techniques, such as topic models, have been developed to automatically discover patterns of meaning in large collections of text. While these methods are useful, they can break down in the presence of OCR errors. We show the extent to which this performance breakdown occurs. The specific types of analyses covered in this dissertation are document clustering, feature selection, unsupervised and supervised topic modeling for documents with and without OCR errors and a new supervised topic model that uses Bayesian nonparametrics to improve the modeling of document metadata. We present results in each of these areas, with an emphasis on studying the effects of noise on the performance of the algorithms and on modeling the metadata associated with the documents. In this research we effectively: improve the state of the art in both document clustering and topic modeling; introduce a useful synthetic dataset for historical document researchers; and present analyses that empirically show how existing algorithms break down in the presence of OCR errors.
92

Topic classification of Monetary Policy Minutes from the Swedish Central Bank / Ämnesklassificering av Riksbankens penningpolitiska mötesprotokoll

Cedervall, Andreas, Jansson, Daniel January 2018 (has links)
Over the last couple of years, Machine Learning has seen a very high increase in usage. Many previously manual tasks are becoming automated and it stands to reason that this development will continue in an incredible pace. This paper builds on the work in Topic Classification and attempts to provide a baseline on how to analyse the Swedish Central Bank Minutes and gather information using both Latent Dirichlet Allocation and a simple Neural Networks. Topic Classification is done on Monetary Policy Minutes from 2004 to 2018 to find how the distributions of topics change over time. The results are compared to empirical evidence that would confirm trends. Finally a business perspective of the work is analysed to reveal what the benefits of implementing this type of technique could be. The results of these methods are compared and they differ. Specifically the Neural Network shows larger changes in topic distributions than the Latent Dirichlet Allocation. The neural network also proved to yield more trends that correlated with other observations such as the start of bond purchasing by the Swedish Central Bank. Thus, our results indicate that a Neural Network would perform better than the Latent Dirichlet Allocation when analyzing Swedish Monetary Policy Minutes. / Under de senaste åren har artificiell intelligens och maskininlärning fått mycket uppmärksamhet och växt otroligt. Tidigare manuella arbeten blir nu automatiserade och mycket tyder på att utvecklingen kommer att fortsätta i en hög takt. Detta arbete bygger vidare på arbeten inom topic modeling (ämnesklassifikation) och applicera detta i ett tidigare outforskat område, riksbanksprotokoll. Latent Dirichlet Allocation och Neural Network används för att undersöka huruvida fördelningen av diskussionspunkter (topics) förändras över tid. Slutligen presenteras en teoretisk diskussion av det potentiella affärsvärdet i att implementera en liknande metod. Resultaten för de olika modellerna uppvisar stora skillnader över tid. Medan Latent Dirichlet Allocation inte finner några större trender i diskussionspunkter visar Neural Network på större förändringar över tid. De senare stämmer dessutom väl överens med andra observationer såsom påbörjandet av obligationsköp. Därav indikerar resultaten att Neural Network är en mer lämplig metod för analys av riksbankens mötesprotokoll.
93

COP TOPICS: TOPIC MODELING-ASSISTED DISCOVERIES OF POLICE-RELATED THEMES IN AFRICAN-AMERICAN JOURNALISTIC TEXTS

Lemire Garlic, Nicole January 2017 (has links)
The analysis of mainstream newspaper content has long been mined by communication scholars and researchers for insights into public opinion and perceptions. In recent years, scholars have been examining African-American authored periodicals to obtain similar insights. Hearkening back to the 1950s and 1960s civil rights movement in the United States, the highly-publicized killings of African-American men by police officers during the past several years have highlighted longstanding strained police-community relations. As part of its role as both a reflection of, and an advocate for, the African-American community, African-American journalistic texts contain a wealth of data about African-American public opinion about, and perceptions of, police. In years past, media content analysts would manually sift through newspapers to divine interesting police-related themes and variables worthy of study. But, with the exponential growth of digitized texts, communication scholars are experimenting with computerized text analysis tools like topic modeling software to aid them in their content analyses. This thesis considers to what degree topic modeling software can be used at the exploratory stage of designing a content analysis study to aid in uncovering themes and variables worthy of further investigation. Appendix A contains results of the manual exploratory content analysis. The list of topics generated by the topic modeling software may be found in Appendix B. / Media Studies & Production / Accompanied by one .pdf file: NLG Thesis Appendices Final.pdf
94

Essays on Utilizing Data Analytics and Dynamic Modeling to Inform Complex Science and Innovation Policies

Baghaei Lakeh, Arash 27 April 2018 (has links)
In many ways, science represents a complex system which involves technical, social, and economic aspects. An analysis of such a system requires employing and combining different methodological perspectives and incorporation of different sources of data. In this dissertation, we use a variety of methods to analyze large sets of data in order to examine the effects of various domestic and institutional factors on scientific activities. First, we evaluate how the contributions of behavioral and social sciences to studies of health have evolved over time. We use data analytics to conduct a textual analysis of more than 200,000 publications on the topic of HIV/AIDS. We find that the focus of the scientific community within the context of the same problem varies as the societal context of the problem changes. Specifically, we uncover that the focus on the behavioral and social aspects of HIV/AIDS has increased over time and varies in different countries. Further, we show that this variation is related to the mortality level that the disease causes in each country. Second, we investigate how different sources of funding affect the science enterprise differently. We use data analytics to analyze more than 60,000 papers published on the subject of specific diseases globally and highlight the role of philanthropic money in these domains. We find that philanthropies tend to have a more practical approach in health studies as compared with public funders. We further show that they are also concerned with the economic, policy related, social, and behavioral aspects of the diseases. We uncover that philanthropies tend to mix and combine approaches and contents supported both by public and private sources of funding for science. We further show that in doing so, philanthropies tend to be closer to the position held by the public sector in the context of health studies. Finally, we find that studies funded by philanthropies tend to receive higher citations, and hence have higher impact, in comparison to those funded by the public sector. Third, we study the effect of different schemes of funding distribution on the career of scientists. In this study, we develop a system dynamics model for analyzing a scientist's career under different funding and competition contexts. We investigate the characteristics of optimal strategies and also the equilibrium points for the cases of scientists competing for financial resources. We show that a policy to fund the best can lead scientists to spend more time on writing proposals, in order to secure funding, rather than writing papers. We find that when everyone receives funding (or have the same chance of receiving funding) the overall optimal payoff of the scientists reaches its highest level and at this optimum, scientists spend all their time on writing papers rather than writing proposals. Our analysis suggests that more egalitarian distributions of funding results in higher overall research output by scientists. We also find that luck plays an important role in the success of scientists. We show that following the optimal strategies do not guarantee success. Due to the stochastic nature of funding decisions, some will eventually fail. The failure is not due to scientists' faulty decisions, but rather simply due to their lack of luck. / Ph. D. / Science helps us understand the world and enables us to improve how we interact with our environment. But science itself has also been the subject of inquiry by philosophers, sociologists, economists, historians, and scientists. The goal in the investigations of science has been to better understand how scientific advances occur, how to foster innovation, and how to improve the institutions that push science forward. This dissertation contributes to this area of research by asking and responding to several questions about the science enterprise. First, we study how communities of scientists in different parts of the world look at the seemingly same problem differently. We use a computational method to read through a large set of publications on the topic of HIV/AIDS (which includes more than 200,000 papers) and uncover the topics of these papers. We find that in the context of HIV/AIDS, contributions of behavioral and social scientists have increased over time. Moreover, we show that the share of these contributions in any counties’ total research output differs significantly. We further find that there is a significant relationship between one country’s rate of death, due to HIV/AIDS, and the share of behavioral and social studies in the overall research profile of that country on the topic of HIV/AIDS. Second, we investigate how different sources of research funding affect scientific activities differently. Specifically, we focus on the role of philanthropic money in science and its effect on the content and impact of research studies. In our analysis, we rely on computational techniques that distinguishes between different themes of research in the studies of a few diseases and also different statistical methods. We find that philanthropies tend to have a more practical approach to health studies as compared with public sources of funding. Meanwhile, we find that they are also concerned with the economic, policy related, social, and behavioral aspects of the diseases. Moreover, we show that philanthropies tend to mix and combine approaches and contents supported both by public and private sources of funding for science. We find that, in doing so, philanthropies tend to be closer to the position held by the public sector in the context of health studies. Finally, we show that studies funded by philanthropies tend to receive higher citations. This finding suggests that these studies have a higher impact in comparison to those funded by the public sector. Third, we study how different mechanisms for distributing research funding among scientists can affect their career and success. Many scientists should spend time on both writing papers and research grant proposals. In this work, we aim at understanding how a scientists should allocate her time between these two activities to maximize her career long number of papers. We develop a small mathematical model to capture the mechanisms related to the research career of a scientist in an academic setting. Then, for different schemes of funding distribution, we find the scientist’s time allocation that maximizes the number of papers she publishes over her career. We find that when funding is being allocated to the best scientists and best grant proposals, scientists’ best strategy is to spend more time on writing research grant proposals rather than papers. This decreases the total number of papers published by the scientists over their career. We also find that luck is important in determining the career success of scientists. Due to errors in evaluation of proposal qualities, a scientist may fail in her career regardless of whether she has followed the best strategy that she could.
95

The Salience of Issues in Parliamentary Debates : Its Development and Relation to the Support of the Sweden Democrats

Alexander, Ödlund Lindholm January 2020 (has links)
The aim of this study was to analyze the salience of issue dimensions in the Swedish parliament debates by the established parties during the rise of the Sweden Democrats Party (SD). Structural topic modeling was used to construct a measurement of the salience of issues, examining the full body of speeches in the Swedish parliament between September 2006 and December 2019. Trend analysis revealed a realignment from a focus on socio-economic to socio-cultural issues in Swedish politics. Cross-correlation analyses had conflicting results, indicating a weak positive relationship between the salience of issues and the support of SD – but low predictive ability; it also showed that changes in the support of SD did lead (precede) changes in the salience of issues in the parliament. The ramifications of socio-cultural issues being the most salient are that so-called radical right-wing populist parties (RRPs), or neo-nationalist parties, has a greater opportunity to gain support. It can make voters more inclined to base their voting decision on socio-cultural issues, which favors parties who fight for and are trustworthy in those issues – giving them more valence in the eyes of the voters.
96

Topic Analysis of Tweets on the European Refugee Crisis Using Non-negative Matrix Factorization

Shen, Chong 01 January 2016 (has links)
The ongoing European Refugee Crisis has been one of the most popular trending topics on Twitter for the past 8 months. This paper applies topic modeling on bulks of tweets to discover the hidden patterns within these social media discussions. In particular, we perform topic analysis through solving Non-negative Matrix Factorization (NMF) as an Inexact Alternating Least Squares problem. We accelerate the computation using techniques including tweet sampling and augmented NMF, compare NMF results with different ranks and visualize the outputs through topic representation and frequency plots. We observe that supportive sentiments maintained a strong presence while negative sentiments such as safety concerns have emerged over time.
97

Triple Non-negative Matrix Factorization Technique for Sentiment Analysis and Topic Modeling

Waggoner, Alexander A 01 January 2017 (has links)
Topic modeling refers to the process of algorithmically sorting documents into categories based on some common relationship between the documents. This common relationship between the documents is considered the “topic” of the documents. Sentiment analysis refers to the process of algorithmically sorting a document into a positive or negative category depending whether this document expresses a positive or negative opinion on its respective topic. In this paper, I consider the open problem of document classification into a topic category, as well as a sentiment category. This has a direct application to the retail industry where companies may want to scour the web in order to find documents (blogs, Amazon reviews, etc.) which both speak about their product, and give an opinion on their product (positive, negative or neutral). My solution to this problem uses a Non-negative Matrix Factorization (NMF) technique in order to determine the topic classifications of a document set, and further factors the matrix in order to discover the sentiment behind this category of product.
98

Inferência das áreas de atuação de pesquisadores / Inference of the area of expertise of researchers

Fonseca, Felipe Penhorate Carvalho da 30 January 2018 (has links)
Atualmente, existe uma grande gama de dados acadêmicos disponíveis na web. Com estas informações é possível realizar tarefas como descoberta de especialistas em uma dada área, identificação de potenciais bolsistas de produtividade, sugestão de colaboradores, entre outras diversas. Contudo, o sucesso destas tarefas depende da qualidade dos dados utilizados, pois dados incorretos ou incompletos tendem a prejudicar o desempenho dos algoritmos aplicados. Diversos repositórios de dados acadêmicos não contêm ou não exigem a informação explícita das áreas de atuação dos pesquisadores. Nos dados dos currículos Lattes essa informação existe, porém é inserida manualmente pelo pesquisador sem que haja nenhum tipo de validação (e potencialmente possui informações desatualizadas, faltantes ou mesmo incorretas). O presente trabalho utilizou técnicas de aprendizado de máquina na inferência das áreas de atuação de pesquisadores com base nos dados cadastrados na plataforma Lattes. Os títulos da produção científica foram utilizados como fonte de dados, sendo estes enriquecidos com informações semanticamente relacionadas presentes em outras bases, além de adotar representações diversas para o texto dos títulos e outras informações acadêmicas como orientações e projetos de pesquisa. Objetivou-se avaliar se o enriquecimento dos dados melhora o desempenho dos algoritmos de classificação testados, além de analisar a contribuição de fatores como métricas de redes sociais, idioma dos títulos e a própria estrutura hierárquica das áreas de atuação no desempenho dos algoritmos. A técnica proposta pode ser aplicada a diferentes dados acadêmicos (não sendo restrita a dados presentes na plataforma Lattes), mas os dados oriundos dessa plataforma foram utilizados para os testes e validações da solução proposta. Como resultado, identificou-se que a técnica utilizada para realizar o enriquecimento do texto não auxiliou na melhoria da precisão da inferência. Todavia, as métricas de redes sociais e representações numéricas melhoram a inferência quando comparadas com técnicas do estado da arte, assim como o uso da própria estrutura hierárquica de classes, que retornou os melhores resultados dentre os obtidos / Nowadays, there is a wide range of academic data available on the web. With this information, it is possible to solve tasks such as the discovery of specialists in a given area, identification of potential scholarship holders, suggestion of collaborators, among others. However, the success of these tasks depends on the quality of the data used, since incorrect or incomplete data tend to impair the performance of the applied algorithms. Several academic data repositories do not contain or do not require the explicit information of the researchers\' areas. In the data of the Lattes curricula, this information exists, but it is inserted manually by the researcher without any kind of validation (and potentially it is outdated, missing or even there is incorrect information). The present work utilized machine learning techniques in the inference of the researcher\'s areas based on the data registered in the Lattes platform. The titles of the scientific production were used as data source and they were enriched with semantically related information present in other bases, besides adopting other representations for the text of the titles and other academic information as orientations and research projects. The objective of this dissertation was to evaluate if the data enrichment improves the performance of the classification algorithms tested, as well as to analyze the contribution of factors such as social network metrics, the language of the titles and the hierarchical structure of the areas in the performance of the algorithms. The proposed technique can be applied to different academic data (not restricted to data present in the Lattes platform), but the data from this platform was used for the tests and validations of the proposed solution. As a result, it was identified that the technique used to perform the enrichment of the text did not improve the accuracy of the inference. However, social network metrics and numerical representations improved inference accuracy when compared to state-of-the-art techniques, as well as the use of the hierarchical structure of the classes, which returned the best results among the obtained
99

Inferência das áreas de atuação de pesquisadores / Inference of the area of expertise of researchers

Felipe Penhorate Carvalho da Fonseca 30 January 2018 (has links)
Atualmente, existe uma grande gama de dados acadêmicos disponíveis na web. Com estas informações é possível realizar tarefas como descoberta de especialistas em uma dada área, identificação de potenciais bolsistas de produtividade, sugestão de colaboradores, entre outras diversas. Contudo, o sucesso destas tarefas depende da qualidade dos dados utilizados, pois dados incorretos ou incompletos tendem a prejudicar o desempenho dos algoritmos aplicados. Diversos repositórios de dados acadêmicos não contêm ou não exigem a informação explícita das áreas de atuação dos pesquisadores. Nos dados dos currículos Lattes essa informação existe, porém é inserida manualmente pelo pesquisador sem que haja nenhum tipo de validação (e potencialmente possui informações desatualizadas, faltantes ou mesmo incorretas). O presente trabalho utilizou técnicas de aprendizado de máquina na inferência das áreas de atuação de pesquisadores com base nos dados cadastrados na plataforma Lattes. Os títulos da produção científica foram utilizados como fonte de dados, sendo estes enriquecidos com informações semanticamente relacionadas presentes em outras bases, além de adotar representações diversas para o texto dos títulos e outras informações acadêmicas como orientações e projetos de pesquisa. Objetivou-se avaliar se o enriquecimento dos dados melhora o desempenho dos algoritmos de classificação testados, além de analisar a contribuição de fatores como métricas de redes sociais, idioma dos títulos e a própria estrutura hierárquica das áreas de atuação no desempenho dos algoritmos. A técnica proposta pode ser aplicada a diferentes dados acadêmicos (não sendo restrita a dados presentes na plataforma Lattes), mas os dados oriundos dessa plataforma foram utilizados para os testes e validações da solução proposta. Como resultado, identificou-se que a técnica utilizada para realizar o enriquecimento do texto não auxiliou na melhoria da precisão da inferência. Todavia, as métricas de redes sociais e representações numéricas melhoram a inferência quando comparadas com técnicas do estado da arte, assim como o uso da própria estrutura hierárquica de classes, que retornou os melhores resultados dentre os obtidos / Nowadays, there is a wide range of academic data available on the web. With this information, it is possible to solve tasks such as the discovery of specialists in a given area, identification of potential scholarship holders, suggestion of collaborators, among others. However, the success of these tasks depends on the quality of the data used, since incorrect or incomplete data tend to impair the performance of the applied algorithms. Several academic data repositories do not contain or do not require the explicit information of the researchers\' areas. In the data of the Lattes curricula, this information exists, but it is inserted manually by the researcher without any kind of validation (and potentially it is outdated, missing or even there is incorrect information). The present work utilized machine learning techniques in the inference of the researcher\'s areas based on the data registered in the Lattes platform. The titles of the scientific production were used as data source and they were enriched with semantically related information present in other bases, besides adopting other representations for the text of the titles and other academic information as orientations and research projects. The objective of this dissertation was to evaluate if the data enrichment improves the performance of the classification algorithms tested, as well as to analyze the contribution of factors such as social network metrics, the language of the titles and the hierarchical structure of the areas in the performance of the algorithms. The proposed technique can be applied to different academic data (not restricted to data present in the Lattes platform), but the data from this platform was used for the tests and validations of the proposed solution. As a result, it was identified that the technique used to perform the enrichment of the text did not improve the accuracy of the inference. However, social network metrics and numerical representations improved inference accuracy when compared to state-of-the-art techniques, as well as the use of the hierarchical structure of the classes, which returned the best results among the obtained
100

Modélisation thématique probabiliste des services web

Aznag, Mustapha 03 July 2015 (has links)
Les travaux sur la gestion des services web utilisent généralement des techniques du domaine de la recherche d'information, de l'extraction de données et de l'analyse linguistique. Alternativement, nous assistons à l'émergence de la modélisation thématique probabiliste utilisée initialement pour l'extraction de thèmes d'un corpus de documents. La contribution de cette thèse se situe à la frontière de la modélisation thématique et des services web. L'objectif principal de cette thèse est d'étudier et de proposer des algorithmes probabilistes pour modéliser la structure thématique des services web. Dans un premier temps, nous considérons une approche non supervisée pour répondre à différentes tâches telles que la découverte et le regroupement de services web. Ensuite, nous combinons la modélisation thématique avec l'analyse de concepts formels pour proposer une méthode de regroupement hiérarchique de services web. Cette méthode permet une nouvelle démarche de découverte interactive basée sur des opérateurs de généralisation et spécialisation des résultats obtenus. Enfin, nous proposons une méthode semi-supervisée pour l'annotation automatique de services web. Nous avons concrétisé nos propositions par un moteur de recherche en ligne appelé WS-Portal. Nous offrons alors différentes fonctions facilitant la gestion de services web, par exemple, la découverte et le regroupement de services web, la recommandation des tags, la surveillance des services, etc. Nous intégrons aussi différents paramètres tels que la disponibilité et la réputation de services web et plus généralement la qualité de service pour améliorer leur classement (la pertinence du résultat de recherche). / The works on web services management use generally the techniques of information retrieval, data mining and the linguistic analysis. Alternately, we attend the emergence of the probabilistic topic models originally developed and utilized for topics extraction and documents modeling. The contribution of this thesis meets the topics modeling and the web services management. The principal objective of this thesis is to study and propose probabilistic algorithms to model the thematic structure of web services. First, we consider an unsupervised approach to meet different tasks such as web services clustering and discovery. Then we combine the topics modeling with the formal concept analysis to propose a novel method for web services hierarchical clustering. This method allows a novel interactive discovery approach based on the specialization and generalization operators of retrieved results. Finally, we propose a semi-supervised method for automatic web service annotation (automatic tagging). We concretized our proposals by developing an on-line web services search engine called WS-Portal where we incorporate our research works to facilitate web service discovery task. Our WS-Portal contains 7063 providers, 115 sub-classes of category and 22236 web services crawled from the Internet. In WS- Portal, several technologies, i.e., web services clustering, tags recommendation, services rating and monitoring are employed to improve the effectiveness of web services discovery. We also integrate various parameters such as availability and reputation of web services and more generally the quality of service to improve their ranking and therefore the relevance of the search result.

Page generated in 0.1255 seconds