Spelling suggestions: "subject:"topic modeling"" "subject:"oopic modeling""
111 |
Labeling Clinical Reports with Active Learning and Topic Modeling / Uppmärkning av kliniska rapporter med active learning och topic modellerLindblad, Simon January 2018 (has links)
Supervised machine learning models require a labeled data set of high quality in order to perform well. Available text data often exists in abundance, but it is usually not labeled. Labeling text data is a time consuming process, especially in the case where multiple labels can be assigned to a single text document. The purpose of this thesis was to make the labeling process of clinical reports as effective and effortless as possible by evaluating different multi-label active learning strategies. The goal of the strategies was to reduce the number of labeled documents a model needs, and increase the quality of those documents. With the strategies, an accuracy of 89% was achieved with 2500 reports, compared to 85% with random sampling. In addition to this, 85% accuracy could be reached after labeling 975 reports, compared to 1700 reports with random sampling.
|
112 |
Explorer et apprendre à partir de collections de textes multilingues à l'aide des modèles probabilistes latents et des réseaux profonds / Mining and learning from multilingual text collections using topic models and word embeddingsBalikas, Georgios 20 October 2017 (has links)
Le texte est l'une des sources d'informations les plus répandues et les plus persistantes. L'analyse de contenu du texte se réfère à des méthodes d'étude et de récupération d'informations à partir de documents. Aujourd'hui, avec une quantité de texte disponible en ligne toujours croissante l'analyse de contenu du texte revêt une grande importance parce qu' elle permet une variété d'applications. À cette fin, les méthodes d'apprentissage de la représentation sans supervision telles que les modèles thématiques et les word embeddings constituent des outils importants.L'objectif de cette dissertation est d'étudier et de relever des défis dans ce domaine.Dans la première partie de la thèse, nous nous concentrons sur les modèles thématiques et plus précisément sur la manière d'incorporer des informations antérieures sur la structure du texte à ces modèles.Les modèles de sujets sont basés sur le principe du sac-de-mots et, par conséquent, les mots sont échangeables. Bien que cette hypothèse profite les calculs des probabilités conditionnelles, cela entraîne une perte d'information.Pour éviter cette limitation, nous proposons deux mécanismes qui étendent les modèles de sujets en intégrant leur connaissance de la structure du texte. Nous supposons que les documents sont répartis dans des segments de texte cohérents. Le premier mécanisme attribue le même sujet aux mots d'un segment. La seconde, capitalise sur les propriétés de copulas, un outil principalement utilisé dans les domaines de l'économie et de la gestion des risques, qui sert à modéliser les distributions communes de densité de probabilité des variables aléatoires tout en n'accédant qu'à leurs marginaux.La deuxième partie de la thèse explore les modèles de sujets bilingues pour les collections comparables avec des alignements de documents explicites. En règle générale, une collection de documents pour ces modèles se présente sous la forme de paires de documents comparables. Les documents d'une paire sont écrits dans différentes langues et sont thématiquement similaires. À moins de traductions, les documents d'une paire sont semblables dans une certaine mesure seulement. Pendant ce temps, les modèles de sujets représentatifs supposent que les documents ont des distributions thématiques identiques, ce qui constitue une hypothèse forte et limitante. Pour le surmonter, nous proposons de nouveaux modèles thématiques bilingues qui intègrent la notion de similitude interlingue des documents qui constituent les paires dans leurs processus générateurs et d'inférence.La dernière partie de la thèse porte sur l'utilisation d'embeddings de mots et de réseaux de neurones pour trois applications d'exploration de texte. Tout d'abord, nous abordons la classification du document polylinguistique où nous soutenons que les traductions d'un document peuvent être utilisées pour enrichir sa représentation. À l'aide d'un codeur automatique pour obtenir ces représentations de documents robustes, nous démontrons des améliorations dans la tâche de classification de documents multi-classes. Deuxièmement, nous explorons la classification des tweets à plusieurs tâches en soutenant que, en formant conjointement des systèmes de classification utilisant des tâches corrélées, on peut améliorer la performance obtenue. À cette fin, nous montrons comment réaliser des performances de pointe sur une tâche de classification du sentiment en utilisant des réseaux neuronaux récurrents. La troisième application que nous explorons est la récupération d'informations entre langues. Compte tenu d'un document écrit dans une langue, la tâche consiste à récupérer les documents les plus similaires à partir d'un ensemble de documents écrits dans une autre langue. Dans cette ligne de recherche, nous montrons qu'en adaptant le problème du transport pour la tâche d'estimation des distances documentaires, on peut obtenir des améliorations importantes. / Text is one of the most pervasive and persistent sources of information. Content analysis of text in its broad sense refers to methods for studying and retrieving information from documents. Nowadays, with the ever increasing amounts of text becoming available online is several languages and different styles, content analysis of text is of tremendous importance as it enables a variety of applications. To this end, unsupervised representation learning methods such as topic models and word embeddings constitute prominent tools.The goal of this dissertation is to study and address challengingproblems in this area, focusing on both the design of novel text miningalgorithms and tools, as well as on studying how these tools can be applied to text collections written in a single or several languages.In the first part of the thesis we focus on topic models and more precisely on how to incorporate prior information of text structure to such models.Topic models are built on the premise of bag-of-words, and therefore words are exchangeable. While this assumption benefits the calculations of the conditional probabilities it results in loss of information.To overcome this limitation we propose two mechanisms that extend topic models by integrating knowledge of text structure to them. We assume that the documents are partitioned in thematically coherent text segments. The first mechanism assigns the same topic to the words of a segment. The second, capitalizes on the properties of copulas, a tool mainly used in the fields of economics and risk management that is used to model the joint probability density distributions of random variables while having access only to their marginals.The second part of the thesis explores bilingual topic models for comparable corpora with explicit document alignments. Typically, a document collection for such models is in the form of comparable document pairs. The documents of a pair are written in different languages and are thematically similar. Unless translations, the documents of a pair are similar to some extent only. Meanwhile, representative topic models assume that the documents have identical topic distributions, which is a strong and limiting assumption. To overcome it we propose novel bilingual topic models that incorporate the notion of cross-lingual similarity of the documents that constitute the pairs in their generative and inference processes. Calculating this cross-lingual document similarity is a task on itself, which we propose to address using cross-lingual word embeddings.The last part of the thesis concerns the use of word embeddings and neural networks for three text mining applications. First, we discuss polylingual document classification where we argue that translations of a document can be used to enrich its representation. Using an auto-encoder to obtain these robust document representations we demonstrate improvements in the task of multi-class document classification. Second, we explore multi-task sentiment classification of tweets arguing that by jointly training classification systems using correlated tasks can improve the obtained performance. To this end we show how can achieve state-of-the-art performance on a sentiment classification task using recurrent neural networks. The third application we explore is cross-lingual information retrieval. Given a document written in one language, the task consists in retrieving the most similar documents from a pool of documents written in another language. In this line of research, we show that by adapting the transportation problem for the task of estimating document distances one can achieve important improvements.
|
113 |
Exploring NMF and LDA Topic Models of Swedish News ArticlesSvensson, Karin, Blad, Johan January 2020 (has links)
The ability to automatically analyze and segment news articles by their content is a growing research field. This thesis explores the unsupervised machine learning method topic modeling applied on Swedish news articles for generating topics to describe and segment articles. Specifically, the algorithms non-negative matrix factorization (NMF) and the latent Dirichlet allocation (LDA) are implemented and evaluated. Their usefulness in the news media industry is assessed by its ability to serve as a uniform categorization framework for news articles. This thesis fills a research gap by studying the application of topic modeling on Swedish news articles and contributes by showing that this can yield meaningful results. It is shown that Swedish text data requires extensive data preparation for successful topic models and that nouns exclusively and especially common nouns are the most suitable words to use. Furthermore, the results show that both NMF and LDA are valuable as content analysis tools and categorization frameworks, but they have different characteristics, hence optimal for different use cases. Lastly, the conclusion is that topic models have issues since they can generate unreliable topics that could be misleading for news consumers, but that they nonetheless can be powerful methods for analyzing and segmenting articles efficiently on a grand scale by organizations internally. The thesis project is a collaboration with one of Sweden’s largest media groups and its results have led to a topic modeling implementation for large-scale content analysis to gain insight into readers’ interests.
|
114 |
Sdílená ekonomika v kontextu postmateriálních hodnot: případ segmentu ubytování v Praze / Sharing Economy in the Context of Postmaterial Values: The Case of Accommodation Segment in PragueSvobodová, Tereza January 2020 (has links)
This master's thesis is about the success of sharing economy in the accommodation segment in Prague. The thesis is based on theories conceptualizing sharing economy as a result of social and value change, not only as technological one. Using online review data, the user experience of shared accommodation via Airbnb and traditional via Booking are compared. Analysis is conducted with focus on users' satisfied needs and fulfilled values. For processing the data, text mining techniques (topic modelling and sentiment analysis) were employed. The major result is that in Prague the models of sharing economy accommodation meets the growing need in society to fulfil post-material values in the market much better than the models of traditional accommodation (hotels, hostels, boarding houses). In their experiences, Airbnb users reflect social and emotional values more often, even though most sharing economy accommodations in Prague do not involve any physical sharing with the host. The thesis thus brings a unique perspective on the Airbnb phenomenon in the Czech context and contributes to the discussion of why the market share of the sharing economy in the accommodation segment in Prague has been growing, while traditional models stagnated.
|
115 |
Neural probabilistic topic modeling of short and messy text / Neuronprobabilistisk ämnesmodellering av kort och stökig textHarrysson, Mattias January 2016 (has links)
Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test. / Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.
|
116 |
The impact of sentiment and misinformation cycling through the social media platform, Twitter, during the initial phase of the COVID-19 vaccine rolloutBurwell, Emily Grace 01 June 2022 (has links)
No description available.
|
117 |
Exploring Hybrid Topic Based Sentiment Analysis as Author Identification Method on Swedish DocumentsJakob, Bremer January 2021 (has links)
The Swedish national bank has had shifting policies when it comes to publicity and confidentiality concerning publishing of texts within the bank. For some time, texts written by commissioners within the bank were decided to be published anonymously. Later they revoked the confidentiality policy, publishing all documents publicly again. This led to emerged interests in possible shifting attitudes toward topics discussed by the commissioners when writing anonymously versus publicly. On a request, based on the interests, there are ongoing analyses being conducted with the help of language technology where topics are extracted from the anonymous and public documents respectively. The aim is to find topics related to individual commissioners with the purpose of, as accurately as possible, identifying which of the anonymous documents is written by who. To discover unique relations between the commissioners and the generated topics, this thesis proposes hybrid topic based sentiment analysis as an author identification method to be able to use sentiments of topics as identifying features of commissioners. The results showed promise in the proposed approach. Though, further research is substantial, conducting comparisons with other acknowledged author identification methods, to confirm some level of efficacy, especially on documents containing close similarities among topics.
|
118 |
Sentiment Analysis of MOOC learner reviews : What motivates learners to complete a course?Knöös, Johanna, Rääf, Siri Amanda January 2021 (has links)
In the last decade, development of Information and Communication Technology (ICT) thatsupports online learning has increased the demand for e-learning and Massive Open OnlineCourses (MOOCs). Despite their increased popularity, MOOCs are struggling with highdropout rates and only a small percentage of learners complete the courses they enrolled in. Thepurpose of this thesis is to gain knowledge about MOOC learner behaviour. The aim of thestudy is to identify the motivations of learners and how these differ between learners whocompleted a course and those who dropped out. Research on MOOC learners has mostly beencarried out using a quantitative approach. While quantitative methodologies are effective inhandling the large amount of data produced by MOOCs, qualitative methods can give deeperinsights into online learners’ motivations. Therefore, this thesis employs an explanatorysequential mixed methods research, in which sentiment analysis and topic modeling of learnerreviews from the platform Coursera are further explained by qualitative interviews with MOOClearners. In the study 28,000 reviews scraped from five courses within the fields of data sciencewere analyzed and ten interviews were held with learners who either completed, dropped outfrom or both completed and dropped out from a MOOC. In the quantitative analysis nine coursefactors were found that learners wrote about: content, delivery, assessment, learning experience,tools, video material, teaching style, instructor skills and course provider. In addition, eighteenthemes were yielded from the interviews: self-discipline, just for fun, certificates, personaldevelopment, knowledge, career, time, equipment, practical exercise, interaction, instructor,reality, structure, external material, cost, community, degree of difficulty and other. In thediscussion the empirical findings are reflected upon using the theoretical framework of theresearch and the literature review. The result does not reveal any differences in motivationsbetween learners who completed a course and those who dropped out, however, it does identifyfactors that caused learners’ to drop out and the topics that most negative learner reviews wereabout. This research contributes to the body of knowledge in the field of research on MOOClearner retention and motivations. The topic is relevant for research in education informaticsand for continued improvements in delivery of MOOCs.
|
119 |
Text Mining for Social Harm and Criminal Justice ApplicationsPandey, Ritika 08 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Increasing rates of social harm events and plethora of text data demands the need of employing text mining techniques not only to better understand their causes but also to develop optimal prevention strategies. In this work, we study three social harm issues: crime topic models, transitions into drug addiction and homicide investigation chronologies. Topic modeling for the categorization and analysis of crime report text allows for more nuanced categories of crime compared to official UCR categorizations. This study has important implications in hotspot policing. We investigate the extent to which topic models that improve coherence lead to higher levels of crime concentration. We further explore the transitions into drug addiction using Reddit data. We proposed a prediction model to classify the users’ transition from casual drug discussion forum to recovery drug discussion forum and the likelihood of such transitions. Through this study we offer insights into modern drug culture and provide tools with potential applications in combating opioid crises. Lastly, we present a knowledge graph based framework for homicide investigation chronologies that may aid investigators in analyzing homicide case data and also allow for post hoc analysis of key features that determine whether a homicide is ultimately solved. For this purpose we perform named entity recognition to determine witnesses, detectives and suspects from chronology, use keyword expansion to identify various evidence types and finally link these entities and evidence to construct a homicide investigation knowledge graph. We compare the performance over several choice of methodologies for these sub-tasks and analyze the association between network statistics of knowledge graph and homicide solvability.
|
120 |
How much do you care about education? Exploring fluctuations of public interest in education issues among top national priorities in the U.S.Nehoran, Dana 01 January 2020 (has links)
It is well known that a strong education system produces citizens who are more engaged in civil and social duties, with obvious benefits to society and the individuals. Policymakers who have the power to help improve the education system frequently rely on the news or the polls to better understand the issues involved, but these tools are often unable to answer customized questions on the public view with a large enough coverage.
Monitoring the American public interest in education over the years is not new. In fact, a number of national polling agencies have tracked education as part of their larger polls asking people to name the most burning issues facing the US. While these polls provide a fair indication of the changes in importance of education in the eyes of the public, they do not identify the factors which have historically been associated with the major fluctuations of such importance. Most importantly, these traditional national polls do not track public concern about specific subtopics within education.
This mixed methods study includes the creation of a software instrument with the objective of exploring the salience of education as a national priority over time and analyzing the possible factors associated with these fluctuations of interest. In addition to discovering the most prominent latent subtopics affecting education (such as academic achievement, sexual assault and freedom of speech), this study also seeks national-level issues that may have recently been associated with the largest declines.
The only source of data utilized is the text of tens of thousands of published news articles. Terms extracted from the text using natural language processing serve as the basis for automated qualitative analysis. As topics emerge from the data, the frequencies of the terms are utilized to associate the articles with the most relevant ones.
The analysis shows that public interest in education has declined the most during election times. It is also found that the areas that contributed the most during the largest surges of public interest in education from 2015 to 2020 were school budget, academic achievement gaps and mental health.
|
Page generated in 0.0812 seconds