1 |
Topic modeling using latent dirichlet allocation on disaster tweetsPatel, Virashree Hrushikesh January 1900 (has links)
Master of Science / Department of Computer Science / Cornelia Caragea / Doina Caragea / Social media has changed the way people communicate information. It has been noted that social media platforms like Twitter are increasingly being used by people and authorities in the wake of natural disasters. The year 2017 was a historic year for the USA in terms of natural calamities and associated costs. According to NOAA (National Oceanic and Atmospheric Administration), during 2017, USA experienced 16 separate billion-dollar disaster events, including three tropical cyclones, eight severe storms, two inland floods, a crop freeze, drought, and wild re. During natural disasters, due to the collapse of infrastructure and telecommunication, often it is hard to reach out to people in need or to determine what areas are affected. In such situations, Twitter can be a lifesaving tool for local government and search and rescue agencies. Using Twitter streaming API service, disaster-related tweets can be collected and analyzed in real-time. Although tweets received from Twitter can be sparse, noisy and ambiguous, some may contain useful information with respect to situational awareness. For example, some tweets express emotions, such as grief, anguish, or call for help, other tweets provide information specific to a region, place or person, while others simply help spread information from news or environmental agencies. To extract information useful for disaster response teams from tweets, disaster tweets need to be cleaned and classified into various categories. Topic modeling can help identify topics from the collection of such disaster tweets. Subsequently, a topic (or a set of topics) will be associated with a tweet. Thus, in this report, we will use Latent Dirichlet Allocation (LDA) to accomplish topic modeling for disaster tweets dataset.
|
2 |
Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as ClassifiersAnaya, Leticia H. 12 1900 (has links)
In the Information Age, a proliferation of unstructured text electronic documents exists. Processing these documents by humans is a daunting task as humans have limited cognitive abilities for processing large volumes of documents that can often be extremely lengthy. To address this problem, text data computer algorithms are being developed. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are two text data computer algorithms that have received much attention individually in the text data literature for topic extraction studies but not for document classification nor for comparison studies. Since classification is considered an important human function and has been studied in the areas of cognitive science and information science, in this dissertation a research study was performed to compare LDA, LSA and humans as document classifiers. The research questions posed in this study are: R1: How accurate is LDA and LSA in classifying documents in a corpus of textual data over a known set of topics? R2: How accurate are humans in performing the same classification task? R3: How does LDA classification performance compare to LSA classification performance? To address these questions, a classification study involving human subjects was designed where humans were asked to generate and classify documents (customer comments) at two levels of abstraction for a quality assurance setting. Then two computer algorithms, LSA and LDA, were used to perform classification on these documents. The results indicate that humans outperformed all computer algorithms and had an accuracy rate of 94% at the higher level of abstraction and 76% at the lower level of abstraction. At the high level of abstraction, the accuracy rates were 84% for both LSA and LDA and at the lower level, the accuracy rate were 67% for LSA and 64% for LDA. The findings of this research have many strong implications for the improvement of information systems that process unstructured text. Document classifiers have many potential applications in many fields (e.g., fraud detection, information retrieval, national security, and customer management). Development and refinement of algorithms that classify text is a fruitful area of ongoing research and this dissertation contributes to this area.
|
3 |
Tag recommendation using Latent Dirichlet Allocation.Choubey, Rahul January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / The vast amount of data present on the internet calls for ways to label and organize this data according to specific categories, in order to facilitate search and browsing activities.
This can be easily accomplished by making use of folksonomies and user provided tags.
However, it can be difficult for users to provide meaningful tags. Tag recommendation
systems can guide the users towards informative tags for online resources such as websites, pictures, etc. The aim of this thesis is to build a system for recommending tags to URLs available through a bookmark sharing service, called BibSonomy. We assume that the URLs for which we recommend tags do not have any prior tags assigned to them.
Two approaches are proposed to address the tagging problem, both of them based on
Latent Dirichlet Allocation (LDA) Blei et al. [2003]. LDA is a generative and probabilistic
topic model which aims to infer the hidden topical structure in a collection of documents.
According to LDA, documents can be seen as mixtures of topics, while topics can be seen as mixtures of words (in our case, tags). The first approach that we propose, called topic words based approach, recommends the top words in the top topics representing a resource as tags for that particular resource. The second approach, called topic distance based approach, uses the tags of the most similar training resources (identified using the KL-divergence Kullback and Liebler [1951]) to recommend tags for a test untagged resource.
The dataset used in this work was made available through the ECML/PKDD Discovery
Challenge 2009. We construct the documents that are provided as input to LDA in two
ways, thus producing two different datasets. In the first dataset, we use only the description and the tags (when available) corresponding to a URL. In the second dataset, we crawl the URL content and use it to construct the document. Experimental results show that the LDA approach is not very effective at recommending tags for new untagged resources. However, using the resource content gives better results than using the description only. Furthermore,
the topic distance based approach is better than the topic words based approach, when only the descriptions are used to construct documents, while the topic words based approach works better when the contents are used to construct documents.
|
4 |
Towards generic relation extractionHachey, Benjamin January 2009 (has links)
A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database that can be more effectively used for querying and automated reasoning. However, adapting conventional relation extraction systems to new domains or tasks requires significant effort from annotators and developers. Furthermore, previous adaptation approaches based on bootstrapping start from example instances of the target relations, thus requiring that the correct relation type schema be known in advance. Generic relation extraction (GRE) addresses the adaptation problem by applying generic techniques that achieve comparable accuracy when transferred, without modification of model parameters, across domains and tasks. Previous work on GRE has relied extensively on various lexical and shallow syntactic indicators. I present new state-of-the-art models for GRE that incorporate governordependency information. I also introduce a dimensionality reduction step into the GRE relation characterisation sub-task, which serves to capture latent semantic information and leads to significant improvements over an unreduced model. Comparison of dimensionality reduction techniques suggests that latent Dirichlet allocation (LDA) – a probabilistic generative approach – successfully incorporates a larger and more interdependent feature set than a model based on singular value decomposition (SVD) and performs as well as or better than SVD on all experimental settings. Finally, I will introduce multi-document summarisation as an extrinsic test bed for GRE and present results which demonstrate that the relative performance of GRE models is consistent across tasks and that the GRE-based representation leads to significant improvements over a standard baseline from the literature. Taken together, the experimental results 1) show that GRE can be improved using dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE for the content selection step of extractive summarisation and 3) validate the GRE claim of modification-free adaptation for the first time with respect to both domain and task. This thesis also introduces data sets derived from publicly available corpora for the purpose of rigorous intrinsic evaluation in the news and biomedical domains.
|
5 |
Functionality Classification Filter for WebsitesJärvstråt, Lotta January 2013 (has links)
The objective of this thesis is to evaluate different models and methods for website classification. The websites are classified based on their functionality, in this case specifically whether they are forums, news sites or blogs. The analysis aims at solving a search engine problem, which means that it is interesting to know from which categories in a information search the results come. The data consists of two datasets, extracted from the web in January and April 2013. Together these data sets consist of approximately 40.000 observations, with each observation being the extracted text from the website. Approximately 7.000 new word variables were subsequently created from this text, as were variables based on Latent Dirichlet Allocation. One variable (the number of links) was created using the HTML-code for the web site. These data sets are used both in multinomial logistic regression with Lasso regularization, and to create a Naive Bayes classifier. The best classifier for the data material studied was achieved when using Lasso for all variables with multinomial logistic regression to reduce the number of variables. The accuracy of this model is 99.70 %. When time dependency of the models is considered, using the first data to make the model and the second data for testing, the accuracy, however, is only 90.74 %. This indicates that the data is time dependent and that websites topics change over time.
|
6 |
Metodología para el análisis de grandes volúmenes de información aplicada a la investigación médica en ChileClavijo García, David Mauricio January 2017 (has links)
Magíster en Ingeniería de Negocios con Tecnología de Información / El conocimiento en la medicina se ha acumulado en artículos de investigación científica a través del tiempo, por consiguiente, se ha generado un interés creciente en desarrollar metodologías de minería de texto para extraer, estructurar y analizar el conocimiento obtenido de grandes volúmenes de información en el menor tiempo posible. En este trabajo se presenta un una metodología que permite lograr el objetivo anterior utilizando el modelo LDA (Latent Dirichlet Allocation). Esta metodología consiste en 3 pasos: Primero, reconocer tópicos relevantes en artículos de investigación científica médica de la Revista Médica de Chile (2012 2015); Segundo, identificar e interpretar la relación entre los tópicos resultantes mediante métodos de visualización (LDAvis); Tercero, evaluar características propias de las investigaciones científicas, en este caso, el financiamiento dirigido, utilizando los dos pasos anteriores. Los resultados muestran que esta metodología resulta efectiva, no sólo para el análisis de artículos de investigación científica médica, sino que también puede ser utilizado en otros campos de la ciencia. Adicionalmente, éste método permite analizar e interpretar el estado en el que se encuentra la investigación médica a nivel nacional utilizando como referente la Revista Médica de Chile.
Dentro de este contexto es importante considerar los procesos de planificación, gestión y producción de la investigación científica al interior de los Hospitales que han sido estandartes de generación del conocimiento ya que funcionan como campus universitarios de tradición e innovación. Por la razón anterior, se realizará un análisis del entorno en el sector de la salud, su estructura y la posibilidad de aplicar la metodología propuesta en este trabajo a partir del planteamiento estratégico y el modelo de negocio del Hospital Exequiel González Cortés.
|
7 |
Content Management and Hashtag Recommendation in a P2P Social Networking ApplicationNelaturu, Keerthi January 2015 (has links)
In this thesis focus is on developing an online social network application with a Peer-to-Peer infrastructure motivated by BestPeer++ architecture and BATON overlay structure. BestPeer++ is a data processing platform which enables data sharing between enterprise systems. BATON is an open-sourced project which implements a peer-to-peer with a topology of a balanced tree.
We designed and developed the components for users to manage their accounts, maintain friend relationships, and publish their contents with privacy control and newsfeed, notification requests in this social network- ing application.
We also developed a Hashtag Recommendation system for this social net- working application. A user may invoke a recommendation procedure while writing a content. After being invoked, the recommendation pro- cedure returns a list of candidate hashtags, and the user may select one hashtag from the list and embed it into the content. The proposed ap- proach uses Latent Dirichlet Allocation (LDA) topic model to derive the latent or hidden topics of different content. LDA topic model is a well developed data mining algorithm and generally effective in analyzing text documents with different lengths. The topic model is further used to identify the candidate hashtags that are associated with the texts in the published content through their association with the derived hidden top- ics.
We considered different methods of recommendation approach for the pro- cedure to select candidate hashtags from different content. Some methods consider the hashtags contained in the contents of the whole social net- work or of the user self. These are content-based recommendation tech- niques which matching user’s own profile with the profiles of items.. Some methods consider the hashtags contained in contents of the friends or of the similar users. These are collaborative filtering based recommendation
techniques which considers the profiles of other users in the system. At the end of the recommendation procedure, the candidate hashtags are or- dered by their probabilities of appearance in the content and returned to the user.
We also conducted experiments to evaluate the effectiveness of the hashtag recommendation approach. These experiments were fed with the tweets published in Twitter. The hit-rate of recommendation is measured in these experiments. Hit-rate is the percentage of the selected or relevant hashtags contained in candidate hashtags. Our experiment results show that the hit-rate above 50% is observed when we use a method of recommendation approach independently. Also, for the case that both similar user and user preferences are considered at the same time, the hit-rate improved to 87% and 92% for top-5 and top-10 candidate recommendations respectively.
|
8 |
Rekonstrukce identit ve fake news: Srovnání dvou webových stránek s obsahem fake news / Reconstructing Identities in Fake News: Comparing two Fake News WebsitesEly, Nicole January 2020 (has links)
TOPICAL ANALYSIS OF FAKE NEWS 4 Abstract Since the 2016 US presidential campaign of Donald Trump, the term "fake news" has permeated mainstream discourse. The proliferation of disinformation and false narratives on social media platforms has caused concern in security circles in both the United States and European Union. Combining latent Dirichlet allocation, a machine learning method for text mining, with themes on topical analysis, ideology and social identity drawn from Critical Discourse theory, this thesis examines the elaborate fake news environments of two well-known English language websites: InfoWars and Sputnik News. Through the exploration of the ideologies and social representations at play in the larger thematic structure of these websites, a picture of two very different platforms emerges. One, a white dominant, somewhat isolationist counterculture mindset that promotes a racist and bigoted view of the world. Another, a more subtle world order-making perspective intent on reaching people in the realm of the mundane. Keywords: fake news, Sputnik, InfoWars, topical analysis, latent Dirichlet allocation Od americké prezidentské kampaně Donalda Trumpa z roku 2016, termín "fake news" (doslovně falešné zprávy) pronikl do mainstreamového diskurzu. Šíření dezinformací a falešných zpráv na platformách...
|
9 |
Meta-uncertainty and resilience with applications in intelligence analysisSchenk, Jason Robert 07 January 2008 (has links)
No description available.
|
10 |
Evaluating Hierarchical LDA Topic Models for Article CategorizationLindgren, Jennifer January 2020 (has links)
With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric.
|
Page generated in 0.1809 seconds