Global ETD Search

1	Expressive Forms of Topic Modeling to Support Digital Humanities Gad, Samah Hossam Aldin 15 October 2014 (has links) Unstructured textual data is rapidly growing and practitioners from diverse disciplines are expe- riencing a need to structure this massive amount of data. Topic modeling is one of the most used techniques for analyzing and understanding the latent structure of large text collections. Probabilistic graphical models are the main building block behind topic modeling and they are used to express assumptions about the latent structure of complex data. This dissertation address four problems related to drawing structure from high dimensional data and improving the text mining process. Studying the ebb and flow of ideas during critical events, e.g. an epidemic, is very important to understanding the reporting or coverage around the event or the impact of the event on the society. This can be accomplished by capturing the dynamic evolution of topics underlying a text corpora. We propose an approach to this problem by identifying segment boundaries that detect significant shifts of topic coverage. In order to identify segment boundaries, we embed a temporal segmentation algorithm around a topic modeling algorithm to capture such significant shifts of coverage. A key advantage of our approach is that it integrates with existing topic modeling algorithms in a transparent manner; thus, more sophisticated algorithms can be readily plugged in as research in topic modeling evolves. We apply this algorithm to studying data from the iNeighbors system, and apply our algorithm to six neighborhoods (three economically advantaged and three economically disadvantaged) to evaluate differences in conversations for statistical significance. Our findings suggest that social technologies may afford opportunities for democratic engagement in contexts that are otherwise less likely to support opportunities for deliberation and participatory democracy. We also examine the progression in coverage of historical newspapers about the 1918 influenza epidemic by applying our algorithm on the Washington Times archives. The algorithm is successful in identifying important qualitative features of news coverage of the pandemic. Visually convincing results of data mining algorithms and models is crucial to analyzing and driving conclusions from the algorithms. We develop ThemeDelta, a visual analytics system for extracting and visualizing temporal trends, clustering, and reorganization in time-indexed textual datasets. ThemeDelta is supported by a dynamic temporal segmentation algorithm that integrates with topic modeling algorithms to identify change points where significant shifts in topics occur. This algorithm detects not only the clustering and associations of keywords in a time period, but also their convergence into topics (groups of keywords) that may later diverge into new groups. The visual representation of ThemeDelta uses sinuous, variable-width lines to show this evolution on a timeline, utilizing color for categories, and line width for keyword strength. We demonstrate how interaction with ThemeDelta helps capture the rise and fall of topics by analyzing archives of historical newspapers, of U.S. presidential campaign speeches, and of social messages collected through iNeighbors. ThemeDelta is evaluated using a qualitative expert user study involving three researchers from rhetoric and history using the historical newspapers corpus. Time and location are key parameters in any event; neglecting them while discovering topics from a collection of documents results in missing valuable information. We propose a dynamic spatial topic model (DSTM), a true spatio-temporal model that enables disaggregating a corpus's coverage into location-based reporting, and understanding how such coverage varies over time. DSTM naturally generalizes traditional spatial and temporal topic models so that many existing formalisms can be viewed as special cases of DSTM. We demonstrate a successful application of DSTM to multiple newspapers from the Chronicling America repository. We demonstrate how our approach helps uncover key differences in the coverage of the flu as it spread through the nation, and provide possible explanations for such differences. Major events that can change the flow of people's lives are important to predict, especially when we have powerful models and sufficient data available at our fingertips. The problem of embedding the DSTM in a predictive setting is the last part of this dissertation. To predict events and their locations across time, we present a predictive dynamic spatial topic model that can predict future topics and their locations from unseen documents. We showed the applicability of our proposed approach by applying it on streaming tweets from Latin America. The prediction approach was successful in identify major events and their locations. / Ph. D. Topic Modeling LDA Segmentation
2	Segmenting, Summarizing and Predicting Data Sequences Chen, Liangzhe 19 June 2018 (has links) Temporal data is ubiquitous nowadays and can be easily found in many applications. Consider the extensively studied social media website Twitter. All the information can be associated with time stamps, and thus form different types of data sequences: a sequence of feature values of users who retweet a message, a sequence of tweets from a certain user, or a sequence of the evolving friendship networks. Mining these data sequences is an important task, which reveals patterns in the sequences, and it is a very challenging task as it usually requires different techniques for different sequences. The problem becomes even more complicated when the sequences are correlated. In this dissertation, we study the following two types of data sequences, and we show how to carefully exploit within-sequence and across-sequence correlations to develop more effective and scalable algorithms. 1. Multi-dimensional value sequences: We study sequences of multi-dimensional values, where each value is associated with a time stamp. Such value sequences arise in many domains such as epidemiology (medical records), social media (keyword trends), etc. Our goals are: for individual sequences, to find a segmentation of the sequence to capture where the pattern changes; for multiple correlated sequences, to use the correlations between sequences to further improve our segmentation; and to automatically find explanations of the segmentation results. 2. Social media post sequences: Driven by applications from popular social media websites such as Twitter and Weibo, we study the modeling of social media post sequences. Our goal is to understand how the posts (like tweets) are generated and how we can gain understanding of the users behind these posts. For individual social media post sequences, we study a prediction problem to find the users' latent state changes over the sequence. For dependent post sequences, we analyze the social influence among users, and how it affects users in generating posts and links. Our models and algorithms lead to useful discoveries, and they solve real problems in Epidemiology, Social Media and Critical Infrastructure Systems. Further, most of the algorithms and frameworks we propose can be extended to solve sequence mining problems in other domains as well. / Ph. D. / Temporal data is ubiquitous nowadays and can be easily found in many applications. Consider the extensively studied social media website Twitter. All the information can be associated with time stamps, and thus form different types of data sequences: a sequence of feature values of users who retweet a message, a sequence of tweets from a certain user, or a sequence of the evolving friendship networks. Mining these data sequences is an important task, which reveals patterns in the sequences, and helps downstream tasks like data compression and visualization. At the same time, it is a very challenging task as it usually requires different techniques for different sequences. The problem becomes even more complicated when the sequences are correlated. In this dissertation, we first study value sequences, where objects in the sequence are multidimensional data values, and move to text sequences, where each object in the sequence is a text document (like a tweet). For each of these data sequences, we study them either as independent individual sequences, or as a group of dependent sequences. We then show how to carefully exploit different types of correlations behind the sequences to develop more effective and scalable algorithms. Our models and algorithms lead to useful discoveries, and they solve real problems in Epidemiology, Social Media and Critical Infrastructure Systems. Further, most of the algorithms and frameworks we propose can be extended to solve sequence mining problems in other domains as well. Sequence Mining Segmentation Topic Modeling
3	Topic modeling using latent dirichlet allocation on disaster tweets Patel, Virashree Hrushikesh January 1900 (has links) Master of Science / Department of Computer Science / Cornelia Caragea / Doina Caragea / Social media has changed the way people communicate information. It has been noted that social media platforms like Twitter are increasingly being used by people and authorities in the wake of natural disasters. The year 2017 was a historic year for the USA in terms of natural calamities and associated costs. According to NOAA (National Oceanic and Atmospheric Administration), during 2017, USA experienced 16 separate billion-dollar disaster events, including three tropical cyclones, eight severe storms, two inland floods, a crop freeze, drought, and wild re. During natural disasters, due to the collapse of infrastructure and telecommunication, often it is hard to reach out to people in need or to determine what areas are affected. In such situations, Twitter can be a lifesaving tool for local government and search and rescue agencies. Using Twitter streaming API service, disaster-related tweets can be collected and analyzed in real-time. Although tweets received from Twitter can be sparse, noisy and ambiguous, some may contain useful information with respect to situational awareness. For example, some tweets express emotions, such as grief, anguish, or call for help, other tweets provide information specific to a region, place or person, while others simply help spread information from news or environmental agencies. To extract information useful for disaster response teams from tweets, disaster tweets need to be cleaned and classified into various categories. Topic modeling can help identify topics from the collection of such disaster tweets. Subsequently, a topic (or a set of topics) will be associated with a tweet. Thus, in this report, we will use Latent Dirichlet Allocation (LDA) to accomplish topic modeling for disaster tweets dataset. topic modeling twitter latent dirichlet allocation LDA
4	Probabilistic Explicit Topic Modeling Hansen, Joshua Aaron 21 April 2013 (has links) (PDF) Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling. topic modeling machine learning Wikipedia Computer Sciences
5	Topic Regression Mimno, David 01 February 2012 (has links) Text documents are generally accompanied by non-textual information, such as authors, dates, publication sources, and, increasingly, automatically recognized named entities. Work in text analysis has often involved predicting these non-text values based on text data for tasks such as document classification and author identification. This thesis considers the opposite problem: predicting the textual content of documents based on non-text data. In this work I study several regression-based methods for estimating the influence of specific metadata elements in determining the content of text documents. Such topic regression methods allow users of document collections to test hypotheses about the underlying environments that produced those documents. Machine Learning Topic Modeling Computer Sciences
6	Proximity and Innovation: Analyzing the path through topic modeling and business model design Devigili, Matteo 13 April 2021 (has links) This thesis aims to deepen the relationship between the different forms of proximity that emerge between economic actors and the consequent influence on their innovative capacity. Over the years, this topic has generated a great deal of attention in conference proceedings and scientific publications. The first step to deepen the understanding of this amount of knowledge was to identify a suitable methodology. In so do- ing, the recent advances of the Machine Learning community – particularly Natural Language Processing academics - have offered interesting insights. In particular, "Topic Modeling" was identified as a suitable methodology to bring out latent semantic structures. Therefore, the first chapter tries to study how this methodology has been implemented in the social sciences and, in particular, in management. The contribution offered is a rationalization of the achievable goals and their relationship with evaluation practices. Once clarified how to use this algorithm, the second chapter studied the relationship between proximity and innovation. Using an unsupervised machine learning technique, the research attempts to identify thematic management cores in a multifocal literature such as proximity. Together with a qualitative analysis, the study attempts to bring out the theoretical and empirical contributions offered to the management community. Once the theoretical and empirical expectations have been clarified, the third chapter introduces a strategic theme, namely the business model. This section proposes a mediating effect of the business model concerning the central relationship between proximity and innovation. After a theoretical introduction, the conceptual model is studied with an exploratory approach. Without any presumption of generalizability and completeness, a novel analytical key is offered to open further debate in the community of proximity.
7	Topics, Events, Stories in Social Media Hua, Ting 05 February 2018 (has links) The rise of big data, especially social media data (e.g., Twitter, Facebook, Youtube), gives new opportunities to the understanding of human behavior. Consequently, novel computing methods for mining patterns in social media data are therefore desired. Through applying these approaches, it has become possible to aggregate public available data to capture triggers underlying events, detect on-going trends, and forecast future happenings. This thesis focuses on developing methods for social media analysis. Specifically, five directions are proposed here: 1) semi-supervised detection for targeted-domain events, 2) topical interaction study among multiple datasets, 3) discriminative learning about the identifications for common and distinctive topics, 4) epidemics modeling for flu forecasting with simulation via signals from social media data, 5) storyline generation for massive unorganized documents. / Ph. D. / The rise of “big data”, especially social media data (e.g., Twitter, Facebook, Youtube), gives new opportunities to the understanding of human behavior. Consequently, novel computing methods for mining patterns in social media data are therefore desired. Through applying these approaches, it has become possible to aggregate public available data to capture triggers underlying events, detect on-going trends, and forecast future happenings. This dissertation provides comprehensive studies for social media data analysis. The goals of the dissertation include: event early detection, future event prediction, and event chain organization. Specifically, these goals are achieved through efforts in the following aspects: (1) semi-supervised and unsupervised methods are developed to collect early signals from social media data and detect on-going events; (2) graphical models are proposed to model the interaction and comparison among multiple datasets; (3) traditional computational methods are combined with new emerge social media data analysis for the purpose of fast epidemic prediction; (4) events in different time stamps are organized into event chains via novel probabilistic models. The effectiveness of our approaches is evaluated using various datasets, such as Twitter posts and news articles. Also, interesting case studies are provided to show models’ abilities in the real world exploration. Social media Topic modeling Event Detection
8	Similarity Reasoning over Semantic Context-Graphs Boteanu, Adrian 26 August 2015 (has links) "Similarity is a central cognitive mechanism for humans which enables a broad range of perceptual and abstraction processes, including recognizing and categorizing objects, drawing parallelism, and predicting outcomes. It has been studied computationally through models designed to replicate human judgment. The work presented in this dissertation leverages general purpose semantic networks to derive similarity measures in a problem-independent manner. We model both general and relational similarity using connectivity between concepts within semantic networks. Our first contribution is to model general similarity using concept connectivity, which we use to partition vocabularies into topics without the need of document corpora. We apply this model to derive topics from unstructured dialog, specifically enabling an early literacy primer application to support parents in having better conversations with their young children, as they are using the primer together. Second, we model relational similarity in proportional analogies. To do so, we derive relational parallelism by searching in semantic networks for similar path pairs that connect either side of this analogy statement. We then derive human readable explanations from the resulting similar path pair. We show that our model can answer broad-vocabulary analogy questions designed for human test takers with high confidence. The third contribution is to enable symbolic plan repair in robot planning through object substitution. When a failure occurs due to unforeseen changes in the environment, such as missing objects, we enable the planning domain to be extended with a number of alternative objects such that the plan can be repaired and execution to continue. To evaluate this type of similarity, we use both general and relational similarity. We demonstrate that the task context is essential in establishing which objects are interchangeable." Analogy Robot Tasks Plan Repair Topic Modeling Semantic Similarity
9	Topic modeling in marketing: recent advances and research opportunities Reisenbichler, Martin, Reutterer, Thomas 04 1900 (has links) (PDF) Using a probabilistic approach for exploring latent patterns in high-dimensional co-occurrence data, topic models offer researchers a flexible and open framework for soft-clustering large data sets. In recent years, there has been a growing interest among marketing scholars and practitioners to adopt topic models in various marketing application domains. However, to this date, there is no comprehensive overview of this rapidly evolving field. By analyzing a set of 61 published papers along with conceptual contributions, we systematically review this highly heterogeneous area of research. In doing so, we characterize extant contributions employing topic models in marketing along the dimensions data structures and retrieval of input data, implementation and extensions of basic topic models, and model performance evaluation. Our findings confirm that there is considerable progress done in various marketing sub-areas. However, there is still scope for promising future research, in particular with respect to integrating multiple, dynamic data sources, including time-varying covariates and the combination of exploratory topic models with powerful predictive marketing models. JEL M30, C00
10	The Annotation Cost of Context Switching: How Topic Models and Active Learning [May Not] Work Together Okuda, Nozomu 01 August 2017 (has links) The labeling of language resources is a time consuming task, whether aided by machine learning or not. Much of the prior work in this area has focused on accelerating human annotation in the context of machine learning, yielding a variety of active learning approaches. Most of these attempt to lead an annotator to label the items which are most likely to improve the quality of an automated, machine learning-based model. These active learning approaches seek to understand the effect of item selection on the machine learning model, but give significantly less emphasis to the effect of item selection on the human annotator. In this work, we consider a sentiment labeling task where existing, traditional active learning seems to have little or no value. We focus instead on the human annotator by ordering the items for better annotator efficiency. active learning topic modeling annotation human cost Computer Sciences

Search results