Global ETD Search

1	New topic detection in microblogs and topic model evaluation using topical alignment Rajani, Nazneen Fatema Naushad 16 September 2014 (has links) This thesis deals with topic model evaluation and new topic detection in microblogs. Microblogs are short and thus may not carry any contextual clues. Hence it becomes challenging to apply traditional natural language processing algorithms on such data. Graphical models have been traditionally used for topic discovery and text clustering on sets of text-based documents. Their unsupervised nature allows topic models to be trained easily on datasets meant for specific domains. However the advantage of not requiring annotated data comes with a drawback with respect to evaluation difficulties. The problem aggravates when the data comprises microblogs which are unstructured and noisy. We demonstrate the application of three types of such models to microblogs - the Latent Dirichlet Allocation, the Author-Topic and the Author-Recipient-Topic model. We extensively evaluate these models under different settings, and our results show that the Author-Recipient-Topic model extracts the most coherent topics. We also addressed the problem of topic modeling on short text by using clustering techniques. This technique helps in boosting the performance of our models. Topical alignment is used for large scale assessment of topical relevance by comparing topics to manually generated domain specific concepts. In this thesis we use this idea to evaluate topic models by measuring misalignments between topics. Our study on comparing topic models reveals interesting traits about Twitter messages, users and their interactions and establishes that joint modeling on author-recipient pairs and on the content of tweet leads to qualitatively better topic discovery. This thesis gives a new direction to the well known problem of topic discovery in microblogs. Trend prediction or topic discovery for microblogs is an extensive research area. We propose the idea of using topical alignment to detect new topics by comparing topics from the current week to those of the previous week. We measure correspondence between a set of topics from the current week and a set of topics from the previous week to quantify five types of misalignments: \textit{junk, fused, missing} and \textit{repeated}. Our analysis compares three types of topic models under different settings and demonstrates how our framework can detect new topics from topical misalignments. In particular so-called \textit{junk} topics are more likely to be new topics and the \textit{missing} topics are likely to have died or die out. To get more insights into the nature of microblogs we apply topical alignment to hashtags. Comparing topics to hashtags enables us to make interesting inferences about Twitter messages and their content. Our study revealed that although a very small proportion of Twitter messages explicitly contain hashtags, the proportion of tweets that discuss topics related to hashtags is much higher. / text Topic models Topical alignment
2	Finding Expert Users in Community Question Answering Services Using Topic Models Riahi, Fatemeh 29 February 2012 (has links) Community Question Answering (CQA) websites provide a rapidly growing source of information in many areas. In most CQA implementations there is little effort in directing new questions to the right group of experts. This means that experts are not provided with questions matching their expertise. In this thesis, we propose a framework for automatically routing a newly posted question to the best suited expert. The purpose of this framework is to decrease the waiting time for a personal response. We also investigate the suitability of two statistical topic models for solving this issue and compare these methods against more traditional Information Retrieval approaches. We show that for a dataset constructed from the Stackoverflow website, these topic models outperform other methods in retrieving a set of best experts. We also show that the Segmented Topic Model gives consistently better performance compared to the Latent Dirichlet Allocation Model. Topic models, Expert Recommender
3	Using Topic Models to Support Software Maintenance Grant, Scott 30 April 2012 (has links) Latent topic models are statistical structures in which a "latent topic" describes some relationship between parts of the data. Co-maintenance is defined as an observable property of software systems under source control in which source code fragments are modified together in some time frame. When topic models are applied to software systems, latent topics emerge from code fragments. However, it is not yet known what these latent topics mean. In this research, we analyse software maintenance history, and show that latent topics often correspond to code fragments that are maintained together. Moreover, we show that latent topic models can identify such co-maintenance relationships even with no supervision. We can use this correlation both to categorize and understand maintenance history, and to predict future co-maintenance in practice. The relationship between co-maintenance and topics is directly analysed within changelists, with respect to both local pairwise code fragment similarity and global system-wide fragment similarity. This analysis is used to evaluate topic models used with a domain-specific programming language for web service similarity detection, and to estimate appropriate topic counts for modelling source code. / Thesis (Ph.D, Computing) -- Queen's University, 2012-04-30 18:16:04.05 latent topic models software engineering
4	Infinite-word topic models for digital media Waters, Austin Severn 02 July 2014 (has links) Digital media collections hold an unprecedented source of knowledge and data about the world. Yet, even at current scales, the data exceeds by many orders of magnitude the amount a single user could browse through in an entire lifetime. Making use of such data requires computational tools that can index, search over, and organize media documents in ways that are meaningful to human users, based on the meaning of their content. This dissertation develops an automated approach to analyzing digital media content based on topic models. Its primary contribution, the Infinite-Word Topic Model (IWTM), helps extend topic modeling to digital media domains by removing model assumptions that do not make sense for them -- in particular, the assumption that documents are composed of discrete, mutually-exclusive words from a fixed-size vocabulary. While conventional topic models like Latent Dirichlet Allocation (LDA) require that media documents be converted into bags of words, IWTM incorporates clustering into its probabilistic model and treats the vocabulary size as a random quantity to be inferred based on the data. Among its other benefits, IWTM achieves better performance than LDA while automating the selection of the vocabulary size. This dissertation contributes fast, scalable variational inference methods for IWTM that allow the model to be applied to large datasets. Furthermore, it introduces a new method, Incremental Variational Inference (IVI), for training IWTM and other Bayesian non-parametric models efficiently on growing datasets. IVI allows such models to grow in complexity as the dataset grows, as their priors state that they should. Finally, building on IVI, an active learning method for topic models is developed that intelligently samples new data, resulting in models that train faster, achieve higher performance, and use smaller amounts of labeled data. / text Machine learning Topic models Variational inference Bayesian nonparametrics
5	Influence modeling in behavioral data Li, Liangda 21 September 2015 (has links) Understanding influence in behavioral data has become increasingly important in analyzing the cause and effect of human behaviors under various scenarios. Influence modeling enables us to learn not only how human behaviors drive the diffusion of memes spread in different kinds of networks, but also the chain reactions evolve in the sequential behaviors of people. In this thesis, I propose to investigate into appropriate probabilistic models for efficiently and effectively modeling influence, and the applications and extensions of the proposed models to analyze behavioral data in computational sustainability and information search. One fundamental problem in influence modeling is the learning of the degree of influence between individuals, which we called social infectivity. In the first part of this work, we study how to efficient and effective learn social infectivity in diffusion phenomenon in social networks and other applications. We replace the pairwise infectivity in the multidimensional Hawkes processes with linear combinations of those time-varying features, and optimize the associated coefficients with lasso regularization on coefficients. In the second part of this work, we investigate the modeling of influence between marked events in the application of energy consumption, which tracks the diffusion of mixed daily routines of household members. Specifically, we leverage temporal and energy consumption information recorded by smart meters in households for influence modeling, through a novel probabilistic model that combines marked point processes with topic models. The learned influence is supposed to reveal the sequential appliance usage pattern of household members, and thereby helps address the problem of energy disaggregation. In the third part of this work, we investigate a complex influence modeling scenario which requires simultaneous learning of both infectivity and influence existence. Specifically, we study the modeling of influence in search behaviors, where the influence tracks the diffusion of mixed search intents of search engine users in information search. We leverage temporal and textual information in query logs for influence modeling, through a novel probabilistic model that combines point processes with topic models. The learned influence is supposed to link queries that serve for the same formation need, and thereby helps address the problem of search task identification. The modeling of influence with the Markov property also help us to understand the chain reaction in the interaction of search engine users with query auto-completion (QAC) engine within each query session. The fourth part of this work studies how a user's present interaction with a QAC engine influences his/her interaction in the next step. We propose a novel probabilistic model based on Markov processes, which leverage such influence in the prediction of users' click choices of suggested queries of QAC engines, and accordingly improve the suggestions to better satisfy users' search intents. In the fifth part of this work, we study the mutual influence between users' behaviors on query auto-completion (QAC) logs and normal click logs across different query sessions. We propose a probabilistic model to explore the correlation between user' behavior patterns on QAC and click logs, and expect to capture the mutual influence between users' behaviors in QAC and click sessions. Influence modeling Behavioral data Topic models Point process Information search
6	Content Management and Hashtag Recommendation in a P2P Social Networking Application Nelaturu, Keerthi January 2015 (has links) In this thesis focus is on developing an online social network application with a Peer-to-Peer infrastructure motivated by BestPeer++ architecture and BATON overlay structure. BestPeer++ is a data processing platform which enables data sharing between enterprise systems. BATON is an open-sourced project which implements a peer-to-peer with a topology of a balanced tree. We designed and developed the components for users to manage their accounts, maintain friend relationships, and publish their contents with privacy control and newsfeed, notification requests in this social network- ing application. We also developed a Hashtag Recommendation system for this social net- working application. A user may invoke a recommendation procedure while writing a content. After being invoked, the recommendation pro- cedure returns a list of candidate hashtags, and the user may select one hashtag from the list and embed it into the content. The proposed ap- proach uses Latent Dirichlet Allocation (LDA) topic model to derive the latent or hidden topics of different content. LDA topic model is a well developed data mining algorithm and generally effective in analyzing text documents with different lengths. The topic model is further used to identify the candidate hashtags that are associated with the texts in the published content through their association with the derived hidden top- ics. We considered different methods of recommendation approach for the pro- cedure to select candidate hashtags from different content. Some methods consider the hashtags contained in the contents of the whole social net- work or of the user self. These are content-based recommendation tech- niques which matching user’s own profile with the profiles of items.. Some methods consider the hashtags contained in contents of the friends or of the similar users. These are collaborative filtering based recommendation techniques which considers the profiles of other users in the system. At the end of the recommendation procedure, the candidate hashtags are or- dered by their probabilities of appearance in the content and returned to the user. We also conducted experiments to evaluate the effectiveness of the hashtag recommendation approach. These experiments were fed with the tweets published in Twitter. The hit-rate of recommendation is measured in these experiments. Hit-rate is the percentage of the selected or relevant hashtags contained in candidate hashtags. Our experiment results show that the hit-rate above 50% is observed when we use a method of recommendation approach independently. Also, for the case that both similar user and user preferences are considered at the same time, the hit-rate improved to 87% and 92% for top-5 and top-10 candidate recommendations respectively. Bestpeer Hashtag recommendation Topic models Latent dirichlet allocation
7	High performance latent dirichlet allocation for text mining Liu, Zelong January 2013 (has links) Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space. 006.3
8	Latent Dirichlet Allocation in R Ponweiser, Martin 05 1900 (has links) (PDF) Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes. This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract) / Series: Theses / Institute for Statistics and Mathematics
9	Generalized Probabilistic Topic and Syntax Models for Natural Language Processing Darling, William Michael 14 September 2012 (has links) This thesis proposes a generalized probabilistic approach to modelling document collections along the combined axes of both semantics and syntax. Probabilistic topic (or semantic) models view documents as random mixtures of unobserved latent topics which are themselves represented as probabilistic distributions over words. They have grown immensely in popularity since the introduction of the original topic model, Latent Dirichlet Allocation (LDA), in 2004, and have seen successes in computational linguistics, bioinformatics, political science, and many other fields. Furthermore, the modular nature of topic models allows them to be extended and adapted to specific tasks with relative ease. Despite the recorded successes, however, there remains a gap in combining axes of information from different sources and in developing models that are as useful as possible for specific applications, particularly in Natural Language Processing (NLP). The main contributions of this thesis are two-fold. First, we present generalized probabilistic models (both parametric and nonparametric) that are semantically and syntactically coherent and contain many simpler probabilistic models as special cases. Our models are consistent along both axes of word information in that an LDA-like component sorts words that are semantically related into distinct topics and a Hidden Markov Model (HMM)-like component determines the syntactic parts-of-speech of words so that we can group words that are both semantically and syntactically affiliated in an unsupervised manner, leading to such groups as verbs about health care and nouns about sports. Second, we apply our generalized probabilistic models to two NLP tasks. Specifically, we present new approaches to automatic text summarization and unsupervised part-of-speech (POS) tagging using our models and report results commensurate with the state-of-the-art in these two sub-fields. Our successes demonstrate the general applicability of our modelling techniques to important areas in computational linguistics and NLP. NLP topic models machine learning bayesian methods text summarization part-of-speech tagging syntax modelling
10	Tumor Gene Expression Purification Using Infinite Mixture Topic Models Deshwar, Amit Gulab 11 July 2013 (has links) There is significant interest in using gene expression measurements to aid in the personalization of medical treatment. The presence of significant normal tissue contamination in tumor samples makes it difficult to use tumor expression measurements to predict clinical variables and treatment response. I present a probabilistic method, TMMpure, to infer the expression profile of the cancerous tissue using a modified topic model that contains a hierarchical Dirichlet process prior on the cancer profiles. I demonstrate that TMMpure is able to infer the expression profile of cancerous tissue and improves the power of predictive models for clinical variables using expression profiles. Bayesian methods Gene expression purficiation Bayesian Nonparametric Topic models 0984 0800 0544

Search results