1 |
Bayesian nonparametric models for name disambiguation and supervised learningDai, Andrew Mingbo January 2013 (has links)
This thesis presents new Bayesian nonparametric models and approaches for their development, for the problems of name disambiguation and supervised learning. Bayesian nonparametric methods form an increasingly popular approach for solving problems that demand a high amount of model flexibility. However, this field is relatively new, and there are many areas that need further investigation. Previous work on Bayesian nonparametrics has neither fully explored the problems of entity disambiguation and supervised learning nor the advantages of nested hierarchical models. Entity disambiguation is a widely encountered problem where different references need to be linked to a real underlying entity. This problem is often unsupervised as there is no previously known information about the entities. Further to this, effective use of Bayesian nonparametrics offer a new approach to tackling supervised problems, which are frequently encountered. The main original contribution of this thesis is a set of new structured Dirichlet process mixture models for name disambiguation and supervised learning that can also have a wide range of applications. These models use techniques from Bayesian statistics, including hierarchical and nested Dirichlet processes, generalised linear models, Markov chain Monte Carlo methods and optimisation techniques such as BFGS. The new models have tangible advantages over existing methods in the field as shown with experiments on real-world datasets including citation databases and classification and regression datasets. I develop the unsupervised author-topic space model for author disambiguation that uses free-text to perform disambiguation unlike traditional author disambiguation approaches. The model incorporates a name variant model that is based on a nonparametric Dirichlet language model. The model handles both novel unseen name variants and can model the unknown authors of the text of the documents. Through this, the model can disambiguate authors with no prior knowledge of the number of true authors in the dataset. In addition, it can do this when the authors have identical names. I use a model for nesting Dirichlet processes named the hybrid NDP-HDP. This model allows Dirichlet processes to be clustered together and adds an additional level of structure to the hierarchical Dirichlet process. I also develop a new hierarchical extension to the hybrid NDP-HDP. I develop this model into the grouped author-topic model for the entity disambiguation task. The grouped author-topic model uses clusters to model the co-occurrence of entities in documents, which can be interpreted as research groups. Since this model does not require entities to be linked to specific words in a document, it overcomes the problems of some existing author-topic models. The model incorporates a new method for modelling name variants, so that domain-specific name variant models can be used. Lastly, I develop extensions to supervised latent Dirichlet allocation, a type of supervised topic model. The keyword-supervised LDA model predicts document responses more accurately by modelling the effect of individual words and their contexts directly. The supervised HDP model has more model flexibility by using Bayesian nonparametrics for supervised learning. These models are evaluated on a number of classification and regression problems, and the results show that they outperform existing supervised topic modelling approaches. The models can also be extended to use similar information to the previous models, incorporating additional information such as entities and document titles to improve prediction.
|
2 |
Viewpoint and Topic Modeling of Current EventsZhang, Kerry January 2016 (has links)
There are multiple sides to every story, and while statistical topic models have been highly successful at topically summarizing the stories in corpora of text documents, they do not explicitly address the issue of learning the different sides, the viewpoints, expressed in the documents. In this paper, we show how these viewpoints can be learned completely unsupervised and represented in a human interpretable form. We use a novel approach of applying CorrLDA2 for this purpose, which learns topic-viewpoint relations that can be used to form groups of topics, where each group represents a viewpoint. A corpus of documents about the Israeli-Palestinian conflict is then used to demonstrate how a Palestinian and an Israeli viewpoint can be learned. By leveraging the magnitudes and signs of the feature weights of a linear SVM, we introduce a principled method to evaluate associations between topics and viewpoints. With this, we demonstrate, both quantitatively and qualitatively, that the learned topic groups are contextually coherent, and form consistently correct topic-viewpoint associations. / I detta kandidatexamensarbete demonstrerar vi hur åsikter som uttrycks i artiklar om aktuella händelser kan modeleras med en oövervakad inlärningsmetod. Vi anpassar CorrLDA2-modellen för detta syfte, som kan lära sig vilka ämnen som diskuteras i en samling av textdokument, vilka åsikter som uttrycks, samt relationer mellan ämnen och åsikter. Med hjälp av dessa relationer kan vi sedan bilda grupper av ämnen, där varje grupp är associerad med en åsikt. Detta skapar en representation av åsikter som är tolkbar för människor. Vi demonstrerar detta med hjälp av en samling av dokument som handlar om Israel-Palestinakonflikten, genom att bilda en grupp av ämnen som representerar den palestinska åsikten, samt en grupp som representerar den isrealiska åsikten. Vi introducerar sedan en ny evalueringsmetod, som använder sig av magnituden samt tecknen på attributsvikter från en linjär SVM. Med hjälp av detta visar vi, både kvantitativt och kvalitativt, att de inlärda relationerna mellan ämenen och åsikter bildar sammanhängande ämnesgrupper, samt konsikvent korrekta associationer mellan ämnen och åsikter. / <p>This is the second time I am submitting my thesis here on DiVa.</p><p>I didn't attach the actual thesis document (i.e. the pdf file) last time because we were submitting on for publication in a scientific conference and I wanted to respect the double blind review process and not publish anything before.</p><p>Now, I want to publish the thesis document here on DiVa.</p>
|
3 |
STUDY ON PARALLELIZING PARTICLE FILTERS WITH APPLICATIONS TO TOPIC MODELSDing, Erli 01 June 2016 (has links)
No description available.
|
4 |
Probabilistic topic models for sentiment analysis on the WebChenghua, Lin January 2011 (has links)
Sentiment analysis aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text, and has received a rapid growth of interest in natural language processing in recent years. Probabilistic topic models, on the other hand, are capable of discovering hidden thematic structure in large archives of documents, and have been an active research area in the field of information retrieval. The work in this thesis focuses on developing topic models for automatic sentiment analysis of web data, by combining the ideas from both research domains. One noticeable issue of most previous work in sentiment analysis is that the trained classifier is domain dependent, and the labelled corpora required for training could be difficult to acquire in real world applications. Another issue is that the dependencies between sentiment/subjectivity and topics are not taken into consideration. The main contribution of this thesis is therefore the introduction of three probabilistic topic models, which address the above concerns by modelling sentiment/subjectivity and topic simultaneously. The first model is called the joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. Unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when applied to new domains, the weakly-supervised nature of JST makes it highly portable to other domains, where the only supervision information required is a domain-independent sentiment lexicon. Apart from document-level sentiment classification results, JST can also extract sentiment-bearing topics automatically, which is a distinct feature compared to the existing sentiment analysis approaches. The second model is a dynamic version of JST called the dynamic joint sentiment-topic (dJST) model. dJST respects the ordering of documents, and allows the analysis of topic and sentiment evolution of document archives that are collected over a long time span. By accounting for the historical dependencies of documents from the past epochs in the generative process, dJST gives a richer posterior topical structure than JST, and can better respond to the permutations of topic prominence. We also derive online inference procedures based on a stochastic EM algorithm for efficiently updating the model parameters. The third model is called the subjectivity detection LDA (subjLDA) model for sentence-level subjectivity detection. Two sets of latent variables were introduced in subjLDA. One is the subjectivity label for each sentence; another is the sentiment label for each word token. By viewing the subjectivity detection problem as weakly-supervised generative model learning, subjLDA significantly outperforms the baseline and is comparable to the supervised approach which relies on much larger amounts of data for training. These models have been evaluated on real world datasets, demonstrating that joint sentiment topic modelling is indeed an important and useful research area with much to offer in the way of good results.
|
5 |
A identidade da União Europeia e a segurança internacional: análise de discurso da região euromediterrânea / European unions identity and international security: discourse analysis from the EuromediterraneanNicolau, Guilherme Giuliano 26 November 2015 (has links)
A dissertação mapeia a formação da identidade internacional da União Europeia através da sua arquitetura de segurança internacional que tem como um dos seus nós a securitização da imigração, utilizando ferramentas metodológicas não-tradicionais para confirmar a nossa tese. A primeira parte do trabalho é um marco teórico: discutimos a virada linguística nas relações internacionais para entender a intersubjetividade entre pesquisador e objeto, de modo que nós escolhemos reflexividade como a nossa abordagem metodológica; em seguida, discutimos as escolas europeias em segurança internacional do pós-guerra fria, como a Escola de Copenhague, Escola Crítica de Gales e Escola de Paris, apresentando conceitos e objetos estudados por especialistas que nos são caros para entender nosso estudo e colocar nossa pesquisa dentro de sua comunidade epistêmica; finalmente, discutimos e incorporamos conceitos e abordagens da Teoria do Discurso (estudos de Ernesto Laclau, Chantal Mouffe e Escola de Essex) para fazer uma construção cronológica e geodiscursiva da região euromediterrânea. Na segunda parte, reconstruímos histórica e institucionalmente a arquitetura europeia de segurança internacional do pós-guerra a hoje vis-à-vis com suas políticas de migração notando suas correlações, também com foco na análise detalhada dos principais documentos oficiais de segurança. A parte quantitativa final (e nossa contribuição original) procura confirmar a causalidade do link segurança-imigração na arquitetura europeia; para isso, utilizamo-nos da linguística computacional para análise semântica semi-automatizada, mais especificamente Topic Model; analisamos cerca de 20.000 documentos oficiais de segurança da União Europeia para indicar estatísticas, agentes, instituições, agendas e discursos que confirmam nossa tese. / The dissertation maps the formation of the international identity of European Union through its international security architecture that has as one of its nodes the securitization of immigration, using non-traditional methodological tools to confirm our thesis. The first part of the work is a theoretical framework: we discuss the linguistic turn in international relations to understand the intersubjectivity between researcher and object so we choose Reflexivity as our methodological approach; then we discuss the European schools in international security from post-cold war such as the Copenhagen School, Wales Critical School and Paris School, presenting concepts and objects studied by experts who are dear to us to understand our study and place our research within its epistemic community; Finally, we discuss and incorporate concepts and approaches from Discourse Theory (studies from Ernesto Laclau, Chantal Mouffe and Essex School) to make a chronological geodiscursive construction of the euromediterranean region. In the second part, we reconstruct historically and institutionally the European international security architecture from post-war till today vis-à-vis with its migration policies and noting their correlations, also focusing on detailed analysis of the main official security documents. A final quantitative section (and our original contribution) seeks to confirm the causality of the security-immigration link in European architecture; for this we use computational linguistics for semi-automated semantic analysis, more specifically Topic Model; We analyze around 20,000 official security documents from European Union to indicate statistics, agents, institutions, agendas and speeches which confirm our thesis.
|
6 |
A identidade da União Europeia e a segurança internacional: análise de discurso da região euromediterrânea / European unions identity and international security: discourse analysis from the EuromediterraneanGuilherme Giuliano Nicolau 26 November 2015 (has links)
A dissertação mapeia a formação da identidade internacional da União Europeia através da sua arquitetura de segurança internacional que tem como um dos seus nós a securitização da imigração, utilizando ferramentas metodológicas não-tradicionais para confirmar a nossa tese. A primeira parte do trabalho é um marco teórico: discutimos a virada linguística nas relações internacionais para entender a intersubjetividade entre pesquisador e objeto, de modo que nós escolhemos reflexividade como a nossa abordagem metodológica; em seguida, discutimos as escolas europeias em segurança internacional do pós-guerra fria, como a Escola de Copenhague, Escola Crítica de Gales e Escola de Paris, apresentando conceitos e objetos estudados por especialistas que nos são caros para entender nosso estudo e colocar nossa pesquisa dentro de sua comunidade epistêmica; finalmente, discutimos e incorporamos conceitos e abordagens da Teoria do Discurso (estudos de Ernesto Laclau, Chantal Mouffe e Escola de Essex) para fazer uma construção cronológica e geodiscursiva da região euromediterrânea. Na segunda parte, reconstruímos histórica e institucionalmente a arquitetura europeia de segurança internacional do pós-guerra a hoje vis-à-vis com suas políticas de migração notando suas correlações, também com foco na análise detalhada dos principais documentos oficiais de segurança. A parte quantitativa final (e nossa contribuição original) procura confirmar a causalidade do link segurança-imigração na arquitetura europeia; para isso, utilizamo-nos da linguística computacional para análise semântica semi-automatizada, mais especificamente Topic Model; analisamos cerca de 20.000 documentos oficiais de segurança da União Europeia para indicar estatísticas, agentes, instituições, agendas e discursos que confirmam nossa tese. / The dissertation maps the formation of the international identity of European Union through its international security architecture that has as one of its nodes the securitization of immigration, using non-traditional methodological tools to confirm our thesis. The first part of the work is a theoretical framework: we discuss the linguistic turn in international relations to understand the intersubjectivity between researcher and object so we choose Reflexivity as our methodological approach; then we discuss the European schools in international security from post-cold war such as the Copenhagen School, Wales Critical School and Paris School, presenting concepts and objects studied by experts who are dear to us to understand our study and place our research within its epistemic community; Finally, we discuss and incorporate concepts and approaches from Discourse Theory (studies from Ernesto Laclau, Chantal Mouffe and Essex School) to make a chronological geodiscursive construction of the euromediterranean region. In the second part, we reconstruct historically and institutionally the European international security architecture from post-war till today vis-à-vis with its migration policies and noting their correlations, also focusing on detailed analysis of the main official security documents. A final quantitative section (and our original contribution) seeks to confirm the causality of the security-immigration link in European architecture; for this we use computational linguistics for semi-automated semantic analysis, more specifically Topic Model; We analyze around 20,000 official security documents from European Union to indicate statistics, agents, institutions, agendas and speeches which confirm our thesis.
|
7 |
STUDYING SOFTWARE QUALITY USING TOPIC MODELSChen, TSE-HSUN 14 January 2013 (has links)
Software is an integral part of our everyday lives, and hence the quality of software is very important. However, improving and maintaining high software quality is a difficult task, and a significant amount of resources is spent on fixing software defects. Previous studies have studied software quality using various measurable aspects of software, such as code size and code change history. Nevertheless, these metrics do not consider all possible factors that are related to defects. For instance, while lines of code may be a good general measure for defects, a large file responsible for simple I/O tasks is likely to have fewer defects than a small file responsible for complicated compiler implementation details. In this thesis, we address this issue by considering the conceptual concerns (or features). We use a statistical topic modelling approach to approximate the conceptual concerns as topics. We then use topics to study software quality along two dimensions: code quality and code testedness. We perform our studies using three versions of four large real-world software systems: Mylyn, Eclipse, Firefox, and NetBeans.
Our proposed topic metrics help improve the defect explanatory power (i.e., fitness of the regression model) of traditional static and historical metrics by 4–314%. We compare one of our metrics, which measures the cohesion of files, with other topic-based cohesion and coupling metrics in the literature and find that our metric gives the greatest improvement in explaining defects over traditional software quality metrics (i.e., lines of code) by 8–55%.
We then study how we can use topics to help improve the testing processes. By training on previous releases of the subject systems, we can predict not well-tested topics that are defect prone in future releases with a precision and recall of 0.77 and 0.75, respectively. We can map these topics back to files and help allocate code inspection and testing resources. We show that our approach outperforms traditional prediction-based resource allocation approaches in terms of saving testing and code inspection efforts.
The results of our studies show that topics can be used to study software quality and support traditional quality assurance approaches. / Thesis (Master, Computing) -- Queen's University, 2013-01-08 10:10:37.878
|
8 |
topicmodels: An R Package for Fitting Topic ModelsHornik, Kurt, Grün, Bettina January 2011 (has links) (PDF)
Topic models allow the probabilistic modeling of term frequency occurrences in documents.
The fitted model can be used to estimate the similarity between documents as
well as between a set of specified keywords using an additional layer of latent variables
which are referred to as topics. The R package topicmodels provides basic infrastructure
for fitting topic models based on data structures from the text mining package tm. The
package includes interfaces to two algorithms for fitting topic models: the variational
expectation-maximization algorithm provided by David M. Blei and co-authors and an
algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.
|
9 |
An Approach to eBook Topics Trend Discovery Based on LDA and Usage LogHung, Chung-yang 13 February 2012 (has links)
With the growth of digital content industry, publishers start to provide online services for ebook search, reading and downloading. Users can access to online resources from anywhere, any place with laptop or mobile devices at any time. Nowadays more and more libraries have purchased ebooks as an important part of the library collection. To access the online resources users can link directly to publisher's ebook portal or via the OPAC system. Compared to the library circulation process, ebooks are more convenient to patrons and improve the utilization of library online resources.
There are various kinds of ebooks available in the market, so libraries have to focus their investment on the most valuable online resources. Usage statistics report plays an important role in providing valuable information to libraries. It is usually based on the standard of COUNTER to generate the statistic reports, although it provides when and where users access to specific ebooks, it fails show the general topics and how they change.
In this study, we introduce a post process method to weighting the LDA topic model via the usage statistic report to emphasize the changes of topic and compare it to the classification method and subject heading method in the bibliographic, namely LCC and LCSH respectively. The result show that weighted topic model significantly affect the ranking of topics, and the topic model are independent from the classification method and the subject heading method in the bibliographic record.
|
10 |
Image Dating, a Case Study to Evaluate the Inter-Battery Topic ModelPertoft, John January 2016 (has links)
The Inter-Battery Topic Model (IBTM) is an extension of the well known Latent Dirichlet Allocation (LDA) topic model. It gives a factorized representation of multimodal (in this case two views) data, which better separates variation in observed data that is present in both views from variation that is present only in one of the separate views. This thesis is an evaluation and application study of this model with the aim of showing how it can be used in the very difficult classification task of dating grayscale face portraits from a dataset collected from highschool yearbooks. This task has very high intra-class variation and low inter-class variation which calls for techniques to extract the necessary information. An online-trained model is also implemented and evaluated as well as a simplification of the model more suited for this data specifically. The results show improved performance over LDA showing that the factorizing property of IBTM has a positive effect on performance for this type of classification task. / Inter-Battery Topic Model (IBTM) är en vidareutveckling av den välkända Latent Dirichlet Allocation (LDA) topic-modellen. Den ger en faktoriserad representation av multimodal data som bättre separerar variation i datat som finns i båda datavyer från den som finns i de enskilda datavyerna. Det här examensarbetet är en evaluering och applikationsstudie av modellen, med mål att visa hur den kan användas i den mycket svåra klassificeringsuppgiften att datera svartvita bilder från ett dataset skapat från amerikanska highschool-årsboksfoton. Denna klassificeringsuppgift har väldigt hög inom-klass variation samt väldigt låg mellan-klass variation vilket kräver bättre sätt att extrahera den nödvändiga information för bra klassificering. En online-tränad variant av modellen implementeras och evalueras också, samt en modellvariant som är mer anpassad för just denna typ av data. Resultaten visar bättre prestanda än LDA vilket visar att den faktoriserade representationen från IBTM har en positiv effekt på prestanda in en klassificeringsuppgift av den här typen.
|
Page generated in 0.044 seconds