11 |
Sparsification for Topic Modeling and Applications to Information RetrievalMuoh, Chibuike 30 November 2009 (has links)
No description available.
|
12 |
Multi Domain Semantic Information Retrieval Based on Topic ModelLee, Sanghoon 07 May 2016 (has links)
Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems.
Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics.
In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications.
|
13 |
Disentangling Discourse: Networks, Entropy, and Social MovementsGallagher, Ryan 01 January 2017 (has links)
Our daily online conversations with friends, family, colleagues, and strangers weave an intricate network of interactions. From these networked discussions emerge themes and topics that transcend the scope of any individual conversation. In turn, these themes direct the discourse of the network and continue to ebb and flow as the interactions between individuals shape the topics themselves. This rich loop between interpersonal conversations and overarching topics is a wonderful example of a complex system: the themes of a discussion are more than just the sum of its parts.
Some of the most socially relevant topics emerging from these online conversations are those pertaining to racial justice issues. Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial shootings of Black Americans. In response to #BlackLivesMatter, other online users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives regardless of race. Together these contentious hashtags each shape clashing narratives that echo previous civil rights battles and illustrate ongoing racial tension between police officers and Black Americans.
These narratives have taken place on a massive scale with millions of online posts and articles debating the sentiments of "black lives matter" and "all lives matter." Since no one person could possibly read everything written in this debate, comprehensively understanding these conversations and their underlying networks requires us to leverage tools from data science, machine learning, and natural language processing. In Chapter 2, we utilize methodology from network science to measure to what extent #BlackLivesMatter and #AllLivesMatter are "slacktivist" movements, and the effect this has on the diversity of topics discussed within these hashtags. In Chapter 3, we precisely quantify the ways in which the discourse of #BlackLivesMatter and #AllLivesMatter diverge through the application of information-theoretic techniques, validating our results at the topic level from Chapter 2. These entropy-based approaches provide the foundation for powerful automated analysis of textual data, and we explore more generally how they can be used to construct a human-in-the-loop topic model in Chapter 4. Our work demonstrates that there is rich potential for weaving together social science domain knowledge with computational tools in the study of language, networks, and social movements.
|
14 |
An empirical case study on Stack Overflow to explore developers’ security challengesRahman, Muhammad Sajidur January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Eugene Vasserman / The unprecedented growth of ubiquitous computing infrastructure has brought new challenges for security, privacy, and trust. New problems range from mobile apps with incomprehensible permission (trust) model to OpenSSL Heartbleed vulnerability, which disrupted the security of a large fraction of the world's web servers. As almost all of the software bugs and flaws boil down to programming errors/misalignment in requirements, we need to retrace back Software Development Life Cycle (SDLC) and supply chain to check and place security & privacy consideration and implementation plan properly.
Historically, there has been a divergent point of view between security teams and developers regarding security. Security is often thought of as a "consideration" or "toll gate" within the project plan rather than being built in from the early stage of project planning, development and production cycles. We argue that security can be effectively made into everyone's business in SDLC through a broader exploration of the users and their social-cultural contexts, gaining insight into their mental models of security and privacy and usage patterns of technology, trying to see why and how security practices being satisfied or not-satisfied, then transferring those observations into new tool building and protocol/interaction design.
The overall goal in our current study is to understand the common challenges and/or misconceptions regarding security-related issues among developers. In order to investigate into this issue, we conduct a mixed-method analysis on the data obtained from Stack Overflow(SO), one of the most popular on-line QA sites for software developer community to communicate, collaborate, and share information with one another. In this study, we have adopted techniques from mining software repositories research paradigm and have employed topic modeling for analyzing security-related topics in SO dataset. To our knowledge, our work in SO data mining is one of the earliest systematic attempts to understand the roots of challenges, misconceptions, and deterrent factors, if any, among developers while they try to implement security features during software development. We argue that a proper understanding of these issues is a necessary first step towards "build security in" culture in SDLC.
|
15 |
Twitterにおけるユーザの興味と話題の時間発展を考慮したオンライン学習可能なトピックモデルの提案FURUHASHI, TAKESHI, YOSHIKAWA, TOMOHIRO, SASAKI, KENTARO, 古橋, 武, 吉川, 大弘, 佐々木, 謙太朗 09 1900 (has links)
No description available.
|
16 |
Probabilistic Topic Models for Human Emotion AnalysisJanuary 2015 (has links)
abstract: While discrete emotions like joy, anger, disgust etc. are quite popular, continuous
emotion dimensions like arousal and valence are gaining popularity within the research
community due to an increase in the availability of datasets annotated with these
emotions. Unlike the discrete emotions, continuous emotions allow modeling of subtle
and complex affect dimensions but are difficult to predict.
Dimension reduction techniques form the core of emotion recognition systems and
help create a new feature space that is more helpful in predicting emotions. But these
techniques do not necessarily guarantee a better predictive capability as most of them
are unsupervised, especially in regression learning. In emotion recognition literature,
supervised dimension reduction techniques have not been explored much and in this
work a solution is provided through probabilistic topic models. Topic models provide
a strong probabilistic framework to embed new learning paradigms and modalities.
In this thesis, the graphical structure of Latent Dirichlet Allocation has been explored
and new models tuned to emotion recognition and change detection have been built.
In this work, it has been shown that the double mixture structure of topic models
helps 1) to visualize feature patterns, and 2) to project features onto a topic simplex
that is more predictive of human emotions, when compared to popular techniques
like PCA and KernelPCA. Traditionally, topic models have been used on quantized
features but in this work, a continuous topic model called the Dirichlet Gaussian
Mixture model has been proposed. Evaluation of DGMM has shown that while modeling
videos, performance of LDA models can be replicated even without quantizing
the features. Until now, topic models have not been explored in a supervised context
of video analysis and thus a Regularized supervised topic model (RSLDA) that
models video and audio features is introduced. RSLDA learning algorithm performs
both dimension reduction and regularized linear regression simultaneously, and has outperformed supervised dimension reduction techniques like SPCA and Correlation
based feature selection algorithms. In a first of its kind, two new topic models, Adaptive
temporal topic model (ATTM) and SLDA for change detection (SLDACD) have
been developed for predicting concept drift in time series data. These models do not
assume independence of consecutive frames and outperform traditional topic models
in detecting local and global changes respectively. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2015
|
17 |
A Quality Criteria Based Evaluation of Topic ModelsSathi, Veer Reddy, Ramanujapura, Jai Simha January 2016 (has links)
Context. Software testing is the process, where a particular software product, or a system is executed, in order to find out the bugs, or issues which may otherwise degrade its performance. Software testing is usually done based on pre-defined test cases. A test case can be defined as a set of terms, or conditions that are used by the software testers to determine, if a particular system that is under test operates as it is supposed to or not. However, in numerous situations, test cases can be so many that executing each and every test case is practically impossible, as there may be many constraints. This causes the testers to prioritize the functions that are to be tested. This is where the ability of topic models can be exploited. Topic models are unsupervised machine learning algorithms that can explore large corpora of data, and classify them by identifying the hidden thematic structure in those corpora. Using topic models for test case prioritization can save a lot of time and resources. Objectives. In our study, we provide an overview of the amount of research that has been done in relation to topic models. We want to uncover various quality criteria, evaluation methods, and metrics that can be used to evaluate the topic models. Furthermore, we would also like to compare the performance of two topic models that are optimized for different quality criteria, on a particular interpretability task, and thereby determine the topic model that produces the best results for that task. Methods. A systematic mapping study was performed to gain an overview of the previous research that has been done on the evaluation of topic models. The mapping study focused on identifying quality criteria, evaluation methods, and metrics that have been used to evaluate topic models. The results of mapping study were then used to identify the most used quality criteria. The evaluation methods related to those criteria were then used to generate two optimized topic models. An experiment was conducted, where the topics generated from those two topic models were provided to a group of 20 subjects. The task was designed, so as to evaluate the interpretability of the generated topics. The performance of the two topic models was then compared by using the Precision, Recall, and F-measure. Results. Based on the results obtained from the mapping study, Latent Dirichlet Allocation (LDA) was found to be the most widely used topic model. Two LDA topic models were created, optimizing one for the quality criterion Generalizability (TG), and one for Interpretability (TI); using the Perplexity, and Point-wise Mutual Information (PMI) measures respectively. For the selected metrics, TI showed better performance, in Precision and F-measure, than TG. However, the performance of both TI and TG was comparable in case of Recall. The total run time of TI was also found to be significantly high than TG. The run time of TI was 46 hours, and 35 minutes, whereas for TG it was 3 hours, and 30 minutes.Conclusions. Looking at the F-measure, it can be concluded that the interpretability topic model (TI) performs better than the generalizability topic model (TG). However, while TI performed better in precision, Conclusions. Looking at the F-measure, it can be concluded that the interpretability topic model (TI) performs better than the generalizability topic model (TG). However, while TI performed better in precision, recall was comparable. Furthermore, the computational cost to create TI is significantly higher than for TG. Hence, we conclude that, the selection of the topic model optimization should be based on the aim of the task the model is used for. If the task requires high interpretability of the model, and precision is important, such as for the prioritization of test cases based on content, then TI would be the right choice, provided time is not a limiting factor. However, if the task aims at generating topics that provide a basic understanding of the concepts (i.e., interpretability is not a high priority), then TG is the most suitable choice; thus making it more suitable for time critical tasks.
|
18 |
Targeted Topic Modeling for Levantine ArabicZahra, Shorouq January 2020 (has links)
Topic models for focused analysis aim to capture topics within the limiting scope of a targeted aspect (which could be thought of as some inner topic within a certain domain). To serve their analytic purposes, topics are expected to be semantically-coherent and closely aligned with human intuition – this in itself poses a major challenge for the more common topic modeling algorithms which, in a broader sense, perform a full analysis that covers all aspects and themes within a collection of texts. The paper attempts to construct a viable focused-analysis topic model which learns topics from Twitter data written in a closely related group of non-standardized varieties of Arabic widely spoken in the Levant region (i.e Levantine Arabic). Results are compared to a baseline model as well as another targeted topic model designed precisely to serve the purpose of focused analysis. The model is capable of adequately capturing topics containing terms which fall within the scope of the targeted aspect when judged overall. Nevertheless, it fails to produce human-friendly and semantically-coherent topics as several topics contained a number of intruding terms while others contained terms, while still relevant to the targeted aspect, thrown together seemingly at random.
|
19 |
Tracking Online Trend Locations using a Geo-Aware Topic ModelSchreiber, Jonah January 2016 (has links)
In automatically categorizing massive corpora of text, various topic models have been applied with good success. Much work has been done on applying machine learning and NLP methods on Internet media, such as Twitter, to survey online discussion. However, less focus has been placed on studying how geographical locations discussed in online fora evolve over time, and even less on associating such location trends with topics. Can online discussions be geographically tracked over time? This thesis attempts to answer this question by evaluating a geo-aware Streaming Latent Dirichlet Allocation (SLDA) implementation which can recognize location terms in text. We show how the model can predict time-dependent locations of the 2016 American primaries by automatic discovery of election topics in various Twitter corpora, and deduce locations over time.
|
20 |
Stochastic EM for generic topic modeling using probabilistic programmingSaberi Nasseri, Robin January 2021 (has links)
Probabilistic topic models are a versatile class of models for discovering latent themes in document collections through unsupervised learning. Conventional inferential methods lack the scaling capabilities necessary for extensions to large-scale applications. In recent years Stochastic Expectation Maximization has proven scalable for the simplest topic model: Latent Dirichlet Allocation. Performing analytical maximization is unfortunately not possible for many more complex topic models. With the rise of probabilistic programming languages, the ability to infer flexibly specified probabilistic models using sophisticated numerical optimization procedures has become widely available. These frameworks have however mainly been developed for optimization of continuous parameters, often prohibiting direct optimization of discrete parameters. This thesis explores the potential of utilizing probabilistic programming for generic topic modeling using Stochastic Expectation Maximization with numerical maximization of discrete parameters reparameterized to unconstrained space. The method achieves results of similar quality as other methods for Latent Dirichlet Allocation in simulated experiments. Further application is made to infer a Dirichlet-multinomial Regression model with metadata covariates. A real dataset is used and the method produces interpretable topics.
|
Page generated in 0.4018 seconds