Global ETD Search

11	Application of Topic Models for Test Case Selection : A comparison of similarity-based selection techniques / Tillämpning av ämnesmodeller för testfallsselektion Askling, Kim January 2019 (has links) Regression testing is just as important for the quality assurance of a system, as it is time consuming. Several techniques exist with the purpose of lowering the execution times of test suites and provide faster feedback to the developers, examples are ones based on transition-models or string-distances. These techniques are called test case selection (TCS) techniques, and focuses on selecting subsets of the test suite deemed relevant for the modifications made to the system under test. This thesis project focused on evaluating the use of a topic model, latent dirichlet allocation, as a means to create a diverse selection of test cases for coverage of certain test characteristics. The model was tested on authentic data sets from two different companies, where the results were compared against prior work where TCS was performed using similarity-based techniques. Also, the model was tuned and evaluated, using an algorithm based on differential evolution, to increase the model’s stability in terms of inferred topics and topic diversity. The results indicate that the use of the model for test case selection purposes was not as efficient as the other similarity-based selection techniques studied in work prior to thist hesis. In fact, the results show that the selection generated using the model performs similar, in terms of coverage, to a randomly selected subset of the test suite. Tuning of the model does not improve these results, in fact the tuned model performs worse than the other methods in most cases. However, the tuning process results in the model being more stable in terms of inferred latent topics and topic diversity. The performance of the model is believed to be strongly dependent on the characteristics of the underlying data used to train the model, putting emphasis on word frequencies and the overall sizes of the training documents, and implying that this would affect the words’ relevance scoring to the better. test automation test case selection machine learning latent dirichlet allocation differential evolution testautomation testfallsselektion maskininlärning latent dirichlet allocation differentiell evolution Computer Systems Datorsystem
12	LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification Mungre, Surbhi January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Doina Caragea / Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems. Domain Adaptation Splice Site Prediction Latent Dirichlet Allocation DNA Sequence Classification Dimentionality Reduction Computer Science (0984)
13	Geographic Relevance for Travel Search: The 2014-2015 Harvey Mudd College Clinic Project for Expedia, Inc. Long, Hannah 01 January 2015 (has links) The purpose of this Clinic project is to help Expedia, Inc. expand the search capabilities it offers to its users. In particular, the goal is to help the company respond to unconstrained search queries by generating a method to associate hotels and regions around the world with the higher-level attributes that describe them, such as “family- friendly” or “culturally-rich.” Our team utilized machine-learning algorithms to extract metadata from textual data about hotels and cities. We focused on two machine-learning models: decision trees and Latent Dirichlet Allocation (LDA). The first appeared to be a promising approach, but would require more resources to replicate on the scale Expedia needs. On the other hand, we were able to generate useful results using LDA. We created a website to visualize these results. Machine Learning Unconstrained Search Decision Trees Latent Dirichlet Allocation Unsupervised Learning Computer Sciences
14	Probabilistic topic models for sentiment analysis on the Web Chenghua, Lin January 2011 (has links) Sentiment analysis aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text, and has received a rapid growth of interest in natural language processing in recent years. Probabilistic topic models, on the other hand, are capable of discovering hidden thematic structure in large archives of documents, and have been an active research area in the field of information retrieval. The work in this thesis focuses on developing topic models for automatic sentiment analysis of web data, by combining the ideas from both research domains. One noticeable issue of most previous work in sentiment analysis is that the trained classifier is domain dependent, and the labelled corpora required for training could be difficult to acquire in real world applications. Another issue is that the dependencies between sentiment/subjectivity and topics are not taken into consideration. The main contribution of this thesis is therefore the introduction of three probabilistic topic models, which address the above concerns by modelling sentiment/subjectivity and topic simultaneously. The first model is called the joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. Unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when applied to new domains, the weakly-supervised nature of JST makes it highly portable to other domains, where the only supervision information required is a domain-independent sentiment lexicon. Apart from document-level sentiment classification results, JST can also extract sentiment-bearing topics automatically, which is a distinct feature compared to the existing sentiment analysis approaches. The second model is a dynamic version of JST called the dynamic joint sentiment-topic (dJST) model. dJST respects the ordering of documents, and allows the analysis of topic and sentiment evolution of document archives that are collected over a long time span. By accounting for the historical dependencies of documents from the past epochs in the generative process, dJST gives a richer posterior topical structure than JST, and can better respond to the permutations of topic prominence. We also derive online inference procedures based on a stochastic EM algorithm for efficiently updating the model parameters. The third model is called the subjectivity detection LDA (subjLDA) model for sentence-level subjectivity detection. Two sets of latent variables were introduced in subjLDA. One is the subjectivity label for each sentence; another is the sentiment label for each word token. By viewing the subjectivity detection problem as weakly-supervised generative model learning, subjLDA significantly outperforms the baseline and is comparable to the supervised approach which relies on much larger amounts of data for training. These models have been evaluated on real world datasets, demonstrating that joint sentiment topic modelling is indeed an important and useful research area with much to offer in the way of good results. 004.01
15	Personalized Document Recommendation by Latent Dirichlet Allocation Chen, Li-Zen 13 August 2012 (has links) Accompanying with the rapid growth of Internet, people around the world can easily distribute, browse, and share as much information as possible through the Internet. The enormous amount of information, however, causes the information overload problem that is beyond users¡¦ limited information processing ability. Therefore, recommender systems arise to help users to look for useful information when they cannot describe the requirements precisely. The filtering techniques in recommender systems can be divided into content-based filtering (CBF) and collaborative filtering (CF). Although CF is shown to be superior over CBF in literature, personalized document recommendation relies more on CBF simply because of its text content in nature. Nevertheless, document recommendation task provides a good chance to integrate both techniques into a hybrid one, and enhance the overall recommendation performance. The objective of this research is thus to propose a hybrid filtering approach for personalized document recommendation. Particularly, latent Dirichlet allocation to uncover latent semantic structure in documents is incorporated to help us to either obtain robust document similarity in CF, or explore user profiles in CBF. Two experiments are conducted accordingly. The results show that our proposed approach outperforms other counterparts on the recommendation performance, which justifies the feasibility of our proposed approach in real applications. recommender systems collaborative filtering hidden topic analysis latent Dirichlet allocation content-based filtering
16	Latent Dirichlet Allocation in R Ponweiser, Martin 05 1900 (has links) (PDF) Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes. This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract) / Series: Theses / Institute for Statistics and Mathematics
17	An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks Callaghan, Liam 08 January 2013 (has links) There may be several mechanisms that drive observed interactions between plants and pollinators in an ecosystem, many of which may involve trait matching or trait complementarity. Hence a model of insect species activity on plant species should be represented as a mixture of these linkage rules. Unfortunately, ecologists do not always know how many, or even which, traits are the main contributors to the observed interactions. This thesis proposes the Latent Dirichlet Allocation (LDA) model from artificial intelligence for modelling the observed interactions in an ecosystem as a finite mixture of (latent) interaction groups in which plant and pollinator pairs that share common linkage rules are placed in the same interaction group. Several model selection criteria are explored for estimating how many interaction groups best describe the observed interactions. This thesis also introduces a new model selection score called ``penalized perplexity". The performance of the model selection criteria, and of LDA in general, are evaluated through a comprehensive simulation study that consider networks of various size along with varying levels of nesting and numbers of interaction groups. Results of the simulation study suggest that LDA works well on networks with mild-to-no nesting, but loses accuracy with increased nestedness. Further, the penalized perplexity tended to outperform the other model selection criteria in identifying the correct number of interaction groups used to simulate the data. Finally, LDA was demonstrated on a real network, the results of which provided insights into the functional roles of pollinator species in the study region.
18	A Quality Criteria Based Evaluation of Topic Models Sathi, Veer Reddy, Ramanujapura, Jai Simha January 2016 (has links) Context. Software testing is the process, where a particular software product, or a system is executed, in order to find out the bugs, or issues which may otherwise degrade its performance. Software testing is usually done based on pre-defined test cases. A test case can be defined as a set of terms, or conditions that are used by the software testers to determine, if a particular system that is under test operates as it is supposed to or not. However, in numerous situations, test cases can be so many that executing each and every test case is practically impossible, as there may be many constraints. This causes the testers to prioritize the functions that are to be tested. This is where the ability of topic models can be exploited. Topic models are unsupervised machine learning algorithms that can explore large corpora of data, and classify them by identifying the hidden thematic structure in those corpora. Using topic models for test case prioritization can save a lot of time and resources. Objectives. In our study, we provide an overview of the amount of research that has been done in relation to topic models. We want to uncover various quality criteria, evaluation methods, and metrics that can be used to evaluate the topic models. Furthermore, we would also like to compare the performance of two topic models that are optimized for different quality criteria, on a particular interpretability task, and thereby determine the topic model that produces the best results for that task. Methods. A systematic mapping study was performed to gain an overview of the previous research that has been done on the evaluation of topic models. The mapping study focused on identifying quality criteria, evaluation methods, and metrics that have been used to evaluate topic models. The results of mapping study were then used to identify the most used quality criteria. The evaluation methods related to those criteria were then used to generate two optimized topic models. An experiment was conducted, where the topics generated from those two topic models were provided to a group of 20 subjects. The task was designed, so as to evaluate the interpretability of the generated topics. The performance of the two topic models was then compared by using the Precision, Recall, and F-measure. Results. Based on the results obtained from the mapping study, Latent Dirichlet Allocation (LDA) was found to be the most widely used topic model. Two LDA topic models were created, optimizing one for the quality criterion Generalizability (TG), and one for Interpretability (TI); using the Perplexity, and Point-wise Mutual Information (PMI) measures respectively. For the selected metrics, TI showed better performance, in Precision and F-measure, than TG. However, the performance of both TI and TG was comparable in case of Recall. The total run time of TI was also found to be significantly high than TG. The run time of TI was 46 hours, and 35 minutes, whereas for TG it was 3 hours, and 30 minutes.Conclusions. Looking at the F-measure, it can be concluded that the interpretability topic model (TI) performs better than the generalizability topic model (TG). However, while TI performed better in precision, Conclusions. Looking at the F-measure, it can be concluded that the interpretability topic model (TI) performs better than the generalizability topic model (TG). However, while TI performed better in precision, recall was comparable. Furthermore, the computational cost to create TI is significantly higher than for TG. Hence, we conclude that, the selection of the topic model optimization should be based on the aim of the task the model is used for. If the task requires high interpretability of the model, and precision is important, such as for the prioritization of test cases based on content, then TI would be the right choice, provided time is not a limiting factor. However, if the task aims at generating topics that provide a basic understanding of the concepts (i.e., interpretability is not a high priority), then TG is the most suitable choice; thus making it more suitable for time critical tasks. Topic models Topic interpretability Test cases Latent Dirichlet Allocation Topic model optimization Software Engineering Programvaruteknik
19	Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification Halmann, Marju January 2017 (has links) Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy. Email mining Latent Dirichlet Allocation Random Forest classification Computer Sciences Datavetenskap (datalogi)
20	Diseño, desarrollo y evaluación de un algoritmo para detectar sub-comunidades traslapadas usando análisis de redes sociales y minería de datos Muñoz Cancino, Ricardo Luis January 2013 (has links) Magíster en Gestión de Operaciones / Ingeniero Civil Industrial / Los sitios de redes sociales virtuales han tenido un enorme crecimiento en la última década. Su principal objetivo es facilitar la creación de vínculos entre personas que, por ejemplo, comparten intereses, actividades, conocimientos, o conexiones en la vida real. La interacción entre los usuarios genera una comunidad en la red social. Existen varios tipos de comunidades, se distinguen las comunidades de interés y práctica. Una comunidad de interés es un grupo de personas interesadas en compartir y discutir un tema de interés particular. En cambio, en una comunidad de práctica las personas comparten una preocupación o pasión por algo que ellos hacen y aprenden cómo hacerlo mejor. Si las interacciones se realizan por internet, se les llama comunidades virtuales (VCoP/VCoI por sus siglas en inglés). Es común que los miembros compartan solo con algunos usuarios formando así subcomunidades, pudiendo pertenecer a más de una. Identificar estas subestructuras es necesario, pues allí se generan las interacciones para la creación y desarrollo del conocimiento de la comunidad. Se han diseñado muchos algoritmos para detectar subcomunidades. Sin embargo, la mayoría de ellos detecta subcomunidades disjuntas y además, no consideran el contenido generado por los miembros de la comunidad. El objetivo principal de este trabajo es diseñar, desarrollar y evaluar un algoritmo para detectar subcomunidades traslapadas mediante el uso de análisis de redes sociales (SNA) y Text Mining. Para ello se utiliza la metodología SNA-KDD propuesta por Ríos et al. [79] que combina Knowledge Discovery in Databases (KDD) y SNA. Ésta fue aplicada sobre dos comunidades virtuales, Plexilandia (VCoP) y The Dark Web Portal (VCoI). En la etapa de KDD se efectuó el preprocesamiento de los posts de los usuarios, para luego aplicar Latent Dirichlet Allocation (LDA), que permite describir cada post en términos de tópicos. En la etapa SNA se construyeron redes filtradas con la información obtenida en la etapa anterior. A continuación se utilizaron dos algoritmos desarrollados en esta tesis, SLTA y TPA, para encontrar subcomunidades traslapadas. Los resultados muestran que SLTA logra un desempeño, en promedio, un 5% superior que el mejor algoritmo existente cuando es aplicado sobre una VCoP. Además, se encontró que la calidad de la estructura de sub-comunidades detectadas aumenta, en promedio, un 64% cuando el filtro semántico es aumentado. Con respecto a TPA, este algoritmo logra, en promedio, una medida de modularidad de 0.33 mientras que el mejor algoritmo existente 0.043 cuando es aplicado sobre una VCoI. Además la aplicación conjunta de nuestros algoritmos parece mostrar una forma de determinar el tipo de comunidad que se está analizando. Sin embargo, esto debe ser comprobado analizando más comunidades virtuales. Redes sociales Algoritmos computacionales Minería de datos Comunidad virtual Latent dirichlet allocation

Search results