Global ETD Search

81	Source code search for automatic bug localization Shayan Ali A Akbar (9761117) 14 December 2020 (has links) This dissertation advances the state-of-the-art in information retrieval (IR) based automatic bug localization for large software systems. We present techniques from three generations of IR based bug localization and compare their performances on our large and diverse bug localization dataset --- the Bugzbook dataset. The three generations span over fifteen years of research in mining software repositories for bug localization and include: (1) the generation of simple bag-of-words (BoW) based techniques, (2) the generation in which software-centric information such as bug and code change histories as well as structured information embedded in bug reports and code files are exploited to improve retrieval, and (3) the third and most recent generation in which order and semantic relationships between terms are modeled to improve the performance of bug localization systems. The dissertation also presents a novel technique called SCOR (Source Code Retrieval with Semantics and Order) which combines Markov Random Fields (MRF) based term-term ordering dependencies with semantic word vectors obtained from neural network based word embedding algorithms, such as word2vec, to better localize bugs in code files. The results presented in this dissertation show that while term-term ordering and semantic relationships significantly improve the performance when they are modeled separately in retrieval systems, the best precisions in retrieval are obtained when they are modeled together in a single retrieval system. We also show that the semantic representations of software terms learned by training the word embedding algorithm on a corpus of software repositories can be used to perform search in new software code repositories not present in the training corpus of the word embedding algorithm.<br> Computer Engineering code search text embeddings information retrieval bug localization Word2vec
82	Automatic Generation of Descriptive Features for Predicting Vehicle Faults Revanur, Vandan, Ayibiowu, Ayodeji January 2020 (has links) Predictive Maintenance (PM) has been increasingly adopted in the Automotive industry, in the recent decades along with conventional approaches such as the Preventive Maintenance and Diagnostic/Corrective Maintenance, since it provides many advantages to estimate the failure before the actual occurrence proactively, and also being adaptive to the present status of the vehicle, in turn allowing flexible maintenance schedules for efficient repair or replacing of faulty components. PM necessitates the storage and analysis of large amounts of sensor data. This requirement can be a challenge in deploying this method on-board the vehicles due to the limited storage and computational power on the hardware of the vehicle. Hence, this thesis seeks to obtain low dimensional descriptive features from high dimensional data using Representation Learning. This low dimensional representation will be used for predicting vehicle faults, specifically Turbocharger related failures. Since the Logged Vehicle Data (LVD) was base on all the data utilized in this thesis, it allowed for the evaluation of large populations of trucks without requiring additional measuring devices and facilities. The gradual degradation methodology is considered for describing vehicle condition, which allows for modeling the malfunction/ failure as a continuous process rather than a discrete flip from healthy to an unhealthy state. This approach eliminates the challenge of data imbalance of healthy and unhealthy samples. Two important hypotheses are presented. Firstly, Parallel StackedClassical Autoencoders would produce better representations com-pared to individual Autoencoders. Secondly, employing Learned Em-beddings on Categorical Variables would improve the performance of the Dimensionality reduction. Based on these hypotheses, a model architecture is proposed and is developed on the LVD. The model is shown to achieve good performance, and in close standards to the previous state-of-the-art research. This thesis, finally, illustrates the potential to apply parallel stacked architectures with Learned Embeddings for the Categorical features, and a combination of feature selection and extraction for numerical features, to predict the Remaining Useful Life (RUL) of a vehicle, in the context of the Turbocharger. A performance improvement of 21.68% with respect to the Mean Absolute Error (MAE) loss with an 80.42% reduction in the size of data was observed. Dimensionality Reduction Autoencoder Artificial Neural Network Embeddings Powertrain Predictive Maintenance Computer Systems Datorsystem
83	Analýza textových používateľských hodnotení vybranej skupiny produktov Valovič, Roman January 2019 (has links) This work focuses on the design of a system that identifies frequently discussed product features in product reviews, summarizes them, and displays them to the user in terms of sentiment. The work deals with the issue of natural language processing, with a specific focus on Czech languague. The reader will be introduced the methods of preprocessing the text and their impact on the quality of the analysis results. The identification of the mainly discussed products features is carried out by cluster analysis using the K-Means algorithm, where we assume that sufficiently internally homogeneous clusters will represent the individual features of the products. A new area that will be explored in this work is the representation of documents using the Word embeddings technique, and its potential of using vector space as input for machine learning algorithms.
84	EXPLORATORY SEARCH USING VECTOR MODEL AND LINKED DATA Daeun Yim (9143660) 30 July 2020 (has links) The way people acquire knowledge has largely shifted from print to web resources. Meanwhile, search has become the main medium to access information. Amongst various search behaviors, exploratory search represents a learning process that involves complex cognitive activities and knowledge acquisition. Research on exploratory search studies on how to make search systems help people seek information and develop intellectual skills. This research focuses on information retrieval and aims to build an exploratory search system that shows higher clustering performance and diversified search results. In this study, a new language model that integrates the state-of-the-art vector language model (i.e., BERT) with human knowledge is built to better understand and organize search results. The clustering performance of the new model (i.e., RDF+BERT) was similar to the original model but slight improvement was observed with conversational texts compared to the pre-trained language model and an exploratory search baseline. With the addition of the enrichment phase of expanding search results to related documents, the novel system also can display more diverse search results. Natural Language Processing Exploratory Search Knowledge Graph Embeddings Language Model BERT RDF
85	Linguistics of Russian Media During the 2016 US Election: A Corpus-Based Study Terry, Devon K. 30 July 2021 (has links) The purpose of this study is to perform a linguistic analysis of Russian mass media focused on its coverage of the 2016 US presidential election. It will be a corpus-based study, using a corpus as a foundational source for quantitative and qualitative data. This study will use a collection of keywords from the corpus and analyze their contexts as they pertain to Hillary Clinton and Donald Trump. This study uses corpus linguistic research tools such as sentence tokenization, Key Words in Context (KWIC), sentiment analysis, word embedding visualization, word-vector math, word frequency lists, and collocate analysis as part of the quantitative analysis. The results of the sentiment analysis and word vector analysis show a moderate bias in the corpus favoring Donald Trump. Additionally, a more in-depth qualitative analysis of sentences containing keywords is performed. A framework using Appraisal Theory is used to examine sample sentences to show how the corpus appraises the candidates. The qualitative analysis shows how many sentences are full of judgment towards Hillary Clinton, positive appraisal of Donald Trump, and attempts to expand positive dialog about Donald Trump, as opposed to a contraction of dialog and expansion of negativity about Hillary Clinton. The predicted Russian geopolitical agenda seeks to demean American politics, positively influence perceptions of Russians towards Vladimir Putin, and support Donald Trump insofar as his policies align with Russia's goals. corpus russian linguistics sentiment embeddings media election 2016 Hillary Clinton Donald Trump Arts and Humanities
86	Word Vector Representations using Shallow Neural Networks Adewumi, Oluwatosin January 2021 (has links) This work highlights some important factors for consideration when developing word vector representations and data-driven conversational systems. The neural network methods for creating word embeddings have gained more prominence than their older, count-based counterparts.However, there are still challenges, such as prolonged training time and the need for more data, especially with deep neural networks. Shallow neural networks with lesser depth appear to have the advantage of less complexity, however, they also face challenges, such as sub-optimal combination of hyper-parameters which produce sub-optimal models. This work, therefore, investigates the following research questions: "How importantly do hyper-parameters influence word embeddings’ performance?" and "What factors are important for developing ethical and robust conversational systems?" In answering the questions, various experiments were conducted using different datasets in different studies. The first study investigates, empirically, various hyper-parameter combinations for creating word vectors and their impact on a few natural language processing (NLP) downstream tasks: named entity recognition (NER) and sentiment analysis (SA). The study shows that optimal performance of embeddings for downstream \acrshort{nlp} tasks depends on the task at hand.It also shows that certain combinations give strong performance across the tasks chosen for the study. Furthermore, it shows that reasonably smaller corpora are sufficient or even produce better models in some cases and take less time to train and load. This is important, especially now that environmental considerations play prominent role in ethical research. Subsequent studies build on the findings of the first and explore the hyper-parameter combinations for Swedish and English embeddings for the downstream NER task. The second study presents the new Swedish analogy test set for evaluation of Swedish embeddings. Furthermore, it shows that character n-grams are useful for Swedish, a morphologically rich language. The third study shows that broad coverage of topics in a corpus appears to be important to produce better embeddings and that noise may be helpful in certain instances, though they are generally harmful. Hence, relatively smaller corpus can show better performance than a larger one, as demonstrated in the work with the smaller Swedish Wikipedia corpus against the Swedish Gigaword. The argument is made, in the final study (in answering the second question) from the point of view of the philosophy of science, that the near-elimination of the presence of unwanted bias in training data and the use of foralike the peer-review, conferences, and journals to provide the necessary avenues for criticism and feedback are instrumental for the development of ethical and robust conversational systems. Word vectors NLP Neural networks Embeddings
87	A comparative study of the grammatical gender systems of languages by means of analysing word embeddings Veeman, Hartger January 2020 (has links) The creation of word embeddings is one of the key breakthroughs in natural language processing. Word embeddings allow for words to be represented semantically, opening the way to many new deep learning methods. Understanding what information is in word embeddings will help understanding the behaviour of embeddings in natural language processing tasks, but also allows for the quantitative study of the linguistic features such as grammatical gender. This thesis attempts to explore how grammatical gender is encoded in word embeddings, through analysing the performance of a neural network classifier on the classification of nouns by gender. This analysis is done in three experiments: an analysis of contextualized embeddings, an analysis of embeddings learned from modified corpora and an analysis of aligned embeddings in many languages. The contextualized word embedding model ELMo has multiple output layers with a gradual increasing presence of semantic information in the embedding. This differing presence of semantic information was used to test the classifier's reliance on semantic information. Swedish, German, Spanish and Russian embeddings were classified at all layers of a three layered ELMo model. The word representation layer without any contextualization was found to produce the best accuracy, indicating the noise introduced by the contextualization was more impactful than any potential extra semantic information. Swedish embeddings were learned from a corpus stripped of articles and a stemmed corpus. Both sets of embeddings showed an drop of about 6% in accuracy in comparison with the embeddings from a non-augmented corpus, indicating agreement plays a large role in the classification. Aligned multilingual embeddings were used to measure the accuracy of a grammatical gender classifier in 24 languages. The classifier models were applied to data of other languages to determine the similarity of the encoding of grammatical gender in these embeddings. Correcting the results with a random guessing baseline shows that transferred models can be highly accurate in certain language combinations and in some cases almost approach the accuracy of the model on its source data. A comparison between transfer accuracy and phylogenetic distance showed that the model transferability follows a pattern that resembles the phylogenetic distance. word embeddings grammatical gender computational linguistics language representations
88	Explorations in Word Embeddings : graph-based word embedding learning and cross-lingual contextual word embedding learning / Explorations de plongements lexicaux : apprentissage de plongements à base de graphes et apprentissage de plongements contextuels multilingues Zhang, Zheng 18 October 2019 (has links) Les plongements lexicaux sont un composant standard des architectures modernes de traitement automatique des langues (TAL). Chaque fois qu'une avancée est obtenue dans l'apprentissage de plongements lexicaux, la grande majorité des tâches de traitement automatique des langues, telles que l'étiquetage morphosyntaxique, la reconnaissance d'entités nommées, la recherche de réponses à des questions, ou l'inférence textuelle, peuvent en bénéficier. Ce travail explore la question de l'amélioration de la qualité de plongements lexicaux monolingues appris par des modèles prédictifs et celle de la mise en correspondance entre langues de plongements lexicaux contextuels créés par des modèles préentraînés de représentation de la langue comme ELMo ou BERT.Pour l'apprentissage de plongements lexicaux monolingues, je prends en compte des informations globales au corpus et génère une distribution de bruit différente pour l'échantillonnage d'exemples négatifs dans word2vec. Dans ce but, je précalcule des statistiques de cooccurrence entre mots avec corpus2graph, un paquet Python en source ouverte orienté vers les applications en TAL : il génère efficacement un graphe de cooccurrence à partir d'un grand corpus, et lui applique des algorithmes de graphes tels que les marches aléatoires. Pour la mise en correspondance translingue de plongements lexicaux, je relie les plongements lexicaux contextuels à des plongements de sens de mots. L'algorithme amélioré de création d'ancres que je propose étend également la portée des algorithmes de mise en correspondance de plongements lexicaux du cas non-contextuel au cas des plongements contextuels. / Word embeddings are a standard component of modern natural language processing architectures. Every time there is a breakthrough in word embedding learning, the vast majority of natural language processing tasks, such as POS-tagging, named entity recognition (NER), question answering, natural language inference, can benefit from it. This work addresses the question of how to improve the quality of monolingual word embeddings learned by prediction-based models and how to map contextual word embeddings generated by pretrained language representation models like ELMo or BERT across different languages.For monolingual word embedding learning, I take into account global, corpus-level information and generate a different noise distribution for negative sampling in word2vec. In this purpose I pre-compute word co-occurrence statistics with corpus2graph, an open-source NLP-application-oriented Python package that I developed: it efficiently generates a word co-occurrence network from a large corpus, and applies to it network algorithms such as random walks. For cross-lingual contextual word embedding mapping, I link contextual word embeddings to word sense embeddings. The improved anchor generation algorithm that I propose also expands the scope of word embedding mapping algorithms from context independent to contextual word embeddings. Vecteurs de mots Traitement automatique des langues Multilingue Word embeddings Natural Language Processing Multilingual
89	Biomedical concept association and clustering using word embeddings Shah, Setu 12 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Biomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space. A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services. The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of. To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for. At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients. Artificial intelligence Natural language processing Document clustering Preventive care Word embeddings Biomedical science
90	Determining Event Outcomes from Social Media Murugan, Srikala 05 1900 (has links) An event is something that happens at a time and location. Events include major life events such as graduating college or getting married, and also simple day-to-day activities such as commuting to work or eating lunch. Most work on event extraction detects events and the entities involved in events. For example, cooking events will usually involve a cook, some utensils and appliances, and a final product. In this work, we target the task of determining whether events result in their expected outcomes. Specifically, we target cooking and baking events, and characterize event outcomes into two categories. First, we distinguish whether something edible resulted from the event. Second, if something edible resulted, we distinguish between perfect, partial and alternative outcomes. The main contributions of this thesis are a corpus of 4,000 tweets annotated with event outcome information and experimental results showing that the task can be automated. The corpus includes tweets that have only text as well as tweets that have text and an image.

Search results