Global ETD Search

11	Software Issue Time Estimation With Natural Language Processing and Machine Learning / Tidsuppskattning för mjukvaruärenden med språkteknologi och maskininlärning Hyberg, Martin January 2021 (has links) Time estimation for software issues is crucial to planning projects. Developers and experts have for many decades tried to estimate time requirements for issues as accurately as possible. The methods that are used today are often time-consuming and complex. This thesis investigates if the time estimation process can be done with natural language processing and machine learning. Three different word embeddings were used to represent the free text description, bag-of-words with tf-idf weighing, word2Vec and fastText. The different word embeddings were then fed into two types of machine learning approaches, classification and regression. The classification was binary and can be formulated as will the issue take more than three hours?. The goal of the regression problem was to predict an actual value for the time that the issue would take to complete. The classification models performance were measured with an F1-score, and the regression model was measured with an R2-score. The best F1- score for classification was 0.748 and was achieved with the word2Vec word embedding and an SVM classifier. The best score for the regression analysis was achieved with the bag-of-words word embedding, which achieved an R2- score of 0.380. Further evaluation of the results and a comparison to actual estimates made by the company show that humans only performs slightly better than the models given the binary classification defined above. The F1-score of the employees was 0.792, a difference of just 0.044 from the best F1-score made by the models. This thesis concludes that the models are not good enough to use in a professional setting. An F1-score of 0.748 could be used in other settings, but the classification question in this problem is too broad to be used for a real project. The results for the regression is also too low to be of any valuable use. / Tidsuppskattning för programvaruärenden är en avgörande del för planering av projekt. Utvecklare och experter har i många årtionden försökt uppskatta tiden ett ärende kommer ta så exakt som möjligt. Metoderna som används idag är ofta tidskrävande och komplexa. Denna avhandling undersöker om tidsuppskattningsprocessen kan göras med hjälp av språkteknologi och maskininlärning. De flesta programvaruärenden har en fritextbeskrivning av vad som är fel eller behöver läggas till. Tre olika ordinbäddningar användes för att representera fritextbeskrivningen, bag-of-word med tf-idf-viktning, word2Vec och fastText. De olika ordinbäddningarna matades sedan in i två typer av maskininlärningsmetoder, klassificering och regression. Klassificeringen var binär och frågan kan formuleras som tar ärendet mer än tre timmar?. Målet med regressionsproblemet var att förutsäga ett faktiskt värde för den tid som frågan skulle ta att slutföra. Klassificeringsmodellens prestanda mättes med en F1-poäng och regressionsmodellen mättes med en R2-poäng. Den bästa F1-poängen för klassificering var 0.748 och uppnåddes med en word2Vec-ordinbäddning och en SVM-klassificeringsmodell. Den bästa poängen för regressionsanalysen uppnåddes med en bag-of-words-inbäddning, som uppnådde en R2-poäng på 0.380. Vidare undersökning av resultaten och en jämförelse av faktiskta tidsestimat som gjorts av företaget visar att människor bara är lite bättre än modellerna givet klassificeringsfrågan beskriven ovan. F1-poängen för de anställda var 0.792, bara 0.044 bättre än det bästa F1-poängen för modellerna. Slutsatsen för denna avhandling är att modellerna inte är tillräckligt bra för att användas i en professionell miljö. En F1-poäng på 0.748 kan användas i andra situationer, men klassificeringsfrågan i detta problem är för bred för att användas för ett riktigt projekt. Resultatet för regressionen är också för lågt för att vara till någon värdefull användning. Natural Language Processing Regression Classification Time Estimation Software Issues Word2Vec FastText Språkteknologi Regression Klassificering Tidsuppskattning Mjukvaruärenden Word2Vec FastText Computer and Information Sciences Data- och informationsvetenskap
12	Investigating Performance of Different Models at Short Text Topic Modelling / En jämförelse av textrepresentationsmodellers prestanda tillämpade för ämnesinnehåll i korta texter Akinepally, Pratima Rao January 2020 (has links) The key objective of this project was to quantitatively and qualitatively assess the performance of a sentence embedding model, Universal Sentence Encoder (USE), and a word embedding model, word2vec, at the task of topic modelling. The first step in the process was data collection. The data used for the project was podcast descriptions available at Spotify, and the topics associated with them. Following this, the data was used to generate description vectors and topic vectors using the embedding models, which were then used to assign topics to descriptions. The results from this study led to the conclusion that embedding models are well suited to this task, and that overall the USE outperforms the word2vec models. / Det huvudsakliga syftet med det i denna uppsats rapporterade projektet är att kvantitativt och kvalitativt utvärdera och jämföra hur väl Universal Sentence Encoder USE, ett semantiskt vektorrum för meningar, och word2vec, ett semantiskt vektorrum för ord, fungerar för att modellera ämnesinnehåll i text. Projektet har som träningsdata använt skriftliga sammanfattningar och ämnesetiketter för podd-episoder som gjorts tillgängliga av Spotify. De skriftliga sammanfattningarna har använts för att generera både vektorer för de enskilda podd-episoderna och för de ämnen de behandlar. De båda ansatsernas vektorer har sedan utvärderats genom att de använts för att tilldela ämnen till beskrivningar ur en testmängd. Resultaten har sedan jämförts och leder både till den allmänna slutsatsen att semantiska vektorrum är väl lämpade för den här sortens uppgifter, och att USE totalt sett överträffar word2vec-modellerna. Topic Modeling NLP word2vec Universal Sentence Encoder Podcast Textanalys semantiska vektorrum NLP word2vec Universal Sentence Encoder Poddar Computer and Information Sciences Data- och informationsvetenskap
13	Decentralized Large-Scale Natural Language Processing Using Gossip Learning / Decentraliserad Storskalig Naturlig Språkbehandling med Hjälp av Skvallerinlärning Alkathiri, Abdul Aziz January 2020 (has links) The field of Natural Language Processing in machine learning has seen rising popularity and use in recent years. The nature of Natural Language Processing, which deals with natural human language and computers, has led to the research and development of many algorithms that produce word embeddings. One of the most widely-used of these algorithms is Word2Vec. With the abundance of data generated by users and organizations and the complexity of machine learning and deep learning models, performing training using a single machine becomes unfeasible. The advancement in distributed machine learning offers a solution to this problem. Unfortunately, due to reasons concerning data privacy and regulations, in some real-life scenarios, the data must not leave its local machine. This limitation has lead to the development of techniques and protocols that are massively-parallel and data-private. The most popular of these protocols is federated learning. However, due to its centralized nature, it still poses some security and robustness risks. Consequently, this led to the development of massively-parallel, data private, decentralized approaches, such as gossip learning. In the gossip learning protocol, every once in a while each node in the network randomly chooses a peer for information exchange, which eliminates the need for a central node. This research intends to test the viability of gossip learning for large- scale, real-world applications. In particular, it focuses on implementation and evaluation for a Natural Language Processing application using gossip learning. The results show that application of Word2Vec in a gossip learning framework is viable and yields comparable results to its non-distributed, centralized counterpart for various scenarios, with an average loss on quality of 6.904%. / Fältet Naturlig Språkbehandling (Natural Language Processing eller NLP) i maskininlärning har sett en ökande popularitet och användning under de senaste åren. Naturen av Naturlig Språkbehandling, som bearbetar naturliga mänskliga språk och datorer, har lett till forskningen och utvecklingen av många algoritmer som producerar inbäddningar av ord. En av de mest använda av dessa algoritmer är Word2Vec. Med överflödet av data som genereras av användare och organisationer, komplexiteten av maskininlärning och djupa inlärningsmodeller, blir det omöjligt att utföra utbildning med hjälp av en enda maskin. Avancemangen inom distribuerad maskininlärning erbjuder en lösning på detta problem, men tyvärr får data av sekretesskäl och datareglering i vissa verkliga scenarier inte lämna sin lokala maskin. Denna begränsning har lett till utvecklingen av tekniker och protokoll som är massivt parallella och dataprivata. Det mest populära av dessa protokoll är federerad inlärning (federated learning), men på grund av sin centraliserade natur utgör det ändock vissa säkerhets- och robusthetsrisker. Följaktligen ledde detta till utvecklingen av massivt parallella, dataprivata och decentraliserade tillvägagångssätt, såsom skvallerinlärning (gossip learning). I skvallerinlärningsprotokollet väljer varje nod i nätverket slumpmässigt en like för informationsutbyte, vilket eliminerarbehovet av en central nod. Syftet med denna forskning är att testa livskraftighetenav skvallerinlärning i större omfattningens verkliga applikationer. I synnerhet fokuserar forskningen på implementering och utvärdering av en NLP-applikation genom användning av skvallerinlärning. Resultaten visar att tillämpningen av Word2Vec i en skvallerinlärnings ramverk är livskraftig och ger jämförbara resultat med dess icke-distribuerade, centraliserade motsvarighet för olika scenarier, med en genomsnittlig kvalitetsförlust av 6,904%. gossip learning decentralized machine learning distributed machine learning NLP Word2Vec data privacy skvallerinlärning decentraliserad maskininlärning distribuerad maskininlärning naturlig språkbehandling Word2Vec dataintegritet Computer and Information Sciences Data- och informationsvetenskap
14	Finfördelad Sentimentanalys : Utvärdering av neurala nätverksmodeller och förbehandlingsmetoder med Word2Vec / Fine-grained Sentiment Analysis : Evaluation of Neural Network Models and Preprocessing Methods with Word2Vec Phanuwat, Phutiwat January 2024 (has links) Sentimentanalys är en teknik som syftar till att automatiskt identifiera den känslomässiga tonen i text. Vanligtvis klassificeras texten som positiv, neutral eller negativ. Nackdelen med denna indelning är att nyanser går förlorade när texten endast klassificeras i tre kategorier. En vidareutveckling av denna klassificering är att inkludera ytterligare två kategorier: mycket positiv och mycket negativ. Utmaningen med denna femklassificering är att det blir svårare att uppnå hög träffsäkerhet på grund av det ökade antalet kategorier. Detta har lett till behovet av att utforska olika metoder för att lösa problemet. Syftet med studien är därför att utvärdera olika klassificerare, såsom MLP, CNN och Bi-GRU i kombination med word2vec för att klassificera sentiment i text i fem kategorier. Studien syftar också till att utforska vilken förbehandling som ger högre träffsäkerhet för word2vec. Utvecklingen av modellerna gjordes med hjälp av SST-datasetet, som är en känd dataset inom finfördelad sentimentanalys. För att avgöra vilken förbehandling som ger högre träffsäkerhet för word2vec, förbehandlades datasetet på fyra olika sätt. Dessa innefattar enkel förbehandling (EF), samt kombinationer av vanliga förbehandlingar som att ta bort stoppord (EF+Utan Stoppord) och lemmatisering (EF+Lemmatisering), samt en kombination av båda (EF+Utan Stoppord/Lemmatisering). Dropout användes för att hjälpa modellerna att generalisera bättre, och träningen reglerades med early stopp-teknik. För att utvärdera vilken klassificerare som ger högre träffsäkerhet, användes förbehandlingsmetoden som hade högst träffsäkerhet som identifierades, och de optimala hyperparametrarna utforskades. Måtten som användes i studien för att utvärdera träffsäkerheten är noggrannhet och F1-score. Resultaten från studien visade att EF-metoden presterade bäst i jämförelse med de andra förbehandlingsmetoderna som utforskades. Den modell som hade högst noggrannhet och F1-score i studien var Bi-GRU. / Sentiment analysis is a technique aimed at automatically identifying the emotional tone in text. Typically, text is classified as positive, neutral, or negative. The downside of this classification is that nuances are lost when text is categorized into only three categories. An advancement of this classification is to include two additional categories: very positive and very negative. The challenge with this five-class classification is that achieving high performance becomes more difficult due to the increased number of categories. This has led to the need to explore different methods to solve the problem. Therefore, the purpose of the study is to evaluate various classifiers, such as MLP, CNN, and Bi-GRU in combination with word2vec, to classify sentiment in text into five categories. The study also aims to explore which preprocessing method yields higher performance for word2vec. The development of the models was done using the SST dataset, which is a well-known dataset in fine-grained sentiment analysis. To determine which preprocessing method yields higher performance for word2vec, the dataset was preprocessed in four different ways. These include simple preprocessing (EF), as well as combinations of common preprocessing techniques such as removing stop words (EF+Without Stopwords) and lemmatization (EF+Lemmatization), as well as a combination of both (EF+Without Stopwords/Lemmatization). Dropout was used to help the models generalize better, and training was regulated with early stopping technique. To evaluate which classifier yields higher performance, the preprocessing method with the highest performance was used, and the optimal hyperparameters were explored. The metrics used in the study to evaluate performance are accuracy and F1-score. The results of the study showed that the EF method performed best compared to the other preprocessing methods explored. The model with the highest accuracy and F1-score in the study was Bi-GRU. fine-grained sentiment analysis machine learning word2vec MLP CNN Bi-GRU finfördelad sentimentanalys maskininlärning word2vec MLP CNN Bi-GRU Computer Sciences Datavetenskap (datalogi)
15	An empirical study of semantic similarity in WordNet and Word2Vec Handler, Abram 18 December 2014 (has links) This thesis performs an empirical analysis of Word2Vec by comparing its output to WordNet, a well-known, human-curated lexical database. It finds that Word2Vec tends to uncover more of certain types of semantic relations than others -- with Word2Vec returning more hypernyms, synonomyns and hyponyms than hyponyms or holonyms. It also shows the probability that neighbors separated by a given cosine distance in Word2Vec are semantically related in WordNet. This result both adds to our understanding of the still-unknown Word2Vec and helps to benchmark new semantic tools built from word vectors. Word2Vec Natural Language Processing WordNet Distributional Semantics Artificial Intelligence and Robotics Computational Linguistics Other Computer Engineering
16	Word embeddings and Patient records : The identification of MRI risk patients Kindberg, Erik January 2019 (has links) Identification of risks ahead of MRI examinations is identified as a cumbersome and time-consuming process at the Linköping University Hospital radiology clinic. The hospital staff often have to search through large amounts of unstructured patient data to find information about implants. Word embeddings has been identified as a possible tool to speed up this process. The purpose of this thesis is to evaluate this method, and that is done by training a Word2Vec model on patient journal data and analyzing the close neighbours of key search words by calculating cosine similarity. The 50 closest neighbours of each search words are categorized and annotated as relevant to the task of identifying risk patients ahead of MRI examinations or not. 10 search words were explored, leading to a total of 500 terms being annotated. In total, 14 different categories were observed in the result and out of these 8 were considered relevant. Out of the 500 terms, 340 (68%) were considered relevant. In addition, 48 implant models could be observed which are particularly interesting because if a patient have an implant, hospital staff needs to determine it’s exact model and the MRI conditions of that model. Overall these findings points towards a positive answer for the aim of the thesis, although further developments are needed. word2vec word embeddings patient records MRI safety digital healthcare
17	Semantic Text Matching Using Convolutional Neural Networks Wang, Run Fen January 2018 (has links) Semantic text matching is a fundamental task for many applications in NaturalLanguage Processing (NLP). Traditional methods using term frequencyinversedocument frequency (TF-IDF) to match exact words in documentshave one strong drawback which is TF-IDF is unable to capture semanticrelations between closely-related words which will lead to a disappointingmatching result. Neural networks have recently been used for various applicationsin NLP, and achieved state-of-the-art performances on many tasks.Recurrent Neural Networks (RNN) have been tested on text classificationand text matching, but it did not gain any remarkable results, which is dueto RNNs working more effectively on texts with a short length, but longdocuments. In this paper, Convolutional Neural Networks (CNN) will beapplied to match texts in a semantic aspect. It uses word embedding representationsof two texts as inputs to the CNN construction to extract thesemantic features between the two texts and give a score as the output ofhow certain the CNN model is that they match. The results show that aftersome tuning of the parameters the CNN model could produce accuracy,prediction, recall and F1-scores all over 80%. This is a great improvementover the previous TF-IDF results and further improvements could be madeby using dynamic word vectors, better pre-processing of the data, generatelarger and more feature rich data sets and further tuning of the parameters. Text matching CNN TF-IDF Word embedding Word2vec NLP
18	Word and Relation Embedding for Sentence Representation January 2017 (has links) abstract: In recent years, several methods have been proposed to encode sentences into fixed length continuous vectors called sentence representation or sentence embedding. With the recent advancements in various deep learning methods applied in Natural Language Processing (NLP), these representations play a crucial role in tasks such as named entity recognition, question answering and sentence classification. Traditionally, sentence vector representations are learnt from its constituent word representations, also known as word embeddings. Various methods to learn the distributed representation (embedding) of words have been proposed using the notion of Distributional Semantics, i.e. “meaning of a word is characterized by the company it keeps”. However, principle of compositionality states that meaning of a sentence is a function of the meanings of words and also the way they are syntactically combined. In various recent methods for sentence representation, the syntactic information like dependency or relation between words have been largely ignored. In this work, I have explored the effectiveness of sentence representations that are composed of the representation of both, its constituent words and the relations between the words in a sentence. The word and relation embeddings are learned based on their context. These general-purpose embeddings can also be used as off-the- shelf semantic and syntactic features for various NLP tasks. Similarity Evaluation tasks was performed on two datasets showing the usefulness of the learned word embeddings. Experiments were conducted on three different sentence classification tasks showing that our sentence representations outperform the original word-based sentence representations, when used with the state-of-the-art Neural Network architectures. / Dissertation/Thesis / Masters Thesis Computer Science 2017 Computer science Artificial intelligence Natural Language Processing sentence classification Sentence embeddings word2vec word and relation embedding
19	A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy Ma, Long 08 August 2017 (has links) Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key word extraction. However, when using these key words as features in classification task, it is common that feature dimension is huge. In addition, how to select key words from tons of documents as features in classification task is also a challenge. Especially when using tradition machine learning algorithm in the large data set, the computation cost would be high. In addition, almost 80% of real data is unstructured and non-labeled. The advanced supervised feature selection methods cannot be used directly in selecting entities from massive of data. Usually, extracting features from unlabeled data for classification tasks, statistical strategies have been utilized to discover key features. However, we propose a nova method to extract important features effectively before feeding them into the classification assignment. There is another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to the documents. This problem makes text classification more complicated when compared with single label classification. Considering above issues, we develop a framework for extracting and eliminating data dimensionality, solving the multi-label problem on labeled and unlabeled data set. To reduce data dimension, we provide 1) a hybrid feature selection method that extracts meaningful features according to the importance of each feature. 2) we apply the Word2Vec to represent each document with a lower feature dimension when doing the document categorization for the big data set. 3) An unsupervised approach to extract features from real online generated data for text classification and prediction. On the other hand, to solve the multi-label classification task, we design a new Multi-Instance Multi-Label (MIML) algorithm in the proposed framework. Multi-label Text Classification Feature Selection Word2Vec Natural Language Processing Depression Symptoms Social Medias
20	Source code search for automatic bug localization Shayan Ali A Akbar (9761117) 14 December 2020 (has links) This dissertation advances the state-of-the-art in information retrieval (IR) based automatic bug localization for large software systems. We present techniques from three generations of IR based bug localization and compare their performances on our large and diverse bug localization dataset --- the Bugzbook dataset. The three generations span over fifteen years of research in mining software repositories for bug localization and include: (1) the generation of simple bag-of-words (BoW) based techniques, (2) the generation in which software-centric information such as bug and code change histories as well as structured information embedded in bug reports and code files are exploited to improve retrieval, and (3) the third and most recent generation in which order and semantic relationships between terms are modeled to improve the performance of bug localization systems. The dissertation also presents a novel technique called SCOR (Source Code Retrieval with Semantics and Order) which combines Markov Random Fields (MRF) based term-term ordering dependencies with semantic word vectors obtained from neural network based word embedding algorithms, such as word2vec, to better localize bugs in code files. The results presented in this dissertation show that while term-term ordering and semantic relationships significantly improve the performance when they are modeled separately in retrieval systems, the best precisions in retrieval are obtained when they are modeled together in a single retrieval system. We also show that the semantic representations of software terms learned by training the word embedding algorithm on a corpus of software repositories can be used to perform search in new software code repositories not present in the training corpus of the word embedding algorithm.<br> Computer Engineering code search text embeddings information retrieval bug localization Word2vec

Search results