Global ETD Search

1	Intelligent Prediction of Stock Market Using Text and Data Mining Techniques Raahemi, Mohammad 04 September 2020 (has links) The stock market undergoes many fluctuations on a daily basis. These changes can be challenging to anticipate. Understanding such volatility are beneficial to investors as it empowers them to make inform decisions to avoid losses and invest when opportunities are predicted to earn funds. The objective of this research is to use text mining and data mining techniques to discover the relationship between news articles and stock prices fluctuations. There are a variety of sources for news articles, including Bloomberg, Google Finance, Yahoo Finance, Factiva, Thompson Routers, and Twitter. In our research, we use Factive and Intrinio news databases. These databases provide daily analytical articles about the general stock market, as well as daily changes in stock prices. The focus of this research is on understanding the news articles which influence stock prices. We believe that different types of stocks in the market behave differently, and news articles could provide indications on different stock price movements. The goal of this research is to create a framework that uses text mining and data mining algorithms to correlate different types of news articles with stock fluctuations to predict whether to “Buy”, “Sell”, or “Hold” a specific stock. We train Doc2Vec models on 1GB of financial news from Factiva to convert news articles into vectors of 100 dimensions. After preprocessing the data, including labeling and balancing the data, we build five predictive models, namely Neural Networks, SVM, Decision Tree, KNN, and Random Forest to predict stock movements (Buy, Sell, or Hold). We evaluate the performances of the predictive models in terms of accuracy and area under the ROC. We conclude that SVM provides the best performance among the five models to predict the stock movement. Stock Prediction Text Mining Data Mining Machine Learning Word Embedding
2	Effectivisation of keywords extraction process : A supervised binary classification approach of scraped words from company websites Andersson, Josef, Fremling, Max January 2023 (has links) In today’s digital era, establishing an online presence and maintaining a well-structured website is vitalfor companies to remain competitive in their respective markets. A crucial aspect of online success liesin strategically selecting the right words to optimize customer engagement and search engine visibility.However, this process is often time-consuming, involving extensive analysis of a company’s website aswell as its competitors’. This thesis focuses on developing an efficient binary classification approachto identify key words and phrases extracted from multiple company websites. The data set used forthis solution consists of approximately 92,000 scraped samples, primarily comprising non-key samples.Various features were extracted, and a word embedding model was employed to assess each sample’srelevance to its specific industry and topic. The logistic regression, decision tree and random forestalgorithms were all explored and implemented as different solutions to the classification problem. Theresults indicated that the logistic regression model excelled in retaining keywords but was less effectivein eliminating non-keywords. Conversely, the tree-based methods demonstrated superior classificationof keywords, albeit at the cost of misclassifying a few keywords. Overall, the random forest approachoutperformed the others, achieving a result of 76 percent in recall and 20 percent in precision whenpredicting key samples. In summary, this thesis presents a solution for classifying words and phrasesfrom company websites into key and non-key categories, and the developed methodology could offervaluable insights for companies seeking to enhance their website optimization strategies. Machine learning keyword classification unbalanced data word embedding Mathematics Matematik
3	French AXA Insurance Word Embeddings : Effects of Fine-tuning BERT and Camembert on AXA France’s data Zouari, Hend January 2020 (has links) We explore in this study the different Natural Language Processing state-of-the art technologies that allow transforming textual data into numerical representation. We go through the theory of the existing traditional methods as well as the most recent ones. This thesis focuses on the recent advances in Natural Language processing being developed upon the Transfer model. One of the most relevant innovations was the release of a deep bidirectional encoder called BERT that broke several state of the art results. BERT utilises Transfer Learning to improve modelling language dependencies in text. BERT is used for several different languages, other specialized model were released like the french BERT: Camembert. This thesis compares the language models of these different pre-trained models and compares their capability to insure a domain adaptation. Using the multilingual and the french pre-trained version of BERT and a dataset from AXA France’s emails, clients’ messages, legal documents, insurance documents containing over 60 million words. We fine-tuned the language models in order to adapt them on the Axa insurance’s french context to create a French AXAInsurance BERT model. We evaluate the performance of this model on the capability of the language model of predicting a masked token based on the context. BERT proves to perform better : modelling better the french AXA’s insurance text without finetuning than Camembert. However, with this small amount of data, Camembert is more capable of adaptation to this specific domain of insurance. / I denna studie undersöker vi de senaste teknologierna för Natural Language Processing, som gör det möjligt att omvandla textdata till numerisk representation. Vi går igenom teorin om befintliga traditionella metoder såväl som de senaste. Denna avhandling fokuserar på de senaste framstegen inom bearbetning av naturliga språk som utvecklats med hjälp av överföringsmodellen. En av de mest relevanta innovationerna var lanseringen av en djup dubbelriktad kodare som heter BERT som bröt flera toppmoderna resultat. BERT använder Transfer Learning för att förbättra modelleringsspråkberoenden i text. BERT används för flera olika språk, andra specialmodeller släpptes som den franska BERT: Camembert. Denna avhandling jämför språkmodellerna för dessa olika förutbildade modeller och jämför deras förmåga att säkerställa en domänanpassning. Med den flerspråkiga och franska förutbildade versionen av BERT och en dataset från AXA Frankrikes epostmeddelanden, kundmeddelanden, juridiska dokument, försäkringsdokument som innehåller över 60 miljoner ord. Vi finjusterade språkmodellerna för att anpassa dem till Axas försäkrings franska sammanhang för att skapa en fransk AXAInsurance BERT-modell. Vi utvärderar prestandan för denna modell på förmågan hos språkmodellen att förutsäga en maskerad token baserat på sammanhanget. BERTpresterar bättre: modellerar bättre den franska AXA-försäkringstexten utan finjustering än Camembert. Men med denna lilla mängd data är Camembert mer kapabel att anpassa sig till denna specifika försäkringsdomän. NLP Language model Word embedding BERT camemBERT NLP Language model Word embedding BERT camemBERT Computer and Information Sciences Data- och informationsvetenskap
4	Using Machine Learning to Learn from Bug Reports : Towards Improved Testing Efficiency Ingvarsson, Sanne January 2019 (has links) The evolution of a software system originates from its changes, whether it comes from changed user needs or adaption to its current environment. These changes are as encouraged as they are inevitable, although every change to a software system comes with a risk of introducing an error or a bug. This thesis aimed to investigate the possibilities of using the description of bug reports as a decision basis for detecting the provenance of a bug by using machine learning. K-means and agglomerative clustering have been applied to free text documents by using Natural Language Processing to initially divide the investigated software system into sub parts. Topic labelling is further on performed on the found clusters to find suitable names and get an overall understanding for the clusters.Finally, it was investigated if it was possible to find which cluster that were more likely to cause a bug from certain clusters and should be tested more thoroughly. By evaluating a subset of known causes, it was found that possible direct connections could be found in 50% of the cases, while this number increased to 58% if the cause were attached to clusters. Machine Learning Testing Bug Natural Language Processing Clustering Word Embedding Other Engineering and Technologies Annan teknik
5	Semantic Text Matching Using Convolutional Neural Networks Wang, Run Fen January 2018 (has links) Semantic text matching is a fundamental task for many applications in NaturalLanguage Processing (NLP). Traditional methods using term frequencyinversedocument frequency (TF-IDF) to match exact words in documentshave one strong drawback which is TF-IDF is unable to capture semanticrelations between closely-related words which will lead to a disappointingmatching result. Neural networks have recently been used for various applicationsin NLP, and achieved state-of-the-art performances on many tasks.Recurrent Neural Networks (RNN) have been tested on text classificationand text matching, but it did not gain any remarkable results, which is dueto RNNs working more effectively on texts with a short length, but longdocuments. In this paper, Convolutional Neural Networks (CNN) will beapplied to match texts in a semantic aspect. It uses word embedding representationsof two texts as inputs to the CNN construction to extract thesemantic features between the two texts and give a score as the output ofhow certain the CNN model is that they match. The results show that aftersome tuning of the parameters the CNN model could produce accuracy,prediction, recall and F1-scores all over 80%. This is a great improvementover the previous TF-IDF results and further improvements could be madeby using dynamic word vectors, better pre-processing of the data, generatelarger and more feature rich data sets and further tuning of the parameters. Text matching CNN TF-IDF Word embedding Word2vec NLP
6	History-related Knowledge Extraction from Temporal Text Collections / テキストコレクションからの歴史関連知識の抽出 Duan, Yijun 23 March 2020 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第22574号 / 情博第711号 / 新制\|\|情\|\|122(附属図書館) / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授吉川正俊, 教授鹿島久嗣, 教授田島敬史, 特定准教授 JATOWT Adam Wladyslaw / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DGAM timeline summarization dynamic word embedding news mining information extraction computational history 007
7	Multi-Layer Web Services Discovery using Word Embedding and Clustering Techniques Obidallah, Waeal 25 February 2021 (has links) Web services discovery is the process of finding the right Web services that best match the end-users’ functional and non-functional requirements. Artificial intelligence, natural language processing, data mining, and text mining techniques have been applied by researchers in Web services discovery to facilitate the process of matchmaking. This thesis contributes to the area of Web services discovery and recommendation, adopting the Design Science Research Methodology to guide the development of useful knowledge, including design theory and artifacts. The lack of a comprehensive review of Web services discovery and recommendation in the literature motivated us to conduct a systematic literature review. Our main purpose in conducting the systematic literature review was to identify and systematically compare current clustering and association rules techniques for Web services discovery and recommendation by providing answers to various research questions, investigating the prior knowledge, and identifying gaps in the related literature. We then propose a conceptual model and a typology of Web services discovery systems. The conceptual model provides a high-level representation of Web services discovery systems, including their various elements, tasks, and relationships. The proposed typology of Web services discovery systems is composed of five groups of characteristics: storage and location characteristics, formalization characteristics, matchmaking characteristics, automation characteristics, and selection characteristics. We reference the typology to compare Web services discovery methods and architectures from the extant literature by linking them to the five proposed characteristics. We employ the proposed conceptual model with its specified characteristics to design and develop the multi-layer data mining architecture for Web services discovery using word embedding and clustering techniques. The proposed architecture consists of five layers: Web services description and data preprocessing; word embedding and representation; syntactic similarity; semantic similarity; and clustering. In the first layer, we identify the steps to parse and preprocess the Web services documents. Bag of Words with Term Frequency–Inverse Document Frequency and three word-embedding models are employed for Web services representation in the second layer. Then in the third layer, four distance measures, including Cosine, Euclidean, Minkowski, and Word Mover, are studied to find the similarities between Web services documents. In layer four, WordNet and Normalized Google Distance are employed to represent and find the similarity between Web services documents. Finally, in the fifth layer, three clustering algorithms, including affinity propagation, K-means, and hierarchical agglomerative clustering, are investigated to cluster Web services based on the observed documents’ similarities. We demonstrate how each component of the five layers is employed in the process of Web services clustering using random-ly selected Web services documents. We conduct experimental analysis to cluster Web services using a collected dataset of Web services documents and evaluating their clustering performances. Using a ground truth for evaluation purposes, we observe that clusters built based on the word embedding models performed better compared to those built using the Bag of Words with Term Frequency–Inverse Document Frequency model. Among the three word embedding models, the pre-trained Word2Vec’s skip-gram model reported higher performance in clustering Web services. Among the three semantic similarity measures, path-based WordNet similarity reported higher clustering performance. By considering the different words representations models and syntactic and semantic similarity measures, the affinity propagation clustering technique performed better in discovering similarities among Web services. Web Services Clustering Web services discovery systematic literature review word embedding NLP
8	Automatic Retail Product Identification System for Cashierless Stores Zhong, Shiting January 2021 (has links) The introduction of artificial intelligence techniques in the retail market is making a revolution in shopping experience. It allows shoppers to walk into a store, grab what they want and simply walk out without scanning barcodes or having to stand in long queues. That is what we call cashierless stores. In this project, it aims to provide an efficient solution to the automatic retail product identification. This solution presents an artifact one can use to build an end-to-end smart system for cashierless stores. Henceforth, a solution based on text classification is proposed to recognize and identify the products. For that, deep learning techniques are used such as RNN and LSTM to build the classifier. The performance of this classifier is evaluated using various metrics and it shows its efficiency with an accuracy exceeding 86%. Text Classification Retail Product Identification Word Embedding Neural Network Word2Vec GloVe Engineering and Technology Teknik och teknologier
9	Classification of User Stories using aNLP and Deep Learning Based Approach Kandikari, Bhavesh January 2023 (has links) No description available. Natural Language Processing Word Embedding Requirement Refining User Story Deep Learning Computer Sciences Datavetenskap (datalogi)
10	Topic discovery and document similarity via pre-trained word embeddings Chen, Simin January 2018 (has links) Throughout the history, humans continue to generate an ever-growing volume of documents about a wide range of topics. We now rely on computer programs to automatically process these vast collections of documents in various applications. Many applications require a quantitative measure of the document similarity. Traditional methods first learn a vector representation for each document using a large corpus, and then compute the distance between two document vectors as the document similarity.In contrast to this corpus-based approach, we propose a straightforward model that directly discovers the topics of a document by clustering its words, without the need of a corpus. We define a vector representation called normalized bag-of-topic-embeddings (nBTE) to encapsulate these discovered topics and compute the soft cosine similarity between two nBTE vectors as the document similarity. In addition, we propose a logistic word importance function that assigns words different importance weights based on their relative discriminating power.Our model is efficient in terms of the average time complexity. The nBTE representation is also interpretable as it allows for topic discovery of the document. On three labeled public data sets, our model achieved comparable k-nearest neighbor classification accuracy with five stateof-art baseline models. Furthermore, from these three data sets, we derived four multi-topic data sets where each label refers to a set of topics. Our model consistently outperforms the state-of-art baseline models by a large margin on these four challenging multi-topic data sets. These works together provide answers to the research question of this thesis:Can we construct an interpretable document represen-tation by clustering the words in a document, and effectively and efficiently estimate the document similarity? / Under hela historien fortsätter människor att skapa en växande mängd dokument om ett brett spektrum av publikationer. Vi förlitar oss nu på dataprogram för att automatiskt bearbeta dessa stora samlingar av dokument i olika applikationer. Många applikationer kräver en kvantitativmått av dokumentets likhet. Traditionella metoder först lära en vektorrepresentation för varje dokument med hjälp av en stor corpus och beräkna sedan avståndet mellan two document vektorer som dokumentets likhet.Till skillnad från detta corpusbaserade tillvägagångssätt, föreslår vi en rak modell som direkt upptäcker ämnena i ett dokument genom att klustra sina ord , utan behov av en corpus. Vi definierar en vektorrepresentation som kallas normalized bag-of-topic-embeddings (nBTE) för att inkapsla de upptäckta ämnena och beräkna den mjuka cosinuslikheten mellan två nBTE-vektorer som dokumentets likhet. Dessutom föreslår vi en logistisk ordbetydelsefunktion som tilldelar ord olika viktvikter baserat på relativ diskriminerande kraft.Vår modell är effektiv när det gäller den genomsnittliga tidskomplexiteten. nBTE-representationen är också tolkbar som möjliggör ämnesidentifiering av dokumentet. På tremärkta offentliga dataset uppnådde vår modell jämförbar närmaste grannklassningsnoggrannhet med fem toppmoderna modeller. Vidare härledde vi från de tre dataseten fyra multi-ämnesdatasatser där varje etikett hänvisar till en uppsättning ämnen. Vår modell överensstämmer överens med de högteknologiska baslinjemodellerna med en stor marginal av fyra utmanande multi-ämnesdatasatser. Dessa arbetsstöd ger svar på forskningsproblemet av tisthesis:Kan vi konstruera en tolkbar dokumentrepresentation genom att klustra orden i ett dokument och effektivt och effektivt uppskatta dokumentets likhet? Computer and Information Sciences Data- och informationsvetenskap

Search results