Global ETD Search

1	Intelligent Prediction of Stock Market Using Text and Data Mining Techniques Raahemi, Mohammad 04 September 2020 (has links) The stock market undergoes many fluctuations on a daily basis. These changes can be challenging to anticipate. Understanding such volatility are beneficial to investors as it empowers them to make inform decisions to avoid losses and invest when opportunities are predicted to earn funds. The objective of this research is to use text mining and data mining techniques to discover the relationship between news articles and stock prices fluctuations. There are a variety of sources for news articles, including Bloomberg, Google Finance, Yahoo Finance, Factiva, Thompson Routers, and Twitter. In our research, we use Factive and Intrinio news databases. These databases provide daily analytical articles about the general stock market, as well as daily changes in stock prices. The focus of this research is on understanding the news articles which influence stock prices. We believe that different types of stocks in the market behave differently, and news articles could provide indications on different stock price movements. The goal of this research is to create a framework that uses text mining and data mining algorithms to correlate different types of news articles with stock fluctuations to predict whether to “Buy”, “Sell”, or “Hold” a specific stock. We train Doc2Vec models on 1GB of financial news from Factiva to convert news articles into vectors of 100 dimensions. After preprocessing the data, including labeling and balancing the data, we build five predictive models, namely Neural Networks, SVM, Decision Tree, KNN, and Random Forest to predict stock movements (Buy, Sell, or Hold). We evaluate the performances of the predictive models in terms of accuracy and area under the ROC. We conclude that SVM provides the best performance among the five models to predict the stock movement. Stock Prediction Text Mining Data Mining Machine Learning Word Embedding
2	Effectivisation of keywords extraction process : A supervised binary classification approach of scraped words from company websites Andersson, Josef, Fremling, Max January 2023 (has links) In today’s digital era, establishing an online presence and maintaining a well-structured website is vitalfor companies to remain competitive in their respective markets. A crucial aspect of online success liesin strategically selecting the right words to optimize customer engagement and search engine visibility.However, this process is often time-consuming, involving extensive analysis of a company’s website aswell as its competitors’. This thesis focuses on developing an efficient binary classification approachto identify key words and phrases extracted from multiple company websites. The data set used forthis solution consists of approximately 92,000 scraped samples, primarily comprising non-key samples.Various features were extracted, and a word embedding model was employed to assess each sample’srelevance to its specific industry and topic. The logistic regression, decision tree and random forestalgorithms were all explored and implemented as different solutions to the classification problem. Theresults indicated that the logistic regression model excelled in retaining keywords but was less effectivein eliminating non-keywords. Conversely, the tree-based methods demonstrated superior classificationof keywords, albeit at the cost of misclassifying a few keywords. Overall, the random forest approachoutperformed the others, achieving a result of 76 percent in recall and 20 percent in precision whenpredicting key samples. In summary, this thesis presents a solution for classifying words and phrasesfrom company websites into key and non-key categories, and the developed methodology could offervaluable insights for companies seeking to enhance their website optimization strategies. Machine learning keyword classification unbalanced data word embedding Mathematics Matematik
3	French AXA Insurance Word Embeddings : Effects of Fine-tuning BERT and Camembert on AXA France’s data Zouari, Hend January 2020 (has links) We explore in this study the different Natural Language Processing state-of-the art technologies that allow transforming textual data into numerical representation. We go through the theory of the existing traditional methods as well as the most recent ones. This thesis focuses on the recent advances in Natural Language processing being developed upon the Transfer model. One of the most relevant innovations was the release of a deep bidirectional encoder called BERT that broke several state of the art results. BERT utilises Transfer Learning to improve modelling language dependencies in text. BERT is used for several different languages, other specialized model were released like the french BERT: Camembert. This thesis compares the language models of these different pre-trained models and compares their capability to insure a domain adaptation. Using the multilingual and the french pre-trained version of BERT and a dataset from AXA France’s emails, clients’ messages, legal documents, insurance documents containing over 60 million words. We fine-tuned the language models in order to adapt them on the Axa insurance’s french context to create a French AXAInsurance BERT model. We evaluate the performance of this model on the capability of the language model of predicting a masked token based on the context. BERT proves to perform better : modelling better the french AXA’s insurance text without finetuning than Camembert. However, with this small amount of data, Camembert is more capable of adaptation to this specific domain of insurance. / I denna studie undersöker vi de senaste teknologierna för Natural Language Processing, som gör det möjligt att omvandla textdata till numerisk representation. Vi går igenom teorin om befintliga traditionella metoder såväl som de senaste. Denna avhandling fokuserar på de senaste framstegen inom bearbetning av naturliga språk som utvecklats med hjälp av överföringsmodellen. En av de mest relevanta innovationerna var lanseringen av en djup dubbelriktad kodare som heter BERT som bröt flera toppmoderna resultat. BERT använder Transfer Learning för att förbättra modelleringsspråkberoenden i text. BERT används för flera olika språk, andra specialmodeller släpptes som den franska BERT: Camembert. Denna avhandling jämför språkmodellerna för dessa olika förutbildade modeller och jämför deras förmåga att säkerställa en domänanpassning. Med den flerspråkiga och franska förutbildade versionen av BERT och en dataset från AXA Frankrikes epostmeddelanden, kundmeddelanden, juridiska dokument, försäkringsdokument som innehåller över 60 miljoner ord. Vi finjusterade språkmodellerna för att anpassa dem till Axas försäkrings franska sammanhang för att skapa en fransk AXAInsurance BERT-modell. Vi utvärderar prestandan för denna modell på förmågan hos språkmodellen att förutsäga en maskerad token baserat på sammanhanget. BERTpresterar bättre: modellerar bättre den franska AXA-försäkringstexten utan finjustering än Camembert. Men med denna lilla mängd data är Camembert mer kapabel att anpassa sig till denna specifika försäkringsdomän. NLP Language model Word embedding BERT camemBERT NLP Language model Word embedding BERT camemBERT Computer and Information Sciences Data- och informationsvetenskap
4	Using Machine Learning to Learn from Bug Reports : Towards Improved Testing Efficiency Ingvarsson, Sanne January 2019 (has links) The evolution of a software system originates from its changes, whether it comes from changed user needs or adaption to its current environment. These changes are as encouraged as they are inevitable, although every change to a software system comes with a risk of introducing an error or a bug. This thesis aimed to investigate the possibilities of using the description of bug reports as a decision basis for detecting the provenance of a bug by using machine learning. K-means and agglomerative clustering have been applied to free text documents by using Natural Language Processing to initially divide the investigated software system into sub parts. Topic labelling is further on performed on the found clusters to find suitable names and get an overall understanding for the clusters.Finally, it was investigated if it was possible to find which cluster that were more likely to cause a bug from certain clusters and should be tested more thoroughly. By evaluating a subset of known causes, it was found that possible direct connections could be found in 50% of the cases, while this number increased to 58% if the cause were attached to clusters. Machine Learning Testing Bug Natural Language Processing Clustering Word Embedding Other Engineering and Technologies Annan teknik
5	Semantic Text Matching Using Convolutional Neural Networks Wang, Run Fen January 2018 (has links) Semantic text matching is a fundamental task for many applications in NaturalLanguage Processing (NLP). Traditional methods using term frequencyinversedocument frequency (TF-IDF) to match exact words in documentshave one strong drawback which is TF-IDF is unable to capture semanticrelations between closely-related words which will lead to a disappointingmatching result. Neural networks have recently been used for various applicationsin NLP, and achieved state-of-the-art performances on many tasks.Recurrent Neural Networks (RNN) have been tested on text classificationand text matching, but it did not gain any remarkable results, which is dueto RNNs working more effectively on texts with a short length, but longdocuments. In this paper, Convolutional Neural Networks (CNN) will beapplied to match texts in a semantic aspect. It uses word embedding representationsof two texts as inputs to the CNN construction to extract thesemantic features between the two texts and give a score as the output ofhow certain the CNN model is that they match. The results show that aftersome tuning of the parameters the CNN model could produce accuracy,prediction, recall and F1-scores all over 80%. This is a great improvementover the previous TF-IDF results and further improvements could be madeby using dynamic word vectors, better pre-processing of the data, generatelarger and more feature rich data sets and further tuning of the parameters. Text matching CNN TF-IDF Word embedding Word2vec NLP
6	History-related Knowledge Extraction from Temporal Text Collections / テキストコレクションからの歴史関連知識の抽出 Duan, Yijun 23 March 2020 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第22574号 / 情博第711号 / 新制\|\|情\|\|122(附属図書館) / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授吉川正俊, 教授鹿島久嗣, 教授田島敬史, 特定准教授 JATOWT Adam Wladyslaw / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DGAM timeline summarization dynamic word embedding news mining information extraction computational history 007
7	Will Svenska Akademiens Ordlista Improve Swedish Word Embeddings? Ahlberg, Ellen January 2022 (has links) Unsupervised word embedding methods are frequently used for natural language processing applications. However, the unsupervised methods overlook known lexical relations that can be of value to capture accurate semantic word relations. This thesis aims to explore if Swedish word embeddings can benefit from prior known linguistic information. Four knowledge graphs extracted from Svenska Akademiens ordlista (SAOL) are incorporated during the training process using the Probabilistic Word Embeddings with Laplacian Priors (PELP) model. The four implemented PELP models are compared with baseline results to evaluate the use of side information. The results suggest that various lexical relations in SAOL are of interest to generate more accurate Swedish word embeddings. word embedding natural language processing NLP Probability Theory and Statistics Sannolikhetsteori och statistik
8	Multi-Layer Web Services Discovery using Word Embedding and Clustering Techniques Obidallah, Waeal 25 February 2021 (has links) Web services discovery is the process of finding the right Web services that best match the end-users’ functional and non-functional requirements. Artificial intelligence, natural language processing, data mining, and text mining techniques have been applied by researchers in Web services discovery to facilitate the process of matchmaking. This thesis contributes to the area of Web services discovery and recommendation, adopting the Design Science Research Methodology to guide the development of useful knowledge, including design theory and artifacts. The lack of a comprehensive review of Web services discovery and recommendation in the literature motivated us to conduct a systematic literature review. Our main purpose in conducting the systematic literature review was to identify and systematically compare current clustering and association rules techniques for Web services discovery and recommendation by providing answers to various research questions, investigating the prior knowledge, and identifying gaps in the related literature. We then propose a conceptual model and a typology of Web services discovery systems. The conceptual model provides a high-level representation of Web services discovery systems, including their various elements, tasks, and relationships. The proposed typology of Web services discovery systems is composed of five groups of characteristics: storage and location characteristics, formalization characteristics, matchmaking characteristics, automation characteristics, and selection characteristics. We reference the typology to compare Web services discovery methods and architectures from the extant literature by linking them to the five proposed characteristics. We employ the proposed conceptual model with its specified characteristics to design and develop the multi-layer data mining architecture for Web services discovery using word embedding and clustering techniques. The proposed architecture consists of five layers: Web services description and data preprocessing; word embedding and representation; syntactic similarity; semantic similarity; and clustering. In the first layer, we identify the steps to parse and preprocess the Web services documents. Bag of Words with Term Frequency–Inverse Document Frequency and three word-embedding models are employed for Web services representation in the second layer. Then in the third layer, four distance measures, including Cosine, Euclidean, Minkowski, and Word Mover, are studied to find the similarities between Web services documents. In layer four, WordNet and Normalized Google Distance are employed to represent and find the similarity between Web services documents. Finally, in the fifth layer, three clustering algorithms, including affinity propagation, K-means, and hierarchical agglomerative clustering, are investigated to cluster Web services based on the observed documents’ similarities. We demonstrate how each component of the five layers is employed in the process of Web services clustering using random-ly selected Web services documents. We conduct experimental analysis to cluster Web services using a collected dataset of Web services documents and evaluating their clustering performances. Using a ground truth for evaluation purposes, we observe that clusters built based on the word embedding models performed better compared to those built using the Bag of Words with Term Frequency–Inverse Document Frequency model. Among the three word embedding models, the pre-trained Word2Vec’s skip-gram model reported higher performance in clustering Web services. Among the three semantic similarity measures, path-based WordNet similarity reported higher clustering performance. By considering the different words representations models and syntactic and semantic similarity measures, the affinity propagation clustering technique performed better in discovering similarities among Web services. Web Services Clustering Web services discovery systematic literature review word embedding NLP
9	Investigating Gender Bias in Word Embeddings for Chinese Jiao, Meichun January 2021 (has links) Gender bias, a sociological issue, has attracted the attention of scholars working on natural language processing (NLP) in recent years. It is confirmed that some NLP techniques like word embedding could capture gender bias in natural language. Here, we investigate gender bias in Chinese word embeddings. Gender bias tests originally designed for English are adapted and applied to Chinese word embeddings trained with three different embedding models. After verifying the efficiency of the adapted tests, the changes of gender bias throughout several time periods are tracked and analysed. Our results validate the feasibility of bias test adaptation and confirm that word embedding trained by a model with character-level information captures more gender bias in general. Moreover, we build a possible framework for diachronic research of gender bias. word embedding gender bias Chinese
10	Automatic Retail Product Identification System for Cashierless Stores Zhong, Shiting January 2021 (has links) The introduction of artificial intelligence techniques in the retail market is making a revolution in shopping experience. It allows shoppers to walk into a store, grab what they want and simply walk out without scanning barcodes or having to stand in long queues. That is what we call cashierless stores. In this project, it aims to provide an efficient solution to the automatic retail product identification. This solution presents an artifact one can use to build an end-to-end smart system for cashierless stores. Henceforth, a solution based on text classification is proposed to recognize and identify the products. For that, deep learning techniques are used such as RNN and LSTM to build the classifier. The performance of this classifier is evaluated using various metrics and it shows its efficiency with an accuracy exceeding 86%. Text Classification Retail Product Identification Word Embedding Neural Network Word2Vec GloVe Engineering and Technology Teknik och teknologier

Search results