Spelling suggestions: "subject:"word2vec"" "subject:"word2vect""
21 |
Získavanie a analýza dát pre oblasť crowdfundinguKoštial, Martin January 2019 (has links)
The thesis deals with data acquisition from crowdfunding and their analysis. The theoretical part is focused on the description of available technologies and algorithms for data analysis. In the practical part the data collection is realized. Data mining and text mining algorithms are applied in this section for data.
|
22 |
Analýza textových používateľských hodnotení vybranej skupiny produktovValovič, Roman January 2019 (has links)
This work focuses on the design of a system that identifies frequently discussed product features in product reviews, summarizes them, and displays them to the user in terms of sentiment. The work deals with the issue of natural language processing, with a specific focus on Czech languague. The reader will be introduced the methods of preprocessing the text and their impact on the quality of the analysis results. The identification of the mainly discussed products features is carried out by cluster analysis using the K-Means algorithm, where we assume that sufficiently internally homogeneous clusters will represent the individual features of the products. A new area that will be explored in this work is the representation of documents using the Word embeddings technique, and its potential of using vector space as input for machine learning algorithms.
|
23 |
Automatisk synonymgenerering med Word2Vec for query expansion inom e-handelKojic, Kemal, Petersson, Emil January 2018 (has links)
I detta arbete undersöks hur väl automatisk synonymgenerering genom maskininlärnings-metoden Word2Vec, som tränats över en datamängd från Google News på hundra miljarder ord, lämpar sig för query expansion inom ehandel. Detta görs genom användning av produkt- och eventdata från ett välkänt modebolag där synonymer genereras utifrån söksträngar som loggats i eventdata genom olika metoder som i sin tur bildar synonymböcker som används i framtida sökningar med hjälp av query expansion. För att kunna besvara studiens forskningsfrågor utförs först en kvantitativ analys. Denna analys utförs på data som matchade köp, produktträffar, no-hits och söktid. Information om denna data genereras utifrån en söksimulator som simulerar loggade händelser från användarsessioner i ett ehandelssystem. Därefter filtreras de genererade synonymböckerna genom att ta bort synonymer som är kopplade till de söksträngar som producerat ett sämre resultat i simuleringen med synonymer, än utan. För att validera vårt resultat från den kvantitativa analysen utförs även en kvalitativ analys på skillnaden i sökresultatet som de olika metoderna tar fram, där vi undersöker vad det är för produkter som tas fram med hjälp av synonymerna, för att undersöka dess relevans. Våra tester uppvisar att ett lägre tröskelvärde leder till fler produkträffar och minskar antalet no-hits. Antalet produktträffar ökades med mellan 4\%-10\%, no-hits reducerades med mellan 11\%-22\%. I de fall där söksträngen har tilldelats bra synonymer påverkas relevansen av produkterna positivt då fler relevanta produkter dyker upp i sökresultatet. I de fall där söksträngen har tilldelats mindre bra synonymer påverkas relevansen av produkterna negativt då vissa irrelevanta produkter dyker upp i sökresultatet som användaren antagligen inte vill se i sitt sökresultat. I alla fall där de automatiskt genererade synonymerna används så befinner sig majoriteten av alla köpta produkter i den första halvan av sökresultatet, däremot minskar antalet köpta produkter på den första platsen i sökresultatet i alla fallen. / In this thesis, we examine automatic synonym generation through the use of the machine learning algorithm Word2Vec that has been trained using a Google News data set containing a hundred million words to find out if it is suitable for query expansions in e-commerce. This is examined through the use of product- and event data from a well-known fashion company where synonyms are generated from search-queries that have been logged in the event data through different methods, resulting in thesaurus' that are used in future searches with the use of query expansions. In order to answer the thesis' research question, a quantitative analysis is performed. This analysis is performed on data such as matched payments, product matches, no-hits and search time. Information about this data is generated through a search simulator that simulates logged events from user sessions in a e-commerce system. The generated thesaurus' are later filtered through the removal of synonyms that are connected to search queries whose results have produced worse results than the results without synonyms. In order to validate our results from the quantitative analysis a qualitative analysis is also performed on the difference of the search result that the different methods produce. In this qualitative analysis we research what type of products that the added synonyms produce in order to understand the relevance of the search query. Our tests show that the lower the threshold is, the higher the number of product hits and the lower the number of no-hits. Our tests shows that the number of product hits was increased by between 4\%-10\%, the number of no-hits was reduced by 11\%-22\%. In all of the tests using automatically generated synonyms, the results show that the majority of the purchased products are presented in the first half of the search result, however, in all of the tests using automatically generated synonyms the number of purchases in the first position of the search result was reduced.
|
24 |
Automatic Retail Product Identification System for Cashierless StoresZhong, Shiting January 2021 (has links)
The introduction of artificial intelligence techniques in the retail market is making a revolution in shopping experience. It allows shoppers to walk into a store, grab what they want and simply walk out without scanning barcodes or having to stand in long queues. That is what we call cashierless stores. In this project, it aims to provide an efficient solution to the automatic retail product identification. This solution presents an artifact one can use to build an end-to-end smart system for cashierless stores. Henceforth, a solution based on text classification is proposed to recognize and identify the products. For that, deep learning techniques are used such as RNN and LSTM to build the classifier. The performance of this classifier is evaluated using various metrics and it shows its efficiency with an accuracy exceeding 86%.
|
25 |
W2R: an ensemble Anomaly detection model inspired by language models for web application firewalls securityWang, Zelong, AnilKumar, Athira January 2023 (has links)
Nowadays, web application attacks have increased tremendously due to the large number of users and applications. Thus, industries are paying more attention to using Web application Firewalls and improving their security which acts as a shield between the app and the internet by filtering and monitoring the HTTP traffic. Most works focus on either traditional feature extraction or deep methods that require no feature extraction method. We noticed that a combination of an unsupervised language model and a classic dimension reduction method is less explored for this problem. Inspired by this gap, we propose a new unsupervised anomaly detection model with better results than the existing state-of-the-art model for anomaly detection in WAF security. This paper focuses on this structure to explore WAF security: 1) feature extraction from HTTP traffic packets by using NLP (natural language processing) methods such as word2vec and Bert, and 2) Dimension reduction by PCA and Autoencoder, 3) Using different types of anomaly detection techniques including OCSVM, isolation forest, LOF and combination of these algorithms to explore how these methods affect results. We used the datasets CSIC 2010 and ECML/PKDD 2007 in this paper, and the model has better results.
|
26 |
SAX meets Word2vec : A new paradigm in the time series forecastingJanerdal, Erik, Dimovski, David January 2023 (has links)
The purpose of this thesis was to investigate whether some successful ideas in NLP, such as word2vec, are applicable to time series prob- lems or not. More specifically, we are interested to assess a combina- tion of previously proven methods such as SAX and Word2vec. Based on a rolling window approach, we applied SAX to create words for each window. These words formed a corpus on which we performed Word2vec, which served as inputs in a time series forecasting setting. We found that for forecasting horizons of longer length, our proposed method showed an improvement over statistical models under certain conditions. The findings suggest that bringing tools from the natural language processing domain into the time series domain may be an ef- fective idea. Further research is necessary to broaden the knowledge of these types of methods by testing alternative options for the cre- ation of words. Hopefully, this work will motivate other researchers to investigate this type of solution further.
|
27 |
Clustering of Distributed Word Representations and its Applicability for Enterprise SearchKorger, Christina 04 October 2016 (has links) (PDF)
Machine learning of distributed word representations with neural embeddings is a state-of-the-art approach to modelling semantic relationships hidden in natural language. The thesis “Clustering of Distributed Word Representations and its Applicability for Enterprise Search” covers different aspects of how such a model can be applied to knowledge management in enterprises. A review of distributed word representations and related language modelling techniques, combined with an overview of applicable clustering algorithms, constitutes the basis for practical studies. The latter have two goals: firstly, they examine the quality of German embedding models trained with gensim and a selected choice of parameter configurations. Secondly, clusterings conducted on the resulting word representations are evaluated against the objective of retrieving immediate semantic relations for a given term. The application of the final results to company-wide knowledge management is subsequently outlined by the example of the platform intergator and conceptual extensions."
|
28 |
Clustering of Distributed Word Representations and its Applicability for Enterprise SearchKorger, Christina 18 August 2016 (has links)
Machine learning of distributed word representations with neural embeddings is a state-of-the-art approach to modelling semantic relationships hidden in natural language. The thesis “Clustering of Distributed Word Representations and its Applicability for Enterprise Search” covers different aspects of how such a model can be applied to knowledge management in enterprises. A review of distributed word representations and related language modelling techniques, combined with an overview of applicable clustering algorithms, constitutes the basis for practical studies. The latter have two goals: firstly, they examine the quality of German embedding models trained with gensim and a selected choice of parameter configurations. Secondly, clusterings conducted on the resulting word representations are evaluated against the objective of retrieving immediate semantic relations for a given term. The application of the final results to company-wide knowledge management is subsequently outlined by the example of the platform intergator and conceptual extensions.":1 Introduction
1.1 Motivation
1.2 Thesis Structure
2 Related Work
3 Distributed Word Representations
3.1 History
3.2 Parallels to Biological Neurons
3.3 Feedforward and Recurrent Neural Networks
3.4 Learning Representations via Backpropagation and Stochastic Gradient Descent
3.5 Word2Vec
3.5.1 Neural Network Architectures and Update Frequency
3.5.2 Hierarchical Softmax
3.5.3 Negative Sampling
3.5.4 Parallelisation
3.5.5 Exploration of Linguistic Regularities
4 Clustering Techniques
4.1 Categorisation
4.2 The Curse of Dimensionality
5 Training and Evaluation of Neural Embedding Models
5.1 Technical Setup
5.2 Model Training
5.2.1 Corpus
5.2.2 Data Segmentation and Ordering
5.2.3 Stopword Removal
5.2.4 Morphological Reduction
5.2.5 Extraction of Multi-Word Concepts
5.2.6 Parameter Selection
5.3 Evaluation Datasets
5.3.1 Measurement Quality Concerns
5.3.2 Semantic Similarities
5.3.3 Regularities Expressed by Analogies
5.3.4 Construction of a Representative Test Set for Evaluation of Paradigmatic Relations
5.3.5 Metrics
5.4 Discussion
6 Evaluation of Semantic Clustering on Word Embeddings
6.1 Qualitative Evaluation
6.2 Discussion
6.3 Summary
7 Conceptual Integration with an Enterprise Search Platform
7.1 The intergator Search Platform
7.2 Deployment Concepts of Distributed Word Representations
7.2.1 Improved Document Retrieval
7.2.2 Improved Query Suggestions
7.2.3 Additional Support in Explorative Search
8 Conclusion
8.1 Summary
8.2 Further Work
Bibliography
List of Figures
List of Tables
Appendix
|
29 |
Evaluating and comparing different key phrase-based web scraping methods for training domain-specific fasttext models / Utvärdering och jämförelse av olika nyckelfrasbaserade webbskrapningsmetoder för att träna domänspecifika fasttextmodellerBook, Love January 2023 (has links)
The demand for automation of simple tasks is constantly increasing. While some tasks are easy to automate because the logic is fixed and the process is streamlined, other tasks are harder because the performance of the task is heavily reliant on the judgment of a human expert. Matching a consultant to an offer from a client is one such task, in which case the expert is either a manager to the consultants or someone within HR at the company. One way to approach this task is to model the specific domain of interest using natural language processing. If we can capture the relationships between relevant skills and phrases within the specific domain, we could potentially use the resulting embeddings in a consultant to offer matching scheme. In this paper, we propose a key phrase-based web scraping approach to collect the data we need for a domain-specific corpus. To retrieve the key phrases needed as prompts for web scraping, we propose using the transformer-based library KeyBERT on limited domain-specific in house data belonging to the consultant firm B3 Indes, in order to retrieve the most important phrases in their respective contexts. Facebook's Word2vec based language model fasttext is then used on the processed corpus to create the fixed word embeddings. We also investigate numerous different approaches for selecting the right key phrases for web scraping in a human similarity comparison scheme, as well as comparisons to a larger pretrained general domain fasttext model. We show that utilizing key phrases for a domain-specific fasttext model could be beneficial compared to using a larger pretrained model. The results are not consistently conclusive under the current analytical framework. The results also indicate that KeyBERT is beneficial when selecting the key phrases compared to the randomized sampling of relevant phrases; however, the results are not conclusive. / Efterfrågan för automatisering av enkla uppgifter efterfrågas alltmer. Medan vissa uppgifter är lätta att automatisera eftersom logiken är fast och processen är tydlig, är andra svårare eftersom utförandet av uppgiften starkt beror på en människas expertis. Att matcha en konsult till ett erbjudande från en klient är en sådan uppgift, där experten är antingen en chef för konsulterna eller någon inom HR på företaget. En metod för att hantera denna uppgift är att modellera det specifika området av intresse med hjälp av maskininlärningsbaserad språkteknologi. Om vi kan fånga relationerna mellan relevanta färdigheter och fraser inom det specifika området, skulle vi potentiellt kunna använda de resulterande inbäddningarna i ett matchningsprocess mellan konsulter och uppdrag. I denna rapport föreslås en nyckelordsbaserad webbskrapnings-metod för att samla in data som behövs för ett domänspecifikt korpus. För att hämta de nyckelord som behövs som input för webbskrapning, föreslår vi att använda transformator-baserade biblioteket KeyBERT på begränsad domänspecifik data från konsultbolaget B3 Indes, detta för att hämta de viktigaste fraserna i deras respektive sammanhang. Sedan används Facebooks Word2vec baserade språkmodell fasttext på det bearbetade korpuset för att skapa statiska inbäddningar. Vi undersöker också olika metoder för att välja rätt nyckelord för webbskrapning i en likhets-jämnförelse mot mänskliga experter, samt jämförelser med en större förtränad fasttext-modell som inte är domänspecifik. Vi visar att användning av nyckelord för webbskrapning för träning av en domänspecifik fasttext-modell skulle kunna vara fördelaktigt jämnfört med en förtränad modell, men resutaten är inte konsekvent signifikanta enligt det begränsade analytiska ramverket. Resultaten indikerar också att KeyBERT är fördelaktigt vid valet av nyckelord jämfört med slumpmässigt urval av relevanta fraser, men dessa resultat är inte heller helt entydiga.
|
30 |
Evaluation of Approaches for Representation and Sentiment of Customer Reviews / Utvärdering av tillvägagångssätt för representation och uppfattning om kundrecensionerGiorgis, Stavros January 2021 (has links)
Classification of sentiment on customer reviews is a real-world application for many companies that offer text analytics and opinion extraction on customer reviews on different domains such as consumer electronics, hotels, restaurants, and car rental agencies. Natural Language Processing’s latest progress has seen the development of many new state-of-the-art approaches for representing the meaning of sentences, phrases, and words in the text using vector space models, so-called embeddings. In this thesis, we evaluated the most current and most popular text representation techniques against traditional methods as a baseline. The evaluation dataset consists of customer reviews from different domains with different lengths used by a text analysis company. Through a train dataset exploration, we evaluated which datasets were the most suitable for this specific task. Furthermore, we explored different techniques that could be used to alter a language model’s decisions without retraining it. Finally, all the methods were evaluated against their time performance and the resource requirements to present an overall experimental assessment that could potentially help the company decide which is the most appropriate technique to replace its system in a production environment. / Klassificeringen av attityd och känsloläge i kundrecensioner är en tillämpning med praktiskt värde för flera företag i marknadsanalysbranschen. Aktuell forskning i språkteknologi har etablerat vektorrum som standardrepresentation för ord, fraser och yttranden, så kallade embeddings. Denna uppsats utvärderar den senaste tidens mest framgångsrika textrepresentationsmodeller jämfört med mer traditionella vektorrum. Utvärdering görs genom att jämföra automatiska analyser med mänskliga bedömningar för kundrecensioner av varierande längd från olika domäner tillhandahållna av ett textanalysföretag. Inom ramen för studien har olika testmängder jämförts och olika sätt att modifera en språkmodells klassficering utan om träning. Alla modeller har också jämförts med avseende på resurs- och tidsåtgång för träning för att hjälpa uppdragsgivaren fatta beslut om vilken teknik som utgör den mest ändamålsenliga utvecklingsvägen för dess driftsatta system.
|
Page generated in 0.1666 seconds