Spelling suggestions: "subject:"doc2vec"" "subject:"doc2vecs""
11 |
Object Classification using Language ModelsFrom, Gustav January 2022 (has links)
In today’s modern digital world more and more emails and messengers must be sent, processed and handled. The categorizing and classification of these text pieces can take an incredibly long time and will cost the company a lot of time and money. If the classification could be done automatically by a computer dependent on the content of the text/message it would result in a major yield for the Easit AB and its customers. In order to facilitate the task of text-classification Easit needs a solution that is made out of one language model and one classifier model. The language model will convert raw text to a vector that is representative of the text and the classifier will construe what predefined labels fit for the vector. The end goal is not to create the best solution. It is simply to create a general understanding about different language and classifier models and how to build a system that will be both fast and accurate. BERT were the primary language model during evaluation but doc2Vec and One-Hot encoding was also tested. The classifier consisted out of boundary condition models or dense neural networks that were all trained without knowledge about what language model that the text vectors came from. The validation accuracy which was presented for the IMDB-comment dataset with BERT resulted between 75% to 94%, mostly dependent on the language model and not on the classifier. The knowledge from the work resulted in a recommendation to Easit for an alternativebased system solution. / I dagens moderna digitala värld är det allt mer majl-ärenden och meddelanden som ska skickas och processeras. Kategorisering och klassificering av dessa kan ta otroligt lång tid och kostar företag tid samt pengar. Om klassifieringen kunde ske automatiskt beroende på text-innehållet skulle det innebära en stor vinst för Easit AB och deras kunder. För att underlätta arbetet med text-klassifiering behöver Easit en tvådelad lösning som består utav en språkmodell och en klassifierare. Språkmodellen som omvandlar text till en vektor som representerar texten och klassifieraren tolkar vilka fördefinerade ettiketter/märken som passar för vektorn. Målet är inte att skapa den bästa lösningen utan det är att skapa en generell kunskap för hur man kan utforma ett system som kan klassifiera texten på ett träffsäkert och effektivt sätt. Vid utvärdering av olika språkmodeller användes framförallt BERT-modeller men även doc2Vec och One-Hot testas också. Klassifieraren bestod utav gränsvillkors-modeller eller dense neurala nätverk som tränades helt utan vetskap om vilken språkmodell som skickat text-vektorerna. Träffsäkerheten som uppvisades vid validering för IMDB-kommentars datasetet med BERT blev mellan 75% till 94%, primärt beroende på språkmodellen. De neuralt nätverk passar bäst som klassifierare mest på grund av deras skalbarhet med flera ettiketter. Kunskapen från arbetet resulterade i en rekommendation till Easit om en alternativbaserad systemlösning.
|
12 |
Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sakÖhrström, Fredrik January 2018 (has links)
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
|
13 |
Video Recommendation Based on Object DetectionNyberg, Selma January 2018 (has links)
In this thesis, various machine learning domains have been combined in order to build a video recommender system that is based on object detection. The work combines two extensively studied research fields, recommender systems and computer vision, that also are rapidly growing and popular techniques on commercial markets. To investigate the performance of the approach, three different content-based recommender systems have been implemented at Spotify, which are based on the following video features: object detections, titles and descriptions, and user preferences. These systems have then been evaluated and compared against each other together with their hybridized result. Two algorithms have been implemented, the prediction and the top-N algorithm, where the former is the more reliable source for evaluating the system's performance. The evaluation of the system shows that the overall performance scores for predicting values of the users' liked and disliked videos are in the range from about 40 % to 70 % for the prediction algorithm and from about 15 % to 70 % for the top-N algorithm. The approach based on object detection performs worse in comparison to the other approaches. Hence, there seems to be is a low correlation between the user preferences and the video contents in terms of object detection data. Therefore, this data is not very suitable for describing the content of videos and using it in the recommender system. However, the results of this study cannot be generalized to apply for other systems before the approach has been evaluated in other environments and for various data sets. Moreover, there are plenty of room for refinements and improvements to the system, as well as there are many interesting research areas for future work.
|
Page generated in 0.2959 seconds