Global ETD Search

Return to search

Object Classification using Language Models

In today’s modern digital world more and more emails and messengers must be sent, processed and handled. The categorizing and classification of these text pieces can take an incredibly long time and will cost the company a lot of time and money. If the classification could be done automatically by a computer dependent on the content of the text/message it would result in a major yield for the Easit AB and its customers. In order to facilitate the task of text-classification Easit needs a solution that is made out of one language model and one classifier model. The language model will convert raw text to a vector that is representative of the text and the classifier will construe what predefined labels fit for the vector. The end goal is not to create the best solution. It is simply to create a general understanding about different language and classifier models and how to build a system that will be both fast and accurate. BERT were the primary language model during evaluation but doc2Vec and One-Hot encoding was also tested. The classifier consisted out of boundary condition models or dense neural networks that were all trained without knowledge about what language model that the text vectors came from. The validation accuracy which was presented for the IMDB-comment dataset with BERT resulted between 75% to 94%, mostly dependent on the language model and not on the classifier. The knowledge from the work resulted in a recommendation to Easit for an alternativebased system solution. / I dagens moderna digitala värld är det allt mer majl-ärenden och meddelanden som ska skickas och processeras. Kategorisering och klassificering av dessa kan ta otroligt lång tid och kostar företag tid samt pengar. Om klassifieringen kunde ske automatiskt beroende på text-innehållet skulle det innebära en stor vinst för Easit AB och deras kunder. För att underlätta arbetet med text-klassifiering behöver Easit en tvådelad lösning som består utav en språkmodell och en klassifierare. Språkmodellen som omvandlar text till en vektor som representerar texten och klassifieraren tolkar vilka fördefinerade ettiketter/märken som passar för vektorn. Målet är inte att skapa den bästa lösningen utan det är att skapa en generell kunskap för hur man kan utforma ett system som kan klassifiera texten på ett träffsäkert och effektivt sätt. Vid utvärdering av olika språkmodeller användes framförallt BERT-modeller men även doc2Vec och One-Hot testas också. Klassifieraren bestod utav gränsvillkors-modeller eller dense neurala nätverk som tränades helt utan vetskap om vilken språkmodell som skickat text-vektorerna. Träffsäkerheten som uppvisades vid validering för IMDB-kommentars datasetet med BERT blev mellan 75% till 94%, primärt beroende på språkmodellen. De neuralt nätverk passar bäst som klassifierare mest på grund av deras skalbarhet med flera ettiketter. Kunskapen från arbetet resulterade i en rekommendation till Easit om en alternativbaserad systemlösning.

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-477857

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-477857
Date	January 2022
Creators	From, Gustav
Publisher	Uppsala universitet, Signaler och system
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	UPTEC F, 1401-5757 ; 22013

Page generated in 0.0018 seconds

Object Classification using Language Models

Description

Links & Downloads

Tags

Additional Fields