Global ETD Search

231	Evaluating Cold-Start in Recommendation Systems Using a Hybrid Model Based on Factorization Machines and SBERT Embeddings / Evaluering av kallstartsproblemet hos rekommendationssystem med en NLP-baserad hybridmodell baserad på faktoriseringsmaskiner och SBERT inbäddningar Chowdhury, Sabrina January 2022 (has links) The item cold-start problem, which describes the difficulty of recommendation systems in recommending new items to users, remains a great challenge for recommendation systems that rely on past user-item interaction data. A popular technique in the current research surrounding the cold-start problem is the use of hybrid models that combine two or more recommendation strategies that may contribute with their individual advantages. This thesis investigates the use of a hybrid model which combines Sentence BERT embeddings with a recommendation model based on Factorization Machines (FM). The research question is stated as: How does a hybrid recommendation system based on Factorization Machines with frozen Sentence BERT embeddings perform in terms of solving the cold-start problem?. Three experiments were conducted to answer the research question. These involved finding an optimal pre-trained Sentence BERT model, investigating the difference in performance between an FM-model and a hybrid FM-model, as well as the difference in ranking of an item depending on whether or not the hybrid FM-model has been trained on the item. The results show that the best pre-trained Sentence BERT model for producing meaningful embeddings is the paraphrase-MiniLM-L3-v2 model, that a hybrid FM-model and a standard FM-model perform almost equally in terms of precision and recall at 50, and that there is a weak correlation between the item-frequency and how the hybrid FM-model ranks an item when trained and not trained on the item. The answer to the research question is that a recommendation model based on Factorization Machines with frozen Sentence BERT embeddings displays low precision at 50 and recall at 50 values with the given parameters in comparison to the values given in an optimal recommendation scenario. The hybrid FM-model shows cold-start potential due to displaying similar results to the standard FM-model, but these values are so low that further investigation with other parameters is needed for a clearer conclusion. / Kallstartsproblem för artiklar som beskriver svårigheten hos rekommendationssystem gällande uppgiften att rekommendera nya artiklar till användare, är fortsatt en stor utmaning för rekommendationssystem som förlitar sig på data som beskriver interaktioner mellan användare och artiklar. En populär teknik inom den aktuella forskningen gällande kallstartsproblemet är användandet av hybridmodeller som kombinerar två eller flera rekommendationsstrategier och som potentiellt kan bidra med sina individuella fördelar. Detta examensarbete undersöker användandet av en hybridmodell som kombinerar menings-BERT inbäddningar med en rekommendationsmodell baserad på faktoriseringsmaskiner (FM). Frågeställningen lyder: Hur väl kan kallstartsproblemet för rekommendationer lösas med en hybridmodell baserad på faktoriseringsmaskiner med frusna menings-BERT-inbäddningar?. Tre experiment utfördes för att svara på frågeställningen. Dessa experiment innebar att hitta en optimal förtränad menings-BERT-modell, undersöka skillnaden i utförandet mellan en FM-modell och en hybrid FM-modell, samt skillnaden i ranking av en artikel baserat på huruvida hybridmodellen tränats eller inte tränats på artikeln. Resultaten visar att den bästa förtränade menings-BERT-modellen gällande skapandet av meningsfulla inbäddningar är paraphrase-MiniLM-L3-v2-modellen, att en hybrid FM-modell och en FM-modell genererar nästan identiska resultat baserat på precision och återkallelse för de första 50 resultaten och att det finns en svag korrelation mellan artikel-frekvens och hur hybridmodellen rankar en artikel när hybridmodellen tränats eller inte tränats på artikeln. Svaret på frågeställningen är att en hybrid FM-modell med frusna menings-BERT-inbäddningar visar låga resultat för precision och återkallelse för de första 50 resultaten givet de använda parametrarna jämfört med de värden som skulle genererats i ett optimalt rekommendationsscenario. Den hybrida FM-modellen visar kallstartspotential då den visar liknande resultat som FM-modellen, men dessa värden är så låga att frågan behöver undersökas ytterligare för tydligare resultat. Natural Language Processing Hybrid Recommender Systems Cold-Start Problem språkteknologi hybrida rekommendationssystem kallstartsproblemet Computer and Information Sciences Data- och informationsvetenskap
232	Att arbeta aktivt med språkutveckling i förskolan för att främja barn med annat modersmål. : En kvalitativ intervjustudie om hur förskollärare beskriver arbetet med språkutveckling / To work actively with language development in preschool to support children with a different first language. : A qualitative interview study on how preschool teachers describe the work with language development. Albinsson, Felicia, Mattsson, Rebecca January 2023 (has links) Studiens ämne är språkutveckling med fokus på barn med annat modersmål. Där syftet med studien är att undersöka hur förskollärare beskriver sitt arbete med att främja språkutvecklingen hos de barn som har svenska som andraspråk. Studiens forskningsfrågor är: hur beskriver förskollärarna att de arbetar med barns språkutveckling? vilka metoder kan användas för att främja svenska som andraspråk enligt förskollärarna? Den teoretiska utgångspunkten för studien har varit det sociokulturella perspektivet. För att besvara syftet och frågeställningarna har sex förskollärare intervjuats med hjälp av kvalitativa metoder. Resultatet visar att förskollärarna arbetar aktivt för att främja barn med ett annat modersmål, där olika arbetssätt och metoder framkom. Arbetssätten var främst TAKK och bildstöd. Polyglutt är ett annat verktyg som nämns av flera förskollärare, som används för att främja både modersmålet, samt andraspråket. I resultatet framkommer att förskollärarna uppmanar vårdnadshavare att tala modersmålet med barnen för att deras språkutveckling ska föras framåt, då modersmålet är en viktig tillgång i förskolans verksamhet. Modersmålet är en bidragande faktor vid inlärningen av ett andraspråk där ingen förskollärare anser inlärningen som negativ. Vikten av att vårdnadshavarna visar intresse och nyfikenhet i arbetet framkommer tydligt i både litteratur och i intervjuerna. Förskola svenska som andraspråk språkutveckling metoder modersmål Pedagogical Work Pedagogiskt arbete
233	Swedish Language End-to-End Automatic Speech Recognition for Media Monitoring using Deep Learning Nyblom, Hector January 2022 (has links) In order to extract relevant information from speech recordings, the general approach is to first convert the audio into transcribed text. The text can then be analysed using well researched methods. NewsMachine AB provides customers with an overview of how they are represented in media by analysing articles in text form. Their plans to scale up their monitoring of publicly available speech recordings was the basis for the thesis. In this thesis I compare three end-to-end Automatic Speech Recognition (ASR) models. I do so in order to find the model that currently works best for transcribing Swedish language radio recordings, considering accuracy and inference speed (computational complexity). The results show that the QuartzNet architecture is the fastest, but pre-trained wav2vec models provided by KBLab on Swedish speech have by far the best accuracy. The KBLab model was used for further fine-tuning on subsets with varying amount of training data from radio recordings. The results show that further fine-tuning the KBLab models on low-resource Swedish speech domains achieves impressive accuracy. With just 5 hours of training data, the result is 11.5% Word Error Rate and 3.8% Character Error Rate. A final model was fine-tuned on all 35 hours of the radio domain dataset, resulting in model achieving 10.4% Word Error Rate and 3.5% Character Error Rate. The thesis presents a complete pipeline able to convert any length of audio into a transcription. Segmentation of audio is performed as a pre-processing step, segmenting the audio based on silence. The silence represents when a sentence stops and a new begins. The audio segments are passed to the final fine-tuned ASR model, and are concatenated for the complete punctuated transcript. This implementation allowed for punctuation, and also timestamping, when sentences occur in the audio. The results show that the complete pipeline performs well on high quality audio recordings. But when introduced to noisy and disruptive audio, there is work needed to achieve optimal performance. Automatic Speech Recognition Deep Learning Machine Learning Natural Language Processing Media Monitoring
234	Punctuation Restoration as Post-processing Step for Swedish Language Automatic Speech Recognition Gupta, Ishika January 2023 (has links) This thesis focuses on the Swedish language, where punctuation restoration, especially as a postprocessing step for the output of Automatic Speech Recognition (ASR) applications, needs furtherresearch. I have collaborated with NewsMachine AB, a company that provides large-scale mediamonitoring services for its clients, for which it employs ASR technology to convert spoken contentinto text.This thesis follows an approach initially designed for high-resource languages such as English. Themethod is based on KB-BERT, a pre-trained Swedish neural network language model developedby the National Library of Sweden. The project uses KB-BERT with a Bidirectional Long-ShortTerm Memory (BiLSTM) layer on top for the task of punctuation restoration. The model is finetuned using the TED Talk 2020 dataset in Swedish, which is acquired from OPUS (an open-sourceparallel corpus). The punctuation marks comma, period, question mark, and colon are considered for this project. A comparative analysis is conducted between two KB-BERT models: bertbase-swedish-cased and albert-base-swedish-cased-alpha. The fine-tuned Swedish BERT-BiLSTMmodel, trained on 5 classes, achieved an overall F1-score of 81.6%, surpassing the performance ofthe ALBERT-BiLSTM model, which was also trained on 5 classes and obtained an overall F1-scoreof 66.6%. Additionally, the BERT-BiLSTM model, trained on 4 classes (excluding colon), outperformed prestoBERT, an existing model designed for the same task in Swedish, with an overallF1-score of 82.8%. In contrast, prestoBERT achieved an overall F1-score of 78.9%.As a further evaluation of the model’s performance on ASR transcribed text, noise was injectedbased on four probabilities (0.05, 0.1, 0.15, 0.2) into a copy of the test data in the form of threeword-level errors (deletion, substitution, and insertion). The performance of the BERT-BiLSTMmodel substantially decreased for all the errors as the probability of noise injected increased. Incontrast, the model still performed comparatively better when dealing with deletion errors as compared to substitution and insertion errors. Lastly, the data resources received from NewsMachineAB were used to perform a qualitative assessment of how the model performs in punctuating realtranscribed data as compared to human judgment. Transformer BERT KB-BERT NLP punctuation restoration deep learning neural networks
235	Extending a Text Classifier to Multiple Languages / Utöka en textklassificeringsmodell till flera språk Byström, Albin January 2021 (has links) This thesis explores the possibility to extend monolingual and bilingual text classifiers to multiple languages. Two different language models are explored, language aligned word embeddings and a transformer model. The goal was to take a classifier based on Swedish and English samples and extend it to Danish, German, and Finnish samples. The result shows that extending a text classifier by word embeddings alignment or by finetuning a multilingual transformer model is possible but with varying accuracy depending on the language. / Denna avhandling undersöker möjligheten att utvidga enspråkiga och tvåspråkiga textklassificatorer till flera språk. Två olika språkmodeller utforskas, justeras ordinbäddningar och en transformatormodell. Målet var att ta en klassificerare baserad på svenska och engelska texter och utvidga den till danska, tyska och finska texter. Resultatet visar att det är möjligt att utöka en textklassificering med ordinbäddning eller genom att finjustera en flerspråkig transformatormodell, men träffsäkerheten varierar beroende på språk. Natural language processing Multilingual Transformer Word embeddings Text classification Språkteknologi Flerspråkig Transformator Ordinbäddningar Textklassificering Computer and Information Sciences Data- och informationsvetenskap
236	Readability: Man and Machine : Using readability metrics to predict results from unsupervised sentiment analysis / Läsbarhet: Människa och maskin : Användning av läsbarhetsmått för att förutsäga resultaten från oövervakad sentimentanalys Larsson, Martin, Ljungberg, Samuel January 2021 (has links) Readability metrics assess the ease with which human beings read and understand written texts. With the advent of machine learning techniques that allow computers to also analyse text, this provides an interesting opportunity to investigate whether readability metrics can be used to inform on the ease with which machines understand texts. To that end, the specific machine analysed in this paper uses word embeddings to conduct unsupervised sentiment analysis. This specification minimises the need for labelling and human intervention, thus relying heavily on the machine instead of the human. Across two different datasets, sentiment predictions are made using Google’s Word2Vec word embedding algorithm, and are evaluated to produce a dichotomous output variable per sentiment. This variable, representing whether a prediction is correct or not, is then used as the dependent variable in a logistic regression with 17 readability metrics as independent variables. The resulting model has high explanatory power and the effects of readability metrics on the results from the sentiment analysis are mostly statistically significant. However, metrics affect sentiment classification in the two datasets differently, indicating that the metrics are expressions of linguistic behaviour unique to the datasets. The implication of the findings is that readability metrics could be used directly in sentiment classification models to improve modelling accuracy. Moreover, the results also indicate that machines are able to pick up on information that human beings do not pick up on, for instance that certain words are associated with more positive or negative sentiments. / Läsbarhetsmått bedömer hur lätt eller svårt det är för människor att läsa och förstå skrivna texter. Eftersom nya maskininlärningstekniker har utvecklats kan datorer numera också analysera texter. Därför är en intressant infallsvinkel huruvida läsbarhetsmåtten också kan användas för att bedöma hur lätt eller svårt det är för maskiner att förstå texter. Mot denna bakgrund använder den specifika maskinen i denna uppsats ordinbäddningar i syfte att utföra oövervakad sentimentanalys. Således minimeras behovet av etikettering och mänsklig handpåläggning, vilket resulterar i en mer djupgående analys av maskinen istället för människan. I två olika dataset jämförs rätt svar mot sentimentförutsägelser från Googles ordinbäddnings-algoritm Word2Vec för att producera en binär utdatavariabel per sentiment. Denna variabel, som representerar om en förutsägelse är korrekt eller inte, används sedan som beroende variabel i en logistisk regression med 17 olika läsbarhetsmått som oberoende variabler. Den resulterande modellen har högt förklaringsvärde och effekterna av läsbarhetsmåtten på resultaten från sentimentanalysen är mestadels statistiskt signifikanta. Emellertid är effekten på klassificeringen beroende på dataset, vilket indikerar att läsbarhetsmåtten ger uttryck för olika lingvistiska beteenden som är unika till datamängderna. Implikationen av resultaten är att läsbarhetsmåtten kan användas direkt i modeller som utför sentimentanalys för att förbättra deras prediktionsförmåga. Dessutom indikerar resultaten också att maskiner kan plocka upp på information som människor inte kan, exempelvis att vissa ord är associerade med positiva eller negativa sentiment. Natural language processing Unsupervised learning Sentiment analysis Word embeddings Readability Språkteknologi Oövervakad inlärning Sentimentanalys Ordinbäddningar Läsbarhet Computer Sciences Datavetenskap (datalogi)
237	Exploring Language Descriptions through Vector Space Models Aleksandrova, Anastasiia January 2024 (has links) The abundance of natural languages and the complexities involved in describingtheir structures pose significant challenges for modern linguists, not only in documentation but also in the systematic organization of knowledge. Computational linguisticstools hold promise in comprehending the “big picture”, provided existing grammars aredigitized and made available for analysis using state-of-the-art language models. Extensive efforts have been made by an international team of linguists to compile such aknowledge base, resulting in the DReaM corpus – a comprehensive dataset comprisingtens of thousands of digital books containing multilingual language descriptions.However, there remains a lack of tools that facilitate understanding of concise language structures and uncovering overlooked topics and dialects. This thesis representsa small step towards elucidating the broader picture by utilizing a subset of the DReaMcorpus as a vector space capable of capturing genetic ties among described languages.To achieve this, we explore several encoding algorithms in conjunction with varioussegmentation strategies and vector summarization approaches for generating bothmonolingual and cross-lingual feature representations of selected grammars in Englishand Russian.Our newly proposed sentence-facets TF-IDF model shows promise in unsupervisedgeneration of monolingual representations, conveying sufficient signal to differentiate historical linguistic relations among 484 languages from 26 language familiesbased on their descriptions. However, the construction of a cross-lingual vector spacenecessitates further exploration of advanced technologies. DReaM corpus multi-faceted book embeddings linguistic genealogy Russian English
238	Low-resource Semantic Role Labeling Through Improved Transfer Learning Lindbäck, Hannes January 2024 (has links) For several more complex tasks, such as semantic role labeling (SRL), large annotated datasets are necessary. For smaller and lower-resource languages, these are not readily available. As a way to overcome this data bottleneck, this thesis investigates the possibilities of using transfer learning from a high-resource language to a low-resource language, and then perform zero-shot SRL on the low-resource language. We additionally investigate if the transfer-learning can be improved by freezing the parameters of a layer in the pre-trained model, leveraging the model to instead focus on learning the parameters of the layers necessary for the task. By training models in English and then evaluating on Spanish, Catalan, German and Chinese CoNLL-2009 data, we find that transfer learning zero-shot SRL can be an effective technique, and in certain cases outperform models trained on low amounts of data. We also find that the results improve when freezing parameters of the lower layers of the model, the layers focused on surface tasks, as this allowed the model to improve the layers necessary for SRL. NLP Natural Language Processing AI Machine Learning Semantic Role Labeling Transfer Learning
239	The Interplay of Text Complexity and Cohesion : Exploring and Analyzing Differences Across Levels of Readability in Easy-to-Read Text Brissman, Wilgot January 2024 (has links) When assessing the readability of a text it is helpful to consider all its interacting elements. This includes its syntactic complexity, but other aspects, such as that of cohesion, are no less important. The thesis explores how these are reflected in each other and in the readability of books in a dataset provided by the publisher Nypon och Vilja, which consists of easy-to-read books divided into six levels of readability. To provide additional nuance, the interrelated concepts of epistemic stance and narrativity are introduced for the purpose of deepening the analysis of the statistical findings. They also prove useful in further discussion surrounding complexity and cohesion as they relate to reading skill and knowledge asymmetries. Principal component analysis (PCA) is employed to uncover these statistical relationships on a broader scale, though more specific in-depth analysis are performed relating to certain metrics. While the findings have some support in literature, re-affirming the importance of narrativity for contextualizing cohesion, the clear link between higher complexity and less narrative text was not expected. Furthermore, the PCA indicates a more nuanced picture of referential cohesion and the use of its constituent metrics, depending both on narrativity and complexity. cohesion complexity coh-metrix PCA narrativity epistemic stance readability easy-to-read
240	Identifying Sensitive Data using Named Entity Recognition with Large Language Models : A comparison of transformer models fine-tuned for Named Entity Recognition Ström Boman, Alfred January 2024 (has links) Utvecklingen av artificiell intelligens och språkmodeller har ökat drastiskt under de senaste åren vilket medfört både möjligheter såväl som risker. Med en större användning av AI-relaterade produkter och människolika chattbotar har det medfört ett intresse av att kontrollera vilken sorts data som delas med dessa verktyg. Under särskilda omständigheter kan det förekomma data som till exempel information relaterat till personer, som inte får delas. Detta projekt har av denna anledning kretsat kring att använda och jämföra olika system för automatisk namnigenkänning, med målet att förhindra sådan data från att bli delad. I projektet jämfördes tre stycken olika alternativ för att implementera system för namnigenkänning, innan det mest lämpliga alternativet valdes för implementationen. Fortsättningsvis användes de tre förtränade transformer-modellerna GPT-SW3, TinyLlama och Mistral för implementationen där dessa tre blev finjusterade på två olika dataset. Implementationsfasen involverade applicering av tekniker för att öka datastorleken, databearbetning samt modellkvantisering innan de finjusterades för namnigenkänning. En uppsättning av utvärderingsmått bestående av bland annat F1-mått användes därefter för att mäta de tränade modellernas prestanda. De tre modellerna utvärderades och jämfördes med varandra utifrån resultatet från mätningen och träningen. Modellerna uppvisade varierande resultat och prestanda där både över- och underanpassning förekom. Avslutningsvis drogs slutsatsen om att TinyLlama var den bäst presterande modellen utifrån resultatet och övriga kringliggande aspekter. / The development of artificial intelligence and large language models has increased rapidly in recent years, bringing both opportunities and risks. With a broader use of AI related products such as human-like chatbots there has been an increase in interest in controlling the data that is being shared with them. In some scenarios there is data, such as personal or proprietary information, which should not be shared. This project has therefore revolved around utilizing and comparing different Named Entity Recognition systems to prevent such data from being shared. Three different approaches to implement Named Entity Recognition systems were compared before selecting the most appropriate one to further use for the actual implementation. Furthermore, three pre-trained transformer models, GPT-SW3, TinyLlama and Mistral, were used for the implementation where these were fine-tuned on two different datasets. The implementation phase included applying data augmentation techniques, data processing and model quantization before fine-tuning the models on Named Entity Recognition. A set of metrics including precision, recall and F1-score was further used to measure the performances of the trained models. The three models were compared and evaluated against each other based on the results obtained from the measurements and the training. The models showed varying results and performances where both overfitting and underfitting occured. Finally, the TinyLlama model was concluded to be the best model based on the obtained results and other considered aspects. Named Entity Recognition Natural Language Processing Machine Learning Fine-tuning. Namnigenkänning Språkteknologi Maskininlärning Finjustering Software Engineering Programvaruteknik

Search results