Global ETD Search

361	Textbrytning av mäklartexter och slutpris : Med BERT, OLS och Elman regressionsnätverk / Text mining of broker texts and sold price : Using BERT, OLS and Elman regression network Fjellström, Emil, Challita, Johan January 2021 (has links) Att estimera slutpriset av en bostadsförsäljning är en komplex uppgift där mäklartexter som beskriver bostäder är en vital del av försäljningen. Denna rapport undersöker om det går att använda mäklartexter för att generera mer träffsäkra estimeringar med maskininlärningsmodeller. Två olika maskininlärningsmodeller implementerades som resultat av en litteraturstudie och utvärderades mot Boolis existerande OLS-modell. De implementerade modellerna är OLS-BERT och Elman regressionsnätverk. OLS-BERT visade en generell förbättring jämfört med Boolis OLS-modell, i synnerhet av F-statistik där mätvärdet sjönk med 99,8 procent. P-värdet i T-statistik för “vista” (utsikten) har sjunkit från 37,7 till 1 procent. Elman regressionsnätverket sänkte Boolis OLS-modells MAPE från 58,5 till 6,62 procent. Modellerna utvärderades med åtta olika mått varav de för studiens viktigaste är MAPE, MAE, F-statistik och T-statistik. Genom att bryta ut attribut ur mäklartexter kan modellen förklara signifikansen hos indata, samt få något mer träffsäkra estimeringar av slutpriset av en bostadsförsäljning. Resultaten visar att det är en intressant metod som med fördel kan vidare utforskas. / Estimating the price of home sales is a complex task, where broker texts describing the housing is a vital part of the sales. This study explore the possibility to use broker texts to generate more accurate estimations using machine learning models. Two different machine learning models were implemented as a result of a literature study and evaluated against Booli’s existing OLS-model. The implemented machine learning models are OLS-BERT and an Elman regression network. OLS-BERT showed a general improvement compared to Booli’s OLS-model, in particular the F-statistic were 99.8 percent lower than Booli’s OLS-model. The p-value in T-statistic for “vista” was 37.7 percent with Booli’s OLS-model and 1 percent with OLS-BERT. The Elman regression network lowered the MAPE of Booli’s OLS-model from 58.5 to 6.62 percent. All models were evaluated using eight different measures, of which the most important for this study is MAPE, MAE, F-statistic, and T-statistic. The conclusion is that by mining attributes from broker texts the models can explain the significance of the input and generate somewhat more accurate estimations of the home sales price of sale. The results show that this is an interesting method that should be further explored. BERT OLS Elman regression machine learning supervised models unsupervised models broker texts attributes BERT OLS Elman regression maskininlärning övervakade modeller oövervakade modeller mäklartexter attribut
362	A Method for the Assisted Translation of QA Datasets Using Multilingual Sentence Embeddings / En metod för att assistera översättning av fråga-svarskorpusar med hjälp av språkagnostiska meningsvektorer Vakili, Thomas January 2020 (has links) This thesis presents a method which reduces the amount of labour required to translate the English question answering dataset SQuAD into Swedish. The purpose of the study is to contribute to shrinking the gap between natural language processing research in English and research in lesser-resourced languages by providing a method for creating datasets in these languages which are counterparts to those used in English. This would allow for the results from English studies to be evaluated in more languages. The method put forward by this thesis uses multilingual sentence embeddings to search for and rank answers to English SQuAD questions in SwedishWikipedia articles associated with the question. The resulting search results are then used to pair SQuAD questions with sentences that contain their answers. We also estimate to what extent SQuAD questions have answers in the Swedish edition of Wikipedia, concluding that this proportion of questions is small but still useful in size. Further, the evaluation of the method shows that it provides a clear reduction in the labour required for translating SQuAD into Swedish, while impacting the amount of datapoints retained in a resulting translation to a degree which is acceptable for many use-cases. Manual labour is still required for translating the SQuAD questions and for locating the answers within the Swedish sentences which contain them. Researching ways to automate these processes would further increase the utility of the approach, but are outside the scope of this thesis. / I detta examensarbete presenteras en metod som syftar till att minska mängden arbete som krävs för att översätta fråga-svarskorpuset SQuAD från engelska till svenska. Syftet med studien är att bidra till att minska glappet mellan språkteknologisk forskning på engelska och forskningen på språk med mindre resurser. Detta åstadkoms genom att beskriva en metod för att skapa korpusar liknande dem som används inom forskning på engelska och som kan användas för att utvärdera i vilken utsträckning resultat från den forskningen generaliserar till andra språk. Metoden använder språkagnostiska meningsvektorer för att söka efter svar på engelska SQuAD-frågor i svenska Wikipedia-artiklar, och sedan ranka dessa. Sökresultaten används sedan för att para samman SQuAD-frågor med de svenska meningar som innehåller deras svar. Även utsträckningen i vilken svar på engelska SQuAD-frågor står att finna i den svenska upplagan av Wikipedia undersöktes. Andelen SQuAD-frågor där ett svar fanns i den svenska Wikipedia-artikel som var associerad med frågan var liten men ändå användbar. Vidare visar utvärderingen av metoden att den innebär en tydlig minskning av mängden arbete som krävs för att översätta SQuAD till svenska. Denna minskning åstadkoms samtidigt som mängden fråga-svarspar som missas som en konsekvens av detta är acceptabel för många användningsområden. Manuellt arbete krävs fortfarande för att översätta SQuAD-frågorna från engelska och för att hitta var i de svenska meningarna som svaren finns. Vidare studier kring dessa frågor skulle bidra till att göra metoden än mer användbar, men ligger utanför avgränsningen för denna uppsats. Natural Language Processing (NLP) Information Retrieval (IR) Multilingual Sentence Embeddings QADatasets Lesser-Resourced Languages språkteknologi informationssökning språkagnostiska meningsvektorer fråga-svarskorpusar språk med mindre resurser Computer and Information Sciences Data- och informationsvetenskap
363	Evaluating Hierarchical LDA Topic Models for Article Categorization Lindgren, Jennifer January 2020 (has links) With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric. topic modeling topic models lda latent dirichlet allocation hlda hierarchical latent dirichlet allocation constrained lda constrained latent dirichlet allocation news articles categorization machine learning natural language processing nlp news recommendations
364	Text Curation for Clustering of Free-text Survey Responses / Textbehandling för klustring av fritextsresponer i enkäter Gefvert, Anton January 2023 (has links) When issuing surveys, having the option for free-text answer fields is only feasible where the number of respondents is small, as the work to summarize the answers becomes unmanageable with a large number of responses. Using NLP techniques to cluster these answers and summarize them would allow a greater range of survey creators to incorporate free-text answers in their survey, without making their workload too large. Academic work in this domain is sparse, especially for smaller languages such as Swedish. The Swedish company iMatrics is regularly hired to do this kind of summarizing, specifically for workplace-related surveys. Their method of clustering has been semiautomatic, where both manual preprocessing and postprocessing have been necessary to accomplish this task. This thesis aims to explore if using more advanced, unsupervised NLP text representation methods, namely SentenceBERT and Sent2Vec, can improve upon these results and reduce the manual work needed for this task. Specifically, three questions are to be answered. Firstly, do the methods show good results? Secondly, can they remove the time-consuming postprocessing step of combining a large number of clusters into a smaller number? Lastly, can a model where unsupervised learning metrics can be shown to correlate to the real-world usability of the model, thus indicating that these metrics can be used to optimize the model for new data? To answer these questions, several models are trained, employed, and then compared using both internal and external metrics: Sent2Vec, SentenceBERT, and traditional baseline models. A manual evaluation procedure is performed to assess the real-world usability of the clusterings looks like, to see how well the models perform as well as to see if there is any correlation between this result and the internal metrics for the clustering. The results indicate that improving the text representation step is not sufficient for fully automating this task. Some of the models show promise in the results of human evaluation, but given the unsupervised nature of the problem and the large variance between models, it is difficult to predict the performance of new data. Thus, the models can serve as an improvement to the workflow, but the need for manual work remains. Natural Language Processing NLP Sentence Representations Sentence Representation Models Survey Surveys Clustring Computer Sciences Datavetenskap (datalogi) Other Computer and Information Science Annan data- och informationsvetenskap
365	Zero-Shot Cross-Lingual Domain Adaptation for Neural Machine Translation : Exploring The Interplay Between Language And Domain Transferability Shahnazaryan, Lia January 2024 (has links) Within the field of neural machine translation (NMT), transfer learning and domain adaptation techniques have emerged as central solutions to overcome the data scarcity challenges faced by low-resource languages and specialized domains. This thesis explores the potential of zero-shot cross-lingual domain adaptation, which integrates principles of transfer learning across languages and domain adaptation. By fine-tuning a multilingual pre-trained NMT model on domain-specific data from one language pair, the aim is to capture domain-specific knowledge and transfer it to target languages within the same domain, enabling effective zero-shot cross-lingual domain transfer. This study conducts a series of comprehensive experiments across both specialized and mixed domains to explore the feasibility and influencing factors of zero-shot cross-lingual domain adaptation. The results indicate that fine-tuned models generally outperform the pre-trained baseline in specialized domains and most target languages. However, the extent of improvement depends on the linguistic complexity of the domain, as well as the transferability potential driven by the linguistic similarity between the pivot and target languages. Additionally, the study examines zero-shot cross-lingual cross-domain transfer, where models fine-tuned on mixed domains are evaluated on specialized domains. The results reveal that while cross-domain transfer is feasible, its effectiveness depends on the characteristics of the pivot and target domains, with domains exhibiting more consistent language being more responsive to cross-domain transfer. By examining the interplay between language-specific and domain-specific factors, the research explores the dynamics influencing zero-shot cross-lingual domain adaptation, highlighting the significant role played by both linguistic relatedness and domain characteristics in determining the transferability potential. Multilingual Neural Machine Translation Domain Adaptation Cross-Lingual Transfer Cross-Domain Transfer Zero-Shot Transfer English Spanish Portuguese Italian French Czech Polish Greek
366	Implementering av Retrieval-Augmented Generation för automatiserad analys av hållbarhetsrapportering : Utnyttjande av språkmodeller som stöd för att bedöma företags rapportering av verksamhetens påverkan på biologisk mångfald / Implementation of Retrieval-Augmented Generation to automate analysis of sustainability reports : Utilizing language models as support to evaluate companies reports of their activities’ effects on biodiversity Wilmi, Wiljam, Roslund, Niklas January 2024 (has links) Vikten av hållbarhetsredovisning kan ses genom den uppmärksamhet ämnet har från företag, media, myndigheter och den ökande regleringen genom införandet av nya direktiv och lagstiftning. Att manuellt analysera företags hållbarhetsredovisningar är en tidskrävande process. En automatiserad analys av hållbarhetsredovisningar skulle innebära ekonomiska och tidsmässiga vinster när viktiga insikter tas fram relaterat till större företags påverkan på sin miljö och omgivning. Denna studie syftar till att utforska möjligheterna till en automatisering av en befintlig manuell arbetsmetod. Prototypen som utvecklats tillämpar moderna språkbehandlingsmetoder, ett område inom maskininlärning, för att realisera denna vision. Studiens implementation uppnår för de utvärderade språkmodellerna upp till 96% precision för majoritetsklassen vid bearbetning av grunddatat respektive 55% precision för minoritetsdataklassen vid bearbetning av grunddata jämfört resultat från den manuellt genomförda metoden. Slutsatsen är att en automatiserad version av den befintliga manuella analysmetoden kan konstrueras och även förbättras med den snabba utveckling som sker inom teknologi och språkmodeller, om ytterligare resurser avsätts. Resultaten visar hopp om potentialen för en metodik som utvecklas i vidare arbeten. / The importance of sustainability reporting can be observed by the attention directed towards the subject from companies, media and authorities’ continuous new directives and laws. To manually analyze companies’ sustainability reports is a time-consuming process. An automated approach analyzing sustainability reports would give advantages regarding both time and economics when important insights related to companies’ operations are brought into light. This study aims to explore possibilities in automating an existing manual method related to analyzing sustainability reports. The developed prototype applies modern language models and methods related to machine learning to realize this vision. For the evaluated language models, the study’s implementation achieves up to 96% precision for the majority class, while the minority class achieves up to 55% precision in processing of data, when compared to reference results from the manual evaluation method. The work’s conclusion indicates that an automated version of the existing manual method for analysis can be constructed with sufficient resources, and even further improved as the area of technology further advances. The results are positive for the potential for a more sophisticated method that can be developed in further work. Machine learning Natural language processing Large language models Transformers Mistral Gemma Llama Retrieval-augmented generation Sustainability reports maskininlärning språkteknologi stora språkmodeller transformers Mistral Gemma Llama retrieval-augmented generation hållbarhetsredovisning Computer Engineering Datorteknik
367	Natural Language Based AI Tools in Interaction Design Research : Using ChatGPT for Qualitative User Research Insight Analysis Saare, Karmen January 2024 (has links) This thesis investigates the use of Artificial Intelligence, specifically the Large Language Model (LLM) application ChatGPT in the context of qualitative user research, with the goal of enhancing the user research interview analysis process. Through an empirical study where ChatGPT was used in the process of a typical user research insight analysis, the limitations and opportunities of the AI tool are examined. The study's results highlight the most significant insights from the empirical investigation, serving as examples to raise awareness of the implications of using ChatGPT in the context of user interview analysis. The study concludes that ChatGPT has the potential to enhance the interpretation of primarily individual interviews by generating well-articulated summaries, provided their accuracy can be verified. Additionally, ChatGPT may be particularly useful in low-risk design projects where the consequences of potential misinterpretations are minimal. Finally, the significance of clearly articulated written instructions for ChatGPT for best results is pointed out. ChatGPT user research UX research methods qualitative research user interviews LLM NLP Computer and Information Sciences Data- och informationsvetenskap Human Computer Interaction Design Design
368	The Struggle Against Misinformation: Evaluating the Performance of Basic vs. Complex Machine Learning Models on Manipulated Data Valladares Parker, Diego Gabriel January 2024 (has links) This study investigates the application of machine learning (ML) techniques in detecting fake news, addressing the rapid spread of misinformation across social media platforms. Given the time-consuming nature of manual fact-checking, this research compares the robustness of basic machine learning models, such as Multinominal Naive Bayes classifiers, with complex models like Distil-BERT in identifying fake news. Utilizing datasets including LIAR, ISOT, and GM, this study will evaluate these models based on standard classification metrics both in single domain and cross-domain scenarios, especially when processing linguistically manipulated data. Results indicate that while complex models like Distil-BERT perform better in single-domain classifications, the Baseline models show competitive performance in cross-domain and on the manipulated dataset. However both models struggle with the manipulated dataset, highlighting a critical area for improvement in fake news detection algorithms and methods. In conclusion, the findings suggest that while both basic and complex models have their strength in certain settings, significant advancements are needed to improve against linguistic manipulations, ensuring reliable detection of fake news across varied contexts before consideration of public availability of automated classification. Fake News Detection Machine Learning Natural Language Processing Information Dissem ination Text Classification Lexical Analysis Neural Networks Cross-Domain Validation Al gorithmic Bias Misinformation
369	Clustering and Anomaly detection using Medical Enterprise system Logs (CAMEL) / Klustring av och anomalidetektering på systemloggar Ahlinder, Henrik, Kylesten, Tiger January 2023 (has links) Research on automated anomaly detection in complex systems by using log files has been on an upswing with the introduction of new deep-learning natural language processing methods. However, manually identifying and labelling anomalous logs is time-consuming, error-prone, and labor-intensive. This thesis instead uses an existing state-of-the-art method which learns from PU data as a baseline and evaluates three extensions to it. The first extension provides insight into the performance of the choice of word em-beddings on the downstream task. The second extension applies a re-labelling strategy to reduce problems from pseudo-labelling. The final extension removes the need for pseudo-labelling by applying a state-of-the-art loss function from the field of PU learning. The findings show that FastText and GloVe embeddings are viable options, with FastText providing faster training times but mixed results in terms of performance. It is shown that several of the methods studied in this thesis suffer from sporadically poor performances on one of the datasets studied. Finally, it is shown that using modified risk functions from the field of PU learning provides new state-of-the-art performances on the datasets considered in this thesis. Natural Language processing NLP Anomaly detection log anomaly detection Positive-Unlabelled learning Positive Unlabelled learning PULearning PU Learning PU nnPU CAMEL clustering
370	Direct Preference Optimization for Improved Technical WritingAssistance : A Study of How Language Models Can Support the Writing of Technical Documentation at Saab / En studie i hur språkmodeller kan stödja skrivandet av teknisk dokumentation på Saab Bengtsson, Hannes, Habbe, Patrik January 2024 (has links) This thesis explores the potential of Large Language Models (LLMs) to assist in the technical documentation process at Saab. With the increasing complexity and regulatory demands on such documentation, the objective is to investigate advanced natural language processing techniques as a means of streamlining the creation of technical documentation. Although many standards exist, this thesis particularly focuses on the standard ASD-STE100, Simplified Technical English abbrv. STE, a controlled language for technical documentation. STE's primary aim is to ensure that technical documents are understandable to individuals regardless of their native language or English proficiency. The study focuses on the implementation of Direct Preference Optimization (DPO) and Supervised Instruction Fine-Tuning (SIFT) to refine the capabilities of LLMs in producing clear and concise outputs that comply with STE. Through a series of experiments, we investigate the effectiveness of LLMs in interpreting and simplifying technical language, with a particular emphasis on adherence to STE standards. The study utilizes a dataset comprised of target data paired with synthetic source data generated by a LLM. We apply various model training strategies, including zero-shot performance, supervised instruction fine-tuning, and direct preference optimization. We evaluate the various models' output using established quantitative metrics for text simplification and substitute human evaluators with company internal software for evaluating adherence to company standards and STE. Our findings suggest that while LLMs can significantly contribute to the technical writing process, the choice of training methods and the quality of data play crucial roles in the model's performance. This study shows how LLMs can improve productivity and reduce manual work. It also looks at the problems and suggests ways to make technical documentation automation better in the future. Large Language Models LLM Natural Language Processing NLP Technical Writing Simplified Technical English Direct Preference Optimization DPO Supervised Instruction Fine-tuning LoRA AI

Search results