Global ETD Search

221	Text Classification using the Teacher- Student Chatroom Corpus / Text klassificering med Teacher-- Student Chatroom Corpu Österberg, Marcus January 2023 (has links) Advancements in Artificial Intelligence, especially in the field of natural language processing have opened new possibilities for educational chatbots. One of these is a chatbot that can simulate a conversation between the teacher and the student for continuous learner support. In an up-scaled learning environment, teachers have less time to interact with each student individually. A resource to practice interactions with students could be a boon to alleviate this issue. In this thesis, we present a machine-learning model combined with a heuristic approach used in the creation of a chatbot. The machine learning model learns language understanding using prebuilt language representations which are fine-tuned with teacher-student conversations. The heuristic compares responses and picks the highest score for response retrieval. A data quality analysis is also performed on the teacher-student conversation dataset. For results, the best-base-cased language model performed best for text classification with a weighted F1-score of 0.70. The dataset used for the machine learning model showed consistency and completeness issues regarding labelling. The Technology Acceptance Model has been used to evaluate the model. The results of this evaluation show a high perceived ease of use, but a low perceived usefulness of the present solution. The thesis contributes with the innovative TUM (topic understanding model), an educational chatbot and an evaluation of the teacher-student chatroom corpus regarding the usage for text classification. / Teknologiska framsteg i artificiell intelligens, speciellt inom språkteknologi, har öppnat för nya möjligheter för chatbottar inom utbildningssektorn. Chatbots har sett en ökande användning i olika lärandeändamål. En av dessa är en chatbot som kan simulera en konversation mellan en lärare och en student för lärandestöd. När inlärning sker på en allt större skala, har lärare allt mindre tid att lägga individuellt på varje student. En resurs för att öva på interaktioner med studenter skulle därför kunna vara ett bra hjälpmedel. I denna masteruppsats presenteras en maskininlärnings modell kombinerad med ett heuristiskt tillvägagångsätt i skapandet av en chatbot. Maskininlärningsmodellen använder sig av färdigbyggda språkrepresentationer som är finjusterade med lärare-studentkonversationer. Heuristiken jämför svar och väljer den högsta poängen för svarshämtning. En datakvalité analys är också gjord på lärare-studentkonversations datasetet. För resultat, den BERT-baserade språkmodellen gav bäst resultat för textklassificering med en weigthed-F1- score på 0.70. Datasetet som användes för maskininlärningsmodellen visade konsistens och fullständighet problem rörande etiketter. Teknologi acceptans modellen har använts för att evaluera modellen. Resultatet av evalueringen visade hög upplevd användarvänlighet, men låg upplevd användbarhet. Detta arbete bidrar med TUM (topic understanding model), en utbildningschatbot och en evaluering av datasetet teacherstudent chatroom corpus för användning till textklassificering. Chatbot NLP Deep learning BERT Data quality Chatbots språkteknologi djup inlärning BERT Datakvalité Computer and Information Sciences Data- och informationsvetenskap
222	Predicting the Unpredictable – Using Language Models to Assess Literary Quality Wu, Yaru January 2023 (has links) People read for various purposes like learning specific skills, acquiring foreign languages, and enjoying the pure reading experience, etc. This kind of pure enjoyment may credit to many aspects, such as the aesthetics of languages, the beauty of rhyme, and the entertainment of being surprised by what will happen next, the last of which is typically featured in fictional narratives and is also the main topic of this project. In other words, “good” fiction may be better at entertaining readers by baffling and eluding their expectations whereas “normal” narratives may contain more cliches and ready-made sentences that are easy to predict. Therefore, this project examines whether “good” fiction is less predictable than “normal” fiction, the two of which are predefined as canonized and non-canonized. The predictability can be statistically reflected by the probability of the next words being correctly predicted given the previous content, which is then further measured in the metric of perplexity. Thanks to recent advances in deep learning, language models based on neural networks with billions of parameters can now be trained on terabytes of text to improve their performance in predicting the next unseen texts. Therefore, the generative pre-trained modeling and the text generator are combined to estimate the perplexities of canonized literature and non-canonized literature. Due to the potential risk that the terabytes of text on which the advanced models have been trained may contain book content within the corpus, two series of models are designed to yield non-biased perplexity results, namely the self-trained models and the generative pre-trained Transformer-2 models. The comparisons of these two groups of results set up the final hierarchy of architecture constituted by five models for further experiments. Over the process of perplexity estimation, the perplexity variance can also be generated at the same time, which is then used to denote how predictability varies across sequences with a certain length within each piece of literature. Evaluated by the perplexity variance, the literature property of homogeneity can also be examined between these two groups of literature. The ultimate results from the five models imply that there lie distinctions in both perplexity values and variances between the canonized literature and non-canonized literature. Besides, the canonized literature shows higher perplexity values and variances measured in both median and mean metrics, which denotes that it is less predictable and homogeneous than the non-canonized literature. Obviously, the perplexity values and variances cannot be used to define the literary quality directly. However, they offer some signals that the metric of perplexity can be insightful in the literary quality analysis using natural language processing techniques. perplexity variance unpredictability homogeneity generative pre-trained models text generation literary quality
223	Controllable sentence simplification in Swedish : Automatic simplification of sentences using control prefixes and mined Swedish paraphrases Monsen, Julius January 2023 (has links) The ability to read and comprehend text is essential in everyday life. Some people, including individuals with dyslexia and cognitive disabilities, may experience difficulties with this. Thus, it is important to make textual information accessible to diverse target audiences. Automatic Text Simplification (ATS) techniques aim to reduce the linguistic complexity in texts to facilitate readability and comprehension. However, existing ATS systems often lack customization to specific user needs, and simplification data for languages other than English is limited. This thesis addressed ATS in a Swedish context, building upon novel methods that provide more control over the simplification generation process, enabling user customization. A dataset of Swedish paraphrases was mined from a large amount of text data. ATS models were then trained on this dataset utilizing prefix-tuning with control prefixes. Two sets of text attributes and their effects on performance were explored for controlling the generation. The first had been used in previous research, and the second was extracted in a data-driven way from existing text complexity measures. The trained ATS models for Swedish and additional models for English were evaluated and compared using SARI and BLEU metrics. The results for the English models were consistent with results from previous research using controllable generation mechanisms, although slightly lower. The Swedish models provided significant improvements over the baseline, in the form of a fine-tuned BART model, and compared to previous Swedish ATS results. These results highlight the efficiency of using paraphrase data paired with controllable generation mechanisms for simplification. Furthermore, the different sets of attributes provided very similar results, pointing to the fact that both these sets of attributes manage to capture aspects of simplification. The process of mining paraphrases, selecting control attributes and other methodological implications are discussed, leading to suggestions for future research. NLP automatic text simplification paraphrase mining prefix-tuning control prefixes BART
224	Text Mining Methods for Biomedical Data Analysis / Text Mining Metoder för Biomedicinsk Data Analys Jabeen, Rakhshanda January 2021 (has links) Biological data topic modeling has become a very prevalent topic among researchers in recent times. However, analysing countless research papers and gathering consensus regarding biomedicine is a near-impossible task for any researcher due to the complexity and quantity of material that is published. This thesis is devised to focus on two objectives that can help the researchers in this domain based on data related to five major DNA repair pathways. The first objective is to propose an unsupervised approach to examine the hidden structures and analyse research trends in temporal biomedical text data. The second objective is to find DNA repair markers involved in immune defense and retrieve potential PPIs, GIs, and disease-gene associations reported in the literature. We have used latent Dirichlet Allocation (LDA) to discover hidden themes and semantically coherent topics from text. We have clustered the documents based on LDA topic models to analyse the research trend and used the Mann- Kendall test to understand the trends of the topics. Hybridization of text mining methods with classical co-occurrence statistical approach and association rule mining was used to discover potential PPIs, GIs, and disease-gene association in the text. The results for PPIs and GIs were then evaluated with an external biological database of PPIs. Bioinformatics (Computational Biology) Bioinformatik (beräkningsbiologi) Probability Theory and Statistics Sannolikhetsteori och statistik
225	Text Mining Methods for Biomedical Data Analysis / Text Mining Methods for Biomedical Data Analysis Jabeen, Rakhshanda January 2021 (has links) No description available. Probability Theory and Statistics Sannolikhetsteori och statistik Bioinformatics and Systems Biology Bioinformatik och systembiologi
226	Assessing BERT-Style Models' Abilities to Learn the Number of a Subject Januleviciute, Laura January 2022 (has links) There is an increasing interest in using deep neural networks in various downstream natural language processing tasks. Such models are commonly used as black boxes, meaning that their decision-making is difficult to interpret. In order to build trust in models, it is crucial to analyse their inner workings which lead to predictions. The need to interpret natural language processing models has induced research on linguistically-informed interpretability. This field revolves around choosing specific linguistic phenomena and inspecting models' capability to capture them without being explicitly trained for it. This thesis project contributes to the field by assessing the ability of BERT-style models to learn subject number in Lithuanian and English. The experiments revolve around designing diagnostic classifiers which are used to determine if the models are capable of learning this particular linguistic phenomenon. The results show that BERT-style models are capable of implicitly learning the number of a subject both in Lithuanian and English. However, this seems to be harder in Lithuanian, as diagnostic classifiers show a lower accuracy. The study observes that the accuracy of logistic regression diagnostic classifiers fluctuates to a large extent. Fully connected neural network classifiers outperform logistic regression classifiers. diagnostic classifier subject number linguistically-informed interpretability Lithuanian English BERT WikiBERT mBERT
227	Automatic generation of definitions : Exploring if GPT is useful for defining words Eriksson, Fanny January 2023 (has links) When reading a text, it is common to get stuck on unfamiliar words that are difficult to understand in the local context. In these cases, we use dictionaries or similar online resources to find the general meaning of the word. However, maintaining a handwritten dictionary is highly resource demanding as the language is constantly developing, and using generative language models for producing definitions could therefore be a more efficient option. To explore this possibility, this thesis performs an online survey to examine if GPT could be useful for defining words. It also investigates how well the Swedish language model GPT-SW3 (3.5 b) define words compared to the model text-davinci-003, and how prompts should be formatted when defining words with these models. The results indicate that text-davinci-003 generates high quality definitions, and according to students t-test, the definitions received significantly higher ratings from participants than definitions taken from Svensk ordbok (SO). Furthermore, the results showed that GPT-SW3 (3.5 b) received the lowest ratings, indicating that it takes more investment to keep up with the big models developed by OpenAI. Regarding prompt formatting, the most appropriate prompt format for defining words is highly dependent on the model, and the results showed that text- davinci-003 performed well using zero-shot, while GPT-SW3 (3.5 b) required a few shot setting. Considering both the high quality of the definitions generated by text-davinci-003, and the practical advantages with generating definitions automatically, GPT could be a useful method for defining words. Definitions NLG prompt engineering GPT GPT-SW3 text-davinci-003
228	Synthetic data generation for domain adaptation of a retriever-reader Question Answering system for the Telecom domain : Comparing dense embeddings with BM25 for Open Domain Question Answering / Syntetisk data genering för domänadaptering av ett retriever-readerbaserat frågebesvaringssystem för telekomdomänen : En jämförelse av dense embeddings med BM25 för Öpen Domän frågebesvaring Döringer Kana, Filip January 2023 (has links) Having computer systems capable of answering questions has been a goal within Natural Language Processing research for many years. Machine Learning systems have recently become increasingly proficient at this task with large language models obtaining state-of-the-art performance. Retriever-reader architectures have become a powerful approach for building systems that enable users to enter questions and get factual answers from a corpus of documents. This architecture uses a retriever component that fetches the most relevant documents and a reader which in turn extracts the answer from the documents. These systems commonly use transformer-based models for both components, which have been fine-tuned on a general domain of documents, such as Wikipedia. However, the performance of such systems on new domains, with different vocabularies, can be lacking. Furthermore, new domains of, for instance, company-specific documents often lack annotated data which makes training new models cumbersome. This thesis investigated how a retriever-reader-based architecture can be adapted to a corpus of Telecom documents by generating question-answer data using a large generative language model, GPT3.5. Also, it compared the usage of a dense retriever using BERT to a BM25-based retriever on the domain. Findings suggest that generating training data can be an effective approach for fine-tuning a dense retriever, increasing the Top-K retrieval accuracy by 20 points for k = 10, compared to a dense retriever fine-tuned on Wikipedia. Additionally, it is found that the sparse retriever outperforms the best dense retriever, although, there is reason to believe that the structure of the test dataset could influence this. Finally, the results also indicate that the performance of the reader is not improved by the generated data although future work is needed to draw better conclusions. / Datorsystem som kan svara på frågor har varit ett mål inom forskningsfältet naturlig språkbehandling i många år. System som använder sig av maskininlärning, så som stora språkmodeller har under de senaste åren uppnått hög prestanda. Att använda sig av en så kallad retriever-reader arkitektur har blivit ett kraftfullt tillvägagångssätt för att bygga system som gör det möjligt för användare att ställa frågor och få faktabaserade svar hämtade från en korpus av dokument. Denna arkitektur använder en retriever som hämtar den mest relevanta informationen och en reader som sedan extraherar ett svar från den hämtade informationen. Dessa system använder vanligtvis transformer-baserade modeller för båda komponenterna, som har tränats på en allmän domän som t.ex., Wikipedia. Dock kan prestandan hos dessa system vara bristfällig när de appliceras på mer specifika domäner med andra ordförråd. Dessutom saknas ofta annoterad data för mer specifika domäner, som exempelvis företagsdokument, vilket gör det svårt att träna modeller på dessa områden. I denna avhandling undersöktes hur en retriever-reader arkitektur kan appliceras på en korpus telekomdokument genom att generera data bestående av frågor och tillhörande svar, genom att använda en stor generativ språkmodell, GPT3.5. Rapporten jämförde även användandet av en BERT-baserad retriever med en BM25-baserad retriever för denna domän. Resultaten tyder på att generering av träningsdata kan vara ett effektivt tillvägagångssätt för att träna en BERT-baserad retriever. Den tränade modellen hade 20 poäng högre noggranhet för måttet Top-K retrieval vid k = 10 jämfört med samma model tränad på data från Wikipedia. Resultaten visade även att en BM25-baserad retriever hade högre noggranhet än den bästa BERT-baserade retrievern som tränats. Dock kan detta bero på datasetets utformning. Slutligen visade resultaten även att prestandan hos en tränad reader inte blev bättre genom att träna på genererad data men denna slutsats kräver framtida arbete för att undersökas mer noggrant. Natural Language Processing Transformers Deep Learning Question Answering Data Generation Språkteknologi Transformers Djupinlärning Frågebesvaring Datagenerering Computer and Information Sciences Data- och informationsvetenskap
229	Incremental Re-tokenization in BPE-trained SentencePiece Models Hellsten, Simon January 2024 (has links) This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts. BPE Byte Pair Encoding SentencePiece NLP Natural Language Processing Tokenization Re-tokenization
230	Few-shot Question Generation with Prompt-based Learning Wu, Yongchao January 2022 (has links) Question generation (QG), which automatically generates good-quality questions from a piece of text, is capable of lowering the cost of the manual composition of questions. Recently Question generation has attracted increasing interest for its ability to supply a large number of questions for developing conversation systems and educational applications, as well as corpus development for natural language processing (NLP) research tasks, such as question answering and reading comprehension. Previous neural-based QG approaches have achieved remarkable performance. In contrast, these approaches require a large amount of data to train neural models properly, limiting the application of question generation in low-resource scenarios, e.g. with a few hundred training examples. This thesis aims to address the problem of the low-resource scenario by investigating a recently emerged paradigm of NLP modelling, prompt-based learning. Prompt-based learning, which makes predictions based on the knowledge of the pre-trained language model and some simple textual task descriptions, has shown great effectiveness in various NLP tasks in few-shot and zero-shot settings, in which a few or non-examples are needed to train a model. In this project, we have introduced a prompt-based question generation approach by constructing question generation task instructions that are understandable by a pre-trained sequence-to-sequence language model. Our experiment results show that our approach outperforms previous state-of-the-art question generation models with a vast margin of 36.8%, 204.8%, 455.9%, 1083.3%, 57.9% for metrics BLEU-1, BLEU-2, BLEU-3, BLEU-4, and ROUGE-L respectively in the few-shot learning settings. We also conducted a quality analysis of the generated questions and found that our approach can generate questions with correct grammar and relevant topical information when training with as few as 1,000 training examples. Nautral Lanugage Processing Question Generation Neural Networks Prompt-based Learning Language Models

Search results