Global ETD Search

31	Leveraging Linguistic Insights for Uncertainty Calibration of ChatGPT and Evaluating Crowdsourced Annotations Venkata Divya Sree Pulipati (18469230) 09 July 2024 (has links) <p dir="ltr">The quality of crowdsource annotations has always been a challenge due to the variability in annotators backgrounds, task complexity, the subjective nature of many labeling tasks, and various other reasons. Hence, it is crucial to evaluate these annotations to ensure their reliability. Traditionally, human experts evaluate the quality of crowdsourced annotations, but this approach has its own challenges. Hence, this paper proposes to leverage large language models like ChatGPT-4 to evaluate one of the existing crowdsourced MAVEN dataset and explore its potential as an alternative solution. However, due to stochastic nature of LLMs, it is important to discern when to trust and question LLM responses. To address this, we introduce a novel approach that applies Rubin's framework for identifying and using linguistic cues within LLM responses as indicators of LLMs certainty levels. Our findings reveal that ChatGPT-4 successfully identified 63% of the incorrect labels, highlighting the potential for improving data label quality through human-AI collaboration on these identified inaccuracies. This study underscores the promising role of LLMs in evaluating crowdsourced data annotations offering a way to enhance accuracy and fairness of crowdsource annotations while saving time and costs.</p><p dir="ltr"><br></p> Natural language processing Data quality Computational linguistics Crowdsourcing Certainty calibration LLM Evaluating annotation quality
32	Preventing Health Data from Leaking in a Machine Learning System : Implementing code analysis with LLM and model privacy evaluation testing / Förhindra att Hälsodata Läcker ut i ett Maskininlärnings System : Implementering av kod analys med stor språk-modell och modell integritets testning Janryd, Balder, Johansson, Tim January 2024 (has links) Sensitive data leaking from a system can have tremendous negative consequences, such as discrimination, social stigma, and fraudulent economic consequences for those whose data has been leaked. Therefore, it’s of utmost importance that sensitive data is not leaked from a system. This thesis investigated different methods to prevent sensitive patient data from leaking in a machine learning system. Various methods have been investigated and evaluated based on previous research; the methods used in this thesis are a large language model (LLM) for code analysis and a membership inference attack on models to test their privacy level. The LLM code analysis results show that the Llama 3 (an LLM) model had an accuracy of 90% in identifying malicious code that attempts to steal sensitive patient data. The model analysis can evaluate and determine membership inference of sensitive patient data used for training in machine learning models, which is essential for determining data leakage a machine learning model can pose in machine learning systems. Further studies in increasing the deterministic and formatting of the LLM‘s responses must be investigated to ensure the robustness of the security system that utilizes LLMs before it can be deployed in a production environment. Further studies of the model analysis can apply a wider variety of evaluations, such as increased size of machine learning model types and increased range of attack testing types of machine learning models, which can be implemented into machine learning systems. / Känsliga data som läcker från ett system kan ha enorma negativa konsekvenser, såsom diskriminering, social stigmatisering och negativa ekonomiska konsekvenser för dem vars data har läckt ut. Därför är det av yttersta vikt att känsliga data inte läcker från ett system. Denna avhandling undersökte olika metoder för att förhindra att känsliga patientdata läcker ut ur ett maskininlärningssystem. Olika metoder har undersökts och utvärderats baserat på tidigare forskning; metoderna som användes i denna avhandling är en stor språkmodell (LLM) för kodanalys och en medlemskapsinfiltrationsattack på maskininlärnings (ML) modeller för att testa modellernas integritetsnivå. Kodanalysresultaten från LLM visar att modellen Llama 3 hade en noggrannhet på 90% i att identifiera skadlig kod som försöker stjäla känsliga patientdata. Modellanalysen kan utvärdera och bestämma medlemskap av känsliga patientdata som används för träning i maskininlärningsmodeller, vilket är avgörande för att bestämma den dataläckage som en maskininlärningsmodell kan exponera. Ytterligare studier för att öka determinismen och formateringen av LLM:s svar måste undersökas för att säkerställa robustheten i säkerhetssystemet som använder LLM:er innan det kan driftsättas i en produktionsmiljö. Vidare studier av modellanalysen kan tillämpa ytterligare bredd av utvärderingar, såsom ökad storlek på maskininlärningsmodelltyper och ökat utbud av attacktesttyper av maskininlärningsmodeller som kan implementeras i maskininlärningssystem. Sensitive Data Machine Learning (ML) Large Language Model (LLM) Code Analysis Llama 3 Data Privacy Membership Inference Attack (MIA) Känsliga Data Maskininlärning (ML) Stor Språkmodell (LLM) Kodanalys Llama 3 Datasekretess Medlemskapsinfiltrationsattack (MIA) Computer Sciences Datavetenskap (datalogi)
33	Stora språkmodeller för bedömning av applikationsrecensioner : Implementering och undersökning av stora språkmodeller för att sammanfatta, extrahera och analysera nyckelinformation från användarrecensioner / Large Language Models for application review data : Implementation survey of Large Language Models (LLM) to summarize, extract, and analyze key information from user reviews von Reybekiel, Algot, Wennström, Emil January 2024 (has links) Manuell granskning av användarrecensioner för att extrahera relevant informationkan vara en tidskrävande process. Denna rapport har undersökt om stora språkmodeller kan användas för att sammanfatta, extrahera och analysera nyckelinformation från recensioner, samt hur en sådan applikation kan konstrueras. Det visade sig att olika modeller presterade olika bra beroende på mätvärden ochviktning mellan recall och precision. Vidare visade det sig att fine-tuning av språkmodeller som Llama 3 förbättrade prestationen vid klassifikation av användbara recensioner och ledde, enligt vissa mätvärden, till högre prestation än större språkmodeller som Chat-Bison. För engelskt översatta recensioner hade Llama 3:8b:Instruct, Chat-Bison samt den fine-tunade versionen av Llama 3:8b ett F4-makro-score på 0.89, 0.90 och 0.91 respektive. Ytterligare ett resultat är att de större modellerna Chat-Bison, Text-Bison och Gemini, presterade bättre i fallet för generering av sammanfattande texter, än de mindre modeller som testades vid inmatning av flertalet recensioner åt gången. Generellt sett presterade språkmodellerna också bättre om recensioner först översattes till engelska innan bearbetning, snarare än då recensionerna var skrivna i originalspråk där de majoriteten av recensionerna var skrivna på svenska. En annan lärdom från förbearbetning av recensioner är att antal anrop till dessa språkmodeller kan minimeras genom att filtrera utifrån ordlängd och betyg. Utöver språkmodeller visade resultaten att användningen av vektordatabaser och embeddings kan ge en större överblick över användbara recensioner genom vektordatabasers inbyggda förmåga att hitta semantiska likheter och samla liknande recensioner i kluster. / Manually reviewing user reviews to extract relevant information can be a time consuming process. This report investigates if large language models can be used to summarize, extract, and analyze key information from reviews, and how such anapplication can be constructed. It was discovered that different models exhibit varying degrees of performance depending on the metrics and the weighting between recall and precision. Furthermore, fine-tuning of language models such as Llama 3 was found to improve performance in classifying useful reviews and, according to some metrics, led to higher performance than larger language models like Chat-bison. Specifically, for English translated reviews, Llama 3:8b:Instruct, Chat-bison, and Llama 3:8b fine-tuned had an F4 macro score 0.89, 0.90, 0.91 respectively. A further finding is that the larger models, Chat-Bison, Text-Bison, and Gemini performed better than the smaller models that was tested, when inputting multiple reviews at a time in the case of summary text generation. In general, language models performed better if reviews were first translated into English before processing rather than when reviews were written in the original language where most reviews were written in Swedish. Additionally, another insight from the pre-processing phase, is that the number of API-calls to these language models can be minimized by filtering based on word length and rating. In addition to findings related to language models, the results also demonstrated that the use of vector databases and embeddings can provide a greater overview of reviews by leveraging the databases’ built-in ability to identify semantic similarities and cluster similar reviews together. LLM NLP large language model natural language processing analyze comparison generative AI summarize extract analyze user reviews Langchain fine-tuning LLM NLP stora språkmodeller naturlig språkhantering analysering jämförelse generativ ai sammanfattning klassificering användarrecensioner Langchain fine-tune Computer Sciences Datavetenskap (datalogi)
34	Generative AI Assistant for Public Transport Using Scheduled and Real-Time Data / Generativ AI-assistent för kollektivtrafik som använder planerad och realtidsdata Karlstrand, Jakob, Nielsen, Axel January 2024 (has links) This thesis presents the design and implementation of a generative Artificial Intelligence (AI)-based decision-support interface applied to the domain of pub- lic transport leveraging both offline and logged data from both past records and real-time updates. The AI assistant system was developed leveraging pre- trained Large Language Models (LLMs) together with Retrieval Augmented Generation (RAG) and the Function Calling Application Programming Inter- face (API), provided by OpenAI, for automating the process of adding knowl- edge to the LLM. Challenges such as formatting and restructuring of data, data retrieval methodologies, accuracy and latency were considered. The result is an AI assistant which can have a conversation with users, answer questions re- garding departures, arrivals, specific vehicle trips, and other questions relevant within the domain of the dataset. The AI assistant system has also been devel- oped to provide client-side actions that integrate with the user interface, enabling interactive elements such as clickable links to trigger relevant actions based on the content provided Different LLMs, including GPT-3.5 and GPT-4 with different temperatures, were compared and evaluated with a pre-defined set of questions paired with a respective ground truth. By adopting a conversational approach, the project aims to streamline infor- mation extraction from extensive datasets, offering a more flexible and feedback- oriented alternative to manual search and filtering processes. This way, traffic managers adapt and operate more efficiently. The traffic managers will also re- main informed about small disturbances and can act accordingly faster and more efficient. The project was conducted at Gaia Systems AB, Norrköping, Sweden. The project primarily aims to enhance the workflow of traffic managers utiliz- ing Gaia’s existing software for public transport management within Östgöta- trafiken. / Denna avhandling presenterar designen och implementationen av en generativ Artificiell Intelligens (AI)-baserad beslutsstödsgränssnitt applicerad på området för kollektivtrafik, utnyttjande både offline och loggad data från både tidigare händelser och realtidsuppdateringar. AI-assistentsystemet utvecklades med hjälp av Large Language Models (LLM) tillsammans med Retrieval Augmented Generation (RAG) och Function Calling API, tillhandahållet av OpenAI, för att automatisera processen att lägga till kunskap till en LLM. Utmaningar som formatering och omstrukturering av data, datahämtningsmetoder, noggrannhet och latens beaktades. Resultatet är en AI-assistent som kan ha en konversation med användare, svara på frågor om avgångar, ankomster, specifika fordonsturer och andra frågor relevanta inom datamängdens område. AI-assistentsystemet har också utvecklats för att tillhandahålla Client Actions som integreras med användargränssnittet, vilket möjliggör interaktiva element som klickbara länkar för att utlösa relevanta åtgärder baserade på den tillhandahållna innehållet. Olika LLM, inklusive GPT-3.5 och GPT-4 med olika temperaturer, jämfördes och utvärderades med en fördefinierad uppsättning frågor parat med en respektive sanning. Genom att använda en konversationell metod syftar projektet till att effektivisera informationsutvinning från omfattande datamängder och erbjuder ett mer flexibelt och feedbackorienterat alternativ till manuella sök- och filtreringsprocesser. På detta sätt kan trafikledare anpassa sig och arbeta mer effektivt. Trafikledarna kommer också att hållas informerade om mindre störningar och kan agera snabbare och mer effektivt. Projektet genomfördes på Gaia Systems AB, Norrköping, Sverige. Projektet syftar främst till att förbättra arbetsflödet för trafikförvaltare som använder Gaia's befintlig programvara för kollektivtrafikhantering inom Östgötatrafiken. Generative AI LLM RAG Retrieval Augmented Generation Copilot AI AI Assistant Public transport Generativ AI LLM RAG Retrieval Augmented Generation Copilot AI AI Assistant Kollektivtrafik Computer Sciences Datavetenskap (datalogi)
35	A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering Tasks Duan, Yuyao, Lundborg, Vilgot January 2024 (has links) This study introduces an automated evaluation method for large language model (LLM) based chatbots in educational settings, utilizing LLM-as-a-Judge to assess their performance. Our results demonstrate the efficacy of this approach in evaluating the accuracy of three LLM-based chatbots (Llama 3 70B, ChatGPT 4, Gemini Advanced) across two subjects: history and biology. The analysis reveals promising performance across different subjects. On a scale from 1 to 5 describing the correctness of the judge itself, the LLM judge’s average scores for correctness when evaluating each chatbot on history related questions are 3.92 (Llama 3 70B), 4.20 (ChatGPT 4), 4.51 (Gemini Advanced); for biology related questions, the average scores are 4.04 (Llama 3 70B), 4.28 (ChatGPT 4), 4.09 (Gemini Advanced). This underscores the potential of leveraging the LLM-as-a-judge strategy to evaluate the correctness of responses from other LLMs. LLM-as-a-Judge Chatbot Large language model LLM chatbot GPT-4o Llama 3 70B ChatGPT 4 Gemini Advanced Automatic evaluation Automatic evaluation parameters Education History Biology Computer Systems Datorsystem
36	Prompt engineering and its usability to improve modern psychology chatbots / Prompt engineering och dess användbarhet för att förbättra psykologichatbottar Nordgren, Isak, E. Svensson, Gustaf January 2023 (has links) As advancements in chatbots and Large Language Models (LLMs) such as GPT-3.5 and GPT-4 continue, their applications in diverse fields, including psychology, expand. This study investigates the effectiveness of LLMs optimized through prompt engineering, aiming to enhance their performance in psychological applications. To this end, two distinct versions of a GPT-3.5-based chatbot were developed: a version similar to the base model, and a version equipped with a more extensive system prompt detailing expected behavior. A panel of professional psychologists evaluated these models based on a predetermined set of questions, providing insight into their potential future use as psychological tools. Our results indicate that an overly prescriptive system prompt can unintentionally limit the versatility of the chatbot, making a careful balance in instruction specificity essential. Furthermore, while our study suggests that current LLMs such as GPT-3.5 are not capable of fully replacing human psychologists, they can provide valuable assistance in tasks such as basic question answering, consolation and validation, and triage. These findings provide a foundation for future research into the effective integration of LLMs in psychology and contribute valuable insights into the promising field of AI-assisted psychological services. / I takt med att framstegen inom chatbots och stora språkmodeller (LLMs) som GPT-3.5 och GPT-4 fortsätter utvidgas deras potentiella tillämpningar inom olika områden, inklusive psykologi. Denna studie undersöker effektiviteten av LLMs optimerade genom prompt engineering, med målet att förbättra deras prestanda inom psykologiska tillämpningar. I detta syfte utvecklades två distinkta versioner av en chatbot baserad på GPT-3.5: en version som liknar bas-modellen, och en version utrustad med en mer omfattande systemprompt som detaljerar förväntat beteende. En panel av professionella psykologer utvärderade dessa modeller baserat på en förbestämd uppsättning frågor, vilket ger inblick i deras potentiella framtida användning som psykologiska verktyg. Våra resultat tyder på att en överdrivet beskrivande systemprompt kan ofrivilligt begränsa chatbotens mångsidighet, vilket kräver en noggrann balans i specificiteten av prompten. Vidare antyder vår studie att nuvarande LLMs som GPT-3.5 inte kan ersätta mänskliga psykologer helt och hållet, men att de kan ge värdefull hjälp i uppgifter som grundläggande frågebesvaring, tröst och bekräftelse, samt triage. Dessa resultat ger en grund för framtida forskning om effektiv integration av LLMs inom psykologi och bidrar med värdefulla insikter till det lovande fältet av AI-assisterade psykologtjänster. Large Language Models LLM GPT GPT-3.5 GPT-4 chatbots psychology prompt engineering Computer and Information Sciences Data- och informationsvetenskap
37	Användning och acceptans av AI-verktyg inom utbildningssektorn : Upplevelser hos lärare och forskare att använda Microsoft 365 Copilot i sin yrkesroll / Use and acceptance of AI-tools in the education sector : Experiences of teachers and researchers using Microsoft 365 Copilot in their professional role Moyo, Hannah, Nordén, Linnea January 2024 (has links) Genom utvecklingen av AI sker ett begynnande paradigmskifte inom organisationer då anställda använder sig av AI-verktyg för att optimera sin arbetsprestanda. Användning av AI-verktyg är även något som kan bidra med nytta för akademiska roller inom utbildningssektorn, såsom lärare och forskare. Det är dock oklart vilket stöd dessa AI-verktyg kan bidra till för dessa yrkesroller. Eftersom deras arbetsuppgifter karaktäriseras av hög kvalitetsnivå och hänsyn till etiska aspekter, ställs höga krav på AI-verktygets kapabilitet. Denna studie syftar till att ge en ökad förståelse för acceptansen av AI-verktyget Microsoft 365 Copilot inom utbildningssektorn utifrån lärares och forskares perspektiv. Som stöd för att undersöka acceptansen av AI-verktyget har studien haft utgångspunkt i Technology Acceptance Model (TAM). Genom semistrukturerade intervjuer och ostrukturerade observationer erhölls en insyn i lärares och forskares upplevelser med AI-verktyget och vilka möjligheter eller begränsningar de identifierat med dess användning inom sin yrkesroll. Vår slutsats visar att AI-verktyget inte upplevs upprätthålla en nivå som var likvärdig med användarna själva eller liknande AI-verktyg. Vidare finns det även behov av stöd och utbildning för lärare och forskare att använda AI-verktyg, både vad gäller AI-verktygets funktionalitet men även riktlinjer om informationssäkerhet. / Through the development of AI, a new paradigm shift is beginning within organizations as employees use AI-tools to optimize their work performance. The use of AI-tools can also bring benefits to academic roles in the education sector, such as teachers and researchers. However, there is an uncertainty about the support these AI-tools can offer to these professional roles. Given the high level of quality required in these professional roles, as well as the need to consider ethical aspects, there are significant demands on the capabilities of the AI-tool. This study aims to provide a deeper understanding of the acceptance of the AI-tool Microsoft 365 Copilot within the education sector from the perspectives of teachers and researchers. To examine the acceptance of the AI-tool, the study is based on the Technology Acceptance Model (TAM). Through semi-structured interviews and unstructured observations, insights were gained into teacher’s and researcher’s experiences with the AI-tool and what opportunities or limitations they identified in using it within their professional role. Our conclusion indicates that the AI-tool was not perceived to maintain a level equal to the users themselves or similar AI-tools. Furthermore, there is a need for support and education for teachers and researchers in using AI-tools, both regarding the functionality of the AI-tool but also guidelines for information security. AI LLM Researcher Teacher Microsoft 365 Copilot Academy Information Systems, Social aspects
38	Towards Manipulator Task-Oriented Programming: Automating Behavior-Tree Configuration Yue Cao (18985100) 08 July 2024 (has links) <p dir="ltr">Task-oriented programming is a way of programming manipulators in terms of high-level tasks instead of explicit motions. It has been a long-standing vision in robotics since its early days. Despite its potential, several challenges have hindered its full realization. This thesis identifies three major challenges, particularly in task specification and the planning-to-execution transition: 1) The absence of natural language integration in system input. 2) The dilemma of continuously developing non-uniform and domain-specific primitive-task libraries. 3) The requirement for much human intervention.</p><p dir="ltr">To overcome these difficulties, this thesis introduces a novel approach that integrates natural language inputs, eliminates the need on fixed primitive-task libraries, and minimizes human intervention. It adopts the behavior tree, a modular and user-friendly form, as the task representation and advances its usage in task specification and planning-to-execution transition. The thesis is structured into two parts – Task Specification and Planning-to-Execution Transition.</p><p dir="ltr">Task specification explores the use of large language models to generate a behavior tree from an end-user's input. A Phase-Step prompt is designed to enable the automatic behavior-tree generation from end-user's abstract task descriptions in natural languages. With the powerful generalizability of large language models, it breaks the dilemma that stays with fixed primitive-task libraries in task generation. A full-process case study demonstrated the proposed approach. An ablation study was conducted to evaluate the effectiveness of the Phase-Step prompts. Task specification also proposes behavior-tree embeddings to facilitate the retrieval-augmented generation of behavior trees. The integration of behavior-tree embeddings not only eliminates the need for manual prompt configuration but also provides a way to incorporate external domain knowledge into the generation process. Three types of evaluations were performed to assess the performance of the behavior-tree embedding method.</p><p dir="ltr">Planning-to-execution transition explores how to transit primitive tasks from task specification into manipulator executions. Two types of primitive tasks are considered separately: point-to-point movement tasks and object-interaction tasks. For point-to-point movement tasks, a behavior-tree reward is proposed to enable reinforcement learning over low-level movement while following high-level running order of the behavior tree. End-users only need to specify rewards on the primitive tasks over the behavior tree, and the rest of the process will be handled automatically. A 2D space movement simulation was provided to justify the approach. For object-interaction tasks, the planning-to-execution transition uses a large-language-model-based generation approach. This approach takes natural-language-described primitive tasks as input and directly produces task-frame-formalism set-points. Combined with hybrid position/force control systems, a transition process from primitive tasks directly into joint-level execution can be realized. Evaluations over a set of 30 primitive tasks were conducted.</p><p dir="ltr">Overall, this thesis proposes an approach that advances the behavior-tree towards automated task specification and planning-to-execution transitions. It opens up new possibilities for building better task-oriented manipulator programming systems.</p> Intelligent robotics Task-Oriented Programming Behavior Trees Manipulators Large Language Model (LLM) Machine Learning Robot Planning and Control
39	Large language models as an interface to interact with API tools in natural language Tesfagiorgis, Yohannes Gebreyohannes, Monteiro Silva, Bruno Miguel January 2023 (has links) In this research project, we aim to explore the use of Large Language Models (LLMs) as an interface to interact with API tools in natural language. Bubeck et al. [1] shed some light on how LLMs could be used to interact with API tools. Since then, new versions of LLMs have been launched and the question of how reliable a LLM can be in this task remains unanswered. The main goal of our thesis is to investigate the designs of the available system prompts for LLMs, identify the best-performing prompts, and evaluate the reliability of different LLMs when using the best-identified prompts. We will employ a multiple-stage controlled experiment: A literature review where we reveal the available system prompts used in the scientific community and open-source projects; then, using F1-score as a metric we will analyse the precision and recall of the system prompts aiming to select the best-performing system prompts in interacting with API tools; and in a latter stage, we compare a selection of LLMs with the best-performing prompts identified earlier. From these experiences, we realize that AI-generated system prompts perform better than the current prompts used in open-source and literature with GPT-4, zero-shot prompts have better performance in this specific task with GPT-4 and that a good system prompt in one model does not generalize well into other models. Large language model (LLM) Natural Language Processing (NLP) GPT-4 Llama-2 Palm Application Programming Interface (API). Engineering and Technology Teknik och teknologier Computer Sciences Datavetenskap (datalogi)
40	ChatGPT’s Performance on the BriefElectricity and Magnetism Assessment Melin, Jakob, Elias, Önerud January 2024 (has links) In this study, we tested the performance of ChatGPT-4 on the concept inventory Brief Electricity and Magnetism Assessment (BEMA) to understand its potential as an educational tool in physics, especially in tasks requiring visual interpretation. Our results indicate that ChatGPT-4 performs similarly to undergraduate students in introductory electromagnetism courses, with an average score close to that of the students. However, ChatGPT-4 displayed significant differences compared to students, particularly in tasks involving complex visual elements such as electrical circuits and magnetic field diagrams. While ChatGPT-4 was proficient in proposing correct physical reasoning, it struggled with accurately interpreting visual information. These findings suggest that while ChatGPT-4 can be a useful supplementary tool for students, it should not be relied upon as a primary tutor for subjects heavily dependent on visual interpretation. Instead, it could be more effective as a peer, where its outputs are critically evaluated by students. Further research should focus on improving ChatGPT’s visual processing capabilities and exploring its role in diverse educational contexts. Physical Sciences Fysik Didactics Didaktik

Search results