31 |
Investigating the impact of Generative AI on newcomers' understanding of Software ProjectsLarsen, Knud Ronau, Edvall, Magnus January 2024 (has links)
Context: In both commercial and open-source software development, newcomers often join the development process in the advanced stages of the software development lifecycle. Newcomers frequently face barriers impeding their ability to make early contributions, often caused by a lack of understanding. For this purpose, we have developed an LLM-based tool called SPAC-B that facilitates project-specific question-answering to aid newcomers' understanding of software projects. Objective: Investigate the LLM-based tool's ability to assist newcomers in understanding software projects by measuring its accuracy and conducting an experiment. Method: In this study, a case study is conducted to investigate the accuracy of the tool, measured in relevance, completeness, and correctness. Furthermore, an experiment is performed among software developers to test the tool's ability to help newcomers formulate better plans for open-source issues. Results: SPAC-B achieved an accuracy of 4.60 in relevance, 4.30 in completeness, and 4.28 in correctness on a scale from 1 to 5. It improved the combined mean score of the plans of the 10 participants in our experiments from 1.90 to 2.70, and 8 out of 10 participants found the tool helpful. Conclusions: SPAC-B has demonstrated high accuracy and helpfulness, but further research is needed to confirm if these results can be generalized to a larger population and other contexts of use.
|
32 |
Leveraging Linguistic Insights for Uncertainty Calibration of ChatGPT and Evaluating Crowdsourced AnnotationsVenkata Divya Sree Pulipati (18469230) 09 July 2024 (has links)
<p dir="ltr">The quality of crowdsource annotations has always been a challenge due to the variability in annotators backgrounds, task complexity, the subjective nature of many labeling tasks, and various other reasons. Hence, it is crucial to evaluate these annotations to ensure their reliability. Traditionally, human experts evaluate the quality of crowdsourced annotations, but this approach has its own challenges. Hence, this paper proposes to leverage large language models like ChatGPT-4 to evaluate one of the existing crowdsourced MAVEN dataset and explore its potential as an alternative solution. However, due to stochastic nature of LLMs, it is important to discern when to trust and question LLM responses. To address this, we introduce a novel approach that applies Rubin's framework for identifying and using linguistic cues within LLM responses as indicators of LLMs certainty levels. Our findings reveal that ChatGPT-4 successfully identified 63% of the incorrect labels, highlighting the potential for improving data label quality through human-AI collaboration on these identified inaccuracies. This study underscores the promising role of LLMs in evaluating crowdsourced data annotations offering a way to enhance accuracy and fairness of crowdsource annotations while saving time and costs.</p><p dir="ltr"><br></p>
|
33 |
Analysis of Security Findings and Reduction of False Positives through Large Language ModelsWagner, Jonas 18 October 2024 (has links)
This thesis investigates the integration of State-of-the-Art (SOTA) Large Language Models
(LLMs) into the process of reassessing security findings generated by Static Application
Security Testing (SAST) tools. The primary objective is to determine whether LLMs are
able to detect false positives (FPs) while maintaining a high true positive (TP) rate, thereby
enhancing the efficiency and effectiveness of security assessments.
Four consecutive experiments were conducted, each addressing specific research questions.
The initial experiment, using a dataset of security findings extracted from the OWASP Bench-
mark, identified the optimal combination of context items provided by the SAST tool Spot-
Bugs, which, when used with GPT-3.5 Turbo, reduced FPs while minimizing the loss of
TPs. The second experiment, conducted on the same dataset, demonstrated that advanced
prompting techniques, particularly few-shot Chain-of-Thought (CoT) prompting combined
with Self-Consistency (SC), further improved the reassessment process. The third experiment
compared both proprietary and open-source LLMs on an OWASP Benchmark dataset about
one-fourth the size of the previously used dataset. GPT-4o achieved the highest performance,
detecting 80 out of 128 FPs without missing any TPs, resulting in a perfect TPR of 100% and
a decrease in FPR by 41.27 percentage points. Meanwhile, Llama 3.1 70B detected 112 out
of the 128 FPs but missed 10 TPs, resulting in a TPR of 94.94% and a reduction in FPR by
56.62 percentage points. To validate these findings in a real-world context, the approach was
applied to a dataset generated from the open-source project Mnestix using multiple SAST
tools. GPT-4o again emerged as the top performer, detecting 26 out of 68 FPs while only
missing one TP, resulting in a TPR decreased by 2.22 percentage points but simultaneously
an FPR decreased 37.57 percentage points.:Table of Contents IV
List of Figures VI
List of Tables VIII
List of Source Codes IX
List of Abbreviations XI
1. Motivation 1
2. Background 3
3. Related Work 17
4. Concept 31
5. Preparing a Security Findings Dataset 39
6. Implementing a Workflow 51
7. Identifying Context Items 67
8. Comparing Prompting Techniques 85
9. Comparing Large Language Models 101
10.Evaluating Developed Approach 127
11.Discussion 141
12.Conclusion 145
A. Appendix: Figures 147
A.1. Repository Directory Tree 148
A.2. Precision-Recall Curve of Compared Large Language Models 149
A.3. Performance Metrics Self-Consistency on Mnestix Dataset 150
B. Appendix: Tables 151
B.1. Design Science Research Concept 151
C. Appendix: Code 153
C.1. Pydantic Base Config Documentation 153
C.2. Pydantic LLM Client Config Documentation 155
C.3. LLM BaseClient Class 157
C.4. Test Cases Removed From Dataset 158
|
34 |
Preventing Health Data from Leaking in a Machine Learning System : Implementing code analysis with LLM and model privacy evaluation testing / Förhindra att Hälsodata Läcker ut i ett Maskininlärnings System : Implementering av kod analys med stor språk-modell och modell integritets testningJanryd, Balder, Johansson, Tim January 2024 (has links)
Sensitive data leaking from a system can have tremendous negative consequences, such as discrimination, social stigma, and fraudulent economic consequences for those whose data has been leaked. Therefore, it’s of utmost importance that sensitive data is not leaked from a system. This thesis investigated different methods to prevent sensitive patient data from leaking in a machine learning system. Various methods have been investigated and evaluated based on previous research; the methods used in this thesis are a large language model (LLM) for code analysis and a membership inference attack on models to test their privacy level. The LLM code analysis results show that the Llama 3 (an LLM) model had an accuracy of 90% in identifying malicious code that attempts to steal sensitive patient data. The model analysis can evaluate and determine membership inference of sensitive patient data used for training in machine learning models, which is essential for determining data leakage a machine learning model can pose in machine learning systems. Further studies in increasing the deterministic and formatting of the LLM‘s responses must be investigated to ensure the robustness of the security system that utilizes LLMs before it can be deployed in a production environment. Further studies of the model analysis can apply a wider variety of evaluations, such as increased size of machine learning model types and increased range of attack testing types of machine learning models, which can be implemented into machine learning systems. / Känsliga data som läcker från ett system kan ha enorma negativa konsekvenser, såsom diskriminering, social stigmatisering och negativa ekonomiska konsekvenser för dem vars data har läckt ut. Därför är det av yttersta vikt att känsliga data inte läcker från ett system. Denna avhandling undersökte olika metoder för att förhindra att känsliga patientdata läcker ut ur ett maskininlärningssystem. Olika metoder har undersökts och utvärderats baserat på tidigare forskning; metoderna som användes i denna avhandling är en stor språkmodell (LLM) för kodanalys och en medlemskapsinfiltrationsattack på maskininlärnings (ML) modeller för att testa modellernas integritetsnivå. Kodanalysresultaten från LLM visar att modellen Llama 3 hade en noggrannhet på 90% i att identifiera skadlig kod som försöker stjäla känsliga patientdata. Modellanalysen kan utvärdera och bestämma medlemskap av känsliga patientdata som används för träning i maskininlärningsmodeller, vilket är avgörande för att bestämma den dataläckage som en maskininlärningsmodell kan exponera. Ytterligare studier för att öka determinismen och formateringen av LLM:s svar måste undersökas för att säkerställa robustheten i säkerhetssystemet som använder LLM:er innan det kan driftsättas i en produktionsmiljö. Vidare studier av modellanalysen kan tillämpa ytterligare bredd av utvärderingar, såsom ökad storlek på maskininlärningsmodelltyper och ökat utbud av attacktesttyper av maskininlärningsmodeller som kan implementeras i maskininlärningssystem.
|
35 |
Stora språkmodeller för bedömning av applikationsrecensioner : Implementering och undersökning av stora språkmodeller för att sammanfatta, extrahera och analysera nyckelinformation från användarrecensioner / Large Language Models for application review data : Implementation survey of Large Language Models (LLM) to summarize, extract, and analyze key information from user reviewsvon Reybekiel, Algot, Wennström, Emil January 2024 (has links)
Manuell granskning av användarrecensioner för att extrahera relevant informationkan vara en tidskrävande process. Denna rapport har undersökt om stora språkmodeller kan användas för att sammanfatta, extrahera och analysera nyckelinformation från recensioner, samt hur en sådan applikation kan konstrueras. Det visade sig att olika modeller presterade olika bra beroende på mätvärden ochviktning mellan recall och precision. Vidare visade det sig att fine-tuning av språkmodeller som Llama 3 förbättrade prestationen vid klassifikation av användbara recensioner och ledde, enligt vissa mätvärden, till högre prestation än större språkmodeller som Chat-Bison. För engelskt översatta recensioner hade Llama 3:8b:Instruct, Chat-Bison samt den fine-tunade versionen av Llama 3:8b ett F4-makro-score på 0.89, 0.90 och 0.91 respektive. Ytterligare ett resultat är att de större modellerna Chat-Bison, Text-Bison och Gemini, presterade bättre i fallet för generering av sammanfattande texter, än de mindre modeller som testades vid inmatning av flertalet recensioner åt gången. Generellt sett presterade språkmodellerna också bättre om recensioner först översattes till engelska innan bearbetning, snarare än då recensionerna var skrivna i originalspråk där de majoriteten av recensionerna var skrivna på svenska. En annan lärdom från förbearbetning av recensioner är att antal anrop till dessa språkmodeller kan minimeras genom att filtrera utifrån ordlängd och betyg. Utöver språkmodeller visade resultaten att användningen av vektordatabaser och embeddings kan ge en större överblick över användbara recensioner genom vektordatabasers inbyggda förmåga att hitta semantiska likheter och samla liknande recensioner i kluster. / Manually reviewing user reviews to extract relevant information can be a time consuming process. This report investigates if large language models can be used to summarize, extract, and analyze key information from reviews, and how such anapplication can be constructed. It was discovered that different models exhibit varying degrees of performance depending on the metrics and the weighting between recall and precision. Furthermore, fine-tuning of language models such as Llama 3 was found to improve performance in classifying useful reviews and, according to some metrics, led to higher performance than larger language models like Chat-bison. Specifically, for English translated reviews, Llama 3:8b:Instruct, Chat-bison, and Llama 3:8b fine-tuned had an F4 macro score 0.89, 0.90, 0.91 respectively. A further finding is that the larger models, Chat-Bison, Text-Bison, and Gemini performed better than the smaller models that was tested, when inputting multiple reviews at a time in the case of summary text generation. In general, language models performed better if reviews were first translated into English before processing rather than when reviews were written in the original language where most reviews were written in Swedish. Additionally, another insight from the pre-processing phase, is that the number of API-calls to these language models can be minimized by filtering based on word length and rating. In addition to findings related to language models, the results also demonstrated that the use of vector databases and embeddings can provide a greater overview of reviews by leveraging the databases’ built-in ability to identify semantic similarities and cluster similar reviews together.
|
36 |
A Method for Automated Assessment of Large Language Model Chatbots : Exploring LLM-as-a-Judge in Educational Question-Answering TasksDuan, Yuyao, Lundborg, Vilgot January 2024 (has links)
This study introduces an automated evaluation method for large language model (LLM) based chatbots in educational settings, utilizing LLM-as-a-Judge to assess their performance. Our results demonstrate the efficacy of this approach in evaluating the accuracy of three LLM-based chatbots (Llama 3 70B, ChatGPT 4, Gemini Advanced) across two subjects: history and biology. The analysis reveals promising performance across different subjects. On a scale from 1 to 5 describing the correctness of the judge itself, the LLM judge’s average scores for correctness when evaluating each chatbot on history related questions are 3.92 (Llama 3 70B), 4.20 (ChatGPT 4), 4.51 (Gemini Advanced); for biology related questions, the average scores are 4.04 (Llama 3 70B), 4.28 (ChatGPT 4), 4.09 (Gemini Advanced). This underscores the potential of leveraging the LLM-as-a-judge strategy to evaluate the correctness of responses from other LLMs.
|
37 |
Generative AI Assistant for Public Transport Using Scheduled and Real-Time Data / Generativ AI-assistent för kollektivtrafik som använder planerad och realtidsdataKarlstrand, Jakob, Nielsen, Axel January 2024 (has links)
This thesis presents the design and implementation of a generative Artificial Intelligence (AI)-based decision-support interface applied to the domain of pub- lic transport leveraging both offline and logged data from both past records and real-time updates. The AI assistant system was developed leveraging pre- trained Large Language Models (LLMs) together with Retrieval Augmented Generation (RAG) and the Function Calling Application Programming Inter- face (API), provided by OpenAI, for automating the process of adding knowl- edge to the LLM. Challenges such as formatting and restructuring of data, data retrieval methodologies, accuracy and latency were considered. The result is an AI assistant which can have a conversation with users, answer questions re- garding departures, arrivals, specific vehicle trips, and other questions relevant within the domain of the dataset. The AI assistant system has also been devel- oped to provide client-side actions that integrate with the user interface, enabling interactive elements such as clickable links to trigger relevant actions based on the content provided Different LLMs, including GPT-3.5 and GPT-4 with different temperatures, were compared and evaluated with a pre-defined set of questions paired with a respective ground truth. By adopting a conversational approach, the project aims to streamline infor- mation extraction from extensive datasets, offering a more flexible and feedback- oriented alternative to manual search and filtering processes. This way, traffic managers adapt and operate more efficiently. The traffic managers will also re- main informed about small disturbances and can act accordingly faster and more efficient. The project was conducted at Gaia Systems AB, Norrköping, Sweden. The project primarily aims to enhance the workflow of traffic managers utiliz- ing Gaia’s existing software for public transport management within Östgöta- trafiken. / Denna avhandling presenterar designen och implementationen av en generativ Artificiell Intelligens (AI)-baserad beslutsstödsgränssnitt applicerad på området för kollektivtrafik, utnyttjande både offline och loggad data från både tidigare händelser och realtidsuppdateringar. AI-assistentsystemet utvecklades med hjälp av Large Language Models (LLM) tillsammans med Retrieval Augmented Generation (RAG) och Function Calling API, tillhandahållet av OpenAI, för att automatisera processen att lägga till kunskap till en LLM. Utmaningar som formatering och omstrukturering av data, datahämtningsmetoder, noggrannhet och latens beaktades. Resultatet är en AI-assistent som kan ha en konversation med användare, svara på frågor om avgångar, ankomster, specifika fordonsturer och andra frågor relevanta inom datamängdens område. AI-assistentsystemet har också utvecklats för att tillhandahålla Client Actions som integreras med användargränssnittet, vilket möjliggör interaktiva element som klickbara länkar för att utlösa relevanta åtgärder baserade på den tillhandahållna innehållet. Olika LLM, inklusive GPT-3.5 och GPT-4 med olika temperaturer, jämfördes och utvärderades med en fördefinierad uppsättning frågor parat med en respektive sanning. Genom att använda en konversationell metod syftar projektet till att effektivisera informationsutvinning från omfattande datamängder och erbjuder ett mer flexibelt och feedbackorienterat alternativ till manuella sök- och filtreringsprocesser. På detta sätt kan trafikledare anpassa sig och arbeta mer effektivt. Trafikledarna kommer också att hållas informerade om mindre störningar och kan agera snabbare och mer effektivt. Projektet genomfördes på Gaia Systems AB, Norrköping, Sverige. Projektet syftar främst till att förbättra arbetsflödet för trafikförvaltare som använder Gaia's befintlig programvara för kollektivtrafikhantering inom Östgötatrafiken.
|
38 |
Prompt engineering and its usability to improve modern psychology chatbots / Prompt engineering och dess användbarhet för att förbättra psykologichatbottarNordgren, Isak, E. Svensson, Gustaf January 2023 (has links)
As advancements in chatbots and Large Language Models (LLMs) such as GPT-3.5 and GPT-4 continue, their applications in diverse fields, including psychology, expand. This study investigates the effectiveness of LLMs optimized through prompt engineering, aiming to enhance their performance in psychological applications. To this end, two distinct versions of a GPT-3.5-based chatbot were developed: a version similar to the base model, and a version equipped with a more extensive system prompt detailing expected behavior. A panel of professional psychologists evaluated these models based on a predetermined set of questions, providing insight into their potential future use as psychological tools. Our results indicate that an overly prescriptive system prompt can unintentionally limit the versatility of the chatbot, making a careful balance in instruction specificity essential. Furthermore, while our study suggests that current LLMs such as GPT-3.5 are not capable of fully replacing human psychologists, they can provide valuable assistance in tasks such as basic question answering, consolation and validation, and triage. These findings provide a foundation for future research into the effective integration of LLMs in psychology and contribute valuable insights into the promising field of AI-assisted psychological services. / I takt med att framstegen inom chatbots och stora språkmodeller (LLMs) som GPT-3.5 och GPT-4 fortsätter utvidgas deras potentiella tillämpningar inom olika områden, inklusive psykologi. Denna studie undersöker effektiviteten av LLMs optimerade genom prompt engineering, med målet att förbättra deras prestanda inom psykologiska tillämpningar. I detta syfte utvecklades två distinkta versioner av en chatbot baserad på GPT-3.5: en version som liknar bas-modellen, och en version utrustad med en mer omfattande systemprompt som detaljerar förväntat beteende. En panel av professionella psykologer utvärderade dessa modeller baserat på en förbestämd uppsättning frågor, vilket ger inblick i deras potentiella framtida användning som psykologiska verktyg. Våra resultat tyder på att en överdrivet beskrivande systemprompt kan ofrivilligt begränsa chatbotens mångsidighet, vilket kräver en noggrann balans i specificiteten av prompten. Vidare antyder vår studie att nuvarande LLMs som GPT-3.5 inte kan ersätta mänskliga psykologer helt och hållet, men att de kan ge värdefull hjälp i uppgifter som grundläggande frågebesvaring, tröst och bekräftelse, samt triage. Dessa resultat ger en grund för framtida forskning om effektiv integration av LLMs inom psykologi och bidrar med värdefulla insikter till det lovande fältet av AI-assisterade psykologtjänster.
|
39 |
Towards Manipulator Task-Oriented Programming: Automating Behavior-Tree ConfigurationYue Cao (18985100) 08 July 2024 (has links)
<p dir="ltr">Task-oriented programming is a way of programming manipulators in terms of high-level tasks instead of explicit motions. It has been a long-standing vision in robotics since its early days. Despite its potential, several challenges have hindered its full realization. This thesis identifies three major challenges, particularly in task specification and the planning-to-execution transition: 1) The absence of natural language integration in system input. 2) The dilemma of continuously developing non-uniform and domain-specific primitive-task libraries. 3) The requirement for much human intervention.</p><p dir="ltr">To overcome these difficulties, this thesis introduces a novel approach that integrates natural language inputs, eliminates the need on fixed primitive-task libraries, and minimizes human intervention. It adopts the behavior tree, a modular and user-friendly form, as the task representation and advances its usage in task specification and planning-to-execution transition. The thesis is structured into two parts – Task Specification and Planning-to-Execution Transition.</p><p dir="ltr">Task specification explores the use of large language models to generate a behavior tree from an end-user's input. A Phase-Step prompt is designed to enable the automatic behavior-tree generation from end-user's abstract task descriptions in natural languages. With the powerful generalizability of large language models, it breaks the dilemma that stays with fixed primitive-task libraries in task generation. A full-process case study demonstrated the proposed approach. An ablation study was conducted to evaluate the effectiveness of the Phase-Step prompts. Task specification also proposes behavior-tree embeddings to facilitate the retrieval-augmented generation of behavior trees. The integration of behavior-tree embeddings not only eliminates the need for manual prompt configuration but also provides a way to incorporate external domain knowledge into the generation process. Three types of evaluations were performed to assess the performance of the behavior-tree embedding method.</p><p dir="ltr">Planning-to-execution transition explores how to transit primitive tasks from task specification into manipulator executions. Two types of primitive tasks are considered separately: point-to-point movement tasks and object-interaction tasks. For point-to-point movement tasks, a behavior-tree reward is proposed to enable reinforcement learning over low-level movement while following high-level running order of the behavior tree. End-users only need to specify rewards on the primitive tasks over the behavior tree, and the rest of the process will be handled automatically. A 2D space movement simulation was provided to justify the approach. For object-interaction tasks, the planning-to-execution transition uses a large-language-model-based generation approach. This approach takes natural-language-described primitive tasks as input and directly produces task-frame-formalism set-points. Combined with hybrid position/force control systems, a transition process from primitive tasks directly into joint-level execution can be realized. Evaluations over a set of 30 primitive tasks were conducted.</p><p dir="ltr">Overall, this thesis proposes an approach that advances the behavior-tree towards automated task specification and planning-to-execution transitions. It opens up new possibilities for building better task-oriented manipulator programming systems.</p>
|
40 |
Användning och acceptans av AI-verktyg inom utbildningssektorn : Upplevelser hos lärare och forskare att använda Microsoft 365 Copilot i sin yrkesroll / Use and acceptance of AI-tools in the education sector : Experiences of teachers and researchers using Microsoft 365 Copilot in their professional roleMoyo, Hannah, Nordén, Linnea January 2024 (has links)
Genom utvecklingen av AI sker ett begynnande paradigmskifte inom organisationer då anställda använder sig av AI-verktyg för att optimera sin arbetsprestanda. Användning av AI-verktyg är även något som kan bidra med nytta för akademiska roller inom utbildningssektorn, såsom lärare och forskare. Det är dock oklart vilket stöd dessa AI-verktyg kan bidra till för dessa yrkesroller. Eftersom deras arbetsuppgifter karaktäriseras av hög kvalitetsnivå och hänsyn till etiska aspekter, ställs höga krav på AI-verktygets kapabilitet. Denna studie syftar till att ge en ökad förståelse för acceptansen av AI-verktyget Microsoft 365 Copilot inom utbildningssektorn utifrån lärares och forskares perspektiv. Som stöd för att undersöka acceptansen av AI-verktyget har studien haft utgångspunkt i Technology Acceptance Model (TAM). Genom semistrukturerade intervjuer och ostrukturerade observationer erhölls en insyn i lärares och forskares upplevelser med AI-verktyget och vilka möjligheter eller begränsningar de identifierat med dess användning inom sin yrkesroll. Vår slutsats visar att AI-verktyget inte upplevs upprätthålla en nivå som var likvärdig med användarna själva eller liknande AI-verktyg. Vidare finns det även behov av stöd och utbildning för lärare och forskare att använda AI-verktyg, både vad gäller AI-verktygets funktionalitet men även riktlinjer om informationssäkerhet. / Through the development of AI, a new paradigm shift is beginning within organizations as employees use AI-tools to optimize their work performance. The use of AI-tools can also bring benefits to academic roles in the education sector, such as teachers and researchers. However, there is an uncertainty about the support these AI-tools can offer to these professional roles. Given the high level of quality required in these professional roles, as well as the need to consider ethical aspects, there are significant demands on the capabilities of the AI-tool. This study aims to provide a deeper understanding of the acceptance of the AI-tool Microsoft 365 Copilot within the education sector from the perspectives of teachers and researchers. To examine the acceptance of the AI-tool, the study is based on the Technology Acceptance Model (TAM). Through semi-structured interviews and unstructured observations, insights were gained into teacher’s and researcher’s experiences with the AI-tool and what opportunities or limitations they identified in using it within their professional role. Our conclusion indicates that the AI-tool was not perceived to maintain a level equal to the users themselves or similar AI-tools. Furthermore, there is a need for support and education for teachers and researchers in using AI-tools, both regarding the functionality of the AI-tool but also guidelines for information security.
|
Page generated in 0.0668 seconds