• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 53
  • 6
  • 1
  • Tagged with
  • 67
  • 67
  • 67
  • 32
  • 29
  • 27
  • 22
  • 21
  • 20
  • 20
  • 20
  • 20
  • 19
  • 18
  • 18
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Characterizing, classifying and transforming language model distributions

Kniele, Annika January 2023 (has links)
Large Language Models (LLMs) have become ever larger in recent years, typically demonstrating improved performance as the number of parameters increases. This thesis investigates how the probability distributions output by language models differ depending on the size of the model. For this purpose, three features for capturing the differences between the distributions are defined, namely the difference in entropy, the difference in probability mass in different slices of the distribution, and the difference in the number of tokens covering the top-p probability mass. The distributions are then put into different distribution classes based on how they differ from the distributions of the differently-sized model. Finally, the distributions are transformed to be more similar to the distributions of the other model. The results suggest that classifying distributions before transforming them, and adapting the transformations based on which class a distribution is in, improves the transformation results. It is also shown that letting a classifier choose the class label for each distribution yields better results than using random labels. Furthermore, the findings indicate that transforming the distributions using entropy and the number of tokens in the top-p probability mass makes the distributions more similar to the targets, while transforming them based on the probability mass of individual slices of the distributions makes the distributions more dissimilar.
32

Self-Reflection on Chain-of-Thought Reasoning in Large Language Models / Självreflektion över Chain-of-Thought-resonerande i stora språkmodeller

Praas, Robert January 2023 (has links)
A strong capability of large language models is Chain-of-Thought reasoning. Prompting a model to ‘think step-by-step’ has led to great performance improvements in solving problems such as planning and question answering, and with the extended output it provides some evidence about the rationale behind an answer or decision. In search of better, more robust, and interpretable language model behavior, this work investigates self-reflection in large language models. Here, self-reflection consists of feedback from large language models to medical question-answering and whether the feedback can be used to accurately distinguish between correct and incorrect answers. GPT-3.5-Turbo and GPT-4 provide zero-shot feedback scores to Chain-of-Thought reasoning on the MedQA (medical questionanswering) dataset. The question-answering is evaluated on traits such as being structured, relevant and consistent. We test whether the feedback scores are different for questions that were either correctly or incorrectly answered by Chain-of-Thought reasoning. The potential differences in feedback scores are statistically tested with the Mann-Whitney U test. Graphical visualization and logistic regressions are performed to preliminarily determine whether the feedback scores are indicative to whether the Chain-of-Thought reasoning leads to the right answer. The results indicate that among the reasoning objectives, the feedback models assign higher feedback scores to questions that were answered correctly than those that were answered incorrectly. Graphical visualization shows potential for reviewing questions with low feedback scores, although logistic regressions that aimed to predict whether or not questions were answered correctly mostly defaulted to the majority class. Nonetheless, there seems to be a possibility for more robust output from self-reflecting language systems. / En stark förmåga hos stora språkmodeller är Chain-of-Thought-resonerande. Att prompta en modell att tänka stegvis har lett till stora prestandaförbättringar vid lösandet av problem som planering och frågebesvarande, och med den utökade outputen ger det en del bevis rörande logiken bakom ett svar eller beslut. I sökandet efter bättre, mer robust och tolk bart beteende hos språkmodeller undersöker detta arbete självreflektion i stora språkmodeller. Forskningsfrågan är: I vilken utsträckning kan feedback från stora språkmodeller, såsom GPT-3.5-Turbo och GPT-4, på ett korrekt sätt skilja mellan korrekta och inkorrekta svar i medicinska frågebesvarande uppgifter genom användningen av Chainof-Thought-resonerande? Här ger GPT-3.5-Turbo och GPT-4 zero-shot feedback-poäng till Chain-ofThought-resonerande på datasetet för MedQA (medicinskt frågebesvarande). Frågebesvarandet bör vara strukturerat, relevant och konsekvent. Feedbackpoängen jämförs mellan två grupper av frågor, baserat på om dessa besvarades korrekt eller felaktigt i första hand. Statistisk testning genomförs på skillnaden i feedback-poäng med Mann-Whitney U-testet. Grafisk visualisering och logistiska regressioner utförs för att preliminärt avgöra om feedbackpoängen är indikativa för huruvida Chainof-Thought-resonerande leder till rätt svar. Resultaten indikerar att bland resonemangsmålen tilldelar feedbackmodellerna fler positiva feedbackpoäng till frågor som besvarats korrekt än de som besvarats felaktigt. Grafisk visualisering visar potential för granskandet av frågor med låga feedbackpoäng, även om logistiska regressioner som syftade till att förutsäga om frågorna besvarades korrekt eller inte för det mesta majoritetsklassen. Icke desto mindre verkar det finnas potential för robustare från självreflekterande språksystem.
33

On Semantic Cognition, Inductive Generalization, and Language Models

Kanishka Misra (9708551) 05 September 2023 (has links)
<p dir="ltr">Our ability to understand language and perform reasoning crucially relies on a robust system of semantic cognition (G. L. Murphy, 2002; Rogers & McClelland, 2004; Rips et al., 2012; Lake & Murphy, 2021): processes that allow us to learn, update, and produce inferences about everyday concepts (e.g., cat, chair), properties (e.g., has fur, can be sat on), categories (e.g., mammals, furniture), and relations (e.g., is-a, taller-than). Meanwhile, recent progress in the field of natural language processing (NLP) has led to the development of language models (LMs): sophisticated neural networks that are trained to predict words in context (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020), and as a result build representations that encode the knowledge present in the statistics of their training environment. These models have achieved impressive levels of performance on a range of tasks that require sophisticated semantic knowledge (e.g. question answering and natural language inference), often even reaching human parity. To what extent do LMs capture the nuances of human conceptual knowledge and reasoning? Centering around this broad question, this dissertation uses core ideas in human semantic cognition as guiding principles and lays down the groundwork to establish effective evaluation and improvement of conceptual understanding in LMs. In particular, I build on prior work that focuses on characterizing what semantic knowledge is made available in the behavior and representations of LMs, and extend it by additionally proposing tests that focus on functional consequences of acquiring basic semantic knowledge.<br><br>I primarily focus on inductive generalization (Hayes & Heit, 2018)—the unique ability of humans to rely on acquired conceptual knowledge to project or generalize novel information—as a context within which we can analyze LMs’ encoding of conceptual knowledge. I do this, since the literature surrounding inductive generalization contains a variety of empirical regularities that map to specific conceptual abstractions and shed light on how humans store, organize and use conceptual knowledge. Before explicitly analyzing LMs for these empirical regularities, I test them on two other contexts, which also feature the role of inductive generalization. First I test the extent to which LMs demonstrate typicality effects—a robust finding in human categorization literature where certain members of a category are considered to be more central to the category than are others. Specifically, I test the behavior 19 different LMs on two contexts where typicality effects modulate human behavior: 1) verification of sentences expressing taxonomic category membership, and 2) projecting novel properties from individual category members to the entire category. In both tests, LMs achieved positive but modest correlations with human typicality ratings, suggesting that they can to a non-trivial extent capture subtle differences between category members. Next, I propose a new benchmark to test the robustness of LMs in attributing properties to everyday concepts, and in making inductive leaps to endow properties to novel concepts. On testing 31 different LMs for these capacities, I find that while they can correctly attribute properties to everyday concepts and even predict the properties of novel concepts in simple settings, they struggle to do so robustly. Combined with the analyses of typicality effects, these results suggest that the ability of LMs to demonstrate impressive conceptual knowledge and reasoning behavior can be explained by their sensitivities to shallow predictive cues. When these cues are carefully controlled for, LMs show critical failures in demonstrating robust conceptual understanding. Finally, I develop a framework that can allow us to characterize the extent to which the distributed representations learned by LMs can encode principles and abstractions that characterize inductive behavior of humans. This framework operationalizes inductive generalization as the behavior of an LM after its representations have been partially exposed (via gradient-based learning) to novel conceptual information. To simulate this behavior, the framework uses LMs that are endowed with human-elicited property knowledge, by training them to evaluate the truth of sentences attributing properties to concepts. I apply this framework to test four different LMs on 13 different inductive phenomena documented for humans (Osherson et al., 1990; Heit & Rubinstein, 1994). Results from these analyses suggest that building representations from word distributions can successfully allow the encoding of many abstract principles that can guide inductive behavior in the models—principles such as sensitivity to conceptual similarity, hierarchical organization of categories, reasoning about category coverage, and sample size. At the same time, the tested models also systematically failed at demonstrating certain phenomena, showcasing their inability to demonstrate pragmatic reasoning, preference to rely on shallow statistical cues, and lack of context sensitivity with respect to high-level intuitive theories.</p>
34

An Empirical Study on Using Codex for Automated Program Repair

Zhao, Pengyu January 2023 (has links)
This thesis explores the potential of Codex, a pre-trained Large Language Model (LLM), for Automated Program Repair (APR) by assessing its performance on the Defects4J benchmark that includes real-world Java bugs. The study aims to provide a comprehensive understanding of Codex’s capabilities and limitations in generating syntactically and semantically equivalent patches for defects, as well as evaluating its ability to handle defects with different levels of importance and complexity. Additionally, we aim to compare the performance of Codex with other LLMs in the APR domain. To achieve these objectives, we employ a systematic methodology that includes prompt engineering, Codex parameter adjustment, code extraction, patch verification, and Abstract Syntax Tree (AST) comparison. We successfully verified 528 bugs in Defects4J, which represents the highest number among other studies, and achieved 53.98% of plausible and 26.52% correct patches. Furthermore, we introduce the elle-elle-aime framework, which extends the RepairThemAll for Codex-based APR and is adaptable for evaluating other LLMs, such as ChatGPT and GPT-4. The findings of this empirical study provide valuable insights into the factors that impact Codex’s performance on APR, helping to create new prompt strategies and techniques that improve research productivity. / Denna avhandling utforskar potentialen hos Codex, en förtränad LLM, för APR genom att utvärdera dess prestanda på Defects4J-benchmarket som inkluderar verkliga Java-buggar. Studien syftar till att ge en omfattande förståelse för Codex förmågor och begränsningar när det gäller att generera syntaktiskt och semantiskt ekvivalenta patchar för defekter samt att utvärdera dess förmåga att hantera defekter med olika nivåer av betydelse och komplexitet. Dessutom är vårt mål att jämföra prestanda hos Codex med andra LLM inom APR-området. För att uppnå dessa mål använder vi en systematisk metodik som inkluderar prompt engineering, justering av Codex-parametrar, kodextraktion, patchverifiering och jämförelse av AST. Vi verifierade framgångsrikt 528 buggar i Defects4J, vilket representerar det högsta antalet bland andra studier, och uppnådde 53,98% plausibla och 26,52% korrekta patchar. Vidare introducerar vi elle-elle-aime ramverket, som utvidgar RepairThemAll för Codex-baserad APR och är anpassningsbart för att utvärdera andra LLM, såsom ChatGPT och GPT-4. Resultaten av denna empiriska studie ger värdefulla insikter i de faktorer som påverkar Codex prestanda på APR och hjälper till att skapa nya promptstrategier och tekniker som förbättrar forskningsproduktiviteten.
35

ChatGPT’s Performance on the BriefElectricity and Magnetism Assessment

Melin, Jakob, Elias, Önerud January 2024 (has links)
In this study, we tested the performance of ChatGPT-4 on the concept inventory Brief Electricity and Magnetism Assessment (BEMA) to understand its potential as an educational tool in physics, especially in tasks requiring visual interpretation. Our results indicate that ChatGPT-4 performs similarly to undergraduate students in introductory electromagnetism courses, with an average score close to that of the students. However, ChatGPT-4 displayed significant differences compared to students, particularly in tasks involving complex visual elements such as electrical circuits and magnetic field diagrams. While ChatGPT-4 was proficient in proposing correct physical reasoning, it struggled with accurately interpreting visual information. These findings suggest that while ChatGPT-4 can be a useful supplementary tool for students, it should not be relied upon as a primary tutor for subjects heavily dependent on visual interpretation. Instead, it could be more effective as a peer, where its outputs are critically evaluated by students. Further research should focus on improving ChatGPT’s visual processing capabilities and exploring its role in diverse educational contexts.
36

Advancing Policy Insights: Opinion Data Analysis and Discourse Structuring Using LLMs

Bhatia, Aaditya 01 January 2024 (has links) (PDF)
The growing volume of opinion data presents a significant challenge for policymakers striving to distill public sentiment into actionable decisions. This study aims to explore the capability of large language models (LLMs) to synthesize public opinion data into coherent policy recommendations. We specifically leverage Mistral 7B and Mixtral 8x7B models for text generation and have developed an architecture to process vast amounts of unstructured information, integrate diverse viewpoints, and extract actionable insights aligned with public opinion. Using a retrospective data analysis of the Polis platform debates published by the Computational Democracy Project, this study examines multiple datasets that span local and national issues with 1600 statements posted and voted upon by over 3400 participants. Through content moderation, topic modeling, semantic structure extraction, insight generation, and argument mapping, we dissect and interpret the comments, leveraging voting data and LLMs for both quantitative and qualitative insights. A key contribution of this thesis is demonstrating how LLM reasoning techniques can enhance content moderation. Our content moderation approach shows performance improvements using comment deconstruction in multi-class classification, underscoring the trade-offs between moderation strategies and emphasizing a balance between precision and cautious moderation. Using comment clustering, we establish a hierarchy of semantically linked topics, facilitating an understanding of thematic structures and the generation of actionable insights. The generated argument maps visually represent the relationships between topics and insights, and highlight popular opinions. Future work will leverage advanced semantic extraction and reasoning techniques to enhance insight generation further. We also plan to generalize our techniques to other major discussion platforms, including Kialo. Our work contributes to the understanding of using LLMs for policymaking and offers a novel approach to structuring complex debates and translating public opinion into actionable policy insights.
37

I cast control chain of thought : A prompt introduction of roleplay to AI

Carlander, Deborah January 2024 (has links)
Teaching AI to play Tabletop Roleplaying Games (TRPG) is a difficult challenge due their negotiable rules and open-ended nature. This is further exacerbated when considering the act of roleplaying, where players do not play or act as themselves, but as a character in a fantasy setting. Previous studies attempting to teach LLMs to play TRPGs do not explicitly discuss role-play in their work, highlighting an absence of a definition in current research. This thesis endeavours in introducing role-playing to AI through developing a prompting method called Control Chain of Thoughts, aimed at teaching it the Dungeons and Dragons alignment system. The prompting method is evaluated through an ablation study where GPT-3.5 is tasked to guess the alignment of characters based on exctracts from D&amp;D gaming sessions. The results indicate a small improvement in GPT’s predictions. Further work needs to be done to evaluate if its alignments help LLMs understand roleplaying.
38

Leveraging Generative AI in Enterprise Settings : A Case Study-Based Framework / Generativ AI i företagsmiljöer : ett fallstudiebaserat ramverk

Ageling, Lisette Elisabet, Nilsson, Elliot January 2024 (has links)
The emergence of Generative AI (GenAI) foundation models presents transformative potential across industries, promising not only to increase productivity but also to pioneer new ways of working and introduce novel business models. Despite this, GenAI adoption levels have lagged behind early projections, and many firms report difficulties in finding appropriate applications. One such firm is Scandic Hotels, a Swedish hospitality company seeking to identify use cases for GenAI within the Scandic Data Platform (SDP), the firm’s analytics unit. The goals of this study were twofold: firstly, to identify GenAI use cases for the SDP based on their organizational needs, and secondly, to create a framework to guide organizations in harnessing the technology’s potential purposefully based on their specific organizational contexts. A conceptual framework was developed based on a synthesis of existing AI use case frameworks and the incorporation of GenAI characteristics to guide the investigation of the SDP. A qualitative case study approach was employed, achieving the first research goal through two primary activities: first, by assessing the organizational context through interviews and a questionnaire, and subsequently, by identifying concrete use cases designed to address organizational challenges based on the domain mapping through collaborative workshops. The investigation into the organizational context culminated in the formulation of a complex problem space with eleven logically interconnected domain problems stemming from two root causes: a high technological complexity of the data platform and a lack of organizational ownership concerning data. These problems lead the SDP to be occasionally overwhelmed with support requests, resulting in a range of time-consuming downstream issues that lock the team in reactive rather than proactive work. The use case identification process yielded eleven concrete use cases leveraging a range of GenAI technologies, including retrieval-augmented generation, fine-tuning, and prompt chaining. An evaluation based on the perceived business value of these use cases found that those directly addressing root problems or contributing to strategic imperatives received the highest value scores by members of the SDP. Our findings reinforce the problem-driven use case identification approach suggested by previous AI use case literature and offer nuances in the importance of basing use cases on a structured hierarchical problem space, allowing use cases to be designed to address root problems and break negative feedback loops for maximal business value. By iterating the literature-informed conceptual framework with these practical insights, a novel framework for GenAI use case formulation was developed, centered around matching root domain problems with GenAI-specific capabilities. This framework provides an overview of key components for the identification of use cases based on the organization’s unique context, contributing important starting points for managers wishing to engage in GenAI adoption and addressing the literature gap in GenAI-specific use case exploration frameworks. / Utvecklingen av grundmodeller inom generativ AI (GenAI) har demonstrerat potential att öka produktivitet, omdefiniera befintliga arbetsflöden och införa nyskapande affärsmodeller. Trots detta har införandegraden i näringslivet legat under tidigare prognosticerade nivåer, och många företag rapporterar svårigheter med att identifiera lämpliga tillämpningar. Ett exempel på ett sådant företag är den svenska hotellkedjan Scandic, som önskar identifiera interna användningsområden för GenAI inom analysenheten i företagets centrala organisation, Scandic Data Platform (SDP). Denna studie ämnade att först identifiera användningsfall för GenAI inom SDP baserat på enhetens specifika behov, och sedan utveckla ett ramverk för att vägleda organisationer i identifieringen av GenAI-användningsfall baserat på deras specifika organisatoriska kontext. Baserat på en syntes av befintlig litteratur inom AI-användningsfall och integreringen av karaktäristiska egenskaper för GenAI konstruerades ett konceptuellt ramverk för att orientera utredningen inom SDP. En kvalitativ fallstudieansats uppdelad i två huvudaktiviteter tillämpades för att uppnå det första forskningsmålet: först undersöktes den organisatoriska kontexten genom nio intervjuer samt en enkät, sedan identifierades konkreta användningsfall utformade för att behandla organisatoriska behov förankrade i kartläggningen av domänen genom kollaborativa workshoppar. Undersökningen av den organisatoriska kontexten kulminerade i formuleringen av en komplext problemrymd med elva logiskt sammanlänkade domänproblem härrörande från två grundorsaker: en hög teknologisk komplexitet hos dataplattformen och en brist på organisatoriskt ägarskap gällande data. Dessa problem leder till att SDP ibland överväldigas av supportförfrågningar, vilket resulterar i en rad tidskrävande efterföljande problem som låser in teamet i reaktivt snarare än proaktivt arbete. Identifiering av användningsfall resulterade i formuleringen av elva konkreta användningsfall som utnyttjar en rad GenAI-teknologier såsom retrieval-augmented generation, finjustering och promptkedjning. En utvärdering baserad på det uppskattade affärsvärdet av dessa visade att de användningsfall som direkt bemötte de två rotproblemen eller bidrog uppfyllandet av strategiska imperativ fick de högsta värdebetygen av SDP:s medlemmar. Våra resultat validerar framgången i det problemstyrda tillvägagångssättet för identifiering av användningsfall som föreslagits av tidigare litteratur, men nyanserar förfarandet genom att understryka vikten av att förankra användningsfall i en hierarkiskt strukturerad problemrymd—vilket gör att användningsfall kan utformas för att direkt bemöta rotproblem och bryta negativa återkopplingsslingor för att uppnå maximalt organisatoriskt värde. Genom att iterera det litteraturinformerade konceptuella ramverket med dessa praktiska insikter utvecklades vi ett nytt ramverk för identifieringen av GenAI-användningsfall, baserat på matchningen av rotproblemen inom domänen med GenAI-specifika kapaciteter. Detta ramverk ger en översikt över nyckelkomponenter för identifiering av användningsfall baserade på den organisatoriska kontexten. På så sätt bidrar studien med en utgångspunkt för företag som önskar engagera sig i införandet av GenAI och bemöter bristen på litteratur innehållandes GenAI-specifika ramverk för utforskning av användningsfall.
39

Large Language Models for Unit Test Generation in React Native TypeScript Components

Borgström, Erik, Bergvall, Robin January 2024 (has links)
Advancements within Large Language Models(LLMs) have opened a world of opportunities within the software development domain. This thesis, through an controlled experiment, aims to investigate how LLMs can be utilized within software testing, more specifically unit testing. The controlled experiment was performed using a Python script interfacing with the gpt-3.5-turbo model, to automatically generate unit tests for React Native components written in TypeScript. The pipeline described, performs the calls to the OpenAI Application Programming Interface(API) iterative. To evaluate and retrieve the metric code coverage, the unit tests were executed with Jest. Additionally, manual execution of failing tests, both compilable and non-compilable tests were executed and the different kind of errors with their frequency were documented. The experiment shows that LLMs can be used to generate comprehensive and accurate unit tests, with high potential of future improvements. While the amount of generated tests that compiled were low, their nature was often good, failing because of easy correctable syntax errors, faulty imports or missing dependencies. The errors found, were at large part due to project configurations while others would probably be less frequent through more extensive prompt-engineering or by the use of an newer model. The experiment also shows that the temperature affected the outcome and that the type of errors were different between compiling and non-compiling tests. A lower temperature parameter to the OpenAI API generally achieved better results, whilst a higher temperature showed greater coverage at compiled failing tests. This thesis also shows that future opportunities and improvements are widely available. Through better project optimization, newer models and better prompting, a better result is to be expected. The script could with further development be turned into a working product, making software testing faster and more efficient, saving both time and money while simultaneously improving the test case quality.
40

Comparative Analysis of ChatGPT-4and Gemini Advanced in ErroneousCode Detection and Correction

Sun, Erik Wen Han, Grace, Yasine January 2024 (has links)
This thesis investigates the capabilities of two advanced Large Language Models(LLMs) OpenAI’s ChatGPT-4 and Google’s Gemini Advanced in the domain ofSoftware engineering. While LLMs are widely utilized across various applications,including text summarization and synthesis, their potential for detecting and correct-ing programming errors has not been thoroughly explored. This study aims to fill thisgap by conducting a comprehensive literature search and experimental comparisonof ChatGPT-4 and Gemini Advanced using the QuixBugs and LeetCode benchmarkdatasets, with specific focus on Python and Java programming languages. The re-search evaluates the models’ abilities to detect and correct bugs using metrics suchas Accuracy, Recall, Precision, and F1-score.Experimental results presets that ChatGPT-4 consistently outperforms GeminiAdvanced in both the detection and correction of bugs. These findings provide valu-able insights that could guide further research in the field of LLMs.

Page generated in 0.0718 seconds