Characterizing, classifying and transforming language model distributions

Kniele, Annika January 2023 (has links)
Large Language Models (LLMs) have become ever larger in recent years, typically demonstrating improved performance as the number of parameters increases. This thesis investigates how the probability distributions output by language models differ depending on the size of the model. For this purpose, three features for capturing the differences between the distributions are defined, namely the difference in entropy, the difference in probability mass in different slices of the distribution, and the difference in the number of tokens covering the top-p probability mass. The distributions are then put into different distribution classes based on how they differ from the distributions of the differently-sized model. Finally, the distributions are transformed to be more similar to the distributions of the other model. The results suggest that classifying distributions before transforming them, and adapting the transformations based on which class a distribution is in, improves the transformation results. It is also shown that letting a classifier choose the class label for each distribution yields better results than using random labels. Furthermore, the findings indicate that transforming the distributions using entropy and the number of tokens in the top-p probability mass makes the distributions more similar to the targets, while transforming them based on the probability mass of individual slices of the distributions makes the distributions more dissimilar.

Self-Reflection on Chain-of-Thought Reasoning in Large Language Models / Självreflektion över Chain-of-Thought-resonerande i stora språkmodeller

Praas, Robert January 2023 (has links)
A strong capability of large language models is Chain-of-Thought reasoning. Prompting a model to ‘think step-by-step’ has led to great performance improvements in solving problems such as planning and question answering, and with the extended output it provides some evidence about the rationale behind an answer or decision. In search of better, more robust, and interpretable language model behavior, this work investigates self-reflection in large language models. Here, self-reflection consists of feedback from large language models to medical question-answering and whether the feedback can be used to accurately distinguish between correct and incorrect answers. GPT-3.5-Turbo and GPT-4 provide zero-shot feedback scores to Chain-of-Thought reasoning on the MedQA (medical questionanswering) dataset. The question-answering is evaluated on traits such as being structured, relevant and consistent. We test whether the feedback scores are different for questions that were either correctly or incorrectly answered by Chain-of-Thought reasoning. The potential differences in feedback scores are statistically tested with the Mann-Whitney U test. Graphical visualization and logistic regressions are performed to preliminarily determine whether the feedback scores are indicative to whether the Chain-of-Thought reasoning leads to the right answer. The results indicate that among the reasoning objectives, the feedback models assign higher feedback scores to questions that were answered correctly than those that were answered incorrectly. Graphical visualization shows potential for reviewing questions with low feedback scores, although logistic regressions that aimed to predict whether or not questions were answered correctly mostly defaulted to the majority class. Nonetheless, there seems to be a possibility for more robust output from self-reflecting language systems. / En stark förmåga hos stora språkmodeller är Chain-of-Thought-resonerande. Att prompta en modell att tänka stegvis har lett till stora prestandaförbättringar vid lösandet av problem som planering och frågebesvarande, och med den utökade outputen ger det en del bevis rörande logiken bakom ett svar eller beslut. I sökandet efter bättre, mer robust och tolk bart beteende hos språkmodeller undersöker detta arbete självreflektion i stora språkmodeller. Forskningsfrågan är: I vilken utsträckning kan feedback från stora språkmodeller, såsom GPT-3.5-Turbo och GPT-4, på ett korrekt sätt skilja mellan korrekta och inkorrekta svar i medicinska frågebesvarande uppgifter genom användningen av Chainof-Thought-resonerande? Här ger GPT-3.5-Turbo och GPT-4 zero-shot feedback-poäng till Chain-ofThought-resonerande på datasetet för MedQA (medicinskt frågebesvarande). Frågebesvarandet bör vara strukturerat, relevant och konsekvent. Feedbackpoängen jämförs mellan två grupper av frågor, baserat på om dessa besvarades korrekt eller felaktigt i första hand. Statistisk testning genomförs på skillnaden i feedback-poäng med Mann-Whitney U-testet. Grafisk visualisering och logistiska regressioner utförs för att preliminärt avgöra om feedbackpoängen är indikativa för huruvida Chainof-Thought-resonerande leder till rätt svar. Resultaten indikerar att bland resonemangsmålen tilldelar feedbackmodellerna fler positiva feedbackpoäng till frågor som besvarats korrekt än de som besvarats felaktigt. Grafisk visualisering visar potential för granskandet av frågor med låga feedbackpoäng, även om logistiska regressioner som syftade till att förutsäga om frågorna besvarades korrekt eller inte för det mesta majoritetsklassen. Icke desto mindre verkar det finnas potential för robustare från självreflekterande språksystem.

On Semantic Cognition, Inductive Generalization, and Language Models

Kanishka Misra (9708551) 05 September 2023 (has links)
<p dir="ltr">Our ability to understand language and perform reasoning crucially relies on a robust system of semantic cognition (G. L. Murphy, 2002; Rogers & McClelland, 2004; Rips et al., 2012; Lake & Murphy, 2021): processes that allow us to learn, update, and produce inferences about everyday concepts (e.g., cat, chair), properties (e.g., has fur, can be sat on), categories (e.g., mammals, furniture), and relations (e.g., is-a, taller-than). Meanwhile, recent progress in the field of natural language processing (NLP) has led to the development of language models (LMs): sophisticated neural networks that are trained to predict words in context (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020), and as a result build representations that encode the knowledge present in the statistics of their training environment. These models have achieved impressive levels of performance on a range of tasks that require sophisticated semantic knowledge (e.g. question answering and natural language inference), often even reaching human parity. To what extent do LMs capture the nuances of human conceptual knowledge and reasoning? Centering around this broad question, this dissertation uses core ideas in human semantic cognition as guiding principles and lays down the groundwork to establish effective evaluation and improvement of conceptual understanding in LMs. In particular, I build on prior work that focuses on characterizing what semantic knowledge is made available in the behavior and representations of LMs, and extend it by additionally proposing tests that focus on functional consequences of acquiring basic semantic knowledge.<br><br>I primarily focus on inductive generalization (Hayes & Heit, 2018)—the unique ability of humans to rely on acquired conceptual knowledge to project or generalize novel information—as a context within which we can analyze LMs’ encoding of conceptual knowledge. I do this, since the literature surrounding inductive generalization contains a variety of empirical regularities that map to specific conceptual abstractions and shed light on how humans store, organize and use conceptual knowledge. Before explicitly analyzing LMs for these empirical regularities, I test them on two other contexts, which also feature the role of inductive generalization. First I test the extent to which LMs demonstrate typicality effects—a robust finding in human categorization literature where certain members of a category are considered to be more central to the category than are others. Specifically, I test the behavior 19 different LMs on two contexts where typicality effects modulate human behavior: 1) verification of sentences expressing taxonomic category membership, and 2) projecting novel properties from individual category members to the entire category. In both tests, LMs achieved positive but modest correlations with human typicality ratings, suggesting that they can to a non-trivial extent capture subtle differences between category members. Next, I propose a new benchmark to test the robustness of LMs in attributing properties to everyday concepts, and in making inductive leaps to endow properties to novel concepts. On testing 31 different LMs for these capacities, I find that while they can correctly attribute properties to everyday concepts and even predict the properties of novel concepts in simple settings, they struggle to do so robustly. Combined with the analyses of typicality effects, these results suggest that the ability of LMs to demonstrate impressive conceptual knowledge and reasoning behavior can be explained by their sensitivities to shallow predictive cues. When these cues are carefully controlled for, LMs show critical failures in demonstrating robust conceptual understanding. Finally, I develop a framework that can allow us to characterize the extent to which the distributed representations learned by LMs can encode principles and abstractions that characterize inductive behavior of humans. This framework operationalizes inductive generalization as the behavior of an LM after its representations have been partially exposed (via gradient-based learning) to novel conceptual information. To simulate this behavior, the framework uses LMs that are endowed with human-elicited property knowledge, by training them to evaluate the truth of sentences attributing properties to concepts. I apply this framework to test four different LMs on 13 different inductive phenomena documented for humans (Osherson et al., 1990; Heit & Rubinstein, 1994). Results from these analyses suggest that building representations from word distributions can successfully allow the encoding of many abstract principles that can guide inductive behavior in the models—principles such as sensitivity to conceptual similarity, hierarchical organization of categories, reasoning about category coverage, and sample size. At the same time, the tested models also systematically failed at demonstrating certain phenomena, showcasing their inability to demonstrate pragmatic reasoning, preference to rely on shallow statistical cues, and lack of context sensitivity with respect to high-level intuitive theories.</p>

An Empirical Study on Using Codex for Automated Program Repair

Zhao, Pengyu January 2023 (has links)
This thesis explores the potential of Codex, a pre-trained Large Language Model (LLM), for Automated Program Repair (APR) by assessing its performance on the Defects4J benchmark that includes real-world Java bugs. The study aims to provide a comprehensive understanding of Codex’s capabilities and limitations in generating syntactically and semantically equivalent patches for defects, as well as evaluating its ability to handle defects with different levels of importance and complexity. Additionally, we aim to compare the performance of Codex with other LLMs in the APR domain. To achieve these objectives, we employ a systematic methodology that includes prompt engineering, Codex parameter adjustment, code extraction, patch verification, and Abstract Syntax Tree (AST) comparison. We successfully verified 528 bugs in Defects4J, which represents the highest number among other studies, and achieved 53.98% of plausible and 26.52% correct patches. Furthermore, we introduce the elle-elle-aime framework, which extends the RepairThemAll for Codex-based APR and is adaptable for evaluating other LLMs, such as ChatGPT and GPT-4. The findings of this empirical study provide valuable insights into the factors that impact Codex’s performance on APR, helping to create new prompt strategies and techniques that improve research productivity. / Denna avhandling utforskar potentialen hos Codex, en förtränad LLM, för APR genom att utvärdera dess prestanda på Defects4J-benchmarket som inkluderar verkliga Java-buggar. Studien syftar till att ge en omfattande förståelse för Codex förmågor och begränsningar när det gäller att generera syntaktiskt och semantiskt ekvivalenta patchar för defekter samt att utvärdera dess förmåga att hantera defekter med olika nivåer av betydelse och komplexitet. Dessutom är vårt mål att jämföra prestanda hos Codex med andra LLM inom APR-området. För att uppnå dessa mål använder vi en systematisk metodik som inkluderar prompt engineering, justering av Codex-parametrar, kodextraktion, patchverifiering och jämförelse av AST. Vi verifierade framgångsrikt 528 buggar i Defects4J, vilket representerar det högsta antalet bland andra studier, och uppnådde 53,98% plausibla och 26,52% korrekta patchar. Vidare introducerar vi elle-elle-aime ramverket, som utvidgar RepairThemAll för Codex-baserad APR och är anpassningsbart för att utvärdera andra LLM, såsom ChatGPT och GPT-4. Resultaten av denna empiriska studie ger värdefulla insikter i de faktorer som påverkar Codex prestanda på APR och hjälper till att skapa nya promptstrategier och tekniker som förbättrar forskningsproduktiviteten.

ChatGPT’s Performance on the BriefElectricity and Magnetism Assessment

Melin, Jakob, Elias, Önerud January 2024 (has links)
In this study, we tested the performance of ChatGPT-4 on the concept inventory Brief Electricity and Magnetism Assessment (BEMA) to understand its potential as an educational tool in physics, especially in tasks requiring visual interpretation. Our results indicate that ChatGPT-4 performs similarly to undergraduate students in introductory electromagnetism courses, with an average score close to that of the students. However, ChatGPT-4 displayed significant differences compared to students, particularly in tasks involving complex visual elements such as electrical circuits and magnetic field diagrams. While ChatGPT-4 was proficient in proposing correct physical reasoning, it struggled with accurately interpreting visual information. These findings suggest that while ChatGPT-4 can be a useful supplementary tool for students, it should not be relied upon as a primary tutor for subjects heavily dependent on visual interpretation. Instead, it could be more effective as a peer, where its outputs are critically evaluated by students. Further research should focus on improving ChatGPT’s visual processing capabilities and exploring its role in diverse educational contexts.

Advancing Policy Insights: Opinion Data Analysis and Discourse Structuring Using LLMs

Bhatia, Aaditya 01 January 2024 (has links) (PDF)
The growing volume of opinion data presents a significant challenge for policymakers striving to distill public sentiment into actionable decisions. This study aims to explore the capability of large language models (LLMs) to synthesize public opinion data into coherent policy recommendations. We specifically leverage Mistral 7B and Mixtral 8x7B models for text generation and have developed an architecture to process vast amounts of unstructured information, integrate diverse viewpoints, and extract actionable insights aligned with public opinion. Using a retrospective data analysis of the Polis platform debates published by the Computational Democracy Project, this study examines multiple datasets that span local and national issues with 1600 statements posted and voted upon by over 3400 participants. Through content moderation, topic modeling, semantic structure extraction, insight generation, and argument mapping, we dissect and interpret the comments, leveraging voting data and LLMs for both quantitative and qualitative insights. A key contribution of this thesis is demonstrating how LLM reasoning techniques can enhance content moderation. Our content moderation approach shows performance improvements using comment deconstruction in multi-class classification, underscoring the trade-offs between moderation strategies and emphasizing a balance between precision and cautious moderation. Using comment clustering, we establish a hierarchy of semantically linked topics, facilitating an understanding of thematic structures and the generation of actionable insights. The generated argument maps visually represent the relationships between topics and insights, and highlight popular opinions. Future work will leverage advanced semantic extraction and reasoning techniques to enhance insight generation further. We also plan to generalize our techniques to other major discussion platforms, including Kialo. Our work contributes to the understanding of using LLMs for policymaking and offers a novel approach to structuring complex debates and translating public opinion into actionable policy insights.

I cast control chain of thought : A prompt introduction of roleplay to AI

Carlander, Deborah January 2024 (has links)
Teaching AI to play Tabletop Roleplaying Games (TRPG) is a difficult challenge due their negotiable rules and open-ended nature. This is further exacerbated when considering the act of roleplaying, where players do not play or act as themselves, but as a character in a fantasy setting. Previous studies attempting to teach LLMs to play TRPGs do not explicitly discuss role-play in their work, highlighting an absence of a definition in current research. This thesis endeavours in introducing role-playing to AI through developing a prompting method called Control Chain of Thoughts, aimed at teaching it the Dungeons and Dragons alignment system. The prompting method is evaluated through an ablation study where GPT-3.5 is tasked to guess the alignment of characters based on exctracts from D&amp;D gaming sessions. The results indicate a small improvement in GPT’s predictions. Further work needs to be done to evaluate if its alignments help LLMs understand roleplaying.

Evaluating ChatGPT's Effectiveness in Web Accessibility for the Visually Impaired / En utvärdering av ChatGPTs effektivitet inom tillängligt innehåll på webben för synskadade

Holmlund, Miranda January 2024 (has links)
Web accessibility is essential for making the internet available to everyone, including individuals with disabilities. This study explores ChatGPT-4s potential in improving webaccessibility for visually impaired users by evaluating its effectiveness in interpreting andconveying web content with accessibility issues.The methodology involved creating websites with intentional accessibility barriers, craftingprompts to simulate real-time issues, and using ChatGPT-4 to provide solutions. Data was gathered from both visually impaired and those without disabilities, who rated ChatGPT-4s responses on relevance, conciseness, clarity, and usability using a 1-5 Likert scale. Results showed that ChatGPT-4 had 64.42% effectiveness in assisting with web accessibility, particularly in summarizing and clarifying content. However, issues such ashallucinations and false information were noted.This study underscores the promise of ChatGPT-4 in enhancing web accessibility and emphasizes the need for further refinement to ensure accuracy and reliability in real-world applications. / Tillgängligt innehåll på webben är en nödvändig del för att skapa ett internet som är användbart av alla, även personer med en funktionsnedsättning. Denna studie utforskar potentialen hos ChatGPT-4 som verktyg för att förbättra tillgänglighet på webben för synskade genom att utvärdera verktygets effektivitet att tolka och förmedla innehåll på webben som har tillgänglighetsproblem. Metodiken innebar att skapa webbsidor avsiktligen innehållandes tillgänglighetsbarriärer, skapa prompts för att simulera realtidsproblem, och att använda ChatGPT-4 som en lösning. Insamlingen av information innefattade data från både individer med och utan en synskada, där personerna rankade ChatGPT-4s svar på kriterierna relevans, kortfattadhet, tydlighet och användbarhet på en 1-5 Likert skala. Reultatet visade att ChatGPT-4 hade en effektitvet på 64,42% i att hjälpa med webbtillgänglighet, och särskilt effektiv i att summera och förklara innehåll. Dock så uppvisade verktyget problem såsom hallucinationer och falsk informarion. Denna studie visar prov på ChatGPT-4s potential i att förbättra tillgänglighet på webben, samt understryker att vidareutveckling behövs för att garantera korrekthet och tillförlitlighet i verkliga applikationer.

Code Generation from Large API Specifications with Open Large Language Models : Increasing Relevance of Code Output in Initial Autonomic Code Generation from Large API Specifications with Open Large Language Models

Lyster Golawski, Esbjörn, Taylor, James January 2024 (has links)
Background. In software systems defined by extensive API specifications, auto- nomic code generation can streamline the coding process by replacing repetitive, manual tasks such as creating REST API endpoints. The use of large language models (LLMs) for generating source code comprehensively on the first try requires refined prompting strategies to ensure output relevancy, a challenge that grows as API specifications become larger.  Objectives. This study aims to develop and validate a prompting orchestration solution for LLMs that generates more relevant, non-duplicated code compared to a single comprehensive prompt, without refactoring previous code. Additionally, the study evaluates the practical value of the generated code for developers at Ericsson familiar with the target application that uses the same API specification. Methods. Employing a prototyping approach, we develop a solution that produces more relevant, non-duplicated code compared to a single prompt with local-hosted LLMs for the target API at Ericsson. We perform a controlled experiment running the developed solution and a single prompt to collect the outputs. Using the results, we conduct interviews with Ericsson developers about the value of the AI-generated code.  Results. The study identified a prompting orchestration method that generated 427 relevant lines of code (LOC) on average in the best-case scenario compared to 66 LOC with a single comprehensive prompt. Additionally, 66% of the developers interviewed preferred using the AI-generated code as a starting point over starting from scratch when developing applications for Ericsson, and 66% preferred starting from the AI-generated code over code generated from the same API specification via Swagger CodeGen.  Conclusions. Increasing the extent locally hosted LLMs can generate relevant code from large API specifications without refactoring the generated code in comparison to a single comprehensive prompt is possible with the right prompting orchestration method. The value of the generated code is that it can currently be used as a good starting point for further software development.

Large Language Models for Unit Test Generation in React Native TypeScript Components

Borgström, Erik, Bergvall, Robin January 2024 (has links)
Advancements within Large Language Models(LLMs) have opened a world of opportunities within the software development domain. This thesis, through an controlled experiment, aims to investigate how LLMs can be utilized within software testing, more specifically unit testing. The controlled experiment was performed using a Python script interfacing with the gpt-3.5-turbo model, to automatically generate unit tests for React Native components written in TypeScript. The pipeline described, performs the calls to the OpenAI Application Programming Interface(API) iterative. To evaluate and retrieve the metric code coverage, the unit tests were executed with Jest. Additionally, manual execution of failing tests, both compilable and non-compilable tests were executed and the different kind of errors with their frequency were documented. The experiment shows that LLMs can be used to generate comprehensive and accurate unit tests, with high potential of future improvements. While the amount of generated tests that compiled were low, their nature was often good, failing because of easy correctable syntax errors, faulty imports or missing dependencies. The errors found, were at large part due to project configurations while others would probably be less frequent through more extensive prompt-engineering or by the use of an newer model. The experiment also shows that the temperature affected the outcome and that the type of errors were different between compiling and non-compiling tests. A lower temperature parameter to the OpenAI API generally achieved better results, whilst a higher temperature showed greater coverage at compiled failing tests. This thesis also shows that future opportunities and improvements are widely available. Through better project optimization, newer models and better prompting, a better result is to be expected. The script could with further development be turned into a working product, making software testing faster and more efficient, saving both time and money while simultaneously improving the test case quality.

