Global ETD Search

11	Analyzing Large Language Models For Classifying Sexual Harassment Stories With Out-of-Vocabulary Word Substitution Seung Yeon Paik (18419409) 25 April 2024 (has links) <p dir="ltr">Sexual harassment is regarded as a serious issue in society, with a particularly negative impact on young children and adolescents. Online sexual harassment has recently gained prominence as a significant number of communications have taken place online. Online sexual harassment can happen anywhere in the world because of the global nature of the internet, which transcends geographical barriers and allows people to communicate electronically. Online sexual harassment can occur in a wide variety of environments such as through work mail or chat apps in the workplace, on social media, in online communities, and in games (Chawki & El Shazly, 2013).<br>However, especially for non-native English speakers, due to cultural differences and language barriers, may vary in their understanding or interpretation of text-based sexual harassment (Welsh, Carr, MacQuarrie, & Huntley, 2006). To bridge this gap, previous studies have proposed large language models to detect and classify online sexual harassment, prompting a need to explore how language models comprehend the nuanced aspects of sexual harassment data. Prior to exploring the role of language models, it is critical to recognize the current gaps in knowledge that these models could potentially address in order to comprehend and interpret the complex nature of sexual harassment.</p><p><br></p><p dir="ltr">The Large Language Model (LLM) has attracted significant attention recently due to its exceptional performance on a broad spectrum of tasks. However, these models are characterized by being very sensitive to input data (Fujita et al., 2022; Wei, Wang, et al., 2022). Thus, the purpose of this study is to examine how various LLMs interpret data that falls under the domain of sexual harassment and how they comprehend it after replacing Out-of-Vocabulary words.</p><p dir="ltr"><br>This research examines the impact of Out-of-Vocabulary words on the performance of LLMs in classifying sexual harassment behaviors in text. The study compares the story classification abilities of cutting-edge LLM, before and after the replacement of Out-of-Vocabulary words. Through this investigation, the study provides insights into the flexibility and contextual awareness of LLMs when managing delicate narratives in the context of sexual harassment stories as well as raises awareness of sensitive social issues.</p> Crime and social justice Natural language processing Sexual harassment Large Language Models (LLMs) out-of-vocabulary (OOV) words
12	Augmenting Large Language Models with Humor Theory To Understand Puns Ryan Rony Dsilva (18429846) 25 April 2024 (has links) <p dir="ltr">This research explores the application of large language models (LLMs) to comprehension of puns. Leveraging the expansive capabilities of LLMs, this study delves into the domain of pun classification by examining it through the prism of two humor theories: the Computational Model of Humor and the Benign Violation theory, which is an extension of the N+V Theory. The computational model posits that for a phrase to qualify as a pun, it must possess both ambiguity and distinctiveness, characterized by a word that can be interpreted in two plausible ways, each interpretation being supported by at least one unique word. On the other hand, the Benign Violation theory posits that puns work by breaching one linguistic rule while conforming to another, thereby creating a "benign violation." By leveraging the capabilities of large language models (LLMs), this research endeavors to scrutinize a curated collection of English language puns. Our aim is to assess the validity and effectiveness of the use of these theoretical frameworks in accurately classifying puns. We undertake controlled experiments on the dataset, selectively removing a condition specific to one theory and then evaluating the puns based on the criteria of the other theory to see how well it classifies the altered inputs. This approach allows us to uncover deeper insights into the processes that facilitate the recognition of puns and to explore the practical implications of applying humor theories. The findings of our experiments, detailed in the subsequent sections, sheds light on how the alteration of specific conditions impacts the ability of the LLMs to accurately classify puns, according to each theory, where each component of the theory does not influence the result to the same extent, thereby contributing to our understanding of humor mechanics through the eyes of LLMs.</p> Natural language processing Deep learning Computational linguistics Large Language Models (LLMs) puns wordplay humor
13	Large Language Models for Unsupervised Keyphrase Extraction and Biomedical Data Analytics Haoran Ding (18825838) 03 September 2024 (has links) <p dir="ltr">Natural Language Processing (NLP), a vital branch of artificial intelligence, is designed to equip computers with the ability to comprehend and manipulate human language, facilitating the extraction and utilization of textual data. NLP plays a crucial role in harnessing the vast quantities of textual data generated daily, facilitating meaningful information extraction. Among the various techniques, keyphrase extraction stands out due to its ability to distill concise information from extensive texts, making it invaluable for summarizing and navigating content efficiently. The process of keyphrase extraction usually begins by generating candidates first and then ranking them to identify the most relevant phrases. Keyphrase extraction can be categorized into supervised and unsupervised approaches. Supervised methods typically achieve higher accuracy as they are trained on labeled data, which allows them to effectively capture and utilize patterns recognized during training. However, the dependency on extensive, well-annotated datasets limits their applicability in scenarios where such data is scarce or costly to obtain. On the other hand, unsupervised methods, while free from the constraints of labeled data, face challenges in capturing deep semantic relationships within text, which can impact their effectiveness. Despite these challenges, unsupervised keyphrase extraction holds significant promise due to its scalability and lower barriers to entry, as it does not require labeled datasets. This approach is increasingly favored for its potential to aid in building extensive knowledge bases from unstructured data, which can be particularly useful in domains where acquiring labeled data is impractical. As a result, unsupervised keyphrase extraction is not only a valuable tool for information retrieval but also a pivotal technology for the ongoing expansion of knowledge-driven applications in NLP.</p><p dir="ltr">In this dissertation, we introduce three innovative unsupervised keyphrase extraction methods: AttentionRank, AGRank, and LLMRank. Additionally, we present a method for constructing knowledge graphs from unsupervised keyphrase extraction, leveraging the self-attention mechanism. The first study discusses the AttentionRank model, which utilizes a pre-trained language model to derive underlying importance rankings of candidate phrases through self-attention. This model employs a cross-attention mechanism to assess the semantic relevance between each candidate phrase and the document, enhancing the phrase ranking process. AGRank, detailed in the second study, is a sophisticated graph-based framework that merges deep learning techniques with graph theory. It constructs a candidate phrase graph using mutual attentions from a pre-trained language model. Both global document information and local phrase details are incorporated as enhanced nodes within the graph, and a graph algorithm is applied to rank the candidate phrases. The third study, LLMRank, leverages the strengths of large language models (LLMs) and graph algorithms. It employs LLMs to generate keyphrase candidates and then integrates global information through the text's graphical structures. This process reranks the candidates, significantly improving keyphrase extraction performance. The fourth study explores how self-attention mechanisms can be used to extract keyphrases from medical literature and generate query-related phrase graphs, improving text retrieval visualization. The mutual attentions of medical entities, extracted using a pre-trained model, form the basis of the knowledge graph. This, coupled with a specialized retrieval algorithm, allows for the visualization of long-range connections between medical entities while simultaneously displaying the supporting literature. In summary, our exploration of unsupervised keyphrase extraction and biomedical data analysis introduces novel methods and insights in NLP, particularly in information extraction. These contributions are crucial for the efficient processing of large text datasets and suggest avenues for future research and applications.</p> Natural language processing Natural Language Processing Unsupervised Keyphrase Extraction Large Language Models (LLMs) Knowledge Graph
14	Measurement and Development for Automated Secure Coding Solutions Frantz, Miles Eugene 09 September 2024 (has links) With the rise of development efforts, there has also been a rise in source code vulnerabilities. Advanced security tools have been created to identify these vulnerabilities throughout the lifetime of the developer's ecosystem and afterward, before the vulnerabilities are exposed. One such popular method is Static Code Analysis (Code Analysis) (SCA), which scans developers' source code to identify potential vulnerabilities in the code. My Ph.D. work aims to help reduce the vulnerabilities exposed by YIELD, ENHANCE, and EVALUATE (EYE) SCA tools to identify vulnerabilities while the developer writes the code. We first look into evaluating tools that support developers with their source code by determining how accurate they are with identifying vulnerability information. Large Language Machine Learning Model (LLM)s have been on the rise recently with the introduction of Chat Generative Pre-trained Transformer (ChatGPT) 3.5, ChatGPT 4.1, Google Gemini, and many more. Using a common framework, we created a zero-shot prompt instructing the LLM to identify; whether there is a vulnerability in the provided source code and what Common Weakness Enumeration (CWE) value represents the vulnerability. With our Python cryptographic benchmark PyCryptoBench, we sent vulnerable samples to four different LLMs and two different versions of ChatGPT Application Program Interface (API)s. The samples allow us to measure how reliable each LLM is at vulnerability identification and defining. The Chat- GPT APIs include multiple reproducible fields that allowed us to measure how reproducible the responses are. Next, we yield a new SCA tool to apply what we learned to a current gap in increasingly complex source code. Cryptolation, our state-of-the-art (SOA) Python SCA tool uses constant propagation-supported variable inference to obtain insight into the data flow state through the program's execution. Python source code has ever-increasing complexities and a lack of SCA tools compared to Java. We compare Cryptolation with the other SOA SCA tools Bandit, Semgrep, and Dlint. To verify the Precision of our tool, we created the benchmark PyCryptoBench, which contains 1,836 test cases and encompasses five different language features. Next, we crawled over 1,000 cryptographic-related Python projects on GitHub and each with each tool. Finally, we reviewed all PyCryptoBench results and sampled over 10,000 cryptographic-related Python projects. The results reveal Cryptolation has a 100% Precision on the benchmark, with the second highest Precision with cryptographic-related projects. Finally, we look at enhancing SCA tools. The SOA tools already compete to have the highest Precision, Recall, and Accuracy. However, we examine several developer surveys to determine their reasons for not adopting such tools. These are generally better aesthetics, usability, customization, and a low effort cost to use consistently. To achieve this, we enhance the SOA Java SCA tool CryptoGuard with the following: integrated build tools, modern terminal Command Line Interface (CLI) usage, customizable and vendor-specific output formats, and no-install demos. / Doctor of Philosophy / With the rise of more development efforts and source codes, there has also been a rise in source code vulnerabilities. More advanced security tools have been created to identify these vulnerabilities before they are exposed to match this. SCA are a popular method for identifying vulnerable source code since they do not execute any code and can scan the code while the developer is writing it. Despite their popularity, there is still much room for improvement. My Ph.D. work aims to help reduce the vulnerabilities exposed by EYE SCA tools to identify vulnerabilities while the developer writes the code. First, we look into evaluating tools that support and refine SCA by examining the Accuracy and secureness of generative LLMs. LLM have been on the rise recently with the introduction of ChatGPT 3.5 and, more recently, ChatGPT 4.1. ChatGPT is a conversation-based program in which you ask the program a question, and it answers the question. This can explain small source code snippets to developers, provide source code examples, or even fix source code. While the developers of the LLMs have restricted certain aspects of the models, one of their main selling points is their source code assistance. With over 1,000 zero-shot prompts, we measure how accurate and reliable LLMs are in identifying the existence and information of vulnerabilities within the source code. First, we yield a new SCA tool to apply what we learned to a current gap in increasingly complex source code. This tool is Cryptolation, a Python SCA tool that uses variable inference to try to determine the variable values without execution. Python source code has ever-increasing complexities and a lack of tools compared to Java. We compare Cryptolation with four other SOA tools. To verify the Precision of our tool, we create the benchmark PyCryptoBench, over 1,000 test cases encompassing five different language features. Next, we crawled over 1,000 cryptographic-related Python projects on GitHub and each with each tool. Finally, we reviewed all PyCryptoBench results and samples of the 10,000 cryptographic-related Python projects. The results reveal Cryptolation has a 100% Precision on the benchmark, with the second highest Precision with cryptographic-related projects. Next, we look at enhancing SCA tools. The SOA tools already compete to have the highest Precision, Recall, and Accuracy. However, we investigated developers' current surveys to see what they identified as reasons not to adopt such tools. These are generally better aesthetics, usability, customization, and a low effort cost to use consistently. To achieve this, we enhance the SOA Java SCA tool CryptoGuardto address these adequately. Large Language Models Machine Learning Static Code Analysis Python Java Benchmark
15	The Student Becomes The Teacher: Training High-Performance Language Models More Sample-Efficiently From Small Models Via Superstilling Gundry, Chaz Allen 14 August 2023 (has links) (PDF) Recent advances including the Transformer architecture have revolutionized the Natural Language Processing community by providing immense performance improvements across many tasks, including the development of Large Language Models (LLMs). LLMs show enormous promise as few-shot learners, common-sense knowledge repositories, conversational agents, writing assistants, and coding tools, and are gaining widespread traction in commercial industry. However, LLMs are expensive and time-consuming to train, requiring many passes over terabytes of data for the largest models. In this paper, we present Superstilling, a method for reducing the sample complexity of language model training by distilling the knowledge from a previously-trained model (the teacher) into a new, larger model (the student). This method does not require conformity between the architectures of the two models, and can be applied even when the weights and training data of the teacher model are not available, for example in federated learning scenarios. We apply Superstilling to train models of various sizes and show this method can decrease sample complexity by more than 10\% on models with over 160M parameters. We also show that in certain scenarios, Superstilling can be used to speed up training despite the need to run the teacher and student models simultaneously. large language models knowledge distillation sample efficiency transformer deep learning Physical Sciences and Mathematics
16	Capturing Style Through Large Language Models - An Authorship Perspective Anuj Dubey (18398505) 10 December 2024 (has links) <p dir="ltr">This research investigates the use of Large Language Model (LLM) embeddings to capture the unique stylistic features of authors in Authorship Attribution (AA) tasks. Specifically, the focus of this research is on evaluating whether LLM-generated embeddings can effectively capture stylistic nuances that distinguish different authors, ultimately assessing their utility in tasks such as authorship attribution and clustering.The dataset comprises news articles from The Guardian authored by multiple writers, and embeddings were generated using OpenAI's text-embedding-ada-002 model. These embeddings were subsequently passed through a Siamese network with the objective of determining whether pairs of texts were authored by the same individual. The resulting model was used to generate style embeddings for unseen articles, which were then evaluated through classification and cluster analysis to assess their effectiveness in identifying individual authors across varying text samples. The classification task tested the model's accuracy in distinguishing authors, while the clustering analysis examined whether style embeddings primarily captured authorial identity or reflected domain-specific topics.</p><p dir="ltr">Our findings demonstrate that the proposed architecture achieves high accuracy for authors not previously encountered, outperforming traditional stylometric features and highlighting the effectiveness of LLM-based style embeddings. Additionally, our experiments reveal that authorship attribution accuracy decreases as the number of authors increases, yet improves with longer text lengths. </p><p dir="ltr"><br></p> Natural language processing Deep learning LLM Large Language Models LLMs Natural language processiong (NLP) authorship attribution
17	Incorporating LLM-based Interactive Learning Environments in CS Education: Learning Data Structures and Algorithms using the Gurukul platform Rachha, Ashwin Kedari 24 September 2024 (has links) Large Language Models (LLMs) have emerged as a revolutionary force in Computer Science Education, offering unprecedented opportunities to facilitate learning and comprehension. Their application in the classroom, however, is not without challenges. LLMs are prone to hallucination and contextual inaccuracies. Furthermore, they risk exposing learning processes to cheating illicit practices and providing explicit solutions that impede the development of critical thinking skills in students. To address these pitfalls and investigate how specialized LLMs can enhance engagement among learners particularly using LLMs, we present Gurukul, a unique coding platform incorporating dual features - Retrieval Augmented Generation and Guardrails. Gurukul's practice feature provides a hands-on code editor to solve DSA problems with the help of a dynamically Guardrailed LLM to prevent explicit code solutions. On the other hand, Gurukul's Study feature incorporates a Retrieval Augmented Generation mechanism that uses OpenDSA as its source of truth, allowing the LLM to fetch and present information accurately and relevantly, thereby trying to overcome the issue of inaccuracies. We present these features to evaluate the user perceptions of LLM-assisted educational tools. To evaluate the effectiveness and utility of Gurukul in a real-world educational setting, we conducted a User Study and a User Expert Review with students (n=40) and faculty (n=2), respectively, from a public state university in the US specializing in DSA courses. We examine student's usage patterns and perceptions of the tool and report reflections from instructors and a series of recommendations for classroom use. Our findings suggest that Gurukul had a positive impact on student learning and engagement in learning DSA. This feedback analyzed through qualitative and quantitative methods indicates the promise of the utility of specialized LLMs in enhancing student engagement in DSA learning. / Master of Science / Computer science education is continuously evolving with new technologies enhancing the learning experience. This thesis introduces Gurukul, an innovative platform designed to transform the way students learn Data Structures and Algorithms (DSA). Gurukul integrates large language models (LLMs) with advanced features like Retrieval Augmented Generation (RAG) and Guardrails to create an interactive and adaptive learning environment. Traditional learning methods often struggle with providing accurate information and engaging students actively. Gurukul addresses these issues by offering a live code editor for hands-on practice and a study feature that retrieves accurate information from trusted sources. The platform ensures students receive context-sensitive guidance without bypassing critical thinking skills. A study involving students and faculty from a public university specializing in DSA courses evaluated Gurukul's effectiveness. The feedback, based on qualitative and quantitative evaluations, highlights the platform's potential to enhance student engagement and learning outcomes in computer science education. This research contributes to the ongoing development of educational technologies and provides insights for future improvements. Large Language Models Retrieval Augmented Generation Guardrails Computer Science Education Adaptive Learning Technologies
18	A Framework to Identify Online Communities for Social Media Analysis Nikhil Mehta (9750842) 16 October 2024 (has links) <p dir="ltr">Easy access, variety of content, and fast widespread interactions are some of the reasons that have made social media increasingly popular in our society. This has lead to many people use social media everyday for a variety of reasons, such as interacting with friends or consuming news content. Thus, understanding content on social media is more important than ever.</p><p dir="ltr">An increased understanding on social media can lead to improvements on a large number of important tasks. In this work, we particularly focus on fake news detection and political bias detection. Fake news, text published by news sources with an intent to spread misinformation and sway beliefs, is ever prevalent in today's society. Detecting it is an important and challenging problem to prevent large scale misinformation and maintain a healthy society. In a similar way, detecting the political bias of news content can provide insights about the different perspectives on social media.</p><p dir="ltr">In this work, we view the problem of understanding social media as reasoning over the relationships between sources, the articles they publish, and the engaging users. We start by analyzing these relationships in a graph-based framework, and then use Large Language Models to do the same. We hypothesize that the key to understanding social media is understanding these relationships, such as identifying which users have similar perspectives, or which articles are likely to be shared by similar users.</p><p dir="ltr">Throughout this thesis, we propose several frameworks to capture the relationships on social media better. We initially tackle this problem using supervised learning systems, improving them to achieve strong performance. However, we find that automatedly modeling the complexities of the social media landscape is challenging. On the contrary, having humans analyze and interact with all news content to find relationships, is not scalable. Thus, we then propose to approach enhance our supervised approaches by approaching the social media understanding problem \textit{interactively}, where humans can interact to help an automated system learn a better social media representation quality.</p><p dir="ltr">On real world events, our experiments show performance improvements in detecting the factuality and political bias of news sources, both when trained with and without minimal human interactions. We particularly focus on one of the most challenging setups of this task, where test data is unseen and focuses on new topics when compared with the training data. This realistic setting shows the real world impact of our work in improving social media understanding.</p> Natural language processing social media analysis Large Language Models (LLMs) Natural Language Processing Model
19	Analysis of Security Findings and Reduction of False Positives through Large Language Models Wagner, Jonas 18 October 2024 (has links) This thesis investigates the integration of State-of-the-Art (SOTA) Large Language Models (LLMs) into the process of reassessing security findings generated by Static Application Security Testing (SAST) tools. The primary objective is to determine whether LLMs are able to detect false positives (FPs) while maintaining a high true positive (TP) rate, thereby enhancing the efficiency and effectiveness of security assessments. Four consecutive experiments were conducted, each addressing specific research questions. The initial experiment, using a dataset of security findings extracted from the OWASP Bench- mark, identified the optimal combination of context items provided by the SAST tool Spot- Bugs, which, when used with GPT-3.5 Turbo, reduced FPs while minimizing the loss of TPs. The second experiment, conducted on the same dataset, demonstrated that advanced prompting techniques, particularly few-shot Chain-of-Thought (CoT) prompting combined with Self-Consistency (SC), further improved the reassessment process. The third experiment compared both proprietary and open-source LLMs on an OWASP Benchmark dataset about one-fourth the size of the previously used dataset. GPT-4o achieved the highest performance, detecting 80 out of 128 FPs without missing any TPs, resulting in a perfect TPR of 100% and a decrease in FPR by 41.27 percentage points. Meanwhile, Llama 3.1 70B detected 112 out of the 128 FPs but missed 10 TPs, resulting in a TPR of 94.94% and a reduction in FPR by 56.62 percentage points. To validate these findings in a real-world context, the approach was applied to a dataset generated from the open-source project Mnestix using multiple SAST tools. GPT-4o again emerged as the top performer, detecting 26 out of 68 FPs while only missing one TP, resulting in a TPR decreased by 2.22 percentage points but simultaneously an FPR decreased 37.57 percentage points.:Table of Contents IV List of Figures VI List of Tables VIII List of Source Codes IX List of Abbreviations XI 1. Motivation 1 2. Background 3 3. Related Work 17 4. Concept 31 5. Preparing a Security Findings Dataset 39 6. Implementing a Workflow 51 7. Identifying Context Items 67 8. Comparing Prompting Techniques 85 9. Comparing Large Language Models 101 10.Evaluating Developed Approach 127 11.Discussion 141 12.Conclusion 145 A. Appendix: Figures 147 A.1. Repository Directory Tree 148 A.2. Precision-Recall Curve of Compared Large Language Models 149 A.3. Performance Metrics Self-Consistency on Mnestix Dataset 150 B. Appendix: Tables 151 B.1. Design Science Research Concept 151 C. Appendix: Code 153 C.1. Pydantic Base Config Documentation 153 C.2. Pydantic LLM Client Config Documentation 155 C.3. LLM BaseClient Class 157 C.4. Test Cases Removed From Dataset 158
20	Automatisering av CPV- klassificering : En studie om Large Language Models i kombination med word embeddings kan lösa CPV-kategorisering av offentliga upphandlingar. Andersson, Niklas, Andersson Sjöberg, Hanna January 2024 (has links) Denna studie utforskar användningen av Large Language Models och word embeddings för attautomatisera kategoriseringen av CPV-koder inom svenska offentliga upphandlingar. Tidigarestudier har inte lyckats uppnå tillförlitlig kategorisering, men detta experiment testar en nymetod som innefattar LLM-modellerna Mistral och Llama3 samt FastText word embeddings. Resultaten visar att även om studiens lösning korrekt kan identifiera vissa CPV-huvudgrupper, är dess övergripande prestanda låg med ett resultat på 12% för en helt korrekt klassificering av upphandlingar och 35% för en delvis korrekt klassificering med minst en korrekt funnen CPV-huvudgrupp. Förbättringar behövs både när det kommer till korrekthet och noggrannhet. Studien bidrar till forskningsfältet genom att påvisa de utmaningar och potentiella lösningar som finns för automatiserad kategorisering av offentliga upphandlingar. Den föreslår även framtida forskning som omfattar användningen av större och mer avancerade modeller för att adressera de identifierade utmaningarna. CPV-kategorisering Offentliga upphandlingar Large Language Models Word embeddings Computer Sciences Datavetenskap (datalogi)

Search results