Global ETD Search

1	Analyzing Large Language Models For Classifying Sexual Harassment Stories With Out-of-Vocabulary Word Substitution Seung Yeon Paik (18419409) 25 April 2024 (has links) <p dir="ltr">Sexual harassment is regarded as a serious issue in society, with a particularly negative impact on young children and adolescents. Online sexual harassment has recently gained prominence as a significant number of communications have taken place online. Online sexual harassment can happen anywhere in the world because of the global nature of the internet, which transcends geographical barriers and allows people to communicate electronically. Online sexual harassment can occur in a wide variety of environments such as through work mail or chat apps in the workplace, on social media, in online communities, and in games (Chawki & El Shazly, 2013).<br>However, especially for non-native English speakers, due to cultural differences and language barriers, may vary in their understanding or interpretation of text-based sexual harassment (Welsh, Carr, MacQuarrie, & Huntley, 2006). To bridge this gap, previous studies have proposed large language models to detect and classify online sexual harassment, prompting a need to explore how language models comprehend the nuanced aspects of sexual harassment data. Prior to exploring the role of language models, it is critical to recognize the current gaps in knowledge that these models could potentially address in order to comprehend and interpret the complex nature of sexual harassment.</p><p><br></p><p dir="ltr">The Large Language Model (LLM) has attracted significant attention recently due to its exceptional performance on a broad spectrum of tasks. However, these models are characterized by being very sensitive to input data (Fujita et al., 2022; Wei, Wang, et al., 2022). Thus, the purpose of this study is to examine how various LLMs interpret data that falls under the domain of sexual harassment and how they comprehend it after replacing Out-of-Vocabulary words.</p><p dir="ltr"><br>This research examines the impact of Out-of-Vocabulary words on the performance of LLMs in classifying sexual harassment behaviors in text. The study compares the story classification abilities of cutting-edge LLM, before and after the replacement of Out-of-Vocabulary words. Through this investigation, the study provides insights into the flexibility and contextual awareness of LLMs when managing delicate narratives in the context of sexual harassment stories as well as raises awareness of sensitive social issues.</p> Crime and social justice Natural language processing Sexual harassment Large Language Models (LLMs) out-of-vocabulary (OOV) words
2	Augmenting Large Language Models with Humor Theory To Understand Puns Ryan Rony Dsilva (18429846) 25 April 2024 (has links) <p dir="ltr">This research explores the application of large language models (LLMs) to comprehension of puns. Leveraging the expansive capabilities of LLMs, this study delves into the domain of pun classification by examining it through the prism of two humor theories: the Computational Model of Humor and the Benign Violation theory, which is an extension of the N+V Theory. The computational model posits that for a phrase to qualify as a pun, it must possess both ambiguity and distinctiveness, characterized by a word that can be interpreted in two plausible ways, each interpretation being supported by at least one unique word. On the other hand, the Benign Violation theory posits that puns work by breaching one linguistic rule while conforming to another, thereby creating a "benign violation." By leveraging the capabilities of large language models (LLMs), this research endeavors to scrutinize a curated collection of English language puns. Our aim is to assess the validity and effectiveness of the use of these theoretical frameworks in accurately classifying puns. We undertake controlled experiments on the dataset, selectively removing a condition specific to one theory and then evaluating the puns based on the criteria of the other theory to see how well it classifies the altered inputs. This approach allows us to uncover deeper insights into the processes that facilitate the recognition of puns and to explore the practical implications of applying humor theories. The findings of our experiments, detailed in the subsequent sections, sheds light on how the alteration of specific conditions impacts the ability of the LLMs to accurately classify puns, according to each theory, where each component of the theory does not influence the result to the same extent, thereby contributing to our understanding of humor mechanics through the eyes of LLMs.</p> Natural language processing Deep learning Computational linguistics Large Language Models (LLMs) puns wordplay humor
3	Large Language Models for Unsupervised Keyphrase Extraction and Biomedical Data Analytics Haoran Ding (18825838) 03 September 2024 (has links) <p dir="ltr">Natural Language Processing (NLP), a vital branch of artificial intelligence, is designed to equip computers with the ability to comprehend and manipulate human language, facilitating the extraction and utilization of textual data. NLP plays a crucial role in harnessing the vast quantities of textual data generated daily, facilitating meaningful information extraction. Among the various techniques, keyphrase extraction stands out due to its ability to distill concise information from extensive texts, making it invaluable for summarizing and navigating content efficiently. The process of keyphrase extraction usually begins by generating candidates first and then ranking them to identify the most relevant phrases. Keyphrase extraction can be categorized into supervised and unsupervised approaches. Supervised methods typically achieve higher accuracy as they are trained on labeled data, which allows them to effectively capture and utilize patterns recognized during training. However, the dependency on extensive, well-annotated datasets limits their applicability in scenarios where such data is scarce or costly to obtain. On the other hand, unsupervised methods, while free from the constraints of labeled data, face challenges in capturing deep semantic relationships within text, which can impact their effectiveness. Despite these challenges, unsupervised keyphrase extraction holds significant promise due to its scalability and lower barriers to entry, as it does not require labeled datasets. This approach is increasingly favored for its potential to aid in building extensive knowledge bases from unstructured data, which can be particularly useful in domains where acquiring labeled data is impractical. As a result, unsupervised keyphrase extraction is not only a valuable tool for information retrieval but also a pivotal technology for the ongoing expansion of knowledge-driven applications in NLP.</p><p dir="ltr">In this dissertation, we introduce three innovative unsupervised keyphrase extraction methods: AttentionRank, AGRank, and LLMRank. Additionally, we present a method for constructing knowledge graphs from unsupervised keyphrase extraction, leveraging the self-attention mechanism. The first study discusses the AttentionRank model, which utilizes a pre-trained language model to derive underlying importance rankings of candidate phrases through self-attention. This model employs a cross-attention mechanism to assess the semantic relevance between each candidate phrase and the document, enhancing the phrase ranking process. AGRank, detailed in the second study, is a sophisticated graph-based framework that merges deep learning techniques with graph theory. It constructs a candidate phrase graph using mutual attentions from a pre-trained language model. Both global document information and local phrase details are incorporated as enhanced nodes within the graph, and a graph algorithm is applied to rank the candidate phrases. The third study, LLMRank, leverages the strengths of large language models (LLMs) and graph algorithms. It employs LLMs to generate keyphrase candidates and then integrates global information through the text's graphical structures. This process reranks the candidates, significantly improving keyphrase extraction performance. The fourth study explores how self-attention mechanisms can be used to extract keyphrases from medical literature and generate query-related phrase graphs, improving text retrieval visualization. The mutual attentions of medical entities, extracted using a pre-trained model, form the basis of the knowledge graph. This, coupled with a specialized retrieval algorithm, allows for the visualization of long-range connections between medical entities while simultaneously displaying the supporting literature. In summary, our exploration of unsupervised keyphrase extraction and biomedical data analysis introduces novel methods and insights in NLP, particularly in information extraction. These contributions are crucial for the efficient processing of large text datasets and suggest avenues for future research and applications.</p> Natural language processing Natural Language Processing Unsupervised Keyphrase Extraction Large Language Models (LLMs) Knowledge Graph
4	A Framework to Identify Online Communities for Social Media Analysis Nikhil Mehta (9750842) 16 October 2024 (has links) <p dir="ltr">Easy access, variety of content, and fast widespread interactions are some of the reasons that have made social media increasingly popular in our society. This has lead to many people use social media everyday for a variety of reasons, such as interacting with friends or consuming news content. Thus, understanding content on social media is more important than ever.</p><p dir="ltr">An increased understanding on social media can lead to improvements on a large number of important tasks. In this work, we particularly focus on fake news detection and political bias detection. Fake news, text published by news sources with an intent to spread misinformation and sway beliefs, is ever prevalent in today's society. Detecting it is an important and challenging problem to prevent large scale misinformation and maintain a healthy society. In a similar way, detecting the political bias of news content can provide insights about the different perspectives on social media.</p><p dir="ltr">In this work, we view the problem of understanding social media as reasoning over the relationships between sources, the articles they publish, and the engaging users. We start by analyzing these relationships in a graph-based framework, and then use Large Language Models to do the same. We hypothesize that the key to understanding social media is understanding these relationships, such as identifying which users have similar perspectives, or which articles are likely to be shared by similar users.</p><p dir="ltr">Throughout this thesis, we propose several frameworks to capture the relationships on social media better. We initially tackle this problem using supervised learning systems, improving them to achieve strong performance. However, we find that automatedly modeling the complexities of the social media landscape is challenging. On the contrary, having humans analyze and interact with all news content to find relationships, is not scalable. Thus, we then propose to approach enhance our supervised approaches by approaching the social media understanding problem \textit{interactively}, where humans can interact to help an automated system learn a better social media representation quality.</p><p dir="ltr">On real world events, our experiments show performance improvements in detecting the factuality and political bias of news sources, both when trained with and without minimal human interactions. We particularly focus on one of the most challenging setups of this task, where test data is unseen and focuses on new topics when compared with the training data. This realistic setting shows the real world impact of our work in improving social media understanding.</p> Natural language processing social media analysis Large Language Models (LLMs) Natural Language Processing Model
5	Capturing Style Through Large Language Models - An Authorship Perspective Anuj Dubey (18398505) 10 December 2024 (has links) <p dir="ltr">This research investigates the use of Large Language Model (LLM) embeddings to capture the unique stylistic features of authors in Authorship Attribution (AA) tasks. Specifically, the focus of this research is on evaluating whether LLM-generated embeddings can effectively capture stylistic nuances that distinguish different authors, ultimately assessing their utility in tasks such as authorship attribution and clustering.The dataset comprises news articles from The Guardian authored by multiple writers, and embeddings were generated using OpenAI's text-embedding-ada-002 model. These embeddings were subsequently passed through a Siamese network with the objective of determining whether pairs of texts were authored by the same individual. The resulting model was used to generate style embeddings for unseen articles, which were then evaluated through classification and cluster analysis to assess their effectiveness in identifying individual authors across varying text samples. The classification task tested the model's accuracy in distinguishing authors, while the clustering analysis examined whether style embeddings primarily captured authorial identity or reflected domain-specific topics.</p><p dir="ltr">Our findings demonstrate that the proposed architecture achieves high accuracy for authors not previously encountered, outperforming traditional stylometric features and highlighting the effectiveness of LLM-based style embeddings. Additionally, our experiments reveal that authorship attribution accuracy decreases as the number of authors increases, yet improves with longer text lengths. </p><p dir="ltr"><br></p> Natural language processing Deep learning LLM Large Language Models LLMs Natural language processiong (NLP) authorship attribution
6	AUTOMATED EVALUATION OF NEUROLOGICAL DISORDERS THROUGH ELECTRONIC HEALTH RECORD ANALYSIS Md Rakibul Islam Prince (18771646) 03 September 2024 (has links) <p dir="ltr">Neurological disorders present a considerable challenge due to their variety and diagnostic complexity especially for older adults. Early prediction of the onset and ongoing assessment of the severity of these disease conditions can allow timely interventions. Currently, most of the assessment tools are time-consuming, costly, and not suitable for use in primary care. To reduce this burden, the present thesis introduces passive digital markers for different disease conditions that can effectively automate the severity assessment and risk prediction from different modalities of electronic health records (EHR). The focus of the first phase of the present study in on developing passive digital markers for the functional assessment of patients suffering from Bipolar disorder and Schizophrenia. The second phase of the study explores different architectures for passive digital markers that can predict patients at risk for dementia. The functional severity PDM uses only a single EHR modality, namely medical notes in order to assess the severity of the functioning of schizophrenia, bipolar type I, or mixed bipolar patients. In this case, the input of is a single medical note from the electronic medical record of the patient. This note is submitted to a hierarchical BERT model which classifies at-risk patients. A hierarchical attention mechanism is adopted because medical notes can exceed the maximum allowed number of tokens by most language models including BERT. The functional severity PDM follows three steps. First, a sentence-level embedding is produced for each sentence in the note using a token-level attention mechanism. Second, an embedding for the entire note is constructed using a sentence-level attention mechanism. Third, the final embedding is classified using a feed-forward neural network which estimates the impairment level of the patient. When used prior to the onset of the disease, this PDM is able to differentiate between severe and moderate functioning levels with an AUC of 76%. Disease-specific severity assessment PDMs are only applicable after the onset of the disease and have AUCs of nearly 85% for schizophrenia and bipolar patients. The dementia risk prediction PDM considers multiple EHR modalities including socio-demographic data, diagnosis codes and medical notes. Moreover, the observation period and prediction horizon are varied for a better understanding of the practical limitations of the model. This PDM is able to identify patients at risk of dementia with AUCs ranging from 70% to 92% as the observation period approaches the index date. The present study introduces methodologies for the automation of important clinical outcomes such as the assessment of the general functioning of psychiatric patients and the prediction of risk for dementia using only routine care data.</p> Natural language processing Deep learning Neural networks Semi- and unsupervised learning language model integration Large Language Models (LLMs) machine learning and AI Dementia -- Prevention Schizophrenia Patients Schizophrenia bipolar disorder patients Psychiatric patient BERT models Llama-2
7	Exploring artificial intelligence bias : a comparative study of societal bias patterns in leading AI-powered chatbots. Udała, Katarzyna Agnieszka January 2023 (has links) The development of artificial intelligence (AI) has revolutionised the way we interact with technology and each other, both in society and in professional careers. Although they come with great potential for productivity and automation, AI systems have been found to exhibit biases that reflect and perpetuate existing societal inequalities. With the recent rise of artificial intelligence tools exploiting the large language model (LLM) technology, such as ChatGPT, Bing Chat and Bard AI, this research project aims to investigate the extent of AI bias in said tools and explore its ethical implications. By reviewing and analysing responses to carefully crafted prompts generated by three different AI chatbot tools, the author will intend to determine whether the content generated by these tools indeed exhibits patterns of bias related to various social identities, as well as compare the extent to which such bias is present across all three tools. This study will contribute to the growing body of literature on AI ethics and inform efforts to develop more equitable and inclusive AI systems. By exploring the ethical dimensions of AI bias in selected LLMs, this research will shed light on the broader societal implications of AI and the role of technology in shaping our future. artificial intelligence generative AI large language models (LLMs) ChatGPT chatbot algorithmic bias ethical AI Gender Studies Genusstudier
8	Characterizing, classifying and transforming language model distributions Kniele, Annika January 2023 (has links) Large Language Models (LLMs) have become ever larger in recent years, typically demonstrating improved performance as the number of parameters increases. This thesis investigates how the probability distributions output by language models differ depending on the size of the model. For this purpose, three features for capturing the differences between the distributions are defined, namely the difference in entropy, the difference in probability mass in different slices of the distribution, and the difference in the number of tokens covering the top-p probability mass. The distributions are then put into different distribution classes based on how they differ from the distributions of the differently-sized model. Finally, the distributions are transformed to be more similar to the distributions of the other model. The results suggest that classifying distributions before transforming them, and adapting the transformations based on which class a distribution is in, improves the transformation results. It is also shown that letting a classifier choose the class label for each distribution yields better results than using random labels. Furthermore, the findings indicate that transforming the distributions using entropy and the number of tokens in the top-p probability mass makes the distributions more similar to the targets, while transforming them based on the probability mass of individual slices of the distributions makes the distributions more dissimilar. Large Language Models (LLMs) GPT BERT NLP deep learning machine learning computational linguistics language technology
9	Enhancing Industrial Process Interaction Using Deep Learning, Semantic Layers, and Augmented Reality Izquierdo Doménech, Juan Jesús 24 June 2024 (has links) Tesis por compendio / [ES] La Realidad Aumentada (Augmented Reality, AR) y su capacidad para integrar contenido sintético sobre una imagen real proporciona un valor incalculable en diversos campos; no obstante, la industria es uno de estos campos que más se puede aprovechar de ello. Como tecnología clave en la evolución hacia la Industria 4.0 y 5.0, la AR no solo complementa sino que también potencia la interacción humana con los procesos industriales. En este contexto, la AR se convierte en una herramienta esencial que no sustituye al factor humano, sino que lo enriquece, ampliando sus capacidades y facilitando una colaboración más efectiva entre humanos y tecnología. Esta integración de la AR en entornos industriales no solo mejora la eficiencia y precisión de las tareas, sino que también abre nuevas posibilidades para la expansión del potencial humano. Existen numerosas formas en las que el ser humano interactúa con la tecnología, siendo la AR uno de los paradigmas más innovadores respecto a cómo los usuarios acceden a la información; sin embargo, es crucial reconocer que la AR, por sí misma, tiene limitaciones en cuanto a la interpretación del contenido que visualiza. Aunque en la actualidad podemos acceder a diferentes librerías que utilizan algoritmos para realizar una detección de imágenes, objetos, o incluso entornos, surge una pregunta fundamental: ¿hasta qué punto puede la AR comprender el contexto de lo que ve? Esta cuestión se vuelve especialmente relevante en entornos industriales. ¿Puede la AR discernir si una máquina está funcionando correctamente, o su rol se limita a la presentación de indicadores digitales superpuestos? La respuesta a estas cuestiones subrayan tanto el potencial como los límites de la AR, impulsando la búsqueda de innovaciones que permitan una mayor comprensión contextual y adaptabilidad a situaciones específicas dentro de la industria. En el núcleo de esta tesis yace el objetivo de no solo dotar a la AR de una "inteligencia semántica" capaz de interpretar y adaptarse al contexto, sino también de ampliar y enriquecer las formas en que los usuarios interactúan con esta tecnología. Este enfoque se orienta particularmente a mejorar la accesibilidad y la eficiencia de las aplicaciones de AR en entornos industriales, que son por naturaleza restringidos y complejos. La intención es ir un paso más allá de los límites tradicionales de la AR, proporcionando herramientas más intuitivas y adaptativas para los operadores en dichos entornos. La investigación se despliega a través de tres artículos de investigación, donde se ha desarrollado y evaluado una arquitectura multimodal progresiva. Esta arquitectura integra diversas modalidades de interacción usuario-tecnología, como el control por voz, la manipulación directa y el feedback visual en AR. Además, se incorporan tecnologías avanzadas basadas en modelos de aprendizaje automática (Machine Learning, ML) y aprendizaje profundo (Deep Learning, DL) para extraer y procesar información semántica del entorno. Cada artículo construye sobre el anterior, demostrando una evolución en la capacidad de la AR para interactuar de manera más inteligente y contextual con su entorno, y resaltando la aplicación práctica y los beneficios de estas innovaciones en la industria. / [CA] La Realitat Augmentada (Augmented Reality, AR) i la seua capacitat per integrar contingut sintètic sobre una imatge real ofereix un valor incalculable en diversos camps; no obstant això, la indústria és un d'aquests camps que més pot aprofitar-se'n. Com a tecnologia clau en l'evolució cap a la Indústria 4.0 i 5.0, l'AR no només complementa sinó que també potencia la interacció humana amb els processos industrials. En aquest context, l'AR es converteix en una eina essencial que no substitueix al factor humà, sinó que l'enriqueix, ampliant les seues capacitats i facilitant una col·laboració més efectiva entre humans i tecnologia. Esta integració de l'AR en entorns industrials no solament millora l'eficiència i precisió de les tasques, sinó que també obri noves possibilitats per a l'expansió del potencial humà. Existeixen nombroses formes en què l'ésser humà interactua amb la tecnologia, sent l'AR un dels paradigmes més innovadors respecte a com els usuaris accedeixen a la informació; no obstant això, és crucial reconéixer que l'AR, per si mateixa, té limitacions quant a la interpretació del contingut que visualitza. Encara que en l'actualitat podem accedir a diferents llibreries que utilitzen algoritmes per a realitzar una detecció d'imatges, objectes, o fins i tot entorns, sorgeix una pregunta fonamental: fins a quin punt pot l'AR comprendre el context d'allò veu? Esta qüestió esdevé especialment rellevant en entorns industrials. Pot l'AR discernir si una màquina està funcionant correctament, o el seu rol es limita a la presentació d'indicadors digitals superposats? La resposta a estes qüestions subratllen tant el potencial com els límits de l'AR, impulsant la recerca d'innovacions que permeten una major comprensió contextual i adaptabilitat a situacions específiques dins de la indústria. En el nucli d'esta tesi jau l'objectiu de no solament dotar a l'AR d'una "intel·ligència semàntica" capaç d'interpretar i adaptar-se al context, sinó també d'ampliar i enriquir les formes en què els usuaris interactuen amb esta tecnologia. Aquest enfocament s'orienta particularment a millorar l'accessibilitat i l'eficiència de les aplicacions d'AR en entorns industrials, que són de naturalesa restringida i complexos. La intenció és anar un pas més enllà dels límits tradicionals de l'AR, proporcionant eines més intuïtives i adaptatives per als operaris en els entorns esmentats. La recerca es desplega a través de tres articles d'investigació, on s'ha desenvolupat i avaluat una arquitectura multimodal progressiva. Esta arquitectura integra diverses modalitats d'interacció usuari-tecnologia, com el control per veu, la manipulació directa i el feedback visual en AR. A més, s'incorporen tecnologies avançades basades en models d'aprenentatge automàtic (ML) i aprenentatge profund (DL) per a extreure i processar informació semàntica de l'entorn. Cada article construeix sobre l'anterior, demostrant una evolució en la capacitat de l'AR per a interactuar de manera més intel·ligent i contextual amb el seu entorn, i ressaltant l'aplicació pràctica i els beneficis d'estes innovacions en la indústria. / [EN] Augmented Reality (AR) and its ability to integrate synthetic content over a real image provides invaluable value in various fields; however, the industry is one of these fields that can benefit most from it. As a key technology in the evolution towards Industry 4.0 and 5.0, AR not only complements but also enhances human interaction with industrial processes. In this context, AR becomes an essential tool that does not replace the human factor but enriches it, expanding its capabilities and facilitating more effective collaboration between humans and technology. This integration of AR in industrial environments not only improves the efficiency and precision of tasks but also opens new possibilities for expanding human potential. There are numerous ways in which humans interact with technology, with AR being one of the most innovative paradigms in how users access information; however, it is crucial to recognize that AR, by itself, has limitations in terms of interpreting the content it visualizes. Although today we can access different libraries that use algorithms for image, object, or even environment detection, a fundamental question arises: To what extent can AR understand the context of what it sees? This question becomes especially relevant in industrial environments. Can AR discern if a machine functions correctly, or is its role limited to presenting superimposed digital indicators? The answer to these questions underscores both the potential and the limits of AR, driving the search for innovations that allow for greater contextual understanding and adaptability to specific situations within the industry. At the core of this thesis lies the objective of not only endowing AR with "semantic intelligence" capable of interpreting and adapting to context, but also of expanding and enriching the ways users interact with this technology. This approach mainly aims to improve the accessibility and efficiency of AR applications in industrial environments, which are by nature restricted and complex. The intention is to go beyond the traditional limits of AR, providing more intuitive and adaptive tools for operators in these environments. The research unfolds through three articles, where a progressive multimodal architecture has been developed and evaluated. This architecture integrates various user-technology interaction modalities, such as voice control, direct manipulation, and visual feedback in AR. In addition, advanced technologies based on Machine Learning (ML) and Deep Learning (DL) models are incorporated to extract and process semantic information from the environment. Each article builds upon the previous one, demonstrating an evolution in AR's ability to interact more intelligently and contextually with its environment, and highlighting the practical application and benefits of these innovations in the industry. / Izquierdo Doménech, JJ. (2024). Enhancing Industrial Process Interaction Using Deep Learning, Semantic Layers, and Augmented Reality [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/205523 / Compendio Convolutional Neural Networks (CNN) Augmented Reality (AR) Large Language Models (LLMs) Multimodal interaction Deep Learning Industry Semantics Transformers LENGUAJES Y SISTEMAS INFORMATICOS
10	CHEMICAL SPACE INVADERS: ENHANCING EXPLORATION OF MODULARLY CONSTRUCTED CHEMICAL SPACES USING CONTEXT AWARE AI AGENTS Matthew Muhoberac (19820007) 10 October 2024 (has links) <p dir="ltr">Chemical science can be imagined as a universe of information in which individual galaxies, solar systems, stars, and planets are compounds, reactions, biomolecules, etc. which need to be discovered, researched, and documented. The problem with this is that the universe of chemical science is potentially vaster than the one in which we live, and we are exploring it in a relatively inefficient manner. There is a scene in one of my favorite television shows, Futurama, which paints a picture of traditional chemical exploration. Taking place in the 30<sup>th</sup> century, the main character Fry loses his robot friend Bender in outer space and resorts to using a giant telescope in the Himalayan mountains to randomly search through points in space to try to find him. After days of searching nonstop, he gives up noting that it is an impossible task because space is so vast in size, and he is searching so inefficiently. While human exploration of chemistry may not be as inefficient, there are a lot of steps which are driven by trial and error and educated guesswork which ultimately introduce major inefficiencies into scientific discovery. While we don’t live in the 30<sup>th</sup> century yet, we do have access to 21<sup>st</sup>century technology which can assist in exploring chemistry in a more directed manner. This mainly involves using machine learning, search algorithms, and generative powered exploratory AI to serve as a force multiplier which can serve to assist human chemists in chemical exploration. To shamelessly compare this with another space-based sci-fi reference, this would be akin to deploying hundreds or thousands of automated space probes to search unexplored planets, akin to how the empire found the rebellion on Hoth in the Empire Strikes Back.</p><p dir="ltr">The journey to integrate AI with chemical exploration starts with the important concept of standardization and how to apply it to chemically relevant data. To easily organize, store, and access relevant aspects of small molecules, macromolecules, chemical reactions, biological assays, etc. it is imperative that data be represented in a standard format which accurately portrays necessary chemical information. This becomes especially relevant as humans aggregate more and more chemical data. In this thesis, we tackle a subset of standardization in Chapter 2 involving benchmarking sets for comparative evaluation of docking software. One major reason why standardization is so important is that standardization promotes ease of access to relevant data, regardless of if this access is attempted by human or computational means. While improving data access for humans is beneficial, computationally it is a game changer when datamining training data for machine learning (ML) applications. Having standardized data readily available for computational access allows for software to rapidly access and preprocess relevant data boosts efficiency in ML model training. In Chapter 4 of this thesis, the central database of the CIPHER close-loop system is standardized and integrated with a REST API, allowing for rapid data acquisition via a structured URL call. Having database standardization and a mechanism for easy data mining makes a database “ML ready” and promotes the database for ML applications.</p><p dir="ltr">Build upon data standardization and training ML models for chemical applications, the next step of this journey revolves around a concept known as a “chemical space” and how chemists can approximate and sample chemical spaces in a directed manner. In the context of this thesis, a chemical space can be visualized in the following manner. Start by envisioning any chemical relationship between some inputs and outputs as an unknown mathematical function. For example, if one is measuring the assay response of a specific drug at a certain concentration, the input would be the concentration, and the output would be the assay response. Then the bounds of this space are set by determining the range of input values and this forms a chemical space which corresponds to the chemical problem. Chemists sample these spaces every day when they go into the lab, run experiments, and analyze their data. While the example described above is relatively simple in scope, even if the relationship is very complex techniques such as ML can be used to approximate the relationship. An example of this approximation is shown in Chapter 3 of this thesis, where normalizing flow architecture is used to bias a vector space representation of molecules with chemical properties, creating a space which correlates compound and property and can be sampled to provided compounds with specific values of trained chemical properties. Training individual models is important, but to truly emulate certain chemical processes multiple models may need to be combined with physical instrumentation to efficiently sample and validate a chemical space. Chapter 4 of this thesis expands upon this concept by integrating a variety of ML modules with high-throughput (HT) bioassay instrumentation to create a “close loop” system designed around discovering, synthesizing, and validating non-addictive analgesics.</p><p dir="ltr">The final step of this journey is to integrate these systems which sample chemical spaces with AI, allowing for automated exploration of these spaces in a directed manner. There are several AI frameworks which can be used separately or combined to accomplish this task, but the framework that is the focus of this thesis is AI agents. AI agents are entities which use some form of AI to serve as a logical processing center which drives their exploration through a problem space. This can be a simple algorithm, some type of heuristic model, or an advance form of generative AI such as an LLM. Additionally, these agents generally have access to certain tools which serve as a medium for interaction with physical or computational environments, such as controlling a robotic arm or searching a database. Finally, these agents generally have a notion of past actions and observations, commonly referred to as memory, which allows agents to recall important information as they explore. Chapter 5 of this thesis details a custom agentic framework which is tailored towards complex scientific applications. This framework builds agents from source documentation around a specific user defined scope, provides them with access to literature and documentation in the form of embeddings, has custom memory for highly targeted retention, and allows form agents to communicate with one another to promote collaborative problem solving. Chapter 6 of this thesis showcase an application of a simpler agentic framework to an automated lipidomic workflow which performs comparative analysis on 5xFAD vs. WT mice brain tissue. The group of AI agents involved in this system generate mass spectrometry worklists, filter data into categories for analysis, perform comparative analysis, and allow for the user to dynamically create plots which can be used to answer specific statistical questions. In addition to performing all these operational and statistical analysis functions, the system includes an agent which uses document embeddings trained on curated technical manuals and protocols to answer user questions via a chatbot style interface. Overall, the system showcases how AI can effectivity be applied to relevant chemical problems to enhance speed, bolster accuracy, and improve usability.</p> Computational chemistry Human-computer interaction Artificial Intelligence AI Agents Large Language Models (LLMs) Drug Discovery Mass Spectrometry

Search results