Global ETD Search

141	Active Learning for Extractive Question Answering Marti Roman, Salvador January 2022 (has links) Data labelling for question answering tasks (QA) is a costly procedure that requires oracles to read lengthy excerpts of texts and reason to extract an answer for a given question from within the text. QA is a task in natural language processing (NLP), where a majority of recent advancements have come from leveraging the vast corpora of unlabelled and unstructured text available online. This work aims to extend this trend in the efficient use of unlabelled text data to the problem of selecting which subset of samples to label in order to maximize performance. This practice of selective labelling is called active learning (AL). Recent developments in AL for NLP have introduced the use of self-supervised learning on large corpora of text in the labelling process of samples for classification problems. This work adapts this research to the task of question answering and performs an initial exploration of expected performance. The methods covered in this work use uncertainty estimates obtained from neural networks to guide an incremental labelling process. These estimates are obtained from transformer-based models, previously trained in a self-supervised manner, by calculating the entropy of the confidence scores or with an approximation of Bayesian uncertainty obtained through Monte Carlo dropout. These methods are evaluated on two different benchmarking QA datasets: SQuAD v1 and TriviaQA. Several factors are observed to influence the behaviour of these uncertainty-based acquisition functions, including the choice of language model used, the presence of unanswered questions and the acquisition size used in the incremental process. The study produces no evidence to support that averaging or selecting maximal uncertainty values between the classification of an answer’s starting and ending positions affects sample acquisition quality. However, language model choice, the presence of unanswerable questions and acquisition size are all identified as key factors affecting consistency between runs and degree of success. Machine Learning Deep Learning Active Learning Natural Language Processing NLP Question Answering Transformers Uncertainty Language Models Probability Theory and Statistics Sannolikhetsteori och statistik
142	Neural Sequence Modeling for Domain-Specific Language Processing: A Systematic Approach Zhu, Ming 14 August 2023 (has links) In recent years, deep learning based sequence modeling (neural sequence modeling) techniques have made substantial progress in many tasks, including information retrieval, question answering, information extraction, machine translation, etc. Benefiting from the highly scalable attention-based Transformer architecture and enormous open access online data, large-scale pre-trained language models have shown great modeling and generalization capacity for sequential data. However, not all domains benefit equally from the rapid development of neural sequence modeling. Domains like healthcare and software engineering have vast amounts of sequential data containing rich knowledge, yet remain under-explored due to a number of challenges: 1) the distribution of the sequences in specific domains is different from the general domain; 2) the effective comprehension of domain-specific data usually relies on domain knowledge; and 3) the labelled data is usually scarce and expensive to get in domain-specific settings. In this thesis, we focus on the research problem of applying neural sequence modeling methods to address both common and domain-specific challenges from the healthcare and software engineering domains. We systematically investigate neural-based machine learning approaches to address the above challenges in three research directions: 1) learning with long sequences, 2) learning from domain knowledge and 3) learning under limited supervision. Our work can also potentially benefit more domains with large amounts of sequential data. / Doctor of Philosophy / In the last few years, computer programs that learn and understand human languages (an area called machine learning for natural language processing) have significantly improved. These advances are visible in various areas such as retrieving information, answering questions, extracting key details from texts, and translating between languages. A key to these successes has been the use of a type of neural network structure known as a "Transformer", which can process and learn from lots of information found online. However, these successes are not uniform across all areas. Two fields, healthcare and software engineering, still present unique challenges despite having a wealth of information. Some of these challenges include the different types of information in these fields, the need for specific expertise to understand this information, and the shortage of labeled data, which is crucial for training machine learning models. In this thesis, we focus on the use of machine learning for natural language processing methods to solve these challenges in the healthcare and software engineering fields. Our research investigates learning with long documents, learning from domain-specific expertise, and learning when there's a shortage of labeled data. The insights and techniques from our work could potentially be applied to other fields that also have a lot of sequential data. Machine Learning for Code Machine Learning for Healthcare Information Retrieval Question Answering Entity Linking Program Translation Code Refinement Sequence-to-Sequence Models
143	[en] A NOVEL SOLUTION TO EMPOWER NATURAL LANGUAGE INTERFACES TO DATABASES (NLIDB) TO HANDLE AGGREGATIONS / [pt] UMA NOVA SOLUÇÃO PARA CAPACITAR INTERFACES DE LINGUAGEM NATURAL PARA BANCOS DE DADOS (NLIDB) PARA LIDAR COM AGREGAÇÕES ALEXANDRE FERREIRA NOVELLO 19 July 2021 (has links) [pt] Perguntas e Respostas (Question Answering - QA) é um campo de estudo dedicado à construção de sistemas que respondem automaticamente a perguntas feitas em linguagem natural. A tradução de uma pergunta feita em linguagem natural em uma consulta estruturada (SQL ou SPARQL) em um banco de dados também é conhecida como Interface de Linguagem Natural para Bancos de Dados (Natural Language Interface to Database - NLIDB). Os sistemas NLIDB geralmente não lidam com agregações, que podem ter os seguintes elementos: funções de agregação (como contagem, soma, média, mínimo e máximo), uma cláusula de agrupamento (GROUP BY) e uma cláusula HAVING. No entanto, eles fornecem bons resultados para consultas normais. Esta dissertação aborda a criação de um módulo genérico, para ser utilizado em sistemas NLIDB, que permite a tais sistemas realizar consultas com agregações, desde que os resultados da consulta que o NLIDB retorna sejam, ou possam ser transformados, em um resultado no formato tabular. O trabalho cobre agregações com especificidades como ambiguidades, diferenças de escala de tempo, agregações em atributos múltiplos, o uso de adjetivos superlativos, reconhecimento básico de unidade de medida, agregações em atributos com nomes compostos e subconsultas com funções de agregação aninhadas em até dois níveis. / [en] Question Answering (QA) is a field of study dedicated to building systems that automatically answer questions asked in natural language. The translation of a question asked in natural language into a structured query (SQL or SPARQL) in a database is also known as Natural Language Interface to Database (NLIDB). NLIDB systems usually do not deal with aggregations, which can have the following elements: aggregation functions (as count, sum, average, minimum and maximum), a grouping clause (GROUP BY) and a having clause (HAVING). However, they deliver good results for normal queries. This dissertation addresses the creation of a generic module, to be used in NLIDB systems, that allows such systems to perform queries with aggregations, on the condition that the query results the NLIDB return are, or can be transformed into, a result set in the form of a table. The work covers aggregations with specificities such as ambiguities, timescale differences, aggregations in multiple attributes, the use of superlative adjectives, basic unit measure recognition, aggregations in attributes with compound names and subqueries with aggregation functions nested up to two levels. [pt] BANCOS DE DADOS [pt] AGREGACAO [pt] SQL [en] DATABASE DESIGN [en] NATURAL LANGUAGE PROCESSING -NLP [en] QUESTION ANSWERING - QA [en] AGGREGATION [en] SQL
144	Visual question answering with modules and language modeling Pahuja, Vardaan 04 1900 (has links) No description available. Réponse visuelle à une question Visual Question Answering Visual Reasoning Modular Networks Neural Structure Optimization Language Modeling Raisonnement Visuel Réseaux Modulaires Modélisation du Langage Optimisation de la structure neuronale
145	Identification of Online Users' Social Status via Mining User-Generated Data Zhao, Tao 05 September 2019 (has links) No description available. 510 Socioeconomic Status Social Community Question Answering Mobile Phone Data Social Media Content Factor Graph Hypergraph Coupled Attribute Representation Bidirectional Long Short-Term Memory User-Generated Data Social Status Topical Opinion Leader Informatik (PPN619939052)
146	A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning Franco Salvador, Marc 03 July 2017 (has links) Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario. In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts. As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way. The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification. The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals. / El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales. / El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals. / Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285 / TESIS cross-language plagiarism detection document retrieval document categorization cross-domain polarity classification natural language processing information retrieval question answering native language identification language variety identification multilingual semantic network knowledge graphs LENGUAJES Y SISTEMAS INFORMATICOS
147	Zero-shot, One Kill: BERT for Neural Information Retrieval Efes, Stergios January 2021 (has links) [Background]: The advent of bidirectional encoder representation from trans- formers (BERT) language models (Devlin et al., 2018) and MS Marco, a large scale human-annotated dataset for machine reading comprehension (Bajaj et al., 2016) that made publicly available, led the field of information retrieval (IR) to experience a revolution (Lin et al., 2020). The retrieval model based on BERT of Nogueira and Cho (2019), by the time they published their paper, became the top entry in the MS Marco passage-reranking leaderboard, surpassing the previous state of the art by 27% in MRR@10. However, training such neural IR models for different domains than MS Marco is still hard because neural approaches often require a vast amount of training data to perform effectively, which is not always available. To address the problem of the shortage of labelled data a new line of research emerged, training neural models with weak supervision. In weak supervision, given an unlabelled dataset labels are generated automatically using an existing model and then a machine learning model is trained upon the artificial “weak“ data. In case of weak supervision for IR, the training dataset comes in the form of a tuple (query, passage). Dehghani et al. (2017) in their work used the AOL query logs (Pass et al., 2006), which is a set of millions of real web queries, and BM25 to retrieve the relevant passages for each of the user queries. A drawback with this approach is that it is hard to obtain query logs for every single different domain. [Objective]: This thesis proposes an intuitive approach for addressing the shortage of data in domains with limited or no data at all through transfer learning in the context of IR. We leverage Wikipedia’s structure for creating a Wikipedia-based generic IR training dataset for zero-shot neural models. [Method]: We create the “pseudo-queries“ by concatenating the titles of Wikipedia’s articles along with each of their title sections and we consider the associated section’s passage as the relevant passage of the pseudo-queries. All of our experiments are evaluated on a standard collection: MS Marco, which is a large scale web collection. For our zero-shot experiments, our proposed model, called “Wiki“, is a BERT model trained on the artificial Wikipedia-based dataset and the baseline is a default BERT model without any additional training. In our second line of experiments, we explore the benefits gained by pre-fine- tuning on the Wikipedia-based IR dataset and further fine-tuning on in-domain data. Our proposed model, "Wiki+Ma", is a BERT model pre-fine-tuned in the Wikipedia-based dataset and further fine-tuned in MS Marco, while the baseline is a BERT model fine-tuned only in MS Marco. [Results]: Results regarding our first experiments show that our BERT model trained on the Wikipedia-based IR dataset, called "Wiki", achieves a performance of 0.197 in MRR@10, which is about +10 points more in comparison to a BERT model with default weights; in addition, results in the development set indicate that the “Wiki“ model performs better than BERT model trained on in-domain data when the data is between 10k-50k instances. Results regarding our second line of experiments show that pre-fine-tuning on the Wikipedia-based IR dataset benefits later fine-tuning steps on in-domain data in terms of stability. [Conclusion]: Our findings suggest that transfer learning for IR tasks by leveraging the generic knowledge incorporated in Wikipedia is possible, though more experimentation is needed to understand its limitations in comparison with the traditional approaches such as the BM25. neural information retrieval passage ranking weak supervision question answering passage reranking BERT transfer-learning in IR zero-shot IR passage-retrieval BERT for passage-retrieval MS Marco information retrieval neural IR
148	Neural Methods Towards Concept Discovery from Text via Knowledge Transfer Das, Manirupa January 2019 (has links) No description available. Computer Engineering Computer Science Information Science Library Science Linguistics
149	Question-answering chatbot for Northvolt IT Support Hjelm, Daniel January 2023 (has links) Northvolt is a Swedish battery manufacturing company that specializes in the production of sustainable lithium-ion batteries for electric vehicles and energy storage systems. Established in 2016, the company has experienced significant growth in recent years. This growth has presented a major challenge for the IT Support team, as they face a substantial volume of ITrelated inquiries. To address this challenge and allow the IT Support team to concentrate on more complex support tasks, a question-answering chatbot has been implemented as part of this thesis project. The chatbot has been developed using the Microsoft Bot Framework and leverages Microsoft cloud services, specifically Azure Cognitive Services, to provide intelligent and cognitive capabilities for answering employee questions directly within Microsoft Teams. The chatbot has undergone testing by a diverse group of employees from various teams within the organization and was evaluated based on three key metrics: effectiveness (including accuracy, precision, and intent recognition rate), efficiency (including response time and scalability), and satisfaction. The test results indicate that the accuracy, precision, and intent recognition rate fall below the required thresholds for production readiness. However, these metrics can be improved by expanding the knowledge base of the bot. The chatbot demonstrates impressive efficiency in terms of response time and scalability, and its user-friendly nature contributes to a positive user experience. Users express high levels of satisfaction with their interactions with the bot, and the majority would recommend it to their colleagues, recognizing it as a valuable service solution that will benefit all employees at Northvolt in the future. Moving forward, the primary focus should be on expanding the knowledge base and effectively communicating the bot’s purpose and scope to enhance effectiveness and satisfaction. Additionally, integrating the bot with advanced AI features, such as OpenAI’s language models available within Microsoft’s ecosystem, would elevate the bot to the next level. Artificial intelligence Chatbot Natural language processing Natural language understanding Machine learning Deep learning Transformer Question answering Conversational agents Conversational AI Computer Sciences Datavetenskap (datalogi)
150	Result Diversification on Spatial, Multidimensional, Opinion, and Bibliographic Data Kucuktunc, Onur 01 October 2013 (has links) No description available. Computer Science diversity relevance graph mining result diversification indexes nearest neighbor search spatial databases collaborative question answering prediction sentiment analysis literature search graph random walks paper recommendation web service

Search results