Global ETD Search

131	Deep neural semantic parsing: translating from natural language into SPARQL / Análise semântica neural profunda: traduzindo de linguagem natural para SPARQL Luz, Fabiano Ferreira 07 February 2019 (has links) Semantic parsing is the process of mapping a natural-language sentence into a machine-readable, formal representation of its meaning. The LSTM Encoder-Decoder is a neural architecture with the ability to map a source language into a target one. We are interested in the problem of mapping natural language into SPARQL queries, and we seek to contribute with strategies that do not rely on handcrafted rules, high-quality lexicons, manually-built templates or other handmade complex structures. In this context, we present two contributions to the problem of semantic parsing departing from the LSTM encoder-decoder. While natural language has well defined vector representation methods that use a very large volume of texts, formal languages, like SPARQL queries, suffer from lack of suitable methods for vector representation. In the first contribution we improve the representation of SPARQL vectors. We start by obtaining an alignment matrix between the two vocabularies, natural language and SPARQL terms, which allows us to refine a vectorial representation of SPARQL items. With this refinement we obtained better results in the posterior training for the semantic parsing model. In the second contribution we propose a neural architecture, that we call Encoder CFG-Decoder, whose output conforms to a given context-free grammar. Unlike the traditional LSTM encoder-decoder, our model provides a grammatical guarantee for the mapping process, which is particularly important for practical cases where grammatical errors can cause critical failures. Results confirm that any output generated by our model obeys the given CFG, and we observe a translation accuracy improvement when compared with other results from the literature. / A análise semântica é o processo de mapear uma sentença em linguagem natural para uma representação formal, interpretável por máquina, do seu significado. O LSTM Encoder-Decoder é uma arquitetura de rede neural com a capacidade de mapear uma sequência de origem para uma sequência de destino. Estamos interessados no problema de mapear a linguagem natural em consultas SPARQL e procuramos contribuir com estratégias que não dependam de regras artesanais, léxico de alta qualidade, modelos construídos manualmente ou outras estruturas complexas feitas à mão. Neste contexto, apresentamos duas contribuições para o problema de análise semântica partindo da arquitetura LSTM Encoder-Decoder. Enquanto para a linguagem natural existem métodos de representação vetorial bem definidos que usam um volume muito grande de textos, as linguagens formais, como as consultas SPARQL, sofrem com a falta de métodos adequados para representação vetorial. Na primeira contribuição, melhoramos a representação dos vetores SPARQL. Começamos obtendo uma matriz de alinhamento entre os dois vocabulários, linguagem natural e termos SPARQL, o que nos permite refinar uma representação vetorial dos termos SPARQL. Com esse refinamento, obtivemos melhores resultados no treinamento posterior para o modelo de análise semântica. Na segunda contribuição, propomos uma arquitetura neural, que chamamos de Encoder CFG-Decoder, cuja saída está de acordo com uma determinada gramática livre de contexto. Ao contrário do modelo tradicional LSTM Encoder-Decoder, nosso modelo fornece uma garantia gramatical para o processo de mapeamento, o que é particularmente importante para casos práticos nos quais erros gramaticais podem causar falhas críticas em um compilador ou interpretador. Os resultados confirmam que qualquer resultado gerado pelo nosso modelo obedece à CFG dada, e observamos uma melhora na precisão da tradução quando comparada com outros resultados da literatura. Análise semântica CFG Codificação decodificação Encoder decoder GLC Gramáticas Grammars LSTM LSTM NLP Ontologias Ontology Palavras associadas PLN RDF RDF RNN RNN Semantic parsing SPARQL SPARQL Word embeddings
132	Espaces de Müntz, plongements de Carleson, et opérateurs de Cesàro / Müntz spaces, Carleson embeddings and Cesàro operators Gaillard, Loïc 07 December 2017 (has links) Pour une suite ⋀ = (λn) satisfaisant la condition de Müntz Σn 1/λn < +∞ et pour p ∈ [1,+∞), on définit l'espace de Müntz Mp⋀ comme le sous-espace fermé de Lp([0, 1]) engendré par les monômes yn : t ↦ tλn. L'espace M∞⋀ est défini de la même façon comme un sous-espace de C([0, 1]). Lorsque la suite (λn + 1/p)n est lacunaire avec un grand indice, nous montrons que la famille (gn) des monômes normalisés dans Lp est (1 + ε)-isométrique à la base canonique de lp. Dans le cas p = +∞, les monômes (yn) forment une famille normalisée et (1 + ε)-isométrique à la base sommante de c. Ces résultats sont un raffinement asymptotique d'un théorème bien connu pour les suites lacunaires. D'autre part, pour p ∈ [1, +∞), nous étudions les mesures de Carleson des espaces de Müntz, c'est-à-dire les mesures boréliennes μ sur [0,1) telles que l'opérateur de plongement Jμ,p : Mp⋀ ⊂ Lp(μ) est borné. Lorsque ⋀ est lacunaire, nous prouvons que si les (gn) sont uniformément bornés dans Lp(μ), alors μ est une mesure de Carleson de Mq⋀ pour tout q > p. Certaines conditionsgéométriques sur μ au voisinage du point 1 sont suffsantes pour garantir la compacité de Jμ,p ou son appartenance à d'autres idéaux d'opérateurs plus fins. Plus précisément, nous estimons les nombres d'approximation de Jμ,p dans le cas lacunaire et nous obtenons même des équivalents pour certaines suites ⋀. Enfin, nous calculons la norme essentielle del'opérateur de moyenne de Cesàro Γp : Lp → Lp : elle est égale à sa norme, c'est-à-dire à p'. Ce résultat est aussi valide pour l'opérateur de Cesàro discret. Nous introduisons les sous-espaces de Müntz des espaces de Cesàro Cesp pour p ∈ [1, +∞]. Nous montrons que la norme essentielle de l'opérateur de multiplication par Ψ est égale à ∥Ψ∥∞ dans l'espace deCesàro, et à \|Ψ(1)\| dans les espaces de Müntz-Cesàro. / For a sequence ⋀ = (λn) satisfying the Müntz condition Σn 1/λn < +∞ and for p ∈ [1,+∞), we define the Müntz space Mp⋀ as the closed subspace of Lp([0, 1]) spanned by the monomials yn : t ↦ tλn. The space M∞⋀ is defined in the same way as a subspace of C([0, 1]). When the sequence (λn + 1/p)n is lacunary with a large ratio, we prove that the sequence of normalized Müntz monomials (gn) in Lp is (1 + ε)-isometric to the canonical basis of lp. In the case p = +∞, the monomials (yn) form a sequence which is (1 + ε)-isometric to the summing basis of c. These results are asymptotic refinements of a well known theorem for the lacunary sequences. On the other hand, for p ∈ [1, +∞), we investigate the Carleson measures for Müntz spaces, which are defined as the Borel measures μ on [0; 1) such that the embedding operator Jμ,p : Mp⋀ ⊂ Lp(μ) is bounded. When ⋀ is lacunary, we prove that if the (gn) are uniformly bounded in Lp(μ), then for any q > p, the measure μ is a Carleson measure for Mq⋀. These questions are closely related to the behaviour of μ in the neighborhood of 1. Wealso find some geometric conditions about the behaviour of μ near the point 1 that ensure the compactness of Jμ,p, or its membership to some thiner operator ideals. More precisely, we estimate the approximation numbers of Jμ,p in the lacunary case and we even obtain some equivalents for particular lacunary sequences ⋀. At last, we show that the essentialnorm of the Cesàro-mean operator Γp : Lp → Lp coincides with its norm, which is p'. This result is also valid for the Cesàro sequence operator. We introduce some Müntz subspaces of the Cesàro function spaces Cesp, for p ∈ [1, +∞]. We show that the value of the essential norm of the multiplication operator TΨ is ∥Ψ∥∞ in the Cesàaro spaces. In the Müntz-Cesàrospaces, the essential norm of TΨ is equal to \|Ψ(1)\|. Espaces de fonctions Mesures de Carleson Espaces de Müntz Norme essentielle Opérateurs de Cesàro Lacunary sequences Carleson embeddings Müntz spaces Essential norm Cesàro operators Cesàro spaces Schatten classes 510
133	An analysis of hierarchical text classification using word embeddings Stein, Roger Alan 28 March 2018 (has links) Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2019-03-07T14:41:05Z No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5) / Made available in DSpace on 2019-03-07T14:41:05Z (GMT). No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5) Previous issue date: 2018-03-28 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC. / Modelos eficientes de representação numérica textual (word embeddings) combinados com algoritmos modernos de aprendizado de máquina têm recentemente produzido uma melhoria considerável em tarefas de classificação automática de documentos. Contudo, a efetividade de tais técnicas ainda não foi avaliada com relação à classificação hierárquica de texto. Este estudo investiga a aplicação daqueles modelos e algoritmos neste problema em específico através de experimentação e análise. Modelos de classificação foram treinados usando implementações proeminentes de algoritmos de aprendizado de máquina—fastText, XGBoost e CNN (Keras)— e notórios métodos de geração de word embeddings—GloVe, word2vec e fastText—com dados disponíveis publicamente e avaliados usando métricas especificamente adequadas ao contexto hierárquico. Nesses experimentos, fastText alcançou um LCAF1 de 0,871 usando uma versão da base de dados RCV1 com apenas uma categoria por tupla. A análise dos resultados indica que a utilização de word embeddings é uma abordagem muito promissora para classificação hierárquica de texto. Classificação hierárquica Classificação textual Redes neurais (computação) FastText Hierarchical classification Text classification Word embeddings Convolutional neural networks FastText
134	Biomedical Concept Association and Clustering Using Word Embeddings Setu Shah (5931128) 12 February 2019 (has links) <div>Biomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space.</div><div><br></div><div>A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services.</div><div><br></div><div>The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of.</div><div><br></div><div>To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for.</div><div><br></div><div>At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients.</div> Computer Engineering Natural Language Processing biomedical science Natural language processsing Word embeddings Artificial intelligence document clustering preventive care
135	Planejamentos combinatórios construindo sistemas triplos de steiner Barbosa, Enio Perez Rodrigues 26 August 2011 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2014-09-16T12:52:36Z No. of bitstreams: 2 Dissertação EnioPerez.pdf: 2190954 bytes, checksum: 8abd6c2cd31279e28971c632f6ed378b (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2014-09-16T14:10:30Z (GMT) No. of bitstreams: 2 Dissertação EnioPerez.pdf: 2190954 bytes, checksum: 8abd6c2cd31279e28971c632f6ed378b (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2014-09-16T14:10:30Z (GMT). No. of bitstreams: 2 Dissertação EnioPerez.pdf: 2190954 bytes, checksum: 8abd6c2cd31279e28971c632f6ed378b (MD5) license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) Previous issue date: 2011-08-26 / Intuitively, the basic idea of Design Theory consists of a way to select subsets, also called blocks, of a finite set, so that some properties are satisfied. The more general case are the blocks designs. A PBD is an ordered pair (S;B), where S is a finite set of symbols, and B is a collection of subsets of S called blocks, such that each pair of distinct elements of S occur together in exactly one block of B. A Steiner Triple System is a particular case of a PBD, where every block has size only 3, being called triples. The main focus is in building technology systems. By resolvability is discussed as a Steiner Triple Systems is resolvable, and when it is not resolvable. This theory has several applications, eg, embeddings and even problems related to computational complexity. / Intuitivamente, a idéia básica de um Planejamento Combinatório consiste em uma maneira de selecionar subconjuntos, também chamados de blocos, de um conjunto finito, de modo que algumas propriedades especificadas sejam satisfeitas. O caso mais geral são os planejamentos balanceados. Um PBD é um par ordenado (S;B), onde S é um conjunto finito de símbolos, e B é uma coleção de subconjuntos de S chamados blocos, tais que cada par de elementos distintos de S ocorrem juntos em exatamente um bloco de B. Um Sistema Triplo de Steiner é um caso particular de um PBD, em que todos os blocos tem tamanho único 3, sendo chamados de triplas. O foco principal está nas técnicas de construção dos sistemas. Por meio da resolubilidade se discute quando um Sistema Triplo de Steiner é resolvível e quando não é resolvível. Esta teoria possui várias aplicações, por exemplo: imersões e até mesmo problemas relacionados à complexidade computacional. Planejamentos combinatórios Blocos Sistemas triplos de steiner Grafos Resolubilidade Imersões Design theory Combinatorial designs Blocks Steiner triple systems Graphs Resolvability Embeddings NP-completeness
136	Klasické operátory harmonické analýzy v Orliczových prostorech / Classical operators of harmonic analysis in Orlicz spaces Musil, Vít January 2018 (has links) Classical operators of harmonic analysis in Orlicz spaces V'ıt Musil We deal with classical operators of harmonic analysis in Orlicz spaces such as the Hardy-Littlewood maximal operator, the Hardy-type integral operators, the maximal operator of fractional order, the Riesz potential, the Laplace transform, and also with Sobolev-type embeddings on open subsets of Rn or with respect to Frostman measures and, in particular, trace embeddings on the boundary. For each operator (in case of embeddings we consider the identity operator) we investigate the question of its boundedness from an Orlicz space into another. Particular attention is paid to the sharpness of the results. We further study the question of the existence of optimal Orlicz domain and target spaces and their description. The work consists of author's published and unpublished results compiled together with material appearing in the literature.
137	Embedding Theorems for Mixed Norm Spaces and Applications Algervik, Robert January 2008 (has links) <p>This thesis is devoted to the study of mixed norm spaces that arise in connection with embeddings of Sobolev and Besov type spaces. The work in this direction originates in a paper due to Gagliardo (1958), and was continued by Fournier (1988) and by Kolyada (2005).</p><p><p><p>We consider fully anisotropic mixed norm spaces. Our main theorem states an embedding of these spaces into Lorentz spaces. Applying this result, we obtain sharp embedding theorems for anisotropic fractional Sobolev spaces and anisotropic Sobolev-Besov spaces. The methods used are based on non-increasing rearrangements and on estimates of sections of functions and sections of sets. We also study limiting relations between embeddings of spaces of different type. More exactly, mixed norm estimates enable us to get embedding constants with sharp asymptotic behaviour. This gives an extension of the results obtained for isotropic Besov spaces $B_p^\alpha$ by Bourgain, Brezis, and Mironescu, and for Besov spaces $B^{\alpha_1,\dots,\alpha_n}_p$ by Kolyada.</p><p>We study also some basic properties (in particular the approximation properties) of special weak type spaces that play an important role in the construction of mixed norm spaces and in the description of Sobolev type embeddings.</p></p></p> mixed norms embeddings rearrangements anisotropic fractional Sobolev spaces anisotropic Besov spaces mixade normer inbäddningar omarrangeringar anisotropa fractionära Sobolev rum anisotropa Besov rum MATHEMATICS MATEMATIK
138	Embedding Theorems for Mixed Norm Spaces and Applications Algervik, Robert January 2008 (has links) This thesis is devoted to the study of mixed norm spaces that arise in connection with embeddings of Sobolev and Besov type spaces. The work in this direction originates in a paper due to Gagliardo (1958), and was continued by Fournier (1988) and by Kolyada (2005). We consider fully anisotropic mixed norm spaces. Our main theorem states an embedding of these spaces into Lorentz spaces. Applying this result, we obtain sharp embedding theorems for anisotropic fractional Sobolev spaces and anisotropic Sobolev-Besov spaces. The methods used are based on non-increasing rearrangements and on estimates of sections of functions and sections of sets. We also study limiting relations between embeddings of spaces of different type. More exactly, mixed norm estimates enable us to get embedding constants with sharp asymptotic behaviour. This gives an extension of the results obtained for isotropic Besov spaces $B_p^\alpha$ by Bourgain, Brezis, and Mironescu, and for Besov spaces $B^{\alpha_1,\dots,\alpha_n}_p$ by Kolyada. We study also some basic properties (in particular the approximation properties) of special weak type spaces that play an important role in the construction of mixed norm spaces and in the description of Sobolev type embeddings. mixed norms embeddings rearrangements anisotropic fractional Sobolev spaces anisotropic Besov spaces mixade normer inbäddningar omarrangeringar anisotropa fractionära Sobolev rum anisotropa Besov rum MATHEMATICS MATEMATIK
139	Data-driven language understanding for spoken dialogue systems Mrkšić, Nikola January 2018 (has links) Spoken dialogue systems provide a natural conversational interface to computer applications. In recent years, the substantial improvements in the performance of speech recognition engines have helped shift the research focus to the next component of the dialogue system pipeline: the one in charge of language understanding. The role of this module is to translate user inputs into accurate representations of the user goal in the form that can be used by the system to interact with the underlying application. The challenges include the modelling of linguistic variation, speech recognition errors and the effects of dialogue context. Recently, the focus of language understanding research has moved to making use of word embeddings induced from large textual corpora using unsupervised methods. The work presented in this thesis demonstrates how these methods can be adapted to overcome the limitations of language understanding pipelines currently used in spoken dialogue systems. The thesis starts with a discussion of the pros and cons of language understanding models used in modern dialogue systems. Most models in use today are based on the delexicalisation paradigm, where exact string matching supplemented by a list of domain-specific rephrasings is used to recognise users' intents and update the system's internal belief state. This is followed by an attempt to use pretrained word vector collections to automatically induce domain-specific semantic lexicons, which are typically hand-crafted to handle lexical variation and account for a plethora of system failure modes. The results highlight the deficiencies of distributional word vectors which must be overcome to make them useful for downstream language understanding models. The thesis next shifts focus to overcoming the language understanding models' dependency on semantic lexicons. To achieve that, the proposed Neural Belief Tracking (NBT) model forsakes the use of standard one-hot n-gram representations used in Natural Language Processing in favour of distributed representations of user utterances, dialogue context and domain ontologies. The NBT model makes use of external lexical knowledge embedded in semantically specialised word vectors, obviating the need for domain-specific semantic lexicons. Subsequent work focuses on semantic specialisation, presenting an efficient method for injecting external lexical knowledge into word vector spaces. The proposed Attract-Repel algorithm boosts the semantic content of existing word vectors while simultaneously inducing high-quality cross-lingual word vector spaces. Finally, NBT models powered by specialised cross-lingual word vectors are used to train multilingual belief tracking models. These models operate across many languages at once, providing an efficient method for bootstrapping language understanding models for lower-resource languages with limited training data.
140	Réduire la probabilité de disparité des termes en exploitant leurs relations sémantiques / Reducing Term Mismatch Probability by Exploiting Semantic Term Relations Almasri, Mohannad 27 June 2017 (has links) Les systèmes de recherche d’information utilisent généralement une multitude de fonctionnalités pour classer les documents. Néanmoins, un élément reste essentiel pour le classement, qui est les modèles standards de recherche d’information.Cette thèse aborde une limitation fondamentale des modèles de recherche d’information, à savoir le problème de la disparité des termes <Term Mismatch Problem>. Le problème de la disparité des termes est un problème de longue date dans la recherche d'informations. Cependant, le problème de la récurrence de la disparité des termes n'a pas bien été défini dans la recherche d'information, son importance, et à quel point cela affecterai les résultats de la recherche. Cette thèse tente de répondre aux problèmes présentés ci-dessus.Nos travaux de recherche sont rendus possibles par la définition formelle de la probabilité de la disparité des termes. Dans cette thèse, la disparité des termes est définie comme étant la probabilité d'un terme ne figurant pas dans un document pertinent pour la requête. De ce fait, cette thèse propose des approches pour réduire la probabilité de la disparité des termes. De plus, nous confortons nos proposions par une analyse quantitative de la probabilité de la disparité des termes qui décrit de quelle manière les approches proposées permettent de réduire la probabilité de la disparité des termes tout en conservant les performances du système.Au première niveau, à savoir le document, nous proposons une approche de modification des documents en fonction de la requête de l'utilisateur. Il s'agit de traiter les termes de la requête qui n'apparaissent pas dans le document. Le modèle de document modifié est ensuite utilisé dans un modèle standard de recherche afin d'obtenir un modèle permettant de traiter explicitement la disparité des termes.Au second niveau, à savoir la requête, nous avons proposé deux majeures contributions.Premièrement, nous proposons une approche d'expansion de requête sémantique basée sur une ressource collaborative. Nous concentrons plutôt sur la structure de ressources collaboratives afin d'obtenir des termes d'expansion intéressants qui contribuent à réduire la probabilité de la disparité des termes, et par conséquent, d'améliorer la qualité de la recherche.Deuxièmement, nous proposons un modèle d'expansion de requête basé sur les modèles de langue neuronaux. Les modèles de langue neuronaux sont proposés pour apprendre les représentations vectorielles des termes dans un espace latent, appelées <Distributed Neural Embeddings>. Ces représentations vectorielles s'appuient sur les relations entre les termes permettant ainsi d'obtenir des résultats impressionnants en comparaison avec l'état de l'art dans les taches de similarité de termes. Cependant, nous proposons d'utiliser ces représentations vectorielles comme une ressource qui définit les relations entre les termes.Nous adaptons la définition de la probabilité de la disparité des termes pour chaque contribution ci-dessus. Nous décrivons comment nous utilisons des corpus standard avec des requêtes et des jugements de pertinence pour estimer la probabilité de la disparité des termes. Premièrement, nous estimons la probabilité de la disparité des termes à l'aide les documents et les requêtes originaux. Ainsi, nous présentons les différents cas de la disparité des termes clairement identifiée dans les systèmes de recherche pour les différents types de termes d'indexation. Ensuite, nous indiquons comment nos contributions réduisent la probabilité de la disparité des termes estimée et améliorent le rappel du système.Des directions de recherche prometteuses sont identifiées dans le domaine de la disparité des termes qui pourrait présenter éventuellement un impact significatif sur l'amélioration des scénarios de la recherche. / Even though modern retrieval systems typically use a multitude of features to rank documents, the backbone for search ranking is usually the standard retrieval models.This thesis addresses a limitation of the standard retrieval models, the term mismatch problem. The term mismatch problem is a long standing problem in information retrieval. However, it was not well understood how often term mismatch happens in retrieval, how important it is for retrieval, or how it affects retrieval performance. This thesis answers the above questions.This research is enabled by the formal definition of term mismatch. In this thesis, term mismatch is defined as the probability that a term does not appear in a document given that this document is relevant. We propose several approaches for reducing term mismatch probability through modifying documents or queries. Our proposals are then followed by a quantitative analysis of term mismatch probability that shows how much the proposed approaches reduce term mismatch probability with maintaining the system performance. An essential component for achieving term mismatch probability reduction is the knowledge resource that defines terms and their relationships.First, we propose a document modification approach according to a user query. The main idea of our document modification approach is to deal with mismatched query terms. While prior research on document enrichment provides a static approach for document modification, we are concerned to only modify the document in case of mismatch. The modified document is then used in a standard retrieval model in order to obtain a mismatch aware retrieval model.Second, we propose a semantic query expansion approach based on a collaborative knowledge resource. We focus on the collaborative resource structure to obtain interesting expansion terms that contribute to reduce term mismatch probability, and as a result, improve the effectiveness of search.Third, we propose a query expansion approach based on neural language models. Neural language models are proposed to learn term vector representations, called distributed neural embeddings. Distributed neural embeddings capture relationships between terms, and they obtained impressive results comparing with state of the art approaches in term similarity tasks. However, in information retrieval, distributed neural embeddings are newly started to be exploited. We propose to use distributed neural embeddings as a knowledge resource in a query expansion scenario.Fourth, we apply the term mismatch probability definition for each contribution of the above contributions. We show how we use standard retrieval corpora with queries and relevance judgments to estimate the term mismatch probability. We estimate the term mismatch probability using original documents and queries, and we figure out how mismatch problem is clearly found in search systems for different types of indexing terms. Then, we point out how much our contributions reduce the estimated mismatch probability, and improve the system recall. As a result, we present how the modified document and query representations contribute to build a mismatch aware retrieval model that mitigate term mismatch problem theoretically and practically.This dissertation shows the effectiveness of our proposals to improve retrieval performance. Our experiments are conducted on corpora from two different domains: medical domain and cultural heritage domain. Moreover, we use two different types of indexing terms for representing documents and queries: words and concepts, and we exploit several types of relationships between indexing terms: hierarchical relationships, relationships based on a collaborative resource structure, relationships defined on distributed neural embeddings.Promising research directions are identified where the term mismatch research may make a significance impact on improving the search scenarios. Disparité des termes Base de connaissances Indexation conceptuelle Ressource collaborative Requête précise Indexation conceptuelle Term mismatch problem Knowledge resource Word embeddings Collaborative resource Precise Query Conceptual Indexing 004

Search results