Global ETD Search

21	Τεχνικές αυτόματης μορφοσυντακτικής ανάλυσης της νέας ελληνικής γλώσσας Σγάρμπας, Κυριάκος 22 September 2009 (has links) - / - Ελληνική γλώσσα 410.285 Linguistics - Computers Greek language
22	Transition-based combinatory categorial grammar parsing for English and Hindi Ambati, Bharat Ram January 2016 (has links) Given a natural language sentence, parsing is the task of assigning it a grammatical structure, according to the rules within a particular grammar formalism. Different grammar formalisms like Dependency Grammar, Phrase Structure Grammar, Combinatory Categorial Grammar, Tree Adjoining Grammar are explored in the literature for parsing. For example, given a sentence like “John ate an apple”, parsers based on the widely used dependency grammars find grammatical relations, such as that ‘John’ is the subject and ‘apple’ is the object of the action ‘ate’. We mainly focus on Combinatory Categorial Grammar (CCG) in this thesis. In this thesis, we present an incremental algorithm for parsing CCG for two diverse languages: English and Hindi. English is a fixed word order, SVO (Subject-Verb- Object), and morphologically simple language, whereas, Hindi, though predominantly a SOV (Subject-Object-Verb) language, is a free word order and morphologically rich language. Developing an incremental parser for Hindi is really challenging since the predicate needed to resolve dependencies comes at the end. As previously available shift-reduce CCG parsers use English CCGbank derivations which are mostly right branching and non-incremental, we design our algorithm based on the dependencies resolved rather than the derivation. Our novel algorithm builds a dependency graph in parallel to the CCG derivation which is used for revealing the unbuilt structure without backtracking. Though we use dependencies for meaning representation and CCG for parsing, our revealing technique can be applied to other meaning representations like lambda expressions and for non-CCG parsing like phrase structure parsing. Any statistical parser requires three major modules: data, parsing algorithm and learning algorithm. This thesis is broadly divided into three parts each dealing with one major module of the statistical parser. In Part I, we design a novel algorithm for converting dependency treebank to CCGbank. We create Hindi CCGbank with a decent coverage of 96% using this algorithm. We also do a cross-formalism experiment where we show that CCG supertags can improve widely used dependency parsers. We experiment with two popular dependency parsers (Malt and MST) for two diverse languages: English and Hindi. For both languages, CCG categories improve the overall accuracy of both parsers by around 0.3-0.5% in all experiments. For both parsers, we see larger improvements specifically on dependencies at which they are known to be weak: long distance dependencies for Malt, and verbal arguments for MST. The result is particularly interesting in the case of the fast greedy parser (Malt), since improving its accuracy without significantly compromising speed is relevant for large scale applications such as parsing the web. We present a novel algorithm for incremental transition-based CCG parsing for English and Hindi, in Part II. Incremental parsers have potential advantages for applications like language modeling for machine translation and speech recognition. We introduce two new actions in the shift-reduce paradigm for revealing the required information during parsing. We also analyze the impact of a beam and look-ahead for parsing. In general, using a beam and/or look-ahead gives better results than not using them. We also show that the incremental CCG parser is more useful than a non-incremental version for predicting relative sentence complexity. Given a pair of sentences from wikipedia and simple wikipedia, we build a classifier which predicts if one sentence is simpler/complex than the other. We show that features from a CCG parser in general and incremental CCG parser in particular are more useful than a chart-based phrase structure parser both in terms of speed and accuracy. In Part III, we develop the first neural network based training algorithm for parsing CCG. We also study the impact of neural network based tagging models, and greedy versus beam-search parsing, by using a structured neural network model. In greedy settings, neural network models give significantly better results than the perceptron models and are also over three times faster. Using a narrow beam, structured neural network model gives consistently better results than the basic neural network model. For English, structured neural network gives similar performance to structured perceptron parser. But for Hindi, structured perceptron is still the winner. 410.285
23	Implication textuelle et réécriture / Textual Entailment and rewriting Bedaride, Paul 18 October 2010 (has links) Cette thèse propose plusieurs contributions sur le thème de la détection d'implications textuelles (DIT). La DIT est la capacité humaine, étant donné deux textes, à pouvoir dire si le sens du second texte peut être déduit à partir de celui du premier. Une des contributions apportée au domaine est un système de DIT hybride prenant les analyses d'un analyseur syntaxique stochastique existant afin de les étiqueter avec des rôles sémantiques, puis transformant les structures obtenues en formules logiques grâce à des règles de réécriture pour tester finalement l'implication à l'aide d'outils de preuve. L'autre contribution de cette thèse est la génération de suites de tests finement annotés avec une distribution uniforme des phénomènes couplée avec une nouvelle méthode d'évaluation des systèmes utilisant les techniques de fouille d'erreurs développées par la communauté de l'analyse syntaxique permettant une meilleure identification des limites des systèmes. Pour cela nous créons un ensemble de formules sémantiques puis nous générons les réalisations syntaxiques annotées correspondantes à l'aide d'un système de génération existant. Nous testons ensuite s'il y a implication ou non entre chaque couple de réalisations syntaxiques possible. Enfin nous sélectionnons un sous-ensemble de cet ensemble de problèmes d'une taille donnée et satisfaisant un certain nombre de contraintes à l'aide d'un algorithme que nous avons développé. / This thesis presents several contributions on the theme of recognising textual entailment (RTE). The RTE is the human capacity, given two texts, to determine whether the meaning of the second text could be deduced from the meaning of the first or not. One of the contributions made to the field is a hybrid system of RTE taking analysis of an existing stochastic parser to label them with semantics roles, then turning obtained structures in logical formulas using rewrite rules to finally test the entailment using proof tools. Another contribution of this thesis is the generation of finely annotated tests suites with a uniform distribution of phenomena coupled with a new methodology of systems evaluation using error minning techniques developed by the community of parsing allowing better identification of systems limitations. For this, we create a set of formulas, then we generate annotated syntactics realisations corresponding by using an existing generation system. Then, we test whether or not there is an entailment between each pair of possible syntactics realisations. Finally, we select a subset of this set of problems of a given size and a satisfactory a certain number of constraints using an algorithm that we developed Traitement automatique des langues Réécriture Représentation Raisonnement Natural Language Processing Rewriting Representation Reasoning 410.285 006.35
24	The effect of context on the activation and processing of word meaning over time Frassinelli, Diego January 2015 (has links) The aim of this thesis is to study the effect that linguistic context exerts on the activation and processing of word meaning over time. Previous studies have demonstrated that a biasing context makes it possible to predict upcoming words. The context causes the pre-activation of expected words and facilitates their processing when they are encountered. The interaction of context and word meaning can be described in terms of feature overlap: as the context unfolds, the semantic features of the processed words are activated and words that match those features are pre-activated and thus processed more quickly when encountered. The aim of the experiments in this thesis is to test a key prediction of this account, viz., that the facilitation effect is additive and occurs together with the unfolding context. Our first contribution is to analyse the effect of an increasing amount of biasing context on the pre-activation of the meaning of a critical word. In a self-paced reading study, we investigate the amount of biasing information required to boost word processing: at least two biasing words are required to significantly reduce the time to read the critical word. In a complementary visual world experiment we study the effect of context as it unfolds over time. We identify a ceiling effect after the first biasing word: when the expected word has been pre-activated, an increasing amount of context does not produce any additional significant facilitation effect. Our second contribution is to model the activation effect observed in the previous experiments using a bag-of-words distributional semantic model. The similarity scores generated by the model significantly correlate with the association scores produced by humans. When we use point-wise multiplication to combine contextual word vectors, the model provides a computational implementation of feature overlap theory, successfully predicting reading times. Our third contribution is to analyse the effect of context on semantically similar words. In another visual world experiment, we show that words that are semantically similar generate similar eye-movements towards a related object depicted on the screen. A coherent context pre-activates the critical word and therefore increases the expectations towards it. This experiment also tested the cognitive validity of a distributional model of semantics by using this model to generate the critical words for the experimental materials used. 410.285
25	Exploitation du contexte sémantique pour améliorer la reconnaissance des noms propres dans les documents audio diachroniques / Exploiting Semantic and Topic Context to Improve Recognition of Proper Names in Diachronic Audio Documents Sheikh, Imran 24 November 2016 (has links) La nature diachronique des bulletins d'information provoque de fortes variations du contenu linguistique et du vocabulaire dans ces documents. Dans le cadre de la reconnaissance automatique de la parole, cela conduit au problème de mots hors vocabulaire (Out-Of-Vocabulary, OOV). La plupart des mots OOV sont des noms propres. Les noms propres sont très importants pour l'indexation automatique de contenus audio-vidéo. De plus, leur bonne identification est importante pour des transcriptions automatiques fiables. Le but de cette thèse est de proposer des méthodes pour récupérer les noms propres manquants dans un système de reconnaissance. Nous proposons de modéliser le contexte sémantique et d'utiliser des informations thématiques contenus dans les documents audio à transcrire. Des modèles probabilistes de thème et des projections dans un espace continu obtenues à l'aide de réseaux de neurones sont explorés pour la tâche de récupération des noms propres pertinents. Une évaluation approfondie de ces représentations contextuelles a été réalisée. Pour modéliser le contexte de nouveaux mots plus efficacement, nous proposons des réseaux de neurones qui maximisent la récupération des noms propres pertinents. En s'appuyant sur ce modèle, nous proposons un nouveau modèle (Neural Bag-of-Weighted-Words, NBOW2) qui permet d'estimer un degré d'importance pour chacun des mots du document et a la capacité de capturer des mots spécifiques à ce document. Des expériences de reconnaissance automatique de bulletins d'information télévisés montrent l'efficacité du modèle proposé. L'évaluation de NBOW2 sur d'autres tâches telles que la classification de textes montre des bonnes performances / The diachronic nature of broadcast news causes frequent variations in the linguistic content and vocabulary, leading to the problem of Out-Of-Vocabulary (OOV) words in automatic speech recognition. Most of the OOV words are found to be proper names whereas proper names are important for automatic indexing of audio-video content as well as for obtaining reliable automatic transcriptions. The goal of this thesis is to model the semantic and topical context of new proper names in order to retrieve those which are relevant to the spoken content in the audio document. Training context models is a challenging problem in this task because several new names come with a low amount of data and the context model should be robust to errors in the automatic transcription. Probabilistic topic models and word embeddings from neural network models are explored for the task of retrieval of relevant proper names. A thorough evaluation of these contextual representations is performed. It is argued that these representations, which are learned in an unsupervised manner, are not the best for the given retrieval task. Neural network context models trained with an objective to maximise the retrieval performance are proposed. The proposed Neural Bag-of-Weighted-Words (NBOW2) model learns to assign a degree of importance to input words and has the ability to capture task specific key-words. Experiments on automatic speech recognition on French broadcast news videos demonstrate the effectiveness of the proposed models. Evaluation of the NBOW2 model on standard text classification tasks shows that it learns interesting information and gives best classification accuracies among the BOW models Reconnaissance de la parole Noms propres OOV Sémantique distributive Speech recognition OOV Proper names Distributional semantics 006.454 410.285
26	Υπολογιστική συντακτική επεξεργασία κειμένων της νέας ελληνικής γλώσσας βασισμένη σε εμπειρικές μεθόδους Μίχος, Στέφανος 22 September 2009 (has links) - / - Ελληνική γλώσσα 410.285 Linguistics - Computers Greek language Computational linguistics
27	Construction automatique de hiérarchies sémantiques à partir du Trésor de la Langue Française informatisé (TLFi) : application à l'indexation et la recherche d'images / Automatic construction of semantic hierarchies from the Trésor de la langue française informatisé (TLFi) : application for image indexing and retrieval Gheorghita, Inga 17 February 2014 (has links) L’objectif principal de cette thèse est de montrer que les informations lexicales issues d’un dictionnaire de langue, tel le Trésor de la langue française informatisé (TLFi), peuvent améliorer les processus d’indexation et de recherche d’images. Le problème d’utilisation d’une telle ressource est qu’elle n’est pas suffisamment formalisée pour être exploitée d’emblée dans un tel domaine d’application. Pour résoudre ce problème, nous proposons, dans un premier temps, une approche de construction automatique de hiérarchies sémantiques à partir du TLFi. Après avoir défini une caractéristique quantitative (mesurable) et comparable des noms apparaissant dans les définitions lexicographiques, à travers une formule de pondération permettant de sélectionner le nom de poids maximal comme un bon candidat hyperonyme pour un lexème donné du TLFi, nous proposons un algorithme de construction automatique de hiérarchies sémantiques pour les lexèmes des vocables du TLFi. Une fois notre approche validée à travers des évaluations manuelles, nous montrons, dans un second temps, que les hiérarchies sémantiques obtenues à partir du TLFi peuvent être utilisées pour l’enrichissement d’un thésaurus construit manuellement ainsi que pour l’indexation automatique d’images à partir de leurs descriptions textuelles associées. Nous prouvons aussi que l’exploitation d’une telle ressource dans le domaine de recherche d’images améliore la précision de la recherche en structurant les résultats selon les domaines auxquels les concepts de la requête de recherche peuvent faire référence. La mise en place d’un prototype nous a permis ainsi d’évaluer et de valider les approches proposées. / The main purpose of this thesis is to show that the lexical information issuing from a language dictionary, as the Trésor de la langue française informatisé (TLFi), can improve the image indexing and retrieval process. The problem of using of such resource is that it is not sufficiently formalized to be exploited immediately in such application domain.To solve this problem, we propose a first approach of automatic construction of semantic hierarchies from TLFi. After defining a quantitative (measurable) and comparable characteristic of names appearing in dictionary definitions, through a weighting formula that allows us to select the name of the maximum weight as a good hypernym candidate for a given TLFi lexeme, we suggest an algorithm of automatic construction of semantic hierarchies for the lexemes of TLFi vocables.Once our approach is validated through manual evaluations, we demonstrate in the second time that the semantic hierarchies obtained from TLFi can be used to enrich a thesaurus manually built as well as for automatic image indexing using their associated text descriptions. We also prove that the use of such resource in the domain of image retrieval improves the accuracy of search by structuring the results according the domains to which the concepts of the search query are related to. The implementation of a prototype allowed us to evaluate and validate the proposed approaches. Hiérarchies sémantiques Ressources lexicales Pondération Dictionnaire de langue TLFi Indexation et recherche d’images 410.285
28	Refinements in hierarchical phrase-based translation systems Pino, Juan Miguel January 2015 (has links) The relatively recently proposed hierarchical phrase-based translation model for statistical machine translation (SMT) has achieved state-of-the-art performance in numerous recent translation evaluations. Hierarchical phrase-based systems comprise a pipeline of modules with complex interactions. In this thesis, we propose refinements to the hierarchical phrase-based model as well as improvements and analyses in various modules for hierarchical phrase-based systems. We took the opportunity of increasing amounts of available training data for machine translation as well as existing frameworks for distributed computing in order to build better infrastructure for extraction, estimation and retrieval of hierarchical phrase-based grammars. We design and implement grammar extraction as a series of Hadoop MapReduce jobs. We store the resulting grammar using the HFile format, which offers competitive trade-offs in terms of efficiency and simplicity. We demonstrate improvements over two alternative solutions used in machine translation. The modular nature of the SMT pipeline, while allowing individual improvements, has the disadvantage that errors committed by one module are propagated to the next. This thesis alleviates this issue between the word alignment module and the grammar extraction and estimation module by considering richer statistics from word alignment models in extraction. We use alignment link and alignment phrase pair posterior probabilities for grammar extraction and estimation and demonstrate translation improvements in Chinese to English translation. This thesis also proposes refinements in grammar and language modelling both in the context of domain adaptation and in the context of the interaction between first-pass decoding and lattice rescoring. We analyse alternative strategies for grammar and language model cross-domain adaptation. We also study interactions between first-pass and second-pass language model in terms of size and n-gram order. Finally, we analyse two smoothing methods for large 5-gram language model rescoring. The last two chapters are devoted to the application of phrase-based grammars to the string regeneration task, which we consider as a means to study the fluency of machine translation output. We design and implement a monolingual phrase-based decoder for string regeneration and achieve state-of-the-art performance on this task. By applying our decoder to the output of a hierarchical phrase-based translation system, we are able to recover the same level of translation quality as the translation system. 410.285
29	Indexation aléatoire et similarité inter-phrases appliquées au résumé automatique / Random indexing and inter-sentences similarity applied to automatic summarization Vu, Hai Hieu 29 January 2016 (has links) Face à la masse grandissante des données textuelles présentes sur le Web, le résumé automatique d'une collection de documents traitant d'un sujet particulier est devenu un champ de recherche important du Traitement Automatique des Langues. Les expérimentations décrites dans cette thèse s'inscrivent dans cette perspective. L'évaluation de la similarité sémantique entre phrases est l'élément central des travaux réalisés. Notre approche repose sur la similarité distributionnelle et une vectorisation des termes qui utilise l'encyclopédie Wikipédia comme corpus de référence. Sur la base de cette représentation, nous avons proposé, évalué et comparé plusieurs mesures de similarité textuelle ; les données de tests utilisées sont celles du défi SemEval 2014 pour la langue anglaise et des ressources que nous avons construites pour la langue française. Les bonnes performances des mesures proposées nous ont amenés à les utiliser dans une tâche de résumé multi-documents, qui met en oeuvre un algorithme de type PageRank. Le système a été évalué sur les données de DUC 2007 pour l'anglais et le corpus RPM2 pour le français. Les résultats obtenus par cette approche simple, robuste et basée sur une ressource aisément disponible dans de nombreuses langues, se sont avérés très encourageants / With the growing mass of textual data on the Web, automatic summarization of topic-oriented collections of documents has become an important research field of Natural Language Processing. The experiments described in this thesis were framed within this context. Evaluating the semantic similarity between sentences is central to our work and we based our approach on distributional similarity and vector representation of terms, with Wikipedia as a reference corpus. We proposed several similarity measures which were evaluated and compared on different data sets: the SemEval 2014 challenge corpus for the English language and own built datasets for French. The good performance showed by our measures led us to use them in a multi-document summary task, which implements a pagerank-type algorithm. The system was evaluated on the DUC 2007 datasets for English and RPM2 corpus for French. This simple approach, based on a resource readily available in many languages, proved efficient, robust and the encouraging outcomes open up real prospects of improvement. Similarité entre phrases Similarité distributionnelle Résumé automatique PageRank DivRank 410.285 025.04
30	Analyse des sentiments et des émotions de commentaires complexes en langue française. / Sentiment and emotion analysis of complex reviews Pecore, Stefania 28 January 2019 (has links) Les définitions des mots « sentiment », « opinion » et « émotion » sont toujours très vagues comme l’atteste aussi le dictionnaire qui semble expliquer un mot en utilisant le deux autres. Tout le monde est affecté par les opinions : les entreprises pour vendre les produits, les gens pour les acheter et, plus en général, pour prendre des décisions, les chercheurs en intelligence artificielle pour comprendre la nature de l’être humain. Aujourd’hui on a une quantité d’information disponible jamais vue avant, mais qui résulte peu accessible. Les mégadonnées (en anglais « big data ») ne sont pas organisées, surtout pour certaines langues – dont la difficulté à les exploiter. La recherche française souffre d’une manque de ressources « prêt-à-porter » pour conduire des tests. Cette thèse a l’objectif d’explorer la nature des sentiments et des émotions, dans le cadre du Traitement Automatique du Langage et des Corpus. Les contributions de cette thèse sont plusieurs : création de nouvelles ressources pour l’analyse du sentiment et de l’émotion, emploi et comparaison de plusieurs techniques d’apprentissage automatique, et plus important, l’étude du problème sous différents points de vue : classification des commentaires en ligne en polarité (positive et négative), Aspect-Based Sentiment Analysis des caractéristiques du produit recensé. Enfin, un étude psycholinguistique, supporté par des approches lexicales et d’apprentissage automatique, sur le rapport entre qui juge et l’objet jugé. / "Sentiment", "opinion" and "emotion" are words really vaguely defined; not even the dictionary seems to be of any help, being it the first to define each of the three by using the remaining two. And yet, the civilised world is heavily affected by opinions: companies need them to understand how to sell their products; people use them to buy the most fitting product and, more generally, to weigh their decisions; researchers exploit them in Artificial Intelligence studies to understand the nature of the human being. Today we can count on a humongous amount of available information, though it’s hard to use it. In fact, the so-called “Big data” are not always structured – especially for certain languages. French research suffers from a lack of readily available resources for tests. In the context of Natural Language Processing, this thesis aims to explore the nature of sentiment and emotion. Some of our contributions to the NLP research community are: creation of new resources for sentiment and emotion analysis, tests and comparisons of several machine learning methods to study the problem from different points of view - classification of online reviews using sentiment polarity, classification of product characteristics using Aspect- Based Sentiment Analysis. Finally, a psycholinguistic study - supported by a machine learning and lexical approaches – on the relation between who judges, the reviewer, and the object that has been judged, the product. Fouille d’opinion Sentiment analysis Emotion analysis Opinion mining Machine learning Natural language processing 410.285

Search results