Global ETD Search

361	Approche multi-niveaux pour l'analyse des données textuelles non-standardisées : corpus de textes en moyen français / Multi-level approach for the analysis of non-standardized textual data : corpus of texts in middle french Aouini, Mourad 19 March 2018 (has links) Cette thèse présente une approche d'analyse des textes non-standardisé qui consiste à modéliser une chaine de traitement permettant l’annotation automatique de textes à savoir l’annotation grammaticale en utilisant une méthode d’étiquetage morphosyntaxique et l’annotation sémantique en mettant en œuvre un système de reconnaissance des entités nommées. Dans ce contexte, nous présentons un système d'analyse du Moyen Français qui est une langue en pleine évolution dont l’orthographe, le système flexionnel et la syntaxe ne sont pas stables. Les textes en Moyen Français se singularisent principalement par l’absence d’orthographe normalisée et par la variabilité tant géographique que chronologique des lexiques médiévaux.L’objectif est de mettre en évidence un système dédié à la construction de ressources linguistiques, notamment la construction des dictionnaires électroniques, se basant sur des règles de morphologie. Ensuite, nous présenterons les instructions que nous avons établies pour construire un étiqueteur morphosyntaxique qui vise à produire automatiquement des analyses contextuelles à l’aide de grammaires de désambiguïsation. Finalement, nous retracerons le chemin qui nous a conduits à mettre en place des grammaires locales permettant de retrouver les entités nommées. De ce fait, nous avons été amenés à constituer un corpus MEDITEXT regroupant des textes en Moyen Français apparus entre le fin du XIIIème et XVème siècle. / This thesis presents a non-standardized text analysis approach which consists a chain process modeling allowing the automatic annotation of texts: grammar annotation using a morphosyntactic tagging method and semantic annotation by putting in operates a system of named-entity recognition. In this context, we present a system analysis of the Middle French which is a language in the course of evolution including: spelling, the flexional system and the syntax are not stable. The texts in Middle French are mainly distinguished by the absence of normalized orthography and the geographical and chronological variability of medieval lexicons.The main objective is to highlight a system dedicated to the construction of linguistic resources, in particular the construction of electronic dictionaries, based on rules of morphology. Then, we will present the instructions that we have carried out to construct a morphosyntactic tagging which aims at automatically producing contextual analyzes using the disambiguation grammars. Finally, we will retrace the path that led us to set up local grammars to find the named entities. Hence, we were asked to create a MEDITEXT corpus of texts in Middle French between the end of the thirteenth and fifteenth centuries. Approche multi-Niveaux Données textuelles non-Standardisées Moyen Français Étiquetage morphosyntaxique Reconnaissance des entités nommées Tal MEDITEXT Multi-Level approach Standardized textual data Middle French Morphosyntactic tagging Named-Entity recognition Nlp 402
362	Extractive Multi-document Summarization of News Articles Grant, Harald January 2019 (has links) Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embeddings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation implemented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summarization. Where the cohesive summary is preferred in the majority of cases. NLP extractive summarization multi-document neural embeddings information extraction text-to-text generation textrank BERT attention Transformer transfer learning fine-tuning ROUGE
363	Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sak Öhrström, Fredrik January 2018 (has links) Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated. nlp text mining clustering semantic meaning text clustering semantic duplicates simplified technical english duplicate detection dbcv doc2vec etteplan Computer Sciences Datavetenskap (datalogi)
364	Surface Realization Using a Featurized Syntactic Statistical Language Model Packer, Thomas L. 13 March 2006 (has links) An important challenge in natural language surface realization is the generation of grammatical sentences from incomplete sentence plans. Realization can be broken into a two-stage process consisting of an over-generating rule-based module followed by a ranker that outputs the most probable candidate sentence based on a statistical language model. Thus far, an n-gram language model has been evaluated in this context. More sophisticated syntactic knowledge is expected to improve such a ranker. In this thesis, a new language model based on featurized functional dependency syntax was developed and evaluated. Generation accuracies and cross-entropy for the new language model did not beat the comparison bigram language model. natural language generation natural language processing NLP NLG Bayesian networks decision trees context specific independence realization statistical language model standard pipeline architecture n-gram (bigram) language model syntax features statistical model machine learning Computer Sciences
365	Élaboration d'ontologies médicales pour une approche multi-agents d'aide à la décision clinique / A multi-agent framework for the development of medical ontologies in clinical decision making Shen, Ying 20 March 2015 (has links) La combinaison du traitement sémantique des connaissances (Semantic Processing of Knowledge) et de la modélisation des étapes de raisonnement (Modeling Steps of Reasoning), utilisés dans le domaine clinique, offrent des possibilités intéressantes, nécessaires aussi, pour l’élaboration des ontologies médicales, utiles à l'exercice de cette profession. Dans ce cadre, l'interrogation de banques de données médicales multiples, comme MEDLINE, PubMed… constitue un outil précieux mais insuffisant car elle ne permet pas d'acquérir des connaissances facilement utilisables lors d’une démarche clinique. En effet, l'abondance de citations inappropriées constitue du bruit et requiert un tri fastidieux, incompatible avec une pratique efficace de la médecine.Dans un processus itératif, l'objectif est de construire, de façon aussi automatisée possible, des bases de connaissances médicales réutilisables, fondées sur des ontologies et, dans cette thèse, nous développons une série d'outils d'acquisition de connaissances qui combinent des opérateurs d'analyse linguistique et de modélisation de la clinique, fondés sur une typologie des connaissances mises en œuvre, et sur une implémentation des différents modes de raisonnement employés. La connaissance ne se résume pas à des informations issues de bases de données ; elle s’organise grâce à des opérateurs cognitifs de raisonnement qui permettent de la rendre opérationnelle dans le contexte intéressant le praticien.Un système multi-agents d’aide à la décision clinique (SMAAD) permettra la coopération et l'intégration des différents modules entrant dans l'élaboration d'une ontologie médicale et les sources de données sont les banques médicales, comme MEDLINE, et des citations extraites par PubMed ; les concepts et le vocabulaire proviennent de l'Unified Medical Language System (UMLS).Concernant le champ des bases de connaissances produites, la recherche concerne l'ensemble de la démarche clinique : le diagnostic, le pronostic, le traitement, le suivi thérapeutique de différentes pathologies, dans un domaine médical donné.Différentes approches et travaux sont recensés, dans l’état de question, et divers paradigmes sont explorés : 1) l'Evidence Base Medicine (une médecine fondée sur des indices). Un indice peut se définir comme un signe lié à son mode de mise en œuvre ; 2) Le raisonnement à partir de cas (RàPC) se fonde sur l'analogie de situations cliniques déjà rencontrées ; 3) Différentes approches sémantiques permettent d'implémenter les ontologies.Sur l’ensemble, nous avons travaillé les aspects logiques liés aux opérateurs cognitifs de raisonnement utilisés et nous avons organisé la coopération et l'intégration des connaissances exploitées durant les différentes étapes du processus clinique (diagnostic, pronostic, traitement, suivi thérapeutique). Cette intégration s’appuie sur un SMAAD : système multi-agent d'aide à la décision. / The combination of semantic processing of knowledge and modelling steps of reasoning employed in the clinical field offers exciting and necessary opportunities to develop ontologies relevant to the practice of medicine. In this context, multiple medical databases such as MEDLINE, PubMed are valuable tools but not sufficient because they cannot acquire the usable knowledge easily in a clinical approach. Indeed, abundance of inappropriate quotations constitutes the noise and requires a tedious sort incompatible with the practice of medicine.In an iterative process, the objective is to build an approach as automated as possible, the reusable medical knowledge bases is founded on an ontology of the concerned fields. In this thesis, the author will develop a series of tools for knowledge acquisition combining the linguistic analysis operators and clinical modelling based on the implemented knowledge typology and an implementation of different forms of employed reasoning. Knowledge is not limited to the information from data, but also and especially on the cognitive operators of reasoning for making them operational in the context relevant to the practitioner.A multi-agent system enables the integration and cooperation of the various modules used in the development of a medical ontology.The data sources are from medical databases such as MEDLINE, the citations retrieved by PubMed, and the concepts and vocabulary from the Unified Medical Language System (UMLS).Regarding the scope of produced knowledge bases, the research concerns the entire clinical process: diagnosis, prognosis, treatment, and therapeutic monitoring of various diseases in a given medical field.It is essential to identify the different approaches and the works already done.Different paradigms will be explored: 1) Evidence Based Medicine. An index can be defined as a sign related to its mode of implementation; 2) Case-based reasoning, which based on the analogy of clinical situations already encountered; 3) The different semantic approaches which are used to implement ontologies.On the whole, we worked on logical aspects related to cognitive operators of used reasoning, and we organized the cooperation and integration of exploited knowledge during the various stages of the clinical process (diagnosis, prognosis, treatment, therapeutic monitoring). This integration is based on a SMAAD: multi-agent system for decision support. Traitement automatique des langues (TAL) Raisonnement à partir de cas (RàPC) Ontologiques médicales UMLS Télémédecine Natural Language Processing (NLP) Case-based reasoning (CBR) Medical Ontology UMLS Telemedicine 410
366	Multi-Agent User-Centric Specialization and Collaboration for Information Retrieval Mooman, Abdelniser January 2012 (has links) The amount of information on the World Wide Web (WWW) is rapidly growing in pace and topic diversity. This has made it increasingly difficult, and often frustrating, for information seekers to retrieve the content they are looking for as information retrieval systems (e.g., search engines) are unable to decipher the relevance of the retrieved information as it pertains to the information they are searching for. This issue can be decomposed into two aspects: 1) variability of information relevance as it pertains to an information seeker. In other words, different information seekers may enter the same search text, or keywords, but expect completely different results. It is therefore, imperative that information retrieval systems possess an ability to incorporate a model of the information seeker in order to estimate the relevance and context of use of information before presenting results. Of course, in this context, by a model we mean the capture of trends in the information seeker's search behaviour. This is what many researchers refer to as the personalized search. 2) Information diversity. Information available on the World Wide Web today spans multitudes of inherently overlapping topics, and it is difficult for any information retrieval system to decide effectively on the relevance of the information retrieved in response to an information seeker's query. For example, the information seeker who wishes to use WWW to learn about a cure for a certain illness would receive a more relevant answer if the search engine was optimized into such domains of topics. This is what is being referred to in the WWW nomenclature as a 'specialized search'. This thesis maintains that the information seeker's search is not intended to be completely random and therefore tends to portray itself as consistent patterns of behaviour. Nonetheless, this behaviour, despite being consistent, can be quite complex to capture. To accomplish this goal the thesis proposes a Multi-Agent Personalized Information Retrieval with Specialization Ontology (MAPIRSO). MAPIRSO offers a complete learning framework that is able to model the end user's search behaviour and interests and to organize information into categorized domains so as to ensure maximum relevance of its responses as they pertain to the end user queries. Specialization and personalization are accomplished using a group of collaborative agents. Each agent employs a Reinforcement Learning (RL) strategy to capture end user's behaviour and interests. Reinforcement learning allows the agents to evolve their knowledge of the end user behaviour and interests as they function to serve him or her. Furthermore, REL allows each agent to adapt to changes in an end user's behaviour and interests. Specialization is the process by which new information domains are created based on existing information topics, allowing new kinds of content to be built exclusively for information seekers. One of the key characteristics of specialization domains is the seeker centric - which allows intelligent agents to create new information based on the information seekers' feedback and their behaviours. Specialized domains are created by intelligent agents that collect information from a specific domain topic. The task of these specialized agents is to map the user's query to a repository of specific domains in order to present users with relevant information. As a result, mapping users' queries to only relevant information is one of the fundamental challenges in Artificial Intelligent (AI) and machine learning research. Our approach employs intelligent cooperative agents that specialize in building personalized ontology information domains that pertain to each information seeker's specific needs. Specializing and categorizing information into unique domains is one of the challenge areas that have been addressed and various proposed solutions were evaluated and adopted to address growing information. However, categorizing information into unique domains does not satisfy each individualized information seeker. Information seekers might search for similar topics, but each would have different interests. For example, medical information of a specific medical domain has different importance to both the doctor and patients. The thesis presents a novel solution that will resolve the growing and diverse information by building seeker centric specialized information domains that are personalized through the information seekers' feedback and behaviours. To address this challenge, the research examines the fundamental components that constitute the specialized agent: an intelligent machine learning system, user input queries, an intelligent agent, and information resources constructed through specialized domains. Experimental work is reported to demonstrate the efficiency of the proposed solution in addressing the overlapping information growth. The experimental work utilizes extensive user-centric specialized domain topics. This work employs personalized and collaborative multi learning agents and ontology techniques thereby enriching the queries and domains of the user. Therefore, experiments and results have shown that building specialized ontology domains, pertinent to the information seekers' needs, are more precise and efficient compared to other information retrieval applications and existing search engines. Electrical and Computer Engineering
367	Multi-Agent User-Centric Specialization and Collaboration for Information Retrieval Mooman, Abdelniser January 2012 (has links) The amount of information on the World Wide Web (WWW) is rapidly growing in pace and topic diversity. This has made it increasingly difficult, and often frustrating, for information seekers to retrieve the content they are looking for as information retrieval systems (e.g., search engines) are unable to decipher the relevance of the retrieved information as it pertains to the information they are searching for. This issue can be decomposed into two aspects: 1) variability of information relevance as it pertains to an information seeker. In other words, different information seekers may enter the same search text, or keywords, but expect completely different results. It is therefore, imperative that information retrieval systems possess an ability to incorporate a model of the information seeker in order to estimate the relevance and context of use of information before presenting results. Of course, in this context, by a model we mean the capture of trends in the information seeker's search behaviour. This is what many researchers refer to as the personalized search. 2) Information diversity. Information available on the World Wide Web today spans multitudes of inherently overlapping topics, and it is difficult for any information retrieval system to decide effectively on the relevance of the information retrieved in response to an information seeker's query. For example, the information seeker who wishes to use WWW to learn about a cure for a certain illness would receive a more relevant answer if the search engine was optimized into such domains of topics. This is what is being referred to in the WWW nomenclature as a 'specialized search'. This thesis maintains that the information seeker's search is not intended to be completely random and therefore tends to portray itself as consistent patterns of behaviour. Nonetheless, this behaviour, despite being consistent, can be quite complex to capture. To accomplish this goal the thesis proposes a Multi-Agent Personalized Information Retrieval with Specialization Ontology (MAPIRSO). MAPIRSO offers a complete learning framework that is able to model the end user's search behaviour and interests and to organize information into categorized domains so as to ensure maximum relevance of its responses as they pertain to the end user queries. Specialization and personalization are accomplished using a group of collaborative agents. Each agent employs a Reinforcement Learning (RL) strategy to capture end user's behaviour and interests. Reinforcement learning allows the agents to evolve their knowledge of the end user behaviour and interests as they function to serve him or her. Furthermore, REL allows each agent to adapt to changes in an end user's behaviour and interests. Specialization is the process by which new information domains are created based on existing information topics, allowing new kinds of content to be built exclusively for information seekers. One of the key characteristics of specialization domains is the seeker centric - which allows intelligent agents to create new information based on the information seekers' feedback and their behaviours. Specialized domains are created by intelligent agents that collect information from a specific domain topic. The task of these specialized agents is to map the user's query to a repository of specific domains in order to present users with relevant information. As a result, mapping users' queries to only relevant information is one of the fundamental challenges in Artificial Intelligent (AI) and machine learning research. Our approach employs intelligent cooperative agents that specialize in building personalized ontology information domains that pertain to each information seeker's specific needs. Specializing and categorizing information into unique domains is one of the challenge areas that have been addressed and various proposed solutions were evaluated and adopted to address growing information. However, categorizing information into unique domains does not satisfy each individualized information seeker. Information seekers might search for similar topics, but each would have different interests. For example, medical information of a specific medical domain has different importance to both the doctor and patients. The thesis presents a novel solution that will resolve the growing and diverse information by building seeker centric specialized information domains that are personalized through the information seekers' feedback and behaviours. To address this challenge, the research examines the fundamental components that constitute the specialized agent: an intelligent machine learning system, user input queries, an intelligent agent, and information resources constructed through specialized domains. Experimental work is reported to demonstrate the efficiency of the proposed solution in addressing the overlapping information growth. The experimental work utilizes extensive user-centric specialized domain topics. This work employs personalized and collaborative multi learning agents and ontology techniques thereby enriching the queries and domains of the user. Therefore, experiments and results have shown that building specialized ontology domains, pertinent to the information seekers' needs, are more precise and efficient compared to other information retrieval applications and existing search engines. Electrical and Computer Engineering
368	Relation Classification using Semantically-Enhanced Syntactic Dependency Paths : Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks Capshaw, Riley January 2018 (has links) Many approaches to solving tasks in the field of Natural Language Processing (NLP) use syntactic dependency trees (SDTs) as a feature to represent the latent nonlinear structure within sentences. Recently, work in parsing sentences to graph-based structures which encode semantic relationships between words—called semantic dependency graphs (SDGs)—has gained interest. This thesis seeks to explore the use of SDGs in place of and alongside SDTs within a relation classification system based on long short-term memory (LSTM) neural networks. Two methods for handling the information in these graphs are presented and compared between two SDG formalisms. Three new relation extraction system architectures have been created based on these methods and are compared to a recent state-of-the-art LSTM-based system, showing comparable results when semantic dependencies are used to enhance syntactic dependencies, but with significantly fewer training parameters. Natural language processing NLP computational linguistics syntactic dependency trees semantic dependency graphs relation classification relation extraction artificial intelligence machine learning deep learning neural networks long short-term memory LSTM
369	Atribuição automática de autoria de obras da literatura brasileira / Atribuição automática de autoria de obras da literatura brasileira Nobre Neto, Francisco Dantas 19 January 2010 (has links) Made available in DSpace on 2015-05-14T12:36:48Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 1280792 bytes, checksum: d335d67b212e054f48f0e8bca0798fe5 (MD5) Previous issue date: 2010-01-19 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Authorship attribution consists in categorizing an unknown document among some classes of authors previously selected. Knowledge about authorship of a text can be useful when it is required to detect plagiarism in any literary document or to properly give the credits to the author of a book. The most intuitive form of human analysis of a text is by selecting some characteristics that it has. The study of selecting attributes in any written document, such as average word length and vocabulary richness, is known as stylometry. For human analysis of an unknown text, the authorship discovery can take months, also becoming tiring activity. Some computational tools have the functionality of extracting such characteristics from the text, leaving the subjective analysis to the researcher. However, there are computational methods that, in addition to extract attributes, make the authorship attribution, based in the characteristics gathered in the text. Techniques such as neural network, decision tree and classification methods have been applied to this context and presented results that make them relevant to this question. This work presents a data compression method, Prediction by Partial Matching (PPM), as a solution of the authorship attribution problem of Brazilian literary works. The writers and works selected to compose the authors database were, mainly, by their representative in national literature. Besides, the availability of the books has also been considered. The PPM performs the authorship identification without any subjective interference in the text analysis. This method, also, does not make use of attributes presents in the text, differently of others methods. The correct classification rate obtained with PPM, in this work, was approximately 93%, while related works exposes a correct rate between 72% and 89%. In this work, was done, also, authorship attribution with SVM approach. For that, were selected attributes in the text divided in two groups, one word based and other in function-words frequency, obtaining a correct rate of 36,6% and 88,4%, respectively. / Atribuição de autoria consiste em categorizar um documento desconhecido dentre algumas classes de autores previamente selecionadas. Saber a autoria de um texto pode ser útil quando é necessário detectar plágio em alguma obra literária ou dar os devidos créditos ao autor de um livro. A forma mais intuitiva ao ser humano para se analisar um texto é selecionando algumas características que ele possui. O estudo de selecionar atributos em um documento escrito, como tamanho médio das palavras e riqueza vocabular, é conhecido como estilometria. Para análise humana de um texto desconhecido, descobrir a autoria pode demandar meses, além de se tornar uma tarefa cansativa. Algumas ferramentas computacionais têm a funcionalidade de extrair tais características do texto, deixando a análise subjetiva para o pesquisador. No entanto, existem métodos computacionais que, além de extrair atributos, atribuem a autoria baseado nas características colhidas ao longo do texto. Técnicas como redes neurais, árvores de decisão e métodos de classificação já foram aplicados neste contexto e apresentaram resultados que os tornam relevantes para tal questão. Este trabalho apresenta um método de compressão de dados, o Prediction by Partial Matching (PPM), para solução do problema de atribuição de autoria de obras da literatura brasileira. Os escritores e obras selecionados para compor o banco de autores se deram, principalmente, pela representatividade que possuem na literatura nacional. Além disso, a disponibilidade dos livros em formato eletrônico também foi considerada. O PPM realiza a identificação de autoria sem ter qualquer interferência subjetiva na análise do texto. Este método, também, não faz uso de atributos presentes ao longo do texto, diferentemente de outros métodos. A taxa de classificação correta alcançada com o PPM, neste trabalho, foi de aproximadamente 93%, enquanto que trabalhos relacionados mostram uma taxa de acerto entre 72% e 89%. Neste trabalho, também foi realizado atribuição de autoria com a abordagem SVM. Para isso, foram selecionados atributos no texto dividido em dois tipos, sendo um baseado em palavras e o outro na contagem de palavrasfunção, obtendo uma taxa de acerto de 36,6% e 88,4%, respectivamente. Atribuição de autoria Prediction by Partial Matching (PPM) Processamento de Linguagem Natural (PLN) literatura brasileira Estilometria Authorship Attribution Prediction by Partial Matching (PPM) Natural Language Processing (NLP) Brazilian literature stylometry
370	Elaboration de ressources électroniques pour les noms composés de type N (E+DET=G) N=G du grec moderne / The N (E + DET=G) N=G compound nouns in Modern Greek Kyriakopoulou, Anthoula 25 March 2011 (has links) L'objectif de cette recherche est la construction manuelle de ressources lexicales pour les noms composés grecs qui sont définis par la structure morphosyntaxique : Nom (E+Déterminant au génitif) Nom au génitif, notés N (E+DET:G) N:G (e.g. ζώνη ασφαλείας/ceinture de sécurité). Les ressources élaborées peuvent être utilisées pour leur reconnaissance lexicale automatique dans les textes écrits et dans d'autres applications du TAL. Notre travail s'inscrit dans la perspective de l'élaboration du lexique-grammaire général du grec moderne en vue de l'analyse automatique des textes écrits. Le cadre théorique et méthodologique de cette étude est celui du lexique-grammaire (M. Gross 1975, 1977), qui s'appuie sur la grammaire transformationnelle harisienne.Notre travail s'organise en cinq parties. Dans la première partie, nous délimitons l'objet de notre travail tout en essayant de définir la notion fondamentale qui régit notre étude, à savoir celle de figement. Dans la deuxième partie, nous présentons la méthodologie utilisée pour le recensement de nos données lexicales et nous étudions les phénomènes de variation observés au sein des noms composés de type N (E+DET:G) N:G. La troisième partie est consacrée à la présentation des différentes sous-catégories des N (E+DET:G) N:G identifiées lors de l'étape du recensement et à l'étude de leur structure lexicale interne. La quatrième partie porte sur l'étude syntaxico-sémantique des N (E+DET:G) N:G. Enfin, dans la cinquième partie, nous présentons les différentes méthodes de représentation formalisée que nous proposons pour nos données lexicales en vue de leur reconnaissance lexicale automatique dans les textes écrits. Des échantillons représentatifs des ressources élaborées sont présentés en Annexe / The object of this research is the manual construction of lexical resources for the Greek compound nouns defined by the following morphosyntactic structure : Noun (E+Determiner in genitive) Noun in genitive, (N (E+DET:G) N:G) (e.g. ζώνη ασφαλείας/safety belt). The elaborated resources may be used for their automatic recognition in written texts and other NLP applications. Our study is part of the general lexicon-grammar for Modern Greek in view of automatic processing of written texts. Our theoretical and methodological framework is that of lexicon-grammar (M. Gross 1975, 1977), based on the Transformational Grammar principles defined by Z. S. Harris. Our study is organised into five parts. In the first part, we give an overview of the core notion governing our research : the notions of (fixed) multiword expression (MWE). In the second part, we present the methodology used to collect our lexical data and we study the variation phenomena observed within the framework of the N (E+DET:G) N:G. The third part is dedicated to the presentation of the different N (E+DET:G) N:G categories identified in the listing phase qnd to the study of their lexical composition. The fourth concerns the syntactical and semantic study of the N (E+DET:G) N:G. Finally, the fifth part deals with the formal representation methods we propose for our lexical data in view of their lexical recognition in Greek written texts. Representative samples of the elaborated resources are illustrated in Appendix TAL Appoche linguistique Nom composé Unité lexicale multi-mots Syntagme nominal semi-figé NLP Linguistic approach Compound noun Multiword lexical unit Semi-fixed noun group Electronic morphological dictionary

Search results