Global ETD Search

1	Les récits de voyage des XIVe et XVe siècles lemmatisés : apports lexicographiques au Dictionnaire du moyen français / Lemmatised travel stories of the 14th and 15th centuries : lexicographic contributions to the Dictionnaire du moyen français Herbert, Capucine 26 February 2016 (has links) À partir d’un corpus lemmatisé de récits de voyage des XIVe et XVe siècles directement écrits en français, nous proposons de réfléchir dans ce travail à une nouvelle méthode d’apports lexicographiques au DMF2012 (Dictionnaire du Moyen Français 2012). Dès ses débuts, le DMF a été conçu en mettant en œuvre une « lexicographie évolutive » (Robert Martin) : il s’est constitué par étapes de travail successives, chacune d’elles donnant lieu à une nouvelle version du dictionnaire, consultable en ligne. Jusqu’à la version de 2009, le DMF était composé essentiellement de lexiques que l’on pouvait consulter groupés ou séparément. Une période de synthèse de ces différents lexiques s’est engagée dans le cadre de la version du DMF2010, ouvrant la voie à une réflexion sur une nouvelle méthode de travail. Était-il encore possible de proposer beaucoup de nouvelles entrées à un dictionnaire en comptant déjà 62 371 ? Comment compléter le plus efficacement possible les articles déjà existants ? Cette thèse expose une méthode d’enrichissement du dictionnaire, de la constitution du corpus à la conception d’un travail lexicographique au format inédit, adapté au DMF2012. Une réflexion est également menée sur l’apport singulier des récits de voyage à la connaissance de la langue de la fin du Moyen Âge. Pour mener à bien cette étude, nous avons utilisé le lemmatiseur LGeRM (Lemmes Graphies et Règles Morphologiques) et son développement, « l’outil glossaire », tous deux conçus par Gilles Souvay, ingénieur-informaticien à l’ATILF (Analyse et Traitement Informatique de la Langue Française). Nous nous sommes interrogée sur leur usage afin d’adopter une démarche de recherche méthodique et efficace. / Based on a lemmatised corpus of travel stories from the 14th and 15th centuries and written in French, this study intends to look at new methods of lexicographic contributions to the DMF2012 (Dictionnaire du moyen français 2012). When it was created, the DMF was conceived with an “evolving lexicography” (Robert Martin) : it was constituted step by step, each one leading to a new version of the dictionary, available on line. Until the 2009 version, the DMF was mostly made of lexicons that could be looked up in groups or separately. A compilation of different lexicons started in the new version of the DMF2010, leading to a reflection on a new method to enrich the dictionary. Was it possible to propose many new terms to a dictionary that already had 62 371 entries ? How could the existing articles be efficiently completed ? This thesis aims at exploring a method to enrich the dictionary, from the compilation of a corpus to the creation of a lexicographic work with a new structure, adapted to the DMF2012. After that work, it is possible to determine the particular contribution of travel stories to the knowledge of the language used at the end of the Middle Ages. To carry out this study, we have used the LGeRM lemmatiser (Lemmes Graphies et Règles Morphologiques) and its expension “outil glossaire”, both developed by Gilles Souvay, a computer engineer at the ATILF (Analyse et Traitement Informatique de la Langue Française). We also had to think about a way to use these tools, leading to a methodical and efficient approach. Moyen français Lemmatisation Dictionnaire électronique Lexique Moyen Âge 447.02
2	Modèles et outils pour des bases lexicales "métier" multilingues et contributives de grande taille, utilisables tant en traduction automatique et automatisée que pour des services dictionnairiques variés / Methods and tools for large multilingual and contributive lexical databases, usable as well in machine (aided) translation as for various dictonary services Zhang, Ying 28 June 2016 (has links) Notre recherche se situe en lexicographie computationnelle, et concerne non seulement le support informatique aux ressources lexicales utiles pour la TA (traduction automatique) et la THAM (traduction humaine aidée par la machine), mais aussi l'architecture linguistique des bases lexicales supportant ces ressources, dans un contexte opérationnel (thèse CIFRE avec L&M).Nous commençons par une étude de l'évolution des idées, depuis l'informatisation des dictionnaires classiques jusqu'aux plates-formes de construction de vraies "bases lexicales" comme JIBIKI-1 [Mangeot, M. et al., 2003 ; Sérasset, G., 2004] et JIBIKI-2 [Zhang, Y. et al., 2014]. Le point de départ a été le système PIVAX-1 [Nguyen, H.-T. et al., 2007 ; Nguyen, H. T. & Boitet, C., 2009] de bases lexicales pour systèmes de TA hétérogènes à pivot lexical supportant plusieurs volumes par "espace lexical" naturel ou artificiel (UNL). En prenant en compte le contexte industriel, nous avons centré notre recherche sur certains problèmes, informatiques et lexicographiques.Pour passer à l'échelle, et pour profiter des nouvelles fonctionnalités permises par JIBIKI-2, dont les "liens riches", nous avons transformé PIVAX-1 en PIVAX-2, et réactivé le projet GBDLEX-UW++ commencé lors du projet ANR TRAOUIERO, en réimportant toutes les données (multilingues) supportées par PIVAX-1, et en les rendant disponibles sur un serveur ouvert.Partant d'un besoin de L&M concernant les acronymes, nous avons étendu la "macrostructure" de PIVAX en y intégrant des volumes de "prolexèmes", comme dans PROLEXBASE [Tran, M. & Maurel, D., 2006]. Nous montrons aussi comment l'étendre pour répondre à de nouveaux besoins, comme ceux du projet INNOVALANGUES. Enfin, nous avons créé un "intergiciel de lemmatisation", LEXTOH, qui permet d'appeler plusieurs analyseurs morphologiques ou lemmatiseurs, puis de fusionner et filtrer leurs résultats. Combiné à un nouvel outil de création de dictionnaires, CREATDICO, LEXTOH permet de construire à la volée un "mini-dictionnaire" correspondant à une phrase ou à un paragraphe d'un texte en cours de "post-édition" en ligne sous IMAG/SECTRA, ce qui réalise la fonctionnalité d'aide lexicale proactive prévue dans [Huynh, C.-P., 2010]. On pourra aussi l'utiliser pour créer des corpus parallèles "factorisés" pour construire des systèmes de TA en MOSES. / Our research is in computational lexicography, and concerns not only the computer support to lexical resources useful for MT (machine translation) and MAHT (Machine Aided Human Translation), but also the linguistic architecture of lexical databases supporting these resources in an operational context (CIFRE thesis with L&M).We begin with a study of the evolution of ideas in this area, since the computerization of classical dictionaries to platforms for building up true "lexical databases" such as JIBIKI-1 [Mangeot, M. et al., 2003 ; Sérasset, G., 2004] and JIBIKI-2 [Zhang, Y. et al., 2014]. The starting point was the PIVAX-1 system [Nguyen, H.-T. et al., 2007 ; Nguyen, H. T. & Boitet, C., 2009] designed for lexical bases for heterogeneous MT systems with a lexical pivot, able to support multiple volumes in each "lexical space", be it natural or artificial (as UNL). Considering the industrial context, we focused our research on some issues, in informatics and lexicography.To scale up, and to add some new features enabled by JIBIKI-2, such as the "rich links", we have transformed PIVAX-1 into PIVAX-2, and reactivated the GBDLEX-UW++ project that started during the ANR TRAOUIERO project, by re-importing all (multilingual) data supported by PIVAX-1, and making them available on an open server.Hence a need for L&M for acronyms, we expanded the "macrostructure" of PIVAX incorporating volumes of "prolexemes" as in PROLEXBASE [Tran, M. & Maurel, D., 2006]. We also show how to extend it to meet new needs such as those of the INNOVALANGUES project. Finally, we have created a "lemmatisation middleware", LEXTOH, which allows calling several morphological analyzers or lemmatizers and then to merge and filter their results. Combined with a new dictionary creation tool, CREATDICO, LEXTOH allows to build on the fly a "mini-dictionary" corresponding to a sentence or a paragraph of a text being "post-edited" online under IMAG/SECTRA, which performs the lexical proactive support functionality foreseen in [Huynh, C.-P., 2010]. It could also be used to create parallel corpora with the aim to build MOSES-based "factored MT systems". Base lexicale Jibiki-Pivax Cognition située Gestion des acronymes Services dictionnairiques Intergiciel de lemmatisation Multilingual lexical databases Jibiki-Pivax Situated cognition Acronyms management Dictonary services Lemmatisation middleware 004
3	Efficient development of human language technology resources for resource-scarce languages / Martin Johannes Puttkammer Puttkammer, Martin Johannes January 2014 (has links) The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement. For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each. First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant. Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time. Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data. Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). / PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014 Afrikaans Automatic speech recognition Lemmatisation Resource-scarce languages Human language technology Resource development
4	Efficient development of human language technology resources for resource-scarce languages / Martin Johannes Puttkammer Puttkammer, Martin Johannes January 2014 (has links) The development of linguistic data, especially annotated corpora, is imperative for the human language technology enablement of any language. The annotation process is, however, often time-consuming and expensive. As such, various projects make use of several strategies to expedite the development of human language technology resources. For resource-scarce languages – those with limited resources, finances and expertise – the efficiency of these strategies has not been conclusively established. This study investigates the efficiency of some of these strategies in the development of resources for resource-scarce languages, in order to provide recommendations for future projects facing decisions regarding which strategies they should implement. For all experiments, Afrikaans is used as an example of a resource-scarce language. Two tasks, viz. lemmatisation of text data and orthographic transcription of audio data, are evaluated in terms of quality and in terms of the time required to perform the task. The main focus of the study is on the skill level of the annotators, software environments which aim to improve the quality and time needed to perform annotations, and whether it is beneficial to annotate more data, or to increase the quality of the data. We outline and conduct systematic experiments on each of the three focus areas in order to determine the efficiency of each. First, we investigated the influence of a respondent’s skill level on data annotation by using untrained, sourced respondents for annotation of linguistic data for Afrikaans. We compared data annotated by experts, novices and laymen. From the results it was evident that the experts outperformed the non-experts on both tasks, and that the differences in performance were statistically significant. Next, we investigated the effect of software environments on data annotation to determine the benefits of using tailor-made software as opposed to general-purpose or domain-specific software. The comparison showed that, for these two specific projects, it was beneficial in terms of time and quality to use tailor-made software rather than domain-specific or general-purpose software. However, in the context of linguistic annotation of data for resource-scarce languages, the additional time needed to develop tailor-made software is not justified by the savings in annotation time. Finally, we compared systems trained with data of varying levels of quality and quantity, to determine the impact of quality versus quantity on the performance of systems. When comparing systems trained with gold standard data to systems trained with more data containing a low level of errors, the systems trained with the erroneous data were statistically significantly better. Thus, we conclude that it is more beneficial to focus on the quantity rather than on the quality of training data. Based on the results and analyses of the experiments, we offer some recommendations regarding which of the methods should be implemented in practice. For a project aiming to develop gold standard data, the highest quality annotations can be obtained by using experts to double-blind annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). For a project that aims to develop a core technology, experts or trained novices should be used to single-annotate data in tailor-made software (if provided for in the budget or if the development time can be justified by the savings in annotation time). / PhD (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2014 Afrikaans Automatic speech recognition Lemmatisation Resource-scarce languages Human language technology Resource development
5	Recherche d'information translinguistique sur les documents en arabe Kadri, Youssef January 2008 (has links) Thèse numérisée par la Division de la gestion de documents et des archives de l'Université de Montréal. Recherche d'information arabe Lemmatisation Traduction de requête Combinaison de ressources de traduction Facteurs de confiance
6	Lemmatisation of derivative nouns in Xitsonga-English bilingual dictionaries Chavalala, Bulu James January 2005 (has links) Thesis (M. A. (African Languages)) --University of Limpopo, 2005 / Refer to the document Lemmatisation Derivative nouns Xitsonga-English bilingual dictionaries
7	Recherche d'information translinguistique sur les documents en arabe Kadri, Youssef January 2008 (has links) Thèse numérisée par la Division de la gestion de documents et des archives de l'Université de Montréal Recherche d'information arabe Lemmatisation Traduction de requête Combinaison de ressources de traduction Facteurs de confiance
8	Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald Groenewald, Hendrik Johannes January 2006 (has links) A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser corlstruction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu "Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu, thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best performancc in Lcrms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflecrion and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime increase as the amount of training dala is increased and that Ihe various feature options bave a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determincd by the use of I'Senrck, a programme that implements Wrapped Progre~sive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Aulornaric Lcmlnalisa~ionf or Afrikaans - - Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the best performing parameters for the IB1 algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007. Lemmatisation Machine learning Memory-based learning Human language technology Natural language processing Computer engineering TIMBL Afrikaans Morphology
9	Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald Groenewald, Hendrik Johannes January 2006 (has links) A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser corlstruction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu "Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu, thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best performancc in Lcrms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflecrion and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime increase as the amount of training dala is increased and that Ihe various feature options bave a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determincd by the use of I'Senrck, a programme that implements Wrapped Progre~sive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Aulornaric Lcmlnalisa~ionf or Afrikaans - - Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the best performing parameters for the IB1 algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007. Lemmatisation Machine learning Memory-based learning Human language technology Natural language processing Computer engineering TIMBL Afrikaans Morphology
10	Outomatiese Setswana lemma-identifisering / Jeanetta Hendrina Brits Brits, Jeanetta Hendrina January 2006 (has links) Within the context of natural language processing, a lemmatiser is one of the most important core technology modules that has to be developed for a particular language. A lemmatiser reduces words in a corpus to the corresponding lemmas of the words in the lexicon. A lemma is defined as the meaningful base form from which other more complex forms (i.e. variants) are derived. Before a lemmatiser can be developed for a specific language, the concept "lemma" as it applies to that specific language should first be defined clearly. This study concludes that, in Setswana, only stems (and not roots) can act independently as words; therefore, only stems should be accepted as lemmas in the context of automatic lemmatisation for Setswana. Five of the seven parts of speech in Setswana could be viewed as closed classes, which means that these classes are not extended by means of regular morphological processes. The two other parts of speech (nouns and verbs) require the implementation of alternation rules to determine the lemma. Such alternation rules were formalised in this study, for the purpose of development of a Setswana lemmatiser. The existing Setswana grammars were used as basis for these rules. Therewith the precision of the formalisation of these existing grammars to lemmatise Setswana words could be determined. The software developed by Van Noord (2002), FSA 6, is one of the best-known applications available for the development of finite state automata and transducers. Regular expressions based on the formalised morphological rules were used in FSA 6 to create finite state transducers. The code subsequently generated by FSA 6 was implemented in the lemmatiser. The metric that applies to the evaluation of the lemmatiser is precision. On a test corpus of 1 000 words, the lemmatiser obtained 70,92%. In another evaluation on 500 complex nouns and 500 complex verbs separately, the lemmatiser obtained 70,96% and 70,52% respectively. Expressed in numbers the precision on 500 complex and simplex nouns was 78,45% and on complex and simplex verbs 79,59%. The quantitative achievement only gives an indication of the relative precision of the grammars. Nevertheless, it did offer analysed data with which the grammars were evaluated qualitatively. The study concludes with an overview of how these results might be improved in the future. / Thesis (M.A. (African Languages))--North-West University, Potchefstroom Campus, 2006. Computational linguistics Setswana grammar Setswana morphology Lemmatisation Stemming Lemma Natural language processing Regular expression Finite state automata Finite state transducer FSA 6

Search results