Global ETD Search

291	A construção de um glossário bilíngue de futebol com o apoio da Linguística de Corpus. / Bulding a bilingual glossary on Football with the aid of Corpus Linguistics Paulo Augusto Almeida Seemann 26 March 2012 (has links) Ao tentar traduzir um texto específico sobre o tema futebol da língua espanhola para o português brasileiro ou vice-versa, o tradutor se depara com uma infinidade de termos típicos dessa área de especialidade que não constam em muitos dos atuais dicionários e glossários, ou constam de forma limitada, sem abranger muitas das situações reais de uso. Neste trabalho, construímos um glossário bilíngue e bidirecional que contempla os termos futebolísticos mais frequentes no par linguístico espanhol-português, usados rotineiramente na comunicação escrita. Partimos da suposição que a Linguística de Corpus forneceria os meios necessários para tal empreitada. A Linguística de Corpus permite estudar uma língua ou variedade linguística por computador, por meio de evidências empíricas encontradas em um corpus, entendido como um conjunto de dados linguísticos textuais em formato eletrônico e coletado de forma criteriosa. Esta dissertação está dividida em cinco partes. Como introdução, falamos de alguns aspectos históricos das línguas portuguesa e espanhola, da influência do futebol em nossa sociedade, de problemas encontrados em dicionários e glossários, e do potencial das notícias futebolísticas da Internet como referência para a construção do glossário que propomos. Na segunda parte, comentamos a Linguística de Corpus como abordagem e método de investigação, os tipos de corpora e a composição de nosso corpus de estudo, a questão da equivalência na tradução e a forma como selecionamos os termos e seus equivalentes tradutórios, tendo como base a comparação de notícias futebolísticas do Brasil, da Espanha e da Argentina, além da extração e observação de palavras-chave, com a ajuda de ferramentas eletrônicas específicas. Na terceira parte, discutimos as questões terminológicas que envolvem este estudo, especialmente as decisões tomadas para a macro e microestrutura de nosso glossário. Na quarta parte, demonstramos como o glossário pode ser apresentado ao consulente e oferecemos uma amostra de verbetes. Na quinta e última parte, fazemos as considerações finais, em que concluímos que a Linguística de Corpus, como abordagem e metodologia, confirmou-se eficiente para a construção do glossário bilíngue, pois a exploração de corpora especializados permitiu identificar os principais termos futebolísticos e seus equivalentes tradutórios usados na comunicação escrita do jornalismo brasileiro, espanhol e argentino, resultando em uma obra de referência bilíngue específica do futebol com quase quatro mil verbetes; todos com exemplos reais de uso / When trying to translate a specific text on football from Spanish into Brazilian Portuguese or vice versa, the translator is faced with a myriad of footbal-specific terms which are not found in most dictionaries or glossaries, or which are found in a limited way, leaving out many real use situations. In the course of this study, a bilingual and bi-directional glossary was built with the most commonly used football terms in written communication in the Spanish-Portuguese language pair, . My initial assumption was that Corpus Linguistics would provide the necessary means for such a task. Corpus Linguistics enables one to study a language or a language variety using a computer, retrieving empirical evidence found in a corpus, which is defined as a set of texts, compiled according to predefined criteria, in electronic format. This dissertation is divided into five parts. In the introduction, some historical aspects of Portuguese and Spanish are discussed, as well as the influence of football in our society, the problems found in dictionaries and glossaries, and the potential of football news retrieved from the Internet as a basis for building the glossary proposed. In the second part, I argue that Corpus Linguistics is an approach and a method of research, and present the different types of corpora. Then, the question of equivalence in translation is briefly addressed, the content of our corpus of study is explained, as well as the steps adopted to identify the terms and their translation equivalents, through the comparison of football news from Brazil, Spain and Argentina, and by means of the extraction and observation of keywords, with the aid of specific electronic tools. In the third part, I discuss the terminology issues implicated in this study, especially with reference to the decisions taken for the macro- and microstructure of the glossary. In the fourth part, I propose a form of presenting the glossary to the user and provide a sample of entries. In the fifth and last part, I make the final considerations, in which I conclude that Corpus Linguistics, as an approach and a methodology, proved to be effective for the construction of the targeted bilingual glossary, since exploring the specialized corpora made it possible to properly identify the main football terms used in written communication in Brazilian, Spanish and Argentine journalism and their translation equivalents. The result is a bilingual work of reference in the field of football, which contains nearly four thousand entries, all of them with authentic examples of usage. Futebol Glossário bilíngue Linguística de Corpus Terminologia Tradução Bilingual Glossary Corpus Linguistics Football Terminology Translation
292	Corps, perception, déplacements : de l'expérience kinesthésique à la cognition linguistique : étude du schème du chemin en grammaire et sémantique anglaises et statut de ce schème en linguistique cognitive / Bodily perception and motion : from kinesthetic experience to linguistic cognition : a study on the PATH-schema in english grammar and semantics : status of this schema in cognitive linguistics Barnabé, Aurélie 09 November 2012 (has links) La linguistique cognitive considère les structures langagières comme le reflet de structures conceptuelles sous-jacentes. Les schèmes-images font partie de ces structures. Ils sont construits et abstraits à partir de l’expérience incorporée et socialement située du monde, ce qui leur confère à la fois une assise culturelle et sensori-motrice. Le présent travail confirme qu’il est possible, sur les bases théoriques édifiées par Lakoff et Johnson (1987), d’en identifier les réalisations lexicales et syntaxiques, en observant les usages langagiers. La thèse que nous soumettons aborde plus spécifiquement le schème-image du CHEMIN (PATH-schema). Pour mener cette analyse, nous ancrons nos recherches dans deux corpus. Le premier fait état de tous les types de chemins répertoriés en linguistique cognitive, à partir d’une centaine d’unités verbales, incluses dans 500 occurrences. Le second s’intéresse à quatre items verbaux, come, go, rise, et fall, répartis sur un millier d’exemples. Notre objectif consiste à discerner les charges morphosyntaxiques et les variantes sémantiques du schème du chemin. Ce faisant, nous inscrivons la corporéité – ou du moins son réinvestissement symbolique - au cœur de notre étude. Par « corporéité », nous entendons la conceptualisation et la figuration du rapport incarné du sujet au monde, les traces que laissent ces représentations dans l’organisation du lexique et des constructions. Nos corpus présentent une quantité importante de verbes, qui révèlent des états de fait abstraits. Ces emplois nous conduiront à explorer le statut du schème du chemin, tant dans sa réalisation morphosyntaxique que dans son contenu sémantique, lorsque ce schème sous-tend les extensions sémantiques des verbes étudiés. Nos questionnements sur la polysémie des verbes, sur leur définition d’un point de vue prototypique et sur leur grammaticalisation éventuelle, contribueront à révéler la réalité cognitive du schème analysé. Enfin, la quantité importante d’emplois « abstraits » des verbes, nous amènera à questionner la corporéité, telle que la linguistique cognitive la définit. / Linguistic structures are considered to be underlain by conceptual structures in cognitive linguistics. Image schemas belong to these structures. Schemas are shaped on the basis of bodily and socially-anchored experience, which gives them a cultural and sensor-motor status. The present study demonstrates that syntactic and lexical characteristics of image-schemas can be identified, on the basis of Lakoff and Johnson’s theories (1987), while examining language usages. This study specifically focuses on the PATH-schema, which will be investigated through two corpus-based analyses. The first sample of occurrences, made up of 500 examples, is a corpus-illustrated analysis, which exemplifies all the types of paths that have been elaborated in cognitive linguistics. The second sample of occurrences is a corpus-driven analysis, made up of 1000 examples, which are divided into four verbs’ usages, i.e. come, go, rise, and fall. We are aiming at detecting the syntactic and semantic patterns of the PATH-schema. Our goal leads us to examine the notion of « embodiment », namely the conceptualization and the evidence of the embodied link of the individual to the environment, left in lexical constructions. Our data display several verbs involved in abstract descriptions. These usages will lead us to explore the status of the PATH-schema, and focus on its syntactic and semantic specificities, particularly when this schema underlies semantic extensions of come, go, rise, and fall. Issues concerning the verbs’ polysemy, their prototypical definition, and their potential grammaticalization, will contribute to revealing the cognitive reality of the PATH-schema. Finally, the quantity of verbs’ « abstract » usages, will lead us to investigate the notion of « embodiment », as cognitive linguistics defines it. Schème-image du chemin Conceptualisation Corpus Constructions Corporéité PATH schema Conceptualization Corpus Constructions Embodiment
293	The Female Protagonists in Thackeray’s Vanity Fair : A Corpus Linguistic Study of Keywords, Collocations, and Characterisation Åhman Billing, Tina January 2016 (has links) This essay uses corpus linguistic methods to study aspects of the novel Vanity Fair by W M Thackeray. The aim is to study the way Thackeray chose to describe his two female protagonists, Rebecca Sharp and Amelia Sedley. This is accomplished by a closer study of keywords in Vanity Fair, created by using a reference corpus consisting of thirteen novels by Victorian authors. These keywords are used to define semantic fields related to the novel. Keywords from the semantic field closest to the protagonists are studied in context. In addition, adjectives that collocate with the names of the protagonists are analyzed to compare the characterization of each woman. The study indicates that Thackeray has used fewer adjectives to describe Amelia than Rebecca, but that he has used these more frequently, which may cause readers to form a stronger mental picture of Amelia’s character sooner than they do for Rebecca’s. Corpus Linguistics Corpus Stylistics Characterisation Thackeray Vanity Fair Specific Languages Studier av enskilda språk
294	Sûreté, sécurité, insécurité. D'une description lexicologique à une etude du discours de presse : la campagne electorale 2001-2002 dans le quotidien Le Monde / Sûreté (safety), sécurité (security), insécurité (insecurity). From lexicological description to a press discourse study : the French presenditial campaign (2001-2002) in Le Monde Née, Émilie 30 November 2009 (has links) Nous sommes partie de l’intensification d’emploi du mot insécurité observable pendant la campagne électorale 2001-2002 pour réfléchir sur le rôle des médias en lien avec l’agir politique. Afin d’éclairer l’usage qu’a fait le Monde du mot insécurité, notre étude a d’abord fait le détour d’un travail en langue sur insécurité, sûreté, sécurité comme mots construits et comme noms abstraits. Une première partie en décrit l’origine et la structure morpho-sémantique, le but étant de mettre en évidence des structures abstraites fondamentales dans lesquelles entrent ces mots. Une deuxième partie analyse l’évolution des usages des trois termes, et en particulier leur fonctionnement dans le discours politique, à partir de la base Frantext (du Moyen Âge au XXe siècle). Cette partie insiste sur l’ambivalence de mots comme sécurité-insécurité, qui dénotent un sentiment subjectif ou une réalité objective. La troisième partie travaille sur le Monde en prenant pour entrée l’unité insécurité. Trois outils d’analyse sont privilégiés, l’intensification d’emploi d’insécurité abordée avec les outils de la statistique textuelle, l’étude du consensus qui semble se construire autour du mot jusqu’au premier tour de la campagne, enfin le trajet argumentatif du mot. / Based on the increased proliferation of the word insécurité [insecurity] in press articles during the French presidential campaign in 2001-2002, we question the role of the media in the development of political themes. To explain the use of the word insécurité by the French newspaper Le Monde, we first carry out a linguistic study of the words insécurité (insecurity), sûreté (safety) and sécurité (security) as both construct nouns and abstract nouns. As a first step, we describe the etymology and morpho-semantic structure of these lexical items, in order to highlight the systematic paradigm in which these items are used. As a second step, we review the changes in the way these words are used from the Middle Ages to the twentieth century, with a particular focus on their application in political discourse, using the Frantext database. This part of the analysis centres around the semantic ambivalence of such items as sécurité and i! nsécurité, which can refer as much to a subjective emotion as they can to a concrete objective reality. As a third and final step, we conduct an analysis on the lexical entry insécurité in Le Monde. Three analysis tools are mainly employed : the increased use of the word insécurité, examined with the tools of textual statistics, the study of the consensus around the word that seems to last from the beginning of the campaign until the first ballot, and finally the deployment of the term in persuasive techniques and its “argumentative route”. Discours Lexicologie Sémantique Médias Grands corpus Discourse Lexicology Semantics Media Large text corpus
295	Analyse textométrique des corpus parallèles francais-coréens / Textometric analysis of French-Korean parallel corpora Cho, Joon-Hyung 25 February 2010 (has links) Les équivalences traductionnelles extraites à partir d’un corpus parallèle deviendraient une ressource précieuse permettant d’étudier différents contextes traductionnels envisagés entre les deux langues distinctes. L’utilisation des textes traductionnels constitue aujourd’hui un thème essentiel en traductologie et en études contrastives des langues. Les méthodes textométriques opèrent une série de calculs statistiques portant sur les unités textuelles dans un corpus parallèle segmenté en occurrences. Elles fournissent les indices quantitatifs permettant de mettre en évidence le lien traductionnel de ces unités. En examinant des formes bilingues issues des corpus parallèles français-coréens, nous avons vérifié l’utilité de cette méthodologie appliquée aux textes traductionnels en français-coréen. Elles ont effectivement donné un résultat positif, d’une part, et un résultat négatif, d’autre part, tout au long de nos travaux. Pourtant, grâce à ces méthodes, nous avons pu étudier divers liens traductionnels entre unités textuelles du français et du coréen. La plupart de méthodes automatisées consacrées au corpus parallèle en langues hétérogènes n’ont pas produit de résultat acceptable. À ce titre, la textométrie, qui vise à l’observation quantitative des éléments lexicaux d’un corpus, serait très intéressante lorsqu’il s’agit notamment d’un corpus parallèle en langues sans parenté. / The translational equivalences extracted from a parallel corpus become a valuable resource enable to study the various translational contexts between the two distinct languages. The use of translational texts is now a principal subject in the translation studies and the contrastive studies of languages. The textometry operate a set of statistical calculations on the textual units in a parallel corpus divided into the tokens. They provide the quantitative evidence that verify the translational relation of the linguistic units. In exploring bilingual words in the French-Korean parallel corpora, we verified the usefulness of this methodology applied to the French-Korean translational texts. They produced actually a positive result, on the one hand, and a negative result, on the other hand, throughout our work. Yet, these methods made also observe the various translational relations of textual units between French and Korean. The most automated methods devoted to the parallel corpora of heterogeneous language pairs have not produced the approvable result. For the reason, the textometry, which aims to observe the lexical elements of a corpus from a statistical point of view, would be very practical method when we deal with a parallel corpus that consists of different language pairs. Corpus parallèle Textométrie Traductologie Français Coréen Parallel corpus Textometry Translation study French Korean
296	Frazémy ve dvojjazyčném slovníku / Phrasemes in a Bilingual Dictionary Ježková, Jaroslava January 2016 (has links) This thesis deals with area of set phrasemes processing in dictionary, specifically processing of somatisms. The thesis consists of theoretical and practical part. The aim of theoretical part is phraseology in general, phrasemes (occasionally phraseologisms) and their application in Czech and German linguistics. Field of phrasemes like somatisms in order to language unit character is taken into account in the first section as well as dependence of phrasemes like their meaning explanation on the context in which they appear. Furthermore, there are listed and described main phrasemes characteristics which distinguish them from other language phenomenons. Conclusion of theoretical part analyzes area of corpus linguistics and its application based on corpus and co-occurrence analysis. Built on first part of thesis, practical part deals with processing of somatisms in bilingual dictionary particularly in lexicography point of view whereas proposal of specific solutions are given. As the attachment there are processed results of search into database input which may be considered as a part of bilingual dictionary. Keywords: phrasem, bilingual dictionary, corpus lexicography, corpus analysis, somatism
297	Traitements linguistiques pour la reconnaissance automatique de la parole appliquée à la langue arabe : de l'arabe standard vers l'arabe dialectal Boujelbane Jarraya, Rahma 05 December 2015 (has links) Les différents dialectes de la langue arabe (DA) présentent de grandes variations phonologiques, morphologiques, lexicales et syntaxiques par rapport à la langue Arabe Standard Moderne (MSA). Jusqu’à récemment, ces dialectes n’étaient présents que sous leurs formes orales et la plupart des ressources existantes pour la langue arabe se limite à l’Arabe Standard (MSA), conduisant à une abondance d’outils pour le traitement automatique de cette variété. Étant donné les différences significatives entre le MSA et les DA, les performances de ces outils s’écroulent lors du traitement des DA. Cette situation conduit à une augmentation notable de l’ambiguïté dans les approches computationnelles des DA. Les travaux décrits dans cette thèse s’inscrivent dans ce cadre à travers la modélisation de l’oral parlé dans les médias tunisiens. Cette source de données contient une quantité importante d’Alternance Codique (AC) entre la langue normative MSA et le dialecte parlé en Tunisie (DT). La présence de ce dernier d’une manière désordonnée dans le discours pose une sérieuse problématique pour le Traitement Automatique de Langue et fait de cet oral une langue peu dotée. Toutefois, les ressources nécessaires pour modéliser cet oral sont quasiment inexistantes. Ainsi, l’objectif de cette thèse consiste à pallier ce manque afin de construire un modèle de langage dédié à un système de reconnaissance automatique pour l’oral parlé dans les médias tunisiens. Pour ce fait, nous décrivons dans cette thèse une méthodologie de création de ressources et nous l’évaluons par rapport à une tâche de modélisation de langage. Les résultats obtenu sont encourageants. / The different dialects of the arabic language have a large phonological, morphological, lexical and syntactic variations when compared to the standard written arabic language called MSA (Modern Standard Arabic). Until recently, these dialects were presented only in their oral form and most of the existing resources for the Arabic language is limited to the Standard Arabic (MSA), leading to an abundance of tools for the automatic processing of this variety. Given the significant differences between the MSA and DA, the performance of these tools fall down when processing AD. This situation leads to a significant increase of the ambiguity in computational approaches of AD.This thesis is part of this framework by modeling the oral spoken in the Tunisian media. This data source contains a significant amount of Code Switching (CS) between the normative language MSA and the Dialect spoken in Tunisia (DT). The presence of the latter in a disorderly manner in the discourse poses a serious problem for NLP (Natural Language Processing) and makes this oral a less resourced language. However, the resources required to model this oral are almost nonexistent. Thus, the objective of this thesis is to fill this gap in order to build a language model dedicated to an automatic recognition system for the oral spoken in the Tunisian media. For this reason, we describe in this thesis a resource generation methodologyand we evaluate it relative to a language modeling task. The results obtained are encouraging. Corpus oral Dialecte tunisien Modèle de langue Ressources Oral corpus Tunisian dialect Language model Resources 004
298	La terminologie wolof dans une perspective de traduction et de combinatoire lexicale restreinte / Wolof terminology in translation and lexical combinatorics perspectives Diagne, Abibatou 29 January 2018 (has links) La présente étude s’intéresse à la terminologie médicale wolof, envisagée dans le cadre de la sémantique lexicale. Nous y abordons des unités terminologiques de type collocation. Ce choix a un lien direct avec notre cadre théorique d’analyse, la Théorie Sens-Texte (TST), qui, à l’heure actuelle, propose l’un des meilleurs outils de description de la collocation avec les Fonctions Lexicales (FL). Les collocations constituent des indices de spécialisation en plus d’avoir un comportement lexicosyntaxique singulier. Nous les analysons, sur la base d’un corpus scientifique compilé, afin d’avoir une perception holistique de la cooccurrence lexicale. Les langues, en Afrique, sont souvent peu dotées du point de vue terminologique. Il s’agit dans cette recherche de s’appuyer sur le modèle d’analyse du lexique qui prend en compte trois paramètres clés : le sens, la forme et la combinatoire, afin de faire une description du lexique wolof qui, à terme, permet d’établir des principes de terminologisation. La portée traductive du travail réside dans l’approche interlinguistique que nous adoptons pour élaborer notre liste de termes. Le versant opératoire de l’étude est la constitution d’un début de corpus médical trilingue (anglais-français-wolof). Le caractère multilingue n’est pas une fin en soi, mais au vu de la richesse terminologique de l’anglais et du français, il nous a paru opportun de partir des travaux en ces langues. La perspective traductive de la terminologie a permis de relever différents procédés de création et de restitutions de termes médicaux (anglais et français) en wolof. Elle n’a par ailleurs pas manqué de poser en filigrane une problématique d’ordre socioterminologique pour laquelle nous donnons des éléments de réponses. / This study focuses on Wolof medical terminology under the lexical semantics context. We talk about terminology units, particularly collocations. This choice has a direct link with our theoretical framework of analysis, the Meaning-Text Theory (MTT), which currently offers one of the best tools for describing collocation through Lexical Functions. Collocations constitute indices of specialization and have a singular lexico-syntactic functioning. We analyze them, on the basis of compiled scientific corpus, in order to have a holistic perception of lexical co-occurrence. Languages in Africa are often poorly endowed from a terminology view. This research is based on the lexical analysis model which takes into account three key parameters : meaning, form and combinatorics, to make a description of the Wolof lexicon which, in the long run, gives principles of terminology. The translational scope of the work lies in the interlinguistic approach we adopt to develop our list of terms. The operative side of the study is the constitution of a beginning of trilingual medical corpus (English-French-Wolof). The multilingual character is not an end in itself, but considering the richness of English and French terminology studies, it seemed appropriate to start work in these languages. The translation perspective of the terminology has revealed different processes to create and restitute medical terms (English and French) into Wolof. Moreover, it focuses on a socioterminological problem for which we give some answers. Terminologie Wolof Combinatoire lexicale Corpus Sémantique lexicale Traduction Lexical combinatorics Corpus Terminology Wolof 401.4
299	Reconnaissance des procédés de traduction sous-phrastiques : des ressources aux validations / Recognition of sub-sentential translation techniques : from resources to validation Zhai, Yuming 19 December 2019 (has links) Les procédés de traduction constituent un sujet important pour les traductologues et les linguistes. Face à un certain mot ou segment difficile à traduire, les traducteurs humains doivent appliquer les solutions particulières au lieu de la traduction littérale, telles que l'équivalence idiomatique, la généralisation, la particularisation, la modulation syntaxique ou sémantique, etc.En revanche, ce sujet a reçu peu d'attention dans le domaine du Traitement Automatique des Langues (TAL). Notre problématique de recherche se décline en deux questions : est-il possible de reconnaître automatiquement les procédés de traduction ? Certaines tâches en TAL peuvent-elles bénéficier de la reconnaissance des procédés de traduction ?Notre hypothèse de travail est qu'il est possible de reconnaître automatiquement les différents procédés de traduction (par exemple littéral versus non littéral). Pour vérifier notre hypothèse, nous avons annoté un corpus parallèle anglais-français en procédés de traduction, tout en établissant un guide d'annotation. Notre typologie de procédés est proposée en nous appuyant sur des typologies précédentes, et est adaptée à notre corpus. L'accord inter-annotateur (0,67) est significatif mais dépasse peu le seuil d'un accord fort (0,61), ce qui reflète la difficulté de la tâche d'annotation. En nous fondant sur des exemples annotés, nous avons ensuite travaillé sur la classification automatique des procédés de traduction. Même si le jeu de données est limité, les résultats expérimentaux valident notre hypothèse de travail concernant la possibilité de reconnaître les différents procédés de traduction. Nous avons aussi montré que l'ajout des traits sensibles au contexte est pertinent pour améliorer la classification automatique.En vue de tester la généricité de notre typologie de procédés de traduction et du guide d'annotation, nos études sur l'annotation manuelle ont été étendues au couple de langues anglais-chinois. Ce couple de langues partagent beaucoup moins de points communs par rapport au couple anglais-français au niveau linguistique et culturel. Le guide d'annotation a été adapté et enrichi. La typologie de procédés de traduction reste identique à celle utilisée pour le couple anglais-français, ce qui justifie d'étudier le transfert des expériences menées pour le couple anglais-français au couple anglais-chinois.Dans le but de valider l'intérêt de ces études, nous avons conçu un outil d'aide à la compréhension écrite pour les apprenants de français langue étrangère. Une expérience sur la compréhension écrite avec des étudiants chinois confirme notre hypothèse de travail et permet de modéliser l'outil. D'autres perspectives de recherche incluent l'aide à la construction de ressource de paraphrases, l'évaluation de l'alignement automatique de mots et l'évaluation de la qualité de la traduction automatique. / Translation techniques constitute an important subject in translation studies and in linguistics. When confronted with a certain word or segment that is difficult to translate, human translators must apply particular solutions instead of literal translation, such as idiomatic equivalence, generalization, particularization, syntactic or semantic modulation, etc.However, this subject has received little attention in the field of Natural Language Processing (NLP). Our research problem is twofold: is it possible to automatically recognize translation techniques? Can some NLP tasks benefit from the recognition of translation techniques?Our working hypothesis is that it is possible to automatically recognize the different translation techniques (e.g. literal versus non-literal). To verify our hypothesis, we annotated a parallel English-French corpus with translation techniques, while establishing an annotation guide. Our typology of techniques is proposed based on previous typologies, and is adapted to our corpus. The inter-annotator agreement (0.67) is significant but slightly exceeds the threshold of a strong agreement (0.61), reflecting the difficulty of the annotation task. Based on annotated examples, we then worked on the automatic classification of translation techniques. Even if the dataset is limited, the experimental results validate our working hypothesis regarding the possibility of recognizing the different translation techniques. We have also shown that adding context-sensitive features is relevant to improve the automatic classification.In order to test the genericity of our typology of translation techniques and the annotation guide, our studies of manual annotation have been extended to the English-Chinese language pair. This pair shares far fewer linguistic and cultural similarities than the English-French pair. The annotation guide has been adapted and enriched. The typology of translation techniques remains the same as that used for the English-French pair, which justifies studying the transfer of the experiments conducted for the English-French pair to the English-Chinese pair.With the aim to validate the benefits of these studies, we have designed a tool to help learners of French as a foreign language in reading comprehension. An experiment on reading comprehension with Chinese students confirms our working hypothesis and allows us to model the tool. Other research perspectives include helping to build paraphrase resources, evaluating automatic word alignment and evaluating the quality of machine translation. Création de corpus Reconnaissance automatique Corpus creation Automatic recognition
300	Contribution de la linguistique de corpus à la constitution de langues contrôlées pour la rédaction technique : l'exemple des exigences de projets spatiaux / A methodology for creating controlled natural languages for technical writing based on corpus analysis : a case study on requirements written for space projects Warnier, Maxime 10 September 2018 (has links) L'objectif de notre travail, qui émane d'une demande de la sous-direction Assurance Qualité du CNES (Centre National d'Études Spatiales), est d'augmenter la clarté des spécifications techniques rédigées par les ingénieurs préalablement à la réalisation de systèmes spatiaux. L'importance des spécifications (et en particulier des exigences qui les composent) pour la réussite des projets de grande envergure est en effet désormais largement reconnue, de même que les principaux problèmes liés à l'utilisation de la langue naturelle (ambiguïtés, flou, incomplétude) sont bien identifiés. Dès lors, de nombreuses solutions, plus ou moins formalisées, ont été proposées et développées pour limiter les risques d'interprétation erronée – dont les conséquences potentielles peuvent se révéler extrêmement coûteuses – lors de la rédaction des exigences.Nous voudrions définir une langue contrôlée pour la rédaction des exigences en français au CNES. L’originalité de notre démarche consiste à systématiquement vérifier nos hypothèses sur un corpus d’exigences (constitué à partir d’authentiques spécifications de projets spatiaux) à l’aide de techniques et d’outils de traitement automatique du langage existants, dans l’optique de proposer un ensemble cohérent de règles (nouvelles ou inspirées de règles plus anciennes) qui puissent ainsi être vérifiées semi-automatiquement lors de l’étape de spécification et qui soient conformes aux pratiques de rédaction des ingénieurs du CNES. Pour cela, nous nous appuyons notamment sur l’hypothèse de l’existence d’un genre textuel, que nous tentons de prouver par une analyse quantitative, ainsi que sur les notions de normalisation et normaison. Notre méthodologie combine les approches corpus-based et corpus-driven en tenant compte à la fois des règles imposées par deux autres langues contrôlées (dont l’adéquation avec des données réelles est discutée au travers d’une analyse plus qualitative) et des résultats offerts par des outils de text mining. / The aim of this work is to improve the clarity and precision of the technical specifications written in French by the engineers at CNES (Centre National d’Études Spatiales / National Centre for Space Studies) prior to the realization of space systems. The importance of specifications (and particularly of the requirements that are part of them) for the success of large-scale projects is indeed widely acknowledged; similarly, the main risks associated with the use of natural language (ambiguity, vagueness, incompleteness) are relatively well identified.In this context, we would like to propose a solution that would be used by the engineers at CNES (who are currently not asked to follow specific writing rules): in that respect, we believe that this solution should be both effective (i.e. it should significantly limit the above-mentioned risks) and not too disruptive (which would make it counterproductive). A Controlled Natural Language (CNL) – i.e. a set of linguistic rules constraining the lexicon, the syntax and the semantics – seems to be an interesting option, provided that it remains close enough to natural language. Unfortunately, the CNLs for technical writing that we have examined are not always relevant from a linguistic point of view.Our methodology for developping a CNL for requirements writing in French at CNES relies on the hypothesis of the existence of a textual genre; besides, we make use of existing Natural Language Processing tools and methods to validate the relevance of the rules on a corpus of genuine requirements written for former projects. Exigences Spécifications Langue contrôlée Genre textuel Corpus Requirements Specifications Controlled language Textual genre Corpus

Search results