Spelling suggestions: "subject:"corpora"" "subject:"korpora""
51 |
Construction automatique d'outils et de ressources linguistiques à partir de corpus parallèles / Automatic creation of linguistic tools and resources from parallel corporaZennaki, Othman 11 March 2019 (has links)
Cette thèse porte sur la construction automatique d’outils et de ressources pour l’analyse linguistique de textes des langues peu dotées. Nous proposons une approche utilisant des réseaux de neurones récurrents (RNN - Recurrent Neural Networks) et n'ayant besoin que d'un corpus parallèle ou mutli-parallele entre une langue source bien dotée et une ou plusieurs langues cibles moins bien ou peu dotées. Ce corpus parallèle ou mutli-parallele est utilisé pour la construction d'une représentation multilingue des mots des langues source et cible. Nous avons utilisé cette représentation multilingue pour l’apprentissage de nos modèles neuronaux et nous avons exploré deux architectures neuronales : les RNN simples et les RNN bidirectionnels. Nous avons aussi proposé plusieurs variantes des RNN pour la prise en compte d'informations linguistiques de bas niveau (informations morpho-syntaxiques) durant le processus de construction d'annotateurs linguistiques de niveau supérieur (SuperSenses et dépendances syntaxiques). Nous avons démontré la généricité de notre approche sur plusieurs langues ainsi que sur plusieurs tâches d'annotation linguistique. Nous avons construit trois types d'annotateurs linguistiques multilingues: annotateurs morpho-syntaxiques, annotateurs en SuperSenses et annotateurs en dépendances syntaxiques, avec des performances très satisfaisantes. Notre approche a les avantages suivants : (a) elle n'utilise aucune information d'alignement des mots, (b) aucune connaissance concernant les langues cibles traitées n'est requise au préalable (notre seule supposition est que, les langues source et cible n'ont pas une grande divergence syntaxique), ce qui rend notre approche applicable pour le traitement d'un très grand éventail de langues peu dotées, (c) elle permet la construction d'annotateurs multilingues authentiques (un annotateur pour N langages). / This thesis focuses on the automatic construction of linguistic tools and resources for analyzing texts of low-resource languages. We propose an approach using Recurrent Neural Networks (RNN) and requiring only a parallel or multi-parallel corpus between a well-resourced language and one or more low-resource languages. This parallel or multi-parallel corpus is used to construct a multilingual representation of words of the source and target languages. We used this multilingual representation to train our neural models and we investigated both uni and bidirectional RNN models. We also proposed a method to include external information (for instance, low-level information from Part-Of-Speech tags) in the RNN to train higher level taggers (for instance, SuperSenses taggers and Syntactic dependency parsers). We demonstrated the validity and genericity of our approach on several languages and we conducted experiments on various NLP tasks: Part-Of-Speech tagging, SuperSenses tagging and Dependency parsing. The obtained results are very satisfactory. Our approach has the following characteristics and advantages: (a) it does not use word alignment information, (b) it does not assume any knowledge about target languages (one requirement is that the two languages (source and target) are not too syntactically divergent), which makes it applicable to a wide range of low-resource languages, (c) it provides authentic multilingual taggers (one tagger for N languages).
|
52 |
A língua inglesa e a atividade secretarial no ambiente corporativo: uma proposta de ensino de inglês com corporaLourenço, José Roberto 07 August 2014 (has links)
Made available in DSpace on 2016-04-28T18:22:53Z (GMT). No. of bitstreams: 1
Jose Roberto Lourenco.pdf: 1704500 bytes, checksum: 030bd4dc3f12fa14571b8bd453eb0082 (MD5)
Previous issue date: 2014-08-07 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / This work had as objective to find out the linguistic needs for the professional activities of a specific target group, namely bilingual secretaries, who have the English language as a work tool. The most important and frequent activities done by the professionals in their daily work routine were identified previously through interviews conducted with ten secretaries. The fifty most important and frequent activities listed were used to compose a questionnaire which was sent on the net to the Union (Sindicato das Secretárias do Estado de São Paulo) for a large scale survey. Based on the 195 valid answers, we identified the main needs in the area concerning language usage, with the purpose of creating an alternative model for the English teaching in the form of templates. The work found theoretical support in the Corpus Linguistics and the Task-based Approach. To run the analysis of the answers to the questionnaire, we used the program SPSS for statistical analysis and performed the chi-square test to know the level of significance of the answers. The corpora utilized in the research were the written Business English Corpus (BEC), the oral American National Corpus (ANC), a written corpus of Blogs (CBC/Corporati) and a reference corpus, the British National Corpus (BNC-W). For the lexico-grammatical study of the texts selected, we used the software WordSmith 6.0, which made it possible to select lexical combinations used to develop tasks with corpora. The results showed the bilingual secretary today is responsible for basic activities such as taking notes on the telephone and arranging schedules. However, she occupies the role of social supporter of expatriates and visitors to the country. It was observed frequent usage of written and oral registers through e-mail and telephone respectively. The study presented here may have made an original contribution to the existing body of corpus-based research in that it provides some new and useful information and data related to the secretarial activities and the use of English in the work environment. The study also presents and discusses possible limitations and further research, as well as suggestions for pedagogical applications of the findings / Este trabalho teve como objetivo efetuar um levantamento das necessidades linguísticas para atividades profissionais de secretariado bilíngues português/inglês. As atividades mais importantes e frequentes na rotina de trabalho foram identificadas previamente por meio de entrevistas realizadas com dez profissionais. As cinquenta atividades mais destacadas compuseram um questionário que foi enviado ao Sindicato das Secretárias do Estado de São Paulo, para coleta de dados em larga escala. Com base nas 195 respostas válidas obtidas, identificamos as principais necessidades da área no tocante à linguagem, no intuito de criar um modelo alternativo de ensino de inglês para aprendizes na forma de módulos (templates). O trabalho adotou como suporte teórico a Linguística de Corpus e o modelo de ensino baseado em tarefas. Para análise das respostas ao questionário, utilizamos o programa SPSS de análise estatística e aplicamos o teste de qui quadrado, para determinar o nível de significância das respostas. Os corpora empregados na pesquisa foram o Business English Corpus (BEC), escrito, o American National Corpus Charlotte (ANC), falado, e o corpus de blogs CBC/Corporati, escrito, além de um corpus de referência, o British National Corpus World (BNC-W). Para o levantamento de palavras-chave e padrões lexicogramaticais dos textos selecionados, utilizamos a ferramenta eletrônica WordSmith Tools 6.0, que permitiu a seleção das combinações lexicais utilizadas posteriormente na composição de tarefas para o ensino do idioma. Os resultados indicaram que as secretárias continuam desempenhando funções básicas, como anotar recados telefônicos e organizar agenda executiva, porém acumulam a importante função de suporte e acompanhamento da vida social e profissional de expatriados e visitantes. Foram observados o uso intensivo de dois registros, a linguagem escrita via e-mail e a linguagem telefônica. A pesquisa pretende ter contribuído para a Linguística de Corpus e para a área de ensino de língua, por ser o primeiro estudo dessas áreas dedicado à atividade secretarial e ao uso profissional do idioma. Novas incursões e estudos futuros são propostos e discutidos na conclusão, bem como sugestões de aplicações pedagógicas dos resultados e planos de trabalho adicionais
|
53 |
Vai ficar tudo bem/Todo va a estar bien: o emprego do verbo estar no Português Brasileiro e no Espanhol Argentino / Vai ficar tudo bem/Todo va a estar bien: el empleo del verbo estar Portugués Brasileño y en Español ArgentinoMoço, Talita Vieira 14 May 2019 (has links)
Esta tese tem como objetivo descrever as estruturas sintático-semânticas fundamentais do verbo estar no Português Brasileiro e na variedade argentina do Espanhol. A hipótese central deste estudo é de que, ao lado de muitas semelhanças entre as duas línguas nesse aspecto, reconhecida por diversos autores principalmente em relação aos traços que o diferenciam do copulativo ser, há também alguns usos próprios de cada língua, que constituem especificidades importantes no modo de focalizar determinados eventos, no presente, no passado e no futuro, especificidades essas importantes do ponto de vista pragmático-discursivo. A coleta de dados, feita principalmente em corpora eletrônicos, nos permitiu analisar os diversos fatores que podem influenciar no emprego de estar ou de outros itens lexicais em cada língua (ser, ficar/quedar[se]). O reconhecimento dessa multiplicidade de fatores lexicais, semânticos, sintáticos e discursivos que se relacionam sem estarem subordinados uns aos outros e afetam os valores das diferentes construções formadas por estar aproxima este trabalho das descrições feitas por Castilho (2014). Após uma ampla revisão bibliográfica sobre o tema nas duas línguas, passa-se a destacar os casos considerados mais relevantes e faz-se uma comparação entre os dados encontrados nos corpora, comparação essa agora organizada em categorias descritivas, a fim de sistematizar os pontos de encontro e distanciamento entre o PB e o E, nas variedades escolhidas. Os resultados obtidos na análise dos corpora confirmam que algumas construções formadas pelo verbo estar no E têm uso mais restrito no PB, já que nesta língua o verbo apresenta mais restrições, tanto em relação aos termos com os quais se combina, como sujeito e/ou adjunto, quanto em relação aos tempos verbais nos quais aparece comumente conjugado, sendo empregadas com mais frequência, em alguns contextos, outras formas verbais, como ser e ficar, para expressar determinados valores semânticos que no E podem ser representados por estar. / Esta tesis tiene como objetivo describir las estructuras sintáctico-semánticas fundamentales del verbo estar en el Portugués Brasileño y en la variedad argentina del Español. La hipótesis central de este estudio es que, al lado de muchas semejanzas entre las dos lenguas en este aspecto, las cuales son reconocidas por diversos autores, principalmente, en cuanto a los rasgos que lo diferencian del copulativo ser, hay también algunos usos propios de cada lengua, que constituyen especificidades importantes, desde el punto de vista pragmático-discursivo, en el modo de focalizar determinados eventos, en presente, pasado y futuro. La recopilación de datos, que se llevó a cabo principalmente en corpora lectrónicos, nos permitió analizar los diversos factores que pueden influenciar el empleo de estar o de otros ítems lexicales en cada lengua (ser, ficar/quedar[se]). El reconocimiento de esa multiplicidad de factores lexicales, semánticos, sintácticos y discursivos los cuales se relacionan, sin estar subordinados los unos a los otros, y que afectan los valores de las diferentes construcciones formadas por estar acerca este trabajo a las descripciones hechas por Castilho (2014). Tras una amplia revisión bibliográfica sobre el tema en las dos lenguas, se destacan los casos considerados más relevantes y se hace una comparación entre los datos encontrados en los corpora, organizada en categorías descriptivas, con el fin de sistematizar los puntos de encuentro y distanciamiento entre el Portugués Brasileño y el Español, en las variedades seleccionadas. Los resultados obtenidos en el análisis de los corpus confirman que algunas construcciones formadas por estar en Español se emplean menos o no son posibles en determinados contextos en el Portugués Brasileño, ya que en esta lengua el verbo presenta más restricciones, tanto en relación a los términos con los cuales se combina, como sujeto y/o adjunto, como en relación a los tiempos verbales en los cuales se conjuga, de modo que, en algunos contextos, se emplean con mayor frecuencia otras formas verbales como, por ejemplo, ser y ficar para expresar determinados valores semánticos que en Español se pueden representar con estar.
|
54 |
Alignement de phrases parallèles dans des corpus bruitésLamraoui, Fethi 07 1900 (has links)
La traduction statistique requiert des corpus parallèles en grande quantité. L’obtention
de tels corpus passe par l’alignement automatique au niveau des phrases. L’alignement des corpus parallèles a reçu beaucoup d’attention dans les années quatre vingt et cette étape est considérée comme résolue par la communauté. Nous montrons dans notre mémoire que ce n’est pas le cas et proposons un nouvel aligneur que nous comparons à des algorithmes à l’état de l’art.
Notre aligneur est simple, rapide et permet d’aligner une très grande quantité de
données. Il produit des résultats souvent meilleurs que ceux produits par les aligneurs les plus élaborés. Nous analysons la robustesse de notre aligneur en fonction du genre des textes à aligner et du bruit qu’ils contiennent. Pour cela, nos expériences se décomposent en deux grandes parties. Dans la première partie, nous travaillons sur le corpus BAF où nous mesurons la qualité d’alignement produit en fonction du bruit qui atteint les 60%.
Dans la deuxième partie, nous travaillons sur le corpus EuroParl où nous revisitons la
procédure d’alignement avec laquelle le corpus Europarl a été préparé et montrons que
de meilleures performances au niveau des systèmes de traduction statistique peuvent être obtenues en utilisant notre aligneur. / Current statistical machine translation systems require parallel corpora in large quantities, and typically obtain such corpora through automatic alignment at the sentence level: a text and its translation . The alignment of parallel corpora has received a lot of attention in the eighties and is largely considered to be a solved problem in the community. We show that this is not the case and propose an alignment technique that we compare to the state-of-the-art aligners.
Our technique is simple, fast and can handle large amounts of data. It often produces
better results than state-of-the-art. We analyze the robustness of our alignment technique across different text genres and noise level. For this, our experiments are divided into two main parts. In the first part, we measure the alignment quality on BAF corpus with up to 60% of noise. In the second part, we use the Europarl corpus and revisit the alignment procedure with which it has been prepared; we show that better SMT performance can be obtained using our alignment technique.
|
55 |
Étude des procédés d’explicitation dans les traductions anglais-français de textes environnementauxKalinichenko, Tetiana M. 06 1900 (has links)
Le présent mémoire vise à faire l’étude des procédés d’explicitation dans les traductions anglais-français de textes spécialisés de l’environnement. Plus précisément, notre but est d’identifier l'éventail de ces procédés d'explicitation, de faire leur analyse, de les classifier et de proposer quelques pistes quant aux causes possibles de l’explicitation dans la traduction.
Nous présentons d’abord quelques travaux antérieurs qui ont porté sur l’explicitation dans des corpus de langue générale et dans des corpus spécialisés. Notre recherche a ceci de particulier qu’elle porte sur l’explicitation dans un corpus spécialisé, plus particulièrement dans des textes du domaine de l’environnement. L’explicitation est peu étudiée dans les textes spécialisés et, à notre connaissance, aucune étude n’a porté sur l’explicitation dans des textes environnementaux.
Pour notre recherche, nous avons élaboré d’abord un corpus de textes anglais-français portant sur l’environnement. Notre corpus a ensuite été aligné au moyen de l’aligneur LogiTerm Pro. Cet aligneur nous permet de créer un corpus aligné qui est utile pour observer les manifestations d’explicitation. Les stratégies d'explicitation identifiées et classées par Pápai (2004) ont servi de base à notre propre classement.
Nous avons découvert que les procédés d’explicitation se produisent à cinq niveaux : des relations logiques et visuelles, lexical et grammatical, syntaxiques I et II, textuel et extralinguistique. Le nombre total de procédés d’explicitation que nous avons identifiés est de 13. Le plus grand nombre de cas d’explicitations (445) se situe au niveau lexical et grammatical. Parmi les cas d’explicitations au niveau lexical et grammatical, le remplissage d’ellipses sémantiques présente le nombre le plus élevé de cas (186) dans notre corpus spécialisé. L’explicitation au niveau syntaxique I s’observe dans 173 cas; l’explicitation au niveau des relations logiques et visuelles s’observe dans 101 cas; l’explication au niveau syntaxique II a été relevée dans 50 cas. Enfin, l’explication se produit au niveau textuel et extralinguistique dans 37 cas. Après avoir observé notre corpus et d’après les résultats obtenus, nous avons pu constater que le nombre et la variété d’explicitations étaient élevés dans les traductions anglais-français de textes spécialisés environnementaux. / This work aims to study the explicitation strategies in English-French translations of specialized texts related to the field of environment. More specifically, our goal is to identify the range of these explicitation strategies, analyze and classify them. We will also offer some explanations of possible causes of explicitation in translation.
First, we present some previous work on explicitation in general language corpora and in specialized corpora. A particularity of our own research is that it focuses on explicitation in a specialized corpus, more specifically in texts in the field of environment. Explicitation has seldom been studied in specialized texts and, to our knowledge, no study focused on explicitation in environmental texts.
For our research, we compiled a corpus of English-French environment texts. Our corpus was then aligned using the aligner LogiTerm Pro. This aligner allows us to create aligned corpora that are useful to observe the linguistic instances of explicitation. The explicitation strategies identified and classified by Pápai (2004) have served as the basis for our own analysis.
We found that the explicitation strategies occur at five levels: logical and visual relations, lexical and grammatical, syntactic I and II, textual and extra-linguistic. The total number of explicitation strategies that we have identified is 13. The largest number of explicitation instances (445) occurs on the lexical and grammatical level. Among explicitation instances on the lexical and grammatical level, filling semantic ellipses present the highest number of instances (186) in our specialized corpus. Explicitation on the syntactic level I was observed in 173 instances; explicitation on the logical and visual relations level in 101 instances; explicitation on the syntactic level II was found in 50 instances. Finally, explicitation occurred on the textual and extra-linguistic level in 37 instances. After observing our corpus and according to the results obtained, we have found that the number and variety of instances of explicitation are high in English-French translations of specialized texts in the field of the environment. / S.O.
|
56 |
Étude sur l'équivalence de termes extraits automatiquement d'un corpus parallèle : contribution à l'extraction terminologique bilingueLe Serrec, Annaïch January 2008 (has links)
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal
|
57 |
Analyse comparative de l'équivalence terminologique en corpus parallèle et en corpus comparable : application au domaine du changement climatiqueLe Serrec, Annaïch 04 1900 (has links)
Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005).
Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie.
L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais.
Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables.
Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs. / The research undertaken for this thesis concerns the analysis of terminological equivalence in a parallel corpus and a comparable corpus. More specifically, we focus on specialized texts related to the domain of climate change. A unique aspect of this study is based on the analysis of the equivalents of single word terms. The theoretical frameworks on which we rely are the terminologie textuelle (Bourigault et Slodzian 1999) and the lexico-sémantique approaches (L’Homme 2005).
This study has two objectives. The first is to perform a comparative analysis of terminological equivalents in the two types of corpora in order to verify if the equivalents found in the parallel corpus are different from the ones observed in the comparable corpora. The second is to compare in detail equivalents associated with a same English term, in order to describe them and define a typology.
A detailed analysis of the French equivalents of 343 English terms is carried out with the help of computer tools (term extractor, text aligner, etc.) and the establishment of a rigorous methodology divided into three parts. The first part, common to both objectives of the research concerns the elaboration of the corpus, the validation of the English terms and the identification of the French equivalents in the two corpora. The second part describes the criteria on which we rely to compare the equivalents of the two types of corpora. The third part sets up the typology of equivalents associated with a same English term.
The results for the first objective shows that of the 343 English words analyzed, terms with equivalents that can be criticized in both corpora are relatively low in number (12), while the number of terms with similar equivalences between the two corpora is very high (272 identical and 55 equivalents not objectionable). The analysis described in this chapter confirms our hypothesis that terminology used in parallel corpora does not differ from that used in comparable corpora.
The results of the second objective show that many English terms are rendered by several equivalents (70% of analyzed terms). It is also noted that synonyms are not the largest group of equivalents but near-synonyms. Also, equivalents from another part of speech constitute an important part of the equivalents analyzed. Thus, the typology developed in this thesis presents terminological equivalent mechanisms rarely described as systematically in previous work.
|
58 |
Unidades fraseológicas especializadas : colocações e colocações estendidas em contratos sociais e estatutos sociais traduzidos no modo juramentado e não-juramentado /Orenha, Adriane. January 2009 (has links)
Orientador: Diva Cardoso de Camargo / Banca: Francis Henrik Aubert / Banca: Ieda Maria Alves / Banca: Claudia Maria Xatara / Banca: Eli Nazareth Bechara / Resumo: Esta pesquisa visa realizar um estudo a respeito dos termos, colocações e colocações especializadas estendidas presentes em contratos sociais e estatutos sociais que representam os corpora de pesquisa. Nesta pesquisa, também observaremos as semelhanças e diferenças nos corpora de traduções jurídicas e juramentadas, no que concerne ao uso desses termos e padrões lexicais, assim como apontaremos aqueles que são mais frequentemente empregados em documentos do tipo contrato social e estatuto social. A investigação baseia-se na abordagem interdisciplinar dos Estudos da Tradução Baseados em Corpus, da Linguística de Corpus, da Fraseologia, de modo mais específico das colocações, das colocações especializadas e das unidades fraseológicas especializadas. A Terminologia, por meio de seus pressupostos teóricos, também traz sua contribuição para a pesquisa, assim como os trabalhos sobre a tradução juramentada. Uma das motivações que delineia este estudo reside no fato de a tradução juramentada ser considerada de grande relevância nas relações comerciais, sociais e jurídicas entre as nações. Para realizar este estudo, compilamos um corpus de estudo (CE1) constituído por contratos sociais e estatutos sociais traduzidos no modo juramentado, nas direções tradutórias inglês português e português inglês, extraídos de Livros de Registro de Traduções, pertencentes a tradutores juramentados credenciados pela Junta Comercial de dois Estados brasileiros; e um corpus de estudo (CE2) formado por documentos de mesma natureza traduzidos sem o processo de juramentação, nas mesmas direções tradutórias. Além destes corpora, construímos dois corpora comparáveis, formados pelos referidos documentos originalmente escritos em português e em inglês. Os resultados desta pesquisa mostraram várias semelhanças, no tocante aos termos empregados em documentos traduzidos... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: This investigation aims at carrying out a study on terms, collocations and extended specialized collocations present in articles of incorporation/articles of organization/articles of association and bylaws that represent our research corpora. We will also observe similarities and differences in sworn and legal translation corpora, which concerns the use of such terms and lexical patterns, as well as point out the ones which are more frequently used in the focused documents. This research derives its theoretical and methodological sources from Corpus-Based Translation Studies, Corpus Linguistics, Phraseology, more specifically from collocations, specialized collocations and specialized phraseological units (SPUs). Terminology, from its theoretical standpoint, also offers its contribution to this study, as well as essays on sworn translation. One of the aspects that motivates this study is the fact that sworn translation is considered to be of great relevance to commercial, social and legal relations among nations. To conduct this research, we compiled a study corpus (CE1) composed of articles of incorporation/articles of organization/articles of association and bylaws submitted to the process of sworn translation in the English Portuguese and Portuguese English directions, excerpted from the Books of Sworn Translation Records, made available by five Brazilian sworn translators, duly sworn by the Board of Trade of two Brazilian States; a study corpus (CE2) made up of documents of the same nature not submitted to the process of sworn translation, in the same translation directions. Besides these corpora, we also built two comparable corpora formed by the referred documents originally written in Portuguese and in English. The results obtained in this research showed some similarities which refer to the terms used in documents submitted to the process of sworn translation... (Complete abstract click electronic access below) / Doutor
|
59 |
UM ESTUDO DOS FATORES DE ATRIBUIÇÃO EM TEXTOS ACADÊMICOS DE LETRAS E PSICOLOGIA À LUZ DA TEORIA HOLÍSTICA DA ATIVIDADE E DA LINGUÍSTICA DE CORPUS / A STUDY OF THE ATTRIBUTION FACTORS IN LANGUAGES AND PSYCHOLOGY ACADEMIC TEXTS ENLIGHTENED BY THE HOLISTIC THEORY OF ACTIVITY AND THE CORPUS LINGUISTICSKader, Carla Callegaro Corrêa 26 February 2014 (has links)
Conselho Nacional de Desenvolvimento Científico e Tecnológico / This study aims to investigate the constitution of the social role of the Language teacher/professor
opposing it to an emancipated profession as Psychology, observing the internal asymmetry of the
alopoietic systems and the endogeny of the autopoietic systems. With this in mind, it was used the
methodological support from Corpus Linguistics to analyze and compare corpora, composed by
graduate research papers, lato sensu monographs, stricto sensu dissertations and theses in the
Language and in the Psychology areas. First, it was questioned how the macro and microanalyses,
generated by the programs TreeTagger, WordSmith Tools 6.0 and the Semantic Mapper, are related
to the multidimensional analysis. Second, it was verified which categories (axiologic, deontic and/or
epistemic alethic ones), linguistically characterize the professional profile of the linguistic educators
and psychologists. Third, it was examined which conceptions were recurrent in the corpora from the
cluster analyses of the concord lines. These questions of the research were answered by the Holistic
Theory of Activity. In this study, this theory is focused in questions about professional development in
the opposition to autopoietic and alopoietic processes (RICHTER, 2011). In the sequence of the
studies, it was observed the frame of each profession with basis in the statistical and qualitative
survey of the corpora. The emphasis of the analysis was in the attribution factor which characterizes
the social roles and the professional model. After the development of the theoretical review, it was
passed to the methodological aspects which maintain this analysis. The collection of the texts which
compose these corpora was accomplished on the internet by means of visiting the database of
libraries of Brazilian universities. It was also collected texts in public and private university libraries of
Santa Maria. After the arrangement of the corpora, the texts were converted into the txt format (text
without format). The texts of the Language area were separated in two groups, market teachers and
academic professionals, while the corpus of Psychology did not suffer any subdivision. The corpora
were tagged and drew out the classes of words with more elevated results. Therefore, it was passed
on to the use of the software WordSmith Tools 6.0 with the obtainment of the quantitative results for
the words in prominence in the WordList. Later, it was passed on the Semantic Mapping of these
words and the separation in subcategories. In the receipt of the mapping results, it was started the
dimensional analysis. The results of this analysis indicate the prominence to the gnoseologic and
praxeological competences and to the modal verbs must and can . The concords analysis was
effectuated with the words which belong to these categories. The final results show a differentiation
between emancipated and non-emancipated professions. The first ones guide their development and
professional performance by the Ethic Federal Council and, the second ones, in the orientations found
in professional s academic papers with postgraduate education and in the guidelines of the official
documents. More specifically, the corpus with Psychology texts points to the linguistics indicia related
to an autopoiesis, characterized by the regulated professions and the professionals of the Languages
area subdivide them in alopoietic linguistics traces (market professionals) and almost autopoietic ones
(academic professionals). / Este trabalho tem por objetivo investigar a constituição do papel social do professor de Letras
contrapondo-o ao de uma profissão emancipada como a Psicologia, observando a assimetria interna
dos sistemas alopoiéticos e a endogenia dos sistemas autopoiéticos. Para tanto, utilizaram-se os
pressupostos metodológicos da Linguística de Corpus para analisar e comparar corpora, compostos
por textos veiculados em trabalhos finais de graduação (TCCs), monografias lato sensu, dissertações
e teses stricto sensu das áreas de Letras e Psicologia. Inicialmente, questionou-se como as análises
macro e microscópicas, geradas pelos programas TreeTagger, WordSmith Tools 6.0 e Mapeador
Semântico, relacionam-se à análise multidimensional. Em um segundo momento, averiguou-se que
eixos (axiológico, deôntico e/ou epistêmico alético) caracterizam linguisticamente o perfil profissional
dos educadores linguísticos e dos psicólogos. Na terceira fase, buscou-se verificar que concepções
são recorrentes nos corpora a partir da análise dos clusters das linhas de concordância. Essas
perguntas de pesquisa foram respondidas à luz da Teoria Holística da Atividade. Neste estudo,
enfoca-se essa teoria nas questões sobre o desenvolvimento profissional na oposição entre
processos autopoiéticos e alopoiéticos (RICHTER, 2011). Na sequência dos estudos, observou-se o
enquadramento de cada profissão com base no levantamento estatístico e qualitativo dos corpora. O
enfoque da análise foi no fator de atribuição que caracteriza os papéis sociais e a modelagem
profissional. Após o desenvolvimento da revisão da literatura, passou-se para os aspectos
metodológicos que sustentam esta análise. A coleta dos textos que compõem os corpora foi realizada
via internet, por meio da visitação aos bancos de dados de bibliotecas de universidades brasileiras.
Também foram coletados textos nas bibliotecas de universidades públicas e privadas de Santa Maria.
Após a formação dos corpora, os textos foram convertidos para o formato txt (texto sem formatação).
Os textos da área de Letras foram separados em dois grupos, profissionais de mercado e
profissionais da academia, enquanto o corpus de Psicologia não sofreu subdivisões. Os corpora
foram etiquetados e extraíram-se as classes de palavras com resultados mais elevados. Passou-se,
assim, para a utilização do software WordSmith Tools 6.0 com a obtenção de resultados quantitativos
para as lexias em destaque na WordList. Posteriormente, passou-se para o mapeamento semântico
dessas lexias e separação em subcategorias. De posse dos resultados do mapeamento, iniciou-se a
análise multidimensional. Os resultados dessa análise apontam o destaque para as competências
gnoseológica e praxeológica e para os verbos modais deve e pode . A análise dos concords foi
efetuada com as lexias que pertencem a essas categorias. Os resultados finais apontam para uma
diferenciação entre profissões emancipadas e não emancipadas. As primeiras orientam sua formação
e atuação profissional pelo Conselho Federal de Ética e, as segundas, nas orientações encontradas
nos textos acadêmicos de profissionais com especializações e nas orientações dos documentos
oficiais. Mais especificamente, o corpus com os textos da área da Psicologia aponta para indícios
linguísticos voltados para uma autopoiesis, característico das profissões regulamentadas e os
profissionais da área de Letras subdividem-se em indícios linguísticos alopoiéticos (profissionais de
mercado) e quase autopoiéticos (profissionais da academia).
|
60 |
Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual TranslationsDella Corte, Giuseppe January 2020 (has links)
The recent uprise of end-to-end speech translation models requires a new generation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically find e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our findings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of difficult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set.
|
Page generated in 0.0532 seconds