Spelling suggestions: "subject:"corpora"" "subject:"korpora""
231 |
[en] METHODOLOGIES FOR CHARACTERIZING AND DETECTING EMOTIONAL DESCRIPTION IN THE PORTUGUESE LANGUAGE / [pt] METODOLOGIAS PARA CARACTERIZAÇÃO E DETECÇÃO DA DESCRIÇÃO DE EMOÇÃO NA LÍNGUA PORTUGUESABARBARA CRISTINA MARQUES P RAMOS 29 May 2023 (has links)
[pt] O interesse desta tese recai sobre compreender como os falantes de língua
portuguesa a utilizam para materializar a menção de emoção através de um
trabalho, sobretudo, linguístico. O objetivo geral da pesquisa é criar recursos para
aprimorar a anotação do campo semântico das emoções na língua portuguesa a
partir do projeto AC/DC, projeto que reúne e disponibiliza publicamente corpora
anotados e recursos para pesquisas na língua portuguesa, e do Emocionário,
projeto de anotação semântica e léxico de emoções. Inicialmente, a pesquisa dá
um panorama dos estudos de emoção; se alinha às perspectivas que refutam a
universalidade das emoções e abordagens que postulam emoções básicas; e
contrapõe seu interesse por menção de emoção à já consolidada área de Análise de
Sentimento, contrastando cinco léxicos de sentimento e/ou polaridades em língua
portuguesa e o Emocionário. A partir de uma ampla varredura nos corpora do
AC/DC, três principais caminhos foram percorridos para investigar palavras de
emoção: (i) uma análise dos vinte e quatro grupos de emoção que já existiam no
léxico do Emocionário a fim de delinear características e desafios no estudo de
emoção na língua portuguesa; (ii) a revisão completa um terço dos grupos do
léxico do Emocionário; e (iii) buscas pelo padrão léxico-sintático sentimento de
N e por expressões anotadas pelo projeto Esqueleto usadas para descrever
emoção. A análise dos corpora à luz dos lemas previamente pertencentes aos
grupos do léxico do Emocionário evidenciou, dentre outras características, a
relevância de expressões lexicalizadas para a análise da descrição de emoção, dos
tipos de argumentos de verbos e afixos que podem causar variação de sentido, e
de variações de tempo e modo verbal que acarretam mudança de significado.
Dentre os desafios estão palavras e expressões polissêmicas e a dificuldade na
detecção de diferentes sentidos em palavras que compartilham da mesma classe
gramatical, tendo como base somente informações morfossintáticas. Esta análise
possibilitou a estruturação e documentação de uma metodologia de revisão que
pode vir a ser aplicada nos demais grupos futuramente. As principais
contribuições desta tese são decorrentes das análises e explorações em corpora: a
limpeza de lemas com sentidos não-emocionais dos grupos do léxico do
Emocionário; a criação dos grupos de emoção Ausência e Outra,
enriquecendo o léxico; a detecção de mais de novecentos lemas e expressões
provenientes das buscas pelo padrão sentimento de N e das conexões
estabelecidas entre os campos semânticos de emoção e do corpo humano; além de
descobertas de campos lexicais pouco mencionados na literatura sobre emoção,
como coletividade, estranhamento, espiritualidade, parentesco e atos
automotivados, que auxiliaram na investigação de como os falantes do português
cristalizam emoções na língua. / [en] The interest of this thesis lies in understanding how Portuguese speakers use
it to materialize the mention of emotion through a linguistic perspective. The
general objective of the research is to create resources to improve the annotation
of the semantic field of emotions in the Portuguese language based on the AC/DC
project, which gathers and makes publicly available annotated corpora and tools
for linguistic research on Portuguese language. and Emocionário, which is both a
semantic annotation project and lexicon of emotions. Initially, the research gives
an overview of emotion studies; aligning itself with perspectives that refute the
universality of emotions and approaches that postulate basic emotions; and
contrasts the interest in emotion description to the already consolidated area of
Sentiment Analysis, comparing five lexicons of emotion and/or polarities in
Portuguese to Emocionário. From a broad sweep of the AC/DC corpora, three
main paths were taken towards investigating emotion words: (i) an analysis of the
twenty-four emotion groups previously composing the Emocionário lexicon in
order to delineate characteristics and challenges in the study of emotion
description in the Portuguese language; (ii) a thorough revision of one-third of the
Emocionário lexicon groups; and (iii) searches for the lexical-syntactic pattern
sentimento de N and for expressions annotated by the Esqueleto project used to
describe emotion. The corpora analysis in the light of the lemmas previously
belonging to the Emocionário lexicon groups showed, amongst other
characteristics, the relevance of lexicalized expressions for the analysis of the
emotion description, the types of arguments of verbs and affixes that can cause
variation in meaning, and variations in tense and verbal mode that lead to a
change in meaning. Amongst the challenges are polysemous words and
expressions and the difficulty in detecting different meanings in words that share
the same grammatical class, based only on morphosyntactic information. This
analysis enabled the structuring and documentation of a revision methodology that
may be applied in other groups in the future. The main contributions of this thesis
derive from the analyzes and explorations in corpora: the exclusion of lemmas
with non-emotional meanings from the Emocionário lexicon groups; the creation
of emotion groups Ausência and Outra, enriching the lexicon; the detection of
more than nine hundred lemmas and expressions from the searches for the
sentimento de N pattern and the connections established between the semantic
fields of emotion and the human body; in addition to discoveries of lexical fields
rarely mentioned in the literature on emotion, such as coletividade,
estranhamento, espiritualidade, parentesco e atos automotivados, which
helped in the investigation of how Portuguese speakers crystallize emotions in
language.
|
232 |
Il frammento nominale nell’italiano digitato colloquiale. Proposta di classificazione sintattica, prospettive di analisi e applicazioni sul campoComandini, Gloria 10 December 2021 (has links)
Questo studio si concentra sull’analisi di un fenomeno assai comune nell’italiano e ben attestato da oltre un secolo in diverse altre lingue, antiche e moderne: le costruzioni prive di un verbo in forma finita nel loro nucleo sintattico principale, che evidentemente non sono state oggetto di una ellissi e che non sempre possono essere definite frasi. Dopo le analisi su questo fenomeno fatte da Mortara Garavelli (1971) sullo scritto letterario e da Cresti (1998) sul parlato colloquiale, in questa ricerca si vuole indagare la natura delle costruzioni senza verbo in una nuova varietà di italiano, ossia lo scritto informale e dialogico prodotto sul web, che sarà definito in questa ricerca come italiano digitato colloquiale (IDC). Pertanto, questo studio adotta un approccio corpus-based, ricercando le costruzioni senza verbo in una raccolta di testi di IDC realmente prodotti, ossia nel corpus COSMIANU (Corpus Of Social Media Italian Annotated with Nominal Utterances) (Comandini et al., 2018). Si è dunque deciso di individuare il fenomeno sulla base della definizione di enunciato nominale di Ferrari (2011; 2014), ma adottando due prospettive sintattiche ancora mai applicate in ambito italiano: la teoria sentenzialista di Merchant (2004; 2006; 2010) e quella non-sentenzialista di Barton & Progovac (2005), entrambe applicate in inglese a strutture ellittiche definite frammenti senza antecedente esplicito. Pertanto, si è deciso di definire le strutture senza verbo studiate come frammenti nominali, nell’ottica tanto di inquadrare un fenomeno che, nella nuova varietà di lingua studiata, assume forme diverse rispetto allo scritto letterario e al parlato colloquiale, quanto di unire simbolicamente due tradizioni di studio delle costruzioni senza verbo che non si sono mai incontrate, ossia quella italo-francese, risalente a Meillet (1906), e quella anglo-americana, risalente a Sweet (1900). Grazie all’analisi dei frammenti nominali nell’italiano digitato colloquiale in ottica non-sentenzialista, si so-no individuate undici classi di frammenti nominali, alcuni dei quali possono essere considerati delle frasi, poi-ché contengono o un rapporto predicativo tra due costituenti, o una Tense Phrase al proprio interno. Sul fronte dell’analisi sentenzialista, invece, si è ipotizzata l’esistenza di una nuova categoria di frammenti nominali, nei quali è stato eliso un elemento pro e un verbo essere. Grazie al contributo tanto della teoria sentenzialista, quanto di quella non-sentenzialista, è stato possibile notare come l’ IDC abbia come uno dei tanti tratti diagnostici proprio la presenza di frammenti nominali che ne incarnano le caratteristiche principali, ossia: a) l’estrema natura dialogica, che quindi spiega l’alta presenza di formule di saluto e di ringraziamento (es.: CIAO A TUTTE LE FANS; grazie 1000000000000) e di interiezioni (es.: bleah!); b) la forte aderenza al contesto comunicativo, con frammenti nominali che hanno come nodo iniziale un NP, un DP o un AP che fa direttamente riferimento a un elemento precedentemente reso rilevante nel contesto (es.: Bellissimoooooooooooo !!!!!!!!!!!!), oppure a un elemento immediatamente successivo, di cui si specifica la natura (es.: una domanda... perché é all'inverso?). Successivamente, si è testato come l’individuazione e l’analisi sintattica dei frammenti nominali possa aiutare a comprendere e a riconoscere meglio l’hate speech. Analizzando i frammenti nominali portatori d’odio nel corpus di tweet razzisti POP-HS-IT (Comandini & Patti, 2019), si è notato come l’ IDC d’odio presenti le medesime classi di frammenti nominali individuate in COSMIANU, ma in percentuali diverse, con una partico-lare rilevanza dei frammenti nominali che hanno come nodo iniziale un FocP (es.: FUORI QUESTE MERDE UMANE DALL'ITALIA). Inoltre, si è trovata una notevole presenza di frammenti nominali di classe FocP (es.: pezzi di merda loro e tutto l’islam) corrispondenti alle frasi esclamative studiate da Munaro (2006) (es.: Noioso, il tuo amico!), in cui l’elemento focalizzato a sinistra (pezzi di merda) è sempre una caratteristica intrinseca e non temporanea del soggetto (loro e tutto l’islam). Questa tipologia di frammenti nominali esclamativi e focalizzati veicola alcune delle caratteristiche più universali dell’hate speech, ossia l’espressione di un odio generalizzato e non dibattibile verso una categoria di persone vista come un gruppo monolitico. L’individuazione dei frammenti nominali più caratteristici dell’hate speech potrebbe aiutare i tool automatici ad annotare i testi d’odio in maniera più accurata.
|
233 |
El aporte del rehablado off-line a la transcripción asistida de corpus oralesRufino Morales, Marimar 04 1900 (has links)
Cette recherche aborde un des grands défis liés à l'étude empirique des phénomènes linguistiques : l'optimisation des ressources matérielles et humaines pour la transcription. Pour ce faire, elle met en relief l’intérêt de la redite off-line, une méthode de transcription vocale à l’aide d’un logiciel de reconnaissance automatique de la parole inspirée du sous-titrage vocal pour les émissions de télé. La tâche de transcrire la parole spontanée est ardue et complexe; on doit rendre compte de tous les constituants de la communication : linguistiques, extralinguistiques et paralinguistiques, et ce, en dépit des difficultés que posent la parole spontanée, les autocorrections, les hésitations, les répétitions, les variations, les phénomènes de contact.
Afin d’évaluer le travail nécessaire pour générer un produit de qualité ont été transcrites par redite une sélection d’interviews du Corpus oral de la langue espagnole à Montréal (COLEM), qui reflète toutes les variétés d'espagnol parlées à Montréal (donc en contact avec le français et l'anglais). La qualité des transcriptions a été évaluée en fonction de leur exactitude, étant donné que plus elles sont exactes, moins le temps de correction est long. Afin d'obtenir des pourcentages d’exactitude plus fidèles à la réalité –même s’ils sont inférieurs à ceux d'autres recherches– ont été pris en compte non seulement les mots incorrectement ajoutés, supprimés ou substitués, mais aussi liées aux signes de ponctuation, aux étiquettes descriptives et aux marques typographiques propres aux conventions de transcription du COLEM. Le temps nécessaire à la production et à la correction des transcriptions a aussi été considéré. Les résultats obtenus ont été comparés à des transcriptions manuelles (dactylographiées) et à des transcriptions automatiques.
La saisie manuelle offre la flexibilité nécessaire pour obtenir le niveau d’exactitude requis pour la transcription, mais ce n'est ni la méthode la plus rapide ni la plus rigoureuse. Quant aux transcriptions automatiques, aucune ne remplit de façon satisfaisante les conditions requises pour gagner du temps ou réduire les efforts de révision. On a aussi remarqué que les performances de la reconnaissance automatique de la parole fluctuaient au gré des locuteurs et locutrices et des caractéristiques des enregistrements, causant des écarts considérables dans le temps de correction des transcriptions. Ce sont les transcriptions redites, effectuées en temps réel, qui donnent les résultats les plus stables; et celles qui ont été effectuées avec un logiciel installé sur l'ordinateur sont supérieures aux autres.
Puisqu’elle permet de minimiser la variabilité des signaux acoustiques, de fournir les indicateurs pour la représentation de la construction dialogique et de favoriser la reconnaissance automatique du vocabulaire issu de la variation de l'espagnol ainsi que d'autres langues, la méthode de redite ne demande en moyenne que 9,2 minutes par minute d'enregistrement du COLEM, incluant la redite en temps réel et deux révisions effectuées par deux personnes différentes à partir de l’audio.
En complément, les erreurs qui peuvent se manifester dans les transcriptions obtenues à l’aide de la technologie intelligente ont été catégorisées, selon qu’il s’agisse de non-respect de l'orthographe ou de la protection des données, d’imprécisions dans la segmentation des unités linguistiques, dans la représentation écrite des mécanismes d'interruption de la séquence de parole, dans la construction dialogique ou dans le lexique. / This research addresses one of the major challenges associated with the empirical study of linguistic phenomena: the optimization of material and human transcription resources. To do so, it highlights the value of off-line respeaking, a method of voice-assisted transcription using automatic speech recognition (ASR) software modelled after voice subtitling for television broadcasts. The task of transcribing spontaneous speech is an arduous and complex one; we must account for all the components of communication: linguistic, extralinguistic and paralinguistic, notwithstanding the difficulties posed by spontaneous speech, self-corrections, hesitations, repetitions, variations and contact phenomena.
To evaluate the work required to generate a quality product, a selection of interviews from the Spoken Corpus of the Spanish Language in Montreal (COLEM), which reflects all the varieties of Spanish spoken in Montreal (i.e., in contact with French and English), were transcribed through respeaking. The quality of the transcriptions was evaluated for accuracy, since the more accurate they were, the less time was needed for correction. To obtain accuracy percentages that are closer to reality –albeit lower than those obtained in other research– we considered not only words incorrectly added, deleted, or substituted, but also issues related to punctuation marks, descriptive labels, and typographical markers specific to COLEM transcription conventions. We also considered the time required to produce and correct the transcriptions. The results obtained were compared with manual (typed) and automatic transcriptions.
Manual input offers the flexibility needed to achieve the level of accuracy required for transcription, but it is neither the fastest nor the most rigorous method. As for automatic transcriptions, none fully meets the conditions required to save time or reduce editing effort. It has also been noted that the performance of automatic speech recognition fluctuates according to the speakers and the characteristics of the recordings, causing considerable variations in the time needed to correct transcriptions. The most stable results were obtained with respoken transcriptions made in real time, and those made with software installed on the computer were better than others.
Since it minimizes the variability of acoustic signals, provides indicators for the representation of dialogical construction, and promotes automatic recognition of vocabulary derived from variations in Spanish as well as other languages, respeaking requires an average of only 9.2 minutes for each minute of COLEM recording, including real-time respeaking and two revisions made from the audio by two different individuals.
In addition, the ASR errors have been categorized, depending on whether they concern misspelling or non-compliance with data protection, inaccuracies in the segmentation of linguistic units, in the written representation of speech interruption mechanisms, in dialogical construction or in the lexicon. / Esta investigación se centra en uno de los grandes retos que acompañan al estudio empírico de los fenómenos lingüísticos: la optimización de recursos materiales y humanos para transcribir. Para ello, propone el rehablado off-line, un método de transcripción vocal asistido por una herramienta de reconocimiento automático del habla (RAH) inspirado del subtitulado vocal para programas audiovisuales. La transcripción del habla espontánea es un trabajo intenso y difícil, que requiere plasmar todos los niveles de la comunicación lingüística, extralingüística y paralingüística, con sus dificultades exacerbadas por los retos propios del habla espontánea, como la autocorrección, la vacilación, la repetición, la variación o los fenómenos de contacto.
Para medir el esfuerzo que conlleva lograr un producto de calidad, primero se rehablaron una serie de grabaciones del Corpus oral de la lengua española en Montreal (COLEM), que refleja todas las variedades del español en contacto con el francés y el inglés. La calidad de las transcripciones se midió en relación con la exactitud: a mayor exactitud, menor tiempo necesario para la corrección. Se contabilizaron las palabras eliminadas, insertadas y sustituidas incorrectamente; pero también computaron los signos de puntuación, las etiquetas descriptivas y demás marcas tipográficas de las convenciones de transcripción del COLEM; los resultados serían inferiores a los de otros trabajos, pero también más realistas. Asimismo, se consideró el tiempo necesario para producir y corregir las transcripciones. Los resultados se compararon con transcripciones mecanografiadas (manuales) y automáticas.
La mecanografía brinda flexibilidad para producir el nivel de detalle de transcripción requerido, pero no es el método más rápido, ni el más exacto. Ninguna de las transcripciones automáticas reúne las condiciones satisfactorias para ganar tiempo ni disminuir esfuerzo. Además, el rendimiento de la tecnología de RAH es muy diferente para determinados hablantes y grabaciones, haciendo fluctuar excesivamente el tiempo de corrección entre una entrevista y otra. Todas las transcripciones rehabladas se hacen en tiempo real y brindan resultados más estables. Las realizadas con un programa instalado en la computadora, que puede editarse, son superiores a las demás.
Gracias a las acciones para minimizar la variación en las señales acústicas, suministrar claves de representación de la mecánica conversacional y complementar el reconocimiento automático del léxico en cualquier variedad del español, y en otras lenguas, las transcripciones de las entrevistas del COLEM se rehablaron y se revisaron dos veces con el audio por dos personas en un promedio de 9,2 minutos por minuto de grabación.
Adicionalmente, se han categorizado los errores que pueden aparecer en las transcripciones realizadas con la tecnología de RAH según sean infracciones a la ortografía o a la protección de datos, errores de segmentación de las unidades del habla, de representación gráfica de los recursos de interrupción de la cadena hablada, del andamiaje conversacional o de cualquier elemento léxico.
|
234 |
基於語料庫的幽默文本翻譯研究: 以錢鍾書的漢語小說"圍城"的英譯為個案研究. / 以錢鍾書的漢語小說"圍城"的英譯為個案研究 / Corpus-based study on translating humorous texts: a case study on the English translation of the Chinese novel "Fortress besieged" by Ch'ien Chung-shu / CUHK electronic theses & dissertations collection / Ji yu yu liao ku de you mo wen ben fan yi yan jiu: yi Qian Zhongshu de Han yu xiao shuo "Wei cheng" de Ying yi wei ge an yan jiu. / Yi Qian Zhongshu de Han yu xiao shuo "Wei cheng" de Ying yi wei ge an yan jiuJanuary 2012 (has links)
幽默是人類社會的一種普遍現象,是人們言語交際的一部分,在日常生活中無處不在。研究者們對幽默的研究涵蓋了心理學、哲學、社會學、人類學、神學、文化學等不同的領域。幽默被認為是最難研究的課題之一,翻譯幽默更是難上加難。我們至今對言語幽默的文本語言特徵、對言語幽默的譯文的文本語言特徵、對涉及言語幽默的翻譯過程知之不多。在言語幽默的翻譯中,有時候原文的幽默信息在譯文中完全保留,有時候則不能,原因何在,有關這方面的研究至今不多見;以漢語幽默文本的英譯為語料建立單向平行語料庫,探討漢語言語幽默英譯的一般及特殊規律的研究,尚無人做過,這正是本研究的目的所在。 / 本研究以錢鍾書的小說《圍城》及其英譯本為語料,運用言語幽默概論的理論框架,建立原文和譯文對照的雙語單向平行語料庫,采用語料庫檢索的方法,對語料進行描寫與分析,得出了“譯文要傳遞原文幽默信息需要保留或轉換原文的表層及深層參數特徵,特別是深層參數的腳本對立、表層參數的修辭手段及語言中的本源概念的研究結論。 / 主要研究結果有三: / 一是展示了言語幽默翻譯的一般規律,即“譯文需要保留原文中的腳本對立; / 二是展現了漢語言語幽默翻譯的特殊規律,即“要保留原文中的腳本對立,就需要轉換漢語所特有的修辭手段和漢語本源概念; / 三是顯示了漢語言語幽默的文本語言特徵,即,漢語言語幽默具有表層和深層的參數特徵。表層參數的核心是修辭手段,深層參數的核心是腳本對立;表層參數具有“相似性之奇特統一、“語言要素之巧妙轉移和“不和諧邏輯間之和諧三大特徵;深層參數具有“真實的與非真實的語境對立、“正常的與非正常的語境對立、“合理與不合理的語境對立三大特徵,反映了現實與經驗、話語現實與語言經驗、話語邏輯與正常邏輯的矛盾衝突。 / Humour is regarded as a universal human phenomenon, and funny situations, funny stories, even funny thoughts occur everyday to virtually everybody (Raskin 1985: 1). Humour represents a multidisciplinary and fertile research field. So is Translation Studies. Both draw from linguistics, psychology and sociology, among other disciplines, for their descriptions and their theoretical models and constructs (Zabalbeascoa 2005: 185). What is surprising is that the link between translation and humour has not received sufficient attention from scholars in either field (ibid). / This research attempts to explore how Chinese humorous texts are transferred into English and what factors affect humour transference in the target text, with a focus on the universality of the translation process, by tapping on fresh methodologies of modern Translation Studies such as bilingual corpora and the General Theory of Verbal Humor. The bilingual corpus for the research is from Wei Cheng (Fortress Besieged) (1947/1991/2003), a humorous fiction by the Chinese writer Ch’ien Chung-shu (1910-1998) and its English translation by Jeanne Kelly and Nathan K. Mao (1979/2003). Adopting the General Theory of Verbal Humour (GTVH) (Attardo & Raskin 1991; Attardo 1994, 2001) as the theoretical framework, the research makes a quantitative and qualitative analysis on the samplings from the corpus and obtains some encouraging findings as follows: / (a) Transference of humour in the source text to the target text needs transference of the Script Opposition embedded in the deep parameters of the source text, which shows the universality of translating Chinese verbal humour into English; / (b) More specifically, transference of the Script Opposition embedded in the deep parameters requires transference of specific Chinese rhetoric devices and socioculturally-bound Chinese alien sources in the target text; / (c) Chinese verbal humour has three features in the surface parameters and three features in the deep parameters, which shows the textual characteristics of Chinese verbal humour. The surface parameters show the features of similarity among quite different things, transference among quite different linguistic elements and congruity among incongruous logical elements. The deep parameters show the following three features: opposition between real and unreal situations; opposition between normal and abnormal situations, opposition between possible, plausible and fully or partially impossible or much less plausible situations. / The research draws a conclusion that verbal humour may travel from one culture to another via translation if the target text successfully transfers the Script Opposition embedded in the deep parameters of the source text, the Rhetorical Device in the surface parameters and the alien concepts hidden in the Language Parameter. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / 戈玲玲. / Sumitted date: 2011年12月. / Sumitted date: 2011 nian 12 yue. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 307-319) / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in Chinese and English. / Ge Lingling. / 摘要 --- p.i / 前言 --- p.vi / 目錄 --- p.vii / 圖表目錄 --- p.x / Chapter 第一章 --- 導論 --- p.1 / Chapter 1.1 --- 研究目的 --- p.2 / Chapter 1.2 --- 原文及作者簡介 --- p.7 / Chapter 1.3 --- 《圍城》的外文譯本 --- p.14 / Chapter 1.4 --- 對《圍城》及其英譯本的研究 --- p.16 / Chapter 1.5 --- 論文結構 --- p.20 / Chapter 第二章 --- 幽默評述及研究 --- p.22 / Chapter 2.1 --- 詞典中對幽默的定义 --- p.22 / Chapter 2.2 --- 前人對幽默的評述與研究 --- p.27 / Chapter 2.2.1 --- 西方的評述與研究 --- p.27 / Chapter 2.2.2 --- 中國的評述與研究 --- p.31 / Chapter 2.2.2.1 --- 林語堂、錢鍾書論幽默 --- p.32 / Chapter 2.2.2.2 --- 其他評述與研究 --- p.34 / Chapter 2.3 --- 言語幽默的可譯性 --- p.37 / Chapter 2.4 --- 小結 --- p.42 / Chapter 第三章 --- 理論框架 --- p.44 / Chapter 3.1 --- 語義腳本理論 --- p.44 / Chapter 3.2 --- 言語幽默概論 --- p.48 / Chapter 3.2.1 --- 六個參數 --- p.50 / Chapter 3.2.1.1 --- 四個必要參數 --- p.51 / Chapter 3.2.1.2 --- 兩個可選參數 --- p.59 / Chapter 3.2.2 --- 六個參數的層級排列:言語幽默相似度測量系統 --- p.65 / Chapter 3.3 --- 言語幽默概論運用於幽默翻譯研究:Attardo(2002) --- p.70 / Chapter 3.4 --- 理論框架的延伸 --- p.73 / Chapter 3.4.1 --- 漢語特有的修辞手段 --- p.74 / Chapter 3.4.2 --- 言語幽默語段的表層參數和深層參數 --- p.76 / Chapter 3.4.2.1 --- 表層參數 --- p.78 / Chapter 3.4.2.2 --- 深層參數 --- p.84 / Chapter 3.5 --- 小結 --- p.87 / Chapter 第四章 --- 《圍城》中幽默語段及其英譯本的雙語平行語料庫 --- p.89 / Chapter 4.1 --- 抽樣過程 --- p.90 / Chapter 4.1.1 --- 《圍城》及其英譯本讀者問卷調查 --- p.90 / Chapter 4.1.2 --- 理論參數 --- p.91 / Chapter 4.2 --- 建立語料庫 --- p.92 / Chapter 4.3 --- 標注漢英對應文本 --- p.94 / Chapter 4.3.1 --- 技術性標注 --- p.94 / Chapter 4.3.2 --- 理論參數標注 --- p.95 / Chapter 4.3.3 --- 本源概念及其翻譯策略的標注 --- p.97 / Chapter 4.4 --- 檢索漢英對應文本 --- p.99 / Chapter 4.4.1 --- 檢索編程 --- p.100 / Chapter 4.4.2 --- 檢索功能 --- p.102 / Chapter 4.5 --- 標注的誤差測定 --- p.114 / Chapter 4.6 --- 小結 --- p.117 / Chapter 第五章 --- 原文中言語幽默語段的分析 --- p.118 / Chapter 5.1 --- 原文幽默語段的表層特徵 --- p.118 / Chapter 5.1.1 --- 表層特徵一:相似性之奇特統一 --- p.119 / Chapter 5.1.2 --- 表層特徵二:語言要素之巧妙轉移 --- p.129 / Chapter 5.1.3 --- 表層特徵三:不和諧邏輯間之和諧 --- p.138 / Chapter 5.2 --- 原文幽默語段的深層特徵 --- p.143 / Chapter 5.3 --- 不規範的參數類型 --- p.149 / Chapter 5.4 --- 表層與深層特徵的對應關係 --- p.160 / Chapter 5.5 --- 小結 --- p.169 / Chapter 第六章 --- 英譯本分析 --- p.171 / Chapter 6.1 --- 英譯本的表層特徵 --- p.172 / Chapter 6.1.1 --- 譯文表層特徵一:相似性之奇特統一 --- p.172 / Chapter 6.1.2 --- 譯文表層特徵二:語言要素之巧妙轉移 --- p.182 / Chapter 6.1.3 --- 譯文表層特徵三:不和諧邏輯間之和諧 --- p.186 / Chapter 6.2. --- 英譯本的深層特徵 --- p.186 / Chapter 6.3 --- 不規範的參數類型 --- p.194 / Chapter 6.4 --- 譯文表層與深層特徵的對應關係 --- p.212 / Chapter 6.5 --- 小結 --- p.222 / Chapter 第七章 --- 譯文、原文對比分析 --- p.224 / Chapter 7.1 --- 譯文、原文總體對比 --- p.224 / Chapter 7.2 --- 譯文、原文分類對比 --- p.226 / Chapter 7.2.1 --- 同型轉換 --- p.227 / Chapter 7.2.2 --- 異型轉換 --- p.229 / Chapter 7.3 --- 譯文、原文表層參數與深層參數的異同對比 --- p.231 / Chapter 7.4 --- 譯文、原文表層與深層特徵對比 --- p.234 / Chapter 7.5 --- 譯文、原文參數的層級排列對比 --- p.235 / Chapter 7.5.1 --- 參數層級排列與原文的參數排列對比 --- p.235 / Chapter 7.5.2 --- 參數層級排列與譯文的參數排列對比 --- p.237 / Chapter 7.5.3 --- 譯文、原文中六個參數的層級排列對比 --- p.239 / Chapter 7.6 --- 小結 --- p.241 / Chapter 第八章 --- 《圍城》中言語幽默的英譯策略 --- p.243 / Chapter 8.1 --- 同型轉換時的翻譯策略 --- p.244 / Chapter 8.2 --- 異型轉換時的翻譯策略--譯文仍然是幽默語段 --- p.259 / Chapter 8.3 --- 異型轉換時的翻譯策略--譯文不是幽默語段 --- p.267 / Chapter 8.4 --- 小結暨討論 --- p.282 / Chapter 第九章 --- 結論 --- p.286 / Chapter 9.1 --- 研究結果 --- p.286 / Chapter 9.2 --- 研究局限 --- p.291 / Chapter 9.3 --- 研究前景 --- p.292 / 附錄一 --- p.294 / 參考文獻 --- p.307
|
235 |
Sequenze ricorrenti in un corpus di comunicazioni mediate dal computer di apprendenti di inglese / RECURRENT SEQUENCES IN A LEARNER CORPUS OF COMPUTER-MEDIATED COMMUNICATIONPAVESI, CATERINA 12 March 2013 (has links)
La tesi si colloca nell'ambito di studi sulla fraseologia nell'inglese prodotto da apprendenti. Presenta uno studio empirico delle sequenze di parole più ricorrenti in un corpus di inglese prodotto da apprendenti di livello avanzato durante chat asincrone in contesto universitario italiano. Secondo la letteratura d'area, sia nella lingua scritta che in quella parlata, le sequenze di parole degli apprendenti rivelano una scarsa attenzione alla variazione del registro a seconda del mezzo di comunicazione usato. Al fine di verificare la presenza di questa caratteristica in un tipo di comunicazione che si trova in posizione intermedia tra i due poli del continuum esistente tra parlato e scritto, la presente ricerca ha analizzato quantitativamente e qualitativamente le sequenze di parole più frequenti nel corpus di comunicazioni mediate dal computer (CMC) raccolto nell'ambito della presente ricerca. Successivamente, le sequenze più frequenti sono state confrontate con quelle estratte da due corpora di interlingua inglese prodotta da apprendenti italofoni, uno di testi scritti (ICLE, Granger et al. 2002) e uno interviste orali (LINDSEI, Gilquin et al. 2010 ). Il confronto ha rivelato che le sequenze più ripetute dagli apprendenti hanno caratteristiche distintive nei vari media e supporta solo in parte i precedenti studi in materia. Ciò è probabilmente dovuto sia alle caratteristiche di informalità e immediatezza della comunicazione mediata dal computer, che ai vantaggi motivazionali e al diverso tipo di elaborazione linguistica connaturato alla CMC. Per l'apprendente la CMC non presenta la stessa pressione comunicativa del parlato e, allo stesso tempo, egli ha la possibilità di monitorare la propria produzione in quanto distanziata da sé dal mezzo elettronico. / The present dissertation contributes to studies of phraseology in learner English. It is an analysis of recurrent sequences of words in a corpus of learner Computer-mediated Communication. English, collected by means of asynchronous chats in an Italian university context. Previous research has argued that the use of recurrent word sequences plays a major role in learner English fluency both in writing and in speech, and is one of the factors behind learner English register failures. Using a corpus-driven approach, the study analyses the most frequent word sequences extracted from the specially compiled Learner Chat Corpus (LCC). To determine the level of adaptation of learner English to different registers, data regarding 3-word sequences from LCC is compared with the Italian subcomponents of a well-known corpus of learner writing (ICLE, Granger et al. 2002) and a corpus of learner speech (LINDSEI, Gilquin et al. 2010 ). The cross-corpus comparisons provide evidence that learners employ combinations which make their English suitable to the mode they are using for communication. Quantitative and qualitative findings from the present research support only in part previous studies of learner English in terms of recurrent sequences. This is probably due both to the informality and spoken-like quality of CMC, and to its motivational advantages and processing differences connected to the fact that learners can monitor their output while communicating because learner language production is distanced by the electronic means.
|
236 |
Exploring the use of parallel corpora in the complilation of specialised bilingual dictionaries of technical terms: a case study of English and isiXhosaShoba, Feziwe Martha 07 1900 (has links)
Text in English / Abstracts in English, isiXhosa and Afrikaans / The Constitution of the Republic of South Africa, Act 108 of 1996, mandates the state to take practical and positive measures to elevate the status and the use of indigenous languages. The implementation of this pronouncement resulted in a growing demand for specialised translations in fields like technology, science, commerce, law and finance. The lack of terminology and resources such as specialised bilingual dictionaries in indigenous languages, particularly isiXhosa remains a growing concern that hinders the translation and the intellectualisation of isiXhosa. A growing number of African scholars affirm the importance of specialised dictionaries in the African languages as tools for language and terminology development so that African languages can be used in the areas of science and technology.
In the light of the background above, this study explored how parallel corpora can be interrogated using a bilingual concordancer, ParaConc to extract bilingual terminology that can be used to create specialised bilingual dictionaries. A corpus-based approach was selected due to its speed, efficiency and accuracy in extracting bilingual terms in their immediate contexts. In enhancing the research outcomes, Descriptive Translations Studies (DTS) and Corpus-based translation studies (CTS) were used in a complementary manner. Because the study is interdisciplinary, the function theories of lexicography that emphasise the function and needs of users were also applied.
The analysis and extraction of bilingual terminology for dictionary making was successful through the use of the following ParaConc features, namely frequencies, hot word lists, hot words, search facility and concordances (Key Word in Context), among others. The findings revealed that English-isiXhosa Parallel Corpus is a repository of translation equivalents and other information categories that can make specialised dictionaries more user-friendly and multifunctional. The frequency lists were revealed as an effective method of selecting headwords for inclusion in a dictionary. The results also unraveled the complex functions of bilingual concordances where information on collocations and multiword units, sense distinction and usage examples could be easily identifiable proving that this approach is more efficient than the traditional method. The study contributes to the knowledge on corpus-based lexicography, standardisation of finance terminology resource development and making of user-friendly dictionaries that are tailor-made for different needs of users. / Umgaqo-siseko weli loMzantsi Afrika ukhululele uRhulumente ukuba athabathe amanyathelo abonakalayo ekuphuhliseni nasekuphuculeni iilwimi zesiNtu. Esi sindululo sibangele ukwanda kokuguqulelwa kwamaxwebhu angezobuchwepheshe, inzululwazi, umthetho, ezemali noqoqosho angesiNgesi eguqulelwa kwiilwimi ebezifudula zingasiwe-so ezinjengesiXhosa. Ukunqongophala kwesigama kunye nezichazi-magama kube yingxaki enkulu ekuguquleleni ngakumbi izichazi-magama ezilwimi-mbini eziqulethe isigama esikhethekileyo. Iingcali ezininzi ziyangqinelana ukuba olu hlobo lwezi zichazi-magama luyimfuneko kuba ludlala iindima enkulu ekuphuhlisweni kweelwimi zesiNtu, ekuyileni isigama, nasekusetyenzisweni kwazo kumabakala obunzululwazi nobuchwepheshe.
Olu phando ke luvavanya ukusetyenziswa kwekhophasi equlethe amaxwebhu esiNgesi neenguqulelo zawo zesiXhosa njengovimba wokudimbaza isigama sezemali esinokunceda ekuqulunqweni kwesichazi-magama esilwimi-mbini. Isizathu esibangele ukukhetha le ndlela yophando esebenzisa ikhompyutha kukuba iyakhawuleza, ulwazi oluthathwe kwikhophasi luchanekile, yaye isigama kwikhophasi singqamana ngqo nomxholo wamaxwebhu nto leyo eyenza kube lula ukufumana iintsingiselo nemizekelo ephilayo. Ukutyebisa olu phando indlela yekhophasi iye yaxhaswa zezinye iindlela zophando ezityunjiweyo: ufundo lwenguguqulelo oluchazayo (DTS) kunye neendlela zokuguqulela ezijoliswe kumsebenzi nakuhlobo lwabasebenzisi zinguqulelo ezo. Kanti ke ziqwalaselwe neenkqubo zophando lobhalo-zichazi-magama eziinjongo zokuqulunqa izichazi-magama ezesebenzisekayo neziluncedo kuninzi lwabasebenzisi zichazi-magama ngakumbi kwisizwe esisebenzisa iilwimi ezininzi.
Ukuhlalutya nokudimbaza isigama kwikhophasi kolu phando kusetyenziswe isixhobo sekhompyutha esilungiselelwe ikhophasi enelwiimi ezimbini nangaphezulu ebizwa ngokuba yiParaConc.
Iziphumo zolu phando zibonise mhlophe ukuba ikhophasi eneenguqulelo nguvimba weendidi ngendidi zamagama nolwazi olunokuphucula izichazi-magama zeli xesha. Kaloku abaguquleli basebenzise amaqhinga ngamaqhinga ukunika iinguqulelo bekhokelwa yimigomo nemithetho yoguqulelo enxuse abasebenzisi bamaxwebhu aguqulelweyo. Ubuchule beParaConc bokukwazi ukuhlela amagama ngokwendlela afumaneka ngayo kunye neenkcukacha zamanani budandalazise indlela eyiyo
yokukhetha imichazwa enokungena kwisichazi-magama. Iziphumo zikwabonakalise iintlaninge yolwazi olufumaneka kwiKWIC, lwazi olo olungelula ukulufumana xa usebenzisa undlela-ndala wokwakha isichazi-magama.
Esi sifundo esihlanganyele uGuqulelo olusekelwe kwiKhophasi noQulunqo-zichazi-magama zobuchwepheshe luya kuba negalelo elingathethekiyo kwindlela yokwakha izichazi-magama kwilwiimi zeSintu ngokubanzi nancakasana kwisiXhosa, nto leyo eya kothula umthwalo kubaqulunqi-zichazi-magama. Ukwakha nokuqulunqa izichazi-magama ezilwimi-mbini zezemali kuya kwandisa imithombo yesigama esinqongopheleyo kananjalo sivelise izichazi-magama eziluncedo kwisininzi sabantu. / Die Grondwet van die Republiek van Suid-Afrika, Wet 108 van 1996, gee aan die staat die mandaat om praktiese en positiewe maatreëls te tref om die status en gebruik van inheemse tale te verhoog. Die implementering van hierdie uitspraak het gelei tot ’n toenemende vraag na gespesialiseerde vertalings in domeine soos tegnologie, wetenskap, handel, regte en finansies. Die gebrek aan terminologie en hulpbronne soos gespesialiseerde woordeboeke in inheemse tale, veral Xhosa, wek toenemende kommer wat die vertaling en die intellektualisering van Xhosa belemmer. ’n Toenemende aantal vakkundiges in Afrika beklemtoon die belangrikheid van gespesialiseerde woordeboeke in die Afrikatale as instrumente vir taal- en terminologie-ontwikkeling sodat Afrikatale gebruik kan word in die areas van wetenskap en tegnologie.
In die lig van die voorafgaande agtergrond het hierdie studie ondersoek ingestel na hoe parallelle korpora deursoek kan word deur ’n tweetalige konkordanser (ParaConc) te gebruik om tweetalige terminologie te ontgin wat gebruik kan word in die onwikkeling van tweetalige gespesialiseerde woordeboeke. ’n Korpusgebaseerde benadering is gekies vir die spoed, doeltreffendheid en akkuraatheid waarmee dit tweetalige terme uit hulle onmiddellike kontekste kan onttrek. Beskrywende Vertaalstudies (DTS) en Korpusgebaseerde Vertaalstudies (CTS) is op ’n aanvullende wyse gebruik om die navorsingsuitkomste te verbeter. Aangesien die studie interdissiplinêr is, is die funksieteorieë van leksikografie wat die funksie en behoeftes van gebruikers beklemtoon, ook toegepas.
Die analise en ontginning van tweetalige terminologie om woordeboeke te ontwikkel was suksesvol deur, onder andere, gebruik te maak van die volgende ParaConc-eienskappe, naamlik, frekwensies, hotword-lyste, hot words, die soekfunksie en konkordansies (Sleutelwoord-in-Konteks). Die bevindings toon dat ’n Engels-Xhosa Parallelle Korpus ’n bron van vertaalekwivalente en ander inligtingskategorieë is wat gespesialiseerde woordeboeke meer gebruikersvriendelik en multifunksioneel kan maak. Die frekwensielyste is geïdentifiseer as ’n doeltreffende metode om hoofwoorde te selekteer wat opgeneem kan word in ’n woordeboek. Die bevindings het ook die komplekse funksies van tweetalige konkordansers ontknoop waar inligting oor kollokasies en veelvuldigewoord-eenhede, betekenisonderskeiding en gebruiksvoorbeelde maklik identifiseer kon word wat aandui dat hierdie metode
viii
doeltreffender is as die tradisionele metode. Die studie dra by tot die kennisveld van korpusgebaseerde leksikografie, standaardisering van finansiële terminologie, hulpbronontwikkeling en die ontwikkeling van gebruikersvriendelike woordeboeke wat doelgemaak is vir verskillende behoeftes van gebruikers. / Linguistics and Modern Languages / D. Litt. et Phil. (Linguistics (Translation Studies))
|
237 |
Producción de un corpus oral y modelado prosódico para la síntesis del habla expresivaIriondo Sanz, Ignasi 18 June 2008 (has links)
Aquesta tesi aborda diferents aspectes relacionats amb la síntesi de la parla expressiva. Es parteix de l'experiència prèvia en sistemes de conversió de text a parla del Grup en Processament Multimodal (GPMM) d'Enginyeria i Arquitectura La Salle, amb l'objectiu de millorar la capacitat expressiva d'aquest tipus de sistemes. La parla expressiva transmet informació paralingüística com, per exemple, l'emoció del parlant, el seu estat d'ànim, una determinada intenció o aspectes relacionats amb l'entorn o amb el seu interlocutor. Els dos objectius principals de la present tesi consisteixen, d'una banda, en el desenvolupament d'un corpus oral expressiu i, d'una altra, en la proposta d'un sistema de modelatge i predicció de la prosòdia per a la seva utilització en l'àmbit de la síntesi expressiva del parla.En primer lloc, es requereix un corpus oral adequat per a la generació d'alguns dels mòduls que componen un sistema de síntesi del parla expressiva. La falta de disponibilitat d'un recurs d'aquest tipus va motivar el desenvolupament d'un nou corpus. A partir de l'estudi dels procediments d'obtenció de parla emocionada o expressiva i de l'experiència prèvia del grup, es planteja el disseny, l'enregistrament, l'etiquetatge i la validació del nou corpus. El principal objectiu consisteix a aconseguir una elevada qualitat del senyal i una cobertura fonètica suficient (segmental i prosòdica), sense renunciar a l'autenticitat des del punt de vista de l'expressivitat oral. El corpus desenvolupat té una durada de més de cinc hores i conté cinc estils expressius: neutre, alegre, sensual, agressiu i trist. En tractar-se de parla expressiva obtinguda mitjançant la lectura de textos semànticament relacionats amb els estils definits, s'ha requerit un procés de validació que garanteixi que les locucions que formen el corpus incorporin el contingut expressiu desitjat. L'avaluació exhaustiva de tots els enunciats del corpus seria excessivament costosa en un corpus de gran grandària. D'altra banda, no existeix suficient coneixement científic per a emular completament la percepció subjectiva mitjançant tècniques automàtiques que permetin una validació exhaustiva i fiable dels corpus orals. En el present treball s'ha proposat un mètode que suposa un avanç cap a una solució pràctica i eficient d'aquest problema, mitjançant la combinació d'una avaluació subjectiva amb tècniques d'identificació automàtica de l'emoció en el parla. El mètode proposat s'utilitza per a portar a terme una revisió automàtica de l'expressivitat del corpus desenvolupat. Finalment, una prova subjectiva ha permès validar el correcte funcionament d'aquest procés automàtic. En segon lloc i, sobre la base dels coneixements actuals, de l'experiència adquirida i dels reptes que es desitjaven abordar, s'ha desenvolupat un sistema d'estimació de la prosòdia basat en corpus. Tal sistema es caracteritza per modelar de forma conjunta les funcions lingüística i paralingüística de la prosòdia a partir de l'extracció automàtica d'atributs prosòdics del text, que constitueixen l'entrada d'un sistema d'aprenentatge automàtic que prediu els trets prosòdics modelats prèviament. El sistema de modelatge prosòdic presentat en aquest treball es fonamenta en el raonament basat en casos, que es tracta d'una tècnica d'aprenentatge automàtic per analogia. Per a l'ajustament d'alguns paràmetres del sistema desenvolupat i per a la seva avaluació s'han utilitzat mesures objectives de l'error i de la correlació calculades en les locucions del conjunt de prova. Atès que les mesures objectives sempre es refereixen a casos concrets, no aporten informació sobre el grau d'acceptació que tindrà la parla sintetitzada en els oïdors. Per tant, s'han portat a terme una sèrie de proves de percepció en les quals un conjunt d'avaluadors ha puntuat un grup d'estímuls en cada estil. Finalment, s'han analitzat els resultats per a cada estil i s'han comparat amb les mesures objectives obtingudes, el que ha permès extreure algunes conclusions sobre la rellevància dels trets prosòdics en la parla expressiva, així com constatar que els resultats generats pel mòdul prosòdic han tingut una bona acceptació, encara que s'han produït diferències segons l'estil. / Esta tesis aborda diferentes aspectos relacionados con la síntesis del habla expresiva. Se parte de la experiencia previa en sistemas de conversión de texto en habla del Grup en Processament Multimodal (GPMM) de Enginyeria i Arquitectura La Salle, con el objetivo de mejorar la capacidad expresiva de este tipo de sistemas. El habla expresiva transmite información paralingüística como, por ejemplo, la emoción del hablante, su estado de ánimo, una determinada intención o aspectos relacionados con el entorno o con su interlocutor. Los dos objetivos principales de la presente tesis consisten, por una parte, en el desarrollo de un corpus oral expresivo y, por otra, en la propuesta de un sistema de modelado y predicción de la prosodia para su utilización en el ámbito de la síntesis expresiva del habla. En primer lugar, se requiere un corpus oral adecuado para la generación de algunos de los módulos que componen un sistema de síntesis del habla expresiva. La falta de disponibilidad de un recurso de este tipo motivó el desarrollo de un nuevo corpus. A partir del estudio de los procedimientos de obtención de habla emocionada o expresiva y de la experiencia previa del grupo, se plantea el diseño, la grabación, el etiquetado y la validación del nuevo corpus. El principal objetivo consiste en conseguir una elevada calidad de la señal y una cobertura fonética suficiente (segmental y prosódica), sin renunciar a la autenticidad desde el punto de vista de la expresividad oral. El corpus desarrollado tiene una duración de más de cinco horas y contiene cinco estilos expresivos: neutro, alegre, sensual, agresivo y triste. Al tratarse de habla expresiva obtenida mediante la lectura de textos semánticamente relacionados con los estilos definidos, se ha requerido un proceso de validación que garantice que las locuciones que forman el corpus incorporen el contenido expresivo deseado. La evaluación exhaustiva de todos los enunciados del corpus sería excesivamente costosa en un corpus de gran tamaño. Por otro lado, no existe suficiente conocimiento científico para emular completamente la percepción subjetiva mediante técnicas automáticas que permitan una validación exhaustiva y fiable de los corpus orales. En el presente trabajo se ha propuesto un método que supone un avance hacia una solución práctica y eficiente de este problema, mediante la combinación de una evaluación subjetiva con técnicas de identificación automática de la emoción en el habla. El método propuesto se utiliza para llevar a cabo una revisión automática de la expresividad del corpus desarrollado. Finalmente, una prueba subjetiva con oyentes ha permitido validar el correcto funcionamiento de este proceso automático.En segundo lugar y, sobre la base de los conocimientos actuales, a la experiencia adquirida y a los retos que se deseaban abordar, se ha desarrollado un sistema de estimación de la prosodia basado en corpus. Tal sistema se caracteriza por modelar de forma conjunta las funciones lingüística y paralingüística de la prosodia a partir de la extracción automática de atributos prosódicos del texto, que constituyen la entrada de un sistema de aprendizaje automático que predice los rasgos prosódicos modelados previamente. El sistema de modelado prosódico presentado en este trabajo se fundamenta en el razonamiento basado en casos que se trata de una técnica de aprendizaje automático por analogía. Para el ajuste de algunos parámetros del sistema desarrollado y para su evaluación se han utilizado medidas objetivas del error y de la correlación calculadas en las locuciones del conjunto de prueba. Dado que las medidas objetivas siempre se refieren a casos concretos, no aportan información sobre el grado de aceptación que tendrá el habla sintetizada en los oyentes. Por lo tanto, se han llevado a cabo una serie de pruebas de percepción en las que un conjunto de oyentes ha puntuado un grupo de estímulos en cada estilo. Finalmente, se han analizado los resultados para cada estilo y se han comparado con las medidas objetivas obtenidas, lo que ha permitido extraer algunas conclusiones sobre la relevancia de los rasgos prosódicos en el habla expresiva, así como constatar que los resultados generados por el módulo prosódico han tenido una buena aceptación, aunque se han producido diferencias según el estilo. / This thesis deals with different aspects related to expressive speech synthesis (ESS). Based on the previous experience in text-to-speech (TTS) systems of the Grup en Processament Multimodal (GPMM) of Enginyeria i Arquitectura La Salle, its main aim is to improve the expressive capabilities of such systems. The expressive speech transmits paralinguistic information as, for example, the emotion of the speaker, his/her mood, a certain intention or aspects related to the environment or to his/her conversational partner. The present thesis tackles two main objectives: on the one hand, the development of an expressive speech corpus and, on the other, the modelling and the prediction of prosody from text for their use in the ESS framework. First, an ESS system requires a speech corpus suitable for the development and the performance of some of its modules. The unavailability of a resource of this kind motivated the development of a new corpus. Based on the study of the strategies to obtain expressive speech and the previous experience of the group, the different tasks have been defined: design, recording, segmentation, tagging and validation. The main objective is to achieve a high quality speech signal and sufficient phonetic coverage (segmental and prosodic), preserving the authenticity from the point of view of the oral expressiveness. The recorded corpus has 4638 sentences and it is 5 h 12 min long; it contains five expressive styles: neutral, happy, sensual, aggressive and sad. Expressive speech has been obtained by means of the reading of texts semantically related to the defined styles. Therefore, a validation process has been required in order to guarantee that recorded utterances incorporate the desired expressive content. A comprehensive assessment of the whole corpus would be too costly. Moreover, there is insufficient scientific knowledge to completely emulate the subjective perception through automated techniques that yield a reliable validation of speech corpora. In this thesis, we propose an approach that supposes a step towards a practical solution to this problem, by combining subjective evaluation with techniques for the automatic identification of emotion in speech. The proposed method is used to perform an automatic review of the expressiveness of the corpus developed. Finally, a subjective test has allowed listeners to validate this automatic process.Second, based on our current experience and the proposed challenges, a corpus-based system for prosody estimation has been developed. This system is characterized by modelling both the linguistic and the paralinguistic functions of prosody. A set of prosodic attributes is automatically extracted from text. This information is the input to an automatic learning system that predicts the prosodic features modelled previously by a supervised training. The root mean squared error and the correlation coefficient have been used in both the adjustment of some system parameters and the objective evaluation. However, these measures are referred to specific utterances delivered by the speaker in the recording session, and then they do not provide information about the degree of acceptance of synthesized speech in listeners. Therefore, we have conducted different perception tests in which a group of listeners has scored a set of stimuli in each expressive style. Finally, the results for each style have been analyzed and compared with the objective measures, which has allowed to draw some conclusions about the relevance of prosodic features in expressive speech, as well as to verify that the results generated by the prosodic module have had a good acceptance, although with differences as a function of the style.
|
238 |
Metalinguistic information extraction from specialized texts to enrich computational lexiconsRodríguez Penagos, Carlos 03 February 2005 (has links)
Este trabajo presenta un estudio empírico del uso y función del metalenguaje en el conocimiento científico experto y los lenguajes de especialidad en lengua inglesa, con especial atención al establecimiento, modificación y negociación de la terminología común del grupo de especialistas de cada área. Mediante enunciados discursivos llamados Operaciones Metalingüísticas Explícitas se formaliza y analiza el carácter dinámico de las estructuras conceptuales científicas y los sublenguajes que las vehiculan.Por otro lado, se presenta la implementación de un sistema automático de extracción de información metalingüística en textos de especialidad. El sistema MOP (Metalinguistic Operation Processor) extrae enunciados metalingüísticos y definiciones de documentos especializados, utilizando tanto autómatas de estados finitos como algoritmos de aprendizaje automático. El sistema crear bases semi-estructuradas de información terminológica llamadas Metalinguistic Information Databases (MID), de utilidad para la lexicografía especializada, el procesamiento del lenguaje natural y el estudio empírico de la evolución del conocimiento científico, entre otras aplicaciones. / This work presents an empirical study of the use and function of metalanguage in expert scientific knowledge and special-domain languages, with special focus on how each field's terminology is established, modified and negotiated within the group of experts. Through discourse statements called Explicit metalinguistic Operations the dynamic nature of conceptual structures and the sublanguages that embody them are formalized and analyzed.On the other hand, it presents a system implementation for the automatic extraction of metalinguistic information from specialized texts. The Metalinguistic Operation Processor (MOP) system extracts metalinguistic statements and definitions from special-domain documents, using finite-state machinery and machine-learning algorithms. The system creates semi-structured databases called Metalinguistic Information Databases (MID), useful for specialized lexicography, Natural Language Processing, and the empirical study of scientific knowledge, among other applications.
|
239 |
Valorisation des analogies lexicales entre l'anglais et les langues romanes : étude prospective pour un dispositif plurilingue d'apprentissage du FLE dans le domaine de la santé / Emphasising lexical analogies between English and Romance languages : prospective study towards a plurilingual learning device of French for healthcareGilles, Fabrice 29 September 2017 (has links)
Cette étude lexicologique prospective s'inscrit dans la didactique des L3. L’objectif est d’élaborer un interlexique anglais-espagnol-français-italien-portugais composé des adjectifs, noms et verbes anglais fréquents dans les écrits scientifiques de la santé, et de leurs équivalents de traduction analogues en espagnol, français, italien et portugais. Deux mots sont analogues s’ils ont le même sens et une forme similaire.Les rapports entre les concepts d'analogie, de similarité et d'identité sont examinés, les types d'analogies intralinguistiques et interlinguistiques illustrés et les principales analogies et dissemblances entre l’anglais, le français et les langues romanes exposées. L'existence de celles-ci est justifiée par les origines indoeuropéennes et surtout d'intenses contacts de langues. Après avoir rappelé l’importance de l’analogie dans l’apprentissage, nous montrons le lien entre notre recherche et deux types d’approches didactiques des langues : l'intercompréhension, qui développe la compréhension de langues voisines, et les approches sur corpus qui permettent de mieux connaitre et faire connaitre la phraséologie scientifique.Les 2000 lemmes anglais les plus fréquents ont été extraits du corpus scientifique anglais de ScienText, leurs 2208 acceptions fréquentes délimitées sur la base du profil combinatoire et triées en deux catégories sémantiques : lexique de spécialité et lexique scientifique transdisciplinaire. Les lemmes anglais ont été traduits dans les quatre langues romanes, et la similarité mesurée en fonction de la sous-chaine maximale commune (SMC).L’interlexique contient 47 % des acceptions fréquentes. Par couples de langues, l’analogie est encore plus élevée : anglais – français, 66 %, anglais-italien, 65 %, anglais-espagnol, 63 %, anglais-portugais, 58 %. Ce lexique analogue pourrait donc servir comme base de transfert dans des activités de FLE L3 pour des professionnels de la santé, et l’anglais L2 semble être une passerelle possible vers les langues romanes. Des activités plurilingues sont construites sur des concordances extraites des corpus multilingues alignés EMEA et Europarl. Un questionnement métalinguistique en anglais sensibilise à des traits (morpho)syntaxiques du français ; les analogies des deux langues sont systématiquement mises en relief, et dans les cas d'opacité, celles des autres langues romanes avec l’anglais. / This prospective lexicological investigation belongs to the field of L3 French didactics. The purpose is to elaborate a French-Italian-Portuguese-Spanish interlexicon out of the frequent adjectives, nouns and verbs of the healthcare scientific writings, and their analogue translation equivalents in French, Italian, Portuguese and Spanish. Two words are analogue if they have the same meaning and a similar form.Related concepts of analogy, similarity and identity are discussed, types of intralinguistic and cross-linguistic analogies reviewed, and the main analogies and differences between English, French and Romance languages detailed. Their many analogies are justified by Indo-European origins and mostly by intense language contacts. Once the importance of analogy in learning procedures has been highlighted, we show how this research and two types of didactic approaches connect together: intercomprehension, which develops comprehension skills in neighbor languages, and corpus approaches which enable to get a closer insight into scientific phraseology.The 2000 most frequent English lemmas were extracted from the ScienText English scientific corpus, their 2208 frequent acceptions explored from their combinatory profile and sorted out in two semantic categories: healthcare subject-specific vocabulary and science specific trans-disciplinary vocabulary. The English lemmas were translated into the four Romance languages, and similarity measurements were carried out with the longest common substring method.The interlexicon contains 47% of the frequent acceptions. Analogy is even higher by language pairs: English – French, 66%, English – Italian, 65%, English - Spanish, 63%, English – Portuguese, 58%. Consequently, this analogue vocabulary could form a transfer basis in learning activities of L3 French for health care providers, and L2 English seems to be a possible bridge language toward Romance languages. Plurilingual activities are built on concordances extracted from multilingual aligned corpora (EMEA, Europarl). Metalinguistic questions in English point out (morpho)syntactic features of French; the analogies between both languages are systematically enhanced, and in case of lexical opacity, those between English and the other Romance languages.
|
240 |
Chybovost v písemném projevu romských žáků 9. ročníků základních škol praktických na základě elektronické databanky ROMi / Error Analysis of Czech Written Expression of the Romani Pupils in the 9th Grade of the Secondary Practical Schools Based on the Corpora ROMiBedřichová, Zuzanna January 2015 (has links)
English Summary - Error Analysis of Czech Written Expression of the Romani Pupils in the 9th Grade of the Secondary Practical Schools Based on the Corpora ROMi Zuzanna Bedřichová ÚČJTK FFUK Prague 2014 The study is focused on practice of error making in written expressions of the Romani pupils in the 9th grade of Secondary practical schools (schools for children with special educational needs). Here 130 written school works of these pupils, which are available through the database ROMi (database of written and spoken accounts in Czech language of children and youth of Romani origin), have been analysed. The author offers innovative concept of new and elaborate scheme of error analysis, and qualitatively - quantitative analyses of the pupils' written accounts. Beside the qualitatively - quantitative analyses, the study outlines current situation of issues such as education of Romani children in the Czech language, the Romani ethnolect of Czech language, and spoken language as a source of stigmatisation. Furthermore, details about the ROMi database, 130 original written accounts in full length and practical proposals of compensation in the practice of error making are provided.
|
Page generated in 0.0675 seconds