Global ETD Search

101	Slovesný vid ve španělštině a češtině / Verbal Aspect in Spanish and Czech Bělehrádková, Kateřina January 2020 (has links) The aim of the thesis is to compare the verbal aspect in Spanish and Czech. The thesis has two parts: theoretical and practical. The theoretical part focuses on defining the category of verbal aspect in both languages and on comparison of this category in these two aspectual systems. Furthemore, we pay attention to the relation of verbal aspect to the category aspecto and to the lexical aspect (Aktionsart). The practical part consists of a corpus case study carried out using the newest version of the Czech parallel corpus InterCorp. In the case study, we analyse aspectual differences between original texts in Czech and their Spanish translations. Key words: verbal aspect - verb - Spanish - Czech - parallel corpora
102	La détection automatique multilingue d’énoncés biaisés dans Wikipédia Aleksandrova, Desislava 11 1900 (has links) Nous proposons une méthode multilingue pour l'extraction de phrases biaisées de Wikipédia, et l'utilisons pour créer des corpus en bulgare, en français et en anglais. En parcourant l'historique des révisions des articles, nous cherchons ceux qui, à un moment donné, avaient été considérés en violation de la politique de neutralité de Wikipédia (et corrigés par la suite). Pour chacun de ces articles, nous récupérons la révision signalée comme biaisée et la révision qui semble avoir corrigé le biais. Ensuite, nous extrayons les phrases qui ont été supprimées ou réécrites dans cette révision. Cette approche permet d'obtenir suffisamment de données même dans le cas de Wikipédias relativement petites, comme celle en bulgare, où de 62 000 articles nous avons extrait 5 000 phrases biaisées. Nous évaluons notre méthode en annotant manuellement 520 phrases pour le bulgare et le français, et 744 pour l'anglais. Nous évaluons le niveau de bruit, ses sources et analysons les formes d’expression de biais. Enfin, nous utilisons les données pour entrainer et évaluer la performance d’algorithmes de classification bien connus afin d’estimer la qualité et le potentiel des corpus. / We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia’s neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5 thousand biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Biais Neutralité Classification Multilingue Corpus Wikipédia Bias Neutrality Multilingual Corpora
103	Integrated Parallel Data Extraction from Comparable Corpora for Statistical Machine Translation / 統計的機械翻訳におけるコンパラブルコーパスからの対訳データの統合的抽出 Chu, Chenhui 23 March 2015 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19107号 / 情博第553号 / 新制\|\|情\|\|98(附属図書館) / 32058 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授黒橋禎夫, 教授石田亨, 教授河原達也 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Statistical Machine Translation Comparable Corpora Bilingual Lexicon Extraction Parallel Sentence Extraction Parallel Fragment Extraction 007
104	The Role of Paratextual Elements in the Reception of Translation of Arabic Novels into English Alblooshi, Fatima Khalifa 08 April 2021 (has links) No description available. Middle Eastern Literature Language
105	Improving the Effectiveness of Machine-Assisted Annotation Felt, Paul L. 10 May 2012 (has links) (PDF) Annotated textual corpora are an essential language resource, facilitating manual search and discovery as well as supporting supervised Natural Language Processing (NLP) techniques designed to accomplishing a variety of useful tasks. However, manual annotation of large textual corpora can be cost-prohibitive, especially for rare and under-resourced languages. For this reason, developers of annotated corpora often attempt to reduce annotation cost by offering annotators various forms of machine assistance intended to increase annotator speed and accuracy. This thesis contributes to the field of annotated corpus development by providing tools and methodologies for empirically evaluating the effectiveness of machine assistance techniques. This allows developers of annotated corpora to improve annotator efficiency by choosing to employ only machine assistance techniques that make a measurable, positive difference. We validate our tools and methodologies using a concrete example. First we present CCASH, a platform for machine-assisted online linguistic annotation capable of recording detailed annotator performance statistics. We employ CCASH to collect data detailing the performance of annotators engaged in syriac morphological analysis in the presence of two machine assistance techniques: pre-annotation and correction propagation. We conduct a preliminary analysis of the data using the traditional approach of comparing mean data values. We then demonstrate a Bayesian analysis of the data that yields deeper insights into our data. Pre-annotation is shown to increase annotator accuracy when pre-annotations are at least 60% accurate, and annotator speed when pre-annotations are at least 80% accurate. Correction propagation's effect on accuracy is minor. The Bayesian analysis indicates that correction propagation has a positive effect on annotator speed after accounting for the effects of the particular visual mechanism we employed to implement it. Syriac Bayesian methods Annotated Corpora Machine-Assisted Annotation Machine Assistance Computer Sciences
106	The Racialized Rhetoric of Elite Education: Standardized Exam Admissions in New York City's Specialized High Schools Moftah, Linda January 2024 (has links) This dissertation explores the underrepresentation of racially minoritized students in elite academic spaces by considering the New York City Specialized High Schools and the controversies surrounding their exam-based enrollment process. It does so by examining the arguments marshalled both for and against the use of the SHSAT entrance exam in order to better understand the role of public discourse in maintaining educational systems implicated in the reproduction of racial inequities. Using a large-scale dataset of social media posts, this study employs a multi-tiered, mixed-methods approach to critical discourse analysis to investigate the linguistic and rhetorical contours of these debates, including the extent to which they rely on racially charged narratives around intelligence, ability, and merit. This methodological strategy incorporates topic modeling, corpus linguistics, and lexical analysis to navigate the breadth and complexity of the SHSAT discourse, while also providing insight into the way that emerging forms of participatory engagement, like social media, are transforming the landscape of equity-oriented education reform. This study finds that race is a central preoccupation of stakeholders on either side of the SHSAT debate, and that patterns of racialized discourse align closely to the articulation of specific policy objectives. The findings of this analysis also suggest that while social media has the potential to expand discursive access and foster dialogue, it can also amplify existing ideological divides and further polarize policy debates. The dualities of social media thus highlight the need for policymakers, researchers, and media consumers alike to critically evaluate the ways that these digital platforms are used to engage with policy and public discourse, as well as the equity implications of their evolving relationship. Education and state School management and organization Racism in education SHSAT (Educational test) Corpora (Linguistics) Social media in education
107	Unidades fraseológicas especializadas: colocações e colocações estendidas em contratos sociais e estatutos sociais traduzidos no modo juramentado e não-juramentado Orenha, Adriane [UNESP] 26 May 2009 (has links) (PDF) Made available in DSpace on 2014-06-11T19:32:45Z (GMT). No. of bitstreams: 0 Previous issue date: 2009-05-26Bitstream added on 2014-06-13T20:24:00Z : No. of bitstreams: 1 orenha_a_dr_sjrp.pdf: 2083225 bytes, checksum: d8f591d9558b95f175aa9e7d6591f835 (MD5) / Esta pesquisa visa realizar um estudo a respeito dos termos, colocações e colocações especializadas estendidas presentes em contratos sociais e estatutos sociais que representam os corpora de pesquisa. Nesta pesquisa, também observaremos as semelhanças e diferenças nos corpora de traduções jurídicas e juramentadas, no que concerne ao uso desses termos e padrões lexicais, assim como apontaremos aqueles que são mais frequentemente empregados em documentos do tipo contrato social e estatuto social. A investigação baseia-se na abordagem interdisciplinar dos Estudos da Tradução Baseados em Corpus, da Linguística de Corpus, da Fraseologia, de modo mais específico das colocações, das colocações especializadas e das unidades fraseológicas especializadas. A Terminologia, por meio de seus pressupostos teóricos, também traz sua contribuição para a pesquisa, assim como os trabalhos sobre a tradução juramentada. Uma das motivações que delineia este estudo reside no fato de a tradução juramentada ser considerada de grande relevância nas relações comerciais, sociais e jurídicas entre as nações. Para realizar este estudo, compilamos um corpus de estudo (CE1) constituído por contratos sociais e estatutos sociais traduzidos no modo juramentado, nas direções tradutórias inglês português e português inglês, extraídos de Livros de Registro de Traduções, pertencentes a tradutores juramentados credenciados pela Junta Comercial de dois Estados brasileiros; e um corpus de estudo (CE2) formado por documentos de mesma natureza traduzidos sem o processo de juramentação, nas mesmas direções tradutórias. Além destes corpora, construímos dois corpora comparáveis, formados pelos referidos documentos originalmente escritos em português e em inglês. Os resultados desta pesquisa mostraram várias semelhanças, no tocante aos termos empregados em documentos traduzidos... / This investigation aims at carrying out a study on terms, collocations and extended specialized collocations present in articles of incorporation/articles of organization/articles of association and bylaws that represent our research corpora. We will also observe similarities and differences in sworn and legal translation corpora, which concerns the use of such terms and lexical patterns, as well as point out the ones which are more frequently used in the focused documents. This research derives its theoretical and methodological sources from Corpus-Based Translation Studies, Corpus Linguistics, Phraseology, more specifically from collocations, specialized collocations and specialized phraseological units (SPUs). Terminology, from its theoretical standpoint, also offers its contribution to this study, as well as essays on sworn translation. One of the aspects that motivates this study is the fact that sworn translation is considered to be of great relevance to commercial, social and legal relations among nations. To conduct this research, we compiled a study corpus (CE1) composed of articles of incorporation/articles of organization/articles of association and bylaws submitted to the process of sworn translation in the English Portuguese and Portuguese English directions, excerpted from the Books of Sworn Translation Records, made available by five Brazilian sworn translators, duly sworn by the Board of Trade of two Brazilian States; a study corpus (CE2) made up of documents of the same nature not submitted to the process of sworn translation, in the same translation directions. Besides these corpora, we also built two comparable corpora formed by the referred documents originally written in Portuguese and in English. The results obtained in this research showed some similarities which refer to the terms used in documents submitted to the process of sworn translation... (Complete abstract click electronic access below) Tradução e interpretação Fraseologia Tradução juramentada - Terminologia Lingüística de corpus Colocações Colocações estendidas Unidades fraseológicas especializadas Terminologia Specialized Phraseological Units Specialized collocations Extended specialized collocations Sworn translation Legal translation Parallel Corpora Comparable Corpora
108	Newsminer: um sistema de data warehouse baseado em texto de notícias / Newsminer: a data warehouse system based on news websites Nogueira, Rodrigo Ramos 12 May 2017 (has links) Submitted by Milena Rubi (milenarubi@ufscar.br) on 2017-10-09T14:12:56Z No. of bitstreams: 1 NOGUEIRA_Rodrigo_2017.pdf: 5427774 bytes, checksum: db8155583bf1bffe3ceb4c01bf26f66f (MD5) / Approved for entry into archive by Milena Rubi (milenarubi@ufscar.br) on 2017-10-09T14:14:04Z (GMT) No. of bitstreams: 1 NOGUEIRA_Rodrigo_2017.pdf: 5427774 bytes, checksum: db8155583bf1bffe3ceb4c01bf26f66f (MD5) / Approved for entry into archive by Milena Rubi (milenarubi@ufscar.br) on 2017-10-09T14:14:13Z (GMT) No. of bitstreams: 1 NOGUEIRA_Rodrigo_2017.pdf: 5427774 bytes, checksum: db8155583bf1bffe3ceb4c01bf26f66f (MD5) / Made available in DSpace on 2017-10-09T14:14:24Z (GMT). No. of bitstreams: 1 NOGUEIRA_Rodrigo_2017.pdf: 5427774 bytes, checksum: db8155583bf1bffe3ceb4c01bf26f66f (MD5) Previous issue date: 2017-05-12 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Data and text mining applications managing Web data have been the subject of recent research. In every case, data mining tasks need to work on clean, consistent, and integrated data for obtaining the best results. Thus, Data Warehouse environments are a valuable source of clean, integrated data for data mining applications. Data Warehouse technology has evolved to retrieve and process data from the Web. In particular, news websites are rich sources that can compose a linguistic corpus. By inserting corpus into a Data Warehousing environment, applications can take advantage of the flexibility that a multidimensional model and OLAP operations provide. Among the benefits are the navigation through the data, the selection of the part of the data considered relevant, data analysis at different levels of abstraction, and aggregation, disaggregation, rotation and filtering over any set of data. This paper presents Newsminer, a data warehouse environment, which provides a consistent and clean set of texts in the form of a multidimensional corpus for consumption by external applications and users. The proposal includes an architecture that integrates the gathering of news in real time, a semantic enrichment module as part of the ETL stage, which adds semantic properties to the data such as news category and POS-tagging annotation and the access to data cubes for consumption by applications and users. Two experiments were performed. The first experiment selects the best news classifier for the semantic enrichment module. The statistical analysis of the results indicated that the Perceptron classifier achieved the best results of F-measure, with a good result of computational time. The second experiment collected data to evaluate real-time news preprocessing. For the data set collected, the results indicated that it is possible to achieve online processing time. / As aplicações de mineração de dados e textos oriundos da Internet têm sido alvo de recentes pesquisas. E, em todos os casos, as tarefas de mineração de dados necessitam trabalhar sobre dados limpos, consistentes e integrados para obter os melhores resultados. Sendo assim, ambientes de Data Warehouse são uma valiosa fonte de dados limpos e integrados para as aplicações de mineração. A tecnologia de Data Warehouse tem evoluído no sentido de recuperar e tratar dados provenientes da Web. Em particular, os sites de notícias são fontes ricas em textos, que podem compor um corpus linguístico. Inserindo o corpus em um ambiente de Data Warehouse, as aplicações poderão tirar proveito da flexibilidade que um modelo multidimensional e as operações OLAP fornecem. Dentre as vantagens estão a navegação pelos dados, a seleção da parte dos dados considerados relevantes, a análise dos dados em diferentes níveis de abstração, e a agregação, desagregação, rotação e filtragem sobre qualquer conjunto de dados. Este trabalho apresenta o ambiente de Data Warehouse Newsminer, que fornece um conjunto de textos consistente e limpo, na forma de um corpus multidimensional para consumo por aplicações externas e usuários. A proposta inclui uma arquitetura que integra a coleta textos de notícias em tempo próximo do tempo real, um módulo de enriquecimento semântico como parte da etapa de ETL, que acrescenta propriedades semânticas aos dados coletados tais como a categoria da notícia e a anotação POS-tagging, e a disponibilização de cubos de dados para consumo por aplicações e usuários. Foram executados dois experimentos. O primeiro experimento é relacionado à escolha do melhor classificador de categorias das notícias do módulo de enriquecimento semântico. A análise estatística dos resultados indicou que o classificador Perceptron atingiu os melhores resultados de F-medida, com resultado bom de tempo de processamento. O segundo experimento coletou dados para avaliar o pré-processamento de notícias em tempo real. Para o conjunto de dados coletados, os resultados indicaram que é possível atingir tempo de processamento online. / OB800972 Mineração de dados (Computação) Sites da Web Corpora multidimensional Enriquecimento semântico Categorização de notícias OLAP Multidimensional corpora Data mining Web sites Data Warehouse News websites Semantic enrichment News categorization
109	Création et exploitation d'un corpus trilingue du tourisme (italien/français/anglais) en vue de la réalisation d'une base de données lexicale informatisée / Creation and exploitation of a trilingual tourism corpus (Italian, French, English) for the realisation of a lexical electronic stored database Piccato, Mariangela 23 July 2012 (has links) Au cours des dernières années, le secteur touristique a été caractérisé par toute une série de changements fondamentaux. L’un de ces changements, certainement le plus important, a été le fait d’être considéré aujourd’hui comme l’activité productive capable de faire tourner l’économie d’un pays entier.Notre mémoire de recherche se situe à l’intersection de la terminologie thématique, de la linguistique de corpus et du traitement automatique des langues.Dans le premier chapitre du travail que nous allons présenter, nous chercherons à introduire aux domaines d’études théoriques sur lesquels notre recherche s’appuie.Premièrement, on traitera de la linguistique de corpus et on examinera les différentes catégories de corpus existantes. On mettra l’accent sur deux notions fondamentales dans la conception de l’outil corpus en général et dans la création de notre corpus en particulier : représentativité et contexte. Au sein du discours touristique, la représentativité, d’un côté, se relie au caractère spécial de notre micro-langue ; le contexte, de l’autre, révèle la pluralité des sous-domaines qui composent ce technolecte à mi-chemin entre la langue générale et la langue spécialisée.Dans le deuxième chapitre, nous présenterons le corpus thématique trilingue (CTT) que nous avons créé préalablement à la rédaction de la thèse proprement dite.Avant tout, on fournira les indications théoriques et pragmatiques nécessaires pour réaliser un corpus trilingue en langue de spécialité : la collecte des textes, l’homogénéisation des échantillons textuels repérés et l’annotation. Au cours de ce chapitre, nous présenterons Alinea, l’instrument qu’on a utilisé pour l’alignement de textes recueillis et pour la consultation simultanée des traductions trilingues. Dans le troisième et dernier chapitre, on passera à l’interrogation du corpus créé. Sur la base d’un terme pris comme exemple, le terme ville, on lancera la recherche dans le CTT. Ensuite, on analysera les collocations les plus usitées contenant le mot ville.En guise de conclusion de notre mémoire, nous présenterons une annexe consacrée à notre glossaire trilingue comme résultat de notre exploration de la chaîne terminologique qu’on aura analysée précédemment. Pour conclure, l’objectif général de notre étude sera d’explorer la chaîne de gestion terminologique à travers la création d’un glossaire trilingue dans le domaine du tourisme. Notre orientation méthodologique de caractère sémasiologique impliquera ainsi au moins quatre objectifs spécifiques :• créer un corpus trilingue du tourisme (CTT), capable d’attester des usages en contexte des termes.• extraire des termes en utilisant des techniques diverses, telle que l’étude fréquentielle des éléments du corpus.• vérifier les données obtenues et les compléter à l’aide de ressources externes.• répertorier et décrire l’ensemble des termes sous forme d’un glossaire trilingue à sujet touristique (GTT). / Our study concerns the language of tourism from a lexicographical perspective.Exploiting the web we realized a corpus ad hoc. This corpus is composed by about 10.000 texts in three languages (French, Italian and English), aligned using “Alinea”.Starting from terminological extraction, we analysed some collocations at the aim to create a trilingual and tri-directional glossary.We chose this subject according to the increasing importance taken from tourism economy in the world.Our study fields are thematic terminology, corpus linguistics and automatic language treatment.The first chapter presents the study field of our research. First of all, we introduced to corpus linguistics presenting the different categories of corpus and pointing out our attention on two main notions: representativeness and context.Therefore, we explained the link between Language for Special Purposes and tourism discourse as a Specialized Discourse.In the second chapter, we showed the trilingual thematic corpus we created during our researches. We described the main steps to create a corpus: collection of texts, cleaning and annotation.In this chapter, we gave a particular attention to the presentation of “Alinea”.Finally, the third chapter is a study of frequent collocations with the term “town” (ville).The annexes present the glossary as well as the methodological principals we followed in the redaction. Lexicographie bilingue Linguistique de corpus Langues de spécialité Terminologie Alinea Discours touristique Corpus alignés Corpus multilingues Glossaire tri-directionnel du tourisme Collocations Bilingual lexicography Corpus linguistics LSP Terminology Alinea Tourism discourse Collocations Aligned corpora Multilingual corpora Tri-directional tourism glossary
110	A Corpus-based Comparison of Albanian and Italian Student Writing in L1 and English as L2: Hedges and Boosters as Modalization by Degree Dheskali, Vincenzo 28 February 2020 (has links) Within the system of modality, modalization builds an area of uncertainty. It is an intermediate point between positive polarity (it is) and negative polarity (it is not), which has various degrees of indeterminacy (Halliday and Matthiessen 2014: 176). This indeterminacy includes probability and is expressed through items that Holmes (1990) and Hyland (1998) termed hedges and boosters. However, hedges and boosters can also function within intensity, where they convey a certain level of degree. Through them, writers achieve approval by finding the right balance between the reinforcement of statements with the assurance of reliable knowledge and the tentativeness to convey doubt and adequate social interrelations (Hyland 1998b: 349). The aim of this comparative study is to investigate the usage of hedges and boosters in Italian and Albanian student academic writings in their L1 and L2. Author-related and proposition-related hedges (e.g. suppose, approximately) and boosters (e.g. show, completely) as well as interrelated aspects such as their positioning, orientation, manifestation, and prosody of modalization will be analysed. My paper will interweave Prince et al.’s (1980) categorization of hedges, Quirk et al.’s (1985) model of boosters, Lafuente Millán’s (2008) categorization of approximative meanings and related concepts of the Systemic Functional Grammar (henceforth SFG) (Halliday and Matthiessen 2014) to create an innovative combination. I have compiled two corpora of Italian student writings (around 3 million words each) respectively in Italian and English and two corpora of writings by Albanian students in Albanian (around 2.2 million words) and in English (around 600.000 words). All corpora include a similar number of words and genres for each disciplinary domain as well as a balance of male and female writers. Disciplinary domains pertain to both soft and hard sciences (Social Sciences, Languages and Literature, Medicine, Chemistry, Physics, Mathematics and Informatics). As Toska (2015) stated, very little research has been conducted on academic writing in Albania. Thus, it is essential to initiate research in this field. Results of the quantitative analysis show that hedges were favored by Albanians and boosters were favored by Italians. The neutral position (in-between the clause complex, next to the verb, temporal or finite operator) of hedges and boosters was the most frequently encountered position in my corpora, followed by medial (in-between the clause complex, not next to verb, temporal or finite operator) and thematic position (at the beginning of the clause). Lastly, the same hedge (probably) and booster (significantly) appeared as author-related (shield) and proposition-related (approximator). This overlap between author-related and proposition-related categories demonstrates the importance of context in ranking these items and suggests relevant modifications to the original categorization by Prince et al. (1980). From my findings, I conclude that Italians show more commitment than Albanians, who appear more tentative in their writings. / Da Sprache (nach Halliday/Matthiesen 2014: 25) ein Mittel des persönlichen Ausdrucks und der fachlichen und gesellschaftlichen Verständigung ist, beinhalten unsere Texte unsere persönlichen und pädagogischen Erfahrungen, unseres alltäglichen Umfeldes und unserer Kultur. Die vorliegende Dissertationsschrift ist ein Beitrag aus der Perspektive der systemisch-funktionalen Grammatik und des wissenschaftlichen Schreibens. Sie befasst sich mit pragmatischen, semantischen und syntaktischen Aspekten von Hedges und Booster. Dafür wurden vier Korpora albanischer und italienischer wissenschaftlicher Arbeiten von Studenten in deren Muttersprache sowie ihrer Zweitsprache Englisch erstellt und untersucht. Die fünf Forschungsfragen beziehen sich größtenteils auf die qualitativen und quantitativen Unterschiede in der Verwendung von Hedges, Booster und deren semantischen und pragmatischen Unterkategorien in den albanischen und italienischen Korpora. Zudem wurde untersucht, welche Rolle Geschlecht und Genre bei der Verwendung von Hedges und Booster spielen. info:eu-repo/classification/ddc/420 ddc:420

Search results