Global ETD Search

161	Portable language technology a resource-light approach to morpho-syntactic tagging / Feldman, Anna. January 2006 (has links) Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 258-273).
162	Does it have to be trees? : Data-driven dependency parsing with incomplete and noisy training data Spreyer, Kathrin January 2011 (has links) We present a novel approach to training data-driven dependency parsers on incomplete annotations. Our parsers are simple modifications of two well-known dependency parsers, the transition-based Malt parser and the graph-based MST parser. While previous work on parsing with incomplete data has typically couched the task in frameworks of unsupervised or semi-supervised machine learning, we essentially treat it as a supervised problem. In particular, we propose what we call agnostic parsers which hide all fragmentation in the training data from their supervised components. We present experimental results with training data that was obtained by means of annotation projection. Annotation projection is a resource-lean technique which allows us to transfer annotations from one language to another within a parallel corpus. However, the output tends to be noisy and incomplete due to cross-lingual non-parallelism and error-prone word alignments. This makes the projected annotations a suitable test bed for our fragment parsers. Our results show that (i) dependency parsers trained on large amounts of projected annotations achieve higher accuracy than the direct projections, and that (ii) our agnostic fragment parsers perform roughly on a par with the original parsers which are trained only on strictly filtered, complete trees. Finally, (iii) when our fragment parsers are trained on artificially fragmented but otherwise gold standard dependencies, the performance loss is moderate even with up to 50% of all edges removed. / Wir präsentieren eine neuartige Herangehensweise an das Trainieren von daten-gesteuerten Dependenzparsern auf unvollständigen Annotationen. Unsere Parser sind einfache Varianten von zwei bekannten Dependenzparsern, nämlich des transitions-basierten Malt-Parsers sowie des graph-basierten MST-Parsers. Während frühere Arbeiten zum Parsing mit unvollständigen Daten die Aufgabe meist in Frameworks für unüberwachtes oder schwach überwachtes maschinelles Lernen gebettet haben, behandeln wir sie im Wesentlichen mit überwachten Lernverfahren. Insbesondere schlagen wir "agnostische" Parser vor, die jegliche Fragmentierung der Trainingsdaten vor ihren daten-gesteuerten Lernkomponenten verbergen. Wir stellen Versuchsergebnisse mit Trainingsdaten vor, die mithilfe von Annotationsprojektion gewonnen wurden. Annotationsprojektion ist ein Verfahren, das es uns erlaubt, innerhalb eines Parallelkorpus Annotationen von einer Sprache auf eine andere zu übertragen. Bedingt durch begrenzten crosslingualen Parallelismus und fehleranfällige Wortalinierung ist die Ausgabe des Projektionsschrittes jedoch üblicherweise verrauscht und unvollständig. Gerade dies macht projizierte Annotationen zu einer angemessenen Testumgebung für unsere fragment-fähigen Parser. Unsere Ergebnisse belegen, dass (i) Dependenzparser, die auf großen Mengen von projizierten Annotationen trainiert wurden, größere Genauigkeit erzielen als die zugrundeliegenden direkten Projektionen, und dass (ii) die Genauigkeit unserer agnostischen, fragment-fähigen Parser der Genauigkeit der Originalparser (trainiert auf streng gefilterten, komplett projizierten Bäumen) annähernd gleichgestellt ist. Schließlich zeigen wir mit künstlich fragmentierten Gold-Standard-Daten, dass (iii) der Verlust an Genauigkeit selbst dann bescheiden bleibt, wenn bis zu 50% aller Kanten in den Trainingsdaten fehlen. Dependenzparsing partielle Annotationen schwach überwachte Lernverfahren Annotationsprojektion Parallelkorpora dependency parsing partial annotations weakly supervised learning techniques annotation projection parallel corpora Language, Linguistics
163	Marqueurs corrélatifs en français et en suédois : Étude sémantico-fonctionnelle de d’une part… d’autre part, d’un côté… de l’autre et de non seulement… mais en contraste / Correlative markers in French and Swedish : Semantic and functional study of d'une part... d'autre part, d'un côté... de l'autre and non seulement... mais in contrast Svensson, Maria January 2010 (has links) This thesis deals with the correlative markers d’une part… d’autre part, d’un côté… de l’autre and non seulement… mais in French and their Swedish counterparts dels… dels, å ena sidan… å andra sidan and inte bara… utan. These markers are composed of two separate parts generally occurring together, and announce a serial of at least two textual units to be considered together. The analyses of the use of these three French and three Swedish markers are based upon two corpora of non-academic humanities texts. The first, principal corpus, is composed only of original French and Swedish texts. The second, complementary corpus, is composed of source texts in the two languages and their translations in the other language. By the combination of these two corpora, this study is comparative as well as contrastive. Through application of the Geneva model of discourse analysis and the Rhetorical Structure Theory, a semantic and functional approach to correlative markers and their text-structural role is adopted. The study shows similarities as well as differences between the six markers, both within each language and between the languages. D’une part… d’autre part and dels… dels principally mark a conjunctive relation, whereas d’un côté… de l’autre and å ena sidan… å andra sidan more often are used in a contrastive relation, even though they all can be used for both kinds of relations. Non seulement… mais and inte bara… utan mark a conjunctive relation, but can also indicate that the second argument is stronger than the first one. By the use of these two markers, the language users also present the first one as given and the second one as new information. In general, the French correlative markers appear to have a more argumentative function, whereas the text-structural function is demonstrated to be the most important in Swedish. discourse markers text organisation French Swedish contrastive analysis parallel corpora Geneva model of discourse organization Rhetorical Structure Theory French language Franska språket
164	Élaboration d'un corpus étalon pour l'évaluation d'extracteurs de termes Bernier-Colborne, Gabriel 05 1900 (has links) Ce travail porte sur la construction d’un corpus étalon pour l’évaluation automatisée des extracteurs de termes. Ces programmes informatiques, conçus pour extraire automatiquement les termes contenus dans un corpus, sont utilisés dans différentes applications, telles que la terminographie, la traduction, la recherche d’information, l’indexation, etc. Ainsi, leur évaluation doit être faite en fonction d’une application précise. Une façon d’évaluer les extracteurs consiste à annoter toutes les occurrences des termes dans un corpus, ce qui nécessite un protocole de repérage et de découpage des unités terminologiques. À notre connaissance, il n’existe pas de corpus annoté bien documenté pour l’évaluation des extracteurs. Ce travail vise à construire un tel corpus et à décrire les problèmes qui doivent être abordés pour y parvenir. Le corpus étalon que nous proposons est un corpus entièrement annoté, construit en fonction d’une application précise, à savoir la compilation d’un dictionnaire spécialisé de la mécanique automobile. Ce corpus rend compte de la variété des réalisations des termes en contexte. Les termes sont sélectionnés en fonction de critères précis liés à l’application, ainsi qu’à certaines propriétés formelles, linguistiques et conceptuelles des termes et des variantes terminologiques. Pour évaluer un extracteur au moyen de ce corpus, il suffit d’extraire toutes les unités terminologiques du corpus et de comparer, au moyen de métriques, cette liste à la sortie de l’extracteur. On peut aussi créer une liste de référence sur mesure en extrayant des sous-ensembles de termes en fonction de différents critères. Ce travail permet une évaluation automatique des extracteurs qui tient compte du rôle de l’application. Cette évaluation étant reproductible, elle peut servir non seulement à mesurer la qualité d’un extracteur, mais à comparer différents extracteurs et à améliorer les techniques d’extraction. / We describe a methodology for constructing a gold standard for the automatic evaluation of term extractors. These programs, designed to automatically extract specialized terms from a corpus, are used in various settings, including terminology work, translation, information retrieval, indexing, etc. Thus, the evaluation of term extractors must be carried out in accordance with a specific application. One way of evaluating term extractors is to construct a corpus in which all term occurrences have been annotated. This involves establishing a protocol for term selection and term boundary identification. To our knowledge, no well-documented annotated corpus is available for the evaluation of term extractors. This contribution aims to build such a corpus and describe what issues must be dealt with in the process. The gold standard we propose is a fully annotated corpus, constructed in accordance with a specific terminological setting, namely the compilation of a specialized dictionary of automotive mechanics. This annotated corpus accounts for the wide variety of realizations of terms in context. Terms are selected in accordance with specific criteria pertaining to the terminological setting as well as formal, linguistic and conceptual properties of terms and term variations. To evaluate a term extractor, a list of all the terminological units in the corpus is extracted and compared to the output of the term extractor, using a set of metrics to assess its performance. Subsets of terminological units may also be extracted, providing a level of customization. This allows an automatic and application-driven evaluation of term extractors. Due to its reusability, it can serve not only to assess the performance of a particular extractor, but also to compare different extractors and fine-tune extraction techniques. Terminologie Terminology Terminologie computationnelle Computational terminology Extraction de termes Term acquisition Évaluation Evaluation Annotation de corpus Annotated corpora Variation terminologique Term variation
165	Effective automatic speech recognition data collection for under–resourced languages / de Vries N.J. De Vries, Nicolaas Johannes January 2011 (has links) As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for underresourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal– time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012. Under-resourced languages New languages Speech resources ASR corpora Automatic speech recognition Developing world Speech data collection Spoken language resources Android NCHLT
166	Effective automatic speech recognition data collection for under–resourced languages / de Vries N.J. De Vries, Nicolaas Johannes January 2011 (has links) As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for underresourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal– time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012. Under-resourced languages New languages Speech resources ASR corpora Automatic speech recognition Developing world Speech data collection Spoken language resources Android NCHLT
167	Extracting Clinical Findings from Swedish Health Record Text Skeppstedt, Maria January 2014 (has links) Information contained in the free text of health records is useful for the immediate care of patients as well as for medical knowledge creation. Advances in clinical language processing have made it possible to automatically extract this information, but most research has, until recently, been conducted on clinical text written in English. In this thesis, however, information extraction from Swedish clinical corpora is explored, particularly focusing on the extraction of clinical findings. Unlike most previous studies, Clinical Finding was divided into the two more granular sub-categories Finding (symptom/result of a medical examination) and Disorder (condition with an underlying pathological process). For detecting clinical findings mentioned in Swedish health record text, a machine learning model, trained on a corpus of manually annotated text, achieved results in line with the obtained inter-annotator agreement figures. The machine learning approach clearly outperformed an approach based on vocabulary mapping, showing that Swedish medical vocabularies are not extensive enough for the purpose of high-quality information extraction from clinical text. A rule and cue vocabulary-based approach was, however, successful for negation and uncertainty classification of detected clinical findings. Methods for facilitating expansion of medical vocabulary resources are particularly important for Swedish and other languages with less extensive vocabulary resources. The possibility of using distributional semantics, in the form of Random indexing, for semi-automatic vocabulary expansion of medical vocabularies was, therefore, evaluated. Distributional semantics does not require that terms or abbreviations are explicitly defined in the text, and it is, thereby, a method suitable for clinical corpora. Random indexing was shown useful for extending vocabularies with medical terms, as well as for extracting medical synonyms and abbreviation dictionaries. Named entity recognition Corpora development Clinical text processing Distributional semantics Random indexing Vocabulary expansion Assertion classification Clinical text mining Electronic health records Swedish
168	Etude et modélisation d'un dialogue homme-machine récréatif ou ludique / System design for the management of entertaining man-computer dialogue Tabutiaux, Benoit 27 June 2014 (has links) Les développements issus de la recherche en gestion du dialogue homme-machine portent essentiellement sur le dialogue utilitaire et délaissent le dialogue à caractère ludique ou récréatif. Une description détaillée du contexte et la reconnaissance des buts ne suffisent pas à appréhender les enjeux de ce type de dialogue. La thèse vise à démontrer qu'un système s'appuyant sur une reconnaissance robuste et fine des actes de dialogue et des intentions associée à une prise en compte de l'altérité, de l'éthique et des émotions peut faire émerger une personnalité dialogique à même d'interagir de façon crédible avec un humain et de reproduire certaines performances dans le contexte d'un dialogue de séduction. L'objectif ne consiste pas à faire en sorte que le système puisse être confondu avec un humain comme cela est le cas dans les tests d'intelligence mais plutôt faire en sorte que le système puisse être dirigé par l'intérêt de la conversation. A cette fin, les recherches portent sur la définition de la relation d'altérité JE-TU appliquée au dialogue par l'intermédiaire de la théorie des jeux notamment à travers les concepts de définition de soi, d'éthique de la discussion et de modèles d'émotions. Plusieurs pistes sont explorées dans le but de réunir un corpus d'étude, notamment des prototypes de jeux collaboratifs. Au final, le modèle de personnage est développé sur la base d'un corpus de scripts de cinéma. Ce modèle repose sur une nouvelle taxonomie de phénomènes de dialogue incluant des actes perlocutoires et une approche différente de la connaissance permettant d'inclure l'éthique et le lien d'altérité en son sein. Les stratégies qui régissent le dialogue sont alors décrites de manière beaucoup plus précise. Une scène extraite d'un film servant de cadre applicatif aux expérimentations permet de valider l'architecture du système de dialogue. / Most of the developments issued from the research in man computer dialogue management are mainly focused on functional dialogue and put aside entertainment dialogue. A sharper context description and goals recognition remains quite limited to fully grasp the stakes of type of dialogue. The thesis aims to demonstrate that a system based on a thin and robust recognition of dialogue acts and intentions linked to a consideration for the concept of relation of otherness, ethics and emotions could leads to the emergence of a dialogic personality. Such a system would be able to interact with a human being in a credible way and reproduce some of its achievements in the context of a seduction dialogue. Unlike the common approach developed for intelligence tests, the purpose is not to mimic a human, but to manage the dialogue based on the notion of interest. To achieve this goal, researches deal with the relation of otherness applied to dialogue within game theory and especially through the concepts of Self-definition, Discourse Ethics and Emotions Modeling. The corpora collection process follows many ways including collaborative games. At the end, the character model is developed using a movie script corpora on the basis of a new dialogue phenomena taxonomy including perlocutionary acts and a new approach of knowledge that incorporate ethics and the relation of otherness. It leads to a thinner description of dialogue strategies. A scene extracted from a movie aims to validate the final architecture as an experimental framework. Dialogue homme-machine Phénomènes de dialogue Collecte de corpus Gestion du dialogue Dialogue récréatif Stratégie de dialogue Man-Computer Dialogue Dialogue phenomena Corpora collection Dialogue management Dialogue for entertainement Dialogue Strategy 004
169	Een multifactoriële studie over metaforiek in de financieel-economische pers Nicaise, Laurent 28 March 2012 (has links) Quels facteurs déterminent la présence et le choix de métaphores dans la presse financière et économique? Présentation d’un modèle explicatif. <p><p>Ces 20 dernières années, les publications en sémantique cognitive traitant de la relation entre la métaphore et l’idéologie dans la presse financière se sont multipliées. Grâce notamment à Boers (1997, 1999, 2000), Koller (2002) et Charteris-Black (2000), la plupart des mécanismes rhétoriques accompagnant les métaphores sont relativement bien connus. <p><p>Toutefois, jusqu’à présent, l’effet de l’idéologie sur les choix métaphoriques n’a pas pu être prouvé, et à fortiori mesuré. Le but de cette étude est de développer un modèle explicatif des facteurs influençant la présence et le choix de métaphores dans la presse financière, afin de fournir un instrument méthodologique et statistique fiable pour l’analyse critique du discours. Une telle analyse pourrait s’avérer également utile dans le domaine de la traduction et de l’apprentissage de la langue spécialisée dans le domaine économique.<p><p>Le cadre théorique est constitué par une version modernisée de la Conceptual Metaphor Theory. L’approche est cognitive et onomasiologique. Le point de départ est un ensemble de concepts élémentaires du monde financier et sélectionnés sur la base des résultats d’un échantillon randomisé de 10.000 mots dans 2 quotidiens de la presse belge. Les concepts sont ensuite rassemblés sur base de critères pragmatiques et statistiques dans un ensemble qui reflète la composition du monde des finances et de la bourse. <p><p>Pour chaque réalisation de ces concepts, on décide si oui ou non il s’agit d’une métaphore, en appliquant la méthode d’identification proposée par le « Pragglejaz Group » (2007). Ensuite, dans le cas d’une métaphore, on tente d’identifier le domaine source.<p><p>Le corpus bilingue couvre une période de 12 mois en 2005 et comprend 450.000 mots, répartis dans 6 publications belges :De Standaard, De Morgen, Trends Cash, La Libre Belgique, Le Soir et L’Investisseur.<p> / Doctorat en philosophie et lettres, Orientation langue et littérature / info:eu-repo/semantics/nonPublished Sciences humaines Langues et littératures Journalism, Commercial Metaphor Linguistic analysis (Linguistics) Corpora (Linguistics) Presse économique Métaphore Analyse linguistique (Linguistique) Corpus (Linguistique) linguistique de corpus presse financière métaphore
170	Academic vocabulary and lexical bundles in the writing of undergraduate psychology students Cooper, Patricia Anne 06 1900 (has links) This thesis investigates the relationship which both academic vocabulary and lexical bundles have to academic performance at university. While academic vocabulary is defined in terms of the University Word List (Coxhead, 2000), lexical bundles are identified as groups of four words that commonly co-occur, such as on the other hand and as a result of. A corpus of student essay writing in a single discipline, psychology, was developed over the course of a three-year undergraduate degree. To provide a benchmark against which to compare the student academic writing, a corpus of published articles in the same discipline was developed. The VocabProfile program (Cobb, 2002) was used to establish the density of academic vocabulary in the student essays. Similarly, the density of lexical bundle use was analysed by means of WordSmith Tools (Scott, 2012). The densities were then correlated against students’ academic performance as measured by their essay results. Comparisons were also made between the use of academic vocabulary and lexical bundles by first- and additional-language speakers, and by first- and third-year students. A keyness analysis enabled comparisons of academic vocabulary and bundle usage by high and low achievers. An additional aspect of this study was the comparison of densities of academic vocabulary and lexical bundles found in the IELTS writing test and in student essays, and the correlation of IELTS reading and writing test scores to students’ academic performance. The students’ vocabulary knowledge was also tested by the application of receptive and productive vocabulary tests, and the results compared to their academic performance. Results indicate that the 10 000-word level is a stronger predictor of academic performance than either the 5000-word level or academic vocabulary, and that there is a significant relationship between the density of lexical bundle use by students and their academic performance. Both vocabulary measures are therefore arguably better predictors of academic performance than the IELTS test scores. / Linguistics and Modern Languages / D. Litt. et Phil. (Linguistics) Corpus linguistics Academic vocabulary Lexical bundles IELTS Student academic writing Longitudinal study First language Additional language Academic performance Undergraduate students Psychology 428.20711 Corpora (Linguistics) Academic writing Discourse analysis

Search results