Global ETD Search

1081	Automatic phoneme recognition of South African English Engelbrecht, Herman Arnold 03 1900 (has links) Thesis (MEng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: Automatic speech recognition applications have been developed for many languages in other countries but not much research has been conducted on developing Human Language Technology (HLT) for S.A. languages. Research has been performed on informally gathered speech data but until now a speech corpus that could be used to develop HLT for S.A. languages did not exist. With the development of the African Speech Technology Speech Corpora, it has now become possible to develop commercial applications of HLT. The two main objectives of this work are the accurate modelling of phonemes, suitable for the purposes of LVCSR, and the evaluation of the untried S.A. English speech corpus. Three different aspects of phoneme modelling was investigated by performing isolated phoneme recognition on the NTIMIT speech corpus. The three aspects were signal processing, statistical modelling of HMM state distributions and context-dependent phoneme modelling. Research has shown that the use of phonetic context when modelling phonemes forms an integral part of most modern LVCSR systems. To facilitate the context-dependent phoneme modelling, a method of constructing robust and accurate models using decision tree-based state clustering techniques is described. The strength of this method is the ability to construct accurate models of contexts that did not occur in the training data. The method incorporates linguistic knowledge about the phonetic context, in conjunction with the training data, to decide which phoneme contexts are similar and should share model parameters. As LVCSR typically consists of continuous recognition of spoken words, the contextdependent and context-independent phoneme models that were created for the isolated recognition experiments are evaluated by performing continuous phoneme recognition. The phoneme recognition experiments are performed, without the aid of a grammar or language model, on the S.A. English corpus. As the S.A. English corpus is newly created, no previous research exist to which the continuous recognition results can be compared to. Therefore, it was necessary to create comparable baseline results, by performing continuous phoneme recognition on the NTIMIT corpus. It was found that acceptable recognition accuracy was obtained on both the NTIMIT and S.A. English corpora. Furthermore, the results on S.A. English was 2 - 6% better than the results on NTIMIT, indicating that the S.A. English corpus is of a high enough quality that it can be used for the development of HLT. / AFRIKAANSE OPSOMMING: Automatiese spraak-herkenning is al ontwikkel vir ander tale in ander lande maar, daar nog nie baie navorsing gedoen om menslike taal-tegnologie (HLT) te ontwikkel vir Suid- Afrikaanse tale. Daar is al navorsing gedoen op spraak wat informeel versamel is, maar tot nou toe was daar nie 'n spraak databasis wat vir die ontwikkeling van HLT vir S.A. tale. Met die ontwikkeling van die African Speech Technology Speech Corpora, het dit moontlik geword om HLT te ontwikkel vir wat geskik is vir kornmersiele doeleindes. Die twee hoofdoele van hierdie tesis is die akkurate modellering van foneme, geskik vir groot-woordeskat kontinue spraak-herkenning (LVCSR), asook die evaluasie van die S.A. Engels spraak-databasis. Drie aspekte van foneem-modellering word ondersoek deur isoleerde foneem-herkenning te doen op die NTIMIT spraak-databasis. Die drie aspekte wat ondersoek word is sein prosessering, statistiese modellering van die HMM toestands distribusies, en konteksafhanklike foneem-modellering. Navorsing het getoon dat die gebruik van fonetiese konteks 'n integrale deel vorm van meeste moderne LVCSR stelsels. Dit is dus nodig om robuuste en akkurate konteks-afhanklike modelle te kan bou. Hiervoor word 'n besluitnemingsboom- gebaseerde trosvormings tegniek beskryf. Die tegniek is ook in staat is om akkurate modelle te bou van kontekste van nie voorgekom het in die afrigdata nie. Om te besluit watter fonetiese kontekste is soortgelyk en dus model parameters moet deel, maak die tegniek gebruik van die afrigdata en inkorporeer taalkundige kennis oor die fonetiese kontekste. Omdat LVCSR tipies is oor die kontinue herkenning van woorde, word die konteksafhanklike en konteks-onafhanklike modelle, wat gebou is vir die isoleerde foneem-herkenningseksperimente, evalueer d.m.v. kontinue foneem-herkening. Die kontinue foneemherkenningseksperimente word gedoen op die S.A. Engels databasis, sonder die hulp van 'n taalmodel of grammatika. Omdat die S.A. Engels databasis nuut is, is daar nog geen ander navorsing waarteen die result ate vergelyk kan word nie. Dit is dus nodig om kontinue foneem-herkennings result ate op die NTIMIT databasis te genereer, waarteen die S.A. Engels resulte vergelyk kan word. Die resulate dui op aanvaarbare foneem her kenning op beide die NTIMIT en S.A. Engels databassise. Die resultate op S.A. Engels is selfs 2 - 6% beter as die resultate op NTIMIT, wat daarop dui dat die S.A. Engels spraak-databasis geskik is vir die ontwikkeling van HLT. Automatic speech recognition Speech processing systems Computational linguistics
1082	Unsupervised clustering of audio data for acoustic modelling in automatic speech recognition systems Goussard, George Willem 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2011. / ENGLISH ABSTRACT: This thesis presents a system that is designed to replace the manual process of generating a pronunciation dictionary for use in automatic speech recognition. The proposed system has several stages. The first stage segments the audio into what will be known as the subword units, using a frequency domain method. In the second stage, dynamic time warping is used to determine the similarity between the segments of each possible pair of these acoustic segments. These similarities are used to cluster similar acoustic segments into acoustic clusters. The final stage derives a pronunciation dictionary from the orthography of the training data and corresponding sequence of acoustic clusters. This process begins with an initial mapping between words and their sequence of clusters, established by Viterbi alignment with the orthographic transcription. The dictionary is refined iteratively by pruning redundant mappings, hidden Markov model estimation and Viterbi re-alignment in each iteration. This approach is evaluated experimentally by applying it to two subsets of the TIMIT corpus. It is found that, when test words are repeated often in the training material, the approach leads to a system whose accuracy is almost as good as one trained using the phonetic transcriptions. When test words are not repeated often in the training set, the proposed approach leads to better results than those achieved using the phonetic transcriptions, although the recognition is poor overall in this case. / AFRIKAANSE OPSOMMING: Die doelwit van die tesis is om ’n stelsel te beskryf wat ontwerp is om die handgedrewe proses in die samestelling van ’n woordeboek, vir die gebruik in outomatiese spraakherkenningsstelsels, te vervang. Die voorgestelde stelsel bestaan uit ’n aantal stappe. Die eerste stap is die segmentering van die oudio in sogenaamde sub-woord eenhede deur gebruik te maak van ’n frekwensie gebied tegniek. Met die tweede stap word die dinamiese tydverplasingsalgoritme ingespan om die ooreenkoms tussen die segmente van elkeen van die moontlike pare van die akoestiese segmente bepaal. Die ooreenkomste word dan gebruik om die akoestiese segmente te groepeer in akoestiese groepe. Die laaste stap stel die woordeboek saam deur gebruik te maak van die ortografiese transkripsie van afrigtingsdata en die ooreenstemmende reeks akoestiese groepe. Die finale stap begin met ’n aanvanklike afbeelding vanaf woorde tot hul reeks groep identifiseerders, bewerkstellig deur Viterbi belyning en die ortografiese transkripsie. Die woordeboek word iteratief verfyn deur oortollige afbeeldings te snoei, verskuilde Markov modelle af te rig en deur Viterbi belyning te gebruik in elke iterasie. Die benadering is getoets deur dit eksperimenteel te evalueer op twee subversamelings data vanuit die TIMIT korpus. Daar is bevind dat, wanneer woorde herhaal word in die afrigtingsdata, die stelsel se benadering die akkuraatheid ewenaar van ’n stelsel wat met die fonetiese transkripsie afgerig is. As die woorde nie herhaal word in die afrigtingsdata nie, is die akkuraatheid van die stelsel se benadering beter as wanneer die stelsel afgerig word met die fonetiese transkripsie, alhoewel die akkuraatheid in die algemeen swak is. Automatic speech recognition Pronunciation dictionary TIMIT Dissertations -- Electronic engineering Theses -- Electronic engineering Speech processing systems
1083	Formação de gentílicos a partir de topônimos : proposta de geração automática Antunes, Roger Alfredo de Marci Rodrigues 17 February 2017 (has links) Submitted by Ronildo Prado (ronisp@ufscar.br) on 2017-08-21T18:50:20Z No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-21T18:50:28Z (GMT) No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-21T18:50:34Z (GMT) No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) / Made available in DSpace on 2017-08-21T18:50:41Z (GMT). No. of bitstreams: 1 DissRARMA.pdf: 5470332 bytes, checksum: e56022e54a0fe99cc8ca45fc74f7e424 (MD5) Previous issue date: 2017-02-17 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / It is a common habit to use the adjective of the city name to indicate people’s origin, however the formulating rules of the adjective has been rarely discussed in the literature. The main objective of this work is to describe the gentile adjectives, which originate from the place names called toponyms. Using specific morphological rules of combination and proposing the formal representation of their regularities we can formulate the basis for a computational system, which can automatically generate the gentiles from their place names. The system proposed here is founded on the methodological principles of Dias-da-Silva (1996) - with respect to the three-phase methodology of the Natural language processing (NLP) - and the theoretical assumptions in the works of Borba (1998), Biderman (2001), Dick (2007) Jurafsky (2009) and Sandmann (1992, 1997). The corpus consists of 5,570 municipalities’ names (toponyms) and their respective gentiles, extracted in a form of a list from the database of the Instituto Brasileiro de Geografia e Estatística (IBGE). It was observed that only from a small set of recurrent unities, such as suffixes and ends of lexical entities, it is possible to extract patterns which can be subsequently used to formulate combination rules for automatic word processing. During this work, the issue of computational representation stands out and proves natural language complexity. Although natural languages can be in principle automatically processed using computers, their inherent features may deviate from the formulated rules and make the processing more intricate. Nonetheless, the results show that it is possible to automatize 52% of the generation of gentiles from the municipal toponyms. Conclusively the inherent opacity of the Portuguese does not allow direct processing of all of the language toponyms. / Utilizam-se diariamente nomes de cidades e adjetivos que indicam as pessoas que nasceram ou vivem nessas cidades, mas raramente se reflete sobre as regras de formação dessas palavras. O presente trabalho tem como objetivo descrever os adjetivos pátrios, ou gentílicos, que advêm dos nomes dos lugares - topônimos -, por meio de regras de combinação morfológicas específicas e propor a representação formal das suas regularidades com intuito de servir de base para um sistema computacional capaz de gerar automaticamente os gentílicos a partir dos seus topônimos. Tomou-se como orientação os princípios metodológicos de Dias-da-Silva (1996) - no que concerne à metodologia trifásica do PLN -, e os pressupostos teóricos nos trabalhos de Borba (1998), Biderman (2001), Dick (2007), Jurafsky (2009) e Sandmann (1992, 1997). O corpus da pesquisa consiste na lista dos topônimos de 5.570 municípios e seus respectivos gentílicos, extraídos do banco de dados do Instituto Brasileiro de Geografia e Estatística (IBGE). Com esta pesquisa, foi possível observar que somente a partir das menores unidades recorrentes, como os sufixos e as extremidades finais das unidades léxicas, podem-se extrair padrões para a formulação de regras de combinação para um processamento automático. Além disso, a problemática da representação computacional evidencia a complexidade das línguas naturais, que embora sejam passíveis de processamento automático, são opacas e, desta maneira, sempre haverá questões inerentes a elas que dificultam essa tarefa. Ainda assim, os resultados mostraram que é possível automatizar a geração de gentílicos a partir de topônimos em 52% do total, o que já é um número razoável, considerando a opacidade inerente à língua natural mencionada. Gentílico Toponímia Morfologia lexical Processos de formação de palavras Linguística computacional Processamento de línguas naturais Gentile Toponymy Lexical morphology Computational linguistics Natural language processing LINGUISTICA, LETRAS E ARTES::LINGUISTICA
1084	Incremental generative models for syntactic and semantic natural language processing Buys, Jan Moolman January 2017 (has links) This thesis investigates the role of linguistically-motivated generative models of syntax and semantic structure in natural language processing (NLP). Syntactic well-formedness is crucial in language generation, but most statistical models do not account for the hierarchical structure of sentences. Many applications exhibiting natural language understanding rely on structured semantic representations to enable querying, inference and reasoning. Yet most semantic parsers produce domain-specific or inadequately expressive representations. We propose a series of generative transition-based models for dependency syntax which can be applied as both parsers and language models while being amenable to supervised or unsupervised learning. Two models are based on Markov assumptions commonly made in NLP: The first is a Bayesian model with hierarchical smoothing, the second is parameterised by feed-forward neural networks. The Bayesian model enables careful analysis of the structure of the conditioning contexts required for generative parsers, but the neural network is more accurate. As a language model the syntactic neural model outperforms both the Bayesian model and n-gram neural networks, pointing to the complementary nature of distributed and structured representations for syntactic prediction. We propose approximate inference methods based on particle filtering. The third model is parameterised by recurrent neural networks (RNNs), dropping the Markov assumptions. Exact inference with dynamic programming is made tractable here by simplifying the structure of the conditioning contexts. We then shift the focus to semantics and propose models for parsing sentences to labelled semantic graphs. We introduce a transition-based parser which incrementally predicts graph nodes (predicates) and edges (arguments). This approach is contrasted against predicting top-down graph traversals. RNNs and pointer networks are key components in approaching graph parsing as an incremental prediction problem. The RNN architecture is augmented to condition the model explicitly on the transition system configuration. We develop a robust parser for Minimal Recursion Semantics, a linguistically-expressive framework for compositional semantics which has previously been parsed only with grammar-based approaches. Our parser is much faster than the grammar-based model, while the same approach improves the accuracy of neural Abstract Meaning Representation parsing.
1085	Intelligence Socio-Affective pour un Robot : primitives langagières pour une interaction évolutive d'un robot de l’habitat intelligent / Intelligence from Socio-Affects of Robot : Dialog Primitives for a Scalable Interaction with a Smart Home Robot Sasa, Yuko 26 January 2018 (has links) Le Traitement Automatique de la Parole (TAP) s’intéresse de plus en plus et progresse techniquement en matière d’étendue de vocabulaire, de gestion de complexité morphosyntaxique, de style et d’esthétique de la parole humaine. L’Affective Computing tend également à intégrer une dimension « émotionnelle » dans un objectif commun au TAP visant à désambiguïser le langage naturel et augmenter la naturalité de l’interaction personne-machine. Dans le cadre de la robotique sociale, cette interaction est modélisée dans des systèmes d’interaction, de dialogue, qui tendent à engendrer une dimension d’attachement dont les effets doivent être éthiquement et collectivement contrôlés. Or la dynamique du langage humain situé met à mal l’efficacité des systèmes automatiques. L’hypothèse de cette thèse propose dans la dynamique des interactions, il existerait une « glu socio-affective » qui ferait entrer en phases synchroniques deux individus dotés chacun d’un rôle social impliqué dans une situation/contexte d’interaction. Cette thèse s'intéresse à des dynamiques interactionnelles impliquant spécifiquement des processus altruistes, orthogonale à la dimension de dominance. Cette glu permettrait ainsi de véhiculer les événements langagiers entre les interlocuteurs, en modifiant constamment leur relation et leur rôle, qui eux même viennent à modifier cette glu, afin d’assurer la continuité de la communication. La seconde hypothèse propose que la glu socio-affective se construise à partir d’une « prosodie socio-affective pure » que l’on peut retrouver dans certaines formes de micro-expressions vocales. L’effet de ces événements langagiers serait alors graduel en fonction du degré de contrôle d’intentionnalité communicative qui s’observerait successivement par des primitives langagières : 1) des bruits de bouche (non phonétiques, non phonologiques), 2) des sons prélexicaux, 3) des interjections/onomatopées, 4) des imitations à contenu lexical contrôlé. Une méthodologie living-lab est ainsi développée au sein de la plateforme Domus, sur des boucles agiles et itératives co-construites avec les partenaires industriels et sociétaux. Un Magicien d’Oz – EmOz – est utilisé afin de contrôler les primitives vocales comme unique support langagier d’un robot majordome d’un habitat intelligent interagissant avec des personnes âgées en isolement relationnel. Un large corpus, EmOz Elderly Expressions –EEE– est ainsi recueilli. Cet isolement relationnel permet méthodologiquement d’appréhender les dimensions de la glu socio-affective, en introduisant une situation contrastive dégradée de la glu. Les effets des primitives permettraient alors d’observer les comportements de l’humain à travers des indices multimodaux. Les enjeux sociétaux abordés par la gérontechnologie montrent que l’isolement est un facteur de fragilisation où la qualité de la communication délite le maillage relationnel des personnes âgées alors que ces liens sont bénéfiques à sa santé et son bien-être. L’émergence de la robotique d’assistance en est une illustration. Le système automatisé qui découlera des données et des analyses de cette étude permettrait alors d’entraîner les personnes à solliciter pleinement leurs mécanismes de construction relationnelle, afin de redonner l’envie de communiquer avec leur entourage humain. Les analyses du corpus EEE recueilli montrent une évolution de la relation à travers différents indices interactionnels, temporellement organisés. Ces paramètres visent à être intégrés dans une perspective de système de dialogue incrémental – SASI. Les prémisses de ce système sont proposées dans un prototype de reconnaissance de la parole dont la robustesse ne dépendra pas de l’exactitude du contenu langagier reconnu, mais sur la reconnaissance du degré de glu, soit de l’état relationnel entre les locuteurs. Ainsi, les erreurs de reconnaissance tendraient à être compensées par l’intelligence socio-affective adaptative de ce système dont pourrait être doté le robot. / The Natural Language Processing (NLP) has technically improved regarding human speech vocabulary extension, morphosyntax scope, style and aesthetic. Affective Computing also tends to integrate an “emotional” dimension with a common goal shared with NLP which is to disambiguate the natural language and increase the human-machine interaction naturalness. Within social robotics, the interaction is modelled in dialogue systems trying to reach out an attachment dimension which effects need to an ethical and collective control. However, the situated natural language dynamics is undermining the automated system’s efficiency, which is trying to respond with useful and suitable feedbacks. This thesis hypothesis supposes the existence of a “socio-affective glue” in every interaction, set up in between two individuals, each with a social role depending on a communication context. This glue is so the consequence of dynamics generated by a process which mechanisms rely on an altruistic dimension, but independent of dominance dimension as seen in emotions studies. This glue would allow the exchange of the language events between interlocutors, by regularly modifying their relation and their role, which is changing themselves this glue, to ensure the communication continuity. The second hypothesis proposes the glue as built by “socio-affective pure prosody” forms that enable this relational construction. These cues are supposed to be carried by hearable and visible micro-expressions. The interaction events effect would also be gradual following the degree of the communication’s intentionality control. The graduation will be continuous through language primitives as 1) mouth noises (neither phonetics nor phonological sounds), 2) pre-lexicalised sounds, 3) interjections and onomatopoeias, 4) controlled command-based imitations with the same socio-affective prosody supposed to create and modify the glue. Within the Domus platform, we developed an almost living-lab methodology. It functions on agile and iterative loops co-constructed with industrial and societal partners. A wizard of oz approach – EmOz – is used to control the vocal primitives proposed as the only language tools of a Smart Home butler robot interacting with relationally isolated elderly. The relational isolation allows the dimensions the socio-affective glue in a contrastive situation where it is damaged. We could thus observe the primitives’ effects through multimodal language cues. One of the gerontechnology social motivation showed the isolation to be a phenomenon amplifying the frailty so can attest the emergence of assistive robotics. A vicious circle leads by the elderly communicational characteristics convey them to some difficulties to maintain their relational tissue while their bonds are beneficial for their health and well-being. If the proposed primitives could have a real effect on the glue, the automated system will be able to train the persons to regain some unfit mechanisms underlying their relational construction, and so possibly increase their desire to communicate with their human social surroundings. The results from the collected EEE corpus show the relation changes through various interactional cues, temporally organised. These denoted parameters tend to build an incremental dialogue system in perspectives – SASI. The first steps moving towards this system reside on a speech recognition prototype which robustness is not based on the accuracy of the recognised language content but on the possibility to identify the glue degree (i.e. the relational state) between the interlocutors. Thus, the recognition errors avoid the system to be rejected by the user, by tempting to be balanced by this system’s adaptive socio-affective intelligence. Traitement Automatique de la Langue Intelligence socio-Affective Primitives du langage Robot Habitat intelligent Natural Language Processing Socio-Emotional intelligence Language primitives Robot Smart Home 004
1086	L'analyse de la complexité du discours et du texte pour apprendre et collaborer / Analysing discourse and text complexity for learning and collaborating Dascalu, Mihai 04 June 2013 (has links) L’apprentissage collaboratif assisté par ordinateur et les technologies d’e-learning devenant de plus en plus populaires et intégrés dans des contextes éducatifs, le besoin se fait sentir de disposer d’outils d’évaluation automatique et d’aide aux enseignants ou tuteurs pour les deux activités, fortement couplées, de compréhension de textes et collaboration entre pairs. Bien qu’une analyse de surface de ces activités est aisément réalisable, une compréhension plus profonde et complète du discours en jeu est nécessaire, complétée par une analyse de l’information méta-cognitive disponible par diverses sources, comme par exemples les auto-explications des apprenants. Dans ce contexte, nous utilisons un modèle dialogique issu des travaux de Bakhtine pour analyser les conversations collaboratives, et une approche théorique visant à unifier les activités de compréhension et de collaboration dans un même cadre, utilisant la construction de graphes de cohésion. Plus spécifiquement, nous nous sommes centrés sur la dimension individuelle de l’apprentissage, analysée à partir de l’identification de stratégies de lecture et sur la mise au jour d’un modèle de la complexité textuelle intégrant des facteurs de surface, lexicaux, morphologiques, syntaxiques et sémantiques. En complément, la dimension collaborative de l’apprentissage est centrée sur l’évaluation de l’implication des participants, ainsi que sur l’évaluation de leur collaboration par deux modèles computationnels: un modèle polyphonique, défini comme l’inter-animation de voix selon de multiples perspectives, un modèle spécifique de construction sociale de connaissances, fondé sur le graphe de cohésion et un mécanisme d’évaluation des tours de parole. Notre approche met en œuvre des techniques avancées de traitement automatique de la langue et a pour but de formaliser une évaluation qualitative du processus d’apprentissage. Ainsi, deux perspectives fortement interreliées sont prises en considération : d’une part, la compréhension, centrée sur la construction de connaissances et les auto-explications à partir desquelles les stratégies de lecture sont identifiées ; d’autre part la collaboration, qui peut être définie comme l’implication sociale, la génération d’idées ou de voix en interanimation dans un contexte donné. Des validations cognitives de nos différents systèmes d’évaluation automatique ont été réalisées, et nous avons conçu des scénarios d’utilisation de ReaderBench, notre système le plus avancé, dans différents contextes d’enseignement. L’un des buts principaux de notre modèle est de favoriser la compréhension vue en tant que « médiatrice de l’apprentissage », en procurant des rétroactions automatiques aux apprenants et enseignants ou tuteurs. Leur avantage est triple: leur flexibilité, leur extensibilité et, cependant, leur spécificité, car ils couvrent de multiples étapes de l’activité d’apprentissage, de la lecture de matériel d’apprentissage à l’écriture de synthèses de cours en passant par la discussion collaborative de contenus de cours et la verbalisation métacognitive de jugements de compréhension, afin d’obtenir une perspective complète du niveau de compréhension et de générer des rétroactions appropriées sur le processus d’apprentissage collaboratif. / With the advent and increasing popularity of Computer Supported Collaborative Learning (CSCL) and e-learning technologies, the need of automatic assessment and of teacher/tutor support for the two tightly intertwined activities of comprehension of reading materials and of collaboration among peers has grown significantly. Whereas a shallow or surface analysis is easily achievable, a deeper understanding of the discourse is required, extended by meta-cognitive information available from multiple sources as self-explanations. In this context, we use a polyphonic model of discourse derived from Bakhtin’s work as a paradigm for analyzing CSCL conversations, as well as cohesion graph building designed for creating an underlying discourse structure. This enables us to address both general texts and conversations and to incorporate comprehension and collaboration specific activities in a unique framework. As specificity of the analysis, in terms of individual learning we have focused on the identification of reading strategies and on providing a multi-dimensional textual complexity model integrating surface, word specific, morphology, syntax and semantic factors. Complementarily, the collaborative learning dimension is centered on the evaluation of participants’ involvement, as well as on collaboration assessment through the use of two computational models: a polyphonic model, defined in terms of voice inter-animation, and a specific social knowledge-building model, derived from the specially designed cohesion graph corroborated with a proposed utterance scoring mechanism. Our approach integrates advanced Natural Language Processing techniques and is focused on providing a qualitative estimation of the learning process. Therefore, two tightly coupled perspectives are taken into consideration: comprehension on one hand is centered on knowledge-building, self-explanations from which multiple reading strategies can be identified, whereas collaboration, on the other, can be seen as social involvement, ideas or voices generation, intertwining and inter-animation in a given context. Various cognitive validations for all our automated evaluation systems have been conducted and scenarios including the use of ReaderBench, our most advanced system, in different educational contexts have been built. One of the most important goals of our model is to enhance understanding as a “mediator of learning” by providing automated feedback to both learners and teachers or tutors. The main benefits are its flexibility, extensibility and nevertheless specificity for covering multiple stages, starting from reading classroom materials, to discussing on specific topics in a collaborative manner, and finishing the feedback loop by verbalizing metacognitive thoughts in order to obtain a clear perspective over one’s comprehension level and appropriate feedback about the collaborative learning processes. Complexité Traitement Automatique de la Langue Analyse du discours Compréhension de texte Complexity Natural Language Processing Discourse Analysis Textual Comprehension
1087	Data-driven language understanding for spoken dialogue systems Mrkšić, Nikola January 2018 (has links) Spoken dialogue systems provide a natural conversational interface to computer applications. In recent years, the substantial improvements in the performance of speech recognition engines have helped shift the research focus to the next component of the dialogue system pipeline: the one in charge of language understanding. The role of this module is to translate user inputs into accurate representations of the user goal in the form that can be used by the system to interact with the underlying application. The challenges include the modelling of linguistic variation, speech recognition errors and the effects of dialogue context. Recently, the focus of language understanding research has moved to making use of word embeddings induced from large textual corpora using unsupervised methods. The work presented in this thesis demonstrates how these methods can be adapted to overcome the limitations of language understanding pipelines currently used in spoken dialogue systems. The thesis starts with a discussion of the pros and cons of language understanding models used in modern dialogue systems. Most models in use today are based on the delexicalisation paradigm, where exact string matching supplemented by a list of domain-specific rephrasings is used to recognise users' intents and update the system's internal belief state. This is followed by an attempt to use pretrained word vector collections to automatically induce domain-specific semantic lexicons, which are typically hand-crafted to handle lexical variation and account for a plethora of system failure modes. The results highlight the deficiencies of distributional word vectors which must be overcome to make them useful for downstream language understanding models. The thesis next shifts focus to overcoming the language understanding models' dependency on semantic lexicons. To achieve that, the proposed Neural Belief Tracking (NBT) model forsakes the use of standard one-hot n-gram representations used in Natural Language Processing in favour of distributed representations of user utterances, dialogue context and domain ontologies. The NBT model makes use of external lexical knowledge embedded in semantically specialised word vectors, obviating the need for domain-specific semantic lexicons. Subsequent work focuses on semantic specialisation, presenting an efficient method for injecting external lexical knowledge into word vector spaces. The proposed Attract-Repel algorithm boosts the semantic content of existing word vectors while simultaneously inducing high-quality cross-lingual word vector spaces. Finally, NBT models powered by specialised cross-lingual word vectors are used to train multilingual belief tracking models. These models operate across many languages at once, providing an efficient method for bootstrapping language understanding models for lower-resource languages with limited training data.
1088	Syntactic Similarity Measures in Annotated Corpora for Language Learning : application to Korean Grammar / Mesures de similarité syntaxique dans des corpus annotés pour la didactique des langues : application à la grammaire du coréen Wang, Ilaine 17 October 2017 (has links) L'exploration de corpus à travers des requêtes fait aujourd'hui partie de la routine de nombreux chercheurs adoptant une approche empirique de la langue, mais aussi de non-spécialistes qui utilisent des moteurs de recherche ou des concordanciers dans le cadre de l'apprentissage d'une langue. Si les requêtes ainsi basées sur des mots-clés sont communes, les non-spécialistes semblent encore peu enclins à explorer des constructions syntaxiques. En effet, les requêtes syntaxiques requièrent souvent des connaissances spécifiques comme la maîtrise des expressions régulières, le langage de requête de l'outil utilisé, ou même simplement le jeu d'étiquettes morpho-syntaxiques du corpus étudié.Pour permettre aux apprenants de langue de se concentrer sur l'analyse des données langagières plutôt que sur la formulation de requêtes, nous proposons une méthodologie incluant un analyseur syntaxique et utilisant des mesures de similarité classiques pour comparer des séquences d'étiquettes syntaxiques ainsi obtenues de manière automatique. / Using queries to explore corpora is today part of the routine of not only researchers of various fields with an empirical approach to discourse, but also of non-specialists who use search engines or concordancers for language learning purposes. If keyword-based queries are quite common, non-specialists still seem to be less likely to explore syntactic constructions. Indeed, syntax-based queries usually require the use of regular expressions with grammatical words combined with morphosyntactic tags, which imply that users master both the query language of the tool and the tagset of the annotated corpus. However, non-specialists like language learners might want to focus on the output rather than spend time and efforts on mastering a query language.To address this shortcoming, we propose a methodology including a syntactic parser and using common similarity measures to compare sequences of morphosyntactic tags automatically provided. Traitement automatique des langues Corpus Syntaxe Mesure de similarité Didactique des langues Grammaire Coréen Natural language processing Corpus Syntax Similarity measure Language learning Grammar Korean
1089	Identification et analyse linguistique du lexique scientifique transdisciplinaire. Approche outillée sur un corpus d'articles de recherche en SHS / The French Cross-disciplinary Scientific Lexicon, Identification and Linguistic Analysis. A corpus-driven approach of Research Articles in Humanities and Social Sciences Hatier, Sylvain 07 December 2016 (has links) Cette thèse s’intéresse au lexique scientifique transdisciplinaire (LST), lexique inscrit dans le genre de l’article de recherche en sciences humaines et sociales. Le LST est fréquemment mobilisé dans les écrits scientifiques et constitue ainsi un objet d’importance pour l’étude de ce genre. Ce lexique trouve également des applications concrètes tant en indexation terminologique que pour l’aide à la rédaction/compréhension de textes scientifiques. Ces différents objectifs nous amènent à adopter une approche outillée pour identifier et caractériser les unités lexicales du LST, lexique complexe à circonscrire, situé entre lexique de la langue générale et terminologie. En nous basant sur les propriétés de spécificité et de transdisciplinarité ainsi que sur l’étude des propriétés lexico-syntaxiques de ses éléments, nous élaborons une ressource du LST intégrant informations lexicales, syntaxiques et sémantiques. L’analyse de la combinatoire à l’aide d’un corpus arboré autorise ainsi une caractérisation du LST ancrée sur l’usage dans le genre de l’article de recherche. Selon cette même approche, nous identifions les acceptions nominales transdisciplinaires et proposons une classification sémantique fondée sur la combinatoire en corpus pour intégrer à notre ressource lexicale une typologie nominale sur deux niveaux. Nous montrons enfin que cette structuration du LST nous permet d’aborder la dimension phraséologique et rhétorique du LST en faisant émerger du corpus des constructions récurrentes définies par leurs propriétés syntactico-sémantiques. / In this dissertation we study the French cross-disciplinary scientific lexicon (CSL), a lexicon which fall within the genre of scientific articles in humanities and social sciences. As the CSL is commonly used in scientific texts, it is a gateway of interest to explore this genre. This lexicon has also practical applications in the fields of automatic terms identification and foreign language teaching in the academic background. To this end, we apply a corpus-driven approach in order to extract and structure the CSL lexical units which are complex to circumscribe. The method relies on the cross-disciplinarity and specificity criteria and on the lexico-syntactic properties of the CSL lexical units. As a result, we designed a lexical resource which include lexical, syntactical and semantical informations. As we analyze the combinatorial properties extracted from a parsed corpus of scientific articles, we performed a CSL study based on its genre specific use. We follow the same approach to identify cross-disciplinary meanings for the CSL nouns and to design a nominal semantic classification. This two-level typology allow us to explore rhetorical and phraseological CSL properties by identifying frequent syntactico-semantic patterns. Linguistique de corpus Lexicologie Écrits scientifiques Lexique scientifique transdisciplinaire Sémantique Corpus linguistics Natural language processing Lexicology Scientific texts Cross-Discplinary scientific lexicon Semantics 400
1090	Automatic Identification of Duplicates in Literature in Multiple Languages Klasson Svensson, Emil January 2018 (has links) As the the amount of books available online the sizes of each these collections are at the same pace growing larger and more commonly in multiple languages. Many of these cor- pora contain duplicates in form of various editions or translations of books. The task of finding these duplicates is usually done manually but with the growing sizes making it time consuming and demanding. The thesis set out to find a method in the field of Text Mining and Natural Language Processing that can automatize the process of manually identifying these duplicates in a corpora mainly consisting of fiction in multiple languages provided by Storytel. The problem was approached using three different methods to compute distance measures between books. The first approach was comparing titles of the books using the Levenstein- distance. The second approach used extracting entities from each book using Named En- tity Recognition and represented them using tf-idf and cosine dissimilarity to compute distances. The third approach was using a Polylingual Topic Model to estimate the books distribution of topics and compare them using Jensen Shannon Distance. In order to es- timate the parameters of the Polylingual Topic Model 8000 books were translated from Swedish to English using Apache Joshua a statistical machine translation system. For each method every book written by an author was pairwise tested using a hypothesis test where the null hypothesis was that the two books compared is not an edition or translation of the others. Since there is no known distribution to assume as the null distribution for each book a null distribution was estimated using distance measures of books not written by the author. The methods were evaluated on two different sets of manually labeled data made by the author of the thesis. One randomly sampled using one-stage cluster sampling and one consisting of books from authors that the corpus provider prior to the thesis be considered more difficult to label using automated techniques. Of the three methods the Title Matching was the method that performed best in terms of accuracy and precision based of the sampled data. The entity matching approach was the method with the lowest accuracy and precision but with a almost constant recall at around 50 %. It was concluded that there seems to be a set of duplicates that are clearly distin- guished from the estimated null-distributions, with a higher significance level a better pre- cision and accuracy could have been made with a similar recall for the specific method. For topic matching the result was worse than the title matching and when studied the es- timated model was not able to create quality topics the cause of multiple factors. It was concluded that further research is needed for the topic matching approach. None of the three methods were deemed be complete solutions to automatize detection of book duplicates. Probability Theory and Statistics Sannolikhetsteori och statistik

Search results