Global ETD Search

991	Alternative Approaches to Correction of Malapropisms in AIML Based Conversational Agents Brock, Walter A. 26 November 2014 (has links) The use of Conversational Agents (CAs) utilizing Artificial Intelligence Markup Language (AIML) has been studied in a number of disciplines. Previous research has shown a great deal of promise. It has also documented significant limitations in the abilities of these CAs. Many of these limitations are related specifically to the method employed by AIML to resolve ambiguities in the meaning and context of words. While methods exist to detect and correct common errors in spelling and grammar of sentences and queries submitted by a user, one class of input error that is particularly difficult to detect and correct is the malapropism. In this research a malapropism is defined a "verbal blunder in which one word is replaced by another similar in sound but different in meaning" ("malapropism," 2013). This research explored the use of alternative methods of correcting malapropisms in sentences input to AIML CAs using measures of Semantic Distance and tri-gram probabilities. Results of these alternate methods were compared against AIML CAs using only the Symbolic Reductions built into AIML. This research found that the use of the two methodologies studied here did indeed lead to a small, but measurable improvement in the performance of the CA in terms of the appropriateness of its responses as classified by human judges. However, it was also noted that in a large number of cases, the CA simply ignored the existence of a malapropism altogether in formulating its responses. In most of these cases, the interpretation and response to the user's input was of such a general nature that one might question the overall efficacy of the AIML engine. The answer to this question is a matter for further study. Information science Conversational Agent disambiguation malapropism Natural Language Processing n-gram probability semantic distance Artificial Intelligence and Robotics Computer Sciences Programming Languages and Compilers
992	Unsupervised Knowledge-based Word Sense Disambiguation: Exploration & Evaluation of Semantic Subgraphs Manion, Steve Lawrence January 2014 (has links) Hypothetically, if you were told: Apple uses the apple as its logo . You would immediately detect two different senses of the word apple , these being the company and the fruit respectively. Making this distinction is the formidable challenge of Word Sense Disambiguation (WSD), which is the subtask of many Natural Language Processing (NLP) applications. This thesis is a multi-branched investigation into WSD, that explores and evaluates unsupervised knowledge-based methods that exploit semantic subgraphs. The nature of research covered by this thesis can be broken down to: 1. Mining data from the encyclopedic resource Wikipedia, to visually prove the existence of context embedded in semantic subgraphs 2. Achieving disambiguation in order to merge concepts that originate from heterogeneous semantic graphs 3. Participation in international evaluations of WSD across a range of languages 4. Treating WSD as a classification task, that can be optimised through the iterative construction of semantic subgraphs The contributions of each chapter are ranged, but can be summarised by what has been produced, learnt, and raised throughout the thesis. Furthermore an API and several resources have been developed as a by-product of this research, all of which can be accessed by visiting the author’s home page at http://www.stevemanion.com. This should enable researchers to replicate the results achieved in this thesis and build on them if they wish. Word Sense Disambiguation Natural Language Processing Computational Linguistics Semantic Web Subgraphs Wikipedia WordNet BabelNet Knowledge-based Taxonomy Iterative Peripheral Diversity Daebak
993	以範例為基礎之英漢TIMSS詴題輔助翻譯 / Using Example-based Translation Techniques for Computer Assisted Translation of TIMSS Test Items 張智傑, Chang, Chih Chieh Unknown Date (has links) 本論文應用以範例為基礎的機器翻譯技術，應用英漢雙語對應的結構輔助英漢單句語料的翻譯。翻譯範例是運用一種特殊的結構，此結構包含來源句的剖析樹、目標句的字串、以及目標句和來源句詞彙對應關係。將翻譯範例建立資料庫，以提供來源句作詞序交換的依據，接著透過字典翻譯，以及利用統計式中英詞彙對列和語言模型來選詞，最後填補缺少的量詞，產生建議的翻譯。我們是以2003年國際數學與科學教育成就趨勢調查測驗詴題為主要翻譯的對象，以期提升翻譯的一致性和效率。以NIST 和BLEU 的評比方式，來評估和比較Google Translate 和Yahoo!線上翻譯系統及本系統所達成的翻譯品質。我們的系統經過詞序調動以及填補量詞後，翻譯品質比我們前一代系統要佳，但整體效果沒有比Google Translate 和Yahoo!線上翻譯的品質要佳。 / This paper presents an example-based machine translation based on bilingual structured string tree correspondence (BSSTC). The BSSTC structure includes a parse tree in source language, a string in target language and the correspondence between the source language tree and the target language string. / We designed an English to Chinese computer assisted translation system for Trends in International Mathematics and Science Study (TIMSS), through the BSSTC structure reordering, directory translation, choosing translation statistics model and measure word generation. / We evaluated our system by the BLEU and NIST score and compared with Google Translate and Yahoo! Translate. By reordering selected word sequences and inserting measure words in the default translations, the current system achieved a higher quality of default translations than the previous implementation of our research group, but the overall effects still lag behind that achieved by Google and Yahoo!. 自然語言處理試題翻譯機器翻譯 Natural language processing Item translation Machine translation TIMSS
994	Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE) Razavi, Amir Hossein 16 July 2012 (has links) In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms. Text representation Ontological representation Second order text representation Natural Language Processing (NLP) Machine Learning (ML) Concept mining Hierarchical Representation Multi-level text representation SOSCO TOR-FUSE
995	Outomatiese Setswana lemma-identifisering / Jeanetta Hendrina Brits Brits, Jeanetta Hendrina January 2006 (has links) Within the context of natural language processing, a lemmatiser is one of the most important core technology modules that has to be developed for a particular language. A lemmatiser reduces words in a corpus to the corresponding lemmas of the words in the lexicon. A lemma is defined as the meaningful base form from which other more complex forms (i.e. variants) are derived. Before a lemmatiser can be developed for a specific language, the concept "lemma" as it applies to that specific language should first be defined clearly. This study concludes that, in Setswana, only stems (and not roots) can act independently as words; therefore, only stems should be accepted as lemmas in the context of automatic lemmatisation for Setswana. Five of the seven parts of speech in Setswana could be viewed as closed classes, which means that these classes are not extended by means of regular morphological processes. The two other parts of speech (nouns and verbs) require the implementation of alternation rules to determine the lemma. Such alternation rules were formalised in this study, for the purpose of development of a Setswana lemmatiser. The existing Setswana grammars were used as basis for these rules. Therewith the precision of the formalisation of these existing grammars to lemmatise Setswana words could be determined. The software developed by Van Noord (2002), FSA 6, is one of the best-known applications available for the development of finite state automata and transducers. Regular expressions based on the formalised morphological rules were used in FSA 6 to create finite state transducers. The code subsequently generated by FSA 6 was implemented in the lemmatiser. The metric that applies to the evaluation of the lemmatiser is precision. On a test corpus of 1 000 words, the lemmatiser obtained 70,92%. In another evaluation on 500 complex nouns and 500 complex verbs separately, the lemmatiser obtained 70,96% and 70,52% respectively. Expressed in numbers the precision on 500 complex and simplex nouns was 78,45% and on complex and simplex verbs 79,59%. The quantitative achievement only gives an indication of the relative precision of the grammars. Nevertheless, it did offer analysed data with which the grammars were evaluated qualitatively. The study concludes with an overview of how these results might be improved in the future. / Thesis (M.A. (African Languages))--North-West University, Potchefstroom Campus, 2006. Computational linguistics Setswana grammar Setswana morphology Lemmatisation Stemming Lemma Natural language processing Regular expression Finite state automata Finite state transducer FSA 6
996	Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources Mao, Jin, Moore, Lisa R., Blank, Carrine E., Wu, Elvis Hsin-Hui, Ackerman, Marcia, Ranade, Sonali, Cui, Hong 13 December 2016 (has links) Background: The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages. Results: We report the development and evaluation of Microbial Phenomics Information Extractor (MicroPIE, version 0.1.0). MicroPIE is a natural language processing application that uses a robust supervised classification algorithm (Support Vector Machine) to identify characters from sentences in prokaryotic taxonomic descriptions, followed by a combination of algorithms applying linguistic rules with groups of known terms to extract characters as well as character states. The input to MicroPIE is a set of taxonomic descriptions (clean text). The output is a taxon-by-character matrix-with taxa in the rows and a set of 42 pre-defined characters (e.g., optimum growth temperature) in the columns. The performance of MicroPIE was evaluated against a gold standard matrix and another student-made matrix. Results show that, compared to the gold standard, MicroPIE extracted 21 characters (50%) with a Relaxed F1 score > 0.80 and 16 characters (38%) with Relaxed F1 scores ranging between 0.50 and 0.80. Inclusion of a character prediction component (SVM) improved the overall performance of MicroPIE, notably the precision. Evaluated against the same gold standard, MicroPIE performed significantly better than the undergraduate students. Conclusion: MicroPIE is a promising new tool for the rapid and efficient extraction of phenotypic character information from prokaryotic taxonomic descriptions. However, further development, including incorporation of ontologies, will be necessary to improve the performance of the extraction for some character types. Information extraction Phenotypic data extraction Prokaryotic taxonomic descriptions Microbial phenotypes Character matrices Support vector machine Machine learning Text mining Algorithm evaluation Natural language processing
997	Apprentissage automatique et compréhension dans le cadre d’un dialogue homme-machine téléphonique à initiative mixte / Corpus-based spoken language understanding for mixed initiative spoken dialog systems Servan, Christophe 10 December 2008 (has links) Les systèmes de dialogues oraux Homme-Machine sont des interfaces entre un utilisateur et des services. Ces services sont présents sous plusieurs formes : services bancaires, systèmes de réservations (de billets de train, d’avion), etc. Les systèmes de dialogues intègrent de nombreux modules notamment ceux de reconnaissance de la parole, de compréhension, de gestion du dialogue et de synthèse de la parole. Le module qui concerne la problématique de cette thèse est celui de compréhension de la parole. Le processus de compréhension de la parole est généralement séparé du processus de transcription. Il s’agit, d’abord, de trouver la meilleure hypothèse de reconnaissance puis d’appliquer un processus de compréhension. L’approche proposée dans cette thèse est de conserver l’espace de recherche probabiliste tout au long du processus de compréhension en l’enrichissant à chaque étape. Cette approche a été appliquée lors de la campagne d’évaluation MEDIA. Nous montrons l’intérêt de notre approche par rapport à l’approche classique. En utilisant différentes sorties du module de RAP sous forme de graphe de mots, nous montrons que les performances du décodage conceptuel se dégradent linéairement en fonction du taux d’erreurs sur les mots (WER). Cependant nous montrons qu’une approche intégrée, cherchant conjointement la meilleure séquence de mots et de concepts, donne de meilleurs résultats qu’une approche séquentielle. Dans le souci de valider notre approche, nous menons des expériences sur le corpus MEDIA dans les mêmes conditions d’évaluation que lors de la campagne MEDIA. Il s’agit de produire des interprétations sémantiques à partir des transcriptions sans erreur. Les résultats montrent que les performances atteintes par notre modèle sont au niveau des performances des systèmes ayant participé à la campagne d’évaluation. L’étude détaillée des résultats obtenus lors de la campagne MEDIA nous permet de montrer la corrélation entre, d’une part, le taux d’erreur d’interprétation et, d’autre part, le taux d’erreur mots de la reconnaissance de la parole, la taille du corpus d’apprentissage, ainsi que l’ajout de connaissance a priori aux modèles de compréhension. Une analyse d’erreurs montre l’intérêt de modifier les probabilités des treillis de mots avec des triggers, un modèle cache ou d’utiliser des règles arbitraires obligeant le passage dans une partie du graphe et s’appliquant sur la présence d’éléments déclencheurs (mots ou concepts) en fonction de l’historique. On présente les méthodes à base de d’apprentissage automatique comme nécessairement plus gourmandes en terme de corpus d’apprentissage. En modifiant la taille du corpus d’apprentissage, on peut mesurer le nombre minimal ainsi que le nombre optimal de dialogues nécessaires à l’apprentissage des modèles de langages conceptuels du système de compréhension. Des travaux de recherche menés dans cette thèse visent à déterminer quel est la quantité de corpus nécessaire à l’apprentissage des modèles de langages conceptuels à partir de laquelle les scores d’évaluation sémantiques stagnent. Une corrélation est établie entre la taille de corpus nécessaire pour l’apprentissage et la taille de corpus afin de valider le guide d’annotations. En effet, il semble, dans notre cas de l’évaluation MEDIA, qu’il ait fallu sensiblement le même nombre d’exemple pour, d’une part, valider l’annotation sémantique et, d’autre part, obtenir un modèle stochastique « de qualité » appris sur corpus. De plus, en ajoutant des données a priori à nos modèles stochastiques, nous réduisons de manière significative la taille du corpus d’apprentissage nécessaire pour atteindre les même scores du système entièrement stochastique (près de deux fois moins de corpus à score égal). Cela nous permet de confirmer que l’ajout de règles élémentaires et intuitives (chiffres, nombres, codes postaux, dates) donne des résultats très encourageants. Ce constat a mené à la réalisation d’un système hybride mêlant des modèles à base de corpus et des modèles à base de connaissance. Dans un second temps, nous nous appliquons à adapter notre système de compréhension à une application de dialogue simple : un système de routage d’appel. La problématique de cette tâche est le manque de données d’apprentissage spécifiques au domaine. Nous la résolvons en partie en utilisant divers corpus déjà à notre disposition. Lors de ce processus, nous conservons les données génériques acquises lors de la campagne MEDIA et nous y intégrons les données spécifiques au domaine. Nous montrons l’intérêt d’intégrer une tâche de classification d’appel dans un processus de compréhension de la parole spontanée. Malheureusement, nous disposons de très peu de données d’apprentissage relatives au domaine de la tâche. En utilisant notre approche intégrée de décodage conceptuel, conjointement à un processus de filtrage, nous proposons une approche sous forme de sac de mots et de concepts. Cette approche exploitée par un classifieur permet d’obtenir des taux de classification d’appels encourageants sur le corpus de test, alors que le WER est assez élevé. L’application des méthodes développées lors de la campagne MEDIA nous permet d’améliorer la robustesse du processus de routage d’appels. / Spoken dialogues systems are interfaces between users and services. Simple examples of services for which theses dialogue systems can be used include : banking, booking (hotels, trains, flights), etc. Dialogue systems are composed of a number of modules. The main modules include Automatic Speech Recognition (ASR), Spoken Language Understanding (SLU), Dialogue Management and Speech Generation. In this thesis, we concentrate on the Spoken Language Understanding component of dialogue systems. In the past, it has usual to separate the Spoken Language Understanding process from that of Automatic Speech Recognition. First, the Automatic Speech Recognition process finds the best word hypothesis. Given this hypothesis, we then find the best semantic interpretation. This thesis presents a method for the robust extraction of basic conceptual constituents (or concepts) from an audio message. The conceptual decoding model proposed follows a stochastic paradigm and is directly integrated into the Automatic Speech Recognition process. This approach allows us to keep the probabilistic search space on sequences of words produced by the Automatic Speech Recognition module, and to project it to a probabilistic search space of sequences of concepts. The experiments carried out on the French spoken dialogue corpus MEDIA, available through ELDA, show that the performance reached by our new approach is better than the traditional sequential approach. As a starting point for evaluation, the effect that deterioration of word error rate (WER) has on SLU systems is examined though use of different ASR outputs. The SLU performance appears to decrease lineary as a function of ASR word error rate.We show, however, that the proposed integrated method of searching for both words and concets, gives better results to that of a traditionnanl sequential approach. In order to validate our approach, we conduct experiments on the MEDIA corpus in the same assessment conditions used during the MEDIA campaign. The goal is toproduce error-free semantic interpretations from transcripts. The results show that the performance achieved by our model is as good as the systems involved in the evaluation campaign. Studies made on the MEDIA corpus show the concept error rate is related to the word error rate, the size of the training corpus and a priori knwoledge added to conceptual model languages. Error analyses show the interest of modifying the probabilities of word lattice with triggers, a template cache or by using arbitrary rules requiring passage through a portion of the graph and applying the presence of triggers (words or concepts) based on history. Methods based on machine learning are generally quite demanding in terms of amount of training data required. By changing the size of the training corpus, the minimum and the optimal number of dialogues needed for training conceptual language models can be measured. Research conducted in this thesis aims to determine the size of corpus necessary for training conceptual language models from which the semantic evaluation scores stagnated. A correlation is established between the necessary corpus size for learning and the corpus size necessary to validate the manual annotations. In the case of the MEDIA evaluation campaign, it took roughly the same number of examples, first to validate the semantic annotations and, secondly, to obtain a "quality" corpus-trained stochastic model. The addition of a priori knowledge to our stochastic models reduce significantly the size of the training corpus needed to achieve the same scores as a fully stochastic system (nearly half the size for the same score). It allows us to confirm that the addition of basic intuitive rules (numbers, zip codes, dates) gives very encouraging results. It leeds us to create a hybrid system combining corpus-based and knowledge-based models. The second part of the thesis examines the application of the understanding module to another simple dialogue system task, a callrouting system. A problem with this specific task is a lack of data available for training the requiered language models. We attempt to resolve this issue by supplementing he in-domain data with various other generic corpora already available, and data from the MEDIA campaing. We show the benefits of integrating a call classification task in a SLU process. Unfortunately, we have very little training corpus in the field under consideration. By using our integrated approach to decode concepts, along with an integrated process, we propose a bag of words and concepts approach. This approach used by a classifier achieved encouraging call classification rates on the test corpus, while the WER was relativelyhigh. The methods developed are shown to improve the call routing system process robustness. Compréhension de la parole Traitement automatique de la parole Apprentissage automatique Systèmes de dialogue Speech language understanding Speech processing Natural language processing Machine learning Dialogue systems
998	Modeling Alcohol Consumption Using Blog Data Koh, Kok Chuan 05 1900 (has links) How do the content and writing style of people who drink alcohol beverages stand out from non-drinkers? How much information can we learn about a person's alcohol consumption behavior by reading text that they have authored? This thesis attempts to extend the methods deployed in authorship attribution and authorship profiling research into the domain of automatically identifying the human action of drinking alcohol beverages. I examine how a psycholinguistics dictionary (the Linguistics Inquiry and Word Count lexicon, developed by James Pennebaker), together with Kenneth Burke's concept of words as symbols of human action, and James Wertsch's concept of mediated action provide a framework for analyzing meaningful data patterns from the content of blogs written by consumers of alcohol beverages. The contributions of this thesis to the research field are twofold. First, I show that it is possible to automatically identify blog posts that have content related to the consumption of alcohol beverages. And second, I provide a framework and tools to model human behavior through text analysis of blog data. Natural language processing blog data LIWC PBAA linguistics inquiry and word count profile based authorship attribution alcohol consumption word symbols mediated action
999	Une approche d'ingénierie ontologique pour l'acquisition et l'exploitation des connaissances à partir de documents textuels : vers des objets de connaissances et d'apprentissage Zouaq, Amal January 2007 (has links) Thèse numérisée par la Division de la gestion de documents et des archives de l'Université de Montréal. Acquisition des connaissances Knowledge acquisition Formation Formation Ontologies Ontologies Mémoires Organizational memories Systèmes tutoriels intelligents IntelligentTutoring Systems e-Learning e-Learning Forage de données Data mining Traitement de la langue maternelle Natural language processing
1000	JSreal : un réalisateur de texte pour la programmation web Daoust, Nicolas 09 1900 (has links) Site web associé au mémoire: http://daou.st/JSreal / La génération automatique de texte en langage naturel est une branche de l’intelligence artificielle qui étudie le développement de systèmes produisant des textes pour différentes applications, par exemple la description textuelle de jeux de données massifs ou l’automatisation de rédactions textuelles routinières. Un projet de génération de texte comporte plusieurs grandes étapes : la détermination du contenu à exprimer, son organisation en structures comme des paragraphes et des phrases et la production de chaînes de caractères pour un lecteur humain ; c’est la réalisation, à laquelle ce mémoire s’attaque. Le web est une plateforme en constante croissance dont le contenu, de plus en plus dynamique, se prête souvent bien à l’automatisation par un réalisateur. Toutefois, les réalisateurs existants ne sont pas conçus en fonction du web et leur utilisation requiert beaucoup de connaissances, compliquant leur emploi. Le présent mémoire de maîtrise présente JSreal, un réalisateur conçu spécifiquement pour le web et facile d’apprentissage et d’utilisation. JSreal permet de construire une variété d’expressions et de phrases en français, qui respectent les règles de grammaire et de syntaxe, d’y ajouter des balises HTML et de les intégrer facilement aux pages web. / Natural language generation, a part of artificial intelligence, studies the development of systems that produce text for different applications, for example the textual description of massive datasets or the automation of routine text redaction. Text generation projects consist of multiple steps : determining the content to be expressed, organising it in logical structures such as sentences and paragraphs, and producing human-readable character strings, a step usually called realisation, which this thesis takes on. The web is constantly growing and its contents, getting progressively more dynamic, are well-suited to automation by a realiser. However, existing realisers are not designed with the web in mind and their operation requires much knowledge, complicating their use. This master’s thesis presents JSreal, a realiser designed specifically for the web and easy to learn and use. JSreal allows its user to build a variety of French expressions and sentences, to add HTML tags to them and to easily integrate them into web pages. Génération automatique de texte Réalisation de texte Natural language processing Natural language generation Text realisation

Search results