Global ETD Search

1	Universal Constraint Language / Universal Constraint Language Piják, Peter January 2011 (has links) Title: Universal Constraint Language Author: Peter Piják Department / Institute: Department of Software Engineering Supervisor of the master thesis: Mgr. Martin Nečaský, Ph.D. Abstract: Today's software applications are typically compound of system of more application components. By modeling of software, various integrity constraint languages are used for particular parts of model (e.g. OCL for UML class diagrams, Schematron for XML or SQL triggers for relational databases). Constraint expressions need to be converted to expressions over different meta-models. These tasks are non-trivial. In this thesis, a new common language Universal Constraint Language (UCL) for expressing integrity constraints over various data meta-models is introduced. It is formally defined and also its parser is implemented. We also present semi-automatic translating between constraints over various meta-models; and deriving constraints from the introduced language to constraints in specific constraint languages. Keywords: constraint language, model-driven architecture, universal formalism
2	Unsupervised Clustering and Automatic Language Model Generation for ASR Podder, Sushil January 2004 (has links) The goal of an automatic speech recognition system is to enable the computer in understanding human speech and act accordingly. In order to realize this goal, language modeling plays an important role. It works as a knowledge source through mimicking human comprehension mechanism in understanding the language. Among many other approaches, statistical language modeling technique is widely used in automatic speech recognition systems. However, the generation of reliable and robust statistical model is very difficult task, especially for a large vocabulary system. For a large vocabulary system, the performance of such a language model degrades as the vocabulary size increases. Hence, the performance of the speech recognition system also degrades due to the increased complexity and mutual confusion among the candidate words in the language model. In order to solve these problems, reduction of language model size as well as minimization of mutual confusion between words are required. In our work, we have employed clustering techniques, using self-organizing map, to build topical language models. Moreover, in order to capture the inherent semantics of sentences, a lexical dictionary, WordNet has been used in the clustering process. This thesis work focuses on various aspects of clustering, language model generation, extraction of task dependent acoustic parameters, and their implementations under the framework of the CMU Sphinx3 speech engine decoder. The preliminary results, presented in this thesis show the effectiveness of the topical language models. Systems Design ASR language model generation WordNet
3	Unsupervised Clustering and Automatic Language Model Generation for ASR Podder, Sushil January 2004 (has links) The goal of an automatic speech recognition system is to enable the computer in understanding human speech and act accordingly. In order to realize this goal, language modeling plays an important role. It works as a knowledge source through mimicking human comprehension mechanism in understanding the language. Among many other approaches, statistical language modeling technique is widely used in automatic speech recognition systems. However, the generation of reliable and robust statistical model is very difficult task, especially for a large vocabulary system. For a large vocabulary system, the performance of such a language model degrades as the vocabulary size increases. Hence, the performance of the speech recognition system also degrades due to the increased complexity and mutual confusion among the candidate words in the language model. In order to solve these problems, reduction of language model size as well as minimization of mutual confusion between words are required. In our work, we have employed clustering techniques, using self-organizing map, to build topical language models. Moreover, in order to capture the inherent semantics of sentences, a lexical dictionary, WordNet has been used in the clustering process. This thesis work focuses on various aspects of clustering, language model generation, extraction of task dependent acoustic parameters, and their implementations under the framework of the CMU Sphinx3 speech engine decoder. The preliminary results, presented in this thesis show the effectiveness of the topical language models. Systems Design ASR language model generation WordNet
4	Study of Pretraining Bias and Frequencies Taware, Rutuja Murlidhar 10 July 2023 (has links) Usage of language models in an in-context learning environment has been adapted for a wide range of tasks. Recent works have showcased the impact of pretraining data on the in-context performance of language models. In this work, we experiment with numbers having high and low frequencies in the pretraining data to understand the impact of term frequencies on the model's performance. We also experiment with random and adversarial demonstrations to understand the pretraining bias present in the model. Through these experiments, we showcase the importance of pretraining frequencies of the numbers present in the demonstrations and explain how highly frequent terms can be used in the demonstrations to achieve better task performance. Moreover, we also show the impact of pretraining bias on the model's performance and explain how the model overcomes this bias with more demonstrations. / Master of Science / Recent works focus on understanding and improving the arithmetic capabilities of the state-of-the-art (SOTA) systems in the domain of Natural Language Processing (NLP). This work focuses on designing and performing novel experiments to analyze the impact of training data on the performance of such systems. Through these experiments, this work showcases interesting properties of the SOTA systems which will promote future research to understand them better as well as help in creating better downstream applications. In-Context Learning Pretraining Frequency Bias Language Model
5	Chinese input method based on reduced phonetic transcription Hsu, Feng-Ho 22 May 2012 (has links) In this paper, we investigate a highly efficient input method in Chinese. In the traditional Mandarin phonetic input method, users have to input the complete Mandarin phonetic symbol. The proposed new Chinese input method is which transforms the first Mandarin phonetic symbol sequence to character sequence. Users only have to input the first Mandarin phonetic symbol. Users input first Mandarin phonetic symbol and follow the input rule that spaces are inserted between the words. The system outputs the candidate character sequence hypotheses. Bigram model is used to describe the relation between words. We use the dynamic programing for decoding. We estimate the feasibility for our new Chinese input method and estimate the Stanford segmenter. In the experiment, we estimate the Standford Segmenter works on the simplified Chinese and Traditional Chinese firstly. We observe that the precision and recall on the simplified Chinese are 84.52% and 85.20% which is better than works on the Traditional Chinese 68.43% and 63.43%. And we estimate system efficiency based on language model that trained by WIKI corpus and ASBC corpus separately. The sentence and word accuracy for the ASBC corpus are 39.8% and 70.3%. And the word and character accuracy for WIKI corpus are 20.3% and 53.3%. Finally we estimate the number of candidate hypotheses. The research shows the 10 hypotheses and 20 hypotheses the sentence accuracy are closed. smoothing language model Chinese input method dynamic programing
6	Neuronové jazykové modely zohledňující morfologii pro strojový překlad / Neural Language Models with Morphology for Machine Translation Musil, Tomáš January 2017 (has links) Language models play an important role in many natural language processing tasks. In this thesis, we focus on language models built on artificial neural net- works. We examine the possibilities of using morphological annotations in these models. We propose a neural network architecture for a language model that explicitly makes use of morphological annotation of the input sentence: instead of word forms it processes lemmata and morphological tags. Both the baseline and the proposed method are evaluated on their own by perplexity, and also in the context of machine translation by the means of automatic translation quality evaluation. While in isolation the proposed model significantly outperforms the baseline, there is no apparent gain in machine translation. 1
7	French AXA Insurance Word Embeddings : Effects of Fine-tuning BERT and Camembert on AXA France’s data Zouari, Hend January 2020 (has links) We explore in this study the different Natural Language Processing state-of-the art technologies that allow transforming textual data into numerical representation. We go through the theory of the existing traditional methods as well as the most recent ones. This thesis focuses on the recent advances in Natural Language processing being developed upon the Transfer model. One of the most relevant innovations was the release of a deep bidirectional encoder called BERT that broke several state of the art results. BERT utilises Transfer Learning to improve modelling language dependencies in text. BERT is used for several different languages, other specialized model were released like the french BERT: Camembert. This thesis compares the language models of these different pre-trained models and compares their capability to insure a domain adaptation. Using the multilingual and the french pre-trained version of BERT and a dataset from AXA France’s emails, clients’ messages, legal documents, insurance documents containing over 60 million words. We fine-tuned the language models in order to adapt them on the Axa insurance’s french context to create a French AXAInsurance BERT model. We evaluate the performance of this model on the capability of the language model of predicting a masked token based on the context. BERT proves to perform better : modelling better the french AXA’s insurance text without finetuning than Camembert. However, with this small amount of data, Camembert is more capable of adaptation to this specific domain of insurance. / I denna studie undersöker vi de senaste teknologierna för Natural Language Processing, som gör det möjligt att omvandla textdata till numerisk representation. Vi går igenom teorin om befintliga traditionella metoder såväl som de senaste. Denna avhandling fokuserar på de senaste framstegen inom bearbetning av naturliga språk som utvecklats med hjälp av överföringsmodellen. En av de mest relevanta innovationerna var lanseringen av en djup dubbelriktad kodare som heter BERT som bröt flera toppmoderna resultat. BERT använder Transfer Learning för att förbättra modelleringsspråkberoenden i text. BERT används för flera olika språk, andra specialmodeller släpptes som den franska BERT: Camembert. Denna avhandling jämför språkmodellerna för dessa olika förutbildade modeller och jämför deras förmåga att säkerställa en domänanpassning. Med den flerspråkiga och franska förutbildade versionen av BERT och en dataset från AXA Frankrikes epostmeddelanden, kundmeddelanden, juridiska dokument, försäkringsdokument som innehåller över 60 miljoner ord. Vi finjusterade språkmodellerna för att anpassa dem till Axas försäkrings franska sammanhang för att skapa en fransk AXAInsurance BERT-modell. Vi utvärderar prestandan för denna modell på förmågan hos språkmodellen att förutsäga en maskerad token baserat på sammanhanget. BERTpresterar bättre: modellerar bättre den franska AXA-försäkringstexten utan finjustering än Camembert. Men med denna lilla mängd data är Camembert mer kapabel att anpassa sig till denna specifika försäkringsdomän. NLP Language model Word embedding BERT camemBERT NLP Language model Word embedding BERT camemBERT Computer and Information Sciences Data- och informationsvetenskap
8	Resource-dependent acoustic and language modeling for spoken keyword search Chen, I-Fan 27 May 2016 (has links) In this dissertation, three research directions were explored to alleviate two major issues, i.e., the use of incorrect models and training/test condition mismatches, in the modeling frameworks of modern spoken keyword search (KWS) systems. Each of the three research directions, which include (i) data-efficient training processes, (ii) system optimization objectives, and (iii) data augmentation, utilizes different types and amounts of training resources in different ways to ameliorate the two issues of acoustic and language modeling in modern KWS systems. To be more specific, resource-dependent keyword modeling, keyword-boosted sMBR (state-level minimum Bayes risk) training, and multilingual acoustic modeling are proposed and investigated for acoustic modeling in this research. For language modeling, keyword-aware language modeling, discriminative keyword-aware language modeling, and web text augmented language modeling are presented and discussed. The dissertation provides a comprehensive collection of solutions and strategies to the acoustic and language modeling problems in KWS. It also offers insights into the realization of good-performance KWS systems. Experimental results show that the data-efficient training process and data augmentation are the two directions providing the most prominent performance improvement for KWS systems. While modifying system optimization objectives provides smaller yet consistent performance enhancement in KWS systems with different configurations. The effects of the proposed acoustic and language modeling approaches in the three directions are also shown to be additive and can be combined to further improve the overall KWS system performance. Spoken keyword search Keyword spotting Acoustic model Language model Speech recognition
9	Automatic Transcript Generator for Podcast Files Holst, Andy January 2010 (has links) <p>In the modern world, Internet has become a popular place, people with speech hearing disabilities and search engines can't take part of speech content in podcast les. In order to solve the problem partially, the Sphinx decoders such as Sphinx-3, Sphinx-4 can be used to implement a Auto Transcript Generator application, by coupling already existing large acoustic model, language model and a existing dictionary, or by training your own large acoustic model, language model and creating your own dictionary to support continuous speaker independent speech recognition system.</p> speech recognition auto transcript generator implementation podcast acoustic model language model dictionary Computer science Datavetenskap
10	Large Vocabulary Continuous Speech Recogniton For Turkish Using Htk Comez, Murat Ali 01 January 2003 (has links) (PDF) This study aims to build a new language model that can be used in a Turkish large vocabulary continuous speech recognition system. Turkish is a very productive language in terms of word forms because of its agglutinative nature. For such languages like Turkish, the vocabulary size is far from being acceptable. From only one simple stem, thousands of new word forms can be generated using inflectional or derivational suffixes. In this thesis, words are parsed into their stems and endings. One ending includes the suffixes attached to the associated root. Then the search network based on bigrams is constructed. Bigrams are obtained either using stem and endings, or using only stems. The language model proposed is based on bigrams obtained using only stems. All work is done in HTK (Hidden Markov Model Toolkit) environment, except parsing and network transforming. Besides of offering a new language model for Turkish, this study involves a comprehensive work about speech recognition inspecting into concepts in the state of the art speech recognition systems. To acquire good command of these concepts and processes in speech recognition isolated word, connected word and continuous speech recognition tasks are performed. The experimental results associated with these tasks are also given.

Search results