1 |
Universal Constraint Language / Universal Constraint LanguagePiják, Peter January 2011 (has links)
Title: Universal Constraint Language Author: Peter Piják Department / Institute: Department of Software Engineering Supervisor of the master thesis: Mgr. Martin Nečaský, Ph.D. Abstract: Today's software applications are typically compound of system of more application components. By modeling of software, various integrity constraint languages are used for particular parts of model (e.g. OCL for UML class diagrams, Schematron for XML or SQL triggers for relational databases). Constraint expressions need to be converted to expressions over different meta-models. These tasks are non-trivial. In this thesis, a new common language Universal Constraint Language (UCL) for expressing integrity constraints over various data meta-models is introduced. It is formally defined and also its parser is implemented. We also present semi-automatic translating between constraints over various meta-models; and deriving constraints from the introduced language to constraints in specific constraint languages. Keywords: constraint language, model-driven architecture, universal formalism
|
2 |
Unsupervised Clustering and Automatic Language Model Generation for ASRPodder, Sushil January 2004 (has links)
The goal of an automatic speech recognition system is to enable the computer in understanding human speech and act accordingly. In order to realize this goal, language modeling plays an important role. It works as a knowledge source through mimicking human comprehension mechanism in understanding the language. Among many other approaches, statistical language modeling technique is widely used in automatic speech recognition systems. However, the generation of reliable and robust statistical model is very difficult task, especially for a large vocabulary system. For a large vocabulary system, the performance of such a language model degrades as the vocabulary size increases. Hence, the performance of the speech recognition system also degrades due to the increased complexity and mutual confusion among the candidate words in the language model. In order to solve these problems, reduction of language model size as well as minimization of mutual confusion between words are required. In our work, we have employed clustering techniques, using self-organizing map, to build topical language models. Moreover, in order to capture the inherent semantics of sentences, a lexical dictionary, WordNet has been used in the clustering process. This thesis work focuses on various aspects of clustering, language model generation, extraction of task dependent acoustic parameters, and their implementations under the framework of the CMU Sphinx3 speech engine decoder. The preliminary results, presented in this thesis show the effectiveness of the topical language models.
|
3 |
Unsupervised Clustering and Automatic Language Model Generation for ASRPodder, Sushil January 2004 (has links)
The goal of an automatic speech recognition system is to enable the computer in understanding human speech and act accordingly. In order to realize this goal, language modeling plays an important role. It works as a knowledge source through mimicking human comprehension mechanism in understanding the language. Among many other approaches, statistical language modeling technique is widely used in automatic speech recognition systems. However, the generation of reliable and robust statistical model is very difficult task, especially for a large vocabulary system. For a large vocabulary system, the performance of such a language model degrades as the vocabulary size increases. Hence, the performance of the speech recognition system also degrades due to the increased complexity and mutual confusion among the candidate words in the language model. In order to solve these problems, reduction of language model size as well as minimization of mutual confusion between words are required. In our work, we have employed clustering techniques, using self-organizing map, to build topical language models. Moreover, in order to capture the inherent semantics of sentences, a lexical dictionary, WordNet has been used in the clustering process. This thesis work focuses on various aspects of clustering, language model generation, extraction of task dependent acoustic parameters, and their implementations under the framework of the CMU Sphinx3 speech engine decoder. The preliminary results, presented in this thesis show the effectiveness of the topical language models.
|
4 |
Study of Pretraining Bias and FrequenciesTaware, Rutuja Murlidhar 10 July 2023 (has links)
Usage of language models in an in-context learning environment has been adapted for a wide range of tasks. Recent works have showcased the impact of pretraining data on the in-context performance of language models. In this work, we experiment with numbers having high and low frequencies in the pretraining data to understand the impact of term frequencies on the model's performance. We also experiment with random and adversarial demonstrations to understand the pretraining bias present in the model. Through these experiments, we showcase the importance of pretraining frequencies of the numbers present in the demonstrations and explain how highly frequent terms can be used in the demonstrations to achieve better task performance. Moreover, we also show the impact of pretraining bias on the model's performance and explain how the model overcomes this bias with more demonstrations. / Master of Science / Recent works focus on understanding and improving the arithmetic capabilities of the state-of-the-art (SOTA) systems in the domain of Natural Language Processing (NLP). This work focuses on designing and performing novel experiments to analyze the impact of training data on the performance of such systems. Through these experiments, this work showcases interesting properties of the SOTA systems which will promote future research to understand them better as well as help in creating better downstream applications.
|
5 |
Chinese input method based on reduced phonetic transcriptionHsu, Feng-Ho 22 May 2012 (has links)
In this paper, we investigate a highly efficient input method in Chinese. In the traditional
Mandarin phonetic input method, users have to input the complete Mandarin phonetic symbol.
The proposed new Chinese input method is which transforms the first Mandarin phonetic
symbol sequence to character sequence. Users only have to input the first Mandarin phonetic
symbol. Users input first Mandarin phonetic symbol and follow the input rule that spaces are
inserted between the words. The system outputs the candidate character sequence hypotheses.
Bigram model is used to describe the relation between words. We use the dynamic programing
for decoding. We estimate the feasibility for our new Chinese input method and estimate the
Stanford segmenter. In the experiment, we estimate the Standford Segmenter works on the
simplified Chinese and Traditional Chinese firstly. We observe that the precision and recall on
the simplified Chinese are 84.52% and 85.20% which is better than works on the Traditional
Chinese 68.43% and 63.43%. And we estimate system efficiency based on language model
that trained by WIKI corpus and ASBC corpus separately. The sentence and word accuracy
for the ASBC corpus are 39.8% and 70.3%. And the word and character accuracy for WIKI
corpus are 20.3% and 53.3%. Finally we estimate the number of candidate hypotheses. The
research shows the 10 hypotheses and 20 hypotheses the sentence accuracy are closed.
|
6 |
Neuronové jazykové modely zohledňující morfologii pro strojový překlad / Neural Language Models with Morphology for Machine TranslationMusil, Tomáš January 2017 (has links)
Language models play an important role in many natural language processing tasks. In this thesis, we focus on language models built on artificial neural net- works. We examine the possibilities of using morphological annotations in these models. We propose a neural network architecture for a language model that explicitly makes use of morphological annotation of the input sentence: instead of word forms it processes lemmata and morphological tags. Both the baseline and the proposed method are evaluated on their own by perplexity, and also in the context of machine translation by the means of automatic translation quality evaluation. While in isolation the proposed model significantly outperforms the baseline, there is no apparent gain in machine translation. 1
|
7 |
Sind Sprachmodelle in der Lage die Arbeit von Software-Testern zu übernehmen?: automatisierte JUnit Testgenerierung durch Large Language ModelsSchäfer, Nils 20 September 2024 (has links)
Die Bachelorarbeit untersucht die Qualität von Sprachmodellen im Kontext der Generierung
von Unit Tests für Java Anwendungen. Ziel der Arbeit ist es, zu analysieren,
inwieweit JUnit Tests durch den Einsatz von Sprachmodellen automatisiert generiert
werden können und daraus abzuleiten mit welcher Qualität sie die Arbeit von Software-
Testern übernehmen und ersetzen. Hierzu wird ein automatisiertes Testerstellungssystem
in Form eines Python-Kommandozeilen-Tools konzipiert sowie implementiert, welches mithilfe
von Anfragen an das Sprachmodell Testfälle generiert. Um dessen Qualität messen zu
können, werden die generierten Tests ohne manuellen Einfluss übernommen. Als Grundlage
der Evaluierung findet eine Durchführung statt, in der für 3 Java-Maven Projekte, mit
unterschiedlichen Komplexitätsgraden, Tests generiert werden. Die anschließende Analyse
besteht aus einem festen Bewertungsverfahren, welches die Testcodeabdeckung sowie
Erfolgsquote evaluiert und mit manuellen Tests vergleicht. Die Ergebnisse zeigen, dass
Sprachmodelle in der Lage sind, JUnit Tests mit einer zufriedenstellenden Testabdeckung
zu generieren, jedoch eine unzureichende Erfolsquote im Vergleich zu manuellen Tests
aufweisen. Es wird deutlich, dass sie aufgrund von Qualitätsmängeln bei der Generierung
von Testcode die Arbeit von Software-Testern nicht vollständig ersetzen können. Jedoch
bieten sie eine Möglichkeit, Testerstellungsprozesse zu übernehmen, welche mit einer
anschließenden manuellen Nachkontrolle enden und reduzieren somit den Arbeitsaufwand
der Tester.:Abbildungsverzeichnis IV
Tabellenverzeichnis V
Quellcodeverzeichnis VI
Abkürzungsverzeichnis VIII
1 Einleitung 1
1.1 Problemstellung 1
1.2 Zielstellung 2
2 Grundlagen 4
2.1 Software Development Lifecycle 4
2.2 Large Language Models 6
2.2.1 Begriff und Einführung 6
2.2.2 Generative Pre-trained Transformer 8
2.3 Prompt Engineering 9
2.3.1 Prompt Elemente 10
2.3.2 Prompt Techniken 10
2.4 Unit Testing 12
2.4.1 Grundlagen 12
2.4.2 Java mit JUnit5 14
2.5 SonarQube 16
3 Konzeption 18
3.1 Voraussetzungen 18
3.2 Anforderungsanalyse 19
3.3 Wahl des Large Language Models 21
3.4 Design des Prompts 22
3.5 Programmablaufplan 25
4 Implementation 28
4.1 Funktionalitäten 28
4.1.1 Nutzerabfrage 28
4.1.2 Java-Datei Erfassung im Projekt 30
4.1.3 Prompt-Erstellung 30
4.1.4 API-Anfrage zur Generierung von Tests 33
4.1.5 Testüberprüfung mit Repair Rounds 34
4.1.6 Logging 37
4.2 Integration von SonarQube, Plugins, Dependencies 39
4.3 Testdurchlauf 40
5 Durchführung und Analyse 43
5.1 Durchführung 43
5.2 Evaluation der Tests 44
5.2.1 Line Coverage 45
5.2.2 Branch Coverage 47
5.2.3 Overall Coverage 49
5.2.4 Erfolgsquote 51
5.3 Testcodeanalyse 52
5.4 Vergleich mit manuellen Testergebnissen 56
5.5 Einordnung der Ergebnisse 57
6 Fazit 58
6.1 Schlussfolgerung 58
6.2 Ausblick 59
Literaturverzeichnis I
A Anhang - Quelltexte
|
8 |
French AXA Insurance Word Embeddings : Effects of Fine-tuning BERT and Camembert on AXA France’s dataZouari, Hend January 2020 (has links)
We explore in this study the different Natural Language Processing state-of-the art technologies that allow transforming textual data into numerical representation. We go through the theory of the existing traditional methods as well as the most recent ones. This thesis focuses on the recent advances in Natural Language processing being developed upon the Transfer model. One of the most relevant innovations was the release of a deep bidirectional encoder called BERT that broke several state of the art results. BERT utilises Transfer Learning to improve modelling language dependencies in text. BERT is used for several different languages, other specialized model were released like the french BERT: Camembert. This thesis compares the language models of these different pre-trained models and compares their capability to insure a domain adaptation. Using the multilingual and the french pre-trained version of BERT and a dataset from AXA France’s emails, clients’ messages, legal documents, insurance documents containing over 60 million words. We fine-tuned the language models in order to adapt them on the Axa insurance’s french context to create a French AXAInsurance BERT model. We evaluate the performance of this model on the capability of the language model of predicting a masked token based on the context. BERT proves to perform better : modelling better the french AXA’s insurance text without finetuning than Camembert. However, with this small amount of data, Camembert is more capable of adaptation to this specific domain of insurance. / I denna studie undersöker vi de senaste teknologierna för Natural Language Processing, som gör det möjligt att omvandla textdata till numerisk representation. Vi går igenom teorin om befintliga traditionella metoder såväl som de senaste. Denna avhandling fokuserar på de senaste framstegen inom bearbetning av naturliga språk som utvecklats med hjälp av överföringsmodellen. En av de mest relevanta innovationerna var lanseringen av en djup dubbelriktad kodare som heter BERT som bröt flera toppmoderna resultat. BERT använder Transfer Learning för att förbättra modelleringsspråkberoenden i text. BERT används för flera olika språk, andra specialmodeller släpptes som den franska BERT: Camembert. Denna avhandling jämför språkmodellerna för dessa olika förutbildade modeller och jämför deras förmåga att säkerställa en domänanpassning. Med den flerspråkiga och franska förutbildade versionen av BERT och en dataset från AXA Frankrikes epostmeddelanden, kundmeddelanden, juridiska dokument, försäkringsdokument som innehåller över 60 miljoner ord. Vi finjusterade språkmodellerna för att anpassa dem till Axas försäkrings franska sammanhang för att skapa en fransk AXAInsurance BERT-modell. Vi utvärderar prestandan för denna modell på förmågan hos språkmodellen att förutsäga en maskerad token baserat på sammanhanget. BERTpresterar bättre: modellerar bättre den franska AXA-försäkringstexten utan finjustering än Camembert. Men med denna lilla mängd data är Camembert mer kapabel att anpassa sig till denna specifika försäkringsdomän.
|
9 |
Resource-dependent acoustic and language modeling for spoken keyword searchChen, I-Fan 27 May 2016 (has links)
In this dissertation, three research directions were explored to alleviate two major issues, i.e., the use of incorrect models and training/test condition mismatches, in the modeling frameworks of modern spoken keyword search (KWS) systems. Each of the three research directions, which include (i) data-efficient training processes, (ii) system optimization objectives, and (iii) data augmentation, utilizes different types and amounts of training resources in different ways to ameliorate the two issues of acoustic and language modeling in modern KWS systems. To be more specific, resource-dependent keyword modeling, keyword-boosted sMBR (state-level minimum Bayes risk) training, and multilingual acoustic modeling are proposed and investigated for acoustic modeling in this research. For language modeling, keyword-aware language modeling, discriminative keyword-aware language modeling, and web text augmented language modeling are presented and discussed. The dissertation provides a comprehensive collection of solutions and strategies to the acoustic and language modeling problems in KWS. It also offers insights into the realization of good-performance KWS systems. Experimental results show that the data-efficient training process and data augmentation are the two directions providing the most prominent performance improvement for KWS systems. While modifying system optimization objectives provides smaller yet consistent performance enhancement in KWS systems with different configurations. The effects of the proposed acoustic and language modeling approaches in the three directions are also shown to be additive and can be combined to further improve the overall KWS system performance.
|
10 |
Automatic Transcript Generator for Podcast FilesHolst, Andy January 2010 (has links)
<p>In the modern world, Internet has become a popular place, people with speech hearing disabilities and search engines can't take part of speech content in podcast les. In order to solve the problem partially, the Sphinx decoders such as Sphinx-3, Sphinx-4 can be used to implement a Auto Transcript Generator application, by coupling already existing large acoustic model, language model and a existing dictionary, or by training your own large acoustic model, language model and creating your own dictionary to support continuous speaker independent speech recognition system.</p>
|
Page generated in 0.0709 seconds