51 |
Fake News Detection : Using a Large Language Model for Accessible SolutionsJurgell, Fredrik, Borgman, Theodor January 2024 (has links)
In an attempt to create a fake news detection tool using a large language model (LLM), the emphasis is on validating the effectiveness of this approach and then making the tooling readily available. With the current model of gpt-4-turbo-preview and its assistant capabilities combined with simple prompts tailored to different objectives. While tools to detect fake news and simplify the process are not new, insight into how they work and why is not commonly available, most likely due to the monetization around the current services. By building an open-source platform that others can expand upon, giving insight into the prompts used, and enabling experimentation and a baseline to start at when developing further or taking inspiration from. The results when articles are not willfully written as fake but missing key data are obviously very hard to detect. However, common tabloid-style news, which are often shared to create an emotional response, shows more promising detection results.
|
52 |
Automating Software Development Processes Through Multi-Agent Systems : A Study in LLM-based Software Engineering / Automatisering av Mjukvaruutvecklingsprocesser genom användning av Multi-Agent System : En studie inom LLM-baserad mjukvaruutvecklingPeltomaa Åström, Samuel, Winoy, Simon January 2024 (has links)
In the ever-evolving landscape of Software Development, the demand for more efficient, scalable, and automated processes is paramount. The advancement of Generative AI has unveiled new avenues for innovative approaches to address this demand. This thesis explores one such avenue through the use of Multi-Agent Systems combined with Large Language Models (LLMs) to automate tasks within the development lifecycle. The thesis presents a structure for designing and developing an LLM-based multi-agent application by encompassing agent design principles, strategies for facilitating multi-agent collaboration, and providing valuable insights into the selection of an appropriate agent framework. Furthermore, the thesis showcases the developed application in its problem-solving capabilities with quantitative benchmarking results. Additionally, the study demonstrates practical implementations through examples of real-world applications. This study demonstrates the potential of utilizing LLM-based multi-agent systems in enhancing software development efficiency, offering companies a promising and powerful tool for streamlining Software Engineering workflows. / I den ständigt föränderliga världen av mjukvaruutveckling är behovet av mer effektiva, skalbara, och automatiserade metoder av stor betydelse. Framstegen inom generativ AI har öppnat nya möjligheter för utveckling av metoder för detta ändamål. Denna studie undersöker en sådan möjlighet genom användning av multi-agent system i samband med stora språkmodeller (Large Language Models, LLM) för automatisering av uppgifter inom utvecklingslivscykeln. Studien presenterar en struktur för design och utveckling av en LLM-baserad multi-agent applikation genom att bearbeta agentdesign och strategier för att underlätta samarbete mellan flera agenter och ge värdefulla insikter i valet av ett lämpligt agent-ramverk. Vidare demonstrerar studien den utvecklade applikationens problemlösningsförmåga med kvantitativa benchmark-resultat. Utöver detta inkluderar studien även exempel på genererade applikationer för att presentera konkreta exempel på implementeringar. Denna studie visar potentialen av att använda LLM-baserade multi-agent system för att förbättra effektiviteten inom mjukvaruutveckling, och erbjuder företag ett lovande och kraftfullt verktyg för effektivisering av arbetsflöden inom mjukvaruteknik.
|
53 |
Arabic text recognition of printed manuscripts : efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processingAl-Muhtaseb, Husni Abdulghani January 2010 (has links)
Arabic text recognition was not researched as thoroughly as other natural languages. The need for automatic Arabic text recognition is clear. In addition to the traditional applications like postal address reading, check verification in banks, and office automation, there is a large interest in searching scanned documents that are available on the internet and for searching handwritten manuscripts. Other possible applications are building digital libraries, recognizing text on digitized maps, recognizing vehicle license plates, using it as first phase in text readers for visually impaired people and understanding filled forms. This research work aims to contribute to the current research in the field of optical character recognition (OCR) of printed Arabic text by developing novel techniques and schemes to advance the performance of the state of the art Arabic OCR systems. Statistical and analytical analysis for Arabic Text was carried out to estimate the probabilities of occurrences of Arabic character for use with Hidden Markov models (HMM) and other techniques. Since there is no publicly available dataset for printed Arabic text for recognition purposes it was decided to create one. In addition, a minimal Arabic script is proposed. The proposed script contains all basic shapes of Arabic letters. The script provides efficient representation for Arabic text in terms of effort and time. Based on the success of using HMM for speech and text recognition, the use of HMM for the automatic recognition of Arabic text was investigated. The HMM technique adapts to noise and font variations and does not require word or character segmentation of Arabic line images. In the feature extraction phase, experiments were conducted with a number of different features to investigate their suitability for HMM. Finally, a novel set of features, which resulted in high recognition rates for different fonts, was selected. The developed techniques do not need word or character segmentation before the classification phase as segmentation is a byproduct of recognition. This seems to be the most advantageous feature of using HMM for Arabic text as segmentation tends to produce errors which are usually propagated to the classification phase. Eight different Arabic fonts were used in the classification phase. The recognition rates were in the range from 98% to 99.9% depending on the used fonts. As far as we know, these are new results in their context. Moreover, the proposed technique could be used for other languages. A proof-of-concept experiment was conducted on English characters with a recognition rate of 98.9% using the same HMM setup. The same techniques where conducted on Bangla characters with a recognition rate above 95%. Moreover, the recognition of printed Arabic text with multi-fonts was also conducted using the same technique. Fonts were categorized into different groups. New high recognition results were achieved. To enhance the recognition rate further, a post-processing module was developed to correct the OCR output through character level post-processing and word level post-processing. The use of this module increased the accuracy of the recognition rate by more than 1%.
|
54 |
Stream-based statistical machine translationLevenberg, Abby D. January 2011 (has links)
We investigate a new approach for SMT system training within the streaming model of computation. We develop and test incrementally retrainable models which, given an incoming stream of new data, can efficiently incorporate the stream data online. A naive approach using a stream would use an unbounded amount of space. Instead, our online SMT system can incorporate information from unbounded incoming streams and maintain constant space and time. Crucially, we are able to match (or even exceed) translation performance of comparable systems which are batch retrained and use unbounded space. Our approach is particularly suited for situations when there is arbitrarily large amounts of new training material and we wish to incorporate it efficiently and in small space. The novel contributions of this thesis are: 1. An online, randomised language model that can model unbounded input streams in constant space and time. 2. An incrementally retrainable translationmodel for both phrase-based and grammarbased systems. The model presented is efficient enough to incorporate novel parallel text at the single sentence level. 3. Strategies for updating our stream-based language model and translation model which demonstrate how such components can be successfully used in a streaming translation setting. This operates both within a single streaming environment and also in the novel situation of having to translate multiple streams. 4. Demonstration that recent data from the stream is beneficial to translation performance. Our stream-based SMT system is efficient for tackling massive volumes of new training data and offers-up new ways of thinking about translating web data and dealing with other natural language streams.
|
55 |
Amélioration a posteriori de la traduction automatique par métaheuristiqueLavoie-Courchesne, Sébastien 03 1900 (has links)
La traduction automatique statistique est un domaine très en demande et où les machines sont encore loin de produire des résultats de qualité humaine. La principale méthode utilisée est une traduction linéaire segment par segment d'une phrase, ce qui empêche de changer des parties de la phrase déjà traduites. La recherche pour ce mémoire se base sur l'approche utilisée dans Langlais, Patry et Gotti 2007, qui tente de corriger une traduction complétée en modifiant des segments suivant une fonction à optimiser. Dans un premier temps, l'exploration de nouveaux traits comme un modèle de langue inverse et un modèle de collocation amène une nouvelle dimension à la fonction à optimiser. Dans un second temps, l'utilisation de différentes métaheuristiques, comme les algorithmes gloutons et gloutons randomisés permet l'exploration plus en profondeur de l'espace de recherche et permet une plus grande amélioration de la fonction objectif. / Statistical Machine Translation is a field ingreat demand and where machines are still far from producing human-level results.The main method used is a segment by segment linear translation of a sentence, which prevents modification of already translated parts of the sentence. Research for this memoir is based on an approach used by Langlais, Patry and Gotti 2007, which tries to correct a completed translation by modifying segments following a function which needs to be optimized. As a first step, exploration of new traits such as an inverted language model and a collocation model brings a new dimension to the optimization function. As a second step, use of different metaheuristics, such as the greedy and randomized greedy algorithms, allows greater depth while exploring the search space and allows a greater improvement of the objective function.
|
56 |
Probabilistic modelling of morphologically rich languagesBotha, Jan Abraham January 2014 (has links)
This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.
|
57 |
Continuous space models with neural networks in natural language processingLe, Hai Son 20 December 2012 (has links) (PDF)
The purpose of language models is in general to capture and to model regularities of language, thereby capturing morphological, syntactical and distributional properties of word sequences in a given language. They play an important role in many successful applications of Natural Language Processing, such as Automatic Speech Recognition, Machine Translation and Information Extraction. The most successful approaches to date are based on n-gram assumption and the adjustment of statistics from the training data by applying smoothing and back-off techniques, notably Kneser-Ney technique, introduced twenty years ago. In this way, language models predict a word based on its n-1 previous words. In spite of their prevalence, conventional n-gram based language models still suffer from several limitations that could be intuitively overcome by consulting human expert knowledge. One critical limitation is that, ignoring all linguistic properties, they treat each word as one discrete symbol with no relation with the others. Another point is that, even with a huge amount of data, the data sparsity issue always has an important impact, so the optimal value of n in the n-gram assumption is often 4 or 5 which is insufficient in practice. This kind of model is constructed based on the count of n-grams in training data. Therefore, the pertinence of these models is conditioned only on the characteristics of the training text (its quantity, its representation of the content in terms of theme, date). Recently, one of the most successful attempts that tries to directly learn word similarities is to use distributed word representations in language modeling, where distributionally words, which have semantic and syntactic similarities, are expected to be represented as neighbors in a continuous space. These representations and the associated objective function (the likelihood of the training data) are jointly learned using a multi-layer neural network architecture. In this way, word similarities are learned automatically. This approach has shown significant and consistent improvements when applied to automatic speech recognition and statistical machine translation tasks. A major difficulty with the continuous space neural network based approach remains the computational burden, which does not scale well to the massive corpora that are nowadays available. For this reason, the first contribution of this dissertation is the definition of a neural architecture based on a tree representation of the output vocabulary, namely Structured OUtput Layer (SOUL), which makes them well suited for large scale frameworks. The SOUL model combines the neural network approach with the class-based approach. It achieves significant improvements on both state-of-the-art large scale automatic speech recognition and statistical machine translations tasks. The second contribution is to provide several insightful analyses on their performances, their pros and cons, their induced word space representation. Finally, the third contribution is the successful adoption of the continuous space neural network into a machine translation framework. New translation models are proposed and reported to achieve significant improvements over state-of-the-art baseline systems.
|
58 |
Turkish Large Vocabulary Continuous Speech Recognition By Using Limited Audio CorpusSusman, Derya 01 March 2012 (has links) (PDF)
Speech recognition in Turkish Language is a challenging problem in several perspectives. Most of the challenges are related to the morphological structure of the language. Since Turkish is an agglutinative language, it is possible to generate many words from a single stem by using suffixes. This characteristic of the language increases the out-of-vocabulary (OOV) words, which degrade the performance of a speech recognizer dramatically. Also, Turkish language allows words to be ordered in a free manner, which makes it difficult to generate robust language models.
In this thesis, the existing models and approaches which address the problem of Turkish LVCSR (Large Vocabulary Continuous Speech Recognition) are explored. Different recognition units (words, morphs, stem and endings) are used in
generating the n-gram language models. 3-gram and 4-gram language models are generated with respect to the recognition unit.
Since the solution domain of speech recognition is involved with machine learning, the performance of the recognizer depends on the sufficiency of the audio data used in acoustic model training. However, it is difficult to obtain rich audio corpora for
the Turkish language. In this thesis, existing approaches are used to solve the problem of Turkish LVCSR by using a limited audio corpus. We also proposed several data selection approaches in order to improve the robustness of the acoustic
model.
|
59 |
Accès à de l'information de type patrimoine culturel / Information Access in Cultural HeritageTan, Kian Lam 30 April 2014 (has links)
Avec la croissance explosive de la numérisation du patrimoine culturel , de nombreuses patrimoine culturel institutions ont été la conversion des objets physiques du patrimoine culturel dans la représentation numérique ou représentation descriptive . Toutefois , la conversion a donné lieu à plusieurs questions telles que : 1 ) les documents sont de nature descriptive , 2 ) l'ambiguïté et de la brièveté des documents , 3 ) le vocabulaire spécifique est utilisé dans les documents , et 4 ), il existe également des variations dans les termes utilisés dans le document . En outre, l'utilisation de mots-clés inexactes également entraîné problème de requête court . La plupart du temps , les problèmes sont causés par la faute agrégée en annotant les documents alors que le problème de requête court est causé par l'utilisateur naïf qui a peu de connaissances et d'expérience dans le domaine du patrimoine culturel . Dans cette recherche, l'objectif principal est de modéliser le système d' accès à l'information pour surmonter partiellement les questions soulevées par le processus de documentation et le fond des utilisateurs du patrimoine culturel numérique . Par conséquent , trois types d'outils d'accès aux informations sont introduites et établies à savoir l'information système de recherche , la recherche de contexte , et jeu mobile sur le patrimoine culturel qui permettent à l' utilisateur d'accéder , d'apprendre et d'explorer les informations sur le patrimoine culturel . Fondamentalement , l'idée principale d'information système de recherche et contexte de recherche est d'intégrer la relation de lien entre les termes dans le modèle de la langue par l'extension de Dirichlet lissage pour résoudre les problèmes qui se posent à la fois le processus de documentation et de fond des utilisateurs . En outre, un modèle de préférence est présenté sur la base de la théorie de la charge d'un condensateur de quantifier le contexte cognitif basé sur le temps et les intégrer dans la longue Dirichlet lissage . En outre, un jeu mobile est introduite par l'intégration des éléments des jeux de monopole et chasse au trésor pour atténuer les problèmes découlant de l'arrière-plan des utilisateurs en particulier leur comportement décontracté . Les premier et deuxième approches ont été testées sur le patrimoine culturel dans CLEF ( chic) collection qui se compose de questions et de courts documents . Les résultats montrent que l'approche est efficace et donne une meilleure précision lors de la récupération . Enfin , une enquête a été menée pour étudier la troisième approche , et le résultat donne à penser que le jeu est en mesure d'aider les participants à explorer et apprendre les informations sur le patrimoine culturel . En outre, les participants ont également estimé qu'une recherche d'information outil qui est intégré avec le jeu peut fournir plus d'informations à l'utilisateur d'une manière plus pratique tout en jouant le jeu et en visitant les sites du patrimoine dans le match. En conclusion , les résultats montrent que les solutions proposées sont en mesure de résoudre les problèmes posés par le processus de documentation et le fond des utilisateurs du patrimoine culturel numérique . / With the explosive growth of digitization in cultural heritage, many cultural heritage institu- tions have been converting physical objects of cultural heritage into digital representation or descriptive representation. However, the conversion resulted in several issues such as: 1) the documents are descriptive in nature, 2) ambiguity and brevity of the documents, 3) dedicated vocabulary is used in the documents, and 4) there is also variation in the terms used in the doc- ument. Besides, the usage of inaccurate keywords also resulted in short query problem. Most of the time, the issues are caused by the aggregated fault in annotating the documents while the short query problem is caused by naive user who has little prior knowledge and experience in cultural heritage domain. In this research, the main aim is to model information access system to overcome partially the issues arising from the documentation process and the background of the users of digital cultural heritage. Therefore, three types of information access tool are introduced and established namely information retrieval system, context search, and mobile game on cultural heritage that allow the user to access, learn, and explore the information on cultural heritage. Basically, the main idea for information retrieval system and context search is to incorporate the link relationship between terms into the Language Model by extending of Dirichlet Smoothing to solve the problems arising from both the documentation process and background of the users. In addition, a Preference Model is introduced based on the Theory of Charging a Capacitor to quantify the cognitive context based on time and integrate into the extended Dirichlet Smoothing. Besides, a mobile game is introduced by integrating the ele- ments of the games of monopoly and treasure hunt to mitigate the problems arising from the background of the users especially their casual behavior. The first and second approaches were tested on the Cultural Heritage in CLEF (CHiC) collection that consists of short queries and documents. The results show that the approach is effective and yields better accuracy during the retrieval. Finally, a survey was carried out to investigate the third approach, and the result suggests that the game is able to help the participants to explore and learn the information on cultural heritage. In addition, the participants also felt that an information seeking tool that is integrated with the game can provide more information to the user in a more convenient manner while playing the game and visiting the heritage sites in the game. In conclusion, the results show that the proposed solutions are able to solve the problems arising from the documentation process and the background of the users of digital cultural heritage.
|
60 |
Modèles de langage ad hoc pour la reconnaissance automatique de la parole / Ad-hoc language models for automatic speech recognitionOger, Stanislas 30 November 2011 (has links)
Les trois piliers d’un système de reconnaissance automatique de la parole sont le lexique,le modèle de langage et le modèle acoustique. Le lexique fournit l’ensemble des mots qu’il est possible de transcrire, associés à leur prononciation. Le modèle acoustique donne une indication sur la manière dont sont réalisés les unités acoustiques et le modèle de langage apporte la connaissance de la manière dont les mots s’enchaînent.Dans les systèmes de reconnaissance automatique de la parole markoviens, les modèles acoustiques et linguistiques sont de nature statistique. Leur estimation nécessite de gros volumes de données sélectionnées, normalisées et annotées.A l’heure actuelle, les données disponibles sur le Web constituent de loin le plus gros corpus textuel disponible pour les langues française et anglaise. Ces données peuvent potentiellement servir à la construction du lexique et à l’estimation et l’adaptation du modèle de langage. Le travail présenté ici consiste à proposer de nouvelles approches permettant de tirer parti de cette ressource.Ce document est organisé en deux parties. La première traite de l’utilisation des données présentes sur le Web pour mettre à jour dynamiquement le lexique du moteur de reconnaissance automatique de la parole. L’approche proposée consiste à augmenter dynamiquement et localement le lexique du moteur de reconnaissance automatique de la parole lorsque des mots inconnus apparaissent dans le flux de parole. Les nouveaux mots sont extraits du Web grâce à la formulation automatique de requêtes soumises à un moteur de recherche. La phonétisation de ces mots est obtenue grâce à un phonétiseur automatique.La seconde partie présente une nouvelle manière de considérer l’information que représente le Web et des éléments de la théorie des possibilités sont utilisés pour la modéliser. Un modèle de langage possibiliste est alors proposé. Il fournit une estimation de la possibilité d’une séquence de mots à partir de connaissances relatives à ’existence de séquences de mots sur le Web. Un modèle probabiliste Web reposant sur le compte de documents fourni par un moteur de recherche Web est également présenté. Plusieurs approches permettant de combiner ces modèles avec des modèles probabilistes classiques estimés sur corpus sont proposées. Les résultats montrent que combiner les modèles probabilistes et possibilistes donne de meilleurs résultats que es modèles probabilistes classiques. De plus, les modèles estimés à partir des données Web donnent de meilleurs résultats que ceux estimés sur corpus. / The three pillars of an automatic speech recognition system are the lexicon, the languagemodel and the acoustic model. The lexicon provides all the words that can betranscribed, associated with their pronunciation. The acoustic model provides an indicationof how the phone units are pronounced, and the language model brings theknowledge of how words are linked. In modern automatic speech recognition systems,the acoustic and language models are statistical. Their estimation requires large volumesof data selected, standardized and annotated.At present, the Web is by far the largest textual corpus available for English andFrench languages. The data it holds can potentially be used to build the vocabularyand the estimation and adaptation of language model. The work presented here is topropose new approaches to take advantage of this resource in the context of languagemodeling.The document is organized into two parts. The first deals with the use of the Webdata to dynamically update the lexicon of the automatic speech recognition system.The proposed approach consists on increasing dynamically and locally the lexicon onlywhen unknown words appear in the speech. New words are extracted from the Webthrough the formulation of queries submitted toWeb search engines. The phonetizationof the words is obtained by an automatic grapheme-to-phoneme transcriber.The second part of the document presents a new way of handling the informationcontained on the Web by relying on possibility theory concepts. A Web-based possibilisticlanguage model is proposed. It provides an estition of the possibility of a wordsequence from knowledge of the existence of its sub-sequences on the Web. A probabilisticWeb-based language model is also proposed. It relies on Web document countsto estimate n-gram probabilities. Several approaches for combining these models withclassical models are proposed. The results show that combining probabilistic and possibilisticmodels gives better results than classical probabilistic models alone. In addition,the models estimated from Web data perform better than those estimated on corpus.
|
Page generated in 0.0574 seconds