Global ETD Search

11	A Study of Recurrent and Convolutional Neural Networks in the Native Language Identification Task Werfelmann, Robert 24 May 2018 (has links) Native Language Identification (NLI) is the task of predicting the native language of an author from their text written in a second language. The idea is to find writing habits that transfer from an author’s native language to their second language. Many approaches to this task have been studied, from simple word frequency analysis, to analyzing grammatical and spelling mistakes to find patterns and traits that are common between different authors of the same native language. This can be a very complex task, depending on the native language and the proficiency of the author’s second language. The most common approach that has seen very good results is based on the usage of n-gram features of words and characters. In this thesis, we attempt to extract lexical, grammatical, and semantic features from the sentences of non-native English essays using neural networks. The training and testing data was obtained from a large corpus of publicly available essays written by authors of several countries around the world. The neural network models consisted of Long Short-Term Memory and Convolutional networks using the sentences of each document as the input. Additional statistical features were generated from the text to complement the predictions of the neural networks, which were then used as feature inputs to a Support Vector Machine, making the final prediction. Results show that Long Short-Term Memory neural network can improve performance over a naive bag of words approach, but with a much smaller feature set. With more fine-tuning of neural network hyperparameters, these results will likely improve significantly. native language identification Natural language processing Machine learning Classification neural networks
12	Detekce změny jazyka při hovoru / Code Switching Detection in Speech Povolný, Filip January 2015 (has links) This master's thesis deals with code-switching detection in speech. The state-of-the-art methods of language diarization are described in the first part of the thesis. The proposed method for implementation is based on acoustic approach to language identification using combination of GMM, i-vector and LDA. New Mandarin-English code-switching database was created for these experiments. Using this system, accuracy of 89,3 % is achieved on this database.
13	Automatic language identification of short texts Avenberg, Anna January 2020 (has links) The world is growing more connected through the use of online communication, exposing software and humans to all the world's languages. While devices are able to understand and share the raw data between themselves and with humans, the information itself is not expressed in a monolithic format. This causes issues both in the human to computer interaction and human to human communication. Automatic language identification (LID) is a field within artificial intelligence and natural language processing that strives to solve a part of these issues by identifying languages from text, sign language and speech. One of the challenges is to identify the short pieces of text that can be found online, such as messages, comments and posts on social media. This is due to the small amount of information they carry. The goal of this thesis has been to build a machine learning model that can identify the language for these short pieces of text. A long short-term memory (LSTM) machine learning model was built and benchmarked towards Facebook's fastText model. The results show how the LSTM model reached an accuracy of around 95% and the fastText model used as comparison reached an accuracy of 97%. The LSTM model struggled more when identifying texts shorter than 50 characters than with longer text. The classification performance of the LSTM model was also relatively poor in cases where languages were similar, like Croatian and Serbian. Both the LSTM model and the fastText model reached accuracy's above 94% which can be considered high, depending on how it is evaluated. There are however many improvements and possible future work to be considered; looking further into texts shorter than 50 characters, evaluating the model's softmax output vector values and how to handle similar languages. Automatic language identification Natural language processing Machine learning Artificial intelligence Short texts Engineering and Technology Teknik och teknologier
14	An End-to-End Native Language Identification Model without the Need for Manual Annotation / En modersmålsidentifiering modell utan behov av manuell annotering Buzaitė, Viktorija January 2022 (has links) Native language identification (NLI) is a classification task which identifies the mother tongue of a language learner based on spoken or written material. The task gained popularity when it was featured in the 2017 BEA-12-workshop and since then many applications have been successfully found for NLI - ranging from language learning to authorship identification and forensic science. While a considerable amount of research has already been done in this area, we introduce a novel approach of incorporating syntactic information into the implementation of a BERT-based NLI model. In addition, we train separate models to test whether erroneous input sequences perform better than corrected sequences. To answer these questions we carry out both a quantitative and qualitative analysis. In addition, we test our idea of implementing a BERT-based GEC model to supply more training data to our NLI model without the need for manual annotation. Our results suggest that our models do not outperform the SVM baseline, but we attribute this result to the lack of training data in our dataset, as transformer-based architectures like BERT need huge amounts of data to be successfully fine-tuned. In turn, simple linear models like SVM perform well on small amounts of data. We also find that erroneous structures in data come useful when combined with syntactic information but neither boosts the performance of NLI model separately. Furthermore, our implemented GEC system performs well enough to produce more data for our NLI models, as their scores increase after implementing the additional data, resulting from our second experiment. We believe that our proposed architecture is potentially suitable for the NLI task if we incorporate extensions which we suggest in the conclusion section. native language identification NLI BERT syntax errors
15	Robust Techniques Of Language Modeling For Spoken Language Identification Basavaraja, S V January 2007 (has links) Language Identification (LID) is the task of automatically identifying the language of speech signal uttered by an unknown speaker. An N language LID task is to classify an input speech utterance, spoken by an unknown speaker and of unknown text, as belonging to one of the N languages L1, L2, . . , LN. We present a new approach to spoken language modeling for language identification using the Lempel-Ziv-Welch (LZW) algorithm, with which we try to overcome the limitations of n-gram stochastic models by automatically identifying the valid set of variable length patterns from the training data. However, since several patterns in a language pattern table are also shared by other language pattern tables, confusability prevailed in the LID task. To overcome this, three pruning techniques are proposed to make these pattern tables more language specific. For LID with limited training data, we present another language modeling technique, which compensates for language specific patterns missing in the language specific LZW pattern table. We develop two new discriminative measures for LID based on the LZW algorithm, viz., (i) Compression Ratio Score (LZW-CRS) and (ii) Weighted Discriminant Score (LZW-WDS). It is shown that for a 6-language LID task of the OGI-TS database, the new model (LZW-WDS) significantly outperforms the conventional bigram approach. With regard to the front end of the LID system, we develop a modified technique to model for Acoustic Sub-Word Units (ASWU) and explore its effectiveness. The segmentation of speech signal is done using an acoustic criterion (ML-segmentation). However, we believe that consistency and discriminability among speech units is the key issue for the success of ASWU based speech processing. We develop a new procedure for clustering and modeling the segments using sub-word GMMs. Because of the flexibility in choosing the labels for the sub-word units, we do an iterative re-clustering and modeling of the segments. Using a consistency measure of labeling the acoustic segments, the convergence of iterations is demonstrated. We show that the performance of new ASWU based front-end and the new LZW based back-end for LID outperforms the earlier reported PSWR based LID. Speech Recognition Speech Processing Language Processing Spoken Language Modeling Language Identification (LID) Language Models Spoken Language Identification Lempel-Ziv-Welch (LZW) Algorithm Acoustic Sub-Word Units (ASWU Language Modeling Communications Engineering
16	Language identification using Gaussian mixture models Nkadimeng, Calvin 03 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: The importance of Language Identification for African languages is seeing a dramatic increase due to the development of telecommunication infrastructure and, as a result, an increase in volumes of data and speech traffic in public networks. By automatically processing the raw speech data the vital assistance given to people in distress can be speeded up, by referring their calls to a person knowledgeable in that language. To this effect a speech corpus was developed and various algorithms were implemented and tested on raw telephone speech data. These algorithms entailed data preparation, signal processing, and statistical analysis aimed at discriminating between languages. The statistical model of Gaussian Mixture Models (GMMs) were chosen for this research due to their ability to represent an entire language with a single stochastic model that does not require phonetic transcription. Language Identification for African languages using GMMs is feasible, although there are some few challenges like proper classification and accurate study into the relationship of langauges that need to be overcome. Other methods that make use of phonetically transcribed data need to be explored and tested with the new corpus for the research to be more rigorous. / AFRIKAANSE OPSOMMING: Die belang van die Taal identifiseer vir Afrika-tale is sien ’n dramatiese toename te danke aan die ontwikkeling van telekommunikasie-infrastruktuur en as gevolg ’n toename in volumes van data en spraak verkeer in die openbaar netwerke.Deur outomaties verwerking van die ruwe toespraak gegee die noodsaaklike hulp verleen aan mense in nood kan word vinniger-up ”, deur te verwys hul oproepe na ’n persoon ingelichte in daardie taal. Tot hierdie effek van ’n toespraak corpus het ontwikkel en die verskillende algoritmes is gemplementeer en getoets op die ruwe telefoon toespraak gegee.Hierdie algoritmes behels die data voorbereiding, seinverwerking, en statistiese analise wat gerig is op onderskei tussen tale.Die statistiese model van Gauss Mengsel Modelle (GGM) was gekies is vir hierdie navorsing as gevolg van hul vermo te verteenwoordig ’n hele taal met’ n enkele stogastiese model wat nodig nie fonetiese tanscription nie. Taal identifiseer vir die Afrikatale gebruik GGM haalbaar is, alhoewel daar enkele paar uitdagings soos behoorlike klassifikasie en akkurate ondersoek na die verhouding van TALE wat moet oorkom moet word.Ander metodes wat gebruik maak van foneties getranskribeerde data nodig om ondersoek te word en getoets word met die nuwe corpus vir die ondersoek te word strenger. Automatic language identification Gaussian mixture models Southern African languages Speech recognition Dissertations -- Electronic engineering Theses -- Electronic engineering Speech processing systems
17	Text-based language identification for the South African languages Botha, Gerrit Reinier 04 September 2008 (has links) We investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa. Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naïve Bayesian and difference-in-frequency classifiers on different amounts of input text and various values of n, for different amounts of training data. For a fixed value of n the support vector machines generally outperforms the other classifiers, but the simpler classifiers are able to handle larger values of n. The additional computational complexity of training the support vector machine classifier may not be justified in light of importance of using a large value of n, except possibly for small sizes of the input window when limited training data is available. We find that it is more difficult to discriminate languages within language families then those across families. The accuracy on small input strings is low due to this reason, but for input strings of 100 characters or more there is only a slight confusion within families and accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy. The relationship between the amount of training data and the accuracy achieved is found to depend on the window size – for the largest window (300 characters) about 400 000 characters are sufficient to achieve close-to-optimal accuracy, whereas improvements in accuracy are found even beyond 1.6 million characters of training data. Finally, we show that the confusions between the different languages in our set can be used to derive informative graphical representations of the relationships between the languages. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted Naïve bayesian classification Support vector machine N-gram statistics Text-based language identification Difference-in-frequency classification UCTD
18	Separierung mit FindLinks gecrawlter Texte nach Sprachen Pollmächer, Johannes 13 February 2018 (has links) In dieser Arbeit wird ein Programm zur Sprachidentifikation von Web-Dokumenten vorgestellt. Das Verfahren nutzt Worthäufigkeitslisten als Trainingsdaten, um anhand dieser Dokumentenklassifikation in Sprachen vorzunehmen. Somit gehört dieses Werkzeug zu den supervised-learning-Systemen. Die zu klassifizierenden Web-Dokumente wurden mittels des von der Abteilung fur Automatische Sprachverarbeitung entwickelten Tools 'FindLinks' heruntergeladen. Das Programm ist somit in die Nachverarbeitung bestehender Rohdaten einzuordnen. / This BSc Thesis presents a program for automatic language identification of web-documents called LangSepa. The procedure uses training-data which is based on word-frequency-tables of over 350 natural languages. Thus this tool can be subsumed under supervised learning systems. The documents for the classification-task were crawled by an information-retrieval system called FindLinks, which is developed at the Natural Language Processing group at the University of Leipzig. Therefore the presented program will be employed for the postprocessing of existent raw data. info:eu-repo/classification/ddc/000 ddc:000
19	Comparing Different Transformer Models’ Performance for Identifying Toxic Language Online Sundelin, Carl January 2023 (has links) There is a growing use of the internet and alongside that, there has been an increase in the use of toxic language towards other people that can be harmful to those that it targets. The usefulness of artificial intelligence has exploded in recent years with the development of natural language processing, especially with the use of transformers. One of the first ones was BERT, and that has spawned many variations including ones that aim to be more lightweight than the original ones. The goal of this project was to train three different kinds of transformer models, RoBERTa, ALBERT, and DistilBERT, and find out which one was best at identifying toxic language online. The models were trained on a handful of existing datasets that had labelled data as abusive, hateful, harassing, and other kinds of toxic language. These datasets were combined to create a dataset that was used to train and test all of the models. When tested against data collected in the datasets, there was very little difference in the overall performance of the models. The biggest difference was how long it took to train them with ALBERT taking approximately 2 hours, RoBERTa, around 1 hour and DistilBERT just over half an hour. To understand how well the models worked in a real-world scenario, the models were evaluated by labelling text as toxic or non-toxic on three different subreddits. Here, a larger difference in performance showed up. DistilBERT labelled significantly fewer instances as toxic compared to the other models. A sample of the classified data was manually annotated, and it showed that the RoBERTa and DistilBERT models still performed similarly to each other. A second evaluation was done on the data from Reddit and a threshold of 80% certainty was required for the classification to be considered toxic. This led to an average of 28% of instances being classified as toxic by RoBERTa, whereas ALBERT and DistilBERT classified an average of 14% and 11% as toxic respectively. When the results from the RoBERTa and DistilBERT models were manually annotated, a significant improvement could be seen in the performance of the models. This led to the conclusion that the DistilBERT model was the most suitable model for training and classifying toxic language of the lightweight models tested in this work. Artificial Intelligence Machine Learning Natural Language Processing Transformers RoBERTa DistilBERT ALBERT Toxic Language Identification Social Media Computer Sciences Datavetenskap (datalogi)
20	Effective Techniques for Indonesian Text Retrieval Asian, Jelita, jelitayang@gmail.com January 2007 (has links) The Web is a vast repository of data, and information on almost any subject can be found with the aid of search engines. Although the Web is international, the majority of research on finding of information has a focus on languages such as English and Chinese. In this thesis, we investigate information retrieval techniques for Indonesian. Although Indonesia is the fourth most populous country in the world, little attention has been given to search of Indonesian documents. Stemming is the process of reducing morphological variants of a word to a common stem form. Previous research has shown that stemming is language-dependent. Although several stemming algorithms have been proposed for Indonesian, there is no consensus on which gives better performance. We empirically explore these algorithms, showing that even the best algorithm still has scope for improvement. We propose novel extensions to this algorithm and develop a new Indonesian stemmer, and show that these can improve stemming correctness by up to three percentage points; our approach makes less than one error in thirty-eight words. We propose a range of techniques to enhance the performance of Indonesian information retrieval. These techniques include: stopping; sub-word tokenisation; and identification of proper nouns; and modifications to existing similarity functions. Our experiments show that many of these techniques can increase retrieval performance, with the highest increase achieved when we use grams of size five to tokenise words. We also present an effective method for identifying the language of a document; this allows various information retrieval techniques to be applied selectively depending on the language of target documents. We also address the problem of automatic creation of parallel corpora --- collections of documents that are the direct translations of each other --- which are essential for cross-lingual information retrieval tasks. Well-curated parallel corpora are rare, and for many languages, such as Indonesian, do not exist at all. We describe algorithms that we have developed to automatically identify parallel documents for Indonesian and English. Unlike most current approaches, which consider only the context and structure of the documents, our approach is based on the document content itself. Our algorithms do not make any prior assumptions about the documents, and are based on the Needleman-Wunsch algorithm for global alignment of protein sequences. Our approach works well in identifying Indonesian-English parallel documents, especially when no translation is performed. It can increase the separation value, a measure to discriminate good matches of parallel documents from bad matches, by approximately ten percentage points. We also investigate the applicability of our identification algorithms for other languages that use the Latin alphabet. Our experiments show that, with minor modifications, our alignment methods are effective for English-French, English-German, and French-German corpora, especially when the documents are not translated. Our technique can increase the separation value for the European corpus by up to twenty-eight percentage points. Together, these results provide a substantial advance in understanding techniques that can be applied for effective Indonesian text retrieval. Information retrieval text retrieval cross lingual information retrieval Indonesian stemming stopping tokenisation proper noun identification language identification compound identification parallel corpora identification

Search results