1 |
Automatic language identification of short textsAvenberg, Anna January 2020 (has links)
The world is growing more connected through the use of online communication, exposing software and humans to all the world's languages. While devices are able to understand and share the raw data between themselves and with humans, the information itself is not expressed in a monolithic format. This causes issues both in the human to computer interaction and human to human communication. Automatic language identification (LID) is a field within artificial intelligence and natural language processing that strives to solve a part of these issues by identifying languages from text, sign language and speech. One of the challenges is to identify the short pieces of text that can be found online, such as messages, comments and posts on social media. This is due to the small amount of information they carry. The goal of this thesis has been to build a machine learning model that can identify the language for these short pieces of text. A long short-term memory (LSTM) machine learning model was built and benchmarked towards Facebook's fastText model. The results show how the LSTM model reached an accuracy of around 95% and the fastText model used as comparison reached an accuracy of 97%. The LSTM model struggled more when identifying texts shorter than 50 characters than with longer text. The classification performance of the LSTM model was also relatively poor in cases where languages were similar, like Croatian and Serbian. Both the LSTM model and the fastText model reached accuracy's above 94% which can be considered high, depending on how it is evaluated. There are however many improvements and possible future work to be considered; looking further into texts shorter than 50 characters, evaluating the model's softmax output vector values and how to handle similar languages.
|
2 |
Language identification using Gaussian mixture modelsNkadimeng, Calvin 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: The importance of Language Identification for African languages is seeing a
dramatic increase due to the development of telecommunication infrastructure
and, as a result, an increase in volumes of data and speech traffic in public
networks. By automatically processing the raw speech data the vital assistance
given to people in distress can be speeded up, by referring their calls to a person
knowledgeable in that language.
To this effect a speech corpus was developed and various algorithms were implemented
and tested on raw telephone speech data. These algorithms entailed
data preparation, signal processing, and statistical analysis aimed at discriminating
between languages. The statistical model of Gaussian Mixture Models
(GMMs) were chosen for this research due to their ability to represent an entire
language with a single stochastic model that does not require phonetic transcription.
Language Identification for African languages using GMMs is feasible, although
there are some few challenges like proper classification and accurate
study into the relationship of langauges that need to be overcome. Other methods
that make use of phonetically transcribed data need to be explored and
tested with the new corpus for the research to be more rigorous. / AFRIKAANSE OPSOMMING: Die belang van die Taal identifiseer vir Afrika-tale is sien ’n dramatiese toename
te danke aan die ontwikkeling van telekommunikasie-infrastruktuur en as gevolg
’n toename in volumes van data en spraak verkeer in die openbaar netwerke.Deur
outomaties verwerking van die ruwe toespraak gegee die noodsaaklike hulp verleen
aan mense in nood kan word vinniger-up ”, deur te verwys hul oproepe na
’n persoon ingelichte in daardie taal.
Tot hierdie effek van ’n toespraak corpus het ontwikkel en die verskillende algoritmes
is gemplementeer en getoets op die ruwe telefoon toespraak gegee.Hierdie
algoritmes behels die data voorbereiding, seinverwerking, en statistiese analise
wat gerig is op onderskei tussen tale.Die statistiese model van Gauss Mengsel
Modelle (GGM) was gekies is vir hierdie navorsing as gevolg van hul vermo
te verteenwoordig ’n hele taal met’ n enkele stogastiese model wat nodig nie
fonetiese tanscription nie.
Taal identifiseer vir die Afrikatale gebruik GGM haalbaar is, alhoewel daar
enkele paar uitdagings soos behoorlike klassifikasie en akkurate ondersoek na die
verhouding van TALE wat moet oorkom moet word.Ander metodes wat gebruik
maak van foneties getranskribeerde data nodig om ondersoek te word en getoets
word met die nuwe corpus vir die ondersoek te word strenger.
|
3 |
Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various ApproachesKnudson, Ryan Charles 05 1900 (has links)
Automatic language identification has been applied to short texts such as queries in information retrieval, but it has not yet been applied to metadata records. Applying this technology to metadata records, particularly their title elements, would enable creators of metadata records to obtain a value for the language element, which is often left blank due to a lack of linguistic expertise. It would also enable the addition of the language value to existing metadata records that currently lack a language value. Titles lend themselves to the problem of language identification mainly due to their shortness, a factor which increases the difficulty of accurately identifying a language. This study implemented four proven approaches to language identification as well as one open-source approach on a collection of multilingual titles of books and movies. Of the five approaches considered, a reduced N-gram frequency profile and distance measure approach outperformed all others, accurately identifying over 83% of all titles in the collection. Future plans are to offer this technology to curators of digital collections for use.
|
Page generated in 0.1521 seconds