• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 15
  • 2
  • 2
  • 2
  • 2
  • 1
  • Tagged with
  • 26
  • 26
  • 9
  • 7
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Language Identification on Short Textual Data

Cui, Yexin January 2020 (has links)
Language identification is the task of automatically detecting the languages(s) written in a text or a document given, and is also the very first step of further natural language processing tasks. This task has been well-studied over decades in the past, however, most of the works have focused on long texts rather than the short that is proved to be more challenging due to the insufficiency of syntactic and semantic information. In this work, we present approaches to this problem based on deep learning techniques, traditional methods and their combination. The proposed ensemble model, composed of a learning based method and a dictionary based method, achieves 89.6% accuracy on our new generated gold test set, surpassing Google Translate API by 3.7% and an industry leading tool Langid.py by 26.1%. / Thesis / Master of Applied Science (MASc)
2

A Design Of Multi-Language Identification System

Kuo, Ding-Yee 11 July 2000 (has links)
A Microsoft Windows program is designed to implement a Multi-Language Identification system based on formants estimation and vector quantization classifier with n-Gram and HMM. LPC is used here as an effective method for formants feature extraction of the speakers, and a new method for distance measure of VQ is also proposed.
3

The textcat Package for n-Gram Based Text Categorization in R

Feinerer, Ingo, Buchta, Christian, Geiger, Wilhelm, Rauch, Johannes, Mair, Patrick, Hornik, Kurt 02 1900 (has links) (PDF)
Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods. (authors' abstract)
4

The development of an automatic pronunciation assistant

Sefara, Tshephisho Joseph January 2019 (has links)
Thesis (M. Sc. (Computer Science)) -- University of Limpopo, 2019 / The pronunciation of words and phrases in any language involves careful manipulation of linguistic features. Factors such as age, motivation, accent, phonetics, stress and intonation sometimes cause a problem of inappropriate or incorrect pronunciation of words from non-native languages. Pronunciation of words using different phonological rules has a tendency of changing the meaning of those words. This study presents the development of an automatic pronunciation assistant system for under-resourced languages of Limpopo Province, namely, Sepedi, Xitsonga, Tshivenda and isiNdebele. The aim of the proposed system is to help non-native speakers to learn appropriate and correct pronunciation of words/phrases in these under-resourced languages. The system is composed of a language identification module on the front-end side and a speech synthesis module on the back-end side. A support vector machine was compared to the baseline multinomial naive Bayes to build the language identification module. The language identification phase performs supervised multiclass text classification to predict a person’s first language based on input text before the speech synthesis phase continues with pronunciation issues using the identified language. The speech synthesis on the back-end phase is composed of four baseline text-to-speech synthesis systems in selected target languages. These text-to-speech synthesis systems were based on the hidden Markov model method of development. Subjective listening tests were conducted to evaluate the performance of the quality of the synthesised speech using a mean opinion score test. The mean opinion score test obtained good performance results on all targeted languages for naturalness, pronunciation, pleasantness, understandability, intelligibility, overall quality of the system and user acceptance. The developed system has been implemented on a “real-live” production web-server for performance evaluation and stability testing using live data.
5

A Feature Design of Multi-Language Identification System

Lin, Jun-Ching 17 July 2003 (has links)
A multi-language identification system of 10 languages: Mandarin, Japanese, Korean, Tamil, Vietnamese, English, French, German, Spanish and Farsi, is built in this thesis. The system utilizes cepstrum coefficients, delta cepstrum coefficients and linear predictive coding coefficients to extract the language features, and incorporates Gaussian mixture model and N-gram model to make the language classification. The feasibility of the system is demonstrated in this thesis.
6

Computational Approaches to Style and the Lexicon

Brooke, Julian 20 March 2014 (has links)
The role of the lexicon has been ignored or minimized in most work on computational stylistics. This research is an effort to fill that gap, demonstrating the key role that the lexicon plays in stylistic variation. In doing so, I bring together a number of diverse perspectives, including aesthetic, functional, and sociological aspects of style. The first major contribution of the thesis is the creation of aesthetic stylistic lexical resources from large mixed-register corpora, adapting statistical techniques from approaches to topic and sentiment analysis. A key novelty of the work is that I consider multiple correlated styles in a single model. Next, I consider a variety of tasks that are relevant to style, in particular tasks relevant to genre and demographic variables, showing that the use of lexical resources compares well to more traditional approaches, in some cases offering information that is simply not available to a system based on surface features. Finally, I focus in on a single stylistic task, Native Language Identification (NLI), offering a novel method for deriving lexical information from native language texts, and using a cross-corpus supervised approach to show definitively that lexical features are key to high performance on this task.
7

Computational Approaches to Style and the Lexicon

Brooke, Julian 20 March 2014 (has links)
The role of the lexicon has been ignored or minimized in most work on computational stylistics. This research is an effort to fill that gap, demonstrating the key role that the lexicon plays in stylistic variation. In doing so, I bring together a number of diverse perspectives, including aesthetic, functional, and sociological aspects of style. The first major contribution of the thesis is the creation of aesthetic stylistic lexical resources from large mixed-register corpora, adapting statistical techniques from approaches to topic and sentiment analysis. A key novelty of the work is that I consider multiple correlated styles in a single model. Next, I consider a variety of tasks that are relevant to style, in particular tasks relevant to genre and demographic variables, showing that the use of lexical resources compares well to more traditional approaches, in some cases offering information that is simply not available to a system based on surface features. Finally, I focus in on a single stylistic task, Native Language Identification (NLI), offering a novel method for deriving lexical information from native language texts, and using a cross-corpus supervised approach to show definitively that lexical features are key to high performance on this task.
8

Language identification with language and feature dependency

Yin, Bo, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2009 (has links)
The purpose of Language Identification (LID) is to identify a specific language from a spoken utterance, automatically. Language-specific characteristics are always associated with different languages. Most existing LID approaches utilise a statistical modelling process with common acoustic/phonotactic features to model specific languages while avoiding any language-specific knowledge. Great successes have been achieved in this area over past decades. However, there is still a huge gap between these languageindependent methods and the actual language-specific patterns. It is extremely useful to address these specific acoustic or semantic construction patterns, without spending huge labour on annotation which requires language-specific knowledge. Inspired by this goal, this research focuses on the language-feature dependency. Several practical methods have been proposed. Various features and modelling techniques have been studied in this research. Some of them carry out additional language-specific information without manual labelling, such as a novel duration modelling method based on articulatory features, and a novel Frequency-Modulation (FM) based feature. The performance of each individual feature is studied for each of the language-pair combinations. The similarity between languages and the contribution in identifying a language by using a particular feature are defined for the first time, in a quantitative style. These distance measures and languagedependent contributions become the foundations of the later-presented frameworks ?? language-dependent weighting and hierarchical language identification. The latter particularly provides remarkable flexibility and enhancement when identifying a relatively large number of languages and accents, due to the fact that the most discriminative feature or feature-combination is used when separating each of the languages. The proposed systems are evaluated in various corpora and task contexts including NIST language recognition evaluation tasks. The performances have been improved in various degrees. The key techniques developed for this work have also been applied to solve a different problem other than LID ?? speech-based cognitive load monitoring.
9

Language identification with language and feature dependency

Yin, Bo, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2009 (has links)
The purpose of Language Identification (LID) is to identify a specific language from a spoken utterance, automatically. Language-specific characteristics are always associated with different languages. Most existing LID approaches utilise a statistical modelling process with common acoustic/phonotactic features to model specific languages while avoiding any language-specific knowledge. Great successes have been achieved in this area over past decades. However, there is still a huge gap between these languageindependent methods and the actual language-specific patterns. It is extremely useful to address these specific acoustic or semantic construction patterns, without spending huge labour on annotation which requires language-specific knowledge. Inspired by this goal, this research focuses on the language-feature dependency. Several practical methods have been proposed. Various features and modelling techniques have been studied in this research. Some of them carry out additional language-specific information without manual labelling, such as a novel duration modelling method based on articulatory features, and a novel Frequency-Modulation (FM) based feature. The performance of each individual feature is studied for each of the language-pair combinations. The similarity between languages and the contribution in identifying a language by using a particular feature are defined for the first time, in a quantitative style. These distance measures and languagedependent contributions become the foundations of the later-presented frameworks ?? language-dependent weighting and hierarchical language identification. The latter particularly provides remarkable flexibility and enhancement when identifying a relatively large number of languages and accents, due to the fact that the most discriminative feature or feature-combination is used when separating each of the languages. The proposed systems are evaluated in various corpora and task contexts including NIST language recognition evaluation tasks. The performances have been improved in various degrees. The key techniques developed for this work have also been applied to solve a different problem other than LID ?? speech-based cognitive load monitoring.
10

Automatic Language Identification for Metadata Records: Measuring the Effectiveness of Various Approaches

Knudson, Ryan Charles 05 1900 (has links)
Automatic language identification has been applied to short texts such as queries in information retrieval, but it has not yet been applied to metadata records. Applying this technology to metadata records, particularly their title elements, would enable creators of metadata records to obtain a value for the language element, which is often left blank due to a lack of linguistic expertise. It would also enable the addition of the language value to existing metadata records that currently lack a language value. Titles lend themselves to the problem of language identification mainly due to their shortness, a factor which increases the difficulty of accurately identifying a language. This study implemented four proven approaches to language identification as well as one open-source approach on a collection of multilingual titles of books and movies. Of the five approaches considered, a reduced N-gram frequency profile and distance measure approach outperformed all others, accurately identifying over 83% of all titles in the collection. Future plans are to offer this technology to curators of digital collections for use.

Page generated in 0.102 seconds