1 |
Pronunciation Modeling in Spelling Correction for Writers of English as a Foreign LanguageBoyd, Adriane 24 June 2008 (has links)
No description available.
|
2 |
Finite-state canonicalization techniques for historical GermanJurish, Bryan January 2011 (has links)
This work addresses issues in the automatic preprocessing of historical German input text for use by conventional natural language processing techniques. Conventional techniques cannot adequately account for historical input text due to conventional tools' reliance on a fixed application-specific lexicon keyed by contemporary orthographic surface form on the one hand, and the lack of consistent orthographic conventions in historical input text on the other. Historical spelling variation is treated here as an error-correction problem or "canonicalization" task: an attempt to automatically assign each (historical) input word a unique extant canonical cognate, thus allowing direct application-specific processing (tagging, parsing, etc.) of the returned canonical forms without need for any additional application-specific modifications. In the course of the work, various methods for automatic canonicalization are investigated and empirically evaluated, including conflation by phonetic identity, conflation by lemma instantiation heuristics, canonicalization by weighted finite-state rewrite cascade, and token-wise disambiguation by a dynamic Hidden Markov Model. / Diese Arbeit behandelt Themen der automatischen Vorverarbeitung historischen deutschen Textes für die Weiterverarbeitung durch konventionelle computerlinguistische Techniken. Konventionelle Techniken können historischen Text wegen des hohen Grads an graphematischer Variation in solchem Text ohne eine solche Vorverarbeitung nicht zufriedenstellend behandeln. Variation in der historischen Rechtschreibung wird hier als Fehlerkorrekturproblem oder "Kanonikalisierungsaufgabe" behandelt: ein Versuch, jedem (historischen) Eingabewort eine eindeutige extante Äquivalente zuzuordnen; so können konventionelle Techniken ohne weitere Modifikation direkt auf den gelieferten kanonischen Formen arbeiten. Verschiedene Methoden zur automatischen Kanonikalisierung werden im Rahmen dieser Arbeit untersucht, unter anderem Konflation durch phonetische Identität, Konflation durch Lemma-Instanziierungsheuristiken, Kanonikalisierung durch eine Kaskade gewichteter endlicher Transduktoren, und Disambiguiierung von Konflationskandidaten durch ein dynamisches Hidden Markov Modell.
|
3 |
Conditional random fields for noisy text normalisationCoetsee, Dirko 12 1900 (has links)
Thesis (MScEng) -- Stellenbosch University, 2014. / ENGLISH ABSTRACT: The increasing popularity of microblogging services such as Twitter means
that more and more unstructured data is available for analysis. The informal
language usage in these media presents a problem for traditional text mining
and natural language processing tools. We develop a pre-processor to normalise
this noisy text so that useful information can be extracted with standard tools.
A system consisting of a tokeniser, out-of-vocabulary token identifier, correct
candidate generator, and N-gram language model is proposed. We compare
the performance of generative and discriminative probabilistic models for
these different modules. The effect of normalising the training and testing
data on the performance of a tweet sentiment classifier is investigated.
A linear-chain conditional random field, which is a discriminative model,
is found to work better than its generative counterpart for the tokenisation
module, achieving a 0.76% character error rate compared to 1.41% for the
finite state automaton. For the candidate generation module, however, the
generative weighted finite state transducer works better, getting the correct
clean version of a word right 36% of the time on the first guess, while the discriminatively
trained hidden alignment conditional random field only achieves
6%. The use of a normaliser as a pre-processing step does not significantly
affect the performance of the sentiment classifier. / AFRIKAANSE OPSOMMING: Mikro-webjoernale soos Twitter word al hoe meer gewild, en die hoeveelheid
ongestruktureerde data wat beskikbaar is vir analise groei daarom soos nooit
tevore nie. Die informele taalgebruik in hierdie media maak dit egter moeilik
om tradisionele tegnieke en bestaande dataverwerkingsgereedskap toe te pas.
’n Stelsel wat hierdie ruiserige teks normaliseer word ontwikkel sodat bestaande
pakkette gebruik kan word om die teks verder te verwerk.
Die stelsel bestaan uit ’n module wat die teks in woordeenhede opdeel, ’n
module wat woorde identifiseer wat gekorrigeer moet word, ’n module wat dan
kandidaat korreksies voorstel, en ’n module wat ’n taalmodel toepas om die
mees waarskynlike skoon teks te vind. Die verrigting van diskriminatiewe
en generatiewe modelle vir ’n paar van hierdie modules word vergelyk en
die invloed wat so ’n normaliseerder op die akkuraatheid van ’n sentimentklassifiseerder
het word ondersoek.
Ons bevind dat ’n lineêre-ketting voorwaardelike toevalsveld—’n diskriminatiewe
model — beter werk as sy generatiewe eweknie vir tekssegmentering.
Die voorwaardelike toevalsveld-model behaal ’n karakterfoutkoers van 0.76%,
terwyl die toestandsmasjien-model 1.41% behaal. Die toestantsmasjien-model werk weer beter om kandidaat woorde te genereer as die verskuilde belyningsmodel
wat ons geïmplementeer het. Die toestandsmasjien kry 36% van die tyd
die regte weergawe van ’n woord met die eerste raaiskoot, terwyl die diskriminatiewe
model dit slegs 6% van die tyd kan doen. Laastens het ons bevind
dat die vooraf normalisering van Twitter boodskappe nie ’n beduidende effek
op die akkuraatheid van ’n sentiment klassifiseerder het nie.
|
4 |
Weighting Edit Distance to Improve Spelling Correction in Music Entity Search / Viktat ändringsavstånd för förbättrad stavningskorrigering vid sökning i en musikdatabasSamuelsson, Axel January 2017 (has links)
This master’s thesis project undertook investigation of whether the extant Damerau- Levenshtein edit distance measurement between two strings could be made more useful for detecting and adjusting misspellings in a search query. The idea was to use the knowledge that many users type their queries using the QWERTY keyboard layout, and weighting the edit distance in a manner that makes it cheaper to correct misspellings caused by confusion of nearer keys. Two different weighting approaches were tested, one with a linear spread from 2/9 to 2 depending on the keyboard distance, and the other had neighbors preferred over non-neighbors (either with half the cost or no cost at all). They were tested against an unweighted baseline as well as inverted versions of themselves (nearer keys more expensive to replace) against a dataset of 1,162,145 searches. No significant improvement in the retrieval of search results were observed when compared to the baseline. However, each of the weightings performed better than its corresponding inversion on a p < 0.05 significance level. This means that while the weighted edit distance did not outperform the baseline, the data still clearly points toward a correlation between the physical position of keys on the keyboard, and what spelling mistakes are made. / Detta examensarbete åtog sig att undersöka om det etablerade Damerau-Levenshtein-avståndet som mäter avståndet kan anpassas för att bättre hitta och korrigera stavningsfel i sökfrågor. Tanken var att använda det faktum att många användare skriver sina sökfrågor på ett tangentbord med QWERTY-layout, och att vikta ändrings- avståndet så att det blir billigare att korrigera stavfel orsakade av hopblandning av två knappar som är närmare varandra. Två olika viktningar testades, en hade vikterna utspridda linjärt mellan 2/9 och 2, och den andra föredrog grannar över icke-grannar (antingen halva kostnaden eller ingen alls). De testades mot ett oviktat referensavstånd samt inversen av sig själva (så att närmare knappar blev dyrare att byta ut) mot ett dataset bestående av 1 162 145 sökningar. Ingen signifikant förbättring uppmättes gentemot referensen. Däremot presterade var och en av viktningarna bättre än sin inverterade motpart på konfidensnivå p < 0,05. Det innebär att trots att de viktade distansavstånden inte presterade bättre än referensen så pekar datan tydligt mot en korrelation mellan den fysiska positioneringen av knapparna på tangentbordet och vilka stavningsmisstag som begås.
|
5 |
Erron: A Phrase-Based Machine Translation Approach to Customized Spelling CorrectionHovermale, DJ 19 December 2011 (has links)
No description available.
|
6 |
Swedish Natural Language Processing with Long Short-term Memory Neural Networks : A Machine Learning-powered Grammar and Spell-checker for the Swedish LanguageGudmundsson, Johan, Menkes, Francis January 2018 (has links)
Natural Language Processing (NLP) is a field studying computer processing of human language. Recently, neural network language models, a subset of machine learning, have been used to great effect in this field. However, research remains focused on the English language, with few implementations in other languages of the world. This work focuses on how NLP techniques can be used for the task of grammar and spelling correction in the Swedish language, in order to investigate how language models can be applied to non-English languages. We use a controlled experiment to find the hyperparameters most suitable for grammar and spelling correction on the Göteborgs-Posten corpus, using a Long Short-term Memory Recurrent Neural Network. We present promising results for Swedish-specific grammar correction tasks using this kind of neural network; specifically, our network has a high accuracy in completing these tasks, though the accuracy achieved for language-independent typos remains low.
|
7 |
Short message service normalization for communication with a health information systemAdesina, Ademola Olusola January 2011 (has links)
Philosophiae Doctor - PhD / Short Message Service (SMS) is one of the most popularly used services for communication between mobile phone users. In recent times it has also been proposed as a means for information access. However, there are several challenges to be overcome in order to process an SMS, especially when it is used as a query in an information retrieval system.SMS users often tend deliberately to use compacted and grammatically incorrect writing that makes the message difficult to process with conventional information retrieval systems. To overcome this, a pre-processing step known as normalization is required. In this thesis an investigation of SMS normalization algorithms is carried out. To this end,studies have been conducted into the design of algorithms for translating and normalizing SMS text. Character-based, unsupervised and rule-based techniques are presented.
An investigation was also undertaken into the design and development of a system for information access via SMS. A specific system was designed to access information related to a Frequently Asked Questions (FAQ) database in healthcare, using a case study. This study secures SMS communication, especially for healthcare information systems. The proposed technique is to encipher the messages using the secure shell (SSH) protocol.
|
8 |
Using phonetic knowledge in tools and resources for Natural Language Processing and Pronunciation Evaluation / Utilizando conhecimento fonético em ferramentas e recursos de Processamento de Língua Natural e Treino de PronúnciaAlmeida, Gustavo Augusto de Mendonça 21 March 2016 (has links)
This thesis presents tools and resources for the development of applications in Natural Language Processing and Pronunciation Training. There are four main contributions. First, a hybrid grapheme-to-phoneme converter for Brazilian Portuguese, named Aeiouadô, which makes use of both manual transcription rules and Classification and Regression Trees (CART) to infer the phone transcription. Second, a spelling correction system based on machine learning, which uses the trascriptions produced by Aeiouadô and is capable of handling phonologically-motivated errors, as well as contextual errors. Third, a method for the extraction of phonetically-rich sentences, which is based on greedy algorithms. Fourth, a prototype system for automatic pronunciation assessment, especially designed for Brazilian-accented English. / Esta dissertação apresenta recursos voltados para o desenvolvimento de aplicações de reconhecimento de fala e avaliação de pronúncia. São quatro as contribuições aqui discutidas. Primeiro, um conversor grafema-fonema híbrido para o Português Brasileiro, chamado Aeiouadô, o qual utiliza regras de transcrição fonética e Classification and Regression Trees (CART) para inferir os fones da fala. Segundo, uma ferramenta de correção automática baseada em aprendizado de máquina, que leva em conta erros de digitação de origem fonética, que é capaz de lidar com erros contextuais e emprega as transcrições geradas pelo Aeiouadô. Terceiro, um método para a extração de sentenças foneticamente-ricas, tendo em vista a criação de corpora de fala, baseado em algoritmos gulosos. Quarto, um protótipo de um sistema de reconhecimento e correção de fala não-nativa, voltado para o Inglês falado por aprendizes brasileiros.
|
9 |
Using phonetic knowledge in tools and resources for Natural Language Processing and Pronunciation Evaluation / Utilizando conhecimento fonético em ferramentas e recursos de Processamento de Língua Natural e Treino de PronúnciaGustavo Augusto de Mendonça Almeida 21 March 2016 (has links)
This thesis presents tools and resources for the development of applications in Natural Language Processing and Pronunciation Training. There are four main contributions. First, a hybrid grapheme-to-phoneme converter for Brazilian Portuguese, named Aeiouadô, which makes use of both manual transcription rules and Classification and Regression Trees (CART) to infer the phone transcription. Second, a spelling correction system based on machine learning, which uses the trascriptions produced by Aeiouadô and is capable of handling phonologically-motivated errors, as well as contextual errors. Third, a method for the extraction of phonetically-rich sentences, which is based on greedy algorithms. Fourth, a prototype system for automatic pronunciation assessment, especially designed for Brazilian-accented English. / Esta dissertação apresenta recursos voltados para o desenvolvimento de aplicações de reconhecimento de fala e avaliação de pronúncia. São quatro as contribuições aqui discutidas. Primeiro, um conversor grafema-fonema híbrido para o Português Brasileiro, chamado Aeiouadô, o qual utiliza regras de transcrição fonética e Classification and Regression Trees (CART) para inferir os fones da fala. Segundo, uma ferramenta de correção automática baseada em aprendizado de máquina, que leva em conta erros de digitação de origem fonética, que é capaz de lidar com erros contextuais e emprega as transcrições geradas pelo Aeiouadô. Terceiro, um método para a extração de sentenças foneticamente-ricas, tendo em vista a criação de corpora de fala, baseado em algoritmos gulosos. Quarto, um protótipo de um sistema de reconhecimento e correção de fala não-nativa, voltado para o Inglês falado por aprendizes brasileiros.
|
Page generated in 0.1651 seconds