171 |
Ambiguous synonyms : Implementing an unsupervised WSD system for division of synonym clusters containing multiple sensesWallin, Moa January 2019 (has links)
When clustering together synonyms, complications arise in cases of the words having multiple senses as each sense’s synonyms are erroneously clustered together. The task of automatically distinguishing word senses in cases of ambiguity, known as word sense disambiguation (WSD), has been an extensively researched problem over the years. This thesis studies the possibility of applying an unsupervised machine learning based WSD-system for analysing existing synonym clusters (N = 149) and dividing them correctly when two or more senses are present. Based on sense embeddings induced from a large corpus, cosine similarities are calculated between sense embeddings for words in the clusters, making it possible to suggest divisions in cases where different words are closer to different senses of a proposed ambiguous word. The system output is then evaluated by four participants, all experts in the area. The results show that the system does not manage to correctly divide the clusters in more than 31% of the cases according to the participants. Moreover, it is discovered that some differences exist between the participants’ ratings, although none of the participants predominantly agree with the system’s division of the clusters. Evidently, further research and improvements are needed and suggested for the future.
|
172 |
The effect of noise in the training of convolutional neural networks for text summarisationMeechan-Maddon, Ailsa January 2019 (has links)
In this thesis, we work towards bridging the gap between two distinct areas: noisy text handling and text summarisation. The overall goal of the paper is to examine the effects of noise in the training of convolutional neural networks for text summarisation, with a view to understanding how to effectively create a noise-robust text-summarisation system. We look specifically at the problem of abstractive text summarisation of noisy data in the context of summarising error-containing documents from automatic speech recognition (ASR) output. We experiment with adding varying levels of noise (errors) to the 4 million-article Gigaword corpus and training an encoder-decoder CNN on it with the aim of producing a noise-robust text summarisation system. A total of six text summarisation models are trained, each with a different level of noise. We discover that the models with a high level of noise are indeed able to aptly summarise noisy data into clean summaries, despite a tendency for all models to overfit to the level of noise on which they were trained. Directions are given for future steps in order to create an even more noise-robust and flexible text summarisation system.
|
173 |
A comparative study of word embedding methods for early risk prediction on the InternetFano, Elena January 2019 (has links)
We built a system to participate in the eRisk 2019 T1 Shared Task. The aim of the task was to evaluate systems for early risk prediction on the internet, in particular to identify users suffering from eating disorders as accurately andquickly as possible given their history of Reddit posts in chronological order. In the controlled settings of this task, we also evaluated the performance of three different word representation methods: random indexing, GloVe, and ELMo.We discuss our system’s performance, also in the light of the scores obtained by other teams in the shared task. Our results show that our two-step learning approach was quite successful, and we obtained good scores on the early risk prediction metric ERDE across the board. Contrary to our expectations, we did not observe a clear-cut advantage of contextualized ELMo vectors over the commonly used and much more light-weight GloVevectors. Our best model in terms of F1 score turned out to be a model with GloVe vectors as input to the text classifier and a multi-layer perceptron as user classifier. The best ERDE scores were obtained by the model with ELMo vectors and a multi-layer perceptron. The model with random indexing vectors hit a good balance between precision and recall in the early processing stages but was eventually surpassed by the models with GloVe and ELMo vectors. We put forward some possible explanations for the observed results, as well as proposing some improvements to our system.
|
174 |
Multi-Label Text Classification with Transfer Learning for Policy Documents : The Case of the Sustainable Development GoalsRodríguez Medina, Samuel January 2019 (has links)
We created and analyzed a text classification dataset from freely-available web documents from the United Nation's Sustainable Development Goals. We then used it to train and compare different multi-label text classifiers with the aim of exploring the alternatives for methods that facilitate the search of information of this type of documents. We explored the effectiveness of deep learning and transfer learning in text classification by fine-tuning different pre-trained language representations — Word2Vec, GloVe, ELMo, ULMFiT and BERT. We also compared these approaches against a baseline of more traditional algorithms without using transfer learning. More specifically, we used multinomial Naive Bayes, logistic regression, k-nearest neighbors and Support Vector Machines. We then analyzed the results of our experiments quantitatively and qualitatively. The best results in terms of micro-averaged F1 scores and AUROC are obtained by BERT. However, it is also interesting that the second best classifier in terms of micro-averaged F1 scores is the Support Vector Machines, closely followed by the logistic regression classifier, which both have the advantage of being less computationally expensive than BERT. The results also show a close relation between our dataset size and the effectiveness of the classifiers.
|
175 |
Named-entity recognition in Czech historical texts : Using a CNN-BiLSTM neural network modelHubková, Helena January 2019 (has links)
The thesis presents named-entity recognition in Czech historical newspapers from Modern Access to Historical Sources Project. Our goal was to create a specific corpus and annotation manual for the project and evaluate neural networks methods for named-entity recognition within the task. We created the corpus using scanned Czech historical newspapers. The scanned pages were converted to digitize text by optical character recognition (OCR) method. The data were preprocessed by deleting some OCR errors. We also defined specific named entities types for our task and created an annotation manual with examples for the project. Based on that, we annotated the final corpus. To find the most suitable neural networks model for our task, we experimented with different neural networks architectures, namely long short-term memory (LSTM), bidirectional LSTM and CNN-BiLSTM models. Moreover, we experimented with randomly initialized word embeddings that were trained during the training process and pretrained word embeddings for contemporary Czech published as open source by fastText. We achieved the best result F1 score 0.444 using CNN-BiLSTM model and the pretrained word embeddings by fastText. We found out that we do not need to normalize spelling of our historical texts to get closer to contemporary language if we use the neural network model. We provided a qualitative analysis of observed linguistics phenomena as well. We found out that some word forms and pair of words which were not frequent in our training data set were miss-tagged or not tagged at all. Based on that, we can say that larger data sets could improve the results.
|
176 |
Compound Processing for Phrase-Based Statistical Machine TranslationStymne, Sara January 2009 (has links)
<p>In this thesis I explore how compound processing can be used to improve phrase-based statistical machine translation (PBSMT) between English and German/Swedish. Both German and Swedish generally use closed compounds, which are written as one word without spaces or other indicators of word boundaries. Compounding is both common and productive, which makes it problematic for PBSMT, mainly due to sparse data problems.</p><p>The adopted strategy for compound processing is to split compounds into their component parts before training and translation. For translation into Swedish and German the parts are merged after translation. I investigate the effect of different splitting algorithms for translation between English and German, and of different merging algorithms for German. I also apply these methods to a different language pair, English--Swedish. Overall the studies show that compound processing is useful, especially for translation from English into German or Swedish. But there are improvements for translation into English as well, such as a reduction of unknown words.</p><p>I show that for translation between English and German different splitting algorithms work best for different translation directions. I also design and evaluate a novel merging algorithm based on part-of-speech matching, which outperforms previous methods for compound merging, showing the need for information that is carried through the translation process, rather than only external knowledge sources such as word lists. Most of the methods for compound processing were originally developed for German. I show that these methods can be applied to Swedish as well, with similar results.</p>
|
177 |
Automatic speaker verification on site and by telephone: methods, applications and assessmentMelin, Håkan January 2006 (has links)
Speaker verification is the biometric task of authenticating a claimed identity by means of analyzing a spoken sample of the claimant's voice. The present thesis deals with various topics related to automatic speaker verification (ASV) in the context of its commercial applications, characterized by co-operative users, user-friendly interfaces, and requirements for small amounts of enrollment and test data. A text-dependent system based on hidden Markov models (HMM) was developed and used to conduct experiments, including a comparison between visual and aural strategies for prompting claimants for randomized digit strings. It was found that aural prompts lead to more errors in spoken responses and that visually prompted utterances performed marginally better in ASV, given that enrollment data were visually prompted. High-resolution flooring techniques were proposed for variance estimation in the HMMs, but results showed no improvement over the standard method of using target-independent variances copied from a background model. These experiments were performed on Gandalf, a Swedish speaker verification telephone corpus with 86 client speakers. A complete on-site application (PER), a physical access control system securing a gate in a reverberant stairway, was implemented based on a combination of the HMM and a Gaussian mixture model based system. Users were authenticated by saying their proper name and a visually prompted, random sequence of digits after having enrolled by speaking ten utterances of the same type. An evaluation was conducted with 54 out of 56 clients who succeeded to enroll. Semi-dedicated impostor attempts were also collected. An equal error rate (EER) of 2.4% was found for this system based on a single attempt per session and after retraining the system on PER-specific development data. On parallel telephone data collected using a telephone version of PER, 3.5% EER was found with landline and around 5% with mobile telephones. Impostor attempts in this case were same-handset attempts. Results also indicate that the distribution of false reject and false accept rates over target speakers are well described by beta distributions. A state-of-the-art commercial system was also tested on PER data with similar performance as the baseline research system. / QC 20100910
|
178 |
Automatisk kvalitetskontroll av terminologi i översättningar / Automatic quality checking of terminology in translationsEdholm, Lars January 2007 (has links)
<p>Kvalitet hos översättningar är beroende av korrekt användning av specialiserade termer, som kan göra översättningen lättare att förstå och samtidigt minska tidsåtgång och kostnader för översättningen (Lommel, 2007). Att terminologi används konsekvent är viktigt, och något som bör granskas vid en kvalitetskontroll av exempelvis översatt dokumentation (Esselink, 2000). Det finns idag funktioner för automatisk kontroll av terminologi i flera kommersiella program. Denna studie syftar till att utvärdera sådana funktioner, då ingen tidigare större studie av detta har påträffats.</p><p>För att få en inblick i hur kvalitetskontroll sker i praktiken genomfördes först två kvalitativa intervjuer med personer involverade i detta på en översättningsbyrå. Resultaten jämfördes med aktuella teorier inom området och visade på stor överensstämmelse med vad exempelvis Bass (2006) förespråkar.</p><p>Utvärderingarna inleddes med en granskning av täckningsgrad hos en verklig termdatabas jämfört med subjektivt markerade termer i en testkorpus baserad på ett autentiskt översättningsminne. Granskningen visade dock på relativt låg täckningsgrad. För att öka täckningsgraden modifierades termdatabasen, bland annat utökades den med längre termer ur testkorpusen.</p><p>Därefter kördes fyra olika programs funktion för kontroll av terminologi i testkorpusen jämfört med den modifierade termdatabasen. Slutligen modifierades även testkorpusen, där ett antal fel placerades ut för att få en mer idealiserad utvärdering. Resultaten i form av larm för potentiella fel kategoriserades och bedömdes som riktiga eller falska larm. Detta utgjorde basen för mått på kontrollernas precision och i den sista utvärderingen även deras recall.</p><p>Utvärderingarna visade bland annat att det för terminologi i översättningar på engelska - svenska var mest fördelaktigt att matcha termdatabasens termer som delar av ord i översättningens käll- och målsegment. På så sätt kan termer med olika böjningsformer fångas utan stöd för språkspecifik morfologi. En orsak till många problem vid matchningen var utseendet på termdatabasens poster, som var mer anpassat för mänskliga översättare än för maskinell läsning.</p><p>Utifrån intervjumaterialet och utvärderingarnas resultat formulerades rekommendationer kring införandet av verktyg för automatisk kontroll av terminologi. På grund av osäkerhetsfaktorer i den automatiska kontrollen motiveras en manuell genomgång av dess resultat. Genom att köra kontrollen på stickprov som redan granskats manuellt ur andra aspekter, kan troligen en lämplig omfattning av resultat att gå igenom manuellt erhållas. Termdatabasens kvalitet är avgörande för dess täckningsgrad för översättningar, och i förlängningen också för nyttan med att använda den för automatisk kontroll.</p> / <p>Quality in translations depends on the correct use of specialized terms, which can make the translation easier to understand as well as reduce the required time and costs for the translation (Lommel, 2007). Consistent use of terminology is important, and should be taken into account during quality checks of for example translated documentation (Esselink, 2000). Today, several commercial programs have functions for automatic quality checking of terminology. The aim of this study is to evaluate such functions since no earlier major study of this has been found.</p><p>To get some insight into quality checking in practice, two qualitative interviews were initially carried out with individuals involved in this at a translation agency. The results were compared to current theories in the subject field and revealed a general agreement with for example the recommendations of Bass (2006).</p><p>The evaluations started with an examination of the recall for a genuine terminology database compared to subjectively marked terms in a test corpus based on an authentic translation memory. The examination however revealed a relatively low recall. To increase the recall the terminology database was modified, it was for example extended with longer terms from the test corpus.</p><p>After that, the function for checking terminology in four different commercial programs was run on the test corpus using the modified terminology database. Finally, the test corpus was also modified, by planting out a number of errors to produce a more idealized evaluation. The results from the programs, in the form of alarms for potential errors, were categorized and judged as true or false alarms. This constitutes a base for measures of precision of the checks, and in the last evaluation also of their recall.</p><p>The evaluations showed that for terminology in translations of English to Swedish, it was advantageous to match terms from the terminology database using partial matching of words in the source and target segments of the translation. In that way, terms with different inflected forms could be matched without support for languagespecific morphology. A cause of many problems in the matching process was the form of the entries in the terminology database, which were more suited for being read by human translators than by a machine.</p><p>Recommendations regarding the introduction of tools for automatic checking of terminology were formulated, based on the results from the interviews and evaluations. Due to factors of uncertainty in the automatic checking, a manual review of its results is motivated. By running the check on a sample that has already been manually checked in other aspects, a reasonable number of results to manually review can be obtained. The quality of the terminology database is crucial for its recall on translations, and in the long run also for the value of using it for automatic checking.</p>
|
179 |
Articulation Rate and Surprisal in Swedish Child-Directed SpeechSjons, Johan January 2022 (has links)
Child-directed speech (CDS) differs from adult-directed speech (ADS) in several respects whose possible facilitating effects for language acquisition are still being studied. One such difference concerns articulation rate --- the number of linguistic units by the number of time units, excluding pauses --- which has been shown to be generally lower than in ADS. However, while it is well-established that ADS exhibits an inverse relation between articulation rate and information-theoretic surprisal --- the amount of information encoded in a linguistic unit --- this measure has been conspicuously absent in the study of articulation rate in CDS. Another issue is if the lower articulation rate in CDS is stable across utterances or an effect of local variation, such as final lengthening. The aim of this work is to arrive at a more comprehensive model of articulation rate in CDS by including surprisal and final lengthening. In particular, one-word utterances were studied, also in relation to word-length effects (the phenomenon that longer words generally have a higher articulation rate). To this end, a methodology for large-scale automatic phoneme-alignment was developed, which was applied to two longitudinal corpora of Swedish CDS. It was investigated i) how articulation rate in CDS varied with respect to child age, ii) whether there was a negative relation between articulation rate and surprisal in CDS, and iii) to what extent articulation rate was lower in CDS than in ADS. The results showed i) a weak positive effectof child age on articulation rate, ii) a negative relation between articulation rate and surprisal, and iii) that there was a lower articulation rate in CDS but that the difference could almost exclusively be attributed to one-word utterances and final lengthening. In other words, adults seem to adapt how fast they speak to their children's age, speaking faster to children is correlated with a reduced amount of information, and the difference in articulation rate between CDS and ADS is most prominent in isolated words and final lengthening. More generally, the results suggest that CDS is well-suited for word segmentation, since lower articulation rate in one-word utterances provides an additional cue.
|
180 |
Applying Coreference Resolution for Usage in Dialog SystemsRolih, Gabi January 2018 (has links)
Using references in language is a major part of communication, and understanding them is not a challenge for humans. Recent years have seen increased usage of dialog systems that interact with humans in natural language to assist them in various tasks, but even the most sophisticated systems still struggle with understanding references. In this thesis, we adapt a coreference resolution system for usage in dialog systems and try to understand what is needed for an efficient understanding of references in dialog systems. We annotate a portion of logs from a customer service system and perform an analysis of the most common coreferring expressions appearing in this type of data. This analysis shows that most coreferring expressions are nominal and pronominal, and they usually appear within two sentences of each other. We implement Stanford's Multi-Pass Sieve with some adaptations and dialog-specific changes and integrate it into a dialog system framework. The preprocessing pipeline makes use of already existing NLP-tools, while some new ones are added, such as a chunker, a head-finding algorithm and a NER-like system. To analyze both user input and output of the system, we deploy two separate coreference resolution systems that interact with each other. An evaluation is performed on the system and its separate parts in five most common evaluation metrics. The system does not achieve state-of-the art numbers, but because of its domain-specific nature that is expected. Some parts of the system do not have any effect on the performance, while the dialog-specific changes contribute to it greatly. An error analysis is concluded and reveals some problems with the implementation, but more importantly, it shows how the system could be further improved by using other types of knowledge and dialog-specific features.
|
Page generated in 0.0725 seconds