Global ETD Search

1	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer Nojoumian, Peyman 12 August 2011 (has links) Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work. Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation
2	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer Nojoumian, Peyman 12 August 2011 (has links) Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work. Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation
3	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer Nojoumian, Peyman 12 August 2011 (has links) Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work. Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation
4	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer Nojoumian, Peyman January 2011 (has links) Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work. Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation
5	Disambiguating Italian homographic heterophones with SoundChoice and testing ChatGPT as a data-generating tool Nanni, Matilde January 2023 (has links) Text-To-Speech systems are challenged by the presence of homographs, words that have more than one possible pronunciation. Rule-based approaches are often still the preferred solution to this issue in the industry. However, there have been multiple attempts to solve the ‘homograph issue’, by exploring statistical-based, neural-based, and hybrid techniques, mostly for English. Ploujnikov and Ravanelli (2022) proposed a neural-based grapheme-to-phoneme framework, SoundChoice, which comes as an RNN and a transformer version and can be fine-tuned for homograph disambiguation thanks to a weighted homograph loss. This thesis trains and tests this framework on Italian, instead of English, to see how it performs on a different language. Moreover, seeing as the available data containing homographs was insufficient for this task, the thesis experiments using ChatGPT as a data-generating tool. SoundChoice was also investigated for out-of-domain evaluation by testing it on data from a Corpus. The results showed that the RNN model reached a 71% accuracy from a baseline of 59%. A better performance was observed for the transformers model which went from 57% to 74%. Further analysis would be needed to draw more solid conclusions as to the origin of this gap and the models should be trained on Corpus data and tested on ChatGPT data to assess whether ChatGPT-generated data is, indeed, suitable as a replacement for Corpus data. ChatGPT Homograph disambiguation Italian SoundChoice
6	Homograph Disambiguation and Diacritization for Arabic Text-to-Speech Using Neural Networks / Homografdisambiguering och diakritisering för arabiska text-till-talsystem med hjälp av neurala nätverk Lameris, Harm January 2021 (has links) Pre-processing Arabic text for Text-to-Speech (TTS) systems poses major challenges, as Arabic omits short vowels in writing. This omission leads to a large number of homographs, and means that Arabic text needs to be diacritized to disambiguate these homographs, in order to be matched up with the intended pronunciation. Diacritizing Arabic has generally been achieved by using rule-based, statistical, or hybrid methods that combine rule-based and statistical methods. Recently, diacritization methods involving deep learning have shown promise in reducing error rates. These deep-learning methods are not yet commonly used in TTS engines, however. To examine neural diacritization methods for use in TTS engines, we normalized and pre-processed a version of the Tashkeela corpus, a large diacritized corpus containing largely Classical Arabic texts, for TTS purposes. We then trained and tested three state-of-the-art Recurrent-Neural-Network-based models on this data set. Additionally we tested these models on the Wiki News corpus, a test set that contains Modern Standard Arabic (MSA) news articles and thus more closely resembles most TTS queries. The models were evaluated by comparing the Diacritic Error Rate (DER) and Word Error Rate (WER) achieved for each data set to one another and to the DER and WER reported in the original papers. Moreover, the per-diacritic accuracy was examined, and a manual evaluation was performed. For the Tashkeela corpus, all models achieved a lower DER and WER than reported in the original papers. This was largely the result of using more training data in addition to the TTS pre-processing steps that were performed on the data. For the Wiki News corpus, the error rates were higher, largely due to the domain gap between the data sets. We found that for both data sets the models overfit on common patterns and the most common diacritic. For the Wiki News corpus the models struggled with Named Entities and loanwords. Purely neural models generally outperformed the model that combined deep learning with rule-based and statistical corrections. These findings highlight the usability of deep learning methods for Arabic diacritization in TTS engines as well as the need for diacritized corpora that are more representative of Modern Standard Arabic. Computational Linguistics Language Technology Diacritization Neural Networks Deep Learning Arabic Natural Language Processing NLP Text-to-Speech TTS Homograph Disambiguation
7	The Same-Spelling Hapax of the Commedia of Dante Soules, Terrill Shepard 27 April 2010 (has links) In the Commedia of Dante, a poem 14,233 lines in length, some 7,500 words occur only once. These are the hapax. Fewer than 2% of these constitute a minute but distinct subset—the hapax for which there are one or more words in the poem whose spelling is identical but whose meaning is different. These are what I call same-spelling hapax. I identify four categories: part-of-speech, homograph, locus, and name. Analysis of the same-spelling hapax illuminates a poetic strategy continuously in use throughout the poem. This is to use the one-word overlap of Rhyme and line number. Not only is it highly probable that a same-spelling hapax will be a rhyme-word, but it is also probable that it will occupy a rhyme-word’s most significant position—the one place—the single word—where the two intertwined formal entities that shape each canto coincide. Every three lines, their tension-resolving this-word-only union intensifies the reader’s attention and understanding alike. Locus hapax third-rhyme hapax legomenon Hollander's Index of Hapax Legomena terza rima dante hapax dante concordance Name hapax second-rhyme first-rhyme Homograph hapax Part-of-Speech hapax all-inclusive Commedia Hapax compilation Hollander hapax
8	The Same-Spelling Hapax of the Commedia of Dante Soules, Terrill S 10 August 2010 (has links) In the Commedia of Dante, a poem 14,233 lines in length, some 7,500 words occur only once. These are the hapax. Fewer than 2% of these constitute a minute but distinct subset—the hapax for which there are one or more words in the poem whose spelling is identical but whose meaning is different. These are what I call same-spelling hapax. I identify four categories: partof- speech, homograph, locus, and name. Examination of the same-spelling hapax illuminates a poetic strategy continuously in use throughout the poem. This is to use the one-word coinciding of Rhyme’s rhyme number and terzina’s line number. Not only is it highly probable that a samespelling hapax will be a rhyme-word, but it is also probable that it will occupy a rhyme-word’s most significant position—the one place—the single word—where the two intertwined formal entities that shape each canto coincide. Every three lines, their tension-resolving this-word-only union intensifies the reader’s attention and understanding alike. Alighieri Cantica Canto Commedia Concordance Dante Society of America Dartmouth Dante Project Divine Comedy Esigenza di rima Hapax legomenon Hollander hapax Homograph hapax Locus hapax Name hapax Part-of-speech hapax Rhyme Robert Hollander Same-spelling hapax Sbarro Terza rima Terzina Www.tsoules.com/Dante/concordance English Language and Literature
9	詞彙歧義解困的次要語義偏向效應再視：中文多義詞的眼動研究證據 / Revisiting the subordinate bias effect of lexical ambiguity resolution: evidence from eye movements in reading Chinese 盧怡璇, Lu, I Hsuan Unknown Date (has links) 過去二十多年來，心理語言學研究關注詞彙歧義解困 (lexical ambiguity resolution)歷程發生時，語義脈絡與多義詞的語義頻率之間的交互作用。許多研究發現，當語境支持非均勢同形異義詞 (unbalanced homograph) 的次要語義時，同形異義詞的凝視時間長於與其有相同字形頻率的單義詞 (unambiguous control)，此為次要語義偏向效應 (subordinate bias effect)。根據再排序觸接模型 (reordered-access model)，次要語義偏向效應來自於主要語義與次要語義的競爭；相對地，選擇觸接模型 (selective access model)則認為只有與語境相關的語義被激發，因此，次要語義偏向效應是因為提取到一個使用頻率較低的語義。本論文進行兩個眼動實驗。實驗一檢視中文多義詞的次要語義偏向效應以區辨兩種詞彙歧義解困模型分別提出的解釋。本實驗的材料使用了低頻同形異義詞、低頻單義詞、以及高頻單義詞。結果顯示，當使用的單義詞與多義詞字形頻率相同時，在目標詞及後目標詞上(目標詞後一個詞)皆發生了次要語義偏向效應。實驗二利用口語理解─視覺典範中透過受試者理解語音訊息時同步記錄眼動的作業方式來探究次要語義偏向效應是否來自於主要語義的激發。當口語句子中的目標詞被唸出後，會計算出隨著時間增加眼睛落在四個雙字詞的凝視比例。結果發現次要語義因為語境的選擇在聽到目標詞後大約500毫秒時就可被激發，主要語義則在一聽完多義詞後被激發。因此，多義詞的兩個語義在聽到目標詞後大約900至1300毫秒時(相當於在後目標詞時)發生競爭。整體而言，本研究顯示即使語境支持多義詞的次要語義，主要語義依然會被激發。因此，次要語義偏向效應是由兩個語義競爭後所造成的結果，符合再排序觸接模型的解釋。 / Research in psycholinguistics throughout the last two decades has focused on the interaction between linguistic context and meaning dominance during lexical ambiguity resolution. Many studies demonstrated the subordinate bias effect when the preceding context biased for the subordinate meaning (i.e. infrequent meaning) of an unbalanced homograph. According to the reordered access model, the SBE is due to competition between the dominant and subordinate meanings. On the contrary, the selective access model assumes only the context-relevant meaning is activated and the SBE is a result of access to a low frequent meaning. Two eye tracking experiments of sentence reading and sentence listening were conducted. Experiment 1 examined the SBE of Chinese homographs to differentiate the two accounts. We utilized low frequency homographs along with their matched low and high-frequency unambiguous words. The results showed the SBE emerging in fixation durations of the target region and post-target region (i.e. next two words of the target), when unambiguous controls were matched to the word-form frequency of ambiguous words. Experiment 2 used visual world paradigm to explore temporal dynamics of dominant meaning activation responsible for the SBE in an instructional eyetracking-during-listening task. Fixation probabilities on four disyllabic printed words were analyzed during a time period after a target word was uttered in a spoken sentence. The results supported the reordered access model. The subordinate meaning was activated by contextual information at about 500 ms after the onset of acoustic homograph at the time when context penetrated to make its favored meaning available. Soon after the offset of homograph, the dominant meaning became active. Both meanings associated with the homograph were activated during the time windows of 901 ms to 1300 ms, which approximately corresponding to the acoustic onset of post target. In sum, our studies demonstrate that the dominant meaning is activated even when the contextual information biases to the subordinate meaning of a homograph. The subordinate bias effect is the result of competition from two meanings, conforming to the reordered access model. 同形異義詞眼動詞彙歧義解困次要語義偏向效應口語理解-視覺典範 Homograph Eye movements Lexical ambiguity resolution Subordinate bias effect Visual world paradigm

Search results