Global ETD Search

21	A Design of Multi-session Text-independent Digital Camcorder Audio-Video Database for Speaker Recognition Chen, Chun-chi 05 September 2008 (has links) In this thesis, an audio-video database for speaker recognition is constructed using a digital camcorder. Motion pictures of fifteen hundred speakers are recorded in three different sessions in the database. For each speaker, 20 still images per session are also derived from the video data. It is hoped that this database can provide an appropriate training and testing mechanism for person identification using both voice and face features. Speaker Recognition Automatic Speech Recognition System Text-to-Speech System Biometrics
22	A knowledge-based grapheme-to-phoneme conversion for Swedish Thorstensson, Niklas January 2002 (has links) <p>A text-to-speech system is a complex system consisting of several different modules such as grapheme-to-phoneme conversion, articulatory and prosodic modelling, voice modelling etc.</p><p>This dissertation is aimed at the creation of the initial part of a text-to-speech system, i.e. the grapheme-to-phoneme conversion, designed for Swedish. The problem area at hand is the conversion of orthographic text into a phonetic representation that can be used as a basis for a future complete text-to speech system.</p><p>The central issue of the dissertation is the grapheme-to-phoneme conversion and the elaboration of rules and algorithms required to achieve this task. The dissertation aims to prove that it is possible to make such a conversion by a rule-based algorithm with reasonable performance. Another goal is to find a way to represent phonotactic rules in a form suitable for parsing. It also aims to find and analyze problematic structures in written text compared to phonetic realization.</p><p>This work proposes a knowledge-based grapheme-to-phoneme conversion system for Swedish. The system suggested here is implemented, tested, evaluated and compared to other existing systems. The results achieved are promising, and show that the system is fast, with a high degree of accuracy.</p> Grapheme-to-Phoneme algorithm Text-to-speech Swedish Computer and systems science Data- och systemvetenskap
23	The Impact of Text Reader Training and Teacher Strategies on a Six-week Reading Program White, D. Heather 20 November 2013 (has links) This study investigates the effects of intensive remediation in reading and assistive technology skills combined with the use of a computer based text to speech reader in a six-week intensive reading program for junior-age students with reading disabilities. The study reports on the strategies used by the teachers, week-by-week student progress, and the results of a criterion-referenced reading assessment. Other themes include student attitudes towards the technology and barriers to implementation. Findings indicate that a computer based text to speech reader provides significant compensatory support, resulting in improved fluency and comprehension scores. Students using technology were able to access paper and on-line text at a higher level. A model which builds on the work of Dyck and Pemberton (2002) and Edyburn (2004b, 2007) is proposed which provides a theoretical framework to assist schools in decisions about remediation or compensation for struggling readers in primary, junior, intermediate, and senior grades. assistive technology reading compensation remediation text to speech computer disabilities teacher strategies 0529 0710 0535
24	The Impact of Text Reader Training and Teacher Strategies on a Six-week Reading Program White, D. Heather 20 November 2013 (has links) This study investigates the effects of intensive remediation in reading and assistive technology skills combined with the use of a computer based text to speech reader in a six-week intensive reading program for junior-age students with reading disabilities. The study reports on the strategies used by the teachers, week-by-week student progress, and the results of a criterion-referenced reading assessment. Other themes include student attitudes towards the technology and barriers to implementation. Findings indicate that a computer based text to speech reader provides significant compensatory support, resulting in improved fluency and comprehension scores. Students using technology were able to access paper and on-line text at a higher level. A model which builds on the work of Dyck and Pemberton (2002) and Edyburn (2004b, 2007) is proposed which provides a theoretical framework to assist schools in decisions about remediation or compensation for struggling readers in primary, junior, intermediate, and senior grades. assistive technology reading compensation remediation text to speech computer disabilities teacher strategies 0529 0710 0535
25	Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz Schlünz, Georg Isaac January 2014 (has links) Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014 Natural language processing Text-to-speech synthesis Prosody Discourse Information structure Affect OCC model E-motif
26	The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. Schlünz Schlünz, Georg Isaac January 2010 (has links) In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem of little available electronic data and linguistic expertise. The Lwazi project in South Africa is a large–scale endeavour to collect and apply such resources for all eleven of the official South African languages. One of the deliverables of the project is more natural text–to–speech (TTS) voices. Naturalness is primarily determined by prosody and it is shown that many aspects of prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices. In a resource–scarce environment, obtaining and applying the POS information are not trivial. Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training data. Secondly, the subsequent processes in TTS that are used to apply the POS information towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic knowledge; others require labelled data as well. The first problem asks the question of which available POS tagging algorithm will be the most accurate on little training data. This research sets out to answer the question by reviewing the most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated papers discussing one algorithm, the aim of the review is to consolidate the research into a single point of reference. A subsequent experimental investigation compares the tagging algorithms on small training data sets of English and Afrikaans, and it is shown that the hidden Markov model (HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset. Regarding the second problem, the question arises whether it is perhaps possible to circumvent the traditional approaches to prosodic modelling by learning the latter directly from the speech data using POS information. In other words, does the addition of POS features to the HTS context labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are trained from English and Afrikaans prosodically rich speech. The voices are compared with and without POS features incorporated into the HTS context labels, analytically and perceptually. For the analytical experiments, measures of prosody to quantify the comparisons are explored. It is then also noted whether the results of the perceptual experiments correlate with their analytical counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the addition of POS tags does improve the naturalness of the voice. However, the same effect can be accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011. part-of-speech tagging text-to-speech synthesis resource-scarce language naturalness prosody HTS context labels
27	The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. Schlünz Schlünz, Georg Isaac January 2010 (has links) In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem of little available electronic data and linguistic expertise. The Lwazi project in South Africa is a large–scale endeavour to collect and apply such resources for all eleven of the official South African languages. One of the deliverables of the project is more natural text–to–speech (TTS) voices. Naturalness is primarily determined by prosody and it is shown that many aspects of prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices. In a resource–scarce environment, obtaining and applying the POS information are not trivial. Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training data. Secondly, the subsequent processes in TTS that are used to apply the POS information towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic knowledge; others require labelled data as well. The first problem asks the question of which available POS tagging algorithm will be the most accurate on little training data. This research sets out to answer the question by reviewing the most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated papers discussing one algorithm, the aim of the review is to consolidate the research into a single point of reference. A subsequent experimental investigation compares the tagging algorithms on small training data sets of English and Afrikaans, and it is shown that the hidden Markov model (HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset. Regarding the second problem, the question arises whether it is perhaps possible to circumvent the traditional approaches to prosodic modelling by learning the latter directly from the speech data using POS information. In other words, does the addition of POS features to the HTS context labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are trained from English and Afrikaans prosodically rich speech. The voices are compared with and without POS features incorporated into the HTS context labels, analytically and perceptually. For the analytical experiments, measures of prosody to quantify the comparisons are explored. It is then also noted whether the results of the perceptual experiments correlate with their analytical counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the addition of POS tags does improve the naturalness of the voice. However, the same effect can be accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011. part-of-speech tagging text-to-speech synthesis resource-scarce language naturalness prosody HTS context labels
28	Statistical parametric speech synthesis using conversational data and phenomena Dall, Rasmus January 2017 (has links) Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.
29	A knowledge-based grapheme-to-phoneme conversion for Swedish Thorstensson, Niklas January 2002 (has links) A text-to-speech system is a complex system consisting of several different modules such as grapheme-to-phoneme conversion, articulatory and prosodic modelling, voice modelling etc. This dissertation is aimed at the creation of the initial part of a text-to-speech system, i.e. the grapheme-to-phoneme conversion, designed for Swedish. The problem area at hand is the conversion of orthographic text into a phonetic representation that can be used as a basis for a future complete text-to speech system. The central issue of the dissertation is the grapheme-to-phoneme conversion and the elaboration of rules and algorithms required to achieve this task. The dissertation aims to prove that it is possible to make such a conversion by a rule-based algorithm with reasonable performance. Another goal is to find a way to represent phonotactic rules in a form suitable for parsing. It also aims to find and analyze problematic structures in written text compared to phonetic realization. This work proposes a knowledge-based grapheme-to-phoneme conversion system for Swedish. The system suggested here is implemented, tested, evaluated and compared to other existing systems. The results achieved are promising, and show that the system is fast, with a high degree of accuracy. Grapheme-to-Phoneme algorithm Text-to-speech Swedish Information Systems
30	Comprehension, Processing Time, and Modality Preferences When People with Aphasia and Neurotypical Healthy Adults Read Books: A Pilot Study Pruitt, McKenzie Ellen 22 April 2022 (has links) No description available. Speech Therapy comprehension processing time preferences book-reading text-to-speech technology

Search results