1 |
Speech enhancement and vowel perceptionPeters, C. J. January 1987 (has links)
No description available.
|
2 |
The use of learnable phonetic representations in connectionist text-to-speech systemCohen, Andrew Dight January 1997 (has links)
No description available.
|
3 |
The role of the dorsal cochlear nucleus in the perception of voicing contrasts in initial English stop consonants : a computational modelling studyPont, Michael Joseph January 1989 (has links)
No description available.
|
4 |
Intelligibility of synthetic speech in noise and reverberationIsaac, Karl Bruce January 2015 (has links)
Synthetic speech is a valuable means of output, in a range of application contexts, for people with visual, cognitive, or other impairments or for situations were other means are not practicable. Noise and reverberation occur in many of these application contexts and are known to have devastating effects on the intelligibility of natural speech, yet very little was known about the effects on synthetic speech based on unit selection or hidden Markov models. In this thesis, we put forward an approach for assessing the intelligibility of synthetic and natural speech in noise, reverberation, or a combination of the two. The approach uses an experimental methodology consisting of Amazon Mechanical Turk, Matrix sentences, and noises that approximate the real-world, evaluated with generalized linear mixed models. The experimental methodologies were assessed against their traditional counterparts and were found to provide a number of additional benefits, whilst maintaining equivalent measures of relative performance. Subsequent experiments were carried out to establish the efficacy of the approach in measuring intelligibility in noise and then reverberation. Finally, the approach was applied to natural speech and the two synthetic speech systems in combinations of noise and reverberation. We have examine and report on the intelligibility of current synthesis systems in real-life noises and reverberation using techniques that bridge the gap between the audiology and speech synthesis communities and using Amazon Mechanical Turk. In the process, we establish Amazon Mechanical Turk and Matrix sentences as valuable tools in the assessment of synthetic speech intelligibility.
|
5 |
Investigating Speech Perception in Evolutionary Perspective: Comparisons of Chimpanzee (Pan troglodytes) and Human CapabilitiesHeimbauer, Lisa A 01 August 2012 (has links)
There has been much discussion regarding whether the capability to perceive speech is uniquely human. The “Speech is Special” (SiS) view proposes that humans possess a specialized cognitive module for speech perception (Mann & Liberman, 1983). In contrast, the “Auditory Hypothesis” (Kuhl, 1988) suggests spoken-language evolution took advantage of existing auditory-system capabilities. In support of the Auditory Hypothesis, there is evidence that Panzee, a language-trained chimpanzee (Pan troglodytes), perceives speech in synthetic “sine-wave” and “noise-vocoded” forms (Heimbauer, Beran, & Owren, 2011). Human comprehension of these altered forms of speech has been cited as evidence for specialized cognitive capabilities (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005).
In light of Panzee’s demonstrated abilities, three experiments extended these investigations of the cognitive processes underlying her speech perception. The first experiment investigated the acoustic cues that Panzee and humans use when identifying sine-wave and noise-vocoded speech. The second experiment examined Panzee’s ability to perceive “time-reversed” speech, in which individual segments of the waveform are reversed in time. Humans are able to perceive such speech if these segments do not much exceed average phoneme length. Finally, the third experiment tested Panzee’s ability to generalize across both familiar and novel talkers, a perceptually challenging task that humans seem to perform effortlessly.
Panzee’s performance was similar to that of humans in all experiments. In Experiment 1, results demonstrated that Panzee likely attends to the same “spectro-temporal” cues in sine-wave and noise-vocoded speech that humans are sensitive to. In Experiment 2, Panzee showed a similar intelligibility pattern as a function of reversal-window length as found in human listeners. In Experiment 3, Panzee readily recognized words not only from a variety of familiar adult males and females, but also from unfamiliar adults and children of both sexes. Overall, results suggest that a combination of general auditory processing and sufficient exposure to meaningful spoken language is sufficient to account for speech-perception evidence previously proposed to require specialized, uniquely human mechanisms. These findings in turn suggest that speech-perception capabilities were already present in latent form in the common evolutionary ancestors of modern chimpanzees and humans.
|
6 |
Orator verbis electris : taldatorn en pedagogisk länk till läs- och skrivfärdighet: utprövning och utvärdering av taldatorbaserade träningsprogram för elever med läs- och skrivsvårigheter / Orator verbis electris : speech computer a pedagogical link to literacy: development and evaluation of speech-computer based trainingprograms for children with reading and writing problemsDahl, Irené January 1997 (has links)
This study presents results from a project named The OVE-project (Orator Verbis Electris i.e. electric speech machine). The aim of this thesis is to describe and to evaluate a set of computer - programs based on synthetic speech. The programs are designed for training phonological awareness and are intended to be used as a remedial tool for children with reading and writing problems. During the evaluation and data-collecting period the project was economically and technically connected to the department of Speech, Music and Hearing at the Royal Institute of Technology in Stockholm. My associate, during this period, late Karoly Galyas has been in charge of the technical part of the project while I have been responsible for the content and the design of programs as well as for the evaluation and the assessment studies. The thesis has two overall aims. The first is to describe, analyze and discuss the development of a specific work process based on computer based supervising with synthetic speech feedback intended for children with reading and writing problems. The second overall aim is to describe, analyze and discuss the effect of the speech feedback and the specific work process on children with reading and writing problems. The evaluation and the data collection has been done within three different phases and consists of three training studies and two experimental studies: one text reproduction study and one spelling test study. The results showed very convincingly that the speech feedback had a positive effect on the pupils' self esteem. Activity and motivation were highly improved. A remarkable improvement of the pupils' ability to concentrate and persevere was observed. The results from the spelling test study showed high improvement rates when speech feedback in ongoing writing was used compared to handwriting and ordinary computer writing without speech feedback. The potentials of the synthetic speech feedback as a means to improve reading and writing skills is discussed. / digitalisering@umu
|
7 |
Speech generating devices and autism : a comparison of digitized and synthetic speech outputRamdoss, Sathiyaprakash Thoppae 24 September 2013 (has links)
Children with autism often experience substantial impairments in the domain of language and communication. Speech generating device (SGD) is one of the widely used augmentative communication systems with this population. The most prevalent speech output systems currently in use with SGDs are digitized and synthetic speech outputs. Advantages and disadvantages of each speech output system exist, and large individual differences in-terms of preference and performance have been speculated in both modalities. There is currently no published research that compares digitized and synthetic speech outputs.
The primary goal of this study is to examine the effects of SGD training using digitized vs. synthetic speech outputs on the acquisition of requesting skills of 4 non-verbal children diagnosed with autism. The study addressed the following research questions. First, are there differences in acquisition rates for requests taught using digitized vs. synthetic speech outputs? Second, do children show a preference for one speech output over other? Finally, Are there any differences in perceived social validity of digitized vs. synthetic speech outputs?
The primary findings of this study were: (1) Differences in the performance were found between two of the participants within each speech output; (2) two of the participants appeared to prefer one speech output over other and one participant could not indicate his preference due to positioning bias; (3) social validity measures indicated favorable ratings for SGD training but no clear indications in-terms of acceptability and usability of speech outputs across different settings. The overall results obtained from this study suggest that speech output can play a significant role, and it is one of the important components that can contribute to the success of the intervention. Additionally, overall outcome suggests that non-verbal children with autism can successfully learn to use the SGDs at their own pace with the support of proper prompting strategies and instructional procedures. / text
|
8 |
Brain Mapping of the Mismatch Negativity Response to Vowel Variances of Natural and Synthetic PhonemesSmith, Lyndsy Marie 26 November 2013 (has links) (PDF)
The mismatch negativity (MMN) is a specific event-related potential (ERP) component used frequently in the observation of auditory processing. The MMN is elicited by a deviant stimulus randomly presented in the presence of repeating stimuli. The current study utilized the MMN response in order to determine the temporal (timing) and linguistic processing of natural and synthetic vowel stimuli. It was hypothesized that a significant MMN response would be elicited by natural and synthetic vowel stimuli. Brain mapping of the MMN response was hypothesized to yield temporal resolution information, which would provide detail regarding the sequential processing differences between natural and synthetic vowel stimuli. It was also hypothesized that the location of dipoles within the cortex would provide information pertaining to differences in cortical localization of processing for natural and synthetic stimuli. Vowel stimuli were presented to twenty participants (10 females and 10 males between the ages of 18 and 26 years) in a three-forced-choice response paradigm. Data from behavioral responses, reaction times, and ERPs were recorded for each participant. Results demonstrated that there were differences in the behavioral and electrophysiological responses between natural and synthesized vowels presented to young, normal hearing adults. In addition, significant MMN responses were evoked by both natural and synthetic vowel stimuli. Greater reaction times existed for the synthetic vowel phonemes compared to the natural vowel phonemes. Electrophysiological differences were primarily seen in the processing of the synthetic /u/ stimuli. Scalp distribution of cognitive processing was essentially the same for naturally produced phonemes. Processing of synthetic phonemes also had similar scalp distributions; however, the synthetic /u/ phoneme required more complex processing compared to the synthetic /æ/ phoneme. The most significant processing localizations were located in the superior temporal gyrus, which is known for its role in linguistic processing. Continued processing in the frontal lobe was observed, suggesting continual evaluation of natural and synthetic phonemes throughout processing.
|
9 |
The Impact of Degraded Speech and Stimulus Familiarity in a Dichotic Listening TaskSinatra, Anne M. 01 January 2012 (has links)
It has been previously established that when engaged in a difficult attention intensive task, which involves repeating information while blocking out other information (the dichotic listening task), participants are often able to report hearing their own names in an unattended audio channel (Moray, 1959). This phenomenon, called the cocktail party effect is a result of words that are important to oneself having a lower threshold, resulting in less attention being necessary to process them (Treisman, 1960). The current studies examined the ability of a person who was engaged in an attention demanding task to hear and recall low-threshold words from a fictional story. These low-threshold words included a traditional alert word, "fire" and fictional character names from a popular franchise-Harry Potter. Further, the role of stimulus degradation was examined by including synthetic and accented speech in the task to determine how it would impact attention and performance. In Study 1 participants repeated passages from a novel that was largely unfamiliar to them, The Secret Garden while blocking out a passage from a much more familiar source, Harry Potter and the Deathly Hallows. Each unattended Harry Potter passage was edited so that it would include 4 names from the series, and the word "fire" twice. The type of speech present in the attended and unattended ears (Natural or Synthetic) was varied to examine the impact that processing a degraded speech would have on performance. The speech that the participant shadowed did not impact unattended recall, however it did impact shadowing accuracy. The speech type that was present in the unattended ear did impact the ability to recall low-threshold, Harry Potter information. When the unattended speech type was synthetic, significantly less Harry Potter information was recalled. Interestingly, while Harry Potter information was recalled by participants with both high and low Harry Potter experience, the traditional low-threshold word, "fire" was not noticed by participants. In order to determine if synthetic speech impeded the ability to report low-threshold Harry Potter names due to being degraded or simply being different than natural speech, Study 2 was designed. In Study 2 the attended (shadowed) speech was held constant as American Natural speech, and the unattended ear was manipulated. An accent which was different than the native accent of the participants was included as a mild form of degradation. There were four experimental stimuli which contained one of the following in the unattended ear: American Natural, British Natural, American Synthetic and British Synthetic. Overall, more unattended information was reported when the unattended channel was Natural than Synthetic. This implies that synthetic speech does take more working memory processing power than even an accented natural speech. Further, it was found that experience with the Harry Potter franchise played a role in the ability to report unattended Harry Potter information. Those who had high levels of Harry Potter experience, particularly with audiobooks, were able to process and report Harry Potter information from the unattended stimulus when it was British Natural. While, those with low Harry Potter experience were not able to report unattended Harry Potter information from this slightly degraded stimulus. Therefore, it is believed that the previous audiobook experience of those in the high Harry Potter experience group acted as training and resulted in less working memory being necessary to encode the unattended Harry Potter information. A pilot study was designed in order to examine the impact of story familiarity in the attended and unattended channels of a dichotic listening task. In the pilot study, participants shadowed a Harry Potter passage (familiar) in one condition with a passage from The Secret Garden (unfamiliar) playing in the unattended ear. A second condition had participants shadowing The Secret Garden (unfamiliar) with a passage from Harry Potter (familiar) present in the unattended ear. There was no significant difference in the number of unattended names recalled. Those with low Harry Potter experience reported significantly less attended information when they shadowed Harry Potter than when they shadowed The Secret Garden. Further, there appeared to be a trend such that those with high Harry Potter experience were reporting more attended information when they shadowed Harry Potter than The Secret Garden. This implies that experience with a franchise and characters may make it easier to recall information about a passage, while lack of experience provides no assistance. Overall, the results of the studies indicate that we do treat fictional characters in a way similarly to ourselves. Names and information about fictional characters were able to break through into attention during a task that required a great deal of attention. The experience one had with the characters also served to assist the working memory in processing the information in degraded circumstances. These results have important implications for training, design of alerts, and the use of popular media in the classroom.
|
10 |
Timbre Perception of Time-Varying SignalsArthi, S January 2014 (has links) (PDF)
Every auditory event provides an information-rich signal to the brain. The signal constitutes perceptual attributes of pitch, loudness, timbre, and also, conceptual attributes like location, emotions, meaning, etc. In the present work we examine the timbre perception of time-varying signals in particular. While stationary signal timbre, by-itself is complex perceptually, the time-varying signal timbre introduces an evolving pattern, adding to its multi-dimensionality.
To characterize timbre, we conduct psycho-acoustic perception tests with normal-hearing human subjects. We focus on time-varying synthetic speech signals(can be extended to music) because listeners are perceptually consistent with speech. Also, we can parametrically control the timbre and pitch glides using linear time-varying models. In order to quantify the timbre change in time-varying signals, we define the JND(Just noticeable difference) of timbre using diphthongs, synthesized using time-varying formant frequency model. The diphthong JND is defined as a two dimensional contour on the plane of percentage change of formant frequencies of terminal vowels. Thus, we simplify the perceptual probing to a lower dimensional space, i.e, 2-D even for a diphthong, which is multi-parametric. We also study the impact of pitch glide on the timbre JND of the diphthong. It is observed that timbre JND is influenced by the occurrence of pitch glide.
Focusing on the magnitude of perceptual timbre change, we design a MUSHRA-like listening test using the vowel continuum in the formant-frequency space. We provide explicit anchors for reference: 0% and 100%, thus quantifying the perceptual timbre change on a 1-D scale. We also propose an objective measure of timbre change and observe that there is good correlation between the objective measure and subjective human responses of percentage timbre change.
Using the above experimental methodology, we studied the influence of pitch shift on timbre perception and observed that the perceptual timbre change increases with change in pitch. We used vowels and diphthongs with 5 different types of pitch glides-(i) Constant pitch,(ii) 3-semitone linearly-up,(iii) 3 semitone linearly-down, (iv)V–like pitch glide and (v) hat-like pitch glide. The present study shows that timbre change can be measured on a 1-D scale if the perturbation is along one-dimension. We observe that for bright vowels(/a/and/i/), linearly decreasing pitch glide(dull pitch glide)causes more timbre change than linearly increasing pitch glide(bright pitch glide).For dull vowels(/u/),it is vice-versa. To summarize, in congruent pitch glides cause more perceptual timbre change than congruent pitch glides.(Congruent pitch glide implies bright pitch glide in bright vowel or dull pitch glide in dull vowel and in congruent pitch glide implies bright pitch glide in dull vowel or dull pitch glide in bright vowel.) Experiments with quadratic pitch glides show that the decay portion of pitch glide affects timbre perception more than the attack portion in short duration signals with less or no sustained part.
In case of time-varying timbre, bright diphthongs show patterns similar to bright vowels. Also, for bright diphthongs(/ai/), perceived timbre change is most with decreasing pitch glide(dull pitch glide). We also observed that listeners perceive more timbre change in constant pitch than in pitch glides, congruent with the timbre or pitch glides with quadratic changes.
The main conclusion of this study is that pitch and timbre do interact and in congruent pitch glides cause more timbre change than congruent pitch glides. In the case of quadratic pitch glides, listener perception of vowels is influenced by the decay than the attack in pitch glide in short duration signals. In the case of time-varying timbre also, in congruent pitch glides cause the most timbre change, followed by constant pitch glide. For congruent pitch glides and quadratic pitch glides in time-varying timbre, the listeners perceive lesser timbre change than otherwise.
|
Page generated in 0.0467 seconds