Global ETD Search

121	Efficient Multispeaker Speech Synthesis and Voice Cloning McHargue, James 26 May 2023 (has links) No description available. Artificial Intelligence Computer Science
122	Improving user trust towards conversational chatbot interfaces with voice output Burri, Ramón January 2018 (has links) This thesis investigates the impact of the voice modality on user trust in conversational chatbot interfaces. The assumption is that trust can be increased by adding voice output to a chatbot and by a higher quality of a used text-to-speech synthesis. The thesis first introduces chatbots and the concept of conversational interfaces then defines trust in an online context. Based on this, a model for trust and perceiving factors for credibility, ease of use and risk is defined. An online experiment is conducted where participants run through conversational scenarios with a chatbot while varying the voice output. Followed by a survey to collect data about the perception of the trust factors for a scenario with no voice and two scenarios with different speech synthesis qualities. To analyse the ordinal data from the survey the ”Wilcoxon signed- rank test”, a nonparametric statistical test, is conducted to compare trust for the voice output types. Results show that adding the voice output modality to a conversational chatbot interface increases the user trust towards the service. Furthermore, the assumption that synthesis quality has an effect on trust could not hold true because the results are not statistically significant. On this basis, the limitations of the used methods are discussed and suggestions for further research are proposed. / Detta examensarbete undersöker den inverkan som röstmodaliteten har på användarförtroende i konversationsbaserade gränssnitt för chatbottar. Antagandet är att förtroendet kan ökas, dels genom att addera röstutmatning till chatbotten och dels genom att text-till-röst-syntesen ges hög kvalité. Först introduceras chatbottar och konceptet konversationsbaserade gränssnitt. Därefter definieras örtroende i en online-kontext. Baserat på detta definieras en modell för förtroende samt uppfattningsfaktorer för trovärdighet, lättanvändhet och risk. Ett onlineexperiment utfördes, där deltagare går igenom konversationscenarion med en chatbot medan röstutmatningen varieras. Därefter följde en undersökning ämnad att samla in data om uppfattningen om förtroendefaktorerna för ett scenario utan röst och två scenarion med olika talsyntes-kvalitéer. För att analysera den ordnade datan från undersökningen genomfördes Wilcoxon signedrank-testet, ett ickeparametriskt statistiskt test, för att jämföra förtroenden för de olika röstutmatningstyperna. Resultaten visar att addering av röstutmatningsmodalitet till ett konversationsbaserat chatbotsgränsnitt ökar användares förtroende för tjänsten. Vidare, antagandet att synteskvalitén har effekt på förtroendet kunde inte verifieras därför att resultaten inte är statistiskt signifikanta. Begränsningarna hos de använda metoderna diskuteras och förslag för framtida forskning läggs fram. Computer and Information Sciences Data- och informationsvetenskap
123	Artificial Neural Networks in Swedish Speech Synthesis / Artificiella neurala nätverk i svensk talsyntes Näslund, Per January 2018 (has links) Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods. In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility. / Talsynteser, också kallat TTS (text-to-speech) används i stor utsträckning inom smarta assistenter och många andra applikationer. Samtida forskning applicerar maskininlärning och artificiella neurala nätverk (ANN) för att utföra talsyntes. Det har visats i studier att dessa system presterar bättre än de äldre konkatenativa och parametriska metoderna. I den här rapporten utforskas ANN-baserade TTS-metoder och en av metoderna implementeras för det svenska språket. Den använda metoden kallas “Tacotron” och är ett första steg mot end-to-end TTS baserat på neurala nätverk. Metoden binder samman flertalet olika ANN-tekniker. Det resulterande systemet jämförs med en parametriskt TTS genom ett graderat preferens-test som innefattar 20 svensktalande försökspersoner. En statistiskt säkerställd preferens för det ANN- baserade TTS-systemet fastställs. Försökspersonerna indikerar att det ANN-baserade TTS-systemet presterar bättre än det parametriska när det kommer till ljudkvalitet och naturlighet men visar brister inom tydlighet. Speech Synthesis neural LSTM Speech Technology Tacotron Attention CNN Neural Networks RNN Computer Sciences Datavetenskap (datalogi)
124	Hybrid Concatenated-Formant Expressive Speech Synthesizer For Kinesensic Voices Chandra, Nishant 05 May 2007 (has links) Traditional and commercial speech synthesizers are incapable of synthesizing speech with proper emotion or prosody. Conveying prosody in artificially synthesized speech is difficult because of extreme variability in human speech. An arbitrary natural language sentence can have different meanings, depending upon the speaker, speaking style, context, and many other factors. Most concatenated speech synthesizers use phonemes, which are phonetic units defined by the International Phonetic Alphabet (IPA). The 50 phonemes in English are standardized and unique units of sound, but not expression. An earlier work proposed the analogy between speech and music ? ?speech is music, music is speech.? The speech data obtained from the master practitioners, who are trained in kinesensic voice, is marked on a five level intonation scale, which is similar to the music scale. From this speech data, 1324 unique expressive units, called expressemes®, are identified. The expressemes consist of melody and rhythm, which, in digital signal processing, is analogous to pitch, duration and energy of the signal. The expressemes have less acoustic and phonetic variability than phonemes, so they better convey the prosody. The goal is to develop a speech synthesizer which exploits the prosodic content of expressemes in order to synthesize expressive speech, with a small speech database. To create a reasonably small database that captures multiple expressions is a challenge because there may not be a complete set of speech segments available to create an emotion. Methods are suggested whereby acoustic mathematical modeling is used to create missing prosodic speech segments from the base prosody unit. New concatenatedormant hybrid speech synthesizer architecture is developed for this purpose. A pitch-synchronous time-varying frequency-warped wavelet transform based prosody manipulation algorithm is developed for transformation between prosodies. A time-varying frequency-warping transform is developed to smoothly concatenate the temporal and spectral parameters of adjacent expressemes to create intelligible speech. Additionally, issues specific to expressive speech synthesis using expressemes are resolved for example, Ergodic Hidden Markov Model based expresseme segmentation, model creation for F0 and segment duration, and target and join cost calculation. The performance of the hybrid synthesizer is measured against a commercially available synthesizer using objective and perceptual evaluations. Subjects consistently rated the hybrid synthesizer better in five different perceptual tests. 70% of speakers rated the hybrid synthesis as more expressive, and 72% preferred it over the commercial synthesizer. The hybrid synthesizer also got a comparable mean opinion score. Emotional speech synthesis Speech morphing Time-varying frequency-warping Expressive speech synthesizer TTS
125	Individual Differences in Speech and Non-Speech Perception of Frequency and Duration Makashay, Matthew Joel 02 April 2003 (has links) No description available. Language, Linguistics individual differences speech perception speech synthesis tense lax sibilant sine wave speech
126	Investigating Speaker Features From Very Short Speech Records Berg, Brian LaRoy 11 September 2001 (has links) A procedure is presented that is capable of extracting various speaker features, and is of particular value for analyzing records containing single words and shorter segments of speech. By taking advantage of the fast convergence properties of adaptive filtering, the approach is capable of modeling the nonstationarities due to both the vocal tract and vocal cord dynamics. Specifically, the procedure extracts the vocal tract estimate from within the closed glottis interval and uses it to obtain a time-domain glottal signal. This procedure is quite simple, requires minimal manual intervention (in cases of inadequate pitch detection), and is particularly unique because it derives both the vocal tract and glottal signal estimates directly from the time-varying filter coefficients rather than from the prediction error signal. Using this procedure, several glottal signals are derived from human and synthesized speech and are analyzed to demonstrate the glottal waveform modeling performance and kind of glottal characteristics obtained therewith. Finally, the procedure is evaluated using automatic speaker identity verification. / Ph. D. Speaker Recognition Speech Synthesis Speech Analysis Digital Signal Processing Speaker Identity Verification Speech Processing
127	Effects of voice coding and speech rate on a synthetic speech display in a telephone information system Herlong, David W. January 1988 (has links) Despite the lack of formal guidelines, synthetic speech displays are used in a growing variety of applications. Telephone information systems permitting human-computer interaction from remote locations are an especially popular implementation of computer-generated speech. Currently, human factors research is needed to specify design characteristics providing usable telephone information systems as defined by task performance and user ratings. Previous research used nonintegrated tasks such as transcription of phonetic syllables, words, or sentences to assess task performance or user preference differences. This study used a computer-driven telephone information system as a real-time, human-computer interface to simulate applications where synthetic speech is used to access data. Subjects used a telephone keypad to navigate through an automated, department store database to locate and transcribe specific information messages. Because speech provides a sequential and transient information display, users may have difficulty navigating through auditory databases. One issue investigated in this study was whether use of alternating male and female voices to code different levels in the database hierarchy would improve user search performance. Other issues investigated were basic intelligibility of these male and female voices as influenced by different levels of speech rate. All factors were assessed as functions of search or transcription task performance and user preference. Analysis of transcription accuracy, search efficiency and time, and subjective ratings revealed an overall significant effect of speech rate on all groups of measures but no significant effects for voice type or coding scheme. Results were used to recommend design guidelines for developing speech displays for telephone information systems. / Master of Science LD5655.V855 1988.H474 Speech synthesis Computer sound processing Speech -- Research
128	The use of the auditory lexical decision task as a method for assessing the relative quality of synthetic speech Jenkins, Reni L. 04 May 2010 (has links) This study evaluates a method for determining the quality of synthetic speech systems. The method involves the use of an auditory lexical decision task to assess the quality of synthetic speech generators relative to each other and to natural speech by using reaction time differences and error rates. Seven voices were evaluated; four synthesizers provided six voices (DECtalk 1.8 Perfect Paul, DECtaik 1.8 Beautiful Betty, DECtaik 2.0 Perfect Paul, DEC talk 2.0 Beautiful Betty, Votrax Personal Speech, Votrax Type'n'Talk) and natural speech provided the seventh voice. Both reaction times and error rates were higher for the low quality synthetic speech systems. The results document that the DECtalk can currently be considered a high quality synthesizer and that the Personal Speech and the Type'n'Talk can be considered low quality synthesizers. The results obtained by using this method can be explained by use of the Activation-Verification model (Paap, McDonald, Schvaneveldt, and Noel, 1986). Within the framework of this model, the results of this study suggest that the verification phase is the bottle-neck in processing words produced by synthetic speech generators. This interpretation suggests that by emphasizing the differences between different phonemes, to make them more uniquely identifiable, rather than concentrating on making them more "natural" might lead to improved results with synthesized speech. / Master of Science LD5655.V855 1992.J466 Speech perception Speech synthesis -- Quality control Speech, Intelligibility of
129	Join cost for unit selection speech synthesis Vepa, Jithendra January 2004 (has links) Undoubtedly, state-of-the-art unit selection-based concatenative speech systems produce very high quality synthetic speech. this is due to a large speech database containing many instances of each speech unit, with a varied and natural distribution of prosodic and spectral characteristics. the join cost, which measures how well two units can be joined together is one of the main criteria for selecting appropriate units from this large speech database. The ideal join cost is one that measures percieved discontinuity based on easily measurable spectral properties of the units being joined, inorder to ensure smooth and natural sounding synthetic speech. During first part of my research, I have investigated various spectrally based distance measures for use in computation of the join cost by designing a perceptual listening experiment. A variation to the usual perceptual test paradigm is proposed in this thesis by deliberately including a wide range of qualities of join in polysyllabic words. The test stimuli are obtained using a state-of-the-art unit-selection text-to-speech system: rVoice from Rhetorical Systems Ltd. Three spectral features Mel-frequency cepstral coefficients (MFCC), line spectral frequencies (LSF) and multiple centroid analysis (MCA) parameters and various statistical distances - Euclidean, Kullback-Leibler, Mahalanobis - are used to obtain distance measures. Based on the correlations between perceptual scores and these spectral distances. I proposed new spectral distance measures, which have good correlation with human perception to concatenation discontinuities. The second part of my research concentrates on combining join cost computation and the smoothing operation, which is required to disguise joins, by learning an underlying representation from the acoustic signal. In order to accomplish this task, I have chosen linear dynamic models (LDM), sometimes known as Kalman filters. Three different initialisation schemes are used prior to Expectation-Maximisation (KM) in LDM training. Once the models are trained, the join cost is computed based on the error between model predictions and actual observations. Analytical measures are derived based on the shape of this error plot. These measures and initialisation schemes are compared by computing correlations using the perceptual data. The LDMs are also able to smooth the observations which are then used to synthesise speech. To evaluate the LDM smoothing operation, another listening test is performed where it is compared with the standard methods (simple linear interpolation). I have compared the best three join cost functions, chosen from the first and second parts of my research, subjectively using a listening test in the third part of my research. in this test, I also evaluated different smoothing methods: no smoothing, linear smoothing and smoothing achieved using LDMs. 621.382
130	Software and Hardware Interface of a VOTRAX Terminal for the Fairchild F24 Computer Wu, Chun Hsiang 05 1900 (has links) VOTRAX is a commercially available voice synthesizer for use with a digital computer. This thesis describes the design and implementation of a VOTRAX terminal for use with the Fairchild F24 computer. Chapters of the thesis consider the audio response technology, some characteristics of Phonetic English Speech, configuration of hardware, and describe the PHONO computer program which was developed. The last chapter discusses the advantages of the VOTRAX voice synthesizer and proposes a future version of the system with a time-sharing host computer. voice synthesizers audio response technology VOTRAX phonetic English Fairchild F24 computer PHONO (Computer file) F 24 (Computer) Speech synthesis.

Search results