Global ETD Search

1	Full covariance modelling for speech recognition Bell, Peter January 2010 (has links) HMM-based systems for Automatic Speech Recognition typically model the acoustic features using mixtures of multivariate Gaussians. In this thesis, we consider the problem of learning a suitable covariance matrix for each Gaussian. A variety of schemes have been proposed for controlling the number of covariance parameters per Gaussian, and studies have shown that in general, the greater the number of parameters used in the models, the better the recognition performance. We therefore investigate systems with full covariance Gaussians. However, in this case, the obvious choice of parameters – given by the sample covariance matrix – leads to matrices that are poorly-conditioned, and do not generalise well to unseen test data. The problem is particularly acute when the amount of training data is limited. We propose two solutions to this problem: firstly, we impose the requirement that each matrix should take the form of a Gaussian graphical model, and introduce a method for learning the parameters and the model structure simultaneously. Secondly, we explain how an alternative estimator, the shrinkage estimator, is preferable to the standard maximum likelihood estimator, and derive formulae for the optimal shrinkage intensity within the context of a Gaussian mixture model. We show how this relates to the use of a diagonal covariance smoothing prior. We compare the effectiveness of these techniques to standard methods on a phone recognition task where the quantity of training data is artificially constrained. We then investigate the performance of the shrinkage estimator on a large-vocabulary conversational telephone speech recognition task. Discriminative training techniques can be used to compensate for the invalidity of the model correctness assumption underpinning maximum likelihood estimation. On the large-vocabulary task, we use discriminative training of the full covariance models and diagonal priors to yield improved recognition performance. 621.382
2	Modelling speech dynamics with trajectory-HMMs Zhang, Le January 2009 (has links) The conditional independence assumption imposed by the hidden Markov models (HMMs) makes it difficult to model temporal correlation patterns in human speech. Traditionally, this limitation is circumvented by appending the first and second-order regression coefficients to the observation feature vectors. Although this leads to improved performance in recognition tasks, we argue that a straightforward use of dynamic features in HMMs will result in an inferior model, due to the incorrect handling of dynamic constraints. In this thesis I will show that an HMM can be transformed into a Trajectory-HMM capable of generating smoothed output mean trajectories, by performing a per-utterance normalisation. The resulting model can be trained by either maximisingmodel log-likelihood or minimisingmean generation errors on the training data. To combat the exponential growth of paths in searching, the idea of delayed path merging is proposed and a new time-synchronous decoding algorithm built on the concept of token-passing is designed for use in the recognition task. The Trajectory-HMM brings a new way of sharing knowledge between speech recognition and synthesis components, by tackling both problems in a coherent statistical framework. I evaluated the Trajectory-HMM on two different speech tasks using the speaker-dependent MOCHA-TIMIT database. First as a generative model to recover articulatory features from speech signal, where the Trajectory-HMM was used in a complementary way to the conventional HMM modelling techniques, within a joint Acoustic-Articulatory framework. Experiments indicate that the jointly trained acoustic-articulatory models are more accurate (having a lower Root Mean Square error) than the separately trained ones, and that Trajectory-HMM training results in greater accuracy compared with conventional Baum-Welch parameter updating. In addition, the Root Mean Square (RMS) training objective proves to be consistently better than the Maximum Likelihood objective. However, experiment of the phone recognition task shows that the MLE trained Trajectory-HMM, while retaining attractive properties of being a proper generative model, tends to favour over-smoothed trajectories among competing hypothesises, and does not perform better than a conventional HMM. We use this to build an argument that models giving a better fit on training data may suffer a reduction of discrimination by being too faithful to the training data. Finally, experiments on using triphone models show that increasing modelling detail is an effective way to leverage modelling performance with little added complexity in training. 617.7
3	Investigation into automatic speech recognition of different dialects of Northern Sotho Mapeka, Madimetja Asaph January 2005 (has links) Thesis (MSc. (Computer Science)) -- University of Limpopo, 2005 / Refer to the document / Telkom (SA), HP (SA) and National Research Fund Automatic speech recognizer Speech technology Dialects Automatic speech recognition Speech perception
4	Τεμαχιοποίηση ομιλίας σε φωνητικές ομάδες για αναγνώριση και σύνθεση ομιλίας Μπουρνά, Βασιλική 21 January 2009 (has links) H διαρκώς αυξανόμενη ανάπτυξη εφαρμογών όπως τα συστήματα μετατροπής κειμένου σε ομιλία (TTS systems) ή τα συστήματα αυτόματης αναγώρισης ομιλίας (ASR systems) κάνουν επιτακτική την ανάγκη της μελέτης χαρακτηριστικών της ομιλίας που δεν περιορίζονται σε συντακτικούς ή λεξιλογικούς κανόνες, αλλά σηματοδοτούνται από διαφορετικές διαδικασίες, όπως είναι η προσωδία. Τα προσωδιακά χαρακτηριστικά της ομιλίας είναι αυτά που πέρα από το λεξιλογικό περιεχόμενο των προτάσεων, επισημαίνουν άλλα σημαντικά στοιχεία που αφορούν στην εστίαση και την έμφαση, εισάγωντας με αυτό τον τρόπο ένα δευτερεύον υποκείμενο κανάλι στην επικοινωνία. Επιπλεόν, συνδέονται σε μεγάλο βαθμό με την έκφραση συναισθήματος στην ομιλία. Γι'αυτό το λόγο είναι σημαντικό το να διερευνηθούν τα χαρακτηριστικά αυτά, τόσο στην ουδέτερη ομιλία, όσο και στις περιπτώσεις ομιλίας σε ορισμένες συναισθηματικές καταστάσεις. Στην παρούσα διπλωματική εργασία γίνεται τεμαχιοποίηση μιας συναισθηματικής ομιλίας, σε επίπεδο φωνημάτων και επιτονική επισημείωση των προσωδιακών γεγονότων που λαμβάνουν χώρα σε επίπεδο συλλαβών, προκειμένου να εξαχθούν οι παράμετροι εκείνες που θα μας επιτρέψουν να μελετήσουμε τα προσωδιακά χαρακτηριστικά παρουσία συναισθηματικής κατάστασης, σε σύγκριση με την ουδέτερη ομιλία. Στη συνέχεια πραγματοποιείται επεξεργασία των δεδομένων και μελέτη των προσωδιακών χαρακτηριστικών, μέσω σύγκρισης των χαρακτηριστικών που παρατηρούνται απο συναίσθημα σε συναίσθημα και μέσω της κατασκευής μοντέλων πρόβλεψης της διάρκειας των φωνημάτων και από αυτές τις διαδικασίες προκύπτουν και παρουσιάζονται κάποια συμπεράσματα σχετικά με την προσωδιακή πτυχή της συναισθηματικής ομιλίας. / The continuously rising development of applications such as Text-to-Speech systems (TTS systems) or Automatic Speech Recognition systems (ASR systems), make imperative the investigation of characteristics of speech which are not limited within the syntactic οr lexical rules, but are signaled by different processes, such as prosody. The prosodic features of speech are those which, beyond the lexical content of utterances, point out other important elements concerning the focus and the accent, implying in that way a secondary subjacent channel of communication. Moreover, they are connected to a great extent with the expression of emotion in speech. Thus, it is important to investigate these features, in neutral speech as well as in cases of speech under emotional conditions. In this thesis, took place the segmentation of a database of emotional speech in phonemic level and the intonational annotation of the prosodic events that occur in the syllabic level, in order to extract the parameters that allow us to study the prosodic features in the presence of emotional state compared to the neutral speech. Following, the extracted data were processed and the prosodic features were studied, through comparing the characteristics that are observed in the different emotional conditions and by building duration prediction models of phonemes and the conclusions drawn through these processes are presented, with regard to the prosodic aspect of emotional speech. Τεχνολογία ομιλίας Προσωδία 414.602 85 Speech technology Prosody Emotional speech
5	Comprehension, Processing Time, and Modality Preferences When People with Aphasia and Neurotypical Healthy Adults Read Books: A Pilot Study Pruitt, McKenzie Ellen 22 April 2022 (has links) No description available. Speech Therapy comprehension processing time preferences book-reading text-to-speech technology
6	Effect of digital highlighting on reading comprehension given text-to-speech technology for people with aphasia deVille, Camille Rae 08 April 2020 (has links) No description available. Speech Therapy Reading comprehension aphasia text-to-speech technology digital highlighting
7	Effects of patient preference selections of text-to-speech technology features on reading comprehension and review time for people with aphasia Crittenden, Allison Marie 21 April 2021 (has links) No description available. Speech Therapy Reading comprehension review time aphasia text-to-speech technology patient preferences
8	Artificial Neural Networks in Swedish Speech Synthesis / Artificiella neurala nätverk i svensk talsyntes Näslund, Per January 2018 (has links) Text-to-speech (TTS) systems have entered our daily lives in the form of smart assistants and many other applications. Contemporary re- search applies machine learning and artificial neural networks (ANNs) to synthesize speech. It has been shown that these systems outperform the older concatenative and parametric methods. In this paper, ANN-based methods for speech synthesis are ex- plored and one of the methods is implemented for the Swedish lan- guage. The implemented method is dubbed “Tacotron” and is a first step towards end-to-end ANN-based TTS which puts many differ- ent ANN-techniques to work. The resulting system is compared to a parametric TTS through a strength-of-preference test that is carried out with 20 Swedish speaking subjects. A statistically significant pref- erence for the ANN-based TTS is found. Test subjects indicate that the ANN-based TTS performs better than the parametric TTS when it comes to audio quality and naturalness but sometimes lacks in intelli- gibility. / Talsynteser, också kallat TTS (text-to-speech) används i stor utsträckning inom smarta assistenter och många andra applikationer. Samtida forskning applicerar maskininlärning och artificiella neurala nätverk (ANN) för att utföra talsyntes. Det har visats i studier att dessa system presterar bättre än de äldre konkatenativa och parametriska metoderna. I den här rapporten utforskas ANN-baserade TTS-metoder och en av metoderna implementeras för det svenska språket. Den använda metoden kallas “Tacotron” och är ett första steg mot end-to-end TTS baserat på neurala nätverk. Metoden binder samman flertalet olika ANN-tekniker. Det resulterande systemet jämförs med en parametriskt TTS genom ett graderat preferens-test som innefattar 20 svensktalande försökspersoner. En statistiskt säkerställd preferens för det ANN- baserade TTS-systemet fastställs. Försökspersonerna indikerar att det ANN-baserade TTS-systemet presterar bättre än det parametriska när det kommer till ljudkvalitet och naturlighet men visar brister inom tydlighet. Speech Synthesis neural LSTM Speech Technology Tacotron Attention CNN Neural Networks RNN Computer Sciences Datavetenskap (datalogi)
9	Förmågan att genomskåda en röstklon : Faktorer som påverkar genomskådning av AI-genererade röstkloner / The ability to see through a voice clone Dalman, Gabriella, Hedin, Jonathan January 2020 (has links) I takt med att maskininlärning utvecklats under senare år har skapandet av så kallade deep fakes, falsk media skapad med denna teknik, oftast video eller bilder, blivit lättare. Röstkloner är ett ämne inom talteknologin som kan sägas vara motsvarigheten för deep fakes för röster. Tidigare studier har redovisat nya tekniker för att använda neurala nätverk för att återskapa trovärdiga kloner av människors röster, men få studier har gjorts på de perceptionella faktorerna hos en människas förmåga att avgöra äktheten hos kloner. Vi gjorde därför en studie med en manlig och en kvinnlig röstklon där deltagare som var bekanta med talarnas röster sen tidigare fick avgöra äktheten hos en serie klipp varibland röstkloner var inkluderade. Frekvensomfånget begränsades i klippen i olika utsträckning för att undersöka om det fanns ett samband mellan omfången och deltagarnas förmågor. Resultaten av undersökningen visar att frekvensomfången inte hade någon statistiskt signifikant påverkan och att de avgörande faktorerna istället var prosodi och förekomsten av artefakter i ljudklippen. Däremot fanns det en betydlig skillnad mellan framgången att genomskåda den manliga röstklonen gentemot den kvinnliga, där deltagarna i större utsträckning genomskådade den manliga. / As machine learning has advanced in later years the creation of deep fakes, fake media created using this technology, most often video or images, has become easier. Voice cloning is a subject in speech technology that can be said to be the equivalent of deep fakes for voices. Earlier studies have proposed new techniques to use neural networks to create believable clones of human voices, but few studies have been made concerning the perceptual factors of the human ability to discern the authenticity in voice clones. Therefore we made a study with one male and one female voice clone where participants familiar with the speaker’s voices determined the authenticity of a series of clips wherein voice clones were included. Different frequency ranges were applied in order to analyse if there was a correlation between the frequency ranges and the participants’ abilities. The results of the study show that the frequency range did not make a statistically significant difference and that the determining factors instead were prosody and artefacts in the sound clips. However, there was a significant difference between the success of detecting the male and female voice clone, where the participants more frequently detected the male voice clone. Media and Communication Technology Medieteknik
10	Deconstructing Disability, Assistive Technology: Secondary Orality, The Path To Universal Access Tripathi, Tara Prakash 01 January 2012 (has links) When Thomas Edison applied for a patent for his phonograph, he listed the talking books for the blind as one of the benefits of his invention. Edison was correct in his claim about talking books or audio books. Audio books have immensely helped the blind to achieve their academic and professional goals. Blind and visually impaired people have also been using audio books for pleasure reading. But several studies have demonstrated the benefits of audio books for people who are not defined as disabled. Many nondisabled people listen to audio books and take advantage of speech based technology, such as text-to-speech programs, in their daily activities. Speech-based technology, however, has remained on the margins of the academic environments, where hegemony of the sense of vision is palpable. Dominance of the sense of sight can be seen in school curricula, class rooms, libraries, academic conferences, books and journals, and virtually everywhere else. This dissertation analyzes the reason behind such an apathy towards technology based on speech. Jacques Derrida's concept of 'metaphysics of presence' helps us understand the arbitrary privileging of one side of a binary at the expense of the other side. I demonstrate in this dissertation that both, the 'disabled' and technology used by them, are on the less privileged side of the binary formation they are part of. I use Derrida's method of 'deconstruction' to deconstruct the binaries of 'assistive' and 'main stream technology' on one hand, and that of the 'disabled' and 'nondisabled' on the other. Donna Haraway and Katherine Hayles present an alternative reading of body to conceive of a post-gendered posthuman identity, I borrow from their work on cyborgism and iii posthumanism to conceive of a technology driven post-disabled world. Cyberspace is a good and tested example of an identity without body and a space without disability. The opposition between mainstream and speech-based assistive technology can be deconstructed with the example of what Walter Ong calls 'secondary orality.' Both disabled and non-disabled use the speech-based technology in their daily activities. Sighted people are increasingly listening to audio books and podcasts. Secondary Orality is also manifest on their GPS devices. Thus, Secondary Orality is a common element in assistive and mainstream technologies, hitherto segregated by designers. The way Derrida uses the concept of 'incest' to deconstruct binary opposition between Nature and Culture, I employ 'secondary orality' as a deconstructing tool in the context of mainstream and assistive technology. Mainstream electronic devices, smart phones, mp3 players, computers, for instance, can now be controlled with speech and they also can read the screen aloud. With Siri assistant, the new application on iPhone that allows the device to be controlled with speech, we seem to be very close to "the age of talking computers" that William Crossman foretells. As a result of such a progress in speech technology, I argue, we don't need the concept of speech based assistive technology any more. Disability assistive technology secondary orality walter ong blindness speech technology Disability Studies

Search results