Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
501 |
Cognition in Hearing Aid Users : Memory for Everyday Speech / Kognition hos hörapparatsanvändare : Att minnas talade vardagsmeningarNg, Hoi Ning Elaine January 2013 (has links)
The thesis investigated the importance of cognition for speech understanding in experienced and new hearing aid users. The aims were 1) to develop a cognitive test (Sentence-final Word Identification and Recall, or SWIR test) to measure the effects of a noise reduction algorithm on processing of highly intelligible speech (everyday sentences); 2) to investigate, using the SWIR test, whether hearing aid signal processing would affect memory for heard speech in experienced hearing aid users; 3) to test whether the effects of signal processing on the ability to recall speech would interact with background noise and individual differences in working memory capacity; 4) to explore the potential clinical application of the SWIR test; and 5) to examine the relationship between cognition and speech recognition in noise in new users over the first six months of hearing aid use. Results showed that, for experienced users, noise reduction freed up cognitive resources and alleviated the negative impact of noise on memory when speech stimuli were presented in a background of speech babble spoken in the listener’s native language. The possible underlying mechanisms are that noise reduction facilitates auditory stream segregation between target and irrelevant speech and reduces the attention captured by the linguistic information in irrelevant speech. The effects of noise reduction and SWIR performance were modulated by individual differences in working memory capacity. SWIR performance was related to the self-reported outcome of hearing aid use. For new users, working memory capacity played a more important role in speech recognition in noise before acclimatization to hearing aid amplification than after six months. This thesis demonstrates for the first time that hearing aid signal processing can significantly improve the ability of individuals with hearing impairment to recall highly intelligible speech stimuli presented in babble noise. It also adds to the literature showing the key role of working memory capacity in listening with hearing aids, especially for new users. By virtue of its relation to subjective measures of hearing aid outcome, the SWIR test can potentially be used as a tool in assessing hearing aid outcome. / Avhandlingens övergripande mål var att studera kognitionens betydelse för talförståelse hos vana och nya hörapparatsanvändare. Syftena var att 1) utveckla ett kognitivt test (Sentence-final Word Identification and Recall, eller SWIR test) för att mäta en brusreducerande algoritms effekt på bearbetningen av tydligt tal (vardagsmeningar); 2) att med hjälp av SWIR testet undersöka huruvida hörapparatens signalbehandling påverkade återgivningen av uppfattat tal hos vana hörapparatsanvändare; 3) att utvärdera om effekten av signalbehandling på förmågan att komma ihåg tal påverkas av störande bakgrundsljud samt individuella skillnader i arbetsminnets kapacitet; 4) att undersöka den potentiella kliniska tillämpningen av SWIR testet och 5) att undersöka förhållandet mellan kognition och taluppfattning i störande bakgrundsljud hos nya hörapparatsanvändare under de första sex månaderna med hörapparater. Resultaten visade att för vana hörapparatsanvändare lindrade brusreduceringen det störande ljudets negativa inverkan på minnet när meningar presenterades i form av irrelevant tal på deltagarnas modersmål. De möjliga underliggande mekanismerna är att brusreducering underlättar diskriminering av de auditiva informationsflödena mellan det som ska uppfattas och det som är irrelevant, samt minskar graden av uppmärksamhet som fångas av den språkliga informationen i det irrelevanta talet. Effekterna av brusreducering och resultaten av SWIR var beroende av individuella skillnader i arbetsminnets kapacitet. Resultaten av SWIR har också samband med det självrapporterade utfallet av hörapparatsanvändning. För nya användare spelar arbetsminnets kapacitet initialt en viktigare roll för taluppfattning i störande bakgrundsljud, innan anpassningen till hörapparatens förstärkning skett, än efter sex månader. Denna avhandling visar för första gången att hörapparatens signalbehandling kan signifikant förbättra möjligheten för individer med hörselnedsättning att minnas tydligt tal, som presenteras i störande bakgrundsljud. Avhandlingen bidrar till litteraturen med en diskussion om hur arbetsminnets kapacitet spelar roll i taluppfattning med hörapparat, i synnerhet för nya användare. Med stöd av dess samband med det självrapporterade utfallet, kan SWIR testet användas som redskap i bedömning av hörapparaters effekt.
|
502 |
Lietuvių kalbos priebalsių spketro analizė / Lithuanian language consonats spectrum analysisŠimkus, Ramūnas, Stumbras, Tomas 03 September 2010 (has links)
20 amžiaus antrojoje pusėje ypač suaktyvėjo tyrimai kalbančiojo atpažinimo ir kalbos sintezavimo srityje. Jau nuo penktojo dešimtmečio vykdomi tyrimai siekiant sukurti sistemas galinčias atpažinti šnekamąją kalbą. Ypač svarbu šioje srityje yra kokybiškai atskirti kalbos signalus. Aštuntajame dešimtmetyje buvo sukurta eilė požymių išskyrimo metodų. Svarbesni iš jų yra melų skalės kepstras, suvokimu paremta tiesinė prognozė (perceptual linear prediction), delta kepstras ir kiti.[3] Naudojant šiuolaikinę kompiuterinę įrangą, signalų atskyrimo uždavinys gerokai supaprastėja, tačiau vis tiek išlieka labai sudėtingas.
Kalbos sintezatorius yra kompiuterinė sistema, kuri gali atpažinti žmogaus balsą bet kokiame tekste. Sistema gali automatiškai sugeneruoti žmogaus balsą. Viena iš perspektyviausių balso technologijų panaudojimo sričių – įvairūs neįgaliems žmonėms skirti taikymai (akliems ir silpnaregiams, nevaikščiojantiems arba turintiems ribotas judėjimo galimybes). Balso technologijų panaudojimas dažnai yra esminis arba net vienintelis tokių žmonių integravimo į visuomenę būdas. Dar yra daugybė tokių sistemų panaudojimo sričių:
• telefoninių ryšių centrai, automatiškai aptarnaujantys telefoninius pokalbius, atpažįstantys ir suprantantys, ką skambinantis sako;
• automatinės transporto tvarkaraščių užklausimo sistemos;
• automobilio mazgų valdymo žmogaus balsu priemonės;
• nenutrūkstamos kalbos atpažinimo sistemos darbui teksto redaktoriais;
Kalbos signalams analizuoti bei atskirti... [toliau žr. visą tekstą] / In 20th century speech recognition and synthesis became very important part of science. In last 50 years were a lot of researches in speech recognition. And for the moment there are many systems for speech recognition and synthesis for popular European languages, such as French, English, Germanic languages. One of the most important benefits of this is for disabled people to make their life more comfortable and adopt them to normal life, to create new interfaces and possibility to use personal computers for them.
For Lithuanian language need researches, because of our language unique. An aim of research is a spectrum of Lithuanian consonants. Main method is linear prediction is used for finding formants. There are some main methods for speech signals analysis: linear prediction, Furier transformation, cepstral analysis. For linear prediction are several different algorithms. We used Burg algorithm for finding formants. In this research paper records of words were annotated and analyzed by PRAAT software. Formant movement obtained with same program. Obtained data of research was processed with MATLAB 6.5 software. All consonants were divided to groups, such as voiced and unvoiced, semivowels, plosives and fricatives. In our research was analyzed influence of vowels following after consonant. Obtained data is useful for increasing quality in speech recognition and synthesis.
Paper includes:
1. Speech generation analysis.
2. Spectrum analysis methods.
3. Experiment methodology... [to full text]
|
503 |
Balso atpažinimo programų lietuvinimo galimybių tyrimas / Speech recognition program`s Lithuanization possibility surveyBivainis, Robertas 30 September 2013 (has links)
Šiame darbe yra analizuojama ir tiriama kaip veikia balso atpažinimo sistema HTK, kokie žingsniai turi būti atlikti norint sėkmingai atpažinti lietuviškai išartus žodžius. Taip pat apžvelgiamos kokių kalbos technologijų samprata reikalinga norint sukurti balso atpažinimo programą. Balso atpažinime labai svarbu yra kalbos signalų atpažinimo modeliai ir paslėptosios Markovo grandinės, todėl analizėje yra apžvelgiama jų veikimo principai ir algoritmai. / This thesis will focus on how the speech recognition program HTK operates and what steps have to be taken in order to recognize spoken Lithuanian words. Also the emphasis of this thesis goes to conceptions of speech recognition technologies which are needed to create a speech recognition program.
|
504 |
The investigation into an algorithm based on wavelet basis functions for the spatial and frequency decomposition of arbitrary signals.Goldstein, Hilton. January 1994 (has links)
The research was directed toward the viability of an O(n) algorithm which could decompose
an arbitrary signal (sound, vibration etc.) into its time-frequency space. The well known
Fourier Transform uses sine and cosine functions (having infinite support on t) as
orthonormal basis functions to decompose a signal i(t) in the time domain to F(w) in the
frequency . domain, where the Fourier coefficients F(w) are the contributions of each
frequency in the original signal. Due to the non-local support of these basis functions, a
signal containing a sharp localised transient does not have localised coefficients, but rather
coefficients that decay slowly. Another problem is that the coefficients F(w) do not convey
any time information. The windowed Fourier Transform, or short-time Fourier Transform,
does attempt to resolve the latter, but has had limited success.
Wavelets are basis functions, usually mutually orthonormal, having finite support in t and
are therefore spatially local. Using non-orthogonal wavelets, the Dominant Scale
Transform (DST) designed by the author, decomposes a signal into its approximate time-frequency
space. The associated Dominant Scale Algorithm (DSA) has O(n) complexity
and is integer-based. These two characteristics make the DSA extremely efficient. The
thesis also investigates the problem of converting a music signal into it's equivalent music
score. The old problem of speech recognition is also examined. The results obtained from
the DST are shown to be consistent with those of other authors who have utilised other
methods. The resulting DST coefficients are shown to render the DST particularly useful in
speech segmentation (silence regions, voiced speech regions, and frication). Moreover, the
Spectrogram Dominant Scale Transform (SDST), formulated from the DST, was shown to
approximate the Fourier coefficients over fixed time intervals within vowel regions of
human speech. / Thesis (Ph.D.)-University of Natal, Durban, 1994.
|
505 |
Blind source separation of the audio signals in a real worldChoi, Hyung Keun 08 1900 (has links)
No description available.
|
506 |
Does Speaker Age Affect Speech Perception in Noise in Older Adults?Harris, Penny January 2013 (has links)
Purpose: To investigate the effects of speaker age, speaker gender, semantic context,
signal-to-noise ratio (SNR) and a listener’s hearing status on speech recognition and listening effort in older adults. We examined the hypothesis that older adults would recognize less speech and exert greater listening effort when listening to the speech of younger versus older adult speakers.
Method: Speech stimuli were recorded from 12 adult speakers classified as “younger” (three males and three females aged 18-31 years) and “older” (three males and three females aged 69-89) respectively. A computer-based subjective rating was conducted to confirm that the speakers were representative of younger and older speakers. Listeners included 20 older adults (aged 65 years and above), who were divided into two age-matched groups with and without hearing loss. All listening and speaking participants in the study were native speakers of New Zealand English. A dual-task paradigm was used to measure speech recognition and listening effort; the primary task involved recognition of target words in sentences containing either high or low contextual cues, while the secondary task required listeners to memorise the target words for later recall, following a set number of sentences. Listening tasks were performed with a variety of listening conditions (quiet, +5 dB SNR and 0dB SNR).
Results: There were no overall differences in speech recognition scores or word recall scores for the 20 older listeners, when listening to the speech of the younger versus older speakers. However, differential effects of speaker group were observed in the two semantic context conditions (high versus low context). Older male speakers were the easiest to understand when semantic context was low; however, for sentences with high semantic context, the older male group were the most difficult to understand. Word recall scores were also significantly higher in the most challenging listening condition (low semantic context, 0 dB SNR), when the speaker was an older male.
Conclusion: Differential effects of speaker group were observed in the two semantic context conditions (high versus low context) suggesting that different speech cues were used by listeners, as the level of context varied. The findings provide further evidence that, in challenging listening conditions, older listeners are able to use a wide range of cues, such as prosodic features and semantic context to compensate for a degraded signal. The availability of these cues depends on characteristics of the speaker, such as rate of speech and prosody, as well as characteristics of the listener and the listening environment.
.
|
507 |
EVALUATION OF INTELLIGIBILITY AND SPEAKER SIMILARITY OF VOICE TRANSFORMATIONRaghunathan, Anusha 01 January 2011 (has links)
Voice transformation refers to a class of techniques that modify the voice characteristics either to conceal the identity or to mimic the voice characteristics of another speaker. Its applications include automatic dialogue replacement and voice generation for people with voice disorders. The diversity in applications makes evaluation of voice transformation a challenging task. The objective of this research is to propose a framework to evaluate intentional voice transformation techniques. Our proposed framework is based on two fundamental qualities: intelligibility and speaker similarity. Intelligibility refers to the clarity of the speech content after voice transformation and speaker similarity measures how well the modified output disguises the source speaker. We measure intelligibility with word error rates and speaker similarity with likelihood of identifying the correct speaker. The novelty of our approach is, we consider whether similarly transformed training data are available to the recognizer. We have demonstrated that this factor plays a significant role in intelligibility and speaker similarity for both human testers and automated recognizers. We thoroughly test two classes of voice transformation techniques: pitch distortion and voice conversion, using our proposed framework. We apply our results for patients with voice hypertension using video self-modeling and preliminary results are presented.
|
508 |
Array-based Spectro-temporal Masking For Automatic Speech RecognitionMoghimi, Amir Reza 01 May 2014 (has links)
Over the years, a variety of array processing techniques have been applied to the problem of enhancing degraded speech to improve automatic speech recognition. In this context, linear beamforming has long been the approach of choice, for reasons including good performance, robustness and analytical simplicity. While various non-linear techniques - typically based to some extent on the study of auditory scene analysis - have also been of interest, they tend to lag behind their linear counterparts in terms of simplicity, scalability and exibility. Nonlinear techniques are also more difficult to analyze and lack the systematic descriptions available in the study of linear beamformers. This work focuses on a class of nonlinear processing, known as time-frequency (T-F) masking - a.k.a. spectro-temporal masking { whose variants comprise a significant portion of the existing techniques. T-F masking is based on accepting or rejecting individual time-frequency cells based on some estimate of local signal quality. Analyses are developed that attempt to mirror the beam patterns used to describe linear processing, leading to a view of T-F masking as "nonlinear beamforming". Two distinct formulations of these "nonlinear beam patterns" are developed, based on different metrics of the algorithms behavior; these formulations are modeled in a variety of scenarios to demonstrate the flexibility of the idea. While these patterns are not quite as simple or all-encompassing as traditional beam patterns in microphone-array processing, they do accurately represent the behavior of masking algorithms in analogous and intuitive ways. In addition to analyzing this class of nonlinear masking algorithm, we also attempt to improve its performance in a variety of ways. Improvements are proposed to the baseline two-channel version of masking, by addressing both the mask estimation and the signal reconstruction stages; the latter more successfully than the former. Furthermore, while these approaches have been shown to outperform linear beamforming in two-sensor arrays, extensions to larger arrays have been few and unsuccessful. We find that combining beamforming and masking is a viable method of bringing the benefits of masking to larger arrays. As a result, a hybrid beamforming-masking approach, called "post-masking", is developed that improves upon the performance of MMSE beamforming (and can be used with any beamforming technique), with the potential for even greater improvement in the future.
|
509 |
Lecture transcription systems in resource-scarce environments / Pieter Theunis de VilliersDe Villiers, Pieter Theunis January 2014 (has links)
Classroom note taking is a fundamental task performed by learners on a daily basis.
These notes provide learners with valuable offline study material, especially in the case of more difficult subjects. The use of class notes has been found to not only provide students with a better learning experience, but also leads to an overall higher academic performance. In a previous study, an increase of 10.5% in student grades was observed after these students had been provided with multimedia class notes. This is not surprising, as other studies have found that the rate of successful transfer of information to humans increases when provided with both visual and audio information. Note taking might seem like an easy task; however, students with hearing impairments, visual impairments, physical impairments, learning disabilities or even non-native listeners find this task very difficult to impossible. It has also been reported that even non-disabled students find note taking time consuming and that it requires a great deal of mental effort while also trying to pay full attention to the lecturer. This is illustrated by a study where it was found that college students were only able to record ~40% of the data presented by the lecturer. It is thus reasonable to expect an automatic way of generating class notes to be beneficial to all learners. Lecture transcription (LT) systems are used in educational environments to assist learners by providing them with real-time in-class transcriptions or recordings and transcriptions for offline use. Such systems have already been successfully implemented in the developed world where all required resources were easily obtained. These systems are typically trained on hundreds to thousands of hours of speech while their language models are trained on millions or even hundreds of millions of words. These amounts of data are generally not available in the developing world. In this dissertation, a number of approaches toward the development of LT systems in resource-scarce environments are investigated.
We focus on different approaches to obtaining sufficient amounts of well transcribed
data for building acoustic models, using corpora with few transcriptions and of variable quality. One approach investigates the use of alignment using a dynamic programming phone string alignment procedure to harvest as much usable data as possible from approximately transcribed speech data. We find that target-language acoustic models are optimal for this purpose, but encouraging results are also found when using models from another language for alignment. Another approach entails using unsupervised training methods where an initial low accuracy recognizer is used to transcribe a set of untranscribed data. Using this poorly transcribed data, correctly recognized portions are extracted based on a word confidence threshold. The initial system is retrained along with the newly recognized data in order to increase its overall accuracy. The initial acoustic models are trained using as little as 11 minutes of transcribed speech. After several iterations of unsupervised training, a noticeable increase in accuracy was observed (47.79% WER to 33.44% WER). Similar results were however found (35.97% WER) after using a large speaker-independent corpus to train the initial system. Usable LMs were also created using as few as 17955 words from transcribed lectures; however, this resulted in large out-of-vocabulary rates. This problem was solved by means of LM interpolation. LM interpolation was found to be very beneficial in cases where subject specific data (such as lecture slides and books) was available. We also introduce our NWU LT system, which was developed for use in learning environments and was designed using a client/server based architecture. Based on the results found in this study we are confident that usable models for use in LT systems can be developed in resource-scarce environments. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014
|
510 |
Improving Grapheme-based speech recognition through P2G transliteration / W.D. BassonBasson, Willem Diederick January 2014 (has links)
Grapheme-based speech recognition systems are faster to develop, but typically do not
reach the same level of performance as phoneme-based systems. Using Afrikaans speech
recognition as a case study, we first analyse the reasons for the discrepancy in performance, before introducing a technique for improving the performance of standard grapheme-based systems. It is found that by handling a relatively small number of irregular words through phoneme-to-grapheme (P2G) transliteration – transforming the original orthography of irregular words to an ‘idealised’ orthography – grapheme-based accuracy can be improved. An analysis of speech recognition accuracy based on word categories shows that P2G transliteration succeeds in improving certain word categories in which grapheme-based systems typically perform poorly, and that the problematic categories can be identified prior to system development. An evaluation is offered of when category-based P2G transliteration is beneficial and methods to implement the technique in practice are discussed. Comparative results are obtained for a second language (Vietnamese) in order to determine whether the technique can be generalised. / MSc (Computer Science) North-West University, Vaal Triangle Campus, 2014
|
Page generated in 0.1101 seconds