Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
581 |
Developing Oral Reading Fluency Among Hispanic High School English-language Learners: an Intervention Using Speech Recognition SoftwareRuffu, Russell 08 1900 (has links)
This study investigated oral reading fluency development among Hispanic high school English-language learners. Participants included 11 males and 9 females from first-year, second-year, and third-year English language arts classes. The pre-post experimental study, which was conducted during a four-week ESL summer program, included a treatment and a control group. The treatment group received a combination of components, including modified repeated reading with self-voice listening and oral dictation output from a speech recognition program. Each day, students performed a series of tasks, including dictation of part of the previous day’s passage; listening to and silently reading a new passage; dictating and correcting individual sentences from the new passage in the speech recognition environment; dictating the new passage as a whole without making corrections; and finally, listening to their own voice from their recorded dictation. This sequence was repeated in the subsequent sessions. Thus, this intervention was a technology-enhanced variation of repeated reading with a pronunciation dictation segment. Research questions focused on improvements in oral reading accuracy and rate, facility with the application, student perceptions toward the technology for reading, and the reliability of the speech recognition program. The treatment group improved oral reading accuracy by 50%, retained and transferred pronunciation of 55% of new vocabulary, and increased oral reading rate 16 words-correct-per-minute. Students used the intervention independently after three sessions. This independence may have contributed to students’ self-efficacy as they perceived improvements in their pronunciation, reading in general, and reported an increased liking of school. Students initially had a very positive perception toward using the technology for reading, but this perception decreased over the four weeks from 2.7 to 2.4 on a 3 point scale. The speech recognition program was reliable 94% of the time. The combination of the summer school program and intervention component stacking supported students’ gains in oral reading fluency, suggesting that further study into applications of the intervention is warranted. Acceleration of oral reading skills and vocabulary acquisition for ELLs contributes to closing the reading gap between ELLs and native-English speakers. Fluent oral reading is strongly correlated with reading comprehension, and reading comprehension is essential for ELLs to be successful in school. Literacy support tools such as this intervention can play a role in ameliorating English acquisition faster than the rate attained through traditional practices.
|
582 |
Strojový překlad mluvené řeči přes fonetickou reprezentaci zdrojové řeči / Spoken Language Translation via Phoneme Representation of the Source LanguagePolák, Peter January 2020 (has links)
We refactor the traditional two-step approach of automatic speech recognition for spoken language translation. Instead of conventional graphemes, we use phonemes as an intermediate speech representation. Starting with the acoustic model, we revise the cross-lingual transfer and propose a coarse-to-fine method providing further speed-up and performance boost. Further, we review the translation model. We experiment with source and target encoding, boosting the robustness by utilizing the fine-tuning and transfer across ASR and SLT. We empirically document that this conventional setup with an alternative representation not only performs well on standard test sets but also provides robust transcripts and translations on challenging (e.g., non-native) test sets. Notably, our ASR system outperforms commercial ASR systems. 1
|
583 |
User-centered Visualizations of Transcription Uncertainty in AI-generated Subtitles of News BroadcastKarlsson, Fredrik January 2020 (has links)
AI-generated subtitles have recently started to automate the process of subtitling with automatic speech recognition. However, people may not perceive that the transcription is based on probabilities and may entail errors. For news that is broadcast live may this be controversial and cause misinterpretation. A user-centered design approach was performed investigating three possible solutions towards visualizing transcription uncertainties in real-time presentation. Based on the user needs, one proposed solution was used in a qualitative comparison with AI- generated subtitles without visualizations. The results suggest that visualization of uncertainties support users’ interpretation of AI-generated subtitles and helps to identify possible errors. However, it does not improve the transcription intelligibility. The result also suggests that unnoticed transcription errors during news broadcast is perceived as critical and decrease trust towards the news. Uncertainty visualizations may increase trust and prevent the risk of misinterpretation with important information.
|
584 |
Cross-lingual and Multilingual Automatic Speech Recognition for Scandinavian LanguagesČerniavski, Rafal January 2022 (has links)
Research into Automatic Speech Recognition (ASR), the task of transforming speech into text, remains highly relevant due to its countless applications in industry and academia. State-of-the-art ASR models are able to produce nearly perfect, sometimes referred to as human-like transcriptions; however, accurate ASR models are most often available only in high-resource languages. Furthermore, the vast majority of ASR models are monolingual, that is, only able to handle one language at a time. In this thesis, we extensively evaluate the quality of existing monolingual ASR models for Swedish, Danish, and Norwegian. In addition, we search for parallels between monolingual ASR models and the cognition of foreign languages in native speakers of these languages. Lastly, we extend the Swedish monolingual model to handle all three languages. The research conducted in this thesis project is divided into two main sections, namely monolingual and multilingual models. In the former, we analyse and compare the performance of monolingual ASR models for Scandinavian languages in monolingual and cross-lingual settings. We compare these results against the levels of mutual intelligibility of Scandinavian languages in native speakers of Swedish, Danish, and Norwegian to see whether the monolingual models favour the same languages as native speakers. We also examine the performance of the monolingual models on the regional dialects of all three languages and perform qualitative analysis of the most common errors. As for multilingual models, we expand the most accurate monolingual ASR model to handle all three languages. To do so, we explore the most suitable settings via trial models. In addition, we propose an extension to the well-established Wav2Vec 2.0-CTC architecture by incorporating a language classification component. The extension enables the usage of language models, thus boosting the overall performance of the multilingual models. The results reported in this thesis suggest that in a cross-lingual setting, monolingual ASR models for Scandinavian languages perform better on the languages that are easier to comprehend for native speakers. Furthermore, the addition of a statistical language model boosts the performance of ASR models in monolingual, cross-lingual, and multilingual settings. ASR models appear to favour certain regional dialects, though the gap narrows in a multilingual setting. Contrary to our expectations, our multilingual model performs comparably with the monolingual Swedish ASR models and outperforms the Danish and Norwegian models. The multilingual architecture proposed in this thesis project is fairly simple yet effective. With greater computational resources at hand, further extensions offered in the conclusions might improve the models further.
|
585 |
Evaluating Usability of Text and Speech as Input Methods for Natural Language Interfaces Using Gamification / Utvärdering av användbarhet för text och tal som inmatningsmetoder för naturligt språkgränssnitt genom spelifieringvon Gegerfelt, Angelina, Klingestedt, Kashmir January 2016 (has links)
Today an increasing amount of systems make use of Natural Language Interfaces (NLIs), which make them easy and efficient to use. The purpose of this research was to gain an increased understanding of the usability of different input methods for NLIs. This was done by implementing two versions of a text-based game with an NLI, where one version used speech as input method and the other used text. Tests were then performed with users that all played both versions of the game and then evaluated them individually using the System Usability Scale. It was found that text was better as input method in all aspects. However, speech scored high when the users felt confident in their English proficiency, acknowledging the possibility of using speech as input method for NLIs. / Idag använder en ökande mängd system naturliga språkgränssnitt, vilket gör dem enkla och effektiva att använda. Syftet med denna forskning var att få en ökad förståelse för användbarheten av olika inmatningsmetoder för naturliga språkgränssnitt. Detta gjordes genom att skapa två versioner av ett text-baserat spel med ett naturligt språkgränssnitt, där en version använde tal som inmatningsmetod och andra använde text. Tester utfördes sedan med användare som alla spelade igenom båda versionerna av spelet och sedan utvärderade dem individuellt med hjälp av System Usability Scale, ett system för att mäta graden av användbarhet. Det konstaterades att text fungerade bättre som inmatningsmetod ur alla aspekter. Tal fick dock en hög poäng när användarna kände sig säkra på sin engelska kunnighet, vilket talar för möjligheten att använda tal som en inmatningsmetod för naturliga gränssnitt.
|
586 |
Performance benchmarks of lip-syncscripting in Maya using speechrecognition : Gender bias and speech recognitionBjörkholm, Adrian January 2022 (has links)
Background: Automated lip sync is used in animation to make facial animations with a minimal interception from an animator. A lip-syncing script for Maya has been written in Python using the Vosk API to transcribe voice lines from audio files into instructions in Maya to automate the pipeline for speech animations. Previous studies have mentioned that some voice transcription and voice recognition API's have had a gender bias that does not read female voices as efficiently as male voices. Does gender affect this lip-syncing script's performance in creating animations? Objectives: Benchmark the performance of a lip-syncing script that uses voice transcription by looking for a gender bias in a voice transcription API by comparing male and female voices as input. If there is a gender bias, how much does it affect the produced animations? Methods: Evaluating the script's perceived performance by conducting a user study through a questionnaire. The Participants evaluate different animation attributes to build an image of a potentially perceived gender bias in the script. Analyzing the transcribed voice lines for an objective view of a possible gender bias. Results: The transcribed voice lines were almost perfect on both male and female vocal lines, with just one transcription error for one word in one of the male voiced lines. The male and female voiced lines received very similar grading on their voice lines when analyzing the data from the questionnaire. On average, the male voice lines seemed to get a higher rating on most voice lines in the different criteria, but the score difference was minimal. Conclusions: There is no gender bias in the lip syncing script. The accuracy experiment had a very similar accuracy rate between the male and female vocal lines. The female-voiced lines received a slightly higher accuracy than the male voice lines with the difference in one word. The male voice lines received a slightly higher score on the perceived scores through the questionnaire. The males had a higher score because of other factors than a possible gender bias.
|
587 |
Automatic Speech Recognition for low-resource languages using Wav2Vec2 : Modern Standard Arabic (MSA) as an example of a low-resource languageZouhair, Taha January 2021 (has links)
The need for fully automatic translation at DigitalTolk, a Stockholm-based company providing translation services, leads to exploring Automatic Speech Recognition as a first step for Modern Standard Arabic (MSA). Facebook AI recently released a second version of its Wav2Vec models, dubbed Wav2Vec 2.0, which uses deep neural networks and provides several English pretrained models along with a multilingual model trained in 53 different languages, referred to as the Cross-Lingual Speech Representation (XLSR-53). The small English and the XLSR-53 pretrained models are tested, and the results stemming from them discussed, with the Arabic data from Mozilla Common Voice. In this research, the small model did not yield any results and may have needed more unlabelled data to train whereas the large model proved to be successful in predicting the audio recordings in Arabic and a Word Error Rate of 24.40% was achieved, an unprecedented result. The small model turned out to be not suitable for training especially on languages other than English and where the unlabelled data is not enough. On the other hand, the large model gave very promising results despite the low amount of data. The large model should be the model of choice for any future training that needs to be done on low resource languages such as Arabic.
|
588 |
Word-Recognition Performance in Interrupted Noise by Young Listeners With Normal Hearing and Older Listeners With Hearing LossWilson, Richard H., McArdle, Rachel, Betancourt, Mavie B., Herring, Kaileen, Lipton, Teresa, Chisolm, Theresa H. 01 February 2010 (has links)
Background: The most common complaint of adults with hearing loss is understanding speech in noise. One class of masker that may be particularly useful in the assessment of speech-in-noise abilities is interrupted noise. Interrupted noise usually is a continuous noise that has been multiplied by a square wave that produces alternating intervals of noise and silence. Wilson and Carhart found that spondaic word thresholds for listeners with normal hearing were 28 dB lower in an interrupted noise than in a continuous noise, whereas listeners with hearing loss experienced only an 11 dB difference. Purpose: The purpose of this series of experiments was to determine if a speech-in- interrupted-noise paradigm differentiates better (1) between listeners with normal hearing and listeners with hearing loss and (2) among listeners with hearing loss than do traditional speech-in-continuous-noise tasks. Research Design: Four descriptive/quasi-experimental studies were conducted. Study Sample: Sixty young adults with normal hearing and 144 older adults with pure-tone hearing losses participated. Data Collection and Analysis: A 4.3 sec sample of speech-spectrum noise was constructed digitally to form the 0 interruptions per second (ips; continuous) noise and the 5, 10, and 20 ips noises with 50% duty cycles. The noise samples were mixed digitally with the Northwestern University Auditory Test No. 6 words at selected signal-to-noise ratios and recorded on CD. The materials were presented through an earphone, and the responses were recorded and analyzed at the word level. Similar techniques were used for the stimuli in the remaining experiments. Results: In Experiment 1, using 0 ips as the reference condition, the listeners with normal hearing achieved 34.0, 30.2, and 28.4 dB escape from masking for 5, 10, and 20 ips, respectively. In contrast, the listeners with hearing loss only achieved 2.1 to 2.4 dB escape from masking. Experiment 2 studied the 0 and 5 ips conditions on 72 older listeners with hearing loss, who were on average 13 yr younger and more varied in their hearing loss than the listeners in Experiment 1. The mean escape from masking in Experiment 2 was 7 dB, which is 20-25 dB less than the escape achieved by listeners with normal hearing. Experiment 3 examined the effects that duty cycle (0-100% in 10% steps) had on recognition performance in the 5 and 10 ips conditions. On the 12 young listeners with normal hearing, (1) the 50% correct point increased almost linearly between the 0 and 60% duty cycles (slope=4.2 dB per 10% increase in duty cycle), (2) the slope of the function was steeper between 60 and 80% duty cycles, and (3) about the same masking was achieved for the 80-100% duty cycles. The data from the listeners with hearing loss were inconclusive. Experiment 4 varied the interburst ratios (0, -6, -12, -24, -48, and -∞ dB) of 5 ips noise and evaluated recognition performance by 24 young adults. The 50% points were described by a linear regression (R2=0.98) with a slope of 0.55 dB/dB. Conclusion: The current data indicate that interrupted noise does provide a better differentiation both between listeners with normal hearing and listeners with hearing loss and among listeners with hearing loss than is provided by continuous noise.
|
589 |
Normative Data for the Words-in-Noise Test for 6-to 12-Year-Old ChildrenWilson, Richard H., Farmer, Nicole M., Gandhi, Avni, Shelburne, Emily, Weaver, Jamie 01 October 2010 (has links)
Purpose: To establish normative data for children on the Words-in-Noise Test (WIN; R. H. Wilson, 2003; R. H. Wilson & R. McArdle, 2007). Method: Forty-two children ineachof 7 age groups, rangingin age from6to12years (n = 294), and 24 young adults (age range: 18-27 years) with normal hearing for pure tones participated. All listeners were screened at 15 dB HL (American National Standards Institute, 2004) with the octave interval between 500 and 4000 Hz. Randomizations of WIN Lists 1, 2, and 1 or WIN Lists 2, 1, and 2 were presented with the noise fixed at 70 dB SPL, followed by presentation at 90 dB SPL of the 70 Northwestern University Auditory Test No. 6 (T. W. Tillman & R. Carhart, 1966) words used in the WIN. Finally, the Peabody Picture Vocabulary Test-Revised (L. M. Dunn & L. M. Dunn, 1981) was administered. Testing was conducted in a quiet room. Results: There were 3 main findings: (a) The biggest change inrecognition performance occurred between the ages of 6 and 7 years; (b) from 9 to 12 years, recognition performance was stable; and (c) performance by young adults (18-27 years) was slightly better (1-2 dB) than performance by the older children. Conclusion: The WIN can be used with children as young as 6 years of age; however, age-specific ranges of normal recognition performance must be used.
|
590 |
Deep networks for sign language video captionZhou, Mingjie 12 August 2020 (has links)
In the hearing-loss community, sign language is a primary tool to communicate with people while there is a communication gap between hearing-loss people with normal hearing people. Sign language is different from spoken language. It has its own vocabulary and grammar. Recent works concentrate on the sign language video caption which consists of sign language recognition and sign language translation. Continuous sign language recognition, which can bridge the communication gap, is a challenging task because of the weakly supervised ordered annotations where no frame-level label is provided. To overcome this problem, connectionist temporal classification (CTC) is the most widely used method. However, CTC learning could perform badly if the extracted features are not good. For better feature extraction, this thesis presents the novel self-attention-based fully-inception (SAFI) networks for vision-based end-to-end continuous sign language recognition. Considering the length of sign words differs from each other, we introduce the fully inception network with different receptive fields to extract dynamic clip-level features. To further boost the performance, the fully inception network with an auxiliary classifier is trained with aggregation cross entropy (ACE) loss. Then the encoder of self-attention networks as the global sequential feature extractor is used to model the clip-level features with CTC. The proposed model is optimized by jointly training with ACE on clip-level feature learning and CTC on global sequential feature learning in an end-to-end fashion. The best method in the baselines achieves 35.6% WER on the validation set and 34.5% WER on the test set. It employs a better decoding algorithm for generating pseudo labels to do the EM-like optimization to fine-tune the CNN module. In contrast, our approach focuses on the better feature extraction for end-to-end learning. To alleviate the overfitting on the limited dataset, we employ temporal elastic deformation to triple the real-world dataset RWTH- PHOENIX-Weather 2014. Experimental results on the real-world dataset RWTH- PHOENIX-Weather 2014 demonstrate the effectiveness of our approach which achieves 31.7% WER on the validation set and 31.2% WER on the test set. Even though sign language recognition can, to some extent, help bridge the communication gap, it is still organized in sign language grammar which is different from spoken language. Unlike sign language recognition that recognizes sign gestures, sign language translation (SLT) converts sign language to a target spoken language text which normal hearing people commonly use in their daily life. To achieve this goal, this thesis provides an effective sign language translation approach which gains state-of-the-art performance on the largest real-life German sign language translation database, RWTH-PHOENIX-Weather 2014T. Besides, a direct end-to-end sign language translation approach gives out promising results (an impressive gain from 9.94 to 13.75 BLEU and 9.58 to 14.07 BLEU on the validation set and test set) without intermediate recognition annotations. The comparative and promising experimental results show the feasibility of the direct end-to-end SLT
|
Page generated in 0.0897 seconds