Global ETD Search

1	Neural Enhancement Strategies for Robust Speech Processing Nawar, Mohamed Nabih Ali Mohamed 10 March 2023 (has links) In real-world scenarios, speech signals are often contaminated with environmental noises, and reverberation, which degrades speech quality and intelligibility. Lately, the development of deep learning algorithms has marked milestones in speech- based research fields e.g. speech recognition, spoken language understanding, etc. As one of the crucial topics in the speech processing research area, speech enhancement aims to restore clean speech signals from noisy signals. In the last decades, many conventional speech enhancement statistical-based algorithms had been pro- posed. However, the performance of these approaches is limited in non-stationary noisy conditions. The raising of deep learning-based approaches for speech enhancement has led to revolutionary advances in their performance. In this context, speech enhancement is formulated as a supervised learning problem, which tackles the open challenges introduced by the speech enhancement conventional approaches. In general, deep learning speech enhancement approaches are categorized into frequency-domain and time-domain approaches. In particular, we experiment with the performance of the Wave-U-Net model, a solid and superior time-domain approach for speech enhancement. First, we attempt to improve the performance of back-end speech-based classification tasks in noisy conditions. In detail, we propose a pipeline that integrates the Wave-U-Net (later this model is modified to the Dilated Encoder Wave-U-Net) as a pre-processing stage for noise elimination with a temporal convolution network (TCN) for the intent classification task. Both models are trained independently from each other. Reported experimental results showed that the modified Wave-U-Net model not only improves the speech quality and intelligibility measured in terms of PESQ, and STOI metrics, but also improves the back-end classification accuracy. Later, it was observed that the dis-joint training approach often introduces signal distortion in the output of the speech enhancement module. Thus, it can deteriorate the back-end performance. Motivated by this, we introduce a set of fully time- domain joint training pipelines that combine the Wave-U-Net model with the TCN intent classifier. The difference between these architectures is the interconnections between the front-end and back-end. All architectures are trained with a loss function that combines the MSE loss as the front-end loss with the cross-entropy loss for the classification task. Based on our observations, we claim that the JT architecture with equally balancing both components’ contributions yields better classification accuracy. Lately, the release of large-scale pre-trained feature extraction models has considerably simplified the development of speech classification and recognition algorithms. However, environmental noise and reverberation still negatively affect performance, making robustness in noisy conditions mandatory in real-world applications. One way to mitigate the noise effect is to integrate a speech enhancement front-end that removes artifacts from the desired speech signals. Unlike the state-of-the-art enhancement approaches that operate either on speech spectrogram, or directly on time-domain signals, we study how enhancement can be applied directly on the speech embeddings, extracted using Wav2Vec, and WavLM models. We investigate a variety of training approaches, considering different flavors of joint and disjoint training of the speech enhancement front-end and of the classification/recognition back-end. We perform exhaustive experiments on the Fluent Speech Commands and Google Speech Commands datasets, contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, as well as on LibriSpeech, contaminated with noises from the MUSAN dataset, considering intent classification, keyword spotting, and speech recognition tasks respectively. Results show that enhancing the speech em-bedding is a viable and computationally effective approach, and provide insights about the most promising training approaches.
2	Increasing speaker invariance in unsupervised speech learning by partitioning probabilistic models using linear siamese networks / Ökad talarinvarians i obevakad talinlärning genom partitionering av probabilistiska modeller med hjälp av linjära siamesiska nätverk Fahlström Myrman, Arvid January 2017 (has links) Unsupervised learning of speech is concerned with automatically finding patterns such as words or speech sounds, without supervision in the form of orthographical transcriptions or a priori knowledge of the language. However, a fundamental problem is that unsupervised speech learning methods tend to discover highly speaker-specific and context-dependent representations of speech. We propose a method for improving the quality of posteriorgrams generated from an unsupervised model through partitioning of the latent classes discovered by the model. We do this by training a sparse siamese model to find a linear transformation of input posteriorgrams, extracted from the unsupervised model, to lower-dimensional posteriorgrams. The siamese model makes use of same-category and different-category speech fragment pairs obtained through unsupervised term discovery. After training, the model is converted into an exact partitioning of the posteriorgrams. We evaluate the model on the minimal-pair ABX task in the context of the Zero Resource Speech Challenge. We are able to demonstrate that our method significantly reduces the dimensionality of standard Gaussian mixture model posteriorgrams, while also making them more speaker invariant. This suggests that the model may be viable as a general post-processing step to improve probabilistic acoustic features obtained by unsupervised learning. / Obevakad inlärning av tal innebär att automatiskt hitta mönster i tal, t ex ord eller talljud, utan bevakning i form av ortografiska transkriptioner eller tidigare kunskap om språket. Ett grundläggande problem är dock att obevakad talinlärning tenderar att hitta väldigt talar- och kontextspecifika representationer av tal. Vi föreslår en metod för att förbättra kvaliteten av posteriorgram genererade med en obevakad modell, genom att partitionera de latenta klasserna funna av modellen. Vi gör detta genom att träna en gles siamesisk modell för att hitta en linjär transformering av de givna posteriorgrammen, extraherade från den obevakade modellen, till lågdimensionella posteriorgram. Den siamesiska modellen använder sig av talfragmentpar funna med obevakad ordupptäckning, där varje par består av fragment som antingen tillhör samma eller olika klasser. Den färdigtränade modellen görs sedan om till en exakt partitionering av posteriorgrammen. Vi följer Zero Resource Speech Challenge, och evaluerar modellen med hjälp av minimala ordpar-ABX-uppgiften. Vi demonstrerar att vår metod avsevärt minskar posteriorgrammens dimensionalitet, samtidigt som posteriorgrammen blir mer talarinvarianta. Detta antyder att modellen kan vara användbar som ett generellt extra steg för att förbättra probabilistiska akustiska särdrag från obevakade modeller. speech recognition unsupervised speech learning unsupervised acoustic modelling siamese networks linear networks speech embeddings dimensionality reduction Computer Sciences Datavetenskap (datalogi)

Search results

Neural Enhancement Strategies for Robust Speech Processing

Increasing speaker invariance in unsupervised speech learning by partitioning probabilistic models using linear siamese networks / Ökad talarinvarians i obevakad talinlärning genom partitionering av probabilistiska modeller med hjälp av linjära siamesiska nätverk