Global ETD Search

21	Identification and Classification of TTS Intelligibility Errors Using ASR : A Method for Automatic Evaluation of Speech Intelligibility / Identifiering och klassifiering av fel relaterade till begriplighet inom talsyntes. : Ett förslag på en metod för automatisk utvärdering av begriplighet av tal. Henriksson, Erik January 2023 (has links) In recent years, applications using synthesized speech have become more numerous and publicly available. As the area grows, so does the need for delivering high-quality, intelligible speech, and subsequently the need for effective methods of assessing the intelligibility of synthesized speech. The common method of evaluating speech using human listeners has the disadvantages of being costly and time-inefficient. Because of this, alternative methods of evaluating speech automatically, using automatic speech recognition (ASR) models, have been introduced. This thesis presents an evaluation system that analyses the intelligibility of synthesized speech using automatic speech recognition, and attempts to identify and categorize the intelligibility errors present in the speech. This system is put through evaluation using two experiments. The first uses publicly available sentences and corresponding synthesized speech, and the second uses publicly available models to synthesize speech for evaluation. Additionally, a survey is conducted where human transcriptions are used instead of automatic speech recognition, and the resulting intelligibility evaluations are compared with those based on automatic speech recognition transcriptions. Results show that this system can be used to evaluate the intelligibility of a model, as well as identify and classify intelligibility errors. It is shown that a combination of automatic speech recognition models can lead to more robust and reliable evaluations, and that reference human recordings can be used to further increase confidence. The evaluation scores show a good correlation with human evaluations, while certain automatic speech recognition models are shown to have a stronger correlation with human evaluations. This research shows that automatic speech recognition can be used to produce a reliable and detailed analysis of text-to-speech intelligibility, which has the potential of making text-to-speech (TTS) improvements more efficient and allowing for the delivery of better text-to-speech models at a faster rate. / Under de senaste åren har antalet applikationer som använder syntetiskt tal ökat och blivit mer tillgängliga för allmänheten. I takt med att området växer ökar också behovet av att leverera tal av hög kvalitet och tydlighet, och därmed behovet av effektiva metoder för att bedöma förståeligheten hos syntetiskt tal. Den vanliga metoden att utvärdera tal med hjälp av mänskliga lyssnare har nackdelarna att den är kostsam och tidskrävande. Av den anledningen har alternativa metoder för att automatiskt utvärdera tal med hjälp av automatiska taligenkänningsmodeller introducerats. I denna avhandling presenteras ett utvärderingssystem som analyserar förståeligheten hos syntetiskt tal med hjälp av automatisk taligenkänning och försöker identifiera och kategorisera de fel i förståelighet som finns i talet. Detta system genomgår sedan utvärdering genom två experiment. Det första experimentet använder offentligt tillgängliga meningar och motsvarande ljudfiler med syntetiskt tal, och det andra använder offentligt tillgängliga modeller för att syntetisera tal för utvärdering. Dessutom genomförs en enkätundersökning där mänskliga transkriptioner används istället för automatisk taligenkänning. De resulterande bedömningarna av förståelighet jämförs sedan med bedömningar baserade på transkriptioner producerade med automatisk taligenkänning. Resultaten visar att utvärderingen som utförs av detta system kan användas för att bedöma förståeligheten hos en talsyntesmodell samt identifiera och kategorisera fel i förståelighet. Det visas att en kombination av automatiska taligenkänningsmodeller kan leda till mer robusta och tillförlitliga utvärderingar, och att referensinspelningar av mänskligt tal kan användas för att ytterligare öka tillförlitligheten. Utvärderingsresultaten visar en god korrelation med mänskliga utvärderingar, medan vissa automatiska taligenkänningsmodeller visar sig ha en starkare korrelation med mänskliga utvärderingar. Denna forskning visar att automatisk taligenkänning kan användas för att producera pålitlig och detaljerad analys av förståeligheten hos talsyntes, vilket har potentialen att göra förbättringar inom talsyntes mer effektiva och möjliggöra leverans av bättre talsyntes-modeller i snabbare takt. Automatic Speech Recognition Natural Language Processing Speech Technology Speech Quality Assessment Text-To-Speech Taligenkänning Språkteknologi Talkvalitetsbedömning Talsyntes Computer and Information Sciences Data- och informationsvetenskap
22	Supervised Speech Separation Using Deep Neural Networks Wang, Yuxuan 21 May 2015 (has links) No description available. Computer Science Engineering Speech separation time-frequency masking computational auditory scene analysis acoustic features deep neural networks training targets generalization speech intelligibility speech quality
23	Automatic Speech Quality Assessment in Unified Communication : A Case Study / Automatisk utvärdering av samtalskvalitet inom integrerad kommunikation : en fallstudie Larsson Alm, Kevin January 2019 (has links) Speech as a medium for communication has always been important in its ability to convey our ideas, personality and emotions. It is therefore not strange that Quality of Experience (QoE) becomes central to any business relying on voice communication. Using Unified Communication (UC) systems, users can communicate with each other in several ways using many different devices, making QoE an important aspect for such systems. For this thesis, automatic methods for assessing speech quality of the voice calls in Briteback’s UC application is studied, including a comparison of the researched methods. Three methods all using a Gaussian Mixture Model (GMM) as a regressor, paired with extraction of Human Factor Cepstral Coefficients (HFCC), Gammatone Frequency Cepstral Coefficients (GFCC) and Modified Mel Frequency Cepstrum Coefficients (MMFCC) features respectively is studied. The method based on HFCC feature extraction shows better performance in general compared to the two other methods, but all methods show comparatively low performance compared to literature. This most likely stems from implementation errors, showing the difference between theory and practice in the literature, together with the lack of reference implementations. Further work with practical aspects in mind, such as reference implementations or verification tools can make the field more popular and increase its use in the real world. speech voice communication qoe quality of experience unified communication uc speech quality assessment speech quality voice calls gaussian mixture model gmm gaussian mixture regression gmr mel frequency cepstrum coefficients mfcc human feature cepstrum coefficients hfcc gfcc Software Engineering Programvaruteknik
24	Gerenciamento adaptativo da qualidade da fala entre terminais VoIP Carvalho, Leandro Silva Galvão de 07 October 2011 (has links) Made available in DSpace on 2015-04-20T12:33:26Z (GMT). No. of bitstreams: 1 Leandro.pdf: 2831865 bytes, checksum: 5804d85c95f338cf4054c799f4dfd45d (MD5) Previous issue date: 2011-10-07 / Voice calls based on Voice over Internet Protocol (VoIP) technology are liable to several impairments from both application and network layer, such as codec compression, end-to-end delay, and packet loss. For years, this problem has been challenging researchers and practitioners, who have been designing and improving QoS control mechanisms for VoIP applications. Such mechanisms aim to make optimum use of network and terminal resources so as to minimize the effects of network impairments on voice quality. Among the several proposed QoS control mechanisms for VoIP, some of them seek to adapt the voice flow or other VoIP-related parameters in accordance with significant changes in the network, end users preferences, or service providers requirements. VoIP systems are particularly likely to require a dynamic adaptation solution for dealing with the complex trade-off between speech quality and impairments, because of the decentralized control nature of IP networks and the stochastic nature of data packet delivery. Although the existing adaptive solutions for QoS control of VoIP show some performance improvement and exhibit some sort of feedback, they do not provide explicit focus on the control loop. This document shows the current progress of our thesis, which addresses the adjustment of internal parameters of VoIP terminals (at application layer) that affect the voice flow, with the aim of improving speech quality in response to changes in network conditions. It is not in the scope of the thesis to propose adaptive solutions that focus exclusively on signaling, billing, security issues, or operate at the network layer. Therefore, this thesis addresses the problem of how adjust encoding parameters in response to variations in delay and packet loss, in order to optimize speech quality. The objective is to optimize user-perceptible attributes of speech, under the perspective of self-adaptive software systems. The emphasis is not to develop new audio codecs, but to build a control loop in the core of sender and receiver terminals to adapt voice flow settings according to network conditions. The main contributions of this thesis are the following: determination of user s perception during codec switching; parametrization of codec precedence for supporting codec switching decision; explicit design of a monitoring analysis planning execution control loop as the core of the adaptation process; and efficiency analysis of feedback message exchanging. / Chamadas de voz baseadas na tecnologia VoIP (Voice over Internet Protocol) estão suscetíveis a degradações diversas, provenientes tanto da camada de aplicação, como da camada de rede, tais como compressão do codec, atraso fim a fim e perda de pacotes. Durante anos, esse problema tem desafiado pesquisadores e profissionais, que têm concebido e melhorado mecanismos de controle de QoS para aplicações VoIP. Tais mecanismos visam otimizar a utilização dos recursos da rede e do terminal VoIP de modo a minimizar os efeitos deletérios da rede subjacente sobre a qualidade de voz. Entre as várias propostas de mecanismos de controle de QoS para VoIP, alguns deles procuram adaptar o fluxo de voz ou outros parâmetros VoIP de acordo com mudanças significativas na rede, preferências de usuário, ou requisitos dos provedores de serviços VoIP. Sistemas VoIP particularmente exigem soluções de adaptação dinâmica para lidar com a complexa relação de compromisso entre qualidade de voz e fatores de degradação, por causa da natureza descentralizada e estocástica das redes IP na entrega de pacotes de voz. Embora as soluções adaptativas existentes para controle de QoS em VoIP mostrem alguma melhora de desempenho e apresentem algum tipo de feedback, elas não fornecem foco explícito na ciclo de controle (control loop). Este documento mostra o progresso atual da nossa tese, que aborda o ajuste de parâmetros internos de terminais VoIP (camada de aplicação) que afetam o fluxo de voz, com o objetivo de melhorar a qualidade da fala em resposta a mudanças nas condições da rede. Não faz parte do escopo da tese abordar soluções adaptativas que se concentram exclusivamente em sinalização, bilhetagem, problemas de segurança, ou que operam no nível da camada de rede. Portanto, esta tese aborda o problema da concepção e avaliação de estratégias adaptativas que explorem as relações de compromisso entre qualidade da fala e os seguintes fatores de degradação: compressão do codec, atraso fim a fim e perda de pacotes. A finalidade é otimizar atributos da fala perceptíveis aos usuário, sob a perspectiva de sistemas de software autoadaptativo. A ênfase não reside em desenvolver novos codecs de áudio, mas sim em desenvolver um ciclo de controle como entidade central de um terminal VoIP, que possa adaptar as configurações do fluxo de voz de acordo com as condições da rede. As principais contribuições desta tese são as seguintes: determinação da percepção do usuário durante a comutação de codec; parametrização de precedência de codecs para suporte de decisão de comutação de codec; enfoque no ciclo de controle baseado nas atividades de monitoramento análise planejamento execução como núcleo do processo de adaptação; e análise de eficiência de troca de mensagens de feedback. Voz sobre Protocolo de Internet (VoIP) Adaptação de qualidade da fala Controle de Qualidade de Serviço (QoS) Ciclo de realimentação Voice over IP Speech quality adaptation QoS control Feedback loop
25	LaMOSNet: Latent Mean-Opinion-Score Network for Non-intrusive Speech Quality Assessment : Deep Neural Network for MOS Prediction / LaMOSNet: Latent Mean-Opinion-Score Network för icke-intrusiv ljudkvalitetsbedömning : Djupt neuralt nätverk för MOS prediktion Cumlin, Fredrik January 2022 (has links) Objective non-intrusive speech quality assessment aimed to emulate and correlate with human judgement has received more attention over the years. It is a diﬀicult problem due to three reasons: data scarcity, noisy human judgement, and a potential uneven distribution of bias of mean opinion scores (MOS). In this paper, we introduce the Latent Mean-Opinion-Score Network (LaMOSNet) that leverage on individual judge’s scores to increase the data size, and new ideas to deal with both noisy and biased labels. We introduce a methodology called Optimistic Judge Estimation as a way to reduce bias in MOS in a clear way. We also implement stochastic gradient noise and mean teacher, ideas from noisy image classification, to further deal with noisy and uneven bias distribution of labels. We achieve competitive results on VCC2018 modeling MOS, and state-of-the-art modeling only listener dependent scores. / Objektiv referensfri ljudkvalitétsbedömning ämnad att härma och korrelera med mänsklig bedömning har fått mer uppmärksamhet med åren. Det är ett svårt problem på grund av tre anledningar: brist på data, varians i mänsklig bedömning, och en potentiell ojämn fördelning av bias av medel bedömningsvärde (mean opinion score, MOS). I detta papper introducerar vi Latent Mean-Opinion-Score Network (LaMOSNet) som tar nytta av individuella bedömmares poäng för att öka datastorleken, och nya idéer för att handskas med både varierande och partisk märkning. Jag introducerar en metodologi som kallas Optimistisk bedömmarestimering, ett sätt att minska partiskheten i MOS på ett klart sätt. Jag implementerar också stokastisk gradient variation och medellärare, idéer från opålitlig bild igenkänning, för att ännu mer hantera opålitliga märkningar. Jag får jämförelsebara resultat på VCC2018 när jag modellerar MOS, och state-of-the-art när jag modellerar enbart beömmarnas märkning. Speech naturalness assessment Speech quality assessment Mean opinion score Voice conversion challenge Semi-supervised learning Noisy labels. Naturligt tal bedömning Ljudkvalitetsbedömning Medel bedömningsvärdet Talkonverteringsutmaningen Semi-övervakad inlärning Varierande märkningar. Computer Engineering Datorteknik
26	DeePMOS: Deep Posterior Mean-Opinion-Score for Speech Quality Assessment : DNN-based MOS Prediction Using a Posterior / DeePMOS: Deep Posterior Mean-Opinion-Score för talkvalitetsbedömning : DNN-baserad MOS-prediktion med hjälp av en posterior Liang, Xinyu January 2024 (has links) This project focuses on deep neural network (DNN)-based non-intrusive speech quality assessment, specifically addressing the challenge of predicting mean-opinion-score (MOS) with interpretable posterior distributions. The conventional approach of providing a single point estimate for MOS lacks interpretability and doesn't capture the uncertainty inherent in subjective assessments. This thesis introduces DeePMOS, a novel framework capable of producing MOS predictions in the form of posterior distributions, offering a more nuanced and understandable representation of speech quality. DeePMOS adopts a CNN-BLSTM architecture with multiple prediction heads to model Gaussian and Beta posterior distributions. For robust training, we use a combination of maximum-likelihood learning, stochastic gradient noise, and a student-teacher learning setup to handle limited and noisy training data. Results showcase DeePMOS's competitive performance, particularly with DeePMOS-B achieving state-of-the-art utterance-level performance. The significance lies in providing accurate predictions along with a measure of confidence, enhancing transparency and reliability. This opens avenues for application in domains such as telecommunications and audio-processing systems. Future work could explore additional posterior distributions, evaluate the model on high-quality datasets, and consider incorporating listener-dependent scores. / Detta projekt fokuserar på icke-intrusiv bedömning av tal-kvalitet med hjälp av djupa neurala nätverk (DNN), särskilt för att hantera utmaningen att förutsäga mean-opinion-score (MOS) med tolkningsbara posteriora fördelningar. Den konventionella metoden att ge en enda punktsuppskattning för MOS saknar tolkningsbarhet och fångar inte osäkerheten som är inneboende i subjektiva bedömningar. Denna avhandling introducerar DeePMOS, en ny ramverk kapabel att producera MOS-förutsägelser i form av posteriora fördelningar, vilket ger en mer nyanserad och förståelig representation av tal-kvalitet. DeePMOS antar en CNN-BLSTM-arkitektur med flera förutsägelsehuvuden för att modellera Gaussiska och Beta-posteriora fördelningar. För robust träning använder vi en kombination av maximum-likelihood learning, stokastisk gradientbrus och en student-lärare inlärningsuppsättning för att hantera begränsad och brusig träningsdata. Resultaten visar DeePMOS konkurrenskraftiga prestanda, särskilt DeePMOS-B som uppnår state-of-the-art prestanda på uttalnivå. Signifikansen ligger i att ge noggranna förutsägelser tillsammans med en mått på förtroende, vilket ökar transparensen och tillförlitligheten. Detta öppnar möjligheter för tillämpningar inom områden som telekommunikation och ljudbehandlingssystem. Framtida arbete kan utforska ytterligare posteriora fördelningar, utvärdera modellen på högkvalitativa dataset och överväga att inkludera lyssnarberoende poäng. Speech Quality Assessment Deep Neural Network Maximum-Likelihood Bayesian Estimation Bedömning av ljudkvalitet Djup neural nätverk Maximum-likelihood Bayesiansk uppskattning Annan elektroteknik och elektronik Elektroteknik och elektronik

Page generated in 0.04 seconds