Global ETD Search

1	Implementation and Evaluation of P.880 Methodology Imam, Hasani Syed Hassan January 2009 (has links) Continuous Evaluation of Time Varying Speech Quality (CETVSQ) is a method of subjective assessment of transmitted speech quality for long speech sequences containing quality fluctuations in time. This method is modeled for continuous evaluation of long speech sequences based on two subjective tasks. First task is to assess the speech quality during the listening and second task is to assess the overall speech quality after listening to the speech sequences. The development of continuous evaluation of time varying speech quality was motivated by fact that speech quality degradations are often not constant and varies in time. In modern IP telephony and wireless networks, speech quality varies due to specific impairments such as packet loss, echo, handover in networks etc. Many other standard methods already exist, which are being used for subjective assessment of short speech sequences. These methods such as ITU-T Rec. P.800 are well suited for only time constant speech quality. In this thesis work, it was required to implement CETVSQ methodology, so that it could be possible to assess long speech sequences. An analog hardware slider is used for the continuous assessment of speech qualities, as well as for overall quality judgments. Instantaneous and overall quality judgments are being saved into Excel file. The results stored in the Excel file are analyzed by applying different statistical measures. In evaluation part of the thesis work, subjects’ scores are analyzed by applying statistical methods to identify several factors that have originated in the CETVSQ methodology. A subjective test had already been conducted according to P.800 ACR method. The long speech sequences were divided into 8 seconds short sequences and then assessed using P.800 ACR method. In this study, the long speech sequences are assessed using CETVSQ methodology and comparison is conducted between P.800 ACR and CETVSQ results. It has been revealed that if long speech sequences are divided into short segments and evaluated using P.800 ACR, then P.800 ACR results will be different from the results obtained from CETVSQ methodology. The necessity of CETVSQ methodology is proved by this study. ITU-T Rec. P.880 methodology speech quality assessment subjective speech quality assessment
2	Identification and Classification of TTS Intelligibility Errors Using ASR : A Method for Automatic Evaluation of Speech Intelligibility / Identifiering och klassifiering av fel relaterade till begriplighet inom talsyntes. : Ett förslag på en metod för automatisk utvärdering av begriplighet av tal. Henriksson, Erik January 2023 (has links) In recent years, applications using synthesized speech have become more numerous and publicly available. As the area grows, so does the need for delivering high-quality, intelligible speech, and subsequently the need for effective methods of assessing the intelligibility of synthesized speech. The common method of evaluating speech using human listeners has the disadvantages of being costly and time-inefficient. Because of this, alternative methods of evaluating speech automatically, using automatic speech recognition (ASR) models, have been introduced. This thesis presents an evaluation system that analyses the intelligibility of synthesized speech using automatic speech recognition, and attempts to identify and categorize the intelligibility errors present in the speech. This system is put through evaluation using two experiments. The first uses publicly available sentences and corresponding synthesized speech, and the second uses publicly available models to synthesize speech for evaluation. Additionally, a survey is conducted where human transcriptions are used instead of automatic speech recognition, and the resulting intelligibility evaluations are compared with those based on automatic speech recognition transcriptions. Results show that this system can be used to evaluate the intelligibility of a model, as well as identify and classify intelligibility errors. It is shown that a combination of automatic speech recognition models can lead to more robust and reliable evaluations, and that reference human recordings can be used to further increase confidence. The evaluation scores show a good correlation with human evaluations, while certain automatic speech recognition models are shown to have a stronger correlation with human evaluations. This research shows that automatic speech recognition can be used to produce a reliable and detailed analysis of text-to-speech intelligibility, which has the potential of making text-to-speech (TTS) improvements more efficient and allowing for the delivery of better text-to-speech models at a faster rate. / Under de senaste åren har antalet applikationer som använder syntetiskt tal ökat och blivit mer tillgängliga för allmänheten. I takt med att området växer ökar också behovet av att leverera tal av hög kvalitet och tydlighet, och därmed behovet av effektiva metoder för att bedöma förståeligheten hos syntetiskt tal. Den vanliga metoden att utvärdera tal med hjälp av mänskliga lyssnare har nackdelarna att den är kostsam och tidskrävande. Av den anledningen har alternativa metoder för att automatiskt utvärdera tal med hjälp av automatiska taligenkänningsmodeller introducerats. I denna avhandling presenteras ett utvärderingssystem som analyserar förståeligheten hos syntetiskt tal med hjälp av automatisk taligenkänning och försöker identifiera och kategorisera de fel i förståelighet som finns i talet. Detta system genomgår sedan utvärdering genom två experiment. Det första experimentet använder offentligt tillgängliga meningar och motsvarande ljudfiler med syntetiskt tal, och det andra använder offentligt tillgängliga modeller för att syntetisera tal för utvärdering. Dessutom genomförs en enkätundersökning där mänskliga transkriptioner används istället för automatisk taligenkänning. De resulterande bedömningarna av förståelighet jämförs sedan med bedömningar baserade på transkriptioner producerade med automatisk taligenkänning. Resultaten visar att utvärderingen som utförs av detta system kan användas för att bedöma förståeligheten hos en talsyntesmodell samt identifiera och kategorisera fel i förståelighet. Det visas att en kombination av automatiska taligenkänningsmodeller kan leda till mer robusta och tillförlitliga utvärderingar, och att referensinspelningar av mänskligt tal kan användas för att ytterligare öka tillförlitligheten. Utvärderingsresultaten visar en god korrelation med mänskliga utvärderingar, medan vissa automatiska taligenkänningsmodeller visar sig ha en starkare korrelation med mänskliga utvärderingar. Denna forskning visar att automatisk taligenkänning kan användas för att producera pålitlig och detaljerad analys av förståeligheten hos talsyntes, vilket har potentialen att göra förbättringar inom talsyntes mer effektiva och möjliggöra leverans av bättre talsyntes-modeller i snabbare takt. Automatic Speech Recognition Natural Language Processing Speech Technology Speech Quality Assessment Text-To-Speech Taligenkänning Språkteknologi Talkvalitetsbedömning Talsyntes Computer and Information Sciences Data- och informationsvetenskap
3	LaMOSNet: Latent Mean-Opinion-Score Network for Non-intrusive Speech Quality Assessment : Deep Neural Network for MOS Prediction / LaMOSNet: Latent Mean-Opinion-Score Network för icke-intrusiv ljudkvalitetsbedömning : Djupt neuralt nätverk för MOS prediktion Cumlin, Fredrik January 2022 (has links) Objective non-intrusive speech quality assessment aimed to emulate and correlate with human judgement has received more attention over the years. It is a diﬀicult problem due to three reasons: data scarcity, noisy human judgement, and a potential uneven distribution of bias of mean opinion scores (MOS). In this paper, we introduce the Latent Mean-Opinion-Score Network (LaMOSNet) that leverage on individual judge’s scores to increase the data size, and new ideas to deal with both noisy and biased labels. We introduce a methodology called Optimistic Judge Estimation as a way to reduce bias in MOS in a clear way. We also implement stochastic gradient noise and mean teacher, ideas from noisy image classification, to further deal with noisy and uneven bias distribution of labels. We achieve competitive results on VCC2018 modeling MOS, and state-of-the-art modeling only listener dependent scores. / Objektiv referensfri ljudkvalitétsbedömning ämnad att härma och korrelera med mänsklig bedömning har fått mer uppmärksamhet med åren. Det är ett svårt problem på grund av tre anledningar: brist på data, varians i mänsklig bedömning, och en potentiell ojämn fördelning av bias av medel bedömningsvärde (mean opinion score, MOS). I detta papper introducerar vi Latent Mean-Opinion-Score Network (LaMOSNet) som tar nytta av individuella bedömmares poäng för att öka datastorleken, och nya idéer för att handskas med både varierande och partisk märkning. Jag introducerar en metodologi som kallas Optimistisk bedömmarestimering, ett sätt att minska partiskheten i MOS på ett klart sätt. Jag implementerar också stokastisk gradient variation och medellärare, idéer från opålitlig bild igenkänning, för att ännu mer hantera opålitliga märkningar. Jag får jämförelsebara resultat på VCC2018 när jag modellerar MOS, och state-of-the-art när jag modellerar enbart beömmarnas märkning. Speech naturalness assessment Speech quality assessment Mean opinion score Voice conversion challenge Semi-supervised learning Noisy labels. Naturligt tal bedömning Ljudkvalitetsbedömning Medel bedömningsvärdet Talkonverteringsutmaningen Semi-övervakad inlärning Varierande märkningar. Computer Engineering Datorteknik
4	DeePMOS: Deep Posterior Mean-Opinion-Score for Speech Quality Assessment : DNN-based MOS Prediction Using a Posterior / DeePMOS: Deep Posterior Mean-Opinion-Score för talkvalitetsbedömning : DNN-baserad MOS-prediktion med hjälp av en posterior Liang, Xinyu January 2024 (has links) This project focuses on deep neural network (DNN)-based non-intrusive speech quality assessment, specifically addressing the challenge of predicting mean-opinion-score (MOS) with interpretable posterior distributions. The conventional approach of providing a single point estimate for MOS lacks interpretability and doesn't capture the uncertainty inherent in subjective assessments. This thesis introduces DeePMOS, a novel framework capable of producing MOS predictions in the form of posterior distributions, offering a more nuanced and understandable representation of speech quality. DeePMOS adopts a CNN-BLSTM architecture with multiple prediction heads to model Gaussian and Beta posterior distributions. For robust training, we use a combination of maximum-likelihood learning, stochastic gradient noise, and a student-teacher learning setup to handle limited and noisy training data. Results showcase DeePMOS's competitive performance, particularly with DeePMOS-B achieving state-of-the-art utterance-level performance. The significance lies in providing accurate predictions along with a measure of confidence, enhancing transparency and reliability. This opens avenues for application in domains such as telecommunications and audio-processing systems. Future work could explore additional posterior distributions, evaluate the model on high-quality datasets, and consider incorporating listener-dependent scores. / Detta projekt fokuserar på icke-intrusiv bedömning av tal-kvalitet med hjälp av djupa neurala nätverk (DNN), särskilt för att hantera utmaningen att förutsäga mean-opinion-score (MOS) med tolkningsbara posteriora fördelningar. Den konventionella metoden att ge en enda punktsuppskattning för MOS saknar tolkningsbarhet och fångar inte osäkerheten som är inneboende i subjektiva bedömningar. Denna avhandling introducerar DeePMOS, en ny ramverk kapabel att producera MOS-förutsägelser i form av posteriora fördelningar, vilket ger en mer nyanserad och förståelig representation av tal-kvalitet. DeePMOS antar en CNN-BLSTM-arkitektur med flera förutsägelsehuvuden för att modellera Gaussiska och Beta-posteriora fördelningar. För robust träning använder vi en kombination av maximum-likelihood learning, stokastisk gradientbrus och en student-lärare inlärningsuppsättning för att hantera begränsad och brusig träningsdata. Resultaten visar DeePMOS konkurrenskraftiga prestanda, särskilt DeePMOS-B som uppnår state-of-the-art prestanda på uttalnivå. Signifikansen ligger i att ge noggranna förutsägelser tillsammans med en mått på förtroende, vilket ökar transparensen och tillförlitligheten. Detta öppnar möjligheter för tillämpningar inom områden som telekommunikation och ljudbehandlingssystem. Framtida arbete kan utforska ytterligare posteriora fördelningar, utvärdera modellen på högkvalitativa dataset och överväga att inkludera lyssnarberoende poäng. Speech Quality Assessment Deep Neural Network Maximum-Likelihood Bayesian Estimation Bedömning av ljudkvalitet Djup neural nätverk Maximum-likelihood Bayesiansk uppskattning Annan elektroteknik och elektronik Elektroteknik och elektronik
5	Automatic Speech Quality Assessment in Unified Communication : A Case Study / Automatisk utvärdering av samtalskvalitet inom integrerad kommunikation : en fallstudie Larsson Alm, Kevin January 2019 (has links) Speech as a medium for communication has always been important in its ability to convey our ideas, personality and emotions. It is therefore not strange that Quality of Experience (QoE) becomes central to any business relying on voice communication. Using Unified Communication (UC) systems, users can communicate with each other in several ways using many different devices, making QoE an important aspect for such systems. For this thesis, automatic methods for assessing speech quality of the voice calls in Briteback’s UC application is studied, including a comparison of the researched methods. Three methods all using a Gaussian Mixture Model (GMM) as a regressor, paired with extraction of Human Factor Cepstral Coefficients (HFCC), Gammatone Frequency Cepstral Coefficients (GFCC) and Modified Mel Frequency Cepstrum Coefficients (MMFCC) features respectively is studied. The method based on HFCC feature extraction shows better performance in general compared to the two other methods, but all methods show comparatively low performance compared to literature. This most likely stems from implementation errors, showing the difference between theory and practice in the literature, together with the lack of reference implementations. Further work with practical aspects in mind, such as reference implementations or verification tools can make the field more popular and increase its use in the real world. speech voice communication qoe quality of experience unified communication uc speech quality assessment speech quality voice calls gaussian mixture model gmm gaussian mixture regression gmr mel frequency cepstrum coefficients mfcc human feature cepstrum coefficients hfcc gfcc Software Engineering Programvaruteknik

1

Page generated in 0.1145 seconds