• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • Tagged with
  • 4
  • 4
  • 4
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Implementation and Evaluation of P.880 Methodology

Imam, Hasani Syed Hassan January 2009 (has links)
Continuous Evaluation of Time Varying Speech Quality (CETVSQ) is a method of subjective assessment of transmitted speech quality for long speech sequences containing quality fluctuations in time. This method is modeled for continuous evaluation of long speech sequences based on two subjective tasks. First task is to assess the speech quality during the listening and second task is to assess the overall speech quality after listening to the speech sequences. The development of continuous evaluation of time varying speech quality was motivated by fact that speech quality degradations are often not constant and varies in time. In modern IP telephony and wireless networks, speech quality varies due to specific impairments such as packet loss, echo, handover in networks etc. Many other standard methods already exist, which are being used for subjective assessment of short speech sequences. These methods such as ITU-T Rec. P.800 are well suited for only time constant speech quality. In this thesis work, it was required to implement CETVSQ methodology, so that it could be possible to assess long speech sequences. An analog hardware slider is used for the continuous assessment of speech qualities, as well as for overall quality judgments. Instantaneous and overall quality judgments are being saved into Excel file. The results stored in the Excel file are analyzed by applying different statistical measures. In evaluation part of the thesis work, subjects’ scores are analyzed by applying statistical methods to identify several factors that have originated in the CETVSQ methodology. A subjective test had already been conducted according to P.800 ACR method. The long speech sequences were divided into 8 seconds short sequences and then assessed using P.800 ACR method. In this study, the long speech sequences are assessed using CETVSQ methodology and comparison is conducted between P.800 ACR and CETVSQ results. It has been revealed that if long speech sequences are divided into short segments and evaluated using P.800 ACR, then P.800 ACR results will be different from the results obtained from CETVSQ methodology. The necessity of CETVSQ methodology is proved by this study.
2

Identification and Classification of TTS Intelligibility Errors Using ASR : A Method for Automatic Evaluation of Speech Intelligibility / Identifiering och klassifiering av fel relaterade till begriplighet inom talsyntes. : Ett förslag på en metod för automatisk utvärdering av begriplighet av tal.

Henriksson, Erik January 2023 (has links)
In recent years, applications using synthesized speech have become more numerous and publicly available. As the area grows, so does the need for delivering high-quality, intelligible speech, and subsequently the need for effective methods of assessing the intelligibility of synthesized speech. The common method of evaluating speech using human listeners has the disadvantages of being costly and time-inefficient. Because of this, alternative methods of evaluating speech automatically, using automatic speech recognition (ASR) models, have been introduced. This thesis presents an evaluation system that analyses the intelligibility of synthesized speech using automatic speech recognition, and attempts to identify and categorize the intelligibility errors present in the speech. This system is put through evaluation using two experiments. The first uses publicly available sentences and corresponding synthesized speech, and the second uses publicly available models to synthesize speech for evaluation. Additionally, a survey is conducted where human transcriptions are used instead of automatic speech recognition, and the resulting intelligibility evaluations are compared with those based on automatic speech recognition transcriptions. Results show that this system can be used to evaluate the intelligibility of a model, as well as identify and classify intelligibility errors. It is shown that a combination of automatic speech recognition models can lead to more robust and reliable evaluations, and that reference human recordings can be used to further increase confidence. The evaluation scores show a good correlation with human evaluations, while certain automatic speech recognition models are shown to have a stronger correlation with human evaluations. This research shows that automatic speech recognition can be used to produce a reliable and detailed analysis of text-to-speech intelligibility, which has the potential of making text-to-speech (TTS) improvements more efficient and allowing for the delivery of better text-to-speech models at a faster rate. / Under de senaste åren har antalet applikationer som använder syntetiskt tal ökat och blivit mer tillgängliga för allmänheten. I takt med att området växer ökar också behovet av att leverera tal av hög kvalitet och tydlighet, och därmed behovet av effektiva metoder för att bedöma förståeligheten hos syntetiskt tal. Den vanliga metoden att utvärdera tal med hjälp av mänskliga lyssnare har nackdelarna att den är kostsam och tidskrävande. Av den anledningen har alternativa metoder för att automatiskt utvärdera tal med hjälp av automatiska taligenkänningsmodeller introducerats. I denna avhandling presenteras ett utvärderingssystem som analyserar förståeligheten hos syntetiskt tal med hjälp av automatisk taligenkänning och försöker identifiera och kategorisera de fel i förståelighet som finns i talet. Detta system genomgår sedan utvärdering genom två experiment. Det första experimentet använder offentligt tillgängliga meningar och motsvarande ljudfiler med syntetiskt tal, och det andra använder offentligt tillgängliga modeller för att syntetisera tal för utvärdering. Dessutom genomförs en enkätundersökning där mänskliga transkriptioner används istället för automatisk taligenkänning. De resulterande bedömningarna av förståelighet jämförs sedan med bedömningar baserade på transkriptioner producerade med automatisk taligenkänning. Resultaten visar att utvärderingen som utförs av detta system kan användas för att bedöma förståeligheten hos en talsyntesmodell samt identifiera och kategorisera fel i förståelighet. Det visas att en kombination av automatiska taligenkänningsmodeller kan leda till mer robusta och tillförlitliga utvärderingar, och att referensinspelningar av mänskligt tal kan användas för att ytterligare öka tillförlitligheten. Utvärderingsresultaten visar en god korrelation med mänskliga utvärderingar, medan vissa automatiska taligenkänningsmodeller visar sig ha en starkare korrelation med mänskliga utvärderingar. Denna forskning visar att automatisk taligenkänning kan användas för att producera pålitlig och detaljerad analys av förståeligheten hos talsyntes, vilket har potentialen att göra förbättringar inom talsyntes mer effektiva och möjliggöra leverans av bättre talsyntes-modeller i snabbare takt.
3

LaMOSNet: Latent Mean-Opinion-Score Network for Non-intrusive Speech Quality Assessment : Deep Neural Network for MOS Prediction / LaMOSNet: Latent Mean-Opinion-Score Network för icke-intrusiv ljudkvalitetsbedömning : Djupt neuralt nätverk för MOS prediktion

Cumlin, Fredrik January 2022 (has links)
Objective non-intrusive speech quality assessment aimed to emulate and correlate with human judgement has received more attention over the years. It is a difficult problem due to three reasons: data scarcity, noisy human judgement, and a potential uneven distribution of bias of mean opinion scores (MOS). In this paper, we introduce the Latent Mean-Opinion-Score Network (LaMOSNet) that leverage on individual judge’s scores to increase the data size, and new ideas to deal with both noisy and biased labels. We introduce a methodology called Optimistic Judge Estimation as a way to reduce bias in MOS in a clear way. We also implement stochastic gradient noise and mean teacher, ideas from noisy image classification, to further deal with noisy and uneven bias distribution of labels. We achieve competitive results on VCC2018 modeling MOS, and state-of-the-art modeling only listener dependent scores. / Objektiv referensfri ljudkvalitétsbedömning ämnad att härma och korrelera med mänsklig bedömning har fått mer uppmärksamhet med åren. Det är ett svårt problem på grund av tre anledningar: brist på data, varians i mänsklig bedömning, och en potentiell ojämn fördelning av bias av medel bedömningsvärde (mean opinion score, MOS). I detta papper introducerar vi Latent Mean-Opinion-Score Network (LaMOSNet) som tar nytta av individuella bedömmares poäng för att öka datastorleken, och nya idéer för att handskas med både varierande och partisk märkning. Jag introducerar en metodologi som kallas Optimistisk bedömmarestimering, ett sätt att minska partiskheten i MOS på ett klart sätt. Jag implementerar också stokastisk gradient variation och medellärare, idéer från opålitlig bild igenkänning, för att ännu mer hantera opålitliga märkningar. Jag får jämförelsebara resultat på VCC2018 när jag modellerar MOS, och state-of-the-art när jag modellerar enbart beömmarnas märkning.
4

Automatic Speech Quality Assessment in Unified Communication : A Case Study / Automatisk utvärdering av samtalskvalitet inom integrerad kommunikation : en fallstudie

Larsson Alm, Kevin January 2019 (has links)
Speech as a medium for communication has always been important in its ability to convey our ideas, personality and emotions. It is therefore not strange that Quality of Experience (QoE) becomes central to any business relying on voice communication. Using Unified Communication (UC) systems, users can communicate with each other in several ways using many different devices, making QoE an important aspect for such systems. For this thesis, automatic methods for assessing speech quality of the voice calls in Briteback’s UC application is studied, including a comparison of the researched methods. Three methods all using a Gaussian Mixture Model (GMM) as a regressor, paired with extraction of Human Factor Cepstral Coefficients (HFCC), Gammatone Frequency Cepstral Coefficients (GFCC) and Modified Mel Frequency Cepstrum Coefficients (MMFCC) features respectively is studied. The method based on HFCC feature extraction shows better performance in general compared to the two other methods, but all methods show comparatively low performance compared to literature. This most likely stems from implementation errors, showing the difference between theory and practice in the literature, together with the lack of reference implementations. Further work with practical aspects in mind, such as reference implementations or verification tools can make the field more popular and increase its use in the real world.

Page generated in 0.1176 seconds