• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 344
  • 40
  • 24
  • 14
  • 10
  • 10
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 8
  • 4
  • 3
  • Tagged with
  • 508
  • 508
  • 508
  • 181
  • 125
  • 103
  • 90
  • 50
  • 49
  • 44
  • 42
  • 42
  • 42
  • 40
  • 39
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
411

Investigating Prompt Difficulty in an Automatically Scored Speaking Performance Assessment

Cox, Troy L. 14 March 2013 (has links) (PDF)
Speaking assessments for second language learners have traditionally been expensive to administer because of the cost of rating the speech samples. To reduce the cost, many researchers are investigating the potential of using automatic speech recognition (ASR) as a means to score examinee responses to open-ended prompts. This study examined the potential of using ASR timing fluency features to predict speech ratings and the effect of prompt difficulty in that process. A speaking test with ten prompts representing five different intended difficulty levels was administered to 201 subjects. The speech samples obtained were then (a) rated by human raters holistically, (b) rated by human raters analytically at the item level, and (c) scored automatically using PRAAT to calculate ten different ASR timing fluency features. The ratings and scores of the speech samples were analyzed with Rasch measurement to evaluate the functionality of the scales and the separation reliability of the examinees, raters, and items. There were three ASR timed fluency features that best predicted human speaking ratings: speech rate, mean syllables per run, and number of silent pauses. However, only 31% of the score variance was predicted by these features. The significance in this finding is that those fluency features alone likely provide insufficient information to predict human rated speaking ability accurately. Furthermore, neither the item difficulties calculated by the ASR nor those rated analytically by the human raters aligned with the intended item difficulty levels. The misalignment of the human raters with the intended difficulties led to a further analysis that found that it was problematic for raters to use a holistic scale at the item level. However, modifying the holistic scale to a scale that examined if the response to the prompt was at-level resulted in a significant correlation (r = .98, p < .01) between the item difficulties calculated analytically by the human raters and the intended difficulties. This result supports the hypothesis that item prompts are important when it comes to obtaining quality speech samples. As test developers seek to use ASR to score speaking assessments, caution is warranted to ensure that score differences are due to examinee ability and not the prompt composition of the test.
412

Speech Recognition Error Prediction Approaches with Applications to Spoken Language Understanding

Serai, Prashant January 2021 (has links)
No description available.
413

IR-Depth Face Detection and Lip Localization Using Kinect V2

Fong, Katherine Kayan 01 June 2015 (has links) (PDF)
Face recognition and lip localization are two main building blocks in the development of audio visual automatic speech recognition systems (AV-ASR). In many earlier works, face recognition and lip localization were conducted in uniform lighting conditions with simple backgrounds. However, such conditions are seldom the case in real world applications. In this paper, we present an approach to face recognition and lip localization that is invariant to lighting conditions. This is done by employing infrared and depth images captured by the Kinect V2 device. First we present the use of infrared images for face detection. Second, we use the face’s inherent depth information to reduce the search area for the lips by developing a nose point detection. Third, we further reduce the search area by using a depth segmentation algorithm to separate the face from its background. Finally, with the reduced search range, we present a method for lip localization based on depth gradients. Experimental results demonstrated an accuracy of 100% for face detection, and 96% for lip localization.
414

Utveckling av applikation för röststyrning vid inventering

Hall, Melvin January 2023 (has links)
The manufacturing industry has an important role in Sweden's economy and has been producing high quality goods that are exported all over the world for a long time. By using modern technologies such as advanced warehouse systems and digital tools, companies can increase productivity and reduce costs. An example of such modern technology is Automatic Speech Recognition (ASR). Most of the previous research conducted in the field of ASR has focused on analyzing and addressing various kind of problems related to the performance of an ASR-system. Furthermore, there are also a number of studies regarding how ASR has been used in the manufacturing industry, and more specifically, to facilitate order picking. In this work, the use of speech recognition was investigated as a possible method to facilitate and streamline the inventory process. To investigate this, a prototype for a web application has been developed. The application enables a user, through speech recognition, to speech the article number together with the available quantity in the warehouse. Subsequently, the user receives a confirmation both visually and through sound of which the application automatically registers it in the Monitor ERP software. Data has been collected by observing user tests and conducting interviews with indi-viduals who all have some connection to the warehouse at different manufacturing companies. The results indicated that the inventory process could become more ef-fective by using the application. However, some deficiencies were identified during the user tests, which means that the prototype needs further development and increased robustness to be used as a tool during inventory management. / Tillverkningsindustrin är viktig del av Sveriges ekonomi och har under en lång tid producerat högkvalitativa varor som exporteras över hela världen. Genom att använda moderna teknologier som avancerade lagersystem och digitala verktyg, kan företagen öka produktiviteten och minska kostnaderna. Ett exempel på en sådan modern teknologi är Automatic Speech Recognition (ASR). Inom området ASR har en betydande del av den tidigare forskningen ägnats åt att analysera och adressera olika problem relaterade till prestandan hos ett ASR-system. Vidare så finns det även ett antal arbeten kring hur ASR använts inom tillverknings-industrin, och mer specifikt, till att effektivisera orderplockningsprocessen inom in-dustrin. I det här arbetet undersöktes huruvida röststyrning kan användas som ett verktyg för att underlätta samt effektivisera inventeringsprocessen. Detta genomfördes tillsam-mans med företaget Monitor ERP, där en prototyp till en webbapplikation har utvecklats. Prototypen ska med hjälp av röststyrning möjliggöra för en användare att säga artikelnummer samt vilket antal som finns i lagret. Därefter ska användaren få en bekräftelse både visuellt och genom ljud, varav applikationen automatisk registrerar in det till Monitors system. Data har samlats in genom att observera användartester samt utföra intervjuer med personer som alla har någon koppling till lagret på olika tillverkningsföretag. Resultatet visade på att inventeringsprocessen kan bli effektivare genom att utföra inventering med hjälp av applikationen. Däremot upptäcktes en del brister under användartesterna vilket betyder att prototypen behöver utvecklas och bli mer robust för att kunna användas som verktyg under inventering.
415

Automatic Voice Trading Surveillance : Achieving Speech and Named Entity Recognition in Voice Trade Calls Using Language Model Interpolation and Named Entity Abstraction

Sundberg, Martin, Ohlsson, Mikael January 2023 (has links)
This master thesis explores the effectiveness of interpolating a larger generic speech recognition model with smaller domain-specific models to enable transcription of domain-specific conversations. The study uses a corpus within the financial domain collected from the web and processed by abstracting named entities such as financial instruments, numbers, as well as names of people and companies. By substituting each named entity with a tag representing the entity type in the domain-specific corpus, each named entity can be replaced during the hypothesis search by words added to the systems pronunciation dictionary. Thus making instruments and other domain-specific terms a matter of extension by configuration.  A proof-of-concept automatic speech recognition system with the ability to transcribe and extract named entities within the constantly changing domain of voice trading was created. The system achieved a 25.08 Word Error Rate and 0.9091 F1-score using stochastic and neural net based language models. The best configuration proved to be a combination of both stochastic and neural net based domain-specific models interpolated with a generic model. This shows that even though the models were trained using the same corpus, different models learned different aspects of the material. The study was deemed successful by the authors as the Word Error Rate was improved by model interpolation and all but one named entities were found in the test recordings by all configurations. By adjusting the amount of influence the domain-specific models had against the generic model, the results improved the transcription accuracy at the cost of named entity recognition, and vice versa. Ultimately, the choice of configuration depends on the business case and the importance of named entity recognition versus accurate transcriptions.
416

Swedish Language End-to-End Automatic Speech Recognition for Media Monitoring using Deep Learning

Nyblom, Hector January 2022 (has links)
In order to extract relevant information from speech recordings, the general approach is to first convert the audio into transcribed text. The text can then be analysed using well researched methods. NewsMachine AB provides customers with an overview of how they are represented in media by analysing articles in text form. Their plans to scale up their monitoring of publicly available speech recordings was the basis for the thesis. In this thesis I compare three end-to-end Automatic Speech Recognition (ASR) models. I do so in order to find the model that currently works best for transcribing Swedish language radio recordings, considering accuracy and inference speed (computational complexity). The results show that the QuartzNet architecture is the fastest, but pre-trained wav2vec models provided by KBLab on Swedish speech have by far the best accuracy. The KBLab model was used for further fine-tuning on subsets with varying amount of training data from radio recordings. The results show that further fine-tuning the KBLab models on low-resource Swedish speech domains achieves impressive accuracy. With just 5 hours of training data, the result is 11.5% Word Error Rate and 3.8% Character Error Rate. A final model was fine-tuned on all 35 hours of the radio domain dataset, resulting in model achieving 10.4% Word Error Rate and 3.5% Character Error Rate. The thesis presents a complete pipeline able to convert any length of audio into a transcription. Segmentation of audio is performed as a pre-processing step, segmenting the audio based on silence. The silence represents when a sentence stops and a new begins. The audio segments are passed to the final fine-tuned ASR model, and are concatenated for the complete punctuated transcript. This implementation allowed for punctuation, and also timestamping, when sentences occur in the audio. The results show that the complete pipeline performs well on high quality audio recordings. But when introduced to noisy and disruptive audio, there is work needed to achieve optimal performance.
417

Domain Adaptation with N-gram Language Models for Swedish Automatic Speech Recognition : Using text data augmentation to create domain-specific n-gram models for a Swedish open-source wav2vec 2.0 model / Domänanpassning Med N-gram Språkmodeller för Svensk Taligenkänning : Datautökning av text för att skapa domänspecifika n-gram språkmodeller för en öppen svensk wav2vec 2.0 modell

Enzell, Viktor January 2022 (has links)
Automatic Speech Recognition (ASR) enables a wide variety of practical applications. However, many applications have their own domain-specific words, creating a gap between training and test data when used in practice. Domain adaptation can be achieved through model fine-tuning, but it requires domain-specific speech data paired with transcripts, which is labor intensive to produce. Fortunately, the dependence on audio data can be mitigated to a certain extent by incorporating text-based language models during decoding. This thesis explores approaches for creating domain-specific 4-gram models for a Swedish open-source wav2vec 2.0 model. The three main approaches extend a social media corpus with domain-specific data to estimate the models. The first approach utilizes a relatively small set of in-domain text data, and the second approach utilizes machine transcripts from another ASR system. Finally, the third approach utilizes Named Entity Recognition (NER) to find words of the same entity type in a corpus to replace with in-domain words. The 4-gram models are evaluated by the error rate (ERR) of recognizing in-domain words in a custom dataset. Additionally, the models are evaluated by the Word Error Rate (WER) on the Common Voice test set to ensure good overall performance. Compared to not having a language model, the base model improves the WER on Common Voice by 2.55 percentage points and the in-domain ERR by 6.11 percentage points. Next, adding in-domain text to the base model results in a 2.61 WER improvement and a 10.38 ERR improvement over not having a language model. Finally, adding in-domain machine transcripts and using the NER approach results in the same 10.38 ERR improvement as adding in-domain text but slightly less significant WER improvements of 2.56 and 2.47, respectively. These results contribute to the exploration of state-of-the-art Swedish ASR and have the potential to enable the adoption of open-source ASR models for more use cases. / Automatisk taligenkänning (ASR) möjliggör en mängd olika praktiska tillämpningar. Men många tillämpningsområden har sin egen uppsättning domänspecifika ord vilket kan skapa problem när en taligenkänningsmodell används på data som skiljer sig från träningsdatan. Taligenkänningsmodeller kan anpassas till nya domäner genom fortsatt träning med taldata, men det kräver tillgång till domänspecifik taldata med tillhörande transkript, vilket är arbetskrävande att producera. Lyckligtvis kan beroendet av ljuddata mildras till viss del genom användande av textbaserade språkmodeller tillsammans med taligenkänningsmodellerna. Detta examensarbete utforskar tillvägagångssätt för att skapa domänspecifika 4-gram-språkmodeller för en svensk wav2vec 2.0-modell som tränats av Kungliga Biblioteket. Utöver en basmodell så används tre huvudsakliga tillvägagångssätt för att utöka en korpus med domänspecifik data att träna modellerna från. Det första tillvägagångssättet använder en relativt liten mängd domänspecifik textdata, och det andra tillvägagångssättet använder transkript från ett annat ASR-system (maskintranskript). Slutligen använder det tredje tillvägagångssättet Named Entity Recognition (NER) för att hitta ord av samma entitetstyp i en korpus som sedan ersätts med domänspecifika ord. Språkmodellerna utvärderas med ett nytt domänspecifikt evalueringsdataset samt på testdelen av Common Voice datasetet. Jämfört med att inte ha en språkmodell förbättrar basmodellen Word Error Rate (WER) på Common Voice med 2,55 procentenheter och Error Rate (ERR) inom domänen med 6,11 procentenheter. Att lägga till domänspecifik text till basmodellens korpus resulterar i en 2,61 WER-förbättringochen10,38 ERR-förbättring jämfört med att inte ha en språkmodell. Slutligen, att lägga till domänspecifika maskintranskript och att använda NER-metoden resulterar i samma 10.38 ERR-förbättringar som att lägga till domänspecifik text men något mindre WER-förbättringar på 2.56 respektive 2.47 procentenheter. Den här studien bidrar till svensk ASR och kan möjliggöra användandet av öppna taligenkänningsmodeller för fler användningsområden.
418

Multichannel audio processing for speaker localization, separation and enhancement

Martí Guerola, Amparo 29 October 2013 (has links)
This thesis is related to the field of acoustic signal processing and its applications to emerging communication environments. Acoustic signal processing is a very wide research area covering the design of signal processing algorithms involving one or several acoustic signals to perform a given task, such as locating the sound source that originated the acquired signals, improving their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing the type of source and the content of the message. Among the above tasks, Sound Source localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in this thesis. In fact, the localization of sound sources in a room has received a lot of attention in the last decades. Most real-word microphone array applications require the localization of one or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation). Some of these applications are teleconferencing systems, video-gaming, autonomous robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound source localization under high noise and reverberation is a very challenging task. One of the most well-known algorithms for source localization in noisy and reverberant environments is the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the baseline framework for the contributions proposed in this thesis. Another challenge in the design of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable number of microphones and limited computational resources. Although the SRP-PHAT algorithm has been shown to be an effective localization algorithm for real-world environments, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this context, several modifications and optimizations have been proposed to improve its performance and applicability. An effective strategy that extends the conventional SRP-PHAT functional is presented in this thesis. This approach performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation with a small hardware cost (reduced number of microphones). This strategy allows to implement real-time applications based on location information, such as automatic camera steering or the detection of speech/non-speech fragments in advanced videoconferencing systems. As stated before, besides the contributions related to SSL, this thesis is also related to the field of ASR. This technology allows a computer or electronic device to identify the words spoken by a person so that the message can be stored or processed in a useful way. ASR is used on a day-to-day basis in a number of applications and services such as natural human-machine interfaces, dictation systems, electronic translators and automatic information desks. However, there are still some challenges to be solved. A major problem in ASR is to recognize people speaking in a room by using distant microphones. In distant-speech recognition, the microphone does not only receive the direct path signal, but also delayed replicas as a result of multi-path propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound Source Separation (SSS) methods can be successfully employed to improve ASR performance in multi-source scenarios. This is the motivation behind the training method for multiple talk situations proposed in this thesis. This training, which is based on a robust transformed model constructed from separated speech in diverse acoustic environments, makes use of a SSS method as a speech enhancement stage that suppresses the unwanted interferences. The combination of source separation and this specific training has been explored and evaluated under different acoustical conditions, leading to improvements of up to a 35% in ASR performance. / Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101
419

Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing

Granell Romero, Emilio 01 September 2017 (has links)
Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. · Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). · Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators. / El Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas. La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: · Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final. · Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil). · Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores, / El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents. Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques. La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris: · Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final. · Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil). · Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombr / Granell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137
420

Measuring, refining and calibrating speaker and language information extracted from speech

Brummer, Niko 12 1900 (has links)
Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having di fferent cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizers likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers. / AFRIKAANSE OPSOMMING: Ons wys hoe om die onsekerheid in die uittree van outomatiese sprekerherkenning- en taalherkenningstelsels voor te stel, te meet, te kalibreer en te optimeer. Dit maak die bestaande tegnologie akkurater, doeltre ender en meer algemeen toepasbaar.

Page generated in 0.1112 seconds