• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 167
  • 40
  • 33
  • 30
  • 14
  • 10
  • 9
  • 8
  • 4
  • 4
  • 4
  • 3
  • 3
  • 2
  • 2
  • Tagged with
  • 388
  • 103
  • 100
  • 85
  • 79
  • 46
  • 39
  • 32
  • 31
  • 31
  • 30
  • 30
  • 28
  • 27
  • 26
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Rodilý mluvčí jako učitel angličtiny / A native speaker as an EFL teacher

Ledvinka, Miroslav January 2013 (has links)
A Native Speaker as an EFL Teacher Rodilý mluvčí jako učitel angličtiny Miroslav Ledvinka Abstract Introduction: The principal aim of this thesis is to determine the role and significance of native speaker teachers of English in the teaching process, as well as to define the expectations of their students and employers. The status of native speaker teachers in the Czech Republic is being contrasted to the position of non-native speaker teachers. The core of this study lies in the analytical part which attempts to delimit the characteristics of the implementation of native English-speaker teachers into the Czech education system. Theoretical part: The theoretical chapter presents a concise summary of the theoretical terms and concepts, both historical and contemporary, which are related to the topic of native English-speaker teachers. Apart from the traditional survey of topics discussed in various authoritative publications and journals, the theoretical overview also includes a schematic outline of the historical development of the status of native speaker teachers with respect to the social, political, and economic factors which played a major role in the shaping of native speakers' position in the education process, and society as a whole. In addition, the theoretical chapter traces the contribution of...
42

Diarization, Localization and Indexing of Meeting Archives

Vajaria, Himanshu 21 February 2008 (has links)
This dissertation documents the research performed on the topics of localization, diarization and indexing in meeting archives. It surveys existing work in these areas, identifies opportunities for improvements and proposes novel solutions for each of these problems. The framework resulting from this dissertation enables various kinds of queries such as identifying the participants of a meeting, finding all meetings for a particular participant, locating a particular individual in the video and finding all instances of speech from a particular individual. Also, since the proposed solutions are computationally efficient, require no training and use little domain knowledge, they can be easily ported to other domains of multimedia analysis. Speaker diarization involves determining the number of distinct speakers and identifying the durations when they spoke in an audio recording. We propose novel solutions for the segmentation and clustering sub-tasks, based on graph spectral clustering. The resulting system yields a diarization error rate of around 20%, a relative improvement of 16% over the current popular diarization technique which is based on hierarchical clustering. The most significant contribution of this work lies in performing speaker localization using only a single camera and a single microphone by exploiting long term audio-visual co-occurence. Our novel computational model allows identifying regions in the image belonging to the speaker even when the speaker's face is non-frontal and even when the speaker is only partially visible. This approach results in a hit ratio of 73.8% compared to an MI based approach which results in a hit ratio of 52.6%, which illustrates its suitability in the meeting domain. The third problem addresses indexing meeting archives to enable retrieving all segments from the archive during which a particular individual speaks, in a query by example framework. By performing audio-visual association and clustering, a target cluster is generated per individual that contains multiple multimodal samples for that individual to which a query sample is matched. The use of multiple samples results in a retrieval precision of 92.6% at 90% recall compared to a precision of 71% at the same recall, achieved by a unimodal unisample system.
43

Text-Independent Speaker Recognition Using Source Based Features

Wildermoth, Brett Richard, n/a January 2001 (has links)
Speech signal is basically meant to carry the information about the linguistic message. But, it also contains the speaker-specific information. It is generated by acoustically exciting the cavities of the mouth and nose, and can be used to recognize (identify/verify) a person. This thesis deals with the speaker identification task; i.e., to find the identity of a person using his/her speech from a group of persons already enrolled during the training phase. Listeners use many audible cues in identifying speakers. These cues range from high level cues such as semantics and linguistics of the speech, to low level cues relating to the speaker's vocal tract and voice source characteristics. Generally, the vocal tract characteristics are modeled in modern day speaker identification systems by cepstral coefficients. Although, these coeficients are good at representing vocal tract information, they can be supplemented by using both pitch and voicing information. Pitch provides very important and useful information for identifying speakers. In the current speaker recognition systems, it is very rarely used as it cannot be reliably extracted, and is not always present in the speech signal. In this thesis, an attempt is made to utilize this pitch and voicing information for speaker identification. This thesis illustrates, through the use of a text-independent speaker identification system, the reasonable performance of the cepstral coefficients, achieving an identification error of 6%. Using pitch as a feature in a straight forward manner results in identification errors in the range of 86% to 94%, and this is not very helpful. The two main reasons why the direct use of pitch as a feature does not work for speaker recognition are listed below. First, the speech is not always periodic; only about half of the frames are voiced. Thus, pitch can not be estimated for half of the frames (i.e. for unvoiced frames). The problem is how to account for pitch information for the unvoiced frames during recognition phase. Second, the pitch estimation methods are not very reliable. They classify some of the frames unvoiced when they are really voiced. Also, they make pitch estimation errors (such as doubling or halving of pitch value depending on the method). In order to use pitch information for speaker recognition, we have to overcome these problems. We need a method which does not use the pitch value directly as feature and which should work for voiced as well as unvoiced frames in a reliable manner. We propose here a method which uses the autocorrelation function of the given frame to derive pitch-related features. We call these features the maximum autocorrelation value (MACV) features. These features can be extracted for voiced as well as unvoiced frames and do not suffer from the pitch doubling or halving type of pitch estimation errors. Using these MACV features along with the cepstral features, the speaker identification performance is improved by 45%.
44

Confidence Measures for Speech/Speaker Recognition and Applications on Turkish LVCSR

Mengusoglu, Erhan 24 May 2004 (has links)
Confidence measures for the results of speech/speaker recognition make the systems more useful in the real time applications. Confidence measures provide a test statistic for accepting or rejecting the recognition hypothesis of the speech/speaker recognition system. Speech/speaker recognition systems are usually based on statistical modeling techniques. In this thesis we defined confidence measures for statistical modeling techniques used in speech/speaker recognition systems. For speech recognition we tested available confidence measures and the newly defined acoustic prior information based confidence measure in two different conditions which cause errors: the out-of-vocabulary words and presence of additive noise. We showed that the newly defined confidence measure performs better in both tests. Review of speech recognition and speaker recognition techniques and some related statistical methods is given through the thesis. We defined also a new interpretation technique for confidence measures which is based on Fisher transformation of likelihood ratios obtained in speaker verification. Transformation provided us with a linearly interpretable confidence level which can be used directly in real time applications like for dialog management. We have also tested the confidence measures for speaker verification systems and evaluated the efficiency of the confidence measures for adaptation of speaker models. We showed that use of confidence measures to select adaptation data improves the accuracy of the speaker model adaptation process. Another contribution of this thesis is the preparation of a phonetically rich continuous speech database for Turkish Language. The database is used for developing an HMM/MLP hybrid speech recognition for Turkish Language. Experiments on the test sets of the database showed that the speech recognition system has a good accuracy for long speech sequences while performance is lower for short words, as it is the case for current speech recognition systems for other languages. A new language modeling technique for the Turkish language is introduced in this thesis, which can be used for other agglutinative languages. Performance evaluations on newly defined language modeling techniques showed that it outperforms the classical n-gram language modeling technique.
45

Detection and handling of overlapping speech for speaker diarization

Zelenák, Martin 31 January 2012 (has links)
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling.
46

Robustní rozpoznávání mluvčího pomocí neuronových sítí / Robust Speaker Verification with Deep Neural Networks

Profant, Ján January 2019 (has links)
The objective of this work is to study state-of-the-art deep neural networks based speaker verification systems called x-vectors on various conditions, such as wideband and narrowband data and to develop the system, which is robust to unseen language, specific noise or speech codec. This system takes variable length audio recording and maps it into fixed length embedding which is afterward used to represent the speaker. We compared our systems to BUT's submission to Speakers in the Wild Speaker Recognition Challenge (SITW) from 2016, which used previously popular statistical models - i-vectors. We observed, that when comparing single best systems, with recently published x-vectors we were able to obtain more than 4.38 times lower Equal Error Rate on SITW core-core condition compared to SITW submission from BUT. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers for SITW core-multi condition but we could not see the same trend on NIST SRE 2018 VAST data.
47

Rozpoznávání mluvčího ve Skype hovorech / Speaker Recognition in Skype Calls

Kaňok, Tomáš January 2011 (has links)
This diploma thesis is concerned with machine identification and verification of speaker, it's theory and applications. It evaluates existing implementation of the subject by the Speech@FIT group. It also considers plugins for the Skype program. Then a functioning plugin is proposed which makes possible identification of the speaker. It is implemented and evaluated here. Towards the end of this thesis suggestions of future development are presented.
48

Detecting Speakers in Video Footage

Williams, Michael 01 April 2018 (has links) (PDF)
Facial recognition is a powerful tool for identifying people visually. Yet, when the end goal is more specific than merely identifying the person in a picture problems can arise. Speaker identification is one such task which expects more predictive power out of a facial recognition system than can be provided on its own. Speaker identification is the task of identifying who is speaking in video not simply who is present in the video. This extra requirement introduces numerous false positives into the facial recognition system largely due to one main scenario. The person speaking is not on camera. This paper investigates a solution to this problem by incorporating information from a new system which indicates whether or not the person on camera is speaking. This information can then be combined with an existing facial recognition to boost its predictive capabilities in this instance. We propose a speaker detection system to visually detect when someone in a given video is speaking. The system relies strictly on visual information and is not reliant on audio information. By relying strictly on visual information to detect when someone is speaker the system can be synced with an existing facial recognition system and extend its predictive power. We use a two-stream convolutional neural network to accomplish the speaker detection. The neural network is trained and tested using data extracted from Digital Democracy’s large database of transcribed political hearings [4]. We show that the system is capable of accurately detecting when someone on camera is speaking with an accuracy of 87% on a dataset of legislators. Furthermore we demonstrate how this information can benefit a facial recognition system with the end goal of identifying the speaker. The system increased the precision of a existing facial recognition system by up to 5% at the cost of a large drop in recall.
49

How do voiceprints age?

Nachesa, Maya Konstantinovna January 2023 (has links)
Voiceprints, like fingerprints, are a biometric. Where fingerprints record a person's unique pattern on their finger, voiceprints record what a person's voice "sounds like", abstracting away from what the person said. They have been used in speaker recognition, including verification and identification. In other words, they have been used to ask "is this speaker who they say they are?" or "who is this speaker?", respectively. However, people age, and so do their voices. Do voiceprints age, too? That is, can a person's voice change enough that after a while, the original voiceprint can no longer be used to identify them? In this thesis, I use Swedish audio recordings from Riksdagen's (the Swedish parliament) debate speeches to test this idea. Depending on the answer, this influences how well we can search the database for previously unmarked speeches. I find that speaker verification performance decreases as the age-gap between voiceprints increases, and that it decreases more strongly after roughly five years. Additionally, I grouped the speakers into age groups spanning five years, and found that speaker verification has the highest performance for those for whom the initial voiceprint was recorded from 29-33 years of age. Additionally, longer input speech provides higher quality voiceprints, with performance improvements stagnating when voiceprints become longer than 30 seconds. Finally, voiceprints for men age more strongly than those for women after roughly 5 years. I also investigated how emotions are encoded in voiceprints, since this could potentially impede in speaker recognition. I found that it is possible to train a classifier to recognise emotions from voiceprints, and that this classifier does better when recognising emotions from known speakers. That is, emotions are encoded more characteristically per person as opposed to per emotion itself. As such, they are unlikely to interfere with speaker recognition.
50

The Reality of This and That

Kelly-Lopez, Catherine Ann 09 June 2005 (has links)
No description available.

Page generated in 0.0431 seconds