Spelling suggestions: "subject:"1activity detection"" "subject:"2activity detection""
1 |
Speech segmentation and speaker diarisation for transcription and translationSinclair, Mark January 2016 (has links)
This dissertation outlines work related to Speech Segmentation – segmenting an audio recording into regions of speech and non-speech, and Speaker Diarization – further segmenting those regions into those pertaining to homogeneous speakers. Knowing not only what was said but also who said it and when, has many useful applications. As well as providing a richer level of transcription for speech, we will show how such knowledge can improve Automatic Speech Recognition (ASR) system performance and can also benefit downstream Natural Language Processing (NLP) tasks such as machine translation and punctuation restoration. While segmentation and diarization may appear to be relatively simple tasks to describe, in practise we find that they are very challenging and are, in general, ill-defined problems. Therefore, we first provide a formalisation of each of the problems as the sub-division of speech within acoustic space and time. Here, we see that the task can become very difficult when we want to partition this domain into our target classes of speakers, whilst avoiding other classes that reside in the same space, such as phonemes. We present a theoretical framework for describing and discussing the tasks as well as introducing existing state-of-the-art methods and research. Current Speaker Diarization systems are notoriously sensitive to hyper-parameters and lack robustness across datasets. Therefore, we present a method which uses a series of oracle experiments to expose the limitations of current systems and to which system components these limitations can be attributed. We also demonstrate how Diarization Error Rate (DER), the dominant error metric in the literature, is not a comprehensive or reliable indicator of overall performance or of error propagation to subsequent downstream tasks. These results inform our subsequent research. We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation is a crucial first step in the system chain. Current methods typically do not account for the inherent structure of spoken discourse. As such, we explored a novel method which exploits an utterance-duration prior in order to better model the segment distribution of speech. We show how this method improves not only segmentation, but also the performance of subsequent speech recognition, machine translation and speaker diarization systems. Typical ASR transcriptions do not include punctuation and the task of enriching transcriptions with this information is known as ‘punctuation restoration’. The benefit is not only improved readability but also better compatibility with NLP systems that expect sentence-like units such as in conventional machine translation. We show how segmentation and diarization are related tasks that are able to contribute acoustic information that complements existing linguistically-based punctuation approaches. There is a growing demand for speech technology applications in the broadcast media domain. This domain presents many new challenges including diverse noise and recording conditions. We show that the capacity of existing GMM-HMM based speech segmentation systems is limited for such scenarios and present a Deep Neural Network (DNN) based method which offers a more robust speech segmentation method resulting in improved speech recognition performance for a television broadcast dataset. Ultimately, we are able to show that the speech segmentation is an inherently ill-defined problem for which the solution is highly dependent on the downstream task that it is intended for.
|
2 |
Speech detection, enhancement and compression for voice communicationsCho, Yong Duk January 2001 (has links)
Speech signal processing for voice communications can be characterised in terms of silence compression, noise reduction, and speech compression. The limit in the channel bandwidth of voice communication systems requires efficient compression of speech and silence signals while retaining the voice quality. Silence compression by means of both voice activity detection (VAD) and comfort noise generation could present transparent speech-quality while substantially lowering the transmission bit-rate, since pause regions between talk spurts do not include any voice information. Thus, this thesis proposes smoothed likelihood ratio-based VAD, designed on the basis of a behavioural analysis and improvement of a statistical model-based voice activity detector. Input speech could exhibit noisy signals, which could make the voice communication fatiguing and less intelligible. This task can be alleviated by noise reduction as a preprocessor for speech coding. Noise characteristics in speech enhancement are adapted typically during the pause regions classified by a voice activity detector. However, VAD errors could lead to over- or under- estimation of the noise statistics. Thus, this thesis proposes mixed decision-based noise adaptation based on a integration of soft and hard decision-based methods, defined with the speech presence uncertainty and VAD result, respectively. At low bit-rate speech coding, the sinusoidal model has been widely applied because of its good nature exploiting the phase redundancy of speech signals. Its performance, however, can be severely smeared by mis-estimation of the pitch. Thus, this thesis proposes a robust pitch estimation technique based on the autocorrelation of spectral amplitudes. Another important parameter in sinusoidal speech coding is the spectral magnitude of the LP-residual signal. It is, however, not easy to directly quantise the magnitudes because the dimensions of the spectral vectors are variable from frame to frame depending on the pitch. To alleviate this problem, this thesis proposes mel-scale-based dimension conversion, which converts the spectral vectors to a fixed dimension based on mel-scale warping. A predictive coding scheme is also employed in order to exploit the inter-frame redundancy between the spectral vectors. Experimental results show that each proposed technique is suitable for enhancing speech quality for voice communications. Furthermore, an improved speech coder incorporating the proposed techniques is developed. The vocoder gives speech quality comparable to TIA/EIA IS-127 for noisy speech whilst operating at lower than half the bit-rate of the reference coder. Key words: voice activity detection, speech enhancement, pitch, spectral magnitude quantisation, low bit-rate coding.
|
3 |
Vision and language understanding with localized evidenceXu, Huijuan 16 February 2019 (has links)
Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval.
First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals.
In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research.
|
4 |
Unsupervised Activity Discovery and Characterization for Sensor-Rich EnvironmentsHamid, Muhammad Raffay 28 November 2005 (has links)
This thesis presents an unsupervised method for discovering and analyzing the different
kinds of activities in an active environment. Drawing from natural language processing, a
novel representation of activities as bags of event n-grams is introduced, where the global
structural information of activities using their local event statistics is analyzed. It is demonstrated how maximal cliques in an undirected edge-weighted graph of activities, can be used in an unsupervised manner, to discover the different activity-classes. Taking on some work done in computer networks and bio-informatics, it is shown how to characterize these discovered activity-classes from a wholestic as well as a by-parts view-point. A definition of anomalous activities is formulated along with a way to detect them based on the difference of an activity instance from each of the discovered activity-classes. Finally, an information theoretic method to explain the detected anomalies in a human-interpretable form is presented. Results over extensive data-sets, collected from multiple active environments are
presented, to show the competence and generalizability of the proposed framework.
|
5 |
Data-Driven Rescaling of Energy Features for Noisy Speech RecognitionLuan, Miau 18 July 2012 (has links)
In this paper, we investigate rescaling of energy features for noise-robust speech recognition.
The performance of the speech recognition system will degrade very quickly by the influence
of environmental noise. As a result, speech robustness technique has become an important
research issue for a long time. However, many studies have pointed out that the impact of
speech recognition under the noisy environment is enormous. Therefore, we proposed the
data-driven energy features rescaling (DEFR) to adjust the features. The method is divided
into three parts, that are voice activity detection (VAD), piecewise log rescaling function and
parameter searching algorithm. The purpose is to reduce the difference of noisy and clean
speech features. We apply this method on Mel-frequency cepstral coefficients (MFCC) and
Teager energy cepstral coefficients (TECC), and we compare the proposed method with mean
subtraction (MS) and mean and variance normalization (MVN). We use the Aurora 2.0 and
Aurora 3.0 databases to evaluate the performance. From the experimental results, we proved
that the proposed method can effectively improve the recognition accuracy.
|
6 |
Αυτόματος εντοπισμός ομιλίαςΘεοδώρου, Θεόδωρος 22 January 2009 (has links)
Στόχος της εργασίας είναι η υλοποίηση του αλγορίθμου του αυτόματου εντοπισμού ομιλίας βάση το πρότυπο ETSI.
Η εργασία αυτή οργανώνεται σε 4 κεφάλαια τα οποία περιλαμβάνουν την εισαγωγή στο σήμα ομιλίας, το πρότυπο ETSI, την πειραματική διαδικασία και τα συμπεράσματα.
Το πρώτο κεφάλαιο περιλαμβάνει τα βασικά χαρακτηριστικά της ομιλίας και ανάλυση των συχνοτήτων συντονισμών και της συχνότητας ταλάντωσης, την έννοια του Mel και την θεωρητική λογική του αυτόματου και προσαρμοστικού αλγόριθμου.
Το δεύτερο κεφάλαιο περιλαμβάνει την διαδικασία επεξεργασίας ομιλίας με front-end αλγόριθμο βασισμένο σε τεχνικές εξαγωγής παραμέτρων Mel και Cepstral, την μείωση θορύβου βασισμένη στο Wiener φίλτρο, η επεξεργασία του σήματος, η κατηγοριοποίηση μεταξύ ηχηρής και άηχης ομιλίας.
Το τρίτο και τέταρτο κεφάλαιο περιλαμβάνουν τα αποτελέσματα από την πειραματική εφαρμογή του συστήματος και τα συμπεράσματα από την σύγκριση με άλλους αλγορίθμους εντοπισμού ομιλίας. / The goal of this project is the implementation of the voice activity detection algorithm based on the ETSI standard.
This project is separate in 4 chapters including: initial themes for speech, ETSI standard, the experimental procedure and the results.
In the first chapter are the basics about speech, formants, pitch, Mel and the theoretic logic of automatic and robust voice activity detection algorithms.
In the second chapter are the procedures of speech processing based in front-end algorithm, Mel and Cepstral procedures, noise reduction based on Wiener filter, signal processing, and the classification of voiced and unvoiced speech.
The last chapters are the results of the experimental procedure and the results of the compare with other system voice activity detection.
|
7 |
Balso signalo aptikimo ir triukšmo pašalinimo algoritmo tyrimas, naudojant aukštesnės eilės statistiką / Voice Activity Detection and Noise Reduction Algortihm Analysis using Higher-Order statisticsMakrickaitė, Raimonda 29 May 2006 (has links)
This work presents a robust algorithm for voice activity detection (VAD) and noise reduction mechanism using combined properties of higher-order statistics (HOS) and an efficient algorithm to estimate the instantaneous Signal-to-Noise Ratio (SNR) of speech signal in a background of acoustic noise. The flat spectral feature of Linear Prediction Coding (LPC) residual results in distinct characteristics for the cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. The HOS of speech is immune to Gaussian noise and this makes them particularly useful in algorithms designed for low SNR environments. The proposed algorithm uses HOS and smooth power estimate metrics with second-order measures, such as SNR and LPC prediction error, to identify speech and noise frames. A voicing condition for speech frames is derived based on the relation between the skewness, kurtosis of voiced speech and estimate of smooth noise power. The algorithm presented and its performance is compared to HOS-only based VAD algorithm. The results show that the proposed algorithm has an overall better performance, with noticeable improvement in Gaussian-like noises, such as street and garage, and high to low SNR, especially for probability of correctly detecting speech. The proposed algorithm is replicated on DSK C6713.
|
8 |
Aukštesnių eilių statistika grįsto balso detektavimo algoritmo sudarymas ir tyrimas / Design and analysis of voice activity detector based on higher order statisticsDuchovskis, Donatas 29 May 2006 (has links)
This report covers a robust voice activity detection (VAD) algorithm presented in [1]. The algorithm uses higher order statistics (HOS) metrics of speech signal in linear prediction coding (LPC) residual domain to classify noise and speech frames of a signal. Chapters in this report present voice activity detection problem and analysis of environment issues for VAD, deep HOS based and standard algorithms analysis and a real time HOS based voice activity detector model. New improvements (instantaneous SNR estimation, decision smoothing, adaptive thresholds, artificial neural network) to the proposed algorithm are introduced and performance results of the improved algorithm compared to standard VAD algorithms are presented.
|
9 |
Recording and automatic detection of African elephant (Loxodonta africana) infrasonic rumblesVenter, Petrus Jacobus 01 October 2008 (has links)
The value of studying elephant vocalizations lies in the abundant information that can be retrieved from it. Recordings of elephant rumbles can be used by researchers to determine the size and composition of the herd, the sexual state, as well as the emotional condition of an elephant. It is a difficult task for researchers to obtain large volumes of continuous recordings of elephant vocalizations. Recordings are normally analysed manually to identify the location of rumbles via the tedious and time consuming methods of sped up listening and the visual evaluation of spectrograms. The application of speech processing on elephant vocalizations is a highly unexploited resource. The aim of this study was to contribute to the current body of knowledge and resources of elephant research by developing a tool for recording high volumes of continuous acoustic data in harsh natural conditions as well as examining the possibilities of applying human speech processing techniques to elephant rumbles to achieve automatic detection of these rumbles in recordings. The recording tool was designed and implemented as an elephant recording collar that has an onboard data storage capacity of 128 gigabytes, enough memory to record sound data continuously for a period of nine months. Data is stored in the wave file format and the device has the ability to navigate and control the FAT32 file system so that the files can be read and downloaded to a personal computer. The collar also has the ability to stamp sound files with the time and date, ambient temperature and GPS coordinates. Several different options for microphone placement and protection have been tested experimentally to find an acceptable solution. A relevant voice activity detection algorithm was chosen as a base for the automatic detection of infrasonic elephant rumbles. The chosen algorithm is based on a robust pitch determination algorithm that has been experimentally verified to function correctly under a signal-to-noise ratio as low as -8 dB when more than four harmonic structures exist in a sound. The algorithm was modified to be used for elephant rumbles and was tested with previously recorded elephant vocalization data. The results obtained suggest that the algorithm can accurately detect elephant rumbles from recordings. The number of false alarms and undetected calls increase when recordings are contaminated with unwanted noise that contains harmonic structures or when the harmonic nature of a rumble is lost. Data obtained from the recording collar is less prone to being contaminated than far field recordings and the automatic detection algorithm should provide an accurate tool for detecting any rumbles that appear in the recordings. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted
|
10 |
Optimalizovaná detekce řečové aktivity v prostředí s proměnnými vlastnostmi / Optimized Voice Activity Detection under Varying EnvironmentsMíča, Ivan January 2014 (has links)
This thesis deals with the issue of algorithmic voice activity detection. Impacts of adverse conditions on the reliability of detection is analysed, and main historical and up-to-date approaches to this issue are discussed. Simulations on both synthetic, and application specific labeled speech databases are used to support the theoretical analysis of important VAD methods. Based on the theoretical analysis together with the performance results, an optimization is proposed that is capable to overcome some limitations of the current methods when dealing with variable working conditions.}
|
Page generated in 0.0961 seconds