1 |
Speech detection, enhancement and compression for voice communicationsCho, Yong Duk January 2001 (has links)
Speech signal processing for voice communications can be characterised in terms of silence compression, noise reduction, and speech compression. The limit in the channel bandwidth of voice communication systems requires efficient compression of speech and silence signals while retaining the voice quality. Silence compression by means of both voice activity detection (VAD) and comfort noise generation could present transparent speech-quality while substantially lowering the transmission bit-rate, since pause regions between talk spurts do not include any voice information. Thus, this thesis proposes smoothed likelihood ratio-based VAD, designed on the basis of a behavioural analysis and improvement of a statistical model-based voice activity detector. Input speech could exhibit noisy signals, which could make the voice communication fatiguing and less intelligible. This task can be alleviated by noise reduction as a preprocessor for speech coding. Noise characteristics in speech enhancement are adapted typically during the pause regions classified by a voice activity detector. However, VAD errors could lead to over- or under- estimation of the noise statistics. Thus, this thesis proposes mixed decision-based noise adaptation based on a integration of soft and hard decision-based methods, defined with the speech presence uncertainty and VAD result, respectively. At low bit-rate speech coding, the sinusoidal model has been widely applied because of its good nature exploiting the phase redundancy of speech signals. Its performance, however, can be severely smeared by mis-estimation of the pitch. Thus, this thesis proposes a robust pitch estimation technique based on the autocorrelation of spectral amplitudes. Another important parameter in sinusoidal speech coding is the spectral magnitude of the LP-residual signal. It is, however, not easy to directly quantise the magnitudes because the dimensions of the spectral vectors are variable from frame to frame depending on the pitch. To alleviate this problem, this thesis proposes mel-scale-based dimension conversion, which converts the spectral vectors to a fixed dimension based on mel-scale warping. A predictive coding scheme is also employed in order to exploit the inter-frame redundancy between the spectral vectors. Experimental results show that each proposed technique is suitable for enhancing speech quality for voice communications. Furthermore, an improved speech coder incorporating the proposed techniques is developed. The vocoder gives speech quality comparable to TIA/EIA IS-127 for noisy speech whilst operating at lower than half the bit-rate of the reference coder. Key words: voice activity detection, speech enhancement, pitch, spectral magnitude quantisation, low bit-rate coding.
|
2 |
Data-Driven Rescaling of Energy Features for Noisy Speech RecognitionLuan, Miau 18 July 2012 (has links)
In this paper, we investigate rescaling of energy features for noise-robust speech recognition.
The performance of the speech recognition system will degrade very quickly by the influence
of environmental noise. As a result, speech robustness technique has become an important
research issue for a long time. However, many studies have pointed out that the impact of
speech recognition under the noisy environment is enormous. Therefore, we proposed the
data-driven energy features rescaling (DEFR) to adjust the features. The method is divided
into three parts, that are voice activity detection (VAD), piecewise log rescaling function and
parameter searching algorithm. The purpose is to reduce the difference of noisy and clean
speech features. We apply this method on Mel-frequency cepstral coefficients (MFCC) and
Teager energy cepstral coefficients (TECC), and we compare the proposed method with mean
subtraction (MS) and mean and variance normalization (MVN). We use the Aurora 2.0 and
Aurora 3.0 databases to evaluate the performance. From the experimental results, we proved
that the proposed method can effectively improve the recognition accuracy.
|
3 |
Αυτόματος εντοπισμός ομιλίαςΘεοδώρου, Θεόδωρος 22 January 2009 (has links)
Στόχος της εργασίας είναι η υλοποίηση του αλγορίθμου του αυτόματου εντοπισμού ομιλίας βάση το πρότυπο ETSI.
Η εργασία αυτή οργανώνεται σε 4 κεφάλαια τα οποία περιλαμβάνουν την εισαγωγή στο σήμα ομιλίας, το πρότυπο ETSI, την πειραματική διαδικασία και τα συμπεράσματα.
Το πρώτο κεφάλαιο περιλαμβάνει τα βασικά χαρακτηριστικά της ομιλίας και ανάλυση των συχνοτήτων συντονισμών και της συχνότητας ταλάντωσης, την έννοια του Mel και την θεωρητική λογική του αυτόματου και προσαρμοστικού αλγόριθμου.
Το δεύτερο κεφάλαιο περιλαμβάνει την διαδικασία επεξεργασίας ομιλίας με front-end αλγόριθμο βασισμένο σε τεχνικές εξαγωγής παραμέτρων Mel και Cepstral, την μείωση θορύβου βασισμένη στο Wiener φίλτρο, η επεξεργασία του σήματος, η κατηγοριοποίηση μεταξύ ηχηρής και άηχης ομιλίας.
Το τρίτο και τέταρτο κεφάλαιο περιλαμβάνουν τα αποτελέσματα από την πειραματική εφαρμογή του συστήματος και τα συμπεράσματα από την σύγκριση με άλλους αλγορίθμους εντοπισμού ομιλίας. / The goal of this project is the implementation of the voice activity detection algorithm based on the ETSI standard.
This project is separate in 4 chapters including: initial themes for speech, ETSI standard, the experimental procedure and the results.
In the first chapter are the basics about speech, formants, pitch, Mel and the theoretic logic of automatic and robust voice activity detection algorithms.
In the second chapter are the procedures of speech processing based in front-end algorithm, Mel and Cepstral procedures, noise reduction based on Wiener filter, signal processing, and the classification of voiced and unvoiced speech.
The last chapters are the results of the experimental procedure and the results of the compare with other system voice activity detection.
|
4 |
Balso signalo aptikimo ir triukšmo pašalinimo algoritmo tyrimas, naudojant aukštesnės eilės statistiką / Voice Activity Detection and Noise Reduction Algortihm Analysis using Higher-Order statisticsMakrickaitė, Raimonda 29 May 2006 (has links)
This work presents a robust algorithm for voice activity detection (VAD) and noise reduction mechanism using combined properties of higher-order statistics (HOS) and an efficient algorithm to estimate the instantaneous Signal-to-Noise Ratio (SNR) of speech signal in a background of acoustic noise. The flat spectral feature of Linear Prediction Coding (LPC) residual results in distinct characteristics for the cumulants in terms of phase, periodicity and harmonic content and yields closed-form expressions for the skewness and kurtosis. The HOS of speech is immune to Gaussian noise and this makes them particularly useful in algorithms designed for low SNR environments. The proposed algorithm uses HOS and smooth power estimate metrics with second-order measures, such as SNR and LPC prediction error, to identify speech and noise frames. A voicing condition for speech frames is derived based on the relation between the skewness, kurtosis of voiced speech and estimate of smooth noise power. The algorithm presented and its performance is compared to HOS-only based VAD algorithm. The results show that the proposed algorithm has an overall better performance, with noticeable improvement in Gaussian-like noises, such as street and garage, and high to low SNR, especially for probability of correctly detecting speech. The proposed algorithm is replicated on DSK C6713.
|
5 |
Aukštesnių eilių statistika grįsto balso detektavimo algoritmo sudarymas ir tyrimas / Design and analysis of voice activity detector based on higher order statisticsDuchovskis, Donatas 29 May 2006 (has links)
This report covers a robust voice activity detection (VAD) algorithm presented in [1]. The algorithm uses higher order statistics (HOS) metrics of speech signal in linear prediction coding (LPC) residual domain to classify noise and speech frames of a signal. Chapters in this report present voice activity detection problem and analysis of environment issues for VAD, deep HOS based and standard algorithms analysis and a real time HOS based voice activity detector model. New improvements (instantaneous SNR estimation, decision smoothing, adaptive thresholds, artificial neural network) to the proposed algorithm are introduced and performance results of the improved algorithm compared to standard VAD algorithms are presented.
|
6 |
Recording and automatic detection of African elephant (Loxodonta africana) infrasonic rumblesVenter, Petrus Jacobus 01 October 2008 (has links)
The value of studying elephant vocalizations lies in the abundant information that can be retrieved from it. Recordings of elephant rumbles can be used by researchers to determine the size and composition of the herd, the sexual state, as well as the emotional condition of an elephant. It is a difficult task for researchers to obtain large volumes of continuous recordings of elephant vocalizations. Recordings are normally analysed manually to identify the location of rumbles via the tedious and time consuming methods of sped up listening and the visual evaluation of spectrograms. The application of speech processing on elephant vocalizations is a highly unexploited resource. The aim of this study was to contribute to the current body of knowledge and resources of elephant research by developing a tool for recording high volumes of continuous acoustic data in harsh natural conditions as well as examining the possibilities of applying human speech processing techniques to elephant rumbles to achieve automatic detection of these rumbles in recordings. The recording tool was designed and implemented as an elephant recording collar that has an onboard data storage capacity of 128 gigabytes, enough memory to record sound data continuously for a period of nine months. Data is stored in the wave file format and the device has the ability to navigate and control the FAT32 file system so that the files can be read and downloaded to a personal computer. The collar also has the ability to stamp sound files with the time and date, ambient temperature and GPS coordinates. Several different options for microphone placement and protection have been tested experimentally to find an acceptable solution. A relevant voice activity detection algorithm was chosen as a base for the automatic detection of infrasonic elephant rumbles. The chosen algorithm is based on a robust pitch determination algorithm that has been experimentally verified to function correctly under a signal-to-noise ratio as low as -8 dB when more than four harmonic structures exist in a sound. The algorithm was modified to be used for elephant rumbles and was tested with previously recorded elephant vocalization data. The results obtained suggest that the algorithm can accurately detect elephant rumbles from recordings. The number of false alarms and undetected calls increase when recordings are contaminated with unwanted noise that contains harmonic structures or when the harmonic nature of a rumble is lost. Data obtained from the recording collar is less prone to being contaminated than far field recordings and the automatic detection algorithm should provide an accurate tool for detecting any rumbles that appear in the recordings. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted
|
7 |
Optimalizovaná detekce řečové aktivity v prostředí s proměnnými vlastnostmi / Optimized Voice Activity Detection under Varying EnvironmentsMíča, Ivan January 2014 (has links)
This thesis deals with the issue of algorithmic voice activity detection. Impacts of adverse conditions on the reliability of detection is analysed, and main historical and up-to-date approaches to this issue are discussed. Simulations on both synthetic, and application specific labeled speech databases are used to support the theoretical analysis of important VAD methods. Based on the theoretical analysis together with the performance results, an optimization is proposed that is capable to overcome some limitations of the current methods when dealing with variable working conditions.}
|
8 |
Improvement of Sound Source Localization for a Binaural Robot of Spherical Head with Pinnae / 耳介付球状頭部を持つ両耳聴ロボットのための音源定位の高性能化Kim, Ui-Hyun 24 September 2013 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第17928号 / 情博第510号 / 新制||情||90(附属図書館) / 30748 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 奥乃 博, 教授 河原 達也, 教授 山本 章博 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
9 |
Speech Detection Using Gammatone Features And One-class Support Vector MachineCooper, Douglas 01 January 2013 (has links)
A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VAD’s rely on time-domain features and simple thresholds for efficient speech detection however this doesn’t say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5dB
|
10 |
Identificação de atividade de voz baseada em vídeoScott, Dario 30 March 2010 (has links)
Made available in DSpace on 2015-03-05T14:01:22Z (GMT). No. of bitstreams: 0
Previous issue date: 30 / Hewlett-Packard Brasil Ltda / Atualmente, existem diversos trabalhos com as mais variadas abordagens relativas ao processamento de imagens digitais para detecção de atividade de voz (VAD). As suas aplicações perpassam diferentes áreas, como por exemplo, comandos de voz em veículos e videoconferência. A motivação deste trabalho constitui-se na construção de um algoritmo que contribua para o aperfeiçoamento das técnicas de processamento de imagens aplicadas para a detecção de atividade de voz em vídeos. A problemática envolvida já apresenta uma grande diversidade de abordagens. No entanto, o foco deste trabalho situa-se na busca de alternativas para a melhoria na extração de um modelo de cor de pele e não-pele e, a partir daí, extrair um classificador para identificar a atividade de fala com mais precisão. Algoritmos já existentes de identificação de face e de classificação dos lábios foram utilizados e aprimorados. Através da criação de patches abaixo dos olhos, foi criado um modelo para determinar as características individuais de cor de / Currently, there are several works with many di_erent approaches to image processing for detection of voice activity (VAD). Its applications cross over di_erent areas, such as voice commands in vehicles and videoconferencing. The motivation of this work consists in building an algorithm that contributes to the improvement of techniques image processing applied to detect voice activity on video. The issue already presents a great diversity of approaches. However, the focus of this work lies in _nding alternatives to improve the extraction of a skin and non-skin color model and, from there, extract a classi_er to identify the activity of speech more accurately. Existing algorithms of face detection and classi_cation of the lips were used and improved. Through the creation of patches under the eyes, a model was created to determine the individual characteristics of skin color using the mean and standard deviation of the pixels of the patches and the mouth area. The results are presented based on two approaches.
|
Page generated in 0.0194 seconds