• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 16
  • 2
  • 1
  • 1
  • Tagged with
  • 23
  • 23
  • 23
  • 8
  • 7
  • 7
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Multichannel audio processing for speaker localization, separation and enhancement

Martí Guerola, Amparo 29 October 2013 (has links)
This thesis is related to the field of acoustic signal processing and its applications to emerging communication environments. Acoustic signal processing is a very wide research area covering the design of signal processing algorithms involving one or several acoustic signals to perform a given task, such as locating the sound source that originated the acquired signals, improving their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing the type of source and the content of the message. Among the above tasks, Sound Source localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in this thesis. In fact, the localization of sound sources in a room has received a lot of attention in the last decades. Most real-word microphone array applications require the localization of one or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation). Some of these applications are teleconferencing systems, video-gaming, autonomous robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound source localization under high noise and reverberation is a very challenging task. One of the most well-known algorithms for source localization in noisy and reverberant environments is the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the baseline framework for the contributions proposed in this thesis. Another challenge in the design of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable number of microphones and limited computational resources. Although the SRP-PHAT algorithm has been shown to be an effective localization algorithm for real-world environments, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this context, several modifications and optimizations have been proposed to improve its performance and applicability. An effective strategy that extends the conventional SRP-PHAT functional is presented in this thesis. This approach performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation with a small hardware cost (reduced number of microphones). This strategy allows to implement real-time applications based on location information, such as automatic camera steering or the detection of speech/non-speech fragments in advanced videoconferencing systems. As stated before, besides the contributions related to SSL, this thesis is also related to the field of ASR. This technology allows a computer or electronic device to identify the words spoken by a person so that the message can be stored or processed in a useful way. ASR is used on a day-to-day basis in a number of applications and services such as natural human-machine interfaces, dictation systems, electronic translators and automatic information desks. However, there are still some challenges to be solved. A major problem in ASR is to recognize people speaking in a room by using distant microphones. In distant-speech recognition, the microphone does not only receive the direct path signal, but also delayed replicas as a result of multi-path propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound Source Separation (SSS) methods can be successfully employed to improve ASR performance in multi-source scenarios. This is the motivation behind the training method for multiple talk situations proposed in this thesis. This training, which is based on a robust transformed model constructed from separated speech in diverse acoustic environments, makes use of a SSS method as a speech enhancement stage that suppresses the unwanted interferences. The combination of source separation and this specific training has been explored and evaluated under different acoustical conditions, leading to improvements of up to a 35% in ASR performance. / Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101
12

PERFORMANCE ANALYSIS OF SRCP IMAGE BASED SOUND SOURCE DETECTION ALGORITHMS

Nalavolu, Praveen Reddy 01 January 2010 (has links)
Steered Response Power based algorithms are widely used for finding sound source location using microphone array systems. SRCP-PHAT is one such algorithm that has a robust performance under noisy and reverberant conditions. The algorithm creates a likelihood function over the field of view. This thesis employs image processing methods on SRCP-PHAT images, to exploit the difference in power levels and pixel patterns to discriminate between sound source and background pixels. Hough Transform based ellipse detection is used to identify the sound source locations by finding the centers of elliptical edge pixel regions typical of source patterns. Monte Carlo simulations of an eight microphone perimeter array with single and multiple sound sources are used to simulate the test environment and area under receiver operating characteristic (ROCA) curve is used to analyze the algorithm performance. Performance was compared to a simpler algorithm involving Canny edge detection and image averaging and an algorithms based simply on the magnitude of local maxima in the SRCP image. Analysis shows that Canny edge detection based method performed better in the presence of coherent noise sources.
13

Acoustic Beamforming : Design and Development of Steered Response Power With Phase Transformation (SRP-PHAT). / Acoustic Beamforming : Design and Development of Steered Response Power With Phase Transformation (SRP-PHAT).

Dey, Ajoy Kumar, Saha, Susmita January 2011 (has links)
Acoustic Sound Source localization using signal processing is required in order to estimate the direction from where a particular acoustic source signal is coming and it is also important in order to find a soluation for hands free communication. Video conferencing, hand free communications are different applications requiring acoustic sound source localization. This applications need a robust algorithm which can reliably localize and position the acoustic sound sources. The Steered Response Power Phase Transform (SRP-PHAT) is an important and roubst algorithm to localilze acoustic sound sources. However, the algorithm has a high computational complexity thus making the algorithm unsuitable for real time applications. This thesis focuses on describe the implementation of the SRP-PHAT algorithm as a function of source type, reverberation levels and ambient noise. The main objective of this thesis is to present different approaches of the SRP-PHAT to verify the algorithm in terms of acoustic enviroment, microphone array configuration, acoustic source position and levels of reverberation and noise.
14

Best PAL : Ball Exercise Sound Tracking PAL / Best PAL : Ljudlokaliserande smart bollplank för individuell fotbollsträning

Hellberg, Joakim, Sundkvist, Axel January 2018 (has links)
The PAL (Practise and Learn) Original is a ball board consisting of three wooden boards placed in a triangle, developed to practise football players’ passing ability and first touch. The former Swedish international footballer Jessica Landström observed that these ball boards can, if they are improved, help footballers to develop even more skills while practicing alone. Landstr¨om’s idea was to put lamps on top off the ball boards which light up when a certain ball board expects to receive a pass. This would force the player to look up instead of looking at the ball and hence improve their vision. We concluded that speaking also is important within football. So our objective became to follow up on the development of the simple PAL Original to a ball board which rotates towards a sound source. We wanted to achieve this without configuring the PAL Original’s construction. With the purpose of executing the idea we needed to estimate the angle between a sound source and a face of the ball board, rotate the ball board with an electric motor, communicate wirelessly between units and detect a ball hit when the ball board receives a pass. The final prototype consists of two systems, one system executing the sound source localization and rotation and the other system executing the detection of ball hit and wireless communication. The first system uses time difference of arrival (TDOA) between incoming sound for three sound sensors to calculate an angle, which in turn is communicated to a DC motorthat executes the rotation. The other system combines an LED to light up when a pass is expected, an accelerometer to detect a pass, and radio transceivers to communicate with each other. When at least three of these devices are used a randomizing algorithm decides which one should light its LED next when the first one detects a pass. / PAL (Practice and Learn) Original är ett bollplank bestående av tre träskivor placerade i en triangel, utvecklad för att träna fotbollsspelares passningsförmåga och första touch. Jessica Landström, landslagsmeriterad fotbollsspelare, insåg att dessa bollplank kan utvecklas till att hjälpa fotbollsspelare att träna ännu fler områden vid indviduell träning. Landströms ursprungliga idé var att placera en lampa på bollplanket som lyser upp när den förväntar sig en passning, detta för att tvinga spelaren att titta upp istället för att titta på bollen, och därigenom träna spelarens spelförståelse. Vi drog slutsatsen att det också är mycket viktigt med kommunikation i fotboll. Vårt mål blev därför att vidareutveckla PAL Original till ett bollplank som roterar en sida mot en ljudkälla. Vi ville uppnå detta så att det är kompatibelt med PAL Original utan att ändra dess konstruktion. För att genomföra detta behövde vi alltså uppskatta vinkeln mellan en ljudkälla och en sida av det triangulära bollplanket, rotera bollplanket med en motor, kommunicera trådlöst och detektera när bollplanket mottar en passning. Den slutliga prototypen består av två system, ett system som utför lokalisering av ljudkälla samt rotation och ett system som utför detektering av bollträff samt hanterar trådlös kommunikation. Det första systemet utnyttjar tidsskillnad för ankomst, TDOA (Time Difference of Arrival), mellan inkommande ljud till tre ljudsensorer för att beräkna en vinkel, som i sin tur kommuniceras till en likströmsmotor som utför rotationen. Det andra systemet kombinerar en lysdiod som lyser när en passning förväntas, en accelerometer för att detektera att passning mottagits och radiosändare samt mottagare för trådlös kommunikation. När minst tre sådana enheter används, bestämmer en slumpgenerator vilken enhet som ska tända sin lysdiod när den första detekterar en passning.
15

Best PAL : Ball Exercise Sound Tracking PAL / Best PAL : Ljudlokaliserande smart bollplank för individuell fotbollsträning

Sundkvist, Axel, Hellberg, Joakim January 2018 (has links)
The PAL (Practise and Learn) Original is a ball boardconsisting of three wooden boards placed in a triangle, developedto practise football players’ passing ability and firsttouch. The former Swedish international footballer JessicaLandstr¨om observed that these ball boards can, if they areimproved, help footballers to develop even more skills whilepracticing alone. Landstr¨om’s idea was to put lamps on topoff the ball boards which light up when a certain ball boardexpects to receive a pass. This would force the player tolook up instead of looking at the ball and hence improvetheir vision.We concluded that speaking also is important withinfootball. So our objective became to follow up on the developmentof the simple PAL Original to a ball board whichrotates towards a sound source. We wanted to achieve thiswithout configuring the PAL Original’s construction.With the purpose of executing the idea we needed toestimate the angle between a sound source and a face ofthe ball board, rotate the ball board with an electric motor,communicate wirelessly between units and detect a ball hitwhen the ball board receives a pass.The final prototype consists of two systems, one systemexecuting the sound source localization and rotation andthe other system executing the detection of ball hit andwireless communication.The first system uses time difference of arrival (TDOA)between incoming sound for three sound sensors to calculatean angle, which in turn is communicated to a DC motorthat executes the rotation.The other system combines an LED to light up whena pass is expected, an accelerometer to detect a pass, andradio transceivers to communicate with each other. Whenat least three of these devices are used a randomizing algorithmdecides which one should light its LED next whenthe first one detects a pass. / PAL (Practice and Learn) Original är ett bollplankbestående av tre träskivor placerade i en triangel, utveckladför att träna fotbollsspelares passningsförmåga och förstatouch. Jessica Landström, landslagsmeriterad fotbollsspelare,insåg att dessa bollplank kan utvecklas till att hjälpafotbollsspelare att träna ännu fler områden vid indviduellträning. Landströms ursprungliga idé var att placeraen lampa på bollplanket som lyser upp när den förväntarsig en passning, detta för att tvinga spelaren att titta uppistället för att titta på bollen, och därigenom träna spelarensspelförståelse.Vi drog slutsatsen att det också är mycket viktigt medkommunikation i fotboll. Vårt mål blev därför att vidareutvecklaPAL Original till ett bollplank som roterar en sidamot en ljudkälla. Vi ville uppnå detta så att det är kompatibeltmed PAL Original utan att ändra dess konstruktion.För att genomföra detta behövde vi alltså uppskattavinkeln mellan en ljudkälla och en sida av det triangulärabollplanket, rotera bollplanket med en motor, kommuniceratrådlöst och detektera när bollplanket mottar en passning.Den slutliga prototypen består av två system, ett systemsom utför lokalisering av ljudkälla samt rotation ochett system som utför detektering av bollträff samt hanterartrådlös kommunikation.Det första systemet utnyttjar tidsskillnad för ankomst,TDOA (Time Difference of Arrival), mellan inkommandeljud till tre ljudsensorer för att beräkna en vinkel, som i sintur kommuniceras till en likströmsmotor som utför rotationen.Det andra systemet kombinerar en lysdiod som lysernär en passning förväntas, en accelerometer för att detekteraatt passning mottagits och radiosändare samt mottagareför trådlös kommunikation. När minst tre sådana enheteranvänds, bestämmer en slumpgenerator vilken enhet somska tända sin lysdiod när den första detekterar en passning.
16

PERFORMANCE IMPROVEMENT OF MULTICHANNEL AUDIO BY GRAPHICS PROCESSING UNITS

Belloch Rodríguez, José Antonio 06 October 2014 (has links)
Multichannel acoustic signal processing has undergone major development in recent years due to the increased complexity of current audio processing applications. People want to collaborate through communication with the feeling of being together and sharing the same environment, what is considered as Immersive Audio Schemes. In this phenomenon, several acoustic e ects are involved: 3D spatial sound, room compensation, crosstalk cancelation, sound source localization, among others. However, high computing capacity is required to achieve any of these e ects in a real large-scale system, what represents a considerable limitation for real-time applications. The increase of the computational capacity has been historically linked to the number of transistors in a chip. However, nowadays the improvements in the computational capacity are mainly given by increasing the number of processing units, i.e expanding parallelism in computing. This is the case of the Graphics Processing Units (GPUs), that own now thousands of computing cores. GPUs were traditionally related to graphic or image applications, but new releases in the GPU programming environments, CUDA or OpenCL, allowed that most applications were computationally accelerated in elds beyond graphics. This thesis aims to demonstrate that GPUs are totally valid tools to carry out audio applications that require high computational resources. To this end, di erent applications in the eld of audio processing are studied and performed using GPUs. This manuscript also analyzes and solves possible limitations in each GPU-based implementation both from the acoustic point of view as from the computational point of view. In this document, we have addressed the following problems: Most of audio applications are based on massive ltering. Thus, the rst implementation to undertake is a fundamental operation in the audio processing: the convolution. It has been rst developed as a computational kernel and afterwards used for an application that combines multiples convolutions concurrently: generalized crosstalk cancellation and equalization. The proposed implementation can successfully manage two di erent and common situations: size of bu ers that are much larger than the size of the lters and size of bu ers that are much smaller than the size of the lters. Two spatial audio applications that use the GPU as a co-processor have been developed from the massive multichannel ltering. First application deals with binaural audio. Its main feature is that this application is able to synthesize sound sources in spatial positions that are not included in the database of HRTF and to generate smoothly movements of sound sources. Both features were designed after di erent tests (objective and subjective). The performance regarding number of sound source that could be rendered in real time was assessed on GPUs with di erent GPU architectures. A similar performance is measured in a Wave Field Synthesis system (second spatial audio application) that is composed of 96 loudspeakers. The proposed GPU-based implementation is able to reduce the room e ects during the sound source rendering. A well-known approach for sound source localization in noisy and reverberant environments is also addressed on a multi-GPU system. This is the case of the Steered Response Power with Phase Transform (SRPPHAT) algorithm. Since localization accuracy can be improved by using high-resolution spatial grids and a high number of microphones, accurate acoustic localization systems require high computational power. The solutions implemented in this thesis are evaluated both from localization and from computational performance points of view, taking into account different acoustic environments, and always from a real-time implementation perspective. Finally, This manuscript addresses also massive multichannel ltering when the lters present an In nite Impulse Response (IIR). Two cases are analyzed in this manuscript: 1) IIR lters composed of multiple secondorder sections, and 2) IIR lters that presents an allpass response. Both cases are used to develop and accelerate two di erent applications: 1) to execute multiple Equalizations in a WFS system, and 2) to reduce the dynamic range in an audio signal. / Belloch Rodríguez, JA. (2014). PERFORMANCE IMPROVEMENT OF MULTICHANNEL AUDIO BY GRAPHICS PROCESSING UNITS [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/40651 / Premios Extraordinarios de tesis doctorales
17

Audiovisual voice activity detection and localization of simultaneous speech sources / Detecção de atividade de voz e localização de fontes sonoras simultâneas utilizando informações audiovisuais

Minotto, Vicente Peruffo January 2013 (has links)
Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores. / Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
18

Audiovisual voice activity detection and localization of simultaneous speech sources / Detecção de atividade de voz e localização de fontes sonoras simultâneas utilizando informações audiovisuais

Minotto, Vicente Peruffo January 2013 (has links)
Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores. / Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
19

Audiovisual voice activity detection and localization of simultaneous speech sources / Detecção de atividade de voz e localização de fontes sonoras simultâneas utilizando informações audiovisuais

Minotto, Vicente Peruffo January 2013 (has links)
Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores. / Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
20

Sound source localization with data and model uncertainties using the EM and Evidential EM algorithms / Estimation de sources acoustiques avec prise en compte de l'incertitude de propagation

Wang, Xun 09 December 2014 (has links)
Ce travail de thèse se penche sur le problème de la localisation de sources acoustiques à partir de signaux déterministes et aléatoires mesurés par un réseau de microphones. Le problème est résolu dans un cadre statistique, par estimation via la méthode du maximum de vraisemblance. La pression mesurée par un microphone est interprétée comme étant un mélange de signaux latents émis par les sources. Les positions et les amplitudes des sources acoustiques sont estimées en utilisant l’algorithme espérance-maximisation (EM). Dans cette thèse, deux types d’incertitude sont également pris en compte : les positions des microphones et le nombre d’onde sont supposés mal connus. Ces incertitudes sont transposées aux données dans le cadre théorique des fonctions de croyance. Ensuite, les positions et les amplitudes des sources acoustiques peuvent être estimées en utilisant l’algorithme E2M, qui est une variante de l’algorithme EM pour les données incertaines.La première partie des travaux considère le modèle de signal déterministe sans prise en compte de l’incertitude. L’algorithme EM est utilisé pour estimer les positions et les amplitudes des sources. En outre, les résultats expérimentaux sont présentés et comparés avec le beamforming et la holographie optimisée statistiquement en champ proche (SONAH), ce qui démontre l’avantage de l’algorithme EM. La deuxième partie considère le problème de l’incertitude du modèle et montre comment les incertitudes sur les positions des microphones et le nombre d’onde peuvent être quantifiées sur les données. Dans ce cas, la fonction de vraisemblance est étendue aux données incertaines. Ensuite, l’algorithme E2M est utilisé pour estimer les sources acoustiques. Finalement, les expériences réalisées sur les données réelles et simulées montrent que les algorithmes EM et E2M donnent des résultats similaires lorsque les données sont certaines, mais que ce dernier est plus robuste en présence d’incertitudes sur les paramètres du modèle. La troisième partie des travaux présente le cas de signaux aléatoires, dont l’amplitude est considérée comme une variable aléatoire gaussienne. Dans le modèle sans incertitude, l’algorithme EM est utilisé pour estimer les sources acoustiques. Dans le modèle incertain, les incertitudes sur les positions des microphones et le nombre d’onde sont transposées aux données comme dans la deuxième partie. Enfin, les positions et les variances des amplitudes aléatoires des sources acoustiques sont estimées en utilisant l’algorithme E2M. Les résultats montrent ici encore l’avantage d’utiliser un modèle statistique pour estimer les sources en présence, et l’intérêt de prendre en compte l’incertitude sur les paramètres du modèle. / This work addresses the problem of multiple sound source localization for both deterministic and random signals measured by an array of microphones. The problem is solved in a statistical framework via maximum likelihood. The pressure measured by a microphone is interpreted as a mixture of latent signals emitted by the sources; then, both the sound source locations and strengths can be estimated using an expectation-maximization (EM) algorithm. In this thesis, two kinds of uncertainties are also considered: on the microphone locations and on the wave number. These uncertainties are transposed to the data in the belief functions framework. Then, the source locations and strengths can be estimated using a variant of the EM algorithm, known as Evidential EM (E2M) algorithm. The first part of this work begins with the deterministic signal model without consideration of uncertainty. The EM algorithm is then used to estimate the source locations and strengths : the update equations for the model parameters are provided. Furthermore, experimental results are presented and compared with the beamforming and the statistically optimized near-field holography (SONAH), which demonstrates the advantage of the EM algorithm. The second part raises the issue of model uncertainty and shows how the uncertainties on microphone locations and wave number can be taken into account at the data level. In this case, the notion of the likelihood is extended to the uncertain data. Then, the E2M algorithm is used to solve the sound source estimation problem. In both the simulation and real experiment, the E2M algorithm proves to be more robust in the presence of model and data uncertainty. The third part of this work considers the case of random signals, in which the amplitude is modeled by a Gaussian random variable. Both the certain and uncertain cases are investigated. In the former case, the EM algorithm is employed to estimate the sound sources. In the latter case, microphone location and wave number uncertainties are quantified similarly to the second part of the thesis. Finally, the source locations and the variance of the random amplitudes are estimated using the E2M algorithm.

Page generated in 0.0794 seconds