Global ETD Search

21	Learning representations for robust audio-visual scene analysis / Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles Parekh, Sanjeel 18 March 2019 (has links) L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles. / The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding. Apprentissage statistique Traitement du signal audio Vision par ordinateur Analyse en variables latentes Séparation de sources Statistical learning Audio signal processing Computer vision Latent variable analysis Source separation
22	GrooveSpired - aplikace pro trénování hry na bicí / GrooveSpired - Application for Drums Training Štrba, Tomáš January 2017 (has links) The main goal of this work is to design and to implement a mobile application for drums training. The application must be capable of displaying drum notation of different grooves from various music styles, playing audio examples of those grooves and also analyze and evaluate drumming skills of drummers. The main method for audio processing is discrete wavelet transform. Results show true value rate above 96%.
23	Sluchátka s adaptivním potlačením šumu / Adaptive Noise Cancellation Headphone Panenka, Vojtěch January 2020 (has links) The thesis deals with the analysis of technology used during the design of headphones with integrated active ambient noise cancellation and examines the possibilities of using adaptive filters to simplify development and achieve more effective attenuation.
24	Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual Translations Della Corte, Giuseppe January 2020 (has links) The recent uprise of end-to-end speech translation models requires a new generation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically find e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our findings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of difficult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set. speech translation parallel corpora bilingual sentence alignment sentence embeddings cosine similarity forced alignment text collection corpora creation audio signal processing
25	Loudspeaker-Room Correction of Conference Rooms / Högtalar- och rumskorrigering av konferensrum Edmark, Marcus January 2023 (has links) In this Thesis a study on the subject on how to improve the overall sound quality within a room using signal processing, played back using a loudspeaker, was conducted. This is a subject that has gained attention during the recent years, with more and more consumer and professional products including it. The objective was to find techniques that offered perceptually good audio quality covering most of the room, while being robust and stable. The solution was to design a correction system which fulfilled these requirements and took advantage of today’s computing technology. This problem and its solution, as included in this Thesis, expose the reader to an introduction to loudspeaker system design and reproduction, room acoustics, psychoacoustics (how humans perceive sound), signal extraction (pre-processing) and filter design as well as design considerations for all of these components. Different ways that this system can be developed further were also discussed. This thesis was mainly based on the theory explained in Immersive Audio Signal Processing av S. Bharitkar and C. Kyriakakis [1]. The results of experiments show that a well-performing room correction system can be realized using a microphone with a known response and a computer. In most cases the improvement in both audible and measurable audio quality is considerable, with only a few cases where an improvement was not made. Using multiple measurement positions, positions of the microphone, led to a further improvement. On the other hand, it was also shown that having two well-positioned microphones was shown to be close to as performant as covering the whole room, even if a combination measurements over the whole listening area was the best performing approach. / I den här examensuppsatsen utfördes en studie på hur man kan förbättra ljudupplevelsen i ett rum, när ljud spelas upp på en högtalare, genom att använda signalbehanlindning. Detta är ett ämne som blivit mer relevant, med mer och mer avancerade och prisvärda ljudsystem på marknaden. Målet för projektet var att hitta tekniker som gav en förbättring av ljudupplevelsen som både var robust och täckte en större yta av rummet. Lösningen var att designa ett korrektionssystem som uppfyllde kraven och tog vara på de stora beräkningsresurserna som dagens datorer erbjuder. Problemet och dess lösning förklaras tillsammans med en introduktion av varje ämne som påverkar ljuduppspelningen samt vad man kan göra för att motverka de oönskade sidoeffekterna. Det inkluderar områden såsom högtalarsystemkonstruktion, rumsaksustik, signalbearbetning och filterdesign, samt exempel och en diskussion på vidare utvecklingar av projektet. Projektet baserades till stor del på boken Immersive Audio Signal Processing av S. Bharitkar and C. Kyriakakis [1] som beskriver hur man skapar en inneslutande ljudupplevelse via rumskorrigering. Slutresultaten visade att det går att med några få steg bygga ett högtalar- och rumskorrigeringssystem som uppfyller de satta villkoren med mycket god ljudkvalitet. Även de enklare systemen, som bara använder en enstaka mätpunkt, kan korrigera för uppspelningen i ett helt rum med goda resultat. Genom att gå vidare med att undersöka att kombinera flera mätpunkter visades det att bara två välplacerade punkter kan prestera likvärdigt med att mäta över hela lyssningsytan. Däremot visas det att en kombination av mätningar över lyssningytan alltid presterar bäst. Immersive audio signal processing room acoustics room correction sound systems conference rooms Signalbehandling inom ljud rumsakustik rumskorrigering högtalarsystem konferensrum Elektroteknik och elektronik
26	Compressed Domain Processing of MPEG Audio Anantharaman, B 03 1900 (has links) MPEG audio compression techniques significantly reduces the storage and transmission requirements for high quality digital audio. However, compression complicates the processing of audio in many applications. If a compressed audio signal is to be processed, a direct method would be to decode the compressed signal, process the decoded signal and re-encode it. This is computationally expensive due to the complexity of the MPEG filter bank. This thesis deals with processing of MPEG compressed audio. The main contributions of this thesis are a) Extracting wavelet coefficients in the MPEG compressed domain. b) Wavelet based pitch extraction in MPEG compressed domain. c) Time Scale Modifications of MPEG audio. d) Watermarking of MPEG audio. The research contributions starts with a technique for calculating several levels of wavelet coefficients from the output of the MPEG analysis filter bank. The technique exploits the toeplitz structure which arises when the MPEG and wavelet filter banks are represented in a matrix form, The computational complexity for extracting several levels of wavelet coefficients after decoding the compressed signal and directly from the output of the MPEG analysis filter bank are compared. The proposed technique is found to be computationally efficient for extracting higher levels of wavelet coefficients. Extracting pitch in the compressed domain becomes essential when large multimedia databases need to be indexed. For example one may be interested in listening to a particular speaker or to listen to male female audio segments in a multimedia document. For this application, pitch information is one of the very basic and important features required. Pitch is basically the time interval between two successive glottal closures. Glottal closures are accompanied by sharp transients in the speech signal which in turn gives rise to a local maxima in the wavelet coefficients. Pitch can be calculated by finding the time interval between two successive maxima in the wavelet coefficients. It is shown that the computational complexity for extracting pitch in the compressed domain is less than 7% of the uncompressed domain processing. An algorithm for extracting pitch in the compressed domain is proposed. The result of this algorithm for synthetic signals, and utterances of words by male/female is reported. In a number of important applications, one needs to modify an audio signal to render it more useful than its original. Typical applications include changing the time evolution of an audio signal (increase or decrease the rate of articulation of a speaker),or to adapt a given audio sequence to a given video sequence. In this thesis, time scale modifications are obtained in the subband domain such that when the modified subband signals are given to the MPEG synthesis filter bank, the desired time scale modification of the decoded signal is achieved. This is done by making use of sinusoidal modeling [I]. Here, each of the subband signal is modeled in terms of parameters such as amplitude phase and frequencies and are subsequently synthesised by using these parameters with Ls = k La where Ls is the length of the synthesis window , k is the time scale factor and La is the length of the analysis window. As the PCM version of the time scaled signal is not available, psychoacoustic model based bit allocation cannot be used. Hence a new bit allocation is done by using a subband coding algorithm. This method has been satisfactorily tested for time scale expansion and compression of speech and music signals. The recent growth of multimedia systems has increased the need for protecting digital media. Digital watermarking has been proposed as a method for protecting digital documents. The watermark needs to be added to the signal in such a way that it does not cause audible distortions. However the idea behind the lossy MPEC encoders is to remove or make insignificant those portions of the signal which does not affect human hearing. This renders the watermark insignificant and hence proving ownership of the signal becomes difficult when an audio signal is compressed. The existing compressed domain methods merely change the bits or the scale factors according to a key. Though simple, these methods are not robust to attacks. Further these methods require original signal to be available in the verification process. In this thesis we propose a watermarking method based on spread spectrum technique which does not require original signal during the verification process. It is also shown to be more robust than the existing methods. In our method the watermark is spread across many subband samples. Here two factors need to be considered, a) the watermark is to be embedded only in those subbands which will make the addition of the noise inaudible. b) The watermark should be added to those subbands which has sufficient bit allocation so that the watermark does not become insignificant due to lack of bit allocation. Embedding the watermark in the lower subbands would cause distortion and in the higher subbands would prove futile as the bit allocation in these subbands are practically zero. Considering a11 these factors, one can introduce noise to samples across many frames corresponding to subbands 4 to 8. In the verification process, it is sufficient to have the key/code and the possibly attacked signal. This method has been satisfactorily tested for robustness to scalefactor, LSB change and MPEG decoding and re-encoding. Electrical Communications MPEG Audio Coding Digital Technique Audio Signal Processing Least Significant Bit (LSB) Audio Signals Compression Wavelet Coefficients Time Scale Modifications Sinusoidal Model Compressed Domain Wavelet Based Pitch Extraction Audio Watermarking
27	Independent Component Analysis Enhancements for Source Separation in Immersive Audio Environments Zhao, Yue 01 January 2013 (has links) In immersive audio environments with distributed microphones, Independent Component Analysis (ICA) can be applied to uncover signals from a mixture of other signals and noise, such as in a cocktail party recording. ICA algorithms have been developed for instantaneous source mixtures and convolutional source mixtures. While ICA for instantaneous mixtures works when no delays exist between the signals in each mixture, distributed microphone recordings typically result various delays of the signals over the recorded channels. The convolutive ICA algorithm should account for delays; however, it requires many parameters to be set and often has stability issues. This thesis introduces the Channel Aligned FastICA (CAICA), which requires knowledge of the source distance to each microphone, but does not require knowledge of noise sources. Furthermore, the CAICA is combined with Time Frequency Masking (TFM), yielding even better SOI extraction even in low SNR environments. Simulations were conducted for ranking experiments tested the performance of three algorithms: Weighted Beamforming (WB), CAICA, CAICA with TFM. The Closest Microphone (CM) recording is used as a reference for all three. Statistical analyses on the results demonstrated superior performance for the CAICA with TFM. The algorithms were applied to experimental recordings to support the conclusions of the simulations. These techniques can be deployed in mobile platforms, used in surveillance for capturing human speech and potentially adapted to biomedical fields. Blind Source Separation Independent Component Analysis Audio Signal Processing Convolutional Source Separation Information Theory Biomedical devices and instrumentation Digital Communications and Networking Signal Processing Systems and Communications
28	Kalman filtering for computer music applications Benning, Manjinder 27 August 2007 (has links) This thesis discusses the use of Kalman filtering for noise reduction in a 3-D gesture- based computer music controller known as the Radio Drum and for real-time tempo tracking of rhythmic and melodic musical performances. The Radio Drum noise reduction Kalman filter is designed based on previous research in the field of target tracking for radar applications and prior knowledge of a drummer’s expected gestures throughout a performance. In this case we are seeking to improve the position estimates of a drum stick in order to enhance the expressivity and control of the instrument by the performer. Our approach to tempo tracking is novel in that a multi- modal approach combining gesture sensors and audio in a late fusion stage lead to higher accuracy in the tempo estimates. Kalman Filtering Computer Music Tempo Tracking Radio Drum Noise Reduction Adaptive Filtering Particle Filtering Wearable Sensors Audio Signal Processing
29	Apprentissage automatique de caractéristiques audio : application à la génération de listes de lecture thématiques / Machine learning algorithms applied to audio features analysis : application in the automatic generation of thematic musical playlists Bayle, Yann 19 June 2018 (has links) Ce mémoire de thèse de doctorat présente, discute et propose des outils de fouille automatique de mégadonnées dans un contexte de classification supervisée musical.L'application principale concerne la classification automatique des thèmes musicaux afin de générer des listes de lecture thématiques.Le premier chapitre introduit les différents contextes et concepts autour des mégadonnées musicales et de leur consommation.Le deuxième chapitre s'attelle à la description des bases de données musicales existantes dans le cadre d'expériences académiques d'analyse audio.Ce chapitre introduit notamment les problématiques concernant la variété et les proportions inégales des thèmes contenus dans une base, qui demeurent complexes à prendre en compte dans une classification supervisée.Le troisième chapitre explique l'importance de l'extraction et du développement de caractéristiques audio et musicales pertinentes afin de mieux décrire le contenu des éléments contenus dans ces bases de données.Ce chapitre explique plusieurs phénomènes psychoacoustiques et utilise des techniques de traitement du signal sonore afin de calculer des caractéristiques audio.De nouvelles méthodes d'agrégation de caractéristiques audio locales sont proposées afin d'améliorer la classification des morceaux.Le quatrième chapitre décrit l'utilisation des caractéristiques musicales extraites afin de trier les morceaux par thèmes et donc de permettre les recommandations musicales et la génération automatique de listes de lecture thématiques homogènes.Cette partie implique l'utilisation d'algorithmes d'apprentissage automatique afin de réaliser des tâches de classification musicale.Les contributions de ce mémoire sont résumées dans le cinquième chapitre qui propose également des perspectives de recherche dans l'apprentissage automatique et l'extraction de caractéristiques audio multi-échelles. / This doctoral dissertation presents, discusses and proposes tools for the automatic information retrieval in big musical databases.The main application is the supervised classification of musical themes to generate thematic playlists.The first chapter introduces the different contexts and concepts around big musical databases and their consumption.The second chapter focuses on the description of existing music databases as part of academic experiments in audio analysis.This chapter notably introduces issues concerning the variety and unequal proportions of the themes contained in a database, which remain complex to take into account in supervised classification.The third chapter explains the importance of extracting and developing relevant audio features in order to better describe the content of music tracks in these databases.This chapter explains several psychoacoustic phenomena and uses sound signal processing techniques to compute audio features.New methods of aggregating local audio features are proposed to improve song classification.The fourth chapter describes the use of the extracted audio features in order to sort the songs by themes and thus to allow the musical recommendations and the automatic generation of homogeneous thematic playlists.This part involves the use of machine learning algorithms to perform music classification tasks.The contributions of this dissertation are summarized in the fifth chapter which also proposes research perspectives in machine learning and extraction of multi-scale audio features. Annotations musicales automatiques Apprentissage automatique et profond Classification supervisée Fouille de mégadonnées Psychoacoustique Traitement du signal audio numérique Big data mining Machine and deep learning Digital audio signal processing Music information retrieval Psychoacoustics Supervised classification
30	MDCT Domain Enhancements For Audio Processing Suresh, K 08 1900 (has links) (PDF) Modified discrete cosine transform (MDCT) derived from DCT IV has emerged as the most suitable choice for transform domain audio coding applications due to its time domain alias cancellation property and de-correlation capability. In the present research work, we focus on MDCT domain analysis of audio signals for compression and other applications. We have derived algorithms for linear filtering in DCT IV and DST IV domains for symmetric and non-symmetric filter impulse responses. These results are also extended to MDCT and MDST domains which have the special property of time domain alias cancellation. We also derive filtering algorithms for the DCT II and DCT III domains. Comparison with other methods in the literature shows that, the new algorithm developed is computationally MAC efficient. These results are useful for MDCT domain audio processing such as reverb synthesis, without having to reconstruct the time domain signal and then perform the necessary filtering operations. In audio coding, the psychoacoustic model plays a crucial role and is used to estimate the masking thresholds for adaptive bit-allocation. Transparent quality audio coding is possible if the quantization noise is kept below the masking threshold for each frame. In the existing methods, the masking threshold is calculated using the DFT of the signal frame separately for MDCT domain adaptive quantization. We have extended the spectral integration based psychoacoustic model proposed for sinusoidal modeling of audio signals to the MDCT domain. This has been possible because of the detailed analysis of the relation between DFT and MDCT; we interpret the MDCT coefficients as co-sinusoids and then apply the sinusoidal masking model. The validity of the masking threshold so derived is verified through listening tests as well as objective measures. Parametric coding techniques are used for low bit rate encoding of multi-channel audio such as 5.1 format surround audio. In these techniques, the surround channels are synthesized at the receiver using the analysis parameters of the parametric model. We develop algorithms for MDCT domain analysis and synthesis of reverberation. Integrating these ideas, a parametric audio coder is developed in the MDCT domain. For the parameter estimation, we use a novel analysis by synthesis scheme in the MDCT domain which results in better modeling of the spatial audio. The resulting parametric stereo coder is able to synthesize acceptable quality stereo audio from the mono audio channel and a side information of approximately 11 kbps. Further, an experimental audio coder is developed in the MDCT domain incorporating the new psychoacoustic model and the parametric model. Sound Recodings Audio Signal - Data Processing Audio Processing Audio Signal Processing MDCT Domain Modified Discrete Cosine Transform Discrete Cosine Transform (DCT) Discrete Sine Transform (DST) Discrete Fourier Transform (DFT) Communication Engineering

Search results