Global ETD Search

31	Learning representations for robust audio-visual scene analysis / Apprentissage de représentations pour l'analyse robuste de scènes audiovisuelles Parekh, Sanjeel 18 March 2019 (has links) L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles. / The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding. Apprentissage statistique Traitement du signal audio Vision par ordinateur Analyse en variables latentes Séparation de sources Statistical learning Audio signal processing Computer vision Latent variable analysis Source separation
32	GrooveSpired - aplikace pro trénování hry na bicí / GrooveSpired - Application for Drums Training Štrba, Tomáš January 2017 (has links) The main goal of this work is to design and to implement a mobile application for drums training. The application must be capable of displaying drum notation of different grooves from various music styles, playing audio examples of those grooves and also analyze and evaluate drumming skills of drummers. The main method for audio processing is discrete wavelet transform. Results show true value rate above 96%.
33	Rozpoznávání hudebního žánru za pomoci technik Music Information Retrieval / Music genre recognition using Music information retrieval techniques Zemánková, Šárka January 2019 (has links) This diploma work deals with music genre recognition using the techniques of Music Information Retrieval. It contains a brief description of the principle of this research area and its subfield called Music Genre Recognition. The following chapter includes selection of the most suitable parameters for describing music genres. This work further characterizes machine learning methods used in this field of research. The next chapter deals with the descriptions of music datasets created for genre classification studies. Subsequently, there is a draft and evaluation of the system for music genre recognition. The last part of this work describes the results of partial parameter analysis, dependence of genre classification accuracy on the amount of parameters and contains a discussion on the causes of classification accurancy for the individual genres.
34	Sluchátka s adaptivním potlačením šumu / Adaptive Noise Cancellation Headphone Panenka, Vojtěch January 2020 (has links) The thesis deals with the analysis of technology used during the design of headphones with integrated active ambient noise cancellation and examines the possibilities of using adaptive filters to simplify development and achieve more effective attenuation.
35	VST Plug-IN pro vodoznačení audio signálů / VST Plug-IN for audio signal watermarking Henzl, David January 2008 (has links) This thesis deal with digital signal proccessing methods, possibilities their processing and especially audio signal watermarking like possibility safeguard author's rights of audio content. In this thesis are foreshadoweds basic audio watermarking methods and possibilities of watermark detection. To idea generation about watermarking there is described audio watermarking method known as Echo Hiding. This method embed watermarks to audio content in time-domain while watermark detection is made in kepstral-domain by using Fast Fourier Transform and correlation function. Method is implemented like VST plug - in and along with ASIO drivers that minimize signal latency provides audio signal watermarking in real - time. Aim of the first volume of this thesis is introduction of VST technology, ASIO driver and creating VST plug – in‘s. Alternative volume of thesis deal with implementation watermarking methods in conjunction with VST technology.
36	Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual Translations Della Corte, Giuseppe January 2020 (has links) The recent uprise of end-to-end speech translation models requires a new generation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically find e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our findings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of difficult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set. speech translation parallel corpora bilingual sentence alignment sentence embeddings cosine similarity forced alignment text collection corpora creation audio signal processing
37	Loudspeaker-Room Correction of Conference Rooms / Högtalar- och rumskorrigering av konferensrum Edmark, Marcus January 2023 (has links) In this Thesis a study on the subject on how to improve the overall sound quality within a room using signal processing, played back using a loudspeaker, was conducted. This is a subject that has gained attention during the recent years, with more and more consumer and professional products including it. The objective was to find techniques that offered perceptually good audio quality covering most of the room, while being robust and stable. The solution was to design a correction system which fulfilled these requirements and took advantage of today’s computing technology. This problem and its solution, as included in this Thesis, expose the reader to an introduction to loudspeaker system design and reproduction, room acoustics, psychoacoustics (how humans perceive sound), signal extraction (pre-processing) and filter design as well as design considerations for all of these components. Different ways that this system can be developed further were also discussed. This thesis was mainly based on the theory explained in Immersive Audio Signal Processing av S. Bharitkar and C. Kyriakakis [1]. The results of experiments show that a well-performing room correction system can be realized using a microphone with a known response and a computer. In most cases the improvement in both audible and measurable audio quality is considerable, with only a few cases where an improvement was not made. Using multiple measurement positions, positions of the microphone, led to a further improvement. On the other hand, it was also shown that having two well-positioned microphones was shown to be close to as performant as covering the whole room, even if a combination measurements over the whole listening area was the best performing approach. / I den här examensuppsatsen utfördes en studie på hur man kan förbättra ljudupplevelsen i ett rum, när ljud spelas upp på en högtalare, genom att använda signalbehanlindning. Detta är ett ämne som blivit mer relevant, med mer och mer avancerade och prisvärda ljudsystem på marknaden. Målet för projektet var att hitta tekniker som gav en förbättring av ljudupplevelsen som både var robust och täckte en större yta av rummet. Lösningen var att designa ett korrektionssystem som uppfyllde kraven och tog vara på de stora beräkningsresurserna som dagens datorer erbjuder. Problemet och dess lösning förklaras tillsammans med en introduktion av varje ämne som påverkar ljuduppspelningen samt vad man kan göra för att motverka de oönskade sidoeffekterna. Det inkluderar områden såsom högtalarsystemkonstruktion, rumsaksustik, signalbearbetning och filterdesign, samt exempel och en diskussion på vidare utvecklingar av projektet. Projektet baserades till stor del på boken Immersive Audio Signal Processing av S. Bharitkar and C. Kyriakakis [1] som beskriver hur man skapar en inneslutande ljudupplevelse via rumskorrigering. Slutresultaten visade att det går att med några få steg bygga ett högtalar- och rumskorrigeringssystem som uppfyller de satta villkoren med mycket god ljudkvalitet. Även de enklare systemen, som bara använder en enstaka mätpunkt, kan korrigera för uppspelningen i ett helt rum med goda resultat. Genom att gå vidare med att undersöka att kombinera flera mätpunkter visades det att bara två välplacerade punkter kan prestera likvärdigt med att mäta över hela lyssningsytan. Däremot visas det att en kombination av mätningar över lyssningytan alltid presterar bäst. Immersive audio signal processing room acoustics room correction sound systems conference rooms Signalbehandling inom ljud rumsakustik rumskorrigering högtalarsystem konferensrum Elektroteknik och elektronik
38	2値多重音響特徴ベクトルを用いた類似音楽探索とその高速化 MURASE, Hiroshi, KASHINO, Kunio, NAGANO, Hidehisa, 永野, 秀尚, 柏野, 邦夫, 村瀬, 洋 01 November 2003 (has links) No description available. polyphonic binary feature vector audio signal search polyphonic music music retrieval 音楽検索多重奏音響探索 2値多重音響特徴ベクトル
39	Compressed Domain Processing of MPEG Audio Anantharaman, B 03 1900 (has links) MPEG audio compression techniques significantly reduces the storage and transmission requirements for high quality digital audio. However, compression complicates the processing of audio in many applications. If a compressed audio signal is to be processed, a direct method would be to decode the compressed signal, process the decoded signal and re-encode it. This is computationally expensive due to the complexity of the MPEG filter bank. This thesis deals with processing of MPEG compressed audio. The main contributions of this thesis are a) Extracting wavelet coefficients in the MPEG compressed domain. b) Wavelet based pitch extraction in MPEG compressed domain. c) Time Scale Modifications of MPEG audio. d) Watermarking of MPEG audio. The research contributions starts with a technique for calculating several levels of wavelet coefficients from the output of the MPEG analysis filter bank. The technique exploits the toeplitz structure which arises when the MPEG and wavelet filter banks are represented in a matrix form, The computational complexity for extracting several levels of wavelet coefficients after decoding the compressed signal and directly from the output of the MPEG analysis filter bank are compared. The proposed technique is found to be computationally efficient for extracting higher levels of wavelet coefficients. Extracting pitch in the compressed domain becomes essential when large multimedia databases need to be indexed. For example one may be interested in listening to a particular speaker or to listen to male female audio segments in a multimedia document. For this application, pitch information is one of the very basic and important features required. Pitch is basically the time interval between two successive glottal closures. Glottal closures are accompanied by sharp transients in the speech signal which in turn gives rise to a local maxima in the wavelet coefficients. Pitch can be calculated by finding the time interval between two successive maxima in the wavelet coefficients. It is shown that the computational complexity for extracting pitch in the compressed domain is less than 7% of the uncompressed domain processing. An algorithm for extracting pitch in the compressed domain is proposed. The result of this algorithm for synthetic signals, and utterances of words by male/female is reported. In a number of important applications, one needs to modify an audio signal to render it more useful than its original. Typical applications include changing the time evolution of an audio signal (increase or decrease the rate of articulation of a speaker),or to adapt a given audio sequence to a given video sequence. In this thesis, time scale modifications are obtained in the subband domain such that when the modified subband signals are given to the MPEG synthesis filter bank, the desired time scale modification of the decoded signal is achieved. This is done by making use of sinusoidal modeling [I]. Here, each of the subband signal is modeled in terms of parameters such as amplitude phase and frequencies and are subsequently synthesised by using these parameters with Ls = k La where Ls is the length of the synthesis window , k is the time scale factor and La is the length of the analysis window. As the PCM version of the time scaled signal is not available, psychoacoustic model based bit allocation cannot be used. Hence a new bit allocation is done by using a subband coding algorithm. This method has been satisfactorily tested for time scale expansion and compression of speech and music signals. The recent growth of multimedia systems has increased the need for protecting digital media. Digital watermarking has been proposed as a method for protecting digital documents. The watermark needs to be added to the signal in such a way that it does not cause audible distortions. However the idea behind the lossy MPEC encoders is to remove or make insignificant those portions of the signal which does not affect human hearing. This renders the watermark insignificant and hence proving ownership of the signal becomes difficult when an audio signal is compressed. The existing compressed domain methods merely change the bits or the scale factors according to a key. Though simple, these methods are not robust to attacks. Further these methods require original signal to be available in the verification process. In this thesis we propose a watermarking method based on spread spectrum technique which does not require original signal during the verification process. It is also shown to be more robust than the existing methods. In our method the watermark is spread across many subband samples. Here two factors need to be considered, a) the watermark is to be embedded only in those subbands which will make the addition of the noise inaudible. b) The watermark should be added to those subbands which has sufficient bit allocation so that the watermark does not become insignificant due to lack of bit allocation. Embedding the watermark in the lower subbands would cause distortion and in the higher subbands would prove futile as the bit allocation in these subbands are practically zero. Considering a11 these factors, one can introduce noise to samples across many frames corresponding to subbands 4 to 8. In the verification process, it is sufficient to have the key/code and the possibly attacked signal. This method has been satisfactorily tested for robustness to scalefactor, LSB change and MPEG decoding and re-encoding. Electrical Communications MPEG Audio Coding Digital Technique Audio Signal Processing Least Significant Bit (LSB) Audio Signals Compression Wavelet Coefficients Time Scale Modifications Sinusoidal Model Compressed Domain Wavelet Based Pitch Extraction Audio Watermarking
40	Τεχνολογίες ηλεκτροακουστικών συστημάτων για απευθείας αναπαραγωγή και ασύρματη μετάδοση ψηφιακών ηχητικών σημάτων Τάτλας, Νικόλαος-Αλέξανδρος 20 February 2009 (has links) Στη Διδακτορική Διατριβή αναλύονται ζητήματα που αφορούν την ασύρματη μετάδοση καθώς και την απευθείας εκπομπή ψηφιακών ηχητικών σημάτων με σκοπό την βελτιστοποίηση των τεχνικών αυτών. Ως προς τo σκέλος της ασύρματης μετάδοσης, η διατριβή εστιάστηκε σε δίκτυο WLAN με υποστήριξη QoS. Για την μελέτη του συστήματος διεξήχθησαν δοκιμές χρησιμοποιώντας πρωτότυπη πλατφόρμα που επιτρέπει την μετατροπή ηχητικών ροών για την εισαγωγή τους και εξαγωγή τους σε εφαρμογή εξομοίωσης δικτύου ώστε να αξιολογηθεί η πιστότητα ηχητικής αναπαραγωγής. Η ανάλυση του συστήματος οδήγησε στην ανάπτυξη πρωτότυπης τεχνικής για τον συγχρονισμό διακριτών καναλιών αναπαραγωγής, που μπορεί να εφαρμοστεί με χρήση τυπικού υλικού WLAN καθώς και πρωτότυπου αλγορίθμου για την συγκάλυψη ακουστών παραμορφώσεων που ενδεχομένως εισάγονται κατά τη μετάδοση. Επίσης, στη διατριβή εισάγονται αποτελέσματα με νέες μεθόδους που σχετίζονται με τη μελέτη Μοναδιαίων Συστοιχιών Ψηφιακής Εκπομπής που οδηγούνται από σήμα Σίγμα-Δέλτα ενός ψηφίου, όπως στο πρότυπο DSD. Η προσέγγιση βασίζεται σε πρωτότυπο αλγόριθμο αντιστοίχησης ροής ενός ψηφίου σε στοιχεία ακουστικής εκπομπής και παρουσιάζει σημαντικά πλεονεκτήματα, ως προς την επιτυγχανόμενη πιστότητα και κατευθυντικότητα, όσο και στην δυνατότητα υλοποίησης. Ο σχεδιασμός και η λειτουργία πρωτοτύπου ψηφιακού ηχείου Σίγμα-Δέλτα τεκμηριώνουν τη θεωρητική ανάλυση ενώ οι μετρήσεις που ελήφθησαν από την κατασκευή βρίσκονται σε αντιστοιχία με τις αναμενόμενες από αντίστοιχες εξομοιώσεις, αλλά διάφοροι πρακτικοί περιορισμοί οδηγούν σε ηχητική πιστότητα κατώτερη της αναμενόμενης. / The dissertation analyzes issues concerning the wireless transmission and the direct emission of digital audio signals, in order to optimize these techniques. Regarding the wireless transmission, the dissertation focuses on the WLAN family of networks, supporting QoS. In order to study the system, trials were conducted using a novel platform that allows the conversion of audio streams for them to be imported an exported from a network simulation application, facilitating the final audio playback fidelity estimation. The above analysis led to the development of a prototype technique for the synchronization of discreet reproduction channels that can be implemented using typical WLAN hardware, as well as a prototype algorithm for concealing audible distortion that might be added during the transmission. Moreover, results using new methods relating to the study of Unary Digital Transmission Arrays driven by one-bit Delta-Sigma signals, as in the DSD standard, are introduced in the dissertation. The approach is based on a prototype algorithm for mapping a one-bit stream to the acoustic transmission elements and exhibits important advantages, namely increased fidelity and improved directivity, as well as ease of implementation. The design and operation of the prototype digital Delta-Sigma speaker substantiates the theoretical analysis, while the measurements obtained from the device are in accordance with what expected from respective simulations. However, various practical limitations lead to lower than expected acoustic fidelity. Ηχητικό σήμα Συγχρονισμός Ασύρματη μετάδοση Τοπικά δίκτυα Ψηφιακή αναπαραγωγή Συστοιχία μετατροπής Συγκάλυψη σφαλμάτων Κωδικοποίηση 006.5 Audio signal Synchronization Wireless transmission Local area networks Digital reproduction Transduction Array error concealment Coding

Search results