1 |
Wavelet analysis for onset detectionTait, Crawford January 1997 (has links)
No description available.
|
2 |
Real-Time Musical Analysis of Polyphonic Guitar AudioHartquist, John E 01 June 2012 (has links) (PDF)
In this thesis, we analyze the audio signal of a guitar to extract musical data in real-time. Specifically, the pitch and octave of notes and chords are displayed over time. Previous work has shown that non-negative matrix factorization is an effective method for classifying the pitches of simultaneous notes. We explore the effect of window size, hop length, and other parameters to maximize the resolution and accuracy of the output.Other groups have required prerecorded note samples to build a library of note templates to search for. We automate this step and compute the library at run-time, tuning it specifically for the input guitar. The program we present generates a musical visualization of the results in addition to suggestions for fingerings of chords in the form of a fretboard display and tablature notation. This program is built as an applet and is accessible from the web browser.
|
3 |
The Social and Pedagogical Advantages of Audio Forensics and Restoration EducationSteinhour, Jacob B. 14 June 2010 (has links)
No description available.
|
4 |
Privacy Protection for Life-log SystemChaudhari, Jayashri S. 01 January 2007 (has links)
Tremendous advances in wearable computing and storage technologies enable us to record not just snapshots of an event but the whole human experience for a long period of time. Such a \life-logandamp;quot; system captures important events as they happen, rather than an after-thought. Such a system has applications in many areas such as law enforcement, personal archives, police questioning, and medicine. Much of the existing eandamp;reg;orts focus on the pattern recognition and information retrieval aspects of the system. On the other hand, the privacy issues raised by such an intrusive system have not received much attention from the research community. The objectives of this research project are two-fold: andamp;macr;rst, to construct a wearable life-log video system, and second, to provide a solution for protecting the identity of the subjects in the video while keeping the video useful. In this thesis work, we designed a portable wearable life-log system that implements audio distortion and face blocking in a real time to protect the privacy of the subjects who are being recorded in life-log video. For audio, our system automatically isolates the subject's speech and distorts it using a pitch- shifting algorithm to conceal the identity. For video, our system uses a real-time face detection, tracking and blocking algorithm to obfuscate the faces of the subjects. Extensive experiments have been conducted on interview videos to demonstrate the ability of our system in protecting the identity of the subject while maintaining the usability of the life-log video.
|
5 |
An investigation into the use of artificial intelligence techniques for the analysis and control of instrumental timbre and timbral combinationsAntoine, Aurélien January 2018 (has links)
Researchers have investigated harnessing computers as a tool to aid in the composition of music for over 70 years. In major part, such research has focused on creating algorithms to work with pitches and rhythm, which has resulted in a selection of sophisticated systems. Although the musical possibilities of these systems are vast, they are not directly considering another important characteristic of sound. Timbre can be defined as all the sound attributes, except pitch, loudness and duration, which allow us to distinguish and recognize that two sounds are dissimilar. This feature plays an essential role in combining instruments as it involves mixing instrumental properties to create unique textures conveying specific sonic qualities. Within this thesis, we explore harnessing techniques for the analysis and control of instrumental timbre and timbral combinations. This thesis begins with investigating the link between musical timbre, auditory perception and psychoacoustics for sounds emerging from instrument mixtures. It resulted in choosing to use verbal descriptors of timbral qualities to represent auditory perception of instrument combination sounds. Therefore, this thesis reports on the developments of methods and tools designed to automatically retrieve and identify perceptual qualities of timbre within audio files, using specific musical acoustic features and artificial intelligence algorithms. Different perceptual experiments have been conducted to evaluate the correlation between selected acoustics cues and humans' perception. Results of these evaluations confirmed the potential and suitability of the presented approaches. Finally, these developments have helped to design a perceptually-orientated generative system harnessing aspects of artificial intelligence to combine sampled instrument notes. The findings of this exploration demonstrate that an artificial intelligence approach can help to harness the perceptual aspect of instrumental timbre and timbral combinations. This investigation suggests that established methods of measuring timbral qualities, based on a diverse selection of sounds, also work for sounds created by combining instrument notes. The development of tools designed to automatically retrieve and identify perceptual qualities of timbre also helped in designing a comparative scale that goes towards standardising metrics for comparing timbral attributes. Finally, this research demonstrates that perceptual characteristics of timbral qualities, using verbal descriptors as a representation, can be implemented in an intelligent computing system designed to combine sampled instrument notes conveying specific perceptual qualities.
|
6 |
Auditory-based processing of communication soundsWalters, Thomas C. January 2011 (has links)
This thesis examines the possible benefits of adapting a biologically-inspired model of human auditory processing as part of a machine-hearing system. Features were generated by an auditory model, and used as input to machine learning systems to determine the content of the sound. Features were generated using the auditory image model (AIM) and were used for speech recognition and audio search. AIM comprises processing to simulate the human cochlea, and a 'strobed temporal integration' process which generates a stabilised auditory image (SAI) from the input sound. The communication sounds which are produced by humans, other animals, and many musical instruments take the form of a pulse-resonance signal: pulses excite resonances in the body, and the resonance following each pulse contains information both about the type of object producing the sound and its size. In the case of humans, vocal tract length (VTL) determines the size properties of the resonance. In the speech recognition experiments, an auditory filterbank was combined with a Gaussian fitting procedure to produce features which are invariant to changes in speaker VTL. These features were compared against standard mel-frequency cepstral coefficients (MFCCs) in a size-invariant syllable recognition task. The VTL-invariant representation was found to produce better results than MFCCs when the system was trained on syllables from simulated talkers of one range of VTLs and tested on those from simulated talkers with a different range of VTLs. The image stabilisation process of strobed temporal integration was analysed. Based on the properties of the auditory filterbank being used, theoretical constraints were placed on the properties of the dynamic thresholding function used to perform strobe detection. These constraints were used to specify a simple, yet robust, strobe detection algorithm. The syllable recognition system described above was then extended to produce features from profiles of the SAI and tested with the same syllable database as before. For clean speech, performance of the features was comparable to that of those generated from the filterbank output. However when pink noise was added to the stimuli, performance dropped more slowly as a function of signal-to-noise ratio when using the SAI-based AIM features, than when using either the filterbank-based features or the MFCCs, demonstrating the noise-robustness properties of the SAI representation. The properties of the auditory filterbank in AIM were also analysed. Three models of the cochlea were considered: the static gammatone filterbank, dynamic compressive gammachirp (dcGC) and the pole-zero filter cascade (PZFC). The dcGC and gammatone are standard filterbank models, whereas the PZFC is a filter cascade, which more accurately models signal propagation in the cochlea. However, while the architecture of the filterbanks is different, they have all been successfully fitted to psychophysical masking data from humans. The abilities of the filterbanks to measure pitch strength were assessed, using stimuli which evoke a weak pitch percept in humans, in order to ascertain whether there is any benefit in the use of the more computationally efficient PZFC.Finally, a complete sound effects search system using auditory features was constructed in collaboration with Google research. Features were computed from the SAI by sampling the SAI space with boxes of different scales. Vector quantization (VQ) was used to convert this multi-scale representation to a sparse code. The 'passive-aggressive model for image retrieval' (PAMIR) was used to learn the relationships between dictionary words and these auditory codewords. These auditory sparse codes were compared against sparse codes generated from MFCCs, and the best performance was found when using the auditory features.
|
7 |
Adaptive Sinusoidal Models for Speech with Applications in Speech Modifications and Audio Analysis / Modèles adaptifs sinusoïdaux de parole avec des applications sur la modification de la parole et l'analyse audioKafentzis, George 20 June 2014 (has links)
La modélisation sinusoïdale est une des méthodes les plus largement utilisés paramétriques pour la parole et le traitement des signaux audio. Inspiré par le récemment introduit Modèle aQHM et Modèle aHM, nous la vue d’ensemble de la théorie de l’ adaptation sinusoïdale modélisation et nous proposons un modèle nommé la Modèle eaQHM, qui est un non modèle paramétrique de mesure d’ajuster les amplitudes et les phases instantanées de ses fonctions de base aux caractéristiques variant dans le temps de sous-jacents du signal de parole, ainsi atténuer significativement la dite hypothèse de stationnarité locale. Le eaQHM est montré à surperformer aQHM dans l’analyse et la resynthèse de la parole voisée. Sur la base de la eaQHM , un système hybride d’analyse / synthèse de la parole est présenté (eaQHNM), et aussi d’ une version hybride de l’ aHM (aHNM). En outre, nous présentons la motivation pour une représentation pleine bande de la parole en utilisant le eaQHM, c’est, représentant toutes les parties du discours comme haute résolution des sinusoıdes AM-FM. Les expériences montrent que l’adaptation et la quasi-harmonicité est suffisante pour fournir une qualité de transparence dans la parole non voisée resynthèse. La pleine bande analyse eaQHM et système de synthèse est présenté à côté, ce qui surpasse l’état de l’art des systèmes, hybride ou pleine bande, dans la reconstruction de la parole, offrant une qualité transparente confirmé par des évaluations objectives et subjectives. En ce qui concerne les applications, le eaQHM et l’ aHM sont appliquées sur les modifications de la parole (de temps et pas mise à l’échelle). Les modifications qui en résultent sont de haute qualité, et suivent des règles très simples, par rapport à d’autres systèmes de modification état de l’art. Les résultats montrent que harmonicité est préféré au quasi- harmonicité de modifications de la parole du fait de la simplicité de la représentation intégrée. En outre, la pleine bande eaQHM est appliquée sur le problème de la modélisation des signaux audio, et en particulier d’instrument de musique retentit. Le eaQHM est évaluée et comparée à des systèmes à la pointe de la technologie, et leur est montré surpasser en termes de qualité de resynthèse, représentant avec succès l’attaque , transitoire, et une partie stationnaire d’un son d’instruments de musique. Enfin, une autre application est suggéré, à savoir l’analyse et la classification des discours émouvant. Le eaQHM est appliqué sur l’analyse des discours émouvant, offrant à ses paramètres instantanés comme des caractéristiques qui peuvent être utilisés dans la reconnaissance et la quantification vectorielle à base classification du contenu émotionnel de la parole. Bien que les modèles sinusoidaux sont pas couramment utilisés dans ces tâches, les résultats sont prometteurs. / Sinusoidal Modeling is one of the most widely used parametric methods for speech and audio signal processing. The accurate estimation of sinusoidal parameters (amplitudes, frequencies, and phases) is a critical task for close representation of the analyzed signal. In this thesis, based on recent advances in sinusoidal analysis, we propose high resolution adaptive sinusoidal models for analysis, synthesis, and modifications systems of speech. Our goal is to provide systems that represent speech in a highly accurate and compact way. Inspired by the recently introduced adaptive Quasi-Harmonic Model (aQHM) and adaptive Harmonic Model (aHM), we overview the theory of adaptive Sinusoidal Modeling and we propose a model named the extended adaptive Quasi-Harmonic Model (eaQHM), which is a non-parametric model able to adjust the instantaneous amplitudes and phases of its basis functions to the underlying time-varying characteristics of the speech signal, thus significantly alleviating the so-called local stationarity hypothesis. The eaQHM is shown to outperform aQHM in analysis and resynthesis of voiced speech. Based on the eaQHM, a hybrid analysis/synthesis system of speech is presented (eaQHNM), along with a hybrid version of the aHM (aHNM). Moreover, we present motivation for a full-band representation of speech using the eaQHM, that is, representing all parts of speech as high resolution AM-FM sinusoids. Experiments show that adaptation and quasi-harmonicity is sufficient to provide transparent quality in unvoiced speech resynthesis. The full-band eaQHM analysis and synthesis system is presented next, which outperforms state-of-the-art systems, hybrid or full-band, in speech reconstruction, providing transparent quality confirmed by objective and subjective evaluations. Regarding applications, the eaQHM and the aHM are applied on speech modifications (time and pitch scaling). The resulting modifications are of high quality, and follow very simple rules, compared to other state-of-the-art modification systems. Results show that harmonicity is preferred over quasi-harmonicity in speech modifications due to the embedded simplicity of representation. Moreover, the full-band eaQHM is applied on the problem of modeling audio signals, and specifically of musical instrument sounds. The eaQHM is evaluated and compared to state-of-the-art systems, and is shown to outperform them in terms of resynthesis quality, successfully representing the attack, transient, and stationary part of a musical instrument sound. Finally, another application is suggested, namely the analysis and classification of emotional speech. The eaQHM is applied on the analysis of emotional speech, providing its instantaneous parameters as features that can be used in recognition and Vector-Quantization-based classification of the emotional content of speech. Although the sinusoidal models are not commonly used in such tasks, results are promising.
|
8 |
[en] CLASSIFICATION AND SEGMENTATION OF MPEG AUDIO BASED ON SCALE FACTORS / [pt] CLASSIFICAÇÃO E SEGMENTAÇÃO DE ÁUDIO A PARTIR DE FATORES DE ESCALA MPEGFERNANDO RIMOLA DA CRUZ MANO 06 May 2008 (has links)
[pt] As tarefas de segmentação e classificação automáticas de
áudio vêm se tornando cada vez mais importantes com o
crescimento da produção e armazenamento de mídia digital.
Este trabalho se baseia em características do padrão MPEG,
que é considerado o padrão para acervos digitais, para gerir
algoritmos de grande eficiência para realizar essas arefas.
Ao passo que há muitos estudos trabalhando a partir do
vídeo, o áudio ainda é pouco utilizado de forma eficiente
para auxiliar nessas tarefas. Os algoritmos sugeridos
partem da leitura apenas dos fatores de escala presentes no
Layer 2 do áudio MPEG para ambas as tarefas. Com isso, é
necessária a leitura da menor quantidade possível de
informações, o que diminui significativamente o volume de
dados manipulado durante a análise e torna seu desempenho
excelente em termos de tempo de processamento. O algoritmo
proposto para a classificação divide o áudio em quatro
possíveis tipos: silêncio, fala, música e aplausos. Já o
algoritmo de segmentação encontra as mudanças ignificativas
de áudio, que são indícios de segmentos e mudanças de cena.
Foram realizados testes com diferentes tipos de vídeos, e
ambos os algoritmos mostraram bons resultados. / [en] With the growth of production and storing of digital media,
audio segmentation and classification are becoming
increasingly important. This work is based on
characteristics of the MPEG standard, considered to be the
standard for digital media storage and retrieval, to
propose efficient algorithms to perform
these tasks. While there are many studies based on video
analysis, the audio information is still not widely used in
an efficient way. The suggested algorithms
for both tasks are based only on the scale factors present
on layer 2 MPEG audio. That allows them to read the
smallest amount of information possible, significantly
diminishing the amount of data manipulated during the
analysis and making their performance excellent in terms of
processing time. The algorithm proposed for audio
classification divides audio in four possible types: silent,
speech, music and applause. The segmentation algorithm
finds significant changes on the audio signal that
represent clues of audio segments and scene changes.
Tests were made with a wide range of types of video, and
both algorithms show good results.
|
9 |
Apprentissage de représentations musicales à l'aide d'architectures profondes et multiéchellesHamel, Philippe 05 1900 (has links)
L'apprentissage machine (AM) est un outil important dans le domaine de la recherche d'information musicale (Music Information Retrieval ou MIR). De nombreuses tâches de MIR peuvent être résolues en entraînant un classifieur sur un ensemble de caractéristiques. Pour les tâches de MIR se basant sur l'audio musical, il est possible d'extraire de l'audio les caractéristiques pertinentes à l'aide de méthodes traitement de signal. Toutefois, certains aspects musicaux sont difficiles à extraire à l'aide de simples heuristiques. Afin d'obtenir des caractéristiques plus riches, il est possible d'utiliser l'AM pour apprendre une représentation musicale à partir de l'audio. Ces caractéristiques apprises permettent souvent d'améliorer la performance sur une tâche de MIR donnée.
Afin d'apprendre des représentations musicales intéressantes, il est important de considérer les aspects particuliers à l'audio musical dans la conception des modèles d'apprentissage.
Vu la structure temporelle et spectrale de l'audio musical, les représentations profondes et multiéchelles sont particulièrement bien conçues pour représenter la musique. Cette thèse porte sur l'apprentissage de représentations de l'audio musical.
Des modèles profonds et multiéchelles améliorant l'état de l'art pour des tâches telles que la reconnaissance d'instrument, la reconnaissance de genre et l'étiquetage automatique y sont présentés. / Machine learning (ML) is an important tool in the field of music information retrieval (MIR). Many MIR tasks can be solved by training a classifier over a set of features. For MIR tasks based on music audio, it is possible to extract features from the audio with signal processing techniques. However, some musical aspects are hard to extract with simple heuristics. To obtain richer features, we can use ML to learn a representation from the audio. These learned features can often improve performance for a given MIR task.
In order to learn interesting musical representations, it is important to consider the particular aspects of music audio when building learning models. Given the temporal and spectral structure of music audio, deep and multi-scale representations are particularly well suited to represent music. This thesis focuses on learning representations from music audio. Deep and multi-scale models that improve the state-of-the-art for tasks such as instrument recognition, genre recognition and automatic annotation are presented.
|
10 |
Apprentissage de représentations musicales à l'aide d'architectures profondes et multiéchellesHamel, Philippe 05 1900 (has links)
L'apprentissage machine (AM) est un outil important dans le domaine de la recherche d'information musicale (Music Information Retrieval ou MIR). De nombreuses tâches de MIR peuvent être résolues en entraînant un classifieur sur un ensemble de caractéristiques. Pour les tâches de MIR se basant sur l'audio musical, il est possible d'extraire de l'audio les caractéristiques pertinentes à l'aide de méthodes traitement de signal. Toutefois, certains aspects musicaux sont difficiles à extraire à l'aide de simples heuristiques. Afin d'obtenir des caractéristiques plus riches, il est possible d'utiliser l'AM pour apprendre une représentation musicale à partir de l'audio. Ces caractéristiques apprises permettent souvent d'améliorer la performance sur une tâche de MIR donnée.
Afin d'apprendre des représentations musicales intéressantes, il est important de considérer les aspects particuliers à l'audio musical dans la conception des modèles d'apprentissage.
Vu la structure temporelle et spectrale de l'audio musical, les représentations profondes et multiéchelles sont particulièrement bien conçues pour représenter la musique. Cette thèse porte sur l'apprentissage de représentations de l'audio musical.
Des modèles profonds et multiéchelles améliorant l'état de l'art pour des tâches telles que la reconnaissance d'instrument, la reconnaissance de genre et l'étiquetage automatique y sont présentés. / Machine learning (ML) is an important tool in the field of music information retrieval (MIR). Many MIR tasks can be solved by training a classifier over a set of features. For MIR tasks based on music audio, it is possible to extract features from the audio with signal processing techniques. However, some musical aspects are hard to extract with simple heuristics. To obtain richer features, we can use ML to learn a representation from the audio. These learned features can often improve performance for a given MIR task.
In order to learn interesting musical representations, it is important to consider the particular aspects of music audio when building learning models. Given the temporal and spectral structure of music audio, deep and multi-scale representations are particularly well suited to represent music. This thesis focuses on learning representations from music audio. Deep and multi-scale models that improve the state-of-the-art for tasks such as instrument recognition, genre recognition and automatic annotation are presented.
|
Page generated in 0.0848 seconds