Spelling suggestions: "subject:"[een] SPEECH PROCESSING"" "subject:"[enn] SPEECH PROCESSING""
251 |
Apprentissage en ligne de signatures audiovisuelles pour la reconnaissance et le suivi de personnes au sein d'un réseau de capteurs ambiants / Online learning of audiovisual signatures for people recognition and tracking within a network of ambient sensorsDecroix, François-Xavier 20 December 2017 (has links)
L'opération neOCampus, initiée en 2013 par l'Université Paul Sabatier, a pour objectif de créer un campus connecté, innovant, intelligent et durable en exploitant les compétences de 11 laboratoires et de plusieurs partenaires industriels. Pluridisciplinaires, ces compétences sont croisées dans le but d'améliorer le confort au quotidien des usagers du campus (étudiants, corps enseignant, personnel administratif) et de diminuer son empreinte écologique. L'intelligence que nous souhaitons apporter au Campus du futur exige de fournir à ses bâtiments une perception de son activité interne. En effet, l'optimisation des ressources énergétiques nécessite une caractérisation des activités des usagers afin que le bâtiment puisse s'y adapter automatiquement. L'activité humaine étant sujet à plusieurs niveaux d'interprétation nos travaux se focalisent sur l'extraction des déplacements des personnes présentes, sa composante la plus élémentaire. La caractérisation de l'activité des usagers, en termes de déplacements, exploite des données extraites de caméras et de microphones disséminés dans une pièce, ces derniers formant ainsi un réseau épars de capteurs hétérogènes. Nous cherchons alors à extraire de ces données une signature audiovisuelle et une localisation grossière des personnes transitant dans ce réseau de capteurs. Tout en préservant la vie privée de l'individu, la signature doit être discriminante, afin de distinguer les personnes entre elles, et compacte, afin d'optimiser les temps de traitement et permettre au bâtiment de s'auto-adapter. Eu égard à ces contraintes, les caractéristiques que nous modélisons sont le timbre de la voix du locuteur, et son apparence vestimentaire en termes de distribution colorimétrique. Les contributions scientifiques de ces travaux s'inscrivent ainsi au croisement des communautés parole et vision, en introduisant des méthodes de fusion de signatures sonores et visuelles d'individus. Pour réaliser cette fusion, des nouveaux indices de localisation de source sonore ainsi qu'une adaptation audiovisuelle d'une méthode de suivi multi-cibles ont été introduits, représentant les contributions principales de ces travaux. Le mémoire est structuré en 4 chapitres. Le premier présente un état de l'art sur les problèmes de ré-identification visuelle de personnes et de reconnaissance de locuteurs. Les modalités sonores et visuelles ne présentant aucune corrélation, deux signatures, une vidéo et une audio sont générées séparément, à l'aide de méthodes préexistantes de la littérature. Le détail de la génération de ces signatures est l'objet du chapitre 2. La fusion de ces signatures est alors traitée comme un problème de mise en correspondance d'observations audio et vidéo, dont les détections correspondantes sont cohérentes et compatibles spatialement, et pour lesquelles deux nouvelles stratégies d'association sont introduites au chapitre 3. La cohérence spatio-temporelle des observations sonores et visuelles est ensuite traitée dans le chapitre 4, dans un contexte de suivi multi-cibles. / The neOCampus operation, started in 2013 by Paul Sabatier University in Toulouse, aims to create a connected, innovative, intelligent and sustainable campus, by exploiting the skills of 11 laboratories and several industrial partners. These multidisciplinary skills are combined in order to improve users (students, teachers, administrative staff) daily comfort and to reduce the ecological footprint of the campus. The intelligence we want to bring to the campus of the future requires to provide to its buildings a perception of its intern activity. Indeed, optimizing the energy resources needs a characterization of the user's activities so that the building can automatically adapt itself to it. Human activity being open to multiple levels of interpretation, our work is focused on extracting people trajectories, its more elementary component. Characterizing users activities, in terms of movement, uses data extracted from cameras and microphones distributed in a room, forming a sparse network of heterogeneous sensors. From these data, we then seek to extract audiovisual signatures and rough localizations of the people transiting through this network of sensors. While protecting person privacy, signatures must be discriminative, to distinguish a person from another one, and compact, to optimize computational costs and enables the building to adapt itself. Having regard to these constraints, the characteristics we model are the speaker's timbre, and his appearance, in terms of colorimetric distribution. The scientific contributions of this thesis are thus at the intersection of the fields of speech processing and computer vision, by introducing new methods of fusing audio and visual signatures of individuals. To achieve this fusion, new sound source location indices as well as an audiovisual adaptation of a multi-target tracking method were introduced, representing the main contributions of this work. The thesis is structured in 4 chapters, and the first one presents the state of the art on visual reidentification of persons and speaker recognition. Acoustic and visual modalities are not correlated, so two signatures are separately computed, one for video and one for audio, using existing methods in the literature. After a first chapter dedicated to the state of the art in re-identification and speaker recognition methods, the details of the computation of the signatures is explored in chapter 2. The fusion of the signatures is then dealt as a problem of matching between audio and video observations, whose corresponding detections are spatially coherent and compatible. Two novel association strategies are introduced in chapter 3. Spatio-temporal coherence of the bimodal observations is then discussed in chapter 4, in a context of multi-target tracking.
|
252 |
Um ambiente de avaliação da usabilidade de software apoiado por técnicas de processamento de imagens e reconhecimento de fala / An environment to support usability evaluation using image processing and speech recognitionColeti, Thiago Adriano 17 December 2013 (has links)
A filmagem e a verbalização são métodos de teste de usabilidade considerados fundamentais para apoiar a avaliação da usabilidade de software, pois permitem ao avaliador coletar dados reais da capacidade de interação de um sistema e sua influência sobre o usuário. Os testes são, geralmente, realizados com usuário reais do software para que os mesmos possam submeter a interface as mais diversas situações. Embora eficazes, a filmagem e a verbalização são pouco eficientes, pois necessitam de muito trabalho para análise dos dados coletados e identificação de problemas de usabilidade. Pesquisas já realizadas na área apontam para um tempo de análise de duas a dez vezes o tempo do teste. Este trabalho teve como objetivo desenvolver um ambiente computacional que utilizava eventos de pronuncia de palavras chave e reações faciais para apoiar o processo de coleta, análise e identificação de interfaces com possíveis problemas de usabilidade de forma rápida e segura. O ambiente foi composto por um aplicativo que monitorava (em segundo plano) a utilização de um determinado aplicativo registrando palavras chave pronunciadas pelo participante e imagens faciais em determinados intervalos de tempo. Além destes dados, imagens das telas do sistema (snapshots) também eram registrados a fim de indicar quais interfaces eram utilizadas no momento de um determinado evento. Após a coleta, estes dados eram organizados e disponibilizados para avaliador com destaques para eventos que poderiam indicar insatisfação do participante ou possíveis problemas na utilização. Foi possível concluir que os eventos relacionados à verbalização com palavras chave foram eficazes para apoiar a tarefa de análise e identificação de interfaces problemáticas, pois as palavras estavam relacionadas com classificadores que indicavam satisfação ou insatisfação por parte do usuário. A atividade de verbalização se mostrou mais eficiente quando a análise de seus dados foi aplicada em conjunto com as imagens faciais, pois permitiram uma análise mais confiável e abrangente. Nesta análise, o avaliador teve condições de identificar quais interfaces do sistema foram mal classificadas pelo usuário e qual era o foco de visão/utilização do usuário no momento do evento. Para análises efetuadas com utilização de palavras chave com/sem utilização de imagens, o tempo gasto para identificar as interfaces e possíveis problemas foi reduzido para menos de duas vezes o tempo de teste. / Filming and verbalization are considered fundamental usability test methods to support software usability evaluation, due to the reason that allows the evaluator to collect real data about the software interaction capacity and how it influences the user. The tests are, usually, performed by real software users because they can submit the system to several situations that were not presupposed by evaluator in the labs. Although effective, the filming and the verbalization are not efficient due to the reason that require a long time to analyzing the data and identify usability problems. Researches performed in the area present that the time to data analysis is two to ten times the test time. This research aimed to develop an environment that used events as words pronounced and face reactions to support the collect, analysis and identification of interfaces with usability problems easily and safe. The environment is composed by a software to monitoring (background) of the user activities. The software collects key words pronounced by the participant and face images in specific time intervals. Besides these data, snapshots of the interfaces were registered in order to present which interfaces were in used in the event moment. After the collect stage, these data were processed and available to the evaluator with highlights to events that could indicate unsatisfactory events or potential utilization problems. In this research, was possible to conclude that the verbalization events using key words were effective to support the analysis and identification of problematic interfaces because the words were related to specific context that indicated the user opinion. The verbalization activities were more effective in the moments that the data analysis was performed using the face images to support it, allowing more reliable and comprehensive data analysis. In this analysis, the evaluator was able to identify which interfaces were classified negatively by the participant and which was the user focus of view/use in the event moment. In analysis performed using key words and/or not using the face images, the time to identifying the interfaces and potentials usability problems was reduced to less than twice the time of test.
|
253 |
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesisFonseca De Sam Bento Ribeiro, Manuel January 2018 (has links)
Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks.
|
254 |
Um ambiente de avaliação da usabilidade de software apoiado por técnicas de processamento de imagens e reconhecimento de fala / An environment to support usability evaluation using image processing and speech recognitionThiago Adriano Coleti 17 December 2013 (has links)
A filmagem e a verbalização são métodos de teste de usabilidade considerados fundamentais para apoiar a avaliação da usabilidade de software, pois permitem ao avaliador coletar dados reais da capacidade de interação de um sistema e sua influência sobre o usuário. Os testes são, geralmente, realizados com usuário reais do software para que os mesmos possam submeter a interface as mais diversas situações. Embora eficazes, a filmagem e a verbalização são pouco eficientes, pois necessitam de muito trabalho para análise dos dados coletados e identificação de problemas de usabilidade. Pesquisas já realizadas na área apontam para um tempo de análise de duas a dez vezes o tempo do teste. Este trabalho teve como objetivo desenvolver um ambiente computacional que utilizava eventos de pronuncia de palavras chave e reações faciais para apoiar o processo de coleta, análise e identificação de interfaces com possíveis problemas de usabilidade de forma rápida e segura. O ambiente foi composto por um aplicativo que monitorava (em segundo plano) a utilização de um determinado aplicativo registrando palavras chave pronunciadas pelo participante e imagens faciais em determinados intervalos de tempo. Além destes dados, imagens das telas do sistema (snapshots) também eram registrados a fim de indicar quais interfaces eram utilizadas no momento de um determinado evento. Após a coleta, estes dados eram organizados e disponibilizados para avaliador com destaques para eventos que poderiam indicar insatisfação do participante ou possíveis problemas na utilização. Foi possível concluir que os eventos relacionados à verbalização com palavras chave foram eficazes para apoiar a tarefa de análise e identificação de interfaces problemáticas, pois as palavras estavam relacionadas com classificadores que indicavam satisfação ou insatisfação por parte do usuário. A atividade de verbalização se mostrou mais eficiente quando a análise de seus dados foi aplicada em conjunto com as imagens faciais, pois permitiram uma análise mais confiável e abrangente. Nesta análise, o avaliador teve condições de identificar quais interfaces do sistema foram mal classificadas pelo usuário e qual era o foco de visão/utilização do usuário no momento do evento. Para análises efetuadas com utilização de palavras chave com/sem utilização de imagens, o tempo gasto para identificar as interfaces e possíveis problemas foi reduzido para menos de duas vezes o tempo de teste. / Filming and verbalization are considered fundamental usability test methods to support software usability evaluation, due to the reason that allows the evaluator to collect real data about the software interaction capacity and how it influences the user. The tests are, usually, performed by real software users because they can submit the system to several situations that were not presupposed by evaluator in the labs. Although effective, the filming and the verbalization are not efficient due to the reason that require a long time to analyzing the data and identify usability problems. Researches performed in the area present that the time to data analysis is two to ten times the test time. This research aimed to develop an environment that used events as words pronounced and face reactions to support the collect, analysis and identification of interfaces with usability problems easily and safe. The environment is composed by a software to monitoring (background) of the user activities. The software collects key words pronounced by the participant and face images in specific time intervals. Besides these data, snapshots of the interfaces were registered in order to present which interfaces were in used in the event moment. After the collect stage, these data were processed and available to the evaluator with highlights to events that could indicate unsatisfactory events or potential utilization problems. In this research, was possible to conclude that the verbalization events using key words were effective to support the analysis and identification of problematic interfaces because the words were related to specific context that indicated the user opinion. The verbalization activities were more effective in the moments that the data analysis was performed using the face images to support it, allowing more reliable and comprehensive data analysis. In this analysis, the evaluator was able to identify which interfaces were classified negatively by the participant and which was the user focus of view/use in the event moment. In analysis performed using key words and/or not using the face images, the time to identifying the interfaces and potentials usability problems was reduced to less than twice the time of test.
|
255 |
[en] MODIFIED INTERPOLATION OF LSFNULLS / [pt] INTERPOLAÇÃO MODIFICADA DE LSFNULLSCARLOS ROBERTO DA COSTA FERREIRA 25 October 2006 (has links)
[pt] Os novos serviços de telecomunicações têm impulsionado o
desenvolvimento de melhorias nos algoritmos de codificação
de voz, devido à
necessidade de se melhorar a qualidade da voz codificada,
utilizando a menor taxa
de transmissão possível. Esta dissertação analisa e
propõem melhorias em um
método para o ajuste de parâmetros LSFs de modo a torná-
los mais precisos,
minimizando as perdas no processo de interpolação de LSFs
codificadas. Com
isso, a percepção de qualidade da voz sintetizada na saída
do decodificador é
aumentada, sem que seja necessário aumento da taxa de
transmissão. É
apresentada de modo detalhado toda a dedução matemática do
método citado.
Para a avaliação de desempenho das melhorias propostas, o
processo de ajuste é
implementado em um codificador a taxas médias inferiores a
2 kb/s. Os resultados
confirmam que é possível obter redução significativa nas
medidas de distorção
com a utilização do ajuste de LSFs. / [en] The new telecommunications services have been pushing
forward the
development of improvements in speech coding, because of
the need to improve
coded speech quality, using the smallest transmission rate
possible. This
thesis analyzes and proposes improvements in a method to
adjust LSF parameters
so they get more accurate, minimizing the losses in the
coded LSFs interpolation
process. With this, the synthesized speech perceptual
quality
in the decoder exit is increased, without having to
increase the transmission rate.
The mathematical deduction of the method is presented in a
detailed way. To
evaluate the performance of the proposed improvements, the
adjust process is
implemented in a speech coder with mean rates less than 2
kb/s. The results
confirmed that is possible to obtain significant reduction
in distortion measures using the adjustment of LSFs.
|
256 |
Perceptual features for speech recognitionHaque, Serajul January 2008 (has links)
Automatic speech recognition (ASR) is one of the most important research areas in the field of speech technology and research. It is also known as the recognition of speech by a machine or, by some artificial intelligence. However, in spite of focused research in this field for the past several decades, robust speech recognition with high reliability has not been achieved as it degrades in presence of speaker variabilities, channel mismatch condi- tions, and in noisy environments. The superb ability of the human auditory system has motivated researchers to include features of human perception in the speech recognition process. This dissertation investigates the roles of perceptual features of human hearing in automatic speech recognition in clean and noisy environments. Methods of simplified synaptic adaptation and two-tone suppression by companding are introduced by temporal processing of speech using a zero-crossing algorithm. It is observed that a high frequency enhancement technique such as synaptic adaptation performs better in stationary Gaussian white noise, whereas a low frequency enhancement technique such as the two-tone sup- pression performs better in non-Gaussian non-stationary noise types. The effects of static compression on ASR parametrization are investigated as observed in the psychoacoustic input/output (I/O) perception curves. A method of frequency dependent asymmetric compression technique, that is, higher compression in the higher frequency regions than the lower frequency regions, is proposed. By asymmetric compression, degradation of the spectral contrast of the low frequency formants due to the added compression is avoided. A novel feature extraction method for ASR based on the auditory processing in the cochlear nucleus is presented. The processings for synchrony detection, average discharge (mean rate) processing and the two tone suppression are segregated and processed separately at the feature extraction level according to the differential processing scheme as observed in the AVCN, PVCN and the DCN, respectively, of the cochlear nucleus. It is further observed that improved ASR performances can be achieved by separating the synchrony detection from the synaptic processing. A time-frequency perceptual spectral subtraction method based on several psychoacoustic properties of human audition is developed and evaluated by an ASR front-end. An auditory masking threshold is determined based on these psychoacoustic e?ects. It is observed that in speech recognition applications, spec- tral subtraction utilizing psychoacoustics may be used for improved performance in noisy conditions. The performance may be further improved if masking of noise by the tonal components is augmented by spectral subtraction in the masked region.
|
257 |
Low-Power Audio Input Enhancement for Portable DevicesYoo, Heejong 13 January 2005 (has links)
With the development of VLSI and wireless communication
technology, portable devices such as personal digital assistants
(PDAs), pocket PCs, and mobile phones have gained a lot of
popularity. Many such devices incorporate a speech recognition
engine, enabling users to interact with the devices using
voice-driven commands and text-to-speech synthesis.
The power consumption of DSP microprocessors has been
consistently decreasing by half about every 18 months, following
Gene's law. The capacity of signal processing, however, is still
significantly constrained by the limited power budget of these
portable devices. In addition, analog-to-digital (A/D) converters
can also limit the signal processing of portable devices. Many
systems require very high-resolution and high-performance A/D
converters, which often consume a large fraction of the limited
power budget of portable devices.
The proposed research develops a low-power audio signal
enhancement system that combines programmable analog signal
processing and traditional digital signal processing. By
utilizing analog signal processing based on floating-gate
transistor technology, the power consumption of the overall
system as well as the complexity of the A/D converters can be
reduced significantly. The system can be used as a front end of
portable devices in which enhancement of audio signal quality
plays a critical role in automatic speech recognition systems on
portable devices. The proposed system performs background audio
noise suppression in a continuous-time domain using analog
computing elements and acoustic echo cancellation in a
discrete-time domain using an FPGA.
|
258 |
Melizmų sintezė dirbtinių neuronų tinklais / Melisma synthesis using artificial neural networksLeonavičius, Romas January 2006 (has links)
Modern methods of speech synthesis are not suitable for restoration of song signals due to lack of vitality and intonation in the resulted sounds. The aim of presented work is to synthesize melismas met in Lithuanian folk songs, by applying Artificial Neural Networks. An analytical survey of rather a widespread literature is presented. First classification and comprehensive discussion of melismas are given. The theory of dynamic systems which will make the basis for studying melismas is presented and finally the relationship for modeling a melisma with nonlinear and dynamic systems is outlined. Investigation of the most widely used Linear Prediction Coding method and possibilities of its improvement. The modification of original Linear Prediction method based on dynamic LPC frame positioning is proposed. On its basis, the new melisma synthesis technique is presented.Developed flexible generalized melisma model, based on two Artificial Neural Networks – a Multilayer Perceptron and Adaline – as well as on two network training algorithms – Levenberg- Marquardt and the Least Squares error minimization – is presented. Moreover, original mathematical models of Fortis, Gruppett, Mordent and Trill are created, fit for synthesizing melismas, and their minimal sizes are proposed. The last chapter concerns experimental investigation, using over 500 melisma records, and corroborates application of the new mathematical models to melisma synthesis of one [ ...].
|
259 |
Optimizing text-independent speaker recognition using an LSTM neural networkLarsson, Joel January 2014 (has links)
In this paper a novel speaker recognition system is introduced. Automated speaker recognition has become increasingly popular to aid in crime investigations and authorization processes with the advances in computer science. Here, a recurrent neural network approach is used to learn to identify ten speakers within a set of 21 audio books. Audio signals are processed via spectral analysis into Mel Frequency Cepstral Coefficients that serve as speaker specific features, which are input to the neural network. The Long Short-Term Memory algorithm is examined for the first time within this area, with interesting results. Experiments are made as to find the optimum network model for the problem. These show that the network learns to identify the speakers well, text-independently, when the recording situation is the same. However the system has problems to recognize speakers from different recordings, which is probably due to noise sensitivity of the speech processing algorithm in use.
|
260 |
An ensemble speaker and speaking environment modeling approach to robust speech recognitionTsao, Yu 18 November 2008 (has links)
In this study, an ensemble speaker and speaking environment modeling (ESSEM) approach is proposed to characterize environments in order to enhance performance robustness of automatic speech recognition (ASR) systems under adverse conditions. The ESSEM process comprises two stages, the offline and online phases. In the offline phase, we prepare an ensemble speaker and speaking environment space formed by a collection of super-vectors. Each super-vector consists of the entire set of means from all the Gaussian mixture components of a set of hidden Markov Models that characterizes a particular environment. In the online phase, with the ensemble environment space prepared in the offline phase, we estimate the super-vector for a new testing environment based on a stochastic matching criterion. A series of techniques is proposed to further improve the original ESSEM approach on both offline and online phases. For the offline phase, we focus on methods to enhance the construction and coverage of the environment space. We first demonstrate environment clustering and environment partitioning algorithms to well structure the environment space; then, we propose a discriminative training algorithm to enhance discrimination across environment super-vectors and therefore broaden the coverage of the ensemble environment space. For the online phase, we study methods to increase the efficiency and precision in estimating the target super-vector for the testing condition. To enhance the efficiency, we incorporate dimensionality reduction techniques to reduce the complexity of the original environment space. To improve the precision, we first study different forms of mapping function and propose a weighted N-best information technique; then, we propose cohort selection, environment space adaptation and multiple cluster matching algorithms to facilitate the environment characterization. We evaluate the proposed ESSEM framework on the Aurora-2 connected digit recognition task. Experimental results verify that the original ESSEM approach already provides clear improvement over a baseline system without environment compensation. Moreover, the performance of ESSEM can be further enhanced by using the proposed offline and online algorithms. A significant improvement of 16.08% word error rate reduction is achieved by ESSEM with optimal offline and online configuration over our best baseline system on the Aurora-2 task.
|
Page generated in 0.0437 seconds