• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 22
  • 7
  • 1
  • 1
  • Tagged with
  • 44
  • 44
  • 18
  • 11
  • 7
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Discussion On Effective Restoration Of Oral Speech Using Voice Conversion Techniques Based On Gaussian Mixture Modeling

Alverio, Gustavo 01 January 2007 (has links)
Today's world consists of many ways to communicate information. One of the most effective ways to communicate is through the use of speech. Unfortunately many lose the ability to converse. This in turn leads to a large negative psychological impact. In addition, skills such as lecturing and singing must now be restored via other methods. The usage of text-to-speech synthesis has been a popular resolution of restoring the capability to use oral speech. Text to speech synthesizers convert text into speech. Although text to speech systems are useful, they only allow for few default voice selections that do not represent that of the user. In order to achieve total restoration, voice conversion must be introduced. Voice conversion is a method that adjusts a source voice to sound like a target voice. Voice conversion consists of a training and converting process. The training process is conducted by composing a speech corpus to be spoken by both source and target voice. The speech corpus should encompass a variety of speech sounds. Once training is finished, the conversion function is employed to transform the source voice into the target voice. Effectively, voice conversion allows for a speaker to sound like any other person. Therefore, voice conversion can be applied to alter the voice output of a text to speech system to produce the target voice. The thesis investigates how one approach, specifically the usage of voice conversion using Gaussian mixture modeling, can be applied to alter the voice output of a text to speech synthesis system. Researchers found that acceptable results can be obtained from using these methods. Although voice conversion and text to speech synthesis are effective in restoring voice, a sample of the speaker before voice loss must be used during the training process. Therefore it is vital that voice samples are made to combat voice loss.
42

Etude numérique et modélisation du modèle d'Euler bitempérature : point de vue cinétique. / Numerical approximation and modelling of the bitemperature Euler model : a kinetic viewpoint.

Prigent, Corentin 24 October 2019 (has links)
Dans divers domaines de la physique, certains phénomènes sont modélisés par des systèmes hyperboliques non-conservatifs. En particulier, dans le domaine de la physique des plasmas, dont l'un des champs d'application majeur est la Fusion par Confinement Inertiel, le système d'Euler bi-température, modélisant les phénomènes de transport de particules chargées, en est un exemple. La difficulté de l'étude de ces systèmes réside dans la présence de termes non-conservatifs, qui empêchent la définition classique des solutions faibles. Pour parvenir à une définition de ce type de solutions, on a recours à l'emploi de systèmes cinétiques sous-jacents. Dans ce manuscrit, on s'intéresse à l'étude numérique de ces systèmes cinétiques pour la résolution du système d'Euler bi-température.Ce manuscrit se divise en deux parties. La première partie contient l'étude numérique du système d'Euler bi-température. Dans un premier chapitre, on résout numériquement les équations en dimension 1 d'espace par le biais d'un système sous-jacent issu de la physique des plasmas: le système de Vlasov-BGK-Ampère. On présente une méthode numérique préservant l'asymptotique pour ce système sous-jacent et on montre, par des simulations numériques, que le schéma limite obtenu donne des résultats consistants avec Euler bi-température. Dans un second chapitre, on résout le même modèle en dimension 2 d'espace par un système sous-jacent de type BGK discret. On démontre une inégalité d'entropie pour les solutions issues du modèle sous-jacent, ainsi qu'une inégalité discrète de dissipation d'entropie pour le schéma.Dans la deuxième partie de ce manuscrit, on s'intéresse au développement de méthodes numériques pour quelques modèles cinétiques. On considère ici le cas des écoulements raréfiés de mélanges de gaz, dans l'optique d'une application aux cas des plasmas. Premièrement, on présente un schéma cinétique adaptatif et dynamique en vitesse pour les gaz inertes. Par l'emploi de lois de conservation discrètes, la solution est approchée sur un ensemble de vitesses discrètes local et dynamique. Dans un second temps, on propose une extension de cette méthode visant à améliorer les performances de celle-ci. Puis, ces deux versions de la méthode sont comparées à la méthode classique sur grille fixe uniforme sur une série de cas tests.Enfin, dans le dernier chapitre, on propose une méthode numérique pour la résolution d'une extension de ces équations, prenant en compte la présence de réactions chimiques au sein du mélange. Le contexte considéré est celui des réactions chimiques bi-moléculaires réversibles lentes. La méthode proposée, de type implicite-explicite, est linéaire, stable et conservative. / In various domains of physics, several phenomena can be modeled via the use of nonconservative hyperbolic systems. In particular, in plasma physics, in the process of developping and understanding the phenomena leading to Inertial Confinement Fusion, the bi-temperature Euler sytem can be used to model particle transport phenomena in a plasma. The difficulty of the mathematical study of such systems dwells in the presence of so-called non-conservative products, which prevent the classical definition of weak solutions via distribution theory. To attempt to define these quantities, it is useful to supplement the hyperbolic system with an underlying kinetic model. In this work, the objective is the numerical study of such kinetic systems in order to solve the bi-temperature Euler system.This manuscript is split in two parts. The first one contains the study of the bi-temperature Euler system. In the first chapter, this system in dimension 1 is solved by the use of an underlying kinetic model sprung from plasma physics: the Vlasov-BGK-Ampère system. An asymptotic-preserving numerical method is introduced, and it is shown that the scheme obtained in the limit is consistant with a scheme for teh bi-temperature Euler system. In the following chapter, the same hyperbolic model in dimension 2 is studied, this time via a discrete-BGK type underlying model. An entropy inequality is proved for solutions coming from the kinetic model, as well as a discrete entropy dissipation inequality.In the second part of the manuscript, we are interested in the development of numerical schemes for gas mixture rarefied flows. Firstly, an adaptive kinetic scheme is introduced for inert gas mixtures. By the use of discrete conservation laws, the solution is approximated on a set of discrete velocities that depends on space, time and species. Secondly, an extension of the method is proposed in order to improve the efficiency of the first method. Finally, the two methods are compared to the classical fixed grid method on a series of test cases.In the last chapter, a numerical method is proposed for rarefied flows of reacting mixtures. The setting considered is the case of slow bimolecular reversible chemical reactions. The method introduced is an explicit-implicit treatment of the relaxation operator, which is shown to be stable, linear and conservative.
43

Who Spoke What And Where? A Latent Variable Framework For Acoustic Scene Analysis

Sundar, Harshavardhan 26 March 2016 (has links) (PDF)
Speech is by far the most natural form of communication between human beings. It is intuitive, expressive and contains information at several cognitive levels. We as humans, are perceptive to several of these cognitive levels of information, as we can gather the information pertaining to the identity of the speaker, the speaker's gender, emotion, location, the language, and so on, in addition to the content of what is being spoken. This makes speech based human machine interaction (HMI), both desirable and challenging for the same set of reasons. For HMI to be natural for humans, it is imperative that a machine understands information present in speech, at least at the level of speaker identity, language, location in space, and the summary of what is being spoken. Although one can draw parallels between the human-human interaction and HMI, the two differ in their purpose. We, as humans, interact with a machine, mostly in the context of getting a task done more efficiently, than is possible without the machine. Thus, typically in HMI, controlling the machine in a specific manner is the primary goal. In this context, it can be argued that, HMI, with a limited vocabulary containing specific commands, would suffice for a more efficient use of the machine. In this thesis, we address the problem of ``Who spoke what and where", in the context of a machine understanding the information pertaining to identities of the speakers, their locations in space and the keywords they spoke, thus considering three levels of information - speaker identity (who), location (where) and keywords (what). This can be addressed with the help of multiple sensors like microphones, video camera, proximity sensors, motion detectors, etc., and combining all these modalities. However, we explore the use of only microphones to address this issue. In practical scenarios, often there are times, wherein, multiple people are talking at the same time. Thus, the goal of this thesis is to detect all the speakers, their keywords, and their locations in mixture signals containing speech from simultaneous speakers. Addressing this problem of ``Who spoke what and where" using only microphone signals, forms a part of acoustic scene analysis (ASA) of speech based acoustic events. We divide the problem of ``who spoke what and where" into two sub-problems: ``Who spoke what?" and ``Who spoke where". Each of these problems is cast in a generic latent variable (LV) framework to capture information in speech at different levels. We associate a LV to represent each of these levels and model the relationship between the levels using conditional dependency. The sub-problem of ``who spoke what" is addressed using single channel microphone signal, by modeling the mixture signal in terms of LV mass functions of speaker identity, the conditional mass function of the keyword spoken given the speaker identity, and a speaker-specific-keyword model. The LV mass functions are estimated in a Maximum likelihood (ML) framework using the Expectation Maximization (EM) algorithm using Student's-t Mixture Model (tMM) as speaker-specific-keyword models. Motivated by HMI in a home environment, we have created our own database. In mixture signals, containing two speakers uttering the keywords simultaneously, the proposed framework achieves an accuracy of 82 % for detecting both the speakers and their respective keywords. The other sub-problem of ``who spoke where?" is addressed in two stages. In the first stage, the enclosure is discretized into sectors. The speakers and the sectors in which they are located are detected in an approach similar to the one employed for ``who spoke what" using signals collected from a Uniform Circular Array (UCA). However, in place of speaker-specific-keyword models, we use tMM based speaker models trained on clean speech, along with a simple Delay and Sum Beamformer (DSB). In the second stage, the speakers are localized within the active sectors using a novel region constrained localization technique based on time difference of arrival (TDOA). Since the problem being addressed is a multi-label classification task, we use the average Hamming score (accuracy) as the performance metric. Although the proposed approach yields an accuracy of 100 % in an anechoic setting for detecting both the speakers and their corresponding sectors in two-speaker mixture signals, the performance degrades to an accuracy of 67 % in a reverberant setting, with a $60$ dB reverberation time (RT60) of 300 ms. To improve the performance under reverberation, prior knowledge of the location of multiple sources is derived using a novel technique derived from geometrical insights into TDOA estimation. With this prior knowledge, the accuracy of the proposed approach improves to 91 %. It is worthwhile to note that, the accuracies are computed for mixture signals containing more than 90 % overlap of competing speakers. The proposed LV framework offers a convenient methodology to represent information at broad levels. In this thesis, we have shown its use with three different levels. This can be extended to several such levels to be applicable for a generic analysis of the acoustic scene consisting of broad levels of events. It will turn out that not all levels are dependent on each other and hence the LV dependencies can be minimized by independence assumption, which will lead to solving several smaller sub-problems, as we have shown above. The LV framework is also attractive to incorporate prior knowledge about the acoustic setting, which is combined with the evidence from the data to derive the information about the presence of an acoustic event. The performance of the framework, is dependent on the choice of stochastic models, which model the likelihood function of the data given the presence of acoustic events. However, it provides an access to compare and contrast the use of different stochastic models for representing the likelihood function.
44

Kdy kdo mluví? / Speaker Diarization

Tomášek, Pavel January 2011 (has links)
This work aims at a task of speaker diarization. The goal is to implement a system which is able to decide "who spoke when". Particular components of implementation are described. The main parts are feature extraction, voice activity detection, speaker segmentation and clustering and finally also postprocessing. This work also contains results of implemented system on test data including a description of evaluation. The test data comes from the NIST RT Evaluation 2005 - 2007 and the lowest error rate for this dataset is 18.52% DER. Results are compared with diarization system implemented by Marijn Huijbregts from The Netherlands, who worked on the same data in 2009 and reached 12.91% DER.

Page generated in 0.0618 seconds