• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 1
  • 1
  • 1
  • Tagged with
  • 10
  • 10
  • 6
  • 6
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

The Effect of Nonlinear Source-Filter Interaction on Aerodynamic Measures in a Synthetic Model of the Vocal Folds and Vocal Tract

May, Nicholas A. 01 June 2022 (has links)
No description available.
2

Characterization of the Voice Source by the DCT for Speaker Information

Abhiram, B January 2014 (has links) (PDF)
Extracting speaker-specific information from speech is of great interest to both researchers and developers alike, since speaker recognition technology finds application in a wide range of areas, primary among them being forensics and biometric security systems. Several models and techniques have been employed to extract speaker information from the speech signal. Speech production is generally modeled as an excitation source followed by a filter. Physiologically, the source corresponds to the vocal fold vibrations and the filter corresponds to the spectrum-shaping vocal tract. Vocal tract-based features like the melfrequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients have been shown to contain speaker information. However, high speed videos of the larynx show that the vocal folds of different individuals vibrate differently. Voice source (VS)-based features have also been shown to perform well in speaker recognition tasks, thereby revealing that the VS does contain speaker information. Moreover, a combination of the vocal tract and VS-based features has been shown to give an improved performance, showing that the latter contains supplementary speaker information. In this study, the focus is on extracting speaker information from the VS. The existing techniques for the same are reviewed, and it is observed that the features which are obtained by fitting a time-domain model on the VS perform poorly than those obtained by simple transformations of the VS. Here, an attempt is made to propose an alternate way of characterizing the VS to extract speaker information, and to study the merits and shortcomings of the proposed speaker-specific features. The VS cannot be measured directly. Thus, to characterize the VS, we first need an estimate of the VS, and the integrated linear prediction residual (ILPR) extracted from the speech signal is used as the VS estimate in this study. The voice source linear prediction model, which was proposed in an earlier study to obtain the ILPR, is used in this work. It is hypothesized here that a speaker’s voice may be characterized by the relative proportions of the harmonics present in the VS. The pitch synchronous discrete cosine transform (DCT) is shown to capture these, and the gross shape of the ILPR in a few coefficients. The ILPR and hence its DCT coefficients are visually observed to distinguish between speakers. However, it is also observed that they do have intra-speaker variability, and thus it is hypothesized that the distribution of the DCT coefficients may capture speaker information, and this distribution is modeled by a Gaussian mixture model (GMM). The DCT coefficients of the ILPR (termed the DCTILPR) are directly used as a feature vector in speaker identification (SID) tasks. Issues related to the GMM, like the type of covariance matrix, are studied, and it is found that diagonal covariance matrices perform better than full covariance matrices. Thus, mixtures of Gaussians having diagonal covariances are used as speaker models, and by conducting SID experiments on three standard databases, it is found that the proposed DCTILPR features fare comparably with the existing VS-based features. It is also found that the gross shape of the VS contains most of the speaker information, and the very fine structure of the VS does not help in distinguishing speakers, and instead leads to more confusion between speakers. The major drawbacks of the DCTILPR are the session and handset variability, but they are also present in existing state-of-the-art speaker-specific VS-based features and the MFCCs, and hence seem to be common problems. There are techniques to compensate these variabilities, which need to be used when the systems using these features are deployed in an actual application. The DCTILPR is found to improve the SID accuracy of a system trained with MFCC features by 12%, indicating that the DCTILPR features capture speaker information which is missed by the MFCCs. It is also found that a combination of MFCC and DCTILPR features on a speaker verification task gives significant performance improvement in the case of short test utterances. Thus, on the whole, this study proposes an alternate way of extracting speaker information from the VS, and adds to the evidence for speaker information present in the VS.
3

Zeros of the z-transform (ZZT) representation and chirp group delay processing for the analysis of source and filter characteristics of speech signals

Bozkurt, Baris 27 October 2005 (has links)
This study proposes a new spectral representation called the Zeros of Z-Transform (ZZT), which is an all-zero representation of the z-transform of the signal. In addition, new chirp group delay processing techniques are developed for analysis of resonances of a signal. The combination of the ZZT representation with the chirp group delay processing algorithms provides a useful domain to study resonance characteristics of source and filter components of speech. Using the two representations, effective algorithms are developed for: source-tract decomposition of speech, glottal flow parameter estimation, formant tracking and feature extraction for speech recognition. The ZZT representation is mainly important for theoretical studies. Studying the ZZT of a signal is essential to be able to develop effective chirp group delay processing methods. Therefore, first the ZZT representation of the source-filter model of speech is studied for providing a theoretical background. We confirm through ZZT representation that anti-causality of the glottal flow signal introduces mixed-phase characteristics in speech signals. The ZZT of windowed speech signals is also studied since windowing cannot be avoided in practical signal processing algorithms and the effect of windowing on ZZT representation is drastic. We show that separate patterns exist in ZZT representations of windowed speech signals for the glottal flow and the vocal tract contributions. A decomposition method for source-tract separation is developed based on these patterns in ZZT. We define chirp group delay as group delay calculated on a circle other than the unit circle in z-plane. The need to compute group delay on a circle other than the unit circle comes from the fact that group delay spectra are often very noisy and cannot be easily processed for formant tracking purposes (the reasons are explained through ZZT representation). In this thesis, we propose methods to avoid such problems by modifying the ZZT of a signal and further computing the chirp group delay spectrum. New algorithms based on processing of the chirp group delay spectrum are developed for formant tracking and feature estimation for speech recognition. The proposed algorithms are compared to state-of-the-art techniques. Equivalent or higher efficiency is obtained for all proposed algorithms. The theoretical parts of the thesis further discuss a mixed-phase model for speech and phase processing problems in detail.
4

Expressive sampling synthesis. Learning extended source-filter models from instrument sound databases for expressive sample manipulations / Synthèse et transformation des sons basés sur les modèles de type source-filtre étendu pour les instruments de musique

Hahn, Henrik 30 September 2015 (has links)
Dans cette thèse un système de synthèse sonore imitative sera présenté, applicable à la plupart des instruments de quasi-harmoniques. Le système se base sur les enregistrements d’une note unique qui représentent une version quantifiée de l'espace de timbre possible d'un instrument par rapport à sa hauteur et son intensité. Une méthode de transformation permet alors de générer des signaux sonores de valeurs continues des paramètres de contrôle d'expression qui sont perceptuellement cohérent avec ses équivalents acoustiques. Un modèle paramétrique de l'instrument se présente donc basé sur un modèle de filtre de source étendu avec des manipulations distinctes sur les harmoniques d’un signal et ses composantes résiduelles. Une procédure d'évaluation subjective sera présentée afin d’évaluer une variété de résultats de transformation par une comparaison directe avec des enregistrements non modifiés, afin de comparer la perception entre les résultats synthétiques et leurs équivalents acoustiques. / Within this thesis an imitative sound synthesis system will be introduced that is applicable to most quasi-harmonic instruments. The system bases upon single-note recordings that represent a quantized version of an instrument's possible timbre space with respect to its pitch and intensity dimension. A transformation method then allows to render sound signals with continuous values of the expressive control parameters which are perceptually coherent with its acoustic equivalents. A parametric instrument model is therefore presented based on an extended source-filter model with separate manipulations of a signal’s harmonic and residual components. A subjective evaluation procedure will be shown to assess a variety of transformation results by a direct comparison with unmodified recordings to determine how perceptually close the synthesis results are regarding their respective acoustic correlates.
5

The Voice Source in Speech Communication - Production and Perception Experiments Involving Inverse Filtering and Synthesis

Gobl, Christer January 2003 (has links)
This thesis explores, through a number of production andperception studies, the nature of the voice source signal andhow it varies in spoken communication. Research is alsopresented that deals with the techniques and methodologies foranalysing and synthesising the voice source. The main analytictechnique involves interactive inverse filtering for obtainingthe source signal, which is then parameterised to permit thequantification of source characteristics. The parameterisationis carried by means of model matching, using the four-parameterLF model of differentiated glottal flow. The first three analytic studies focus on segmental andsuprasegmental determinants of source variation. As part of theprosodic variation of utterances, focal stress shows for theglottal excitation an enhancement between the stressed voweland the surrounding consonants. At a segmental level, the voicesource characteristics of a vowel show potentially majordifferences as a function of the voiced/voiceless nature of anadjacent stop. Cross-language differences in the extent anddirectionality of the observed effects suggest differentunderlying control strategies in terms of the timing of thelaryngeal and supralaryngeal gestures, as well as in thelaryngeal tensions settings. Different classes of voicedconsonants also show differences in source characteristics:here the differences are likely to be passive consequences ofthe aerodynamic conditions that are inherent to the consonants.Two further analytic studies present voice source correlatesfor six different voice qualities as defined by Laver'sclassification system. Data from stressed and unstressedcontexts clearly show that the transformation from one voicequality to another does not simply involve global changes ofthe source parameters. As well as providing insights into theseaspects of speech production, the analytic studies providequantitative measures useful in technology applications,particularly in speech synthesis. The perceptual experiments use the LF source implementationin the KLSYN88 synthesiser to test some of the analytic resultsand to harness them to explore the paralinguistic dimension ofspeech communication. A study of the perceptual salience ofdifferent parameters associated with breathy voice indicatesthat the source spectral slope is critically important andthat, surprisingly, aspiration noise contributes relativelylittle. Further perceptual tests using stimuli with differentvoice qualities explore the mapping between voice quality andits paralinguistic function of expressing emotion, mood andattitude. The results of these studies highlight the crucialrole of voice quality in expressing affect as well as providingpointers to how it combines withf0for this purpose. The last section of the thesis focuses on the techniquesused for the analysis and synthesis of the source. Asemi-automatic method for inverse filtering is presented, whichis novel in that it optimises the inverse filter by exploitingthe knowledge that is typically used by the experimenter whencarrying out manual interactive inverse filtering. A furtherstudy looks at the properties of the modified LF model in theKLSYN88 synthesiser: it highlights how it differs from thestandard LF model and discusses the implications forsynthesising the glottal source signal from LF model data.Effective and robust source parameterisation for the analysisof voice quality is the topic of the final paper: theeffectiveness of global, amplitude-based, source parameters isexamined across speech tokens with large differences inf0. Additional amplitude-based parameters areproposed to enable a more detailed characterisation of theglottal pulse. <b>Keywords:</b>Voice source dynamics, glottal sourceparameters, source-filter interaction, voice quality,phonation, perception, affect, emotion, mood, attitude,paralinguistic, inverse filtering, knowledge-based, formantsynthesis, LF model, fundamental frequency,f0.
6

Nonlinear Interactive Source-filter Model For Voiced Speech

Koc, Turgay 01 October 2012 (has links) (PDF)
The linear source-filter model (LSFM) has been used as a primary model for speech processing since 1960 when G. Fant presented acoustic speech production theory. It assumes that the source of voiced speech sounds, glottal flow, is independent of the filter, vocal tract. However, acoustic simulations based on the physical speech production models show that, especially when the fundamental frequency (F0) of source harmonics approaches to the first formant frequency (F1) of vocal tract filter, the filter has significant effects on the source due to the nonlinear coupling between them. In this thesis, as an alternative to linear source-filter model, nonlinear interactive source-filter models are proposed for voiced speech. This thesis has two parts, in the first part, a framework for the coupling of the source and the filter is presented. Then, two interactive system models are proposed assuming that glottal flow is a quasi-steady Bernoulli flow and acoustics in vocal tract is linear. In these models, instead of glottal flow, glottal area is used as a source for voiced speech. In the proposed interactive models, the relation between the glottal flow, glottal area and vocal tract is determined by the quasi-steady Bernoulli flow equation. It is theoretically shown that linear source-filter model is an approximation of the nonlinear models. Estimation of ISFM&rsquo / s parameters from only speech signal is a nonlinear blind deconvolution problem. The problem is solved by a robust method developed based on the acoustical interpretation of the systems. Experimental results show that ISFMs produce source-filter coupling effects seen in the physical simulations and the parameter estimation method produce always stable and better performing models than LSFM model. In addition, a framework for the incorporation of the source-filter interaction into classical source-filter model is presented. The Rosenberg source model is extended to an interactive source for voiced speech and its performance is evaluated on a large speech database. The results of the experiments conducted on vowels in the database show that the interactive Rosenberg model is always better than its noninteractive version. In the second part of the thesis, LSFM and ISFMs are compared by using not only the speech signal but also HSV (High Speed Endocopic Video) of vocal folds in a system identification approach. In this case, HSV and speech are used as a reference input-output data for the analysis and comparison of the models. First, a new robust HSV processing algorithm is developed and applied on HSV images to extract the glottal area. Then, system parameters are estimated by using a modified version of the method proposed in the first part. The experimental results show that speech signal can contain some harmonics of the fundamental frequency of the glottal area other than those contained in the glottal area signal. Proposed nonlinear interactive source-filter models can generate harmonics components in speech and produce more realistic speech sounds than LSFM.
7

The Voice Source in Speech Communication - Production and Perception Experiments Involving Inverse Filtering and Synthesis

Gobl, Christer January 2003 (has links)
<p>This thesis explores, through a number of production andperception studies, the nature of the voice source signal andhow it varies in spoken communication. Research is alsopresented that deals with the techniques and methodologies foranalysing and synthesising the voice source. The main analytictechnique involves interactive inverse filtering for obtainingthe source signal, which is then parameterised to permit thequantification of source characteristics. The parameterisationis carried by means of model matching, using the four-parameterLF model of differentiated glottal flow.</p><p>The first three analytic studies focus on segmental andsuprasegmental determinants of source variation. As part of theprosodic variation of utterances, focal stress shows for theglottal excitation an enhancement between the stressed voweland the surrounding consonants. At a segmental level, the voicesource characteristics of a vowel show potentially majordifferences as a function of the voiced/voiceless nature of anadjacent stop. Cross-language differences in the extent anddirectionality of the observed effects suggest differentunderlying control strategies in terms of the timing of thelaryngeal and supralaryngeal gestures, as well as in thelaryngeal tensions settings. Different classes of voicedconsonants also show differences in source characteristics:here the differences are likely to be passive consequences ofthe aerodynamic conditions that are inherent to the consonants.Two further analytic studies present voice source correlatesfor six different voice qualities as defined by Laver'sclassification system. Data from stressed and unstressedcontexts clearly show that the transformation from one voicequality to another does not simply involve global changes ofthe source parameters. As well as providing insights into theseaspects of speech production, the analytic studies providequantitative measures useful in technology applications,particularly in speech synthesis.</p><p>The perceptual experiments use the LF source implementationin the KLSYN88 synthesiser to test some of the analytic resultsand to harness them to explore the paralinguistic dimension ofspeech communication. A study of the perceptual salience ofdifferent parameters associated with breathy voice indicatesthat the source spectral slope is critically important andthat, surprisingly, aspiration noise contributes relativelylittle. Further perceptual tests using stimuli with differentvoice qualities explore the mapping between voice quality andits paralinguistic function of expressing emotion, mood andattitude. The results of these studies highlight the crucialrole of voice quality in expressing affect as well as providingpointers to how it combines with<i>f</i><sub>0</sub>for this purpose.</p><p>The last section of the thesis focuses on the techniquesused for the analysis and synthesis of the source. Asemi-automatic method for inverse filtering is presented, whichis novel in that it optimises the inverse filter by exploitingthe knowledge that is typically used by the experimenter whencarrying out manual interactive inverse filtering. A furtherstudy looks at the properties of the modified LF model in theKLSYN88 synthesiser: it highlights how it differs from thestandard LF model and discusses the implications forsynthesising the glottal source signal from LF model data.Effective and robust source parameterisation for the analysisof voice quality is the topic of the final paper: theeffectiveness of global, amplitude-based, source parameters isexamined across speech tokens with large differences in<i>f</i><sub>0</sub>. Additional amplitude-based parameters areproposed to enable a more detailed characterisation of theglottal pulse.</p><p><b>Keywords:</b>Voice source dynamics, glottal sourceparameters, source-filter interaction, voice quality,phonation, perception, affect, emotion, mood, attitude,paralinguistic, inverse filtering, knowledge-based, formantsynthesis, LF model, fundamental frequency,<i>f</i><sub>0</sub>.</p>
8

Model-based synthesis of singing / Modellbaserad syntes av sång

Zeng, Xiaofeng January 2023 (has links)
The legacy KTH Music and Singing Synthesis Equipment (MUSSE) system, developed decades ago, is no longer compatible with contemporary computer systems. Nonetheless, the fundamental synthesis model at its core, known as the source-filter model, continues to be a valuable technology in the research field of voice synthesis. In this thesis, the author re-implemented the legacy system with the traditional source-filter model and the modern platform SuperCollider. This re-implementation led to great enhancements in functionality, flexibility and performance. The most noteworthy improvement introduced in the new system is the addition of notch filters, which is able to simulate anti-resonances in the human vocal tract, thereby allowing a broader range of vocal nuances to be reproduced. To demonstrate the significance of notches in vowel synthesis, a subjective auditory experiment was conducted. The results of this experiment clearly show that vowels synthesized with notches sound much more natural and closer to real human voice. The work presented in this thesis, the new MUSSE program with notch filters, will serve as a foundation to support general acoustics research at TMH in the future. / Den äldre KTH Music and Singing Synthesis Equipment (MUSSE) -systemet, som utvecklades för decennier sedan, är inte längre kompatibelt med samtida datorsystem. Trots det fortsätter den grundläggande syntesmodellen vid dess kärna, känd som källa-filtermodellen, att vara en värdefull teknik inom forskningsområdet för röstsyntes. I den här avhandlingen har författaren återimplementerat det äldre systemet med den traditionella källa-filtermodellen och den moderna plattformen SuperCollider. Denna återimplementering ledde till betydande förbättringar i funktionalitet, flexibilitet och prestanda. Den mest anmärkningsvärda förbättringen som infördes i det nya systemet är tillägget av notch-filter, som kan simulera anti-resonanser i den mänskliga röstkanalen och därmed möjliggöra en bredare uppsättning vokala nyanser att återskapas. För att visa betydelsen av notch-filter i vokalsyntes utfördes en subjektiv auditiv undersökning. Resultaten av denna undersökning visar tydligt att vokaler som syntetiseras med notch-filter låter mycket mer naturliga och liknar den verkliga mänskliga rösten. Arbetet som presenteras i denna avhandling, det nya MUSSE-programmet med notch-filter, kommer att fungera som en grund för att stödja allmän akustikforskning vid TMH i framtiden.
9

Analyse de la qualité vocale appliquée à la parole expressive / Voice quality analysis applied to expressive speech

Sturmel, Nicolas 02 March 2011 (has links)
L’analyse des signaux de parole permet de comprendre le fonctionnement de l’appareil vocal, mais aussi de décrire de nouveaux paramètres permettant de qualifier et quantifier la perception de la voix. Dans le cas de la parole expressive, l'intérêt se porte sur des variations importantes de qualité vocales et sur leurs liens avec l’expressivité et l’intention du sujet. Afin de décrire ces liens, il convient de pouvoir estimer les paramètres du modèle de production mais aussi de décomposer le signal vocal en chacune des parties qui contribuent à ce modèle. Le travail réalisé au cours de cette thèse s’axe donc autour de la segmentation et la décomposition des signaux vocaux et de l’estimation des paramètres du modèle de production vocale : Tout d’abord, la décomposition multi-échelles des signaux vocaux est abordée. En reprenant la méthode LoMA qui trace des lignes suivant les amplitudes maximum sur les réponses temporelles au banc de filtre en ondelettes, il est possible d’y détecter un certain nombre de caractéristiques du signal vocal : les instants de fermeture glottique, l’énergie associée à chaque cycle ainsi que sa distribution spectrale, le quotient ouvert du cycle glottique (par l’observation du retard de phase du premier harmonique). Cette méthode est ensuite testée sur des signaux synthétiques et réels. Puis, la décomposition harmonique + bruit des signaux vocaux est abordée. Une méthode existante (PAPD - Périodic/APériodic Décomposition) est adaptée aux variations de fréquence fondamentale par le biais de la variation dynamique de la taille de la fenêtre d’analyse et est appelée PAP-A. Cette nouvelle méthode est ensuite testée sur une base de signaux synthétiques. La sensibilité à la précision d’estimation de la fréquence fondamentale est notamment abordée. Les résultats montrent des décompositions de meilleures qualité pour PAP-A par rapport à PAPD. Ensuite, le problème de la déconvolution source/filtre est abordé. La séparation source/filtre par ZZT (zéros de la transformée en Z) est comparée aux méthodes usuelles à base de prédiction linéaire. La ZZT est utilisée pour estimer les paramètres du modèle de la source glottique via une méthode simple mais robuste qui permet une estimation conjointe de deux paramètres du débit glottique : le quotient ouvert et l'asymétrie. La méthode ainsi développée est testée et combinée à l’estimation du quotient ouvert par ondelettes. Finalement, ces trois méthodes d’estimations sont appliquées à un grand nombre de fichiers d’une base de données comportant différents styles d’élocution. Les résultats de cette analyse sont discutés afin de caractériser le lien entre style, valeur des paramètres de la production vocale et qualité vocale. On constate notamment l’émergence très nette de groupes de styles. / Analysis of speech signals is a good way of understanding how the voice is produced, but it is also important as a way of describing new parameters in order to define the perception of voice quality. This study focuses on expressive speech, where voice quality varies a lot and is explicitly linked to the expressivity or intention of the speaker. In order to define those links, one has to be able to estimate a high number of parameters of the speech production model, but also be able to decompose the speech signal into each parts that contributes to this model. The work presented in this thesis addresses the segmentation of speech signals, their decomposition and the estimation of the voice production model parameters. At first, multi-scale analysis of speech signals is studied. Using the LoMA method that traces lines across scales from one maximum to the other on the time domain response of a wavelet filter bank, it is possible to detect a number of features on voiced speech, namely : the glottal closing instants, the energy associated to each glottal cycle, the open quotient (by estimating the time delay of the first harmonic). This method is then tested on both synthetic and real speech. Secondly, harmonic plus noise decomposition of speech signals is studied. An existing method (PAPD standing for Periodic/Aperiodic Decomposition) is modified to dynamically adapt the analysis window length to the fundamental frequency (F0) of the signal. The new method is then tested on synthetic speech where the sensibility to the estimation error on F0 is also discussed. Decomposition on real speech, along with their audio files, are also discussed. Results shows that this new method provides better quality of decomposition. Thirdly, the problem of source/filter deconvolution is addressed. The ZZT (Zeros of the Z Transform) method is compared to classical methods based on linear prediction. ZZT is then used for the estimation of the glottal flow parameters with a simple but robust method based on the joint estimation of both the open quotient and the asymmetry. The later method is then combined to the estimation of the open quotient using wavelet analysis. Finally, the three estimation methods developed in this thesis are used to analyze a large number of files from a database presenting different speaking styles. Results are discussed in order to characterize the link between style, model parameters and voice quality. We especially notice the neat appearance of speaking style groups
10

Snímání scény pomocí USB a FireWire kamer / USB a FireWire camera aplications

Berka, Petr January 2008 (has links)
This diploma thesis deals with a interface USB and FireWire cameras to a computer through a technology called DirectShow. I used mostly a development kit „MontiVision“, which cooperates with a programming environment as is e.g. Microsoft Visual C++. You find here how to use a direct pixel access, how to get singles photos from a video, how to set up and calibrate a camera and how can look a particular application of the image processing. I wrote an introduction to the stereo-vision above the frame of my task. This thesis is like a manual for students. It includes my personal experiences and experimentations too.

Page generated in 0.0482 seconds