1 |
Estimation of glottal source features from the spectral envelope of the acoustic speech signalTorres, Juan Félix 17 May 2010 (has links)
Speech communication encompasses diverse types of information, including phonetics, affective state, voice quality, and speaker identity. From a speech production standpoint, the acoustic speech signal can be mainly divided into glottal source and vocal tract components, which play distinct roles in rendering the various types of information it contains. Most deployed speech analysis systems, however, do not explicitly represent these two components as distinct entities, as their joint estimation from the acoustic speech signal becomes an ill-defined blind deconvolution problem. Nevertheless, because of the desire to understand glottal behavior and how it relates to perceived voice quality, there has been continued interest in explicitly estimating the glottal component of the speech signal. To this end, several inverse filtering (IF) algorithms have been proposed, but they are unreliable in practice because of the blind formulation of the separation problem. In an effort to develop a method that can bypass the challenging IF process, this thesis proposes a new glottal source information extraction method that relies on supervised machine learning to transform smoothed spectral representations of speech, which are already used in some of the most widely deployed and successful speech analysis applications, into a set of glottal source features. A transformation method based on Gaussian mixture regression (GMR) is presented and compared to current IF methods in terms of feature similarity, reliability, and speaker discrimination capability on a large speech corpus, and potential representations of the spectral envelope of speech are investigated for their ability represent glottal source variation in a predictable manner. The proposed system was found to produce glottal source features that reasonably matched their IF counterparts in many cases, while being less susceptible to spurious errors. The development of the proposed method entailed a study into the aspects of glottal source information that are already contained within the spectral features commonly used in speech analysis, yielding an objective assessment regarding the expected advantages of explicitly using glottal information extracted from the speech signal via currently available IF methods, versus the alternative of relying on the glottal source information that is implicitly contained in spectral envelope representations.
|
2 |
Why so different? - Aspects of voice characteristics in operatic and musical theatre singing : Aspects of voice characteristics in operatic and musical theatre singingBjörkner, Eva January 2006 (has links)
This thesis addresses aspects of voice characteristics in operatic and musical theatre singing. The common aim of the studies was to identify respiratory, phonatory and resonatory characteristics accounting for salient voice timbre differences between singing styles. The velopharyngeal opening (VPO) was analyzed in professional operatic singers, using nasofiberscopy. Differing shapes of VPOs suggested that singers may use a VPO to fine-tune the vocal tract resonance characteristics and hence voice timbre. A listening test revealed no correlation between rated nasal quality and the presence of a VPO. The voice quality referred to as “throaty”, a term sometimes used for characterizing speech and “non-classical” vocalists, was examined with respect to subglottal pressure (Psub) and formant frequencies. Vocal tract shapes were determined by magnetic resonance imaging. The throaty versions of four vowels showed a typical narrowing of the pharynx. Throatiness was characterized by increased first formant frequency and lowering of higher formants. Also, voice source parameter analyses suggested a hyper-functional voice production. Female musical theatre singers typically use two vocal registers (chest and head). Voice source parameters, including closed-quotient, peak-to-peak pulse amplitude, maximum flow declination rate, and normalized amplitude quotient (NAQ), were analyzed at ten equally spaced subglottal pressures representing a wide range of vocal loudness. Chest register showed higher values in all glottal parameters except for NAQ. Operatic baritone singer voices were analyzed in order to explore the informative power of the amplitude quotient (AQ), and its normalized version NAQ, suggested to reflect glottal adduction. Differences in NAQ were found between fundamental frequency values while AQ was basically unaffected. Voice timbre differs between musical theatre and operatic singers. Measurements of voice source parameters as functions of subglottal pressure, covering a wide range of vocal loudness, showed that both groups varied Psub systematically. The musical theatre singers used somewhat higher pressures, produced higher sound pressure levels, and did not show the opera singers’ characteristic clustering of higher formants. Musical theatre and operatic singers show highly controlled and consistent behaviors, characteristic for each style. A common feature is the precise control of subglottal pressure, while laryngeal and vocal tract conditions differ between singing styles. In addition, opera singers tend to sing with a stronger voice source fundamental than musical theatre singers. / <p>QC 20100812</p>
|
3 |
Registros vocais no canto: aspectos perceptivos, acústicos, aerodinâmicos e fisiológicos da voz modal e da voz de falseteSalomão, Gláucia Laís 28 November 2008 (has links)
Made available in DSpace on 2016-04-28T18:23:56Z (GMT). No. of bitstreams: 1
Glaucia Lais Salomao.pdf: 2740028 bytes, checksum: c0cdc35e66facbaed2788e65a7a799ed (MD5)
Previous issue date: 2008-11-28 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / The purpose of this study was to investigate the relationship between the production and the perception of the modal and falsetto s registers in tones sang by male choir singers.104 tones were analyzed, 52 sand in modal register, 52 sang in falsetto register. The data for the analyses of the vocal registers production were obtained by means of inverse filtering of the sound irradiated from the mouth and by means of the electroglottography. The following measures were extracted: Fundamental Frequency (F0), duration of the Closed Phase, Maximum Flow Declination Rate, and Amplitude of the Alternating Current. Also, the following measures were calculated: Closed Quotient, Normalized Amplitude Quotient and level difference between the two lowest partials in the glottal source spectrum (H1 H2). The results showed clear differences between the modal and the falsetto registers. The values found for the Closed Quotient, the Maximum Flow Declination Rate and the Alternating Current Amplitude were predominantly greater in the modal than in the falsetto register. The Maximum Flow Declination Rate and H1 H2 values, on the other hand, were greater in the falsetto register than in the modal. The correlation analysis of the acoustical parameters showed a clear covariation between the Closed Quotient, the H1 H2 values, the Normalized Amplitude Quotient and Maximum Flow Declination Rate. The data for the analyses of the vocal registers perception were obtained from a forced choice test. The number of votes for modal was compared with the voice source parameters values. The results showed that tones with high values of Closed Quotient and low values of H1 H2 and of Normalized Quotient Amplitude were typically associated with higher number of votes to the modal register, and vice-versa. Our results suggest that the acoustic manifestations of the phonatory dimension related to the vocal registers are gradient and its perceived characteristics are as much distinct as they are wider apart in the continuous / O objetivo desta pesquisa foi o de investigar as relações entre a produção e a percepção dos registros vocais modal e de falsete em emissões de cantores de coral do sexo masculino. Foram analisadas 104 emissões, 52 cantadas no registro modal e 52 cantadas no registro de falsete. Os dados para análise da produção dos registros vocais foram obtidos por meio da filtragem inversa do sinal acústico e da eletroglotografia. As seguintes medidas foram extraídas: Freqüência Fundamental (F0), duração da Fase Fechada, Velocidade Máxima do Declínio do Fluxo de Ar e Amplitude da Corrente Alternada. Foram também calculados: o Quociente Fechado, o Quociente da Amplitude Normalizada e a diferença entre os níveis de intensidade dos dois primeiros harmônicos no espectro da fonte glótica (H1 H2). Os resultados obtidos mostraram diferenças claras e sistemáticas entre os registros modal e de falsete. Os valores obtidos para o Quociente Fechado, para a Velocidade Máxima do Declínio do Fluxo de Ar e para a Amplitude do Corrente Alternada foram predominantemente maiores no registro modal do que no registro de falsete. Os valores de H1 H2 e do Quociente da Amplitude Normalizada, por outro lado, foram maiores no registro de falsete do que no registro modal. Os dados para análise da percepção dos registros vocais foram obtidos a partir de um teste perceptivo de escolha forçada. O número de votos para o registro modal foi comparado com os valores obtidos para os parâmetros acústicos glotais. Os resultados mostraram que tons com valores maiores de Quociente Fechado, e valores menores de H1 H2 e de Quociente da Amplitude Normalizada foram tipicamente associados com maior número de votos para o registro de modal, e vice-versa. Os resultados sugerem a existência de uma gradiência sonora na dimensão das manifestações acústicas relacionadas aos registros vocais com características percebidas como tanto mais diferenciadas quanto mais distanciadas nesse contínuo
|
4 |
The Voice Source in Speech Communication - Production and Perception Experiments Involving Inverse Filtering and SynthesisGobl, Christer January 2003 (has links)
This thesis explores, through a number of production andperception studies, the nature of the voice source signal andhow it varies in spoken communication. Research is alsopresented that deals with the techniques and methodologies foranalysing and synthesising the voice source. The main analytictechnique involves interactive inverse filtering for obtainingthe source signal, which is then parameterised to permit thequantification of source characteristics. The parameterisationis carried by means of model matching, using the four-parameterLF model of differentiated glottal flow. The first three analytic studies focus on segmental andsuprasegmental determinants of source variation. As part of theprosodic variation of utterances, focal stress shows for theglottal excitation an enhancement between the stressed voweland the surrounding consonants. At a segmental level, the voicesource characteristics of a vowel show potentially majordifferences as a function of the voiced/voiceless nature of anadjacent stop. Cross-language differences in the extent anddirectionality of the observed effects suggest differentunderlying control strategies in terms of the timing of thelaryngeal and supralaryngeal gestures, as well as in thelaryngeal tensions settings. Different classes of voicedconsonants also show differences in source characteristics:here the differences are likely to be passive consequences ofthe aerodynamic conditions that are inherent to the consonants.Two further analytic studies present voice source correlatesfor six different voice qualities as defined by Laver'sclassification system. Data from stressed and unstressedcontexts clearly show that the transformation from one voicequality to another does not simply involve global changes ofthe source parameters. As well as providing insights into theseaspects of speech production, the analytic studies providequantitative measures useful in technology applications,particularly in speech synthesis. The perceptual experiments use the LF source implementationin the KLSYN88 synthesiser to test some of the analytic resultsand to harness them to explore the paralinguistic dimension ofspeech communication. A study of the perceptual salience ofdifferent parameters associated with breathy voice indicatesthat the source spectral slope is critically important andthat, surprisingly, aspiration noise contributes relativelylittle. Further perceptual tests using stimuli with differentvoice qualities explore the mapping between voice quality andits paralinguistic function of expressing emotion, mood andattitude. The results of these studies highlight the crucialrole of voice quality in expressing affect as well as providingpointers to how it combines withf0for this purpose. The last section of the thesis focuses on the techniquesused for the analysis and synthesis of the source. Asemi-automatic method for inverse filtering is presented, whichis novel in that it optimises the inverse filter by exploitingthe knowledge that is typically used by the experimenter whencarrying out manual interactive inverse filtering. A furtherstudy looks at the properties of the modified LF model in theKLSYN88 synthesiser: it highlights how it differs from thestandard LF model and discusses the implications forsynthesising the glottal source signal from LF model data.Effective and robust source parameterisation for the analysisof voice quality is the topic of the final paper: theeffectiveness of global, amplitude-based, source parameters isexamined across speech tokens with large differences inf0. Additional amplitude-based parameters areproposed to enable a more detailed characterisation of theglottal pulse. <b>Keywords:</b>Voice source dynamics, glottal sourceparameters, source-filter interaction, voice quality,phonation, perception, affect, emotion, mood, attitude,paralinguistic, inverse filtering, knowledge-based, formantsynthesis, LF model, fundamental frequency,f0.
|
5 |
The Voice Source in Speech Communication - Production and Perception Experiments Involving Inverse Filtering and SynthesisGobl, Christer January 2003 (has links)
<p>This thesis explores, through a number of production andperception studies, the nature of the voice source signal andhow it varies in spoken communication. Research is alsopresented that deals with the techniques and methodologies foranalysing and synthesising the voice source. The main analytictechnique involves interactive inverse filtering for obtainingthe source signal, which is then parameterised to permit thequantification of source characteristics. The parameterisationis carried by means of model matching, using the four-parameterLF model of differentiated glottal flow.</p><p>The first three analytic studies focus on segmental andsuprasegmental determinants of source variation. As part of theprosodic variation of utterances, focal stress shows for theglottal excitation an enhancement between the stressed voweland the surrounding consonants. At a segmental level, the voicesource characteristics of a vowel show potentially majordifferences as a function of the voiced/voiceless nature of anadjacent stop. Cross-language differences in the extent anddirectionality of the observed effects suggest differentunderlying control strategies in terms of the timing of thelaryngeal and supralaryngeal gestures, as well as in thelaryngeal tensions settings. Different classes of voicedconsonants also show differences in source characteristics:here the differences are likely to be passive consequences ofthe aerodynamic conditions that are inherent to the consonants.Two further analytic studies present voice source correlatesfor six different voice qualities as defined by Laver'sclassification system. Data from stressed and unstressedcontexts clearly show that the transformation from one voicequality to another does not simply involve global changes ofthe source parameters. As well as providing insights into theseaspects of speech production, the analytic studies providequantitative measures useful in technology applications,particularly in speech synthesis.</p><p>The perceptual experiments use the LF source implementationin the KLSYN88 synthesiser to test some of the analytic resultsand to harness them to explore the paralinguistic dimension ofspeech communication. A study of the perceptual salience ofdifferent parameters associated with breathy voice indicatesthat the source spectral slope is critically important andthat, surprisingly, aspiration noise contributes relativelylittle. Further perceptual tests using stimuli with differentvoice qualities explore the mapping between voice quality andits paralinguistic function of expressing emotion, mood andattitude. The results of these studies highlight the crucialrole of voice quality in expressing affect as well as providingpointers to how it combines with<i>f</i><sub>0</sub>for this purpose.</p><p>The last section of the thesis focuses on the techniquesused for the analysis and synthesis of the source. Asemi-automatic method for inverse filtering is presented, whichis novel in that it optimises the inverse filter by exploitingthe knowledge that is typically used by the experimenter whencarrying out manual interactive inverse filtering. A furtherstudy looks at the properties of the modified LF model in theKLSYN88 synthesiser: it highlights how it differs from thestandard LF model and discusses the implications forsynthesising the glottal source signal from LF model data.Effective and robust source parameterisation for the analysisof voice quality is the topic of the final paper: theeffectiveness of global, amplitude-based, source parameters isexamined across speech tokens with large differences in<i>f</i><sub>0</sub>. Additional amplitude-based parameters areproposed to enable a more detailed characterisation of theglottal pulse.</p><p><b>Keywords:</b>Voice source dynamics, glottal sourceparameters, source-filter interaction, voice quality,phonation, perception, affect, emotion, mood, attitude,paralinguistic, inverse filtering, knowledge-based, formantsynthesis, LF model, fundamental frequency,<i>f</i><sub>0</sub>.</p>
|
6 |
Transforming high-effort voices into breathy voices using adaptive pre-emphasis linear predictionNordstrom, Karl 29 April 2008 (has links)
During musical performance and recording, there are a variety of techniques and electronic effects available to transform the singing voice. The particular effect examined in this dissertation is breathiness, where artificial noise is added to a voice to simulate aspiration noise. The typical problem with this effect is that artificial noise does not effectively blend into voices that exhibit high vocal effort. The existing breathy effect does not reduce the perceived effort; breathy voices exhibit low effort.
A typical approach to synthesizing breathiness is to separate the voice into a filter representing the vocal tract and a source representing the excitation of the vocal folds. Artificial noise is added to the source to simulate aspiration noise. The modified source is then fed through the vocal tract filter to synthesize a new voice. The resulting voice sounds like the original voice plus noise.
Listening experiments were carried out. These listening experiments demonstrated that constant pre-emphasis linear prediction (LP) results in an estimated vocal tract filter that retains the perception of vocal effort. It was hypothesized that reducing the perception of vocal effort in the estimated vocal tract filter may improve the breathy effect.
This dissertation presents adaptive pre-emphasis LP (APLP) as a technique to more appropriately model the spectral envelope of the voice. The APLP algorithm results in a more consistent vocal tract filter and an estimated voice source that varies more appropriately with changes in vocal effort. This dissertation describes how APLP estimates a spectral emphasis filter that can transform the spectral envelope of the voice, thereby reducing the perception of vocal effort.
A listening experiment was carried out to determine whether APLP is able to transform high effort voices into breathy voices more effectively than constant pre-emphasis LP. The experiment demonstrates that APLP is able to reduce the perceived effort in the voice. In addition, the voices transformed using APLP sound less artificial than the same voices transformed using constant pre-emphasis LP. This indicates that APLP is able to more effectively transform high-effort voices into breathy voices.
|
7 |
Characterization of the Voice Source by the DCT for Speaker InformationAbhiram, B January 2014 (has links) (PDF)
Extracting speaker-specific information from speech is of great interest to both researchers and developers alike, since speaker recognition technology finds application in a wide range of areas, primary among them being forensics and biometric security systems.
Several models and techniques have been employed to extract speaker information from the speech signal. Speech production is generally modeled as an excitation source followed by a filter. Physiologically, the source corresponds to the vocal fold vibrations and the filter corresponds to the spectrum-shaping vocal tract. Vocal tract-based features like the melfrequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients have been shown to contain speaker information. However, high speed videos of the larynx show that the vocal folds of different individuals vibrate differently. Voice source (VS)-based features have also been shown to perform well in speaker recognition tasks, thereby revealing that the VS does contain speaker information. Moreover, a combination of the vocal tract and VS-based features has been shown to give an improved performance, showing that the latter contains supplementary speaker information.
In this study, the focus is on extracting speaker information from the VS. The existing techniques for the same are reviewed, and it is observed that the features which are obtained by fitting a time-domain model on the VS perform poorly than those obtained by simple transformations of the VS. Here, an attempt is made to propose an alternate way of characterizing the VS to extract speaker information, and to study the merits and shortcomings of the proposed speaker-specific features.
The VS cannot be measured directly. Thus, to characterize the VS, we first need an estimate of the VS, and the integrated linear prediction residual (ILPR) extracted from the speech signal is used as the VS estimate in this study. The voice source linear prediction model, which was proposed in an earlier study to obtain the ILPR, is used in this work.
It is hypothesized here that a speaker’s voice may be characterized by the relative proportions of the harmonics present in the VS. The pitch synchronous discrete cosine transform (DCT) is shown to capture these, and the gross shape of the ILPR in a few coefficients. The ILPR and hence its DCT coefficients are visually observed to distinguish between speakers. However, it is also observed that they do have intra-speaker variability, and thus it is hypothesized that the distribution of the DCT coefficients may capture speaker information, and this distribution is modeled by a Gaussian mixture model (GMM).
The DCT coefficients of the ILPR (termed the DCTILPR) are directly used as a feature vector in speaker identification (SID) tasks. Issues related to the GMM, like the type of covariance matrix, are studied, and it is found that diagonal covariance matrices perform better than full covariance matrices. Thus, mixtures of Gaussians having diagonal covariances are used as speaker models, and by conducting SID experiments on three standard databases, it is found that the proposed DCTILPR features fare comparably with the existing VS-based features. It is also found that the gross shape of the VS contains most of the speaker information, and the very fine structure of the VS does not help in distinguishing speakers, and instead leads to more confusion between speakers. The major drawbacks of the DCTILPR are the session and handset variability, but they are also present in existing state-of-the-art speaker-specific VS-based features and the MFCCs, and hence seem to be common problems. There are techniques to compensate these variabilities, which need to be used when the systems using these features are deployed in an actual application.
The DCTILPR is found to improve the SID accuracy of a system trained with MFCC features by 12%, indicating that the DCTILPR features capture speaker information which is missed by the MFCCs. It is also found that a combination of MFCC and DCTILPR features on a speaker verification task gives significant performance improvement in the case of short test utterances. Thus, on the whole, this study proposes an alternate way of extracting speaker information from the VS, and adds to the evidence for speaker information present in the VS.
|
Page generated in 0.0763 seconds