1 |
The automatic and unconstrained segmentation of speech into subword unitsVan Vuuren, Van Zyl 03 1900 (has links)
Thesis (MEng)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: We develop and evaluate several algorithms that segment a speech signal into subword units without
using phone or orthographic transcripts. These segmentation algorithms rely on a scoring function,
termed the local score, that is applied at the feature level and indicates where the characteristics of the
audio signal change. The predominant approach in the literature to segmentation is to apply a threshold
to the local score, and local maxima (peaks) that are above the threshold result in the hypothesis of a
segment boundary. Scoring mechanisms of a select number of such algorithms are investigated, and it
is found that these local scores frequently exhibit clusters of peaks near phoneme transitions that cause
spurious segment boundaries. As a consequence, very short segments are sometimes postulated by the
algorithms. To counteract this, ad-hoc remedies are proposed in the literature. We propose a dynamic
programming (DP) framework for speech segmentation that employs a probabilistic segment length
model in conjunction with the local scores. DP o ers an elegant way to deal with peak clusters by
choosing only the most probable segment length and local score combinations as boundary positions.
It is shown to o er a clear performance improvement over selected methods from the literature serving
as benchmarks.
Multilayer perceptrons (MLPs) can be trained to generate local scores by using groups of feature
vectors centred around phoneme boundaries and midway between phoneme boundaries in suitable
training data. The MLPs are trained to produce a high output value at a boundary, and a low value
at continuity. It was found that the more accurate local scores generated by the MLP, which rarely
exhibit clusters of peaks, made the additional application of DP less e ective than before. However, a
hybrid approach in which DP is used only to resolve smaller, more ambiguous peaks in the local score
was found to o er a substantial improvement on all prior methods.
Finally, restricted Boltzmann machines (RBMs) were applied as features detectors. This provided a
means of building multi-layer networks that are capable of detecting highly abstract features. It is
found that when local score are estimated by such deep networks, additional performance gains are
achieved. / AFRIKAANSE OPSOMMING: Ons ontwikkel en evalueer verskeie algoritmes wat 'n spraaksein in sub-woord eenhede segmenteer
sonder om gebruik te maak van ortogra ese of fonetiese transkripsies. Dié algoritmes maak gebruik van
'n funksie, genaamd die lokale tellingsfunksie, wat 'n waarde produseer omtrent die lokale verandering in
'n spraaksein. In die literatuur is daar gevind dat die hoofbenadering tot segmentasie gebaseer is op 'n
grenswaarde, waarbo alle lokale maksima (pieke) in die lokale telling lei tot 'n skeiding tussen segmente.
'n Selektiewe groep segmentasie algoritmes is ondersoek en dit is gevind dat lokale tellings geneig is
om groeperings van pieke te hê naby aan die skeidings tussen foneme. As gevolg hiervan, word baie
kort segmente geselekteer deur die algoritmes. Om dit teen te werk, word ad-hoc metodes voorgestel
in die literatuur. Ons stel 'n alternatiewe metode voor wat gebaseer is op dinamiese programmering
(DP), wat 'n statistiese verspreiding van lengtes van segmente inkorporeer by segmentasie. DP bied 'n
elegante manier om groeperings van pieke te hanteer, deurdat net kombinasies van hoë lokale tellings en
segmentwaarskynlikheid, met betrekking tot die lengte van die segment, tot 'n skeiding lei. Daar word
gewys dat DP 'n duidelike verbetering in segmentasie akkuraatheid toon bo 'n paar gekose algoritmes
uit die literatuur.
Meervoudige lae perseptrone (MLPe) kan opgelei word om 'n lokale telling te genereer deur gebruik te
maak van groepe eienskapsvektore gesentreerd rondom en tussen foneem skeidings in geskikte opleidingsdata.
Die MLPe word opgelei om 'n groot waarde te genereer as 'n foneem skeiding voorkom
en 'n klein waarde andersins. Dit is gevind dat die meer akkurate lokale tellings wat deur die MLPe
gegenereer word minder groeperings van pieke het, wat dan die addisionele toepassing van die DP
minder e ektief maak. 'n Hibriede toepassing, waar DP net tussen kleiner en minder duidelike pieke
in die lokale telling kies, lei egter tot 'n groot verbetering bo-op alle vorige metodes.
As 'n nale stap het ons beperkte Boltzmann masjiene (BBMe) gebruik om patrone in data te identi-
seer. Sodoende, verskaf BBMe 'n manier om meervoudige lae netwerke op te bou waar die boonste
lae baie komplekse patrone in die data identi seer. Die toepassing van dié dieper netwerke tot die
generasie van 'n lokale telling het tot verdere verbeteringe in segmentasie-akkuraatheid gelei. / National Research Foundation (NRF)
|
2 |
Acoustic cues to speech segmentation in spoken French : native and non-native strategiesShoemaker, Ellenor Marguerite 23 October 2009 (has links)
In spoken French, the phonological processes of liaison and resyllabification can
render word and syllable boundaries ambiguous. In the case of liaison, for example, the
/n/ in the masculine indefinite article un [oẽ] is normally latent, but when followed by a
vowel-initial word the /n/ surfaces and is resyllabified as the onset of that word. Thus, the
phrases un air ‘a melody’ and un nerf ‘a nerve’ are produced with identical phonemic
content and syllable boundaries [oẽ.nɛʁ]). Some research has suggested that speakers of
French give listeners cues to word boundaries by varying the duration of consonants that
surface in liaison environments relative to consonants produced word-initially.
Production studies (e.g. Wauquier-Gravelines 1996; Spinelli et al. 2003) have
demonstrated that liaison consonants (e.g. /n/ in un air) are significantly shorter than the
same consonant in initial position (e.g. /n/ in un nerf). Studies on the perception of
spoken French have suggested that listeners exploit these durational differences in the
segmentation of running speech (e.g. Gaskell et al. 2002; Spinelli et al. 2003), though no
study to date has tested this hypothesis directly.
The current study employs a direct test of the exploitation of duration as a
segmentation cue by manipulating this single acoustic factor while holding all other
factors constant. Thirty-six native speakers of French and 54 adult learners of French as a second language (L2) were tested on both an AX discrimination task and a forced-choice
identification task which employed stimuli in which the durations of pivotal consonants
(e.g. /n/ in [oẽ.nɛʁ]) were instrumentally shortened and lengthened. The results suggest
that duration alone can indeed modulate the lexical interpretation of sequences rendered
sequences in spoken French. Shortened stimuli elicited a significantly larger proportion
of vowel-initial (liaison) responses, while lengthened stimuli elicited a significantly
larger proportion of consonant-initial responses, indicating that both native and
(advanced) non-native speakers are indeed sensitive to this acoustic cue.
These results add to a growing body of work demonstrating that listeners use
extremely fined-grained acoustic detail to modulate lexical access (e.g. Salverda et al.
2003; Shatzman & McQueen 2006). In addition, the current results have manifest
ramifications for study of the upper limits of L2 acquisition and the plasticity of the adult
perceptual system in that they show evidence nativelike sensitivity to non-contrastive
phonological variation. / text
|
3 |
Speech Assessment for the Classification of Hypokinetic Dysthria in Parkinson DiseaseButt, Abdul Haleem January 2012 (has links)
The aim of this thesis is to investigate computerized voice assessment methods to classify between the normal and Dysarthric speech signals. In this proposed system, computerized assessment methods equipped with signal processing and artificial intelligence techniques have been introduced. The sentences used for the measurement of inter-stress intervals (ISI) were read by each subject. These sentences were computed for comparisons between normal and impaired voice. Band pass filter has been used for the preprocessing of speech samples. Speech segmentation is performed using signal energy and spectral centroid to separate voiced and unvoiced areas in speech signal. Acoustic features are extracted from the LPC model and speech segments from each audio signal to find the anomalies. The speech features which have been assessed for classification are Energy Entropy, Zero crossing rate (ZCR), Spectral-Centroid, Mean Fundamental-Frequency (Meanf0), Jitter (RAP), Jitter (PPQ), and Shimmer (APQ). Naïve Bayes (NB) has been used for speech classification. For speech test-1 and test-2, 72% and 80% accuracies of classification between healthy and impaired speech samples have been achieved respectively using the NB. For speech test-3, 64% correct classification is achieved using the NB. The results direct the possibility of speech impairment classification in PD patients based on the clinical rating scale.
|
4 |
The Minimal Word Hypothesis: A Speech Segmentation StrategyMeador, Diane L. January 1996 (has links)
Previous investigations have sought to determine how listeners might locate word boundaries in the speech signal for the purpose of lexical access. Cutler (1990) proposes the Metrical Segmentation Strategy (MSS), such that only full vowels in stressed syllables and their preceding syllabic onsets are segmented from the speech stream. I report the results of several experiments which indicate that the listener segments the minimal word, a phonologically motivated prosodic constituent, during processing of the speech signal. These experiments were designed to contrast the MSS with two prosodic alternative hypotheses. The Syllable Hypothesis posits that listeners segment a linguistic syllable in its entirety as it is produced by the speaker. The Minimal Word Hypothesis proposes that a minimal word is segmented according to implicit knowledge the listener has concerning statistically probable characteristics of the lexicon. These competing hypotheses were tested by using a word spotting method similar to that in Cutler and Norris (1988). The subjects' task was to detect real monosyllabic words embedded initially in bisyllabic nonce strings. Both open (CV) and closed (CVC) words were embedded in strings containing a single intervocalic consonant. The prosodic constituency of this consonant was varied by manipulating factors affecting prosodic structure: stress, the sonority of the consonant, and the quality of the vowel in the first syllable. The assumption behind the method is that word detection will be facilitated when embedded word and segmentation boundaries are coincident. Results show that these factors are influential during segmentation. The degree of difficulty in word detection is a function of how well the speech signal corresponds to the minimal word. Findings are consistently counter to both the MSS and Syllable hypotheses. The Minimal Word Hypothesis takes advantage of statistical properties of the lexicon, ensuring a strategy which is successful more often than not. The minimal word specifies the smallest possible content word in a language in terms of prosodic structure while simultaneously affiliating the greatest amount of featural information within the structural limits. It therefore guarantees an efficient strategy with as few parses as possible.
|
5 |
Speech segmentation and speaker diarisation for transcription and translationSinclair, Mark January 2016 (has links)
This dissertation outlines work related to Speech Segmentation – segmenting an audio recording into regions of speech and non-speech, and Speaker Diarization – further segmenting those regions into those pertaining to homogeneous speakers. Knowing not only what was said but also who said it and when, has many useful applications. As well as providing a richer level of transcription for speech, we will show how such knowledge can improve Automatic Speech Recognition (ASR) system performance and can also benefit downstream Natural Language Processing (NLP) tasks such as machine translation and punctuation restoration. While segmentation and diarization may appear to be relatively simple tasks to describe, in practise we find that they are very challenging and are, in general, ill-defined problems. Therefore, we first provide a formalisation of each of the problems as the sub-division of speech within acoustic space and time. Here, we see that the task can become very difficult when we want to partition this domain into our target classes of speakers, whilst avoiding other classes that reside in the same space, such as phonemes. We present a theoretical framework for describing and discussing the tasks as well as introducing existing state-of-the-art methods and research. Current Speaker Diarization systems are notoriously sensitive to hyper-parameters and lack robustness across datasets. Therefore, we present a method which uses a series of oracle experiments to expose the limitations of current systems and to which system components these limitations can be attributed. We also demonstrate how Diarization Error Rate (DER), the dominant error metric in the literature, is not a comprehensive or reliable indicator of overall performance or of error propagation to subsequent downstream tasks. These results inform our subsequent research. We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation is a crucial first step in the system chain. Current methods typically do not account for the inherent structure of spoken discourse. As such, we explored a novel method which exploits an utterance-duration prior in order to better model the segment distribution of speech. We show how this method improves not only segmentation, but also the performance of subsequent speech recognition, machine translation and speaker diarization systems. Typical ASR transcriptions do not include punctuation and the task of enriching transcriptions with this information is known as ‘punctuation restoration’. The benefit is not only improved readability but also better compatibility with NLP systems that expect sentence-like units such as in conventional machine translation. We show how segmentation and diarization are related tasks that are able to contribute acoustic information that complements existing linguistically-based punctuation approaches. There is a growing demand for speech technology applications in the broadcast media domain. This domain presents many new challenges including diverse noise and recording conditions. We show that the capacity of existing GMM-HMM based speech segmentation systems is limited for such scenarios and present a Deep Neural Network (DNN) based method which offers a more robust speech segmentation method resulting in improved speech recognition performance for a television broadcast dataset. Ultimately, we are able to show that the speech segmentation is an inherently ill-defined problem for which the solution is highly dependent on the downstream task that it is intended for.
|
6 |
Statistical Bootstrapping of Speech Segmentation CuesPlanet, Nicolas O. 01 January 2010 (has links) (PDF)
Various infant studies suggest that statistical regularities in the speech stream (e.g. transitional probabilities) are one of the first speech segmentation cues available. Statistical learning may serve as a mechanism for learning various language specific segmentation cues (e.g. stress segmentation by English speakers). To test this possibility we exposed adults to an artificial language in which all words had a novel acoustic cue on the final syllable. Subjects were presented with a continuous stream of synthesized speech in which the words were repeated in random order. Subjects were then given a new set of words to see if they had learned the acoustic cue and generalized it to new stimuli. Finally, subjects were exposed to a competition stream in which the transitional probability and novel acoustic cues conflicted to see which cue they preferred to use for segmentation. Results on the word-learning test suggest that subjects were able to segment the first exposure stream, however, on the cue transfer test they did not display any evidence of learning the relationship between word boundaries and the novel acoustic cue. Subjects were able to learn statistical words from the competition stream despite extra intervening syllables.
|
7 |
Sistema baseado em regras para o refinamento da segmentação automatica de fala / Rule based system for refining the automatic speech segmentationSelmini, Antonio Marcos 22 August 2008 (has links)
Orientador: Fabio Violaro / Tese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de Computação / Made available in DSpace on 2018-08-11T22:49:44Z (GMT). No. of bitstreams: 1
Selmini_AntonioMarcos_D.pdf: 2404244 bytes, checksum: d7fcd0828f3157c595a0e3426b4a7eb0 (MD5)
Previous issue date: 2008 / Resumo: A demanda por uma segmentação automática de fala confiável vem crescendo e exigindo pesquisas para suportar o desenvolvimento de sistemas que usam fala para uma interação homem-máquina. Neste contexto, este trabalho relata o desenvolvimento e avaliação de um sistema para segmentação automática de fala usando o algoritmo de Viterbi e refinamento das fronteiras de segmentação baseado nas características fonético-acústicas das classes fonéticas. As subunidades fonéticas (dependentes de contexto) são representadas com Modelos Ocultos de Markov (HMM - Hidden Markov Models). Cada fronteira estimada pelo algoritmo de Viterbi é refinada usando características acústicas dependentes de classes de fones, uma vez que a identidade dos fones do lado direito e esquerdo da fronteira considerada é conhecida. O sistema proposto foi avaliado usando duas bases dependentes de locutor do Português do Brasil (uma masculina e outra feminina) e também uma base independente de locutor (TIMIT). A avaliação foi realizada comparando a segmentação automática com a segmentação manual. Depois do processo de refinamento, um ganho de 29% nas fronteiras com erro de segmentação abaixo de 20 ms foi obtido para a base de fala dependente de locutor masculino do Português Brasileiro. / Abstract: The demand for reliable automatic speech segmentation is increasing and requiring additional research to support the development of systems that use speech for man-machine interface. In this context, this work reports the development and evaluation of a system for automatic speech segmentation using Viterbi's algorithm and a refinement of segmentation boundaries based on acoustic-phonetic features. Phonetic sub-units (context-dependent phones) are modeled with HMM (Hidden Markov Models). Each boundary estimated by Viterbi's algorithm is refined using class-dependent acoustic features, as the identity of the phones on the left and right side of the considered boundary is known. The proposed system was evaluated using two speaker dependent Brazilian Portuguese speech databases (one male and one female speaker), and a speaker independent English database (TIMIT). The evaluation was carried out comparing automatic against manual segmentation. After the refinement process, an improvement of 29% in the percentage of segmentation errors below 20 ms was achieved for the male speaker dependent Brazilian Portuguese speech database. / Doutorado / Telecomunicações e Telemática / Doutor em Engenharia Elétrica
|
8 |
Speech Endpoint Detection: An Image Segmentation ApproachFaris, Nesma January 2013 (has links)
Speech Endpoint Detection, also known as Speech Segmentation, is an unsolved problem in speech processing that affects numerous applications including robust speech recognition. This task is not as trivial as it appears, and most of the existing algorithms degrade at low signal-to-noise ratios (SNRs). Most of the previous research approaches have focused on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules. This research tackles the endpoint detection problem in a different way, and proposes a novel speech endpoint detection algorithm which has been derived from Chan-Vese algorithm for image segmentation. The proposed algorithm has the ability to fuse multi features extracted from the speech signal to enhance the detection accuracy. The algorithm performance has been evaluated and compared to two widely used speech detection algorithms under various noise environments with SNR levels ranging from 0 dB to 30 dB. Furthermore, the proposed algorithm has also been applied to different types of American English phonemes. The experiments show that, even under conditions of severe noise contamination, the proposed algorithm is more efficient as compared to the reference algorithms.
|
9 |
Bimodal Automatic Speech Segmentation And Boundary Refinement TechniquesAkdemir, Eren 01 March 2010 (has links) (PDF)
Automatic segmentation of speech is compulsory for building large speech databases to be used in speech processing applications. This study proposes a bimodal automatic speech segmentation system that uses either articulator motion information (AMI) or visual information obtained by a camera in collaboration with auditory information. The presence of visual modality is shown to be very beneficial in speech recognition applications, improving the performance and noise robustness of those systems. In this dissertation a significant increase in the performance of the automatic speech segmentation system is achieved by using a bimodal approach.
Automatic speech segmentation systems have a tradeoff between precision and resulting number of gross errors. Boundary refinement techniques are used in order to increase precision of these systems without decreasing the system performance. Two novel boundary refinement techniques are proposed in this thesis / a hidden Markov model (HMM) based fine tuning system and an inverse filtering based fine tuning system. The segment boundaries obtained by the bimodal speech segmentation system are improved further by using these techniques.
To fulfill these goals, a complete two-stage automatic speech segmentation system is produced and tested in two different databases. A phonetically rich Turkish audiovisual speech database, that contains acoustic data and camera recordings of 1600 Turkish sentences uttered by a male speaker, is build from scratch in order to be used in the experiments. The visual features of the recordings are extracted and manual phonetic alignment of the database is done to be used as a ground truth for the performance tests of the automatic speech segmentation systems.
|
10 |
Speech Endpoint Detection: An Image Segmentation ApproachFaris, Nesma January 2013 (has links)
Speech Endpoint Detection, also known as Speech Segmentation, is an unsolved problem in speech processing that affects numerous applications including robust speech recognition. This task is not as trivial as it appears, and most of the existing algorithms degrade at low signal-to-noise ratios (SNRs). Most of the previous research approaches have focused on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules. This research tackles the endpoint detection problem in a different way, and proposes a novel speech endpoint detection algorithm which has been derived from Chan-Vese algorithm for image segmentation. The proposed algorithm has the ability to fuse multi features extracted from the speech signal to enhance the detection accuracy. The algorithm performance has been evaluated and compared to two widely used speech detection algorithms under various noise environments with SNR levels ranging from 0 dB to 30 dB. Furthermore, the proposed algorithm has also been applied to different types of American English phonemes. The experiments show that, even under conditions of severe noise contamination, the proposed algorithm is more efficient as compared to the reference algorithms.
|
Page generated in 0.1534 seconds