231 |
The effects of recognition accuracy and vocabulary size of a speech recognition system on task performance and user acceptanceCasali, Sherry P. 22 June 2010 (has links)
Automatic speech recognition systems have at last advanced to the state that they are now a feasible alternative for human-machine communication in selected applications. As such, research efforts are now beginning to focus on characteristics of the human, the recognition device, and the interface which optimize the system performance, rather than the previous trend of determining factors affecting recognizer performance alone. This study investigated two characteristics of the recognition device, the accuracy level at which it recognizes speech, and the vocabulary size of the recognizer as a percent of task vocabulary size to determine their effects on system performance. In addition, the study considered one characteristic of the user, age. Briefly, subjects performed a data entry task under each of the treatment conditions. Task completion time and the number of errors remaining at the end of each session were recorded. After each session, subjects rated the recognition device used as to its acceptability for the task.
The accuracy level at which the recognizer was performing significantly influenced the task completion time as well as the user's acceptability ratings, but had only a small effect on the number of errors left uncorrected. The available vocabulary size also significantly affected the task completion time; however, its effect on the final error rate and on the acceptability ratings was negligible. The age of the subject was also found to influence both objective and subjective measures. Older subjects in general required longer times to complete the tasks; however, they consistently rated the speech input systems more favorably than the younger subjects. / Master of Science
|
232 |
Improving the quality of speech in noisy environmentsParikh, Devangi Nikunj 06 November 2012 (has links)
In this thesis, we are interested in processing noisy speech signals that are meant to be heard by humans, and hence we approach the noise-suppression problem from a perceptual perspective. We develop a noise-suppression paradigm that is based on a model of the human auditory system, where we process signals in a way that is natural to the human ear. Under this paradigm, we transform an audio signal in to a perceptual domain, and processes the signal in this perceptual domain. This approach allows us to reduce the background noise and the audible artifacts that are seen in traditional noise-suppression algorithms, while preserving the quality of the processed speech. We develop a single- and dual-microphone algorithm based on this perceptual paradigm, and conduct subjecting tests to show that this approach outperforms traditional noise-suppression techniques. Moreover, we investigate the cause of audible artifacts that are generated as a result of suppressing the noise in noisy signals, and introduce constraints on the noise-suppression gain such that these artifacts are reduced.
|
233 |
An Analog Architecture for Auditory Feature Extraction and RecognitionSmith, Paul Devon 22 November 2004 (has links)
Speech recognition systems have been implemented using a wide range of signal processing techniques including
neuromorphic/biological inspired and Digital Signal Processing
techniques. Neuromorphic/biologically inspired techniques, such as silicon cochlea models, are based on fairly simple yet highly parallel computation and/or computational units. While the area of digital signal processing (DSP) is based on block transforms and statistical or error minimization methods.
Essential to each of these techniques is the first stage of
extracting meaningful information from the speech signal, which is known as feature extraction. This can be done using biologically inspired techniques such as silicon cochlea models, or techniques beginning with a model of speech production and then trying to separate the the vocal tract response from an excitation signal. Even within each of these approaches, there are multiple techniques including cepstrum filtering, which sits
under the class of Homomorphic signal processing, or techniques using FFT based predictive approaches. The underlying reality is there are multiple techniques that have attacked the problem in speech recognition but the problem is still far from being solved. The techniques that have shown to have the best recognition rates involve Cepstrum Coefficients for the feature extraction and Hidden-Markov Models to perform the pattern recognition.
The presented research develops an analog system based on
programmable analog array technology that can perform the initial stages of auditory feature extraction and recognition before passing information to a digital signal processor. The goal being a low power system that can be fully contained on one or more integrated circuit chips. Results show that it is
possible to realize advanced filtering techniques such as
Cepstrum Filtering and Vector Quantization in analog circuitry. Prior to this work, previous applications of analog signal processing have focused on vision, cochlea models, anti-aliasing filters and other single component uses. Furthermore, classic designs have looked heavily at utilizing op-amps as a basic core building block for these designs. This research also shows a novel design for a Hidden Markov Model (HMM) decoder utilizing circuits that take advantage of the inherent properties of subthreshold transistors and floating-gate technology to create low-power computational blocks.
|
234 |
Evaluation of two tactile speech displaysClements, Mark Andrew. January 1978 (has links)
Thesis: Elec. E., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 1978 / Bibliography: leaves 57-59. / by Mark Andrew Clements. / Elec. E. / Elec. E. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
|
235 |
Independent formant and pitch control applied to singing voiceCalitz, Wietsche Roets 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: A singing voice can be manipulated artificially by means of a digital computer for the
purposes of creating new melodies or to correct existing ones. When the fundamental frequency
of an audio signal that represents a human voice is changed by simple algorithms,
the formants of the voice tend to move to new frequency locations, making it sound unnatural.
The main purpose is to design a technique by which the pitch and formants of a
singing voice can be controlled independently. / AFRIKAANSE OPSOMMING: Onafhanklike formant- en toonhoogte beheer toegepas op ’n sangstem: ’n Sangstem kan
deur ’n digitale rekenaar gemanipuleer word om nuwe melodie¨e te skep, of om bestaandes
te verbeter. Wanneer die fundamentele frekwensie van ’n klanksein (wat ’n menslike stem
voorstel) deur ’n eenvoudige algoritme verander word, skuif die oorspronklike formante
na nuwe frekwensie gebiede. Dit veroorsaak dat die resultaat onnatuurlik klink. Die hoof
oogmerk is om ’n tegniek te ontwerp wat die toonhoogte en die formante van ’n sangstem
apart kan beheer.
|
236 |
Non-acoustic speaker recognitionDu Toit, Ilze 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: In this study the phoneme labels derived from a phoneme recogniser are used for phonetic
speaker recognition. The time-dependencies among phonemes are modelled by using
hidden Markov models (HMMs) for the speaker models. Experiments are done using firstorder
and second-order HMMs and various smoothing techniques are examined to address
the problem of data scarcity. The use of word labels for lexical speaker recognition is also
investigated. Single word frequencies are counted and the use of various word selections
as feature sets are investigated. During April 2004, the University of Stellenbosch, in collaboration
with Spescom DataVoice, participated in an international speaker verification
competition presented by the National Institute of Standards and Technology (NIST). The
University of Stellenbosch submitted phonetic and lexical (non-acoustic) speaker recognition
systems and a fused system (the primary system) that fuses the acoustic system of
Spescom DataVoice with the non-acoustic systems of the University of Stellenbosch. The
results were evaluated by means of a cost model. Based on the cost model, the primary
system obtained second and third position in the two categories that were submitted. / AFRIKAANSE OPSOMMING: Hierdie projek maak gebruik van foneem-etikette wat geklassifiseer word deur ’n foneemherkenner
en daarna gebruik word vir fonetiese sprekerherkenning. Die tyd-afhanklikhede
tussen foneme word gemodelleer deur gebruik te maak van verskuilde Markov modelle
(HMMs) as sprekermodelle. Daar word ge¨eksperimenteer met eerste-orde en tweede-orde
HMMs en verskeie vergladdingstegnieke word ondersoek om dataskaarsheid aan te spreek.
Die gebruik van woord-etikette vir sprekerherkenning word ook ondersoek. Enkelwoordfrekwensies
word getel en daar word ge¨eksperimenteer met verskeie woordseleksies as kenmerke
vir sprekerherkenning. Gedurende April 2004 het die Universiteit van Stellenbosch
in samewerking met Spescom DataVoice deelgeneem aan ’n internasionale sprekerverifikasie
kompetisie wat deur die National Institute of Standards and Technology (NIST)
aangebied is. Die Universiteit van Stellenbosch het ingeskryf vir ’n fonetiese en ’n woordgebaseerde
(nie-akoestiese) sprekerherkenningstelsel, asook ’n saamgesmelte stelsel wat as
primˆere stelsel dien. Die saamgesmelte stelsel is ’n kombinasie van Spescom DataVoice se
akoestiese stelsel en die twee nie-akoestiese stelsels van die Universiteit van Stellenbosch.
Die resultate is ge¨evalueer deur gebruik te maak van ’n koste-model. Op grond van die
koste-model het die primˆere stelsel tweede en derde plek behaal in die twee kategorie¨e
waaraan deelgeneem is.
|
237 |
Wavelet-based speech enhancement : a statistical approachHarmse, Wynand 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: Speech enhancement is the process of removing background noise from speech signals. The
equivalent process for images is known as image denoising. While the Fourier transform is
widely used for speech enhancement, image denoising typically uses the wavelet transform.
Research on wavelet-based speech enhancement has only recently emerged, yet it shows
promising results compared to Fourier-based methods. This research is enhanced by the
availability of new wavelet denoising algorithms based on the statistical modelling of
wavelet coefficients, such as the hidden Markov tree.
The aim of this research project is to investigate wavelet-based speech enhancement from
a statistical perspective. Current Fourier-based speech enhancement and its evaluation
process are described, and a framework is created for wavelet-based speech enhancement.
Several wavelet denoising algorithms are investigated, and it is found that the algorithms
based on the statistical properties of speech in the wavelet domain outperform the classical
and more heuristic denoising techniques. The choice of wavelet influences the quality of the
enhanced speech and the effect of this choice is therefore examined. The introduction of a
noise floor parameter also improves the perceptual quality of the wavelet-based enhanced
speech, by masking annoying residual artifacts. The performance of wavelet-based speech
enhancement is similar to that of the more widely used Fourier methods at low noise
levels, with a slight difference in the residual artifact. At high noise levels, however, the
Fourier methods are superior. / AFRIKAANSE OPSOMMING: Spraaksuiwering is die proses waardeur agtergrondgeraas uit spraakseine verwyder word.
Die ekwivalente proses vir beelde word beeldsuiwering genoem. Terwyl spraaksuiwering in
die algemeen in die Fourier-domein gedoen word, gebruik beeldsuiwering tipies die golfietransform.
Navorsing oor golfie-gebaseerde spraaksuiwering het eers onlangs verskyn, en
dit toon reeds belowende resultate in vergelyking met Fourier-gebaseerde metodes. Hierdie
navorsingsveld word aangehelp deur die beskikbaarheid van nuwe golfie-gebaseerde suiweringstegnieke
wat die golfie-ko¨effisi¨ente statisties modelleer, soos die verskuilde Markovboom.
Die doel van hierdie navorsingsprojek is om golfie-gebaseerde spraaksuiwering vanuit ‘n
statistiese oogpunt te bestudeer. Huidige Fourier-gebaseerde spraaksuiweringsmetodes
asook die evalueringsproses vir sulke algoritmes word bespreek, en ‘n raamwerk word
geskep vir golfie-gebaseerde spraaksuiwering. Verskeie golfie-gebaseerde algoritmes word
ondersoek, en daar word gevind dat die metodes wat die statistiese eienskappe van spraak
in die golfie-gebied gebruik, beter vaar as die klassieke en meer heuristiese metodes. Die
keuse van golfie be¨ınvloed die kwaliteit van die gesuiwerde spraak, en die effek van hierdie
keuse word dus ondersoek. Die gebruik van ‘n ruisvloer parameter verhoog ook
die kwaliteit van die golfie-gesuiwerde spraak, deur steurende residuele artifakte te verberg.
Die golfie-metodes vaar omtrent dieselfde as die klassieke Fourier-metodes by lae
ruisvlakke, met ’n klein verskil in residuele artifakte. By ho¨e ruisvlakke vaar die Fouriermetodes
egter steeds beter.
|
238 |
Tree-based Gaussian mixture models for speaker verificationCilliers, Francois Dirk 12 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2005. / The Gaussian mixture model (GMM) performs very effectively in applications
such as speech and speaker recognition. However, evaluation speed is greatly
reduced when the GMM has a large number of mixture components. Various
techniques improve the evaluation speed by reducing the number of required
Gaussian evaluations.
|
239 |
Speech generation in a spoken dialogue systemVisagie, Albertus Sybrand 12 1900 (has links)
Thesis (MScIng)--University of Stellenbosch, 2004. / ENGLISH ABSTRACT: Spoken dialogue systems accessed over the telephone network are rapidly becoming more
popular as a means to reduce call-centre costs and improve customer experience. It is
now technologically feasible to delegate repetitive and relatively simple tasks conducted
in most telephone calls to automatic systems. Such a system uses speech recognition to
take input from users. This work focuses on the speech generation component that a
specific prototype system uses to convey audible speech output back to the user.
Many commercial systems contain general text-to-speech synthesisers. Text-to-speech
synthesis is a very active branch of speech processing. It aims to build machines that
read text aloud. In some languages this has been a reality for almost two decades. While
these synthesisers are often very understandable, they almost never sound natural. The
output quality of synthetic speech is considered to be a very important factor in the user’s
perception of the quality and usability of spoken dialogue systems.
The static nature of the spoken dialogue system is exploited to produce a custom
speech synthesis component that provides very high quality output speech for the particular
application. To this end the current state of the art in speech synthesis is surveyed
and summarised. A unit-selection synthesiser is produced that functions in Afrikaans,
English and Xhosa.
The unit-selection synthesiser selects short waveforms from a recorded speech corpus,
and concatenates them to produce the required utterances. Techniques are developed for
designing a compact corpus and processing it to produce a unit-selection database. Speech
modification methods were researched to build a framework for natural-sounding speech
concatenation. This framework also provides pitch and duration modification capabilities
that will enable research in languages such as Afrikaans and Xhosa where text-to-speech
capabilities are relatively immature. / AFRIKAANSE OPSOMMING: Telefoniese, spraakgebaseerde dialoogstelsels word steeds meer algemeen, en is ’n doeltreffende
metode om oproepsentrumkostes te verlaag. Dit is tans tegnologies moontlik om
’n groot aantal eenvoudige transaksies met automatiese stelsels te hanteer. Sulke stelsels
gebruik spraakherkenning om intree van die gebruiker te ontvang. Hierdie werk fokus op
die spraakgenerasiekomponent wat ’n spesifieke prototipestelsel gebruik om afvoer aan
die gebruiker terug te speel.
Vele kommersi¨ele stelsels gebruik generiese teks-na-spraak sintetiseerders. Sulke teksna-
spraak sintetiseerders is steeds ’n baie aktiewe veld in spraaknavorsing. In die algemeen
poog navorsing om teks te kan lees en om te sit in verstaanbare spraak. Sulke stelsels
bestaan nou al vir ten minste twee dekades. Alhoewel heeltemal verstaanbaar, klink
hierdie stelsels onnatuurlik. In telefoniese spraakgebaseerde dialoogstelsels is kwaliteit
van die sintetiese spraak belangrik vir die gebruiker se persepsie van die stelsel se kwaliteit
en bruikbaarheid.
Die dialoog is meestal staties van aard en hierdie eienskap word benut om ho¨e kwaliteit
spraak in ’n bepaalde toepassing te sintetiseer. Om dit reg te kry is die huidige stand van
sake in hierdie veld bestudeer en opgesom. ’n Knip-en-plak sintetiseerder is gebou wat
werk in Afrikaans, Engels en Xhosa.
Die sintetiseerder selekteer kort stukkies spraakgolfvorms vanuit ’n spraakkorpus, en
las dit aanmekaar om die vereiste spraak te produseer. Outomatiese tegnieke is ontwikkel
om ’n kompakte korpus te ontwerp wat steeds alles bevat wat die sintetiseerder sal nodig
hˆe om sy taak te verrig. Verdere tegnieke prosesseer die korpus tot ’n bruikbare vorm vir
sintese.
Metodes van spraakmodifikasie is ondersoek ten einde die aanmekaargelaste stukkies
spraak meer natuurlik te laat klink en die intonasie en tempo daarvan te korrigeer. Dit
verskaf infrastruktuur vir navorsing in tale soos Afrikaans en Xhosa waar teks-na-spraak
vermo¨ens nog onvolwasse is.
|
240 |
The design of a high-performance, floating-point embedded system for speech recognition and audio research purposesDuckitt, William 03 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--Stellenbosch University, 2008. / This thesis describes the design of a high performance, floating-point, standalone embedded
system that is appropriate for speech and audio processing purposes.
The system successfully employs the Analog Devices TigerSHARC TS201 600MHz floating
point digital signal processor as a CPU, and includes 512MB RAM, a Compact Flash storage card
interface as non-volatile memory, a multi-channel audio input and output system with two
programmable microphone preamplifiers offering up to 65dB gain, a USB interface, a LCD display
and a push-button user interface.
An Altera Cyclone II FPGA is used to interface the CPU with the various peripheral
components. The FIFO buffers within the FPGA allow bulk DMA transfers of audio data for minimal
processor delays. Similar approaches are taken for communication with the USB interface, the
Compact Flash storage card and the LCD display.
A logic analyzer interface allows system debugging via the FPGA. This interface can also in
future be used to interface to additional components. The power distribution required a total of 11
different supplies to be provided with a total consumption of 16.8W. A 6 layer PCB incorporating 4
signal layers, a power plane and ground plane was designed for the final prototype.
All system components were verified to be operating correctly by means of appropriate
testing software, and the computational performance was measured by repeated calculation of a
multi-dimensional Gaussian log-probability and found to be comparable with an Intel 1.8GHz
Core2Duo processor.
The design can therefore be considered a success, and the prototype is ready for
development of suitable speech or audio processing software.
|
Page generated in 0.0763 seconds