Global ETD Search

31	Music And Speech Analysis Using The 'Bach' Scale Filter-Bank Ananthakrishnan, G 04 1900 (has links) The aim of this thesis is to deﬁne a perceptual scale for the ‘Time-Frequency’ analysis of music signals. The equal tempered ‘Bach ’ scale is a suitable scale, since it covers most of the genres of music and the error is equally distributed for each semi-tone. However, it may be necessary to allow a tolerance of around 50 cents or half the interval of the Bach scale, so that the interval can accommodate other common intonation schemes. The thesis covers the formulation of the Bach scale ﬁlter-bank as a time-varying model. It makes a comparative study with other commonly used perceptual scales. Two applications for the Bach scale ﬁlter-bank are also proposed, namely automated segmentation of speech signals and transcription of singing voice for query-by-humming applications. Even though this ﬁlter-bank is suggested with a motivation from music, it could also be applied to speech. A method for automatically segmenting continuous speech into phonetic units is proposed. The results, obtained from the proposed method, show around 82% accuracy for the English and 85% accuracy for the Hindi databases. This is an improvement of around 2 -3% when the performance is compared with other popular methods in the literature. Interestingly, the Bach scale ﬁlters perform better than the ﬁlters designed for other common perceptual scales, such as Mel and Bark scales. ‘Musical transcription’ refers to the process of converting a musical rendering or performance into a set of symbols or notations. A query in a ‘query-by-humming system’ can be made in several ways, some of which are singing with words, or with arbitrary syllables, or whistling. Two algorithms are suggested to annotate a query. The algorithms are designed to be fairly robust for these various forms of queries. The ﬁrst algorithm is a frequency selection based method. It works on the basis of selecting the most likely frequency components at any given time instant. The second algorithm works on the basis of ﬁnding time-connected contours of high energy in the ‘Time-Frequency’ plane of the input signal. The time domain algorithm works better in terms of instantaneous pitch estimates. It results in an error of around 10 -15%, while the frequency domain method results in an error of around 12 -20%. A song rendered by two diﬀerent people will have quite a few diﬀerent properties. Their absolute pitches, rates of rendering, timbres based on voice quality and inaccuracies, may be diﬀerent. The thesis discusses a method to quantify the distance between two diﬀerent renderings of musical pieces. The distance function has been evaluated by attempting a search for a particular song from a database of a size of 315, made up of songs sung by both male and female singers and whistled queries. Around 90 % of the time, the correct song is found among the top ﬁve best choices picked. Thus, the Bach scale has been proposed as a suitable scale for representing the perception of music. It has been explored in two applications, namely automated segmentation of speech and transcription of singing voices. Using the transcription obtained, a measure of the distance between renderings of musical pieces has also been suggested. Speech Analysis Speech Processing Filter Bank Musical Transcription Speech Recognition Speech - Signal Processing Audio Signals Music - Pitch Tracking Algorithms Music Signals - Analysis Time Frequency Analysis Bach Scale Automated Speech Segmentation Computer Science
32	The language learning infant: Effects of speech input, vocal output, and feedback Gustavsson, Lisa January 2009 (has links) This thesis studies the characteristics of the acoustic signal in speech, especially in speech directed to infants and in infant vocal development, to gain insight on essential aspects of speech processing, speech production and communicative interaction in early language acquisition. Three sets of experimental studies are presented in this thesis. From a phonetic point of view they investigate the fundamental processes involved in first language acquisition. The first set (study 1.1 and study 1.2) investigated how linguistic structure in the speech signal can be derived and which strategy infants and adults use to process information depending on its presentation. The second set (study 2.1 and study 2.2) studied acoustic consequences of the anatomical geometry of the infant vocal tract and the development of sensory-motor control for articulatory strategies. The third set of studies (study 3.1 and study 3.2) explored the infant's interaction with the linguistic environment, specifically how vocal imitation and reinforcement may assist infants to converge towards adult-like speech. The first set of studies suggests that structure and quality of simultaneous sensory input impact on the establishment of initial linguistic representations. The second set indicates that the anatomy of the infant vocal tract does not constrain the production of adult-like speech sounds and that some degree of articulatory motor control is present from six months of age. The third set of studies suggests that the adult interprets and reinforces vocalizations produced by the infant in a developmentally-adjusted fashion that can guide the infant towards the sounds of the ambient language. The results are discussed in terms of essential aspects of early speech processing and speech production that can be accounted for by biological general purpose mechanisms in the language learning infant. / För att köpa boken skicka en beställning till exp@ling.su.se/ To order the book send an e-mail to exp@ling.su.se human language language acquisition perception production humanoid development model embodied system speech signal processing vocal tract morphology acoustic speech input information processing scaling interaction growth infant imitation feedback perceptual salience modeling Phonetics Fonetik
33	Timbre Perception of Time-Varying Signals Arthi, S January 2014 (has links) (PDF) Every auditory event provides an information-rich signal to the brain. The signal constitutes perceptual attributes of pitch, loudness, timbre, and also, conceptual attributes like location, emotions, meaning, etc. In the present work we examine the timbre perception of time-varying signals in particular. While stationary signal timbre, by-itself is complex perceptually, the time-varying signal timbre introduces an evolving pattern, adding to its multi-dimensionality. To characterize timbre, we conduct psycho-acoustic perception tests with normal-hearing human subjects. We focus on time-varying synthetic speech signals(can be extended to music) because listeners are perceptually consistent with speech. Also, we can parametrically control the timbre and pitch glides using linear time-varying models. In order to quantify the timbre change in time-varying signals, we define the JND(Just noticeable difference) of timbre using diphthongs, synthesized using time-varying formant frequency model. The diphthong JND is defined as a two dimensional contour on the plane of percentage change of formant frequencies of terminal vowels. Thus, we simplify the perceptual probing to a lower dimensional space, i.e, 2-D even for a diphthong, which is multi-parametric. We also study the impact of pitch glide on the timbre JND of the diphthong. It is observed that timbre JND is influenced by the occurrence of pitch glide. Focusing on the magnitude of perceptual timbre change, we design a MUSHRA-like listening test using the vowel continuum in the formant-frequency space. We provide explicit anchors for reference: 0% and 100%, thus quantifying the perceptual timbre change on a 1-D scale. We also propose an objective measure of timbre change and observe that there is good correlation between the objective measure and subjective human responses of percentage timbre change. Using the above experimental methodology, we studied the influence of pitch shift on timbre perception and observed that the perceptual timbre change increases with change in pitch. We used vowels and diphthongs with 5 different types of pitch glides-(i) Constant pitch,(ii) 3-semitone linearly-up,(iii) 3 semitone linearly-down, (iv)V–like pitch glide and (v) hat-like pitch glide. The present study shows that timbre change can be measured on a 1-D scale if the perturbation is along one-dimension. We observe that for bright vowels(/a/and/i/), linearly decreasing pitch glide(dull pitch glide)causes more timbre change than linearly increasing pitch glide(bright pitch glide).For dull vowels(/u/),it is vice-versa. To summarize, in congruent pitch glides cause more perceptual timbre change than congruent pitch glides.(Congruent pitch glide implies bright pitch glide in bright vowel or dull pitch glide in dull vowel and in congruent pitch glide implies bright pitch glide in dull vowel or dull pitch glide in bright vowel.) Experiments with quadratic pitch glides show that the decay portion of pitch glide affects timbre perception more than the attack portion in short duration signals with less or no sustained part. In case of time-varying timbre, bright diphthongs show patterns similar to bright vowels. Also, for bright diphthongs(/ai/), perceived timbre change is most with decreasing pitch glide(dull pitch glide). We also observed that listeners perceive more timbre change in constant pitch than in pitch glides, congruent with the timbre or pitch glides with quadratic changes. The main conclusion of this study is that pitch and timbre do interact and in congruent pitch glides cause more timbre change than congruent pitch glides. In the case of quadratic pitch glides, listener perception of vowels is influenced by the decay than the attack in pitch glide in short duration signals. In the case of time-varying timbre also, in congruent pitch glides cause the most timbre change, followed by constant pitch glide. For congruent pitch glides and quadratic pitch glides in time-varying timbre, the listeners perceive lesser timbre change than otherwise. Time-Varying Signals Signal Perception Timber Perception Time-Varying Synthetic Speech Signals Signal Processing Dipthongs Pitch Glides Speech Signal Processing Time-Varying Signal Timber Linear Time-varying (LTV) Model Pitch and Timbre Communication Engineering
34	Diagnóza Parkinsonovy choroby z řečového signálu / Parkinson disease diagnosis using speech signal analysis Karásek, Michal January 2011 (has links) The thesis deals with the recognition of Parkinson's disease from the speech signal. The first part refers to the principles of speech signals and speech signals by patients suffering from Parkinson's disease. Further, it continues to describe the issues of speech signals processing, basic symptoms used for diagnosis of Parkinson's disease (e. g. VAI, VSA, FCR, VOT etc.) and reduction of these symptoms. The next part focuses on a block diagram of the program for the diagnosis of Parkinson's disease. The main objective of this thesis is comparison of two methods of feature selection (mRMR and SFFS). For classification have selected two different methods were used. The first method is classification kNN and second method of classification is Gaussian mixture model (GMM).
35	Akustická analýza vět složitých na artikulaci u pacientů s Parkinsonovou nemocí / Acoustic analysis of sentences complicated for articulation in patients with Parkinson's disease Kiska, Tomáš January 2015 (has links) This work deals with a design of hypokinetic dysarthria analysis system. Hypokinetic dysarthria is a speech motor dysfunction that is present in approx. 90 % of patients with Parkinson’s disease. Next there is described Parkinson's disease and change of the speech signal by this disability. The following describes the symptoms, which are used for the diagnosis of Parkinson's disease (FCR, VSA, VAI, etc.). The work is mainly focused on parameterization techniques that can be used to diagnose or monitor this disease as well as estimate its progress. A protocol of dysarthric speech acquisition is described in this work too. In combination with acoustic analysis it can be used to estimate a grade of hypokinetic dysarthria in fields of faciokinesis, phonorespiration and phonetics (correlation with 3F test). Regarding the parameterization, new features based on method RASTA. The analysis is based on parametrization sentences complicated for articulation. Experimental dataset consists of 101 PD patients with different disease progress and 53 healthy controls. For classification with feature selection have selected method mRMR.
36	Analýza Parkinsonovy nemoci pomocí segmentálních řečových příznaků / Analysis of Parkinson's disease using segmental speech parameters Mračko, Peter January 2015 (has links) This project describes design of the system for diagnosis Parkinson’s disease based on speech. Parkinson’s disease is a neurodegenerative disorder of the central nervous system. One of the symptoms of this disease is disability of motor aspects of speech, called hypokinetic dysarthria. Design of the system in this work is based on the best known segmental features such as coefficients LPC, PLP, MFCC, LPCC but also less known such as CMS, ACW and MSC. From speech records of patients affected by Parkinson’s disease and also healthy controls are calculated these coefficients, further is performed a selection process and subsequent classification. The best result, which was obtained in this project reached classification accuracy 77,19%, sensitivity 74,69% and specificity 78,95%.
37	Applying Automatic Speech to Text in Academic Settings for the Deaf and Hard of Hearing Weigel, Carla January 2021 (has links) This project discusses the importance of accurate note-taking for D/deaf and hard of hearing students who have accomodation requirements and offers innovative opportunities to improve the student experience in order to encourage more D/deaf and hard of hearing individuals to persue academia. It also includes a linguistic analysis of speech singals that correspond to transcription output errors produced by speech-to-text programs, which can be utilized to advance and improve speech recognition systems. / In hopes to encourage more D/deaf and hard of hearing (DHH) students to pursue academia, speech-to-text has been suggested to address notetaking issues. This research examined several transcripts created by two untrained speech-to-text programs, Ava and Otter, using 11 different speakers in academic contexts. Observations regarding functionality and error analysis are detailed in this thesis. This project has several objectives, including: 1) to outline how the DHH students’ experience differs from other note-taking needs; 2) to use linguistic analysis to understand how transcript accuracy converts to real-world use and to investigate why errors occur; and 3) to describe what needs to be addressed before assigning DHH students with a captioning service. Results from a focus group showed that current notetaking services are problematic, and that automatic captioning may solve some issues, but some errors are detrimental as it is particularly difficult for DHH students to identify and fix errors within transcripts. Transcripts produced by the programs were difficult to read, as outputs lacked accurate utterance breaks and contained poor punctuation. The captioning of scripted speech was more accurate than that of spontaneous speech for native and most non-native English speakers. An analysis of errors showed that some errors are less severe than others; in response, we offer an alternative way to view errors: as insignificant, obvious, or critical errors. Errors are caused by either the program’s inability to identify various items, such as word breaks, abbreviations, and numbers, or a blend of various speaker factors including: assimilation, vowel approximation, epenthesis, phoneme reduction, and overall intelligibility. Both programs worked best with intelligible speech, as measured by human perception. Speech rate trends were surprising: Otter seemed to prefer fast speech from native English speakers and Ava preferred, as expected, slow speech, but results differed between scripted and spontaneous speech. Correlations of accuracy and fundamental frequencies showed conflicting results. Some reasons for errors could not be determined without knowing more about how the systems were programed. / Thesis / Master of Science (MSc) / In hopes to encourage more D/deaf and hard of hearing (DHH) students to pursue academia, automatic captioning has been suggested to address notetaking issues. Captioning programs use speech recognition (SR) technology to caption lectures in real-time and produce a transcript afterwards. This research examined several transcripts created by two untrained speech-to-text programs, Ava and Otter, using 11 different speakers. Observations regarding functionality and error analysis are detailed in this thesis. The project has several objectives: 1) to outline how the DHH students’ experience differs from other note-taking needs; 2) to use linguistic analysis to understand how transcript accuracy converts to real-world use and to investigate why errors occur; and 3) to describe what needs to be addressed before assigning DHH students with a captioning service. Results from a focus group showed that current notetaking services are problematic, and that automatic captioning may solve some issues, but some types of errors are detrimental as it is particularly difficult for DHH students to identify and fix errors within transcripts. Transcripts produced by the programs were difficult to read, as outputs contain poor punctuation and lack breaks between thoughts. Captioning of scripted speech was more accurate than that of spontaneous speech for native and most non-native English speakers; and an analysis of errors showed that some errors are less severe than others. In response, we offer an alternative way to view errors: as insignificant, obvious, or critical errors. Errors are caused by either the program’s inability to identify various items, such as word breaks, abbreviations, and numbers, or a blend of various speaker factors. Both programs worked best with intelligible speech; One seemed to prefer fast speech from native English speakers and the other preferred slow speech; a preference of male or female voices showed conflicting results. Some reasons for errors could not be determined, as one would have to observe how the systems were programed. Deaf Hard of Hearing Speech recognition captions note taking analysis error rate accessibility computational phonetics acoustics speech-to-text Otter Ava speech signal speech perception linguistics language lecture transcription intelligibility academia accuracy
38	Identifikace osob pomocí otisku hlasu / Identification of persons via voice imprint Mekyska, Jiří January 2010 (has links) This work deals with the text-dependent speaker recognition in systems, where just a few training samples exist. For the purpose of this recognition, the voice imprint based on different features (e.g. MFCC, PLP, ACW etc.) is proposed. At the beginning, there is described the way, how the speech signal is produced. Some speech characteristics important for speaker recognition are also mentioned. The next part of work deals with the speech signal analysis. There is mentioned the preprocessing and also the feature extraction methods. The following part describes the process of speaker recognition and mentions the evaluation of the used methods: speaker identification and verification. Last theoretically based part of work deals with the classifiers which are suitable for the text-dependent recognition. The classifiers based on fractional distances, dynamic time warping, dispersion matching and vector quantization are mentioned. This work continues by design and realization of system, which evaluates all described classifiers for voice imprint based on different features.
39	Analýza řečových promluv pro IT diagnostiku neurologických onemocnění / Analysis of Speech Signals for the Purpose of Neurological Disorders IT Diagnosis Mekyska, Jiří January 2014 (has links) This work deals with a design of hypokinetic dysarthria analysis system. Hypokinetic dysarthria is a speech motor dysfunction that is present in approx. 90 % of patients with Parkinson’s disease. The work is mainly focused on parameterization techniques that can be used to diagnose or monitor this disease as well as estimate its progress. Next, features that significantly correlate with subjective tests are found. These features can be used to estimate scores of different scales like Unified Parkinson’s Disease Rating Scale (UPDRS) or Mini–Mental State Examination (MMSE). A protocol of dysarthric speech acquisition is introduced in this work too. In combination with acoustic analysis it can be used to estimate a grade of hypokinetic dysarthria in fields of faciokinesis, phonorespiration and phonetics (correlation with 3F test). Regarding the parameterization, features based on modulation spectrum, inferior colliculus coefficients, bicepstrum, approximate and sample entropy, empirical mode decomposition and singular points are originally introduced in this work. All the designed techniques are integrated into the system concept in way that it can be implemented in a hospital and used for a research on Parkinson’s disease or its evaluation.
40	Rozpoznání emočního stavu z hrané a spontánní řeči / Emotion Recognition from Acted and Spontaneous Speech Atassi, Hicham January 2014 (has links) Dizertační práce se zabývá rozpoznáním emočního stavu mluvčích z řečového signálu. Práce je rozdělena do dvou hlavních častí, první část popisuju navržené metody pro rozpoznání emočního stavu z hraných databází. V rámci této části jsou představeny výsledky rozpoznání použitím dvou různých databází s různými jazyky. Hlavními přínosy této části je detailní analýza rozsáhlé škály různých příznaků získaných z řečového signálu, návrh nových klasifikačních architektur jako je například „emoční párování“ a návrh nové metody pro mapování diskrétních emočních stavů do dvou dimenzionálního prostoru. Druhá část se zabývá rozpoznáním emočních stavů z databáze spontánní řeči, která byla získána ze záznamů hovorů z reálných call center. Poznatky z analýzy a návrhu metod rozpoznání z hrané řeči byly využity pro návrh nového systému pro rozpoznání sedmi spontánních emočních stavů. Jádrem navrženého přístupu je komplexní klasifikační architektura založena na fúzi různých systémů. Práce se dále zabývá vlivem emočního stavu mluvčího na úspěšnosti rozpoznání pohlaví a návrhem systému pro automatickou detekci úspěšných hovorů v call centrech na základě analýzy parametrů dialogu mezi účastníky telefonních hovorů.

Search results