Global ETD Search

1	CIAIR In-Car Speech Corpus : Influence of Driving Status Kawaguchi, Nobuo, Matsubara, Shigeki, Takeda, Kazuya, Itakura, Fumitada 03 1900 (has links) No description available. speech corpus in-car speech ITS
2	Construction and Evaluation of a Large In-Car Speech Corpus Takeda, Kazuya, Fujimura, Hiroshi, Itou, Katsunobu, Kawaguchi, Nobuo, Matsubara, Shigeki, Itakura, Fumitada 03 1900 (has links) No description available. speech corpus in-car speech recognition perplexity SNR
3	Construction and Utilization of Bilingual Speech Corpus for Simultaneous Machine Interpretation Research Tohyama, Hitomi, 松原, 茂樹, Matsubara, Shigeki, 河口, 信夫, Kawaguchi, Nobuo, Inagaki, Yasuyoshi 06 September 2005 (has links) No description available. speech corpus simultaneous intepretation bilingual speech-to-speech translation
4	Influence of Pause Length on Listeners' Impressions in Simultaneous Interpretation Matsubara, Shigeki, Tohyama, Hitomi 17 September 2006 (has links) No description available. bilingual speech corpus machine translation simultaneous interpretation cross-lingual
5	Assessing the impact of manual corrections in the Groningen Meaning Bank / Assessing the impact of manual corrections in the Groningen Meaning Bank Weck, Benno January 2016 (has links) The Groningen Meaning Bank (GMB) project develops a corpus with rich syntactic and semantic annotations. Annotations in GMB are generated semi-automatically and stem from two sources: (i) Initial annotations from a set of standard NLP tools, (ii) Corrections/refinements by human annotators. For example, on the part-of-speech level of annotation there are currently 18,000 of those corrections, so called Bits of Wisdom (BOWs). For applying this information to boost the NLP processing we experiment how to use the BOWs in retraining the part-of-speech tagger and found that it can be improved to correct up to 70% of identified errors within held-out data. Moreover an improved tagger helps to raise the performance of the parser. Preferring sentences with a high rate of verified tags in retraining has proven to be the most reliable way. With a simulated active learning experiment using Query-by-Uncertainty (QBU) and Query-by- Committee (QBC) we proved that selectively sampling sentences for retraining yields better results with less data needed than random selection. In an additional pilot study we found that a standard maximum-entropy part-of-speech tagger can be augmented so that it uses already known tags to enhance its tagging decisions on an entire sequence without retraining a new model first. Powered by...
6	Prosodic properties of formality in spoken Japanese Sherr-Ziarko, Ethan January 2017 (has links) This thesis investigates the relationship between prosody and formality in spoken Japanese, from the standpoints of both speech production and perception. The previous literature on this topic has often produced inconsistent or contradictory results (e.g. Loveday, 1981; Ofuka at al., 2000; Ito, 2001; Ito, 2002), and this thesis therefore seeks to address the research question of whether speakers and listeners use prosody in any predictable way when expressing or judging formality in spoken Japanese. Chapter 2 describes a pilot study which aimed to determine which prosodic variables were worth investigating in a larger corpus-based study. Speech of different levels of formality was elicited from subjects indirectly via the inclusion of indexical linguistic items in carrier sentences. Analysis of the relationship between mean f<sub>0</sub> and duration shows a significant correlation with the categories of formal and informal speech where both variables are higher in informal speech. Consequently, in Chapter 3 f<sub>0</sub> and articulation rate were analyzed in the corpus-based study. Corpus data for the study was collected via one-on-one conversations recorded at NINJAL in Tachikawa-shi, Japan. The speech data from the corpus was analyzed in order to test the hypothesis that the prosodic variables of mean f<sub>0</sub>, articulation rate, and f<sub>0</sub> range would all be consistently higher in informal speech. Analysis using mixed effects models and a functional data analysis shows that all three prosodic variables are significantly higher in informal speech. These results were then used to inform the design of a speech perception study, which tested how manipulation of mean f<sub>0</sub>, articulation rate, and f<sub>0</sub> range upwards or downwards affect listeners' judgments of de-lexicalized speech as formal or informal. Results show that manipulation of all three variables upwards or downward leads to listeners' judging recordings as more informal or formal respectively. However, manipulation of individual variables does not have a significant correlation with changes in listeners' judgements. This result led to the theory that categorization tasks in speech perception are probabilistic, with listeners accessing distributions of acoustic cues to the categories in order to make judgments. Chapter 5 of the thesis describes a probabilistic Bayesian model of formality formulated based on the theory of the cognitive process of category judgment described in Chapter 4, which attempts to predict a recording's level of formality based only on its prosody. Given information on the overall and speaker-specific distributions of the prosodic cues to the different levels of formality, the model is able to discriminate between categories at a rate better than chance (~63% accurate for formal speech, ~74% accurate for informal speech), performing better than human listeners - who could not predict formality based on only prosodic information at a rate above chance in the study in Chapter 4. The studies in this thesis show a consistent, significant relationship between prosody and formality in spoken Japanese in both speech production and perception, which can be modeled probabilistically using a Bayesian statistical framework.
7	Construction of linefeed insertion rules for lecture transcript and their evaluation Matsubara, Shigeki, Ohno, Tomohiro, Murata, Masaki January 2010 (has links) No description available. linefeed insertion rules speech corpus clause boundaries real-time captioning sentence analysis spoken language
8	Towards a robust Arabic speech recognition system based on reservoir computing Alalshekmubarak, Abdulrahman January 2014 (has links) In this thesis we investigate the potential of developing a speech recognition system based on a recently introduced artificial neural network (ANN) technique, namely Reservoir Computing (RC). This technique has, in theory, a higher capability for modelling dynamic behaviour compared to feed-forward ANNs due to the recurrent connections between the nodes in the reservoir layer, which serves as a memory. We conduct this study on the Arabic language, (one of the most spoken languages in the world and the official language in 26 countries), because there is a serious gap in the literature on speech recognition systems for Arabic, making the potential impact high. The investigation covers a variety of tasks, including the implementation of the first reservoir-based Arabic speech recognition system. In addition, a thorough evaluation of the developed system is conducted including several comparisons to other state- of-the-art models found in the literature, and baseline models. The impact of feature extraction methods are studied in this work, and a new biologically inspired feature extraction technique, namely the Auditory Nerve feature, is applied to the speech recognition domain. Comparing different feature extraction methods requires access to the original recorded sound, which is not possible in the only publicly accessible Arabic corpus. We have developed the largest public Arabic corpus for isolated words, which contains roughly 10,000 samples. Our investigation has led us to develop two novel approaches based on reservoir computing, ESNSVMs (Echo State Networks with Support Vector Machines) and ESNEKMs (Echo State Networks with Extreme Kernel Machines). These aim to improve the performance of the conventional RC approach by proposing different readout architectures. These two approaches have been compared to the conventional RC approach and other state-of-the- art systems. Finally, these developed approaches have been evaluated on the presence of different types and levels of noise to examine their resilience to noise, which is crucial for real world applications. 492.7
9	Automatic speaker verification on site and by telephone: methods, applications and assessment Melin, Håkan January 2006 (has links) Speaker verification is the biometric task of authenticating a claimed identity by means of analyzing a spoken sample of the claimant's voice. The present thesis deals with various topics related to automatic speaker verification (ASV) in the context of its commercial applications, characterized by co-operative users, user-friendly interfaces, and requirements for small amounts of enrollment and test data. A text-dependent system based on hidden Markov models (HMM) was developed and used to conduct experiments, including a comparison between visual and aural strategies for prompting claimants for randomized digit strings. It was found that aural prompts lead to more errors in spoken responses and that visually prompted utterances performed marginally better in ASV, given that enrollment data were visually prompted. High-resolution flooring techniques were proposed for variance estimation in the HMMs, but results showed no improvement over the standard method of using target-independent variances copied from a background model. These experiments were performed on Gandalf, a Swedish speaker verification telephone corpus with 86 client speakers. A complete on-site application (PER), a physical access control system securing a gate in a reverberant stairway, was implemented based on a combination of the HMM and a Gaussian mixture model based system. Users were authenticated by saying their proper name and a visually prompted, random sequence of digits after having enrolled by speaking ten utterances of the same type. An evaluation was conducted with 54 out of 56 clients who succeeded to enroll. Semi-dedicated impostor attempts were also collected. An equal error rate (EER) of 2.4% was found for this system based on a single attempt per session and after retraining the system on PER-specific development data. On parallel telephone data collected using a telephone version of PER, 3.5% EER was found with landline and around 5% with mobile telephones. Impostor attempts in this case were same-handset attempts. Results also indicate that the distribution of false reject and false accept rates over target speakers are well described by beta distributions. A state-of-the-art commercial system was also tested on PER data with similar performance as the baseline research system. / QC 20100910 speaker recognition speaker verification speech technology biometrics access control speech corpus variance estimation Information and language technology Informations- och språkteknologi
10	Voice Transformation And Development Of Related Speech Analysis Tools For Turkish Salor, Ozgul 01 January 2005 (has links) (PDF) In this dissertation, new approaches in the design of a voice transformation (VT) system for Turkish are proposed. Objectives in this thesis are two-fold. The first objective is to develop standard speech corpora and segmentation tools for Turkish speech research. The second objective is to consider new approaches for VT. A triphone-balanced set of 2462 Turkish sentences is prepared for analysis. An audio corpus of 100 speakers, each uttering 40 sentences out of the 2462-sentence set, is used to train a speech recognition system designed for English. This system is ported to Turkish to obtain a phonetic aligner and a phoneme recognizer. The triphone-balanced sentence set and the phonetic aligner are used to develop a speech corpus for VT. A new voice transformation approach based on Mixed Excitation Linear Prediction (MELP) speech coding framework is proposed. Multi-stage vector quantization of MELP is used to obtain speaker-specific line-spectral frequency (LSF) codebooks for source and target speakers. Histograms mapping the LSF spaces of source and target speakers are used for transformation in the baseline system. The baseline system is improved by a dynamic programming approach to estimate the target LSFs. As a second approach to the VT problem, quantizing the LSFs using k-means clustering algorithm is applied with dimension reduction of LSFs using principle component analysis. This approach provides speaker-specific codebooks out of the speech corpus instead of using MELP&#039 / s pre-trained LSF codebook. Evaluations show that both dimension reduction and dynamic programming improve the transformation performance.

Search results