Global ETD Search

11	A Design of Trilingual Speech Recognition System for Chinese, Italian and Farsi Jiang, Wei-Sheng 10 September 2012 (has links) China, Italy and Iran are seemingly quite different in language, history, culture and economy. However, there have been existed mutual interactions among these three countries during the past age. In the fourth century, the Chinese Northern Wei Dynasty established close relation with the Persian Empire, located in Iran today. Persian language is also called Farsi in her native name. The unearthed silver bowls from China in the recent years showed similar appearance and material with the Sassanid-Persian silverware of Iran. Archaeologists found that ancient China and Iran used to be close international trading partners. In the thirteenth century, Marco-Polo, an Italian travel adventurer and merchant, visited Chinese Yuan Dynasty, and wrote a marvelous book ¡§The Travels of Marco-Polo¡¨. Fantastic experiences in China were depicted in this journal, and these initiated the Sino-Italian relation in the early days. Armani suits and Ferrari super racers become the oriental passion to the Italy in the Modern China, and this may represent the achievement of Asian-European culture exchange. Therefore, it is our objective to design a trilingual speech recognition system to help us to learn Chinese, Italian and Farsi languages. Linear predicted cepstral coefficients, Mel-frequency cepstral coefficients, hidden Markov model and phonotactics are used in this system as the two syllable feature models and the recognition model respectively. For the Chinese system, a 2,699 two-syllable words database is used as the training corpus. For the Italian and Farsi systems, a database of 10 utterances per mono-syllable is established by applying their pronunciation rules. These 10 utterances are collected through reading 5 rounds of the same mono-syllables twice with tone 1 and tone 4. The correct recognition rates of 87.54%, 87.48%, and 90.33% can be reached for the 82,000 Chinese, 27,900 Italian, and 4,000 Farsi phrase databases respectively. The computation time for each system is within 1.5 seconds. Furthermore, a trilingual language-speech recognition system for 300 common words, composed of 100 words from each language, is developed. A 98.67 % correct language-phrase recognition rate can be obtained with the computation time about 2 seconds. Speech recognition Linear predicted cepstral coefficients Hidden Markov model Mel-frequency cepstral coefficients Phonotactics
12	A Design of Trilingual Speech Recognition System for Chinese, Portuguese and Hindi Wang, Yu-an 10 September 2012 (has links) The BRICS, Brazil, Russia, India, China and South Africa, have been making a significant amount of contribution to the global economy growth in the past few years. China possesses not only the largest population, but also the most splendid history in the world. During the recent years, the rapid development on all respects, including the enhanced economic trade with Taiwan, has made China in the line of the Super Powers. Brazil is the largest Portuguese speaking country in the world, where the world class manufacturer Foxconn Technology decided to build Apple iPad/iPhone factory in 2011. India has been flourishing in software, tele-communications and aviation industries since last decade. Offshore outsourcing consulting is so popular due to cost-down policy of the Western companies. Chinese, Portuguese and Hindi speaking population are over 1.573 billion, and account for over 22% of the world population. Therefore, it is our objective to establish a trilingual speech recognition system to help verbal communication and cultural understanding among languages. This thesis investigates the design and implementation strategies for a trilingual speech recognition system for Chinese, Portuguese and Hindi. Based on their pronunciation rules, the 404 Chinese, 515 Portuguese and 244 Hindi common mono-syllables are selected and utilized as the major speech training and recognition methodology. Mel-frequency cepstral coefficients, linear predicted cepstral coefficients, and hidden Markov model are used as the two syllable feature models and the recognition model respectively. Under the AMD 2.2 GHz Athlon XP 2800+ personal computer and Ubuntu 9.04 operating system environment, the correct phrase recognition rates of 87.69%, 85.14% and 86.74% can be reached using phonotactical rules for the 82,000 Chinese, 30,000 Portuguese and 3,900 Hindi phrase database respectively. Furthermore, a trilingual language-speech recognition system for 300 common words, composed of 100 words from each language, is developed. A 98% correct language-phrase recognition rate can be reached. The average computation time for each system is within 2 seconds. Linear predicted cepstral coefficients Hidden Markov model Phonotactics Mel-frequency cepstral coefficients Speech recognition
13	A Design of Trilingual Speech Recognition System for Chinese, Arabic and Dutch Tu, Ming-hui 10 September 2012 (has links) Chinese as well as Arabic is one of the six official languages in the United Nations. The population of Chinese is over 1.2 billion, ranked number one in the world. Arabic, a language used in the Arab World, has a more than 2,800 year history. Her religion, culture and oil economy have been making far-reaching effects around the globe. The worldwide energy supply greatly relies on the petroleum from the Arab World. Netherland, whose official language is Dutch, has been an international trading power since ancient time. She has become an industrial giant today. Recently, European-study-abroad is getting more popular, many famous Netherland universities offer opportunities for foreign students. Therefore, it is our objective to design a trilingual speech recognition system to help us learn Chinese, Arabic and Dutch, as well as appreciate their profound history and beautiful culture. This thesis investigates the design and implementation strategies for a Chinese, Arabic and Dutch speech recognition system. A 2,699 two-syllable recorded words database is utilized as the Chinese training corpus. For the Arabic and Dutch systems, 396 and 205 common mono-syllables are selected respectively as the major training and recognition methodology. Each mono-syllable is uttered twice with tone 1 and tone 4, and ten training patterns are used for system implementation. Mel-frequency cepstral coefficients, linear predicted cepstral coefficients, hidden Markov model and phonotactics are applied as the two syllable feature models and the recognition model respectively. The correct recognition rates of 90.17%, 84.65%, and 86.69% can be reached for the 82,000 Chinese, 31,000 Arabic, and 3,600 Dutch phrase databases respectively. Furthermore, a trilingual language-speech recognition system for 300 common words, composed of 100 words from each language, is developed. A 98.67 % correct language-phrase recognition rate can be obtained. The computation time for each system is about 2 seconds. Linear predicted cepstral coefficients Mel-frequency cepstral coefficients Hidden Markov model Phonotactics Speech recognition
14	A Design of Trilingual Speech Recognition System for Chinese, Hakka and Swedish Wu, Chih-Han 10 September 2012 (has links) According to the statistics of Summer Institute of Linguistics, USA, there are about 7,000 languages in the world. Chinese, Hakka and Swedish are all the first 100 most popular languages. Chinese is spoken in Taiwan, Mainland China, Hong Kong and Macau. Hakka is the second popular dialect in Taiwan. The population is only less than that of Taiwanese. The ancestors of Hakka are from the Han people in Honan, China. Hakka culture has been cultivated by enormous migrations since the fourth century, and transformed to represent the tradition. Taiwan and Sweden are developed, free and democratic countries, with similar level of living standard. The ancestors of Sweden are from the Germanic peoples in Northern Europe. Swedish has been also evolved and transformed by massive migrations since the ninth century, sharing the analogous evolution route with Chinese and Hakka. Therefore, it is our objective to establish a trilingual speech recognition system to help verbal communication among languages in the global economic arena. This thesis investigates the design and implementation strategies for a trilingual speech recognition system for Chinese, Hakka and Swedish. Based on their pronunciation rules, the 404 Chinese, 204 Hakka and 369 Swedish common mono-syllables are selected as the major speech training and recognition methodology. A 2,699 two-syllable words database is recorded as the Chinese training corpus. The five rounds with four tones and six rounds with two tones training strategies are used for Hakka and Swedish respectively. Correct rates of 92.29%, 90.70% and 89.09% can be reached for the 82,000 Chinese, 3,900 Hakka and 3,750 Swedish phrase database respectively. Besides, a trilingual language-speech recognition system for 300 common words, composed of 100 words from each language, is developed. A 98.67% correct language-phrase recognition rate can be obtained. The average computation time for each system is within 2 seconds. Linear predicted cepstral coefficients Phonotactics Hidden Markov model Speech recognition Mel-frequency cepstral coefficients
15	A Design of Trilingual Speech Recognition System for Chinese, Turkish and Tamil Lin, Wei-Ting 10 September 2012 (has links) In this thesis, both Turkish and Tamil, a language spoken in southern India and Sri Lanka, are studied in addition to Mandarin Chinese. It is hoped that the history, culture, and economy behind each language can be acquainted, tasted and appreciated during the learning process. In the ancient Chinese Han and Tang Dynasties, the ¡§Silk Road¡¨ played the most magnificent role to connect among the Oriental China, the Western Turkey and the Southern India as the international trading corridor. In this modern era, Turkey and India are both the most important cotton exporting countries. Moreover, China, Turkey and India have been showing their potential to the newly emerging markets in the world. Therefore, a trilingual speech recognition system is developed and implemented to help us to learn Chinese, Turkish and Tamil, as well as to enhance our understanding to their history and culture. In this trilingual system, linear predicted cepstral coefficients, Mel-frequency cepstral coefficients, hidden Markov model and phonotactics are used as the two syllable feature models and the recognition model respectively. For the Chinese system, a 2,699 two-syllable words database is used as the training corpus. For the Turkish and Tamil systems, a database of 10 utterances per mono-syllable is established by applying their pronunciation rules. These 10 utterances are collected through reading 5 rounds of the same mono-syllables twice with tone 1 and tone 4. The correct rates of 88.30%, 84.21%, and 88.74% can be reached for the 82,000 Chinese, 30,795 Turkish, and 3,500 Tamil phrase databases respectively. The computation time for each system is within 1.5 seconds. Furthermore, a trilingual language-speech recognition system for 300 common words, composed of 100 words from each language, is developed. A 98% correct language-phrase recognition rate can be reached with the computation time less than 2 seconds. Linear predicted cepstral coefficients Hidden Markov model Phonotactics Mel-frequency cepstral coefficients Speech recognition system
16	A robust audio-based symbol recognition system using machine learning techniques Wu, Qiming 02 1900 (has links) Masters of Science / This research investigates the creation of an audio-shape recognition system that is able to interpret a user’s drawn audio shapes—fundamental shapes, digits and/or letters— on a given surface such as a table-top using a generic stylus such as the back of a pen. The system aims to make use of one, two or three Piezo microphones, as required, to capture the sound of the audio gestures, and a combination of the Mel-Frequency Cepstral Coeﬃcients (MFCC) feature descriptor and Support Vector Machines (SVMs) to recognise audio shapes. The novelty of the system is in the use of piezo microphones which are low cost, light-weight and portable, and the main investigation is around determining whether these microphones are able to provide suﬃciently rich information to recognise the audio shapes mentioned in such a framework. Audio shape Recognition system Piezo microphones Support vector machines Mel-frequency cepstral coefficients
17	Classification of Affective Emotion in Musical Themes : How to understand the emotional content of the soundtracks of the movies? Diaz Banet, Paula January 2021 (has links) Music is created by composers to arouse different emotions and feelings in the listener, and in the case of soundtracks, to support the storytelling of scenes. The goal of this project is to seek the best method to evaluate the emotional content of soundtracks. This emotional content can be measured quantitatively thanks to Russell’s model of valence, arousal and dominance which converts moods labels into numbers. To conduct the analysis, MFCCs and VGGish features were extracted from the soundtracks and used as inputs to a CNN and an LSTM model, in order to study which one achieved a better prediction. A database of 6757 number of soundtracks with their correspondent VAD values was created to perform the mentioned analysis. As an ultimate purpose, the results of the experiments will contribute to the start-up Vionlabs to understand better the content of the movies and, therefore, make a more accurate recommendation on what users want to consume on Video on Demand platforms according to their emotions or moods. / Musik skapas av kompositörer för att väcka olika känslor och känslor hos lyssnaren, och när det gäller ljudspår, för att stödja berättandet av scener. Målet med detta projekt är att söka den bästa metoden för att utvärdera det emotionella innehållet i ljudspår. Detta känslomässiga innehåll kan mätas kvantitativt tack vare Russells modell av valens, upphetsning och dominans som omvandlar stämningsetiketter till siffror. För att genomföra analysen extraherades MFCC: er och VGGish-funktioner från ljudspåren och användes som ingångar till en CNN- och en LSTM-modell för att studera vilken som uppnådde en bättre förutsägelse. En databas med totalt 6757 ljudspår med deras korrespondent acrshort VAD-värden skapades för att utföra den nämnda analysen. Som ett yttersta syfte kommer resultaten av experimenten att bidra till att starta upp Vionlabs för att bättre förstå innehållet i filmerna och därför ge mer exakta rekommendationer på Video on Demand-plattformar baserat på användarnas känslor eller stämningar. Music emotion recognition Deep learning Feature extraction VGGish Mel-frequency Cepstral Coefficients. Music emotion recognition Deep learning Särdragsextraktion VGGish Mel-frequency Cepstral Coefficients Elektroteknik och elektronik
18	Application of LabVIEW and myRIO to voice controlled home automation Lindstål, Tim, Marklund, Daniel January 2019 (has links) The aim of this project is to use NI myRIO and LabVIEW for voice controlled home automation. The NI myRIO is an embedded device which has a Xilinx FPGA and a dual-core ARM Cortex-A9processor as well as analog input/output and digital input/output, and is programmed with theLabVIEW, a graphical programming language. The voice control is implemented in two differentsystems. The first system is based on an Amazon Echo Dot for voice recognition, which is acommercial smart speaker developed by Amazon Lab126. The Echo Dot devices are connectedvia the Internet to the voice-controlled intelligent personal assistant service known as Alexa(developed by Amazon), which is capable of voice interaction, music playback, and controllingsmart devices for home automation. This system in the present thesis project is more focusingon myRIO used for the wireless control of smart home devices, where smart lamps, sensors,speakers and a LCD-display was implemented. The other system is more focusing on myRIO for speech recognition and was built on myRIOwith a microphone connected. The speech recognition was implemented using mel frequencycepstral coefficients and dynamic time warping. A few commands could be recognized, includinga wake word ”Bosse” as well as other four commands for controlling the colors of a smart lamp. The thesis project is shown to be successful, having demonstrated that the implementation ofhome automation using the NI myRIO with two voice-controlled systems can correctly controlhome devices such as smart lamps, sensors, speakers and a LCD-display. speech recognition home automation labview myrio amazon alexa mel frequency cepstral coefficients dynamic time warping Embedded Systems Inbäddad systemteknik
19	A Design of Recognition Rate Improving Strategy for Japanese Speech Recognition System Lin, Cheng-Hung 24 August 2010 (has links) This thesis investigates the recognition rate improvement strategies for a Japanese speech recognition system. Both training data development and consonant correction scheme are studied. For training data development, a database of 995 two-syllable Japanese words is established by phonetic balanced sieving. Furthermore, feature models for the 188 common Japanese mono-syllables are derived through mixed position training scheme to increase recognition rate. For consonant correction, a sub-syllable model is developed to enhance the consonant recognition accuracy, and hence further improve the overall correct rate for the whole Japanese phrases. Experimental results indicate that the average correct rate for Japanese phrase recognition system with 34 thousand phrases can be improved from 86.91% to 92.38%. Phrase training Sub-syllable model Mel-frequency cepstral coefficients Speech recognition Linear predictive cepstrum coefficients Hidden Markov model
20	Optimizing text-independent speaker recognition using an LSTM neural network Larsson, Joel January 2014 (has links) In this paper a novel speaker recognition system is introduced. Automated speaker recognition has become increasingly popular to aid in crime investigations and authorization processes with the advances in computer science. Here, a recurrent neural network approach is used to learn to identify ten speakers within a set of 21 audio books. Audio signals are processed via spectral analysis into Mel Frequency Cepstral Coefficients that serve as speaker specific features, which are input to the neural network. The Long Short-Term Memory algorithm is examined for the first time within this area, with interesting results. Experiments are made as to find the optimum network model for the problem. These show that the network learns to identify the speakers well, text-independently, when the recording situation is the same. However the system has problems to recognize speakers from different recordings, which is probably due to noise sensitivity of the speech processing algorithm in use. speaker recognition speaker identification text-independent long short-term memory lstm mel frequency cepstral coefficients mfcc recurrent neural network speech processing spectral analysis rnnlib htktoolkit

Search results