Global ETD Search

11	Development of robust language models for speech recognition of under-resourced language Sindana, Daniel January 2020 (has links) Thesis (M.Sc.(Computer Science )) -- University of Limpopo, 2020 / Language modelling (LM) work for under-resourced languages that does not consider most linguistic information inherent in a language produces language models that in adequately represent the language, thereby leading to under-development of natural language processing tools and systems such as speech recognition systems. This study investigated the influence that the orthography (i.e., writing system) of a lan guage has on the quality and/or robustness of the language models created for the text of that language. The unique conjunctive and disjunctive writing systems of isiN debele (Ndebele) and Sepedi (Pedi) were studied. The text data from the LWAZI and NCHLT speech corpora were used to develop lan guage models. The LM techniques that were implemented included: word-based n gram LM, LM smoothing, LM linear interpolation, and higher-order n-gram LM. The toolkits used for development were: HTK LM, SRILM, and CMU-Cam SLM toolkits. From the findings of the study – found on text preparation, data pooling and sizing, higher n-gram models, and interpolation of models – it is concluded that the orthogra phy of the selected languages does have effect on the quality of the language models created for their text. The following recommendations are made as part of LM devel opment for the concerned languages. 1) Special preparation and normalisation of the text data before LM development – paying attention to within sentence text markers and annotation tags that may incorrectly form part of sentences, word sequences, and n-gram contexts. 2) Enable interpolation during training. 3) Develop pentagram and hexagram language models for Pedi texts, and trigrams and quadrigrams for Ndebele texts. 4) Investigate efficient smoothing method for the different languages, especially for different text sizes and different text domains / National Research Foundation (NRF) Telkom University of Limpopo Language modelling Natural language processing Automatic speech recognition Under-resourced languages Robust control Automatic speech recognition Speech perception
12	Construction et évaluation pour la TA d'un corpus journalistique bilingue : application au français-somali / Building and evaluating for MT a bilingual corpus : Application ton French-Somali Ahmed Assowe, Houssein 29 May 2019 (has links) Dans le cadre des travaux en cours pour informatiser un grand nombre de langues « peu dotées », en particulier celles de l’espace francophone, nous avons créé un système de traduction automatique français-somali dédié à un sous-langage journalistique, permettant d’obtenir des traductions de qualité, à partir d’un corpus bilingue construit par post-édition des résultats de Google Translate (GT), à destination des populations somalophones et non francophones de la Corne de l’Afrique. Pour cela, nous avons constitué le tout premier corpus parallèle français-somali de qualité, comprenant à ce jour 98 912 mots (environ 400 pages standard) et 10 669 segments. Ce dernier constitue’est un corpus aligné, et de très bonne qualité, car nous l’avons construit en post-éditant les pré-traductions de GT, qui combine pour cela avec une combinaison de lason système de TA français-anglais et système de TA anglais-somali. Il Ce corpus a également fait l’objet d’une évaluation de la part depar 9 annotateurs bilingues qui ont donné une note score de qualité à chaque segment du corpus, et corrigé éventuellement notre post-édition. À partir de ce corpus, en croissance, nous avons construit plusieurs versions successives d’un système de Traduction Automatique à base de fragments (PBMT), MosesLIG-fr-so, qui s’est révélé meilleur que GoogleTranslate GT sur ce couple de langues et ce sous-langage, en termes de mesure BLEU et du temps de post-édition. Nous avons fait également une première expérience de traduction automatique neuronale français-somali en utilisant OpenNMT, de façon à améliorer les résultats de la TA sans aboutir à des temps de calcul prohibitifs, tant durant l’entraînement que durant le décodage.D’autre part, nous avons mis en place une iMAG (passerelle interactive d’accès multilingue) qui permet à des internautes somaliens non francophones du continent d’accéder en somali à l’édition en ligne du journal « La Nation de Djibouti ». Les segments (phrases ou titres) prétraduits automatiquement par notre un système de TA fr-so en ligne disponible peuvent être post-édités et notés (sur sur une échelle de 1 à 20) par les lecteurs eux-mêmes, de façon à améliorer le système par apprentissage incrémental, de la même façon que ce qui a été fait pour le système français-chinois (PBMT) créé par [Wang, 2015]. / As part of ongoing work to computerize a large number of "poorly endowed" languages, especially those in the French-speaking world, we have created a French-Somali machine translation system dedicated to a journalistic sub-language, allowing to obtain quality translations from a bilingual body built by post-editing of GoogleTranslate results for the Somali and non-French speaking populations of the Horn of Africa. For this, we have created the very first quality French-Somali parallel corpus, comprising to date 98,912 words (about 400 standard pages) and 10,669 segments. The latter is an aligned corpus of very good quality, because we built in by post-editions editing pre-translations of produced by GT, which uses with a combination of the its French-English and English-Somali MT language pairs. It That corpus was also evaluated by 9 bilingual annotators who gave assigned a quality note score to each segment of the corpus and corrected our post-editing. From Using this growing body corpus as training corpusof work, we have built several successive versions of a MosesLIG-fr-so fragmented statistical Phrase-Based Automatic Machine Translation System (PBMT), which has proven to be better than GoogleTranslate on this language pair and this sub-language, in terms BLEU and of post-editing time. We also did used OpenNMT to build a first French-Somali neural automatic translationMT system and experiment it.in order to improve the results of TA without leading to prohibitive calculation times, both during training and during decoding.On the other hand, we have set up an iMAG (multilingual interactive access gateway) that allows non-French-speaking Somali surfers on the continent to access the online edition of the newspaper "La Nation de Djibouti" in Somali. The segments (sentences or titles), pre- automatically translated automatically by our any available fr-so MT system, can be post-edited and rated (out on a 1 to of 20scale) by the readers themselves, so as to improve the system by incremental learning, in the same way as the has been done before for the French-Chinese PBMT system. (PBMT) created by [Wang, 2015]. Ressources linguistiques Informatisation Langue somali Traduction automatique statistique Langue peu dotée Linguistique informatique Linguistic ressources Computerization Somali language Statistical machine translation Under-Resourced language Computationnel linguistic 004
13	Effective automatic speech recognition data collection for under–resourced languages / de Vries N.J. De Vries, Nicolaas Johannes January 2011 (has links) As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for underresourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal– time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012. Under-resourced languages New languages Speech resources ASR corpora Automatic speech recognition Developing world Speech data collection Spoken language resources Android NCHLT
14	Effective automatic speech recognition data collection for under–resourced languages / de Vries N.J. De Vries, Nicolaas Johannes January 2011 (has links) As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing automatic speech recognition (ASR) technologies for such languages, a key step in developing these technologies is the effective collection of ASR data, consisting of transcribed audio and associated meta data. The problem is that no suitable tool currently exists for effectively collecting ASR data for such languages. The specific context and requirements for effectively collecting ASR data for underresourced languages, render all currently known solutions unsuitable for such a task. Such requirements include portability, Internet independence and an open–source code–base. This work documents the development of such a tool, called Woefzela, from the determination of the requirements necessary for effective data collection in this context, to the verification and validation of its functionality. The study demonstrates the effectiveness of using smartphones without any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal– time quality control philosophy which increases the amount of usable ASR data collected from speakers. Woefzela was developed for the Android Operating System, and is freely available for use on Android smartphones, with its source code also being made available. A total of more than 790 hours of ASR data for the eleven official languages of South Africa have been successfully collected with Woefzela. As part of this study a benchmark for the performance of a new National Centre for Human Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012. Under-resourced languages New languages Speech resources ASR corpora Automatic speech recognition Developing world Speech data collection Spoken language resources Android NCHLT
15	Access and use of information and communication technology for teaching and learning amongst schools in under resourced communities in the Western Cape, South Africa Koranteng, Kesewaa January 2012 (has links) Thesis (MTech(Information Technology)) --Cape Peninsula University of Technology, 2012 / Due to the legacy of apartheid South Africa is facing developmental discrepancies with inequalities between the advantaged few in the more urban areas and the disadvantaged majority in the rural areas. With quality education being key, not only to the success of an individual but of a country’s development, efforts have been made to ensure equal access for all. ICT is seen as a key enabler to this end. The study investigated the status of ICT deployment and its integration into curricula in schools. The objective was to understand the factors affecting the efforts to achieve successful implementation of ICT integration into schools in underdeveloped areas, to understand the challenges that exist and ultimately, to inform solutions. A qualitative study was conducted, using a case study method. A purposive sampling method was used to select population elements; educators and school coordinators of ICT programs in Western Cape schools (i.e. Kulani Secondary, Sithembele Matiso Secondary, Macassar Secondary and Marvin Park Primary). To gain an understanding of the status quo, literature was explored and semi-structured interviews were conducted with ICT coordinators and educators within the 4 sampled schools. Activity theory was used to provide an analytical framework for the study. Through this framework the aims and objectives of the study were conceptualized and summarized to form a graphical representation of the phenomena under study. In spite of efforts to ensure universal access to ICT, the findings indicate that the status of ICT deployment and its integration into school curricula is far from favourable in underdeveloped schools. Information technology Integration ICT Deployment - South Africa Computer-based programs Educational technology ICT literacy ICT Access Disadvantaged Schools Under-resourced communities
16	Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée / Extraction a parallel corpus for machine translation from and to under-resourced languages Do, Thi Ngoc Diep 20 December 2011 (has links) Les systèmes de traduction automatique obtiennent aujourd'hui de bons résultats sur certains couples de langues comme anglais – français, anglais – chinois, anglais – espagnol, etc. Les approches de traduction empiriques, particulièrement l'approche de traduction automatique probabiliste, nous permettent de construire rapidement un système de traduction si des corpus de données adéquats sont disponibles. En effet, la traduction automatique probabiliste est fondée sur l'apprentissage de modèles à partir de grands corpus parallèles bilingues pour les langues source et cible. Toutefois, la recherche sur la traduction automatique pour des paires de langues dites «peu dotés» doit faire face au défi du manque de données. Nous avons ainsi abordé le problème d'acquisition d'un grand corpus de textes bilingues parallèles pour construire le système de traduction automatique probabiliste. L'originalité de notre travail réside dans le fait que nous nous concentrons sur les langues peu dotées, où des corpus de textes bilingues parallèles sont inexistants dans la plupart des cas. Ce manuscrit présente notre méthodologie d'extraction d'un corpus d'apprentissage parallèle à partir d'un corpus comparable, une ressource de données plus riche et diversifiée sur l'Internet. Nous proposons trois méthodes d'extraction. La première méthode suit l'approche de recherche classique qui utilise des caractéristiques générales des documents ainsi que des informations lexicales du document pour extraire à la fois les documents comparables et les phrases parallèles. Cependant, cette méthode requiert des données supplémentaires sur la paire de langues. La deuxième méthode est une méthode entièrement non supervisée qui ne requiert aucune donnée supplémentaire à l'entrée, et peut être appliquée pour n'importe quelle paires de langues, même des paires de langues peu dotées. La dernière méthode est une extension de la deuxième méthode qui utilise une troisième langue, pour améliorer les processus d'extraction de deux paires de langues. Les méthodes proposées sont validées par des expériences appliquées sur la langue peu dotée vietnamienne et les langues française et anglaise. / Nowadays, machine translation has reached good results when applied to several language pairs such as English – French, English – Chinese, English – Spanish, etc. Empirical translation, particularly statistical machine translation allows us to build quickly a translation system if adequate data is available because statistical machine translation is based on models trained from large parallel bilingual corpora in source and target languages. However, research on machine translation for under-resourced language pairs always faces to the lack of training data. Thus, we have addressed the problem of retrieving a large parallel bilingual text corpus to build a statistical machine translation system. The originality of our work lies in the fact that we focus on under-resourced languages for which parallel bilingual corpora do not exist in most cases. This manuscript presents our methodology for extracting a parallel corpus from a comparable corpus, a richer and more diverse data resource over the Web. We propose three methods of extraction. The first method follows the classical approach using general characteristics of documents as well as lexical information of the document to retrieve both parallel documents and parallel sentence pairs. However, this method requires additional data of the language pair. The second method is a completely unsupervised method that does not require additional data and it can be applied to any language pairs, even under resourced language pairs. The last method deals with the extension of the second method using a third language to improve the extraction process (triangulation). The proposed methods are validated by a number of experiments applied on the under resourced Vietnamese language and the English and French languages. Langues peu dotées Traduction automatique probabiliste Extraction de données parallèles Corpus comparable Méthode non supervisée Triangulation Under resourced languages Statistical machine translation Mining parallel data Comparable corpus Unsupervised method Triangulation 004
17	Exploring Science Identity: The Lived Experiences of Underserved Students in a University Supplemental Science Program Perrault, Lynette D 20 December 2017 (has links) Underserved students attending under-resourced schools experience limited opportunities to engage in advanced science. An exploration into the influence a supplemental science program has on underserved students’ acquisition of science knowledge and skills to increase their pursuit of science was conducted to help explain science identity formation in students. The proliferation of supplemental science programs have emerged as a result of limited exposure and resources in science for underserved students, thus prompting further investigation into the influence supplemental science programs have on underserved students interest and motivation in science, attainment of science knowledge and skills, and confidence in science to promote science identities in students. Using a phenomenological qualitative approach, this study examined science identity formation in high school students participating in a university supplemental environmental health science program. The study explored high school students’ perceptions of their lived experiences in science supplemental activities, research, and field experiences and the influences these experiences have in relation to their science identity development. The university supplemental science program was an eight-week summer program in which students interacted with a diverse group of peers from various high schools, through engaging in environmental health science rotations, field experiences, and research with faculty advisors and graduate student mentors. Data collection included existing program evaluation data including, weekly journals and exit interviews, as well as follow-up interviews conducted several months after the program concluded. The study findings from a three step coding process of the follow-up interview transcripts provided six emerging themes as follows: (1) promoting interest and motivation to pursue new areas of science, (2) mechanisms in the acquisition of science knowledge and skills in scientific practice, (3) confidence in science knowledge and abilities, (4) understanding and applying science in the world, (5) emerging relationships with peers and mentors in science, and (6) aspirations to be a science person in the scientific community. This research study informs other supplemental science programs, has implications for improved science curricula and instruction in K12 schools, as well as explains how exposure to science experiences can help students gain identities in science. Biodiversity Educational Leadership Environmental Health Scholarship of Teaching and Learning Science and Mathematics Education Urban Education
18	Automatic Annotation of Speech: Exploring Boundaries within Forced Alignment for Swedish and Norwegian / Automatisk Anteckning av Tal: Utforskning av Gränser inom Forced Alignment för Svenska och Norska Biczysko, Klaudia January 2022 (has links) In Automatic Speech Recognition, there is an extensive need for time-aligned data. Manual speech segmentation has been shown to be more laborious than manual transcription, especially when dealing with tens of hours of speech. Forced alignment is a technique for matching a signal with its orthographic transcription with respect to the duration of linguistic units. Most forced aligners, however, are language-dependent and trained on English data, whereas under-resourced languages lack the resources to develop an acoustic model required for an aligner, as well as manually aligned data. An alternative solution to the training of new models can be cross-language forced alignment, in which an aligner trained on one language is used for aligning data in another language. This thesis aimed to evaluate state-of-the-art forced alignment algorithms available for Swedish and test whether a Swedish model could be applied for aligning Norwegian. Three approaches for forced aligners were employed: (1) one forced aligner based on Dynamic Time Warping and text-to-speech synthesis Aeneas, (2) two forced aligners based on Hidden Markov Models, namely the Munich AUtomatic Segmentation System (WebMAUS) and the Montreal Forced Aligner (MFA) and (3) Connectionist Temporal Classification (CTC) segmentation algorithm with two pre-trained and fine-tuned Wav2Vec2 Swedish models. First, small speech test sets for Norwegian and Swedish, covering different types of spontaneousness in the speech, were created and manually aligned to create gold-standard alignments. Second, the performance of the Swedish dataset was evaluated with respect to the gold standard. Finally, it was tested whether Swedish forced aligners could be applied for aligning Norwegian data. The performance of the aligners was assessed by measuring the difference between the boundaries set in the gold standard from that of the comparison alignment. The accuracy was estimated by calculating the proportion of alignments below a particular threshold proposed in the literature. It was found that the performance of the CTC segmentation algorithm with Wav2Vec2 (VoxRex) was superior to other forced alignment systems. The differences between the alignments of two Wav2Vec2 models suggest that the training data may have a larger influence on the alignments, than the architecture of the algorithm. In lower thresholds, the traditional HMM approach outperformed the deep learning models. Finally, findings from the thesis have demonstrated promising results for cross-language forced alignment using Swedish models to align related languages, such as Norwegian. forced alignment automatic speech recognition ASR natural language processing under-resourced languages Swedish Norwegian CTC segmentation wav2vec2 kaldi HTK dynamic time warping
19	Teachers' experiences of curriculum change in two under-resourced primary schools in the Durban area Pillay, Inbam 11 1900 (has links) The purpose of this study was to explore teachers’ experiences of curriculum change in two under-resourced primary schools in the Durban area. By examining the experiences of educators using a qualitative approach the researcher was able to identify problems that prevent a smooth transition from one curriculum to another. The introduction of the Curriculum Assessment Policy Statements in January 2012 necessitated a plethora of adjustments for teachers at schools. Changes were made to the number of subjects to be taught, the notional time for each subject as well as a renewed emphasis on textbooks as a vital teaching resource in the classroom. This study was conducted in under-resourced primary schools in the Durban area. Data collection in both these schools shows that despite the lack of essential resources such as text books, teachers still manage to implement change and follow policy, whilst at the same time ensuring that their learners benefit from the curriculum. This study also highlights the challenges experienced by teachers in under-resourced schools that need to be confronted for effective curriculum implementation. The researcher makes recommendations to address these challenges as well as suggestions for future research. / Curriculum and Instructional Studies / M. Ed. (Curriculum Studies) Curriculum change Curriculum implementation Outcomes Based Education C2005 Under-resourced school Foundations for Learning Campaign Learner-teacher support materials 372.190968455
20	Teachers' experiences of curriculum change in two under-resourced primary schools in the Durban area Pillay, Inbam 11 1900 (has links) The purpose of this study was to explore teachers’ experiences of curriculum change in two under-resourced primary schools in the Durban area. By examining the experiences of educators using a qualitative approach the researcher was able to identify problems that prevent a smooth transition from one curriculum to another. The introduction of the Curriculum Assessment Policy Statements in January 2012 necessitated a plethora of adjustments for teachers at schools. Changes were made to the number of subjects to be taught, the notional time for each subject as well as a renewed emphasis on textbooks as a vital teaching resource in the classroom. This study was conducted in under-resourced primary schools in the Durban area. Data collection in both these schools shows that despite the lack of essential resources such as text books, teachers still manage to implement change and follow policy, whilst at the same time ensuring that their learners benefit from the curriculum. This study also highlights the challenges experienced by teachers in under-resourced schools that need to be confronted for effective curriculum implementation. The researcher makes recommendations to address these challenges as well as suggestions for future research. / Curriculum and Instructional Studies / M. Ed. (Curriculum Studies) Curriculum change Curriculum implementation Outcomes Based Education C2005 Under-resourced school Foundations for Learning Campaign Learner-teacher support materials 372.190968455

Search results