401 |
Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource ScenariosEskander, Ramy January 2021 (has links)
With the high cost of manually labeling data and the increasing interest in low-resource languages, for which human annotators might not be even available, unsupervised approaches have become essential for processing a typologically diverse set of languages, whether high-resource or low-resource. In this work, we propose new fully unsupervised approaches for two tasks in morphology: unsupervised morphological segmentation and unsupervised cross-lingual part-of-speech (POS) tagging, which have been two essential subtasks for several downstream NLP applications, such as machine translation, speech recognition, information extraction and question answering.
We propose a new unsupervised morphological-segmentation approach that utilizes Adaptor Grammars (AGs), nonparametric Bayesian models that generalize probabilistic context-free grammars (PCFGs), where a PCFG models word structure in the task of morphological segmentation. We implement the approach as a publicly available morphological-segmentation framework, MorphAGram, that enables unsupervised morphological segmentation through the use of several proposed language-independent grammars. In addition, the framework allows for the use of scholar knowledge, when available, in the form of affixes that can be seeded into the grammars. The framework handles the cases when the scholar-seeded knowledge is either generated from language resources, possibly by someone who does not know the language, as weak linguistic priors, or generated by an expert in the underlying language as strong linguistic priors. Another form of linguistic priors is the design of a grammar that models language-dependent specifications. We also propose a fully unsupervised learning setting that approximates the effect of scholar-seeded knowledge through self-training. Moreover, since there is no single grammar that works best across all languages, we propose an approach that picks a nearly optimal configuration (a learning setting and a grammar) for an unseen language, a language that is not part of the development. Finally, we examine multilingual learning for unsupervised morphological segmentation in low-resource setups.
For unsupervised POS tagging, two cross-lingual approaches have been widely adapted: 1) annotation projection, where POS annotations are projected across an aligned parallel text from a source language for which a POS tagger is accessible to the target one prior to training a POS model; and 2) zero-shot model transfer, where a model of a source language is directly applied on texts in the target language. We propose an end-to-end architecture for unsupervised cross-lingual POS tagging via annotation projection in truly low-resource scenarios that do not assume access to parallel corpora that are large in size or represent a specific domain. We integrate and expand the best practices in alignment and projection and design a rich neural architecture that exploits non-contextualized and transformer-based contextualized word embeddings, affix embeddings and word-cluster embeddings. Additionally, since parallel data might be available between the target language and multiple source ones, as in the case of the Bible, we propose different approaches for learning from multiple sources. Finally, we combine our work on unsupervised morphological segmentation and unsupervised cross-lingual POS tagging by conducting unsupervised stem-based cross-lingual POS tagging via annotation projection, which relies on the stem as the core unit of abstraction for alignment and projection, which is beneficial to low-resource morphologically complex languages. We also examine morpheme-based alignment and projection, the use of linguistic priors towards better POS models and the use of segmentation information as learning features in the neural architecture.
We conduct comprehensive evaluation and analysis to assess the performance of our approaches of unsupervised morphological segmentation and unsupervised POS tagging and show that they achieve the state-of-the-art performance for the two morphology tasks when evaluated on a large set of languages of different typologies: analytic, fusional, agglutinative and synthetic/polysynthetic.
|
402 |
Nízko-dimenzionální faktorizace pro "End-To-End" řečové systémy / Low-Dimensional Matrix Factorization in End-To-End Speech Recognition SystemsGajdár, Matúš January 2020 (has links)
The project covers automatic speech recognition with neural network training using low-dimensional matrix factorization. We are describing time delay neural networks with factorization (TDNN-F) and without it (TDNN) in Pytorch language. We are comparing the implementation between Pytorch and Kaldi toolkit, where we achieve similar results during experiments with various network architectures. The last chapter describes the impact of a low-dimensional matrix factorization on End-to-End speech recognition systems and also a modification of the system with TDNN(-F) networks. Using specific network settings, we were able to achieve better results with systems using factorization. Additionally, we reduced the complexity of training by decreasing network parameters with the use of TDNN(-F) networks.
|
403 |
A Swedish wav2vec versus Google speech-to-textLagerlöf, Ester January 2022 (has links)
As the automatic speech recognition technology is becoming more advanced, the possibilities of in which fields it can operate are growing. The best automatic speech recognition technologies today are mainly based on - and made for - the English language. However, the national library of Sweden recently released open-source wav2vec models purposefully with the Swedish language in mind. With the interest of investigating their performance, one of their models is chosen to assess how well they transcribe the Swedish news broadcasts ”kvart-i-fem”-ekot, comparing its results with Google speech-to-text. The results present wav2vec as the prominent model for this type of audio data, securing a word error rate average that is 9 percentage points less than Google-speech-to-text. A part of this performance could be attributed to the self-supervising method the wav2vec model uses to access large amounts of unlabeled data in its training. In spite of this, both models displayed difficulty with transcribing audio that has poor quality such as disturbing background noise and stationary sounds. Words like abbreviations and names was also difficult for them both to correctly transcribe. Google speech-to-text did however perform better than the wav2vec model on this part.
|
404 |
AI’s implications for International Entrepreneurship in the digital and pandemic world : From external and internal perspectivesLampic Aaltonen, Ibb, Fust, Fiona January 2022 (has links)
In the fourth industrial revolution, technological advancement and digital transformation are inevitable, which impact individuals, organizations, and governments tremendously and extensively. The current ongoing pandemic covid -19 has been a catalyst that accelerates the pace and scale of embracing digitalization, which leads to a dramatic shift in the business environment. Artificial Intelligence (AI) attracts increasing attention based on the business opportunities and values that can be created from both internal and external aspects alike. Grounding on the digital context, AI as an enabler from the external perspective; AI as a core resource from the internal perspective, the research attempts to identify 1) AI's implication on international entrepreneurs' possibilities to explore business opportunities;2) AI's significance for international entrepreneurs to enhance performance and generate value alike in the international market. The research conducts qualitative research based on six case studies to examine and explore the aforementioned research area. The research supports the theoretical framework that AI as an enabler provides international entrepreneurs with conducive conditions to testify and experiment with new business initiatives, which positively impacts spurring innovation and opening a new wide window of business opportunity across borders. In parallel, the research is consistent with the theories that AI is one of the valuable resources from the resource-based view, making contributions to SMEs’ enhanced performance, which paves the way for international entrepreneurs to stay in the business competition. In addition, the study proposes a combination of entrepreneurs' heuristic approaches in making strategic decisions with the assistance of AI in uncertain circumstances is crucial in conducting business in the digital environment. The research highlights the integration of innovation resources from external and internal aspects alike to stimulate and catalyse the growth of international entrepreneurship in the digital industry in the established markets. The research accentuates pandemic Covid-19 causes the changes in the digital environment, which affects international entrepreneurial activities. The article concludes with the above-mentioned circumstances' implications on international entrepreneurship, proposing a theoretical framework and providing an agenda for future research in the area.
|
405 |
Development of a text-independent automatic speaker recognition systemMokgonyane, Tumisho Billson January 2021 (has links)
Thesis (M. Sc. (Computer Science)) -- University of Limpopo, 2021 / The task of automatic speaker recognition, wherein a system verifies or identifies
speakers from a recording of their voices, has been researched for several decades.
However, research in this area has been carried out largely on freely accessible
speaker datasets built on languages that are well-resourced like English. This study
undertakes automatic speaker recognition research focused on a low-resourced
language, Sepedi. As one of the 11 official languages in South Africa, Sepedi is
spoken by at least 2.8 million people. Pre-recorded voices were acquired from a
speech and language national repository, namely, the National Centre for Human
Language Technology (NCHLT), were we selected the Sepedi NCHLT Speech
Corpus. The open-source pyAudioAnalysis python library was used to extract three
types of acoustic features of speech namely, time, frequency and cepstral domain
features, from the acquired speech data. The effects and compatibility of these
acoustic features was investigated. It was observed that combining the three acoustic
features of speech had a more significant effect than using individual features as far
as speaker recognition accuracy is concerned. The study also investigated the
performance of machine learning algorithms on low-resourced languages such as
Sepedi. Five machine learning (ML) algorithms implemented on Scikit-learn namely,
K-nearest neighbours (KNN), support vector machines (SVM), random forest (RF),
logistic regression (LR), and multi-layer perceptrons (MLP) were used to train different
classifier models. The GridSearchCV algorithm, also implemented on Scikit-learn, was
used to deduce ideal hyper-parameters for each of the five ML algorithms. The
classifier models were evaluated on recognition accuracy and the results show that
the MLP classifier, with a recognition accuracy of 98%, outperforms KNN, RF, LR and
SVM classifiers. A graphical user interface (GUI) is developed and the best performing
classifier model, MLP, is deployed on the developed GUI intended to be used for real time speaker identification and verification tasks. Participants were recruited to the
GUI performance and acceptable results were obtained
|
406 |
End-to-end Speech Separation with Neural NetworksLuo, Yi January 2021 (has links)
Speech separation has long been an active research topic in the signal processing community with its importance in a wide range of applications such as hearable devices and telecommunication systems. It not only serves as a fundamental problem for all higher-level speech processing tasks such as automatic speech recognition, natural language understanding, and smart personal assistants, but also plays an important role in smart earphones and augmented and virtual reality devices.
With the recent progress in deep neural networks, the separation performance has been significantly advanced by various new problem definitions and model architectures. The most widely-used approach in the past years performs separation in time-frequency domain, where a spectrogram or a time-frequency representation is first calculated from the mixture signal and multiple time-frequency masks are then estimated for the target sources. The masks are applied on the mixture's time-frequency representation to extract the target representations, and then operations such as inverse short-time Fourier transform is utilized to convert them back to waveforms. However, such frequency-domain methods may have difficulties in modeling the phase spectrogram as the conventional time-frequency masks often only consider the magnitude spectrogram. Moreover, the training objectives for the frequency-domain methods are typically also in frequency-domain, which may not be inline with widely-used time-domain evaluation metrics such as signal-to-noise ratio and signal-to-distortion ratio.
The problem formulation of time-domain, end-to-end speech separation naturally arises to tackle the disadvantages in the frequency-domain systems. The end-to-end speech separation networks take the mixture waveform as input and directly estimate the waveforms of the target sources. Following the general pipeline of conventional frequency-domain systems which contains a waveform encoder, a separator, and a waveform decoder, time-domain systems can be design in a similar way while significantly improves the separation performance.
In this dissertation, I focus on multiple aspects in the general problem formulation of end-to-end separation networks including the system designs, model architectures, and training objectives. I start with a single-channel pipeline, which we refer to as the time-domain audio separation network (TasNet), to validate the advantage of end-to-end separation comparing with the conventional time-frequency domain pipelines. I then move to the multi-channel scenario and introduce the filter-and-sum network (FaSNet) for both fixed-geometry and ad-hoc geometry microphone arrays.
Next I introduce methods for lightweight network architecture design that allows the models to maintain the separation performance while using only as small as 2.5% model size and 17.6% model complexity. After that, I look into the training objective functions for end-to-end speech separation and describe two training objectives for separating varying numbers of sources and improving the robustness under reverberant environments, respectively. Finally I take a step back and revisit several problem formulations in end-to-end separation pipeline and raise more questions in this framework to be further analyzed and investigated in future works.
|
407 |
Spoken Dialogue System for Information Navigation based on Statistical Learning of Semantic and Dialogue Structure / 意味・対話構造の統計的学習に基づく情報案内のための音声対話システムYoshino, Koichiro 24 September 2014 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第18614号 / 情博第538号 / 新制||情||95(附属図書館) / 31514 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 河原 達也, 教授 黒橋 禎夫, 教授 鹿島 久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
408 |
Partial and Synchronized Caption to Foster Second Language Listening based on Automatic Speech Recognition Clues / 第二言語のリスニング訓練のための自動音声認識を用いた部分的かつ同期された字幕付与Maryam, Sadat Mirzaei 23 March 2017 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第20505号 / 情博第633号 / 新制||情||110(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 河原 達也, 教授 黒橋 禎夫, 教授 壇辻 正剛 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
409 |
Deep Learning Based Array Processing for Speech Separation, Localization, and RecognitionWang, Zhong-Qiu 15 September 2020 (has links)
No description available.
|
410 |
Fluency Features and Elicited Imitation as Oral Proficiency MeasurementChristensen, Carl V. 07 July 2012 (has links) (PDF)
The objective and automatic grading of oral language tests has been the subject of significant research in recent years. Several obstacles lie in the way of achieving this goal. Recent work has suggested a testing technique called elicited imitation (EI) can be used to accurately approximate global oral proficiency. This testing methodology, however, does not incorporate some fundamental aspects of language such as fluency. Other work has suggested another testing technique, simulated speech (SS), as a supplement to EI that can provide automated fluency metrics. In this work, I investigate a combination of fluency features extracted for SS testing and EI test scores to more accurately predict oral language proficiency. I also investigate the role of EI as an oral language test, and the optimal method of extracting fluency features from SS sound files. Results demonstrate the ability of EI and SS to more effectively predict hand-scored SS test item scores. I finally discuss implications of this work for future automated oral testing scenarios.
|
Page generated in 0.1347 seconds