Global ETD Search

1001	Probabilistic modelling of morphologically rich languages Botha, Jan Abraham January 2014 (has links) This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes. 410.285
1002	Global connectivity, information diffusion, and the role of multilingual users in user-generated content platforms Hale, Scott A. January 2014 (has links) Internet content and Internet users are becoming more linguistically diverse as more people speaking different languages come online and produce content on user-generated content platforms. Several platforms have emerged as truly global platforms with users speaking many different languages and coming from around the world. It is now possible to study human behavior on these platforms using the digital trace data the platforms make available about the content people are authoring. Network literature suggests that people cluster together by language, but also that there is a small average path length between any two people on most Internet platforms (including two speakers of different languages). If so, multilingual users may play critical roles as bridges or brokers on these platforms by connecting clusters of monolingual users together across languages. The large differences in the content available in different languages online underscores the importance of such roles. This thesis studies the roles of multilingual users and platform design on two large, user-generated content platforms: Wikipedia and Twitter. It finds that language has a strong role structuring each platform, that multilingual users do act as linguistic bridges subject to certain limitations, that the size of a language correlates with the roles its speakers play in cross-language connections, and that there is a correlation between activity and multilingualism. In contrast to the general understanding in linguistics of high levels of multilingualism offline, this thesis finds relatively low levels of multilingualism on Twitter (11%) and Wikipedia (15%). The findings have implications for both platform design and social network theory. The findings suggest design strategies to increase multilingualism online through the identification and promotion of multilingual starter tasks, the discovery of related other-language information, and the promotion of user choice in linguistic filtering. While weak-ties have received much attention in the social networks literature, cross-language ties are often not distinguished from same-language weak ties. This thesis finds that cross-language ties are similar to same-language weak ties in that both connect distant parts of the network, have limited bandwidth, and yet transfer a non-trivial amount of information when considered in aggregate. At the same time, cross-language ties are distinct from same-language weak ties for the purposes of information diffusion. In general cross-language ties are smaller in number than same-language ties, but each cross-language tie may convey more diverse information given the large differences in the content available in different languages and the relative ease with which a multilingual speaker may access content in multiple languages compared to a monolingual speaker. 006.3
1003	Automatické zpracování českých soudních rozhodnutí / Processing of Czech court decisions Maslowski, Bohdan January 2015 (has links) Title: Processing of Czech court decisions Author: Bohdan Maslowski Department: Institute of Formal and Applied Linguistics Supervisor: Mgr. Barbora Vidová Hladká, Ph.D. Abstract: The objective of this thesis is a comparison of various language processing methods of Czech case-law documents. In particular, the tasks of extraction of information about parties (names, roles, addresses, etc.) and document classification by two criteria, subject and result have been solved. Machine learning methods are evaluated and compared to rule-based approach. For the purpose of training and evaluation of classifiers, a corpus of 400 Czech case-law documents has been created and manually annotated. The thesis includes a web application used for demonstration of the results of different approaches and a tool for running and evaluation of testing scenarios. Keywords: natural language processing, information extraction, legislative domain, machine learning, rule-based systems
1004	Vícejazyčná databáze kolokací / Vícejazyčná databáze kolokací Helcl, Jindřich January 2014 (has links) Collocations are groups of words which are co-occurring more often than appearing separately. They also include phrases that give a new meaning to a group of unrelated words. This thesis is aimed to find collocations in large data and to create a database that allows their retrieval. The Pointwise Mutual Information, a value based on word frequency, is computed for finding the collocations. Words with the highest value of PMI are considered candidates for good collocations. Chosen collocations are stored in a database in a format that allows searching with Apache Lucene. A part of the thesis is to create a Web user interface as a quick and easy way to search collocations. If this service is fast enough and the collocations are good, translators will be able to use it for finding proper equivalents in the target language. Students of a foreign language will also be able to use it to extend their vocabulary. Such database will be created independently in several languages including Czech and English. Powered by TCPDF (www.tcpdf.org)
1005	詞彙向量的理論與評估基於矩陣分解與神經網絡 / Theory and evaluation of word embedding based on matrix factorization and neural network 張文嘉, Jhang, Wun Jia Unknown Date (has links) 隨著機器學習在越來越多任務中有突破性的發展，特別是在自然語言處理問題上，得到越來越多的關注，近年來，詞向量是自然語言處理研究中最令人興奮的部分之一。在這篇論文中，我們討論了兩種主要的詞向量學習方法。一種是傳統的矩陣分解，如奇異值分解，另一種是基於神經網絡模型（具有負採樣的Skip-gram模型（Mikolov等人提出，2013），它是一種迭代演算法。我們提出一種方法來挑選初始值，透過使用奇異值分解得到的詞向量當作是Skip-gram模型的初始直，結果發現替換較佳的初始值，在某些自然語言處理的任務中得到明顯的提升。 / Recently, word embedding is one of the most exciting part of research in natural language processing. In this thesis, we discuss the two major learning approaches for word embedding. One is traditional matrix factorization like singular value decomposition, the other is based on neural network model (e.g. the Skip-gram model with negative sampling (Mikolov et al., 2013b)) which is an iterative algorithm. It is known that an iterative process is sensitive to initial starting values. We present an approach for implementing the Skip-gram model with negative sampling from a given initial value that is using singular value decomposition. Furthermore, we show that refined initial starting points improve the analogy task and succeed in capturing fine-gained semantic and syntactic regularities using vector arithmetic. 矩陣分解初始值自然語言處理神經網絡 Matrix factorization Initalization Natural language processing Neural network
1006	Employee Matching Using Machine Learning Methods Marakani, Sumeesha January 2019 (has links) Background: Expertise retrieval is an information retrieval technique that focuses on techniques to identify the most suitable ’expert’ for a task from a list of individuals. Objectives: This master thesis is a collaboration with Volvo Cars to attempt applying this concept and match employees based on information that was extracted from an internal tool of the company. In this tool, the employees describe themselves in free-flowing text. This text is extracted from the tool and analyzed using Natural Language Processing (NLP) techniques. Methods: Through the course of this project, various techniques are employed and experimented with to study, analyze and understand the unlabelled textual data using NLP techniques. Through the course of the project, we try to match individuals based on information extracted from these techniques using Unsupervised MachineLearning methods (K-means clustering).Results. The results obtained from applying the various NLP techniques are explained along with the algorithms that are implemented. Inferences deduced about the properties of the data and methodologies are discussed. Conclusions: The results obtained from this project have shown that it is possible to extract patterns among people based on free-text data written about them. The future aim is to incorporate the semantic relationship between the words to be able to identify people who are similar and dissimilar based on the data they share about themselves. Machine Learning Unsupervised Learning Natural Language Processing Information Retrieval text analysis Annan elektroteknik och elektronik
1007	A natural language processing solution to probable Alzheimer’s disease detection in conversation transcripts Comuni, Federica January 2019 (has links) This study proposes an accuracy comparison of two of the best performing machine learning algorithms in natural language processing, the Bayesian Network and the Long Short-Term Memory (LSTM) Recurrent Neural Network, in detecting Alzheimer’s disease symptoms in conversation transcripts. Because of the current global rise of life expectancy, the number of seniors affected by Alzheimer’s disease worldwide is increasing each year. Early detection is important to ensure that affected seniors take measures to relieve symptoms when possible or prepare plans before further cognitive decline occurs. Literature shows that natural language processing can be a valid tool for early diagnosis of the disease. This study found that mild dementia and possible Alzheimer’s can be detected in conversation transcripts with promising results, and that the LSTM is particularly accurate in said detection, reaching an accuracy of 86.5% on the chosen dataset. The Bayesian Network classified with an accuracy of 72.1%. The study confirms the effectiveness of a natural language processing approach to detecting Alzheimer’s disease. Bayesian network machine learning natural language processing Alzheimer's disease early detection Computer Sciences Datavetenskap (datalogi)
1008	[en] DEEP SEMANTIC ROLE LABELING FOR PORTUGUESE / [pt] ANOTAÇÃO PROFUNDA DE PAPÉIS SEMÂNTICOS PARA O PORTUGUÊS GUILHERME SANT ANNA VARELA 06 August 2019 (has links) [pt] Vivemos em um mundo complexo, no qual incontáveis fatores aparentemente desconexos – tais como a lei de Moore que dita um aumento exponencial da capacidade de processamento em um chip de silício, a queda do custo de espaço de armazenamento e a adoção em massa de smartphones colaboram para a formação de uma sociedade progressivamente interdependente. Todos os dias são criados 2,5 quintilhões de bytes de dados, de fato 90 por cento dos dados no mundo foram criados nos últimos dois anos. Domar os padrões salientes aos dados separando informação do caos torna-se uma necessidade iminente para a tomada de decisão dos indivíduos e para sobrevivência de organizações. Nesse cenário a melhor resposta dos pesquisadores de Processamento de Linguagem Natural encontra-se na tarefa de Anotação de Papéis Semânticos. APS é a tarefa que tem o audacioso objetivo de compreender eventos, buscando determinar Quem fez o que e aonde, Quais foram os beneficiados? ou Qual o meio utilizado para atingir os fins. APS serve como tarefa intermediária para várias aplicações de alto nível e.g information extraction, question and answering e agentes conversacionais. Tradicionalmente, resultados satisfatórios eram obtidos apenas com alta dependência de conhecimento específico de domínio. Para o português, através desta abordagem, o sistema estado da arte da tarefa para é de 79,6 por cento de pontuação F1. Sistemas mais recentes dependem de uma série de subtarefas, obtém 58 por cento de pontuação F1. Nessa dissertação, exploramos um novo paradigma utilizando redes neurais recorrentes, para o idioma do português do Brasil, e sem subtarefas intermediárias obtendo uma pontuação de 66,23. / [en] We live in a complex world in which a myriad of seemingly unrelated factors – such as Moore s law which states that the processing capacity on a silicon wafer should increase exponentially, the fall of storage costs and mass adoption of smart-phones contribute to the formation of an increasingly inter-dependent society: 2.5 quintillion bytes of data are generated every day, in fact ninety percent of the world s data were created in the last few years. Harnessing the emerging patterns within the data, effectively separating information from chaos is crucial for both individual decision making as well as for the survival of organizations. In this scenario the best answer from Natural Language Processing researchers is the task of Semantic Role Labeling. SRL is the task the concerns itself with the audacious goal of event understanding, which means determining Who did what to whom, Who was the beneficiary? or What were the means to achieve some goal. APS is also an intermediary task to high level applications such as information extraction, question and answering and chatbots. Traditionally, satisfactory results were obtained only by the introduction of highly specific domain knowledge. For Portuguese, this approach is able to yields a F1 score of 79.6 percent. Recent systems, rely on a pipeline of sub-tasks, yielding a F1 score of 58 percent. In this dissertation, we adopt a new paradigm using recurrent neural networks for the Brazilian Portuguese, that does not rely on a pipeline, our system obtains a score of 66.23 percent. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] PROCESSAMENTO DE LINGUAGEM NATURAL [en] NATURAL LANGUAGE PROCESSING [pt] ANOTACAO DE PAPEIS SEMANTICOS [en] SEMANTIC ROLE LABELING [pt] APRENDIZADO DE MAQUINA PROFUNDO [en] DEEP LEARNING
1009	[en] DISTANT SUPERVISION FOR RELATION EXTRACTION USING ONTOLOGY CLASS HIERARCHY-BASED FEATURES / [pt] SUPERVISÃO À DISTÂNCIA EM EXTRAÇÃO DE RELACIONAMENTOS USANDO CARACTERÍSTICAS BASEADAS EM HIERARQUIA DE CLASSES EM ONTOLOGIAS PEDRO HENRIQUE RIBEIRO DE ASSIS 18 March 2015 (has links) [pt] Extração de relacionamentos é uma etapa chave para o problema de identificação de uma estrutura em um texto em formato de linguagem natural. Em geral, estruturas são compostas por entidades e relacionamentos entre elas. As propostas de solução com maior sucesso aplicam aprendizado de máquina supervisionado a corpus anotados à mão para a criação de classificadores de alta precisão. Embora alcancem boa robustez, corpus criados à mão não são escaláveis por serem uma alternativa de grande custo. Neste trabalho, nós aplicamos um paradigma alternativo para a criação de um número considerável de exemplos de instâncias para classificação. Tal método é chamado de supervisão à distância. Em conjunto com essa alternativa, usamos ontologias da Web semântica para propor e usar novas características para treinar classificadores. Elas são baseadas na estrutura e semântica descrita por ontologias onde recursos da Web semântica são definidos. O uso de tais características tiveram grande impacto na precisão e recall dos nossos classificadores finais. Neste trabalho, aplicamos nossa teoria em um corpus extraído da Wikipedia. Alcançamos uma alta precisão e recall para um número considerável de relacionamentos. / [en] Relation extraction is a key step for the problem of rendering a structure from natural language text format. In general, structures are composed by entities and relationships among them. The most successful approaches on relation extraction apply supervised machine learning on hand-labeled corpus for creating highly accurate classifiers. Although good robustness is achieved, hand-labeled corpus are not scalable due to the expensive cost of its creation. In this work we apply an alternative paradigm for creating a considerable number of examples of instances for classification. Such method is called distant supervision. Along with this alternative approach we adopt Semantic Web ontologies to propose and use new features for training classifiers. Those features are based on the structure and semantics described by ontologies where Semantic Web resources are defined. The use of such features has a great impact on the precision and recall of our final classifiers. In this work, we apply our theory on corpus extracted from Wikipedia. We achieve a high precision and recall for a considerable number of relations. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] WEB SEMANTICA [en] SEMANTIC WEB [pt] EXTRACAO DE RELACIONAMENTOS [en] RELATION EXTRACTION [pt] SUPERVISAO A DISTANCIA [en] DISTANT SUPERVISION [pt] PROCESSAMENTO NATURAL DE LINGUAGENS [en] NATURAL LANGUAGE PROCESSING
1010	A Pragmatic Analysis of WISH Imperatives Ryo Nomura (6630887) 14 May 2019 (has links) <p>A word or a linguistic construction can mean various things depending on the context. The imperative is a representative example of such a construction and can express a variety of illocutionary forces such as COMMAND, REQUEST, ADVICE, and more (Quirk et al., 1985, Huddleston et al., 2002). </p> <p>However, although there are many studies that comprehensively deal with the imperative or individual illocutionary forces of it (e.g. Lakoff, 1966, Ljung, 1975, Davies, 1986, Wilson & Sperber 1988, Han, 2000, Takahashi, 2012, Jary & Kissine, 2014), there is no such study that shows a possible overall process of how we would interpret an imperative to reach a certain illocutionary force when it is uttered. Without such a shared process, we cannot explain why we can communicate using imperatives without misunderstandings. Thus, this process needs to be investigated. </p> <p>Another problem regarding imperatives is the treatment of non-directive uses of imperatives such as “Have a good day”. The illocutionary force of this imperative would be called GOOD WISH and regarded as a conventional use of imperatives (Davies, 1986). However, it has not been clearly explained why we would choose the imperative construction to express wishes. If this kind of wishes expressed in the form of the imperative are actually a use of imperative, then there should be some reason and motivation for it. </p> <p>The main purposes of this study are to provide (1) a schema of how one would typically reach the interpretation of WISH when hearing an imperative and (2) an account of such use of imperatives as WISH. In this study, examples of imperatives in two non-cognate languages are used for the analysis in the hope to substantiate the credibility of the schema and the account: Japanese and English. Based on the analyses on the imperative and individual illocutionary forces that have been presented in the literature combined with my own analysis, a schema is proposed that illustrates how one would typically reach PRIVATE WISH, the state of affairs of which is deemed to be desirable mainly for the speaker, and GOOD WISH, the state of affairs of which is deemed to be desirable mainly for the addressee. Then, an account for the use of PRIVATE WISH and GOOD WISH is provided. Specifically, the use of imperatives as WISH is an analogous use of prototypical imperatives; people would use the imperative construction to express their strong desirability, and to build and maintain a good relationship with others.</p> Linguistics Natural Language Processing English Language Japanese Language Comparative Language Studies Discourse and Pragmatics Philosophy of Language imperatives Pragmatics prototype theory illocutionary acts Japanese linguistics English linguistics speech acts

Search results