Global ETD Search

501	Metody pro rozdělování slovních složenin / Splitting word compounds Oberländer, Jonathan January 2017 (has links) Unlike the English language, languages such as German, Dutch, the Skandinavian languages or Greek form compounds not as multi-word expressions, but by combining the parts of the compound into a new word without any orthographical separation. This poses problems for a variety of tasks, such as Statistical Machine Translation or Information Retrieval. Most previous work on the subject of splitting compounds into their parts, or ``decompounding'' has focused on German. In this work, we create a new, simple, unsupervised system for automatic decompounding for three representative compounding languages: German, Swedish, and Hungarian. A multi-lingual evaluation corpus in the medical domain is created from the EMEA corpus, and annotated with regards to compounding. Finally, several variants of our system are evaluated and compared to previous work. Powered by TCPDF (www.tcpdf.org)
502	Metody pro rozdělování slovních složenin / Splitting word compounds Oberländer, Jonathan January 2017 (has links) Unlike the English language, languages such as German, Dutch, the Skandinavian languages or Greek form compounds not as multi-word expressions, but by combining the parts of the compound into a new word without any orthographical separation. This poses problems for a variety of tasks, such as Statistical Machine Translation or Information Retrieval. Most previous work on the subject of splitting compounds into their parts, or ``decompounding'' has focused on German. In this work, we create a new, simple, unsupervised system for automatic decompounding for three representative compounding languages: German, Swedish, and Hungarian. A multi-lingual evaluation corpus in the medical domain is created from the EMEA corpus, and annotated with regards to compounding. Finally, several variants of our system are evaluated and compared to previous work. Powered by TCPDF (www.tcpdf.org)
503	Semantic analysis for extracting fine-grained opinion aspects Zhan, Tianjie 01 January 2010 (has links) No description available. Data processing Public opinion polls Semantics
504	[en] A TOKEN CLASSIFICATION APPROACH TO DEPENDENCY PARSING / [pt] UMA ABORDAGEM POR CLASSIFICAÇÃO TOKEN-A-TOKEN PARA O PARSING DE DEPENDÊNCIA CARLOS EDUARDO MEGER CRESTANA 13 October 2010 (has links) [pt] Uma das tarefas mais importantes em Processamento de Linguagem Natural é a análise sintática, onde a estrutura de uma sentença é determinada de acordo com uma dada gramática, informando o significado de uma sentença a partir do significado das palavras nela contidas. A Análise Sintática baseada em Gramáticas de Dependência consiste em identificar para cada palavra a outra palavra na sentença que a governa. Assim, a saída de um analisador sintático de dependência é uma árvore onde os nós são as palavras da sentença. Esta estrutura simples, mas rica, é utilizada em uma grande variedade de aplicações, entre elas Sistemas de Pergunta-Resposta, Tradução Automática, Extração de Informação, e Identificação de Papéis Semânticos. Os sistemas estado-da-arte em análise sintática de dependência utilizam modelos baseados em transições ou modelos baseados em grafos. Essa dissertação apresenta uma abordagem por classificação tokena- token para a análise sintática de dependência ao criar um conjunto especial de classes que permitem a correta identificação de uma palavra na sentença. Usando esse conjunto de classes, qualquer algoritmo de classificação pode ser treinado para identificar corretamente a palavra governante de cada palavra na sentença. Além disso, este conjunto de classes permite tratar igualmente relações de dependência projetivas e não-projetivas, evitando abordagens pseudo-projetivas. Para avaliar a sua eficácia, aplicamos o algoritmo Entropy Guided Transformation Learning aos corpora disponibilizados publicamente na tarefa proposta durante a CoNLL 2006. Esses experimentos foram realizados em três corpora de diferentes idiomas: dinamarquês, holandês e português. Para avaliação de desempenho foi utilizada a métrica de Unlabeled Attachment Score. Nossos resultados mostram que os modelos gerados atingem resultados acima da média dos sistemas do CoNLL. Ainda, nossos resultados indicam que a abordagem por classificação token-a-token é uma abordagem promissora para o problema de análise sintática de dependência. / [en] One of the most important tasks in Natural Language Processing is syntactic parsing, where the structure of a sentence is inferred according to a given grammar. Syntactic parsing, thus, tells us how to determine the meaning of the sentence fromthemeaning of the words in it. Syntactic parsing based on dependency grammars is called dependency parsing. The Dependency-based syntactic parsing task consists in identifying a head word for each word in an input sentence. Hence, its output is a rooted tree, where the nodes are the words in the sentence. This simple, yet powerful, structure is used in a great variety of applications, like Question Answering,Machine Translation, Information Extraction and Semantic Role Labeling. State-of-the-art dependency parsing systems use transition-based or graph-based models. This dissertation presents a token classification approach to dependency parsing, by creating a special tagging set that helps to correctly find the head of a token. Using this tagging style, any classification algorithm can be trained to identify the syntactic head of each word in a sentence. In addition, this classification model treats projective and non-projective dependency graphs equally, avoiding pseudo-projective approaches. To evaluate its effectiveness, we apply the Entropy Guided Transformation Learning algorithm to the publicly available corpora from the CoNLL 2006 Shared Task. These computational experiments are performed on three corpora in different languages, namely: Danish, Dutch and Portuguese. We use the Unlabelled Attachment Score as the accuracy metric. Our results show that the generated models are above the average CoNLL system performance. Additionally, these findings also indicate that the token classification approach is a promising one. [pt] APRENDIZAGEM [en] LEARNING [pt] PROCESSAMENTO DA LINGUAGEM NATURAL [en] NATURAL LANGUAGE PROCESSING [pt] CLASSIFICACAO TOKEN-A-TOKEN
505	A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy Ma, Long 08 August 2017 (has links) Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key word extraction. However, when using these key words as features in classification task, it is common that feature dimension is huge. In addition, how to select key words from tons of documents as features in classification task is also a challenge. Especially when using tradition machine learning algorithm in the large data set, the computation cost would be high. In addition, almost 80% of real data is unstructured and non-labeled. The advanced supervised feature selection methods cannot be used directly in selecting entities from massive of data. Usually, extracting features from unlabeled data for classification tasks, statistical strategies have been utilized to discover key features. However, we propose a nova method to extract important features effectively before feeding them into the classification assignment. There is another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to the documents. This problem makes text classification more complicated when compared with single label classification. Considering above issues, we develop a framework for extracting and eliminating data dimensionality, solving the multi-label problem on labeled and unlabeled data set. To reduce data dimension, we provide 1) a hybrid feature selection method that extracts meaningful features according to the importance of each feature. 2) we apply the Word2Vec to represent each document with a lower feature dimension when doing the document categorization for the big data set. 3) An unsupervised approach to extract features from real online generated data for text classification and prediction. On the other hand, to solve the multi-label classification task, we design a new Multi-Instance Multi-Label (MIML) algorithm in the proposed framework. Multi-label Text Classification Feature Selection Word2Vec Natural Language Processing Depression Symptoms Social Medias
506	El finaciamiento de las campañas bajo la cámara secreta de las donaciones y el comportamiento parlamentario Szederkenyi Vicuña, Francisco José January 2017 (has links) Magíster en Economía Aplicada / En este artículo estudio si el financiamiento a los parlamentarios chilenos bajo la Cámara Secreta de las Donaciones, entre 2005 y 2009, pudo influir sobre su comportamiento cuando votaban proyectos de ley en el Congreso. Para esto utilizo una herramienta de Natural Language Processing (NLP) con la cual clasifico los distintos proyectos de ley como beneficiosos o perjudiciales para las empresas, basándome en la similitud de los textos de los proyectos de ley y aquéllos encontrados en internet de organizaciones manifiestamente pro-empresa (asociaciones gremiales, etc) o anti-empresa (sindicatos, organizaciones de defensa de los consumidores, etc). Luego, con las iniciativas clasificadas, analizo si existe relación entre mayor financiamiento obtenido y votaciones más pro-empresa, controlando por distintas características observables de los parlamentarios. Los resultados indican que, si bien existe una correlación positiva del financiamiento y votaciones pro empresa, tras controlar por distintas características del parlamentario, en general no hay relación entre mayor financiamiento bajo la Cámara Secreta de las Donaciones y votaciones más favorables hacia las empresas. Sin embargo, un análisis más específico muestra que quienes reciben un mayor financiamiento en el distrito por parte de pocos donantes tienen un comportamiento más pro empresa y, que quienes reciben mayor financiamiento votan más en favor de las empresas los proyectos relacionados con la comisión de Salud. Proyectos de ley Fondos para campañas electorales Natural language processing
507	Lexical selection for machine translation Sabtan, Yasser Muhammad Naguib mahmoud January 2011 (has links) Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89. 006.3
508	Lexical simplification : optimising the pipeline Shardlow, Matthew January 2015 (has links) Introduction: This thesis was submitted by Matthew Shardlow to the University of Manchester for the degree of Doctor of Philosophy (PhD) in the year 2015. Lexical simplification is the practice of automatically increasing the readability and understandability of a text by identifying problematic vocabulary and substituting easy to understand synonyms. This work describes the research undertaken during the course of a 4-year PhD. We have focused on the pipeline of operations which string together to produce lexical simplifications. We have identified key areas for research and allowed our results to influence the direction of our research. We have suggested new methods and ideas where appropriate. Objectives: We seek to further the field of lexical simplification as an assistive technology. Although the concept of fully-automated error-free lexical simplification is some way off, we seek to bring this dream closer to reality. Technology is ubiquitous in our information-based society. Ever-increasingly we consume news, correspondence and literature through an electronic device. E-reading gives us the opportunity to intervene when a text is too difficult. Simplification can act as an augmentative communication tool for those who find a text is above their reading level. Texts which would otherwise go unread would become accessible via simplification. Contributions: This PhD has focused on the lexical simplification pipeline. We have identified common sources of errors as well as the detrimental effects of these errors. We have looked at techniques to mitigate the errors at each stage of the pipeline. We have created the CW Corpus, a resource for evaluating the task of identifying complex words. We have also compared machine learning strategies for identifying complex words. We propose a new preprocessing step which yields a significant increase in identification performance. We have also tackled the related fields of word sense disambiguation and substitution generation. We evaluate the current state of the field and make recommendations for best practice in lexical simplification. Finally, we focus our attention on evaluating the effect of lexical simplification on the reading ability of people with aphasia. We find that in our small-scale preliminary study, lexical simplification has a nega- tive effect, causing reading time to increase. We evaluate this result and use it to motivate further work into lexical simplification for people with aphasia. 006.4
509	Language Consistency and Exchange: Market Reactions to Change in the Distribution of Field-level Information Watts, Jameson K.M., Watts, Jameson K.M. January 2015 (has links) Markets are fluid. Over time, the dominant designs, processes and paradigms that define an industry invariably succumb to productive innovation or changes in fashion (Arthur, 2009; Schumpeter, 1942; Simmel, 1957). Take for example the recent upheaval of the cell phone market following Apple's release of the iPhone. When it was introduced in 2007, one could clearly differentiate Apple's product from all others; however, subsequent imitation of the iPhone produced a market in which nearly all cell phones look (and perform) alike. The iPhone was a harbinger of the new dominant design. These cycles of innovation and fashion are not limited to consumer markets. Business markets (often defined by longer term inter-firm relationships) are subject to similar transformations. For example, current practices in the biotechnology industry are quite distinct from those accompanying its emergence from university labs in the second half of the 20th century (Powell et al., 2005). Technologies that were once viewed as radical have undergone a process of legitimation and integration into mainstream healthcare delivery systems. Practices that were dominant in the 1980's gave way to newer business models in the 1990's and feedback from down-stream providers changed the way drugs were delivered to patients (Wolff, 2001).During periods of transition, market actors face great difficulty anticipating reactions to their behavior (practices, products, etc.). How they deal with this uncertainty is an interminable source of academic inquiry in the social sciences (see e.g. Alderson, 1965; Simon, 1957; Thompson, 1967) and, in a broad sense, it is the primary concern of the current work as well. However, I am focused specifically on the turmoil caused by transitions in technology, taste and attention over time--the disagreements which occur as market actors collectively shift their practices from one paradigm to the next (Powell and Colyvas, 2008). If innovations are assumed to arise locally and diffuse gradually (see e.g. Bass, 1969; Rogers, 2002), then transient differences in knowledge are a natural outcome. Those closest to, or most interested in an innovation will have greater knowledge than those furthest away or less involved. Thus, for a period following some shift in technology, taste or attention, market participants will vary in their knowledge and interpretation of the change. In the following chapters, I investigate the ramifications of this sort of knowledge heterogeneity on the exchange behavior and subsequent performance of market participants. It is the central argument of this thesis that this heterogeneity affects exchange by both limiting coordination and increasing quality uncertainty. The details of this argument are fleshed out in Chapters 1, 2 and 3 (summarized below), which build upon each other in a progression from abstract, to descriptive to specific tests of theory. However, each can also stand by itself as an independent examination of the knowledge-exchange relationship. The final chapter synthesizes my findings and highlights some implications for practitioners and further research. In Chapter 1, I review the history and development of Alderson's (1965) 'law of exchange' in the marketing literature and propose an extension based on insights from information theory. A concept called market entropy is introduced to describe the distribution of knowledge in a field and propositions are offered to explain the exchange behavior expected when this distribution changes. Chapter 2 investigates knowledge heterogeneity through its relation with written language. Drawing on social-constructionist theories of classification (Goldberg, 2012) and insights from research on the legitimation process (Powell and Colyvas, 2008), I argue for a measure of field-level consensus based on changes in the frequency distribution of descriptive words over time. This measure is operationalized using eleven years of trade journal articles from the biotech industry and is shown to support the propositions offered in Chapter 1. Chapter 3 builds on the arguments and evidence developed in Chapters 1 and 2 to test theory on the structural advantages of a firm's position in a network of strategic alliances. Prior work has documented returns to network centrality based on the premise that central firms have greater and more timely access to information about industry developments (Powell et al., 1996, 1999). However, other research claims that benefits to centrality accrue based on the signal that such a position provides about an actor's underlying quality (Malter, 2014; Podolny, 1993, 2005). I investigate this tension in the literature and offer new insights based on interactions between network position and the measure developed in Chapter 2. Computational Linguistics Entropy Exchange Natural Language Processing Social Networks Management Complexity
510	Refinements in hierarchical phrase-based translation systems Pino, Juan Miguel January 2015 (has links) The relatively recently proposed hierarchical phrase-based translation model for statistical machine translation (SMT) has achieved state-of-the-art performance in numerous recent translation evaluations. Hierarchical phrase-based systems comprise a pipeline of modules with complex interactions. In this thesis, we propose refinements to the hierarchical phrase-based model as well as improvements and analyses in various modules for hierarchical phrase-based systems. We took the opportunity of increasing amounts of available training data for machine translation as well as existing frameworks for distributed computing in order to build better infrastructure for extraction, estimation and retrieval of hierarchical phrase-based grammars. We design and implement grammar extraction as a series of Hadoop MapReduce jobs. We store the resulting grammar using the HFile format, which offers competitive trade-offs in terms of efficiency and simplicity. We demonstrate improvements over two alternative solutions used in machine translation. The modular nature of the SMT pipeline, while allowing individual improvements, has the disadvantage that errors committed by one module are propagated to the next. This thesis alleviates this issue between the word alignment module and the grammar extraction and estimation module by considering richer statistics from word alignment models in extraction. We use alignment link and alignment phrase pair posterior probabilities for grammar extraction and estimation and demonstrate translation improvements in Chinese to English translation. This thesis also proposes refinements in grammar and language modelling both in the context of domain adaptation and in the context of the interaction between first-pass decoding and lattice rescoring. We analyse alternative strategies for grammar and language model cross-domain adaptation. We also study interactions between first-pass and second-pass language model in terms of size and n-gram order. Finally, we analyse two smoothing methods for large 5-gram language model rescoring. The last two chapters are devoted to the application of phrase-based grammars to the string regeneration task, which we consider as a means to study the fluency of machine translation output. We design and implement a monolingual phrase-based decoder for string regeneration and achieve state-of-the-art performance on this task. By applying our decoder to the output of a hierarchical phrase-based translation system, we are able to recover the same level of translation quality as the translation system. 410.285

Search results