Spelling suggestions: "subject:"batural language aprocessing"" "subject:"batural language eprocessing""
511 |
Semantic analysis for extracting fine-grained opinion aspectsZhan, Tianjie 01 January 2010 (has links)
No description available.
|
512 |
[en] A TOKEN CLASSIFICATION APPROACH TO DEPENDENCY PARSING / [pt] UMA ABORDAGEM POR CLASSIFICAÇÃO TOKEN-A-TOKEN PARA O PARSING DE DEPENDÊNCIACARLOS EDUARDO MEGER CRESTANA 13 October 2010 (has links)
[pt] Uma das tarefas mais importantes em Processamento de Linguagem Natural é
a análise sintática, onde a estrutura de uma sentença é determinada de acordo
com uma dada gramática, informando o significado de uma sentença a partir do
significado das palavras nela contidas. A Análise Sintática baseada em Gramáticas
de Dependência consiste em identificar para cada palavra a outra palavra na
sentença que a governa. Assim, a saída de um analisador sintático de dependência
é uma árvore onde os nós são as palavras da sentença. Esta estrutura simples,
mas rica, é utilizada em uma grande variedade de aplicações, entre elas Sistemas
de Pergunta-Resposta, Tradução Automática, Extração de Informação, e Identificação
de Papéis Semânticos. Os sistemas estado-da-arte em análise sintática
de dependência utilizam modelos baseados em transições ou modelos baseados
em grafos. Essa dissertação apresenta uma abordagem por classificação tokena-
token para a análise sintática de dependência ao criar um conjunto especial de
classes que permitem a correta identificação de uma palavra na sentença. Usando
esse conjunto de classes, qualquer algoritmo de classificação pode ser treinado
para identificar corretamente a palavra governante de cada palavra na sentença.
Além disso, este conjunto de classes permite tratar igualmente relações de dependência
projetivas e não-projetivas, evitando abordagens pseudo-projetivas.
Para avaliar a sua eficácia, aplicamos o algoritmo Entropy Guided Transformation
Learning aos corpora disponibilizados publicamente na tarefa proposta durante
a CoNLL 2006. Esses experimentos foram realizados em três corpora de
diferentes idiomas: dinamarquês, holandês e português. Para avaliação de desempenho
foi utilizada a métrica de Unlabeled Attachment Score. Nossos resultados
mostram que os modelos gerados atingem resultados acima da média dos sistemas
do CoNLL. Ainda, nossos resultados indicam que a abordagem por classificação
token-a-token é uma abordagem promissora para o problema de análise
sintática de dependência. / [en] One of the most important tasks in Natural Language Processing is syntactic
parsing, where the structure of a sentence is inferred according to a given grammar.
Syntactic parsing, thus, tells us how to determine the meaning of the sentence
fromthemeaning of the words in it. Syntactic parsing based on dependency
grammars is called dependency parsing. The Dependency-based syntactic parsing
task consists in identifying a head word for each word in an input sentence.
Hence, its output is a rooted tree, where the nodes are the words in the sentence.
This simple, yet powerful, structure is used in a great variety of applications, like
Question Answering,Machine Translation, Information Extraction and Semantic
Role Labeling. State-of-the-art dependency parsing systems use transition-based
or graph-based models. This dissertation presents a token classification approach
to dependency parsing, by creating a special tagging set that helps to correctly
find the head of a token. Using this tagging style, any classification algorithm can
be trained to identify the syntactic head of each word in a sentence. In addition,
this classification model treats projective and non-projective dependency graphs
equally, avoiding pseudo-projective approaches. To evaluate its effectiveness, we
apply the Entropy Guided Transformation Learning algorithm to the publicly
available corpora from the CoNLL 2006 Shared Task. These computational experiments
are performed on three corpora in different languages, namely: Danish,
Dutch and Portuguese. We use the Unlabelled Attachment Score as the accuracy
metric. Our results show that the generated models are above the average
CoNLL system performance. Additionally, these findings also indicate that the
token classification approach is a promising one.
|
513 |
A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection StrategyMa, Long 08 August 2017 (has links)
Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key word extraction. However, when using these key words as features in classification task, it is common that feature dimension is huge. In addition, how to select key words from tons of documents as features in classification task is also a challenge. Especially when using tradition machine learning algorithm in the large data set, the computation cost would be high. In addition, almost 80% of real data is unstructured and non-labeled. The advanced supervised feature selection methods cannot be used directly in selecting entities from massive of data. Usually, extracting features from unlabeled data for classification tasks, statistical strategies have been utilized to discover key features. However, we propose a nova method to extract important features effectively before feeding them into the classification assignment. There is another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to the documents. This problem makes text classification more complicated when compared with single label classification. Considering above issues, we develop a framework for extracting and eliminating data dimensionality, solving the multi-label problem on labeled and unlabeled data set. To reduce data dimension, we provide 1) a hybrid feature selection method that extracts meaningful features according to the importance of each feature. 2) we apply the Word2Vec to represent each document with a lower feature dimension when doing the document categorization for the big data set. 3) An unsupervised approach to extract features from real online generated data for text classification and prediction. On the other hand, to solve the multi-label classification task, we design a new Multi-Instance Multi-Label (MIML) algorithm in the proposed framework.
|
514 |
El finaciamiento de las campañas bajo la cámara secreta de las donaciones y el comportamiento parlamentarioSzederkenyi Vicuña, Francisco José January 2017 (has links)
Magíster en Economía Aplicada / En este artículo estudio si el financiamiento a los parlamentarios chilenos bajo la Cámara Secreta de las Donaciones, entre 2005 y 2009, pudo influir sobre su comportamiento cuando votaban proyectos de ley en el Congreso. Para esto utilizo una herramienta de Natural Language Processing (NLP) con la cual clasifico los distintos proyectos de ley como beneficiosos o perjudiciales para las empresas, basándome en la similitud de los textos de los proyectos de ley y aquéllos encontrados en internet de organizaciones manifiestamente pro-empresa (asociaciones gremiales, etc) o anti-empresa (sindicatos, organizaciones de defensa de los consumidores, etc). Luego, con las iniciativas clasificadas, analizo si existe relación entre mayor financiamiento obtenido y votaciones más pro-empresa, controlando por distintas características observables de los parlamentarios. Los resultados indican que, si bien existe una correlación positiva del financiamiento y votaciones pro empresa, tras controlar por distintas características del parlamentario, en general no hay relación entre mayor financiamiento bajo la Cámara Secreta de las Donaciones y votaciones más favorables hacia las empresas. Sin embargo, un análisis más específico muestra que quienes reciben un mayor financiamiento en el distrito por parte de pocos donantes tienen un comportamiento más pro empresa y, que quienes reciben mayor financiamiento votan más en favor de las empresas los proyectos relacionados con la comisión de Salud.
|
515 |
Lexical selection for machine translationSabtan, Yasser Muhammad Naguib mahmoud January 2011 (has links)
Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89.
|
516 |
Lexical simplification : optimising the pipelineShardlow, Matthew January 2015 (has links)
Introduction: This thesis was submitted by Matthew Shardlow to the University of Manchester for the degree of Doctor of Philosophy (PhD) in the year 2015. Lexical simplification is the practice of automatically increasing the readability and understandability of a text by identifying problematic vocabulary and substituting easy to understand synonyms. This work describes the research undertaken during the course of a 4-year PhD. We have focused on the pipeline of operations which string together to produce lexical simplifications. We have identified key areas for research and allowed our results to influence the direction of our research. We have suggested new methods and ideas where appropriate. Objectives: We seek to further the field of lexical simplification as an assistive technology. Although the concept of fully-automated error-free lexical simplification is some way off, we seek to bring this dream closer to reality. Technology is ubiquitous in our information-based society. Ever-increasingly we consume news, correspondence and literature through an electronic device. E-reading gives us the opportunity to intervene when a text is too difficult. Simplification can act as an augmentative communication tool for those who find a text is above their reading level. Texts which would otherwise go unread would become accessible via simplification. Contributions: This PhD has focused on the lexical simplification pipeline. We have identified common sources of errors as well as the detrimental effects of these errors. We have looked at techniques to mitigate the errors at each stage of the pipeline. We have created the CW Corpus, a resource for evaluating the task of identifying complex words. We have also compared machine learning strategies for identifying complex words. We propose a new preprocessing step which yields a significant increase in identification performance. We have also tackled the related fields of word sense disambiguation and substitution generation. We evaluate the current state of the field and make recommendations for best practice in lexical simplification. Finally, we focus our attention on evaluating the effect of lexical simplification on the reading ability of people with aphasia. We find that in our small-scale preliminary study, lexical simplification has a nega- tive effect, causing reading time to increase. We evaluate this result and use it to motivate further work into lexical simplification for people with aphasia.
|
517 |
Language Consistency and Exchange: Market Reactions to Change in the Distribution of Field-level InformationWatts, Jameson K.M., Watts, Jameson K.M. January 2015 (has links)
Markets are fluid. Over time, the dominant designs, processes and paradigms that define an industry invariably succumb to productive innovation or changes in fashion (Arthur, 2009; Schumpeter, 1942; Simmel, 1957). Take for example the recent upheaval of the cell phone market following Apple's release of the iPhone. When it was introduced in 2007, one could clearly differentiate Apple's product from all others; however, subsequent imitation of the iPhone produced a market in which nearly all cell phones look (and perform) alike. The iPhone was a harbinger of the new dominant design. These cycles of innovation and fashion are not limited to consumer markets. Business markets (often defined by longer term inter-firm relationships) are subject to similar transformations. For example, current practices in the biotechnology industry are quite distinct from those accompanying its emergence from university labs in the second half of the 20th century (Powell et al., 2005). Technologies that were once viewed as radical have undergone a process of legitimation and integration into mainstream healthcare delivery systems. Practices that were dominant in the 1980's gave way to newer business models in the 1990's and feedback from down-stream providers changed the way drugs were delivered to patients (Wolff, 2001).During periods of transition, market actors face great difficulty anticipating reactions to their behavior (practices, products, etc.). How they deal with this uncertainty is an interminable source of academic inquiry in the social sciences (see e.g. Alderson, 1965; Simon, 1957; Thompson, 1967) and, in a broad sense, it is the primary concern of the current work as well. However, I am focused specifically on the turmoil caused by transitions in technology, taste and attention over time--the disagreements which occur as market actors collectively shift their practices from one paradigm to the next (Powell and Colyvas, 2008). If innovations are assumed to arise locally and diffuse gradually (see e.g. Bass, 1969; Rogers, 2002), then transient differences in knowledge are a natural outcome. Those closest to, or most interested in an innovation will have greater knowledge than those furthest away or less involved. Thus, for a period following some shift in technology, taste or attention, market participants will vary in their knowledge and interpretation of the change. In the following chapters, I investigate the ramifications of this sort of knowledge heterogeneity on the exchange behavior and subsequent performance of market participants. It is the central argument of this thesis that this heterogeneity affects exchange by both limiting coordination and increasing quality uncertainty. The details of this argument are fleshed out in Chapters 1, 2 and 3 (summarized below), which build upon each other in a progression from abstract, to descriptive to specific tests of theory. However, each can also stand by itself as an independent examination of the knowledge-exchange relationship. The final chapter synthesizes my findings and highlights some implications for practitioners and further research. In Chapter 1, I review the history and development of Alderson's (1965) 'law of exchange' in the marketing literature and propose an extension based on insights from information theory. A concept called market entropy is introduced to describe the distribution of knowledge in a field and propositions are offered to explain the exchange behavior expected when this distribution changes. Chapter 2 investigates knowledge heterogeneity through its relation with written language. Drawing on social-constructionist theories of classification (Goldberg, 2012) and insights from research on the legitimation process (Powell and Colyvas, 2008), I argue for a measure of field-level consensus based on changes in the frequency distribution of descriptive words over time. This measure is operationalized using eleven years of trade journal articles from the biotech industry and is shown to support the propositions offered in Chapter 1. Chapter 3 builds on the arguments and evidence developed in Chapters 1 and 2 to test theory on the structural advantages of a firm's position in a network of strategic alliances. Prior work has documented returns to network centrality based on the premise that central firms have greater and more timely access to information about industry developments (Powell et al., 1996, 1999). However, other research claims that benefits to centrality accrue based on the signal that such a position provides about an actor's underlying quality (Malter, 2014; Podolny, 1993, 2005). I investigate this tension in the literature and offer new insights based on interactions between network position and the measure developed in Chapter 2.
|
518 |
Refinements in hierarchical phrase-based translation systemsPino, Juan Miguel January 2015 (has links)
The relatively recently proposed hierarchical phrase-based translation model for statistical machine translation (SMT) has achieved state-of-the-art performance in numerous recent translation evaluations. Hierarchical phrase-based systems comprise a pipeline of modules with complex interactions. In this thesis, we propose refinements to the hierarchical phrase-based model as well as improvements and analyses in various modules for hierarchical phrase-based systems. We took the opportunity of increasing amounts of available training data for machine translation as well as existing frameworks for distributed computing in order to build better infrastructure for extraction, estimation and retrieval of hierarchical phrase-based grammars. We design and implement grammar extraction as a series of Hadoop MapReduce jobs. We store the resulting grammar using the HFile format, which offers competitive trade-offs in terms of efficiency and simplicity. We demonstrate improvements over two alternative solutions used in machine translation. The modular nature of the SMT pipeline, while allowing individual improvements, has the disadvantage that errors committed by one module are propagated to the next. This thesis alleviates this issue between the word alignment module and the grammar extraction and estimation module by considering richer statistics from word alignment models in extraction. We use alignment link and alignment phrase pair posterior probabilities for grammar extraction and estimation and demonstrate translation improvements in Chinese to English translation. This thesis also proposes refinements in grammar and language modelling both in the context of domain adaptation and in the context of the interaction between first-pass decoding and lattice rescoring. We analyse alternative strategies for grammar and language model cross-domain adaptation. We also study interactions between first-pass and second-pass language model in terms of size and n-gram order. Finally, we analyse two smoothing methods for large 5-gram language model rescoring. The last two chapters are devoted to the application of phrase-based grammars to the string regeneration task, which we consider as a means to study the fluency of machine translation output. We design and implement a monolingual phrase-based decoder for string regeneration and achieve state-of-the-art performance on this task. By applying our decoder to the output of a hierarchical phrase-based translation system, we are able to recover the same level of translation quality as the translation system.
|
519 |
Commonsense Knowledge for 3D Modeling: A Machine Learning ApproachHassani, Kaveh January 2017 (has links)
Common‒sense knowledge is a collection of non‒expert and agreed‒upon facts and information about the world shared among most people past early childhood based on their experiences. It includes uses of objects, their properties, parts and materials, their locations, spatial arrangements among them; location and duration of events; arguments, preconditions and effects of actions; urges and emotions of people, etc. In creating 3D worlds and especially text‒ to‒scene and text‒to‒animation systems, this knowledge is essential to eliminate the tedious and low‒level tasks, thus allowing users to focus on their creativity and imagination. We address tasks related to five categories of common‒sense knowledge that is required by such systems including: (1) spatial role labeling to automatically identify and annotate a set of spatial signals within a scene description in natural language; (2) grounding spatial relations to automatically position an object in 3D world; (3) inferring spatial relations to extract symbolic spatial relation between objects to answer questions regarding a 3D world; (4) recommending objects and their relative spatial relations given a recent manipulated object to auto‒complete a scene design; (5) learning physical attributes (e.g., size, weight, and speed) of objects and their corresponding distribution. We approach these tasks by using deep learning and probabilistic graphical models and exploit existing datasets and web content to learn the corresponding common‒sense knowledge.
|
520 |
Stance Detection and Analysis in Social MediaSobhani, Parinaz January 2017 (has links)
Computational approaches to opinion mining have mostly focused on polarity detection of product reviews by classifying the given text as positive, negative or neutral. While, there is less effort in the direction of socio-political opinion mining to determine favorability towards given targets of interest, particularly for social media data like news comments and tweets. In this research, we explore the task of automatically determining from the text whether the author of the text is in favor of, against, or neutral towards a proposition or target. The target may be a person, an organization, a government policy, a movement, a product, etc. Moreover, we are interested in detecting the reasons behind authors’ positions.
This thesis is organized into three main parts: the first part on Twitter stance detection and interaction of stance and sentiment labels, the second part on detecting stance and the reasons behind it in online news comments, and the third part on multi-target stance classification.
One may express favor (or disfavor) towards a target by using positive or negative language. Here, for the first time, we present a dataset of tweets annotated for whether the tweeter is in favor of or against pre-chosen targets, as well as for sentiment. These targets may or may not be referred to in the tweets, and they may or may not be the target of opinion in the tweets. We develop a simple stance detection system that outperforms all 19 teams that participated in
a recent shared task competition on the same dataset (SemEval-2016 Task #6). Additionally, access to both stance and sentiment annotations allows us to conduct several experiments to tease out their interactions.
Next, we proposed a novel framework for joint learning of stance and reasons behind it.
This framework relies on topic modeling. Unlike other machine learning approaches for argument tagging which often require a large set of labeled data, our approach is minimally supervised. The extracted arguments are subsequently employed for stance classification. Furthermore, we create and make available the first dataset of online news comments manually annotated for stance and arguments. Experiments on this dataset demonstrate the benefits of using topic modeling, particularly Non-Negative Matrix Factorization, for argument detection. Previous models for stance classification often treat each target independently, ignoring the potential (sometimes very strong) dependency that could exist among targets. However,
in many applications, there exist natural dependencies among targets. In this research, we relieve such independence assumptions in order to jointly model the stance expressed towards multiple targets. We present a new dataset that we built for this task and make it publicly available. Next, we show that an attention-based encoder-decoder framework is very effective for this problem, outperforming several alternatives that jointly learn dependent subjectivity through cascading classification or multi-task learning.
|
Page generated in 0.1197 seconds