Spelling suggestions: "subject:"batural language aprocessing"" "subject:"batural language eprocessing""
521 |
Textual entailment for modern standard ArabicAlabbas, Maytham Abualhail Shahed January 2013 (has links)
This thesis explores a range of approaches to the task of recognising textual entailment (RTE), i.e. determining whether one text snippet entails another, for Arabic, where we are faced with an exceptional level of lexical and structural ambiguity. To the best of our knowledge, this is the first attempt to carry out this task for Arabic. Tree edit distance (TED) has been widely used as a component of natural language processing (NLP) systems that attempt to achieve the goal above, with the distance between pairs of dependency trees being taken as a measure of the likelihood that one entails the other. Such a technique relies on having accurate linguistic analyses. Obtaining such analyses for Arabic is notoriously difficult. To overcome these problems we have investigated strategies for improving tagging and parsing depending on system combination techniques. These strategies lead to substantially better performance than any of the contributing tools. We describe also a semi-automatic technique for creating a first dataset for RTE for Arabic using an extension of the ‘headline-lead paragraph’ technique because there are, again to the best of our knowledge, no such datasets available. We sketch the difficulties inherent in volunteer annotators-based judgment, and describe a regime to ameliorate some of these. The major contribution of this thesis is the introduction of two ways of improving the standard TED: (i) we present a novel approach, extended TED (ETED), for extending the standard TED algorithm for calculating the distance between two trees by allowing operations to apply to subtrees, rather than just to single nodes. This leads to useful improvements over the performance of the standard TED for determining entailment. The key here is that subtrees tend to correspond to single information units. By treating operations on subtrees as less costly than the corresponding set of individual node operations, ETED concentrates on entire information units, which are a more appropriate granularity than individual words for considering entailment relations; and (ii) we use the artificial bee colony (ABC) algorithm to automatically estimate the cost of edit operations for single nodes and subtrees and to determine thresholds, since assigning an appropriate cost to each edit operation manually can become a tricky task.The current findings are encouraging. These extensions can substantially affect the F-score and accuracy and achieve a better RTE model when compared with a number of string-based algorithms and the standard TED approaches. The relative performance of the standard techniques on our Arabic test set replicates the results reported for these techniques for English test sets. We have also applied ETED with ABC to the English RTE2 test set, where it again outperforms the standard TED.
|
522 |
Mining patient journeys from healthcare narrativesDehghan, Azad January 2015 (has links)
The aim of the thesis is to investigate the feasibility of using text mining methods to reconstruct patient journeys from unstructured clinical narratives. A novel method to extract and represent patient journeys is proposed and evaluated in this thesis. A composition of methods were designed, developed and evaluated to this end; which included health-related concept extraction, temporal information extraction, and concept clustering and automated work-flow generation. A suite of methods to extract clinical information from healthcare narratives were proposed and evaluated in order to enable chronological ordering of clinical concepts. Specifically, we proposed and evaluated a data-driven method to identify key clinical events (i.e., medical problems, treatments, and tests) using a sequence labelling algorithm, CRF, with a combination of lexical and syntactic features, and a rule-based post-processing method including label correction, boundary adjustment and false positive filter. The method was evaluated as part of the 2012 i2b2 challengeand achieved a state-of-the-art performance with a strict and lenient micro F1-measure of 83.45% and 91.13% respectively. A method to extract temporal expressions using a hybrid knowledge- (dictionary and rules) and data-driven (CRF) has been proposed and evaluated. The method demonstrated the state-of-the-art performance at the 2012 i2b2 challenge: F1-measure of 90.48% and accuracy of 70.44% for identification and normalisation respectively. For temporal ordering of events we proposed and evaluated a knowledge-driven method, with a F1-measure of 62.96% (considering the reduced temporal graph) or 70.22% for extraction of temporal links. The method developed consisted of initial rule-based identification and classification components which utilised contextual lexico-syntactic cues for inter-sentence links, string similarity for co-reference links, and subsequently a temporal closure component to calculate transitive relations of the extracted links. In a case study of survivors of childhood central nervous system tumours (medulloblastoma), qualitative evaluation showed that we were able to capture specific trends part of patient journeys. An overall quantitative evaluation score (average precision and recall) of 94-100% for individual and 97% for aggregated patient journeys were also achieved. Hence, indicating that text mining methods can be used to identify, extract and temporally organise key clinical concepts that make up a patient’s journey. We also presented an analyses of healthcare narratives, specifically exploring the content of clinical and patient narratives by using methods developed to extract patient journeys. We found that health-related quality of life concepts are more common in patient narrative, while clinical concepts (e.g., medical problems, treatments, tests) are more prevalent in clinical narratives. In addition, while both aggregated sets of narratives contain all investigated concepts; clinical narratives contain, proportionally, more health-related quality of life concepts than clinical concepts found in patient narratives. These results demonstrate that automated concept extraction, in particular health-related quality of life, as part of standard clinical practice is feasible. The proposed method presented herein demonstrated that text mining methods can be efficiently used to identify, extract and temporally organise key clinical concepts that make up a patient’s journey in a healthcare system. Automated reconstruction of patient journeys can potentially be of value for clinical practitioners and researchers, to aid large scale analyses of implemented care pathways, and subsequently help monitor, compare, develop and adjust clinical guidelines both in the areas of chronic diseases where there is plenty of data and rare conditions where potentially there are no established guidelines.
|
523 |
A principle-based system for natural language analysis and translationCrocker, Matthew Walter January 1988 (has links)
Traditional views of grammatical theory hold that languages are characterised by sets of constructions. This approach entails the enumeration of all possible constructions for each language being described. Current theories of transformational generative grammar have established an alternative position. Specifically, Chomsky's Government-Binding theory proposes a system of principles which are common to human language. Such a theory is referred to as a "Universal Grammar"(UG). Associated with the principles of grammar are parameters of variation which account for the diversity of human languages. The grammar for a particular language is known as a "Core Grammar", and is characterised by an appropriately parametrised instance of UG. Despite these advances in linguistic theory, construction-based approaches have remained the status quo within the field of natural language processing. This thesis investigates the possibility of developing a principle-based system which reflects the modular nature of the linguistic theory. That is, rather than stipulating the possible constructions of a language, a system is developed which uses the principles of grammar and language specific parameters to parse language. Specifically, a system-is presented which performs syntactic analysis and translation for a subset of English and German. The cross-linguistic nature of the theory is reflected by the system which can be considered a procedural model of UG. / Science, Faculty of / Computer Science, Department of / Graduate
|
524 |
Extrakcia nešpecifikovaných znalostí z webu / Extraction of unspecified relations from the webOvečka, Marek January 2013 (has links)
The subject of this thesis is non-specific knowledge extraction from the web. In recent years, tools that improve the results of this type of knowledge extraction were created. The aim of this thesis is to become familiar with these tools, test and propose the use of results. In this thesis these tools are described and compared and extraction is carried out using OLLIE. Based on the results of the extractions, two methods of enriching extractions using name entity recognition, are proposed. The first method proposes to modify the weights of extractions and second proposes the enrichment of extractions by named entities. The paper proposed ontology, which allows to capture the structure of enriched extractions. In the last part practical experiment is carried out, in which the proposed methods are demonstrated. Future research in this field would be useful in areas of extraction and categorization of relational phrases.
|
525 |
A Feature Structure Approach for Disambiguating Preposition SensesBaglodi, Venkatesh 01 January 2009 (has links)
Word Sense Disambiguation (WSD) continues to be an open research problem in spite of recent advances in the NLP field, especially in machine learning. WSD for open-class words is well understood. However, WSD for closed class structural words (such as prepositions) is not so well resolved, and their role in frame semantics seems to be a relatively unknown area. This research uses a new method to disambiguate preposition senses by using a combined lookup from FrameNet and TPP databases. Motivated by recent work by Popescu, Tonelli, & Pianta (2007), it extends the concept to provide a deterministic WSD of prepositions using the lexical information drawn from the sentences in a local context. While the primary goal of the research is to disambiguate preposition sense, the approach also assigns frames and roles to different sentence elements. The use of prepositions for frame and role assignment seems to be a largely unexplored area which could provide a new dimension to research in lexical semantics.
|
526 |
Exploração de métodos de sumarização automática multidocumento com base em conhecimento semântico-discursivo / Exploration of automatic methods for multi-document summarization using discourse modelsPaula Christina Figueira Cardoso 05 September 2014 (has links)
A sumarização automática multidocumento visa à produção de um sumário a partir de um conjunto de textos relacionados, para ser utilizado por um usuário particular e/ou para determinada tarefa. Com o crescimento exponencial das informações disponíveis e a necessidade das pessoas obterem a informação em um curto espaço de tempo, a tarefa de sumarização automática tem recebido muita atenção nos últimos tempos. Sabe-se que em um conjunto de textos relacionados existem informações redundantes, contraditórias e complementares, que representam os fenômenos multidocumento. Em cada texto-fonte, o assunto principal é descrito em uma sequência de subtópicos. Além disso, as sentenças de um texto-fonte possuem graus de relevância diferentes. Nesse contexto, espera-se que um sumário multidocumento consista das informações relevantes que representem o total de textos do conjunto. No entanto, as estratégias de sumarização automática multidocumento adotadas até o presente utilizam somente os relacionamentos entre textos e descartam a análise da estrutura textual de cada texto-fonte, resultando em sumários que são pouco representativos dos subtópicos textuais e menos informativos do que poderiam ser. A fim de tratar adequadamente a relevância das informações, os fenômenos multidocumento e a distribuição de subtópicos, neste trabalho de doutorado, investigou-se como modelar o processo de sumarização automática usando o conhecimento semântico-discursivo em métodos de seleção de conteúdo e o impacto disso para a produção de sumários mais informativos e representativos dos textos-fonte. Na formalização do conhecimento semântico-discursivo, foram utilizadas as teorias semântico-discursivas RST (Rhetorical Structure Theory) e CST (Cross-document Structure Theory). Para apoiar o trabalho, um córpus multidocumento foi anotado com RST e subtópicos, consistindo em um recurso disponível para outras pesquisas. A partir da análise de córpus, foram propostos 10 métodos de segmentação em subtópicos e 13 métodos inovadores de sumarização automática. A avaliação dos métodos de segmentação em subtópicos mostrou que existe uma forte relação entre a estrutura de subtópicos e a análise retórica de um texto. Quanto à avaliação dos métodos de sumarização automática, os resultados indicam que o uso do conhecimento semântico-discursivo em boas estratégias de seleção de conteúdo afeta positivamente a produção de sumários informativos. / The multi-document summarization aims at producing a summary from a set of related texts to be used for an individual or/and a particular task. Nowadays, with the exponential growth of available information and the peoples need to obtain information in a short time, the task of automatic summarization has received wide attention. It is known that in a set of related texts there are pieces of redundant, contradictory and complementary information that represent the multi-document phenomenon. In each source text, the main subject is described in a sequence of subtopics. Furthermore, some sentences in the same text are more relevant than others. Considering this context, it is expected that a multi-document summary consists of relevant information that represents a set of texts. However, strategies for automatic multi-document summarization adopted until now have used only the relationships between texts and dismissed the analysis of textual structure of each source text, resulting in summaries that are less representative of subtopics and less informative than they could be. In order to properly treat the relevance of information, multi-document phenomena and distribution of subtopics, in this thesis, we investigated how to model the summarization process using the semantic-discursive knowledge and its impact for producing more informative and representative summaries from source texts. In order to formalize the semantic-discursive knowledge, we adopted RST (Rhetorical Structure Theory) and CST (Cross-document Structure Theory) theories. To support the work, a multi-document corpus was annotated with RST and subtopics, consisting of a new resource available for other researchers. From the corpus analysis, 10 methods for subtopic segmentation and 13 orignal methods for automatic summarization were proposed. The assessment of methods for subtopic segmentation showed that there is a strong relationship between the subtopics structure and the rhetorical analysis of a text. In regards to the assessment of the methods for automatic summarization, the results indicate that the use of semantic-discursive knowledge in good strategies for content selection affects positively the production of informative summaries.
|
527 |
Uma abordagem conexionista para anotação de papéis semânticos / A connectionist approach to semantic role labelingErick Rocha Fonseca 10 April 2013 (has links)
A anotação de papéis semânticos (APS) é uma subárea do Processamento de Línguas Naturais (PLN) que começou a ser explorada para a língua inglesa a partir de 2002. Seu objetivo é detectar estruturas de predicador e argumentos em sentenças escritas, que correspondem a descrições de eventos (normalmente feitas por verbos); seus participantes, como agente e paciente; e circunstâncias, como tempo, local, etc. Diversas aplicações de PLN, como tradução automática e recuperação de informação, têm obtido melhorias em seu desempenho ao empregar a APS como uma etapa de pré-processamento. Para a língua portuguesa, os avanços na pesquisa de APS são ainda muito incipientes. Dado que a grande maioria dos trabalhos encontrados na literatura desta área emprega aprendizado de máquina supervisionado, um fator limitante tem sido a ausência de dados rotulados em português, problema que apenas recentemente foi parcialmente resolvido com a criação do PropBank-Br. Este recurso segue o modelo de anotação usado no Prop- Bank, o principal conjunto de dados rotulados empregado na tarefa de APS para a língua inglesa. Ainda assim, o PropBank-Br contém menos de um décimo do total de instâncias de dados presentes no PropBank original. Outro ponto a ser observado é que a abordagem mais comum para a APS baseia-se na extração de uma grande quantidade de informação linguística das sentenças de entrada para ser usada por classificadores automáticos. Tal abordagem mostra-se extremamente dependente de outras ferramentas de PLN, característica particularmente indesejável no caso da língua portuguesa, que não possui muitos recursos livremente disponíveis. Em contrapartida, uma outra abordagem bem sucedida encontrada na literatura abre mão do uso de conhecimento linguístico explícito e associa palavras a sequências numéricas, cujos valores são ajustados durante o treinamento de uma rede neural artificial. Estas sequências são então empregadas pela rede para realizar a APS, e podem servir também para outras tarefas de PLN. O presente trabalho seguiu o segundo método descrito acima. Foram implementadas alterações nesse método que permitiram um ganho de desempenho em comparação com sua versão original quando testada no PropBank-Br. A versão final do sistema desenvolvido está pronta para uso e poderá auxiliar pesquisas de PLN em português / Semantic Role Labeling (SRL) is a subfield of Natural Language Processing (NLP) which began to be explored for English in 2002. Its goal is to detect structures of predicate and arguments in written sentences, which correspond to descriptions of events (usually made by verbs); its participants, such as agents and patients; and circumstances, such as time, place, etc. Many NLP applications, as machine translation and information retrieval, have achieved performance gains by applying SRL as a pre-processing step. For Portuguese, advances in SRL research are still in very early stages. Given that the majority of works found in the literature of this area employ supervised machine learning, a limiting factor has been the absence of labeled data in Portuguese, a problem that only recently was partially solved with the creation of PropBank-Br. This resource follows the annotation model used in PropBank, the main labeled data set employed in the SRL task for English. Even then, PropBank-Br contains less than one tenth of the data instances present in the original PropBank. Another point to be observed is that the most common approach to SRL is based on the extraction of a great amount of information from the input sentences to be used by automatic classifiers. Such approach is extremely dependent on other NLP tools, a particularly undesirable feature in the case of Portuguese, which does not have many freely available resources. On the other hand, another succesful approach found in the literature forgoes the use of explicit linguistic knowledge and associates words to numeric sequences, whose values are adjusted during the training of an artificial neural network. These sequences are then employed by the network in order to perform SRL, and can also be useful for other NLP tasks. This work followed the second method described above. Modifications on this method were implemented and allowed for a performance gain in comparison with its original version when tested on PropBank-Br. The final version of the developed system is ready for use and will be able to help NLP research in Portuguese
|
528 |
Anotação automática de papéis semânticos de textos jornalísticos e de opinião sobre árvores sintáticas não revisadas / Automatic semantic role labeling on non-revised syntactic trees of journalistic and opinion textsNathan Siegle Hartmann 25 June 2015 (has links)
Contexto: A Anotação de Papéis Semânticos (APS) é uma tarefa da área de Processamento de Línguas Naturais (PLN) que permite detectar os eventos descritos nas sentenças e os participantes destes eventos (Palmer et al., 2010). A APS responde perguntas como Quem?, Quando?, Onde?, O quê?, e Por quê?, dentre outras e, sendo assim, é importante para várias aplicações de PLN. Para anotar automaticamente um texto com papéis semânticos, a maioria dos sistemas atuais emprega técnicas de Aprendizagem de Máquina (AM). Porém, alguns papéis semânticos são previsíveis e, portanto, não necessitam ser tratados via AM. Além disso, a grande maioria das pesquisas desenvolvidas em APS tem dado foco ao inglês, considerando as particularidades gramaticais e semânticas dessa língua, o que impede que essas ferramentas e resultados sejam diretamente transportados para outras línguas. Revisão da Literatura: Para o português do Brasil, há três trabalhos finalizados recentemente que lidam com textos jornalísticos, porém com performance inferior ao estado da arte para o inglês. O primeiro (Alva- Manchego, 2013) obteve 79,6 de F1 na APS sobre o córpus PropBank.Br; o segundo (Fonseca, 2013), sem fazer uso de um treebank para treinamento, obteve 68,0 de F1 sobre o córpus PropBank.Br; o terceiro (Sequeira et al., 2012) realizou anotação apenas dos papéis Arg0 (sujeito prototípico) e Arg1 (paciente prototípico) no córpus CETEMPúblico, com performance de 31,3 pontos de F1 para o primeiro papel e de 19,0 de F1 para o segundo. Objetivos: O objetivo desse trabalho de mestrado é avançar o estado da arte na APS do português brasileiro no gênero jornalístico, avaliando o desempenho de um sistema de APS treinado com árvores sintáticas geradas por um parser automático (Bick, 2000), sem revisão humana, usando uma amostragem do córpus PLN-Br. Como objetivo adicional, foi avaliada a robustez da tarefa de APS frente a gêneros diferentes, testando o sistema de APS, treinado no gênero jornalístico, em uma amostra de revisões de produtos da web. Esse gênero não foi explorado até então na área de APS e poucas de suas características foram formalizadas. Resultados: Foi compilado o primeiro córpus de opiniões sobre produtos da web, o córpus Buscapé (Hartmann et al., 2014). A diferença de performance entre um sistema treinado sobre árvores revisadas e outro sobre árvores não revisadas ambos no gênero jornalístico foi de 10,48 pontos de F1. A troca de gênero entre as fases de treinamento e teste, em APS, é possível, com perda de performance de 3,78 pontos de F1 (córpus PLN-Br e Buscapé, respectivamente). Foi desenvolvido um sistema de inserção de sujeitos não expressos no texto, com precisão de 87,8% no córpus PLN-Br e de 94,5% no córpus Buscapé. Foi desenvolvido um sistema, baseado em regras, para anotar verbos auxiliares com papéis semânticos modificadores, com confiança de 96,76% no córpus PLN-Br. Conclusões: Foi mostrado que o sistema de Alva-Manchego (2013), baseado em árvores sintáticas, desempenha melhor APS do que o sistema de Fonseca (2013), independente de árvores sintáticas. Foi mostrado que sistemas de APS treinados sobre árvores sintáticas não revisadas desempenham melhor APS sobre árvores não revisadas do que um sistema treinado sobre dados gold-standard. Mostramos que a explicitação de sujeitos não expressos nos textos do Buscapé, um córpus do gênero de opinião de produtos na web, melhora a performance da sua APS. Também mostramos que é possível anotar verbos auxiliares com papéis semânticos modificadores, utilizando um sistema baseado em regras, com alta confiança. Por fim, mostramos que o uso do sentido do verbo, como feature de AM, para APS, não melhora a perfomance dos sistemas treinados sobre o PLN-Br e o Buscapé, por serem córpus pequenos. / Background: Semantic Role Labeling (SRL) is a Natural Language Processing (NLP) task that enables the detection of events described in sentences and the participants of these events (Palmer et al., 2010). SRL answers questions such as Who?, When?, Where?, What? and Why? (and others), that are important for several NLP applications. In order to automatically annotate a text with semantic roles, most current systems use Machine Learning (ML) techniques. However, some semantic roles are predictable, and therefore, do not need to be classified through ML. In spite of SRL being well advanced in English, there are grammatical and semantic particularities that prevents full reuse of tools and results in other languages. Related work: For Brazilian Portuguese, there are three studies recently concluded that performs SRL in journalistic texts. The first one (Alva-Manchego, 2013) obtained 79.6 of F1 on the SRL of the PropBank.Br corpus; the second one (Fonseca, 2013), without using a treebank for training, obtained 68.0 of F1 for the same corpus; and the third one (Sequeira et al., 2012) annotated only the Arg0 (prototypical agent) and Arg1 (prototypical patient) roles on the CETEMPúblico corpus, with a perfomance of 31.3 of F1 for the first semantic role and 19.0 for the second one. None of them, however, reached the state of the art of the English language. Purpose: The goal of this masters dissertation was to advance the state of the art of SRL in Brazilian Portuguese. The training corpus used is from the journalistic genre, as previous works, but the SRL annotation is performed on non-revised syntactic trees, i.e., generated by an automatic parser (Bick, 2000) without human revision, using a sampling of the corpus PLN-Br. To evaluate the resulting SRL classifier in another text genre, a sample of product reviews from web was used. Until now, product reviews was a genre not explored in SRL research, and few of its characteristics are formalized. Results: The first corpus of web product reviews, the Buscapé corpus (Hartmann et al., 2014), was compiled. It is shown that the difference in the performance of a system trained on revised syntactic trees and another trained on non-revised trees both from the journalistic genre was of 10.48 of F1. The change of genres during the training and testing steps in SRL is possible, with a performance loss of 3.78 of F1 (corpus PLN-Br and Buscapé, respectively). A system to insert unexpressed subjects reached 87.8% of precision on the PLN-Br corpus and a 94.5% of precision on the Buscapé corpus. A rule-based system was developed to annotated auxiliary verbs with semantic roles of modifiers (ArgMs), achieving 96.76% confidence on the PLN-Br corpus. Conclusions: First we have shown that Alva-Manchego (2013) SRL system, that is based on syntactic trees, performs better annotation than Fonseca (2013)s system, that is nondependent on syntactic trees. Second the SRL system trained on non-revised syntactic trees performs better over non-revised trees than a system trained on gold-standard data. Third, the explicitation of unexpressed subjects on the Buscapé texts improves their SRL performance. Additionally, we show it is possible to annotate auxiliary verbs with semantic roles of modifiers, using a rule-based system. Last, we have shown that the use of the verb sense as a feature of ML, for SRL, does not improve the performance of the systems trained over PLN-Br and Buscapé corpus, since they are small.
|
529 |
Extração de informações de narrativas clínicas / Clinical reports information retrievalMichel Oleynik 02 October 2013 (has links)
Narrativas clínicas são normalmente escritas em linguagem natural devido a seu poder descritivo e facilidade de comunicação entre os especialistas. Processar esses dados para fins de descoberta de conhecimento e coleta de estatísticas exige técnicas de extração de informações, com alguns resultados já apresentados na literatura para o domínio jornalístico, mas ainda raras no domínio médico. O presente trabalho visa desenvolver um classificador de laudos de anatomia patológica que seja capaz de inferir a topografia e a morfologia de um câncer na Classificação Internacional de Doenças para Oncologia (CID-O). Dados fornecidos pelo A.C. Camargo Cancer Center em São Paulo foram utilizados para treinamento e validação. Técnicas de processamento de linguagem natural (PLN) aliadas a classificadores bayesianos foram exploradas na busca de qualidade da recuperação da informação, avaliada por meio da medida-F2. Valores acima de 74% para o grupo topográfico e de 61% para o grupo morfológico são relatados, com pequena contribuição das técnicas de PLN e suavização. Os resultados corroboram trabalhos similares e demonstram a necessidade de retreinamento das ferramentas de PLN no domínio médico. / Clinical reports are usually written in natural language due to its descriptive power and ease of communication among specialists. Processing data for knowledge discovery and statistical analysis requires information retrieval techniques, already established for newswire texts, but still rare in the medical subdomain. The present work aims at developing an automated classifier of pathology reports, which should be able to infer the topography and the morphology classes of a cancer using codes of the International Classification of Diseases for Oncology (ICD-O). Data provided by the A.C. Camargo Cancer Center located in Sao Paulo was used for training and validation. Techniques of natural language processing (NLP) and Bayes classifiers were used in search for information retrieval quality, evaluated by F2-score. Measures upper than 74% in the topographic group and 61% in the morphologic group are reported, with small contribution from NLP or smoothing techniques. The results agree with similar studies and show that a retraining of NLP tools in the medical domain is necessary.
|
530 |
[en] ENTROPY GUIDED FEATURE GENERATION FOR STRUCTURE LEARNING / [pt] GERAÇÃO DE ATRIBUTOS GUIADA POR ENTROPIA PARA APRENDIZADO DE ESTRUTURAS17 December 2014 (has links)
[pt] Aprendizado de estruturas consiste em aprender um mapeamento de variáveis de entrada para saídas estruturadas a partir de exemplos de pares entrada-saída. Vários problemas importantes podem ser modelados desta maneira. O processamento de linguagem natural provê diversas tarefas que podem ser formuladas e solucionadas através do aprendizado de estruturas. Por exemplo, parsing de dependência envolve o reconhecimento de uma árvore implícita em uma frase. Geração de atributos é uma sub-tarefa importante do aprendizado de estruturas. Geralmente, esta sub-tarefa é realizada por um especialista que constrói gabaritos de atributos complexos e discriminativos através da combinação dos atributos básicos disponíveis na entrada. Esta é uma forma limitada e cara para geração de atributos e é reconhecida como um gargalo de modelagem.
Neste trabalho, propomos um método automático para geração de atributos para problemas de aprendizado de estruturas. Este método é guiado por entropia já que é baseado na entropia condicional de variáveis locais de saída dados os atributos básicos. Comparamos experimentalmente o método proposto com dois métodos alternativos para geração de atributos: geração manual e métodos de kernel polinomial. Nossos resultados mostram que o método de geração de atributos guiado por entropia é superior aos dois métodos alternativos em diferentes aspectos. Nosso método é muito mais barato do que o método manual e computacionalmente mais rápido que o método baseado em kernel. Adicionalmente, ele permite o controle do seu poder de generalização mais facilmente do que métodos de kernel.
Nós avaliamos nosso método em nove datasets envolvendo cinco tarefas de linguística computacional e quatro idiomas. Os sistemas desenvolvidos apresentam resultados comparáveis aos melhores sistemas atualmente e, particularmente para etiquetagem morfossintática, identificação de sintagmas, extração de citações e resolução de coreferência, obtêm os melhores resultados conhecidos para diferentes idiomas como Árabe, Chinês, Inglês e Português. Adicionalmente, nosso sistema de resolução de coreferência obteve o primeiro lugar na competição Conference on Computational Natural Language Learning 2012 Shared Task. O sistema vencedor foi determinado pela média de desempenho em três idiomas: Árabe, Chinês e Inglês. Nosso sistema obteve o melhor desempenho nos três idiomas avaliados.
Nosso método de geração de atributos estende naturalmente o framework de aprendizado de estruturas e não está restrito a tarefas de processamento de linguagem natural. / [en] Structure learning consists in learning a mapping from inputs to
structured outputs by means of a sample of correct input-output pairs. Many
important problems fit into this setting. Natural language processing provides
several tasks that can be formulated and solved as structure learning problems.
Dependency parsing, for instance, involves the prediction of a tree underlying
a sentence. Feature generation is an important subtask of structure learning
which, usually, is partially solved by a domain expert that builds complex
discriminative feature templates by conjoining the available basic features.
This is a limited and expensive way to generate features and is recognized as
a modeling bottleneck.
In this work, we propose an automatic feature generation method
for structure learning problems. This method is entropy guided since it
generates complex features based on the conditional entropy of local output
variables given the available input features. We experimentally compare the
proposed method with two important alternative feature generation methods,
namely manual template generation and polynomial kernel methods. Our
experimental findings indicate that the proposed method is more attractive
than both alternatives. It is much cheaper than manual templates and
computationally faster than kernel methods. Additionally, it is simpler to
control its generalization performance than with kernel methods.
We evaluate our method on nine datasets involving five natural
language processing tasks and four languages. The resulting systems present
state-of-the-art comparable performances and, particularly on part-of-speech
tagging, text chunking, quotation extraction and coreference resolution,
remarkably achieve the best known performances on different languages
like Arabic, Chinese, English, and Portuguese. Furthermore, our coreference
resolution systems achieve the very first place on the Conference on
Computational Natural Language Learning 2012 Shared Task. The competing
systems were ranked by the mean score over three languages: Arabic,
Chinese and English. Our approach obtained the best performances among
all competitors for all the three languages.
Our feature generation method naturally extends the general structure
learning framework and is not restricted to natural language processing tasks.
|
Page generated in 0.1199 seconds