• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 929
  • 156
  • 74
  • 55
  • 27
  • 23
  • 18
  • 13
  • 10
  • 9
  • 8
  • 7
  • 5
  • 5
  • 4
  • Tagged with
  • 1601
  • 1601
  • 1601
  • 622
  • 565
  • 464
  • 383
  • 376
  • 266
  • 256
  • 245
  • 228
  • 221
  • 208
  • 204
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
731

Discovering Hidden Networks Using Topic Modeling

Cooper, Wyatt 01 January 2017 (has links)
This paper explores topic modeling via unsupervised non-negative matrix factorization. This technique is used on a variety of sources in order to extract salient topics. From these topics, hidden entity networks are discovered and visualized in a graph representation. In addition, other visualization techniques such as examining the time series of a topic and examining the top words of a topic are used for evaluation and analysis. There is a large software component to this project, and so this paper will also focus on the design decisions that were made in order to make the program developed as versatile and extensible as possible.
732

Système symbolique de création de résumés de mise à jour

Genest, Pierre-Étienne January 2009 (has links)
Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal.
733

Applying particle filtering to unsupervised part-of-speech induction

Dubbin, Gregory January 2014 (has links)
Statistical Natural Language Processing (NLP) lies at the intersection of Computational Linguistics and Machine Learning. As linguistic models incorporate more subtle nuances of language and its structure, standard inference techniques can fall behind. One such application is research on the unsupervised induction of part-of-speech tags. It has the potential to improve both our understanding of the plausibility of theories of first language acquisition, and Natural Language Processing applications such as Speech Recognition and Machine Translation. Sequential Monte Carlo (SMC) approaches, i.e. particle filters, are well suited to approximating such models. This thesis seeks to determine whether one application of SMC methods, particle Gibbs sampling, is capable of performing inference in otherwise intractable NLP applications. Specifically, this research analyses the benefits and drawbacks to relying on particle Gibbs to perform unsupervised part-of-speech induction without the flawed one-tag-per-type assumption of similar approaches. Additionally, this thesis explores the affects of type-based supervision with tag-dictionaries extracted from annotated corpora or from the wiktionary. The semi-supervised tag dictionary improves the performance of the local Gibbs PYP-HMM sampler enough to nearly match the performance of the particle Gibbs type-sampler. Finally, this thesis also extends the Pitman-Yor HMM tagger of Blunsom and Cohn (2011) to include an explicit model of the lexicon which encodes those tags from which a word-type may be generated. This has the effect of both biasing the model to produce fewer tags per type and modelling the tendency for open class words to be ambiguous between only a subset of the available tags. Furthermore, I extend the type based particle Gibbs inference algorithm to simultaneously resample the ambiguity class as well as tags for all of the tokens of a given word type. The result is a principled probabilistic model of part-of-speech induction that achieves state-of-the-art performance. Overall, the experiments and contributions of this thesis demonstrate the applicability of the particle Gibbs sampler and particle methods in general to otherwise intractable problems in NLP.
734

Voisinage lexical pour l'analyse du discours / Lexical neighbours for discourse analysis

Adam, Clémentine 28 September 2012 (has links)
Cette thèse s'intéresse au rôle de la cohésion lexicale dans différentes approches de l'analyse du discours. Nous yexplorons deux hypothèses principales:- l'analyse distributionnelle, qui permet de rapprocher des unités lexicales sur la base des contextes syntaxiques qu'ellespartagent, met au jour des relations sémantiques variées pouvant être exploitées pour la détection de la cohésion lexicaledes textes;- les indices lexicaux constituent des éléments de signalisation de l'organisation du discours pouvant être exploités aussibien à un niveau local (identification de relations rhétoriques entre constituants élémentaires du discours) qu'à un niveauglobal (repérage ou caractérisation de segments de niveau supérieur dotés d'une fonction rhétorique et garantissant lacohérence et la lisibilité du texte, par exemple passages à unité thématique).Concernant le premier point, nous montrons la pertinence d'une ressource distributionnelle pour l'appréhension d'une largegamme de relations impliquées dans la cohésion lexicale des textes. Nous présentons les méthodes de projection et defiltrage que nous avons mises en œuvre pour la production de sorties exploitables.Concernant le second point, nous fournissons une série d'éclairages qui montrent l'apport d'une prise en compte réfléchiede la cohésion lexicale pour une grande variété de problématiques liées à l'étude et au repérage automatique del'organisation textuelle: segmentation thématique de textes, caractérisation des structures énumératives, étude de lacorrélation entre lexique et structure rhétorique du discours et enfin détection de réalisations d'une relation de discoursparticulière, la relation d'élaboration. / This thesis considers the role of lexical cohesion in various approaches of discourse analysis. Two main hypotheses arestudied:- distributional analysis, which allows to bring together lexical units based on the syntactic contexts they share, highlightsdiverse semantic relations which can be employed in the detection of lexical cohesion in texts;- lexical cues are involved in discourse signalization and can be used both at a local level (identification of rhetoricalrelations between elementary discourse units) and at a global level (detection or characterization of higher levelsegments).In reference to the first hypothesis, we show that a distributional resource is strongly relevant in the analysis of a widepanel of relations having lexical cohesion roles in texts. We introduce projection and filtering methods for thisdistributional resource.In reference to the second hypothesis, we provide a series of outlooks showing the improvement brought by carefulconsideration of lexical cohesion in a large panel of settings within the study of textual organisation and its automaticdetection: thematic segmentation of texts, enumerative structures characterization, study of the correlation betweenlexicon and the rhetorical structure of discourse, and finally detection of realisations of a specific discourse relation, theElaboration relation.
735

User Modeling in Social Media: Gender and Age Detection

Daneshvar, Saman 21 August 2019 (has links)
Author profiling is a field within Natural Language Processing (NLP) that is concerned with identifying various characteristics and demographic factors of authors, such as gender, age, location, native language, political orientation, and personality by analyzing the style and content of their writings. There is a growing interest in author profiling, with applications in marketing and advertising, opinion mining, personalization, recommendation systems, forensics, security, and defense. In this work, we build several classification models using NLP, Deep Learning, and classical Machine Learning techniques that can identify the gender and age of a Twitter user based on the textual contents of their correspondence (tweets) on the platform. Our SVM gender classifier utilizes a combination of word and character n-grams as features, dimensionality reduction using Latent Semantic Analysis (LSA), and a Support Vector Machine (SVM) classifier with linear kernel. At the PAN 2018 author profiling shared task, this model achieved the highest performance with 82.21%, 82.00%, and 80.90% accuracy on the English, Spanish, and Arabic datasets, respectively. Our age classifier was trained on a dataset of 11,160 Twitter users, using the same approach, though the age classification experiments are preliminary. Our Deep Learning gender classifiers are trained and tested on English datasets. Our feedforward neural network consisting of a word embedding layer, flattening, and two densely-connected layers achieves 79.57% accuracy, and our bidirectional Long Short-Term Memory (LSTM) neural network achieves 76.85% accuracy on the gender classification task.
736

Analyse et reconnaissance des émotions lors de conversations de centres d'appels / Automatic emotions recognition during call center conversations

Vaudable, Christophe 11 July 2012 (has links)
La reconnaissance automatique des émotions dans la parole est un sujet de recherche relativement récent dans le domaine du traitement de la parole, puisqu’il est abordé depuis une dizaine d’années environs. Ce sujet fait de nos jours l’objet d’une grande attention, non seulement dans le monde académique mais aussi dans l’industrie, grâce à l’augmentation des performances et de la fiabilité des systèmes. Les premiers travaux étaient fondés sur des donnés jouées par des acteurs, et donc non spontanées. Même aujourd’hui, la plupart des études exploitent des séquences pré-segmentées d’un locuteur unique et non une communication spontanée entre plusieurs locuteurs. Cette méthodologie rend les travaux effectués difficilement généralisables pour des informations collectées de manière naturelle.Les travaux entrepris dans cette thèse se basent sur des conversations de centre d’appels, enregistrés en grande quantité et mettant en jeu au minimum 2 locuteurs humains (un client et un agent commercial) lors de chaque dialogue. Notre but est la détection, via l’expression émotionnelle, de la satisfaction client. Dans une première partie nous présentons les scores pouvant être obtenus sur nos données à partir de modèles se basant uniquement sur des indices acoustiques ou lexicaux. Nous montrons que pour obtenir des résultats satisfaisants une approche ne prenant en compte qu’un seul de ces types d’indices ne suffit pas. Nous proposons pour palier ce problème une étude sur la fusion d’indices de types acoustiques, lexicaux et syntaxico-sémantiques. Nous montrons que l’emploi de cette combinaison d’indices nous permet d’obtenir des gains par rapport aux modèles acoustiques même dans les cas ou nous nous basons sur une approche sans pré-traitements manuels (segmentation automatique des conversations, utilisation de transcriptions fournies par un système de reconnaissance de la parole). Dans une seconde partie nous remarquons que même si les modèles hybrides acoustiques/linguistiques nous permettent d’obtenir des gains intéressants la quantité de données utilisées dans nos modèles de détection est un problème lorsque nous testons nos méthodes sur des données nouvelles et très variées (49h issus de la base de données de conversations). Pour remédier à ce problème nous proposons une méthode d’enrichissement de notre corpus d’apprentissage. Nous sélectionnons ainsi, de manière automatique, de nouvelles données qui seront intégrées dans notre corpus d’apprentissage. Ces ajouts nous permettent de doubler la taille de notre ensemble d’apprentissage et d’obtenir des gains par rapport aux modèles de départ. Enfin, dans une dernière partie nous choisissons d’évaluées nos méthodes non plus sur des portions de dialogues comme cela est le cas dans la plupart des études, mais sur des conversations complètes. Nous utilisons pour cela les modèles issus des études précédentes (modèles issus de la fusion d’indices, des méthodes d’enrichissement automatique) et ajoutons 2 groupes d’indices supplémentaires : i) Des indices « structurels » prenant en compte des informations comme la durée de la conversation, le temps de parole de chaque type de locuteurs. ii) des indices « dialogiques » comprenant des informations comme le thème de la conversation ainsi qu’un nouveau concept que nous nommons « implication affective ». Celui-ci a pour but de modéliser l’impact de la production émotionnelle du locuteur courant sur le ou les autres participants de la conversation. Nous montrons que lorsque nous combinons l’ensemble de ces informations nous arrivons à obtenir des résultats proches de ceux d’un humain lorsqu’il s’agit de déterminer le caractère positif ou négatif d’une conversation / Automatic emotion recognition in speech is a relatively recent research subject in the field of natural language processing considering that the subject has been proposed for the first time about ten years ago. This subject is nowadays the object of much attention, not only in academia but also in industry, thank to the increased models performance and system reliability. The first studies were based on acted data and non spontaneous speech. Up until now, most experiments carried out by the research community on emotions were realized pre-segmented sequences and with a unique speaker and not on spontaneous speech with several speaker. With this methodology the models built on acted data are hardly usable on data collected in natural context The studies we present in this thesis are based on call center’s conversation with about 1620 hours of dialogs and with at least two human speakers (a commercial agent and a client) for each conversation. Our aim is the detection, via emotional expression, of the client satisfaction.In the first part of this work we present the results we obtained from models using only acoustic or linguistic features for emotion detection. We show that to obtain correct results an approach taking into account only one of these features type is not enough. To overcome this problem we propose the combination of three type of features (acoustic, lexical and semantic). We show that the use of models with features fusion allows higher score for the recognition step in all case compared to the model using only acoustic features. This gain is also obtained if we use an approach without manual pre-processing (automatic segmentation of conversation, transcriptions based on automatic speech recognition).In the second part of our study we notice that even if models based on features combination are relevant for emotion detection the amount of data we use in our training set is too small if we used it on large amount of data test. To overcome this problem we propose a new method to automatically complete training set with new data. We base this selection on linguistic and acoustic criterion. These new information are issued from 100 hours of data. These additions allow us to double the amount of data in our training set and increase emotion recognition rate compare to the non-enrich models. Finally, in the last part we choose to evaluate our method on entire conversation and not only on conversations turns as in most studies. To define the classification of a dialog we use models built on the previous steps of this works and we add two new features group:i) structural features including information like the length of the conversation, the proportion of speech for each speaker in the dialogii) dialogic features including informations like the topic of a conversation and a new concept we call “affective implication”. The aim of the affective implication is to represent the impact of the current speaker’s emotional production on the other speakers. We show that if we combined all information we can obtain results close to those of humans
737

Répondre à des questions à réponses multiples sur le Web / Answering multiple answer questions from the Web

Falco, Mathieu-Henri 22 May 2014 (has links)
Les systèmes de question-réponse renvoient une réponse précise à une question formulée en langue naturelle. Les systèmes de question-réponse actuels, ainsi que les campagnes d'évaluation les évaluant, font en général l'hypothèse qu'une seule réponse est attendue pour une question. Or nous avons constaté que, souvent, ce n'était pas le cas, surtout quand on cherche les réponses sur le Web et non dans une collection finie de documents.Nous nous sommes donc intéressés au traitement des questions attendant plusieurs réponses à travers un système de question-réponse sur le Web en français. Pour cela, nous avons développé le système Citron capable d'extraire des réponses multiples différentes à des questions factuelles en domaine ouvert, ainsi que de repérer et d'extraire le critère variant (date, lieu) source de la multiplicité des réponses. Nous avons montré grâce à notre étude de différents corpus que les réponses à de telles questions se trouvaient souvent dans des tableaux ou des listes mais que ces structures sont difficilement analysables automatiquement sans prétraitement. C'est pourquoi, nous avons également développé l'outil Kitten qui permet d'extraire le contenu des documents HTML sous forme de texte et aussi de repérer, analyser et formater ces structures. Enfin, nous avons réalisé deux expériences avec des utilisateurs. La première expérience évaluait Citron et les êtres humains sur la tâche d'extraction de réponse multiples : les résultats ont montré que Citron était plus rapide que les êtres humains et que l'écart entre la qualité des réponses de Citron et celle des utilisateurs était raisonnable. La seconde expérience a évalué la satisfaction des utilisateurs concernant la présentation de réponses multiples : les résultats ont montré que les utilisateurs préféraient la présentation de Citron agrégeant les réponses et y ajoutant un critère variant (lorsqu'il existe) par rapport à la présentation utilisée lors des campagnes d'évaluation. / Question answering systems find and extract a precise answer to a question in natural language. Both current question-answering systems and evaluation campaigns often assume that only one single answeris expected for a question. Our corpus studies show that this is rarely the case, specially when answers are extracted from the Web instead of a frozen collection of documents.We therefore focus on questions expecting multiple correct answers fromthe Web by developping the question-answering system Citron. Citron is dedicated to extracting multiple answers in open domain and identifying theshifting criteria (date, location) which is often the reason of this answer multiplicity Our corpus studies show that the answers of this kind of questions are often located in structures such as tables and lists which cannot be analysed without a suitable preprocessing. Consequently we developed the Kitten software which aims at extracting text information from HTML documents and also both identifying and formatting these structures.We finally evaluate Citron through two experiments involving users. Thefirst experiment evaluates both Citron and human beings on a multipleanswer extraction task: results show that Citron was faster than humans andthat the quality difference between answers extracted by Citron andhumans was reasonable. The second experiment evaluates user satisfaction regarding the presentation of multiple answers: results show that user shave a preference for Citron presentation aggregating answers and adding the shifting criteria (if it exists) over the presentation used by evaluation campaigns.
738

Aperfeiçoamento de um tradutor automático Português-Inglês: tempos verbais / Development of a Portuguese-to-English machine translation system: tenses

Silva, Lucia Helena Rozario da 03 August 2010 (has links)
Esta dissertação apresenta o aperfeiçoamento de um sistema de tradução automática português-inglês. Nosso objetivo principal é criar regras de transferência estrutural entre o par de línguas português e inglês e avaliar, através do uso da métrica de avaliação METEOR, o desempenho do sistema. Para isto, utilizamos um corpus teste criado especialmente para esta pesquisa. Tendo como ponto de partida a relevância de uma correta tradução para os tempos verbais de uma sentença, este trabalho priorizou a criação de regras que tratassem a transferência entre os tempos verbais do português brasileiro para o inglês americano. Devido ao fato de os verbos em português estarem distribuídos por três conjugações, criamos um corpus para cada uma dessas conjugações. O objetivo da criação desses corpora é verificar a aplicação das regras de transferência estrutural entre os tempos verbais em todas as três classes de conjugação. Após a criação dos corpora, mapeamos os tempos verbais em português no modo indicativo, subjuntivo e imperativo para os tempos verbais do inglês. Em seguida, iniciamos a construção das regras de transferência estrutural entre os tempos verbais mapeados. Ao final da construção das regras, submetemos os corpora obedecendo as três classes de conjugação à métrica de avaliação automática METEOR. Os resultados da avaliação do sistema após a inserção das regras apresentaram uma regressão quando comparado a avaliação do sistema no estágio inicial da pesquisa. Detectamos, através de análises dos resultados, que a métrica de avaliação automática METEOR não foi sensível às modificações feitas no sistema, embora as regras criadas sigam a gramática tradicional da língua portuguesa e estejam sendo aplicadas a todas as três classes de conjugação. Apresentamos em detalhes o conjunto de regras sintáticas e os corpora utilizados neste estudo, e que acreditamos serem de utilidade geral para quaisquer sistemas de tradução automática entre o português brasileiro e o inglês americano. Outra contribuição deste trabalho está em discutir os valores apresentados pela métrica METEOR e sugerir que novos ajustes sejam feitos a esses parâmetros utilizados pela métrica. / This dissertation presents the development of a Portuguese-to-English Machine Translation system. Our main objective is creating structural transfer rules between this pair of languages, and evaluate the performance of the system using the METEOR evaluation metric. Therefore, we developed a corpus to enable this study. Taking translation relevance as a starting point, we focused on verbal tenses and developed rules that dealt with transfer between verbal tenses from the Brazilian Portuguese to US English. Due to the fact that verbs in Portuguese are distributed in three conjugations, we created one corpus for each of these conjugations. The objective was to verify the application of structural transfer rules between verbal tenses in each conjugation class in isolation. After creating these corpora, we mapped the Portuguese verbal tenses in the indicative, subjunctive and imperative modes to English. Next, we constructed structural transfer rules to these mapped verbal tenses. After constructing these rules, we evaluated our corpora using the METEOR evaluation metric. The results of this evaluation showed lack of improvement after the insertion of these transfer rules, when compared to the initial stage of the system. We detected that the METEOR evaluation metric was not sensible to these modi_cations made to the system, even though they were linguistically sound and were being applied correctly to the sentences. We introduce in details the set of transfer rules and corpora used in this study, and we believe they are general enough to be useful in any rule-based Portuguese-to-English Machine Translation system. Another contribution of this work lies in the discussion of the results presented by the METEOR metric. We suggest adjustments to be made to its parameters, in order to make it more sensible to sentences variation such as those introduced by our rules.
739

Extractive document summarization using complex networks / Sumarização extractiva de documentos usando redes complexas

Tohalino, Jorge Andoni Valverde 15 June 2018 (has links)
Due to a large amount of textual information available on the Internet, the task of automatic document summarization has gained significant importance. Document summarization became important because its focus is the development of techniques aimed at finding relevant and concise content in large volumes of information without changing its original meaning. The purpose of this Masters work is to use network theory concepts for extractive document summarization for both Single Document Summarization (SDS) and Multi-Document Summarization (MDS). In this work, the documents are modeled as networks, where sentences are represented as nodes with the aim of extracting the most relevant sentences through the use of ranking algorithms. The edges between nodes are established in different ways. The first approach for edge calculation is based on the number of common nouns between two sentences (network nodes). Another approach to creating an edge is through the similarity between two sentences. In order to calculate the similarity of such sentences, we used the vector space model based on Tf-Idf weighting and word embeddings for the vector representation of the sentences. Also, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer) by using multilayer network models for the Multi-Document Summarization task. In this approach, each network layer represents a document of the document set that will be summarized. In addition to the measurements typically used in complex networks such as node degree, clustering coefficient, shortest paths, etc., the network characterization also is guided by dynamical measurements of complex networks, including symmetry, accessibility and absorption time. The generated summaries were evaluated by using different corpus for both Portuguese and English language. The ROUGE-1 metric was used for the validation of generated summaries. The results suggest that simpler models like Noun and Tf-Idf based networks achieved a better performance in comparison to those models based on word embeddings. Also, excellent results were achieved by using the multilayered representation of documents for MDS. Finally, we concluded that several measurements could be used to improve the characterization of networks for the summarization task. / Devido à grande quantidade de informações textuais disponíveis na Internet, a tarefa de sumarização automática de documentos ganhou importância significativa. A sumarização de documentos tornou-se importante porque seu foco é o desenvolvimento de técnicas destinadas a encontrar conteúdo relevante e conciso em grandes volumes de informação sem alterar seu significado original. O objetivo deste trabalho de Mestrado é usar os conceitos da teoria de grafos para o resumo extrativo de documentos para Sumarização mono-documento (SDS) e Sumarização multi-documento (MDS). Neste trabalho, os documentos são modelados como redes, onde as sentenças são representadas como nós com o objetivo de extrair as sentenças mais relevantes através do uso de algoritmos de ranqueamento. As arestas entre nós são estabelecidas de maneiras diferentes. A primeira abordagem para o cálculo de arestas é baseada no número de substantivos comuns entre duas sentenças (nós da rede). Outra abordagem para criar uma aresta é através da similaridade entre duas sentenças. Para calcular a similaridade de tais sentenças, foi usado o modelo de espaço vetorial baseado na ponderação Tf-Idf e word embeddings para a representação vetorial das sentenças. Além disso, fazemos uma distinção entre as arestas que vinculam sentenças de diferentes documentos (inter-camada) e aquelas que conectam sentenças do mesmo documento (intra-camada) usando modelos de redes multicamada para a tarefa de Sumarização multi-documento. Nesta abordagem, cada camada da rede representa um documento do conjunto de documentos que será resumido. Além das medições tipicamente usadas em redes complexas como grau dos nós, coeficiente de agrupamento, caminhos mais curtos, etc., a caracterização da rede também é guiada por medições dinâmicas de redes complexas, incluindo simetria, acessibilidade e tempo de absorção. Os resumos gerados foram avaliados usando diferentes corpus para Português e Inglês. A métrica ROUGE-1 foi usada para a validação dos resumos gerados. Os resultados sugerem que os modelos mais simples, como redes baseadas em Noun e Tf-Idf, obtiveram um melhor desempenho em comparação com os modelos baseados em word embeddings. Além disso, excelentes resultados foram obtidos usando a representação de redes multicamada de documentos para MDS. Finalmente, concluímos que várias medidas podem ser usadas para melhorar a caracterização de redes para a tarefa de sumarização.
740

"Métodos para análise discursiva automática" / Methods for Automatic Discourse Analysis

Pardo, Thiago Alexandre Salgueiro 04 August 2005 (has links)
Pesquisas em Lingüística e Lingüística Computacional têm comprovado há tempos que um texto é mais do que uma simples seqüência de sentenças justapostas. Um texto possui uma estrutura subjacente altamente elaborada que relaciona todo o seu conteúdo, atribuindo-lhe coerência. A essa estrutura dá-se o nome de estrutura discursiva, sendo ela objeto de estudo da área de pesquisa conhecida como Análise de Discurso. Diante da grande utilidade desse conhecimento para diversas aplicações de Processamento de Línguas Naturais, por exemplo, sumarização automática de textos e resolução de anáforas, a análise discursiva automática tem recebido muita atenção. Para o português do Brasil, em particular, há poucos recursos e pesquisas nessa área de pesquisa. Neste cenário, esta tese de doutorado visa a investigar, desenvolver e implementar métodos para análise discursiva automática, adotando como principal teoria discursiva a Rhetorical Structure Theory, uma das teorias mais difundidas atualmente. A partir da anotação retórica e da análise de um corpus de textos científicos da Computação, produziu-se o primeiro analisador retórico automático para a língua portuguesa do Brasil, chamado DiZer (DIscourse analyZER), além de uma grande quantidade de conhecimento discursivo. Apresentam-se modelos estatísticos inéditos para o reconhecimento de relações discursivas baseados em unidades de conteúdo de crescente complexidade, abordando palavras, conceitos e estruturas argumentais. Em relação a este último item, é apresentado um modelo para o aprendizado não supervisionado das estruturas argumentais dos verbos, o qual foi aplicado para os 1.500 verbos mais freqüentes do inglês, resultando em um repositório chamado ArgBank. O DiZer e os modelos propostos são avaliados, produzindo resultados satisfatórios. / Researches in Linguistics and Computational Linguistics have shown that a text is more than a simple sequence of juxtaposed sentences. Every text contains a highly elaborated underlying structure that relates its content, attributing coherence to the text. This structure is called discourse structure and is the object of study in the research area known as Discourse Analysis. Given the usefulness of this kind of knowledge for several Natural Language Processing tasks, e.g., automatic text summarization and anaphora resolution, automatic discourse analysis became a very important research topic. For Brazilian Portuguese, in particular, there are few resources and researches about it. In this scenario, this thesis aims at investigating, developing and implementing methods for automatic discourse analysis, following the Rhetorical Structure Theory mainly, one of the most used discourse theories nowadays. Based on the rhetorical annotation and analysis of a corpus of scientific texts from Computers domain, the first rhetorical analyzer for Brazilian Portuguese, called DiZer (DIscourse analyZER), was produced, together with a big amount of discourse knowledge. Novel statistical models for detecting discourse relations are presented, based on content units of increasing complexity, namely, words, concepts and argument structures. About the latter, a model for unsupervised learning of verb argument structures is presented, being applied to the 1.500 most frequent English verbs, resulting in a repository called ArgBank. DiZer and the proposed models are evaluated, producing satisfactory results.

Page generated in 0.1199 seconds