Global ETD Search

1	Investigação de métodos de sumarização automática multidocumento baseados em hierarquias conceituais Zacarias, Andressa Caroline Inácio 29 March 2016 (has links) Submitted by Livia Mello (liviacmello@yahoo.com.br) on 2016-09-30T19:20:49Z No. of bitstreams: 1 DissACIZ.pdf: 2734710 bytes, checksum: bf061fead4f2a8becfcbedc457a68b25 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-10-20T16:19:10Z (GMT) No. of bitstreams: 1 DissACIZ.pdf: 2734710 bytes, checksum: bf061fead4f2a8becfcbedc457a68b25 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-10-20T16:19:17Z (GMT) No. of bitstreams: 1 DissACIZ.pdf: 2734710 bytes, checksum: bf061fead4f2a8becfcbedc457a68b25 (MD5) / Made available in DSpace on 2016-10-20T16:19:25Z (GMT). No. of bitstreams: 1 DissACIZ.pdf: 2734710 bytes, checksum: bf061fead4f2a8becfcbedc457a68b25 (MD5) Previous issue date: 2016-03-29 / Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) / The Automatic Multi-Document Summarization (MDS) aims at creating a single summary, coherent and cohesive, from a collection of different sources texts, on the same topic. The creation of these summaries, in general extracts (informative and generic), requires the selection of the most important sentences from the collection. Therefore, one may use superficial linguistic knowledge (or statistic) or deep knowledge. It is important to note that deep methods, although more expensive and less robust, produce more informative extracts and with more linguistic quality. For the Portuguese language, the sole deep methods that use lexical-conceptual knowledge are based on the frequency of the occurrence of the concepts in the collection for the selection of a content. Considering the potential for application of semantic-conceptual knowledge, the proposition is to investigate MDS methods that start with representation of lexical concepts of source texts in a hierarchy for further exploration of certain hierarchical properties able to distinguish the most relevant concepts (in other words, the topics from a collection of texts) from the others. Specifically, 3 out of 50 CSTNews (multi-document corpus of Portuguese reference) collections were selected and the names that have occurred in the source texts of each collection were manually indexed to the concepts of the WordNet from Princenton (WN.Pr), engendering at the end, an hierarchy with the concepts derived from the collection and other concepts inherited from the WN.PR for the construction of the hierarchy. The hierarchy concepts were characterized in 5 graph metrics (of relevancy) potentially relevant to identify the concepts that compose a summary: Centrality, Simple Frequency, Cumulative Frequency, Closeness and Level. Said characterization was analyzed manually and by machine learning algorithms (ML) with the purpose of verifying the most suitable measures to identify the relevant concepts of the collection. As a result, the measure Centrality was disregarded and the other ones were used to propose content selection methods to MDS. Specifically, 2 sentences selection methods were selected which make up the extractive methods: (i) CFSumm whose content selection is exclusively based on the metric Simple Frequency, and (ii) LCHSumm whose selection is based on rules learned by machine learning algorithms from the use of all 4 relevant measures as attributes. These methods were intrinsically evaluated concerning the informativeness, by means of the package of measures called ROUGE, and the evaluation of linguistic quality was based on the criteria from the TAC conference. Therefore, the 6 human abstracts available in each CSTNews collection were used. Furthermore, the summaries generated by the proposed methods were compared to the extracts generated by the GistSumm summarizer, taken as baseline. The two methods got satisfactory results when compared to the GistSumm baseline and the CFSumm method stands out upon the LCHSumm method. / Na Sumarização Automática Multidocumento (SAM), busca-se gerar um único sumário, coerente e coeso, a partir de uma coleção de textos, de diferentes fontes, que tratam de um mesmo assunto. A geração de tais sumários, comumente extratos (informativos e genéricos), requer a seleção das sentenças mais importantes da coleção. Para tanto, pode-se empregar conhecimento linguístico superficial (ou estatística) ou conhecimento profundo. Quanto aos métodos profundos, destaca-se que estes, apesar de mais caros e menos robustos, produzem extratos mais informativos e com mais qualidade linguística. Para o português, os únicos métodos profundos que utilizam conhecimento léxico-conceitual baseiam na frequência de ocorrência dos conceitos na coleção para a seleção de conteúdo. Tendo em vista o potencial de aplicação do conhecimento semântico-conceitual, propôs-se investigar métodos de SAM que partem da representação dos conceitos lexicais dos textos-fonte em uma hierarquia para a posterior exploração de certas propriedades hierárquicas capazes de distinguir os conceitos mais relevantes (ou seja, os tópicos da coleção) dos demais. Especificamente, selecionaram-se 3 das 50 coleções do CSTNews, corpus multidocumento de referência do português, e os nomes que ocorrem nos textos-fonte de cada coleção foram manualmente indexados aos conceitos da WordNet de Princeton (WN.Pr), gerando, ao final, uma hierarquia com os conceitos constitutivos da coleção e demais conceitos herdados da WN.Pr para a construção da hierarquia. Os conceitos da hierarquia foram caracterizados em função de 5 métricas (de relevância) de grafo potencialmente pertinentes para a identificação dos conceitos a comporem um sumário: Centrality, Simple Frequency, Cumulative Frequency, Closeness e Level. Tal caracterização foi analisada de forma manual e por meio de algoritmos de Aprendizado de Máquina (AM) com o objetivo de verificar quais medidas seriam as mais adequadas para identificar os conceitos relevantes da coleção. Como resultado, a medida Centrality foi descartada e as demais utilizadas para propor métodos de seleção de conteúdo para a SAM. Especificamente, propuseram-se 2 métodos de seleção de sentenças, os quais compõem os métodos extrativos: (i) CFSumm, cuja seleção de conteúdo se baseia exclusivamente na métrica Simple Frequency, e (ii) LCHSumm, cuja seleção se baseia em regras aprendidas por algoritmos de AM a partir da utilização em conjunto das 4 medidas relevantes como atributos. Tais métodos foram avaliados intrinsecamente quanto à informatividade, por meio do pacote de medidas ROUGE, e qualidade linguística, com base nos critérios da conferência TAC. Para tanto, utilizaram-se os 6 abstracts humanos disponíveis em cada coleção do CSTNews. Ademais, os sumários gerados pelos métodos propostos foram comparados aos extratos gerados pelo sumarizador GistSumm, tido como baseline. Os dois métodos obtiveram resultados satisfatórios quando comparados ao baseline GistSumm e o método CFSumm se sobressai ao método LCHSumm. / FAPESP 2014/12817-4 Sumarização Automática Multidocumento Métricas de grafo Hierarquia léxico-conceitual Automatic multi-document summarization Graph metrics Lexical-conceptual hierarchy LINGUISTICA, LETRAS E ARTES::LINGUISTICA

Search results

Investigação de métodos de sumarização automática multidocumento baseados em hierarquias conceituais