Return to search

Uma an?lise comparativa entre as abordagens lingu?stica e estat?stica para extra??o autom?tica de termos relevantes de corpora

Submitted by PPG Ci?ncia da Computa??o (ppgcc@pucrs.br) on 2018-07-26T19:48:07Z
No. of bitstreams: 1
CARLOS ALBERTO DOS SANTOS_DIS.pdf: 1271475 bytes, checksum: 856ae87ad633d3c772b413816caa43d1 (MD5) / Approved for entry into archive by Sheila Dias (sheila.dias@pucrs.br) on 2018-08-01T13:39:36Z (GMT) No. of bitstreams: 1
CARLOS ALBERTO DOS SANTOS_DIS.pdf: 1271475 bytes, checksum: 856ae87ad633d3c772b413816caa43d1 (MD5) / Made available in DSpace on 2018-08-01T14:31:21Z (GMT). No. of bitstreams: 1
CARLOS ALBERTO DOS SANTOS_DIS.pdf: 1271475 bytes, checksum: 856ae87ad633d3c772b413816caa43d1 (MD5)
Previous issue date: 2018-04-27 / It is known that linguistic processing of corpora demands high computational effort because of the complexity of its algorithms, but despite this, the results reached are better than that generated by the statistical processing, where the computational demand is lower. This dissertation describes a comparative analysis between the process linguistic and statistical of term extraction. Experiments were carried out through four corpora in English idiom, built from scientific papers, on which terms extractions were carried out using the approaches. The resulting terms lists were refined with use of relevance metrics and stop list, and then compared with the reference lists of the corpora across the recall technical. These lists, in its turn, were built from the context these corpora, whith help of Internet searches. The results shown that the statistical extraction combined with the stop list and relevance metrics can produce superior results to linguistic process extraction using the same metrics. It?s concluded that statistical approach composed by these metrics can be ideal option to relevance terms extraction, by requiring few computational resources and by to show superior results that found in the linguistic processing. / Sabe-se que o processamento lingu?stico de corpora demanda grande esfor?o computacional devido ? complexidade dos seus algoritmos, mas que, apesar disso, os resultados alcan?ados s?o melhores que aqueles gerados pelo processamento estat?stico, onde a demanda computacional ? menor. Esta disserta??o descreve uma an?lise comparativa entre os processos lingu?stico e estat?stico de extra??o de termos. Foram realizados experimentos atrav?s de quatro corpora em l?ngua inglesa, constru?dos a partir de artigos cient?ficos, sobre os quais foram executadas extra??es de termos utilizando essas abordagens. As listas de termos resultantes foram refinadas com o uso de m?tricas de relev?ncia e stop list, e em seguida comparadas com as listas de refer?ncia dos corpora atrav?s da t?cnica do recall. Essas listas, por sua vez, foram constru?das a partir do contexto desses corpora e com ajuda de pesquisas na Internet. Os resultados mostraram que a extra??o estat?stica combinada com as t?cnicas da stop list e as m?tricas de relev?ncia pode produzir resultados superiores ao processo de extra??o lingu?stico refinado pelas mesmas m?tricas. Concluiu se que a abordagem estat?stica composta por essas t?cnicas pode ser a op??o ideal para extra??o de termos relevantes, por exigir poucos recursos computacionais e por apresentar resultados superiores ?queles encontrados no processamento lingu?stico.

Identiferoai:union.ndltd.org:IBICT/oai:tede2.pucrs.br:tede/8233
Date27 April 2018
CreatorsSantos, Carlos Alberto dos
ContributorsVieira, Renata
PublisherPontif?cia Universidade Cat?lica do Rio Grande do Sul, Programa de P?s-Gradua??o em Ci?ncia da Computa??o, PUCRS, Brasil, Escola Polit?cnica
Source SetsIBICT Brazilian ETDs
LanguagePortuguese
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, info:eu-repo/semantics/masterThesis
Formatapplication/pdf
Sourcereponame:Biblioteca Digital de Teses e Dissertações da PUC_RS, instname:Pontifícia Universidade Católica do Rio Grande do Sul, instacron:PUC_RS
Rightsinfo:eu-repo/semantics/openAccess
Relation1974996533081274470, 500, 500, -862078257083325301

Page generated in 0.003 seconds