Global ETD Search

Return to search

[en] CLUSTERING TEXT STRUCTURED DATA BASED ON TEXT SIMILARITY / [pt] AGRUPAMENTO DE REGISTROS TEXTUAIS BASEADO EM SIMILARIDADE ENTRE TEXTOS

[pt] O presente trabalho apresenta os resultados que obtivemos com a aplicação de grande número de modelos e algoritmos em um determinado conjunto de experimentos de agrupamento de texto. O objetivo de tais testes é determinar quais são as melhores abordagens para processar as grandes massas de informação geradas pelas crescentes demandas de data quality em diversos setores da economia. O processo de deduplicação foi acelerado pela divisão dos conjuntos de dados em subconjuntos de itens similares. No melhor cenário possível, cada subconjunto tem em si todas as ocorrências duplicadas de cada registro, o que leva o nível de erro na formação de cada grupo a zero. Todavia, foi determinada uma taxa de tolerância intrínseca de 5 porcento após o agrupamento. Os experimentos mostram que o tempo de processamento é significativamente menor e a taxa de acerto é de até 98,92 porcento. A melhor relação entre acurácia e desempenho é obtida pela aplicação do algoritmo K-Means com um modelo baseado em trigramas. / [en] This document reports our findings on a set of text clusterig experiments, where a wide variety of models and algorithms were applied. The objective of these experiments is to investigate which are the most feasible strategies to process large amounts of information in face of the growing demands on data quality in many fields. The process of deduplication was accelerated through the division of the data set into individual subsets of similar items. In the best case scenario, each subset must contain all duplicates of each produced register, mitigating to zero the cluster s errors. It is established, although, a tolerance of 5 percent after the clustering process. The experiments show that the processing time is significantly lower, showing a 98,92 percent precision. The best accuracy/performance relation is achieved with the K-Means Algorithm using a trigram based model.

[pt] APRENDIZADO DE MAQUINA

[en] MACHINE LEARNING

[pt] RECUPERACAO DE INFORMACAO

[en] INFORMATION RETRIEVAL

[pt] MINERACAO DE TEXTOS

[en] TEXTS MINING

[pt] DEDUPLICACAO

Identifer	oai:union.ndltd.org:puc-rio.br/oai:MAXWELL.puc-rio.br:25796
Date	18 February 2016
Creators	IAN MONTEIRO NUNES
Contributors	RUY LUIZ MILIDIU
Publisher	MAXWELL
Source Sets	PUC Rio
Language	Portuguese
Detected Language	Portuguese
Type	TEXTO

Page generated in 0.0024 seconds

[en] CLUSTERING TEXT STRUCTURED DATA BASED ON TEXT SIMILARITY / [pt] AGRUPAMENTO DE REGISTROS TEXTUAIS BASEADO EM SIMILARIDADE ENTRE TEXTOS

Description

Links & Downloads

Tags

Additional Fields