Global ETD Search

1	[en] THE IMPACT OF STRUCTURAL ATTRIBUTES TO IDENTIFY TABLES AND LISTS IN HTML DOCUMENTS / [pt] O IMPACTO DE ATRIBUTOS ESTRUTURAIS NA IDENTIFICAÇÃO DE TABELAS E LISTAS EM DOCUMENTOS HTML IAM VITA JABOUR 11 April 2011 (has links) [pt] A segmentação de documentos HTML tem sido essencial para as tarefas de extração de informações, como mostram vários estudos na área. Nesta dissertação investigamos a relação entre o documento HTML e sua representação visual, mostrando como esta ligação ajuda na abordagem estrutural para a identificação de segmentos. Também investigamos como utilizar algoritmos de distância de edição em árvores para encontrar padrões na árvore DOM, tornando possível resolver duas tarefas de identificação de segmentos. A primeira tarefa é a identificação de tabelas genuínas, aonde foi obtido 90,40% de F1 utilizando o corpus fornecido por (Wang e Hu, 2002). Mostramos através de um estudo experimental que este resultado é competitivo com os melhores resultados da área. A segunda tarefa que consideramos é a identificação de listas de produtos em sites de comércio eletrônico, nessa obtivemos 94,95% de F1 utilizando um corpus com 1114 documentos HTML, criado a partir de 8 sites. Concluímos que os algoritmos de similaridade estrutural ajudam na resolução de ambas às tarefas e acreditamos que possam ajudar na identificação de outros tipos de segmentos. / [en] The segmentation of HTML documents has been essential to information extraction tasks, as showed by several works in this area. This paper studies the link between an HTML document and its visual representation to show how it helps segments identification using a structural approach. For this, we investigate how tree edit distance algorithms can find structural similarities in a DOM tree, using two tasks to execute our experiments. The first one is the identification of genuine tables where we obtained a 90.40% F1 score using the corpus provided by (Wang e Hu, 2002). We show through an experimental study that this result is competitive with the best results in the area. The second task studied is the identification of product listings in e-commerce sites. Here we get a 94.95% F1 score using a corpus with 1114 HTML documents from 8 distinct sites. We conclude that algorithms to calculate trees similarity provide competitive results for both tasks, making them also good candidates to identify other types of segments. [pt] REPRESENTACAO VISUAL [en] VISUAL REPRESENTATION [pt] EXTRACAO DE INFORMACAO [en] EXTRATION OF INFORMATION [pt] ISOMORFISMO EM ÁRVORE

Search results

[en] THE IMPACT OF STRUCTURAL ATTRIBUTES TO IDENTIFY TABLES AND LISTS IN HTML DOCUMENTS / [pt] O IMPACTO DE ATRIBUTOS ESTRUTURAIS NA IDENTIFICAÇÃO DE TABELAS E LISTAS EM DOCUMENTOS HTML