Return to search

A probabilistic description-oriented approach for categorising Web documents

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.

Identiferoai:union.ndltd.org:DUETT/oai:DUETT:duett-04232004-143024
Date23 April 2004
CreatorsGoevert, Norbert ; Fuhr, Norbert ; Lalmas, Mounia
Contributorsnone
PublisherGerhard-Mercator-Universitaet Duisburg
Source SetsDissertations and other Documents of the Gerhard-Mercator-University Duisburg
LanguageGerman
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://www.ub.uni-duisburg.de/ETD-db/theses/available/duett-04232004-143024/
Rightsunrestricted, I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. Hiermit erteile ich der Universitaet Duisburg das nicht-ausschliessliche Recht unter den unten angegebenen Bedingungen, meine Dissertation, Staatsexamens- oder Diplomarbeit, meinen Forschungs- oder Projektbericht zu veroeffentlichen und zu archivieren. Ich behalte das Urheberrecht und das Recht das Dokument zu veroeffentlichen und in anderen Arbeiten weiterzuverwenden.

Page generated in 0.0016 seconds