The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
Identifer | oai:union.ndltd.org:DUETT/oai:DUETT:duett-04232004-143024 |
Date | 23 April 2004 |
Creators | Goevert, Norbert ; Fuhr, Norbert ; Lalmas, Mounia |
Contributors | none |
Publisher | Gerhard-Mercator-Universitaet Duisburg |
Source Sets | Dissertations and other Documents of the Gerhard-Mercator-University Duisburg |
Language | German |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://www.ub.uni-duisburg.de/ETD-db/theses/available/duett-04232004-143024/ |
Rights | unrestricted, I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. Hiermit erteile ich der Universitaet Duisburg das nicht-ausschliessliche Recht unter den unten angegebenen Bedingungen, meine Dissertation, Staatsexamens- oder Diplomarbeit, meinen Forschungs- oder Projektbericht zu veroeffentlichen und zu archivieren. Ich behalte das Urheberrecht und das Recht das Dokument zu veroeffentlichen und in anderen Arbeiten weiterzuverwenden. |
Page generated in 0.0016 seconds