The deep web contains a massive number of collections that are mostly invisible to search engines. These collections often contain high-quality, structured information that cannot be crawled using traditional methods. An important problem is selecting which of these collections to search. Automatic collection selection methods try to solve this problem by suggesting the best subset of deep web collections to search based on a query. A few methods for deep Web collection selection have proposed in Collection Retrieval Inference Network system and Glossary of Servers Server system. The drawback in these methods is that they require communication between the search broker and the collections, and need metadata about each collection. This thesis compares three different sampling methods that do not require communication with the broker or metadata about each collection. It also transforms some traditional information retrieval based techniques to this area. In addition, the thesis tests these techniques using INEX collection for total 18 collections (including 12232 XML documents) and total 36 queries. The experiment shows that the performance of sample-based technique is satisfactory in average.
Identifer | oai:union.ndltd.org:ADTP/264987 |
Date | January 2004 |
Creators | King, John Douglas |
Publisher | Queensland University of Technology |
Source Sets | Australiasian Digital Theses Program |
Detected Language | English |
Rights | Copyright John Douglas King |
Page generated in 0.0019 seconds