Global ETD Search

1	Deep Web Collection Selection King, John Douglas January 2004 (has links) The deep web contains a massive number of collections that are mostly invisible to search engines. These collections often contain high-quality, structured information that cannot be crawled using traditional methods. An important problem is selecting which of these collections to search. Automatic collection selection methods try to solve this problem by suggesting the best subset of deep web collections to search based on a query. A few methods for deep Web collection selection have proposed in Collection Retrieval Inference Network system and Glossary of Servers Server system. The drawback in these methods is that they require communication between the search broker and the collections, and need metadata about each collection. This thesis compares three different sampling methods that do not require communication with the broker or metadata about each collection. It also transforms some traditional information retrieval based techniques to this area. In addition, the thesis tests these techniques using INEX collection for total 18 collections (including 12232 XML documents) and total 36 queries. The experiment shows that the performance of sample-based technique is satisfactory in average. information retrieval deep web collection selection singular value decomposition latent semantic analysis sampling query focused probabilistic
2	Search engine content analysis King, John D. January 2008 (has links) Search engines have forever changed the way people access and discover knowledge, allowing information about almost any subject to be quickly and easily retrieved within seconds. As increasingly more material becomes available electronically the influence of search engines on our lives will continue to grow. This presents the problem of how to find what information is contained in each search engine, what bias a search engine may have, and how to select the best search engine for a particular information need. This research introduces a new method, search engine content analysis, in order to solve the above problem. Search engine content analysis is a new development of traditional information retrieval field called collection selection, which deals with general information repositories. Current research in collection selection relies on full access to the collection or estimations of the size of the collections. Also collection descriptions are often represented as term occurrence statistics. An automatic ontology learning method is developed for the search engine content analysis, which trains an ontology with world knowledge of hundreds of different subjects in a multilevel taxonomy. This ontology is then mined to find important classification rules, and these rules are used to perform an extensive analysis of the content of the largest general purpose Internet search engines in use today. Instead of representing collections as a set of terms, which commonly occurs in collection selection, they are represented as a set of subjects, leading to a more robust representation of information and a decrease of synonymy. The ontology based method was compared with ReDDE (Relevant Document Distribution Estimation method for resource selection) using the standard R-value metric, with encouraging results. ReDDE is the current state of the art collection selection method which relies on collection size estimation. The method was also used to analyse the content of the most popular search engines in use today, including Google and Yahoo. In addition several specialist search engines such as Pubmed and the U.S. Department of Agriculture were analysed. In conclusion, this research shows that the ontology based method mitigates the need for collection size estimation.

Search results

Deep Web Collection Selection

Search engine content analysis