Federated information retrieval is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot index uncrawlable hidden web collections; federated information retrieval systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated information retrieval systems acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. In this thesis, we propose new approaches for each of these problems. Our suggested methods, for collection representation, collection selection, and result merging, outperform state-of-the-art techniques in most cases. We also propose novel methods for estimating the number of documents in collections, and for pruning unnecessary information from collection representations sets. Although management of document duplication has been cited as one of the major problems in federated search, prior research in this area often assumes that collections are free of overlap. We investigate the effectiveness of federated search on overlapped collections, and propose new methods for maximizing the number of distinct relevant documents in the final merged results. In summary, this thesis introduces several new contributions to the field of federated information retrieval, including practical solutions to some historically unsolved problems in federated search, such as document duplication management. We test our techniques on multiple testbeds that simulate both hidden web and enterprise search environments.
Identifer | oai:union.ndltd.org:ADTP/210311 |
Date | January 2008 |
Creators | Shokouhi, Milad, milads@microsoft.com |
Publisher | RMIT University. Computer Science and Information Technology |
Source Sets | Australiasian Digital Theses Program |
Language | English |
Detected Language | English |
Rights | http://www.rmit.edu.au/help/disclaimer, Copyright Milad Shokouhi |
Page generated in 0.0016 seconds