Return to search

Improving Web Search Ranking Using the Internet Archive

Current web search engines retrieve relevant results only based on the latest content of web pages stored in their indices despite the fact that many web resources update frequently. We explore possible techniques and data sources for improving web search result ranking using web page historical content change. We compare web pages with previous versions and separately model texts and relevance signals in the newly added, retained, and removed parts. We particularly examine the Internet Archive, the largest web archiving service thus far, for its effectiveness in improving web search performance. We experiment with a few possible retrieval techniques, including language modeling approaches using refined document and query representations built based on comparing current web pages to previous versions and Learning-to-rank methods for combining relevance features in different versions of web pages. Experimental results on two large-scale retrieval datasets (ClueWeb09 and ClueWeb12) suggest it is promising to use web page content change history to improve web search performance. However, it is worth mentioning that the actual effectiveness at this moment is affected by the practical coverage of the Internet Archive and the amount of regularly-changing resources among the relevant information related to search queries. Our work is the first step towards a promising area combining web search and web archiving, and discloses new opportunities for commercial search engines and web archiving services. / Master of Science / Current web search engines show search documents only based on the most recent version of web pages stored in their database despite the fact that many web resources update frequently. We explore possible techniques and data sources for improving web search result ranking using web page historical content change. We compare web pages with previous versions and get the newly added, retained, and removed parts. We examine the Internet Archive in particular, the largest web archiving service now, for its effectiveness in improving web search performance. We experiment with a few possible retrieval techniques, including language modeling approaches using refined document and query representations built based on comparing current web pages to previous versions and Learning-to-rank methods for combining relevance features in different versions of web pages. Experimental results on two large-scale retrieval datasets (ClueWeb09 and ClueWeb12) suggest it is promising to use web page content change history to improve web search performance. However, it is worth mentioning that the actual effectiveness at this point is affected by the practical coverage of the Internet Archive and the amount of ever-changing resources among the relevant information related to search queries. Our work is the first step towards a promising area combining web search and web archiving, and discloses new opportunities for commercial search engines and web archiving services.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/106740
Date02 June 2020
CreatorsLi, Liyan
ContributorsComputer Science, Jiang, Jiepu, Karpatne, Anuj, Fox, Edward A.
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.002 seconds