Global ETD Search

1	Improving Web Search Ranking Using the Internet Archive Li, Liyan 02 June 2020 (has links) Current web search engines retrieve relevant results only based on the latest content of web pages stored in their indices despite the fact that many web resources update frequently. We explore possible techniques and data sources for improving web search result ranking using web page historical content change. We compare web pages with previous versions and separately model texts and relevance signals in the newly added, retained, and removed parts. We particularly examine the Internet Archive, the largest web archiving service thus far, for its effectiveness in improving web search performance. We experiment with a few possible retrieval techniques, including language modeling approaches using refined document and query representations built based on comparing current web pages to previous versions and Learning-to-rank methods for combining relevance features in different versions of web pages. Experimental results on two large-scale retrieval datasets (ClueWeb09 and ClueWeb12) suggest it is promising to use web page content change history to improve web search performance. However, it is worth mentioning that the actual effectiveness at this moment is affected by the practical coverage of the Internet Archive and the amount of regularly-changing resources among the relevant information related to search queries. Our work is the first step towards a promising area combining web search and web archiving, and discloses new opportunities for commercial search engines and web archiving services. / Master of Science / Current web search engines show search documents only based on the most recent version of web pages stored in their database despite the fact that many web resources update frequently. We explore possible techniques and data sources for improving web search result ranking using web page historical content change. We compare web pages with previous versions and get the newly added, retained, and removed parts. We examine the Internet Archive in particular, the largest web archiving service now, for its effectiveness in improving web search performance. We experiment with a few possible retrieval techniques, including language modeling approaches using refined document and query representations built based on comparing current web pages to previous versions and Learning-to-rank methods for combining relevance features in different versions of web pages. Experimental results on two large-scale retrieval datasets (ClueWeb09 and ClueWeb12) suggest it is promising to use web page content change history to improve web search performance. However, it is worth mentioning that the actual effectiveness at this point is affected by the practical coverage of the Internet Archive and the amount of ever-changing resources among the relevant information related to search queries. Our work is the first step towards a promising area combining web search and web archiving, and discloses new opportunities for commercial search engines and web archiving services. Information Retrieval Web Archiving Internet Archive Search Result Ranking
2	Building CTRnet Digital Library Services using Archive-It and LucidWorks Big Data Software Chitturi, Kiran 27 March 2014 (has links) When a crisis occurs, information flows rapidly in the Web through social media, blogs, and news articles. The shared information captures the reactions, impacts, and responses from the government as well as the public. Later, researchers, scholars, students, and others seek information about earlier events, sometimes for cross-event analysis or comparison. There are very few integrated systems which try to collect and permanently archive the information about an event and provide access to the crisis information at the same time. In this thesis, we describe the CTRnet Digital Library and Archive which aims to permanently archive crisis event information by using Archive-It services and then provide access to the archived information by using LucidWorks Big Data software. Through the Big Data (LWBD) software, we take advantage of text extraction, clustering, similarity, annotation, and indexing services and build digital libraries with the generated metadata that will be helpful for the system stakeholders to locate information about an event. Through this study, we collected data for 46 crises events using Archive-It. We built a CTRnet DL prototype and its services for the ``Boston Marathon Bombing" collection by using the components of LucidWorks Big Data. Running LucidWorks Big Data on a 30 node Hadoop cluster accelerates the sub-workflows processing and also provides fault tolerant execution. LWBD sub-workflows, ``ingest" and ``extract", processed the textual data present in the WARC files. Other sub-workflows ``kmeans", ``simdoc", and ``annotate" helped in grouping the search-results, deleting the duplicates and providing metadata for additional facets in the CTRnet DL prototype, respectively. / Master of Science Digital Library Services CTRnet Internet Archive LucidWorks Big Data Crises Archive-It

Search results

Improving Web Search Ranking Using the Internet Archive

Building CTRnet Digital Library Services using Archive-It and LucidWorks Big Data Software