Web archiving is necessary to retain the history of the World Wide Web and to study its evolution. It is important for the cultural heritage community. Some organizations are legally obligated to capture and archive Web content. The advent of transactional Web archiving makes the archiving process more efficient, thereby aiding organizations to archive their Web content.
This study measures and analyzes the performance of transactional Web archiving systems. To conduct a detailed analysis, we construct a meaningful design space defined by the system specifications that determine the performance of these systems. SiteStory, a state-of-the-art transactional Web archiving system, and local archiving, an alternative archiving technique, are used in this research. We experimentally evaluate the performance of these systems using the Greek version of Wikipedia deployed on dedicated hardware on a private network. Our benchmarking results show that the local archiving technique uses a Web server’s resources more efficiently than SiteStory for one data point in our design space. Better performance than SiteStory in such scenarios makes our archiving solution favorable to use for transactional archiving. We also show that SiteStory does not impose any significant performance overhead on the Web server for the rest of the data points in our design space. / Master of Science / Web archiving is the process of preserving the information available on the World Wide Web into archives. This process provides historians and cultural heritage scholars access to the data that allows them to understand the evolution of the Internet and its usage. Additionally, Web archiving is also essential for some organizations that are obligated to keep the records of online resource access for their customers. Transactional Web archiving is an archiving technique where the information available on the Web is archived by capturing a transaction between a user and the Web server processing the user’s request. Transactional Web archiving provides a more complete and accurate history of a Web server than the traditional Web archiving models. However, in some scenarios the transactional Web archiving solutions may impose performance issues for the Web server being archived.
In this thesis, we conduct a detailed performance analysis of SiteStory, a state-of-the-art transactional Web archiving solution, in various experimental settings. Furthermore, we propose a novel transactional Web archiving approach and compare its performance with SiteStory. To conduct a realistic study, we analyze real-life traffic on Greek Wikipedia website and generate similar traffic to perform our benchmarking experiments. Our benchmarking results show that our archiving technique uses a Web server’s resources more efficiently than SiteStory in some scenarios. Better performance than SiteStory in such scenarios makes our archiving solution favorable to use for transactional archiving. We also show that SiteStory does not impose any significant performance overhead on the Web server in other scenarios.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/78371 |
Date | 19 July 2017 |
Creators | Maharshi, Shivam |
Contributors | Computer Science, Fox, Edward A., Xie, Zhiwu, Lee, Dongyoon |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf, application/x-zip-compressed |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.002 seconds