The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently.
In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format. / Doctor of Philosophy / The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information discovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently.
In this research, we propose to build a web archive big data processing infrastructure that can support efficient and scalable web archive reuse like quantitative data analysis and browsing services. We adopt industry frameworks and tools to establish a platform that can provide high-performance computation for web archive initiatives and users. We propose to convert the standard web archive data file format to a columnar data format for efficient future reuse. Our experiments show that our proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. Our design can also serve an efficient web browsing service without adopting a sophisticated web hosting architecture. In addition to the standard web archive data, we also integrate Twitter data into our design as a unique web archive resource. Twitter is a prominent source of data for researchers in a variety of fields and an integral element of the web's history. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/113345 |
Date | 20 January 2023 |
Creators | Wang, Xinyue |
Contributors | Computer Science and Applications, Fox, Edward A., Xie, Zhiwu, Lourentzou, Ismini, Eldardiry, Hoda, Chen, Jiangping |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | Creative Commons Attribution-NonCommercial 4.0 International, http://creativecommons.org/licenses/by-nc/4.0/ |
Page generated in 0.0026 seconds