Global ETD Search

1	Large Web Archive Collection Infrastructure and Services Wang, Xinyue 20 January 2023 (has links) The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently. In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format. / Doctor of Philosophy / The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information discovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently. In this research, we propose to build a web archive big data processing infrastructure that can support efficient and scalable web archive reuse like quantitative data analysis and browsing services. We adopt industry frameworks and tools to establish a platform that can provide high-performance computation for web archive initiatives and users. We propose to convert the standard web archive data file format to a columnar data format for efficient future reuse. Our experiments show that our proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. Our design can also serve an efficient web browsing service without adopting a sophisticated web hosting architecture. In addition to the standard web archive data, we also integrate Twitter data into our design as a unique web archive resource. Twitter is a prominent source of data for researchers in a variety of fields and an integral element of the web's history. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format. Web Archive Digital Library Big Data Infrastructure
2	Internet art and agency : the social lives of online artworks De Wild, Karin January 2019 (has links) During the 1990s, artists started to explore the possibilities of the World Wide Web. This thesis investigates online artworks by studying their agency. Why do people interact with them, as if they are alive? How do they mobilise people, or make them share visions and ideas? Based on research in largely untapped archives, it presents an in-depth examination of several case studies, exploring the artwork's ability to have the power to act in a variety of social settings. Through studying the life trajectory of the artwork, it also offers insights in how these dynamic entities undergo changes over time and across cultures. Grounded in theoretical literature on the agency of art, this research offers an innovative way of understanding Internet art and it contributes to wider conversations about the agency of art and artefacts. Case studies include: Mouchette (Martine Neddam), 'Mouchette' (1996-present). Web project (www.mouchette.org). Collection of Stedelijk Museum (Amsterdam). Shu Lea Cheang, 'Brandon' (1998-1999). Web project (brandon.guggenheim.org). Collection of Solomon R. Guggenheim Museum (New York). Lynn Hershman Leeson, 'Agent Ruby' (1998-2002). Web project (agentruby.sfmoma.org). Collection of SFMOMA (San Francisco).
3	Srovnávací analýza WebArchivu Národní knihovny ČR se zahraničními projekty / Comparative Analysis of WebArchiv of the National Library of the Czech Republic and Foreign Projects Kupcová, Pavla January 2012 (has links) (in English) The topic of the diploma thesis is to compare the WebArchiv with selected foreign Web Archives, which are responsible for preserving the national cultural heritage. The introduction briefly explains the history of Web Archives and typology of harvesting. Next parts deal with the history, legal aspects of archiving, selected types of harvesting, Web resources, systems, accessing and evaluation the Czech (WebArchiv), Australian (Pandora) and British archive (United Kingdom Web Archive). The text continues with an evaluation of the selected archives that mentions strong and weak properties and possible solutions. In conclusion, outlines the problematic aspects of archiving, which must be addressed in the future. [Author's abstract]
4	En förbisedd skatt av svenskt kulturarv : Kulturarw³ och dess värde för forskningen / An Overlooked Treasure of Swedish Cultural Heritage : Kulturarw³ and its Value for Scientific Research Skjöldebrand Lefevre, Caroline January 2023 (has links) This master thesis has examined a user’s capabilities to utilize the Swedish national web archive Kulturarw³ for research purposes. The aim was also to identify any potential areas of improvement in the user’s capabilities working with Kulturarw³. The research questions are: 1. How does Kulturarw³ operate? 2. What are the main factors which affect Kulturarw³ structure and function? 3. What capabilities exist for researchers and students to utilize Kulturarw³ for their research? Are there any potential areas of improvement to the web archives user capabilities? The author has analyzed the web archive altogether using institutional theory in organization studies. The analysis has been loosely structured after Staffan Furusten’s model of the outside world in using institutional theory in organization studies. The purpose of this is to explain why the web archive looks the way it does today. An understanding of the web archive will better illuminate why any potential areas of improvement identified may or may not be possible for KW3 to implement. The author has conducted email interviews, in-person interviews as well as digital interviews with the staff responsible for working with Kulturarw³ at the Swedish National Library, Kungliga biblioteket. A draft of guidelines concerning Kulturarw³ from Kungliga biblioteket and a video-interview at Internetmuseum with one of the the founders of the web archive has also been used as source-material for this master thesis. The author concluded that Kulturarw³ is a national web archive with a long history. Its functions and limitations are complex. Kulturarw³s operation has changed greatly throughout its lifetime because of the surrounding environment. Several main factors which affect Kulturarw³ were identified. Several Swedish laws, international charters and initiatives, collaborations between and relations to other web archives, use of open-source software and digitalization’s impact on Kulturarw³ is discussed in detail. Kulturarw³'s long history of archiving the Swedish web makes it a valuable and plentiful source for research. Its collections and functions should be sufficient for anyone to conduct qualitative research. Yet at the current moment, the web archive is too inaccessible to live up to user’s expectations. That makes it an unviable option for research purposes. Unfortunately, there is not a lot Kulturarw³ can currently change to make it more assessable. The lack of information readily available also hinders users from using the web archive at max efficiency. There is a lot of opportunities for KB to better inform its users of its value and capabilities. An increased collaboration with Swedish research institutions would also benefit both researchers and the web archive in the long run. Web archive web archives web archiving Kulturarw³ Kulturarw3 KW3 National Library of Sweden KB cultural heritage digital heritage Webbarkiv webbarkivering webben Kulturarw³ Kulturarw3 KW3 Kungliga biblioteket KB kulturarv digitalt kulturarv institutionell organisationsteori Information Studies Biblioteks- och informationsvetenskap

1

Page generated in 0.0408 seconds