Global ETD Search

1	Accession the Web : preserving access to online cultural heritage / Preserving access to online cultural heritage Tenney, Martha Sarabeth 15 August 2012 (has links) The Web is now recognized as a cultural artifact worthy of preservation and study; however, the rhizomatic, dynamic nature of online production, the accelerating rate of innovation of the live Web, and the sheer quantity of online records all pose challenges to preserving access to online cultural heritage. Moreover, whole-Web archiving efforts such as the Internet Archive frequently miss sites that are not linked to well from other sites—including the marginalized and fringe materials that are most important in building a thick cultural history of online life. This paper argues that archives and other collecting institutions are uniquely poised to preserve online heritage in the form of cultural subject Web archives. Such institutions have the intellectual capital and the technical capabilities, as well as the cultural responsibility, to create collections that reflect the diversity of online life and that best serve potential future users. In order to build these collections, archivists and other information professionals will need a new set of skills. This paper proposes some theoretical and technical approaches to selection and access for cultural Web collections, with helpful tools and model projects to guide the discussion. / text Web archiving Cybercultural Online heritage Archives
2	Performance Measurement and Analysis of Transactional Web Archiving Maharshi, Shivam 19 July 2017 (has links) Web archiving is necessary to retain the history of the World Wide Web and to study its evolution. It is important for the cultural heritage community. Some organizations are legally obligated to capture and archive Web content. The advent of transactional Web archiving makes the archiving process more efficient, thereby aiding organizations to archive their Web content. This study measures and analyzes the performance of transactional Web archiving systems. To conduct a detailed analysis, we construct a meaningful design space defined by the system specifications that determine the performance of these systems. SiteStory, a state-of-the-art transactional Web archiving system, and local archiving, an alternative archiving technique, are used in this research. We experimentally evaluate the performance of these systems using the Greek version of Wikipedia deployed on dedicated hardware on a private network. Our benchmarking results show that the local archiving technique uses a Web server’s resources more efficiently than SiteStory for one data point in our design space. Better performance than SiteStory in such scenarios makes our archiving solution favorable to use for transactional archiving. We also show that SiteStory does not impose any significant performance overhead on the Web server for the rest of the data points in our design space. / Master of Science / Web archiving is the process of preserving the information available on the World Wide Web into archives. This process provides historians and cultural heritage scholars access to the data that allows them to understand the evolution of the Internet and its usage. Additionally, Web archiving is also essential for some organizations that are obligated to keep the records of online resource access for their customers. Transactional Web archiving is an archiving technique where the information available on the Web is archived by capturing a transaction between a user and the Web server processing the user’s request. Transactional Web archiving provides a more complete and accurate history of a Web server than the traditional Web archiving models. However, in some scenarios the transactional Web archiving solutions may impose performance issues for the Web server being archived. In this thesis, we conduct a detailed performance analysis of SiteStory, a state-of-the-art transactional Web archiving solution, in various experimental settings. Furthermore, we propose a novel transactional Web archiving approach and compare its performance with SiteStory. To conduct a realistic study, we analyze real-life traffic on Greek Wikipedia website and generate similar traffic to perform our benchmarking experiments. Our benchmarking results show that our archiving technique uses a Web server’s resources more efficiently than SiteStory in some scenarios. Better performance than SiteStory in such scenarios makes our archiving solution favorable to use for transactional archiving. We also show that SiteStory does not impose any significant performance overhead on the Web server in other scenarios. Web Archiving Digital Preservation Performance Benchmark
3	Improving Web Search Ranking Using the Internet Archive Li, Liyan 02 June 2020 (has links) Current web search engines retrieve relevant results only based on the latest content of web pages stored in their indices despite the fact that many web resources update frequently. We explore possible techniques and data sources for improving web search result ranking using web page historical content change. We compare web pages with previous versions and separately model texts and relevance signals in the newly added, retained, and removed parts. We particularly examine the Internet Archive, the largest web archiving service thus far, for its effectiveness in improving web search performance. We experiment with a few possible retrieval techniques, including language modeling approaches using refined document and query representations built based on comparing current web pages to previous versions and Learning-to-rank methods for combining relevance features in different versions of web pages. Experimental results on two large-scale retrieval datasets (ClueWeb09 and ClueWeb12) suggest it is promising to use web page content change history to improve web search performance. However, it is worth mentioning that the actual effectiveness at this moment is affected by the practical coverage of the Internet Archive and the amount of regularly-changing resources among the relevant information related to search queries. Our work is the first step towards a promising area combining web search and web archiving, and discloses new opportunities for commercial search engines and web archiving services. / Master of Science / Current web search engines show search documents only based on the most recent version of web pages stored in their database despite the fact that many web resources update frequently. We explore possible techniques and data sources for improving web search result ranking using web page historical content change. We compare web pages with previous versions and get the newly added, retained, and removed parts. We examine the Internet Archive in particular, the largest web archiving service now, for its effectiveness in improving web search performance. We experiment with a few possible retrieval techniques, including language modeling approaches using refined document and query representations built based on comparing current web pages to previous versions and Learning-to-rank methods for combining relevance features in different versions of web pages. Experimental results on two large-scale retrieval datasets (ClueWeb09 and ClueWeb12) suggest it is promising to use web page content change history to improve web search performance. However, it is worth mentioning that the actual effectiveness at this point is affected by the practical coverage of the Internet Archive and the amount of ever-changing resources among the relevant information related to search queries. Our work is the first step towards a promising area combining web search and web archiving, and discloses new opportunities for commercial search engines and web archiving services. Information Retrieval Web Archiving Internet Archive Search Result Ranking
4	Performance Evaluation of Web Archiving Through In-Memory Page Cache Vishwasrao, Saket Dilip 23 June 2017 (has links) This study proposes and evaluates a new method for Web archiving. We leverage the caching infrastructure in Web servers for archiving. Redis is used as the page cache and its persistence mechanism is exploited for archiving. We experimentally evaluate the performance of our archival technique using the Greek version of Wikipedia deployed on Amazon cloud infrastructure. We show that there is a slight increase in latencies of the rendered pages due to archiving. Though the server performance is comparable at larger page cache sizes, the maximum throughput the server can handle decreases significantly at lower cache sizes due to more disk write operations as a result of archiving. Since pages are dynamically rendered and the technology stack of Wikipedia is extensively used in a number of Web applications, our results should have broad impact. / Master of Science / This study proposes and evaluates a new method for Web archiving. To reduce response time for serving webpages, Web Servers store recently rendered pages in memory. This process is known as caching. We modify this caching mechanism of Web Servers for archival. We then experimentally evaluate the impact of our archival technique on Web Servers. We observe that the time to render a Web page increases slightly as long as the Web Server is under moderate load. Through our experiments, we establish limits on the maximum requests a Web Server can handle without increasing the response time. We ensure our experiments are conducted on Web Servers using technologies that are widely used today. Thus our results should have broad impact. Information Retrieval Transactional Web Archiving Caching Benchmarking Wikipedia
5	Archivage du Web organisationnel dans une perspective archivistique Chebbi, Aïda 12 1900 (has links) Le Web représente actuellement un espace privilégié d’expression et d’activité pour plusieurs communautés, où pratiques communicationnelles et pratiques documentaires s’enrichissent mutuellement. Dans sa dimension visible ou invisible, le Web constitue aussi un réservoir documentaire planétaire caractérisé non seulement par l’abondance de l’information qui y circule, mais aussi par sa diversité, sa complexité et son caractère éphémère. Les projets d’archivage du Web en cours abordent pour beaucoup cette question du point de vue de la préservation des publications en ligne sans la considérer dans une perspective archivistique. Seuls quelques projets d’archivage du Web visent la préservation du Web organisationnel ou gouvernemental. La valeur archivistique du Web, notamment du Web organisationnel, ne semble pas être reconnue malgré un effort soutenu de certaines archives nationales à diffuser des politiques d’archivage du Web organisationnel. La présente thèse a pour but de développer une meilleure compréhension de la nature des archives Web et de documenter les pratiques actuelles d’archivage du Web organisationnel. Plus précisément, cette recherche vise à répondre aux trois questions suivantes : (1) Que recommandent en général les politiques d’archivage du Web organisationnel? (2) Quelles sont les principales caractéristiques des archives Web? (3) Quelles pratiques d’archivage du Web organisationnel sont mises en place dans des organisations au Québec? Pour répondre à ces questions, cette recherche exploratoire et descriptive a adopté une approche qualitative basée sur trois modes de collecte des données, à savoir : l’analyse d’un corpus de 55 politiques et documents complémentaires relatifs à l’archivage du Web organisationnel; l’observation de 11 sites Web publics d’organismes au Québec de même que l’observation d’un échantillon de 737 documents produits par ces systèmes Web; et, enfin, des entrevues avec 21 participants impliqués dans la gestion et l’archivage de ces sites Web. Les résultats de recherche démontrent que les sites Web étudiés sont le produit de la conduite des activités en ligne d’une organisation et documentent, en même temps, les objectifs et les manifestations de sa présence sur le Web. De nouveaux types de documents propres au Web organisationnel ont pu être identifiés. Les documents qui ont migré sur le Web ont acquis un autre contexte d’usage et de nouvelles caractéristiques. Les méthodes de gestion actuelles doivent prendre en considération les propriétés des documents dans un environnement Web. Alors que certains sites d’étude n’archivent pas leur site Web public, d’autres s’y investissent. Toutefois les choix établis ne correspondent pas toujours aux recommandations proposées dans les politiques d’archivage du Web analysées et ne garantissent pas la pérennité des archives Web ni leur exploitabilité à long terme. Ce constat nous a amenée à proposer une politique type adaptée aux caractéristiques des archives Web. Ce modèle décrit les composantes essentielles d’une politique pour l’archivage des sites Web ainsi qu’un éventail des mesures que pourrait mettre en place l’organisation en fonction des résultats d’une analyse des risques associés à l’usage de son site Web public dans la conduite de ses affaires. / Several communities have adopted the web as a privileged space for expression and activity, where communication and documentary practices complement and enrich each other. Both in its visible and invisible dimensions, the web is a documentation vault of planetary scope, characterized not only by the sheer amount of information it contains, but by its diversity, complexity and ephemeral nature. Most current web archiving projects focus on preserving online publications without considering their value from an archival point of view. Only a few web archiving projects target the preservation of organisational or governmental web records. The archival value of the web, especially that of organisational web, does not seem to be justly recognized, despite the continuous effort deployed by certain National Archives bodies to disseminate best practices and policies for organisational web archiving. This thesis aims to develop a better understanding of the nature of web records and to document current archiving practices intended for organisational web. In particular, this research will look at the following three questions: (1) What general recommendations can be found in archiving policies for organisational web? (2) What are the main characteristics of web records? (3) Which web record-keeping practices have been deployed in Quebec organisations? To address these questions, this exploratory and descriptive research uses a qualitative approach based on three data collection methods, namely: the analysis of a body of 55 policies and supporting documents related to organisational web archiving; the scrutiny of 11 public websites of various Quebec organisations and of a sample of 737 documents generated by those web systems; and interviews with 21 individuals that are involved in the management and archiving of those websites. The results of this research show that the observed sites are the product of an organisation’s online activity and that they simultaneously document the objectives and the occurrences of an organisation’s web presence. New types of documents that are specific to organisational web have been identified. Documents that have migrated online have acquired a different context of use and new characteristics. Hence, current document management methods must consider the unique properties of documents in a web environment. Only a portion of the observed organisations are involved in the process of archiving their public website. Additionally, the chosen archiving strategies are not always coherent with the recommendations found in web archiving policies, and do not guarantee the sustainability of web records. Those results led us to design a standard policy model adapted to the particular properties of web archives. This model describes the essential components of a web archiving policy and proposes a range of measures that an organisation could implement based on the results of a risks analysis of their public website’s uses in a business context. Analyse des risques Archives Web Gestion des archives Méthodes d’archivage du Web Politique d’archivage Pratiques archivistiques Approche d’archivage du Web Techniques d’archivage du Web Types des archives Web organisationnel Archival practices Archives management Archives typology Risks analysis Web archiving Web archiving methods Web records Web archiving policy Web archiving approaches Web archiving techniques
6	Archivage du Web organisationnel dans une perspective archivistique Chebbi, Aïda 12 1900 (has links) Le Web représente actuellement un espace privilégié d’expression et d’activité pour plusieurs communautés, où pratiques communicationnelles et pratiques documentaires s’enrichissent mutuellement. Dans sa dimension visible ou invisible, le Web constitue aussi un réservoir documentaire planétaire caractérisé non seulement par l’abondance de l’information qui y circule, mais aussi par sa diversité, sa complexité et son caractère éphémère. Les projets d’archivage du Web en cours abordent pour beaucoup cette question du point de vue de la préservation des publications en ligne sans la considérer dans une perspective archivistique. Seuls quelques projets d’archivage du Web visent la préservation du Web organisationnel ou gouvernemental. La valeur archivistique du Web, notamment du Web organisationnel, ne semble pas être reconnue malgré un effort soutenu de certaines archives nationales à diffuser des politiques d’archivage du Web organisationnel. La présente thèse a pour but de développer une meilleure compréhension de la nature des archives Web et de documenter les pratiques actuelles d’archivage du Web organisationnel. Plus précisément, cette recherche vise à répondre aux trois questions suivantes : (1) Que recommandent en général les politiques d’archivage du Web organisationnel? (2) Quelles sont les principales caractéristiques des archives Web? (3) Quelles pratiques d’archivage du Web organisationnel sont mises en place dans des organisations au Québec? Pour répondre à ces questions, cette recherche exploratoire et descriptive a adopté une approche qualitative basée sur trois modes de collecte des données, à savoir : l’analyse d’un corpus de 55 politiques et documents complémentaires relatifs à l’archivage du Web organisationnel; l’observation de 11 sites Web publics d’organismes au Québec de même que l’observation d’un échantillon de 737 documents produits par ces systèmes Web; et, enfin, des entrevues avec 21 participants impliqués dans la gestion et l’archivage de ces sites Web. Les résultats de recherche démontrent que les sites Web étudiés sont le produit de la conduite des activités en ligne d’une organisation et documentent, en même temps, les objectifs et les manifestations de sa présence sur le Web. De nouveaux types de documents propres au Web organisationnel ont pu être identifiés. Les documents qui ont migré sur le Web ont acquis un autre contexte d’usage et de nouvelles caractéristiques. Les méthodes de gestion actuelles doivent prendre en considération les propriétés des documents dans un environnement Web. Alors que certains sites d’étude n’archivent pas leur site Web public, d’autres s’y investissent. Toutefois les choix établis ne correspondent pas toujours aux recommandations proposées dans les politiques d’archivage du Web analysées et ne garantissent pas la pérennité des archives Web ni leur exploitabilité à long terme. Ce constat nous a amenée à proposer une politique type adaptée aux caractéristiques des archives Web. Ce modèle décrit les composantes essentielles d’une politique pour l’archivage des sites Web ainsi qu’un éventail des mesures que pourrait mettre en place l’organisation en fonction des résultats d’une analyse des risques associés à l’usage de son site Web public dans la conduite de ses affaires. / Several communities have adopted the web as a privileged space for expression and activity, where communication and documentary practices complement and enrich each other. Both in its visible and invisible dimensions, the web is a documentation vault of planetary scope, characterized not only by the sheer amount of information it contains, but by its diversity, complexity and ephemeral nature. Most current web archiving projects focus on preserving online publications without considering their value from an archival point of view. Only a few web archiving projects target the preservation of organisational or governmental web records. The archival value of the web, especially that of organisational web, does not seem to be justly recognized, despite the continuous effort deployed by certain National Archives bodies to disseminate best practices and policies for organisational web archiving. This thesis aims to develop a better understanding of the nature of web records and to document current archiving practices intended for organisational web. In particular, this research will look at the following three questions: (1) What general recommendations can be found in archiving policies for organisational web? (2) What are the main characteristics of web records? (3) Which web record-keeping practices have been deployed in Quebec organisations? To address these questions, this exploratory and descriptive research uses a qualitative approach based on three data collection methods, namely: the analysis of a body of 55 policies and supporting documents related to organisational web archiving; the scrutiny of 11 public websites of various Quebec organisations and of a sample of 737 documents generated by those web systems; and interviews with 21 individuals that are involved in the management and archiving of those websites. The results of this research show that the observed sites are the product of an organisation’s online activity and that they simultaneously document the objectives and the occurrences of an organisation’s web presence. New types of documents that are specific to organisational web have been identified. Documents that have migrated online have acquired a different context of use and new characteristics. Hence, current document management methods must consider the unique properties of documents in a web environment. Only a portion of the observed organisations are involved in the process of archiving their public website. Additionally, the chosen archiving strategies are not always coherent with the recommendations found in web archiving policies, and do not guarantee the sustainability of web records. Those results led us to design a standard policy model adapted to the particular properties of web archives. This model describes the essential components of a web archiving policy and proposes a range of measures that an organisation could implement based on the results of a risks analysis of their public website’s uses in a business context. Analyse des risques Archives Web Gestion des archives Méthodes d’archivage du Web Politique d’archivage Pratiques archivistiques Approche d’archivage du Web Techniques d’archivage du Web Types des archives Web organisationnel Archival practices Archives management Archives typology Risks analysis Web archiving Web archiving methods Web records Web archiving policy Web archiving approaches Web archiving techniques
7	A Grounded Theory of Information Quality in Web Archives Reyes, Brenda 08 1900 (has links) Web archiving is the practice of preserving websites as a historical record. It is a technologically-challenging endeavor that has as its goal the creation of a high-quality archived website that looks and behaves exactly like the original website. Despite the importance of the notion of quality, comprehensive definitions of Information Quality (IQ) in a web archive have yet to be developed. Currently, the field has no single, comprehensive theory that describes what is a high-quality or low-quality archived website. Furthermore, most of the research that has been conducted on web archives has been system-centered and not user-centered, leading to a dearth of information on how humans perceive web archives. This dissertation seeks to remedy this problem by presenting a user-centered grounded theory of IQ for web archives. It answers two research questions: 1) What is the definition of information quality (IQ) for web archives? and 2) How can IQ in a web archive be measured? The theory presented is grounded on data obtained from users of the Internet Archive's Archive-It system, the largest web-archiving subscription service in the United States. Also presented are mathematical definitions for each dimension of IQ, which can then be applied to measure the quality of a web archive. information quality web archiving web archives grounded theory web preservation history of the web Information Science Computer Science Library Science Web archiving. Knowledge management.
8	Intelligent Event Focused Crawling Farag, Mohamed Magdy Gharib 23 September 2016 (has links) There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effective access. We extend the traditional focused (topical) crawling techniques in two directions, modeling and representing: events and webpage source importance. We developed an event model that can capture key event information (topical, spatial, and temporal). We incorporated that model into the focused crawler algorithm. For the focused crawler to leverage the event model in predicting a webpage's relevance, we developed a function that measures the similarity between two event representations, based on textual content. Although the textual content provides a rich set of features, we proposed an additional source of evidence that allows the focused crawler to better estimate the importance of a webpage by considering its website. We estimated webpage source importance by the ratio of number of relevant webpages to non-relevant webpages found during crawling a website. We combined the textual content information and source importance into a single relevance score. For the focused crawler to work well, it needs a diverse set of high quality seed URLs (URLs of relevant webpages that link to other relevant webpages). Although manual curation of seed URLs guarantees quality, it requires exhaustive manual labor. We proposed an automated approach for curating seed URLs using social media content. We leveraged the richness of social media content about events to extract URLs that can be used as seed URLs for further focused crawling. We evaluated our system through four series of experiments, using recent events: Orlando shooting, Ecuador earthquake, Panama papers, California shooting, Brussels attack, Paris attack, and Oregon shooting. In the first experiment series our proposed event model representation, used to predict webpage relevance, outperformed the topic-only approach, showing better results in precision, recall, and F1-score. In the second series, using harvest ratio to measure ability to collect relevant webpages, our event model-based focused crawler outperformed the state-of-the-art focused crawler (best-first search). The third series evaluated the effectiveness of our proposed webpage source importance for collecting more relevant webpages. The focused crawler with webpage source importance managed to collect roughly the same number of relevant webpages as the focused crawler without webpage source importance, but from a smaller set of sources. The fourth series provides guidance to archivists regarding the effectiveness of curating seed URLs from social media content (tweets) using different methods of selection. / Ph. D. Focused Crawling Event Modeling Web Archiving Digital Libraries Web Mining Social Media Mining Seed URLs Selection
9	Born Digital Legal Deposit Policies and Practices Zarndt, Frederick, Carner, Dorothy, McCain, Edward 16 October 2017 (has links) In 2014, the authors surveyed the born digital content legal deposit policies and practices in 17 different countries and presented the results of the survey at the 2015 International News Media Conference hosted by the National Library of Sweden in Stockholm, Sweden, April 15-16, 2015.2 Three years later, the authors expanded their team and updated the survey in order to assess progress in creating or improving national policies and in implementing practices for preserving born digital content. The 2017 survey reach has been broadened to include countries that did not participate in the 2014 survey. To optimise survey design, and allow for comparability of results with previous surveys, the authors briefly review 17 efforts over the last 12 years to understand the state of digital legal deposit and broader digital preservation policies (a deeper analysis will be provided in a future paper), and then set out the logic behind the current survey. info:eu-repo/classification/ddc/020 ddc:020
10	Automatizovaná rekonstrukce webových stránek / Automatic Webpage Reconstruction Serečun, Viliam January 2018 (has links) Many legal institutions require a burden of proof regarding web content. This thesis deals with a problem connected to web reconstruction and archiving. The primary goal is to provide an open source solution, which will satisfy legal institutions with their requirements. This work presents two main products. The first is a framework, which is a fundamental building block for developing web scraping and web archiving applications. The second product is a web application prototype. This prototype shows the framework utilization. The application output is MAFF archive file which comprises a reconstructed web page, web page screenshot, and meta information table. This table shows information about collected data, server information such as IP addresses and ports of a device where is the original web page located, and time stamp.

Search results