Global ETD Search

21	A Scalable P2P RIA Crawling System with Fault Tolerance Ben Hafaiedh, Khaled January 2016 (has links) Rich Internet Applications (RIAs) have been widely used in the web over the last decade as they were found to be responsive and user-friendly compared to traditional web applications. RIAs use client-side scripting such as JavaScript which allows for asynchronous updates on the server-side using AJAX (Asynchronous JavaScript and XML). Due to the large size of RIAs and therefore the long time required for crawling, distributed RIA crawling has been introduced with the aim to decrease the crawling time. However, the current RIA crawling systems are not scalable, i.e. they are limited to a relatively low number of crawlers. Furthermore, they do not allow for fault tolerance in case that a failure occurs in one of their components. In this research, we address the scalability and resilience problems when crawling RIAs in a distributed environment and we explore the possibilities of designing an efficient RIA crawling system that is scalable and fault-tolerant. Our approach is to partition the search space among several storage devices (distributed databases) over a peer-to-peer (P2P) network where each database is responsible for storing only a portion of the RIA graph. This makes the distributed data structure invulnerable to a single point of failure. However, accessing the distributed data required by crawlers makes the crawling task challenging when the number of crawlers becomes high. We show by simulation results and analytical reasoning that our system is scalable and fault-tolerant. Furthermore, simulation results show that the crawling time using the P2P crawling system is significantly faster than the crawling time using both the non-distributed crawling system and the distributed crawling system using a single database. Fault Tolerance Data Recovery Rich Internet Applications Web Crawling RIA Crawling Distributed RIA Crawling P2P Networks Graph Exploration
22	Extrakce informací z webových stránek / Information Extraction from Web Pages Bukovčák, Jakub January 2019 (has links) This master thesis is focused on current technologies that are used for downloading web pages and extraction of structured information from them. The paper describes available tools to make this process possible and easier. Another part of this document provides the overview of technologies that can be used for creating web pages. Also, there is an information about development of information systems with web user interface based on Java Enterprise Edition (Java EE) platform. The main part of this master thesis describes design and implementation of application used to specify and manage extraction tasks. The last part of this project describes application testing on real web pages and evaluation of achieved results.
23	Lokman: A Medical Ontology Based Topical Web Crawler Kayisoglu, Altug 01 September 2005 (has links) (PDF) Use of ontology is an approach to overcome the &ldquo / search-on-the-net&rdquo / problem. An ontology based web information retrieval system requires a topical web crawler to construct a high quality document collection. This thesis focuses on implementing a topical web crawler with medical domain ontology in order to find out the advantages of ontological information in web crawling. Crawler is implemented with Best-First search algorithm. Design of the crawler is optimized to UMLS ontology. Crawler is tested with Harvest Rate and Target Recall Metrics and compared to a non-ontology based Best-First Crawler. Performed test results proved that ontology use in crawler URL selection algorithm improved the crawler performance by 76%. QA Computer Software 76.75-76.765
24	[en] ALUMNI TOOL: INFORMATION RECOVERY OF PERSONAL DATA ON THE WEB IN AUTHENTICATED SOCIAL NETWORKS / [pt] ALUMNI TOOL: RECUPERAÇÃO DE DADOS PESSOAIS NA WEB EM REDES SOCIAIS AUTENTICADAS LUIS GUSTAVO ALMEIDA 02 August 2018 (has links) [pt] O uso de robôs de busca para coletar informações para um determinado contexto sempre foi um problema desafiante e tem crescido substancialmente nos últimos anos. Por exemplo, robôs de busca podem ser utilizados para capturar dados de redes sociais profissionais. Em particular, tais redes permitem estudar as trajetórias profissionais dos egressos de uma universidade, e responder diversas perguntas, como por exemplo: Quanto tempo um ex-aluno da PUC-Rio leva para chegar a um cargo de relevância? No entanto, um problema de natureza comum a este cenário é a impossibilidade de coletar informações devido a sistemas de autenticação, impedindo um robô de busca de acessar determinadas páginas e conteúdos. Esta dissertação aborda uma solução para capturar dados, que contorna o problema de autenticação e automatiza o processo de coleta de dados. A solução proposta coleta dados de perfis de usuários de uma rede social profissional para armazenamento em banco de dados e posterior análise. A dissertação contempla ainda a possibilidade de adicionar diversas outras fontes de dados dando ênfase a uma estrutura de armazém de dados. / [en] The use of search bots to collect information for a given context has grown substantially in recent years. For example, search bots may be used to capture data from professional social networks. In particular, such social networks facilitate studying the professional trajectory of the alumni of a given university, and answer several questions such as: How long does a former student of PUC-Rio take to arrive at a management position? However, a common problem in this scenario is the inability to collect information due to authentication systems, preventing a search robot from accessing certain pages and content. This dissertation addresses a solution to capture data, which circumvents the authentication problem and automates the data collection process. The proposed solution collects data from user profiles for later database storage and analysis. The dissertation also contemplates the possibility of adding several other sources of data giving emphasis to a data warehouse structure. [pt] RECUPERACAO DE INFORMACAO [en] INFORMATION RETRIEVAL [pt] WEB CRAWLING [en] WEB CRAWLING [pt] COLETA DE DADOS [en] DATA RETRIEVAL [pt] BIG DATA [en] BIG DATA [pt] BOTS [en] BOTS [pt] REDES SOCIAIS [en] SOCIAL MEDIA [pt] SELENIUM [en] SELENIUM [pt] SCRAPING [en] SCRAPING [pt] ROBOS DE BUSCA [en] SEARCH ENGINE [pt] WEB SPIDER [en] WEB SPIDER

Page generated in 0.0247 seconds