Global ETD Search

Return to search

A Distributed Approach to Crawl Domain Specific Hidden Web

A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content "hidden" behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance.

Deep Web

Breadth-first crawler

Search spider

Distributed Web crawler

task-specific and Domain Specific

Hidden Web

Content Extraction

Computer Sciences

Identifer	oai:union.ndltd.org:GEORGIA/oai:digitalarchive.gsu.edu:cs_theses-1046
Date	03 August 2007
Creators	Desai, Lovekeshkumar
Publisher	Digital Archive @ GSU
Source Sets	Georgia State University
Detected Language	English
Type	text
Format	application/pdf
Source	Computer Science Theses

Page generated in 0.0022 seconds

A Distributed Approach to Crawl Domain Specific Hidden Web

Description

Links & Downloads

Tags

Additional Fields