Global ETD Search

Return to search

IRLbot: design and performance analysis of a large-scale web crawler

This thesis shares our experience in designing web crawlers that scale to billions
of pages and models their performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, breadth-first search (BFS) crawl order,
and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope
with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts.
We offer a set of techniques for dealing with these issues and test their performance
in an implementation we call IRLbot. In our recent experiment that lasted 41 days,
IRLbot running on a single server successfully crawled 6:3 billion valid HTML pages
(7:6 billion connection requests) and sustained an average download rate of 319 mb/s
(1,789 pages/s). Unlike our prior experiments with algorithms proposed in related
work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered
a subset of the web graph with 41 billion unique nodes.

http://hdl.handle.net/1969.1/85914

Measurement

Performance

Identifer	oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/85914
Date	10 October 2008
Creators	Lee, Hsin-Tsang
Contributors	Loguinov, Dmitri
Publisher	Texas A&M University
Source Sets	Texas A and M University
Language	en_US
Detected Language	English
Type	Book, Thesis, Electronic Thesis, text
Format	electronic, born digital

Page generated in 0.0023 seconds

IRLbot: design and performance analysis of a large-scale web crawler

Description

Links & Downloads

Tags

Additional Fields