Spelling suggestions: "subject:"directed crawling"" "subject:"irected crawling""
1 |
Discovering and Tracking Interesting Web ServicesRocco, Daniel J. (Daniel John) 01 December 2004 (has links)
The World Wide Web has become the standard mechanism for information distribution and scientific collaboration on the Internet. This dissertation research explores a suite of techniques for discovering relevant dynamic sources in a specific domain of interest and for managing Web data effectively. We first explore techniques for discovery and automatic classification of dynamic Web sources. Our approach utilizes a service class model of the dynamic Web that allows the characteristics of interesting services to be specified using a service class description.
To promote effective Web data management, the Page Digest Web document encoding eliminates tag redundancy and places structure, content, tags, and attributes into separate containers, each of which can be referenced in isolation or in conjunction with the other elements of the document. The Page Digest Sentinel system leverages our unique encoding to provide efficient and scalable change monitoring for arbitrary Web documents through document compartmentalization and semantic change request grouping.
Finally, we present XPack, an XML document compression system that uses a containerized view of an XML document to provide both good compression and efficient querying over compressed documents. XPack's queryable XML compression format is general-purpose, does not rely on domain knowledge or particular document structural characteristics for compression, and achieves better query performance than standard query processors using text-based XML.
Our research expands the capabilities of existing dynamic Web techniques, providing superior service discovery and classification services, efficient change monitoring of Web information, and compartmentalized document handling. DynaBot is the first system to combine a service class view of the Web with a modular crawling architecture to provide automated service discovery and classification. The Page Digest Web document encoding represents Web documents efficiently by separating the individual characteristics of the document. The Page Digest Sentinel change monitoring system utilizes the Page Digest document encoding for scalable change monitoring through efficient change algorithms and intelligent request grouping. Finally, XPack is the first XML compression system that delivers compression rates similar to existing techniques while supporting better query performance than standard query processors using text-based XML.
|
Page generated in 0.0582 seconds