The nature of semistructured data in web collections is evolving. Even when XML web documents are valid with regard to a schema, the actual structure of such documents exhibits significant variations across collections for several reasons: an XML schema may be very lax (e.g., to accommodate the flexibility needed to represent collections of documents in RSS feeds), a schema may be large and different subsets used for different documents (e.g., this is common in industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). A schema alone may not provide sufficient information for many data management tasks that require knowledge of the actual structure of the collection.
Web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges.
This thesis introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, providing support for more efficient evaluation of XPath workloads. DescribeX permits the declarative description of document structure using all axes and language constructs in XPath, and generalizes many of the XML indexing and summarization approaches in the literature. DescribeX supports the construction of heterogenous summaries where different document elements sharing a common structure can be declaratively defined and refined by means of path regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can significantly help in the understanding of both the structure of complex, heterogeneous XML collections and the behaviour of XPath queries evaluated on them.
Experimental results demonstrate the scalability of DescribeX summary refinements and stabilizations (the key enablers for tailoring summaries) with multi-gigabyte web collections. A comparative study suggests that using a DescribeX summary created from a given workload can produce query evaluation times orders of magnitude better than using existing summaries. DescribeX’s light-weight approach of combining summaries with a file-at-a-time XPath processor can be a very competitive alternative, in terms of performance, to conventional fully-fledged XML query engines that provide DB-like functionality such as security, transaction processing, and native storage.
Identifer | oai:union.ndltd.org:TORONTO/oai:tspace.library.utoronto.ca:1807/17254 |
Date | 26 February 2009 |
Creators | Rizzolo, Flavio Carlos |
Contributors | Miller, Renee J. |
Source Sets | University of Toronto |
Language | en_ca |
Detected Language | English |
Type | Thesis |
Format | 2404488 bytes, application/pdf |
Page generated in 0.0021 seconds