The development of high-throughput genome sequencing and protein structure determination techniques have provided researchers with a wealth ofbiological data. However, providing an integrated analysis can be difficult due to the incompatibilities of data formats between providers and applications, the strict schema constraints imposed by data providers, and the lack ofinfrastructure for easily accommodating new semantic information. To address these issues, this thesis first proposes to use Extensible Markup Language (XML) [26] and its supporting query languages as the underlying technology to facilitate a seamless, integrated access to the sum of heterogeneous biological data and services. XML is used due to its semi-structured nature and its ability to easily encapsulate both contextual and semantic information. The tree representation of an XML document enables applications to easily traverse and access data within the document without prior knowledge of its schema. However, in the process ofconstructing the framework, we have identified a number of issues that are related to the performance ofXML technologies. More specifically, on the performance ofthe XML query processor, the data store and the transformation processor. Hence, this thesis also focuses on finding new solutions to address these issues. For the XML query processor, we proposes an efficient structural join algorithm that can be implemented on top of existing relational databases. Experiments show the proposed method outperforms previous work in both queries and updates. For complicated XML query patterns, a new twig join algorithm called CTwigStack is proposed in this thesis. In essence, the new approach only produces and merges partial solution nodes that satisfy the entire twig query pattern tree. Experiments show the proposed algorithm outperforms previous methods in most cases. For more general cases, a propose a mixed mode twig join is proposed, which combines CTwigStack with the existing twig join algorithms and the extensive experimental results have shown the superior effectiveness of both CTwigStack and the mixed mode twig join. By combining with existing system information, the mixed mode twig join can be served as a framework for plan selection during the process of XML query optimization. For the XML transfonnation component, a novel stand-alone, memory conscious XSLT processor is proposed in this thesis, such that the proposed XSLT processor only requires a single pass of the input XML dataset. Consequently, enabling fast transfonnation of streaming XML data and better handling of complicated XPath selection patterns, including aggregate predicate functions such as the XPath count function. Ultimately, based on the nature of the proposed framework, we believe that solving the perfonnance issues related to the underlying XML components can subsequently lead to a more robust framework for integrating heterogeneous biological data sources and services.
Identifer | oai:union.ndltd.org:ADTP/258304 |
Date | January 2007 |
Creators | Shui, William Miao, Computer Science & Engineering, Faculty of Engineering, UNSW |
Publisher | Awarded by:University of New South Wales. Computer Science & Engineering |
Source Sets | Australiasian Digital Theses Program |
Language | English |
Detected Language | English |
Rights | Copyright Shui William Miao., http://unsworks.unsw.edu.au/copyright |
Page generated in 0.0018 seconds