The Extensible Markup Language (XML) has become an increasingly popular format for representing and exchanging data. Its flexible and exstensible syntax makes it suitable for representing both structured data and textual information, or a mixture of both. The popularization of XML has lead to the development of a new database type. XML databases serve as repositories of large collections of XML documents, and seek to provide the same benefits for XML data as relational databases for relational data; indexing, transactional processing, failsafe physical storage, querying collections etc.. There are two standardized query languages for XML, XQuery and XPath, which are both powerful for querying and navigating the structure XML. However, they offer limited support for full-text search, and cannot be used alone for typical Information Retrieval (IR) applications. To address IR-related issues in XML, a new standard is emerging as an extension to XPath and XQuery: XQuery and XPath Full Text 1.0 (XQFT). XQFT is carefully investigated to determine how well-known IR techniques apply to XML, and the chracateristics of full-text search and indexing in existing XML databases are described in a state-of-the-art study. Based on findings from literature and source code review, the design and implementation of XQFT is discussed; first in general terms, then in the context of Oracle Berkeley DB XML (BDB XML). Experimental support for XQFT is enabled in BDB XML, and a few experiments are conducted in order to evaluate functionality aspects of the XQFT implementation. A scheme for full-text indexing in BDB XML is proposed. The full-text index acts as an augmented version of an inverted list, and is implemented on top of an Oracle Berkeley DB database. Tokens are used as keys, with data tuples for each distinct (document, path) combination the token occurs in. Lookups in the index are based on keywords, and should allow answering various queries without materializing data. Investigation shows that XML-based IR with XQFT is not fundamentally different from traditional text-based IR. Full-text queries rely on linguistic tokens, which --- in XQFT --- are derived from nodes without considering the XML structure. Further, it is discovered that full-text indexing is crucial for query efficiency in large document collections. In summary, common issues with full-text search are present in XML-based IR, and are addressed in the same manner as text-based IR.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:ntnu-9837 |
Date | January 2009 |
Creators | Skoglund, Robin |
Publisher | Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, Institutt for datateknikk og informasjonsvitenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.1207 seconds