Global ETD Search

Return to search

Efficient Schema Extraction from a Collection of XML Documents

The eXtensible Markup Language (XML) has become the standard format for data exchange on the Internet, providing interoperability between different business applications. Such wide use results in large volumes of heterogeneous XML data, i.e., XML documents conforming to different schemas. Although schemas are important in many business applications, they are often missing in XML documents. In this thesis, we present a suite of algorithms that are effective in extracting schema information from a large collection of XML documents. We propose using the cost of NFA simulation to compute the Minimum Length Description to rank the inferred schema. We also studied using frequencies of the sample inputs to improve the precision of the schema extraction. Furthermore, we propose an evaluation framework to quantify the quality of the extracted schema. Experimental studies are conducted on various data sets to demonstrate the efficiency and efficacy of our approach.

document mark up language

data mining

eXtensible Markup Language

Databases and Information Systems

Programming Languages and Compilers

Identifer	oai:union.ndltd.org:WKU/oai:digitalcommons.wku.edu:theses-2064
Date	01 May 2011
Creators	Parthepan, Vijayeandra
Publisher	TopSCHOLAR®
Source Sets	Western Kentucky University Theses
Detected Language	English
Type	text
Format	application/pdf
Source	Masters Theses & Specialist Projects

Page generated in 0.0019 seconds

Efficient Schema Extraction from a Collection of XML Documents

Description

Links & Downloads

Tags

Additional Fields