Global ETD Search

Return to search

Data extraction from the Web using XML.

This thesis presents a mechanism based on eXtensible Markup Language (XML) to extract data from HTML-based Web pages and populate relational databases. This task is performed by a system called the XML-based Web Agent (XWA). The data extraction is done in three phases. First, the Web pages are converted to well-formed XML documents to facilitate their processing. Second, the data is extracted from the well-formed XML documents and formatted into valid XML documents. Finally, the valid XML documents are mapped into tables to be stored in a relational database. To extract specific data from the Web, the XWA requires information about the Web pages from which to extract the data, the location of the data within the Web pages, and how the extracted data should be formatted. This information is stored in Web Site Ontologies which are built using a language called the Web Ontology Description Language (WONDEL). WONDEL is based on XML and XML Pointer Language. It has been defined as a part of this work to allow users to specify the data they want, and let the XWA work offline to extract it and store it in a database. This has the advantage of saving users the time waiting for the Web pages to download, and taking benefit from the powerful query mechanism offered by database management systems.

Information Science.

Identifer	oai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/9260
Date	January 2001
Creators	Ouahid, Hicham.
Contributors	Karmouch, Ahmed,
Publisher	University of Ottawa (Canada)
Source Sets	Université d’Ottawa
Detected Language	English
Type	Thesis
Format	132 p.

Page generated in 0.0025 seconds

Data extraction from the Web using XML.

Description

Links & Downloads

Tags

Additional Fields