Return to search

Data extraction from the Web using XML.

This thesis presents a mechanism based on eXtensible Markup Language (XML) to extract data from HTML-based Web pages and populate relational databases. This task is performed by a system called the XML-based Web Agent (XWA). The data extraction is done in three phases. First, the Web pages are converted to well-formed XML documents to facilitate their processing. Second, the data is extracted from the well-formed XML documents and formatted into valid XML documents. Finally, the valid XML documents are mapped into tables to be stored in a relational database. To extract specific data from the Web, the XWA requires information about the Web pages from which to extract the data, the location of the data within the Web pages, and how the extracted data should be formatted. This information is stored in Web Site Ontologies which are built using a language called the Web Ontology Description Language (WONDEL). WONDEL is based on XML and XML Pointer Language. It has been defined as a part of this work to allow users to specify the data they want, and let the XWA work offline to extract it and store it in a database. This has the advantage of saving users the time waiting for the Web pages to download, and taking benefit from the powerful query mechanism offered by database management systems.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/9260
Date January 2001
CreatorsOuahid, Hicham.
ContributorsKarmouch, Ahmed,
PublisherUniversity of Ottawa (Canada)
Source SetsUniversité d’Ottawa
Detected LanguageEnglish
TypeThesis
Format132 p.

Page generated in 0.0025 seconds