Return to search

Automatically Extract Information from Web Documents

The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user's behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. Mining data records in Web pages is useful because they typically present their host pages' essential information, such as lists of products and services. Extracting these structured data objects enables one to integrate data/information from multiple Web pages to provide value-added services, e.g., comparative shopping, meta-querying and search. Web content mining has thus become an area of interest for many researchers because of the phenomenal growth of the Web contents and the economic benefits associated with it. However, due to the heterogeneity of Web pages, automated discovery of targeted information is still posing as a challenging problem.

Identiferoai:union.ndltd.org:WKU/oai:digitalcommons.wku.edu:theses-1379
Date01 December 2007
CreatorsSharma, Dipesh
PublisherTopSCHOLAR®
Source SetsWestern Kentucky University Theses
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceMasters Theses & Specialist Projects

Page generated in 0.002 seconds