Global ETD Search

Return to search

Automating the Extraction of Domain-Specific Information from the Web-A Case Study for the Genealogical Domain

Current ways of finding genealogical information within the millions of pages on the Web are inadequate. In an effort to help genealogical researchers find desired information more quickly, we have developed GeneTIQS, a Genealogy Target-based Information Query System. GeneTIQS builds on ontology-based methods of data extraction to allow database-style queries on the Web. This thesis makes two main contributions to GeneTIQS. (1) It builds a framework to do generic ontology-based data extraction. (2) It develops a hybrid record separator based on Vector Space Modeling that uses both formatting clues and data clues to split pages into component records. The record separator allows GeneTIQS to extract data from the complex documents common in genealogy. Experiments show that this approach yields 92% recall and 93% precision on documents from the Web.

Information Extraction

Genealogy

Computer Sciences

Identifer	oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-1213
Date	23 November 2004
Creators	Walker, Troy L.
Publisher	BYU ScholarsArchive
Source Sets	Brigham Young University
Detected Language	English
Type	text
Format	application/pdf
Source	Theses and Dissertations
Rights	http://lib.byu.edu/about/copyright/

Page generated in 0.0017 seconds

Automating the Extraction of Domain-Specific Information from the Web-A Case Study for the Genealogical Domain

Description

Links & Downloads

Tags

Additional Fields