Global ETD Search

Return to search

A Semantic Graph Model for Text Representation and Matching in Document Mining

The explosive growth in the number of documents produced daily necessitates the development of effective alternatives to explore, analyze, and discover knowledge from documents. Document mining research work has emerged to devise automated means to discover and analyze useful information from documents. This work has been mainly concerned with constructing text representation models, developing distance measures to estimate similarities between documents, and utilizing that in mining processes such as document clustering, document classification, information retrieval, information filtering, and information extraction. Conventional text representation methodologies consider documents as bags of words and ignore the meanings and ideas their authors want to convey. It is this deficiency that causes similarity measures to fail to perceive contextual similarity of text passages due to the variation of the words the passages contain, or at least perceive contextually dissimilar text passages as being similar because of the resemblance of words the passages have. This thesis presents a new paradigm for mining documents by exploiting semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation scheme for documents. The representation scheme is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation scheme along with the proposed similarity measure will enable more effective document mining processes. The proposed techniques to mine documents were implemented as vital components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.

http://hdl.handle.net/10012/2860

Electrical & Computer Engineering

Document mining

semantic understanding

text representation

similarity measure

document clustering.

Identifer	oai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/2860
Date	January 2006
Creators	Shaban, Khaled
Publisher	University of Waterloo
Source Sets	University of Waterloo Electronic Theses Repository
Language	English
Detected Language	English
Type	Thesis or Dissertation
Format	application/pdf, 1461362 bytes, application/pdf
Rights	Copyright: 2006, Shaban, Khaled. All rights reserved.

Page generated in 0.0016 seconds

A Semantic Graph Model for Text Representation and Matching in Document Mining

Description

Links & Downloads

Tags

Additional Fields