Return to search

Concept Mining: A Conceptual Understanding based Approach

Due to the daily rapid growth of the information, there are
considerable needs to extract and discover valuable knowledge from
data sources such as the World Wide Web. Most of the common
techniques in text mining are based on the statistical analysis of a
term either word or phrase. These techniques consider documents as
bags of words and pay no attention to the meanings of the document
content. In addition, statistical analysis of a term frequency
captures the importance of the term within a document only. However,
two terms can have the same frequency in their documents, but one
term contributes more to the meaning of its sentences than the other
term. Therefore, there is an intensive need for a model that
captures the meaning of linguistic utterances in a formal structure.
The underlying model should indicate terms that capture the
semantics of text. In this case, the model can capture terms that
present the concepts of the sentence, which leads to discover the
topic of the document.

A new concept-based model that analyzes terms on the sentence,
document and corpus levels rather than the traditional analysis of
document only is introduced. The concept-based model can effectively
discriminate between non-important terms with respect to sentence
semantics and terms which hold the concepts that represent the
sentence meaning.

The proposed model consists of concept-based statistical analyzer,
conceptual ontological graph representation, concept extractor and
concept-based similarity measure. The term which contributes to the
sentence semantics is assigned two different weights by the
concept-based statistical analyzer and the conceptual ontological
graph representation. These two weights are combined into a new
weight. The concepts that have maximum combined weights are selected
by the concept extractor. The similarity between documents is
calculated based on a new concept-based similarity measure. The
proposed similarity measure takes full advantage of using the
concept analysis measures on the sentence, document, and corpus
levels in calculating the similarity between documents.


Large sets of experiments using the proposed concept-based model on
different datasets in text clustering, categorization and retrieval
are conducted. The experiments demonstrate extensive comparison
between traditional weighting and the concept-based weighting
obtained by the concept-based model. Experimental results in text
clustering, categorization and retrieval demonstrate the substantial
enhancement of the quality using: (1) concept-based term frequency
(tf), (2) conceptual term frequency (ctf), (3) concept-based
statistical analyzer, (4) conceptual ontological graph, (5)
concept-based combined model.


In text clustering, the evaluation of results is relied on two
quality measures, the F-Measure and the Entropy. In text
categorization, the evaluation of results is relied on three quality
measures, the Micro-averaged F1, the Macro-averaged F1 and the Error
rate. In text retrieval, the evaluation of results relies on three
quality measures, the precision at 10 documents retrieved P(10), the
preference measure (bpref), and the mean uninterpolated average
precision (MAP). All of these quality measures are improved when the
newly developed concept-based model is used to enhance the quality
of the text clustering, categorization and retrieval.

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OWTU.10012/4430
Date January 2009
CreatorsShehata, Shady
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
LanguageEnglish
Detected LanguageEnglish
TypeThesis or Dissertation

Page generated in 0.008 seconds