Return to search

Text classification using a hidden Markov model

Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.

Identiferoai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:QMM.85214
Date January 2005
CreatorsYi, Kwan, 1963-
PublisherMcGill University
Source SetsLibrary and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
LanguageEnglish
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Formatapplication/pdf
CoverageDoctor of Philosophy (Graduate School of Library and Information Studies.)
RightsAll items in eScholarship@McGill are protected by copyright with all rights reserved unless otherwise indicated.
Relationalephsysno: 002211451, proquestno: AAINR12966, Theses scanned by UMI/ProQuest.

Page generated in 0.0026 seconds