Return to search

Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers

In the Information Age, a proliferation of unstructured text electronic documents exists. Processing these documents by humans is a daunting task as humans have limited cognitive abilities for processing large volumes of documents that can often be extremely lengthy. To address this problem, text data computer algorithms are being developed. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are two text data computer algorithms that have received much attention individually in the text data literature for topic extraction studies but not for document classification nor for comparison studies. Since classification is considered an important human function and has been studied in the areas of cognitive science and information science, in this dissertation a research study was performed to compare LDA, LSA and humans as document classifiers. The research questions posed in this study are: R1: How accurate is LDA and LSA in classifying documents in a corpus of textual data over a known set of topics? R2: How accurate are humans in performing the same classification task? R3: How does LDA classification performance compare to LSA classification performance? To address these questions, a classification study involving human subjects was designed where humans were asked to generate and classify documents (customer comments) at two levels of abstraction for a quality assurance setting. Then two computer algorithms, LSA and LDA, were used to perform classification on these documents. The results indicate that humans outperformed all computer algorithms and had an accuracy rate of 94% at the higher level of abstraction and 76% at the lower level of abstraction. At the high level of abstraction, the accuracy rates were 84% for both LSA and LDA and at the lower level, the accuracy rate were 67% for LSA and 64% for LDA. The findings of this research have many strong implications for the improvement of information systems that process unstructured text. Document classifiers have many potential applications in many fields (e.g., fraud detection, information retrieval, national security, and customer management). Development and refinement of algorithms that classify text is a fruitful area of ongoing research and this dissertation contributes to this area.

Identiferoai:union.ndltd.org:unt.edu/info:ark/67531/metadc103284
Date12 1900
CreatorsAnaya, Leticia H.
ContributorsEvangelopoulos, Nicholas, Kulkarni, Shailesh, Pavur, Robert, Peak, Dan, Boubekri, Nourredine
PublisherUniversity of North Texas
Source SetsUniversity of North Texas
LanguageEnglish
Detected LanguageEnglish
TypeThesis or Dissertation
FormatText
RightsPublic, Anaya, Leticia H., Copyright, Copyright is held by the author, unless otherwise noted. All rights Reserved.

Page generated in 0.003 seconds