The amount of data we are dealing with is being generated at an astronomical pace. With the rapid technological advances in the field of data storage techniques, storing and transmitting copious amounts of data has become very easy and hassle-free. However, exploring those abundant data and finding the interesting ones has always been a huge integral challenge and cumbersome process to people in all industrial sectors. A model to rank data by interest will help in saving the time spent on the large amount of data. In this research we concentrate specifically on ranking the text documents in corpora according to ``interestingness''
We design a state-of-the-art empirical model to rank documents according to ``interestingness''. The model is cost-efficient, fast and automated to an extent which requires minimal human intervention. We identify different categories of documents based on the word-usage pattern which in turn classifies them as being interesting, mundane or anomalous documents. The model is a novel approach which does not depend on the semantics of the words used in the document but is based on the repetition of words and rate of introduction of new words in the document. The model is a generic design which can be applied to a document corpus of any size from any domain. The model can be used to rank new documents introduced into the corpus. We formulate a couple of normalization techniques which can be used to neutralize the impact of variable document length.
We use three approaches, namely dictionary-based data compression, analysis of the rate of new word occurrences and Singular Value Decomposition (SVD). To test the model we use a variety of corpora namely: US Diplomatic Cable releases by Wikileaks, US Presidents State of Union Addresses, Open American National Corpus and 20 Newsgroups articles. The techniques have various pre-processing steps which are totally automated. We compare the results of the three techniques and examine the level of agreement between pair of techniques using a statistical method called the Jaccard coefficient. This approach can also be used to detect the unusual and anomalous documents within the corpus.
The results also contradict the assumptions made by Simon and Yule in deriving an equation for a general text generation model. / Thesis (Master, Computing) -- Queen's University, 2012-01-31 15:28:04.177
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OKQ.1974/6995 |
Date | 01 February 2012 |
Creators | KONDI CHANDRASEKARAN, PRADEEP KUMAR |
Contributors | Queen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.)) |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English, English |
Detected Language | English |
Type | Thesis |
Rights | This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner. |
Relation | Canadian theses |
Page generated in 0.0021 seconds