This thesis explores the temporal analysis of text using the implicit temporal cues
present in document. We consider the case when all explicit temporal expressions such as
specific dates or years are removed from the text and a bag of words based approach is used
for timestamp prediction for the text. A set of gold standard text documents with times-
tamps are used as the training set. We also predict time spans for Wikipedia biographies
based on their text. We have training texts from 3800 BC to present day. We partition this
timeline into equal sized chronons and build a probability histogram for a test document
over this chronon sequence. The document is assigned to the chronon with the highest
probability.
We use 2 approaches: 1) a generative language model with Bayesian priors, and 2) a
KL divergence based model. To counter the sparsity in the documents and chronons we use
3 different smoothing techniques across models. We use 3 diverse datasets to test our mod-
els: 1) Wikipedia Biographies, 2) Guttenberg Short Stories, and 3) Wikipedia Years dataset.
Our models are trained on a subset of Wikipedia biographies. We concentrate on
two prediction tasks: 1) time-stamp prediction for a generic text or mid-span prediction for
a Wikipedia biography , and 2) life-span prediction for a Wikipedia biography. We achieve
an f-score of 81.1% for life-span prediction task and a mean error of around 36 years for
mid-span prediction for biographies from present day to 3800 BC. The best model gives a
mean error of 18 years for publication date prediction for short stories that are uniformly
distributed in the range 1700 AD to 2010 AD. Our models exploit the temporal distribu-
tion of text for associating time. Our error analysis reveals interesting properties about the
models and datasets used.
We try to combine explicit temporal cues extracted from the document with its
implicit cues and obtain combined prediction model. We show that a combination of the
date-based predictions and language model divergence predictions is highly effective for this
task: our best model obtains an f-score of 81.1% and the median error between actual and
predicted life span midpoints is 6 years. This would be one of the emphasis for our future
work.
The above analyses demonstrates that there are strong temporal cues within texts
that can be exploited statistically for temporal predictions. We also create good benchmark
datasets along the way for the research community to further explore this problem. / text
Identifer | oai:union.ndltd.org:UTEXAS/oai:repositories.lib.utexas.edu:2152/23581 |
Date | 18 March 2014 |
Creators | Kumar, Abhimanu |
Source Sets | University of Texas |
Detected Language | English |
Type | Thesis |
Format | application/pdf |
Page generated in 0.0019 seconds