This dissertation describes a research project concerned with establishing the topical structure of long informal documents. In this research, we place special emphasis on literary data, but also work with speech transcripts and several other types of data.
It has long been acknowledged that discourse is more than a sequence of sentences but, for the purposes of many Natural Language Processing tasks, it is often modelled exactly in that way. In this dissertation, we propose a practical approach to modelling discourse structure, with an emphasis on it being computationally feasible and easily applicable. Instead of following one of the many linguistic theories of discourse structure, we attempt to model the structure of a document as a tree of topical segments. Each segment encapsulates a span that concentrates on a particular topic at a certain level of granularity. Each span can be further sub-segmented based on finer fluctuations of topic. The lowest (most refined) level of segmentation is individual paragraphs.
In our model, each topical segment is described by a segment centre -- a sentence or a paragraph that best captures the contents of the segment. In this manner, the segmenter effectively builds an extractive hierarchical outline of the document. In order to achieve these goals, we use the framework of factor graphs and modify a recent clustering algorithm, Affinity Propagation, to perform hierarchical segmentation instead of clustering.
While it is far from being a solved problem, topical text segmentation is not uncharted territory. The methods developed so far, however, perform least well where they are most needed: on documents that lack rigid formal structure, such as speech transcripts, personal correspondence or literature. The model described in this dissertation is geared towards dealing with just such types of documents.
In order to study how people create similar models of literary data, we built two corpora of topical segmentations, one flat and one hierarchical. Each document in these corpora is annotated for topical structure by 3-6 people.
The corpora, the model of hierarchical segmentation and software for segmentation are the main contributions of this work.
Identifer | oai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/31612 |
Date | January 2014 |
Creators | Kazantseva, Anna |
Contributors | Szpakowicz, Stanislaw |
Publisher | Université d'Ottawa / University of Ottawa |
Source Sets | Université d’Ottawa |
Language | English |
Detected Language | English |
Type | Thesis |
Page generated in 0.0022 seconds