Global ETD Search

Return to search

Topical Structure in Long Informal Documents

This dissertation describes a research project concerned with establishing the topical structure of long informal documents. In this research, we place special emphasis on literary data, but also work with speech transcripts and several other types of data.

It has long been acknowledged that discourse is more than a sequence of sentences but, for the purposes of many Natural Language Processing tasks, it is often modelled exactly in that way. In this dissertation, we propose a practical approach to modelling discourse structure, with an emphasis on it being computationally feasible and easily applicable. Instead of following one of the many linguistic theories of discourse structure, we attempt to model the structure of a document as a tree of topical segments. Each segment encapsulates a span that concentrates on a particular topic at a certain level of granularity. Each span can be further sub-segmented based on finer fluctuations of topic. The lowest (most refined) level of segmentation is individual paragraphs.

In our model, each topical segment is described by a segment centre -- a sentence or a paragraph that best captures the contents of the segment. In this manner, the segmenter effectively builds an extractive hierarchical outline of the document. In order to achieve these goals, we use the framework of factor graphs and modify a recent clustering algorithm, Affinity Propagation, to perform hierarchical segmentation instead of clustering.

While it is far from being a solved problem, topical text segmentation is not uncharted territory. The methods developed so far, however, perform least well where they are most needed: on documents that lack rigid formal structure, such as speech transcripts, personal correspondence or literature. The model described in this dissertation is geared towards dealing with just such types of documents.

In order to study how people create similar models of literary data, we built two corpora of topical segmentations, one flat and one hierarchical. Each document in these corpora is annotated for topical structure by 3-6 people.

The corpora, the model of hierarchical segmentation and software for segmentation are the main contributions of this work.

Natural Language Processing

topic modelling

topical segmentation

discourse structure

Identifer	oai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/31612
Date	January 2014
Creators	Kazantseva, Anna
Contributors	Szpakowicz, Stanislaw
Publisher	Université d'Ottawa / University of Ottawa
Source Sets	Université d’Ottawa
Language	English
Detected Language	English
Type	Thesis

Page generated in 0.0026 seconds

Topical Structure in Long Informal Documents

Description

Links & Downloads

Tags

Additional Fields