Global ETD Search

Return to search

Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing.

This research addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes. The dissertation introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance. Additionally, a number of novel features are proposed, including a statistical informativeness measure based on chi statistics; an encyclopedic feature that taps into the vast knowledge base of Wikipedia to establish the likelihood of a phrase referring to an informative concept; and a linguistic feature based on sophisticated semantic analysis of the text using current theories of discourse comprehension. The resulting keyphrase extraction system is shown to outperform the current state of the art in supervised keyphrase extraction by a large margin. Moreover, a fully automated back of the book indexing system based on the keyphrase extraction system was shown to lead to back of the book indexes closely resembling those created by human experts.

construction-integration

Keyword extraction

back of the book indexing

Automatic indexing.

Identifer	oai:union.ndltd.org:unt.edu/info:ark/67531/metadc6118
Date	05 1900
Creators	Csomai, Andras
Contributors	Mihalcea, Rada, 1974-, Chen, Jiangping, Tarau, Paul, Pasca, Marius
Publisher	University of North Texas
Source Sets	University of North Texas
Language	English
Detected Language	English
Type	Thesis or Dissertation
Format	Text
Rights	Public, Copyright, Csomai, Andras, Copyright is held by the author, unless otherwise noted. All rights reserved.

Page generated in 0.0021 seconds

Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing.

Description

Links & Downloads

Tags

Additional Fields