Global ETD Search

Return to search

Text Augmentation: Inserting markup into natural language text with PPM Models

This thesis describes a new optimisation and new heuristics for automatically marking up XML documents, and CEM, a Java implementation, using PPM models. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BibTeX system and marked up in XML with every field from the original BibTeX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists' Communique corpus and the Reuters' corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked up documents. The performance of the new heuristics and optimisation are examined using the four corpora.

http://hdl.handle.net/10289/2600

Part-Of-Speech Tagging

XML

Metadata

Identifer	oai:union.ndltd.org:ADTP/238021
Date	January 2006
Creators	Yeates, Stuart Andrew
Publisher	The University of Waikato
Source Sets	Australiasian Digital Theses Program
Language	English
Detected Language	English
Rights	http://www.waikato.ac.nz/library/research_commons/rc_about.shtml#copyright

Page generated in 0.0013 seconds

Text Augmentation: Inserting markup into natural language text with PPM Models

Description

Links & Downloads

Tags

Additional Fields