Global ETD Search

Return to search

Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text

With the growing number of textual resources available, the ability to understand them becomes critical. An essential first step in understanding these sources is the ability to identify the parts-of-speech in each sentence. The goal of this research is to propose, improve, and implement an algorithm capable of finding terms (words in a corpus) that are used in similar ways--a term categorizer. Such a term categorizer can be used to find a particular part-of-speech, i.e. nouns in a corpus, and generate a lexicon. The proposed work is not dependent on any external sources of information, such as dictionaries, and it shows a significant improvement (~30%) over an existing method of categorization. More importantly, the proposed algorithm can be applied as a component of an unsupervised part-of-speech tagger, making it truly unsupervised, requiring only unannotated text. The algorithm is discussed in detail, along with its background, and its performance. Experimentation shows that the proposed algorithm performs within 3% of the baseline, the Penn-TreeBank Lexicon. / Master of Science

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/10094
Date	02 September 2004
Creators	Pereira, Dennis V.
Contributors	Computer Science, Egyhazy, Csaba J., Belli, Gabriella M., Frakes, William B.
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Thesis
Format	ETD, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/
Relation	08242004_Dennis_Pereira_ETD.pdf

Page generated in 0.0019 seconds

Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text

Description

Links & Downloads

Tags

Additional Fields