Global ETD Search

Return to search

A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics

Tokenization, or word boundary detection, is a critical first step for most NLP applications. This is often given little attention in English and other languages which use explicit spaces between written words, but standard orthographies for many languages lack explicit markers. Tokenization systems for such languages are usually engineered on an individual basis, with little re-use. The human ability to decode any written language, however, suggests that a general algorithm exists.This thesis presents simple morphologically-based and statistical methods for identifying word boundaries in multiple languages. Statistical methods tend to over-predict, while lexical and morphological methods fail when encountering unknown words. I demonstrate that a generic hybrid approach to tokenization using both morphological and statistical information generalizes well across multiple languages and improves performance over morphological or statistical methods alone, and show that it can be used for efficient tokenization of English, Korean, and Arabic.

tokenization

lexing

morphological analysis

Linguistics

Identifer	oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-6983
Date	01 June 2016
Creators	Kearsley, Logan R.
Publisher	BYU ScholarsArchive
Source Sets	Brigham Young University
Detected Language	English
Type	text
Format	application/pdf
Source	All Theses and Dissertations
Rights	http://lib.byu.edu/about/copyright/

Page generated in 0.0019 seconds

A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics

Description

Links & Downloads

Tags

Additional Fields