We show that a carefully crafted probabilistic morphological analyzer significantly outperforms a reasonable, naive baseline for Syriac. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. Such tools are widely used to contribute to the process of annotating morphologically complex languages. We introduce and connect novel data-driven models for segmentation, dictionary linkage, and morphological tagging in a joint pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of this model with varying amounts of training data and find that with about 34,500 tokens, it can outperform the baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, this joint model achieves 86.47% accuracy — a 29.7% reduction in error rate over the baseline.
Identifer | oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-3199 |
Date | 08 July 2010 |
Creators | McClanahan, Peter J. |
Publisher | BYU ScholarsArchive |
Source Sets | Brigham Young University |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Theses and Dissertations |
Rights | http://lib.byu.edu/about/copyright/ |
Page generated in 0.0024 seconds