Automatic syntactic analysis is essential for extracting useful information from large-scale learner data for linguistic research and natural language processing (NLP). Currently, researchers use standard POS taggers and parsers developed on native language to analyze learner language. Investigation of how such systems perform on learner data is needed to develop strategies for minimizing the cross-domain effects. Furthermore, POS taggers and parsers are developed for generic NLP purposes and may not be useful for identifying specific syntactic constructs such as subcategorization frames (SCFs). SCFs have attracted much research attention as they provide unique insight into the interplay between lexical and structural information. An automatic SCF identification system adapted for learner language is needed to facilitate research on L2 SCFs. In this thesis, we first provide a comprehensive evaluation of standard POS taggers and parsers on learner and native English. We show that the common practice of constructing a gold standard by manually correcting the output of a system can introduce bias to the evaluation, and we suggest a method to control for the bias. We also quantitatively evaluate the impact of fine-grained learner errors on POS tagging and parsing, identifying the most influential learner errors. Furthermore, we show that the performance of probabilistic POS taggers and parsers on native English can predict their performance on learner English. Secondly, we develop an SCF identification system for learner English. We train a machine learning model on both native and learner English data. The system can label individual verb occurrences in learner data for a set of 49 distinct SCFs. Our evaluation shows that the system reaches an accuracy of 84\% F1 score. We then demonstrate that the level of accuracy is adequate for linguistic research. We design the first multidimensional SCF diversity metrics and investigate how SCF diversity changes with L2 proficiency on a large learner corpus. Our results show that as L2 proficiency develops, learners tend to use more diverse SCF types with greater taxonomic distance; more advanced learners also use different SCF types more evenly and locate the verb tokens of the same SCF type further away from each other. Furthermore, we demonstrate that the proposed SCF diversity metrics contribute a unique perspective to the prediction of L2 proficiency beyond existing syntactic complexity metrics.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:763805 |
Date | January 2019 |
Creators | Huang, Yan |
Contributors | Korhonen, Anna |
Publisher | University of Cambridge |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | https://www.repository.cam.ac.uk/handle/1810/285998 |
Page generated in 0.0011 seconds