In this thesis we explore robust statistical syntax analysis for French. Our main concern is to explore methods whereby the linguist can inject linguistic knowledge and/or resources into the robust statistical engine in order to improve results for specific phenomena. We first explore the dependency annotation schema for French, concentrating on certain phenomena. Next, we look into the various algorithms capable of producing this annotation, and in particular on the transition-based parsing algorithm used in the rest of this thesis. After exploring supervised machine learning algorithms for NLP classification problems, we present the Talismane toolkit for syntax analysis, built within the framework of this thesis, including four statistical modules - sentence boundary detection, tokenisation, pos-tagging and parsing - as well as the various linguistic resources used for the baseline model, including corpora, lexicons and feature sets. Our first experiments attempt various machine learning configurations in order to identify the best baseline. We then look into improvements made possible by beam search and beam propagation. Finally, we present a series of experiments aimed at correcting errors related to specific linguistic phenomena, using targeted features. One our innovation is the introduction of rules that can impose or prohibit certain decisions locally, thus bypassing the statistical model. We explore the usage of rules for errors that the features are unable to correct. Finally, we look into the enhancement of targeted features by large scale linguistic resources, and in particular a semi-supervised approach using a distributional semantic resource.
Identifer | oai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-01058143 |
Date | 17 December 2013 |
Creators | Urieli, Assaf |
Publisher | Université Toulouse le Mirail - Toulouse II |
Source Sets | CCSD theses-EN-ligne, France |
Language | English |
Detected Language | English |
Type | PhD thesis |
Page generated in 0.0034 seconds