1 |
Using data-driven resources for optimising rule-based syntactic analysis for modern standard ArabicElbey, Mohamed January 2014 (has links)
This thesis is about optimising a rule based parser for Modern Standard Arabic (MSA). If ambiguity is a major problem in NLP systems, it is even worse in a language MSA due to the fact that written MSA omits short vowels and for other reasons that will be discussed in Chapter 1. By analysing the original rule based parser, it turned out that many parses were unnecessary due to many edges being produced and not used in the final analysis. The first part of this thesis is to investigate whether integrating a Part Of Speech (POS) tagger will help speeding up the parsing, or not. This is a well-known technique for Romance and Germanic languages, but its effectiveness has not been widely explored for MSA. The second part of the thesis is to use statistics and machine learning techniques and investigate its effects on the parser. This thesis is not about the accuracy of the parser. It is about finding ways to improve the speed. A new approach will be discussed, which was not explored in statistical parsing before. This approach is collecting statistics while parsing, and using these to learn strategies to be used during the parsing process. The learning process involves all the moves of the parsing (moves that lead to the final analysis, i.e good moves and moves that lead away from it, i.e. bad moves). The idea here is, not only we are learning from positive data, but also from negative data. The questions to be asked: • Why is this move good so that we can encourage itl • Why is this move bad so that we discourage it. In the final part of the thesis, both techniques were merged together: integrating a POS tagger and using the learning approach, and finding out the effect of this on the parser.
|
Page generated in 0.2084 seconds