This thesis project addresses the machine learning (ML) modelling aspects of the problem of automatically extracting typological linguistic information of natural languages spoken in South Asia from annotated descriptive grammars. Without getting stuck into the theory and methods of Natural Language Processing (NLP), the focus has been to develop and test a machine learning (ML) model dedicated to the information extraction part. Starting with the existing state-of-the-art frameworks to get labelled training data through the structured representation of the descriptive grammars, the problem has been modelled as a supervised ML classification task where the annotated text is provided as input and the objective is to classify the input to one of the pre-learned labels. The approach has been to systematically explore the data to develop understanding of the problem domain and then evaluate a set of four potential ML algorithms using predetermined performance metrics namely: accuracy, recall, precision and f-score. It turned out that the problem splits up into two independent classification tasks: binary classification task and multiclass classification task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Using stratified 10-fold cross validation performance metrics are measured and the candidate algorithms are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for further fine tuning and used in a web demo for typological information extraction tool developed to show the usability of the ML model in the field.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:his-17893 |
Date | January 2019 |
Creators | Aslam, Irfan |
Publisher | Högskolan i Skövde, Institutionen för informationsteknologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0021 seconds