Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:699281 |
Date | January 2016 |
Creators | Alsaad, Amal |
Contributors | Abbod, M. |
Publisher | Brunel University |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://bura.brunel.ac.uk/handle/2438/13510 |
Page generated in 0.0144 seconds