Return to search

Poly-Lingual Text Categorization

With the rapid emergence and proliferation of Internet and the trend of globalization, a tremendous number of textual documents written in different languages are electronically accessible online. Efficiently and effectively managing these textual documents written different languages is essential to organizations and individuals. Although poly-lingual text categorization (PLTC) can be approached as a set of independent monolingual classifiers, this naïve approach employs only the training documents of the same language to construct to construct a monolingual classifier and fails to utilize the opportunity offered by poly-lingual training documents. Motivated by the significance of and need for such a poly-lingual text categorization technique, we propose a PLTC technique that takes into account all training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) technique as our performance benchmark, our empirical evaluation results show that our proposed PLTC technique achieves higher classification accuracy than the benchmark technique does in both English and Chinese corpora. In addition, our empirical results also suggest the robustness of the proposed PLTC technique with respect to the range of training sizes investigated.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0809106-221247
Date09 August 2006
CreatorsShih, Hui-Hua
ContributorsChristopher C. Yang, Chin-Pin Wei, Wen-Hsiang Lu
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0809106-221247
Rightswithheld, Copyright information available at source archive

Page generated in 0.0018 seconds