Global ETD Search

Return to search

The textcat Package for n-Gram Based Text Categorization in R

Identifying the language used will typically be the first step in most natural language
processing tasks. Among the wide variety of language identification methods discussed
in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text
categorization based on character n-gram frequencies have been particularly successful.
This paper presents the R extension package textcat for n-gram based text categorization
which implements both the Cavnar and Trenkle approach as well as a reduced n-gram
approach designed to remove redundancies of the original approach. A multi-lingual
corpus obtained from the Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the provided language
identification methods. (authors' abstract)

http://epub.wu.ac.at/3985/1/textcat.pdf

Identifer	oai:union.ndltd.org:VIENNA/oai:epub.wu-wien.ac.at:3985
Date	02 1900
Creators	Feinerer, Ingo, Buchta, Christian, Geiger, Wilhelm, Rauch, Johannes, Mair, Patrick, Hornik, Kurt
Publisher	American Statistical Association
Source Sets	Wirtschaftsuniversität Wien
Language	English
Detected Language	English
Type	Article, PeerReviewed
Format	application/pdf
Relation	http://www.jstatsoft.org/v52/i06/paper, http://epub.wu.ac.at/3985/

Page generated in 0.0075 seconds

The textcat Package for n-Gram Based Text Categorization in R

Description

Links & Downloads

Tags

Additional Fields