Global ETD Search

Return to search

A tm Plug-In for Distributed Text Mining in R

R has gained explicit text mining support with the tm package enabling statisticians
to answer many interesting research questions via statistical analysis or modeling of (text)
corpora. However, we typically face two challenges when analyzing large corpora: (1) the
amount of data to be processed in a single machine is usually limited by the available main
memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient
procedures for calculating valuable results. Fortunately, adequate programming models
like MapReduce facilitate parallelization of text mining tasks and allow for processing
data sets beyond what would fit into memory by using a distributed file system possibly
spanning over several machines, e.g., in a cluster of workstations. In this paper we present
a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which
can take advantage of the Hadoop MapReduce library for large scale text mining tasks.
We show on the basis of an application in culturomics that we can efficiently handle data
sets of signifficant size. (authors' abstract)

Identifer	oai:union.ndltd.org:VIENNA/oai:epub.wu-wien.ac.at:3974
Date	11 1900
Creators	Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt
Publisher	University of California, Los Angeles
Source Sets	Wirtschaftsuniversität Wien
Language	English
Detected Language	English
Type	Article, PeerReviewed
Format	application/pdf, application/x-gzip
Relation	http://www.jstatsoft.org/v51/i05, http://epub.wu.ac.at/3974/

Page generated in 0.0024 seconds

A tm Plug-In for Distributed Text Mining in R

Description

Links & Downloads

Tags

Additional Fields