Global ETD Search

Return to search

Distributed Text Mining in R

R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately,
adequate parallel programming models like MapReduce and the
corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory.
In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we
can efficiently handle data sets of significant size. / Series: Research Report Series / Department of Statistics and Mathematics

Identifer	oai:union.ndltd.org:VIENNA/oai:epub.wu-wien.ac.at:3034
Date	16 March 2011
Creators	Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt
Publisher	WU Vienna University of Economics and Business
Source Sets	Wirtschaftsuniversität Wien
Language	English
Detected Language	English
Type	Paper, NonPeerReviewed
Format	application/pdf
Relation	http://statmath.wu.ac.at/, http://epub.wu.ac.at/3034/

Page generated in 0.0023 seconds

Distributed Text Mining in R

Description

Links & Downloads

Tags

Additional Fields