Return to search

R/parallel Parallel Computing for R in non‐dedicated environments

Traditionally, parallel computing has been associated with special purpose
applications designed to run in complex computing clusters, specifically set up with
a software stack of dedicated libraries together with advanced administration tools to
manage co Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage complex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R. mplex IT infrastructures. These High Performance Computing (HPC)
solutions, although being the most efficient solutions in terms of performance and
scalability, impose technical and practical barriers for most common scientists
whom, with reduced IT knowledge, time and resources, are unable to embrace
classical HPC solutions without considerable efforts.
Moreover, two important technology advances are increasing the need for parallel
computing. For example in the bioinformatics field, and similarly in other
experimental science disciplines, new high throughput screening devices are
generating huge amounts of data within very short time which requires their analysis
in equally short time periods to avoid delaying experimental analysis. Another
important technological change involves the design of new processor chips. To
increase raw performance the current strategy is to increase the number of
processing units per chip, so to make use of the new processing capacities parallel
applications are required. In both cases we find users that may need to update their
current sequential applications and computing resources to achieve the increased
processing capacities required for their particular needs. Since parallel computing is
becoming a natural option for obtaining increased performance and it is required by
new computer systems, solutions adapted for the mainstream should be developed
for a seamless adoption.
In order to enable the adoption of parallel computing, new methods and technologies
are required to remove or mitigate the current barriers and obstacles that prevent
many users from evolving their sequential running environments. A particular
scenario that specially suffers from these problems and that is considered as a
practical case in this work consists of bioinformaticians analyzing molecular data
with methods written with the R language. In many cases, with long datasets, they
have to wait for days and weeks for their data to be processed or perform the
cumbersome task of manually splitting their data, look for available computers to
run these subsets and collect back the previously scattered results. Most of these
applications written in R are based on parallel loops. A loop is called a parallel loop
if there is no data dependency among all its iterations, and therefore any iteration
can be processed in any order or even simultaneously, so they are susceptible of
being parallelized. Parallel loops are found in a large number of scientific
applications.
Previous contributions deal with partial aspects of the problems suffered by this kind
of users, such as providing access to additional computing resources or enabling the
codification of parallel problems, but none takes proper care of providing complete
solutions without considering advanced users with access to traditional HPC
platforms. Our contribution consists in the design and evaluation of methods to
enable the easy parallelization of applications based in parallel loops written in R
using non-dedicated environments as a computing platform and considering users
without proper experience in parallel computing or system management skills. As a
proof of concept, and in order to evaluate the feasibility of our proposal, an
extension of R, called R/parallel, has been developed to test our ideas in real
environments with real bioinformatics problems.
The results show that even in situations with a reduced level of information about
the running environment and with a high degree of uncertainty about the quantity
and quality of the available resources it is possible to provide a software layer to
enable users without previous knowledge and skills adapt their applications with a
minimal effort and perform concurrent computations using the available computers.
Additionally of proving the feasibility of our proposal, a new self-scheduling
scheme, suitable for parallel loops in dynamics environments has been contributed,
the results of which show that it is possible to obtain improved performance levels
compared to previous contributions in best-effort environments.
The main conclusion is that, even in situations with limited information about the
environment and the involved technologies, it is possible to provide the mechanisms
that will allow users without proper knowledge and time restrictions to conveniently
make use and take advantage of parallel computing technologies, so closing the gap
between classical HPC solutions and the mainstream of users of common
applications, in our case, based in parallel loops with R.

Identiferoai:union.ndltd.org:TDX_UAB/oai:www.tdx.cat:10803/121248
Date21 July 2010
CreatorsVera Rodríguez, Gonzalo
ContributorsSuppi Boldrito, Remo, Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius
PublisherUniversitat Autònoma de Barcelona
Source SetsUniversitat Autònoma de Barcelona
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis, info:eu-repo/semantics/publishedVersion
Format136 p., application/pdf
SourceTDX (Tesis Doctorals en Xarxa)
Rightsinfo:eu-repo/semantics/openAccess, ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

Page generated in 0.4978 seconds