Global ETD Search

1	Automatic Program Parallelization Using Traces Bradel, Borys 16 March 2011 (has links) We present a novel automatic parallelization approach that uses traces. Our approach uses a binary representation of a program, allowing for the parallelization of programs even if their full source code is not available. Furthermore, traces can represent both iteration and recursion. We use hardware transactional memory (HTM) to ensure correct execution in the presence of dependences. We describe a parallel trace execution model that allows sequential programs to execute in parallel. In the model, traces are identified by a trace collection system (TCS), the program is transformed to allow the traces to execute on multiple processors, and the traces are executed in parallel. We present a framework with four components that, with a TCS, realizes our execution model. The grouping component groups traces into tasks to reduce overhead and make identification of successor traces easier. The packaging component allows tasks to execute on multiple processors. The dependence component deals with dependences on reduction and induction variables. In addition, transactions are committed in sequential program order on an HTM system to deal with dependences that are not removed. Finally, the scheduler assigns tasks to processors. We create a prototype that parallelizes programs and uses an HTM simulator to deal with dependences. To overcome the limitations of simulation, we also create another prototype that automatically parallelizes programs on a real system. Since HTM is not used, only dependences on induction and reduction variables are handled. We demonstrate the feasibility of our trace-based parallelization approach by performing an experimental evaluation on several recursive and loop-based Java programs. On the HTM system, the average speedup of the computational phase of the benchmarks on four processors is 2.79. On a real system, the average speedup on four processors is 1.83. Therefore, the evaluation indicates that trace-based parallelization can be used to effectively parallelize recursive and loop-based Java programs based on their binary representation. automatic parallelization traces 0544
2	Automatic Program Parallelization Using Traces Bradel, Borys 16 March 2011 (has links) We present a novel automatic parallelization approach that uses traces. Our approach uses a binary representation of a program, allowing for the parallelization of programs even if their full source code is not available. Furthermore, traces can represent both iteration and recursion. We use hardware transactional memory (HTM) to ensure correct execution in the presence of dependences. We describe a parallel trace execution model that allows sequential programs to execute in parallel. In the model, traces are identified by a trace collection system (TCS), the program is transformed to allow the traces to execute on multiple processors, and the traces are executed in parallel. We present a framework with four components that, with a TCS, realizes our execution model. The grouping component groups traces into tasks to reduce overhead and make identification of successor traces easier. The packaging component allows tasks to execute on multiple processors. The dependence component deals with dependences on reduction and induction variables. In addition, transactions are committed in sequential program order on an HTM system to deal with dependences that are not removed. Finally, the scheduler assigns tasks to processors. We create a prototype that parallelizes programs and uses an HTM simulator to deal with dependences. To overcome the limitations of simulation, we also create another prototype that automatically parallelizes programs on a real system. Since HTM is not used, only dependences on induction and reduction variables are handled. We demonstrate the feasibility of our trace-based parallelization approach by performing an experimental evaluation on several recursive and loop-based Java programs. On the HTM system, the average speedup of the computational phase of the benchmarks on four processors is 2.79. On a real system, the average speedup on four processors is 1.83. Therefore, the evaluation indicates that trace-based parallelization can be used to effectively parallelize recursive and loop-based Java programs based on their binary representation. automatic parallelization traces 0544
3	Automated detection of structured coarse-grained parallelism in sequential legacy applications Edler Von Koch, Tobias Joseph Kastulus January 2014 (has links) The efficient execution of sequential legacy applications on modern, parallel computer architectures is one of today’s most pressing problems. Automatic parallelization has been investigated as a potential solution for several decades but its success generally remains restricted to small niches of regular, array-based applications. This thesis investigates two techniques that have the potential to overcome these limitations. Beginning at the lowest level of abstraction, the binary executable, it presents a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed technique that takes advantage of an underlying multicore host to transparently parallelize a sequential binary executable. While still in its infancy, Dbp has received broad interest within the research community. This thesis seeks to gain an understanding of the factors contributing to the limits of Dbp and the costs and overheads of its implementation. An extensive evaluation using a parameterizable Dbp system targeting a Cmp with light-weight architectural Tls support is presented. The results show that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks, but that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average. While automatically parallelizing compilers have traditionally focused on data parallelism, additional parallelism exists in a plethora of other shapes such as task farms, divide & conquer, map/reduce and many more. These algorithmic skeletons, i.e. high-level abstractions for commonly used patterns of parallel computation, differ substantially from data parallel loops. Unfortunately, algorithmic skeletons are largely informal programming abstractions and are lacking a formal characterization in terms of established compiler concepts. This thesis develops compiler-friendly characterizations of popular algorithmic skeletons using a novel notion of commutativity based on liveness. A hybrid static/dynamic analysis framework for the context-sensitive detection of skeletons in legacy code that overcomes limitations of static analysis by complementing it with profiling information is described. A proof-of-concept implementation of this framework in the Llvm compiler infrastructure is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in nature. Like the two approaches presented in this thesis, many dynamic parallelization techniques exploit the fact that some statically detected data and control flow dependences do not manifest themselves in every possible program execution (may-dependences) but occur only infrequently, e.g. for some corner cases, or not at all for any legal program input. While the effectiveness of dynamic parallelization techniques critically depends on the absence of such dependences, not much is known about their nature. This thesis presents an empirical analysis and characterization of the variability of both data dependences and control flow across program runs. The cBench benchmark suite is run with 100 randomly chosen input data sets to generate whole-program control and data flow graphs (Cdfgs) for each run, which are then compared to obtain a measure of the variance in the observed control and data flow. The results show that, on average, the cumulative profile information gathered with at least 55, and up to 100, different input data sets is needed to achieve full coverage of the data flow observed across all runs. For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests that profile-guided parallelization needs to be applied with utmost care, as misclassification of sequential loops as parallel was observed even when up to 94 input data sets are used. 005.2
4	Programmer-assisted Automatic Parallelization Huang, Diego 08 December 2011 (has links) Parallel software is now required to exploit the abundance of threads and processors in modern multicore computers. Unfortunately, manual parallelization is too time-consuming and error-prone for all but the most advanced programmers. While automatic parallelization promises threaded software with little programmer effort, current auto-parallelizers are easily thwarted by pointers and other forms of ambiguity in the code. In this dissertation we profile the loops in SPEC CPU2006, categorize the loops in terms of available parallelism, and focus on promising loops that are not parallelized by IBM's XL C/C++ V10 auto-parallelizer. For those loops we propose methods of improved interaction between the programmer and compiler that can facilitate their parallelization. In particular, we (i) suggest methods for the compiler to better identify to the programmer the parallelization-blockers; (ii) suggest methods for the programmer to provide guarantees to the compiler that overcome these parallelization-blockers; and (iii) evaluate the resulting impact on performance. compiler automatic parallelization programmer guarantees 0984 0537
5	Programmer-assisted Automatic Parallelization Huang, Diego 08 December 2011 (has links) Parallel software is now required to exploit the abundance of threads and processors in modern multicore computers. Unfortunately, manual parallelization is too time-consuming and error-prone for all but the most advanced programmers. While automatic parallelization promises threaded software with little programmer effort, current auto-parallelizers are easily thwarted by pointers and other forms of ambiguity in the code. In this dissertation we profile the loops in SPEC CPU2006, categorize the loops in terms of available parallelism, and focus on promising loops that are not parallelized by IBM's XL C/C++ V10 auto-parallelizer. For those loops we propose methods of improved interaction between the programmer and compiler that can facilitate their parallelization. In particular, we (i) suggest methods for the compiler to better identify to the programmer the parallelization-blockers; (ii) suggest methods for the programmer to provide guarantees to the compiler that overcome these parallelization-blockers; and (iii) evaluate the resulting impact on performance. compiler automatic parallelization programmer guarantees 0984 0537
6	Guided automatic binary parallelisation Zhou, Ruoyu January 2018 (has links) For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, existing automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries. In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation information. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation between the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution. The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation. BEEP is used to identify potential thread-level parallelism through static binary analysis and binary instrumentation. BEEP performs preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program. The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information. BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms. The third tool is named GABP (Guided Automatic Binary Parallelisation), an extension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execution. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers.
7	Parallel SAT solvers and their application in automatic parallelization / SAT solvers paralelos e suas aplicações em paralelização automática Silveira, Jaime Kirch da January 2014 (has links) Desde a diminuição da tendência de aumento na frequência de processadores, uma nova tendência surgiu para permitir que softwares tirem proveito de harwares mais rápidos: a paralelização. Contudo, diferente de aumentar a frequência de processadores, utilizar parallelização requer um tipo diferente de programação, a programação paralela, que é geralmente mais difícil que a programação sequencial comum. Neste contexto, a paralelização automática apareceu, permitindo que o software tire proveito do paralelismo sem a necessidade de programação paralela. Nós apresentamos aqui duas propostas: SAT-PaDdlinG e RePaSAT. SAT-PaDdlinG é um SAT Solver DPLL paralelo que roda em GPU, o que permite que RePaSAT utilize esse ambiente. RePaSAT é a nossa proposta de uma máquina paralela que utiliza o Problema SAT para paralelizar automaticamente código sequencial. Como uma GPU provê um ambiente barato e massivamente paralelo, SAT-PaDdlinG tem como objetivo prover esse paralelismo massivo a baixo custo para RePaSAT, como para qualquer outra ferramenta ou problema que utilize SAT Solvers. / Since the slowdown in improvement in the frequency of processors, a new tendency has arisen to allow software to take advantage of faster hardware: parallelization. However, different from increasing the frequency of processors, using parallelization requires a different kind of programming, parallel programming, which is usually harder than common sequential programming. In this context, automatic parallelization has arisen, allowing software to take advantage of parallelism without the need of parallel programming. We present here two proposals: SAT-PaDdlinG and RePaSAT. SAT-PaDdlinG is a parallel DPLL SAT Solver on GPU, which allows RePaSAT to use this environment. RePaSAT is our proposal of a parallel machine that uses the SAT Problem to automatically parallelize sequential code. Because GPU provides a cheap, massively parallel environment, SATPaDdlinG aims at providing this massive parallelism and low cost to RePaSAT, as well as to any other tool or problem that uses SAT Solvers. Microeletrônica Processadores Parallel SAT solver Automatic parallelization RePaSAT SAT-PaDdlinG
8	Parallel SAT solvers and their application in automatic parallelization / SAT solvers paralelos e suas aplicações em paralelização automática Silveira, Jaime Kirch da January 2014 (has links) Desde a diminuição da tendência de aumento na frequência de processadores, uma nova tendência surgiu para permitir que softwares tirem proveito de harwares mais rápidos: a paralelização. Contudo, diferente de aumentar a frequência de processadores, utilizar parallelização requer um tipo diferente de programação, a programação paralela, que é geralmente mais difícil que a programação sequencial comum. Neste contexto, a paralelização automática apareceu, permitindo que o software tire proveito do paralelismo sem a necessidade de programação paralela. Nós apresentamos aqui duas propostas: SAT-PaDdlinG e RePaSAT. SAT-PaDdlinG é um SAT Solver DPLL paralelo que roda em GPU, o que permite que RePaSAT utilize esse ambiente. RePaSAT é a nossa proposta de uma máquina paralela que utiliza o Problema SAT para paralelizar automaticamente código sequencial. Como uma GPU provê um ambiente barato e massivamente paralelo, SAT-PaDdlinG tem como objetivo prover esse paralelismo massivo a baixo custo para RePaSAT, como para qualquer outra ferramenta ou problema que utilize SAT Solvers. / Since the slowdown in improvement in the frequency of processors, a new tendency has arisen to allow software to take advantage of faster hardware: parallelization. However, different from increasing the frequency of processors, using parallelization requires a different kind of programming, parallel programming, which is usually harder than common sequential programming. In this context, automatic parallelization has arisen, allowing software to take advantage of parallelism without the need of parallel programming. We present here two proposals: SAT-PaDdlinG and RePaSAT. SAT-PaDdlinG is a parallel DPLL SAT Solver on GPU, which allows RePaSAT to use this environment. RePaSAT is our proposal of a parallel machine that uses the SAT Problem to automatically parallelize sequential code. Because GPU provides a cheap, massively parallel environment, SATPaDdlinG aims at providing this massive parallelism and low cost to RePaSAT, as well as to any other tool or problem that uses SAT Solvers. Microeletrônica Processadores Parallel SAT solver Automatic parallelization RePaSAT SAT-PaDdlinG
9	Parallel SAT solvers and their application in automatic parallelization / SAT solvers paralelos e suas aplicações em paralelização automática Silveira, Jaime Kirch da January 2014 (has links) Desde a diminuição da tendência de aumento na frequência de processadores, uma nova tendência surgiu para permitir que softwares tirem proveito de harwares mais rápidos: a paralelização. Contudo, diferente de aumentar a frequência de processadores, utilizar parallelização requer um tipo diferente de programação, a programação paralela, que é geralmente mais difícil que a programação sequencial comum. Neste contexto, a paralelização automática apareceu, permitindo que o software tire proveito do paralelismo sem a necessidade de programação paralela. Nós apresentamos aqui duas propostas: SAT-PaDdlinG e RePaSAT. SAT-PaDdlinG é um SAT Solver DPLL paralelo que roda em GPU, o que permite que RePaSAT utilize esse ambiente. RePaSAT é a nossa proposta de uma máquina paralela que utiliza o Problema SAT para paralelizar automaticamente código sequencial. Como uma GPU provê um ambiente barato e massivamente paralelo, SAT-PaDdlinG tem como objetivo prover esse paralelismo massivo a baixo custo para RePaSAT, como para qualquer outra ferramenta ou problema que utilize SAT Solvers. / Since the slowdown in improvement in the frequency of processors, a new tendency has arisen to allow software to take advantage of faster hardware: parallelization. However, different from increasing the frequency of processors, using parallelization requires a different kind of programming, parallel programming, which is usually harder than common sequential programming. In this context, automatic parallelization has arisen, allowing software to take advantage of parallelism without the need of parallel programming. We present here two proposals: SAT-PaDdlinG and RePaSAT. SAT-PaDdlinG is a parallel DPLL SAT Solver on GPU, which allows RePaSAT to use this environment. RePaSAT is our proposal of a parallel machine that uses the SAT Problem to automatically parallelize sequential code. Because GPU provides a cheap, massively parallel environment, SATPaDdlinG aims at providing this massive parallelism and low cost to RePaSAT, as well as to any other tool or problem that uses SAT Solvers. Microeletrônica Processadores Parallel SAT solver Automatic parallelization RePaSAT SAT-PaDdlinG
10	Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model Bondhugula, Uday Kumar 11 September 2008 (has links) No description available. Computer Science Automatic Parallelization Polyhedral Model Compiler Optimizations

Search results