1 |
Automatic Program Parallelization Using TracesBradel, Borys 16 March 2011 (has links)
We present a novel automatic parallelization approach that uses traces. Our approach uses a binary representation of a program, allowing for the parallelization of programs even if their full source code is not available. Furthermore, traces can represent both iteration and recursion. We use hardware transactional memory (HTM) to ensure correct execution in the presence of dependences.
We describe a parallel trace execution model that allows sequential programs to execute in parallel. In the model, traces are identified by a trace collection system (TCS), the program is transformed to allow the traces to execute on multiple processors, and the traces are executed in parallel.
We present a framework with four components that, with a TCS, realizes our execution model. The grouping component groups traces into tasks to reduce overhead and make identification of successor traces easier. The packaging component allows tasks to execute on multiple processors. The dependence component deals with dependences on reduction and induction variables. In addition, transactions are committed in sequential program order on an HTM system to deal with dependences that are not removed. Finally, the scheduler assigns tasks to processors.
We create a prototype that parallelizes programs and uses an HTM simulator to deal with dependences. To overcome the limitations of simulation, we also create another prototype that automatically parallelizes programs on a real system. Since HTM is not used, only dependences on induction and reduction variables are handled.
We demonstrate the feasibility of our trace-based parallelization approach by performing an experimental evaluation on several recursive and loop-based Java programs. On the HTM system, the average speedup of the computational phase of the benchmarks on four processors is 2.79. On a real system, the average speedup on four processors is 1.83. Therefore, the evaluation indicates that trace-based parallelization can be used to effectively parallelize recursive and loop-based Java programs based on their binary representation.
|
2 |
Speculative parallelization of partially parallel loopsDang, Francis Hoai Dinh 15 May 2009 (has links)
Current parallelizing compilers cannot identify a significant fraction of parallelizable
loops because they have complex or statically insufficiently defined access patterns.
In our previous work, we have speculatively executed a loop as a doall, and applied a
fully parallel data dependence test to determine if it had any cross–processor depen-
dences. If the test failed, then the loop was re–executed serially. While this method
exploits doall parallelism well, it can cause slowdowns for loops with even one cross-
processor flow dependence because we have to re-execute sequentially. Moreover, the
existing, partial parallelism of loops is not exploited.
We demonstrate a generalization of the speculative doall parallelization tech-
nique, called the Recursive LRPD test, that can extract and exploit the maximum
available parallelism of any loop and that limits potential slowdowns to the over-
head of the run-time dependence test itself. In this thesis, we have presented the
base algorithm and an analysis of the different heuristics for its practical applica-
tion. To reduce the run-time overhead of the Recursive LRPD test, we have im-
plemented on-demand checkpointing and commit, more efficient data dependence
analysis and shadow structures, and feedback-guided load balancing. We obtained
scalable speedups for loops from Track, Spice, and FMA3D that were not paralleliz-
able by previous speculative parallelization methods.
|
3 |
Parallel Algorithms for Time and Frequency Domain Circuit SimulationDong, Wei 2009 August 1900 (has links)
As a most critical form of pre-silicon verification, transistor-level circuit simulation
is an indispensable step before committing to an expensive manufacturing process.
However, considering the nature of circuit simulation, it can be computationally
expensive, especially for ever-larger transistor circuits with more complex device models.
Therefore, it is becoming increasingly desirable to accelerate circuit simulation.
On the other hand, the emergence of multi-core machines offers a promising solution
to circuit simulation besides the known application of distributed-memory clustered
computing platforms, which provides abundant hardware computing resources. This
research addresses the limitations of traditional serial circuit simulations and proposes
new techniques for both time-domain and frequency-domain parallel circuit
simulations.
For time-domain simulation, this dissertation presents a parallel transient simulation
methodology. This new approach, called WavePipe, exploits coarse-grained
application-level parallelism by simultaneously computing circuit solutions at multiple
adjacent time points in a way resembling hardware pipelining. There are two
embodiments in WavePipe: backward and forward pipelining schemes. While the
former creates independent computing tasks that contribute to a larger future time
step, the latter performs predictive computing along the forward direction. Unlike
existing relaxation methods, WavePipe facilitates parallel circuit simulation without jeopardizing convergence and accuracy. As a coarse-grained parallel approach, it requires
low parallel programming effort, furthermore it creates new avenues to have a
full utilization of increasingly parallel hardware by going beyond conventional finer
grained parallel device model evaluation and matrix solutions.
This dissertation also exploits the recently developed explicit telescopic projective
integration method for efficient parallel transient circuit simulation by addressing the
stability limitation of explicit numerical integration. The new method allows the
effective time step controlled by accuracy requirement instead of stability limitation.
Therefore, it not only leads to noticeable efficiency improvement, but also lends itself
to straightforward parallelization due to its explicit nature.
For frequency-domain simulation, this dissertation presents a parallel harmonic
balance approach, applicable to the steady-state and envelope-following analyses of
both driven and autonomous circuits. The new approach is centered on a naturally-parallelizable
preconditioning technique that speeds up the core computation in harmonic
balance based analysis. The proposed method facilitates parallel computing
via the use of domain knowledge and simplifies parallel programming compared with
fine-grained strategies. As a result, favorable runtime speedups are achieved.
|
4 |
Automatic Program Parallelization Using TracesBradel, Borys 16 March 2011 (has links)
We present a novel automatic parallelization approach that uses traces. Our approach uses a binary representation of a program, allowing for the parallelization of programs even if their full source code is not available. Furthermore, traces can represent both iteration and recursion. We use hardware transactional memory (HTM) to ensure correct execution in the presence of dependences.
We describe a parallel trace execution model that allows sequential programs to execute in parallel. In the model, traces are identified by a trace collection system (TCS), the program is transformed to allow the traces to execute on multiple processors, and the traces are executed in parallel.
We present a framework with four components that, with a TCS, realizes our execution model. The grouping component groups traces into tasks to reduce overhead and make identification of successor traces easier. The packaging component allows tasks to execute on multiple processors. The dependence component deals with dependences on reduction and induction variables. In addition, transactions are committed in sequential program order on an HTM system to deal with dependences that are not removed. Finally, the scheduler assigns tasks to processors.
We create a prototype that parallelizes programs and uses an HTM simulator to deal with dependences. To overcome the limitations of simulation, we also create another prototype that automatically parallelizes programs on a real system. Since HTM is not used, only dependences on induction and reduction variables are handled.
We demonstrate the feasibility of our trace-based parallelization approach by performing an experimental evaluation on several recursive and loop-based Java programs. On the HTM system, the average speedup of the computational phase of the benchmarks on four processors is 2.79. On a real system, the average speedup on four processors is 1.83. Therefore, the evaluation indicates that trace-based parallelization can be used to effectively parallelize recursive and loop-based Java programs based on their binary representation.
|
5 |
3d terrain visualization and CPU parallelization of particle swarm optimizationWieczorek, Calvin L. January 2018 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Particle Swarm Optimization is a bio-inspired optimization technique used to approximately solve the non-deterministic polynomial (NP) problem of asset allocation in 3D space, frequency, antenna azimuth [1], and elevation orientation [1]. This research uses QT Data Visualization to display the PSO solutions, assets, transmitters in 3D space from the work done in [2]. Elevation and Imagery data was extracted from ARCGIS (a geographic information system (GIS) database) to add overlapping elevation and imagery data to that the 3D visualization displays proper topological data. The 3D environment range was improved and is now dynamic; giving the user appropriate coordinates based from the ARCGIS latitude and longitude ranges. The second part of the research improves the performance of the PSOs runtime, using OpenMP with CPU threading to parallelize the evaluation of the PSO by particle. Lastly, this implementation uses CPU multithreading with 4 threads to improve the performance of the PSO by 42% - 51% in comparison to running the PSO without CPU multithreading. The contributions provided allow for the PSO project to be more realistically simulate its use in the Electronic Warfare (EW) space, adding additional CPU multithreading implementation for further performance improvements.
|
6 |
High performance simplex solverHuangfu, Qi January 2013 (has links)
The dual simplex method is frequently the most efficient technique for solving linear programming (LP) problems. This thesis describes an efficient implementation of the sequential dual simplex method and the design and development of two parallel dual simplex solvers. In serial, many advanced techniques for the (dual) simplex method are implemented, including sparse LU factorization, hyper-sparse linear system solution technique, efficient approaches to updating LU factors and sophisticated dual simplex pivoting rules. These techniques, some of which are novel, lead to serial performance which is comparable with the best public domain dual simplex solver, providing a solid foundation for the simplex parallelization. During the implementation of the sequential dual simplex solver, the study of classic LU factor update techniques leads to the development of three novel update variants. One of them is comparable to the most efficient established approach but is much simpler in terms of implementation, and the other two are specially useful for one of the parallel simplex solvers. In addition, the study of the dual simplex pivoting rules identifies and motivates further investigation of how hyper-sparsity maybe promoted. In parallel, two high performance simplex solvers are designed and developed. One approach, based on a less-known dual pivoting rule called suboptimization, exploits parallelism across multiple iterations (PAMI). The other, based on the regular dual pivoting rule, exploits purely single iteration parallelism (SIP). The performance of PAMI is comparable to a world-leading commercial simplex solver. SIP is frequently complementary to PAMI in achieving speedup when PAMI results in slowdown.
|
7 |
Design and Implementation of C Programming Language Extension for Parallel GPU ComputingYang, Yu-Wei 27 July 2010 (has links)
NVIDIA developed a technique of executing general program on GPU, named CUDA (Compute Unified Device Architecture), in 2006. The CUDA programming model allows a group of same instructions to execute on multi-thread simultaneously, which has advantage of parallel programs in reducing the execution time significantly. Although CUDA provides a series of C-like APIs (Application Programming Interface) so that programmers can easy use CUDA language, it still costs certain efforts to be familiar with the development. In this thesis, we propose a tool to automatically translate C programs into corresponding CUDA programs which reduce program development time effectively.
|
8 |
Automated detection of structured coarse-grained parallelism in sequential legacy applicationsEdler Von Koch, Tobias Joseph Kastulus January 2014 (has links)
The efficient execution of sequential legacy applications on modern, parallel computer architectures is one of today’s most pressing problems. Automatic parallelization has been investigated as a potential solution for several decades but its success generally remains restricted to small niches of regular, array-based applications. This thesis investigates two techniques that have the potential to overcome these limitations. Beginning at the lowest level of abstraction, the binary executable, it presents a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed technique that takes advantage of an underlying multicore host to transparently parallelize a sequential binary executable. While still in its infancy, Dbp has received broad interest within the research community. This thesis seeks to gain an understanding of the factors contributing to the limits of Dbp and the costs and overheads of its implementation. An extensive evaluation using a parameterizable Dbp system targeting a Cmp with light-weight architectural Tls support is presented. The results show that there is room for a significant reduction of up to 54% in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks, but that it is much harder to translate these savings into actual performance improvements, with a realistic hardware-supported implementation achieving a speedup of 1.09 on average. While automatically parallelizing compilers have traditionally focused on data parallelism, additional parallelism exists in a plethora of other shapes such as task farms, divide & conquer, map/reduce and many more. These algorithmic skeletons, i.e. high-level abstractions for commonly used patterns of parallel computation, differ substantially from data parallel loops. Unfortunately, algorithmic skeletons are largely informal programming abstractions and are lacking a formal characterization in terms of established compiler concepts. This thesis develops compiler-friendly characterizations of popular algorithmic skeletons using a novel notion of commutativity based on liveness. A hybrid static/dynamic analysis framework for the context-sensitive detection of skeletons in legacy code that overcomes limitations of static analysis by complementing it with profiling information is described. A proof-of-concept implementation of this framework in the Llvm compiler infrastructure is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in nature. Like the two approaches presented in this thesis, many dynamic parallelization techniques exploit the fact that some statically detected data and control flow dependences do not manifest themselves in every possible program execution (may-dependences) but occur only infrequently, e.g. for some corner cases, or not at all for any legal program input. While the effectiveness of dynamic parallelization techniques critically depends on the absence of such dependences, not much is known about their nature. This thesis presents an empirical analysis and characterization of the variability of both data dependences and control flow across program runs. The cBench benchmark suite is run with 100 randomly chosen input data sets to generate whole-program control and data flow graphs (Cdfgs) for each run, which are then compared to obtain a measure of the variance in the observed control and data flow. The results show that, on average, the cumulative profile information gathered with at least 55, and up to 100, different input data sets is needed to achieve full coverage of the data flow observed across all runs. For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests that profile-guided parallelization needs to be applied with utmost care, as misclassification of sequential loops as parallel was observed even when up to 94 input data sets are used.
|
9 |
Simulation of Modelica Models on the CUDA ArchitectureÖstlund, Per January 2009 (has links)
<p>Simulations are very important for many reasons, and finding ways of accelerating simulations are therefore interesting. In this thesis the feasibility of automatically generating simulation code for a limited set of Modelica models that can be executed on NVIDIAs CUDA architecture is studied. The OpenModelica compiler, an open-source Modelica compiler, was for this purpose extended to generate CUDA code.</p><p>This thesis presents an overview of the CUDA architecture, and looks at the problems that need to be solved to generate efficient simulation code for this architecture. Methods of finding parallelism in models that can be used on the highly parallel CUDA architecture are shown, and methods of efficiently using the available memory spaces on the architecture are also presented.</p><p>This thesis shows that it is possible to generate CUDA simulation code for the set of Modelica models that were chosen. It also shows that for models with a large amount of parallelism it is possible to get significant speedups compared with simulation on a normal processor, and a speedup of 4.6 was reached for one of the models used in the thesis. Several suggestions on how the CUDA architecture can be used even more efficiently for Modelica simulations are also given.</p>
|
10 |
Programmer-assisted Automatic ParallelizationHuang, Diego 08 December 2011 (has links)
Parallel software is now required to exploit the abundance of threads and processors in modern multicore computers. Unfortunately, manual parallelization is too time-consuming and error-prone for all but the most advanced programmers. While automatic parallelization promises threaded software with little programmer effort, current auto-parallelizers are easily thwarted by pointers and other forms of ambiguity in the code. In this dissertation we profile the loops in SPEC CPU2006, categorize the loops in terms of available parallelism, and focus on promising loops that are not parallelized by IBM's XL C/C++ V10 auto-parallelizer. For those loops we propose methods of improved interaction between the programmer and compiler that can facilitate their parallelization. In particular, we (i) suggest methods for the compiler to better identify to the programmer the parallelization-blockers; (ii) suggest methods for the programmer to provide guarantees to the compiler that overcome these parallelization-blockers; and (iii) evaluate the resulting impact on performance.
|
Page generated in 0.2714 seconds