Spelling suggestions: "subject:"arallel processing"" "subject:"aparallel processing""
271 |
Run-time loop parallelization with efficient dependency checking on GPU-accelerated platformsZhang, Chenggang, 张呈刚 January 2011 (has links)
General-Purpose computing on Graphics Processing Units (GPGPU) has attracted a lot of attention recently. Exciting results have been reported in using GPUs to accelerate applications in various domains such as scientific simulations, data mining, bio-informatics and computational finance. However, up to now GPUs can only accelerate data-parallel loops with statically analyzable parallelism. Loops with dynamic parallelism (e.g., with array accesses through subscripted subscripts), an important pattern in many general-purpose applications, cannot be parallelized on GPUs using existing technologies.
Run-time loop parallelization using Thread Level Speculation (TLS) has been proposed in the literatures to parallelize loops with statically un-analyzable dependencies. However, most of the existing TLS systems are designed for multiprocessor/multi-core CPUs. GPUs have fundamental differences with CPUs in both hardware architecture and execution model, making the previous TLS designs not work or inefficient when ported to GPUs. This thesis presents GPUTLS, a runtime system designed to support speculative loop parallelization on GPUs. The design of GPU-TLS addresses several key problems encountered when adapting TLS to GPUs: (1) To reduce the possibility of mis-speculation, deferred-update memory versioning scheme is adopted to avoid mis-speculations caused by inter-iteration WAR and WAW dependencies. A technique named intra-warp value forwarding is proposed to respect some inter-iteration RAW dependencies, which further reduces the mis-speculation possibility. (2) An incremental speculative execution scheme is designed to exploit partial parallelism within loops. This avoids excessive re-executions and reduces the mis-speculation penalty. (3) The dependency checking among thousands of speculative GPU threads poses large overhead and can easily become the performance bottleneck. To lower the overhead, we design several e_cient dependency checking schemes named PRW+BDC, SW, SR, SRW+EDC, and SRW+LDC respectively. (4) We devise a novel parallel commit scheme to avoid the overhead incurred by the serial commit phase in most existing TLS designs.
We have carried out extensive experiments on two platforms with different NVIDIA GPUs, using both a synthetic loop that can simulate loops with different characteristics and several loops from real-life applications. Testing results show that the proposed intra-warp value forwarding and eager dependency checking techniques can improve the performance for almost all kinds of loop patterns. We observe that compared with other dependency checking schemes, SR and SW can achieve better performance in most cases. It is also shown that the proposed parallel commit scheme is especially useful for loops with large write set size and small number of inter-iteration WAW dependencies. Overall, GPU-TLS can achieve speedups ranging from 5 to 105 for loops with dynamic parallelism. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
272 |
Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architecturesHan, Guodong, 韩国栋 January 2013 (has links)
The GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient.
However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis.
To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU.
We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
273 |
Using parallel computation to apply the singular value decomposition (SVD) in solving for large Earth gravity fields based on satellite dataHinga, Mark Brandon 28 August 2008 (has links)
Not available / text
|
274 |
Compiler directed speculation for embedded clustered EPIC machinesPillai, Satish 28 August 2008 (has links)
Not available / text
|
275 |
Parallel machine scheduling with time windowsRojanasoonthon, Siwate 28 August 2008 (has links)
Not available / text
|
276 |
Performance enhancing software loop transformations for embedded VLIW/EPIC processorsAkturan, Cagdas, 1973- 14 March 2011 (has links)
Not available / text
|
277 |
Architectural techniques to accelerate multimedia applications on general-purpose processorsTalla, Deependra, 1975- 06 April 2011 (has links)
Not available / text
|
278 |
Fast and efficient video coding based on communication and computationscheduling on multiprocessorsLeung, Kwong-Keung., 梁光強. January 2001 (has links)
published_or_final_version / abstract / toc / Electrical and Electronic Engineering / Doctoral / Doctor of Philosophy
|
279 |
Inheritance and OMT: a CSP approachChan, Wing-kwong., 陳榮光. January 1995 (has links)
published_or_final_version / abstract / toc / Computer Science / Master / Master of Philosophy
|
280 |
Cyclone: a high-performance cluster-based webserver with socket cloningSit, Yiu-fai., 薛耀暉. January 2002 (has links)
published_or_final_version / abstract / toc / Computer Science and Information Systems / Master / Master of Philosophy
|
Page generated in 0.1058 seconds