Global ETD Search

271	Run-time loop parallelization with efficient dependency checking on GPU-accelerated platforms Zhang, Chenggang, 张呈刚 January 2011 (has links) General-Purpose computing on Graphics Processing Units (GPGPU) has attracted a lot of attention recently. Exciting results have been reported in using GPUs to accelerate applications in various domains such as scientific simulations, data mining, bio-informatics and computational finance. However, up to now GPUs can only accelerate data-parallel loops with statically analyzable parallelism. Loops with dynamic parallelism (e.g., with array accesses through subscripted subscripts), an important pattern in many general-purpose applications, cannot be parallelized on GPUs using existing technologies. Run-time loop parallelization using Thread Level Speculation (TLS) has been proposed in the literatures to parallelize loops with statically un-analyzable dependencies. However, most of the existing TLS systems are designed for multiprocessor/multi-core CPUs. GPUs have fundamental differences with CPUs in both hardware architecture and execution model, making the previous TLS designs not work or inefficient when ported to GPUs. This thesis presents GPUTLS, a runtime system designed to support speculative loop parallelization on GPUs. The design of GPU-TLS addresses several key problems encountered when adapting TLS to GPUs: (1) To reduce the possibility of mis-speculation, deferred-update memory versioning scheme is adopted to avoid mis-speculations caused by inter-iteration WAR and WAW dependencies. A technique named intra-warp value forwarding is proposed to respect some inter-iteration RAW dependencies, which further reduces the mis-speculation possibility. (2) An incremental speculative execution scheme is designed to exploit partial parallelism within loops. This avoids excessive re-executions and reduces the mis-speculation penalty. (3) The dependency checking among thousands of speculative GPU threads poses large overhead and can easily become the performance bottleneck. To lower the overhead, we design several e_cient dependency checking schemes named PRW+BDC, SW, SR, SRW+EDC, and SRW+LDC respectively. (4) We devise a novel parallel commit scheme to avoid the overhead incurred by the serial commit phase in most existing TLS designs. We have carried out extensive experiments on two platforms with different NVIDIA GPUs, using both a synthetic loop that can simulate loops with different characteristics and several loops from real-life applications. Testing results show that the proposed intra-warp value forwarding and eager dependency checking techniques can improve the performance for almost all kinds of loop patterns. We observe that compared with other dependency checking schemes, SR and SW can achieve better performance in most cases. It is also shown that the proposed parallel commit scheme is especially useful for loops with large write set size and small number of inter-iteration WAW dependencies. Overall, GPU-TLS can achieve speedups ranging from 5 to 105 for loops with dynamic parallelism. / published_or_final_version / Computer Science / Master / Master of Philosophy Graphics processing units. Threads (Computer programs)
272	Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures Han, Guodong, 韩国栋 January 2013 (has links) The GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system. / published_or_final_version / Computer Science / Master / Master of Philosophy Graphics processing units. Computer architecture.
273	Using parallel computation to apply the singular value decomposition (SVD) in solving for large Earth gravity fields based on satellite data Hinga, Mark Brandon 28 August 2008 (has links) Not available / text Gravity--Measurement--Computer programs Parallel algorithms
274	Compiler directed speculation for embedded clustered EPIC machines Pillai, Satish 28 August 2008 (has links) Not available / text Compilers (Computer programs) Embedded computer systems
275	Parallel machine scheduling with time windows Rojanasoonthon, Siwate 28 August 2008 (has links) Not available / text Scheduling--Mathematical models Heuristic programming
276	Performance enhancing software loop transformations for embedded VLIW/EPIC processors Akturan, Cagdas, 1973- 14 March 2011 (has links) Not available / text Computer software Computer architecture
277	Architectural techniques to accelerate multimedia applications on general-purpose processors Talla, Deependra, 1975- 06 April 2011 (has links) Not available / text Multimedia systems Computer architecture
278	Fast and efficient video coding based on communication and computationscheduling on multiprocessors Leung, Kwong-Keung., 梁光強. January 2001 (has links) published_or_final_version / abstract / toc / Electrical and Electronic Engineering / Doctoral / Doctor of Philosophy Computer algorithms Coding theory.
279	Inheritance and OMT: a CSP approach Chan, Wing-kwong., 陳榮光. January 1995 (has links) published_or_final_version / abstract / toc / Computer Science / Master / Master of Philosophy
280	Cyclone: a high-performance cluster-based webserver with socket cloning Sit, Yiu-fai., 薛耀暉. January 2002 (has links) published_or_final_version / abstract / toc / Computer Science and Information Systems / Master / Master of Philosophy Web servers. Computer networks.

Search results