Spelling suggestions: "subject:"tarp scheduling"" "subject:"parp scheduling""
1 |
An enhanced GPU architecture for not-so-regular parallelism with special implications for database searchNarasiman, Veynu Tupil 27 June 2014 (has links)
Graphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications. / text
|
2 |
Intelligent Scheduling and Memory Management Techniques for Modern GPU ArchitecturesJanuary 2017 (has links)
abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures.
First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup.
Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads.
Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs.
In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs. / Dissertation/Thesis / Doctoral Dissertation Computer Engineering 2017
|
3 |
Enhancing GPGPU Performance through Warp Scheduling, Divergence Taming and Runtime Parallelizing TransformationsAnantpur, Jayvant P January 2017 (has links) (PDF)
There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleration of general purpose applications. The growth is primarily due to the huge computing power offered by the GPUs and the emergence of programming languages such as CUDA and OpenCL. A typical GPU consists of several 100s to a few 1000s of Single Instruction Multiple Data (SIMD) cores, organized as 10s of Streaming Multiprocessors (SMs), each having several SIMD cores which operate in a lock-step manner, o ering a few TeraFLOPS of performance in a single socket. SMs execute instructions from a group of consecutive threads, called warps. At each cycle, an SM schedules a warp from a group of active warps and can context switch among the active warps to hide various stalls. However, various factors, such as global memory latency, divergence among warps of a thread block (TB), branch divergence among threads of a warp (Control Divergence), number of active warps, etc., can significantly impact the ability of a warp scheduler to hide stalls. This reduces the speedup of applications running on the GPU. Further, applications containing loops with potential cross iteration dependences, do not utilize the available resources (SIMD cores) effectively and hence su er in terms of performance. In this thesis, we propose several mechanisms which address the above issues and enhance the performance of GPU applications through efficient warp scheduling, taming branch and warp divergence, and runtime parallelization.
First, we propose RLWS, a Reinforcement Learning (RL) based Warp Scheduler which uses unsupervised learning to schedule warps based on the current state of the core and the long-term benefits of scheduling actions. As the design space involving the state variables used by the RL and the RL parameters (such as learning and exploration rates, reward and penalty values, etc.) is large, we use a Genetic Algorithm to identify the useful subset of state variables and RL parameter values. We evaluated the proposed RL based scheduler using the GPGPU-SIM simulator on a large number of applications from the Rodinia, Parboil, CUDA-SDK and GPGPU-SIM benchmark suites. Our RL based implementation achieved an average speedup of 1.06x over the Loose Round Robin (LRR) strategy and 1.07x over the Two-Level (TL) strategy. A salient feature of RLWS is that it is robust, i.e., performs nearly as well as the best performing warp scheduler, consistently across a wide range of applications. Using the insights obtained from RLWS, we designed PRO, a heuristic warp scheduler which in addition to hiding the long latencies of certain operations, reduces the waiting time of warps at synchronization points. Evaluation of the proposed algorithm using the GPGPU-SIM simulator on a diverse set of applications showed an average speedup of 1.07x over the LRR warp scheduler and 1.08x over the TL warp scheduler.
In the second part of the thesis, we address problems due to warp and branch divergences. First, many GPU kernels exhibit warp divergence due to various reasons such as, different amounts of work, cache misses, and thread divergence. Also, we observed that some kernels contain code which is redundant across TBs, i.e., all TBs will execute the code identically and hence compute the same results. To improve performance of such kernels, we propose a solution based on the concept of virtual TBs and loop independent code motion. We propose necessary code transformations which enable one virtual TB to execute the kernel code for multiple real TBs. We evaluated this technique using the GPGPU-SIM simulator on a diverse set of applications and observed an average improvement of 1.08x over the LRR and 1.04x over the Greedy Then Old (GTO) warp scheduling algorithms. Second, branch divergence causes execution of diverging branches to be serialized to execute only one control ow path at a time. Existing stack based hardware mechanism to reconverge threads causes duplicate execution of code for unstructured control ow graphs (CFG). We propose a simple and elegant transformation to convert an unstructured CFG to a structured CFG. The transformation eliminates duplicate execution of user code while incurring only a linear increase in the number of basic blocks and also the number of instructions. We implemented the proposed transformation at the PTX level using the Ocelot compiler infrastructure and demonstrate that the pro-posed technique is effective in handling the performance problem due to divergence in unstructured CFGs.
Our third proposal is to enable efficient execution of loops with indirect memory accesses that can potentially cause cross iteration dependences. Such dependences are hard to detect using existing compilation techniques. We present an algorithm to compute at run-time, the cross iteration dependences in such loops, using both the CPU and the GPU. It effectively uses the compute capabilities of the GPU to collect the memory accesses performed by the iterations. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Experimental evaluation on real hardware (NVIDIA GPUs) reveals that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.
|
4 |
Improving the performance of GPU-accelerated spatial joinsHrstic, Dusan Viktor January 2017 (has links)
Data collisions have been widely studied by various fields of science and industry. Combing CPU and GPU for processing spatial joins has been broadly accepted due to the increased speed of computations. This should redirect efforts in GPGPU research from straightforward porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. As threads are executing instructions while using hardware resources that are available, impact of different thread organizations and their effect on spatial join performance is analyzed and examined in this report.Having new perspectives and solutions to the problem of thread organization and warp scheduling may contribute more to encourage others to program on the GPU side. The aim with this project is to examine the impact of different thread organizations in spatial join processes. The relationship between the items inside datasets are examined by counting the number of collisions their join produce in order to understand how different approaches may have an influence on performance. Performance benchmarking, analysis and measuring of different approaches in thread organization are investigated and analyzed in this report in order to find the most time efficient solution which is the purpose of the conducted work.This report shows the obtained results for the utilization of different thread techniques in order to optimize the computational speeds of the spatial join algorithms. There are two algorithms on the GPU, one implementing thread techniques and the other non-optimizing solution. The GPU times are compared with the execution times on the CPU and the GPU implementations are verified by observing the collision counters that are matching with all of the collision counters from the CPU counterpart.In the analysis part of this report the the implementations are discussed and compared to each other. It has shown that the difference between algorithm implementing thread techniques and the non-optimizing one lies around 80% in favour of the algorithm implementing thread techniques and it is also around 56 times faster then the spatial joins on the CPU. / Datakollisioner har studerats i stor utsträckning i olika områden inom vetenskap och industri. Att kombinera CPU och GPU för bearbetning av rumsliga föreningar har godtagits på grund av bättre prestanda. Detta bör omdirigera insatser i GPGPU-forskning från en enkel portning av applikationer till fastställande av principer och strategier som möjliggör en effektiv användning av grafikhårdvara. Eftersom trådar som exekverar instruktioner använder sig av hårdvaruresurser, förekommer olika effekter beroende på olika trådorganisationer. Deras på verkan på prestanda av rumsliga föreningar kommer att analyseras och granskas i denna rapport. Nya perspektiv och lösningar på problemet med trådorganisationen och schemaläggning av warps kan bidra till att fler uppmuntras till att använda GPU-programmering. Syftet med denna rapport är att undersöka effekterna av olika trådorganisationer i rumsliga föreningar. Förhållandet mellan objekten inom datamängder undersöks genom att beräkna antalet kollisioner som ihopslagna datamängder förorsakar. Detta görs för att förstå hur olika metoder kan påverka effektivitet och prestanda. Prestandamätningar av olika metoder inom trå dorganisationer undersö ks och analyseras fö r att hitta den mest tidseffektiva lösningen. I denna rapport visualiseras också det erhållna resultatet av olika trådtekniker som används för att optimera beräkningshastigheterna för rumsliga föreningar. Rapporten undersökeren CPU-algoritm och två GPU-algoritmer. GPU tiderna jämförs hela tiden med exekveringstiderna på CPU:n, och GPU-implementeringarna verifieras genom att jämföra antalet kollisioner från både CPU:n och GPU:n. Under analysdelen av rapporten jämförs och diskuteras olika implementationer med varandra. Det visade sig att skillnaden mellan en algoritm som implementerar trådtekniker och en icke-optimerad version är cirka 80 % till förmån för algoritmen som implementerar trådtekniker. Det visade sig också föreningarna på CPU:n att den är runt 56 gånger snabbare än de rumsliga
|
Page generated in 0.0913 seconds