Spelling suggestions: "subject:"aprocessing unit"" "subject:"eprocessing unit""
21 |
Design and Implementation of Scalable High-Performance Network FunctionsHsieh, Cheng-Liang 01 August 2017 (has links)
Service Function Chaining (SFC) enriches the network functionalities to fulfill the increasing demand of value-added services. By leveraging SDN and NFV for SFC, it becomes possible to meet the demand fluctuation and construct a dynamic SFc. However, the integration of SDN with NFV requires packet header modifications, generates excessive network traffics, and induces additional I/O overheads for packet processing. These additional overheads result in a lower system performance, scalability, and agility. To improve the system performance, a co-optimized solution is proposed to implemented NF to achieve a better performance for software-based network functions. To improve the system scalability, a many-field packet classification is proposed to support a more complex ruleset. To improve the system agility, a network function-enabled switch is proposed to lower the network function content switching time. The experiment results show that the performance of a network function is improved by 8 times by leveraging GPU as a parallel computation platform. Moreover, the matching speed to steer network traffics with many-field ruleset is improved by 4 times with the proposed many-field packet classification algorithm. Finally, the proposed system is able to improve system bandwidth 5 times better compared the native solution and maintain the content switch time with the proposed SFC implementation using SDN and NFV.
|
22 |
GPGPU design space exploration using neural networksJooya, Ali 28 September 2018 (has links)
General Purpose computing on Graphic Processing Unit (GPGPU) gained atten-
tion in 2006 with NVIDIA’s first Tesla Graphic Processing Unit (GPU) which could
perform high performance computing. Ever since, researchers have been working on software and hardware techniques to improve the efficiency of running general purpose applications on GPUs. The efficiency can be evaluated using metrics such as energy consumption and throughput and is defined based on the requirements of the system. I define it as obtaining high throughput by consuming minimum energy.
GPUs are equipped with a large number of processing units, a high memory
bandwidth, and different types of on-chip memory and caches. To run efficiently,
an application should maximize the utilization of GPU resources. Therefore, a good correspondence between the computing and memory resources of the GPU and those of application is critical. Since an application’s requirements are fixed, the GPU’s configuration should be tuned to these requirements. Having models to study and predict the power consumption and throughput of running a GPGPU application on a given GPU configuration can help achieve high efficiency.
The main purpose of this dissertation is to find a GPU configuration that best
matches the requirements of a given application. I propose three models that predict a GPU configuration that runs an application with maximum throughput while consuming minimum energy.
The first model is a fast, low-cost and effective approach to optimize resource allocation in future GPUs. The model finds the optimal GPU configuration for different available chip real-estate budgets .
The second model considers the power consumption and throughput of a GPGPU application as functions of the GPU configuration parameters. The proposed model accurately predicts the power consumption and throughput of the modeled GPGPU application. I then propose to accelerate the process of building the model using optimization techniques and quantum annealing. I use the proposed model to explore the GPU configuration space of different applications. I apply multiobjective optimization technique to find the configurations that offer minimum power consumption and maximum throughput.
Finally, using clustering and classification techniques, I develop models to re-
late the power consumption and throughput of GPGPU applications to the code
attributes. Both models could accurately predict the optimum configuration for any given GPGPU application.
To build these models I have used different machine learning techniques and optimization methods such as Pareto Front and Knapsack optimization problem. I validated the model produced results with simulation results and showed that the models make accurate predictions.
These models could be used by GPGPU programmers to identify the architectural parameters that most affect an application’s power consumption and throughput.
This information could be translated into software optimization opportunities. Also, these models can be implemented as part of a compiler to help it to make the best optimization decisions. Moreover, GPU manufacturers could gain insight on architectural parameters which would profit GPGPU applications the most in terms of power and performance and hence invest on these. / Graduate
|
23 |
Automatic digital surface model generation using graphics processing unitVan der Merwe, Dirk Jacobus 05 June 2012 (has links)
M. Ing. / Digital Surface Models (DSM) are widely used in the earth sciences for research, visu- alizations, construction etc. In order to generate a DSM for a speci c area, specialized equipment and personnel are always required which leads to a costly and time consuming exercise. Image processing has become a viable processing technique to generate terrain models since the improvements of hardware provided adequate processing power to complete such a task. Digital Surface Models (DSM) can be generated from stereo imagery, usually obtained from a remote sensing platform. The core component of a DSM generating system is the image matching algorithm. Even though there are a variety of algorithms to date which can generate DSMs, it is a computationally complex calculation and does tend to take some time to complete. In order to achieve faster DSMs, an investigation into an alternative processing platform for the generation of terrain models has been done. The Graphics Processing Unit (GPU) is usually used in the gaming industry to manipulate display data and then render it to a computer screen. The architecture is designed to manipulate large amounts of oating point data. The scientic community has begun using the GPU processing power available for technical computing, hence the term, General Purpose computing on a Graphics Processing Unit (GPGPU). The GPU is investigated as alternative processing platform for the image matching procedure since the processing capability of the GPU is so much higher than the CPU but only for a conditioned set of input data. A matching algorithm, derived from the GC3 algorithm has been implemented on both a CPU platform and a GPU platform in order to investigate the viability of a GPU processing alternative. The algorithm makes use of a Normalized Cross Correlation similarity measurement and the geometry of the image acquisition contained in the sensor model to obtain conjugate point matches in the two source images. The results of the investigation indicated an improvement of up to 70% on the processing time required to generate a DSM. The improvements varied from 70% to some cases where the GPU has taken longer to generate the DSM. The accuracy of the automatic DSM generating system could not be clearly determined since only poor quality reference data was available. It is however shown the DSMs generated using both the CPU and GPU platforms relate to the reference data and correlate to each other. The discrepancies between the CPU and the GPU results are low enough to prove the GPU processing is bene cial with neglible drawbacks in terms of accuracy. The GPU will definitely provide superior processing capabilites for DSM generation above a CPU implementation if a matching algorithm is speci cally designed to cater for the bene ts and limitations of the GPU.
|
24 |
Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizationsLutz, Thibaut January 2015 (has links)
Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four.
|
25 |
Pencil beam dose calculation for proton therapy on graphics processing unitsda Silva, Joakim January 2016 (has links)
Radiotherapy delivered using scanned beams of protons enables greater conformity between the dose distribution and the tumour than conventional radiotherapy using X rays. However, the dose distributions are more sensitive to changes in patient anatomy, and tend to deteriorate in the presence of motion. Online dose calculation during treatment delivery offers a way of monitoring the delivered dose in real time, and could be used as a basis for mitigating the effects of motion. The aim of this work has therefore been to investigate how the computational power offered by graphics processing units can be harnessed to enable fast analytical dose calculation for online monitoring in proton therapy. The first part of the work consisted of a systematic investigation of various approaches to implementing the most computationally expensive step of the pencil beam algorithm to run on graphics processing units. As a result, it was demonstrated how the kernel superposition operation, or convolution with a spatially varying kernel, can be efficiently implemented using a novel scatter-based approach. For the intended application, this outperformed the conventional gather-based approach suggested in the literature, permitting faster pencil beam dose calculation and potential speedups of related algorithms in other fields. In the second part, a parallelised proton therapy dose calculation engine employing the scatter-based kernel superposition implementation was developed. Such a dose calculation engine, running all of the principal steps of the pencil beam algorithm on a graphics processing unit, had not previously been presented in the literature. The accuracy of the calculation in the high- and medium-dose regions matched that of a clinical treatment planning system whilst the calculation was an order of magnitude faster than previously reported. Importantly, the calculation times were short, both compared to the dead time available during treatment delivery and to the typical motion period, making the implementation suitable for online calculation. In the final part, the beam model of the dose calculation engine was extended to account for the low-dose halo caused by particles travelling at large angles with the beam, making the algorithm comparable to those in current clinical use. By reusing the workflow of the initial calculation but employing a lower resolution for the halo calculation, it was demonstrated how the improved beam model could be included without prohibitively prolonging the calculation time. Since the implementation was based on a widely used algorithm, it was further predicted that by careful tuning, the dose calculation engine would be able to reproduce the dose from a general beamline with sufficient accuracy. Based on the presented results, it was concluded that, by using a single graphics processing unit, dose calculation using the pencil beam algorithm could be made sufficiently fast for online dose monitoring, whilst maintaining the accuracy of current clinical systems.
|
26 |
Efficient Dynamic Automatic Memory Management And Concurrent Kernel Execution For General-Purpose Programs On Graphics Processing UnitsPai, Sreepathi 11 1900 (has links) (PDF)
Modern supercomputers now use accelerators to achieve their performance with the most widely used accelerator being the Graphics Processing Unit (GPU). However, achieving the performance potential of systems that combine a GPU and CPU is an arduous task which could be made easier with the assistance of the compiler or runtime. In particular, exploiting two features of GPU architectures -- distributed memory and concurrent kernel execution -- is critical to achieve good performance, but in current GPU programming systems, programmers must exploit them manually. This can lead to poor performance. In this thesis, we propose automatic techniques that: i) perform data transfers between the CPU and GPU, ii) allocate resources for concurrent kernels, and iii) schedule concurrent kernels efficiently without programmer intervention.
<p>Most GPU programs access data in GPU memory for performance. Manually inserting data transfers that move data to and from this GPU memory is an error-prone and tedious task. In this work, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale data accesses and uses a runtime to initiate transfers as necessary. This avoids redundant transfers that are exhibited by all other existing automatic memory management proposals for general purpose programs. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only (ADSM) and compiler-only (OpenMPC) proposals, it also transfers 2.2x to 13.3x less data on average.
<p>Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite. Current GPUs therefore allow concurrent execution of kernels to improve utilization. We study concurrent execution of GPU kernels using multiprogrammed workloads on current NVIDIA Fermi GPUs. On two-program workloads from Parboil2 we find concurrent execution is often no better than serialized execution. We identify lack of control over resource allocation to kernels as a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware runtime concurrency policies that offer significantly better performance and concurrency than the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads over the current CUDA concurrency implementation.
<p>Recent NVIDIA GPUs use a FIFO policy in their thread block scheduler (TBS) to schedule thread blocks of concurrent kernels. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the Shortest Remaining Time First (SRTF) policy instead. Since SRTF requires an estimate of runtime (i.e. execution time), we introduce Structural Runtime Prediction that uses the grid structure of GPU programs for predicting runtimes. Using a novel Staircase model of GPU kernel execution, we show that kernel runtime can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench and find that predictions made after the execution of single thread block are between 0.48x to 1.08x of actual runtime. %Next, we design a thread block scheduler that is both concurrent kernel-aware and incorporates this predictor. We implement the SRTF policy for concurrent kernels that uses this predictor and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. Compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves STP to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.
|
27 |
Comparative Study of CPU and GPGPU Implementations of the Sievesof Eratosthenes, Sundaram and AtkinMånsson, Jakob January 2021 (has links)
Background. Prime numbers are integers divisible only on 1 and themselves, and one of the oldest methods of finding them is through a process known as sieving. A prime number sieving algorithm produces every prime number in a span, usually from the number 2 up to a given number n. In this thesis, we will cover the three sieves of Eratosthenes, Sundaram, and Atkin. Objectives. We shall compare their sequential CPU implementations to their parallel GPGPU (General Purpose Graphics Processing Unit) counterparts on the matter of performance, accuracy, and suitability. GPGPU is a method in which one utilizes hardware indented for graphics rendering to achieve a high degree of parallelism. Our goal is to establish if GPGPU sieving can be more effective than the sequential way, which is currently commonplace. Method. We utilize the C++ and CUDA programming languages to implement the algorithms, and then extract data regarding their execution time and accuracy. Experiments are set up and run at several sieving limits, with the upper bound set by the memory capacity of available GPU hardware. Furthermore, we study each sieve to identify what characteristics make them fit or unfit for a GPGPU approach. Results. Our results show that the sieve of Eratosthenes is slow and ill-suited for GPGPU computing, that the sieve of Sundaram is both efficient and fit for parallelization, and that the sieve of Atkin is the fastest but suffers from imperfect accuracy. Conclusions. Finally, we address how the lesser concurrent memory capacity available for GPGPU limits the ranges that can be sieved, as compared to CPU. Utilizing the beneficial characteristics of the sieve of Sundaram, we propose a batch-divided implementation that would allow the GPGPU sieve to cover an equal range of numbers as any of the CPU variants.
|
28 |
Toward Reliable, Secure, and Energy-Efficient Multi-Core System DesignBasu, Prabal 01 August 2019 (has links)
Computer hardware researchers have perennially focussed on improving the performance of computers while stipulating the energy consumption under a strict budget. While several innovations over the years have led to high performance and energy efficient computers, more challenges have also emerged as a fallout. For example, smaller transistor devices in modern multi-core systems are afflicted with several reliability and security concerns, which were inconceivable even a decade ago. Tackling these bottlenecks happens to negatively impact the power and performance of the computers. This dissertation explores novel techniques to gracefully solve some of the pressing challenges of the modern computer design. Specifically, the proposed techniques improve the reliability of on-chip communication fabric under a high power supply noise, increase the energy-efficiency of low-power graphics processing units, and demonstrate an unprecedented security loophole of the low-power computing paradigm through rigorous hardware-based experiments.
|
29 |
Coping Uncertainty in Wireless Network OptimizationLi, Shaoran 24 October 2022 (has links)
Network optimization plays an important role in 5G/next-G networks, which requires knowledge of network parameters (e.g., channel state information). The majority of existing works assume that all network parameters are either given a prior or can be accurately estimated. However, in many practical scenarios, some parameters are uncertain at the time of allocating resources and can only be modeled by random variables. Further, we only have limited knowledge of those uncertain parameters. For instance, channel gains are not exactly known due to channel estimation errors, network delay, limited feedback, and a lack of cooperation (between networks). Therefore, a practical solution to network optimization must address such uncertainty inside wireless networks.
There are three approaches to address such a network uncertainty: stochastic programming, worst-case optimization, and chance-constrained programming (CCP). Among the three, CCP has some unique benefits compared to the other two approaches. Stochastic programming explicitly requires full distribution knowledge, which is usually unavailable in practice. In comparison, CCP can work with various settings of available knowledge such as first and second order statistics, symmetric properties, or limited data samples. Therefore, CCP is more flexible to handle different network settings, which is important to address problems in 5G/next-G networks. Further, worst-case optimization assumes upper or lower bounds (i.e., worst cases) for the uncertain parameters and it is known to be conservative due to its focus on extreme cases. In contrast, CCP allows occasional and controllable violations for some constraints and thus offers much better performance in resource utilization compared to worst-case optimization. The only drawback of CCP is that it may lead to intractability due to its probabilistic formulation and limited knowledge of the underlying random variables.
To date, CCP has not been well utilized in the wireless communication and networking community. The goal of this dissertation is to extend the state-of-the-art of CCP techniques and address a number of challenging network optimization problems. This dissertation is correspondingly organized into two parts. In the first part, we assume the uncertain parameters are only known by their mean and covariance (without distribution knowledge). We assume these statistics are rather stationary (i.e., time-invariant for a sufficiently long time) and thus can be accurately estimated. In this setting, we introduce a novel reformulation technique based on the mean and covariance to derive a solution. In the second part, we assume these statistics are time-varying and thus cannot be accurately estimated.In this setting, we employ limited data samples that are collected in a small time window and use them to derive a solution.
For the first part, we investigate four research problems based on the mean and covariance of the uncertain parameters:
- In the first problem, we study how to maximize spectrum efficiency in underlay coexistence.The interference from all secondary users to each primary user must be kept below a given threshold. However, there is much uncertainty about the channel gains between the primary users and the second users due to a lack of cooperation between them. We formulate probabilistic interference constraints using CCP for the primary users. For tractability, we introduce a novel and powerful reformulation technique called Exact Conic Reformulation (ECR). With limited knowledge of mean and covariance, ECR offers an equivalent reformulation for the intractable chance constraints with tractable deterministic constraints without relaxation errors. After reformulation, we employ linearization techniques to the mixed-integer non-linear problem to reduce the computation complexity. We show that our proposed approach can achieve near-optimal performance and stands as a performance benchmark for the underlay coexistence problem.
- To find a solution for the same underlay coexistence problem that can be used in the real world, we need to find a solution in "real-time". The real-time requirement here refers to finding a solution in 125 us (the minimum time slot for small cells in 5G). Our proposed solution has three steps. First, it employs ECR to reformulate the original CCP into a deterministic optimization problem. Then it decomposes the problem and narrows down the search space into a smaller but promising one. By random sampling inside the promising search space and through local search, our proposed solution can meet the 125 us requirement in 5G while achieving 90% optimality on average.
- We further apply CCP, predicated on the reformulation technique ECR, to two other problems.
* We study the problem of power control in concurrent transmissions. Our objective is to maximize energy efficiency for all transmitter-receiver pairs with capacity requirements. This problem is challenging due to mutual interference among different transmitter-receiver pairs and the uncertain channel gain between any transmitter and receiver. We formulate a CCP and reformulate it into a deterministic problem using ECR. Then we employ Geometric Programming (GP) with a tight approximation to derive a near-optimal solution.
* We study task offloading in Mobile Edge Computing (MEC) where the number of processing cycles of a task is unknown until completion. The goal is to minimize the energy consumption of the users while meeting probabilistic deadlines for the tasks. We formulate the probabilistic deadlines into chance constraints and then use ECR to reformulate them into deterministic constraints. We propose a solution that consists of periodic scheduling and schedule updates to choose the offloaded tasks and task-to-processor assignments at the base station.
In the second part, we investigate two research problems based on limited data samples of the uncertain parameters:
- We study MU-MIMO beamforming based on Channel State Information (CSI). The goal is to derive a beamforming solution---minimizing power consumption at the BS while meeting the probabilistic data rate requirements of the users---by using very limited CSI data samples. For our CCP formulation, we explore the idea of Wasserstein ambiguity set to quantify the distance between the true (but unknown) distribution and the empirical distribution based on the limited data samples. Our proposed solution---Data-Driven Beamforming (D^2BF)---reformulates the CCP into a non-convex deterministic optimization problem based on the properties of Wasserstein ambiguity set. Then D^2BF employs a novel convex approximation to the non-convex deterministic problem, which can be directly solved by commercial solvers.
- For a solution to the MU-MIMO beamforming to be useful in the real world, it must meet the "real-time" requirement. Here, the real-time requirement refers to 1 ms, which is one transmission time interval (TTI) under 5G numerology 0. We present ReDBeam---a Real-time Data-driven Beamforming solution for the MU-MIMO beamforming problem (minimizing power consumption while offering probabilistic data rate guarantees to the users) with limited CSI data samples. RedBeam is a parallel algorithm and is purposefully designed to take advantage of the vast parallel processing capability offered by GPU. ReDBeam generates a large number of initial solutions from a promising search space and then refines each solution by a local search. We show that ReDBeam meets the 1 ms real-time requirement on a commercial GPU and is orders of magnitude faster than other state-of-the-art algorithms for the same problem. / Doctor of Philosophy / Network optimization plays an important role in 5G/next-G networks. In a wireless network optimization problem, we typically want to maximize or minimize an objective function under a set of performance or resource constraints. Knowledge of network parameters is typically required in these problems. The majority of existing works assume that all network parameters are either given a prior or can be accurately estimated. However, in many practical scenarios, some parameters are uncertain in nature and cannot be accurately estimated beforehand.
This dissertation addresses uncertainty in wireless network optimizations using chance-constrained programming (CCP). CCP can work with limited knowledge of uncertain parameters such as statistics or data samples, instead of full distribution information. In a CCP formulation, violations of certain target performance or requirement thresholds are expressed as probabilistic constraints and the frequency of such violations is bounded through a risk parameter. By changing this risk level, CCP offers a unique trade-off between the guaranteed threshold violation probabilities and the achieved objective value. The only drawback of CCP is that it may lead to intractability due to its probabilistic formulation and limited knowledge of the underlying random variables.
The goal of this dissertation is to extend the state-of-the-art of CCP techniques to address a number of challenging network optimization problems. This dissertation is organized into two parts. In the first part, the mean and covariance of the uncertain parameters are assumed to be stationary and thus can be accurately estimated. Our main contribution is a novel reformulation technique for CCP called Exact Conic Reformulation (ECR). Based on knowledge of mean and covariance, ECR is able to offer an equivalent reformulation for the intractable chance constraints with tractable deterministic constraints without relaxation errors. We apply CCP, predicated on ECR, to address three problems: (i) scheduling and power control in underlay coexistence; (ii) power control in concurrent transmissions, and (iii) task offloading in Mobile Edge Computing (MEC). For the first problem, we further address the "real-time" requirement in a solution and propose a solution that can meet the stringent timing requirement.
In the second part, when the uncertain parameters are non-stationary and their statistics cannot be accurately estimated, we propose to employ limited data samples that are collected over a small window and use them to develop a solution. To demonstrate the efficacy of this approach, we investigate the MU-MIMO beamforming problem that minimizes the power consumption of the base station while providing probabilistic guarantees to users' data rates. We further address the timing requirement for such a solution in practice, and present a real-time data-driven beamforming solution for MU-MIMO.
|
30 |
Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing SystemsXiao, Shucai 03 July 2013 (has links)
Today, heterogeneous computing systems are widely used to meet the increasing demand for high-performance computing. These systems commonly use powerful and energy-efficient accelerators to augment general-purpose processors (i.e., CPUs). The graphic processing unit (GPU) is one such accelerator. Originally designed solely for graphics processing, GPUs have evolved into programmable processors that can deliver massive parallel processing power for general-purpose applications.
Using SIMD (Single Instruction Multiple Data) based components as building units; the current GPU architecture is well suited for data-parallel applications where the execution of each task is independent. With the delivery of programming models such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), programming GPUs has become much easier than before. However, developing and optimizing an application on a GPU is still a challenging task, even for well-trained computing experts. Such programming tasks will be even more challenging in large-scale heterogeneous systems, particularly in the context of utility computing, where GPU resources are used as a service. These challenges are largely due to the limitations in the current programming models: (1) there are no intra-and inter-GPU cooperative mechanisms that are natively supported; (2) current programming models only support the utilization of GPUs installed locally; and (3) to use GPUs on another node, application programs need to explicitly call application programming interface (API) functions for data communication.
To reduce the mapping efforts and to better utilize the GPU resources, we investigate generalizing the utility of GPUs in large-scale heterogeneous systems with GPUs as accelerators. We generalize the utility of GPUs through the transparent virtualization of GPUs, which can enable applications to view all GPUs in the system as if they were installed locally. As a result, all GPUs in the system can be used as local GPUs. Moreover, GPU virtualization is a key capability to support the notion of "GPU as a service." Specifically, we propose the virtual OpenCL (or VOCL) framework for the transparent virtualization of GPUs. To achieve good performance, we optimize and extend the framework in three aspects: (1) optimize VOCL by reducing the data transfer overhead between the local node and remote node; (2) propose GPU synchronization to reduce the overhead of switching back and forth if multiple kernel launches are needed for data communication across different compute units on a GPU; and (3) extend VOCL to support live virtual GPU migration for quick system maintenance and load rebalancing across GPUs.
With the above optimizations and extensions, we thoroughly evaluate VOCL along three dimensions: (1) show the performance improvement for each of our optimization strategies; (2) evaluate the overhead of using remote GPUs via several microbenchmark suites as well as a few real-world applications; and (3) demonstrate the overhead as well as the benefit of live virtual GPU migration. Our experimental results indicate that VOCL can generalize the utility of GPUs in large-scale systems at a reasonable virtualization and migration cost. / Ph. D.
|
Page generated in 0.0732 seconds