Global ETD Search

11	Profile-guided loop parallelization and co-scheduling on GPU-based heterogeneous many-core architectures Han, Guodong, 韩国栋 January 2013 (has links) The GPU-based heterogeneous architectures (e.g., Tianhe-1A, Nebulae), composing multi-core CPU and GPU, have drawn increasing adoptions and are becoming the norm of supercomputing as they are cost-effective and power-efficient. However, programming such heterogeneous architectures still requires significant effort from application developers using sophisticated GPU programming languages such as CUDA and OpenCL. Although some automatic parallelization tools utilizing static analysis could ease the programming efforts, this approach could only parallelize loops 100% free of inter-iteration dependency (i.e., determined DO-ALL loops) because of imprecision of static analysis. To exploit the abundant runtime parallelism and take full advantage of the computing resources both in CPU and GPU, in this work, we propose a new user-friendly compiler framework and runtime system, which helps Java applications harness the full power of a heterogeneous system. It unveils an all-round system design unifying the programming style and language for transparent use of both CPUs and GPUs, automatically parallelizing all kinds of loops, scheduling workloads efficiently across CPU and GPU resources while ensuring data coherence during highly-threaded execution. By means of simple user annotations, sequential Java source code will be analyzed, translated and compiled into a dual executable consisting of CUDA kernels and multiple Java threads running on GPU and CPU cores respectively. Annotated loops will be automatically split into loop chunks (or tasks) being scheduled to execute on all available GPU/CPU cores. To guide the runtime task scheduling, we develop a novel dynamic loop profiler which generates the program dependency graph (PDG) and computes the density of dependencies across iterations through a hybrid checking scheme combining intra-warp and inter-warp analyses. Implementing a GPU-tailored thread-level speculation (TLS) model, our system supports speculative execution of loops with moderate dependency densities and privatization of loops having only false dependencies on the GPU side. Our scheduler also supports task stealing and task sharing algorithms that allow swift load redistribution across GPU and CPU. We have carried out several experiments to evaluate the profiling overhead and up to 11 real-life applications to evaluate our system performance. Testing results show that the overhead is moderate compared with the sequential execution and prove that almost all the applications could benefit from our system. / published_or_final_version / Computer Science / Master / Master of Philosophy Graphics processing units. Computer architecture.
12	Performance and power modeling of GPU systems with dynamic voltage and frequency scaling Wang, Qiang 13 August 2020 (has links) To address the ever-increasing demand for computing capacities, more and more heterogeneous systems have been designed to use both general-purpose and special-purpose processors. The huge energy consumption of them raises new environmental concerns and challenges. Besides performance, energy efficiency is another key factor to be considered by system designers and consumers. In particular, contemporary graphics processing units (GPUs) support dynamic voltage and frequency scaling (DVFS) to balance computational performance and energy consumption. However, accurate and straightforward performance and power estimation for a given GPU kernel under different frequency settings is still lacking for real hardware, which is essential to determine the best frequency configuration for energy saving. In this thesis, we investigate how to improve the energy efficiency of GPU systems by accurately modeling the effects of GPU DVFS on the target GPU kernel. We also propose efficient algorithms to solve the communication contention problem in scheduling multiple distributed deep learning (DDL) jobs on GPU clusters. We introduce our studies as follows. First, we present a benchmark suite EPPMiner for evaluating the performance, power, and energy of different heterogeneous systems. EPPMiner consists of 16 benchmark programs that cover a broad range of application domains, and it shows a great variety in the intensity of utilizing the processors. We have implemented a prototype of EPPMiner that supports OpenMP, CUDA, and OpenCL, and demonstrated its usage by three showcases. The showcases justify that GPUs provide much better energy efficiency than other types of computing systems, and especially illustrate the effectiveness of GPU Dynamic Voltage and Frequency Scaling (DVFS) on the energy efficiency of GPU applications. Second, we reveal a fine-grained analytical model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Compared to the cycle-level simulators, which are too slow to apply on real hardware, our model only needs one-off micro-benchmarks to extract a set of hardware parameters and kernel performance counters without any source code analysis. Our experimental results show that the proposed performance model can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy. Third, we design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre- training or as supplementary training samples. We then build machine learning models to predict the execution time and runtime power of a GPU kernel under different voltage and frequency settings. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the model trained only with our cross-benchmarking suite is able to achieve considerably accurate results. At last, we establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient job placement algorithm, Least-Workload-First- (LWF-), to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling the communication tasks, we propose Ada-SRSF for the DDL job scheduling problem to address the communication contention issue. Our simulation results show that LWF- achieves up to 1.59x improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by up to 36.7%, as compared to the solutions of either avoiding all the communication contention or accepting all of it
13	Performance and power modeling of GPU systems with dynamic voltage and frequency scaling Wang, Qiang 13 August 2020 (has links) To address the ever-increasing demand for computing capacities, more and more heterogeneous systems have been designed to use both general-purpose and special-purpose processors. The huge energy consumption of them raises new environmental concerns and challenges. Besides performance, energy efficiency is another key factor to be considered by system designers and consumers. In particular, contemporary graphics processing units (GPUs) support dynamic voltage and frequency scaling (DVFS) to balance computational performance and energy consumption. However, accurate and straightforward performance and power estimation for a given GPU kernel under different frequency settings is still lacking for real hardware, which is essential to determine the best frequency configuration for energy saving. In this thesis, we investigate how to improve the energy efficiency of GPU systems by accurately modeling the effects of GPU DVFS on the target GPU kernel. We also propose efficient algorithms to solve the communication contention problem in scheduling multiple distributed deep learning (DDL) jobs on GPU clusters. We introduce our studies as follows. First, we present a benchmark suite EPPMiner for evaluating the performance, power, and energy of different heterogeneous systems. EPPMiner consists of 16 benchmark programs that cover a broad range of application domains, and it shows a great variety in the intensity of utilizing the processors. We have implemented a prototype of EPPMiner that supports OpenMP, CUDA, and OpenCL, and demonstrated its usage by three showcases. The showcases justify that GPUs provide much better energy efficiency than other types of computing systems, and especially illustrate the effectiveness of GPU Dynamic Voltage and Frequency Scaling (DVFS) on the energy efficiency of GPU applications. Second, we reveal a fine-grained analytical model to estimate the execution time of GPU kernels with both core and memory frequency scaling. Compared to the cycle-level simulators, which are too slow to apply on real hardware, our model only needs one-off micro-benchmarks to extract a set of hardware parameters and kernel performance counters without any source code analysis. Our experimental results show that the proposed performance model can capture the kernel performance scaling behaviors under different frequency settings and achieve decent accuracy. Third, we design a cross-benchmarking suite, which simulates kernels with a wide range of instruction distributions. The synthetic kernels generated by this suite can be used for model pre- training or as supplementary training samples. We then build machine learning models to predict the execution time and runtime power of a GPU kernel under different voltage and frequency settings. Validated on three modern GPUs with a wide frequency scaling range, by using a collection of 24 real application kernels, the model trained only with our cross-benchmarking suite is able to achieve considerably accurate results. At last, we establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient job placement algorithm, Least-Workload-First- (LWF-), to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling the communication tasks, we propose Ada-SRSF for the DDL job scheduling problem to address the communication contention issue. Our simulation results show that LWF- achieves up to 1.59x improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by up to 36.7%, as compared to the solutions of either avoiding all the communication contention or accepting all of it
14	A MapReduce Framework for Heterogeneous Computing Architectures Elteir, Marwa Khamis 01 June 2013 (has links) Nowadays, an increasing number of computational systems are equipped with heterogeneous compute resources, i.e., following different architecture. This applies to the level of a single chip, a single node and even supercomputers and large-scale clusters. With its impressive price-to-performance ratio as well as power efficiently compared to traditional multicore processors, graphics processing units (GPUs) has become an integrated part of these systems. GPUs deliver high peak performance; however efficiently exploiting their computational power requires the exploration of a multi-dimensional space of optimization methodologies, which is challenging even for the well-trained expert. The complexity of this multi-dimensional space arises not only from the traditionally well known but arduous task of architecture-aware GPU optimization at design and compile time, but it also arises in the partitioning and scheduling of the computation across these heterogeneous resources. Even with programming models like the Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), the developer still needs to manage the data transfer be- tween host and device and vice versa, orchestrate the execution of several kernels, and more arduously, optimize the kernel code. In this dissertation, we aim to deliver a transparent parallel programming environment for heterogeneous resources by leveraging the power of the MapReduce programming model and OpenCL programming language. We propose a portable architecture-aware framework that efficiently runs an application across heterogeneous resources, specifically AMD GPUs and NVIDIA GPUs, while hiding complex architectural details from the developer. To further enhance performance portability, we explore approaches for asynchronously and efficiently distributing the computations across heterogeneous resources. When applied to benchmarks and representative applications, our proposed framework significantly enhances performance, including up to 58% improvement over traditional approaches to task assignment and up to a 45-fold improvement over state-of-the-art MapReduce implementations. / Ph. D. Atomics Graphics Processing Units Programming Models Heterogeneous Computing MapReduce
15	Pattern recognition systems design on parallel GPU architectures for breast lesions characterisation employing multimodality images Sidiropoulos, Konstantinos January 2014 (has links) The aim of this research was to address the computational complexity in designing multimodality Computer-Aided Diagnosis (CAD) systems for characterising breast lesions, by harnessing the general purpose computational potential of consumer-level Graphics Processing Units (GPUs) through parallel programming methods. The complexity in designing such systems lies on the increased dimensionality of the problem, due to the multiple imaging modalities involved, on the inherent complexity of optimal design methods for securing high precision, and on assessing the performance of the design prior to deployment in a clinical environment, employing unbiased system evaluation methods. For the purposes of this research, a Pattern Recognition (PR)-system was designed to provide highest possible precision by programming in parallel the multiprocessors of the NVIDIA’s GPU-cards, GeForce 8800GT or 580GTX, and using the CUDA programming framework and C++. The PR-system was built around the Probabilistic Neural Network classifier and its performance was evaluated by a re-substitution method, for estimating the system’s highest accuracy, and by the external cross validation method, for assessing the PR-system’s unbiased accuracy to new, “unseen” by the system, data. Data comprised images of patients with histologically verified (benign or malignant) breast lesions, who underwent both ultrasound (US) and digital mammography (DM). Lesions were outlined on the images by an experienced radiologist, and textural features were calculated. Regarding breast lesion classification, the accuracies for discriminating malignant from benign lesions were, 85.5% using US-features alone, 82.3% employing DM-features alone, and 93.5% combining US and DM features. Mean accuracy to new “unseen” data for the combined US and DM features was 81%. Those classification accuracies were about 10% higher than accuracies achieved on a single CPU, using sequential programming methods, and 150-fold faster. In addition, benign lesions were found smoother, more homogeneous, and containing larger structures. Additionally, the PR-system design was adapted for tackling other medical problems, as a proof of its generalisation. These included classification of rare brain tumours, (achieving 78.6% for overall accuracy (OA) and 73.8% for estimated generalisation accuracy (GA), and accelerating system design 267 times), discrimination of patients with micro-ischemic and multiple sclerosis lesions (90.2% OA and 80% GA with 32-fold design acceleration), classification of normal and pathological knee cartilages (93.2% OA and 89% GA with 257-fold design acceleration), and separation of low from high grade laryngeal cancer cases (93.2% OA and 89% GA, with 130-fold design acceleration). The proposed PR-system improves breast-lesion discrimination accuracy, it may be redesigned on site when new verified data are incorporated in its depository, and it may serve as a second opinion tool in a clinical environment. 610.28
16	Development of GPU-based incompressible SPH and application to sloshing problems in the oil industry Dickenson, Paul January 2014 (has links) No description available.
17	Task Performance with List-Mode Data Caucci, Luca January 2012 (has links) This dissertation investigates the application of list-mode data to detection, estimation, and image reconstruction problems, with an emphasis on emission tomography in medical imaging. We begin by introducing a theoretical framework for list-mode data and we use it to define two observers that operate on list-mode data. These observers are applied to the problem of detecting a signal~(known in shape and location) buried in a random lumpy background. We then consider maximum-likelihood methods for the estimation of numerical parameters from list-mode data, and we characterize the performance of these estimators via the so-called Fisher information matrix. Reconstruction from PET list-mode data is then considered. In a process we called "double maximum-likelihood" reconstruction, we consider a simple PET imaging system and we use maximum-likelihood methods to first estimate a parameter vector for each pair of gamma-ray photons that is detected by the hardware. The collection of these parameter vectors forms a list, which is then fed to another maximum-likelihood algorithm for volumetric reconstruction over a grid of voxels. Efficient parallel implementation of the algorithms discussed above is then presented. In this work, we take advantage of two low-cost, mass-produced computing platforms that have recently appeared on the market, and we provide some details on implementing our algorithms on these devices. We conclude this dissertation work by elaborating on a possible application of list-mode data to X-ray digital mammography. We argue that today's CMOS detectors and computing platforms have become fast enough to make X-ray digital mammography list-mode data acquisition and processing feasible. Estimation Graphics Processing Units (GPUs) List-Mode data Reconstruction Optical Sciences Detection Emission Tomography
18	Applying the finite-difference time-domain to the modelling of large-scale radio channels Rial, Alvaro Valcarce January 2010 (has links) Finite-difference models have been used for nearly 40 years to solve electromagnetic problems of heterogeneous nature. Further, these techniques are well known for being computationally expensive, as well as subject to various numerical artifacts. However, little is yet understood about the errors arising in the simulation of wideband sources with the finitedifference time-domain (FDTD) method. Within this context, the focus of this thesis is on two different problems. On the one hand, the speed and accuracy of current FDTD implementations is analysed and increased. On the other hand, the distortion of numerical pulses is characterised and mitigation techniques proposed. In addition, recent developments in general-purpose computing on graphics processing units (GPGPU) have unveiled new methods for the efficient implementation of FDTD algorithms. Therefore, this thesis proposes specific GPU-based guidelines for the implementation of the standard FDTD. Then, metaheuristics are used for the calibration of a FDTD-based narrowband simulator. Regarding the simulation of wideband sources, this thesis uses first Lagrange multipliers to characterise the extrema of the numerical group velocity. Then, the spread of numerical Gaussian pulses is characterised analytically in terms of the FDTD grid parameters. The usefulness of the proposed solutions to the previously described problems is illustrated in this thesis using coverage and wideband predictions in large-scale scenarios. In particular, the indoor-to-outdoor radio channel in residential areas is studied. Furthermore, coverage and wideband measurements have also been used to validate the predictions. As a result of all the above, this thesis introduces first an efficient and accurate FDTD simulator. Then, it characterises analytically the propagation of numerical pulses. Finally, the narrowband and wideband indoorto-outdoor channels are modeled using the developed techniques. 621.382
19	Efficient image/video restyling and collage on GPU. / CUHK electronic theses & dissertations collection January 2013 (has links) 創意媒體研究中，圖像/視頻再藝術作為有表現力的用戶定制外觀的創作手段受到了很大關注。交互設計中，特別是在圖像空間只有單張圖像或視頻輸入的情況下，運用計算機輔助設計虛擬地再渲染關注物體的風格化外觀來實現紋理替換是很強大的。現行的紋理替換往往通過操作圖像空間中像素的間距來處理紋理扭曲，原始圖像中潛在的紋理扭曲總是被破壞，因為現行的方法要麼存在由於手動網格拉伸導致的不恰當扭曲，要麼就由於紋理合成而導致不可避免的紋理開裂。圖像/視頻拼貼畫是被發明用以支持在顯示畫布上並行展示多個物體和活動。隨著數字視頻俘獲裝置的快速發展，相關的議題就是快速檢閱和摘要大量的視覺媒體數據集來找出關注的資料。這會是一項繁瑣的任務來審查長且乏味的監控視頻並快速把握重要信息。以關鍵信息和縮短視頻形式為交流媒介，視頻摘要是增強視覺數據集瀏覽效率和簡易理解的手段。 / 本文首先將圖像/視頻再藝術聚焦在高效紋理替換和風格化上。我們展示了一種交互紋理替換方法，能夠在不知潛在幾何結構和光照環境的情況下保持相似的紋理扭曲。我們運用SIFT 棱角特徵來自然地發現潛在紋理扭曲，並應用梯度深度圖復原和皺褶重要性優化來完成扭曲過程。我們運用GPU-CUDA 的並行性，通過實時雙邊網格和特徵導向的扭曲優化來促成交互紋理替換。我們運用基於塊的實時高精度TV-L¹光流，通過基於關鍵幀的紋理傳遞來完成視頻紋理替換。我們進一步研究了基於GPU 的風格化方法，並運用梯度優化保持原始圖像的精細結構。我們提出了一種能夠自然建模原始圖像精細結構的圖像結構圖，並運用基於梯度的切線生成和切線導向的形態學來構建這個結構圖。我們在GPU-CUDA 上通過並行雙邊網格和結構保持促成最終風格化。實驗中，我們的方法實時連續地展現了高質量的圖像/視頻的抽象再藝術。 / 當前，視頻拼貼畫大多創作靜態的基於關鍵幀的拼貼圖片，該結果只包含動態視頻有限的信息，會很大程度影響視覺數據集的理解。爲了便於瀏覽，我們展示了一種在顯示畫布上有效並行摘要動態活動的動態視頻拼貼畫。我們提出應用活動長方體來重組織及提取事件，執行視頻防抖來生成穩定的活動長方體，實行時空域優化來優化活動長方體在三維拼貼空間的位置。我們通過在GPU 上的事件相似性和移動關係優化來完成高效的動態拼貼畫，允許多視頻輸入。擁有再序核函數CUDA 處理，我們的視頻拼貼畫爲便捷瀏覽長視頻激活了動態摘要，節省大量存儲傳輸空間。實驗和調查表明我們的動態拼貼畫快捷有效，能被廣泛應用于視頻摘要。將來，我們會擴展交互紋理替換來支持更複雜的具大運動和遮蔽場景的一般視頻，避免紋理跳動。我們會採用最新視頻技術靈感使視頻紋理替換更加穩定。我們未來關於視頻拼貼畫的工作包括審查監控業中動態拼貼畫應用，並研究含有大量相機運動和不同種視頻過度的移動相機和一般視頻。 / Image/video restyling as an expressive way for producing usercustomized appearances has received much attention in creative media researches. In interactive design, it would be powerful to re-render the stylized presentation of interested objects virtually using computer-aided design tools for retexturing, especially in the image space with a single image or video as input. The nowaday retexturing methods mostly process texture distortion by inter-pixel distance manipulation in image space, the underlying texture distortion is always destroyed due to limitations like improper distortion caused by human mesh stretching, or unavoidable texture splitting caused by texture synthesis. Image/ video collage techniques are invented to allow parallel presenting of multiple objects and events on the display canvas. With the rapid development of digital video capture devices, the related issues are to quickly review and brief such large amount of visual media datasets to find out interested video materials. It will be a tedious task to investigate long boring surveillance videos and grasp the essential information quickly. By applying key information and shortened video forms as vehicles for communication, video abstraction and summary are the means to enhance the browsing efficiency and easy understanding of visual media datasets. / In this thesis, we first focused our image/video restyling work on efficient retexturing and stylization. We present an interactive retexturing that preserves similar texture distortion without knowing the underlying geometry and lighting environment. We utilized SIFT corner features to naturally discover the underlying texture distortion. The gradient depth recovery and wrinkle stress optimization are applied to accomplish the distortion process. We facilitate the interactive retexturing via real-time bilateral grids and feature-guided distortion optimization using GPU-CUDA parallelism. Video retexturing is achieved through a keyframe-based texture transferring strategy using accurate TV-L¹ optical flow with patch motion tracking techniques in real-time. Further, we work on GPU-based abstract stylization that preserves the fine structure in the original images using gradient optimization. We propose an image structure map to naturally distill the fine structure of the original images. Gradientbased tangent generation and tangent-guided morphology are applied to build the structure map. We facilitate the final stylization via parallel bilateral grids and structure-aware stylizing in real-time on GPU-CUDA. In the experiments, our proposed methods consistently demonstrate high quality performance of image/video abstract restyling in real-time. / Currently, in video abstraction, video collages are mostly produced with static keyfame-based collage pictures, which contain limited information of dynamic videos and in uence understanding of visual media datasets greatly. We present dynamic video collage that effectively summarizes condensed dynamic activities in parallel on the canvas for easy browsing. We propose to utilize activity cuboids to reorganize and extract dynamic objects for further collaging, and video stabilization is performed to generate stabilized activity cuboids. Spatial-temporal optimization is carried out to optimize the positions of activity cuboids in the 3D collage space. We facilitate the efficient dynamic collage via event similarity and moving relationship optimization on GPU allowing multi-video inputs. Our video collage approach with kernel reordering CUDA processing enables dynamic summaries for easy browsing of long videos, while saving huge memory space for storing and transmitting them. The experiments and user study have shown the efficiency and usefulness of our dynamic video collage, which can be widely applied for video briefing and summary applications. In the future, we will further extend the interactive retexturing to more complicated general video applications with large motion and occluded scene avoiding textures icking. We will also work on new approaches to make video retexturing more stable by inspiration from latest video processing techniques. Our future work for video collage includes investigating applications of dynamic collage into the surveillance industry, and working on moving camera and general videos, which may contain large amount of camera motions and different types of video shot transitions. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Li, Ping. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 109-121). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts also in Chinese. / Abstract --- p.i / Acknowledgements --- p.v / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.2 --- Main Contributions --- p.5 / Chapter 1.3 --- Thesis Overview --- p.7 / Chapter 2 --- Efficient Image/video Retexturing --- p.8 / Chapter 2.1 --- Introduction --- p.8 / Chapter 2.2 --- Related Work --- p.11 / Chapter 2.3 --- Image/video Retexturing on GPU --- p.16 / Chapter 2.3.1 --- Wrinkle Stress Optimization --- p.19 / Chapter 2.3.2 --- Efficient Video Retexturing --- p.24 / Chapter 2.3.3 --- Interactive Parallel Retexturing --- p.29 / Chapter 2.4 --- Results and Discussion --- p.35 / Chapter 2.5 --- Chapter Summary --- p.41 / Chapter 3 --- Structure-Aware Image Stylization --- p.43 / Chapter 3.1 --- Introduction --- p.43 / Chapter 3.2 --- Related Work --- p.46 / Chapter 3.3 --- Structure-Aware Stylization --- p.50 / Chapter 3.3.1 --- Approach Overview --- p.50 / Chapter 3.3.2 --- Gradient-Based Tangent Generation --- p.52 / Chapter 3.3.3 --- Tangent-Guided Image Morphology --- p.54 / Chapter 3.3.4 --- Structure-Aware Optimization --- p.56 / Chapter 3.3.5 --- GPU-Accelerated Stylization --- p.58 / Chapter 3.4 --- Results and Discussion --- p.61 / Chapter 3.5 --- Chapter Summary --- p.66 / Chapter 4 --- Dynamic Video Collage --- p.67 / Chapter 4.1 --- Introduction --- p.67 / Chapter 4.2 --- Related Work --- p.70 / Chapter 4.3 --- Dynamic Video Collage on GPU --- p.74 / Chapter 4.3.1 --- Activity Cuboid Generation --- p.75 / Chapter 4.3.2 --- Spatial-Temporal Optimization --- p.80 / Chapter 4.3.3 --- GPU-Accelerated Parallel Collage --- p.86 / Chapter 4.4 --- Results and Discussion --- p.90 / Chapter 4.5 --- Chapter Summary --- p.100 / Chapter 5 --- Conclusion --- p.101 / Chapter 5.1 --- Research Summary --- p.101 / Chapter 5.2 --- Future Work --- p.104 / Chapter A --- Publication List --- p.107 / Bibliography --- p.109 Graphics processing units--Programming Image processing--Data processing Digital video--Editing
20	GPU-Acceleration of In-Memory Data Analytics Sitaridi, Evangelia January 2016 (has links) Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user's need for fast interaction with the data, we focus on in-memory analytics. Our techniques span different steps of the data processing pipeline: (1) Data preprocessing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessing techniques adapt the data layout for numeric and string columns to maximize the achieved GPU memory bandwidth. Our query compilation techniques compute the optimal execution plan for conjunctive filters. We formulate \textit{memory divergence} for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework \textit{Gompresso} to fit more data into the limited GPU memory. Gompresso achieves high speed-ups on GPUs over multi-core CPU state-of-the-art libraries and is suitable for any massively parallel processor. Memory management (Computer science) Electronic data processing Graphics processing units Computer science

Search results