Global ETD Search

31	A portable relational algebra library for high performance data-intensive query processing Saeed, Ifrah 09 April 2014 (has links) A growing number of industries are turning to data warehousing applications such as forecasting and risk assessment to process large volumes of data. These data warehousing applications, which utilize queries comprised of a mix of arithmetic and relational algebra (RA) operators, currently run on systems that utilize commodity multi-core CPUs. If we acknowledge the data-intensive nature of these applications, general purpose graphics processing units (GPUs) with high throughput and memory bandwidth seem to be natural candidates to host these applications. However, since such relational queries exhibit irregular parallelism and data accesses, their efficient implementation on GPUs remains challenging. Thus, although tailored solutions for individual processors using their native programming environments have evolved, these solutions are not accessible to other processors. This thesis addresses this problem by providing a portable implementation of RA, mathematical, and related primitives required to implement and accelerate relational queries over large data sets in the form of the library. These primitives can run on any modern multi- and many-core architecture that supports OpenCL, thereby enhancing the performance potential of such architectures for warehousing applications. In essence, this thesis describes the implementation of primitives and the results of their performance evaluation on a range of platforms and concludes with insights, the identification of opportunities, and lessons learned. One of the major insights from our analysis is that for complex relational queries, the time taken to transfer data between host CPUs and discrete GPUs can render the performance of discrete and integrated GPUs comparable in spite of the higher computing power and memory bandwidth of discrete GPUs. Therefore, data movement optimization is the key to eff ectively harnessing the high performance of discrete GPUs; otherwise, cost eff ectiveness would encourage the use of integrated GPUs. Furthermore, portability also enables the complete utilization of all GPUs and CPUs in the system at run time by opportunistically using any type of available processor when a kernel is ready for execution. Data-intensive query processing RA operators OpenCL GPUs CPUs Graphics processing units Data warehousing Big data Relation algebras
32	A smoothed particle hydrodynamic simulation utilizing the parallel processing capabilites of the GPUs Lundqvist, Viktor January 2009 (has links) Simulating fluid behavior has proven to be a demanding challenge which requires complex computational models and highly efficient data structures. Smoothed Particle Hydrodynamics (SPH) is a particle based computational model used to simulate fluid behavior that has been found capable of producing convincing results. However, the SPH algorithm is computational heavy which makes it cumbersome to work with. This master thesis describes how the SPH algorithm can be accelerated by utilizing the GPU’s computational resources. It describes a model for how to distribute the work load on the GPU and presents a suitable data structure. In addition, it proposes a method to represent and handle moving objects in the fluids surroundings. Finally, the performance gain due to the GPU is evaluated by comparing processing times with an identical implementation running solely on the CPU. Fluid simulation Smoothed Particle Hydrodynamics Computational-fluid dynamics General Purpose Computation on GPUs CUDA. Computer and Information Sciences Data- och informationsvetenskap
33	Models and Techniques for Green High-Performance Computing Adhinarayanan, Vignesh 01 June 2020 (has links) High-performance computing (HPC) systems have become power limited. For instance, the U.S. Department of Energy set a power envelope of 20MW in 2008 for the first exascale supercomputer now expected to arrive in 2021--22. Toward this end, we seek to improve the greenness of HPC systems by improving their performance per watt at the allocated power budget. In this dissertation, we develop a series of models and techniques to manage power at micro-, meso-, and macro-levels of the system hierarchy, specifically addressing data movement and heterogeneity. We target the chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level. Overall, our goal is to improve the greenness of HPC systems by intelligently managing power. The first part of this dissertation focuses on measurement and modeling problems for power. First, we study how to infer chip-interconnect power by observing the system-wide power consumption. Our proposal is to design a novel micro-benchmarking methodology based on data-movement distance by which we can properly isolate the chip interconnect and measure its power. Next, we study how to develop software power meters to monitor a GPU's power consumption at runtime. Our proposal is to adapt performance counter-based models for their use at runtime via a combination of heuristics, statistical techniques, and application-specific knowledge. In the second part of this dissertation, we focus on managing power. First, we propose to reduce the chip-interconnect power by proactively managing its dynamic voltage and frequency (DVFS) state. Toward this end, we develop a novel phase predictor that uses approximate pattern matching to forecast future requirements and in turn, proactively manage power. Second, we study the problem of applying a power cap to a heterogeneous node. Our proposal proactively manages the GPU power using phase prediction and a DVFS power model but reactively manages the CPU. The resulting hybrid approach can take advantage of the differences in the capabilities of the two devices. Third, we study how in-situ techniques can be applied to improve the greenness of HPC clusters. Overall, in our dissertation, we demonstrate that it is possible to infer power consumption of real hardware components without directly measuring them, using the chip interconnect and GPU as examples. We also demonstrate that it is possible to build models of sufficient accuracy and apply them for intelligently managing power at many levels of the system hierarchy. / Doctor of Philosophy / Past research in green high-performance computing (HPC) mostly focused on managing the power consumed by general-purpose processors, known as central processing units (CPUs) and to a lesser extent, memory. In this dissertation, we study two increasingly important components: interconnects (predominantly focused on those inside a chip, but not limited to them) and graphics processing units (GPUs). Our contributions in this dissertation include a set of innovative measurement techniques to estimate the power consumed by the target components, statistical and analytical approaches to develop power models and their optimizations, and algorithms to manage power statically and at runtime. Experimental results show that it is possible to build models of sufficient accuracy and apply them for intelligently managing power on multiple levels of the system hierarchy: chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level. Green Supercomputing Power Modeling Power Management DVFS Phase Prediction Heterogeneity GPUs Data Movement On-chip Interconnects In-situ Techniques
34	Real-time shadow detection and removal in aerial motion imagery application Silva, Guilherme Fr?es 14 August 2017 (has links) Submitted by PPG Engenharia El?trica (engenharia.pg.eletrica@pucrs.br) on 2017-11-08T12:48:45Z No. of bitstreams: 1 Guilherme_Silva_Dissertacao.pdf: 5806780 bytes, checksum: dd97b70975650b889aaa277b7e6f2b19 (MD5) / Approved for entry into archive by Caroline Xavier (caroline.xavier@pucrs.br) on 2017-11-17T12:39:26Z (GMT) No. of bitstreams: 1 Guilherme_Silva_Dissertacao.pdf: 5806780 bytes, checksum: dd97b70975650b889aaa277b7e6f2b19 (MD5) / Made available in DSpace on 2017-11-17T12:50:25Z (GMT). No. of bitstreams: 1 Guilherme_Silva_Dissertacao.pdf: 5806780 bytes, checksum: dd97b70975650b889aaa277b7e6f2b19 (MD5) Previous issue date: 2017-08-14 / Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior - CAPES Shadow Detection Shadow Removal GPU Programming Wide-Area Motion Imagery System Detec??o de Sombras Remo??o de Sombras Programa??o de GPUs Sistema de Imageamento de ?reas Amplas ENGENHARIAS
35	Efficient Compilation Of Stream Programs Onto Multi-cores With Accelerators Udupa, Abhishek 07 1900 (has links) Over the past two decades, microprocessor manufacturers have typically relied on wider issue widths and deeper pipelines to obtain performance improvements for single threaded applications. However, in the recent years, with power dissipation and wire delays becoming primary design constraints, this approach can no longer be effectively used to yield performance improvements. Thus process designers and vendors are universally moving towards multi-core designs. Examples for these are the commodity general purpose multi-core processors, the CellBE accelerator from IBM and the Graphics Processing Units from NVIDIA and ATI. Although these many and multi-core architectures can provide enormous performance benefits, it is difficult to program for them due to the complexity of writing explicitly parallel code. The ubiquity of computationally intensive media processing applications makes it imperative to consider new programming frameworks and languages that can express parallelism in an easy, portable manner. The StreamIt programming language has been proposed to efficiently exploit parallelism at various levels on general purpose multi-core architectures and stream processors and allow media processing and DSP application to be developed in an easy and portable fashion. The StreamIt model allows programmers to specify a program as a set of filters connected by FIFO communication channels. The graphs thus specified by the StreamIt programs describe task, data and pipeline parallelism which can be potentially exploited on modern Graphics Processing Units (GPUs), which have emerged as powerful, commodity stream processors, which support abundant parallelism in hardware. The first part of this thesis deals with the challenges in mapping StreamIt programs to GPUs and proposes an efficient technique to software pipeline the execution of stream Programs on GPUs. We formulate this problem—both scheduling and assignment of filters to processors—as an efficient Integer Linear Program(ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. We have evaluated our approach on a platform equipped with an NVIDIA GeForce 8800 GTS 512 GPU and our approach yields a (geometric) mean speedup of 5.02X, with a maximum speedup of 36.83X across a set of StreamIt benchmarks, with the speedup measured relative to an optimized single threaded CPU execution. While the approach of software pipelining the execution of stream programs on GPUs is efficient and performs well, it does not utilize the CPU cores to perform useful computation. Further, it does not support programs with stateful filters, which are essentially filters that are not data parallel owing to a dependence between each successive firing that is carried through the implicit state of the filter. The second part of the thesis aims at addressing these issues and describes a novel method to orchestrate the execution of a StreamIt program on the multiple cores of a system and GPUs in a synergistic manner. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers, the limited DMA bandwidth available and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program(ILP) which can then be solved by an ILP solver. Since solving an ILP is NP-Hard in the general case and may thus require a large amount of time, we also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solutions to the ILP formulation on an average across the benchmark suite, while requiring 2–3 orders of magnitude less time than the ILP approach. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with eight CPU cores, out of which four were used, and a GeForce 8800 GTS512 GPU show a(geometric) mean speed up of 6.84X with a maximum of 51.96X over a single threaded CPU execution across a set of StreamIt benchmarks. Compilers Stream Programs Partitioning Algorithm Stream Programs - Execution Stream Programs - Compilation Graphics Processing Units (GPUs) Streaming Multiprocessors (SMs) Accelerators Computer Science
36	A GPU Stream Computing Approach to Terrain Database Integrity Monitoring McKeon, Sean Patrick 10 July 2009 (has links) Synthetic Vision Systems (SVS) provide an aircraft pilot with a virtual 3-D image of surrounding terrain which is generated from a digital elevation model stored in an onboard database. SVS improves the pilot's situational awareness at night and in inclement weather, thus reducing the chance of accidents such as controlled flight into terrain. A terrain database integrity monitor is needed to verify the accuracy of the displayed image due to potential database and navigational system errors. Previous research has used existing aircraft sensors to compare the real terrain position with the predicted position. We propose an improvement to one of these models by leveraging the stream computing capabilities of commercial graphics hardware. "Brook for GPUs," a system for implementing stream computing applications on programmable graphics processors, is used to execute a streaming ray-casting algorithm that correctly simulates the beam characteristics of a radar altimeter during all phases of flight. Brook for GPUs terrain database synthetic vision graphics processing unit stream computing integrity monitor ray casting digital elevation model Computer Sciences
37	Global address spaces for efficient resource provisioning in the data center Young, Jeffrey Scott 13 January 2014 (has links) The rise of large data sets, or "Big Data'', has coincided with the rise of clusters with large amounts of memory and GPU accelerators that can be used to process rapidly growing data footprints. However, the complexity and performance limitations of sharing memory and accelerators in a cluster limits the options for efficient management and allocation of resources for applications. The global address space model (GAS), and specifically hardware-supported GAS, is proposed as a means to provide a high-performance resource management platform upon which resource sharing between nodes and resource aggregation across nodes can take place. This thesis builds on the initial concept of GAS with a model that is matched to "Big Data'' computing and its data transfer requirements. The proposed model, Dynamic Partitioned Global Address Spaces (DPGAS), is implemented using a commodity converged interconnect, HyperTransport over Ethernet (HToE), and a software framework, the Oncilla runtime and API. The DPGAS model and associated hardware and software components are used to investigate two application spaces, resource sharing for time-varying workloads and resource aggregation for GPU-accelerated data warehousing applications. This work demonstrates that hardware-supported GAS can be used improve the performance and power consumption of memory-intensive applications, and that it can be used to simplify host and accelerator resource management in the data center. Networks Memory DRAM GPUs Global address spaces Data warehousing InfiniBand Ethernet FPGA Graphics processing units Big data Memory management (Computer science) Computer architecture
38	Contributions to parallel stochastic simulation : application of good software engineering practices to the distribution of pseudorandom streams in hybrid Monte Carlo simulations / Contributions à la simulation stochastique parallèle : architectures logicielles pour la distribution de flux pseudo-aléatoires dans les simulations Monte Carlo sur CPU/GPU Passerat-Palmbach, Jonathan 11 October 2013 (has links) Résumé non disponible / The race to computing power increases every day in the simulation community. A few years ago, scientists have started to harness the computing power of Graphics Processing Units (GPUs) to parallelize their simulations. As with any parallel architecture, not only the simulation model implementation has to be ported to the new parallel platform, but all the tools must be reimplemented as well. In the particular case of stochastic simulations, one of the major element of the implementation is the pseudorandom numbers source. Employing pseudorandom numbers in parallel applications is not a straightforward task, and it has to be done with caution in order not to introduce biases in the results of the simulation. This problematic has been studied since parallel architectures are available and is called pseudorandom stream distribution. While the literature is full of solutions to handle pseudorandom stream distribution on CPU-based parallel platforms, the young GPU programming community cannot display the same experience yet.In this thesis, we study how to correctly distribute pseudorandom streams on GPU. From the existing solutions, we identified a need for good software engineering solutions, coupled to sound theoretical choices in the implementation. We propose a set of guidelines to follow when a PRNG has to be ported to GPU, and put these advice into practice in a software library called ShoveRand. This library is used in a stochastic Polymer Folding model that we have implemented in C++/CUDA. Pseudorandom streams distribution on manycore architectures is also one of our concerns. It resulted in a contribution named TaskLocalRandom, which targets parallel Java applications using pseudorandom numbers and task frameworks.Eventually, we share a reflection on the methods to choose the right parallel platform for a given application. In this way, we propose to automatically build prototypes of the parallel application running on a wide set of architectures. This approach relies on existing software engineering tools from the Java and Scala community, most of them generating OpenCL source code from a high-level abstraction layer. Pseudorandom Number Generation (PRNG) High Performance Computing (HPC) Software Engineering Stochastic Simulation Graphics Processing Units (GPUs) GPU Programming Automatic Parallelization
39	Résolution de systèmes linéaires et non linéaires creux sur grappes de GPUs / Solving sparse linear and nonlinear systems on GPU clusters Ziane Khodja, Lilia 07 June 2013 (has links) Depuis quelques années, les grappes équipées de processeurs graphiques GPUs sont devenues des outils très attrayants pour le calcul parallèle haute performance. Dans cette thèse, nous avons conçu des algorithmes itératifs parallèles pour la résolution de systèmes linéaires et non linéaires creux de très grandes tailles sur grappes de GPUs. Dans un premier temps, nous nous sommes focalisés sur la résolution de systèmes linéaires creux à l'aide des méthodes itératives CG et GMRES. Les expérimentations ont montré qu'une grappe de GPUs est plus performante que son homologue grappe de CPUs pour la résolution de systèmes linéaires de très grandes tailles. Ensuite, nous avons mis en oeuvre des algorithmes parallèles synchrones et asynchrones des méthodes itératives Richardson et de relaxation par blocs pour la résolution de systèmes non linéaires creux. Nous avons constaté que les meilleurs solutions développées pour les CPUs ne sont pas nécessairement bien adaptées aux GPUs. En effet, les simulations effectuées sur une grappe de GPUs ont montré que les algorithmes Richardson sont largement plus efficaces que ceux de relaxation par blocs. De plus, elles ont aussi montré que la puissance de calcul des GPUs permet de réduire le rapport entre le temps d'exécution et celui de communication, ce qui favorise l'utilisation des algorithmes asynchrones sur des grappes de GPUs. Enfin, nous nous sommes intéressés aux grappes géographiquement distantes pour la résolution de systèmes linéaires creux. Dans ce contexte, nous avons utilisé la méthode de multi-décomposition à deux niveaux avec GMRES parallèle adaptée aux grappes de GPUs. Celle-ci utilise des itérations synchrones pour résoudre localement les sous-systèmes linéaires et des itérations asynchrones pour résoudre la globalité du système linéaire. / Or the past few years, the clusters equipped with GPUs have become attractive tools for high performance computing. In this thesis, we have designed parallel iterative algorithms for solving large sparse linear and nonlinear systems on GPU clusters. First, we have focused on solving sparse linear systems using CG and GMRES iterative methods. The experiments have shown that a GPU cluster is more efficient that its pure CPU counterpart for solving large sparse systems of linear equations. Then, we have implemented the synchronous and asynchronous algorithms of the Richardson and the block relaxation iterative methods for solving sparse nonlinear systems. We have noticed that the best solutions developed for the CPUs are not necessarily well suited to GPUs. Indeed, the experiments performed on a GPU cluster have shown that the parallel algorithms of the Richardson method are far more efficient than those of the block relaxation method. In addition, they have shown that the computing power of GPUs allows to reduce the ratio between the time of the computation over that of the communication, which favors the use of the asynchronous iteration on GPU clusters. Finally, we are interested in geographically distant clusters for solving large sparse linear systems. In this context, we have used a multisplitting two-stage method using parallel GMRES method adapted to GPU clusters. It uses the synchronous iteration to solve locally the sub-linear systems and the asynchronous one to solve the global sparse linear system. Méthodes itératives Parallélisme MPI/CUDA Grappes de GPUs Sparse linear and nonlinear systems Iterative methods MPI/CUDA parallelism GPU clusters 005.1
40	Efficient Execution Of AMR Computations On GPU Systems Raghavan, Hari K 11 1900 (has links) (PDF) Adaptive Mesh Refinement (AMR) is a method which dynamically varies the spatio-temporal resolution of localized mesh regions in numerical simulations, based on the strength of the solution features. Due to high resolution discretization of localized regions of interests into rectangular mesh units called patches, AMR provides low cost of computations and high degree of accuracy. General purpose graphics processing units (GPGPUs) with their support for fine-grained parallelism, offer an attractive option for obtaining high performance for AMR applications. The data parallel computations of the finite difference schemes of AMR can be efficiently performed on GPGPUs. This research deals with challenges and develops techniques for efficient executions of AMR applications with uniform and non-uniform patches on GPUs. In the first part of the thesis, we optimize an AMR model with uniform patches. We have developed strategies for continuous online visualization of time evolving data for AMR applications executed on GPUs. In-situ visualization plays an important role for analyzing the time evolving characteristics of the domain structures. Continuous visualization of the output data for various time steps results in better study of the underlying domain and the model used for simulating the domain. We reorder the meshes for computations on the GPU based on the users input related to the subdomain that he wants to visualize. This makes the data available for visualization at a faster rate. We then perform asynchronous executions of the visualization steps and fix-up operations on the coarse meshes on the CPUs while the GPU advances the solution. By performing experiments on Tesla S1070 and Fermi C2070 clusters, we found that our strategies result in up to 60% improvement in response time and 16% improvement in the rate of visualization of frames over the existing strategy of performing fix-ups and visualization at the end of the time steps. The second part of the thesis deals with adaptive strategies for efficient execution of block structured AMR applications with non-uniform patches on GPUs. Most AMR approaches use patches of uniform sizes over regions of interests. Since this leads to over-refinement, some efforts have focused on forming patches of non-uniform dimensions to improve computational efficiency since the dimensions of a patch can be tuned to the geometry of a region of interest. While effective hybrid execution strategies exist for applications with uniform patches, our work considers efficient execution of non-uniform patches with different workloads. Our techniques include a geometric bin-packing method to load balance GPU computations and reduce thread idling, adaptive determination of amount of work to maximize asynchronism between CPU and GPU executions using a knapsack formulation, and scheduling communications for multi-GPU executions. We test our strategies for synthetic inputs as well as for traces from real applications. Our experiments on Tesla S1070 and Fermi C2070 clusters with both single-GPU and multi-GPU executions show that our strategies result in up to 69% improvement in performance over existing strategies. Our bin-packing based load balancing gives performance gains up to 39%, kernel optimizations give an improvement of up to 20%, and our strategies for adaptive asynchronism between CPU-GPU executions give performance improvements of up to 17% over default static asynchronous executions. Adaptive Mesh Refinement (AMR) Graphical Processing Units (GPUs) Adaptive Mesh Refinement Computations Graphical Processing Unit (GPU) AMR Computations GPU System Computer Science

Search results