Global ETD Search

41	Programming High-Performance Clusters with Heterogeneous Computing Devices Aji, Ashwin M. 19 May 2015 (has links) Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization. / Ph. D. Runtime Systems Programming Models Message Passing Interface (MPI) CUDA OpenCL
42	Accelerating Quantum Monte Carlo via Graphics Processing Units Himberg, Benjamin Evert 01 January 2017 (has links) An exact quantum Monte Carlo algorithm for interacting particles in the spatial continuum is extended to exploit the massive parallelism offered by graphics processing units. Its efficacy is tested on the Calogero-Sutherland model describing a system of bosons interacting in one spatial dimension via an inverse square law. Due to the long range nature of the interactions, this model has proved difficult to simulate via conventional path integral Monte Carlo methods running on conventional processors. Using Graphics Processing Units, optimal speedup factors of up to 640 times are obtained for N = 126 particles. The known results for the ground state energy are confirmed and, for the first time, the effects of thermal fluctuations at finite temperature are explored. Calogero-Sutherland model Exactly solvable Graphics Processing Units One-dimensional models Path Integral Monte Carlo Condensed Matter Physics
43	General Purpose Computing in Gpu - a Watermarking Case Study Hanson, Anthony 08 1900 (has links) The purpose of this project is to explore the GPU for general purpose computing. The GPU is a massively parallel computing device that has a high-throughput, exhibits high arithmetic intensity, has a large market presence, and with the increasing computation power being added to it each year through innovations, the GPU is a perfect candidate to complement the CPU in performing computations. The GPU follows the single instruction multiple data (SIMD) model for applying operations on its data. This model allows the GPU to be very useful for assisting the CPU in performing computations on data that is highly parallel in nature. The compute unified device architecture (CUDA) is a parallel computing and programming platform for NVIDIA GPUs. The main focus of this project is to show the power, speed, and performance of a CUDA-enabled GPU for digital video watermark insertion in the H.264 video compression domain. Digital video watermarking in general is a highly computationally intensive process that is strongly dependent on the video compression format in place. The H.264/MPEG-4 AVC video compression format has high compression efficiency at the expense of having high computational complexity and leaving little room for an imperceptible watermark to be inserted. Employing a human visual model to limit distortion and degradation of visual quality introduced by the watermark is a good choice for designing a video watermarking algorithm though this does introduce more computational complexity to the algorithm. Research is being conducted into how the CPU-GPU execution of the digital watermark application can boost the speed of the applications several times compared to running the application on a standalone CPU using NVIDIA visual profiler to optimize the application. H.264 video compression domain digital video watermark CUDA Graphics processing units. CUDA (Computer architecture) Data protection.
44	Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors January 2018 (has links) abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2018 Computer science Computer engineering Cache memories Chip multiprocessors Computer architecture Graphics Processing Units Memory subsystem Moore's law
45	Modeling the performance of many-core programs on GPUs with advanced features Pei, Mo Mo January 2012 (has links) University of Macau / Faculty of Science and Technology / Department of Computer and Information Science Rendering (Computer graphics) Computer graphics. Real-time data processing Image processing -- Digital techniques
46	Flexible architecture methods for graphics processing Dutton, Marcus 29 March 2011 (has links) The FPGA GPU architecture proposed in this thesis was motivated by underserved markets for graphics processing that desire flexibility, long-term device availability, scalability, certifiability, and high reliability. These markets of industrial, medical, and avionics applications often are forced to rely on the latest GPUs that were actually designed for gaming PCs or handheld consumer devices. The architecture for the GPU in this thesis was crafted specifically for an FPGA and therefore takes advantage of its capabilities while also avoiding its limitations. Previous work did not specifically exploit the FPGA's structures and instead used FPGA implementations merely as an integration platform prior to proceeding on to a final ASIC design. The target of an FPGA for this architecture is also important because its flexibility and programmability allow the GPU's performance to be scaled or supplemented to fit unique application requirements. This tailoring of the architecture to specific requirements minimizes power consumption and device cost while still satisfying performance, certification, and device availability requirements. To demonstrate the feasibility of the flexible FPGA GPU architectural concepts, the architecture is applied to an avionics application and analyzed to confirm satisfactory results. The architecture is further validated through the development of extensions to support more comprehensive graphics processing applications. In addition, the breadth of this research is illustrated through its applicability to general-purpose computations and more specifically, scientific visualizations. GPU FPGA Graphics processing units Computer graphics Field programmable gate arrays Application-specific integrated circuits Integrated circuits
47	Δημιουργία, μελέτη και βελτιστοποίηση φωτορεαλιστικών απεικονίσεων πραγματικού χρόνου με χρήση προγραμματιζόμενων επεξεργαστών γραφικών Σταυρόπουλος, Ασημάκης 22 September 2009 (has links) Οι προγραμματιζόμενοι επεξεργαστές γραφικών (Graphics Processing Units - GPUs), είναι πανίσχυροι παράλληλοι επεξεργαστές και πλέον υπάρχουν σε κάθε σύγχρονο προσωπικό υπολογιστή (PC). Οι GPUs αναλαμβάνουν κι επιταχύνουν την σχεδίαση δισδιάστατων και τρισδιάστατων γραφικών στην οθόνη του υπολογιστή. Η εξέλιξή τους είναι τόσο ραγδαία τα τελευταία χρόνια, που πλέον ξεπερνούν σε πολυπλοκότητα τις σύγχρονες κεντρικές μονάδες επεξεργασίας (CPUs), ενώ είναι ικανές να επιταχύνουν εκτός από γραφικά κι άλλες απαιτητικές σε επεξεργαστική ισχύ εφαρμογές, όπως είναι η τεχνητή νοημοσύνη και η προσομοίωση φυσικών αλληλεπιδράσεων μεταξύ αντικειμένων (συγκρούσεις, εκρήξεις, προσομοίωση κίνησης υγρών) κ.α. Σκοπός της συγκεκριμένης εργασίας είναι η δημιουργία, η μελέτη και η βελτιστοποίηση αλγορίθμων σκίασης με χρήση GPUs. Ο όρος σκίαση (shading) αναφέρεται στην αλληλεπίδραση του φωτός με τα αντικείμενα ενός εικονικού περιβάλλοντος. Παρουσιάζονται τα εργαλεία (APIs) και οι γλώσσες προγραμματισμού των GPUs καθώς και τρόποι βελτιστοποίησης της εκτέλεσης των σκιάσεων που είναι ένα θέμα μείζονος σημασίας σε προσομοιώσεις πραγματικού χρόνου. / Graphics processing units (GPUs), are powerful parallel processors and today are found in every modern Personal Computer (PC). The GPUs accelerate the drawing of two and three dimensional graphics on the monitor of the PCs. The evolution of this hardware is very rapid the last decade and today these circuits are more complex than CPUs. They are capable of accelerating many demanding applications except graphics, like Artificial Intelligence and Physics Simulation. The purpose of this thesis is to implement, study and optimize the execution of shading algorithms that run on GPUs in real time. The term shading refers to the interactions between light and the material of every object in a virtual three dimensional environment. In this thesis we present the tools, the programming languages and techniques for optimizing the execution of the shaders which is a matter of major importance in real time simulations. Κάρτες γραφικών Αλγόριθμοι σκίασης 006.677 3 Graphics processing units (GPUs) Shading Shaders Real time graphics
48	A portable relational algebra library for high performance data-intensive query processing Saeed, Ifrah 09 April 2014 (has links) A growing number of industries are turning to data warehousing applications such as forecasting and risk assessment to process large volumes of data. These data warehousing applications, which utilize queries comprised of a mix of arithmetic and relational algebra (RA) operators, currently run on systems that utilize commodity multi-core CPUs. If we acknowledge the data-intensive nature of these applications, general purpose graphics processing units (GPUs) with high throughput and memory bandwidth seem to be natural candidates to host these applications. However, since such relational queries exhibit irregular parallelism and data accesses, their efficient implementation on GPUs remains challenging. Thus, although tailored solutions for individual processors using their native programming environments have evolved, these solutions are not accessible to other processors. This thesis addresses this problem by providing a portable implementation of RA, mathematical, and related primitives required to implement and accelerate relational queries over large data sets in the form of the library. These primitives can run on any modern multi- and many-core architecture that supports OpenCL, thereby enhancing the performance potential of such architectures for warehousing applications. In essence, this thesis describes the implementation of primitives and the results of their performance evaluation on a range of platforms and concludes with insights, the identification of opportunities, and lessons learned. One of the major insights from our analysis is that for complex relational queries, the time taken to transfer data between host CPUs and discrete GPUs can render the performance of discrete and integrated GPUs comparable in spite of the higher computing power and memory bandwidth of discrete GPUs. Therefore, data movement optimization is the key to eff ectively harnessing the high performance of discrete GPUs; otherwise, cost eff ectiveness would encourage the use of integrated GPUs. Furthermore, portability also enables the complete utilization of all GPUs and CPUs in the system at run time by opportunistically using any type of available processor when a kernel is ready for execution. Data-intensive query processing RA operators OpenCL GPUs CPUs Graphics processing units Data warehousing Big data Relation algebras
49	Accelerating computational diffusion MRI using Graphics Processing Units Fernandez, Moises Hernandez January 2017 (has links) Diffusion magnetic resonance imaging (dMRI) allows uniquely the study of the human brain non-invasively and in vivo. Advances in dMRI offer new insight into tissue microstructure and connectivity, and the possibility of investigating the mechanisms and pathology of neurological diseases. The great potential of the technique relies on indirect inference, as modelling frameworks are necessary to map dMRI measurements to neuroanatomical features. However, this mapping can be computationally expensive, particularly given the trend of increasing dataset sizes and/or the increased complexity in biophysical modelling. Limitations on computing can restrict data exploration and even methodology development. A step forward is to take advantage of the power offered by recent parallel computing architectures, especially Graphics Processing Units (GPUs). GPUs are massive parallel processors that offer trillions of floating point operations per second, and have made possible the solution of computationally intensive scientific problems that were intractable before. However, they are not inherently suited for all types of problems, and bespoke computational frameworks need to be developed in many cases to take advantage of their full potential. In this thesis, we propose parallel computational frameworks for the analysis of dMRI using GPUs within different contexts. We show that GPU-based designs can offer accelerations of more than two orders of magnitude for a number of scientific computing tasks with different parallelisability requirements, ranging from biophysical modelling for tissue microstructure estimation to white matter tractography for connectome generation. We develop novel and efficient GPUaccelerated solutions, including a framework that automatically generates GPU parallel code from a user-specified biophysical model. We also present a parallel GPU framework for performing probabilistic tractography and generating whole-brain connectomes. Throughout the thesis, we discuss several strategies for parallelising scientific applications, and we show the great potential of the accelerations obtained, which change the perspective of what is computationally feasible.
50	Adaptive signal processing for multichannel sound using high performance computing Lorente Giner, Jorge 02 December 2015 (has links) [EN] The field of audio signal processing has undergone a major development in recent years. Both the consumer and professional marketplaces continue to show growth in audio applications such as immersive audio schemes that offer optimal listening experience, intelligent noise reduction in cars or improvements in audio teleconferencing or hearing aids. The development of these applications has a common interest in increasing or improving the number of discrete audio channels, the quality of the audio or the sophistication of the algorithms. This often gives rise to problems of high computational cost, even when using common signal processing algorithms, mainly due to the application of these algorithms to multiple signals with real-time requirements. The field of High Performance Computing (HPC) based on low cost hardware elements is the bridge needed between the computing problems and the real multimedia signals and systems that lead to user's applications. In this sense, the present thesis goes a step further in the development of these systems by using the computational power of General Purpose Graphics Processing Units (GPGPUs) to exploit the inherent parallelism of signal processing for multichannel audio applications. The increase of the computational capacity of the processing devices has been historically linked to the number of transistors in a chip. However, nowadays the improvements in the computational capacity are mainly given by increasing the number of processing units and using parallel processing. The Graphics Processing Units (GPUs), which have now thousands of computing cores, are a representative example. The GPUs were traditionally used to graphic or image processing, but new releases in the GPU programming environments such as CUDA have allowed the use of GPUS for general processing applications. Hence, the use of GPUs is being extended to a wide variety of intensive-computation applications among which audio processing is included. However, the data transactions between the CPU and the GPU and viceversa have questioned the viability of the use of GPUs for audio applications in which real-time interaction between microphones and loudspeakers is required. This is the case of the adaptive filtering applications, where an efficient use of parallel computation in not straightforward. For these reasons, up to the beginning of this thesis, very few publications had dealt with the GPU implementation of real-time acoustic applications based on adaptive filtering. Therefore, this thesis aims to demonstrate that GPUs are totally valid tools to carry out audio applications based on adaptive filtering that require high computational resources. To this end, different adaptive applications in the field of audio processing are studied and performed using GPUs. This manuscript also analyzes and solves possible limitations in each GPU-based implementation both from the acoustic point of view as from the computational point of view. / [ES] El campo de procesado de señales de audio ha experimentado un desarrollo importante en los últimos años. Tanto el mercado de consumo como el profesional siguen mostrando un crecimiento en aplicaciones de audio, tales como: los sistemas de audio inmersivo que ofrecen una experiencia de sonido óptima, los sistemas inteligentes de reducción de ruido en coches o las mejoras en sistemas de teleconferencia o en audífonos. El desarrollo de estas aplicaciones tiene un propósito común de aumentar o mejorar el número de canales de audio, la propia calidad del audio o la sofisticación de los algoritmos. Estas mejoras suelen dar lugar a sistemas de alto coste computacional, incluso usando algoritmos comunes de procesado de señal. Esto se debe principalmente a que los algoritmos se suelen aplicar a sistemas multicanales con requerimientos de procesamiento en tiempo real. El campo de la Computación de Alto Rendimiento basado en elementos hardware de bajo coste es el puente necesario entre los problemas de computación y los sistemas multimedia que dan lugar a aplicaciones de usuario. En este sentido, la presente tesis va un paso más allá en el desarrollo de estos sistemas mediante el uso de la potencia de cálculo de las Unidades de Procesamiento Gráfico (GPU) en aplicaciones de propósito general. Con ello, aprovechamos la inherente capacidad de paralelización que poseen las GPU para procesar señales de audio y obtener aplicaciones de audio multicanal. El aumento de la capacidad computacional de los dispositivos de procesado ha estado vinculado históricamente al número de transistores que había en un chip. Sin embargo, hoy en día, las mejoras en la capacidad computacional se dan principalmente por el aumento del número de unidades de procesado y su uso para el procesado en paralelo. Las GPUs son un ejemplo muy representativo. Hoy en día, las GPUs poseen hasta miles de núcleos de computación. Tradicionalmente, las GPUs se han utilizado para el procesado de gráficos o imágenes. Sin embargo, la aparición de entornos sencillos de programación GPU, como por ejemplo CUDA, han permitido el uso de las GPU para aplicaciones de procesado general. De ese modo, el uso de las GPU se ha extendido a una amplia variedad de aplicaciones que requieren cálculo intensivo. Entre esta gama de aplicaciones, se incluye el procesado de señales de audio. No obstante, las transferencias de datos entre la CPU y la GPU y viceversa pusieron en duda la viabilidad de las GPUs para aplicaciones de audio en las que se requiere una interacción en tiempo real entre micrófonos y altavoces. Este es el caso de las aplicaciones basadas en filtrado adaptativo, donde el uso eficiente de la computación en paralelo no es sencillo. Por estas razones, hasta el comienzo de esta tesis, había muy pocas publicaciones que utilizaran la GPU para implementaciones en tiempo real de aplicaciones acústicas basadas en filtrado adaptativo. A pesar de todo, esta tesis pretende demostrar que las GPU son herramientas totalmente válidas para llevar a cabo aplicaciones de audio basadas en filtrado adaptativo que requieran elevados recursos computacionales. Con este fin, la presente tesis ha estudiado y desarrollado varias aplicaciones adaptativas de procesado de audio utilizando una GPU como procesador. Además, también analiza y resuelve las posibles limitaciones de cada aplicación tanto desde el punto de vista acústico como desde el punto de vista computacional. / [CAT] El camp del processament de senyals d'àudio ha experimentat un desenvolupament important als últims anys. Tant el mercat de consum com el professional segueixen mostrant un creixement en aplicacions d'àudio, com ara: els sistemes d'àudio immersiu que ofereixen una experiència de so òptima, els sistemes intel·ligents de reducció de soroll en els cotxes o les millores en sistemes de teleconferència o en audiòfons. El desenvolupament d'aquestes aplicacions té un propòsit comú d'augmentar o millorar el nombre de canals d'àudio, la pròpia qualitat de l'àudio o la sofisticació dels algorismes que s'utilitzen. Això, sovint dóna lloc a sistemes d'alt cost computacional, fins i tot quan es fan servir algorismes comuns de processat de senyal. Això es deu principalment al fet que els algorismes se solen aplicar a sistemes multicanals amb requeriments de processat en temps real. El camp de la Computació d'Alt Rendiment basat en elements hardware de baix cost és el pont necessari entre els problemes de computació i els sistemes multimèdia que donen lloc a aplicacions d'usuari. En aquest sentit, aquesta tesi va un pas més enllà en el desenvolupament d'aquests sistemes mitjançant l'ús de la potència de càlcul de les Unitats de Processament Gràfic (GPU) en aplicacions de propòsit general. Amb això, s'aprofita la inherent capacitat de paral·lelització que posseeixen les GPUs per processar senyals d'àudio i obtenir aplicacions d'àudio multicanal. L'augment de la capacitat computacional dels dispositius de processat ha estat històricament vinculada al nombre de transistors que hi havia en un xip. No obstant, avui en dia, les millores en la capacitat computacional es donen principalment per l'augment del nombre d'unitats de processat i el seu ús per al processament en paral·lel. Un exemple molt representatiu són les GPU, que avui en dia posseeixen milers de nuclis de computació. Tradicionalment, les GPUs s'han utilitzat per al processat de gràfics o imatges. No obstant, l'aparició d'entorns senzills de programació de la GPU com és CUDA, han permès l'ús de les GPUs per a aplicacions de processat general. D'aquesta manera, l'ús de les GPUs s'ha estès a una àmplia varietat d'aplicacions que requereixen càlcul intensiu. Entre aquesta gamma d'aplicacions, s'inclou el processat de senyals d'àudio. No obstant, les transferències de dades entre la CPU i la GPU i viceversa van posar en dubte la viabilitat de les GPUs per a aplicacions d'àudio en què es requereix la interacció en temps real de micròfons i altaveus. Aquest és el cas de les aplicacions basades en filtrat adaptatiu, on l'ús eficient de la computació en paral·lel no és senzilla. Per aquestes raons, fins al començament d'aquesta tesi, hi havia molt poques publicacions que utilitzessin la GPU per implementar en temps real aplicacions acústiques basades en filtrat adaptatiu. Malgrat tot, aquesta tesi pretén demostrar que les GPU són eines totalment vàlides per dur a terme aplicacions d'àudio basades en filtrat adaptatiu que requereixen alts recursos computacionals. Amb aquesta finalitat, en la present tesi s'han estudiat i desenvolupat diverses aplicacions adaptatives de processament d'àudio utilitzant una GPU com a processador. A més, aquest manuscrit també analitza i resol les possibles limitacions de cada aplicació, tant des del punt de vista acústic, com des del punt de vista computacional. / Lorente Giner, J. (2015). Adaptive signal processing for multichannel sound using high performance computing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/58427 / TESIS Multichannel Adaptive filtering Adaptive Equalization Active Noise Control Distributed Processing Graphics Processing Units. TEORIA DE LA SEÑAL Y COMUNICACIONES

Search results