Global ETD Search

31	Paralelizace evolučních algoritmů pomocí GPU / GPU Parallelization of Evolutionary Algorithms Valkovič, Patrik January 2021 (has links) Graphical Processing Units stand for the success of Artificial Neural Networks over the past decade and their broader application in the industry. Another promising field of Artificial Intelligence is Evolutionary Algorithms. Their parallelization ability is well known and has been successfully applied in practice. However, these attempts focused on multi-core and multi-machine parallelization rather than on the GPU. This work explores the possibilities of Evolutionary Algorithms parallelization on GPU. I propose implementation in PyTorch library, allowing to execute EA on both CPU and GPU. The proposed implementation provides the most common evolutionary operators for Genetic Algorithms, Real-Coded Evolutionary Algorithms, and Particle Swarm Op- timization Algorithms. Finally, I show the performance is an order of magnitude faster on GPU for medium and big-sized problems and populations. 1
32	Free Wake Potential Flow Vortex Wind Turbine Modeling: Advances in Parallel Processing and Integration of Ground Effects Develder, Nathaniel B 01 January 2014 (has links) (PDF) Potential flow simulations are a great engineering type, middle-ground approach to modeling complex aerodynamic systems, but quickly become computationally unwieldy for large domains. An N-body problem with N-squared interactions to calculate, this free wake vortex model of a wind turbine is well suited to parallel computation. This thesis discusses general trends in wind turbine modeling, a potential flow model of the rotor of the NREL 5MW reference turbine, various forms of parallel computing, current GPU hardware, and the application of ground effects to the model. In the vicinity of 200,000 points, current GPU hardware was found to be nearly 17 times faster than an OpenMP 12 core CPU parallel code, and over 280 times faster than serial MATLAB code. Convergence of the solution is found to be dependent on the direction in which the grid is refined. The "no entry" condition at the ground plane is found to have a measurable but small impact on the model outputs with a periodicity driven by the blade proximity to the ground plane. The effect of the ground panel method was found to converge to that of the "method of images" for increasing ground extent and number of panels. Aerodynamics Wind Turbines Potential Flow Vortex Methods GPU Computing Aerodynamics and Fluid Mechanics Energy Systems
33	Scalable and Energy-Efficient SIMT Systems for Deep Learning and Data Center Microservices Mahmoud Khairy A. Abdallah (12894191) 04 July 2022 (has links) <p> </p> <p>Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmability in the twilight of Moore’s Law.</p> <p>In the HPC and Deep Learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.</p> <p>Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.</p> <p>In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?</p> Distributed systems and algorithms Operating systems Programming languages SIMT Deep Learning Microservices Systems GPU computing Data Center Energy Efficiency
34	Arquitecturas para la computación de altas prestaciones en la nube. Aplicación a procesos de geometría computacional Sánchez-Ribes, Víctor 03 March 2024 (has links) La computación en nube es una de las tecnologías que están dando forma al mundo actual. En este sentido, las empresas deben hacer uso de esta tecnología para seguir siendo competitivas en un mercado globalizado. Los sectores tradicionales de la industria manufacturera (calzado, muebles, juguetes, entre otros) se caracterizan principalmente por tener un diseño intensivo y un trabajo de fabricación en la producción de nuevos productos de temporada. Este trabajo se realiza a través de software de modelado y fabricación 3D. Este software se conoce habitualmente como “CAD/CAM”. Se basa principalmente en la aplicación de primitivas de modelado y cálculo geométrico. La externalización de procesamiento es el método utilizado para externalizar la carga de procesamiento a la nube. Esta técnica aporta muchas ventajas a los procesos de diseño y fabricación: reducción del coste inicial para pequeñas y medianas empresas que necesitan una gran capacidad de cálculo, infraestructura muy flexible para proporcionar potencia de cálculo ajustable, prestación de servicios informáticos “CAD/CAM” a diseñadores de todo el mundo, etc.. Sin embargo, la externalización del cálculo geométrico a la nube implica varios retos que deben superarse para que la propuesta sea viable. El objetivo de este trabajo es explorar nuevas formas de aprovechar los dispositivos especializados y mejorar las capacidades de las “GPUs” mediante la revisión y comparación de las técnicas de programación paralela disponibles, y proponer la configuración óptima de la arquitectura “Cloud” y el desarrollo de aplicaciones para mejorar el grado de paralelización de los dispositivos de procesamiento especializados, sirviendo de base para su mayor explotación en la nube para pequeñas y medianas empresas. Finalmente, este trabajo muestra los experimentos utilizados para validar la propuesta tanto a nivel de arquitectura de comunicación como de la programación en las "GPU" y aporta unas conclusiones derivadas de esta experimentación. GPU Computing GPU CUDA Cloud Offloading Cloud Computing High-Performance Processing Offloading Computation Mobile Cloud Computing Distributed Computing Quality of Service
35	Runtime specialization for heterogeneous CPU-GPU platforms Farooqui, Naila 27 May 2016 (has links) Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices. Dynamic instrumentation Dynamic compilation GPU computing Heterogeneous computing Profile-guided optimizations Program analysis Workload characterization Compiler Runtime Multicore CUDA OpenCL SIMD
36	Scalability of fixed-radius searching in meshless methods for heterogeneous architectures Pols, LeRoi Vincent 12 1900 (has links) Thesis (MEng)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: In this thesis we set out to design an algorithm for solving the all-pairs fixed-radius nearest neighbours search problem for a massively parallel heterogeneous system. The all-pairs search problem is stated as follows: Given a set of N points in d-dimensional space, find all pairs of points within a horizon distance of one another. This search is required by any nonlocal or meshless numerical modelling method to construct the neighbour list of each mesh point in the problem domain. Therefore, this work is applicable to a wide variety of fields, ranging from molecular dynamics to pattern recognition and geographical information systems. Here we focus on nonlocal solid mechanics methods. The basic method of solving the all-pairs search is to calculate, for each mesh point, the distance to each other mesh point and compare with the horizon value to determine if the points are neighbours. This can be a very computationally intensive procedure, especially if the neighbourhood needs to be updated at every time step to account for changes in material configuration. The problem also becomes more complex if the analysis is done in parallel. Furthermore, GPU computing has become very popular in the last decade. Most of the fastest supercomputers in the world today employ GPU processors as accelerators to CPU processors. It is also believed that the next-generation exascale supercomputers will be heterogeneous. Therefore the focus is on how to develop a neighbour searching algorithm that will take advantage of next-generation hardware. In this thesis we propose a CPU - multi GPU algorithm, which is an extension of the fixed-grid method, for the fixed-radius nearest neighbours search on massively parallel systems. / AFRIKAANSE OPSOMMING: In hierdie tesis het ons die ontwerp van ’n algoritme vir die oplossing van die alle-pare vaste-radius naaste bure soektog probleem vir groot skaal parallele heterogene stelsels aangepak. Die alle-pare soektog probleem is as volg gestel: Gegewe ’n stel van N punte in d-dimensionele ruimte, vind al die pare van punte wat binne ’n horison afstand van mekaar af is. Die soektog word deur enige nie-lokale of roosterlose numeriese metode benodig om die bure-lys van alle rooster-punte in die probleem te kry. Daarom is hierdie werk van toepassing op ’n wye verskeidenheid van velde, wat wissel van molekulêre dinamika tot patroon herkenning en geografiese inligtingstelsels. Hier is ons fokus op nie-lokale soliede meganika metodes. Die basiese metode vir die oplossing van die alle-pare soektog is om vir elke rooster-punt, die afstand na elke ander rooster-punt te bereken en te vergelyk met die horison lente, om dus so te bepaal of die punte bure is. Dit kan ’n baie berekenings intensiewe proses wees, veral as die probleem by elke stap opgedateer moet word om die veranderinge in die materiaal konfigurasie daar te stel. Die probleem word ook baie meer kompleks as die analise in parallel gedoen word. Verder het GVE’s (Grafiese verwerkings eenhede) baie gewild geword in die afgelope dekade. Die meeste van die vinnigste superrekenaars in die wêreld vandag gebruik GVE’s as versnellers te same met SVE’s (Sentrale verwerkings eenhede). Dit is ook van mening dat die volgende generasie exa-skaal superrekenaars GVE’s sal implementeer. Daarom is die fokus op hoe om ’n bure-lys soektog algoritme te ontwikkel wat gebruik sal maak van die volgende generasie hardeware. In hierdie tesis stel ons ’n SVE - veelvoudige GVE algoritme voor, wat ’n verlenging van die vaste-rooster metode is, vir die vaste-radius naaste bure soektog op groot skaal parallele stelsels. Solid mechanics Neighbour searching Meshfree methods (Numerical analysis) GPU computing Theses -- Civil engineering Dissertations -- Civil engineering Parallel algorihms Heterogeneous computing UCTD
37	Proteins, anatomy and networks of the fruit fly brain Knowles-Barley, Seymour Francis January 2012 (has links) Our understanding of the complexity of the brain is limited by the data we can collect and analyze. Because of experimental limitations and a desire for greater detail, most investigations focus on just one aspect of the brain. For example, brain function can be studied at many levels of abstraction including, but not limited to, gene expression, protein interactions, anatomical regions, neuronal connectivity, synaptic plasticity, and the electrical activity of neurons. By focusing on each of these levels, neuroscience has built up a detailed picture of how the brain works, but each level is understood mostly in isolation from the others. It is likely that interaction between all these levels is just as important. Therefore, a key hypothesis is that functional units spanning multiple levels of biological organization exist in the brain. This project attempted to combine neuronal circuitry analysis with functional proteomics and anatomical regions of the brain to explore this hypothesis, and took an evolutionary view of the results obtained. During the process we had to solve a number of technical challenges as the tools to undertake this type of research did not exist. Two informatics challenges for this research were to develop ways to analyze neurobiological data, such as brain protein expression patterns, to extract useful information, and how to share and present this data in a way that is fast and easy for anyone to access. This project contributes towards a more wholistic understanding of the fruit fly brain in three ways. Firstly, a screen was conducted to record the expression of proteins in the brain of the fruit fly, Drosophila melanogaster. Protein expression patterns in the fruit fly brain were recorded from 535 protein trap lines using confocal microscopy. A total of 884 3D images were annotated and made available on an easy to use website database, BrainTrap, available at fruitfly.inf.ed.ac.uk/braintrap. The website allows 3D images of the protein expression to be viewed interactively in the web browser, and an ontology-based search tool allows users to search for protein expression patterns in specific areas of interest. Different expression patterns mapped to a common template can be viewed simultaneously in multiple colours. This data bridges the gap between anatomical and biomolecular levels of understanding. Secondly, protein trap expression patterns were used to investigate the properties of the fruit fly brain. Thousands of protein-protein interactions have been recorded by methods such as yeast two-hybrid, however many of these protein pairs do not express in the same regions of the fruit fly brain. Using 535 protein expression patterns it was possible to rule out 149 protein-protein interactions. Also, protein expression patterns registered against a common template brain were used to produce new anatomical breakdowns of the fruit fly brain. Clustering techniques were able to naturally segment brain regions based only on the protein expression data. This is just one example of how, by combining proteomics with anatomy, we were able to learn more about both levels of understanding. Results are analysed further in combination with networks such as genetic homology networks, and connectivity networks. We show how the wealth of biological and neuroscience data now available in public databases can be combined with the Brain- Trap data to reveal similarities between areas of the fruit fly and mammalian brain. The BrainTrap data also informs us on the process of evolution and we show that genes found in fruit fly, yeast and mouse are more likely to be generally expressed throughout the brain, whereas genes found only in fruit fly and mouse, but not yeast, are more likely to have a specific expression pattern in the fruit fly brain. Thus, by combining data from multiple sources we can gain further insight into the complexity of the brain. Neural connectivity data is also analyzed and a new technique for enhanced motifs is developed for the combined analysis of connectivity data with other information such as neuron type data and potentially protein expression data. Thirdly, I investigated techniques for imaging the protein trap lines at higher resolution using electron microscopy (EM) and developed new informatics techniques for the automated analysis of neural connectivity data collected from serial section transmission electron microscopy (ssTEM). Measurement of the connectivity between neurons requires high resolution imaging techniques, such as electron microscopy, and images produced by this method are currently annotated manually to produce very detailed maps of cell morphology and connectivity. This is an extremely time consuming process and the volume of tissue and number of neurons that can be reconstructed is severely limited by the annotation step. I developed a set of computer vision algorithms to improve the alignment between consecutive images, and to perform partial annotation automatically by detecting membrane, synapses and mitochondria present in the images. Accuracy of the automatic annotation was evaluated on a small dataset and 96% of membrane could be identified at the cost of 13% false positives. This research demonstrates that informatics technology can help us to automatically analyze biological images and bring together genetic, anatomical, and connectivity data in a meaningful way. This combination of multiple data sources reveals more detail about each individual level of understanding, and gives us a more wholistic view of the fruit fly brain. 578.012
38	Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit Delorme, Michael Christopher 18 March 2013 (has links) We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU. Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU. Parallel sorting Radix sort Heterogeneous computing GPU GPGPU AMD Fusion Llano APU Accelerated Processing Unit OpenCL Fusion Sort GPU computing 0984
39	A model of dynamic compilation for heterogeneous compute platforms Kerr, Andrew 10 December 2012 (has links) Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality. Dynamic compilation GPU computing Cuda Opencl SIMD Vector Multicore Parallel computing Parallel computers Parallel programs (Computer programs) Heterogeneous computing High performance computing
40	Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit Delorme, Michael Christopher 18 March 2013 (has links) We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU. Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU. Parallel sorting Radix sort Heterogeneous computing GPU GPGPU AMD Fusion Llano APU Accelerated Processing Unit OpenCL Fusion Sort GPU computing 0984

Search results