31 |
Paralelizace evolučních algoritmů pomocí GPU / GPU Parallelization of Evolutionary AlgorithmsValkovič, Patrik January 2021 (has links)
Graphical Processing Units stand for the success of Artificial Neural Networks over the past decade and their broader application in the industry. Another promising field of Artificial Intelligence is Evolutionary Algorithms. Their parallelization ability is well known and has been successfully applied in practice. However, these attempts focused on multi-core and multi-machine parallelization rather than on the GPU. This work explores the possibilities of Evolutionary Algorithms parallelization on GPU. I propose implementation in PyTorch library, allowing to execute EA on both CPU and GPU. The proposed implementation provides the most common evolutionary operators for Genetic Algorithms, Real-Coded Evolutionary Algorithms, and Particle Swarm Op- timization Algorithms. Finally, I show the performance is an order of magnitude faster on GPU for medium and big-sized problems and populations. 1
|
32 |
Free Wake Potential Flow Vortex Wind Turbine Modeling: Advances in Parallel Processing and Integration of Ground EffectsDevelder, Nathaniel B 01 January 2014 (has links) (PDF)
Potential flow simulations are a great engineering type, middle-ground approach to modeling complex aerodynamic systems, but quickly become computationally unwieldy for large domains. An N-body problem with N-squared interactions to calculate, this free wake vortex model of a wind turbine is well suited to parallel computation. This thesis discusses general trends in wind turbine modeling, a potential flow model of the rotor of the NREL 5MW reference turbine, various forms of parallel computing, current GPU hardware, and the application of ground effects to the model. In the vicinity of 200,000 points, current GPU hardware was found to be nearly 17 times faster than an OpenMP 12 core CPU parallel code, and over 280 times faster than serial MATLAB code. Convergence of the solution is found to be dependent on the direction in which the grid is refined. The "no entry" condition at the ground plane is found to have a measurable but small impact on the model outputs with a periodicity driven by the blade proximity to the ground plane. The effect of the ground panel method was found to converge to that of the "method of images" for increasing ground extent and number of panels.
|
33 |
Scalable and Energy-Efficient SIMT Systems for Deep Learning and Data Center MicroservicesMahmoud Khairy A. Abdallah (12894191) 04 July 2022 (has links)
<p> </p>
<p>Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmability in the twilight of Moore’s Law.</p>
<p>In the HPC and Deep Learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.</p>
<p>Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.</p>
<p>In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?</p>
|
34 |
Solving continuous reaction-diffusion models in image-based complex geometriesStark, Justina 06 November 2024 (has links)
Porous media, including soil, catalysts, rocks, and organic tissue, are ubiquitous in nature, acting as complex environments through which heat, ions, and chemicals travel. Diffusion, often coupled to interfacial reactions, constitutes a fundamental transport process in porous media. It plays an important role in the transport of fertilizer and contaminants in soil, heat conduction in insulators, and natural phenomena such as geological rock transformations and biological signaling and patterning. This thesis aims to enable a deeper understanding of reaction-diffusion processes in porous media by developing a flexible and computationally efficient numerical modeling and simulation workflow.
Numerical modeling is required whenever the problem is too complex for mechanistic insight by quantitative experiments or analytical theory. Reaction-diffusion processes in porous media are such a complex problem, as transport is coupled to the intricate pore geometry. In addition, they involve different scales, from microscale tortuous diffusion pathways and local reactions to macroscale gradients, requiring models that resolve multiple scales.
Multiscale modeling is, however, challenging due to its large memory requirement and computational cost. In addition, realistic porous media geometries, as can be derived from microscopy images or µCTs, are not parametrizable, requiring algorithmic representation.
We address these issues by developing a scalable, multi-GPU accelerated numerical simulation pipeline that enables memory-efficient multiscale modeling of reaction-diffusion processes in realistic, image-based geometries. This pipeline takes volumetric images as input, from which it derives implicit geometry representations using the level-set method. The diffusion domain is discretized in a geometry-adapted, memory-efficient way using distributed sparse block grids. Reaction-diffusion PDEs are solved in the strong form using the finite difference method with scalable multi-GPU acceleration, enabling the simulation in large, highly resolved 3D samples.
We demonstrate the versatility of the present pipeline by simulating reaction-diffusion processes in the image-derived 3D geometries of four applications: fertilizer diffusion in soil, heat conduction with surface dissipation in reticulate porous ceramics, fluid-mediated mineral replacement in rocks, and morphogen gradient formation in the extracellular space of a gastrulating zebrafish embryo. The former two are used to benchmark the performance of our pipeline, whereas the latter two address real-world problems from geology and biology, respectively.
The geological problem considers a process called dolomitization, which converts calcite into dolomite. Determining the geophysical characteristics of the earth's most abundant rocks, dolomitization plays an important role in engineering and geology. Predicting dolomitization is hampered by the extreme scales involved, as mountain-scale dolomite is produced by ion-scale reactions over millions of years. Using the presented pipeline, we derive rock geometries from µCTs and simulate dolomitization as an inhomogeneous reaction-diffusion process with moving reaction fronts and phase-dependent diffusion. The simulation results show that reaction and diffusion are not sufficient to explain the reaction-front roughness observed experimentally, implying that other processes, such as advection, porosity fingering, or sub-resolution geometric features, such as microcracks in the rock, play an important role in dolomitization.
The biological problem, which constitutes the main application of this thesis, is the formation of morphogen gradients during embryonic development. This is a particularly complex problem influenced by several factors, such as dynamically changing tissue geometries, localized sources and sinks, and interaction with molecules of the extracellular matrix (e.g., HSPG). The abundance of factors involved and the coupling between them makes it difficult to quantify how they modulate the gradient individually and collectively.
We use the present pipeline to reconstruct realistic extracellular space (ECS) geometries of a zebrafish embryo from a light-sheet microscopy video. In these geometries, we simulate the gradient formation of the morphogen Fgf8a, showing for the first time in realistic embryo geometries that a source-diffusion-degradation mechanism with HSPG binding is sufficient for the spontaneous formation and maintenance of robust long-range morphogen gradients. We further test gradient sensitivity against different source, sink, and HSPG-binding rates and show that the gradient becomes distorted when ECS volume or connectivity in the model changes, demonstrating the importance of considering realistic embryo geometries.
In summary, this thesis shows that modeling highly resolved, realistic 3D geometries is computationally feasible using geometry-adapted sparse grids, achieving an 18-fold reduction in memory requirements for the zebrafish model compared to a dense-grid implementation. Multi-CPU/GPU acceleration enables pore-scale simulation of large systems. The pipeline developed in this thesis is fully open-source and versatile, as demonstrated by its application to different kinds of porous media, and we anticipate its future application to other reaction-diffusion problems in porous media, in particular from biology.
|
35 |
Arquitecturas para la computación de altas prestaciones en la nube. Aplicación a procesos de geometría computacionalSánchez-Ribes, Víctor 03 March 2024 (has links)
La computación en nube es una de las tecnologías que están dando forma al mundo actual. En este sentido, las empresas deben hacer uso de esta tecnología para seguir siendo competitivas en un mercado globalizado. Los sectores tradicionales de la industria manufacturera (calzado, muebles, juguetes, entre otros) se caracterizan principalmente por tener un diseño intensivo y un trabajo de fabricación en la producción de nuevos productos de temporada. Este trabajo se realiza a través de software de modelado y fabricación 3D. Este software se conoce habitualmente como “CAD/CAM”. Se basa principalmente en la aplicación de primitivas de modelado y cálculo geométrico. La externalización de procesamiento es el método utilizado para externalizar la carga de procesamiento a la nube. Esta técnica aporta muchas ventajas a los procesos de diseño y fabricación: reducción del coste inicial para pequeñas y medianas empresas que necesitan una gran capacidad de cálculo, infraestructura muy flexible para proporcionar potencia de cálculo ajustable, prestación de servicios informáticos “CAD/CAM” a diseñadores de todo el mundo, etc.. Sin embargo, la externalización del cálculo geométrico a la nube implica varios retos que deben superarse para que la propuesta sea viable. El objetivo de este trabajo es explorar nuevas formas de aprovechar los dispositivos especializados y mejorar las capacidades de las “GPUs” mediante la revisión y comparación de las técnicas de programación paralela disponibles, y proponer la configuración óptima de la arquitectura “Cloud” y el desarrollo de aplicaciones para mejorar el grado de paralelización de los dispositivos de procesamiento especializados, sirviendo de base para su mayor explotación en la nube para pequeñas y medianas empresas. Finalmente, este trabajo muestra los experimentos utilizados para validar la propuesta tanto a nivel de arquitectura de comunicación como de la programación en las "GPU" y aporta unas conclusiones derivadas de esta experimentación.
|
36 |
Runtime specialization for heterogeneous CPU-GPU platformsFarooqui, Naila 27 May 2016 (has links)
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
|
37 |
Scalability of fixed-radius searching in meshless methods for heterogeneous architecturesPols, LeRoi Vincent 12 1900 (has links)
Thesis (MEng)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: In this thesis we set out to design an algorithm for solving the all-pairs fixed-radius nearest
neighbours search problem for a massively parallel heterogeneous system. The all-pairs
search problem is stated as follows: Given a set of N points in d-dimensional space, find
all pairs of points within a horizon distance of one another. This search is required
by any nonlocal or meshless numerical modelling method to construct the neighbour list
of each mesh point in the problem domain. Therefore, this work is applicable to a wide
variety of fields, ranging from molecular dynamics to pattern recognition and geographical
information systems. Here we focus on nonlocal solid mechanics methods.
The basic method of solving the all-pairs search is to calculate, for each mesh point, the
distance to each other mesh point and compare with the horizon value to determine if the
points are neighbours. This can be a very computationally intensive procedure, especially
if the neighbourhood needs to be updated at every time step to account for changes in
material configuration. The problem also becomes more complex if the analysis is done
in parallel.
Furthermore, GPU computing has become very popular in the last decade. Most of the
fastest supercomputers in the world today employ GPU processors as accelerators to CPU
processors. It is also believed that the next-generation exascale supercomputers will be heterogeneous. Therefore the focus is on how to develop a neighbour searching algorithm
that will take advantage of next-generation hardware.
In this thesis we propose a CPU - multi GPU algorithm, which is an extension of the
fixed-grid method, for the fixed-radius nearest neighbours search on massively parallel
systems. / AFRIKAANSE OPSOMMING: In hierdie tesis het ons die ontwerp van ’n algoritme vir die oplossing van die alle-pare
vaste-radius naaste bure soektog probleem vir groot skaal parallele heterogene stelsels
aangepak. Die alle-pare soektog probleem is as volg gestel: Gegewe ’n stel van N punte
in d-dimensionele ruimte, vind al die pare van punte wat binne ’n horison afstand van
mekaar af is. Die soektog word deur enige nie-lokale of roosterlose numeriese metode
benodig om die bure-lys van alle rooster-punte in die probleem te kry. Daarom is hierdie
werk van toepassing op ’n wye verskeidenheid van velde, wat wissel van molekulêre dinamika
tot patroon herkenning en geografiese inligtingstelsels. Hier is ons fokus op nie-lokale
soliede meganika metodes.
Die basiese metode vir die oplossing van die alle-pare soektog is om vir elke rooster-punt,
die afstand na elke ander rooster-punt te bereken en te vergelyk met die horison lente,
om dus so te bepaal of die punte bure is. Dit kan ’n baie berekenings intensiewe proses
wees, veral as die probleem by elke stap opgedateer moet word om die veranderinge in
die materiaal konfigurasie daar te stel. Die probleem word ook baie meer kompleks as die
analise in parallel gedoen word.
Verder het GVE’s (Grafiese verwerkings eenhede) baie gewild geword in die afgelope
dekade. Die meeste van die vinnigste superrekenaars in die wêreld vandag gebruik GVE’s as versnellers te same met SVE’s (Sentrale verwerkings eenhede). Dit is ook van mening
dat die volgende generasie exa-skaal superrekenaars GVE’s sal implementeer. Daarom is
die fokus op hoe om ’n bure-lys soektog algoritme te ontwikkel wat gebruik sal maak van
die volgende generasie hardeware.
In hierdie tesis stel ons ’n SVE - veelvoudige GVE algoritme voor, wat ’n verlenging
van die vaste-rooster metode is, vir die vaste-radius naaste bure soektog op groot skaal
parallele stelsels.
|
38 |
Proteins, anatomy and networks of the fruit fly brainKnowles-Barley, Seymour Francis January 2012 (has links)
Our understanding of the complexity of the brain is limited by the data we can collect and analyze. Because of experimental limitations and a desire for greater detail, most investigations focus on just one aspect of the brain. For example, brain function can be studied at many levels of abstraction including, but not limited to, gene expression, protein interactions, anatomical regions, neuronal connectivity, synaptic plasticity, and the electrical activity of neurons. By focusing on each of these levels, neuroscience has built up a detailed picture of how the brain works, but each level is understood mostly in isolation from the others. It is likely that interaction between all these levels is just as important. Therefore, a key hypothesis is that functional units spanning multiple levels of biological organization exist in the brain. This project attempted to combine neuronal circuitry analysis with functional proteomics and anatomical regions of the brain to explore this hypothesis, and took an evolutionary view of the results obtained. During the process we had to solve a number of technical challenges as the tools to undertake this type of research did not exist. Two informatics challenges for this research were to develop ways to analyze neurobiological data, such as brain protein expression patterns, to extract useful information, and how to share and present this data in a way that is fast and easy for anyone to access. This project contributes towards a more wholistic understanding of the fruit fly brain in three ways. Firstly, a screen was conducted to record the expression of proteins in the brain of the fruit fly, Drosophila melanogaster. Protein expression patterns in the fruit fly brain were recorded from 535 protein trap lines using confocal microscopy. A total of 884 3D images were annotated and made available on an easy to use website database, BrainTrap, available at fruitfly.inf.ed.ac.uk/braintrap. The website allows 3D images of the protein expression to be viewed interactively in the web browser, and an ontology-based search tool allows users to search for protein expression patterns in specific areas of interest. Different expression patterns mapped to a common template can be viewed simultaneously in multiple colours. This data bridges the gap between anatomical and biomolecular levels of understanding. Secondly, protein trap expression patterns were used to investigate the properties of the fruit fly brain. Thousands of protein-protein interactions have been recorded by methods such as yeast two-hybrid, however many of these protein pairs do not express in the same regions of the fruit fly brain. Using 535 protein expression patterns it was possible to rule out 149 protein-protein interactions. Also, protein expression patterns registered against a common template brain were used to produce new anatomical breakdowns of the fruit fly brain. Clustering techniques were able to naturally segment brain regions based only on the protein expression data. This is just one example of how, by combining proteomics with anatomy, we were able to learn more about both levels of understanding. Results are analysed further in combination with networks such as genetic homology networks, and connectivity networks. We show how the wealth of biological and neuroscience data now available in public databases can be combined with the Brain- Trap data to reveal similarities between areas of the fruit fly and mammalian brain. The BrainTrap data also informs us on the process of evolution and we show that genes found in fruit fly, yeast and mouse are more likely to be generally expressed throughout the brain, whereas genes found only in fruit fly and mouse, but not yeast, are more likely to have a specific expression pattern in the fruit fly brain. Thus, by combining data from multiple sources we can gain further insight into the complexity of the brain. Neural connectivity data is also analyzed and a new technique for enhanced motifs is developed for the combined analysis of connectivity data with other information such as neuron type data and potentially protein expression data. Thirdly, I investigated techniques for imaging the protein trap lines at higher resolution using electron microscopy (EM) and developed new informatics techniques for the automated analysis of neural connectivity data collected from serial section transmission electron microscopy (ssTEM). Measurement of the connectivity between neurons requires high resolution imaging techniques, such as electron microscopy, and images produced by this method are currently annotated manually to produce very detailed maps of cell morphology and connectivity. This is an extremely time consuming process and the volume of tissue and number of neurons that can be reconstructed is severely limited by the annotation step. I developed a set of computer vision algorithms to improve the alignment between consecutive images, and to perform partial annotation automatically by detecting membrane, synapses and mitochondria present in the images. Accuracy of the automatic annotation was evaluated on a small dataset and 96% of membrane could be identified at the cost of 13% false positives. This research demonstrates that informatics technology can help us to automatically analyze biological images and bring together genetic, anatomical, and connectivity data in a meaningful way. This combination of multiple data sources reveals more detail about each individual level of understanding, and gives us a more wholistic view of the fruit fly brain.
|
39 |
Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing UnitDelorme, Michael Christopher 18 March 2013 (has links)
We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU.
Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU.
|
40 |
A model of dynamic compilation for heterogeneous compute platformsKerr, Andrew 10 December 2012 (has links)
Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity.
The rise of parallelism adds an additional dimension to the challenge of portability, as
different processors support different notions of parallelism, whether vector parallelism executing
in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software
experiences obstacles to portability and efficient execution beyond differences in instruction sets;
rather, the underlying execution models of radically different architectures may not be compatible.
Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction
layer decoupling program representations from optimized binaries, thus enabling portability without
encumbering performance. This dissertation proposes several techniques that extend dynamic
compilation to data-parallel execution models. These contributions include:
- characterization of data-parallel workloads
- machine-independent application metrics
- framework for performance modeling and prediction
- execution model translation for vector processors
- region-based compilation and scheduling
We evaluate these claims via the development of a novel dynamic compilation framework,
GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables
the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a
functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage
of vector instruction set extensions, and effectively exploit data locality via scheduling which
attempts to maximize control locality.
|
Page generated in 0.0191 seconds