• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 24
  • 5
  • 3
  • 2
  • 1
  • Tagged with
  • 50
  • 50
  • 11
  • 10
  • 10
  • 9
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Paralelizace evolučních algoritmů pomocí GPU / GPU Parallelization of Evolutionary Algorithms

Valkovič, Patrik January 2021 (has links)
Graphical Processing Units stand for the success of Artificial Neural Networks over the past decade and their broader application in the industry. Another promising field of Artificial Intelligence is Evolutionary Algorithms. Their parallelization ability is well known and has been successfully applied in practice. However, these attempts focused on multi-core and multi-machine parallelization rather than on the GPU. This work explores the possibilities of Evolutionary Algorithms parallelization on GPU. I propose implementation in PyTorch library, allowing to execute EA on both CPU and GPU. The proposed implementation provides the most common evolutionary operators for Genetic Algorithms, Real-Coded Evolutionary Algorithms, and Particle Swarm Op- timization Algorithms. Finally, I show the performance is an order of magnitude faster on GPU for medium and big-sized problems and populations. 1
32

Free Wake Potential Flow Vortex Wind Turbine Modeling: Advances in Parallel Processing and Integration of Ground Effects

Develder, Nathaniel B 01 January 2014 (has links) (PDF)
Potential flow simulations are a great engineering type, middle-ground approach to modeling complex aerodynamic systems, but quickly become computationally unwieldy for large domains. An N-body problem with N-squared interactions to calculate, this free wake vortex model of a wind turbine is well suited to parallel computation. This thesis discusses general trends in wind turbine modeling, a potential flow model of the rotor of the NREL 5MW reference turbine, various forms of parallel computing, current GPU hardware, and the application of ground effects to the model. In the vicinity of 200,000 points, current GPU hardware was found to be nearly 17 times faster than an OpenMP 12 core CPU parallel code, and over 280 times faster than serial MATLAB code. Convergence of the solution is found to be dependent on the direction in which the grid is refined. The "no entry" condition at the ground plane is found to have a measurable but small impact on the model outputs with a periodicity driven by the blade proximity to the ground plane. The effect of the ground panel method was found to converge to that of the "method of images" for increasing ground extent and number of panels.
33

Scalable and Energy-Efficient SIMT Systems for Deep Learning and Data Center Microservices

Mahmoud Khairy A. Abdallah (12894191) 04 July 2022 (has links)
<p> </p> <p>Moore’s law is dead. The physical and economic principles that enabled an exponential rise in transistors per chip have reached their breaking point. As a result, High-Performance Computing (HPC) domain and cloud data centers are encountering significant energy, cost, and environmental hurdles that have led them to embrace custom hardware/software solutions. Single Instruction Multiple Thread (SIMT) accelerators, like Graphics Processing Units (GPUs), are compelling solutions to achieve considerable energy efficiency while still preserving programmability in the twilight of Moore’s Law.</p> <p>In the HPC and Deep Learning (DL) domain, the death of single-chip GPU performance scaling will usher in a renaissance in multi-chip Non-Uniform Memory Access (NUMA) scaling. Advances in silicon interposers and other inter-chip signaling technology will enable single-package systems, composed of multiple chiplets that continue to scale even as per-chip transistors do not. Given this evolving, massively parallel NUMA landscape, the placement of data on each chiplet, or discrete GPU card, and the scheduling of the threads that use that data is a critical factor in system performance and power consumption.</p> <p>Aside from the supercomputer space, general-purpose compute units are still the main driver of data center’s total cost of ownership (TCO). CPUs consume 60% of the total data center power budget, half of which comes from the CPU pipeline’s frontend. Coupled with the hardware efficiency crisis is an increased desire for programmer productivity, flexible scalability, and nimble software updates that have led to the rise of software microservices. Consequently, single servers are now packed with many threads executing the same, relatively small task on different data.</p> <p>In this dissertation, I discuss these new paradigm shifts, addressing the following concerns: (1) how do we overcome the non-uniform memory access overhead for next-generation multi-chiplet GPUs in the era of DL-driven workloads?; (2) how can we improve the energy efficiency of data center’s CPUs in the light of microservices evolution and request similarity?; and (3) how to study such rapidly-evolving systems with an accurate and extensible SIMT performance modeling?</p>
34

Runtime specialization for heterogeneous CPU-GPU platforms

Farooqui, Naila 27 May 2016 (has links)
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
35

Scalability of fixed-radius searching in meshless methods for heterogeneous architectures

Pols, LeRoi Vincent 12 1900 (has links)
Thesis (MEng)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: In this thesis we set out to design an algorithm for solving the all-pairs fixed-radius nearest neighbours search problem for a massively parallel heterogeneous system. The all-pairs search problem is stated as follows: Given a set of N points in d-dimensional space, find all pairs of points within a horizon distance of one another. This search is required by any nonlocal or meshless numerical modelling method to construct the neighbour list of each mesh point in the problem domain. Therefore, this work is applicable to a wide variety of fields, ranging from molecular dynamics to pattern recognition and geographical information systems. Here we focus on nonlocal solid mechanics methods. The basic method of solving the all-pairs search is to calculate, for each mesh point, the distance to each other mesh point and compare with the horizon value to determine if the points are neighbours. This can be a very computationally intensive procedure, especially if the neighbourhood needs to be updated at every time step to account for changes in material configuration. The problem also becomes more complex if the analysis is done in parallel. Furthermore, GPU computing has become very popular in the last decade. Most of the fastest supercomputers in the world today employ GPU processors as accelerators to CPU processors. It is also believed that the next-generation exascale supercomputers will be heterogeneous. Therefore the focus is on how to develop a neighbour searching algorithm that will take advantage of next-generation hardware. In this thesis we propose a CPU - multi GPU algorithm, which is an extension of the fixed-grid method, for the fixed-radius nearest neighbours search on massively parallel systems. / AFRIKAANSE OPSOMMING: In hierdie tesis het ons die ontwerp van ’n algoritme vir die oplossing van die alle-pare vaste-radius naaste bure soektog probleem vir groot skaal parallele heterogene stelsels aangepak. Die alle-pare soektog probleem is as volg gestel: Gegewe ’n stel van N punte in d-dimensionele ruimte, vind al die pare van punte wat binne ’n horison afstand van mekaar af is. Die soektog word deur enige nie-lokale of roosterlose numeriese metode benodig om die bure-lys van alle rooster-punte in die probleem te kry. Daarom is hierdie werk van toepassing op ’n wye verskeidenheid van velde, wat wissel van molekulêre dinamika tot patroon herkenning en geografiese inligtingstelsels. Hier is ons fokus op nie-lokale soliede meganika metodes. Die basiese metode vir die oplossing van die alle-pare soektog is om vir elke rooster-punt, die afstand na elke ander rooster-punt te bereken en te vergelyk met die horison lente, om dus so te bepaal of die punte bure is. Dit kan ’n baie berekenings intensiewe proses wees, veral as die probleem by elke stap opgedateer moet word om die veranderinge in die materiaal konfigurasie daar te stel. Die probleem word ook baie meer kompleks as die analise in parallel gedoen word. Verder het GVE’s (Grafiese verwerkings eenhede) baie gewild geword in die afgelope dekade. Die meeste van die vinnigste superrekenaars in die wêreld vandag gebruik GVE’s as versnellers te same met SVE’s (Sentrale verwerkings eenhede). Dit is ook van mening dat die volgende generasie exa-skaal superrekenaars GVE’s sal implementeer. Daarom is die fokus op hoe om ’n bure-lys soektog algoritme te ontwikkel wat gebruik sal maak van die volgende generasie hardeware. In hierdie tesis stel ons ’n SVE - veelvoudige GVE algoritme voor, wat ’n verlenging van die vaste-rooster metode is, vir die vaste-radius naaste bure soektog op groot skaal parallele stelsels.
36

Proteins, anatomy and networks of the fruit fly brain

Knowles-Barley, Seymour Francis January 2012 (has links)
Our understanding of the complexity of the brain is limited by the data we can collect and analyze. Because of experimental limitations and a desire for greater detail, most investigations focus on just one aspect of the brain. For example, brain function can be studied at many levels of abstraction including, but not limited to, gene expression, protein interactions, anatomical regions, neuronal connectivity, synaptic plasticity, and the electrical activity of neurons. By focusing on each of these levels, neuroscience has built up a detailed picture of how the brain works, but each level is understood mostly in isolation from the others. It is likely that interaction between all these levels is just as important. Therefore, a key hypothesis is that functional units spanning multiple levels of biological organization exist in the brain. This project attempted to combine neuronal circuitry analysis with functional proteomics and anatomical regions of the brain to explore this hypothesis, and took an evolutionary view of the results obtained. During the process we had to solve a number of technical challenges as the tools to undertake this type of research did not exist. Two informatics challenges for this research were to develop ways to analyze neurobiological data, such as brain protein expression patterns, to extract useful information, and how to share and present this data in a way that is fast and easy for anyone to access. This project contributes towards a more wholistic understanding of the fruit fly brain in three ways. Firstly, a screen was conducted to record the expression of proteins in the brain of the fruit fly, Drosophila melanogaster. Protein expression patterns in the fruit fly brain were recorded from 535 protein trap lines using confocal microscopy. A total of 884 3D images were annotated and made available on an easy to use website database, BrainTrap, available at fruitfly.inf.ed.ac.uk/braintrap. The website allows 3D images of the protein expression to be viewed interactively in the web browser, and an ontology-based search tool allows users to search for protein expression patterns in specific areas of interest. Different expression patterns mapped to a common template can be viewed simultaneously in multiple colours. This data bridges the gap between anatomical and biomolecular levels of understanding. Secondly, protein trap expression patterns were used to investigate the properties of the fruit fly brain. Thousands of protein-protein interactions have been recorded by methods such as yeast two-hybrid, however many of these protein pairs do not express in the same regions of the fruit fly brain. Using 535 protein expression patterns it was possible to rule out 149 protein-protein interactions. Also, protein expression patterns registered against a common template brain were used to produce new anatomical breakdowns of the fruit fly brain. Clustering techniques were able to naturally segment brain regions based only on the protein expression data. This is just one example of how, by combining proteomics with anatomy, we were able to learn more about both levels of understanding. Results are analysed further in combination with networks such as genetic homology networks, and connectivity networks. We show how the wealth of biological and neuroscience data now available in public databases can be combined with the Brain- Trap data to reveal similarities between areas of the fruit fly and mammalian brain. The BrainTrap data also informs us on the process of evolution and we show that genes found in fruit fly, yeast and mouse are more likely to be generally expressed throughout the brain, whereas genes found only in fruit fly and mouse, but not yeast, are more likely to have a specific expression pattern in the fruit fly brain. Thus, by combining data from multiple sources we can gain further insight into the complexity of the brain. Neural connectivity data is also analyzed and a new technique for enhanced motifs is developed for the combined analysis of connectivity data with other information such as neuron type data and potentially protein expression data. Thirdly, I investigated techniques for imaging the protein trap lines at higher resolution using electron microscopy (EM) and developed new informatics techniques for the automated analysis of neural connectivity data collected from serial section transmission electron microscopy (ssTEM). Measurement of the connectivity between neurons requires high resolution imaging techniques, such as electron microscopy, and images produced by this method are currently annotated manually to produce very detailed maps of cell morphology and connectivity. This is an extremely time consuming process and the volume of tissue and number of neurons that can be reconstructed is severely limited by the annotation step. I developed a set of computer vision algorithms to improve the alignment between consecutive images, and to perform partial annotation automatically by detecting membrane, synapses and mitochondria present in the images. Accuracy of the automatic annotation was evaluated on a small dataset and 96% of membrane could be identified at the cost of 13% false positives. This research demonstrates that informatics technology can help us to automatically analyze biological images and bring together genetic, anatomical, and connectivity data in a meaningful way. This combination of multiple data sources reveals more detail about each individual level of understanding, and gives us a more wholistic view of the fruit fly brain.
37

Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit

Delorme, Michael Christopher 18 March 2013 (has links)
We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU. Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU.
38

A model of dynamic compilation for heterogeneous compute platforms

Kerr, Andrew 10 December 2012 (has links)
Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality.
39

Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit

Delorme, Michael Christopher 18 March 2013 (has links)
We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU. Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU.
40

A GPU Accelerated Tensor Spectral Method for Subspace Clustering

Pai, Nithish January 2016 (has links) (PDF)
In this thesis we consider the problem of clustering the data lying in a union of subspaces using spectral methods. Though the data generated may have high dimensionality, in many of the applications, such as motion segmentation and illumination invariant face clustering, the data resides in a union of subspaces having small dimensions. Furthermore, for a number of classification and inference problems, it is often useful to identify these subspaces and work with data in this smaller dimensional manifold. If the observations in each cluster were to be distributed around a centric, applying spectral clustering on an a nifty matrix built using distance based similarity measures between the data points have been used successfully to solve the problem. But it has been observed that using such pair-wise distance based measure between the data points to construct a similarity matrix is not sufficient to solve the subspace clustering problem. Hence, a major challenge is to end a similarity measure that can capture the information of the subspace the data lies in. This is the motivation to develop methods that use an affinity tensor by calculating similarity between multiple data points. One can then use spectral methods on these tensors to solve the subspace clustering problem. In order to keep the algorithm computationally feasible, one can employ column sampling strategies. However, the computational costs for performing the tensor factorization increases very quickly with increase in sampling rate. Fortunately, the advances in GPU computing has made it possible to perform many linear algebra operations several order of magnitudes faster than traditional CPU and multicourse computing. In this work, we develop parallel algorithms for subspace clustering on a GPU com-putting environment. We show that this gives us a significant speedup over the implementations on the CPU, which allows us to sample a larger fraction of the tensor and thereby achieve better accuracies. We empirically analyze the performance of these algorithms on a number of synthetically generated subspaces con gyrations. We ally demonstrate the effectiveness of these algorithms on the motion segmentation, handwritten digit clustering and illumination invariant face clustering and show that the performance of these algorithms are comparable with the state of the art approaches.

Page generated in 0.1069 seconds