Global ETD Search

71	Runtime specialization for heterogeneous CPU-GPU platforms Farooqui, Naila 27 May 2016 (has links) Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices. Dynamic instrumentation Dynamic compilation GPU computing Heterogeneous computing Profile-guided optimizations Program analysis Workload characterization Compiler Runtime Multicore CUDA OpenCL SIMD
72	Scalability of fixed-radius searching in meshless methods for heterogeneous architectures Pols, LeRoi Vincent 12 1900 (has links) Thesis (MEng)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: In this thesis we set out to design an algorithm for solving the all-pairs fixed-radius nearest neighbours search problem for a massively parallel heterogeneous system. The all-pairs search problem is stated as follows: Given a set of N points in d-dimensional space, find all pairs of points within a horizon distance of one another. This search is required by any nonlocal or meshless numerical modelling method to construct the neighbour list of each mesh point in the problem domain. Therefore, this work is applicable to a wide variety of fields, ranging from molecular dynamics to pattern recognition and geographical information systems. Here we focus on nonlocal solid mechanics methods. The basic method of solving the all-pairs search is to calculate, for each mesh point, the distance to each other mesh point and compare with the horizon value to determine if the points are neighbours. This can be a very computationally intensive procedure, especially if the neighbourhood needs to be updated at every time step to account for changes in material configuration. The problem also becomes more complex if the analysis is done in parallel. Furthermore, GPU computing has become very popular in the last decade. Most of the fastest supercomputers in the world today employ GPU processors as accelerators to CPU processors. It is also believed that the next-generation exascale supercomputers will be heterogeneous. Therefore the focus is on how to develop a neighbour searching algorithm that will take advantage of next-generation hardware. In this thesis we propose a CPU - multi GPU algorithm, which is an extension of the fixed-grid method, for the fixed-radius nearest neighbours search on massively parallel systems. / AFRIKAANSE OPSOMMING: In hierdie tesis het ons die ontwerp van ’n algoritme vir die oplossing van die alle-pare vaste-radius naaste bure soektog probleem vir groot skaal parallele heterogene stelsels aangepak. Die alle-pare soektog probleem is as volg gestel: Gegewe ’n stel van N punte in d-dimensionele ruimte, vind al die pare van punte wat binne ’n horison afstand van mekaar af is. Die soektog word deur enige nie-lokale of roosterlose numeriese metode benodig om die bure-lys van alle rooster-punte in die probleem te kry. Daarom is hierdie werk van toepassing op ’n wye verskeidenheid van velde, wat wissel van molekulêre dinamika tot patroon herkenning en geografiese inligtingstelsels. Hier is ons fokus op nie-lokale soliede meganika metodes. Die basiese metode vir die oplossing van die alle-pare soektog is om vir elke rooster-punt, die afstand na elke ander rooster-punt te bereken en te vergelyk met die horison lente, om dus so te bepaal of die punte bure is. Dit kan ’n baie berekenings intensiewe proses wees, veral as die probleem by elke stap opgedateer moet word om die veranderinge in die materiaal konfigurasie daar te stel. Die probleem word ook baie meer kompleks as die analise in parallel gedoen word. Verder het GVE’s (Grafiese verwerkings eenhede) baie gewild geword in die afgelope dekade. Die meeste van die vinnigste superrekenaars in die wêreld vandag gebruik GVE’s as versnellers te same met SVE’s (Sentrale verwerkings eenhede). Dit is ook van mening dat die volgende generasie exa-skaal superrekenaars GVE’s sal implementeer. Daarom is die fokus op hoe om ’n bure-lys soektog algoritme te ontwikkel wat gebruik sal maak van die volgende generasie hardeware. In hierdie tesis stel ons ’n SVE - veelvoudige GVE algoritme voor, wat ’n verlenging van die vaste-rooster metode is, vir die vaste-radius naaste bure soektog op groot skaal parallele stelsels. Solid mechanics Neighbour searching Meshfree methods (Numerical analysis) GPU computing Theses -- Civil engineering Dissertations -- Civil engineering Parallel algorihms Heterogeneous computing UCTD
73	"Índices de carga e desempenho em ambientes paralelos/distribuídos - modelagem e métricas" / Load and Performance Index for Parallel/Distributed System - Modelling and Metrics Branco, Kalinka Regina Lucas Jaquie Castelo 15 December 2004 (has links) Esta tese aborda o problema de obtenção de um índice de carga ou de desempenho adequado para utilização no escalonamento de processos em sistemas computacionais heterogêneos paralelos/distribuídos. Uma ampla revisão bibliográfica com a correspondente análise crítica é apresentada. Essa revisão é a base para a comparação das métricas existentes para a avaliação do grau de heterogeneidade/homogeneidade dos sistemas computacionais. Uma nova métrica é proposta neste trabalho, removendo as restrições identificadas no estudo comparativo realizado. Resultados de aplicações dessa nova métrica são apresentados e discutidos. Esta tese propõe também o conceito de heterogeneidade/homogeneidade temporal que pode ser utilizado para futuros aprimoramentos de políticas de escalonamento empregadas em plataformas computacionais heterogêneas paralelas/distribuídas. Um novo índice de desempenho (Vector for Index of Performance - VIP), generalizando o conceito de índice de carga, é proposto com base em uma métrica Euclidiana. Esse novo índice é aplicado na implementação de uma política de escalonamento e amplamente testado através de modelagem e simulação. Os resultados obtidos são apresentados e analisados estatisticamente. É demonstrado que o novo índice leva a bons resultados de modo geral e é apresentado um mapeamento mostrando as vantagens e desvantagens de sua adoção quando comparado às métricas tradicionais. / This thesis approaches the problem of evaluating an adequate load index or a performance index, for using in process scheduling in heterogeneous parallel/distributed computing systems. A wide literature review with the corresponding critical analysis is presented. This review is the base for the comparison of the existing metrics for the evaluation of the computing systems homogeneity/heterogeneity degree. A new metric is proposed in this work, removing the restrictions identified during the comparative study realized. Results from the application of the new metric are presented and discussed. This thesis also proposes the concept of temporal heterogeneity/homogeneity that can be used for future improvements in scheduling polices for parallel/distributed heterogeneous computing platforms. A new performance index (Vector for Index of Performance - VIP), generalizing the concept of load index, is proposed based on an Euclidean metric. This new index is applied to the implementation of a scheduling police and widely tested through modeling and simulation. The results obtained are presented and statistically analyzed. It is shown that the new index reaches good results in general and it is also presented a mapping showing the advantages and disadvantages of its adoption when compared with the traditional metrics. Balanceamento de Carga Computação Heterogêna Distributed System Escalonamento de Processos Heterogeneous Computing Índice de Desempenho Índices de Carga Load Balancing Load Index Performance Index Process Scheduling Sistemas Distribuídos
74	"Índices de carga e desempenho em ambientes paralelos/distribuídos - modelagem e métricas" / Load and Performance Index for Parallel/Distributed System - Modelling and Metrics Kalinka Regina Lucas Jaquie Castelo Branco 15 December 2004 (has links) Esta tese aborda o problema de obtenção de um índice de carga ou de desempenho adequado para utilização no escalonamento de processos em sistemas computacionais heterogêneos paralelos/distribuídos. Uma ampla revisão bibliográfica com a correspondente análise crítica é apresentada. Essa revisão é a base para a comparação das métricas existentes para a avaliação do grau de heterogeneidade/homogeneidade dos sistemas computacionais. Uma nova métrica é proposta neste trabalho, removendo as restrições identificadas no estudo comparativo realizado. Resultados de aplicações dessa nova métrica são apresentados e discutidos. Esta tese propõe também o conceito de heterogeneidade/homogeneidade temporal que pode ser utilizado para futuros aprimoramentos de políticas de escalonamento empregadas em plataformas computacionais heterogêneas paralelas/distribuídas. Um novo índice de desempenho (Vector for Index of Performance - VIP), generalizando o conceito de índice de carga, é proposto com base em uma métrica Euclidiana. Esse novo índice é aplicado na implementação de uma política de escalonamento e amplamente testado através de modelagem e simulação. Os resultados obtidos são apresentados e analisados estatisticamente. É demonstrado que o novo índice leva a bons resultados de modo geral e é apresentado um mapeamento mostrando as vantagens e desvantagens de sua adoção quando comparado às métricas tradicionais. / This thesis approaches the problem of evaluating an adequate load index or a performance index, for using in process scheduling in heterogeneous parallel/distributed computing systems. A wide literature review with the corresponding critical analysis is presented. This review is the base for the comparison of the existing metrics for the evaluation of the computing systems homogeneity/heterogeneity degree. A new metric is proposed in this work, removing the restrictions identified during the comparative study realized. Results from the application of the new metric are presented and discussed. This thesis also proposes the concept of temporal heterogeneity/homogeneity that can be used for future improvements in scheduling polices for parallel/distributed heterogeneous computing platforms. A new performance index (Vector for Index of Performance - VIP), generalizing the concept of load index, is proposed based on an Euclidean metric. This new index is applied to the implementation of a scheduling police and widely tested through modeling and simulation. The results obtained are presented and statistically analyzed. It is shown that the new index reaches good results in general and it is also presented a mapping showing the advantages and disadvantages of its adoption when compared with the traditional metrics. Balanceamento de Carga Computação Heterogêna Escalonamento de Processos Índice de Desempenho Índices de Carga Sistemas Distribuídos Distributed System Heterogeneous Computing Load Balancing Load Index Performance Index Process Scheduling
75	Heterogeneous multi-pipeline application specific instruction-set processor design and implementation Radhakrishnan, Swarnalatha, Computer Science & Engineering, Faculty of Engineering, UNSW January 2006 (has links) Embedded systems are becoming ubiquitous, primarily due to the fast evolution of digital electronic devices. The design of modern embedded systems requires systems to exhibit, high performance and reliability, yet have short design time and low cost. Application Specific Instruction set processors (ASIPs) are widely used in embedded system since they are economical to use, flexible, and reusable (thus saves design time). During the last decade research work on ASIPs have been carried out in mainly for single pipelined processors. Improving performance in processors is possible by exploring the available parallelism in the program. Designing of multiple parallel execution paths for parallel execution of the processor naturally incurs additional cost. The methodology presented in this dissertation has addressed the problem of improving performance in ASIPs, at minimal additional cost. The devised methodology explores the available parallelism of an application to generate a multi-pipeline heterogeneous ASIP. The processor design is application specific. No pre-defined IPs are used in the design. The generated processor contains multiple standalone pipelined data paths, which are not necessarily identical, and are connected by the necessary bypass paths and control signals. Control unit are separate for each pipeline (though with the same clock) resulting in a simple and cost effective design. By using separate instruction and data memories (Harvard architecture) and by allowing memory access by two separate pipes, the complexity of the controller and buses are reduced. The impact of higher memory latencies is nullified by utilizing parallel pipes during memory access. Efficient bypass network selection and encoding techniques provide a better implementation. The initial design approach with only two pipelines without bypass paths show speed improvements of up to 36% and switching activity reductions of up to 11%. The additional area costs around 16%. An improved design with different number of pipelines (more than two) based on applications show on average of 77% performance improvement with overheads of: 49% on area; 51% on leakage power; 17% on switching activity; and 69% on code size. The design was further trimmed, with bypass path selection and encoding techniques, which show a saving of up to 32% of area and 34% of leakage power with 6% performance improvement and 69% of code size reduction compared to the design approach without these techniques in the multi pipeline design. Application specific processors ASIP Heterogeneous ASIPs Multi-pipeline ASIPs Embedded computer systems Heterogeneous computing Application-specific integrated circuits Parallel programming (Computer science) Piping
76	Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit Delorme, Michael Christopher 18 March 2013 (has links) We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU. Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU. Parallel sorting Radix sort Heterogeneous computing GPU GPGPU AMD Fusion Llano APU Accelerated Processing Unit OpenCL Fusion Sort GPU computing 0984
77	Design of heterogeneous coherence hierarchies using manager-client pairing Beu, Jesse Garrett 09 April 2013 (has links) Over the past ten years, the architecture community has witnessed the end of single-threaded performance scaling and a subsequent shift in focus toward multicore and manycore processing. While this is an exciting time for architects, with many new opportunities and design spaces to explore, this brings with it some new challenges. One area that is especially impacted is the memory subsystem. Specifically, the design, verification, and evaluation of cache coherence protocols becomes very challenging as cores become more numerous and more diverse. This dissertation examines these issues and presents Manager-Client Pairing as a solution to the challenges facing next-generation coherence protocol design. By defining a standardized coherence communication interface and permissions checking algorithm, Manager-Client Pairing enables coherence hierarchies to be constructed and evaluated quickly without the high design-cost previously associated with hierarchical composition. Further, Manager-Client Pairing also allows for verification composition, even in the presence of protocol heterogeneity. As a result, this rapid development of diverse protocols is ensured to be bug-free, enabling architects to focus on performance optimization, rather than debugging and correctness concerns, while comparing diverse coherence configurations for use in future heterogeneous systems. Formal verification Protocol verification Heterogeneous computing Uncore Microarchitecture Computer architecture Computer storage devices Memory management (Computer science) Memory maps (Computer science) Data processing
78	A model of dynamic compilation for heterogeneous compute platforms Kerr, Andrew 10 December 2012 (has links) Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality. Dynamic compilation GPU computing Cuda Opencl SIMD Vector Multicore Parallel computing Parallel computers Parallel programs (Computer programs) Heterogeneous computing High performance computing
79	Coordinated system level resource management for heterogeneous many-core platforms Gupta, Vishakha 24 August 2011 (has links) A challenge posed by future computer architectures is the efficient exploitation of their many and sometimes heterogeneous computational cores. This challenge is exacerbated by the multiple facilities for data movement and sharing across cores resident on such platforms. To answer the question of how systems software should treat heterogeneous resources, this dissertation describes an approach that (1) creates a common manageable pool for all the resources present in the platform, and then (2) provides virtual machines (VMs) with multiple `personalities', flexibly mapped to and efficiently run on the heterogeneous underlying hardware. A VM's personality is its execution context on the different types of available processing resources usable by the VM. We provide mechanisms for making such platforms manageable and evaluate coordinated scheduling policies for mapping different VM personalities on heterogeneous hardware. Towards that end, this dissertation contributes technologies that include (1) restructuring hypervisor and system functions to create high performance environments that enable flexibility of execution and data sharing, (2) scheduling and other resource management infrastructure for supporting diverse application needs and heterogeneous platform characteristics, and (3) hypervisor level policies to permit efficient and coordinated resource usage and sharing. Experimental evaluations on multiple heterogeneous platforms, like one comprised of x86-based cores with attached NVIDIA accelerators and others with asymmetric elements on chip, demonstrate the utility of the approach and its ability to efficiently host diverse applications and resource management methods. Coordinated scheduling Heterogeneous many-core systems Asymmetric multi-cores Virtualization Kinship model Performance points Virtual computer systems Computing platforms Computer architecture Heterogeneous computing High performance computing
80	Performance and energy efficiency via an adaptive MorphCore architecture Khubaib 09 July 2014 (has links) The level of Thread-Level Parallelism (TLP), Instruction-Level Parallelism (ILP), and Memory-Level Parallelism (MLP) varies across programs and across program phases. Hence, every program requires different underlying core microarchitecture resources for high performance and/or energy efficiency. Current core microarchitectures are inefficient because they are fixed at design time and do not adapt to variable TLP, ILP, or MLP. I show that if a core microarchitecture can adapt to the variation in TLP, ILP, and MLP, significantly higher performance and/or energy efficiency can be achieved. I propose MorphCore, a low-overhead adaptive microarchitecture built from a traditional OOO core with small changes. MorphCore adapts to TLP by operating in two modes: (a) as a wide-width large-OOO-window core when TLP is low and ILP is high, and (b) as a high-performance low-energy highly-threaded in-order SMT core when TLP is high. MorphCore adapts to ILP and MLP by varying the superscalar width and the out-of-order (OOO) window size by operating in four modes: (1) as a wide-width large-OOO-window core, 2) as a wide-width medium-OOO-window core, 3) as a medium-width large-OOO-window core, and 4) as a medium-width medium-OOO-window core. My evaluation with single-thread and multi-thread benchmarks shows that when highest single-thread performance is desired, MorphCore achieves performance similar to a traditional out-of-order core. When energy efficiency is desired on single-thread programs, MorphCore reduces energy by up to 15% (on average 8%) over an out-of-order core. When high multi-thread performance is desired, MorphCore increases performance by 21% and reduces energy consumption by 20% over an out-of-order core. Thus, for multi-thread programs, MorphCore's energy efficiency is similar to highly-threaded throughput-optimized small and medium core architectures, and its performance is two-thirds of their potential. / text Computer architecture Microprocessor Thread-level parallelism Instruction-level parallelism Memory-level parallelism Adaptive microprocessor Energy efficiency High performance Power efficiency Chip-multiprocessor Heterogeneous computing

Search results