• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 48
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 89
  • 89
  • 44
  • 26
  • 18
  • 18
  • 15
  • 15
  • 15
  • 14
  • 13
  • 13
  • 11
  • 11
  • 10
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

Reconfigurable Technologies for Next Generation Internet and Cluster Computing

Unnikrishnan, Deepak C. 01 September 2013 (has links)
Modern web applications are marked by distinct networking and computing characteristics. As applications evolve, they continue to operate over a large monolithic framework of networking and computing equipment built from general-purpose microprocessors and Application Specific Integrated Circuits (ASICs) that offers few architectural choices. This dissertation presents techniques to diversify the next-generation Internet infrastructure by integrating Field-programmable Gate Arrays (FPGAs), a class of reconfigurable integrated circuits, with general-purpose microprocessor-based techniques. Specifically, our solutions are demonstrated in the context of two applications - network virtualization and distributed cluster computing. Network virtualization enables the physical network infrastructure to be shared among several logical networks to run diverse protocols and differentiated services. The design of a good network virtualization platform is challenging because the physical networking substrate must scale to support several isolated virtual networks with high packet forwarding rates and offer sufficient flexibility to customize networking features. The first major contribution of this dissertation is a novel high performance heterogeneous network virtualization system that integrates FPGAs and general-purpose CPUs. Salient features of this architecture include the ability to scale the number of virtual networks in an FPGA using existing software-based network virtualization techniques, the ability to map virtual networks to a combination of hardware and software resources on demand, and the ability to use off-chip memory resources to scale virtual router features. Partial-reconfiguration has been exploited to dynamically customize virtual networking parameters. An open software framework to describe virtual networking features using a hardware-agnostic language has been developed. Evaluation of our system using a NetFPGA card demonstrates one to two orders of improved throughput over state-of-the-art network virtualization techniques. The demand for greater computing capacity grows as web applications scale. In state-of-the-art systems, an application is scaled by parallelizing the computation on a pool of commodity hardware machines using distributed computing frameworks. Although this technique is useful, it is inefficient because the sequential nature of execution in general-purpose processors does not suit all workloads equally well. Iterative algorithms form a pervasive class of web and data mining algorithms that are poorly executed on general purpose processors due to the presence of strict synchronization barriers in distributed cluster frameworks. This dissertation presents Maestro, a heterogeneous distributed computing framework that demonstrates how FPGAs can break down such synchronization barriers using asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers. The benefits of a heterogeneous cluster are illustrated by executing a general-class of iterative algorithms on a cluster of commodity CPUs and FPGAs. Computation is dynamically prioritized to accelerate algorithm convergence. We implement a general-class of three iterative algorithms on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers 154× speedup versus a standard Hadoop-based CPU-workstation cluster. Improved performance is achieved by clusters of FPGAs.
72

A general-purpose model for heterogeneous computation

Williams, Tiffani L. 01 October 2000 (has links)
No description available.
73

Runtime specialization for heterogeneous CPU-GPU platforms

Farooqui, Naila 27 May 2016 (has links)
Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
74

Scalability of fixed-radius searching in meshless methods for heterogeneous architectures

Pols, LeRoi Vincent 12 1900 (has links)
Thesis (MEng)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: In this thesis we set out to design an algorithm for solving the all-pairs fixed-radius nearest neighbours search problem for a massively parallel heterogeneous system. The all-pairs search problem is stated as follows: Given a set of N points in d-dimensional space, find all pairs of points within a horizon distance of one another. This search is required by any nonlocal or meshless numerical modelling method to construct the neighbour list of each mesh point in the problem domain. Therefore, this work is applicable to a wide variety of fields, ranging from molecular dynamics to pattern recognition and geographical information systems. Here we focus on nonlocal solid mechanics methods. The basic method of solving the all-pairs search is to calculate, for each mesh point, the distance to each other mesh point and compare with the horizon value to determine if the points are neighbours. This can be a very computationally intensive procedure, especially if the neighbourhood needs to be updated at every time step to account for changes in material configuration. The problem also becomes more complex if the analysis is done in parallel. Furthermore, GPU computing has become very popular in the last decade. Most of the fastest supercomputers in the world today employ GPU processors as accelerators to CPU processors. It is also believed that the next-generation exascale supercomputers will be heterogeneous. Therefore the focus is on how to develop a neighbour searching algorithm that will take advantage of next-generation hardware. In this thesis we propose a CPU - multi GPU algorithm, which is an extension of the fixed-grid method, for the fixed-radius nearest neighbours search on massively parallel systems. / AFRIKAANSE OPSOMMING: In hierdie tesis het ons die ontwerp van ’n algoritme vir die oplossing van die alle-pare vaste-radius naaste bure soektog probleem vir groot skaal parallele heterogene stelsels aangepak. Die alle-pare soektog probleem is as volg gestel: Gegewe ’n stel van N punte in d-dimensionele ruimte, vind al die pare van punte wat binne ’n horison afstand van mekaar af is. Die soektog word deur enige nie-lokale of roosterlose numeriese metode benodig om die bure-lys van alle rooster-punte in die probleem te kry. Daarom is hierdie werk van toepassing op ’n wye verskeidenheid van velde, wat wissel van molekulêre dinamika tot patroon herkenning en geografiese inligtingstelsels. Hier is ons fokus op nie-lokale soliede meganika metodes. Die basiese metode vir die oplossing van die alle-pare soektog is om vir elke rooster-punt, die afstand na elke ander rooster-punt te bereken en te vergelyk met die horison lente, om dus so te bepaal of die punte bure is. Dit kan ’n baie berekenings intensiewe proses wees, veral as die probleem by elke stap opgedateer moet word om die veranderinge in die materiaal konfigurasie daar te stel. Die probleem word ook baie meer kompleks as die analise in parallel gedoen word. Verder het GVE’s (Grafiese verwerkings eenhede) baie gewild geword in die afgelope dekade. Die meeste van die vinnigste superrekenaars in die wêreld vandag gebruik GVE’s as versnellers te same met SVE’s (Sentrale verwerkings eenhede). Dit is ook van mening dat die volgende generasie exa-skaal superrekenaars GVE’s sal implementeer. Daarom is die fokus op hoe om ’n bure-lys soektog algoritme te ontwikkel wat gebruik sal maak van die volgende generasie hardeware. In hierdie tesis stel ons ’n SVE - veelvoudige GVE algoritme voor, wat ’n verlenging van die vaste-rooster metode is, vir die vaste-radius naaste bure soektog op groot skaal parallele stelsels.
75

"Índices de carga e desempenho em ambientes paralelos/distribuídos - modelagem e métricas" / Load and Performance Index for Parallel/Distributed System - Modelling and Metrics

Branco, Kalinka Regina Lucas Jaquie Castelo 15 December 2004 (has links)
Esta tese aborda o problema de obtenção de um índice de carga ou de desempenho adequado para utilização no escalonamento de processos em sistemas computacionais heterogêneos paralelos/distribuídos. Uma ampla revisão bibliográfica com a correspondente análise crítica é apresentada. Essa revisão é a base para a comparação das métricas existentes para a avaliação do grau de heterogeneidade/homogeneidade dos sistemas computacionais. Uma nova métrica é proposta neste trabalho, removendo as restrições identificadas no estudo comparativo realizado. Resultados de aplicações dessa nova métrica são apresentados e discutidos. Esta tese propõe também o conceito de heterogeneidade/homogeneidade temporal que pode ser utilizado para futuros aprimoramentos de políticas de escalonamento empregadas em plataformas computacionais heterogêneas paralelas/distribuídas. Um novo índice de desempenho (Vector for Index of Performance - VIP), generalizando o conceito de índice de carga, é proposto com base em uma métrica Euclidiana. Esse novo índice é aplicado na implementação de uma política de escalonamento e amplamente testado através de modelagem e simulação. Os resultados obtidos são apresentados e analisados estatisticamente. É demonstrado que o novo índice leva a bons resultados de modo geral e é apresentado um mapeamento mostrando as vantagens e desvantagens de sua adoção quando comparado às métricas tradicionais. / This thesis approaches the problem of evaluating an adequate load index or a performance index, for using in process scheduling in heterogeneous parallel/distributed computing systems. A wide literature review with the corresponding critical analysis is presented. This review is the base for the comparison of the existing metrics for the evaluation of the computing systems homogeneity/heterogeneity degree. A new metric is proposed in this work, removing the restrictions identified during the comparative study realized. Results from the application of the new metric are presented and discussed. This thesis also proposes the concept of temporal heterogeneity/homogeneity that can be used for future improvements in scheduling polices for parallel/distributed heterogeneous computing platforms. A new performance index (Vector for Index of Performance - VIP), generalizing the concept of load index, is proposed based on an Euclidean metric. This new index is applied to the implementation of a scheduling police and widely tested through modeling and simulation. The results obtained are presented and statistically analyzed. It is shown that the new index reaches good results in general and it is also presented a mapping showing the advantages and disadvantages of its adoption when compared with the traditional metrics.
76

"Índices de carga e desempenho em ambientes paralelos/distribuídos - modelagem e métricas" / Load and Performance Index for Parallel/Distributed System - Modelling and Metrics

Kalinka Regina Lucas Jaquie Castelo Branco 15 December 2004 (has links)
Esta tese aborda o problema de obtenção de um índice de carga ou de desempenho adequado para utilização no escalonamento de processos em sistemas computacionais heterogêneos paralelos/distribuídos. Uma ampla revisão bibliográfica com a correspondente análise crítica é apresentada. Essa revisão é a base para a comparação das métricas existentes para a avaliação do grau de heterogeneidade/homogeneidade dos sistemas computacionais. Uma nova métrica é proposta neste trabalho, removendo as restrições identificadas no estudo comparativo realizado. Resultados de aplicações dessa nova métrica são apresentados e discutidos. Esta tese propõe também o conceito de heterogeneidade/homogeneidade temporal que pode ser utilizado para futuros aprimoramentos de políticas de escalonamento empregadas em plataformas computacionais heterogêneas paralelas/distribuídas. Um novo índice de desempenho (Vector for Index of Performance - VIP), generalizando o conceito de índice de carga, é proposto com base em uma métrica Euclidiana. Esse novo índice é aplicado na implementação de uma política de escalonamento e amplamente testado através de modelagem e simulação. Os resultados obtidos são apresentados e analisados estatisticamente. É demonstrado que o novo índice leva a bons resultados de modo geral e é apresentado um mapeamento mostrando as vantagens e desvantagens de sua adoção quando comparado às métricas tradicionais. / This thesis approaches the problem of evaluating an adequate load index or a performance index, for using in process scheduling in heterogeneous parallel/distributed computing systems. A wide literature review with the corresponding critical analysis is presented. This review is the base for the comparison of the existing metrics for the evaluation of the computing systems homogeneity/heterogeneity degree. A new metric is proposed in this work, removing the restrictions identified during the comparative study realized. Results from the application of the new metric are presented and discussed. This thesis also proposes the concept of temporal heterogeneity/homogeneity that can be used for future improvements in scheduling polices for parallel/distributed heterogeneous computing platforms. A new performance index (Vector for Index of Performance - VIP), generalizing the concept of load index, is proposed based on an Euclidean metric. This new index is applied to the implementation of a scheduling police and widely tested through modeling and simulation. The results obtained are presented and statistically analyzed. It is shown that the new index reaches good results in general and it is also presented a mapping showing the advantages and disadvantages of its adoption when compared with the traditional metrics.
77

Heterogeneous multi-pipeline application specific instruction-set processor design and implementation

Radhakrishnan, Swarnalatha, Computer Science & Engineering, Faculty of Engineering, UNSW January 2006 (has links)
Embedded systems are becoming ubiquitous, primarily due to the fast evolution of digital electronic devices. The design of modern embedded systems requires systems to exhibit, high performance and reliability, yet have short design time and low cost. Application Specific Instruction set processors (ASIPs) are widely used in embedded system since they are economical to use, flexible, and reusable (thus saves design time). During the last decade research work on ASIPs have been carried out in mainly for single pipelined processors. Improving performance in processors is possible by exploring the available parallelism in the program. Designing of multiple parallel execution paths for parallel execution of the processor naturally incurs additional cost. The methodology presented in this dissertation has addressed the problem of improving performance in ASIPs, at minimal additional cost. The devised methodology explores the available parallelism of an application to generate a multi-pipeline heterogeneous ASIP. The processor design is application specific. No pre-defined IPs are used in the design. The generated processor contains multiple standalone pipelined data paths, which are not necessarily identical, and are connected by the necessary bypass paths and control signals. Control unit are separate for each pipeline (though with the same clock) resulting in a simple and cost effective design. By using separate instruction and data memories (Harvard architecture) and by allowing memory access by two separate pipes, the complexity of the controller and buses are reduced. The impact of higher memory latencies is nullified by utilizing parallel pipes during memory access. Efficient bypass network selection and encoding techniques provide a better implementation. The initial design approach with only two pipelines without bypass paths show speed improvements of up to 36% and switching activity reductions of up to 11%. The additional area costs around 16%. An improved design with different number of pipelines (more than two) based on applications show on average of 77% performance improvement with overheads of: 49% on area; 51% on leakage power; 17% on switching activity; and 69% on code size. The design was further trimmed, with bypass path selection and encoding techniques, which show a saving of up to 32% of area and 34% of leakage power with 6% performance improvement and 69% of code size reduction compared to the design approach without these techniques in the multi pipeline design.
78

Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit

Delorme, Michael Christopher 18 March 2013 (has links)
We explore efficient parallel radix sort for the AMD Fusion Accelerated Processing Unit (APU). Two challenges arise: efficiently partitioning data between the CPU and GPU and the allocation of data in memory regions. Our coarse-grained implementation utilizes both the GPU and CPU by sharing data at the begining and end of the sort. Our fine-grained implementation utilizes the APU’s integrated memory system to share data throughout the sort. Both these implementations outperform the current state of the art GPU radix sort from NVIDIA. We therefore demonstrate that the CPU can be efficiently used to speed up radix sort on the APU. Our fine-grained implementation slightly outperforms our coarse-grained implementation. This demonstrates the benefit of the APU’s integrated architecture. This performance benefit is hindered by limitations in the APU’s architecture and programming model. We believe that the performance benefits will increase once these limitations are addressed in future generations of the APU.
79

Design of heterogeneous coherence hierarchies using manager-client pairing

Beu, Jesse Garrett 09 April 2013 (has links)
Over the past ten years, the architecture community has witnessed the end of single-threaded performance scaling and a subsequent shift in focus toward multicore and manycore processing. While this is an exciting time for architects, with many new opportunities and design spaces to explore, this brings with it some new challenges. One area that is especially impacted is the memory subsystem. Specifically, the design, verification, and evaluation of cache coherence protocols becomes very challenging as cores become more numerous and more diverse. This dissertation examines these issues and presents Manager-Client Pairing as a solution to the challenges facing next-generation coherence protocol design. By defining a standardized coherence communication interface and permissions checking algorithm, Manager-Client Pairing enables coherence hierarchies to be constructed and evaluated quickly without the high design-cost previously associated with hierarchical composition. Further, Manager-Client Pairing also allows for verification composition, even in the presence of protocol heterogeneity. As a result, this rapid development of diverse protocols is ensured to be bug-free, enabling architects to focus on performance optimization, rather than debugging and correctness concerns, while comparing diverse coherence configurations for use in future heterogeneous systems.
80

A model of dynamic compilation for heterogeneous compute platforms

Kerr, Andrew 10 December 2012 (has links)
Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality.

Page generated in 0.1223 seconds