1 |
A Compiler Framework to Support and Exploit Heterogeneous Overlapping-ISA Multiprocessor PlatformsJelesnianski, Christopher Stanisław 15 December 2015 (has links)
As the demand for ever increasingly powerful machines continues, new architectures are sought to be the next route of breaking past the brick wall that currently stagnates the performance growth of modern multi-core CPUs. Due to physical limitations, scaling single-core performance any further is no longer possible, giving rise to modern multi-cores. However, the brick wall is now limiting the scaling of general-purpose multi-cores. Heterogeneous-core CPUs have the potential to continue scaling by reducing power consumption through exploitation of specialized and simple cores within the same chip.
Heterogeneous-core CPUs join fundamentally different processors each which their own peculiar features, i.e., fast execution time, improved power efficiency, etc; enabling the building of versatile computing systems. To make heterogeneous platforms permeate the computer market, the next hurdle to overcome is the ability to provide a familiar programming model and environment such that developers do not have to focus on platform details. Nevertheless, heterogeneous platforms integrate processors with diverse characteristics and potentially a different Instruction Set Architecture (ISA), which exacerbate the complexity of the software. A brave few have begun to tread down the heterogeneous-ISA path, hoping to prove that this avenue will yield the next generation of super computers. However, many unforeseen obstacles have yet to be discovered.
With this new challenge comes the clear need for efficient, developer-friendly, adaptable system software to support the efforts of making heterogeneous-ISA the golden standard for future high-performance and general-purpose computing. To foster rapid development of this technology, it is imperative to put the proper tools into the hands of developers, such as application and architecture profiling engines, in order to realize the best heterogeneous-ISA platform possible with available technology. In addition, it would be in the best interest to create tools to be as "timeless" as possible to expose fundamental concepts industry could benefit from and adopt in future designs.
We demonstrate the feasibility of a compiler framework and runtime for an existing heterogeneous-ISA operating system (Popcorn Linux) for automatically scheduling compute blocks within an application on a given heterogeneous-ISA high-performance platform (in our case a platform built with Intel Xeon - Xeon Phi). With the introduced Profiler, Partitioner, and Runtime support, we prove to be able to automatically exploit the heterogeneity in an overlapping-ISA platform, being faster than native execution and other parallelism programming models.
Empirically evaluating our compiler framework, we show that application execution on Popcorn Linux can be up to 52% faster than the most performant native execution for Xeon or Xeon Phi. Using our compiler framework relieves the developer from manual scheduling and porting of applications, requiring only a single profiling run per application. / Master of Science
|
2 |
IMPROVING PERFORMANCE AND ENERGY EFFICIENCY FOR THE INTEGRATED CPU-GPU HETEROGENEOUS SYSTEMSWen, Hao 01 January 2018 (has links)
Current heterogeneous CPU-GPU architectures integrate general purpose CPUs and highly thread-level parallelized GPUs (Graphic Processing Units) in the same die. This dissertation focuses on improving the energy efficiency and performance for the heterogeneous CPU-GPU system.
Leakage energy has become an increasingly large fraction of total energy consumption, making it important to reduce leakage energy for improving the overall energy efficiency. Cache occupies a large on-chip area, which are good targets for leakage energy reduction. For the CPU cache, we study how to reduce the cache leakage energy efficiently in a hybrid SPM (Scratch-Pad Memory) and cache architecture. For the GPU cache, the access pattern of GPU cache is different from the CPU, which usually has little locality and high miss rate. In addition, GPU can hide memory latency more effectively due to multi-threading. Because of the above reasons, we find it is possible to place the cache lines of the GPU data caches into the low power mode more aggressively than traditional leakage management for CPU caches, which can reduce more leakage energy without significant performance degradation.
The contention in shared resources between CPU and GPU, such as the last level cache (LLC), interconnection network and DRAM, may degrade both CPU and GPU performance. We propose a simple yet effective method based on probability to control the LLC replacement policy for reducing the CPU’s inter-core conflict misses caused by GPU without significantly impacting GPU performance. In addition, we develop two strategies to combine the probability based method for the LLC and an existing technique called virtual channel partition (VCP) for the interconnection network to further improve the CPU performance.
For a specific graph application of Breadth first search (BFS), which is a basis for graph search and a core building block for many higher-level graph analysis applications, it is a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently. This approach ensures correctness and can significantly improve both the performance and energy efficiency on GPUs.
|
3 |
Shared resource management for efficient heterogeneous computingLee, Jaekyu 13 January 2014 (has links)
The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors.
In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network.
In the thesis, we propose four resource sharing mechanisms:
First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs.
|
4 |
Conception faible consommation d'un système de détection de chute / Low power architecture for fall detection systemNguyen, Thi Khanh Hong 18 November 2015 (has links)
De nos jours, la détection de chute est un défi pour la santé, notamment pour la surveillance des personnes âgées. Le but de cette thèse est de concevoir un système de détection de chute basée sur une surveillance par caméra et d’étudier les aspects algorithmiques et architecturaux. Notre système se compose de quatre modules : la segmentation d’objet, le filtrage, l’extraction de caractéristiques et la reconnaissance qui permettent en plus de la détection de chute d’identifier leur type afin de définir un niveau d’alerte. En premier lieu, différents algorithmes ont été étudiés et comparés comme le Background Subtraction-Neural Network; le Background Subtraction-Template Matching (BGS-TM); le Background Subtraction-Hidden Markov Model ; et le Gaussian Mixture Model. Le BGS/TM présentant le meilleur taux de reconnaissance a alors été retenu. Une nouvelle base de donnée DTU-HBU a été construite et classifiée selon différentes actions : chute, non-chute (assis, couché, rampant, etc.) selon trois angles de caméra (face, côtés et de biais). Le second objectif fut de définir une méthode de conception permettant de sélectionner les architectures présentant la meilleure performance. Un premier travail fut de définir des modèles de la consommation et du temps d’exécution pour différentes cibles (processeur, FPGA). A titre d’exemple, la plateforme ZYNQ a été considérée. Les modèles proposés présentent un taux erreur inférieur à 3,5%. Une méthodologie de conception DSE basée sur deux techniques de parallélisme (Intra-task et inter-task) et couplant le taux de reconnaissance (ACC) a été définie. Les résultats obtenus montrent que l’ACC atteint 98,3% pour une énergie de 29,5 mJ/f. / Nowadays, fall detection is a major challenge in the public health care domain, especially for the elderly living alone and rehabilitants in hospitals. This thesis presents an exploration for a Fall Detection System based on camera under an algorithmic and architectural point of view. Our system includes four modules: Object Segmentation, Filter, Feature Extraction and Recognition and give an urgent alarm for detecting different kinds of fall. Firstly, different algorithms for the Fall Detection System are proposed and compared the efficiency among Background Subtraction-Neural Network, Background Subtraction-Template Matching (BGS/TM), Background Subtraction-Hidden Markov Model, and Gaussian Mixture Model. Therefore, the selected BGS/TM with 91.67% (Recall), 100% (Precision) and 95.65% (Accuracy) will be implemented on ZYNQ platform. Moreover, a DUT-HBU database which is classified with different actions: fall, non-fall in three camera directions is used to evaluate the efficiency of this system. Secondly, the aim is to explore low cost architectures for this system, new power consumption and execution time models for processor core and FPGA are defined according to the different configurations of architecture and applications. The error rates of the proposed models don’t exceed 3.5%. The models are then extended to hardware/software architectures to explore low cost architecture by defining a suitable Design Space Exploration methodology. Two techniques for parallelization which are based on intra-task and inter-task static scheduling are applied with the aim to enhance the accuracy and the power consumption of this system reaches 98.3% with energy per frame of 29.5mJ/f.
|
5 |
Translating C/C++ applications to a task-based representationLi, Lu January 2011 (has links)
GPU-based heterogeneous architectures have been given much attention recently. How to get optimal performance out of those architectures with affordable programming effort remains a complex challenge. The PEPPHER framework is one possible solution. Within the PEPPHER framework, the StarPU run-time system is used to decrease such programming efforts, and at the same time to ensure near optimal performance by efficient scheduling over such architectures. However, adapting a normal C/C++ application to the StarPU runtime system requires additional programming effort. This thesis implements and tests a composition tool for automatic adaptation of normal C/C++ applications withPEPPHER components to StarPU. This composition tool requires XML annotation for applications and several trivial changes to applications, which take limited efforts. Our results obtained by three test cases (vector scale, sorting, andmatrix multiplication) show that automatic adaptation works well on different platforms that StarPU supports. It is also shown that besides StarPU’s dynamic composition, this tool facilitates static composition to improve performance portability of normal C/C++ applications.
|
6 |
An Analysis of an Interrupt-Driven Implementation of the Master-Worker Model with Application-Specific CoprocessorsHickman, Joseph 17 January 2008 (has links)
In this thesis, we present a versatile parallel programming model composed of an individual general-purpose processor aided by several application-specific coprocessors. These computing units operate under a simplification of the master-worker model. The user-defined coprocessors may be either homogeneous or heterogeneous. We analyze system performance with regard to system size and task granularity, and we present experimental results to determine the optimal operating conditions. Finally, we consider the suitability of this approach for scientific simulations — specifically for use in agent-based models of biological systems. / Master of Science
|
7 |
Rethinking Consistency Management in Real-time Collaborative Editing SystemsPreston, Jon Anderson 28 June 2007 (has links)
Networked computer systems offer much to support collaborative editing of shared documents among users. Increasing concurrent access to shared documents by allowing multiple users to contribute to and/or track changes to these shared documents is the goal of real-time collaborative editing systems (RTCES); yet concurrent access is either limited in existing systems that employ exclusive locking or concurrency control algorithms such as operational transformation (OT) may be employed to enable concurrent access. Unfortunately, such OT based schemes are costly with respect to communication and computation. Further, existing systems are often specialized in their functionality and require users to adopt new, unfamiliar software to enable collaboration. This research discusses our work in improving consistency management in RTCES. We have developed a set of deadlock-free multi-granular dynamic locking algorithms and data structures that maximize concurrent access to shared documents while minimizing communication cost. These algorithms provide a high level of service for concurrent access to the shared document and integrate merge-based or OT-based consistency maintenance policies locally among a subset of the users within a subsection of the document – thus reducing the communication costs in maintaining consistency. Additionally, we have developed client-server and P2P implementations of our hierarchical document management algorithms. Simulations results indicate that our approach achieves significant communication and computation cost savings. We have also developed a hierarchical reduction algorithm that can minimize the space required of RTCES, and this algorithm may be pipelined through our document tree. Further, we have developed an architecture that allows for a heterogeneous set of client editing software to connect with a heterogeneous set of server document repositories via Web services. This architecture supports our algorithms and does not require client or server technologies to be modified – thus it is able to accommodate existing, favored editing and repository tools. Finally, we have developed a prototype benchmark system of our architecture that is responsive to users’ actions and minimizes communication costs.
|
8 |
Methodology and tools for energy-aware task mapping on heterogeneous multiprocessor architectures / Méthodes et outils permettant le placement de taches efficaces en énergie sur architectures multicoeurs hétérogènesRoux, Baptiste 23 November 2017 (has links)
Au cours de la dernière décennie, la conception des systèmes embarqués a évolué dans l'optique d'augmenter la puissance de calcul tout en conservant une faible consommation d'énergie. À titre d'exemple, les véhicules autonomes tels que les drones sont un domaine d'application représentatif qui combine de la vision, des communications sans fil avec d'autres noyaux de calculs intensifs, le tout avec un budget énergétique limité. Avec l'avènement des systèmes multicœurs sur puce (MpSoC), la simplification des processeurs a diminué la consommation d'énergie par opération, alors que leur multiplication a amélioré les performances. Cependant, l'apparition du phénomène de ''dark silicon'' a conduit à l'intégration d'accélérateurs matériels spécialisés au sein des systèmes multicœurs. C'est ainsi que sont nées les architectures massivement multicœurs hétérogènes (HMpSoC) combinant des processeurs généralistes (SW) et des accélérateurs matériels (HW). Pour ces architectures hétérogènes, les performances et la consommation d'énergie dépendent d'un large ensemble de paramètres tels que le partitionnement HW/SW, le type d'implémentation HW et le coût de communication entre les organes de calcul HW et SW conduisant ainsi à un immense espace de conception. Dans cette thèse, nous étudions des méthodes permettant la réduction de la complexité de développement et de mise en oeuvre d'applications efficaces en énergie sur HMpSoC. De nombreuses contributions sont proposées pour améliorer les outils d'exploration de l'espace de conception (DSE) avec des objectifs énergétiques. Tout d'abord, une définition formelle de la structure HMpSoC est introduite ainsi qu'une méthode de représentation générique axée sur la hiérarchie mémoire. Ensuite, un outil de modélisation rapide de l'énergie est proposé et validé sur plusieurs applications. Ce modèle énergétique sépare les sources d'énergie en trois catégories (calcul statique, dynamique et communications) et calcule leurs contributions sur la consommation globale de manière indépendante. Basée sur une étude précise des communications, cette approche calcule rapidement la consommation d'énergie pour une répartition donnée d'application sur un HMpSoC. Dans un deuxième temps, nous proposons une méthodologie permettant l'exploration énergétique d'accélérateurs sur HmpSoC. Cette méthode s'appuie sur le modèle de consommation précédent couplé à une formulation de programmation linéaire en nombre entier mixte (MILP). Cela permet de sélectionner efficacement les accélérateurs HW et le partitionnement HW/SW et ainsi d'obtenir une implémentation efficace en énergie pour une application tuilée. Les expériences réalisées ont montré la complexité du processus de validation d'outils/algorithmes de DSE sur une large gamme d'applications et d'architectures. Afin de résoudre ce problème, nous proposons un simulateur d'architectures HMpSoC intégrant un modèle de consommation permettant d'observer l'exécution d'applications. La structure de l'architecture cible est décrite à l'aide d'un fichier de configuration basé sur le modèle de représentation générique précédent. Ce fichier est chargé dynamiquement lors du démarrage du simulateur. De plus, ce simulateur est associé à un générateur d'applications permettant la création d'un large ensemble d'applications représentatives du domaine. Ce générateur se base sur un ensemble de schémas de calcul et de communication élémentaire qu'il combine pour obtenir une application complète. Les applications ainsi obtenues peuvent être enrichies par des informations de placement et automatiquement exécutées sur le simulateur. Cet ensemble d'outils a pour objectif de faciliter la validation de nouveaux algorithmes ciblant le placement efficace en énergie d'application sur une large gamme d'architectures HMpSoC. / During the last decade, the design of embedded systems was pushed to increase computational power while maintaining low energy consumption. As an example, autonomous vehicles such as drones are a representative application domain which combines vision, wireless communications and other computation intensive kernels constrained with a limited energy budget. With the advent of Multiprocessor System-on-Chip (MpSoC) architectures, simplification of processor cores decreased power consumption per operation, while the multiplication of cores brought performance improvement. However, the ''dark silicon'' issue led to the benefit of augmenting programmable processors with specialized hardware accelerators and to the rise of Heterogeneous MpSoC (HMpSoC) combining both software (SW) and hardware (HW) computational resources. For these heterogeneous architectures, performance and energy consumption depend on a large set of parameters such as the HW/SW partitioning, the type of HW implementation or the communication cost between HW and SW cores therefore leading to a huge design space. In this thesis, we study how to reduce the development and implementation complexity of energy-efficient applications on HMpSoC. Multiple contributions are proposed to enhance Design Space Exploration (DSE) tools with energy objectives. First, a formal definition of HMpSoC structure is introduced alongside with a generic representation focused on the memory hierarchy. Then, a fast power modelling tool is proposed and validated on several applications. This power model separates the power sources in three families (static, dynamic computation and dynamic communication) and computes their contributions on global consumption independently. With a fine grain communications study, this approach rapidly computes energy consumption for a given application mapping on a HMpSoC. In a second time, we propose a methodology for energy-driven accelerator exploration on HMpSoC. This method builds upon the previous power model coupled with an Mixed Integer Linear Programming (MILP) formulation and enables to efficiently select HW accelerators and HW/SW partitioning which achieve energy efficient-mapping of a tiled application. The experiments involved in these contributions show the complexity of DSE validation process on a wide range of applications and architectures. To address these issues, we introduce a HMpSoC simulator embedding a power model to monitor application execution. Properties of targeted architectures are described, at run-time with the previous generic representation model. Furthermore, this simulator is coupled with an application generator framework that could build an infinite set of representative applications following predefined computation models. The obtained applications could then be enriched with mapping directive and executed on the simulator. This combination enables to ease the research and validation of new DSE algorithms targeting energy-aware application mapping on a wide range of HMpSoC architectures.
|
9 |
Approche parcimonieuse et calcul haute performance pour la tomographie itérative régularisée. / Computationally Efficient Sparse Prior in Regularized Iterative Tomographic ReconstructionNotargiacomo, Thibault 14 February 2017 (has links)
La tomographie est une technique permettant de reconstruire une carte des propriétés physiques de l'intérieur d'un objet, à partir d'un ensemble de mesures extérieures. Bien que la tomographie soit une technologie mature, la plupart des algorithmes utilisés dans les produits commerciaux sont basés sur des méthodes analytiques telles que la rétroprojection filtrée. L'idée principale de cette thèse est d'exploiter les dernières avancées dans le domaine de l'informatique et des mathématiques appliqués en vue d'étudier, concevoir et implémenter de nouveaux algorithmes dédiés à la reconstruction 3D en géométrie conique. Nos travaux ciblent des scenarii d'intérêt clinique tels que les acquisitions faible dose ou faible nombre de vues provenant de détecteurs plats. Nous avons étudié différents modèles d'opérateurs tomographiques, leurs implémentations sur serveur multi-GPU, et avons proposé l'utilisation d'une transformée en ondelettes complexes 3D pour régulariser le problème inverse. / X-Ray computed tomography (CT) is a technique that aims at providing a measure of a given property of the interior of a physical object, given a set of exterior projection measurement. Although CT is a mature technology, most of the algorithm used for image reconstruction in commercial applications are based on analytical methods such as the filtered back-projection. The main idea of this thesis is to exploit the latest advances in the field of applied mathematics and computer sciences in order to study, design and implement algorithms dedicated to 3D cone beam reconstruction from X-Ray flat panel detectors targeting clinically relevant usecases, including low doses and few view acquisitions.In this work, we studied various strategies to model the tomographic operators, and how they can be implemented on a multi-GPU platform. Then we proposed to use the 3D complex wavelet transform in order to regularize the reconstruction problem.
|
10 |
Adéquation Algorithme Architecture et modèle de programmation pour l'implémentation d'algorithmes de traitement du signal et de l'image sur cluster multi-GPU / Programming model for the implementation of 2D-3D image processing applications on a hybrid CPU-GPU cluster.Boulos, Vincent 18 December 2012 (has links)
Initialement con¸cu pour d´echarger le CPU des tˆaches de rendu graphique, le GPU estdevenu une architecture massivement parall`ele adapt´ee au traitement de donn´ees volumineuses.Alors qu’il occupe une part de march´e importante dans le Calcul Haute Performance, uned´emarche d’Ad´equation Algorithme Architecture est n´eanmoins requise pour impl´ementerefficacement un algorithme sur GPU.La contribution de cette th`ese est double. Dans un premier temps, nous pr´esentons legain significatif apport´e par l’impl´ementation optimis´ee d’un algorithme de granulom´etrie(l’ordre de grandeur passe de l’heure `a la minute pour un volume de 10243 voxels). Un mod`eleanalytique permettant d’´etablir les variations de performance de l’application de granulom´etriesur GPU a ´egalement ´et´e d´efini et pourrait ˆetre ´etendu `a d’autres algorithmes r´eguliers.Dans un second temps, un outil facilitant le d´eploiement d’applications de Traitementdu Signal et de l’Image sur cluster multi-GPU a ´et´e d´evelopp´e. Pour cela, le champ d’actiondu programmeur est r´eduit au d´ecoupage du programme en tˆaches et `a leur mapping sur les´el´ements de calcul (GPP ou GPU). L’am´elioration notable du d´ebit sortant d’une applicationstreaming de calcul de carte de saillence visuelle a d´emontr´e l’efficacit´e de notre outil pourl’impl´ementation d’une solution sur cluster multi-GPU. Afin de permettre un ´equilibrage decharge dynamique, une m´ethode de migration de tˆaches a ´egalement ´et´e incorpor´ee `a l’outil. / Originally designed to relieve the CPU from graphics rendering tasks, the GPU has becomea massively parallel architecture suitable for processing large amounts of data. While it haswon a significant market share in the High Performance Computing domain, an Algorithm-Architecture Matching approach is still necessary to efficiently implement an algorithm onGPU.The contribution of this thesis is twofold. Firstly, we present the significant gain providedby the implementation of a granulometry optimized algorithm (computation time decreasesfrom several hours to less than minute for a volume of 10243 voxels). An analytical modelestablishing the performance variations of the granulometry application is also presented. Webelieve it can be expanded to other regular algorithms.Secondly, the deployment of Signal and Image processing applications on multi-GPUcluster can be a tedious task for the programmer. In order to help him, we developped alibrary that reduces the scope of the programmer’s contribution in the development. Hisremaining tasks are decomposing the application into a Data Flow Graph and giving mappingannotations in order for the tool to automatically dispatch tasks on the processing elements(GPP or GPU). The throughput of a visual sailency streaming application is then improvedthanks to the efficient implementation brought by our tool on a multi-GPU cluster. In orderto permit dynamic load balancing, a task migration method has also been incorporated into it.
|
Page generated in 0.136 seconds