• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 5
  • 2
  • 2
  • Tagged with
  • 12
  • 12
  • 5
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.

A Compiler Framework to Support and Exploit Heterogeneous Overlapping-ISA Multiprocessor Platforms

Jelesnianski, Christopher Stanisław 15 December 2015 (has links)
As the demand for ever increasingly powerful machines continues, new architectures are sought to be the next route of breaking past the brick wall that currently stagnates the performance growth of modern multi-core CPUs. Due to physical limitations, scaling single-core performance any further is no longer possible, giving rise to modern multi-cores. However, the brick wall is now limiting the scaling of general-purpose multi-cores. Heterogeneous-core CPUs have the potential to continue scaling by reducing power consumption through exploitation of specialized and simple cores within the same chip. Heterogeneous-core CPUs join fundamentally different processors each which their own peculiar features, i.e., fast execution time, improved power efficiency, etc; enabling the building of versatile computing systems. To make heterogeneous platforms permeate the computer market, the next hurdle to overcome is the ability to provide a familiar programming model and environment such that developers do not have to focus on platform details. Nevertheless, heterogeneous platforms integrate processors with diverse characteristics and potentially a different Instruction Set Architecture (ISA), which exacerbate the complexity of the software. A brave few have begun to tread down the heterogeneous-ISA path, hoping to prove that this avenue will yield the next generation of super computers. However, many unforeseen obstacles have yet to be discovered. With this new challenge comes the clear need for efficient, developer-friendly, adaptable system software to support the efforts of making heterogeneous-ISA the golden standard for future high-performance and general-purpose computing. To foster rapid development of this technology, it is imperative to put the proper tools into the hands of developers, such as application and architecture profiling engines, in order to realize the best heterogeneous-ISA platform possible with available technology. In addition, it would be in the best interest to create tools to be as "timeless" as possible to expose fundamental concepts industry could benefit from and adopt in future designs. We demonstrate the feasibility of a compiler framework and runtime for an existing heterogeneous-ISA operating system (Popcorn Linux) for automatically scheduling compute blocks within an application on a given heterogeneous-ISA high-performance platform (in our case a platform built with Intel Xeon - Xeon Phi). With the introduced Profiler, Partitioner, and Runtime support, we prove to be able to automatically exploit the heterogeneity in an overlapping-ISA platform, being faster than native execution and other parallelism programming models. Empirically evaluating our compiler framework, we show that application execution on Popcorn Linux can be up to 52% faster than the most performant native execution for Xeon or Xeon Phi. Using our compiler framework relieves the developer from manual scheduling and porting of applications, requiring only a single profiling run per application. / Master of Science

Shared resource management for efficient heterogeneous computing

Lee, Jaekyu 13 January 2014 (has links)
The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs.

Translating C/C++ applications to a task-based representation

Li, Lu January 2011 (has links)
GPU-based heterogeneous architectures have been given much attention recently. How to get optimal performance out of those architectures with affordable programming effort remains a complex challenge. The PEPPHER framework is one possible solution. Within the PEPPHER framework, the StarPU run-time system is used to decrease such programming efforts, and at the same time to ensure near optimal performance by efficient scheduling over such architectures. However, adapting a normal C/C++ application to the StarPU runtime system requires additional programming effort. This thesis implements and tests a composition tool for automatic adaptation of normal C/C++ applications withPEPPHER components to StarPU. This composition tool requires XML annotation for applications and several trivial changes to applications, which take limited efforts. Our results obtained by three test cases (vector scale, sorting, andmatrix multiplication) show that automatic adaptation works well on different platforms that StarPU supports. It is also shown that besides StarPU’s dynamic composition, this tool facilitates static composition to improve performance portability of normal C/C++ applications.

Conception faible consommation d'un système de détection de chute / Low power architecture for fall detection system

Nguyen, Thi Khanh Hong 18 November 2015 (has links)
De nos jours, la détection de chute est un défi pour la santé, notamment pour la surveillance des personnes âgées. Le but de cette thèse est de concevoir un système de détection de chute basée sur une surveillance par caméra et d’étudier les aspects algorithmiques et architecturaux. Notre système se compose de quatre modules : la segmentation d’objet, le filtrage, l’extraction de caractéristiques et la reconnaissance qui permettent en plus de la détection de chute d’identifier leur type afin de définir un niveau d’alerte. En premier lieu, différents algorithmes ont été étudiés et comparés comme le Background Subtraction-Neural Network; le Background Subtraction-Template Matching (BGS-TM); le Background Subtraction-Hidden Markov Model ; et le Gaussian Mixture Model. Le BGS/TM présentant le meilleur taux de reconnaissance a alors été retenu. Une nouvelle base de donnée DTU-HBU a été construite et classifiée selon différentes actions : chute, non-chute (assis, couché, rampant, etc.) selon trois angles de caméra (face, côtés et de biais). Le second objectif fut de définir une méthode de conception permettant de sélectionner les architectures présentant la meilleure performance. Un premier travail fut de définir des modèles de la consommation et du temps d’exécution pour différentes cibles (processeur, FPGA). A titre d’exemple, la plateforme ZYNQ a été considérée. Les modèles proposés présentent un taux erreur inférieur à 3,5%. Une méthodologie de conception DSE basée sur deux techniques de parallélisme (Intra-task et inter-task) et couplant le taux de reconnaissance (ACC) a été définie. Les résultats obtenus montrent que l’ACC atteint 98,3% pour une énergie de 29,5 mJ/f. / Nowadays, fall detection is a major challenge in the public health care domain, especially for the elderly living alone and rehabilitants in hospitals. This thesis presents an exploration for a Fall Detection System based on camera under an algorithmic and architectural point of view. Our system includes four modules: Object Segmentation, Filter, Feature Extraction and Recognition and give an urgent alarm for detecting different kinds of fall. Firstly, different algorithms for the Fall Detection System are proposed and compared the efficiency among Background Subtraction-Neural Network, Background Subtraction-Template Matching (BGS/TM), Background Subtraction-Hidden Markov Model, and Gaussian Mixture Model. Therefore, the selected BGS/TM with 91.67% (Recall), 100% (Precision) and 95.65% (Accuracy) will be implemented on ZYNQ platform. Moreover, a DUT-HBU database which is classified with different actions: fall, non-fall in three camera directions is used to evaluate the efficiency of this system. Secondly, the aim is to explore low cost architectures for this system, new power consumption and execution time models for processor core and FPGA are defined according to the different configurations of architecture and applications. The error rates of the proposed models don’t exceed 3.5%. The models are then extended to hardware/software architectures to explore low cost architecture by defining a suitable Design Space Exploration methodology. Two techniques for parallelization which are based on intra-task and inter-task static scheduling are applied with the aim to enhance the accuracy and the power consumption of this system reaches 98.3% with energy per frame of 29.5mJ/f.


Wen, Hao 01 January 2018 (has links)
Current heterogeneous CPU-GPU architectures integrate general purpose CPUs and highly thread-level parallelized GPUs (Graphic Processing Units) in the same die. This dissertation focuses on improving the energy efficiency and performance for the heterogeneous CPU-GPU system. Leakage energy has become an increasingly large fraction of total energy consumption, making it important to reduce leakage energy for improving the overall energy efficiency. Cache occupies a large on-chip area, which are good targets for leakage energy reduction. For the CPU cache, we study how to reduce the cache leakage energy efficiently in a hybrid SPM (Scratch-Pad Memory) and cache architecture. For the GPU cache, the access pattern of GPU cache is different from the CPU, which usually has little locality and high miss rate. In addition, GPU can hide memory latency more effectively due to multi-threading. Because of the above reasons, we find it is possible to place the cache lines of the GPU data caches into the low power mode more aggressively than traditional leakage management for CPU caches, which can reduce more leakage energy without significant performance degradation. The contention in shared resources between CPU and GPU, such as the last level cache (LLC), interconnection network and DRAM, may degrade both CPU and GPU performance. We propose a simple yet effective method based on probability to control the LLC replacement policy for reducing the CPU’s inter-core conflict misses caused by GPU without significantly impacting GPU performance. In addition, we develop two strategies to combine the probability based method for the LLC and an existing technique called virtual channel partition (VCP) for the interconnection network to further improve the CPU performance. For a specific graph application of Breadth first search (BFS), which is a basis for graph search and a core building block for many higher-level graph analysis applications, it is a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual-BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently. This approach ensures correctness and can significantly improve both the performance and energy efficiency on GPUs.

An Analysis of an Interrupt-Driven Implementation of the Master-Worker Model with Application-Specific Coprocessors

Hickman, Joseph 17 January 2008 (has links)
In this thesis, we present a versatile parallel programming model composed of an individual general-purpose processor aided by several application-specific coprocessors. These computing units operate under a simplification of the master-worker model. The user-defined coprocessors may be either homogeneous or heterogeneous. We analyze system performance with regard to system size and task granularity, and we present experimental results to determine the optimal operating conditions. Finally, we consider the suitability of this approach for scientific simulations — specifically for use in agent-based models of biological systems. / Master of Science

Rethinking Consistency Management in Real-time Collaborative Editing Systems

Preston, Jon Anderson 28 June 2007 (has links)
Networked computer systems offer much to support collaborative editing of shared documents among users. Increasing concurrent access to shared documents by allowing multiple users to contribute to and/or track changes to these shared documents is the goal of real-time collaborative editing systems (RTCES); yet concurrent access is either limited in existing systems that employ exclusive locking or concurrency control algorithms such as operational transformation (OT) may be employed to enable concurrent access. Unfortunately, such OT based schemes are costly with respect to communication and computation. Further, existing systems are often specialized in their functionality and require users to adopt new, unfamiliar software to enable collaboration. This research discusses our work in improving consistency management in RTCES. We have developed a set of deadlock-free multi-granular dynamic locking algorithms and data structures that maximize concurrent access to shared documents while minimizing communication cost. These algorithms provide a high level of service for concurrent access to the shared document and integrate merge-based or OT-based consistency maintenance policies locally among a subset of the users within a subsection of the document – thus reducing the communication costs in maintaining consistency. Additionally, we have developed client-server and P2P implementations of our hierarchical document management algorithms. Simulations results indicate that our approach achieves significant communication and computation cost savings. We have also developed a hierarchical reduction algorithm that can minimize the space required of RTCES, and this algorithm may be pipelined through our document tree. Further, we have developed an architecture that allows for a heterogeneous set of client editing software to connect with a heterogeneous set of server document repositories via Web services. This architecture supports our algorithms and does not require client or server technologies to be modified – thus it is able to accommodate existing, favored editing and repository tools. Finally, we have developed a prototype benchmark system of our architecture that is responsive to users’ actions and minimizes communication costs.

Approche parcimonieuse et calcul haute performance pour la tomographie itérative régularisée. / Computationally Efficient Sparse Prior in Regularized Iterative Tomographic Reconstruction

Notargiacomo, Thibault 14 February 2017 (has links)
La tomographie est une technique permettant de reconstruire une carte des propriétés physiques de l'intérieur d'un objet, à partir d'un ensemble de mesures extérieures. Bien que la tomographie soit une technologie mature, la plupart des algorithmes utilisés dans les produits commerciaux sont basés sur des méthodes analytiques telles que la rétroprojection filtrée. L'idée principale de cette thèse est d'exploiter les dernières avancées dans le domaine de l'informatique et des mathématiques appliqués en vue d'étudier, concevoir et implémenter de nouveaux algorithmes dédiés à la reconstruction 3D en géométrie conique. Nos travaux ciblent des scenarii d'intérêt clinique tels que les acquisitions faible dose ou faible nombre de vues provenant de détecteurs plats. Nous avons étudié différents modèles d'opérateurs tomographiques, leurs implémentations sur serveur multi-GPU, et avons proposé l'utilisation d'une transformée en ondelettes complexes 3D pour régulariser le problème inverse. / X-Ray computed tomography (CT) is a technique that aims at providing a measure of a given property of the interior of a physical object, given a set of exterior projection measurement. Although CT is a mature technology, most of the algorithm used for image reconstruction in commercial applications are based on analytical methods such as the filtered back-projection. The main idea of this thesis is to exploit the latest advances in the field of applied mathematics and computer sciences in order to study, design and implement algorithms dedicated to 3D cone beam reconstruction from X-Ray flat panel detectors targeting clinically relevant usecases, including low doses and few view acquisitions.In this work, we studied various strategies to model the tomographic operators, and how they can be implemented on a multi-GPU platform. Then we proposed to use the 3D complex wavelet transform in order to regularize the reconstruction problem.

Adéquation Algorithme Architecture et modèle de programmation pour l'implémentation d'algorithmes de traitement du signal et de l'image sur cluster multi-GPU / Programming model for the implementation of 2D-3D image processing applications on a hybrid CPU-GPU cluster.

Boulos, Vincent 18 December 2012 (has links)
Initialement con¸cu pour d´echarger le CPU des tˆaches de rendu graphique, le GPU estdevenu une architecture massivement parall`ele adapt´ee au traitement de donn´ees volumineuses.Alors qu’il occupe une part de march´e importante dans le Calcul Haute Performance, uned´emarche d’Ad´equation Algorithme Architecture est n´eanmoins requise pour impl´ementerefficacement un algorithme sur GPU.La contribution de cette th`ese est double. Dans un premier temps, nous pr´esentons legain significatif apport´e par l’impl´ementation optimis´ee d’un algorithme de granulom´etrie(l’ordre de grandeur passe de l’heure `a la minute pour un volume de 10243 voxels). Un mod`eleanalytique permettant d’´etablir les variations de performance de l’application de granulom´etriesur GPU a ´egalement ´et´e d´efini et pourrait ˆetre ´etendu `a d’autres algorithmes r´eguliers.Dans un second temps, un outil facilitant le d´eploiement d’applications de Traitementdu Signal et de l’Image sur cluster multi-GPU a ´et´e d´evelopp´e. Pour cela, le champ d’actiondu programmeur est r´eduit au d´ecoupage du programme en tˆaches et `a leur mapping sur les´el´ements de calcul (GPP ou GPU). L’am´elioration notable du d´ebit sortant d’une applicationstreaming de calcul de carte de saillence visuelle a d´emontr´e l’efficacit´e de notre outil pourl’impl´ementation d’une solution sur cluster multi-GPU. Afin de permettre un ´equilibrage decharge dynamique, une m´ethode de migration de tˆaches a ´egalement ´et´e incorpor´ee `a l’outil. / Originally designed to relieve the CPU from graphics rendering tasks, the GPU has becomea massively parallel architecture suitable for processing large amounts of data. While it haswon a significant market share in the High Performance Computing domain, an Algorithm-Architecture Matching approach is still necessary to efficiently implement an algorithm onGPU.The contribution of this thesis is twofold. Firstly, we present the significant gain providedby the implementation of a granulometry optimized algorithm (computation time decreasesfrom several hours to less than minute for a volume of 10243 voxels). An analytical modelestablishing the performance variations of the granulometry application is also presented. Webelieve it can be expanded to other regular algorithms.Secondly, the deployment of Signal and Image processing applications on multi-GPUcluster can be a tedious task for the programmer. In order to help him, we developped alibrary that reduces the scope of the programmer’s contribution in the development. Hisremaining tasks are decomposing the application into a Data Flow Graph and giving mappingannotations in order for the tool to automatically dispatch tasks on the processing elements(GPP or GPU). The throughput of a visual sailency streaming application is then improvedthanks to the efficient implementation brought by our tool on a multi-GPU cluster. In orderto permit dynamic load balancing, a task migration method has also been incorporated into it.

Placement de tâches dynamique et flexible sur processeur multicoeur asymétrique en fonctionnalités / Dynamic and flexible task mapping on functionally asymmetric multi-core processor

Aminot, Alexandre 01 October 2015 (has links)
Pour répondre aux besoins de plus en plus hétérogènes des applications (puissance et efficacité énergétique), nous nous intéressons dans cette thèse aux architectures émergentes de type multi-cœur asymétrique en fonctionnalités (FAMP). Ces architectures sont caractérisées par une mise en œuvre non-uniforme des extensions matérielles dans les cœurs (ex. unitée de calculs à virgule flottante (FPU)). Les avantages en surface sont apparents, mais qu'en est-il de l'impact au niveau logiciel, énergétique et performance?Pour répondre à ces questions, la thèse explore la nature de l'utilisation des extensions dans des applications de l'état de l'art et compare différentes méthodes existantes. Pour optimiser le placement de tâches et ainsi augmenter l'efficacité, la thèse propose une solution dynamique au niveau ordonnanceur, appelée ordonnanceur relaxé.Les extensions matérielles sont intéressantes car elles permettent des accélérations difficilement atteignables par la parallélisation sur un multi-cœur. Néanmoins, leurs utilisations par les applications sont faibles et leur coût en termes de surface et consommation énergétique sont importants.En se basant sur ces observations, les points suivants ont été développés:Nous présentons une étude approfondie sur l'utilisation de l'extension vectorielle et FPU dans des applications de l'état de l'artNous comparons plusieurs solutions de gestion des extensions à différent niveaux de granularité temporelle d'action pour comprendre les limites de ces solutions et ainsi définir à quel niveau il faut agir. Peu d'études traitent la question de la granularité d'action pour gérer les extensions.Nous proposons une solution pour estimer en ligne la dégradation de performance à exécuter une tâche sur un cœur sans extension. Afin de permettre la mise à l'échelle des multi-cœurs, le système d'exploitation doit avoir de la flexibilité dans le placement de tâches. Placer une tâche sur un cœur sans extension peut avoir d'importantes conséquences en énergie et en performance. Or à ce jour, il n'existe pas de solution pour estimer cette potentielle dégradation.Nous proposons un ordonnanceur relaxé, basé notre modèle d'estimation de dégradation, qui place les tâches sur un ensemble de cœurs hétérogènes de manière efficace. Nous étudions la flexibilité gagnée ainsi que les conséquences aux niveaux performances et énergie.Les solutions existantes proposent des méthodes pour placer les tâches sur un ensemble de cœurs hétérogènes, or, celles-ci n'étudient pas le compromis entre qualité de service et gain en consommation pour les architectures FAMP.Nos expériences sur simulateur ont montré que l'ordonnanceur peut atteindre une flexibilité de placement significative avec une dégradation en performance de moins de 2%. Comparé à un multi-cœur symétrique, notre solution permet un gain énergétique moyen au niveau cœur de 11 %. Ces résultats sont très encourageant et contribuent au développement d'une plateforme complète FAMP. Cette thèse a fait l'objet d'un dépôt de brevet, de trois communications scientifiques internationales (plus une en soumission), et a contribué à deux projets européens. / To meet the increasingly heterogeneous needs of applications (in terms of power and efficiency), this thesis focus on the emerging functionally asymmetric multi-core processor (FAMP) architectures. These architectures are characterized by non-uniform implementation of hardware extensions in the cores (ex. Floating Point Unit (FPU)). The area savings are apparent, but what about the impact in software, energy and performance?To answer these questions, the thesis investigates the nature of the use of extensions in state-of-the-art's applications and compares various existing methods. To optimize the tasks mapping and increase efficiency, the thesis proposes a dynamic solution at scheduler level, called relaxed scheduler.Hardware extensions are valuable because they speed up part of code where the parallelization on multi-core isn't efficient. However, the hardware extensions are under-exploited by applications and their cost in terms of area and power consumption are important.Based on these observations, the following contributions have been proposed:We present a detailed study on the use of vector and FPU extensions in state-of-the-art's applicationsWe compare multiple extension management solutions at different levels of temporal granularity of action, to understand the limitations of these solutions and thus define at which level we must act. Few studies address the issue of the granularity of action to manage extensions.We offer a solution for estimating online performance degradation to run a task on a core without a given extension. To allow the scalability of multi-core, the operating system must have flexibility in the placement of tasks. Placing a task on a core with no extension can have important consequences for energy and performance. But to date, there is no way to estimate this potential degradation.We offer a relaxed scheduler, based on our degradation estimation model, which maps the tasks on a set of heterogeneous cores effectively. We study the flexibility gained and the implications for performance and energy levels. Existing solutions propose methods to map tasks on a heterogeneous set of cores, but they do not study the tradeoff between quality of service and consumption gain for FAMP architectures.Our experiments with simulators have shown that the scheduler can achieve a significantly higher mapping flexibility with a performance degradation of less than 2 %. Compared to a symmetrical multi-core, our solution enables an average energy gain at core level of 11 %. These results are very encouraging and contribute to the development of a comprehensive FAMP platform . This thesis has been the subject of a patent application, three international scientific communications (plus one submission), and contributes to two active european projects.

Page generated in 0.4864 seconds