Global ETD Search

1	Intra- and Inter-chip Communication Support for Asymmetric Multicore Processors with Explicitly Managed Memory Hierarchies Rose, Benjamin Aaron 10 June 2009 (has links) The use of asymmetric multi-core processors with on-chip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on making efficient use of individual systems, and porting applications to asymmetric processors. The use of these asymmetric processors, like the Cell processor, in a cluster setting is the inspiration for the Cell Connector framework presented in this thesis. Cell Connector adopts a streaming approach for providing data to compute nodes with high computing potential but limited memory resources. Instead of dividing very large data sets once among computation resources, Cell Connector slices, distributes, and collects work units off of a master data held by a single large memory machine. Using this methodology, Cell Connector is able to maximize the use of limited resources and produces results that are up to 63.3\% better compared to standard non-streaming approaches. / Master of Science Cell BE multicore cluster
2	Optimized On-chip Software Pipelining On the Cell BE Processor Hultén, Rikard January 2010 (has links) <p>The special architecture of the Cell BE processor has made scientists revisit the problem of sorting. This paper implements and tests a variant of merge sort where a number of 2-to-1 mergers are connected in a pipelined tree. For large trees there are many more such mergers than processors which means they must be mapped to the processors in some way. Optimized mappings are tested and results show that changing the model used when optimizing might be beneficiary. It is also shown that the small size of the local storages on the co-processors is not limiting the performance.</p> Cell BE Merge sort Computer science Datavetenskap
3	Optimized On-chip Software Pipelining On the Cell BE Processor Hultén, Rikard January 2010 (has links) The special architecture of the Cell BE processor has made scientists revisit the problem of sorting. This paper implements and tests a variant of merge sort where a number of 2-to-1 mergers are connected in a pipelined tree. For large trees there are many more such mergers than processors which means they must be mapped to the processors in some way. Optimized mappings are tested and results show that changing the model used when optimizing might be beneficiary. It is also shown that the small size of the local storages on the co-processors is not limiting the performance. Cell BE Merge sort Computer Sciences Datavetenskap (datalogi)
4	Shared Memory Abstractions for Heterogeneous Multicore Processors Schneider, Scott 21 January 2011 (has links) We are now seeing diminishing returns from classic single-core processor designs, yet the number of transistors available for a processor is still increasing. Processor architects are therefore experimenting with a variety of multicore processor designs. Heterogeneous multicore processors with Explicitly Managed Memory (EMM) hierarchies are one such experimental design which has the potential for high performance, but at the cost of great programmer effort. EMM processors have cores that are divorced from the normal memory hierarchy, thus the onus is on the programmer to manage locality and parallelism. This dissertation presents the Cellgen source-to-source compiler which moves some of this complexity back into the compiler. Cellgen offers a directive-based programming model with semantics similar to OpenMP for the Cell Broadband Engine, a general-purpose processor with EMM. The compiler implicitly handles locality and parallelism, schedules memory transfers for data parallel regions of code, and provides performance predictions which can be leveraged to make scheduling decisions. We compare this approach to using a software cache, to a different programming model which is task based with explicit data transfers, and to programming the Cell directly using the native SDK. We also present a case study which uses the Cellgen compiler in a comparison across multiple kinds of multicore architectures: heterogeneous, homogeneous and radically data-parallel graphics processors. / Ph. D. Parallel Programming EMM Cell BE Programming Models Parallel Hardware Architecture
5	CellPilot: An extension of the Pilot library for Cell Broadband Engine processors and heterogeneous clusters Girard, Natalie 13 January 2012 (has links) The CellPilot library provides a uniform communication programming model, based on Pilot's process/channel approach, for clusters of Cell Broadband Engine processors. Pilot, a thin layer on top of the Message Passing Interface (MPI) library, allows processes to read/write messages on channels defined between pairs of processes on the cluster, but Pilot alone does not help a Cell programmer cope with the considerable complexities of intra-Cell communication. With CellPilot, programmers still design software in terms of processes, but they can now be located on a Cell node's Power Processor Elements (PPEs), Synergistic Processing Elements (SPEs), or non-Cell node within a heterogeneous Cell cluster, and communication is accomplished via channels between process pairs. Programs are coded in terms of reading and writing on those channels, whereupon CellPilot transparently applies whichever communication mechanisms are required to transport the message, regardless of its endpoints. This gives the programmer a way to handle inter-process communication while avoiding low-level I/O operations and the use of multiple libraries. CellPilot Pilot Cell BE Cell Broadband Engine MPI heterogeneous cluster high performance computing multicore
6	Scheduling on Asymmetric Architectures Blagojevic, Filip 22 July 2008 (has links) We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of programmers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically "rightsize" the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model predicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications. / Ph. D. process scheduling performance prediction high-performance computing runtime adaptation Multicore processors Cell BE
7	Programmation des architectures hiérarchiques et hétérogènes / Programming hierarxchical and heterogenous machines Hamidouche, Khaled 10 November 2011 (has links) Les architectures de calcul haute performance de nos jours sont des architectures hiérarchiques et hétérogènes: hiérarchiques car elles sont composées d’une hiérarchie de mémoire, une mémoire distribuée entre les noeuds et une mémoire partagée entre les coeurs d’un même noeud. Hétérogènes due à l’utilisation des processeurs spécifiques appelés Accélérateurs tel que le processeur CellBE d’IBM et les CPUs de NVIDIA. La complexité de maîtrise de ces architectures est double. D’une part, le problème de programmabilité: la programmation doit rester simple, la plus proche possible de la programmation séquentielle classique et indépendante de l’architecture cible. D’autre part, le problème d’efficacité: les performances doivent êtres proches de celles qu’obtiendrait un expert en écrivant le code à la main en utilisant des outils de bas niveau. Dans cette thèse, nous avons proposé une plateforme de développement pour répondre à ces problèmes. Pour cela, nous proposons deux outils : BSP++ est une bibliothèque générique utilisant des templates C++ et BSPGen est un framework permettant la génération automatique de code hybride à plusieurs niveaux de la hiérarchie (MPI+OpenMP ou MPI + Cell BE). Basée sur un modèle hiérarchique, la bibliothèque BSP++ prend les architectures hybrides comme cibles natives. Utilisant un ensemble réduit de primitives et de concepts intuitifs, BSP++ offre une simplicité d'utilisation et un haut niveau d' abstraction de la machine cible. Utilisant le modèle de coût de BSP++, BSPGen estime et génère le code hybride hiérarchique adéquat pour une application donnée sur une architecture cible. BSPGen génère un code hybride à partir d'une liste de fonctions séquentielles et d'une description de l'algorithme parallèle. Nos outils ont été validés sur différentes applications de différents domaines allant de la vérification et du calcul scientifique au traitement d'images en passant par la bioinformatique. En utilisant une large sélection d’architecture cible allant de simple machines à mémoire partagée au machines Petascale en passant par les architectures hétérogènes équipées d’accélérateurs de type Cell BE. / Today’s high-performance computing architectures are hierarchical and heterogeneous. With a hierarchy of memory, they are composed of distributed memory between nodes and shared memory between cores of the same node. heterogeneous due to the use of specific processors called accelerators such as the CellBE IBM processor and/or NVIDIA GPUs. The programming complexity of these architectures is twofold. On the one hand, the problem of programmability: the programming should be simple, as close as possible to the conventional sequential programming and independent of the target architecture. On the other hand, the problem of efficiency: performance should be similar to those obtained by a expert in writing code by hand using low-level tools. In this thesis, we proposed a development platform to address these problems. For this, we propose two tools: BSP++ is a generic library using C++ templates and BSPGen is a framework for the automatic hybrid multi-level hierarchy (MPI + OpenMP or MPI + Cell BE) code generation.Based on a hierarchical model, the BSP++ library takes the hybrid architectures as native targets. Using a small set of primitives and intuitive concepts, BSP++ provides a simple way to use and a high level of abstraction of the target machine. Using the cost model of BSP++, BSPGen predicts and generates the appropriate hierarchical hybrid code for a given application on target architecture. BSPGen generates hybrid code from a sequential list of functions and a description of the parallel algorithm.Our tools have been validated with various applications in different fields ranging from verification to scientific computing and image processing through bioinformatics. Using a wide selection of target architecture ranging from simple shared memory machines to Petascale machines through the heterogeneous architectures equipped with Cell BE accelerators. BSP Génération automatique Programmation parallèle MPI OpenMP Cell BE BSP Automatic code generation Parallel computing MPI OpenMP Cell BE
8	Programmation des architectures hiérarchiques et hétérogènes. Hamidouche, Khaled 10 November 2011 (has links) (PDF) Les architectures de calcul haute performance de nos jours sont des architectures hiérarchiques et hétérogènes: hiérarchiques car elles sont composées d'une hiérarchie de mémoire, une mémoire distribuée entre les noeuds et une mémoire partagée entre les coeurs d'un même noeud. Hétérogènes due à l'utilisation des processeurs spécifiques appelés Accélérateurs tel que le processeur CellBE d'IBM et les CPUs de NVIDIA. La complexité de maîtrise de ces architectures est double. D'une part, le problème de programmabilité: la programmation doit rester simple, la plus proche possible de la programmation séquentielle classique et indépendante de l'architecture cible. D'autre part, le problème d'efficacité: les performances doivent êtres proches de celles qu'obtiendrait un expert en écrivant le code à la main en utilisant des outils de bas niveau. Dans cette thèse, nous avons proposé une plateforme de développement pour répondre à ces problèmes. Pour cela, nous proposons deux outils : BSP++ est une bibliothèque générique utilisant des templates C++ et BSPGen est un framework permettant la génération automatique de code hybride à plusieurs niveaux de la hiérarchie (MPI+OpenMP ou MPI + Cell BE). Basée sur un modèle hiérarchique, la bibliothèque BSP++ prend les architectures hybrides comme cibles natives. Utilisant un ensemble réduit de primitives et de concepts intuitifs, BSP++ offre une simplicité d'utilisation et un haut niveau d' abstraction de la machine cible. Utilisant le modèle de coût de BSP++, BSPGen estime et génère le code hybride hiérarchique adéquat pour une application donnée sur une architecture cible. BSPGen génère un code hybride à partir d'une liste de fonctions séquentielles et d'une description de l'algorithme parallèle. Nos outils ont été validés sur différentes applications de différents domaines allant de la vérification et du calcul scientifique au traitement d'images en passant par la bioinformatique. En utilisant une large sélection d'architecture cible allant de simple machines à mémoire partagée au machines Petascale en passant par les architectures hétérogènes équipées d'accélérateurs de type Cell BE. [INFO:INFO_OH] Computer Science/Other BSP Architectures hiérarchiques hybrides et hétérogènes Génération automatique Programmation parallèle MPI OpenMP Cell BE
9	High Performance by Exploiting Information Locality through Reverse Computing / Hautes Performances en Exploitant la Localité de l'Information via le Calcul Réversible. Bahi, Mouad 21 December 2011 (has links) Les trois principales ressources du calcul sont le temps, l'espace et l'énergie, les minimiser constitue un des défis les plus importants de la recherche de la performance des processeurs.Dans cette thèse, nous nous intéressons à un quatrième facteur qui est l'information. L'information a un impact direct sur ces trois facteurs, et nous montrons comment elle contribue ainsi à l'optimisation des performances. Landauer a montré que c’est la destruction - logique - d’information qui coûte de l’énergie, ceci est un résultat fondamental de la thermodynamique en physique. Sous cette hypothèse, un calcul ne consommant pas d’énergie est donc un calcul qui ne détruit pas d’information. On peut toujours retrouver les valeurs d’origine et intermédiaires à tout moment du calcul, le calcul est réversible. L'information peut être portée non seulement par une donnée mais aussi par le processus et les données d’entrée qui la génèrent. Quand un calcul est réversible, on peut aussi retrouver une information au moyen de données déjà calculées et du calcul inverse. Donc, le calcul réversible améliore la localité de l'information. La thèse développe ces idées dans deux directions. Dans la première partie, partant d'un calcul, donné sous forme de DAG (graphe dirigé acyclique), nous définissons la notion de « garbage » comme étant la taille mémoire – le nombre de registres - supplémentaire nécessaire pour rendre ce calcul réversible. Nous proposons un allocateur réversible de registres, et nous montrons empiriquement que le garbage est au maximum la moitié du nombre de noeuds du graphe.La deuxième partie consiste à appliquer cette approche au compromis entre le recalcul (direct ou inverse) et le stockage dans le contexte des supercalculateurs que sont les récents coprocesseurs vectoriels et parallèles, cartes graphiques (GPU, Graphics Processing Unit), processeur Cell d’IBM, etc., où le fossé entre temps d’accès à la mémoire et temps de calcul ne fait que s'aggraver. Nous montons comment le recalcul en général, et le recalcul inverse en particulier, permettent de minimiser la demande en registres et par suite la pression sur la mémoire. Cette démarche conduit également à augmenter significativement le parallélisme d’instructions (Cell BE), et le parallélisme de threads sur un multicore avec mémoire et/ou banc de registres partagés (GPU), dans lequel le nombre de threads dépend de manière importante du nombre de registres utilisés par un thread. Ainsi, l’ajout d’instructions du fait du calcul inverse pour la rematérialisation de certaines variables est largement compensé par le gain en parallélisme. Nos expérimentations sur le code de Lattice QCD porté sur un GPU Nvidia montrent un gain de performances atteignant 11%. / The main resources for computation are time, space and energy. Reducing them is the main challenge in the field of processor performance.In this thesis, we are interested in a fourth factor which is information. Information has an important and direct impact on these three resources. We show how it contributes to performance optimization. Landauer has suggested that independently on the hardware where computation is run information erasure generates dissipated energy. This is a fundamental result of thermodynamics in physics. Therefore, under this hypothesis, only reversible computations where no information is ever lost, are likely to be thermodynamically adiabatic and do not dissipate power. Reversibility means that data can always be retrieved from any point of the program. Information may be carried not only by the data but also by the process and input data that generate it. When a computation is reversible, information can also be retrieved from other already computed data and reverse computation. Hence reversible computing improves information locality.This thesis develops these ideas in two directions. In the first part, we address the issue of making a computation DAG (directed acyclic graph) reversible in terms of spatial complexity. We define energetic garbage as the additional number of registers needed for the reversible computation with respect to the original computation. We propose a reversible register allocator and we show empirically that the garbage size is never more than 50% of the DAG size. In the second part, we apply this approach to the trade-off between recomputing (direct or reverse) and storage in the context of supercomputers such as the recent vector and parallel coprocessors, graphical processing units (GPUs), IBM Cell processor, etc., where the gap between processor cycle time and memory access time is increasing. We show that recomputing in general and reverse computing in particular helps reduce register requirements and memory pressure. This approach of reverse rematerialization also contributes to the increase of instruction-level parallelism (Cell) and thread-level parallelism in multicore processors with shared register/memory file (GPU). On the latter architecture, the number of registers required by the kernel limits the number of running threads and affects performance. Reverse rematerialization generates additional instructions but their cost can be hidden by the parallelism gain. Experiments on the highly memory demanding Lattice QCD simulation code on Nvidia GPU show a performance gain up to 11%. Compilation Calcul réversible Localité de l'information Optimisation des performances Allocation de registres Rematérialisation Vidage en mémoire Parallélisme d'instructions Parallélisme de threads GPU Cell BE LQCD Reversible computing Performance optimization Information locality Rematerialization Register allocation Spill code Instruction-level parallelism Thread-l

Search results