Global ETD Search

1	Evaluation of Energy-Optimizing Scheduling Algorithms for Streaming Computations on Massively Parallel Multicore Architectures / Evaluering av energioptimerande schemaläggningsalgoritmer för strömmande beräkningar på massivt parallella flerkärniga arkitekturer Janzén, Johan January 2014 (has links) This thesis describes an environment to evaluate and compare static schedulers for real pipelined streaming applications on massively parallel architectures, such as Intel Single chip Cloud Computer (SCC), Adapteva Epiphany, and Tilera TILE-Gx series. The framework allows performance comparison of schedulers in their execution time, or the energy usage of static schedules with energy models and measurements on real platform. This thesis focuses on the implementation of a framework evaluating the energy consumption of such streaming applications on the SCC. The framework can run streaming applications, built as task collections, with static schedules including dynamic frequency scaling. Streams are handled by the framework with FIFO buffers, connected between tasks. We evaluate the framework by considering a pipelined mergesort implementation with different static schedules. The runtime is compared with the runtime of a previously published task based optimized mergesort implementation. The results show how much overhead the framework adds on to the streaming application. As a demonstration of the energy measuring capabilities, we schedule and analyze a Fast Fourier Transform application, and discuss the results. Future work may include quantitative comparative studies of a range of different static schedulers. This has, to our knowledge, not been done previously. Intel SCC DVFS Task based programming Static scheduling Energy efficiency Multicore
2	Scientific Computing on Multicore Architectures Tillenius, Martin January 2014 (has links) Computer simulations are an indispensable tool for scientists to gain new insights about nature. Simulations of natural phenomena are usually large, and limited by the available computer resources. By using the computer resources more efficiently, larger and more detailed simulations can be performed, and more information can be extracted to help advance human knowledge. The topic of this thesis is how to make best use of modern computers for scientific computations. The challenge here is the high level of parallelism that is required to fully utilize the multicore processors in these systems. Starting from the basics, the primitives for synchronizing between threads are investigated. Hardware transactional memory is a new construct for this, which is evaluated for a new use of importance for scientific software: atomic updates of floating point values. The evaluation includes experiments on real hardware and comparisons against standard methods. Higher level programming models for shared memory parallelism are then considered. The state of the art for efficient use of multicore systems is dynamically scheduled task-based systems, where tasks can depend on data. In such systems, the software is divided up into many small tasks that are scheduled asynchronously according to their data dependencies. This enables a high level of parallelism, and avoids global barriers. A new system for managing task dependencies is developed in this thesis, based on data versioning. The system is implemented as a reusable software library, and shown to be as efficient or more efficient than other shared-memory task-based systems in experimental comparisons. The developed runtime system is then extended to distributed memory machines, and used for implementing a parallel version of a software for global climate simulations. By running the optimized and parallelized version on eight servers, an equally sized problem can be solved over 100 times faster than in the original sequential version. The parallel version also allowed significantly larger problems to be solved, previously unreachable due to memory constraints. / UPMARC / eSSENCE multicore scientific computing shared memory parallelism task-based programming parallel programming model task scheduling data versioning
3	Software-defined significance-driven computing Chalios, Charalambos January 2017 (has links) Approximate computing has been an emerging programming and system design paradigm that has been proposed as a way to overcome the power-wall problem that hinders the scaling of the next generation of both high-end and mobile computing systems. Towards this end, a lot of researchers have been studying the effects of approximation to applications and those hardware modifications that allow increased power benefits for reduced reliability. In this work, we focus on runtime system modifications and task-based programming models that enable software-controlled, user-driven approximate computing. We employ a systematic methodology that allows us to evaluate the potential energy and performance benefits of approximate computing using as building blocks unreliable hardware components. We present a set of extensions to OpenMP 4.0 that enable the programmer to define computations suitable for approximation. We introduce task-significance, a novel concept that describes the contribution of a task to the quality of the result. We use significance as a channel of communication from domain specific knowledge about applications towards the runtime-system, where we can optimise approximate execution depending on user constraints. Finally, we show extensions to the Linux kernel that enable it to operate seamlessly on top of unreliable memory and provide a user-space interface for memory allocation from the unreliable portion of the physical memory. Having this framework in place allowed us to identify what we call the refresh-by-access property of applications that use dynamic random-access memory (DRAM). We use this property to implement techniques for task-based applications that minimise the probability of errors when using unreliable memory enabling increased quality and power efficiency when using unreliable DRAM.
4	Un modèle de programmation à grain fin pour la parallélisation de solveurs linéaires creux / A fine grain model programming for parallelization of sparse linear solver Rossignon, Corentin 17 July 2015 (has links) La résolution de grands systèmes linéaires creux est un élément essentiel des simulations numériques.Ces résolutions peuvent représenter jusqu’à 80% du temps de calcul des simulations.Une parallélisation efficace des noyaux d’algèbre linéaire creuse conduira donc à obtenir de meilleures performances. En mémoire distribuée, la parallélisation de ces noyaux se fait le plus souvent en modifiant leschéma numérique. Par contre, en mémoire partagée, un parallélisme plus efficace peut être utilisé. Il est doncimportant d’utiliser deux niveaux de parallélisme, un premier niveau entre les noeuds d’une grappe de serveuret un deuxième niveau à l’intérieur du noeud. Lors de l’utilisation de méthodes itératives en mémoire partagée,les graphes de tâches permettent de décrire naturellement le parallélisme en prenant comme granularité letravail sur une ligne de la matrice. Malheureusement, cette granularité est trop fine et ne permet pas d’obtenirde bonnes performances à cause du surcoût de l’ordonnanceur de tâches.Dans cette thèse, nous étudions le problème de la granularité pour la parallélisation par graphe detâches. Nous proposons d’augmenter la granularité des tâches de calcul en créant des agrégats de tâchesqui deviendront eux-mêmes des tâches. L’ensemble de ces agrégats et des nouvelles dépendances entre lesagrégats forme un graphe de granularité plus grossière. Ce graphe est ensuite utilisé par un ordonnanceur detâches pour obtenir de meilleurs résultats. Nous utilisons comme exemple la factorisation LU incomplète d’unematrice creuse et nous montrons les améliorations apportées par cette méthode. Puis, dans un second temps,nous nous concentrons sur les machines à architecture NUMA. Dans le cas de l’utilisation d’algorithmeslimités par la bande passante mémoire, il est intéressant de réduire les effets NUMA liés à cette architectureen plaçant soi-même les données. Nous montrons comment prendre en compte ces effets dans un intergiciel àbase de tâches pour ainsi améliorer les performances d’un programme parallèle. / Solving large sparse linear system is an essential part of numerical simulations. These resolve can takeup to 80% of the total of the simulation time.An efficient parallelization of sparse linear kernels leads to better performances. In distributed memory,parallelization of these kernels is often done by changing the numerical scheme. Contrariwise, in sharedmemory, a more efficient parallelism can be used. It’s necessary to use two levels of parallelism, a first onebetween nodes of a cluster and a second inside a node.When using iterative methods in shared memory, task-based programming enables the possibility tonaturally describe the parallelism by using as granularity one line of the matrix for one task. Unfortunately,this granularity is too fine and doesn’t allow to obtain good performance.In this thesis, we study the granularity problem of the task-based parallelization. We offer to increasegrain size of computational tasks by creating aggregates of tasks which will become tasks themself. Thenew coarser task graph is composed by the set of these aggregates and the new dependencies betweenaggregates. Then a task scheduler schedules this new graph to obtain better performance. We use as examplethe Incomplete LU factorization of a sparse matrix and we show some improvements made by this method.Then, we focus on NUMA architecture computer. When we use a memory bandwidth limited algorithm onthis architecture, it is interesting to reduce NUMA effects. We show how to take into account these effects ina task-based runtime in order to improve performance of a parallel program. Parallélisme Graphe de tâches Supports d’exécution NUMA Multi-coeurs Algèbre linéaire creuse Parallelism Task-based programming Runtime NUMA Multicore Sparse linear algebra
5	Passage à l'echelle d'un support d'exécution à base de tâches pour l'algèbre linéaire dense / Scalability of a task-based runtime system for dense linear algebra applications Sergent, Marc 08 December 2016 (has links) La complexification des architectures matérielles pousse vers l’utilisation de paradigmes de programmation de haut niveau pour concevoir des applications scientifiques efficaces, portables et qui passent à l’échelle. Parmi ces paradigmes, la programmation par tâches permet d’abstraire la complexité des machines en représentant les applications comme des graphes de tâches orientés acycliques (DAG). En particulier, le modèle de programmation par tâches soumises séquentiellement (STF) permet de découpler la phase de soumission des tâches, séquentielle, de la phase d’exécution parallèle des tâches. Même si ce modèle permet des optimisations supplémentaires sur le graphe de tâches au moment de la soumission, il y a une préoccupation majeure sur la limite que la soumission séquentielle des tâches peut imposer aux performances de l’application lors du passage à l’échelle. Cette thèse se concentre sur l’étude du passage à l’échelle du support d’exécution StarPU (développé à Inria Bordeaux dans l’équipe STORM), qui implémente le modèle STF, dans le but d’optimiser les performances d’un solveur d’algèbre linéaire dense utilisé par le CEA pour faire de grandes simulations 3D. Nous avons collaboré avec l’équipe HiePACS d’Inria Bordeaux sur le logiciel Chameleon, qui est une collection de solveurs d’algèbre linéaire portés sur supports d’exécution à base de tâches, afin de produire un solveur d’algèbre linéaire dense sur StarPU efficace et qui passe à l’échelle jusqu’à 3 000 coeurs de calcul et 288 accélérateurs de type GPU du supercalculateur TERA-100 du CEA-DAM. / The ever-increasing supercomputer architectural complexity emphasizes the need for high-level parallel programming paradigms to design efficient, scalable and portable scientific applications. Among such paradigms, the task-based programming model abstracts away much of the architecture complexity by representing an application as a Directed Acyclic Graph (DAG) of tasks. Among them, the Sequential-Task-Flow (STF) model decouples the task submission step, sequential, from the parallel task execution step. While this model allows for further optimizations on the DAG of tasks at submission time, there is a key concern about the performance hindrance of sequential task submission when scaling. This thesis’ work focuses on studying the scalability of the STF-based StarPU runtime system (developed at Inria Bordeaux in the STORM team) for large scale 3D simulations of the CEA which uses dense linear algebra solvers. To that end, we collaborated with the HiePACS team of Inria Bordeaux on the Chameleon software, which is a collection of linear algebra solvers on top of task-based runtime systems, to produce an efficient and scalable dense linear algebra solver on top of StarPU up to 3,000 cores and 288 GPUs of CEA-DAM’s TERA-100 cluster. Calcul haute performance Supports d’exécution Calcul distribué Programmation par tâches Modèles de programmation parallèle High performance computing Run-time systems Distributed computing Task-based programming Parallel programming models
6	Algorithmes à grain fin et schémas numériques pour des simulations exascales de plasmas turbulents / Fine grain algorithm and numerical schemes for exascale simulation of turbulent plasmas Bouzat, Nicolas 17 December 2018 (has links) Les architectures de calcul haute performance les plus récentes intègrent de plus en plus de nœuds de calcul qui contiennent eux-mêmes plus de cœurs. Les bus mémoires et les réseaux de communication sont soumis à un niveau d'utilisation critique. La programmation parallèle sur ces nouvelles machines nécessite de porter une attention particulière à ces problématiques pour l'écriture de nouveaux algorithmes. Nous analysons dans cette thèse un code de simulation de turbulences de plasma et proposons une refonte de la parallélisation de l'opérateur de gyromoyenne plus adapté en termes de distribution de données et bénéficiant d'un schéma de recouvrement calcul -- communication efficace. Les optimisations permettent un gain vis-à-vis des coûts de communication et de l’empreinte mémoire. Nous étudions également les possibilités d'évolution de ce code à travers la conception d'un prototype utilisant un modèle programmation par tâche et un schéma de communication asynchrone adapté. Cela permet d'atteindre un meilleur équilibrage de charge afin de maximiser le temps de calcul et de minimiser les communications entre processus. Un maillage réduit adaptatif en espace est proposé, diminuant le nombre de points sans pour autant perdre en précision, mais ajoutant de fait une couche supplémentaire de complexité. Ce prototype explore également une distribution de données différente ainsi qu'un maillage en géométrie complexe adapté aux nouvelles configurations des tokamaks. Les performances de différentes optimisations sont étudiées et comparées avec le code préexistant et un cas dimensionnant sur un grand nombre de cœurs est présenté. / Recent high performance computing architectures come with more and more cores on a greater number of computational nodes. Memory buses and communication networks are facing critical levels of use. Programming parallel codes for those architectures requires to put the emphasize on those matters while writing tailored algorithms. In this thesis, a plasma turbulence simulation code is analyzed and its parallelization is overhauled. The gyroaverage operator benefits from a new algorithm that is better suited with regard to its data distribution and that uses a computation -- communication overlapping scheme. Those optimizations lead to an improvement by reducing both execution times and memory footprint. We also study new designs for the code by developing a prototype based on task programming model and an asynchronous communication scheme. It allows us to reach a better load balancing and thus to achieve better execution times by minimizing communication overheads. A new reduced mesh is introduced, shrinking the overall mesh size while keeping the same numerical accuracy but at the expense of more complex operators. This prototype also uses a new data distribution and twists the mesh to adapt to the complex geometries of modern tokamak reactors. Performance of the different optimizations is studied and compared to that of the current code. A case scaling on a large number of cores is given. Parallélisme Schémas numériques Programmation par tâches Recouvrement calcul-communication Ordonnancement Maillage réduit Distributed computing Numerical schemes Task-based programming Computation-communication overlap Tasks and communication ordering Reduced grid 005.1 530.44

1

Page generated in 0.0819 seconds