Global ETD Search

11	Contribution à l’émergence de nouvelles méthodes parallèles et réparties intelligentes utilisant un paradigme de programmation multi-niveaux pour le calcul extrême / Contribution to the emergence of new intelligent parallel and distributed methods using a multi-level programming paradigm for extreme computing Wu, Xinzhe 22 March 2019 (has links) Les méthodes itératives de Krylov sont utilisées sur les plate-formes de Calcul Haute Performance (CHP) pour résoudre les grands systèmes linéaires issus des domaines de la science et de l’ingénierie. Avec l’augmentation du nombre de cœurs et de l’hétérogénéité des superordinateurs, le temps consacré à la communication et synchronisation globales nuit gravement aux leurs performances parallèles. La programmation tend à être distribuée et parallèle. Le développement d’algorithmes devrait prendre en compte les principes: 1) parallélisme avec multi-granularité; 2) mémoire hiérarchique; 3) minimisation de la communication globale; 4) promotion de l’asynchronicité; 5) proposition de stratégies d’ordonnancement et de moteurs de gestion pour gérer les trafics et la tolérance aux pannes. En réponse à ces objectifs, nous présentons un paradigme de programmation multi-niveaux distribués et parallèles pour les méthodes de Krylov sur les plates-formes de CHP. La première partie porte sur la mise en œuvre d’un générateur de matrices avec des valeurs propres prescrites pour la référence des méthodes itératives. Dans la deuxième partie, nous étudions les performances numériques et parallèles de la méthode itérative proposée. Son implémentation avec un moteur de gestion peut gérer la communication, la tolérance aux pannes et la réutilisabilité. Dans la troisième partie, un schéma de réglage automatique est introduit pour la sélection intelligente de ses paramètres lors de l’exécution. Enfin, nous étudions la possibilité d’implémenter ce paradigme dans un environnement d’exécution de flux de travail. / Krylov iterative methods are frequently used on High-Performance Computing (HPC) systems to solve the extremely large sparse linear systems and eigenvalue problems from science and engineering fields. With the increase of both number of computing units and the heterogeneity of supercomputers, time spent in the global communication and synchronization severely damage the parallel performance of iterative methods. Programming on supercomputers tends to become distributed and parallel. Algorithm development should consider the principles: 1) multi-granularity parallelism; 2) hierarchical memory; 3) minimization of global communication; 4) promotion of the asynchronicity; 5) proposition of multi-level scheduling strategies and manager engines to handle huge traffic and improve the fault tolerance. In response to these goals, we present a distributed and parallel multi-level programming paradigm for Krylov methods on HPC platforms. The first part of our work focuses on an implementation of a scalable matrix generator to create test matrices with customized eigenvalue for benchmarking iterative methods on supercomputers. In the second part, we aim to study the numerical and parallel performance of proposed distributed and parallel iterative method. Its implementation with a manager engine and runtime can handle the huge communication traffic, fault tolerance, and reusability. In the third part, an auto-tuning scheme is introduced for the smart selection of its parameters at runtime. Finally, we analyse the possibility to implement the distributed and parallel paradigm by a graph-based workflow runtime environment. Techniques de préconditionnement Communications asynchrones Architecture hétérogène Architecture homogène Informatique exascale 005.275
12	Πειραματική αξιολόγηση μεθοδολογίας βελτιστοποίησης του αλγόριθμου πολλαπλασιασμού πίνακα επί διάνυσμα σε μονοπύρηνες και πολυπύρηνες αρχιτεκτονικές Παπαδήμα, Ελισσάβετ 30 April 2014 (has links) Στην παρούσα διπλωματική εργασία έγινε υλοποίηση και πειραματική αξιολόγηση μιας μεθοδολογίας η οποία έχει αναπτυχθεί στο Εργαστήριο Ολοκληρωμένων Κυκλωμάτων και αφορά τη βελτιστοποίηση του Πολλαπλασιασμού Πίνακα επί Διάνυσμα (ΠΠΔ) σε μονοπύρηνους και πολυπύρηνους επεξεργαστές. Η μεθοδολογία εκμεταλλεύεται το σύνολο των χαρακτηριστικών της αρχιτεκτονικής που χρησιμοποιείται και συγκεκριμένα (α) την ιεραρχία της μνήμης, (β) το μέγεθος της κρυφής μνήμης, (γ) το βαθμό συσχέτισης κάθε επιπέδου της κρυφής μνήμης, (δ) την καθυστέρηση της μνήμης και (ε) το πλήθος των πυρήνων. Είναι η πρώτη φορά που λαμβάνεται υπόψη ο βαθμός συσχέτισης της μνήμης. Σκοπός της μεθοδολογίας είναι η βελτιστοποίηση με βάση όλες τις παραμέτρους μαζί και όχι καθεμία ξεχωριστά. Για να βελτιωθεί η απόδοση προτείνεται διαφορετικός χρονοπρογραμματισμός ανάλογα με το μέγεθος του πίνακα. Για την πειραματική αξιολόγηση χρησιμοποιήθηκαν οι επεξεργαστές γενικού σκοπού Intel Core 2 Duo E6065 και Τ6600, ο Intel i7-3930K και ο ενσωματωμένος επεξεργαστής ειδικού σκοπού Microblaze από το Virtex-5 FPGA (Xilinx). Τα αποτελέσματα συγκρίνονται με την state-of-the-art βιβλιοθήκη ATLAS (Automatically Tuned Linear Algebra Software) και παρουσιάζουν βελτίωση 30%. Από τα πειραματικά αποτελέσματα είναι φανερό ότι η κύρια μνήμη είναι το bottleneck του προβλήματος. Επίσης, η απόδοση βελτιώνεται όταν αλλάζει το layout του πίνακα τόσο σε μονοπύρηνες όσο σε πολυπύρηνες αρχιτεκτονικές. Όσον αφορά τη μέθοδο του tiling τα πειραματικά αποτελέσματα δείχνουν ότι η μείωση των αστοχιών δεν βελτιώνει πάντα την απόδοση γιατί υπάρχει trade-off ανάμεσα στο μέγεθος του tile και στις εντολές διευθυνσιοδότησης. Επίσης, είναι φανερό ότι στις πολυπύρηνες αρχιτεκτονικές δεν υπάρχει γραμμική σχέση της απόδοσης και του πλήθους των πυρήνων που χρησιμοποιούνται. Αυτό οφείλεται στο περιορισμένο εύρος ζώνης της μνήμης. / The subject of this MSc Thesis is the implementation and the experimental evaluation of a methodology that has been developed at the Laboratory of Integrated Circuits and optimizes the Matrix Vector Multiplication (MVM) in single-core and multi-core processors. The methodology fully exploits the characteristics of the architecture. Specifically, it exploits (a) the hierarchy of the memory, (b) the cache size, (c) the cache associativity, (d) the memory latency and (e) the number of the cores. It is the first time that the cache associativity is taken into account. The methodology optimizes all the parameters together as one problem and not separately. A different scheduling is proposed according to the size of the matrix. The general purpose processors Intel Core 2 Duo E6065, Intel Core 2 Duo T6600 and Intel i7-3930K and the embedded processor Virtex-5 Microblaze have been used. The results have been compared with the state-of-the-art library ATLAS (Automatically Tuned Linear Algebra Software) and the performance is improved by 30%. According to the experimental results, it is obvious that the bottleneck is the memory latency. Moreover, the performance is increased when a new way of saving the matrix in the main memory (data array layout) is used in both single-core and multi-core architectures. As far as the tiling is concerned, the experimental results indicate that the decrease of the misses does not always improve the performance because there is a trade-off between the tile size and the addressing instructions. According to the experimental results, as far as multicore architectures are concerned, there is no linear relation between the performance and the number of the cores, because of the limited memory bandwidth. 005.275 Matrix vector multiplication Multicore architectures
13	Παραλληλισμός αλγορίθμων σε κάρτες γραφικών για σχεδιασμό κίνησης Πάσχος, Ανδρέας 16 May 2014 (has links) Στην παρούσα διπλωματική, κύριος στόχος ήταν η παραλληλοποίηση ενός αλγορίθμου σχεδιασμού κίνησης για κάρτες γραφικών. Για το σκοπό αυτό, χρησιμοποιήθηκε ο Probabilistic Road Map (PRM), ένας αλγόριθμος που προσφέρει μεγάλο βαθμό παραλληλισμού και, συνεπώς, προτείνεται για υλοποίηση σε πολυπύρηνους επεξεργαστές. Το πλαίσιο εργασίας που χρησιμοποιήθηκε για τον προγραμματισμό στην κάρτα γραφικών ήταν το OpenCL επειδή προσφέρει ένα αφαιρετικό επίπεδο προγραμματισμού ανεξαρτήτως υλικού και μπορεί να μεταφερθεί σε κάρτες γραφικών από διαφορετικούς κατασκευαστές. Ο αλγόριθμος αποσυντέθηκε στα δομικά του μέρη και καθένα από αυτά μελετήθηκε ξεχωριστά, ώστε να παραλληλοποιηθεί. Κατά τη διαδικασία αυτή, λοιπόν, υλοποιήθηκαν οι εξής αλγόριθμοι: • Ταξινόμηση • Αναζήτηση Γράφου κατά Πλάτος • Κατακερματισμός • Αναζήτηση Κοντινότερων Γειτόνων Οι παραπάνω αλγόριθμοι έχουν γραφτεί με τέτοιο τρόπο ώστε να μπορούν να χρησιμοποιηθούν αυτόνομα, ως ξεχωριστά κομμάτια. / In this thesis work, the main objective was the parallelization of a motion planning algorithm for graphics card units. For this purpose, the Probabilistic Road Map (PRM) was chosen, an algorithm that offers a high degree of parallelism and, consequently, is suggested for implementation in many core processing units. The framework used for GPU programming was OpenCL because it provides an abstraction programming layer independent of hardware and is portable among GPUs. The algorithm was decomposed in its structural components and each one of them was processed indepedently with the purpose of massive parallelization. During this process, the following algorithms were implemented: • Sorting • Breadth First Traversal • Hashing • Nearest Neighbours Search The above algorithms have been written in such a way so that they can be used as separate parts. Σχεδιασμός κίνησης Τεχνητή νοημοσύνη Κάρτες γραφικών Ρομποτική 005.275 Motion planning Artificial intelligence GPU Robotics Opencl
14	Αρχιτεκτονική προσομοίωση σε επεξεργαστικές μονάδες υψηλού βαθμού παραλληλίας Στρίκος, Νικόλαος 11 January 2011 (has links) Η πρόσφατη εξάπλωση που είδε το μοντέλο της παράλληλης επεξεργασίας στους μικροεπεξεργαστές γενικής χρήσης με την εισαγωγή περισσότερων από έναν πυρήνες εντός του ολοκληρωμένου κυκλώματος έφερε νέες απαιτήσεις στις μεθόδους προσομοίωσης που παραδοσιακά χρησιμοποιήθηκαν για την εξερεύνηση νέων αρχιτεκτονικών. Στην εργασία αυτή προτείνεται ένα πλαίσιο και ένα προγραμματιστικό μοντέλο που κάνει χρήση της αρχιτεκτονικής υψηλού βαθμού παραλληλίας CUDA για να επιτύχει επιτάχυνση στην αρχιτεκτονική προσομοίωση πρωτοκόλλων συνοχής κρυφής μνήμης. / The recent adoption of the parallel computing model in general-use microprocessors with the inclusion of more than one cores in the IC has raised new demands for the simulation methodologies that have been traditionally used. In this work, a framework and a programming model are proposed that make use of the highly parallel CUDA platform to accelerate architectural simulation of cache coherency protocols. Μικροεπεξεργαστές Κρυφή μνήμη 005.275 GPU CUDA Cache coherency protocols Parallel simulation
15	Υπολογιστικές εφαρμογές σε περιβάλλον παράλληλης επεξεργασίας Κομηνός, Χαράλαμπος Γαβριήλ 10 March 2014 (has links) Η παρούσα διπλωματική εργασία πραγματοποιήθηκε κατά το διάστημα 2012-2013 στο Εργαστήριο Συστημάτων Υπολογιστών (CSL) του Πανεπιστημίου Πατρών. Στόχος της εργασίας είναι η επίλυση ενός συνόλου προβλημάτων χρονοπρογραμματισμού εξετάσεων (ETP, Carter Dataset), με χρήση πληροφορημένου γενετικού αλγορίθμου. Στην εργασία αυτή θα παρουσιαστούν, τα βασικά μοντέλα λειτουργίας των γενετικών αλγορίθμων, του ETP καθώς και παρουσίαση βασικών εννοιών των παράλληλων συστημάτων. Τέλος παρουσιάζεται ο σειριακός κώδικας που υλοποιήθηκε σε ANSI-C και στην συνέχεια γίνεται σύγκριση με τον παράλληλο κώδικα που υλοποιήθηκε με MPI-C και παρουσιάζονται τα αποτελέσματα της σύγκρισης μεταξύ των δύο. / The Aim of this thesis which was completed during the 2012/2013 academic year at the Computer Systems Laboratory (CSL) at the University of Patras is to solve a set of Examination Timetabling Problems (Carter Dataset,ETP) with the aid of an informed genetic algorithm. I will present the basic model under which the genetic algorithms operate and some information about the ETP and general parallel systems. To conclude we will present our serial ANSI-C code and compare it with the parallel MPI-C code that we build and compare the two results. Γενετικοί αλγόριθμοι 005.275 MPI-C Genetic algorithms Examination timetabling Parallel computing Carter dataset
16	Sections atomiques emboîtées avec échappement de processus légers : sémantiques et compilation / Nested atomic sections with thread escape : semantics and compilation Pinsard, Thomas 15 December 2014 (has links) La mémoire transactionnelle est un mécanisme de plus en plus populaire pour la programmation parallèle et concurrente. Dans la plupart des implantations, l’emboîtement de transactions n’est pas possible ce qui pénalise la modularité. Plutôt que les transactions, qui sont un choix possible d’implantation, nous considérons directement la notion de section atomique. Dans un objectif d’améliorer la modularité et l’expressivité, nous considérons un langage impératif simple étendu avec des instructions de parallélisme avec lancement et attente de processus légers et une instruction de section atomique à portée syntaxique, depuis laquelle des processus légers peuvent s’échapper. Dans ce contexte notre première contribution est la définition précise de l’atomicité et de la bonne synchronisation. Nous prouvons que pour des traces bien formées, la dernière implique la forme forte de la première. Ceci est fait sur des traces d’exécution abstraites dans le sens où nous ne définissons par précisément la syntaxe et la sémantique opérationnelle d’un langage de programmation. Cette première partie de notre travail peut être considérée comme une spécification pour un tel langage. Nous avons utilisé l’assistant de preuve Coq pour modéliser et prouver nos résultats. Notre deuxième contribution est la définition formelle du langage Atomic Fork Join (AFJ). Nous montrons que les traces de sa sémantique opérationnelle vérifient effectivement les conditions de bonne formation définies précédemment. La troisième contribution est la compilation de programmes AFJ en programmes Lock Unlock Fork Join (LUFJ) un langage avec processus léger et verrous mais sans sections atomiques. Nous étudions la correction de la compilation de AFJ vers LUFJ. / Transactions are becoming a popular mechanism for parallel and concurrent programming. In most implementations the nesting of transactions is not supported which hinders modularity. Rather than transactions, which are an implementation choice, we consider directly the notion of atomic section. For the sake of modularity with we consider a simple imperative language with fork/join parallelism and lexically scoped nested atomic sections from which threads can escape. In this context, our first contribution is the precise definition of atomicity, well-synchronisation and the proof that the latter implies the strong form of the former. This is done on execution traces without being specific to a language syntax and operational semantics. This first part of our work could be considered as a specification for the design and implementation of such a parallel language. A formalisation of our results in the Coq proof assistant is also available. Our second contribution is a formal definition of the Atomic Fork Join (AFJ) language and its operational semantics. We show that it indeed satisfies the conditions previously defined. The third contribution of our work is a compilation procedure of AFJ programs to programs another language with threads and locks but without atomic sections, named Lock Unlock Fork Join (LUFJ). We study the correctness of the compilation from AFJ to LUFJ. Sémantique formelle Parallélisme Processus léger Section atomique Verrou Bonne synchronisation Trace de programme Atomicité Assistant de preuve Sémantique opérationnelle Compilation Préservation de la sémantique Formal semantics Parallelism Thread Atomic section Lock Well-synchronisation Program trace Atomicity Proof assistant Operational semantics Compilation Semantic preservation 005.275

Page generated in 0.0274 seconds