Global ETD Search

51	Διαχείριση κοινόχρηστων πόρων σε πολυεπεξεργαστικά συστήματα ενός ολοκληρωμένου Πετούμενος, Παύλος 06 October 2011 (has links) Στην παρούσα διατριβή προτείνονται μέθοδοι διαχείρισης των κοινόχρηστων πόρων σε υπολογιστικά συστήματα όπου πολλαπλοί επεξεργαστές μοιράζονται το ίδιο ολοκληρωμένο (Chip Multiprocessors – CMPs). Ενώ μέχρι πρόσφατα ο σχεδιασμός ενός υπολογιστικού συστήματος στόχευε στην ικανοποίηση των απαιτήσεων μόνο μίας εφαρμογής ανά χρονική περίοδο, τώρα πια απαιτείται και η εξισορρόπηση των απαιτήσεων διαφορετικών εφαρμογών που ανταγωνίζονται για την κατοχή των ίδιων πόρων. Σε πολλές περιπτώσεις, όμως, αυτό δεν αρκεί από μόνο του. Ακόμη και αν επιτευχθεί κάποιος ιδανικός διαμοιρασμός του πόρου, αν δεν βελτιστοποιηθεί ο τρόπος με τον οποίο χρησιμοποιούν οι επεξεργαστές τον κοινόχρηστο πόρο, δεν θα καταφέρει να εξυπηρετήσει ικανοποιητικά το αυξημένο φορτίο. Για να αντιμετωπιστούν τα προβλήματα που πηγάζουν από τον διαμοιρασμό των κοινόχρηστων πόρων, στην παρούσα εργασία προτείνονται τρεις εναλλακτικοί μηχανισμοί διαχείρισης. Η πρώτη μεθοδολογία εισάγει μία νέα θεωρητική μοντελοποίηση του διαμοιρασμού της κρυφής μνήμης, η οποία μπορεί να χρησιμοποιηθεί παράλληλα με την εκτέλεση των προγραμμάτων που διαμοιράζονται την κρυφή μνήμη. Η μεθοδολογία αξιοποιεί στην συνέχεια αυτήν την μοντελοποίηση, για να ελέγξει τον διαμοιρασμό της κρυφής μνήμης και να επιτύχει δικαιοσύνη στο πως κατανέμεται ο χώρος της κρυφής μνήμης μεταξύ των επεξεργαστών. Η δεύτερη μεθοδολογία παρουσιάζει μία νέα τεχνική για την πρόβλεψη της τοπικότητας των προσπελάσεων της κρυφής μνήμης. Καθώς η τοπικότητα είναι η βασική παράμετρος που καθορίζει την χρησιμότητα των δεδομένων της κρυφής μνήμης, χρησιμοποιώντας αυτήν την τεχνική πρόβλεψης μπορούν να οδηγηθούν μηχανισμοί διαχείρισης που βελτιώνουν την αξιοποίηση του χώρου της κρυφής μνήμης. Στα πλαίσια της μεθοδολογίας παρουσιάζουμε έναν τέτοιο μηχανισμό, ο οποίος στοχεύει στην ελαχιστοποίηση των αστοχιών της κρυφής μνήμης μέσω μίας νέας πολιτικής αντικατάστασης. Η τελευταία μεθοδολογία που παρουσιάζεται είναι μία μεθοδολογία για την μείωση της κατανάλωσης ενέργειας της ουράς εντολών, που είναι μία από τις πιο ενεργειακά απαιτητικές δομές του επεξεργαστή. Στα πλαίσια της μεθοδολογίας, δείχνεται ότι το κλειδί για την αποδοτική μείωση της κατανάλωσης ενέργειας της ουράς εντολών βρίσκεται στην αλληλεπίδραση της με το υποσύστημα μνήμης. Με βάση αυτό το συμπέρασμα, παρουσιάζουμε έναν νέο μηχανισμό δυναμικής διαχείρισης του μεγέθους της ουράς εντολών, ο οποίος συνδυάζει επιθετική μείωση της κατανάλωσης ενέργειας του επεξεργαστή με διατήρηση της υψηλής απόδοσής του. / This dissertation proposes methodologies for the management of shared resources in chip multi-processors (CMP). Until recently, the design of a computing system had to satisfy the computational and storage needs of a single program during each time period. Now instead, the designer has to balance the, perhaps conflicting, needs of multiple programs competing for the same resources. But, in many cases, even this is not enough. Even if we could invent a perfect way to manage sharing, without optimizing the way that each processor uses the shared resource, the resource could not deal efficiently with the increased load. In order to handle the negative effects of resource sharing, this dissertation proposes three management mechanisms. The first one introduces a novel theoretical model of the sharing of the shared cache, which can be used at run-time. Furthermore, out methodology uses the model to control sharing and to achieve a sense of justice in the way the cache is shared among the processors. Our second methodology presents a new technique for predicting the locality of cache accesses. Since locality determines, almost entirely, the usefulness of cache data, our technique can be used to drive any management mechanism which strives to improve the efficiency of the cache. As part of our methodology, we present such a mechanism, a new cache replacement policy which tries to minimize cache misses by near-optimal replacement decisions. The last methodology presented in this dissertation, targets the energy consumption of the processor. To that end, our methodology shows that the key to reducing the power consumption of the Issue Queue, without disproportional performance degradation, lies at the interaction of the Issue Queue with the memory subsystem: as long as the management of the Issue Queue doesn’t reduce the utilization of the memory subsystem, the effects of the management on the processor’s performance will be minimal. Based on this conclusion, we introduce a new mechanism for dynamically resizing the Issue Queue, which achieves aggressive downsizing and energy savings with almost no performance degradation. Κρυφές μνήμες Ουρά εντολών Μηχανισμοί πρόβλεψης Κατανάλωση ισχύος 004.35 Computer architecture Chip multiprocessors CPU caches Instruction queue Prediction mechanisms Power-aware management techniques
52	Auto-tuning pour la détection automatique du meilleur format de compression pour matrice creuse / Auto-tuning for the automatic sparse matrix optimal compression format detection Mehrez, Ichrak 27 September 2018 (has links) De nombreuses applications en calcul scientifique traitent des matrices creuses de grande taille ayant des structures régulières ou irrégulières. Pour réduire à la fois la complexité spatiale et la complexité temporelle du traitement, ces matrices nécessitent l'utilisation d'une structure particulière de stockage (ou de compression) des données ainsi que l’utilisation d’architectures cibles parallèles/distribuées. Le choix du format de compression le plus adéquat dépend généralement de plusieurs facteurs dont, en particulier, la structure de la matrice creuse, l'architecture de la plateforme cible et la méthode numérique utilisée. Etant donné la diversité de ces facteurs, un choix optimisé pour un ensemble de données d'entrée peut induire de mauvaises performances pour un autre. D’où l’intérêt d’utiliser un système permettant la sélection automatique du meilleur format de compression (MFC) en prenant en compte ces différents facteurs. C’est dans ce cadre précis que s’inscrit cette thèse. Nous détaillons notre approche en présentant la modélisation d'un système automatique qui, étant donnée une matrice creuse, une méthode numérique, un modèle de programmation parallèle et une architecture, permet de sélectionner automatiquement le MFC. Dans une première étape, nous validons notre modélisation par une étude de cas impliquant (i) Horner, et par la suite le produit matrice-vecteur creux (PMVC), comme méthodes numériques, (ii) CSC, CSR, ELL, et COO comme formats de compression, (iii) le data parallèle comme modèle de programmation et (iv) une plateforme multicœurs comme architecture cible. Cette étude nous permet d’extraire un ensemble de métriques et paramètres qui influent sur la sélection du MFC. Nous démontrons que les métriques extraites de l'analyse du modèle data parallèle ne suffisent pas pour prendre une décision (sélection du MFC). Par conséquent, nous définissons de nouvelles métriques impliquant le nombre d'opérations effectuées par la méthode numérique et le nombre d’accès à la mémoire. Ainsi, nous proposons un processus de décision prenant en compte à la fois l'analyse du modèle data parallèle et l'analyse de l’algorithme. Dans une deuxième étape, et en se basant sur les données que nous avons extrait précédemment, nous utilisons les algorithmes du Machine Learning pour prédire le MFC d’une matrice creuse donnée. Une étude expérimentale ciblant une plateforme parallèle multicœurs et traitant des matrices creuses aléatoires et/ou provenant de problèmes réels permet de valider notre approche et d’évaluer ses performances. Comme travail futur, nous visons la validation de notre approche en utilisant d'autres plateformes parallèles telles que les GPUs. / Several applications in scientific computing deals with large sparse matrices having regular or irregular structures. In order to reduce required memory space and computing time, these matrices require the use of a particular data storage structure as well as the use of parallel/distributed target architectures. The choice of the most appropriate compression format generally depends on several factors, such as matrix structure, numerical method and target architecture. Given the diversity of these factors, an optimized choice for one input data set will likely have poor performances on another. Hence the interest of using a system allowing the automatic selection of the Optimal Compression Format (OCF) by taking into account these different factors. This thesis is written in this context. We detail our approach by presenting a design of an auto-tuner system for OCF selection. Given a sparse matrix, a numerical method, a parallel programming model and an architecture, our system can automatically select the OCF. In a first step, we validate our modeling by a case study that concerns (i) Horner scheme, and then the sparse matrix vector product (SMVP), as numerical methods, (ii) CSC, CSR, ELL, and COO as compression formats; (iii) data parallel as a programming model; and (iv) a multicore platform as target architecture. This study allows us to extract a set of metrics and parameters that affect the OCF selection. We note that data parallel metrics are not sufficient to accurately choose the most suitable format. Therefore, we define new metrics involving the number of operations and the number of indirect data access. Thus, we proposed a new decision process taking into account data parallel model analysis and algorithm analysis.In the second step, we propose to use machine learning algorithm to predict the OCF for a given sparse matrix. An experimental study using a multicore parallel platform and dealing with random and/or real-world random matrices validates our approach and evaluates its performances. As future work, we aim to validate our approach using other parallel platforms such as GPUs. Format de compression Machine learning Matrice creuse Data parallèle Auto-tuning Cluster Compression format Machine learning Sparse matrix Data parallel Auto-tuning Cluster 004.35
53	Παράλληλοι αλγόριθμοι και εφαρμογές σε πολυπύρηνες μονάδες επεξεργασίας γραφικών / Parallel algorithms and applications in manycore graphics processing units Κολώνιας, Βασίλειος 05 February 2015 (has links) Στην παρούσα διατριβή παρουσιάζονται παράλληλοι αλγόριθμοι και εφαρμογές σε πολυπύρηνες μονάδες επεξεργασίας γραφικών. Πιο συγκεκριμένα, εξετάζονται οι μέθοδοι σχεδίασης ενός παράλληλου αλγορίθμου για την επίλυση τόσο απλών και κοινών προβλημάτων, όπως η ταξινόμηση, όσο και υπολογιστικά απαιτητικών προβλημάτων, έτσι ώστε να εκμεταλλευτούμε πλήρως την τεράστια υπολογιστική δύναμη που προσφέρουν οι σύγχρονες μονάδες επεξεργασίας γραφικών. Πρώτο πρόβλημα που εξετάστηκε είναι η ταξινόμηση, η οποία είναι ένα από τα πιο συνηθισμένα προβλήματα στην επιστήμη των υπολογιστών. Υπάρχει σαν εσωτερικό πρόβλημα σε πολλές εφαρμογές, επομένως πετυχαίνοντας πιο γρήγορη ταξινόμηση πετυχαίνουμε πιο καλή απόδοση γενικότερα. Στο Κεφάλαιο 3 περιγράφονται όλα τα βήματα σχεδιασμού για την εκτέλεση ενός αλγορίθμου ταξινόμησης για ακεραίους, της count sort, σε μια μονάδα επεξεργασίας γραφικών. Σημαντική επίδραση στην απόδοση είχε η αποφυγή του συγχρονισμού των νημάτων στο τελευταίο βήμα του αλγορίθμου. Στη συνέχεια παρουσιάζονται εφαρμογές παράλληλων αλγορίθμων σε υπολογιστικά απαιτητικά προβλήματα. Στο Κεφάλαιο 4, εξετάζεται το πρόβλημα χρονοπρογραμματισμού εξετάσεων Πανεπιστημίων, το οποίο είναι ένα πρόβλημα συνδυαστικής βελτιστοποίησης. Για την επίλυσή του χρησιμοποιείται ένας υβριδικός εξελικτικός αλγόριθμος, ο οποίος εκτελείται εξ' ολοκλήρου στην μονάδα επεξεργασίας γραφικών. Η τεράστια υπολογιστική δύναμη της GPU και ο παράλληλος προγραμματισμός δίνουν τη δυνατότητα χρήσης μεγάλων πληθυσμών έτσι ώστε να εξερευνήσουμε καλύτερα τον χώρο λύσεων και να πάρουμε καλύτερα ποιοτικά αποτελέσματα. Στο επόμενο κεφάλαιο γίνεται επίλυση του προβλήματος σχεδιασμού κίνησης για υποθαλάσσια οχήματα με βραχίονα. Εξετάζεται το πρόβλημα τόσο του ολικού σχεδιασμού όσο και του τοπικού. Στην πρώτη περίπτωση είναι σημαντική η καλή λύση και η ακρίβεια και ο παράλληλος αλγόριθμος που χρησιμοποιείται για την αναπαράσταση του περιβάλλοντος εργασίας σε μια Bump-επιφάνεια βοηθάει προς αυτή την κατεύθυνση. Στη δεύτερη περίπτωση, το πρόβλημα είναι πρόβλημα πραγματικού χρόνου και μας ενδιαφέρει η ταχύτητα εύρεσης της επόμενης θέσης του οχήματος. Ο παράλληλος προγραμματισμός και η GPU βοηθούν σημαντικά σε αυτό. Τελευταία εφαρμογή που εξετάστηκε είναι η μελέτη ενός συστήματος ημιφθοριωμένων αλκανίων με την μοριακή προσομοίωση Monte Carlo. Η παραλληλοποίηση ενός μέρους, του πιο χρονοβόρου, του αλγορίθμου έδωσε τη δυνατότητα εξέτασης ενός πολύ μεγαλύτερου συστήματος σε αποδεκτό χρόνο. Σε γενικές γραμμές, γίνεται φανερό ότι ο παράλληλος προγραμματισμός και οι σύγχρονες πολυπύρηνες αρχιτεκτονικές, όπως οι μονάδες επεξεργασίας γραφικών, δίνουν νέες δυνατότητες στην αντιμετώπιση καθημερινών προβλημάτων, προβλημάτων πραγματικού χρόνου και προβλημάτων συνδυαστικής βελτιστοποίησης. / In this thesis, parallel algorithms and applications in manycore graphics processing units are presented. More specifically, we examine methods of designing a parallel algorithm for solving both simple and common problems such as sorting, and computationally demanding problems, so as to fully exploit the enormous computing power of modern graphics processing units (GPUs). First problem considered is sorting, which is one of the most common problems in computer science. It exists as an internal problem in many applications. Therefore, sorting faster, results in better performance in general. Chapter 3 describes all design options for the implementation of a sorting algorithm for integers, count sort, on a graphics processing unit. The elimination of thread synchronization in the last step of the algorithm had a significant effect on the performance. Chapter 4 addresses the examination timetabling problem for Universities, which is a combinatorial optimization problem. A hybrid evolutionary algorithm, which runs entirely on GPU, was used to solve the problem. The tremendous computing power of GPU and parallel programming enable the use of large populations in order to explore better the solution space and get better quality results. In the next chapter, the problem of motion planning for underwater vehicle manipulator systems is examined. In the gross motion planning problem, it is important to achieve a good solution with high accuracy. The parallel algorithm used for the representation of the working environment in a Bump-surface is a step towards this direction. In the local motion planning problem, which is a real-time problem, the time needed to find the next configuration of the vehicle is crucial. Parallel programming and the GPU greatly assist in this online problem. Last application considered is the atomistic Monte Carlo simulation of semifluorinated alkanes. The parallelization of part of the algorithm, the most time-consuming, enabled the study of a much larger system in an acceptable execution time. In general, it becomes obvious that parallel programming and new novel manycore architectures, such as graphics processing units, give new capabilities for solving everyday problems, real time and combinatorial optimization problems. Σχεδιασμός κίνησης Μοριακή προσομοίωση 004.35 Parallel programming Manycore architectures Graphics processing units Count sort Examination timetabling problem Evolutionary algorithms Motion planning Molecular simulations Monte Carlo
54	Algorithmes parallèles pour le suivi de particules / Parallel algorithms for tracking of particles Bonnier, Florent 12 December 2018 (has links) Les méthodes de suivi de particules sont couramment utilisées en mécanique des fluides de par leur propriété unique de reconstruire de longues trajectoires avec une haute résolution spatiale et temporelle. De fait, de nombreuses applications industrielles mettant en jeu des écoulements gaz-particules, comme les turbines aéronautiques utilisent un formalisme Euler-Lagrange. L’augmentation rapide de la puissance de calcul des machines massivement parallèles et l’arrivée des machines atteignant le petaflops ouvrent une nouvelle voie pour des simulations qui étaient prohibitives il y a encore une décennie. La mise en oeuvre d’un code parallèle efficace pour maintenir une bonne performance sur un grand nombre de processeurs devra être étudié. On s’attachera en particuliers à conserver un bon équilibre des charges sur les processeurs. De plus, une attention particulière aux structures de données devra être fait afin de conserver une certaine simplicité et la portabilité et l’adaptabilité du code pour différentes architectures et différents problèmes utilisant une approche Lagrangienne. Ainsi, certains algorithmes sont à repenser pour tenir compte de ces contraintes. La puissance de calcul permettant de résoudre ces problèmes est offerte par des nouvelles architectures distribuées avec un nombre important de coeurs. Cependant, l’exploitation efficace de ces architectures est une tâche très délicate nécessitant une maîtrise des architectures ciblées, des modèles de programmation associés et des applications visées. La complexité de ces nouvelles générations des architectures distribuées est essentiellement due à un très grand nombre de noeuds multi-coeurs. Ces noeuds ou une partie d’entre eux peuvent être hétérogènes et parfois distants. L’approche de la plupart des bibliothèques parallèles (PBLAS, ScalAPACK, P_ARPACK) consiste à mettre en oeuvre la version distribuée de ses opérations de base, ce qui signifie que les sous-programmes de ces bibliothèques ne peuvent pas adapter leurs comportements aux types de données. Ces sous programmes doivent être définis une fois pour l’utilisation dans le cas séquentiel et une autre fois pour le cas parallèle. L’approche par composants permet la modularité et l’extensibilité de certaines bibliothèques numériques (comme par exemple PETSc) tout en offrant la réutilisation de code séquentiel et parallèle. Cette approche récente pour modéliser des bibliothèques numériques séquentielles/parallèles est très prometteuse grâce à ses possibilités de réutilisation et son moindre coût de maintenance. Dans les applications industrielles, le besoin de l’emploi des techniques du génie logiciel pour le calcul scientifique dont la réutilisabilité est un des éléments des plus importants, est de plus en plus mis en évidence. Cependant, ces techniques ne sont pas encore maÃotrisées et les modèles ne sont pas encore bien définis. La recherche de méthodologies afin de concevoir et réaliser des bibliothèques réutilisables est motivée, entre autres, par les besoins du monde industriel dans ce domaine. L’objectif principal de ce projet de thèse est de définir des stratégies de conception d’une bibliothèque numérique parallèle pour le suivi lagrangien en utilisant une approche par composants. Ces stratégies devront permettre la réutilisation du code séquentiel dans les versions parallèles tout en permettant l’optimisation des performances. L’étude devra être basée sur une séparation entre le flux de contrôle et la gestion des flux de données. Elle devra s’étendre aux modèles de parallélisme permettant l’exploitation d’un grand nombre de coeurs en mémoire partagée et distribuée. / The complexity of these new generations of distributed architectures is essencially due to a high number of multi-core nodes. Most of the nodes can be heterogeneous and sometimes remote. Today, nor the high number of nodes, nor the processes that compose the nodes are exploited by most of applications and numerical libraries. The approach of most of parallel libraries (PBLAS, ScalAPACK, P_ARPACK) consists in implementing the distributed version of its base operations, which means that the subroutines of these libraries can not adapt their behaviors to the data types. These subroutines must be defined once for use in the sequential case and again for the parallel case. The object-oriented approach allows the modularity and scalability of some digital libraries (such as PETSc) and the reusability of sequential and parallel code. This modern approach to modelize sequential/parallel libraries is very promising because of its reusability and low maintenance cost. In industrial applications, the need for the use of software engineering techniques for scientific computation, whose reusability is one of the most important elements, is increasingly highlighted. However, these techniques are not yet well defined. The search for methodologies for designing and producing reusable libraries is motivated by the needs of the industries in this field. The main objective of this thesis is to define strategies for designing a parallel library for Lagrangian particle tracking using a component approach. These strategies should allow the reuse of the sequential code in the parallel versions while allowing the optimization of the performances. The study should be based on a separation between the control flow and the data flow management. It should extend to models of parallelism allowing the exploitation of a large number of cores in shared and distributed memory. Suivi de particules Parallélisme à grande échelle Lagrange HPC Architecture par composants Calcul parallèle Lagrangien Particle tracking Large-Scale parallelism Serial/parallel reusability Lagrange HPC Component based software Parallel computation Lagrangian 004.35

Page generated in 0.0188 seconds