Global ETD Search

31	A data dependency recovery system for a heterogeneous multicore processor Kainth, Haresh S. January 2014 (has links) Multicore processors often increase the performance of applications. However, with their deeper pipelining, they have proven increasingly difficult to improve. In an attempt to deliver enhanced performance at lower power requirements, semiconductor microprocessor manufacturers have progressively utilised chip-multicore processors. Existing research has utilised a very common technique known as thread-level speculation. This technique attempts to compute results before the actual result is known. However, thread-level speculation impacts operation latency, circuit timing, confounds data cache behaviour and code generation in the compiler. We describe an software framework codenamed Lyuba that handles low-level data hazards and automatically recovers the application from data hazards without programmer and speculation intervention for an asymmetric chip-multicore processor. The problem of determining correct execution of multiple threads when data hazards occur on conventional symmetrical chip-multicore processors is a significant and on-going challenge. However, there has been very little focus on the use of asymmetrical (heterogeneous) processors with applications that have complex data dependencies. The purpose of this thesis is to: (i) define the development of a software framework for an asymmetric (heterogeneous) chip-multicore processor; (ii) present an optimal software control of hardware for distributed processing and recovery from violations;(iii) provides performance results of five applications using three datasets. Applications with a small dataset showed an improvement of 17% and a larger dataset showed an improvement of 16% giving overall 11% improvement in performance. 004.35
32	Enhancing the performance of decoupled software pipeline through backward slicing Alwan, Esraa January 2014 (has links) The rapidly increasing number of cores available in multicore processors does not necessarily lead directly to a commensurate increase in performance: programs written in conventional languages, such as C, need careful restructuring, preferably automatically, before the benefits can be observed in improved run-times. Even then, much depends upon the intrinsic capacity of the original program for concurrent execution. Using software techniques to parallelize the sequential application can raise the level of gain from multicore systems. Parallel programming is not an easy job for the user, who has to deal with many issues such as dependencies, synchronization, load balancing, and race conditions. For this reason the role of automatically parallelizing compilers and techniques for the extraction of several threads from single-threaded programs, without programmer intervention, is becoming more important and may help to deliver better utilization of modern hardware. One parallelizing technique that has been shown to be an effective for the parallelization of applications that have irregular control flow and complex memory access patterns is Decoupled Software Pipeline (DSWP). This transformation partitions the loop body into a set of stages, ensuring that critical path dependencies are kept local to a stage. Each stage becomes a thread and data is passed between threads using inter-core communication. The success of DSWP depends on being able to extract the relatively fine-grain parallelism that is present in many applications. Another technique which offers potential gains in parallelizing general purpose applications is slicing. Program slicing transforms large programs into several smaller ones that execute independently, each consisting of only statements relevant to the computation of certain, socalled, (program) points. This dissertation explores the possibility of performance benefits arising from a secondary transformation of DSWP stages by slicing. To that end a new combination method called DSWP/Slice is presented. Our observation is that individual DSWP stages can be parallelized by slicing, leading to an improvement in performance of the longest duration DSWP stages. In particular, this approach can be applicable in cases where DOALL is not. In consequence better load balancing can be achieved between the DSWP stages. Moreover, we introduce an automatic implementation of the combination method using Low Level Virtual Machine (LLVM) compiler framework. This combination is particularly effective when the whole long stage comprises a function body. More than one slice extracted from a function body can speed up its execution time and also increases the scalability of DSWP. An evaluation of this technique on six programs with a range of dependence patterns leads to considerable performance gains on a core-i7 870 machine with 4-cores/8-threads. The results are obtained from an automatic implementation that shows the proposed method can give a factor of up to 1.8 speed up compared with the original sequential code. 004.35
33	Décomposition automatique des programmes parallèles pour l'optimisation et la prédiction de performance. / Automatic decomposition of parallel programs for optimization and performance prediction. Popov, Mihail 07 October 2016 (has links) Dans le domaine du calcul haute performance, de nombreux programmes étalons ou benchmarks sont utilisés pour mesurer l’efficacité des calculateurs,des compilateurs et des optimisations de performance. Les benchmarks de référence regroupent souvent des programmes de calcul issus de l’industrie et peuvent être très longs. Le processus d’´étalonnage d’une nouvelle architecture de calcul ou d’une optimisation est donc coûteux.La plupart des benchmarks sont constitués d’un ensemble de noyaux de calcul indépendants. Souvent l’´étalonneur n’est intéressé que par un sous ensemble de ces noyaux, il serait donc intéressant de pouvoir les exécuter séparément. Ainsi, il devient plus facile et rapide d’appliquer des optimisations locales sur les benchmarks. De plus, les benchmarks contiennent de nombreux noyaux de calcul redondants. Certaines opérations, bien que mesurées plusieurs fois, n’apportent pas d’informations supplémentaires sur le système à étudier. En détectant les similarités entre eux et en éliminant les noyaux redondants, on diminue le coût des benchmarks sans perte d’information.Cette thèse propose une méthode permettant de décomposer automatiquement une application en un ensemble de noyaux de performance, que nous appelons codelets. La méthode proposée permet de rejouer les codelets,de manière isolée, dans différentes conditions expérimentales pour pouvoir étalonner leur performance. Cette thèse étudie dans quelle mesure la décomposition en noyaux permet de diminuer le coût du processus de benchmarking et d’optimisation. Elle évalue aussi l’avantage d’optimisations locales par rapport à une approche globale.De nombreux travaux ont été réalisés afin d’améliorer le processus de benchmarking. Dans ce domaine, on remarquera l’utilisation de techniques d’apprentissage machine ou d’´echantillonnage. L’idée de décomposer les benchmarks en morceaux indépendants n’est pas nouvelle. Ce concept a été aappliqué avec succès sur les programmes séquentiels et nous le portons à maturité sur les programmes parallèles.Evaluer des nouvelles micro-architectures ou la scalabilité est 25× fois plus rapide avec des codelets que avec des exécutions d’applications. Les codelets prédisent le temps d’exécution avec une précision de 94% et permettent de trouver des optimisations locales jusqu’`a 1.06× fois plus efficaces que la meilleure approche globale. / In high performance computing, benchmarks evaluate architectures, compilers and optimizations. Standard benchmarks are mostly issued from the industrial world and may have a very long execution time. So, evaluating a new architecture or an optimization is costly. Most of the benchmarks are composed of independent kernels. Usually, users are only interested by a small subset of these kernels. To get faster and easier local optimizations, we should find ways to extract kernels as standalone executables. Also, benchmarks have redundant computational kernels. Some calculations do not bring new informations about the system that we want to study, despite that we measure them many times. By detecting similar operations and removing redundant kernels, we can reduce the benchmarking cost without loosing information about the system. This thesis proposes a method to automatically decompose applications into small kernels called codelets. Each codelet is a standalone executable that can be replayed in different execution contexts to evaluate them. This thesis quantifies how much the decomposition method accelerates optimization and benchmarking processes. It also quantify the benefits of local optimizations over global optimizations. There are many related works which aim to enhance the benchmarking process. In particular, we note machine learning approaches and sampling techniques. Decomposing applications into independent pieces is not a new idea. It has been successfully applied on sequential codes. In this thesis we extend it to parallel programs. Evaluating scalability or new micro-architectures is 25× faster with codelets than with full application executions. Codelets predict the execution time with an accuracy of 94% and find local optimizations that outperform the best global optimization up to 1.06×. Prédiction de performance Parallélisme Compilation Optimisation Checkpoint restart Performance prediction Parallelism Compilation Optimization Checkpoint restart 004.35
34	Nouveaux algorithmes numériques pour l’utilisation efficace des architectures multi-cœurs et hétérogènes / New numerical algorithms for efficient utilization of multicore and heterogeneous architectures Ye, Fan 16 December 2015 (has links) Cette étude est motivée par les besoins réels de calcul dans la physique des réacteurs. Notre objectif est de concevoir les algorithmes parallèles, y compris en proposant efficaces noyaux algébriques linéaires et méthodes numériques parallèles.Dans un environnement many-cœurs en mémoire partagée tel que le système Intel Many Integrated Core (MIC), la parallélisation efficace d'algorithmes est obtenue en termes de parallélisme des tâches à grain fin et parallélisme de données. Pour la programmation des tâches, deux principales stratégies, le partage du travail et vol de travail ont été étudiées. A des fins de généralité et de réutilisation, nous utilisons des interfaces de programmation parallèle standard, comme OpenMP, Cilk/Cilk+ et TBB. Pour vectoriser les tâches, les outils disponibles incluent Cilk+ array notation, pragmas SIMD, et les fonctions intrinsèques. Nous avons évalué ces techniques et proposé un noyau efficace de multiplication matrice-vecteur dense. Pour faire face à une situation plus complexe, nous proposons d'utiliser le modèle hybride MPI/OpenMP pour la mise en œuvre de noyau multiplication matrice-vecteur creux. Nous avons également conçu un modèle de performance pour modéliser les performances sur MICs et ainsi guider l'optimisation. En ce qui concerne la résolution de systèmes linéaires, nous avons proposé un solveur parallèle évolutif issue de méthodes Monte Carlo. Cette méthode présente un degré de parallélisme abondant, qui s’adapte bien à l'architecture multi-coeurs. Pour répondre à certains des goulots d'étranglement fondamentaux de ce solveur, nous proposons un modèle d'exécution basée sur les tâches qui résout complètement ces problèmes. / This study is driven by the real computational needs coming from different fields of reactor physics, such as neutronics or thermal hydraulics, where the eigenvalue problem and resolution of linear system are the key challenges that consume substantial computing resources. In this context, our objective is to design and improve the parallel computing techniques, including proposing efficient linear algebraic kernels and parallel numerical methods. In a shared-memory environment such as the Intel Many Integrated Core (MIC) system, the parallelization of an algorithm is achieved in terms of fine-grained task parallelism and data parallelism. For scheduling the tasks, two main policies, the work-sharing and work-stealing was studied. For the purpose of generality and reusability, we use common parallel programming interfaces, such as OpenMP, Cilk/Cilk+, and TBB. For vectorizing the task, the available tools include Cilk+ array notation, SIMD pragmas, and intrinsic functions. We evaluated these techniques and propose an efficient dense matrix-vector multiplication kernel. In order to tackle a more complicated situation, we propose to use hybrid MPI/OpenMP model for implementing sparse matrix-vector multiplication. We also designed a performance model for characterizing performance issues on MIC and guiding the optimization. As for solving the linear system, we derived a scalable parallel solver from the Monte Carlo method. Such method exhibits inherently abundant parallelism, which is a good fit for many-core architecture. To address some of the fundamental bottlenecks of this solver, we propose a task-based execution model that completely fixes the problems. Produit matrice-Vecteur Évaluation de performances Architecture hétérogène Intel Many Integrated Core (MIC) 004.35
35	Αυτόματος χρονοπρογραμματισμός πληρωμάτων με υψηλού επιπέδου μοντελοποίηση των κανονισμών και παράλληλη/κατανεμημένη επεξεργασία Γκουμόπουλος, Χρήστος 09 September 2009 (has links) - / - Αεροσκάφη Πτήσεις 004.35 Computers Parallel architectures Aircrafts Flights
36	Σχεδίαση και ανάπτυξη συστήματος κατανεμημένης διαμοιραζόμενης μνήμης για πολυεπεξεργαστή του ενός ολοκληρωμένου (CMP) / Design and development of a shared distributed memory system for a chip multiprocessor (CMP) Αδαμίδης, Ανδρέας 09 February 2009 (has links) Αντικείμενο της παρούσας μεταπτυχιακής εργασίας είναι ο σχεδιασμός και η ανάπτυξη συστήματος κατανεμημένης διαμοιραζόμενης μνήμης ως τμήμα της αρχιτεκτονικής πολυεπεξεργαστικού συστήματος SiScape. Λόγω των ιδιαιτεροτήτων της αρχιτεκτονικής αυτής, το σύστημα μνήμης της και συγκεκριμένα η κρυφή μνήμη δευτέρου επιπέδου που καθιστά δυνατή τη λειτουργία του, κρίθηκε απαραίτητο να σχεδιαστεί και να αναπτυχθεί από το μηδέν, προκειμένου να ανταποκριθεί στις απαιτήσεις της. Ο σχεδιασμός της κρυφής μνήμης δευτέρου επιπέδου περιγράφηκε στη γλώσσα περιγραφής υλικού VHDL. / The purpose of this master thesis is the design and development of a shared distributed memory system as part of the multiprocessor architecture SiScape. Because of the architecture's irregular structure, it was imperative that the memory system and particularly the second level cache that enables its functionality, was designed from scratch, to fill all of its requirements. The design of the second level cache was described using the VHDL hardware description language. Πολυπεξεργαστής Συγχρονισμός 004.35 SiScape Chip multiprocessor VHDL Second level cache Synchronization
37	Διάφανη απεικόνιση προγραμματιστικών μοντέλων υψηλού επιπέδου σε ετερόμορφες παράλληλες αρχιτεκτονικές Βενέτης, Ιωάννης 16 March 2009 (has links) - / - Πολυπρογραμματισμός 004.35 POSIX threads OpenMP Cilk Nano threads NthLib
38	Ανάπτυξη ενσωματωμένων συστημάτων σε πολυπήρηνο ή πολυεπεξεργαστικό περιβάλλον με χρήση Real Time Java Δημητρακόπουλος, Γεώργιος 20 October 2010 (has links) Ο σκοπός της παρούσας διπλωματικής εργασίας είναι να μελετηθεί με ποιον τρόπο μια εφαρμογή μπορεί να αξιοποιήσει την παρουσία πολλών επεξεργαστών σε ένα σύστημα. Το σύστημα προς μελέτη είναι το Festo MPS, το οποίο είναι ένα κατανεμημένο ενσωματωμένο σύστημα πραγματικού χρόνου, αποτελούμενο από τρεις υπομονάδες. Το σύστημα έχει υλοποιηθεί σε Real Time Java, μια επέκταση της Java, η οποία ανταποκρίνεται σε απαιτήσεις πραγματικού χρόνου. Η εφαρμογή εκτελείται σε μια Java Virtual Machine πραγματικού χρόνου, η οποία με τη σειρά της εκτελείται σε ένα λειτουργικό σύστημα τύπου Linux. Το κάθε επίπεδο έχει διάφορους μηχανισμούς έτσι ώστε να αξιοποιεί τους διαθέσιμους επεξεργαστές. Ερευνούνται τρόποι με τους οποίους ο προγραμματιστής μπορεί να διευκολυνθεί στο έργο του, γράφοντας αποδοτικότερο, μικρότερο και καθαρότερο παράλληλο κώδικα, καθώς και οι επιλογές, ώστε να καθορίζει ο ίδιος επακριβώς τον τρόπο εκτέλεσης, όταν αυτό απαιτείται. Τελικός στόχος είναι να εκτελεστεί μια προσομοίωση του συστήματος, όπου κάθε υπομονάδα θα εκτελείται σε διαφορετικό επεξεργαστή. Αυτό επιτυγχάνεται με τη βοήθεια των κλήσεων του λειτουργικού συστήματος μέσω Java Native Interface. / The purpose of this thesis is to study how an application can exploit many processors available in a system. The case study system is the Festo MPS, which is a distributed embedded real time system consisting of three subunits. The system has been implemented in Real Time Java, an extension of Java, which responds to real-time requirements. The application runs on a Real Time Java Virtual Machine, which in turn runs on a Linux type operating system. Each level has several mechanisms to utilize the available processors. There are explored ways in which the programmer can be facilitated in his work by writing more efficient, smaller and cleaner parallel code and also the options to set himself how the code will be executed when required. The ultimate goal is to run a simulation of the system where each subunit will run on its own processor. This is achieved through calls to the operating system through Java Native Interface. 004.35 Embedded systems Multithreading Real Time Java
39	Παραλληλοποίηση αλγορίθμου Aho-Corasick με τεχνολογία CUDA Δημόπουλος, Παναγιώτης 24 October 2012 (has links) Στην παρούσα διπλωματική εκπονείται μία μελέτη για την απόδοση των αλγορίθμων αναζήτησης μοτίβων όταν αυτοί τροποποιηθούν κατάλληλα ώστε να εκμεταλλεύονται την αρχιτεκτονική του Υλικού των καρτών γραφικών. Για τον σκοπό αυτό στην παρούσα διπλωματική παρουσιάζεται στην αρχή το πρόβλημα της αναζήτησης ώστε να γίνει κατανοητό γιατί είναι επιτακτική η ανάγκη βελτιστοποίησης της απόδοσης των υπαρχόντων αλγορίθμων. Επίσης παρουσιάζονται οι κυριότεροι αλγόριθμοι αναζήτησης μοτίβων που χρησιμοποιούνται σήμερα και εξηγείται γιατί επιλέγεται ένας από αυτούς τους αλγόριθμους που στην συνέχεια θα τροποποιηθεί ώστε να εκμεταλλεύεται την ιδιαίτερη αρχιτεκτονική μιας κάρτας γραφικών. Έπειτα εξάγονται συμπεράσματα για την απόδοση που μας προσφέρει αυτή η νέα υλοποίηση του αλγορίθμου σε λογισμικό σε σχέση με την απλή υλοποίηση του αλγορίθμου και για διαφορετικά μεγέθη εισόδων / Conversion of Aho-Corasick algorithm in order to execute in an Nvidia graphic card using CUDA technology. Comparison of speed between the parallel and the classic version of the algorithm. Κάρτες γραφικών Μοτίβο 004.35 Graphic cards Parallel computing CUDA Aho-Corasick
40	Nouveaux algorithmes numériques pour l'utilisation efficace des architectures de calcul multi-coeurs et hétérogènes / New algorithms for Efficient use of multi-cores and heterogeneous architectures Boillod-Cerneux, France 13 October 2014 (has links) Depuis la naissance des supercalculateurs jusqu'à l'arrivée de machines Petaflopiques, les technologies qui les entourent n'ont cessé d'évoluer à une vitesse fulgurante. Cette course à la performance se heurte aujourd'hui au passage à l'Exascale, qui se démarque des autres échelles par les difficultés qu'elle impose: Les conséquences qui en découlent bouleversent tous les domaines scientifiques relatifs au Calcul Haute Performance (HPC). Nous nous plaçons dans le contexte des problèmes à valeurs propres, largement répandus: du page ranking aux simulation nucléaires, astronomie, explorations pétrolifères...Notre démarche comporte deux thématiques complémentaires: Nous proposons d'étudier puis d'améliorer la convergence de la méthode Explicitely Restarted Arnoldi Method (ERAM) en réutilisant les informations générées. L'étude de la convergence et sa caractérisation sont indispensable pour pouvoir mettre en place des techniques de Smart-Tuning. La phase d'amélioration consiste à utiliser les valeurs de Ritz de manière efficace afin d'accélérer la convergence de la méthode sans couts supplémentaires en termes de communications parallèles ou de stockage mémoire, paramètres indispensables pour les machines multi-coeurs et hétérogènes. Enfin, nous proposons deux méthodes pour générer des matrices de très larges dimensions aux spectres imposés afin de constituer une collection de matrices de tests qui seront partagées avec la communauté du HPC. Ces matrices serviront à valider numériquement des solveurs de systèmes à valeurs propres d'une part, et d'autre part de pouvoir évaluer leur performances parallèles grâce à leur propriétés adaptées aux machines petaflopiques et au-delà. / The supercomputers architectures and programming paradigms have dramatically evolve during the last decades. Since we have reached the Petaflopic scale, we forecast to overcome the Exaflopic scale. Crossing this new scale implies many drastic changes, concerning the overall High Performance Computing scientific fields. In this Thesis, we focus on the eigenvalue problems, implied in most of the industrial simulations. We first propose to study and caracterize the Explicitly Restarted Arnoldi Method convergence. Based on this algorithm, we re-use efficiently the computed Ritz-Eigenvalues to accelerate the ERAM convergence onto the desired eigensubspace. We then propose two matrix generators, starting from a user-imposed spectrum. Such matrix collections are used to numerically check and approve extrem-scale eigensolvers, as well as measure and improve their parallel performance on ultra-scale supercomputers. Techniques d'auto-tuning Stratégies de redémarrage 004.35

Search results