Global ETD Search

221	Διαχείριση κοινών πόρων σε πολυπύρηνους επεξεργαστές Αλεξανδρής, Φωκίων 27 June 2012 (has links) Οι σύγχρονες τάσεις της Επιστήμης Σχεδιασμού των Υπολογιστικών Συστημάτων έχουν υιοθετήσει την χρήση των Κρυφών Μνημών ή Μνημών Cache, αποβλέποντας στην απόκρυψη της Καθυστέρησης της Κύριας Μνήμης των Συστημάτων (Memory Latency) και την γεφύρωση του χάσματος της απόδοσης του Επεξεργαστή και της Κύριας Μνήμης (Processor – Memory Performance Gap). Οι Μνήμες Cache έτσι έχουν αποκτήσει αδιαμφισβήτητα πρωτεύοντα ρόλο στην Ιεραρχία Μνήμης των Ηλεκτρονικών Υπολογιστών. Οι νέες τάσεις Σχεδιασμού ανέδειξαν την Έννοια του Παραλληλισμού σε πρωτεύοντα ρόλο. Αρχικά διερευνήθηκε ο Παραλληλισμός Επιπέδου Εντολών, ωστόσο η αύξηση της Απόδοσης των Υπολογιστών σύντομα έφτασε ένα μέγιστο. Την τελευταία δεκαετία το κέντρο του ενδιαφέροντος των σχεδιαστών έχει και πάλι μετατοπιστεί, καθώς ένας νέος τύπος Επεξεργαστών έχει εισέλθει στο προσκήνιο, οι Πολυπύρηνοι Επεξεργαστές, ή όπως είναι αλλιώς γνωστοί on-chip Multiprocessors (CMP). Αυτές οι εξελίξεις, σε συνδυασμό με την ολοένα αυξανόμενη πολυπλοκότητα της “συμπεριφοράς” των εκτελούμενων Εφαρμογών, ώθησαν το σχεδιαστικό ενδιαφέρον προς την εκμετάλλευση ενός νεοσύστατου τύπου Παραλληλισμού. Ο Παραλληλισμός Επιπέδου Μνήμης ή Memory Level Parallelism (MLP) αποτελεί τα τελευταία χρόνια, το πλέον ισχυρό μέσο αύξησης της απόδοσης των Υπολογιστικών Συστημάτων και μαζί με τους Πολυπύρηνους Επεξεργαστές θα κυριαρχήσει στο προσκήνιο των εξελίξεων τα επόμενα χρόνια. Σκοπός της παρούσας Διπλωματικής Εργασίας είναι η ανάπτυξη ενός Στατιστικού – Πιθανοτικού Μοντέλου για μελέτη και πρόβλεψη των φαινομένων που αναπτύσσονται σε Μνήμες Cache, στις οποίες αποθηκεύονται δεδομένα από εκτελούμενες Εφαρμογές, με έντονο Παραλληλισμό Επιπέδου Μνήμης. Θα οριστεί ένας Εκτιμητής του Φόρτου που επιβάλλεται στο Σύστημα, από φαινόμενα Παραλληλισμού Επιπέδου Μνήμης (MLP). Στην συνέχεια, με βάση το Μοντέλο που αναπτύσσουμε, θα διερευνηθεί ένα ικανοποιητικό σύνολο Εφαρμογών, και θα εξαχθεί μια Εκτίμηση – Πρόβλεψη για τον Φόρτο (MLP) του Συστήματος. Εφόσον οι Προβλέψεις μας κριθούν επιτυχής, το Μοντέλο Πρόβλεψης Φόρτου MLP που αναπτύξαμε, μπορεί να αποτελέσει χρήσιμο Εργαλείο στα χέρια των Σχεδιαστών που ασχολούνται με την αύξηση της Απόδοσης των Σύγχρονων Υπολογιστικών Συστημάτων. / - Κρυφές μνήμες Έννοια παραλληλισμού 004.5 Cache memories Memory latency On-chip multiprocessors (CMP) Memory level parallelism (MLP)
222	Διαχείριση κοινόχρηστων πόρων σε πολυεπεξεργαστικά συστήματα ενός ολοκληρωμένου Πετούμενος, Παύλος 06 October 2011 (has links) Στην παρούσα διατριβή προτείνονται μέθοδοι διαχείρισης των κοινόχρηστων πόρων σε υπολογιστικά συστήματα όπου πολλαπλοί επεξεργαστές μοιράζονται το ίδιο ολοκληρωμένο (Chip Multiprocessors – CMPs). Ενώ μέχρι πρόσφατα ο σχεδιασμός ενός υπολογιστικού συστήματος στόχευε στην ικανοποίηση των απαιτήσεων μόνο μίας εφαρμογής ανά χρονική περίοδο, τώρα πια απαιτείται και η εξισορρόπηση των απαιτήσεων διαφορετικών εφαρμογών που ανταγωνίζονται για την κατοχή των ίδιων πόρων. Σε πολλές περιπτώσεις, όμως, αυτό δεν αρκεί από μόνο του. Ακόμη και αν επιτευχθεί κάποιος ιδανικός διαμοιρασμός του πόρου, αν δεν βελτιστοποιηθεί ο τρόπος με τον οποίο χρησιμοποιούν οι επεξεργαστές τον κοινόχρηστο πόρο, δεν θα καταφέρει να εξυπηρετήσει ικανοποιητικά το αυξημένο φορτίο. Για να αντιμετωπιστούν τα προβλήματα που πηγάζουν από τον διαμοιρασμό των κοινόχρηστων πόρων, στην παρούσα εργασία προτείνονται τρεις εναλλακτικοί μηχανισμοί διαχείρισης. Η πρώτη μεθοδολογία εισάγει μία νέα θεωρητική μοντελοποίηση του διαμοιρασμού της κρυφής μνήμης, η οποία μπορεί να χρησιμοποιηθεί παράλληλα με την εκτέλεση των προγραμμάτων που διαμοιράζονται την κρυφή μνήμη. Η μεθοδολογία αξιοποιεί στην συνέχεια αυτήν την μοντελοποίηση, για να ελέγξει τον διαμοιρασμό της κρυφής μνήμης και να επιτύχει δικαιοσύνη στο πως κατανέμεται ο χώρος της κρυφής μνήμης μεταξύ των επεξεργαστών. Η δεύτερη μεθοδολογία παρουσιάζει μία νέα τεχνική για την πρόβλεψη της τοπικότητας των προσπελάσεων της κρυφής μνήμης. Καθώς η τοπικότητα είναι η βασική παράμετρος που καθορίζει την χρησιμότητα των δεδομένων της κρυφής μνήμης, χρησιμοποιώντας αυτήν την τεχνική πρόβλεψης μπορούν να οδηγηθούν μηχανισμοί διαχείρισης που βελτιώνουν την αξιοποίηση του χώρου της κρυφής μνήμης. Στα πλαίσια της μεθοδολογίας παρουσιάζουμε έναν τέτοιο μηχανισμό, ο οποίος στοχεύει στην ελαχιστοποίηση των αστοχιών της κρυφής μνήμης μέσω μίας νέας πολιτικής αντικατάστασης. Η τελευταία μεθοδολογία που παρουσιάζεται είναι μία μεθοδολογία για την μείωση της κατανάλωσης ενέργειας της ουράς εντολών, που είναι μία από τις πιο ενεργειακά απαιτητικές δομές του επεξεργαστή. Στα πλαίσια της μεθοδολογίας, δείχνεται ότι το κλειδί για την αποδοτική μείωση της κατανάλωσης ενέργειας της ουράς εντολών βρίσκεται στην αλληλεπίδραση της με το υποσύστημα μνήμης. Με βάση αυτό το συμπέρασμα, παρουσιάζουμε έναν νέο μηχανισμό δυναμικής διαχείρισης του μεγέθους της ουράς εντολών, ο οποίος συνδυάζει επιθετική μείωση της κατανάλωσης ενέργειας του επεξεργαστή με διατήρηση της υψηλής απόδοσής του. / This dissertation proposes methodologies for the management of shared resources in chip multi-processors (CMP). Until recently, the design of a computing system had to satisfy the computational and storage needs of a single program during each time period. Now instead, the designer has to balance the, perhaps conflicting, needs of multiple programs competing for the same resources. But, in many cases, even this is not enough. Even if we could invent a perfect way to manage sharing, without optimizing the way that each processor uses the shared resource, the resource could not deal efficiently with the increased load. In order to handle the negative effects of resource sharing, this dissertation proposes three management mechanisms. The first one introduces a novel theoretical model of the sharing of the shared cache, which can be used at run-time. Furthermore, out methodology uses the model to control sharing and to achieve a sense of justice in the way the cache is shared among the processors. Our second methodology presents a new technique for predicting the locality of cache accesses. Since locality determines, almost entirely, the usefulness of cache data, our technique can be used to drive any management mechanism which strives to improve the efficiency of the cache. As part of our methodology, we present such a mechanism, a new cache replacement policy which tries to minimize cache misses by near-optimal replacement decisions. The last methodology presented in this dissertation, targets the energy consumption of the processor. To that end, our methodology shows that the key to reducing the power consumption of the Issue Queue, without disproportional performance degradation, lies at the interaction of the Issue Queue with the memory subsystem: as long as the management of the Issue Queue doesn’t reduce the utilization of the memory subsystem, the effects of the management on the processor’s performance will be minimal. Based on this conclusion, we introduce a new mechanism for dynamically resizing the Issue Queue, which achieves aggressive downsizing and energy savings with almost no performance degradation. Κρυφές μνήμες Ουρά εντολών Μηχανισμοί πρόβλεψης Κατανάλωση ισχύος 004.35 Computer architecture Chip multiprocessors CPU caches Instruction queue Prediction mechanisms Power-aware management techniques
223	Efficient Fault Tolerance In Chip Multiprocessors Using Critical Value Forwarding Subramanyan, Pramod 06 1900 (has links) (PDF) Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to transient faults, wear-out related permanent faults and process variations. Decreasing CMOS reliability implies that high-availability systems which were previously restricted to the domain of mainframe computers or specially designed fault-tolerant systems may be come important for the commodity market as well. In this thesis we tackle the problem of enabling efficient, low cost and configurable fault-tolerance using Chip Multiprocessors (CMPs). Our work studies architectural fault detection methods based on redundant execution, specifically focusing on “leader-follower” architectures. In such architectures redundant execution is performed on two cores/threads of a CMP. One thread acts as the leading thread while the other acts as the trailing thread. The leading thread assists the execution of the trailing thread by forwarding the results of its execution. These forwarded results are used as predictions in the trailing thread and help improve its performance. In this thesis, we introduce a new form of execution assistance called critical value forwarding. Critical value forwarding uses heuristics to identify instructions on the critical path of execution and forwards the results of these instructions to the trailing core. The advantage of critical value forwarding is that it provides much of the speed up obtained by forwarding all values at a fraction of the bandwidth cost. We propose two architectures to exploit the idea of critical value forwarding. The first of these operates the trailing core at lower voltage/frequency levels in order to provide energy-efficient redundant execution. In this context, we also introduce algorithms to dynamically adapt the voltage/frequency level of the trailing core based on program behavior. Our experimental evaluation shows that this proposal consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a mean performance overhead of about 1%. We compare our proposal to two previous energy-efficient fault-tolerant CMP proposals and find that our proposal delivers higher energy-efficiency and lower performance degradation than both while providing a similar level of fault coverage. Our second proposal uses the idea of critical value forwarding to improve fault-tolerant CMP throughput. This is done by using coarse-grained multithreading to mul-tiplex trailing threads on a single core. Our evaluation shows that this architecture delivers 9–13% higher throughput than previous proposals, including one configuration that uses simultaneous multithreading(SMT) to multiplex trailing threads. Since this proposal increases fault-tolerant CMP throughput by executing multiple threads on a single core, it comes at a modest cost in single-threaded performance, a mean slowdown between11–14%. Fault Tolerant Computing Microprocessors Chip Multiprocessors (CMPs) Microarchitecture Energy-efficient Architecture Transient Faults Permanent Faults Redundant Execution Fault Tolerance CMOS Computer Science
224	Multi-processor logic simulation at the chip level Roumeliotis, Emmanuel January 1986 (has links) This dissertation presents the design and development of a multi-processor logic simulator. After an introduction to parallel processing, the concept of distributed simulation is described as well as the possibility of deadlock in a distributed system. It is proven that the proposed system does not deadlock. Next, the modeling techniques are discussed along with the timing mechanisms used for logic simulation. A new approach, namely process oriented simulation is studied in depth. It is shown that modeling for this kind of simulation is more efficient regarding modeling ease, computer memory and simulation time, than existing simulation methods. The hardware design of the multi-processor system and the algorithms for synchronization and signal interchange between the processors are presented next. An algorithm for an efficient partitioning of the digital network to be simulated among the processors of the system is also described. Apart from the simulation of a single digital network, the simulator can also be used for fault simulation and design verification. Regarding fault simulation, the fault injection and fault detection techniques are presented. The experimental results obtained by running the multi-processor simulator are compared with the theoretical estimates as well as with results obtained by other multi-processor systems. The comparison shows that the proposed simulator exhibits the estimated performance. Finally, the design of a common bus interface is given. This interface will connect the processors of the system directly without the intervention of a hard disk which was used for the development and testing of the system. / Ph. D. LD5655.V856 1986.R685 Digital computer simulation
225	Affectation de composantes basée sur des contraintes énergétiques dans une architecture multiprocesseurs en trois dimensions Deldicque, Martin 06 1900 (has links) La lithographie et la loi de Moore ont permis des avancées extraordinaires dans la fabrication des circuits intégrés. De nos jours, plusieurs systèmes très complexes peuvent être embarqués sur la même puce électronique. Les contraintes de développement de ces systèmes sont tellement grandes qu’une bonne planification dès le début de leur cycle de développement est incontournable. Ainsi, la planification de la gestion énergétique au début du cycle de développement est devenue une phase importante dans la conception de ces systèmes. Pendant plusieurs années, l’idée était de réduire la consommation énergétique en ajoutant un mécanisme physique une fois le circuit créé, comme par exemple un dissipateur de chaleur. La stratégie actuelle est d’intégrer les contraintes énergétiques dès les premières phases de la conception des circuits. Il est donc essentiel de bien connaître la dissipation d’énergie avant l’intégration des composantes dans une architecture d’un système multiprocesseurs de façon à ce que chaque composante puisse fonctionner efficacement dans les limites de ses contraintes thermiques. Lorsqu’une composante fonctionne, elle consomme de l’énergie électrique qui est transformée en dégagement de chaleur. Le but de ce mémoire est de trouver une affectation efficace des composantes dans une architecture de multiprocesseurs en trois dimensions en tenant compte des limites des facteurs thermiques de ce système. / Lithography and Moore’s law have led to extraordinary advances in integrated circuits manufacturing. Nowadays, many complex systems can be embedded on the same chip. Development constraints of these systems are so significant that a good planning from the beginning of the development stage is essential. Thus, the planning of energy management at the beginning of the development cycle has become important in the design of these systems. For several years, the idea was to reduce energy consumption by adding a cooling system once the circuit is created, a heat sink for example. The current strategy is to integrate energy constraints in the early stages of circuits design. It is therefore important to know the energy dissipation before the integration of the components in the architecture of a multiprocessor system so that each component can work within the limits of its thermal stresses. When a component is running, it consumes electric energy which is converted into heat. The aim of this thesis is to find an efficient assignment of components in a multiprocessor system architecture in three dimensions, taking into account the limits of its thermal factors. Recherche opérationnelle Affectation quadratique Recherche avec tabous Architecture de multiprocesseurs Dégagement de chaleur Positionnement d’éléments Choix des bus Operations research Quadratic assignment Taboo search Architecture of multiprocessors Heat dissipation Assignment of elements Choosing bus
226	Cellular GPU Models to Euclidean Optimization Problems : Applications from Stereo Matching to Structured Adaptive Meshing and Traveling Salesman Problem / Modèles cellulaires GPU appliquès à des problèmes d'optimisation euclidiennes : applications à l'appariement d'images stéréo, à la génération de maillages et au voyageur de commerce Zhang, Naiyu 02 December 2013 (has links) Le travail présenté dans ce mémoire étudie et propose des modèles de calcul parallèles de type cellulaire pour traiter différents problèmes d’optimisation NP-durs définis dans l’espace euclidien, et leur implantation sur des processeurs graphiques multi-fonction (Graphics Processing Unit; GPU). Le but est de pouvoir traiter des problèmes de grande taille tout en permettant des facteurs d’accélération substantiels à l’aide du parallélisme massif. Les champs d’application visés concernent les systèmes embarqués pour la stéréovision de même que les problèmes de transports définis dans le plan, tels que les problèmes de tournées de véhicules. La principale caractéristique du modèle cellulaire est qu’il est fondé sur une décomposition du plan en un nombre approprié de cellules, chacune comportant une part constante de la donnée, et chacune correspondant à une unité de calcul (processus). Ainsi, le nombre de processus parallèles et la taille mémoire nécessaire sont en relation linéaire avec la taille du problème d’optimisation, ce qui permet de traiter des instances de très grandes tailles.L’efficacité des modèles cellulaires proposés a été testée sur plateforme parallèle GPU sur quatre applications. La première application est un problème d’appariement d’images stéréo. Elle concerne la stéréovision couleur. L’entrée du problème est une paire d’images stéréo, et la sortie une carte de disparités représentant les profondeurs dans la scène 3D. Le but est de comparer des méthodes d’appariement local selon l’approche winner-takes-all et appliquées à des paires d’images CFA (color filter array). La deuxième application concerne la recherche d’améliorations de l’implantation GPU permettant de réaliser un calcul quasi temps-réel de l’appariement. Les troisième et quatrième applications ont trait à l’implantation cellulaire GPU des réseaux neuronaux de type carte auto-organisatrice dans le plan. La troisième application concerne la génération de maillages structurés appliquée aux cartes de disparité afin de produire des représentations compressées des surfaces 3D. Enfin, la quatrième application concerne le traitement d’instances de grandes tailles du problème du voyageur de commerce euclidien comportant jusqu’à 33708 villes.Pour chacune des applications, les implantations GPU permettent une accélération substantielle du calcul par rapport aux versions CPU, pour des tailles croissantes des problèmes et pour une qualité de résultat obtenue similaire ou supérieure. Le facteur d’accélération GPU par rapport à la version CPU est d’environ 20 fois plus vite pour la version GPU sur le traitement des images CFA, cependant que le temps de traitement GPU est d’environ de 0,2s pour une paire d’images de petites tailles de la base Middlebury. L’algorithme amélioré quasi temps-réel nécessite environ 0,017s pour traiter une paire d’images de petites tailles, ce qui correspond aux temps d’exécution parmi les plus rapides de la base Middlebury pour une qualité de résultat modérée. La génération de maillages structurés est évaluée sur la base Middlebury afin de déterminer les facteurs d’accélération et qualité de résultats obtenus. Le facteur d’accélération obtenu pour l’implantation parallèle des cartes auto-organisatrices appliquée au problème du voyageur de commerce et pour l’instance avec 33708 villes est de 30 pour la version parallèle. / The work presented in this PhD studies and proposes cellular computation parallel models able to address different types of NP-hard optimization problems defined in the Euclidean space, and their implementation on the Graphics Processing Unit (GPU) platform. The goal is to allow both dealing with large size problems and provide substantial acceleration factors by massive parallelism. The field of applications concerns vehicle embedded systems for stereovision as well as transportation problems in the plane, as vehicle routing problems. The main characteristic of the cellular model is that it decomposes the plane into an appropriate number of cellular units, each responsible of a constant part of the input data, and such that each cell corresponds to a single processing unit. Hence, the number of processing units and required memory are with linear increasing relationship to the optimization problem size, which makes the model able to deal with very large size problems.The effectiveness of the proposed cellular models has been tested on the GPU parallel platform on four applications. The first application is a stereo-matching problem. It concerns color stereovision. The problem input is a stereo image pair, and the output a disparity map that represents depths in the 3D scene. The goal is to implement and compare GPU/CPU winner-takes-all local dense stereo-matching methods dealing with CFA (color filter array) image pairs. The second application focuses on the possible GPU improvements able to reach near real-time stereo-matching computation. The third and fourth applications deal with a cellular GPU implementation of the self-organizing map neural network in the plane. The third application concerns structured mesh generation according to the disparity map to allow 3D surface compressed representation. Then, the fourth application is to address large size Euclidean traveling salesman problems (TSP) with up to 33708 cities.In all applications, GPU implementations allow substantial acceleration factors over CPU versions, as the problem size increases and for similar or higher quality results. The GPU speedup factor over CPU was of 20 times faster for the CFA image pairs, but GPU computation time is about 0.2s for a small image pair from Middlebury database. The near real-time stereovision algorithm takes about 0.017s for a small image pair, which is one of the fastest records in the Middlebury benchmark with moderate quality. The structured mesh generation is evaluated on Middlebury data set to gauge the GPU acceleration factor and quality obtained. The acceleration factor for the GPU parallel self-organizing map over the CPU version, on the largest TSP problem with 33708 cities, is of 30 times faster. Optimisation combinatoire Multi-processeurs GPU Stéréo-vision Maillage adaptatif Reconstruction 3D Problème du voyageur de commerce Combinatorial optimization Multiprocessors Graphics Processing Units GPU Stereovision Adaptive meshing 3d reconstruction Traveling salesman problem
227	Efficient and Scalable Cache Coherence for Many-Core Chip Multiprocessors Ros Bardisa, Alberto 24 September 2009 (has links) La nueva tendencia para aumentar el rendimiento de los futuroscomputadores son los multiprocesadores en un solo chip (CMPs). Seespera que en un futuro cercano salgan al mercado CMPs con decenas deprocesadores. Hoy en dï¿½a, la mejor manera de mantener la coherencia decache en estos sistemas es mediante los protocolos basados endirectorio. Sin embargo, estos protocolos tienen dos grandesproblemas: una gran sobrecarga de memoria y una alta latencia de losfallos de cache.Esta tesis se ha centrado en estos problemas claves para la eficienciay escalabilidad del CMP. En primer lugar, se ha presentado unaorganizaciï¿½n de directorios escalable. En segundo lugar, se hanpropuesto los protocolos de coherencia directa, que evitan laindirecciï¿½n al nodo home y, por tanto, reducen el tiempo de ejecuciï¿½nde las aplicaciones. Por ï¿½ltimo, se ha desarrollado una polï¿½tica demapeo para caches compartidas pero fï¿½sicamente distribuidas, quereduce la latencia de acceso y garantiza una distribuciï¿½n uniforme delos datos con el fin de reducir su tasa de fallos. Esto se traducefinalmente en un menor tiempo de ejecuciï¿½n para las aplicaciones. / Chip multiprocessors (CMPs) constitute the new trend for increasingthe performance of future computers. In the near future, chips withtens of cores will become more popular. Nowadays, directory-basedprotocols constitute the best alternative to keep cache coherence inlarge-scale systems. Nevertheless, directory-based protocols have twoimportant issues that prevent them from achieving better scalability:the directory memory overhead and the long cache miss latencies.This thesis focuses on these key issues. The first proposal is ascalable distributed directory organization that copes with the memoryoverhead of directory-based protocols. The second proposal presentsthe direct coherence protocols, which are aimed at avoiding theindirection problem of traditional directory-based protocols and,therefore, they improve applications' performance. Finally, a novelmapping policy for distributed caches is presented. This policyreduces the long access latency while lessening the number of off-chipaccesses, leading to improvements in applications' execution time. directory protocols scalability cache coherence Chip multiprocessors NUCA caches latencia de acceso coherencia directa indirecciï¿½n protocolos de directorio escalabilidad coherencia de cache Multiprocesadores en un solo chip indirection direct coherence access latency NUCA caches Arquitectura de computadores 004
228	Iterative and Adaptive PDE Solvers for Shared Memory Architectures / Iterativa och adaptiva PDE-lösare för parallelldatorer med gemensam minnesorganisation Löf, Henrik January 2006 (has links) Scientific computing is used frequently in an increasing number of disciplines to accelerate scientific discovery. Many such computing problems involve the numerical solution of partial differential equations (PDE). In this thesis we explore and develop methodology for high-performance implementations of PDE solvers for shared-memory multiprocessor architectures. We consider three realistic PDE settings: solution of the Maxwell equations in 3D using an unstructured grid and the method of conjugate gradients, solution of the Poisson equation in 3D using a geometric multigrid method, and solution of an advection equation in 2D using structured adaptive mesh refinement. We apply software optimization techniques to increase both parallel efficiency and the degree of data locality. In our evaluation we use several different shared-memory architectures ranging from symmetric multiprocessors and distributed shared-memory architectures to chip-multiprocessors. For distributed shared-memory systems we explore methods of data distribution to increase the amount of geographical locality. We evaluate automatic and transparent page migration based on runtime sampling, user-initiated page migration using a directive with an affinity-on-next-touch semantic, and algorithmic optimizations for page-placement policies. Our results show that page migration increases the amount of geographical locality and that the parallel overhead related to page migration can be amortized over the iterations needed to reach convergence. This is especially true for the affinity-on-next-touch methodology whereby page migration can be initiated at an early stage in the algorithms. We also develop and explore methodology for other forms of data locality and conclude that the effect on performance is significant and that this effect will increase for future shared-memory architectures. Our overall conclusion is that, if the involved locality issues are addressed, the shared-memory programming model provides an efficient and productive environment for solving many important PDE problems. partial differential equations iterative methods finite elements conjugate gradients adaptive mesh refinement multigrid cc-NUMA distributed shared memory OpenMP page migration TLB shoot-down bandwidth minimization reverse Cuthill-McKee migrate-on-next-touch affinity temporal locality chip multiprocessors CMP
229	Athapascan-0 : exploitation de la multiprogrammation légère sur grappes de multiprocesseurs Carissimi, Alexandre da Silva January 1999 (has links) L'accroissement d'efficacite des réseaux d'interconnexion et la vulgarisation des machines multiprocesseurs permettent la réalisation de machines parallèles a mémoire distribuée de faible coût: les grappes de multiprocesseurs. Elles nécessitent l'exploitation à la fois du parallélismeà grain fin, interne à un multiprocesseur offert par la multiprogrammation légère, et du parallélisme à gros grain entre les différents multiprocesseurs. L'exploitation simultanée de ces deux types de parallélisme exige une méthode de communication entre les processus légers qui ne partagent pas le mêmme espace d'adressage. Le travail de cette thèse porte sur le problème de l'Intégration de la multiprogrammation légère et des communications sur grappes de multiprocesseurs symétriques (SMP). II porte plus précisément sur evaluation et le reglage du noyau exécutif ATHAPASCAN-0 sur ce type d'architecture. ATHAPASCAN-0 est un noyau exécutif, portable, développé au sein du projet APACHE (CNRS-INPG-INRIA-UJF), qui combine la multiprogrammation légère et la communication par échange de messages. La portabilité est assurée par une organisation en couches basée sur les standards POSIX threads et MPI largement répandus. ATHAPASCAN-0 étend le modèle de réseau statique de processus «lourds» communicants tel que MPI, PVM, etc,à celui d'un réseau dynamique de processus légers communicants. La technique de base est la multiprogrammation lègere des communications et des calculs. La progression des communications exige la scrutation de état du reseau et l'enchainement des opérations de transferts. L'efficacité repose sur la minimisation de ces opérations. De plus, l'emploi de multiprocesseurs ajoute des problèmes spécifiques dus à l'apparition d'un parallélisme réel entre calcul et communication. Ces problèmes sont présentés et des solutions sont proposées pour l'environnement ATHAPASCAN-0. Ces solutions sont évaluées sur des grappes de multiprocesseurs. / The continuous price reduction for commodity PC multiprocessors and the availability of fast network interfaces have made cluster of multiprocessors an attractive low-price alternative to build parallel systems. Multiprocessor clusters offer two levels of parallelism: a fine grain parallelism inside a single multiprocessor and a coarse grain among them. A mechanism must be provided to exploit both levels of parallelism simultaneously. This requires to provide communications between threads belonging to different addresses spaces. This dissertation addresses the problem of integrating threads and communications on ATHAPASCAN-0 run time system. ATHAPASCAN-0 is a portable run time for cluster of multiprocessors developed as part of the APACHE project (CNRS-INPG-INRIA-UJF). Portability is achieved by a layered organization based on standards like POSIX threads and MPI. The ATHAPASCAN-0 run time system extends the heavy-weight process communication model of message passing libraries such as MPI, PVM, etc, into a lighter dynamic network of communicating threads. Multiprogramming is the key concept used. Communication progress is based on a network polling basis to handle incoming messages and to deliver outgoing communications requests. Performance is strongly dependent on the way these operations are implemented. Additionally, multiprocessors introduce some programming problems like overhead of cache coherency mechanisms, method of managing concurrent accesses and efficient mutex locking to avoid unnecessary context switching. These problems are analyzed and solutions are implemented in the ATHAPASCAN-0 run time system. An evaluation of these solutions is performed on a cluster of multiprocessors. Multiprogrammation légère Communication par échange de messages Grappes de stations Multiprocesseurs symmétriques Arquitetura de computadores Multiprogramacao Processamento paralelo Multiprocessamento Multithreading Message passing Parallel programming environnements Network of workstations Symmetric multiprocessors
230	Athapascan-0 : exploitation de la multiprogrammation légère sur grappes de multiprocesseurs Carissimi, Alexandre da Silva January 1999 (has links) L'accroissement d'efficacite des réseaux d'interconnexion et la vulgarisation des machines multiprocesseurs permettent la réalisation de machines parallèles a mémoire distribuée de faible coût: les grappes de multiprocesseurs. Elles nécessitent l'exploitation à la fois du parallélismeà grain fin, interne à un multiprocesseur offert par la multiprogrammation légère, et du parallélisme à gros grain entre les différents multiprocesseurs. L'exploitation simultanée de ces deux types de parallélisme exige une méthode de communication entre les processus légers qui ne partagent pas le mêmme espace d'adressage. Le travail de cette thèse porte sur le problème de l'Intégration de la multiprogrammation légère et des communications sur grappes de multiprocesseurs symétriques (SMP). II porte plus précisément sur evaluation et le reglage du noyau exécutif ATHAPASCAN-0 sur ce type d'architecture. ATHAPASCAN-0 est un noyau exécutif, portable, développé au sein du projet APACHE (CNRS-INPG-INRIA-UJF), qui combine la multiprogrammation légère et la communication par échange de messages. La portabilité est assurée par une organisation en couches basée sur les standards POSIX threads et MPI largement répandus. ATHAPASCAN-0 étend le modèle de réseau statique de processus «lourds» communicants tel que MPI, PVM, etc,à celui d'un réseau dynamique de processus légers communicants. La technique de base est la multiprogrammation lègere des communications et des calculs. La progression des communications exige la scrutation de état du reseau et l'enchainement des opérations de transferts. L'efficacité repose sur la minimisation de ces opérations. De plus, l'emploi de multiprocesseurs ajoute des problèmes spécifiques dus à l'apparition d'un parallélisme réel entre calcul et communication. Ces problèmes sont présentés et des solutions sont proposées pour l'environnement ATHAPASCAN-0. Ces solutions sont évaluées sur des grappes de multiprocesseurs. / The continuous price reduction for commodity PC multiprocessors and the availability of fast network interfaces have made cluster of multiprocessors an attractive low-price alternative to build parallel systems. Multiprocessor clusters offer two levels of parallelism: a fine grain parallelism inside a single multiprocessor and a coarse grain among them. A mechanism must be provided to exploit both levels of parallelism simultaneously. This requires to provide communications between threads belonging to different addresses spaces. This dissertation addresses the problem of integrating threads and communications on ATHAPASCAN-0 run time system. ATHAPASCAN-0 is a portable run time for cluster of multiprocessors developed as part of the APACHE project (CNRS-INPG-INRIA-UJF). Portability is achieved by a layered organization based on standards like POSIX threads and MPI. The ATHAPASCAN-0 run time system extends the heavy-weight process communication model of message passing libraries such as MPI, PVM, etc, into a lighter dynamic network of communicating threads. Multiprogramming is the key concept used. Communication progress is based on a network polling basis to handle incoming messages and to deliver outgoing communications requests. Performance is strongly dependent on the way these operations are implemented. Additionally, multiprocessors introduce some programming problems like overhead of cache coherency mechanisms, method of managing concurrent accesses and efficient mutex locking to avoid unnecessary context switching. These problems are analyzed and solutions are implemented in the ATHAPASCAN-0 run time system. An evaluation of these solutions is performed on a cluster of multiprocessors. Multiprogrammation légère Communication par échange de messages Grappes de stations Multiprocesseurs symmétriques Arquitetura de computadores Multiprogramacao Processamento paralelo Multiprocessamento Multithreading Message passing Parallel programming environnements Network of workstations Symmetric multiprocessors

Search results