Global ETD Search

1	High-level structured programming models for explicit and automatic parallelization on multicore architectures / Modèle de programmation de haut niveau pour la parallélisation expicite et automatique : application aux architectures multicoeurs Khammassi, Nader 05 December 2014 (has links) La prolifération des architectures multi-coeurs est source d’unepression importante pour les developpeurs, qui doivent chercherà paralléliser leurs applications de manière à profiter au mieux deces plateformes. Malheureusement, les modèles de programmationde bas niveau amplifient les difficultés inhérentes à la conceptiond’applications complexes et parallèles. Il existe donc une attentepour des modèles de programmation de plus haut niveau, quipuissent simplifier la vie des programmeurs de manière significative,tout en proposant des abstractions suffisantes pour absorberl’hétérogénéité des architectures matérielles.Contrairement à une multitude de modèles de programmation parallèlequi introduisent de nouveaux langages, annotations ou étendentdes langages existants et requièrent donc des compilateurs spécialisés,nous exploitons ici le potentiel du language C++ standardet traditionnel. En particulier nous avons recours à ses capacitésen terme de meta-programmation, afin de fournir au programmeurune interface de programmation parallèle simple et directe. Cetteinterface autorise le programmeur à exprimer le parallélismede son application au prix d’une altération négligeable du codeséquentiel initial. Un runtime intelligent se charge d’extraire touteinformation relative aux dépendances de données entre tâches,ainsi que celles relatives à l’ordonnancement. Nous montronscomment ce runtime est à même d’exploiter ces informations dansle but de détecter et protéger les données partagées, puis réaliserun ordonnancement prenant en compte les particularités des caches.L’implémentation initiale de notre modèle de programmation est unelibrairie C++ pure appelée XPU. XPU est conÃ˘gue dans le but defaciliter l’explicitation, par le programmeur, du parallélisme applicatif.Une seconde réalisation appelée FATMA doit être considérée commeune extension d’XPU qui permet une détection automatique desdépendances dans une séquence de tâches : il s’agit donc de parallélisationautomatique, sans recours à quelque outil que se soit,excepté un compilateur C++ standard. Afin de démontrer le potentielde notre approche, nous utilisons ces deux outils –XPU et FATMA–pour paralléliser des problèmes populaires, ainsi que des applicationsindustrielles réelles. Nous montrons qu’en dépit de leur abstractionélevée, nos modèles de programmation présentent des performancescomparables à des modèles de programmation de basniveau,et offrent un meilleur compromis productivité-performance. / The continuous proliferation of multicore architectures has placeddevelopers under great pressure to parallelize their applicationsaccordingly with what such platforms can offer. Unfortunately,traditional low-level programming models exacerbate the difficultiesof building large and complex parallel applications. High-level parallelprogramming models are in high-demand as they reduce the burdenson programmers significantly and provide enough abstraction toaccommodate hardware heterogeneity. In this thesis, we proposea flexible parallelization methodology, and we introduce a newtask-based parallel programming model designed to provide highproductivity and expressiveness without sacrificing performance.Our programming model aims to ease expression of both sequentialexecution and several types of parallelism including task, data andpipeline parallelism at different granularity levels to form a structuredhomogeneous programming model.Contrary to many parallel programming models which introducenew languages, compiler annotations or extend existing languagesand thus require specialized compilers, extra-hardware or virtualmachines..., we exploit the potential of the traditional standardC++ language and particularly its meta-programming capabilities toprovide a light-weight and smart parallel programming interface. Thisprogramming interface enable programmer to express parallelismat the cost of a little amount of extra-code while reuse its legacysequential code almost without any alteration. An intelligent run-timesystem is able to extract transparently many information on task-datadependencies and ordering. We show how the run-time system canexploit these valuable information to detect and protect shared dataautomatically and perform cache-aware scheduling.The initial implementation of our programming model is a pure C++library named "XPU" and is designed for explicit parallelism specification.A second implementation named "FATMA" extends XPU andexploits the transparent task dependencies extraction feature to provideautomatic parallelization of a given sequence of tasks withoutneed to any specific tool apart a standard C++ compiler. In order todemonstrate the potential of our approach, we use both of the explicitand automatic parallel programming models to parallelize popularproblems as well as real industrial applications. We show thatdespite its high abstraction, our programming models provide comparableperformances to lower-level programming models and offersa better productivity-performance tradeoff. XPU FATMA CHATS Parallélisation automatique Squelettes algorithmiques Multicore architectures 005.275
2	Πειραματική αξιολόγηση μεθοδολογίας βελτιστοποίησης του αλγόριθμου πολλαπλασιασμού πίνακα επί διάνυσμα σε μονοπύρηνες και πολυπύρηνες αρχιτεκτονικές Παπαδήμα, Ελισσάβετ 30 April 2014 (has links) Στην παρούσα διπλωματική εργασία έγινε υλοποίηση και πειραματική αξιολόγηση μιας μεθοδολογίας η οποία έχει αναπτυχθεί στο Εργαστήριο Ολοκληρωμένων Κυκλωμάτων και αφορά τη βελτιστοποίηση του Πολλαπλασιασμού Πίνακα επί Διάνυσμα (ΠΠΔ) σε μονοπύρηνους και πολυπύρηνους επεξεργαστές. Η μεθοδολογία εκμεταλλεύεται το σύνολο των χαρακτηριστικών της αρχιτεκτονικής που χρησιμοποιείται και συγκεκριμένα (α) την ιεραρχία της μνήμης, (β) το μέγεθος της κρυφής μνήμης, (γ) το βαθμό συσχέτισης κάθε επιπέδου της κρυφής μνήμης, (δ) την καθυστέρηση της μνήμης και (ε) το πλήθος των πυρήνων. Είναι η πρώτη φορά που λαμβάνεται υπόψη ο βαθμός συσχέτισης της μνήμης. Σκοπός της μεθοδολογίας είναι η βελτιστοποίηση με βάση όλες τις παραμέτρους μαζί και όχι καθεμία ξεχωριστά. Για να βελτιωθεί η απόδοση προτείνεται διαφορετικός χρονοπρογραμματισμός ανάλογα με το μέγεθος του πίνακα. Για την πειραματική αξιολόγηση χρησιμοποιήθηκαν οι επεξεργαστές γενικού σκοπού Intel Core 2 Duo E6065 και Τ6600, ο Intel i7-3930K και ο ενσωματωμένος επεξεργαστής ειδικού σκοπού Microblaze από το Virtex-5 FPGA (Xilinx). Τα αποτελέσματα συγκρίνονται με την state-of-the-art βιβλιοθήκη ATLAS (Automatically Tuned Linear Algebra Software) και παρουσιάζουν βελτίωση 30%. Από τα πειραματικά αποτελέσματα είναι φανερό ότι η κύρια μνήμη είναι το bottleneck του προβλήματος. Επίσης, η απόδοση βελτιώνεται όταν αλλάζει το layout του πίνακα τόσο σε μονοπύρηνες όσο σε πολυπύρηνες αρχιτεκτονικές. Όσον αφορά τη μέθοδο του tiling τα πειραματικά αποτελέσματα δείχνουν ότι η μείωση των αστοχιών δεν βελτιώνει πάντα την απόδοση γιατί υπάρχει trade-off ανάμεσα στο μέγεθος του tile και στις εντολές διευθυνσιοδότησης. Επίσης, είναι φανερό ότι στις πολυπύρηνες αρχιτεκτονικές δεν υπάρχει γραμμική σχέση της απόδοσης και του πλήθους των πυρήνων που χρησιμοποιούνται. Αυτό οφείλεται στο περιορισμένο εύρος ζώνης της μνήμης. / The subject of this MSc Thesis is the implementation and the experimental evaluation of a methodology that has been developed at the Laboratory of Integrated Circuits and optimizes the Matrix Vector Multiplication (MVM) in single-core and multi-core processors. The methodology fully exploits the characteristics of the architecture. Specifically, it exploits (a) the hierarchy of the memory, (b) the cache size, (c) the cache associativity, (d) the memory latency and (e) the number of the cores. It is the first time that the cache associativity is taken into account. The methodology optimizes all the parameters together as one problem and not separately. A different scheduling is proposed according to the size of the matrix. The general purpose processors Intel Core 2 Duo E6065, Intel Core 2 Duo T6600 and Intel i7-3930K and the embedded processor Virtex-5 Microblaze have been used. The results have been compared with the state-of-the-art library ATLAS (Automatically Tuned Linear Algebra Software) and the performance is improved by 30%. According to the experimental results, it is obvious that the bottleneck is the memory latency. Moreover, the performance is increased when a new way of saving the matrix in the main memory (data array layout) is used in both single-core and multi-core architectures. As far as the tiling is concerned, the experimental results indicate that the decrease of the misses does not always improve the performance because there is a trade-off between the tile size and the addressing instructions. According to the experimental results, as far as multicore architectures are concerned, there is no linear relation between the performance and the number of the cores, because of the limited memory bandwidth. 005.275 Matrix vector multiplication Multicore architectures
3	On the Interaction of High-Performance Network Protocol Stacks with Multicore Architectures Chunangad Narayanaswamy, Ganesh 20 May 2008 (has links) Multicore architectures have been one of the primary driving forces in the recent rapid growth in high-end computing systems, contributing to its growing scales and capabilities. With significant enhancements in high-speed networking technologies and protocol stacks which support these high-end systems, a growing need to understand the interaction between them closely is realized. Since these two components have been designed mostly independently, there tend to have often serious and surprising interactions that result in heavy asymmetry in the effective capability of the different cores, thereby degrading the performance for various applications. Similarly, depending on the communication pattern of the application and the layout of processes across nodes, these interactions could potentially introduce network scalability issues, which is also an important concern for system designers. In this thesis, we analyze these asymmetric interactions and propose and design a novel systems level management framework called SIMMer (Systems Interaction Mapping Manager) that automatically monitors these interactions and dynamically manages the mapping of processes on processor cores to transparently maximize application performance. Performance analysis of SIMMer shows that it can improve the communication performance of applications by more than twofold and the overall application performance by 18%. We further analyze the impact of contention in network and processor resources and relate it to the communication pattern of the application. Insights learnt from these analyses can lead to efficient runtime configurations for scientific applications on multicore architectures. / Master of Science Multicore Architectures High-Performance Networking Process-to-Core Mapping Network Contention Runtime Adaptation
4	Parallélisation de la ligne de partage des eaux dans le cadre des graphes à arêtes valuées sur architecture multi-cœurs / Parallelization of the watershed transform in weighted graphs on multicore architecture Braham, Yosra 24 November 2018 (has links) Notre travail s'inscrit dans le cadre de la parallélisation d’algorithmes de calcul de la Ligne de Partage des Eaux (LPE) en particulier la LPE d’arêtes qui est une notion de la LPE introduite dans le cadre des Graphes à Arêtes Valuées. Nous avons élaboré un état d'art sur les algorithmes séquentiels de calcul de la LPE afin de motiver le choix de l'algorithme qui fait l'objet de notre étude qui est l'algorithme de calcul de noyau par M-bord. L'objectif majeur de cette thèse est de paralléliser cet algorithme afin de réduire son temps de calcul. En premier lieu, nous avons présenté les travaux qui se sont intéressés à la parallélisation des différentes variantes de la LPE et ce afin de dégager les problématiques que soulèvent cette tâche et les solutions adéquates à notre contexte. Dans un second lieu, nous avons montré que malgré la localité de l'opération de base de cet algorithme qui est l’abaissement de la valeur de certaines arêtes nommées arêtes M-bord, son exécution parallèle se trouve pénaliser par un problème de dépendance de données, en particulier au niveau des arêtes M-bord qui ont un sommet non minimum commun. Dans ce contexte, nous avons proposé trois stratégies de parallélisation de cet algorithme visant à résoudre ce problème de dépendance de données. La première stratégie consiste à diviser le graphe de départ en des bandes appelées partitions, et les traiter en parallèle sur P processeurs. La deuxième stratégie consiste à diviser les arêtes du graphe de départ en alternance en des sous-ensembles d’arêtes indépendantes. La troisième stratégie consiste à examiner les sommets au lieu des arêtes du graphe initial tout en préservant le paradigme d’amincissement sur lequel est basé l’algorithme séquentiel initial. Par conséquent, l’ensemble des sommets non-minima adjacents aux sommets minima sont traités en parallèle. En dernier lieu, nous avons étudié la parallélisation d'une technique de segmentation basée sur l'algorithme de calcul de noyau par M-bord. Cette technique comprend les étapes suivantes : la recherche des minima régionaux, la pondération des sommets et le calcul des sommets minima et enfin calcul du noyau par M-bord. A cet égard, nous avons commencé par faire une étude relative à la dépendance des données des différentes étapes qui la constituent et nous avons proposé des algorithmes parallèles pour chacune d'entre elles. Afin d'évaluer nos contributions, nous avons implémenté les différents algorithmes parallèles proposés dans le cadre de cette thèse sur une architecture multi-cœurs à mémoire partagée. Les résultats obtenus ont montré des gains en termes de temps d’exécution. Ce gain est traduit par des facteurs d’accélération qui augmentent avec le nombre de processeurs et ce quel que soit la taille des images à segmenter / Our work is a contribution of the parallelization of the Watershed Transform in particular the Watershed cuts which are a notion of watershed introduced in the framework of Edge Weighted Graphs. We have developed a state of art on the sequential watershed algorithms in order to motivate the choice of the algorithm that is the subject of our study, which is the M-border Kernel algorithm. The main objective of this thesis is to parallelize this algorithm in order to reduce its running time. First, we presented a review on the works that have treated the parallelization of the different types of Watershed in order to identify the issues raised by this task and the appropriate solutions to our context. In a second place, we have shown that despite the locality of the basic operation of this algorithm which is the lowering of some edges named the M-border edges; its parallel execution raises a data dependency problem, especially at the M-border edges which have a common non-minimum vertex. In this context, we have proposed three strategies of parallelization of this algorithm that solve this problematic: the first strategy consists of dividing the initial graph into bands called partitions processed in parallel by P processors. The second strategy is to divide the edges of the initial graph alternately into subsets of independent edges. The third strategy consists in examining the vertices instead of the edges of the initial graph while preserving the thinning paradigm on which the sequential algorithm is based. Therefore, the set of non-minima vertices adjacent to the minima ones are processed in parallel. Finally, we studied the parallelization of a segmentation technique based on the M-border kernel algorithm. This technique consists of three main steps which are: regional minima detection, vertices valuation and M-border kernel computation. For this purpose, we began by studying the data dependency of the different stages of this technique and we proposed parallel algorithms for each one of them. In order to evaluate our contributions, we implemented the parallel algorithms proposed in this thesis, on a shared memory multi-core architecture. The results obtained showed a notable gain in terms of execution time. This gain is translated by speedup factors that increase with the number of processors whatever is the resolution of the input images Graphe Segmentation d'images en niveaux de gris Algorithmes parallèles Ligne de Partage des Eaux Architectures multi-Cœurs Grayscale image segmentation Parallel algorithms Graph Watershed Multicore architectures
5	HPI Future SOC Lab : proceedings 2011 January 2013 (has links) Together with industrial partners Hasso-Plattner-Institut (HPI) is currently establishing a “HPI Future SOC Lab,” which will provide a complete infrastructure for research on on-demand systems. The lab utilizes the latest, multi/many-core hardware and its practical implementation and testing as well as further development. The necessary components for such a highly ambitious project are provided by renowned companies: Fujitsu and Hewlett Packard provide their latest 4 and 8-way servers with 1-2 TB RAM, SAP will make available its latest Business byDesign (ByD) system in its most complete version. EMC² provides high performance storage systems and VMware offers virtualization solutions. The lab will operate on the basis of real data from large enterprises. The HPI Future SOC Lab, which will be open for use by interested researchers also from other universities, will provide an opportunity to study real-life complex systems and follow new ideas all the way to their practical implementation and testing. This technical report presents results of research projects executed in 2011. Selected projects have presented their results on June 15th and October 26th 2011 at the Future SOC Lab Day events. / In Kooperation mit Partnern aus der Industrie etabliert das Hasso-Plattner-Institut (HPI) ein “HPI Future SOC Lab”, das eine komplette Infrastruktur von hochkomplexen on-demand Systemen auf neuester, am Markt noch nicht verfügbarer, massiv paralleler (multi-/many-core) Hardware mit enormen Hauptspeicherkapazitäten und dafür konzipierte Software bereitstellt. Das HPI Future SOC Lab verfügt über prototypische 4- und 8-way Intel 64-Bit Serversysteme von Fujitsu und Hewlett-Packard mit 32- bzw. 64-Cores und 1 - 2 TB Hauptspeicher. Es kommen weiterhin hochperformante Speichersysteme von EMC². SAP stellt ihre neueste Business by Design (ByD) Software zur Verfügung und auch komplexe reale Unternehmensdaten stehen zur Verfügung, auf die für Forschungszwecke zugegriffen werden kann. Interessierte Wissenschaftler aus universitären und außeruniversitären Forschungsinstitutionen können im HPI Future SOC Lab zukünftige hoch-komplexe IT-Systeme untersuchen, neue Ideen / Datenstrukturen / Algorithmen entwickeln und bis hin zur praktischen Erprobung verfolgen. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2011 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 15. Juni 2011 und 26. Oktober 2011 im Rahmen der Future SOC Lab Tag Veranstaltungen vor. Future SOC Lab Forschungsprojekte Multicore Architekturen In-Memory Technologie Future SOC Lab Research Projects Multicore architectures In-Memory technology Data processing Computer science
6	HPI future SOC lab : proceedings 2013 January 2014 (has links) The “HPI Future SOC Lab” is a cooperation of the Hasso-Plattner-Institut (HPI) and industrial partners. Its mission is to enable and promote exchange and interaction between the research community and the industrial partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard- and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2013. Selected projects have presented their results on April 10th and September 24th 2013 at the Future SOC Lab Day events. / Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermöglichung und Förderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftlern eine Infrastruktur von neuester Hard- und Software kostenfrei für Forschungszwecke zur Verfügung gestellt. Dazu zählen teilweise noch nicht am Markt verfügbare Technologien, die im normalen Hochschulbereich in der Regel nicht zu finanzieren wären, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2013 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 10. April 2013 und 24. September 2013 im Rahmen der Future SOC Lab Tag Veranstaltungen vor. future SOC lab Forschungsprojekte Multicore Architekturen In-Memory Technologie Cloud Computing Future SOC Lab research projects multicore architectures in-memory technology cloud computing Data processing Computer science
7	HPI future SOC lab : proceedings 2012 January 2013 (has links) The “HPI Future SOC Lab” is a cooperation of the Hasso-Plattner-Institut (HPI) and industrial partners. Its mission is to enable and promote exchange and interaction between the research community and the industrial partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard- and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2012. Selected projects have presented their results on June 18th and November 26th 2012 at the Future SOC Lab Day events. / Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermöglichung und Förderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftlern eine Infrastruktur von neuester Hard- und Software kostenfrei für Forschungszwecke zur Verfügung gestellt. Dazu zählen teilweise noch nicht am Markt verfügbare Technologien, die im normalen Hochschulbereich in der Regel nicht zu finanzieren wären, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2012 vorgestellt. Ausgewählte Projekte stellten ihre Ergebnisse am 18. April 2012 und 14. November 2012 im Rahmen der Future SOC Lab Tag Veranstaltungen vor. Future SOC Lab Forschungsprojekte Multicore Architekturen In-Memory Technologie Cloud Computing Future SOC Lab research projects multicore architectures in-memory technology cloud computing Data processing Computer science
8	Calcul à haute performance et simulations stochastiques : Etude de la reproductibiité numérique sur architectures multicore et manycore / High performance computing and stochastic simulation : Study of numerical reproducibility on multicore and manycore architectures Dao, Van Toan 02 March 2017 (has links) La reproductibilité des expériences numériques sur les systèmes de calcul à haute performance est parfois négligée. De plus, les méthodes numériques employées pour une parallélisation rigoureuse des simulations stochastiques sont souvent méconnues. En effet, les résultats obtenus pour une simulation stochastique utilisant des systèmes de calcul à hautes performances peuvent être différents d’une exécution à l’autre, et ce pour les mêmes paramètres et les même contextes d’exécution du fait de l’impact des nouvelles architectures, des accélérateurs, des compilateurs, des systèmes d’exploitation ou du changement de l’ordre d’exécution en parallèle des opérations en arithmétique flottantes au sein des micro-processeurs. En cas de non répétabilité des expériences numériques, comment mettre au point les applications ? Quel crédit peut-on apporter au logiciel parallèle ainsi développé ? Dans cette thèse, nous faisons une synthèse des causes de non-reproductibilité pour une simulation stochastique parallèle utilisant des systèmes de calcul à haute performance. Contrairement aux travaux habituels du parallélisme, nous ne nous consacrons pas à l’amélioration des performances, mais à l’obtention de résultats numériquement répétables d’une expérience à l’autre. Nous présentons la reproductibilité et ses apports dans la science numérique expérimentale. Nous proposons dans cette thèse quelques contributions, notamment : pour vérifier la reproductibilité et la portabilité des générateurs modernes de nombres pseudo-aléatoires ; pour détecter la corrélation entre flux parallèles issus de générateurs de nombres pseudo-aléatoires ; pour répéter et reproduire les résultats numériques de simulations stochastiques parallèles indépendantes. / The reproducibility of numerical experiments on high performance computing systems is sometimes overlooked. Moreover, the numerical methods used for rigorous parallelization of stochastic simulations are often unknown. Indeed, the results obtained for a stochastic simulation using high performance computing systems can be different from run to run with the same parameters and the same execution contexts due to the impact of new architectures, accelerators, compilers, operating systems or a changing of the order of execution of the floating arithmetic operations within the micro-processors for parallelizing optimizations. In the case of non-repeatability of numerical experiments, how can we seriously develop a scientific application? What credit can be given to the parallel software thus developed? In this thesis, we synthesize the main causes of non-reproducibility for a parallel stochastic simulation using high performance computing systems. Unlike the usual parallelism works, we do not focus on improving performance, but on obtaining numerically repeatable results from one experiment to another. We present the reproducibility and its contributions to the science of experimental and numerical computing. Furthermore, we propose some contributions, in particular: to verify the reproducibility and portability of top modern pseudo-random number generators, to detect the correlation between parallel streams issued from such generators, to repeat and reproduce the numerical results of independent parallel stochastic simulations. Reproductibilité numérique Simulation stochastique parallèle Calcul à haute performance Architectures manycore et multicore Numerical reproducibility Parallel stochastic simulation High performance computing Manycore and multicore architectures
9	Etude de performances sur processeurs multicoeur : environnement d'exécution événementiel efficace et étude comparative de modèles de programmation / Performance studies on multicore processors : efficient event-driven runtime and programming models comparison Geneves, Sylvain 05 April 2013 (has links) Cette thèse traite des performances des serveurs de données en multi-cœur. Plus précisément nous nous intéressons au passage à l'échelle avec le nombre de cœurs. Dans un premier temps, nous étudions le fonctionnement interne d'un support d'exécution événementiel multi-cœur. Nous montrons tout d'abord que le faux-partage ainsi que les mécanismes de communications inter-cœurs dégradent fortement les performances et empêchent le passage à l'échelle des applications. Nous proposons alors plusieurs optimisations pour pallier ces comportements. Dans un second temps, nous comparons les performances en multi-cœur de trois serveurs Web chacun représentatif d'un modèle de programmation. Nous remarquons que les différences de performances observées entre les serveurs varient lorsque le nombre de cœurs augmente. Après une analyse approfondie des performances observées, nous identifions la cause de la limitation du passage à l'échelle des serveurs étudiés. Nous présentons une proposition ainsi qu'un ensemble de pistes pour lever cette limitation. / This thesis studies the performances of data servers on multicores. More precisely, we focus on the scalability with the number of cores. First, we study the internals of an event-driven multicore runtime. We demonstrate that false sharing and inter-core communications hurt performances badly, and prevent applications from scaling. We then propose several optimisations to fix these issues. In a second part, we compare the multicore performances of three Webservers, each reprensentative of a programming model. We observe that the differences between each server's performances vary as the number of cores increases. We are able to pinpoint the cause of the scalability limitation observed. We present one approach and some perspectives to overcome this limit. Architectures multi-coeur Performance des serveurs de données Programmation événementielle Gestion de la mémoire Serveurs Web Analyse de performance Modèles de programmation Multicore architectures Data servers performances Event-driven programming Memory management Webservers Performance analysis Programming models

Search results