Global ETD Search

41	Harmony: an execution model for heterogeneous systems Diamos, Gregory Frederick 10 November 2011 (has links) The emergence of heterogeneous and many-core architectures presents a unique opportunity to deliver order of magnitude performance increases to high performance applications by matching certain classes of algorithms to specifically tailored architectures. However, their ubiquitous adoption has been limited by a lack of programming models and management frameworks designed to reduce the high degree of complexity of software development inherent to heterogeneous architectures. This dissertation introduces Harmony, an execution model for heterogeneous systems that draws heavily from concepts and optimizations used in processor micro-architecture to provide: (1) semantics for simplifying heterogeneity management, (2) dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and (3) online monitoring driven performance optimization for heterogeneous many core systems. This work focuses on simplifying development and ensuring binary portability and scalability across system configurations and sizes. Heterogeneous Many-core Compiler Runtime GPU Processor SIMD Scheduling Execution model Modeling Computing model Computer architecture Algorithms Heterogeneous computing
42	Simulation fonctionnelle native pour des systèmes many-cœurs / Functional native simulation techniques for many-core systems Sarrazin, Guillaume 23 May 2016 (has links) Le nombre de transistors dans une puce augmente constamment en suivant la conjecture de Moore, qui dit que le nombre de transistors dans une puce double tous les 2 ans. On arrive donc aujourd’hui à des systèmes d’une telle complexité que l’exploration architecturale ou le développement, même parallèle, de la conception de la puce et du code applicatif prend trop de temps. Pour réduire ce temps, la solution généralement admise consiste à développer des plateformes virtuelles reproduisant le comportement de la puce cible. Avoir une haute vitesse de simulation est essentiel pour ces plateformes, notamment pour les systèmes many-cœurs à cause du grand nombre de cœurs à simuler. Nous nous focalisons donc dans cette thèse sur la simulation native, dont le principe est de compiler le code source directement pour l’architecture hôte, offrant ainsi un temps de simulation que l’on peut espérer optimal. Mais un certain nombre de caractéristiques fonctionnelles spécifiques au cœur cible peuvent ne pas être présentes sur le cœur hôte. L’utilisation de l’assistance matérielle à la virtualisation (HAV) comme base pour la simulation native vient renforcer la dépendance de la simulation du cœur cible par rapport aux caractéristiques du cœur hôte. Nous proposons dans ce contexte un moyen de simuler les caractéristiques fonctionnelles spécifiques du cœur cible en simulation native basée sur le HAV. Parmi les caractéristiques propres au cœur cible, l’unité de calcul à virgule flottante est un élément important, bien trop souvent négligé en simulation native conduisant certains calculs à donner des résultats différents entre le cœur cible et le cœur hôte. Nous nous restreignons au cas de la simulation compilée et nous proposons une méthodologie permettant de simuler correctement les opérations de calcul à virgule flottante. Finalement la simulation native pose des problèmes de passage à l’échelle. Des problèmes de découplage temporel amènent à simuler inutilement certaines instructions lors de procédures de synchronisation entre des tâches s’exécutant sur les cœurs cibles, conduisant à une réduction de la vitesse de simulation. Nous proposons des solutions pour permettre un meilleur passage à l’échelle de la simulation native. / The number of transistors in one chip is increasing following Moore’s conjecture which says that the number of transistors per chip doubles every two years. Current systems are so complex that chip design and specific software development for one chip take too much time even if software development is done in parallel with the design of the hardware architecture, often because of system integration issues. To help reducing this time, the general solution consists of using virtual platforms to reproduce the behavior of the target chip. The simulation speed of these platforms is a major issue, especially for many-core systems in which the number of programmable cores is really high. We focus in this thesis on native simulation. Its principle is to compile source code directly for the host architecture to allow very fast simulation, at the cost of requiring "equivalent" features on the target and host cores.However, some target core specific features can be missing in the host core. Hardware Assisted Virtualization (HAV) is used to ease native simulation but it reinforces the dependency of the target chip simulation regarding the host core capabilities. In this context, we propose a solution to simulate the target core functional specific features with HAV based native simulation.Among target core features, the floating point unit is an important element which is neglected in native simulation leading to potential functional differences between target and host computation results. We restrict our study to the compiled simulation technique and we propose a methodology ensuring to accurately simulate floating point computations while still keeping a good simulation speed.Finally, native simulation has a scalability issue. Time decoupling problems generate unnecessary code simulation during synchronisation protocols between threads executed on the target cores, leading to an important decrease of simulation speed when the number of cores grows. We address this problem and propose solutions to allow a better scalability for native simulation. Simulation native Simulation compilé Systèmes sur puce Fpu Many-Cœur Native simulation Compiled smulation System-On-Chip Fpu Many-Core 004
43	Arquitetura de NoC programável baseada em múltiplos clusters de cores para suporte a padrões de comunicação coletiva / Programmable multi-cluster noc architecture to support collective communication patterns Freitas, Henrique Cota de January 2009 (has links) As próximas gerações de processadores many-core exigem que novas abordagens no projeto de arquitetura de processadores sejam propostas. Neste novo contexto, as redes de comunicação intra-chip são importantes para garantir o desempenho dos programas. Soluções tradicionais de interconexão possuem limites físicos que comprometem a escalabilidade e o desempenho no processamento de aplicações paralelas de diversos tipos. A alternativa apontada pelo estado da arte é a Network-on-Chip (NoC) composta por roteadores e outros elementos de rede capazes de prover comunicação escalável e de alto desempenho. No entanto, as cargas de trabalho geram padrões de comunicação diferentes que podem influenciar no desempenho da rede. Existem pesquisas que abordam metodologias de projeto dedicado de NoCs em função de domínios de aplicações específicos. Apesar de uma NoC dedicada possuir um alto desempenho, cargas de trabalho paralelas geram padrões de comunicação coletiva que mudam dinamicamente. Com o objetivo de aumentar a flexibilidade de redes-em-chip, trabalhos correlatos utilizam conceitos de computação reconfigurável para aumentar a capacidade da arquitetura da NoC se adaptar em função de padrões de comunicação. Alguns trabalhos focam na programação de FPGAs e outros em ASICs polimórficos. O objetivo desta tese é propor uma arquitetura de Network-on-Chip que suporte múltiplos clusters de núcleos de processamento através de roteadores programáveis e de topologias reconfiguráveis. Cada roteador é composto por uma chave crossbar reconfigurável capaz de implementar topologias dinamicamente através do uso de um segundo nível de reconfiguração. Os roteadores possuem processadores de rede que aumentam a flexibilidade e a capacidade da NoC se adaptar ao padrão de comunicação através de programas que monitoram e gerenciam a rede. Portanto, a contribuição da tese é a Arquitetura de NoC Programável Baseada em Múltiplos Clusters de Cores. Os resultados baseados em modelos analíticos e de simulação, e cargas de trabalho artificiais e naturais, mostram que a arquitetura da NoC possui um alto desempenho e vazão de pacotes, proporcionados pela adaptação de topologias e redução da influência da rede na comunicação. A ocupação em FPGA mostra que os roteadores programáveis possuem tamanho similares a NoCs com arquiteturas tradicionais para gerenciamento de mesma quantidade de núcleos. A menor utilização de buffers de entrada resulta em uma melhor eficiência no consumo de potência e energia. Portanto, através dos modelos de projeto e avaliação foi possível verificar através dos resultados que a arquitetura da MCNoC é uma alternativa para suportar padrões de comunicações coletivas. / For the next generation of many-core processors, new design methodologies must be proposed. In this context, on-chip interconnections are important to assure the program performance. Traditional approaches of interconnections have physical constraints that reduce the scalability and performance to process parallel applications. The state-of-theart points out to the Network-on-Chip (NoC), which consists of routers and other network devices capable of increasing the communication scalability and performance. However, workloads produce different types of communication patterns, which can influence the network performance. There are research works that explore applicationspecific NoC design to response the demand on specific workloads. Although a dedicated NoC has a high performance, parallel workloads have different collective communication patterns. In order to increase the flexibility of NoCs, related works use concepts of reconfigurable computing to add architecture adaptability to support dynamic communication patterns. Some works focus on FPGA-based reconfiguration and others on polymorphic ASICs. The goal of this thesis is to propose an alternative Programmable Multi-Cluster NoC architecture. Each router consists of a reconfigurable crossbar switch capable of implementing dynamic topologies through a second reconfiguration level. The routers have network processors that increase the flexibility and the NoC adaptability through management programs in order to support different workloads. Therefore, the contribution of this thesis is the following: A Programmable Multi-Cluster NoC (MCNoC) architecture. Based on analytical and simulation models, and artificial and natural workloads, results show the high performance and throughput for the proposed NoC architecture, due to the adaptable topologies and low network latency impact. Results based on FPGA shows a similar component utilization considering the proposed programmable NoC relative to conventional NoC architectures for the same number of processing cores. The low utilization of input buffers improves the efficiency of power and energy consumption. Therefore, through design and evaluation models, the NoC proposal was verified and the results point out the MCNoC as an alternative architecture to support collective communication patterns. Processamento paralelo Microprocessadores Desempenho : Computadores Microeletrônica Network-on-chip architecture Communication patterns Many-core processors
44	Exécution prédictible sur processeurs pluri-coeurs / Predictable execution on many-core processors Perret, Quentin 25 April 2017 (has links) Dans cette thèse, nous étudions l’adéquation de l’architecture distribuée des processeurs pluricoeurs avec les besoins des concepteurs de systèmes temps réels avioniques. Nous proposons d’abord une analyse détaillée d’un processeur sur étagère (COTS), le KALRAY MPPA®-256, et nous identifions certaines de ses ressources partagées comme étant les goulots d’étranglement limitant à la fois la performance et la prédictibilité lorsque plusieurs applications s’exécutent. Pour limiter l’impact de ces ressources sur les WCETs, nous définissons formellement un modèle d’exécution isolant temporellement les applications concurrentes. Son implantation est réalisée au sein d’un hyperviseur offrant à chaque application un environnement d’exécution isolé et assurant le respect des comportements attendus en ligne. Sur cette base, nous formalisons la notion de partition comme l’association d’une application avec un budget de ressources matérielles. Dans notre approche, les applications s’exécutant au sein d’une partition sont garanties d’être temporellement isolées des autres applications. Ainsi, étant donné une application et son budget associé, nous proposons d’utiliser la programmation par contraintes pour vérifier automatiquement si les ressources allouées à l’application sont suffisantes pour permettre son exécution de manière satisfaisante. Dans le même temps, dans le cas où un budget est effectivement valide, notre approche fournit un ordonnancement et un placement complet de l’application sur le sous-ensemble des ressources du processeurallouées à sa partition. / In this thesis, we study the suitability of the distributed architecture of many-core processors for the design of highly constrained real-time systems as is the case in avionics. We firstly propose a thorough analysis of an existing COTS processor, namely the KALRAY MPPA®-256, and we identify some of its shared resources to be paths of interference when shared among several applications. We provide an execution model to restrict the access to these resources in order to mitigate their impact on WCETs and to temporally isolate co-running applications. We describe in detail how such an execution model can be implemented with a hypervisor which practically provides the expected property of temporal isolation at run-time. Based on this, we formalize a notion of partition which represents the association of an application with a resource budget. In our approach, an application placed in a partition is guaranteed to be temporally isolated from applications placed in other partitions. Then, assuming that applications and resource budgets are given,we propose to use constraint programming in order to verify automatically whether the amount of resources requested by a budget is sufficient to meet all of the application’s constraints. Simultaneously, when a budget is valid, our approach computes a schedule of the application on the subset of the processor’s resources allocated to it. Pluri-Coeurs Processeur Placement Ordonnancement Prédictibilité Avionique Temps-Réel Many-Core Processor Mapping Scheduling Predictability Avionics Real time 621
45	Arquitetura de NoC programável baseada em múltiplos clusters de cores para suporte a padrões de comunicação coletiva / Programmable multi-cluster noc architecture to support collective communication patterns Freitas, Henrique Cota de January 2009 (has links) As próximas gerações de processadores many-core exigem que novas abordagens no projeto de arquitetura de processadores sejam propostas. Neste novo contexto, as redes de comunicação intra-chip são importantes para garantir o desempenho dos programas. Soluções tradicionais de interconexão possuem limites físicos que comprometem a escalabilidade e o desempenho no processamento de aplicações paralelas de diversos tipos. A alternativa apontada pelo estado da arte é a Network-on-Chip (NoC) composta por roteadores e outros elementos de rede capazes de prover comunicação escalável e de alto desempenho. No entanto, as cargas de trabalho geram padrões de comunicação diferentes que podem influenciar no desempenho da rede. Existem pesquisas que abordam metodologias de projeto dedicado de NoCs em função de domínios de aplicações específicos. Apesar de uma NoC dedicada possuir um alto desempenho, cargas de trabalho paralelas geram padrões de comunicação coletiva que mudam dinamicamente. Com o objetivo de aumentar a flexibilidade de redes-em-chip, trabalhos correlatos utilizam conceitos de computação reconfigurável para aumentar a capacidade da arquitetura da NoC se adaptar em função de padrões de comunicação. Alguns trabalhos focam na programação de FPGAs e outros em ASICs polimórficos. O objetivo desta tese é propor uma arquitetura de Network-on-Chip que suporte múltiplos clusters de núcleos de processamento através de roteadores programáveis e de topologias reconfiguráveis. Cada roteador é composto por uma chave crossbar reconfigurável capaz de implementar topologias dinamicamente através do uso de um segundo nível de reconfiguração. Os roteadores possuem processadores de rede que aumentam a flexibilidade e a capacidade da NoC se adaptar ao padrão de comunicação através de programas que monitoram e gerenciam a rede. Portanto, a contribuição da tese é a Arquitetura de NoC Programável Baseada em Múltiplos Clusters de Cores. Os resultados baseados em modelos analíticos e de simulação, e cargas de trabalho artificiais e naturais, mostram que a arquitetura da NoC possui um alto desempenho e vazão de pacotes, proporcionados pela adaptação de topologias e redução da influência da rede na comunicação. A ocupação em FPGA mostra que os roteadores programáveis possuem tamanho similares a NoCs com arquiteturas tradicionais para gerenciamento de mesma quantidade de núcleos. A menor utilização de buffers de entrada resulta em uma melhor eficiência no consumo de potência e energia. Portanto, através dos modelos de projeto e avaliação foi possível verificar através dos resultados que a arquitetura da MCNoC é uma alternativa para suportar padrões de comunicações coletivas. / For the next generation of many-core processors, new design methodologies must be proposed. In this context, on-chip interconnections are important to assure the program performance. Traditional approaches of interconnections have physical constraints that reduce the scalability and performance to process parallel applications. The state-of-theart points out to the Network-on-Chip (NoC), which consists of routers and other network devices capable of increasing the communication scalability and performance. However, workloads produce different types of communication patterns, which can influence the network performance. There are research works that explore applicationspecific NoC design to response the demand on specific workloads. Although a dedicated NoC has a high performance, parallel workloads have different collective communication patterns. In order to increase the flexibility of NoCs, related works use concepts of reconfigurable computing to add architecture adaptability to support dynamic communication patterns. Some works focus on FPGA-based reconfiguration and others on polymorphic ASICs. The goal of this thesis is to propose an alternative Programmable Multi-Cluster NoC architecture. Each router consists of a reconfigurable crossbar switch capable of implementing dynamic topologies through a second reconfiguration level. The routers have network processors that increase the flexibility and the NoC adaptability through management programs in order to support different workloads. Therefore, the contribution of this thesis is the following: A Programmable Multi-Cluster NoC (MCNoC) architecture. Based on analytical and simulation models, and artificial and natural workloads, results show the high performance and throughput for the proposed NoC architecture, due to the adaptable topologies and low network latency impact. Results based on FPGA shows a similar component utilization considering the proposed programmable NoC relative to conventional NoC architectures for the same number of processing cores. The low utilization of input buffers improves the efficiency of power and energy consumption. Therefore, through design and evaluation models, the NoC proposal was verified and the results point out the MCNoC as an alternative architecture to support collective communication patterns. Processamento paralelo Microprocessadores Desempenho : Computadores Microeletrônica Network-on-chip architecture Communication patterns Many-core processors
46	Arquitetura de NoC programável baseada em múltiplos clusters de cores para suporte a padrões de comunicação coletiva / Programmable multi-cluster noc architecture to support collective communication patterns Freitas, Henrique Cota de January 2009 (has links) As próximas gerações de processadores many-core exigem que novas abordagens no projeto de arquitetura de processadores sejam propostas. Neste novo contexto, as redes de comunicação intra-chip são importantes para garantir o desempenho dos programas. Soluções tradicionais de interconexão possuem limites físicos que comprometem a escalabilidade e o desempenho no processamento de aplicações paralelas de diversos tipos. A alternativa apontada pelo estado da arte é a Network-on-Chip (NoC) composta por roteadores e outros elementos de rede capazes de prover comunicação escalável e de alto desempenho. No entanto, as cargas de trabalho geram padrões de comunicação diferentes que podem influenciar no desempenho da rede. Existem pesquisas que abordam metodologias de projeto dedicado de NoCs em função de domínios de aplicações específicos. Apesar de uma NoC dedicada possuir um alto desempenho, cargas de trabalho paralelas geram padrões de comunicação coletiva que mudam dinamicamente. Com o objetivo de aumentar a flexibilidade de redes-em-chip, trabalhos correlatos utilizam conceitos de computação reconfigurável para aumentar a capacidade da arquitetura da NoC se adaptar em função de padrões de comunicação. Alguns trabalhos focam na programação de FPGAs e outros em ASICs polimórficos. O objetivo desta tese é propor uma arquitetura de Network-on-Chip que suporte múltiplos clusters de núcleos de processamento através de roteadores programáveis e de topologias reconfiguráveis. Cada roteador é composto por uma chave crossbar reconfigurável capaz de implementar topologias dinamicamente através do uso de um segundo nível de reconfiguração. Os roteadores possuem processadores de rede que aumentam a flexibilidade e a capacidade da NoC se adaptar ao padrão de comunicação através de programas que monitoram e gerenciam a rede. Portanto, a contribuição da tese é a Arquitetura de NoC Programável Baseada em Múltiplos Clusters de Cores. Os resultados baseados em modelos analíticos e de simulação, e cargas de trabalho artificiais e naturais, mostram que a arquitetura da NoC possui um alto desempenho e vazão de pacotes, proporcionados pela adaptação de topologias e redução da influência da rede na comunicação. A ocupação em FPGA mostra que os roteadores programáveis possuem tamanho similares a NoCs com arquiteturas tradicionais para gerenciamento de mesma quantidade de núcleos. A menor utilização de buffers de entrada resulta em uma melhor eficiência no consumo de potência e energia. Portanto, através dos modelos de projeto e avaliação foi possível verificar através dos resultados que a arquitetura da MCNoC é uma alternativa para suportar padrões de comunicações coletivas. / For the next generation of many-core processors, new design methodologies must be proposed. In this context, on-chip interconnections are important to assure the program performance. Traditional approaches of interconnections have physical constraints that reduce the scalability and performance to process parallel applications. The state-of-theart points out to the Network-on-Chip (NoC), which consists of routers and other network devices capable of increasing the communication scalability and performance. However, workloads produce different types of communication patterns, which can influence the network performance. There are research works that explore applicationspecific NoC design to response the demand on specific workloads. Although a dedicated NoC has a high performance, parallel workloads have different collective communication patterns. In order to increase the flexibility of NoCs, related works use concepts of reconfigurable computing to add architecture adaptability to support dynamic communication patterns. Some works focus on FPGA-based reconfiguration and others on polymorphic ASICs. The goal of this thesis is to propose an alternative Programmable Multi-Cluster NoC architecture. Each router consists of a reconfigurable crossbar switch capable of implementing dynamic topologies through a second reconfiguration level. The routers have network processors that increase the flexibility and the NoC adaptability through management programs in order to support different workloads. Therefore, the contribution of this thesis is the following: A Programmable Multi-Cluster NoC (MCNoC) architecture. Based on analytical and simulation models, and artificial and natural workloads, results show the high performance and throughput for the proposed NoC architecture, due to the adaptable topologies and low network latency impact. Results based on FPGA shows a similar component utilization considering the proposed programmable NoC relative to conventional NoC architectures for the same number of processing cores. The low utilization of input buffers improves the efficiency of power and energy consumption. Therefore, through design and evaluation models, the NoC proposal was verified and the results point out the MCNoC as an alternative architecture to support collective communication patterns. Processamento paralelo Microprocessadores Desempenho : Computadores Microeletrônica Network-on-chip architecture Communication patterns Many-core processors
47	Movement sensor using image correlation on a multicore platform Lind, Christoffer, Green, Jonas, Ingvarsson, Thomas January 2012 (has links) The purpose of this study was to investigate the possibility to measure speed of a vehicle usingimage correlation. It was identified that a new solution of measuring the speed of a vehicle, astoday’s solution does not give the True Speed Over Ground, would open up possibilities of highprecision driving applications. It was also the intention to evaluate the performance of theproposed algorithm on a multicore platform. The study was commissioned by HalmstadUniversity.The investigation of image correlation as a method to measure speed of a vehicle was conductedby applying the proposed algorithm on a sequence of images. The result was compared toreference points in the image sequence to confirm the accuracy. The performance of the multicoreplatform was measured by counting the clock cycles it took to perform one measurement cycle ofthe algorithm.It was found out that using image correlation to measure speed has a positional accuracy of closeto a half percent. The results also revealed that one measurement cycle of the algorithm could beperformed in close to half a millisecond and the achieved parallel utilization of the multicoreplatform was close to eighty-seven percent.It was concluded that the algorithm performed well within the limit of acceptance. A conclusionabout the performance was that low execution time of a measurement cycle makes it possible toexecute the algorithm at a frequency of eighteen hundred Hertz. With a frequency that high, incombination with the camera settings proposed in the thesis, the algorithm would be able tomeasure speeds close to one thousand one hundred kilometers per hour.The authors recommend that future work should be focused on investigating the cameraparameters to be able to optimize both the memory and computational requirements of theapplication. It is also recommended to look closer at the algorithm and the possibilities ofdetecting transversal and angular changes as it would open up for other application areas,requiring more than just the speed. Adapteva Multicore Parallel Image correlation Speed sensor True speed over ground Many-core Phase correlation Parallelization Computer Sciences Datavetenskap (datalogi)
48	A source-to-source compiler for the PRAM language Fork to the REPLICA many-core architecture Zhou, Cheng January 2012 (has links) This thesis describes the implementation of a source to source compiler that translates Fork language to REPLICA baseline language. The Fork language is a high-level programming language designed for the PRAM (Parallel Random Access Machine) model. The baseline language is a low-level parallel programming language for the REPLICA architecture which implements the PRAM computing model. To support the Fork language on REPLICA, a compiler that translates Fork to baseline is built. The Fork to baseline compiler is built in compatibility with the Fork implementation for SB-PRAM. Moreover, the libraries that support Fork's features are built using baseline language.The evaluation result verifies that the features of the Fork language are supported in the implementation. The evaluation also shows the scalability of our implementation and shows that the overhead introduced by Fork-to-baseline translation is small. source-to-source compiler many-core computing PRAM model of parallel computing Fork language Computer Sciences Datavetenskap (datalogi)
49	Enhancing Task Assignment in Many-Core Systems by a Situation Aware Scheduler Meier, Tobias, Ernst, Michael, Frey, Andreas, Hardt, Wolfram 17 July 2017 (has links) (PDF) The resource demand on embedded devices is constantly growing. This is caused by the sheer explosion of software based functions in embedded systems, that are growing far faster than the resources of the single-core and multi-core embedded processors. As one of the limitation is the computing power of the processors we need to explore ways to use this resource more efficiently. We identified that during the run-time of the embedded devices the resource demand of the software functions is permanently changing dependent on the device situation. To enable an embedded device to take advantage of this dynamic resource demand, the allocation of the software functions to the processor must be handled by a scheduler that is able to evaluate the resource demand of the software functions in relation to the device situation. This marks a change in embedded devices from static defined software systems to dynamic software systems. Above that we can increase the efficiency even further by extending the approach from a single device to a distributed or networked system (many-core system). However, existing approaches to deal with dynamic resource allocation are focused on individual devices and leave the optimization potential of manycore systems untouched. Our concept will extend the existing Hierarchical Asynchronous Multi-Core Scheduler (HAMS) concept for individual devices to many-core systems. This extension introduces a dynamic situation aware scheduler for many-core systems which take the current workload of all devices and the system-situation into account. With our approach, the resource efficiency of an embedded many-core system can be increased. The following paper will explain the architecture and the expected results of our concept. many-core scheduling real-time scheduling dynamic scheduling deterministic scheduling coordinated scheduling Terminplanung eingebettete Systeme ddc:000 Scheduling Dynamik
50	Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems / Optimisation dynamique des applications à base de tâches data-flow pour des machines NUMA Drebes, Andi 25 June 2015 (has links) Au milieu des années deux mille, le développement de microprocesseurs a atteint un point à partir duquel l'augmentation de la fréquence de fonctionnement et la complexification des micro-architectures devenaient moins efficaces en termes de consommation d'énergie, poussant ainsi la densité d'énergie au delà du raisonnable. Par conséquent, l'industrie a opté pour des architectures multi-cœurs intégrant plusieurs unités de calcul sur une même puce. Les sytèmes hautes performances d'aujourd'hui sont composés de centaines de cœurs et les systèmes futurs intègreront des milliers d'unités de calcul. Afin de fournir une bande passante mémoire suffisante dans ces systèmes, la mémoire vive est distribuée physiquement sur plusieurs contrôleurs mémoire avec un accès non-uniforme à la mémoire (NUMA). Des travaux de recherche récents ont identifié les modèles de programmation à base de tâches dépendantes à granularité fine comme une approche clé pour exploiter la puissance de calcul des architectures généralistes massivement parallèles. Toutefois, peu de recherches ont été conduites sur l'optimisation dynamique des programmes parallèles à base de tâches afin de réduire l'impact négatif sur les performances résultant de la non-uniformité des accès à la mémoire. L'objectif de cette thèse est de déterminer les enjeux et les opportunités concernant l'exploitation efficace de machines many-core NUMA par des applications à base de tâches et de proposer des mécanismes efficaces, portables et entièrement automatiques pour le placement de tâches et de données, améliorant la localité des accès à la mémoire ainsi que les performances. Les décisions de placement sont basées sur l'exploitation des informations sur les dépendances entre tâches disponibles dans les run-times de langages de programmation à base de tâches modernes. Les évaluations expérimentales réalisées reposent sur notre implémentation dans le run-time du langage OpenStream et un ensemble de benchmarks scientifiques hautes performances. Enfin, nous avons développé et implémenté Aftermath, un outil d'analyse et de débogage de performances pour des applications à base de tâches et leurs run-times. / Within the last decade, microprocessor development reached a point at which higher clock rates and more complex micro-architectures became less energy-efficient, such that power consumption and energy density were pushed beyond reasonable limits. As a consequence, the industry has shifted to more energy efficient multi-core designs, integrating multiple processing units (cores) on a single chip. The number of cores is expected to grow exponentially and future systems are expected to integrate thousands of processing units. In order to provide sufficient memory bandwidth in these systems, main memory is physically distributed over multiple memory controllers with non-uniform access to memory (NUMA). Past research has identified programming models based on fine-grained, dependent tasks as a key technique to unleash the parallel processing power of massively parallel general-purpose computing architectures. However, the execution of task-paralel programs on architectures with non-uniform memory access and the dynamic optimizations to mitigate NUMA effects have received only little interest. In this thesis, we explore the main factors on performance and data locality of task-parallel programs and propose a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance. Placement decisions are based on information about point-to-point data dependences, readily available in the run-time systems of modern task-parallel programming frameworks. The experimental evaluation of these techniques is conducted on our implementation in the run-time of the OpenStream language and a set of high-performance scientific benchmarks. Finally, we designed and implemented Aftermath, a tool for performance analysis and debugging of task-parallel applications and run-times. Programmation parallèle Runtime Many-Coeur NUMA Ordonnancement Allocation mémoire Analyse de performances Paralel programs Many-core NUMA 004

Search results