Spelling suggestions: "subject:"workstealing"" "subject:"stealing""
1 |
Exploring heterogeneous scheduling using the task-centric programming modelPodobas, Artur, Brorsson, Mats, Vlassov, Vladimir January 2012 (has links)
Computer architecture technology is moving towards more heteroge-neous solutions, which will contain a number of processing units with different capabilities that may increase the performance of the system as a whole. How-ever, with increased performance comes increased complexity; complexity that is now barely handled in homogeneous multiprocessing systems. The present study tries to solve a small piece of the heterogeneous puzzle; how can we exploit all system resources in a performance-effective and user-friendly way? Our proposed solution includes a run-time system capable of using a variety of different heterogeneous components while providing the user with the already familiar task-centric programming model interface. Furthermore, when dealing with non-uniform workloads, we show that traditional approaches based on centralized or work-stealing queue algorithms do not work well and propose a scheduling algorithm based on trend analysis to distribute work in a performance-effective way across resources. / <p>QC 20130429</p> / ENCORE
|
2 |
Förbättring av mjukvarubibliotek för parallellberäkningar med programmeringsmodellen Chunks and TasksEl Harbiti, Deeb January 2015 (has links)
Chunks and Tasks is a programming model based on the C ++ programming language. This programming model is used for electronic structure calculations, among other things.The purpose of this project is to improve the CHT-MPI software library for Chunks and tasks, so that calculations of matrix-matrix multiplications are performed more efficiently than they do with the existing software library. The software library is based on the work stealing method, which is a method the software library for Chunks and Tasks uses for the distribution of the calculation work. The considered way to improve the software library is by modifying the work stealing method in a way that makes the distribution of calculation work happen in a more efficient way , which will lead to calculations performed faster than before.Two different modifications of the work stealing method were tested and it led to two new methods, Method 1 and Method 2, which distributed the calculation work differently. Method 1 did not give results that were compatible with the theory, since the calculation time with this method was much longer than the previous method. The results for method 2 were compatible with the theory for the method. Method 2 distributed the calculation work more efficiently than before which decreased the amount of data sent during the calculations, which led to a shorter calculation time than with the previous method. This method made an improvement of the software library for the programming model Chunks and Tasks. / Chunks and Tasks är en programmeringsmodell baserad på programspråket C++. Denna programmeringsmodell används vid bl.a. metoder för lösningar av Schrödingerekvationen för elektronerna i molekyler. Syftet med detta projekt är att förbättra mjukvarubiblioteket för Chunks and Tasks, så att beräkningar av matris-matris-multiplikationer utförs på ett effektivare sätt än vad de gör med det existerande mjukvarubiblioteket. Mjukvarubiblioteket använder sig av work stealing-metoden vid fördelning av beräkningsarbetet. Det är tänkt att mjukvarubiblioteket ska förbättras genom att just modifiera work stealing-metoden på ett sätt som får arbetsfördelningen att ske på ett smidigare sätt, vilket i sin tur ska leda till att beräkningarna utförs under en kortare tid än tidigare. Två olika ändringar av work stealing-metoden testades och man fick två nya metoder, metod 1 och metod 2, som fördelade beräkningsarbetet olika sätt. Det som söktes var en metod som kunde minska mängden data som skickades under beräkningarna av olika matris-matrismultiplikationer, då en minskad data-mängd innebar en förkortning av beräkningstiden. Med metod 1 fick man en försämring, då beräkningstiderna blev mycket längre än tidigare. Med metod 2 erhöll man ett bättre resultat, med denna metod fördelades arbetet på ett effektivare sätt som ledde till att mängden data som skickades minskade, vilket även betydde att beräkningstiderna kortades ner. Med denna metod fick man en förbättring av mjukvarubiblioteket för programmeringsmodellen Chunks and Tasks.
|
3 |
Effective cooperative scheduling of task-parallel applications on multiprogrammed parallel architecturesVaristeas, Georgios January 2015 (has links)
Emerging architecture designs include tens of processing cores on a single chip die; it is believed that the number of cores will reach the hundreds in not so many years from now. However, most common parallel workloads cannot fully utilize such systems. They expose fluctuating parallelism, and do not scale up indefinitely as there is usually a point after which synchronization costs outweigh the gains of parallelism. The combination of these issues suggests that large-scale systems will be either multiprogrammed or have their unneeded resources powered off.Multiprogramming leads to hardware resource contention and as a result application performance degradation, even when there are enough resources, due to negative share effects and increased bus traffic. Most often this degradation is quite unbalanced between co-runners, as some applications dominate the hardware over others. Current Operating Systems blindly provide applications with access to as many resources they ask for. This leads to over-committing the system with too many threads, memory contention and increased bus traffic. Due to the inability of the application to have any insight on system-wide resource demands, most parallel workloads will create as many threads as there are available cores. If every co-running application does the same, the system ends up with threads $N$ times the amount of cores. Threads then need to time-share cores, so the continuous context-switching and cache line evictions generate considerable overhead.This thesis proposes a novel solution across all software layers that achieves throughput optimization and uniform performance degradation of co-running applications. Through a novel fully automated approach (DVS and Palirria), task-parallel applications can accurately quantify their available parallelism online, generating a meaningful metric as parallelism feedback to the Operating System. A second component in the Operating System scheduler (Pond) uses such feedback from all co-runners to effectively partition available resources.The proposed two-level scheduling scheme ultimately achieves having each co-runner degrade its performance by the same factor, relative to how it would execute with unrestricted isolated access to the same hardware. We call this fair scheduling, departing from the traditional notion of equal opportunity which causes uneven degradation, with some experiments showing at least one application degrading its performance 10 times less than its co-runners. / <p>QC 20151016</p>
|
4 |
Equilibrage de charge dynamique sur plates-formes hiérarchiques / dynamic Load-Balancing on hierarchical platformsQuintin, Jean-Noël 08 December 2011 (has links)
La course à l'augmentation de la puissance de calcul qui se déroule depuis de nombreuses années entre les différents producteurs de matériel a depuis quelques années changé de visage: nous assistons en effet désormais à une véritable démocratisation des machines parallèles avec une complexification sans cesse croissante de la structure des processeurs. À terme, il est tout à fait envisageable de voir apparaître pour le grand public des architecture pleinement hétérogènes composées d'un ensemble de cœurs reliés par un réseau sur puce. La parallélisation et l'exécution parallèle d'applications sur les machines à venir soulèvent ainsi de nombreux problèmes. Parmi ceux-ci, nous nous intéressons ici au problème de l'ordonnancement d'un ensemble de tâches sur un ensemble de cœurs, c'est à dire le choix de l'affectation du travail à réaliser sur les ressources disponibles. Parmi les méthodes existantes, on distingue deux types d'algorithmes: en-ligne et hors-ligne. Les algorithmes en-ligne comme le vol de travail présentent l'avantage de fonctionner en l'absence d'informations sur le matériel ou la durée des tâches mais ne permettent généralement pas une gestion efficace des communications. Dans cette thèse, nous nous intéressons à l'ordonnancement de tâches en-ligne sur des plates-formes complexes pour lesquelles le réseau peut, par des problèmes de congestion, limiter les performances. Plus précisément, nous proposons de nouveaux algorithmes d'ordonnancement en-ligne, basés sur le vol de travail, ciblant deux configurations différentes. D'une part, nous considérons des applications pour lesquelles le graphe de dépendance est connu à priori. L'utilisation de cette information nous permet ainsi de limiter les quantités de données transférées et d'obtenir des performances supérieures aux meilleurs algorithmes hors-ligne connus. D'autre part, nous étudions les optimisations possibles lorsque l'algorithme d'ordonnancement connaît la topologie de la plate-forme. Encore une fois, nous montrons qu'il est possible de tirer parti de cette information pour réaliser un gain non-négligeable en performance. Nos travaux permettent ainsi d'étendre le champ d'application des algorithmes d'ordonnancement vers des architectures plus complexes et permettront peut-être une meilleure utilisation des machines de demain. / The race towards more processing power between all different hardware manufacturers has in recent years faced deep changes. We see nowadays a huge development in the use of parallel machines with more and more cores and increasingly complex architectures. It seems now clear that we will witness in a near future the development of cheap Network On Chip computers. Executing parallel applications on such machines raises several problems. Amongst them we take in this work interest in the problem of scheduling a set of tasks on a set of computing resources. Between all existing methods we can generally distinguish on-line or off-line algorithms. On-line algorithms like work-stealing present the advantage to work without informations on hardware or tasks durations but do not generally achieve an efficient control of communications. In this book we take interest in on-line tasks scheduling on complex platforms where networking can impact (through congestion) performance. More precisely, we propose several new scheduling algorithms based on work-stealing targeting two different configurations. In a first study, we consider applications whose dependency graph is known in advance. By taking advantage of this information we manage to limit the amount of data transfered and thus to achieve high performance and even outperform the best known off-line algorithms. Concurrently to that, we also study possible optimisations in the case where knowledge of platform topology is available. We show again that it is possible to use this information to enhance performance. Our work allows therefore to extend the application field of scheduling algorithms towards more complex architectures and we hope will allow a better use of tomorrow's machine.
|
5 |
Escalonamento Work-Stealing de programas Divisão-e-Conquista com MPI-2 / Scheduling Divide-and-Conquer programs by Work-Stealing with MPI-2Pezzi, Guilherme Peretti January 2006 (has links)
Com o objetivo de ser portável e eficiente em arquiteturas HPC atuais, a execução de um programa paralelo deve ser adaptável. Este trabalho mostra como isso pode ser atingido utilizando MPI, através de criação dinâmica de processos, integrada com programação Divisão-e-Conquista e uma estratégia Work-Stealing para balancear os processos MPI, em ambientes heterogêneos e/ou dinâmicos, em tempo de execução. Este trabalho explica como implementar uma aplicação segundo o modelo de Divisão-e-Conquista com MPI, bem como a implementação de uma estratégia Work-Stealing. São apresentados resultados experimentais baseados em uma aplicação sintética, o problema das N-Rainhas (N-Queens). Valida-se tanto a adaptabilidade e a eficiência do código. Os resultados mostram que é possível utilizar um padrão amplamente difundido como o MPI, mesmo em plataformas de HPC não tão homogêneas como um cluster. / In order to be portable and efficient on modern HPC architectures, the execution of a parallel program must be adaptable. This work shows how to achieve this in MPI, by the dynamic creation of processes, coupled with Divide-and-Conquer programming and a Work-Stealing strategy to balance the MPI processes, in a heterogeneous and/or dynamic environment, at runtime. The application of Divide and Conquer with MPI is explained, as well as the implementation of a Work-Stealing strategy. Experimental results are provided, based on a synthetic application, the N-Queens computation. Both the adaptability of the code and its efficiency are validated. The results show that it is possible to use widely spread standards such as MPI, even in parallel HPC platforms that are not as homogeneous as a Cluster.
|
6 |
Escalonamento Work-Stealing de programas Divisão-e-Conquista com MPI-2 / Scheduling Divide-and-Conquer programs by Work-Stealing with MPI-2Pezzi, Guilherme Peretti January 2006 (has links)
Com o objetivo de ser portável e eficiente em arquiteturas HPC atuais, a execução de um programa paralelo deve ser adaptável. Este trabalho mostra como isso pode ser atingido utilizando MPI, através de criação dinâmica de processos, integrada com programação Divisão-e-Conquista e uma estratégia Work-Stealing para balancear os processos MPI, em ambientes heterogêneos e/ou dinâmicos, em tempo de execução. Este trabalho explica como implementar uma aplicação segundo o modelo de Divisão-e-Conquista com MPI, bem como a implementação de uma estratégia Work-Stealing. São apresentados resultados experimentais baseados em uma aplicação sintética, o problema das N-Rainhas (N-Queens). Valida-se tanto a adaptabilidade e a eficiência do código. Os resultados mostram que é possível utilizar um padrão amplamente difundido como o MPI, mesmo em plataformas de HPC não tão homogêneas como um cluster. / In order to be portable and efficient on modern HPC architectures, the execution of a parallel program must be adaptable. This work shows how to achieve this in MPI, by the dynamic creation of processes, coupled with Divide-and-Conquer programming and a Work-Stealing strategy to balance the MPI processes, in a heterogeneous and/or dynamic environment, at runtime. The application of Divide and Conquer with MPI is explained, as well as the implementation of a Work-Stealing strategy. Experimental results are provided, based on a synthetic application, the N-Queens computation. Both the adaptability of the code and its efficiency are validated. The results show that it is possible to use widely spread standards such as MPI, even in parallel HPC platforms that are not as homogeneous as a Cluster.
|
7 |
Escalonamento Work-Stealing de programas Divisão-e-Conquista com MPI-2 / Scheduling Divide-and-Conquer programs by Work-Stealing with MPI-2Pezzi, Guilherme Peretti January 2006 (has links)
Com o objetivo de ser portável e eficiente em arquiteturas HPC atuais, a execução de um programa paralelo deve ser adaptável. Este trabalho mostra como isso pode ser atingido utilizando MPI, através de criação dinâmica de processos, integrada com programação Divisão-e-Conquista e uma estratégia Work-Stealing para balancear os processos MPI, em ambientes heterogêneos e/ou dinâmicos, em tempo de execução. Este trabalho explica como implementar uma aplicação segundo o modelo de Divisão-e-Conquista com MPI, bem como a implementação de uma estratégia Work-Stealing. São apresentados resultados experimentais baseados em uma aplicação sintética, o problema das N-Rainhas (N-Queens). Valida-se tanto a adaptabilidade e a eficiência do código. Os resultados mostram que é possível utilizar um padrão amplamente difundido como o MPI, mesmo em plataformas de HPC não tão homogêneas como um cluster. / In order to be portable and efficient on modern HPC architectures, the execution of a parallel program must be adaptable. This work shows how to achieve this in MPI, by the dynamic creation of processes, coupled with Divide-and-Conquer programming and a Work-Stealing strategy to balance the MPI processes, in a heterogeneous and/or dynamic environment, at runtime. The application of Divide and Conquer with MPI is explained, as well as the implementation of a Work-Stealing strategy. Experimental results are provided, based on a synthetic application, the N-Queens computation. Both the adaptability of the code and its efficiency are validated. The results show that it is possible to use widely spread standards such as MPI, even in parallel HPC platforms that are not as homogeneous as a Cluster.
|
8 |
Parallélisme en programmation par contraintes / Parallelism in constraint programmingRezgui, Mohamed 08 July 2015 (has links)
Nous étudions la parallélisation de la procédure de recherche de solution d’un problème en Programmation Par Contraintes (PPC). Après une étude de l’état de l’art, nous présentons une nouvelle méthode, nommée Embarrassingly Parallel Search (EPS). Cette méthode est basée sur la décomposition d’un problème en un très grand nombre de sous-problèmes disjoints qui sont ensuite résolus en parallèle par des unités de calcul avec très peu, voire aucune communication. Le principe d’EPS est d’arriver statistiquement à un équilibrage des temps de résolution de chaque unité de calcul afin d’obtenir une bonne répartition de la charge de travail. EPS s’appuie sur la propriété suivante : la somme des temps de résolution de chacun des sous-problèmes est comparable au temps de résolution du problème en entier. Cette propriété est vérifiée en PPC, ce qui nous permet de disposer d’une méthode simple et efficace en pratique. Dans nos expérimentations, nous nous intéressons à la recherche de toutes les solutions d’un problème en PPC, à prouver qu’un problème n’a pas de solution et à la recherche d’une solution optimale d’un problème d’optimisation. Les résultats montrent que la décomposition doit générer au moins 30 sous-problèmes par unité de calcul pour obtenir des charges de travail par unité de calcul équivalentes. Nous évaluons notre approche sur différentes architectures (machine multi-coeurs, centre de calcul et cloud computing) et montrons qu’elle obtient un gain pratiquement linéaire en fonction du nombre d’unités de calcul. Une comparaison avec les méthodes actuelles telles que le work stealing ou le portfolio montre qu’EPS obtient de meilleurs résultats. / We study the search procedure parallelization in Constraint Programming (CP). After giving an overview on various existing methods of the state-of-the-art, we present a new method, named Embarrassinqly Parallel Search (EPS). This method is based on the decomposition of a problem into many disjoint subproblems which are then solved in parallel by computing units with little or without communication. The principle of EPS is to have a resolution times balancing for each computing unit in a statistical sense to obtain a goodDépôt de thèse – Données complémentaireswell-balanced workload. We assume that the amount of resolution times of all subproblems is comparable to the resolution time of the entire problem. This property is checked with CP and allows us to have a simple and efficient method in practice. In our experiments, we are interested in enumerating all solutions of a problem, and proving that a problem has no solution and finding an optimal solution of an optimization problem. We observe that the decomposition has to generate at least 30 subproblems per computing unit to get equivalent workloads per computing unit. Then, we evaluate our approach on different architectures (multicore machine, cluster and cloud computing) and we observe a substantially linear speedup. A comparison with current methods such as work stealing or portfolio shows that EPS gets better results.
|
9 |
Porting Cilk to the Barrelfish OSHo Bao Le, Chau January 2013 (has links)
Barrelfish operating system is an experimental instance of multikernel structure which exhibits good features such as hardware heterogeneity, scalability, dynamicity, etc. Barrelfish is in progress and lacks applications. Therefore, there is a need to investigate the efficiency of applications running in Barrelfish and one of candidates is a shared-memory application. To conduct an empirical study, Cilk is chosen inasmuch as its runtime library is designed for shared-memory architectures and it has been known to expose good performance. This thesis focuses on making Cilk run on top of Barrelfish in order to reach two goals: portability which is described to be supported by Barrelfish, and good speed afterwards. The porting involves compiling Cilk runtime source code by replacing its pthread subroutines with set of APIs in Barrelfish and then changing the way Cilk scheduler spawns worker thread on multiple cores. However, the main point of the porting is to make different cores access to the same virtual address space. Luckily, Barrelfish provides a notion of domain which specifies the number of cores in an application so that these cores can share the same memory space. This thesis also has carried out benchmarks on some Cilk programs and found that Cilk does not perform as well as it is expected. In addition measurements on parallel workers shows that Cilk on Barrelfish takes more cycles to perform computation. Although Cilk still maintains work-first principle, it cannot achieve the time bound. The spanning domain cost is proportional to the number of cores, but it will matter if applications take small time to complete.
|
10 |
Escalonamento por roubo de tarefas em sistemas Multi-CPU e Multi-GPUPinto, Vinícius Garcia January 2013 (has links)
Nos últimos anos, uma das alternativas adotadas para aumentar o desempenho de sistemas de processamento de alto desempenho têm sido o uso de arquiteturas híbridas. Essas arquiteturas são constituídas de processadores multicore e coprocessadores especializados, como GPUs. Esses coprocessadores atuam como aceleradores em alguns tipos de operações. Por outro lado, as ferramentas e modelos de programação paralela atuais não são adequados para cenários híbridos, produzindo aplicações pouco portáveis. O paralelismo de tarefas considerado um paradigma de programação genérico e de alto nível pode ser adotado neste cenário. Porém, exige o uso de algoritmos de escalonamento dinâmicos, como o algoritmo de roubo de tarefas. Neste contexto, este trabalho apresenta um middleware (WORMS) que oferece suporte ao paralelismo de tarefas com escalonamento por roubo de tarefas em sistemas híbridos multi-CPU e multi-GPU. Esse middleware permite que as tarefas tenham implementação tanto para execução em CPUs quanto em GPUs, decidindo em tempo de execução qual das implementações será executada de acordo com os recursos de hardware disponíveis. Os resultados obtidos com o WORMS mostram ser possível superar, em algumas aplicações, tanto o desempenho de ferramentas de referência para execução em CPU quanto de ferramentas para execução em GPUs. / In the last years, one of alternatives adopted to increase performance in high performance computing systems have been the use of hybrid architectures. These architectures consist of multicore processors and specialized coprocessors, like GPUs. Coprocessors act as accelerators in some types of operations. On the other hand, current parallel programming models and tools are not suitable for hybrid scenarios, generating less portable applications. Task parallelism, considered a generic and high level programming paradigm, can be used in this scenario. However, it requires the use of dynamic scheduling algorithms, such as work stealing. In this context, this work presents a middleware (WORMS) that supports task parallelism with work stealing scheduling in multi-CPU and multi-GPU systems. This middleware allows task implementations for both CPU and GPU, deciding at runtime which implementation will run according to the available hardware resources. The performance results obtained with WORMS showed that is possible to outperform both CPU and GPU reference tools in some applications.
|
Page generated in 0.0517 seconds