Global ETD Search

1	System abstractions for resource scaling on heterogeneous platforms Gupta, Vishal 13 January 2014 (has links) The increasingly diverse nature of modern applications makes it critical for future systems to have dynamic resource scaling capabilities which enable them to adapt their resource usage to meet user requirements. Such mechanisms should be both fine-grained in nature for resource-efficient operation and also provide a high scaling range to support a variety of applications with diverse needs. To this end, heterogeneous platforms, consisting of components with varying characteristics, have been proposed to provide improved performance/efficiency than homogeneous configurations, by making it possible to execute applications on the most suitable component. However, introduction of such heterogeneous architectural components requires system software to embrace complexity associated with heterogeneity for managing them efficiently. Diversity across vendors and rapidly changing hardware make it difficult to incorporate heterogeneity-aware resource management mechanisms into mainstream systems, affecting the widespread adoption of these platforms. Addressing these issues, this dissertation presents novel abstractions and mechanisms for heterogeneous platforms which decouple heterogeneity from management operations by masking the differences due to heterogeneity from applications. By exporting a homogeneous interface over heterogeneous components, it proposes the scalable 'resource state' abstraction, allowing applications to express their resource requirements which then are dynamically and transparently mapped to heterogeneous resources underneath. The proposed approach is explored for both modern mobile devices where power is a key resource and for cloud computing environments where platform resource usage has monetary implications, resulting in HeteroMates and HeteroVisor solutions. In addition, it also highlights the need for hardware and system software to consider multiple resources together to obtain desirable gains from such scaling mechanisms. The solutions presented in this dissertation open ways for utilizing future heterogeneous platforms to provide on-demand performance, as well as resource-efficient operation, without disrupting the existing software stack. Resource scaling Heterogeneous platforms Cloud elasticity Heterogeneous computing Resource allocation
2	Multi-tasking scheduling for heterogeneous systems Wen, Yuan January 2017 (has links) Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple tasks from multiple users is an emerging challenge. A good scheduler can greatly improve performance. However, indiscriminately allocating tasks based on availability leads to poor performance. As modern GPUs have a large number of hardware resources, most tasks cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising solution, however, indiscriminately running tasks in parallel causes a slowdown. This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed to determine where to schedule OpenCL kernels. It predicts the best-fit device by using a machine learning-based classifier, then schedules the kernels accordingly to either CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed. Kernels are merged if their predicted co-execution can provide better performance than sequential execution. A machine learning based classifier is developed to find the best kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to schedule kernels separately on either CPU or GPU, and run kernels in pairs if their co-execution can improve performance. The approaches developed in this thesis significantly improve system performance and outperform all existing techniques. 006.6
3	Um framework para agrupar funções com base no comportamento da comunicação de dados em plataformas multiprocessadas / A framework for clustering functions based on the behavior of data communication on multiprocessed platforms Santos, Rafael Ribeiro dos 12 June 2018 (has links) O aumento da demanda por sistemas computacionais mais eficientes para obter alto desempenho impôs novos desafios à comunidade de pesquisa, que precisou buscar por novas plataformas heterogêneas para grandes aplicações. Para utilizar todo o potencial dessas plataformas, podese agrupar a aplicação em grupos menores de modo que cada grupo seja executado em uma unidade de processamento específica, para reduzir o gargalo de comunicação, de acordo com o comportamento de comunicação durante a execução da aplicação. Com o propósito de oferecer um agrupamento mais eficiente, este projeto propõe a análise de agrupamento de uma aplicação levando em consideração não só o volume total de dados, mas também a distribuição desse volume durante o tempo de execução associado à restrição da banda e da taxa de transmissão. Embora alguns trabalhos considerem o volume total de dados para o agrupamento, não é evidenciado como esse volume é distribuído e como a restrição de banda afeta o agrupamento. Assim, neste projeto foi implementado um framework para sugerir um agrupamento considerando a distribuição do volume de comunicação e restrições de banda. Além disso, foi desenvolvido um módulo de extensão para a ferramenta externa MCProf (Memory and Communication Profiler) com o objetivo de obter a distribuição do volume de comunicação. A validação do framework foi realizada por meios de testes de agrupamentos de aplicações nos quais foram comparados o tempo de comunicação do agrupamento gerado pela execução do framework em relação aos resultado dos agrupamentos considerando os trabalhos da literatura. O uso desta abordagem apresentou um aumento no desempenho que variou de 1,117X a 2,621X para as aplicações usadas nos experimentos. / The increased demand for more efficient computing systems to achieve high performance proposed new challenges to the research community, which needed to search for new heterogeneous platforms for large applications. To utilize the full potential of these platforms, the application can be grouped into small groups that runs on a specific processing unit to reduce the communication bottleneck according to the communication behavior during application execution . With the purpose of offering a more efficient clustering, this project proposes the analysis of clustering of an application taking into account not only the total volume of data, but also the distribution of that volume during the execution time associated to the band and restriction of rate transmission. Although some studies consider the total volume of data for the cluster, it is not clear how this volume is distributed and how the band constraint affects clustering. Thus, in this project was implemented a framework to suggest a cluster considering the distribution of the volume of communication and band restrictions. In addition, an extension module was developed for the external tool MCProf (Memory and Communication Profiler) in order to obtain the distribution of the communication. The validation of the framework was performed by clsutering tests which used applications in which the communication time of the cluster generated by the execution of framework was compared to the results of the clusters considering the literature. The use of this approach showed an increase in performance ranging from 1.117X to 2.621X for the applications used in the experiments. Agrupamento de funções Clustering functions Communication distribution Distribuição de comunicação Heterogeneous platforms Largura de banda variável Plataformas heterogêneas Variable bandwidth
4	Memory-aware algorithms : from multicores to large scale platforms Jacquelin, Mathias 20 July 2011 (has links) (PDF) This thesis focus on memory-aware algorithms tailored for hierarchical memory architectures, found for instance within multicore processors. We first study the matrix product on multicore architectures. We model such a processor, and derive lower bounds on the communication volume. We introduce three ad hoc algorithms, and experimentally assess their performance.We then target a more complex operation: the QR factorization of tall matrices. We revisit existing algorithms to better exploit the parallelism of multicore processors. We thus study the critical paths of many algorithms, prove some of them to be asymptotically optimal, and assess their performance.In the next study, we focus on scheduling streaming applications onto a heterogeneous multicore platform, the QS 22. We introduce a model of the platform and use steady-state scheduling techniques so as to maximize the throughput. We present a mixed integer programming approach that computes an optimal solution, and propose simpler heuristics. We then focus on minimizing the amount of required memory for tree-shaped workflows, and target a classical two-level memory system. I/O represent transfers from a memory to the other. We propose a new exact algorithm, and show that there exist trees where postorder traversals are arbitrarily bad. We then study the problem of minimizing the I/O volume for a given memory, show that it is NP-hard, and provide a set of heuristics.Finally, we compare archival policies for BLUE WATERS. We introduce two archival policies and adapt the well known RAIT strategy. We provide a model of the tape storage platform, and use it to assess the performance of the three policies through simulation. [INFO:INFO_OH] Computer Science/Other Memory hierarchy Scheduling Steady-state Heterogeneous platforms Heuristics Optimization Linear algebra Throughput maximization Memory constraints Multicore
5	Scheduling for Reliability : complexity and Algorithms Dufossé, Fanny 06 September 2011 (has links) (PDF) This thesis deals with the mapping and the scheduling of workflows. In this context, we consider unreliable platforms, with processors subject to failures. In a first part, we consider a particular model of streaming applications : the filtering services. In this context, we aim at minimizing period and latency. We first neglect communication costs. In this model, we study scheduling problems on homogeneous and heterogeneous platforms. Then, the impact of communication costs on scheduling problems of a filtering application is studied. Finally, we consider the scheduling problem of such an application on a chain of processors. The theoretical complexity of any variant of this problem is proved. This filtering property can model the reliability of processors. The results of some computations are successfully computed, and some other ones are lost. We consider the more frequent failure types : transient failures. We aim efficient and reliable schedules. The complexity of many variants of this problem is proved. Two heuristics are proposed and compared using using simulations. Even if transient failures are the most common failures in classical grids, some particular type of platform are more concerned by other type of problems. Desktop grids are especially unstable. In this context, we want to execute iterative applications. All tasks are executed, then a synchronization occurs, and so on. Two variants of this problem are considered : applicationsof independent tasks, and applications where all tasks need to be executed at same speed. In both cases, the problem is first theoretically studied, then heuristics are proposed and compared using simulations. [INFO:INFO_OH] Computer Science/Other Streaming applications Multicriteria optimization Heuristics Linear programs Heterogeneous platforms Complexity results Desktop grids Transient failures Reliability
6	Um framework para agrupar funções com base no comportamento da comunicação de dados em plataformas multiprocessadas / A framework for clustering functions based on the behavior of data communication on multiprocessed platforms Rafael Ribeiro dos Santos 12 June 2018 (has links) O aumento da demanda por sistemas computacionais mais eficientes para obter alto desempenho impôs novos desafios à comunidade de pesquisa, que precisou buscar por novas plataformas heterogêneas para grandes aplicações. Para utilizar todo o potencial dessas plataformas, podese agrupar a aplicação em grupos menores de modo que cada grupo seja executado em uma unidade de processamento específica, para reduzir o gargalo de comunicação, de acordo com o comportamento de comunicação durante a execução da aplicação. Com o propósito de oferecer um agrupamento mais eficiente, este projeto propõe a análise de agrupamento de uma aplicação levando em consideração não só o volume total de dados, mas também a distribuição desse volume durante o tempo de execução associado à restrição da banda e da taxa de transmissão. Embora alguns trabalhos considerem o volume total de dados para o agrupamento, não é evidenciado como esse volume é distribuído e como a restrição de banda afeta o agrupamento. Assim, neste projeto foi implementado um framework para sugerir um agrupamento considerando a distribuição do volume de comunicação e restrições de banda. Além disso, foi desenvolvido um módulo de extensão para a ferramenta externa MCProf (Memory and Communication Profiler) com o objetivo de obter a distribuição do volume de comunicação. A validação do framework foi realizada por meios de testes de agrupamentos de aplicações nos quais foram comparados o tempo de comunicação do agrupamento gerado pela execução do framework em relação aos resultado dos agrupamentos considerando os trabalhos da literatura. O uso desta abordagem apresentou um aumento no desempenho que variou de 1,117X a 2,621X para as aplicações usadas nos experimentos. / The increased demand for more efficient computing systems to achieve high performance proposed new challenges to the research community, which needed to search for new heterogeneous platforms for large applications. To utilize the full potential of these platforms, the application can be grouped into small groups that runs on a specific processing unit to reduce the communication bottleneck according to the communication behavior during application execution . With the purpose of offering a more efficient clustering, this project proposes the analysis of clustering of an application taking into account not only the total volume of data, but also the distribution of that volume during the execution time associated to the band and restriction of rate transmission. Although some studies consider the total volume of data for the cluster, it is not clear how this volume is distributed and how the band constraint affects clustering. Thus, in this project was implemented a framework to suggest a cluster considering the distribution of the volume of communication and band restrictions. In addition, an extension module was developed for the external tool MCProf (Memory and Communication Profiler) in order to obtain the distribution of the communication. The validation of the framework was performed by clsutering tests which used applications in which the communication time of the cluster generated by the execution of framework was compared to the results of the clusters considering the literature. The use of this approach showed an increase in performance ranging from 1.117X to 2.621X for the applications used in the experiments. Agrupamento de funções Distribuição de comunicação Largura de banda variável Plataformas heterogêneas Clustering functions Communication distribution Heterogeneous platforms Variable bandwidth
7	Scheduling of Dense Linear Algebra Kernels on Heterogeneous Resources / Ordonnancement de noyaux d'algèbre linéaire dense sur ressources hétérogènes Kumar, Suraj 12 April 2017 (has links) Du fait des énormes capacités de calculs des accélérateurs tels que les GPUs et les Xeon Phi, l’utilisation de machines multicoques pourvues d’accélérateurs est devenue commune dans le domaine du calcul haute performance (HPC). La complexité induite par ces accélérateurs a suscité le développement de systèmes d’exécution à base de tâches, dans lesquels les dépendances entre les applications sont exprimées sous la forme de graphe de tâches et où les tâches sont ordonnancées dynamiquement sur les ressources de calcul. La difficulté est alors de concevoir des stratégies d’ordonnancement qui font une utilisation efficace des ressources de calculs et le développement de telles stratégies, même pour un unique noeud hybride, est un enjeu essentiel de la performance des systèmes HPC. Nous considérons dans cette thèse l’ordonnancement de noyaux d’algèbre linéaire dense sur des noeuds complètement hétérogènes et constitués de CPUs et de GPUs. Les performances relatives des accélérateurs par rapport aux coeurs classique dépend très fortement du noyau considéré. Par exemple, les accélérateurs sont beaucoup plus efficaces pour les produits de matrices, par exemple, que pour les factorisations. Dans cette thèse, nous analysons les performances de stratégies statiques et dynamiques d’ordonnancement et nous proposons un ensemble de stratégies intermédiaires, en ajoutant des composantes statiques (respectivement dynamiques) à des stratégies d’ordonnancements dynamique (respectivement statiques). Récemment, une stratégie appelée HeteroPrio a été proposée, qui s’appuie sur les affinités entre les tâches et les ressources pour un petit ensemble de tâches différentes s’exécutant sur deux types de ressources. Nous avons étendu cette stratégie d’ordonnancement pour des graphes de tâches généraux pour deux types de ressources puis pour plus de deux types. De manière complémentaire, nous avons également démontré des facteurs d’approximation et des pires cas pour HeteroPrio dans le cas d’un ensemble de tâches indépendantes sur différents types de plates-formes. / Due to massive computation power of accelerators such as GPU, Xeon phi, multicore machines equipped with accelerators are becoming popular in High Performance Computing (HPC). The added complexity led to the development of different task-based runtime systems, which allow computations to be expressed as graphs of tasks and rely on runtime systems to schedule those tasks among all resources of the platform. The real challenge is to design efficient schedulers for such runtimes to make effective utilization of all resources. Developing good schedulers, even for a single hybrid node, and analyzing them can thus have a strong impact on the performance of current HPC systems. We consider the problem of scheduling dense linear algebra applications on fully hybrid platforms made of CPUs and GPUs. The relative performance of CPU and GPU highly depends on the sub-routine. For instance, GPUs are much more efficient to process matrix-matrix multiplications than matrix factorizations. In this thesis, we analyze the performance of static and dynamic scheduling strategies and we propose a set of intermediate strategies, by adding static (resp. dynamic) features into dynamic (resp. static) strategies. A resource centric dynamic scheduler, HeteroPrio, which is based on affinity between tasks and resources, has been proposed recently for a set of small independent tasks on two types of resources. We extend and analyze this scheduler for general task graphs first on two types of resources and then on more than two types of resources. Additionally, we provide approximation ratios and worst case examples of HeteroPrio for a set of independent tasks on different platform sizes. Algèbre linéaire dense Ordonnancement dynamique Plates-formes hétérogènes Systèmes d’ordonnancement dynamiques Dense Linear Algebra Dynamic Schedulers Task-based Scheduling Heterogeneous Platforms Runtime Systems
8	Memory-aware algorithms : from multicores to large scale platforms / Algorithmes orientés mémoire : des processeurs multi-cœurs aux plates-formes à grande échelle Jacquelin, Mathias 20 July 2011 (has links) Cette thèse s’intéresse aux algorithmes adaptés aux architectures mémoire hiérarchiques, rencontrées notamment dans le contexte des processeurs multi-cœurs.Nous étudions d’abord le produit de matrices sur les processeurs multi-cœurs. Nous modélisons le processeur, bornons le volume de communication, présentons trois algorithmes réduisant ce volume de communication et validons leurs performances. Nous étudions ensuite la factorisation QR, dans le contexte des matrices ayant plus de lignes que de colonnes. Nous revisitons les algorithmes existants afin d’exploiter les processeurs multi-cœurs, analysons leurs chemins critiques, montrons que certains sont asymptotiquement optimaux, et analysons leurs performances.Nous étudions ensuite les applications pipelinées sur une plate-forme hétérogène, le QS 22. Nous modélisons celle-ci et appliquons les techniques d’ordonnancement en régime permanent. Nous introduisons un programme linéaire mixte permettant d’obtenir une solution optimale. Nous introduisons en outre un ensemble d’heuristiques.Puis, nous minimisons la mémoire nécessaire à une application modélisée par un arbre, sur une plate-forme à deux niveaux de mémoire. Nous présentons un algorithme optimal et montrons qu’il existe des arbres tels que les parcours postfixes sont arbitrairement mauvais. Nous étudions alors la minimisation du volume d’E/S à mémoire donnée, montrons que ce problème est NP-complet, et présentons des heuristiques. Enfin, nous comparons plusieurs politiques d’archivage pour BLUE WATERS. Nous introduisons deux politiques d’archivage améliorant les performances de la politique RAIT, modélisons la plate-forme de stockage et simulons son fonctionnement. / This thesis focus on memory-aware algorithms tailored for hierarchical memory architectures, found for instance within multicore processors. We first study the matrix product on multicore architectures. We model such a processor, and derive lower bounds on the communication volume. We introduce three ad hoc algorithms, and experimentally assess their performance.We then target a more complex operation: the QR factorization of tall matrices. We revisit existing algorithms to better exploit the parallelism of multicore processors. We thus study the critical paths of many algorithms, prove some of them to be asymptotically optimal, and assess their performance.In the next study, we focus on scheduling streaming applications onto a heterogeneous multicore platform, the QS 22. We introduce a model of the platform and use steady-state scheduling techniques so as to maximize the throughput. We present a mixed integer programming approach that computes an optimal solution, and propose simpler heuristics. We then focus on minimizing the amount of required memory for tree-shaped workflows, and target a classical two-level memory system. I/O represent transfers from a memory to the other. We propose a new exact algorithm, and show that there exist trees where postorder traversals are arbitrarily bad. We then study the problem of minimizing the I/O volume for a given memory, show that it is NP-hard, and provide a set of heuristics.Finally, we compare archival policies for BLUE WATERS. We introduce two archival policies and adapt the well known RAIT strategy. We provide a model of the tape storage platform, and use it to assess the performance of the three policies through simulation. Hiérarchies mémoire Ordonnancement Régime permanent Plates-formes hétérogènes Méthodes heuristiques Optimisation Programmes linéaires Maximisation du débit Contraintes mémoire Multicoeur Memory hierarchy Scheduling Steady-state Heterogeneous platforms Heuristics Optimization Linear algebra Throughput maximization Memory constraints Multicore
9	Scheduling for Reliability : complexity and Algorithms / Ordonnancement pour la Fiabilité : complexité et algorithmes Dufossé, Fanny 06 September 2011 (has links) Les travaux présentés dans cette thèse portent sur le placement et l’ordonnancement d’applications de flots de données. On se place dans le contexte de plates-formes composées de processeurs sujets à des pannes. Dans une première partie, on considère un type particulier d’applications de flots de données: les services filtrants. On étudie l'ordonnancement de telles applications sur des plates-formes homogènes et hétérogènes, d'abord sans tenir compte des coûts de communication, puis en les incluant dans le modèle. On considère enfin l’ordonnancement d’un tel calcul sur une chaîne de processeurs. Le comportement d’un service filtrant est comparable à celui d’un calcul effectué sur un processeur non fiable: certains résultats vont être calculés, et d’autres perdus. On étudie le modèle des pannes transitoires. On veut effectuer un calcul à la fois fiable et efficace. La complexité de différentes variantes de ce problème est démontrée. Deux heuristiques sont décrites, puis comparées expérimentalement. Si les pannes transitoires sont les pannes les plus fréquemment rencontrées sur des grilles de calculs classiques, certains types de plates-formes rencontrent d’autres types de défaillances. Les grilles de volontaires sont particulièrement instables. Sur ce type de plate-forme, on veut exécuter des calculs itératifs. Cette application est constituée soit de tâches indépendantes, soit de tâches couplées, qui doivent être calculées ensemble et au même rythme. Dans chaque cas, le problème est d’abord étudié théoriquement, puis des heuristiques sontproposées, et leur performances sont comparées. / This thesis deals with the mapping and the scheduling of workflows. In this context, we consider unreliable platforms, with processors subject to failures. In a first part, we consider a particular model of streaming applications : the filtering services. In this context, we aim at minimizing period and latency. We first neglect communication costs. In this model, we study scheduling problems on homogeneous and heterogeneous platforms. Then, the impact of communication costs on scheduling problems of a filtering application is studied. Finally, we consider the scheduling problem of such an application on a chain of processors. The theoretical complexity of any variant of this problem is proved. This filtering property can model the reliability of processors. The results of some computations are successfully computed, and some other ones are lost. We consider the more frequent failure types : transient failures. We aim efficient and reliable schedules. The complexity of many variants of this problem is proved. Two heuristics are proposed and compared using using simulations. Even if transient failures are the most common failures in classical grids, some particular type of platform are more concerned by other type of problems. Desktop grids are especially unstable. In this context, we want to execute iterative applications. All tasks are executed, then a synchronization occurs, and so on. Two variants of this problem are considered : applicationsof independent tasks, and applications where all tasks need to be executed at same speed. In both cases, the problem is first theoretically studied, then heuristics are proposed and compared using simulations. Applications de flots de données Optimisation multi-critères Heuristiques Programmes linéaires Plate-formes hétérogènes Complexité Grilles de volontaires Pannes transitoires Streaming applications Multicriteria optimization Heuristics Linear programs Heterogeneous platforms Complexity results Desktop grids Transient failures Reliability
10	Energy-aware scheduling : complexity and algorithms Renaud-Goud, Paul 05 July 2012 (has links) (PDF) In this thesis we have tackled a few scheduling problems under energy constraint, since the energy issue is becoming crucial, for both economical and environmental reasons. In the first chapter, we exhibit tight bounds on the energy metric of a classical algorithm that minimizes the makespan of independent tasks. In the second chapter, we schedule several independent but concurrent pipelined applications and address problems combining multiple criteria, which are period, latency and energy. We perform an exhaustive complexity study and describe the performance of new heuristics. In the third chapter, we study the replica placement problem in a tree network. We try to minimize the energy consumption in a dynamic frame. After a complexity study, we confirm the quality of our heuristics through a complete set of simulations. In the fourth chapter, we come back to streaming applications, but in the form of series-parallel graphs, and try to map them onto a chip multiprocessor. The design of a polynomial algorithm on a simple problem allows us to derive heuristics on the most general problem, whose NP-completeness has been proven. In the fifth chapter, we study energy bounds of different routing policies in chip multiprocessors, compared to the classical XY routing, and develop new routing heuristics. In the last chapter, we compare the performance of different algorithms of the literature that tackle the problem of mapping DAG applications to minimize the energy consumption. [INFO:INFO_OH] Computer Science/Other Energy minimization Power Scheduling Complexity Makespan Heuristics Optimal algorithms Greedy algorithm Independent jobs Parallel processors Mapping Concurrent streaming applications Heterogeneous platforms Resource sharing Energy Latency Period Workflow Replica placement Tree networks Update strategies Dynamic programming algorithms Series-parallel graph Routing Chip multiprocesseur Manhattan Single path Multiple paths Slack reclamation DAG

Search results