Global ETD Search

1	Compile-time and Run-time Optimizations for Enhancing Locality and Parallelism on Multi-core and Many-core Systems Baskaran, Muthu Manikandan 05 November 2009 (has links) No description available. Computer Science Compilers. Multi-cores GPUs
2	Um método de otimização da relação desempenho/consumo de energia para arquiteturas multi-cores heterogêneas em FPGA / A method to optimize performance/energy consumption relation for heterogeneous multi-core architectures on FPGA Silva, Bruno de Abreu 07 March 2016 (has links) Devido às tendências de crescimento da quantidade de dados processados e a crescente necessidade por computação de alto desempenho, mudanças significativas estão acontecendo no projeto de arquiteturas de computadores. Com isso, tem-se migrado do paradigma sequencial para o paralelo, com centenas ou milhares de núcleos de processamento em um mesmo chip. Dentro desse contexto, o gerenciamento de energia torna-se cada vez mais importante, principalmente em sistemas embarcados, que geralmente são alimentados por baterias. De acordo com a Lei de Moore, o desempenho de um processador dobra a cada 18 meses, porém a capacidade das baterias dobra somente a cada 10 anos. Esta situação provoca uma enorme lacuna, que pode ser amenizada com a utilização de arquiteturas multi-cores heterogêneas. Um desafio fundamental que permanece em aberto para estas arquiteturas é realizar a integração entre desenvolvimento de código embarcado, escalonamento e hardware para gerenciamento de energia. O objetivo geral deste trabalho de doutorado é investigar técnicas para otimização da relação desempenho/consumo de energia em arquiteturas multi-cores heterogêneas single-ISA implementadas em FPGA. Nesse sentido, buscou-se por soluções que obtivessem o melhor desempenho possível a um consumo de energia ótimo. Isto foi feito por meio da combinação de mineração de dados para a análise de softwares baseados em threads aliadas às técnicas tradicionais para gerenciamento de energia, como way-shutdown dinâmico, e uma nova política de escalonamento heterogeneity-aware. Como principais contribuições pode-se citar a combinação de técnicas de gerenciamento de energia em diversos níveis como o nível do hardware, do escalonamento e da compilação; e uma política de escalonamento integrada com uma arquitetura multi-core heterogênea em relação ao tamanho da memória cache L1. / Due to the growing need for high-performance computing along with higher volume of data to process, important changes are happening in computer architecture design. Parallel computing processors having hundreds or thousands of processing cores in a single chip are becoming a common solution, even for embedded systems. Power management becomes increasingly important, especially for mobile systems. A key challenge remaining open for these architectures is to perform the integration of application code, runtime scheduling and hardware control for power management. This thesis aims to present a method able to integrate these three aspects, by investigating techniques for optimizing performance versus power consumption in single-ISA heterogeneous multi-cores architectures implemented on FPGA. Our approach applies a data mining technique to analyze the application source-code, traditional techniques for power management, and an heterogeneity-aware scheduling policy. The main contributions are the combination of power management techniques at hardware, scheduling and compilation levels; a new scheduling policy along with a heterogeneous multi-core architecture relative to its L1 cache memory size determined offline and online. Consumo de energia Desempenho Energy consumption FPGA FPGA Heterogeneous multi-cores Multi-cores heterogêneos Performance
3	Um método de otimização da relação desempenho/consumo de energia para arquiteturas multi-cores heterogêneas em FPGA / A method to optimize performance/energy consumption relation for heterogeneous multi-core architectures on FPGA Bruno de Abreu Silva 07 March 2016 (has links) Devido às tendências de crescimento da quantidade de dados processados e a crescente necessidade por computação de alto desempenho, mudanças significativas estão acontecendo no projeto de arquiteturas de computadores. Com isso, tem-se migrado do paradigma sequencial para o paralelo, com centenas ou milhares de núcleos de processamento em um mesmo chip. Dentro desse contexto, o gerenciamento de energia torna-se cada vez mais importante, principalmente em sistemas embarcados, que geralmente são alimentados por baterias. De acordo com a Lei de Moore, o desempenho de um processador dobra a cada 18 meses, porém a capacidade das baterias dobra somente a cada 10 anos. Esta situação provoca uma enorme lacuna, que pode ser amenizada com a utilização de arquiteturas multi-cores heterogêneas. Um desafio fundamental que permanece em aberto para estas arquiteturas é realizar a integração entre desenvolvimento de código embarcado, escalonamento e hardware para gerenciamento de energia. O objetivo geral deste trabalho de doutorado é investigar técnicas para otimização da relação desempenho/consumo de energia em arquiteturas multi-cores heterogêneas single-ISA implementadas em FPGA. Nesse sentido, buscou-se por soluções que obtivessem o melhor desempenho possível a um consumo de energia ótimo. Isto foi feito por meio da combinação de mineração de dados para a análise de softwares baseados em threads aliadas às técnicas tradicionais para gerenciamento de energia, como way-shutdown dinâmico, e uma nova política de escalonamento heterogeneity-aware. Como principais contribuições pode-se citar a combinação de técnicas de gerenciamento de energia em diversos níveis como o nível do hardware, do escalonamento e da compilação; e uma política de escalonamento integrada com uma arquitetura multi-core heterogênea em relação ao tamanho da memória cache L1. / Due to the growing need for high-performance computing along with higher volume of data to process, important changes are happening in computer architecture design. Parallel computing processors having hundreds or thousands of processing cores in a single chip are becoming a common solution, even for embedded systems. Power management becomes increasingly important, especially for mobile systems. A key challenge remaining open for these architectures is to perform the integration of application code, runtime scheduling and hardware control for power management. This thesis aims to present a method able to integrate these three aspects, by investigating techniques for optimizing performance versus power consumption in single-ISA heterogeneous multi-cores architectures implemented on FPGA. Our approach applies a data mining technique to analyze the application source-code, traditional techniques for power management, and an heterogeneity-aware scheduling policy. The main contributions are the combination of power management techniques at hardware, scheduling and compilation levels; a new scheduling policy along with a heterogeneous multi-core architecture relative to its L1 cache memory size determined offline and online. Consumo de energia Desempenho FPGA Multi-cores heterogêneos Energy consumption FPGA Heterogeneous multi-cores Performance
4	Machine learning based mapping of data and streaming parallelism to multi-cores Wang, Zheng January 2011 (has links) Multi-core processors are now ubiquitous and are widely seen as the most viable means of delivering performance with increasing transistor densities. However, this potential can only be realised if the application programs are suitably parallel. Applications can either be written in parallel from scratch or converted from existing sequential programs. Regardless of how applications are parallelised, the code must be efficiently mapped onto the underlying platform to fully exploit the hardware’s potential. This thesis addresses the problem of finding the best mappings of data and streaming parallelism—two types of parallelism that exist in broad and important domains such as scientific, signal processing and media applications. Despite significant progress having been made over the past few decades, state-of-the-art mapping approaches still largely rely upon hand-crafted, architecture-specific heuristics. Developing a heuristic by hand, however, often requiresmonths of development time. Asmulticore designs become increasingly diverse and complex, manually tuning a heuristic for a wide range of architectures is no longer feasible. What are needed are innovative techniques that can automatically scale with advances in multi-core technologies. In this thesis two distinct areas of computer science, namely parallel compiler design and machine learning, are brought together to develop new compiler-based mapping techniques. Using machine learning, it is possible to automatically build highquality mapping schemes, which adapt to evolving architectures, with little human involvement. First, two techniques are proposed to find the best mapping of data parallelism. The first technique predicts whether parallel execution of a data parallel candidate is profitable on the underlying architecture. On a typical multi-core platform, it achieves almost the same (and sometimes a better) level of performance when compared to the manually parallelised code developed by independent experts. For a profitable candidate, the second technique predicts how many threads should be used to execute the candidate across different program inputs. The second technique achieves, on average, over 96% of the maximum available performance on two different multi-core platforms. Next, a new approach is developed for partitioning stream applications. This approach predicts the ideal partitioning structure for a given stream application. Based on the prediction, a compiler can rapidly search the program space (without executing any code) to generate a good partition. It achieves, on average, a 1.90x speedup over the already tuned partitioning scheme of a state-of-the-art streaming compiler. 005.3
5	Exploiting heterogeneous many cores on sequential code / Exploiter des multi-coeurs hétérogènes dans le cadre de codes séquentiels Narasimha Swamy, Bharath 05 March 2015 (has links) Les architectures ''Heterogeneous Many Cores'' (HMC) qui mélangent beaucoup de petits/simples cœurs avec quelques cœurs larges/complexes, fournissent de bonnes performances pour des applications séquentielles et permettent une économie d'énergie pour les applications parallèles. Les petits cœurs des HMC peuvent être utilisés comme des cœurs auxiliaires pour accélérer les applications séquentielles gourmandes en mémoire qui s'exécutent sur le cœur principal. Cependant, le surcoût pour accéder aux petits cœurs limite leur utilisation comme cœurs auxiliaires. En raison de la disparité de performance entre le cœur principal et les petits cœurs, on ne sait pas encore si les petits cœurs sont adaptés pour exécuter des threads auxiliaires pour faire du prefetching pour un cœur plus puissant. Dans cette thèse, nous présentons une architecture hardware/software appelée « core-tethering », pour supporter efficacement l'exécution de threads auxiliaires sur les systèmes HMC. Cette architecture permet au cœur principal de pouvoir lancer et contrôler directement l'exécution des threads auxiliaires, et de transférer efficacement le contexte des applications nécessaire à l'exécution des threads auxiliaires. Sur un ensemble de programmes ayant une utilisation intensive de la mémoire, les threads auxiliaires s'exécutant sur des cœurs relativement petits, peuvent apporter une accélération significative par rapport à du prefetching matériel seul. Et les petits cœurs fournissent un bon compromis par rapport à l'utilisation d'un seul cœur puissant pour exécuter les threads auxiliaires. En résumé, malgré le surcoût lié à la latence d'accès aux lignes de cache chargées par le prefetching depuis le cache L3 partagé, le prefetching par les threads auxiliaires sur les petits cœurs semble être une manière prometteuse d'améliorer la performance des codes séquentiels pour des applications ayant une utilisation intensive de la mémoire sur les systèmes HMC. / Heterogeneous Many Cores (HMC) architectures that mix many simple/small cores with a few complex/large cores are emerging as a design alternative that can provide both fast sequential performance for single threaded workloads and power-efficient execution for through-put oriented parallel workloads. The availability of many small cores in a HMC presents an opportunity to utilize them as low-power helper cores to accelerate memory-intensive sequential programs mapped to a large core. However, the latency overhead of accessing small cores in a loosely coupled system limits their utility as helper cores. Also, it is not clear if small cores can execute helper threads sufficiently in advance to benefit applications running on a larger, much powerful, core. In this thesis, we present a hardware/software framework called core-tethering to support efficient helper threading on heterogeneous many-cores. Core-tethering provides a co-processor like interface to the small cores that (a) enables a large core to directly initiate and control helper execution on the helper core and (b) allows efficient transfer of execution context between the cores, thereby reducing the performance overhead of accessing small cores for helper execution. Our evaluation on a set of memory intensive programs chosen from the standard benchmark suites show that, helper threads using moderately sized small cores can significantly accelerate a larger core compared to using a hardware prefetcher alone. We also find that a small core provides a good trade-off against using an equivalent large core to run helper threads in a HMC. In summary, despite the latency overheads of accessing prefetched cache lines from the shared L3 cache, helper thread based prefetching on small cores looks as a promising way to improve single thread performance on memory intensive workloads in HMC architectures. Microprocesseurs multi-Coeurs Moore, Loi de Microprocessors multi-Cores Moore's Law
6	Energy-efficient Scheduling for Heterogeneous Servers in the Dark Silicon Era January 2015 (has links) abstract: Driven by stringent power and thermal constraints, heterogeneous multi-core processors, such as the ARM big-LITTLE architecture, are becoming increasingly popular. In this thesis, the use of low-power heterogeneous multi-cores as Microservers using web search as a motivational application is addressed. In particular, I propose a new family of scheduling policies for heterogeneous microservers that assign incoming search queries to available cores so as to optimize for performance metrics such as mean response time and service level agreements, while guaranteeing thermally-safe operation. Thorough experimental evaluations on a big-LITTLE platform demonstrate, on an heterogeneous eight-core Samsung Exynos 5422 MpSoC, with four big and little cores each, that naive performance oriented scheduling policies quickly result in thermal instability, while the proposed policies not only reduce peak temperature but also achieve 4.8x reduction in processing time and 5.6x increase in energy efficiency compared to baseline scheduling policies. / Dissertation/Thesis / Masters Thesis Electrical Engineering 2015 Electrical engineering Heterogeneous Servers Multi-cores Scheduling Algorithms Threshold based algorithms
7	Minimising shared resource contention when scheduling real-time applications on multi-core architectures / Minimiser l’impact des communications lors de l’ordonnancement d’application temps-réels sur des architectures multi-cœurs Rouxel, Benjamin 19 December 2018 (has links) Les architectures multi-cœurs utilisant des mémoire bloc-notes sont des architectures attrayantes pour l'exécution des applications embarquées temps-réel, car elles offrent une grande capacité de calcul. Cependant, les systèmes temps-réel nécessitent de satisfaire des contraintes temporelles, ce qui peut être compliqué sur ce type d'architectures à cause notamment des ressources matérielles physiquement partagées entre les cœurs. Plus précisément, les scénarios de pire cas de partage du bus de communication entre les cœurs et la mémoire externe sont trop pessimistes. Cette thèse propose des stratégies pour réduire ce pessimisme lors de l'ordonnancement d'applications sur des architectures multi-cœurs. Tout d'abord, la précision du pire cas des coûts de communication est accrue grâce aux informations disponibles sur l'application et l'état de l'ordonnancement en cours. Ensuite, les capacités de parallélisation du matériel sont exploitées afin de superposer les calculs et les communications. De plus, les possibilités de superposition sont accrues par le morcellement de ces communications. / Multi-core architectures using scratch pad memories are very attractive to execute embedded time-critical applications, because they offer a large computational power. However, ensuring that timing constraints are met on such platforms is challenging, because some hardware resources are shared between cores. When targeting the bus connecting cores and external memory, worst-case sharing scenarios are too pessimistic. This thesis propose strategies to reduce this pessimism. These strategies offer to both improve the accuracy of worst-case communication costs, and to exploit hardware parallel capacities by overlapping computations and communications. Moreover, fragmenting the latter allow to increase overlapping possibilities. Systèmes temps-Réel Multi-Cœurs Ordonnancement Contention Real-Time system Multi-Cores Scheduling Contention
8	Vers une utilisation efficace des processeurs multi-coeurs dans des systèmes embarqués à criticités multiples / Towards an efficient use of multi-core processors in mixed criticality embedded systems Blin, Antoine 30 January 2017 (has links) Les systèmes embarqués dans les véhicules comportent un mélange d’applications temps réel et « best effort » déployées, pour des raisons d’isolation, sur des calculateurs séparés. L’ajout de nouvelles fonctionnalités dans les véhicules se traduit par un accroissement du nombre de calculateurs et ainsi par une augmentation des coûts, de la consommation électrique et de la dissipation thermique.L’émergence de nouvelles plate-formes multi-cœurs à bas coûts permet d’envisager le déploiement d’une nouvelle architecture dite « virtualisée » pour exécuter en parallèle sur un même calculateur les deux types d’applications. Néanmoins, la hiérarchie mémoire de tels calculateurs, reste partagée. Une application temps réel exécutée sur un cœur peut donc voir ses temps d’accès à la mémoire ralentis par les accès effectués par les applications « best effort » exécutées en parallèle entraînant ainsi la violation des échéances de la tâche temps réel.Dans cette thèse, nous proposons une nouvelle approche de gestion de la contention mémoire. Dans une première étape, hors ligne, nous générons un oracle capable d’estimer les ralentissements d’une tâche temps réel en fonction du trafic mémoire mesuré. Dans une deuxième étape, en ligne, les tâches temps réel sont exécutées en parallèle des applications « best effort ». Un mécanisme de régulation va surveiller la consommation mémoire et utiliser l’oracle généré précédemment pour estimer le ralentissement des tâches temps réel. Lorsque le ralentissement estimé est supérieur à celui fixé par le concepteur du système les applications « best effort » sont suspendues jusqu’à ce que l’application temps réel termine son activation. / Complex embedded systems today commonly involve a mix of real-time and best-effort applications integrated on separate microcontrollers thus ensuring fault isolation and error containment. However, this solution multiplies hardware costs, power consumption and thermal dissipation.The recent emergence of low-cost multi-core processors raises the possibility of running both kinds of applications on a single machine, with virtualization ensuring isolation. Nevertheless, the memory hierarchy on such processors is shared between all cores. Memory accesses done by a real time application running on one dedicated core can be slowed down by concurrent memory accesses initiated by best effort applications running in parallels. Therefore real time applications can miss their deadlines.In this thesis, we propose a run-time software-regulation approach that aims to maximize parallelism between real-time and best-effort applications running on a single low-cost multicore ECU. Our approach uses an overhead estimation derived from offline profiling of the real-time application to estimate the slow down on the real-time application caused by memory interferences. When the estimated overhead reaches a predefined threshold, our approach suspends the best-effort applications, allowing the real-time task to continue executing without interferences. Suspended best-effort applications are resumed when the real-time application ends its current activation. Multi-Coeurs Contention Mémoire Embarqué Temps-Réel Régulation Memory contetion Multi-cores Embedded systems 005.4
9	MNoC : A Network on Chip for Monitors Madduri, Sailaja 01 January 2008 (has links) (PDF) As silicon processes scale, system-on-chips (SoCs) will require numerous hardware monitors that perform assessment of physical characteristics that change during the operation of a device. To address the need for high-speed and coordinated transport of monitor data in a SoC, we develop a new interconnection network for monitors - the monitor network on chip (MNoC). Data collected from the monitors via MNoC is collated by a monitor executive processor (MEP) that controls the operation of the SoC in response to monitor data. In this thesis, we developed the architecture of MNoC and the infrastructure to evaluate its performance and overhead for various network parameters. A system level architectural simulation can then be performed to ensure that the latency and bandwidth provided by MNoC are sufficient to allow the MEP to react in a timely fashion. This typically translates to a system level benefit that can be assessed using architectural simulation. We demonstrate in this thesis, the employment of MNoC for two specific monitoring systems that involve thermal and delay monitors. Results show that MNoC facilitates employment of a thermal-aware dynamic frequency scaling scheme in a multicore processor resulting in improved performance. It also facilitates power and performance savings in a delay -monitored multicore system by enabling a better than worst case voltage and frequency settings for the processor. Networks on chip Multi-cores Thermal management Voltage droop management Real-time monitors Scalable interconnects Electrical engineering
10	Low-cost and efficient architectural support for correctness and performance debugging Venkataramani, Guru Prasadh V. 15 July 2009 (has links) With rapid growth in computer hardware technologies and architectures, software programs have become increasingly complex and error-prone. This software complexity has resulted in program crashes and even security threats. Correctness Debugging is making sure that the program does not exhibit any unintended behavior at runtime. A fully correct program without good performance does not lend any commercial success to the software product. Performance Debugging ensures good performance on hardware platforms. A number of prior debugging solutions either suffer from huge performance overheads or incur high implementation costs. We propose low-cost and efficient hardware solutions that target three specific correctness and performance problems, namely, memory debugging, taint propagation and comprehensive cache miss classification. Experiments show that our mechanisms incur low performance overheads and can be designed with minimal changes to existing processor hardware. While architects invest time and resources into designing high-end architectures, we show that it is equally important to incorporate useful debugging features into these processors in order to enhance the ease of use for programmers. Scalability in multi-cores Debugging Computer architecture Computer programs Correctness Debugging in computer science Computer systems Reliability Computer architecture

Search results