Global ETD Search

1	GCA: Global Congestion Awareness for Load Balance in Networks-on-Chip Ramakrishna, Mukund 2012 August 1900 (has links) As modern CMPs scale to ever increasing core counts, Networks-on-Chip (NoCs) are emerging as an interconnection fabric, enabling communication between components. While NoCs are easy to implement and provide high and scalable bandwidth, current routing algorithms, such as dimension-ordered routing, suffer from poor load balance, leading to reduced throughput and high latencies. Improving load balance, hence, is critical in future CMP designs where increased latency leads to wasted power and energy waiting for outstanding requests to resolve. Adaptive routing is a known technique to improve load balance; however, prior adaptive routing techniques either use local, myopic information or misinformed, regionally-aggregated information to form their routing decisions. This thesis proposes a new, light-weight, adaptive routing algorithm for on-chip routers based on global link state and congestion information, Global Congestion Awareness (GCA). GCA leverages unused bits in existing packet header flits to "piggyback" congestion state information around the network and uses a simple, low-complexity route calculation unit, to calculate optimal packet paths to their destination without the myopia of local decisions, nor the aggregation of unrelated status information, found in prior designs. In particular GCA outperforms local adaptive routing by up to 82%, Regional Congestion Awareness (RCA) by up to 51%, and a recent competing adaptive routing algorithm, DAR, by 8% on average on realistic workloads. computer architecture on-chip networks networks-on-chip multicores adaptive routing
2	Policies for Migration of Real-Time tasks in Embedded Multicore Systems Katre, Kedar Maheshwar 01 December 2010 (has links) There has been a lot of work that has been done on timing predictability of real-time tasks on embedded systems. The main assumption in these studies has been that the timing behavior has been based on single processor systems. The scenario has changed entirely when the single core systems have been replaced with the new Multicore systems. The timing predictability is controlled by the migrating tasks, the network topology connecting the cores and the number of cores on the system. In this thesis we come up with a feasibility analysis which depends on the characteristics of the tasks viz. number of cache lines, time of migration, available bandwith, number of tasks etc. We also test this analysis on novel mechanisms of migration which have been proposed recently and present its results. Kedar migration Multicores Real Time Scheduling weighted migration cost
3	Software caching techniques and hardware optimizations for on-chip local memories Vujic, Nikola 05 June 2012 (has links) Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache. / Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals. computer architecture high performance computing on-chip local memories software cache DMA multicores 004
4	Computación eficiente del alineamiento de secuencias de ADN sobre cluster de multicores Rucci, Enzo 30 July 2013 (has links) Una de las áreas de mayor interés y crecimiento en los últimos años dentro del procesamiento paralelo es la del tratamiento de grandes volúmenes de datos, tales como las secuencias de ADN. El tipo de procesamiento extensivo de comparación para analizar patrones genéticos requiere un esfuerzo importante en el desarrollo de algoritmos paralelos eficientes. El alineamiento de secuencias de ADN representa una de las operaciones más importantes dentro de la bioinformática. En 1981, Smith y Waterman desarrollaron un método para el alineamiento local de secuencias. Sin embargo, en la práctica se emplean diversas heurísticas en su lugar, debido a los requerimientos de procesamiento y de memoria del algoritmo Smith-Waterman. Si bien son más rápidas, las heurísticas no garantizan que el alineamiento óptimo sea encontrado. Es por ello que resulta interesante estudiar cómo aplicar la potencia de cómputo de plataformas paralelas actuales de manera de acelerar el proceso de alinear secuencias sin perder precisión en los resultados. Los niveles insostenibles de generación de calor y consumo de energía que se presentan al escalar al máximo la velocidad de los procesadores mononúcleos motivaron el surgimiento de los procesadores de múltiples núcleos (multicore). Un procesador multicore integra dos o más núcleos computacionales dentro de un único chip y, si bien estos son más simples y menos veloces, al combinarlos permiten mejorar el rendimiento global del procesador y al mismo tiempo hacerlo más eficiente energéticamente. Al incorporar este tipo de procesadores a los clusters convencionales, se da origen a una arquitectura conocida como cluster de multicores, que combina memoria compartida y distribuida, y donde la comunicación entre las diferentes unidades de procesamiento resulta ser heterogénea. En este trabajo se presenta un algoritmo paralelo distribuido para el alineamiento de secuencias de ADN basado en el método Smith-Waterman para ser ejecutado sobre las arquitecturas de cluster actuales. Además, se realiza un análisis de rendimiento del mismo. Por último, se presentan las conclusiones y las posibles líneas de trabajo futuro. alineamiento de secuencias de ADN cluster de multicores método Smith-Waterman programación paralela Parallel programming Distributed programming Algorithms Ciencias Informáticas
5	Adaptive and intelligent memory systems / Système mémoire adaptatif intelligent Sridharan, Aswinkumar 15 December 2016 (has links) Dans cette thèse, nous nous sommes concentrés sur l'interférence aux ressources de la hiérarchie de la mémoire partagée : cache de dernier niveau et accès à la mémoire hors-puce dans le contexte des systèmes multicœurs à grande échelle. À cette fin, le premier travail a porté sur les caches de dernier niveau partagées, où le nombre d'applications partageant le cache pourrait dépasser l'associativité du cache. Pour gérer les caches dans de telles situations, notre solution évalue l'empreinte du cache des applications pour déterminer approximativement à quel point elles pourraient utiliser le cache. L'estimation quantitative de l'utilitaire de cache permet explicitement de faire respecter différentes priorités entre les applications. La seconde partie apporte une prédétection dans la gestion de la mémoire cache. En particulier, nous observons les blocs cache pré-sélectionnés pour présenter un bon comportement de réutilisation dans le contexte de caches plus grands. Notre troisième travail est axé sur l'interférence entre les demandes à la demande et les demandes de prélecture à l'accès partagé à la mémoire morte. Ce travail est basé sur deux observations fondamentales de la fraction des requêtes de prélecture générées et de sa corrélation avec l'utilité de prélecture et l'interférence causée par le prélecteur. Au total, deux observations conduisent à contrôler le flux de requêtes de prélecture entre les mémoires LLC et off-chip. / In this thesis, we have focused on addressing interference at the shared memory-hierarchy resources: last level cache and off-chip memory access in the context of large-scale multicore systems. Towards this end, the first work focused on shared last level caches, where the number of applications sharing the cache could exceed the associativity of the cache. To manage caches in such situations, our solution estimates the cache footprint of applications to approximate how well they could utilize the cache. Quantitative estimate of cache utility explicitly allows enforcing different priorities across applications. The second part brings in prefetch awareness in cache management. In particular, we observe prefetched cache blocks to exhibit good reuse behavior in the context of larger caches. Our third work focuses on addressing interference between on-demand and prefetch requests at the shared off-chip memory access. This work is based on two fundamental observations of the fraction of prefetch requests generated and its correlation with prefetch usefulness and prefetcher-caused interference. Altogether, two observations lead to control the flow of prefetch requests between LLC and off-chip memory. Systèmes multicœurs à grande échelle Cache (informatique) Prélecture Cache Management Large-Scale multicores Prefetcher aggressiveness control Memory system Footprint-Number Prefetch-Fraction
6	Optimizing Power Consumption, Resource Utilization, and Performance for Manycore Architectures using Reinforcement Learning Fettes, Quintin 23 May 2022 (has links) No description available. Computer Science Computer Engineering Artificial Intelligence Multicores Network-on-Chip Reinforcement Learning Deep Learning Deep Reinforcement Learning Microservices Chip Multiprocessor Energy Efficiency
7	Task-based multifrontal QR solver for heterogeneous architectures / Solveur multifrontal QR à base de tâches pour architectures hétérogènes Lopez, Florent 11 December 2015 (has links) Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. / To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver. Méthode multifrontale Multicœur Moteurs d'exécutions Architectures hétérogènes Calcul haute performance GPU Sparse direct solvers Multifrontal method Multicores Runtime systems Scheduling Memory-aware algorythms Heterogeneous architectures High-performance computing

1

Page generated in 0.0617 seconds