Global ETD Search

11	Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Shah, Ankur Savailal 28 May 2008 (has links) Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance prediction framework, which we deploy to address the problem of simultaneous runtime optimization of DVFS, DCT, and thread placement on multi-core systems. We present results from an implementation of the prediction framework in a runtime system linked to the Intel OpenMP runtime environment and running on a real dual-processor quad-core system as well as a dual-processor dual-core system. We show that the prediction framework derives near-optimal settings of the three power-aware program adaptation knobs that we consider. Our overall runtime optimization framework achieves significant reductions in energy (12.27% mean) and ED² (29.6% mean), through simultaneous power savings (3.9% mean) and performance improvements (10.3% mean). Our prediction and adaptation framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings. / Master of Science concurrency throttling power-aware computing runtime adaptation performance prediction high-performance computing Multicore processors
12	Processus de détermination d'architecture logicielle optimale pour processeurs Multicœurs pour le milieu automobile / Design process for the optimization of embedded software architectures on to multi-core processors in automotive industry Wang, Wenhao 10 July 2017 (has links) La migration récente des plateformes mono-cœur vers multi-cœur, dans le domaine automobile, révèle de grands changements dans le processus de développement du logiciel embarqué. Tout d’abord, les concepteurs de logiciel ont besoin de nouvelles méthodes leur permettant de combler le fossé entre la description des applications (versus Autosar) et le déploiement de tâches. Deuxièmement, l’utilisation du multi-cœur doit assurer la compatibilité avec les contraintes liées aux aspects temps-réel et à la Sûreté de fonctionnement. Au final, les développeurs ont besoins d’outils pour intégrer de nouveaux modules dans leur système multi-cœur. Confronter aux complexités ci-dessus, nous avons proposé une méthodologie afin de repartir, de manière optimale, les applications sous forme de partitions logiques. Nous avons ainsi intégré dans notre processus de développement, un outil de distribution des traitements d’un système embarqué sur différents processeurs et compatible avec le standard AUTOSAR (AUTomotive Open System ARchitecture). Les solutions de partitionnement traitent simultanément l’allocation des applications ainsi que la politique d’ordonnancement. Le périmètre d’étude du partitionnement est automatique, les solutions trouvées étant évaluées par nos fonctions de coût. Elles prennent aussi en compte des critères tels que, le coût de communication inter-cœur, l’équilibrage de la charge CPU entre les cœurs et la gigue globale. Pour la partie ordonnancement, nous présentons une formalisation des dépendances sous forme périodiques pour répondre au besoin automobile. L’algorithme d’ordonnancement proposé prend en compte cette spécificité ainsi que les contraintes temps-réel et fonctionnelles, assurant l’applicabilité de notre méthodologie dans un produit industriel. Nous avons expérimenté nos solutions avec une application de type contrôle moteur, sur une plateforme matérielle multi-cœur. / The recent migration from single-core to multi-core platforms in the automotive domain reveals great challenges for the legacy embedded software design flow. First of all, software designers need new methods to fill the gap between applications description and tasks deployment. Secondly, the use of multiple cores has also to remain compatible with real-time and safety design constraints. Finally, developers need tools to assist them in the new steps of the design process. Face to these issues, we proposed a method integrated in the AUTOSAR (AUTomotive Open System ARchitecture) design flow for partitioning the automotive applications onto multi-core systems. The method proposes the partitions solution that contains allocation of application as well as scheduling policy simultaneously. The design space of the partitioning is explored automatically and the solutions are evaluated thanks to our proposed objective functions that consider certain criteria such as communication overhead and global jitters. For the scheduling part, we present a formalization of periodic dependencies adapted to this automotive framework and propose a scheduling algorithm taking into account this specificity. Our defined constraints from real-time aspect as well as functional aspect make sure the applicability of our method on the real life user case. We leaded experiments with a complex and real world control application onto a concrete multi-core platform. Processeur Multicoeur Architecture logicielle Automobile Ordonnancement temps-Réel Multicore processors Software architecture Automotive industry Real Time Scheduling
13	Operating System Management of Shared Caches on Multicore Processors Tam, David Kar Fai 01 September 2010 (has links) Our thesis is that operating systems should manage the on-chip shared caches of multicore processors for the purposes of achieving performance gains. Consequently, this dissertation demonstrates how the operating system can profitably manage these shared caches. Two shared-cache management principles are investigated: (1) promoting shared use of the shared cache, demonstrated by an automated online thread clustering technique, and (2) providing cache space isolation, demonstrated by a software-based cache partitioning technique. In support of providing isolation, cache provisioning is also investigated, demonstrated by an automated online technique called RapidMRC. We show how these software-based techniques are feasible on existing multicore systems with the help of their hardware performance monitoring units and their associated hardware performance counters. On a 2-chip IBM POWER5 multicore system, promoting sharing reduced processor pipeline stalls caused by cross-chip cache accesses by up to 70%, resulting in performance improvements of up to 7%. On a larger 8-chip IBM POWER5+ multicore system, the potential for up to 14% performance improvement was measured. Providing isolation improved performance by up to 50%, using an exhaustive offline search method to determine optimal partition size. On the other hand, up to 27% performance improvement was extracted from the corresponding workload using an automated online approximation technique, made possible by RapidMRC. operating systems multicore processors shared caches promoting sharing providing isolation cache provisioning thread clustering cache partitioning approximating MRCs 0984
14	Operating System Management of Shared Caches on Multicore Processors Tam, David Kar Fai 01 September 2010 (has links) Our thesis is that operating systems should manage the on-chip shared caches of multicore processors for the purposes of achieving performance gains. Consequently, this dissertation demonstrates how the operating system can profitably manage these shared caches. Two shared-cache management principles are investigated: (1) promoting shared use of the shared cache, demonstrated by an automated online thread clustering technique, and (2) providing cache space isolation, demonstrated by a software-based cache partitioning technique. In support of providing isolation, cache provisioning is also investigated, demonstrated by an automated online technique called RapidMRC. We show how these software-based techniques are feasible on existing multicore systems with the help of their hardware performance monitoring units and their associated hardware performance counters. On a 2-chip IBM POWER5 multicore system, promoting sharing reduced processor pipeline stalls caused by cross-chip cache accesses by up to 70%, resulting in performance improvements of up to 7%. On a larger 8-chip IBM POWER5+ multicore system, the potential for up to 14% performance improvement was measured. Providing isolation improved performance by up to 50%, using an exhaustive offline search method to determine optimal partition size. On the other hand, up to 27% performance improvement was extracted from the corresponding workload using an automated online approximation technique, made possible by RapidMRC. operating systems multicore processors shared caches promoting sharing providing isolation cache provisioning thread clustering cache partitioning approximating MRCs 0984
15	Mejora del Rendimiento y Reducción de Consumo de los Procesadores Multinúcleo usando Redes Heterogéneas Flores Gil, Antonio 24 September 2010 (has links) En la presente Tesis se proponen soluciones para aliviar el alto coste, en rendimiento y consumo, de las comunicaciones intra-chip a través de los alambres globales. En concreto, se propone utilizar redes heterogéneas que permiten una mejor adaptación a las necesidades de los diferentes tipos de mensajes de coherencia.Nuestra primera propuesta consiste en dividir las respuestas con datos en un mensaje corto crítico, enviado usando enlaces de baja latencia, y un mensaje largo no crítico enviado usando enlaces de bajo consumo. La segunda propuesta utiliza la compresión de direcciones en el contexto de una red de interconexión heterogénea que permite la compresión de mayoría de los mensajes críticos en unos pocos bytes, siendo transmitidos usando enlaces de muy baja latencia. Finalmente, se explora el uso de la prebúsqueda por hardware para aliviar los problemas derivados de las altas latencias de los enlaces globales. / In this thesis we propose different ways to alleviate the high cost, in terms of performance and power consumption, of the intra-chipcommunications using global wires. In particular, we considerheterogeneous networks to obtain a better match between thenetwork-on-chip and the needs of the different types of coherencemessages.Our first contribution proposes the partitioning of reply messages with data into a short critical message, which is sent using low-latency links, as well as a long non-critical message sent using low-power links. The second contribution exploits the use of address compression in the context of a heterogeneous interconnect to allow most of the critical messages to be compressed in a few bytes and transmitted using very low latency links. Finally, we explore the use of heterogeneous networks in the context of hardware prefetching to alleviate the problems caused by high latencies of global links. Heterogeneous Networks Power Reduction Multicore Processors Redes Heterogéneas Reducción de Consumo Procesadores Multinúcleo 004 621.3
16	Adaptive Prefetching and Cache Partitioning for Multicore Processors Selfa Oliver, Vicent 13 November 2018 (has links) El acceso a la memoria principal en los procesadores actuales supone un importante cuello de botella para las prestaciones, dado que los diferentes núcleos compiten por el limitado ancho de banda de memoria, agravando la brecha entre las prestaciones del procesador y las de la memoria principal. Distintas técnicas atacan este problema, siendo las más relevantes el uso de jerarquías de caché multinivel y la prebúsqueda. Las cachés jerárquicas aprovechan la localidad temporal y espacial que en general presentan los programas en el acceso a los datos, para mitigar las enormes latencias de acceso a memoria principal. Para limitar el número de accesos a la memoria DRAM, fuera del chip, los procesadores actuales cuentan con grandes cachés de último nivel (LLC). Para mejorar su utilización y reducir costes, estas cachés suelen compartirse entre todos los núcleos del procesador. Este enfoque mejora significativamente el rendimiento de la mayoría de las aplicaciones en comparación con el uso de cachés privados más pequeños. Compartir la caché, sin embargo, presenta una problema importante: la interferencia entre aplicaciones. La prebúsqueda, por otro lado, trae bloques de datos a las cachés antes de que el procesador los solicite, ocultando la latencia de memoria principal. Desafortunadamente, dado que la prebúsqueda es una técnica especulativa, si no tiene éxito puede contaminar la caché con bloques que no se usarán. Además, las prebúsquedas interfieren con los accesos a memoria normales, tanto los del núcleo que emite las prebúsquedas como los de los demás. Esta tesis se centra en reducir la interferencia entre aplicaciones, tanto en las caché compartidas como en el acceso a la memoria principal. Para reducir la interferencia entre aplicaciones en el acceso a la memoria principal, el mecanismo propuesto en esta disertación regula la agresividad de cada prebuscador, activando o desactivando selectivamente algunos de ellos, dependiendo de su rendimiento individual y de los requisitos de ancho de banda de memoria principal de los otros núcleos. Con respecto a la interferencia en cachés compartidos, esta tesis propone dos técnicas de particionado para la LLC, las cuales otorgan más espacio de caché a las aplicaciones que progresan más lentamente debido a la interferencia entre aplicaciones. La primera propuesta de particionado de caché requiere hardware específico no disponible en procesadores comerciales, por lo que se ha evaluado utilizando un entorno de simulación. La segunda propuesta de particionado de caché presenta una familia de políticas que superan las limitaciones en el número de particiones y en el número de vías de caché disponibles mediante la agrupación de aplicaciones en clústeres y la superposición de particiones de caché, por lo que varias aplicaciones comparten las mismas vías. Dado que se ha implementado utilizando los mecanismos para el particionado de la LLC que presentan algunos procesadores Intel modernos, esta propuesta ha sido evaluada en una máquina real. Los resultados experimentales muestran que el mecanismo de prebúsqueda selectiva propuesto en esta tesis reduce el número de solicitudes de memoria principal en un 20%, cosa que se traduce en mejoras en la equidad del sistema, el rendimiento y el consumo de energía. Por otro lado, con respecto a los esquemas de partición propuestos, en comparación con un sistema sin particiones, ambas propuestas reducen la iniquidad del sistema en un promedio de más del 25%, independientemente de la cantidad de aplicaciones en ejecución, y esta reducción en la injusticia no afecta negativamente al rendimiento. / Accessing main memory represents a major performance bottleneck in current processors, since the different cores compete among them for the limited offchip bandwidth, aggravating even more the so called memory wall. Several techniques have been applied to deal with the core-memory performance gap, with the most preeminent ones being prefetching and hierarchical caching. Hierarchical caches leverage the temporal and spacial locality of the accessed data, mitigating the huge main memory access latencies. To limit the number of accesses to the off-chip DRAM memory, current processors feature large Last Level Caches. These caches are shared between all the cores to improve the utilization of the cache space and reduce cost. This approach significantly improves the performance of most applications compared to using smaller private caches. Cache sharing, however, presents an important shortcoming: the interference between applications. Prefetching, on the other hand, brings data blocks to the caches before they are requested, hiding the main memory latency. Unfortunately, since prefetching is a speculative technique, inaccurate prefetches may pollute the cache with blocks that will not be used. In addition, the prefetches interfere with the regular memory requests, both the ones from the application running on the core that issued the prefetches and the others. This thesis focuses on reducing the inter-application interference, both in the shared cache and in the access to the main memory. To reduce the interapplication interference in the access to main memory, the proposed approach regulates the aggressiveness of each core prefetcher, and selectively activates or deactivates some of them, depending on their individual performance and the main memory bandwidth requirements of the other cores. With respect to interference in shared caches, this thesis proposes two LLC partitioning techniques that give more cache space to the applications that have their progress diminished due inter-application interferences. The first cache partitioning proposal requires dedicated hardware not available in commercial processors, so it has been evaluated using a simulation framework. The second proposal dealing with cache partitioning presents a family of partitioning policies that overcome the limitations in the number of partitions and the number of available ways by grouping applications and overlapping cache partitions, so multiple applications share the same ways. Since it has been implemented using the cache partitioning features of modern Intel processors it has been evaluated in a real machine. Experimental results show that the proposed selective prefetching mechanism reduces the number of main memory requests by 20%, which translates to improvements in unfairness, performance, and energy consumption. On the other hand, regarding the proposed partitioning schemes, compared to a system with no partitioning, both reduce unfairness more than 25% on average, regardless of the number of applications running in the multicore, and this reduction in unfairness does not negatively affect the performance. / L'accés a la memòria principal en els processadors actuals suposa un important coll d'ampolla per a les prestacions, ja que els diferents nuclis competeixen pel limitat ample de banda de memòria, agreujant la bretxa entre les prestacions del processador i les de la memòria principal. Diferents tècniques ataquen aquest problema, sent les més rellevants l'ús de jerarquies de memòria cau multinivell i la prebusca. Les memòries cau jeràrquiques aprofiten la localitat temporal i espacial que en general presenten els programes en l'accés a les dades per mitigar les enormes latències d'accés a memòria principal. Per limitar el nombre d'accessos a la memòria DRAM, fora del xip, els processadors actuals compten amb grans caus d'últim nivell (LLC). Per millorar la seva utilització i reduir costos, aquestes memòries cau solen compartir-se entre tots els nuclis del processador. Aquest enfocament millora significativament el rendiment de la majoria de les aplicacions en comparació amb l'ús de caus privades més menudes. Compartir la memòria cau, no obstant, presenta una problema important: la interferencia entre aplicacions. La prebusca, per altra banda, porta blocs de dades a les memòries cau abans que el processador els sol·licite, ocultant la latència de memòria principal. Desafortunadament, donat que la prebusca és una técnica especulativa, si no té èxit pot contaminar la memòria cau amb blocs que no fan falta. A més, les prebusques interfereixen amb els accessos normals a memòria, tant els del nucli que emet les prebusques com els dels altres. Aquesta tesi es centra en reduir la interferència entre aplicacions, tant en les cau compartides com en l'accés a la memòria principal. Per reduir la interferència entre aplicacions en l'accés a la memòria principal, el mecanismo proposat en aquesta dissertació regula l'agressivitat de cada prebuscador, activant o desactivant selectivament alguns d'ells, en funció del seu rendiment individual i dels requisits d'ample de banda de memòria principal dels altres nuclis. Pel que fa a la interferència en caus compartides, aquesta tesi proposa dues tècniques de particionat per a la LLC, les quals atorguen més espai de memòria cau a les aplicacions que progressen més lentament a causa de la interferència entre aplicacions. La primera proposta per al particionat de memòria cau requereix hardware específic no disponible en processadors comercials, per la qual cosa s'ha avaluat utilitzant un entorn de simulació. La segona proposta de particionat per a memòries cau presenta una família de polítiques que superen les limitacions en el nombre de particions i en el nombre de vies de memòria cau disponibles mitjan¿ cant l'agrupació d'aplicacions en clústers i la superposició de particions de memòria cau, de manera que diverses aplicacions comparteixen les mateixes vies. Atès que s'ha implementat utilitzant els mecanismes per al particionat de la LLC que ofereixen alguns processadors Intel moderns, aquesta proposta s'ha avaluat en una màquina real. Els resultats experimentals mostren que el mecanisme de prebusca selectiva proposat en aquesta tesi redueix el nombre de sol·licituds a la memòria principal en un 20%, cosa que es tradueix en millores en l'equitat del sistema, el rendiment i el consum d'energia. Per altra banda, pel que fa als esquemes de particiónat proposats, en comparació amb un sistema sense particions, ambdues propostes redueixen la iniquitat del sistema en més d'un 25% de mitjana, independentment de la quantitat d'aplicacions en execució, i aquesta reducció en la iniquitat no afecta negativament el rendiment. / Selfa Oliver, V. (2018). Adaptive Prefetching and Cache Partitioning for Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112423 / TESIS
17	A Dynamic Reconfiguration Framework to Maximize Performance/Power in Asymmetric Multicore Processors Annamalai, Arunachalam 01 January 2013 (has links) (PDF) Recent trends in technology scaling have shifted the processing paradigm to multicores. Depending on the characteristics of the cores, the multicores can be either symmetric or asymmetric. Prior research has shown that Asymmetric Multicore Processors (AMPs) outperform their symmetric (SMP) counterparts within a given resource and power budget. But, due to the heterogeneity in core-types and time-varying workload behavior, thread-to-core assignment is always a challenge in AMPs. As the computational requirements vary significantly across different applications and with time, there is a need to dynamically allocate appropriate computational resources on demand to suit the applications’ current needs, in order to maximize the performance and minimize the energy consumption. Performance/power of the applications could be further increased by dynamically adapting the voltage and frequency of the cores to better fit the changing characteristics of the workloads. Not only can a core be forced to a low power mode when its activity level is low, but the power saved by doing so could be opportunistically re-budgeted to the other cores to boost the overall system throughput. To this end, we propose a novel solution that seamlessly combines heterogeneity with a Dynamic Reconfiguration Framework (DRF). The proposed dynamic reconfiguration framework is equipped with Dynamic Resource Allocation (DRA) and Voltage/Frequency Adaptation (DVFA) capabilities to adapt the core resources and operating conditions at runtime to the changing demands of the applications. As a proof of concept, we illustrate our proposed approach using a dual-core AMP and demonstrate significant performance/power benefits over various baselines. Asymmetric Multicore Processors (AMP) Core Morphing Thread scheduling Performance/power prediction Hardware performance counters (HPCs) Computer Engineering Electrical and Computer Engineering
18	Dynamic Bandwidth allocation algorithms for an RF on-chip interconnect / Allocation dynamique de bande passante pour l’interconnexion RF d’un réseau sur puce Unlu, Eren 21 June 2016 (has links) Avec l’augmentation du nombre de cœurs, les problèmes de congestion sont commencé avec les interconnexions conventionnelles. Afin de remédier à ces défis, WiNoCoD projet (Wired RF Network-on-Chip Reconfigurable-on-Demand) a été initié par le financement de l’Agence Nationale de Recherche (ANR). Ce travail de thèse contribue à WiNoCoD projet. Une structure de contrôleur de RF est proposé pour l’interconnexion OFDMA de WiNoCoD et plusieurs algorithmes d’allocation de bande passante efficaces (distribués et centralisés) sont développés, concernant les demandes et contraintes très spécifiques de l’environnement sur-puce. Un protocole innovante pour l’arbitrage des sous-porteuses pour des longueurs bimodales de paquets sur-puce, qui ne nécessite aucun signalisation supplémentaire est introduit. Utilisation des ordres de modulation élevés avec plus grande consommation d’énergie est évaluée. / With rapidly increasing number of cores on a single chip, scalability problems have arised due to congestion and latency with conventional interconnects. In order to address these issues, WiNoCoD project (Wired RF Network-on-Chip Reconfigurable-on-Demand) has been initiated by the support of French National Research Agency (ANR). This thesis work contributes to WiNoCoD project. A special RF controller structure has been proposed for the OFDMA based wired RF interconnect of WiNoCoD. Based on this architecture, effective bandwidth allocation algorithms have been presented, concerning very specific requirements and constraints of on-chip environment. An innovative subcarrier allocation protocol for bimodal packet lengths of cache coherency traffic has been presented, which is proven to decrease average latency significantly. In addition to these, effective modulation order selection policies for this interconnect have been introduced, which seeks the optimal delay-power trade-off. OFDMA Allocation Dynamique de bande passante Réseau sur puce Processeurs multicœurs Interconnexions sur puce OFDMA Dynamic bandwidth allocation Network-on-chip Multicore processors On-chip interconnects
19	Estudo da influência dos parâmetros de algoritmos paralelos da computação evolutiva no seu desempenho em plataformas multicore Pais, Mônica Sakuray 14 March 2014 (has links) Parallel computing is a powerful way to reduce the computation time and to improve the quality of solutions of evolutionary algorithms (EAs). At first, parallel evolutionary algorithms (PEAs) ran on very expensive and not easily available parallel machines. As multicore processors become ubiquitous, the improved performance available to parallel programs is a great motivation to computationally demanding EAs to turn into parallel programs and exploit the power of multicores. The parallel implementation brings more factors to influence performance, and consequently adds more complexity on PEAs evaluations. Statistics can help in this task and guarantee the significance and correct conclusions with minimum tests, provided that the correct design of experiments is applied. This work presents a methodology that guarantees the correct estimation of speedups and applies a factorial design on the analysis of PEAs performance. As a case study, the influence of migration related parameters on the performance of a parallel evolutionary algorithm solving two benchmark problems executed on a multicore processor is evaluated. / A computação paralela é um modo poderoso de reduzir o tempo de processamento e de melhorar a qualidade das soluções dos algoritmos evolutivos (AE). No princípio, os AE paralelos (AEP) eram executados em máquinas paralelas caras e pouco disponíveis. Desde que os processadores multicore tornaram-se largamente disponíveis, sua capacidade de processamento paralelo é um grande incentivo para que os AE, programas exigentes de poder computacional, sejam paralelizados e explorem ao máximo a capacidade de processamento dos multicore. A implementação paralela traz mais fatores que podem influenciar a performance dos AEP e adiciona mais complexidade na avaliação desses algoritmos. A estatística pode ajudar nessa tarefa e garantir conclusões corretas e significativas, com o mínimo de testes, se for aplicado o planejamento de experimentos adequado. Neste trabalho é apresentada uma metodologia de experimentação com AEP. Essa metodologia garante a correta estimação do speedup e aplica ao planejamento fatorial na análise dos fatores que influenciam o desempenho. Como estudo de caso, um algoritmo genético, denominado AGP-I, foi paralelizado segundo o modelo de ilhas. O AGP-I foi executado em plataformas com diferentes processadores multicore na resolução de duas funções de teste. A metodologia de experimentação com AEP foi aplicada para se determinar a influência dos fatores relacionados à migração no desempenho do AGP-I. / Doutor em Ciências Algoritmos paralelos Planejamento experimental Algoritmos evolutivos paralelos Processadores multicore Planejamento de experimento Planejamento fatorial Parallel evolutionary algorithms Multicore processors Design of experiments Factorial design CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
20	Um modelo de memória transacional para arquiteturas heterogêneas baseado em software Cache / A transactional memory model for heterogeneous architectures based in Software Cache Goldstein, Felipe Portavales 17 August 2018 (has links) Orientador: Rodolfo Jardim de Azevedo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação Científica / Made available in DSpace on 2018-08-17T02:02:14Z (GMT). No. of bitstreams: 1 Goldstein_FelipePortavales_M.pdf: 2303926 bytes, checksum: c44512059a990654552904a0f94d74f2 (MD5) Previous issue date: 2010 / Resumo: A adoção de processadores com múltiplos núcleos pela indústria, levou à necessidade de novas técnicas para facilitar a programação de software paralelo. A técnica chamada memórias transacionais é uma das mais promissoras. Esta técnica é capaz de executar tarefas concorrentemente de forma otimista, o que permite um bom desempenho. Outra vantagem é que a sua utilização é muito mais simples comparada com a técnica clássica de exclusão mútua. Neste trabalho é proposto o primeiro modelo de memória transacional para arquiteturas híbridas, neste caso a arquitetura alvo é o processador Cell BE. O processador Cell BE é especialmente complexo por causa das dificuldades que a arquitetura deste processador impõe ao programador quando se necessita acessar a memória global compartilhada. O modelo proposto age como uma camada entre o programa e a memória principal, permitindo um acesso transparente aos dados, garantindo coerência e realizando o controle de concorrência de forma automática. O modelo proposto utiliza Software Cache combinado com a memória transacional para facilitar o acesso à memória externa a partir dos SPEs. Ele foi implementado e testado utilizando 8 aplicativos benchmark diferentes, mostrando sua viabilidade para casos de uso reais. Foi feita uma análise detalhada de cada parte da arquitetura proposta com relação ao impacto no desempenho geral do sistema. Este modelo foi capaz de obter um desempenho até duas vezes superior à implementação utilizando um mutex global. As vantagens da utilização se concentram principalmente na facilidade de uso, garantias de coerência e por evitar alguns tipos de bugs que seriam comuns em uma implementação com mutex, como por exemplo dead-locks. Este trabalho obteve o prêmio de melhor artigo no SBAC-PAD 2008 / Abstract: The adoption of multi-core processors by the industry has pushed towards the development of new techniques to simplify programming parallel software. The technique called transactional memories is one of the most promising. This technique is able to execute multiple tasks concurrently in an optimistic way to achieve a better performance. Another advantage is that the usage of this technique is simpler than the classic mutual exclusion. This work proposes the first transactional memory model for hybrid architectures, in this case the target architecture is the Cell BE processor. The Cell BE is specially complex because of the dificulties when acessing the main shared memory from one of the SPEs. The proposed model acts as a layer between the program running and the main shared memory, allowing transparent access to the data, guaranteeing coherency and automatic concurrency control. The proposed model uses a Software Cache combined with a transactional memory to facilitate the acess to the main memory from the SPEs. This model was implemented and tested using 8 benchmark applications, showing its feasability in real use cases. A detailed analysis of its internal parts has been made to show the impact of each part in the overal system performance. The model was able to achieve a performance up to two times better than a similar implementation using a global mutex. The advantages of this model rely on its usability, coherency guaranty and because it is able to avoid concurrency programming bugs such as dead-lock, which are common in a mutex implementation. This work won the best paper award at SBAC-PAD 2008 / Mestrado / Arquitetura de Computadores / Mestre em Ciência da Computação Memória cache Memória hierárquica (Computação) Processadores multicore Arquitetura de computador Transactional memory Cache memory Hierarchical memory (Computer science) Multicore processors Computer architecture

Search results