Global ETD Search

11	Exploiting Parallelism in GPUs Hechtman, Blake Alan January 2014 (has links) <p>Heterogeneous processors with accelerators provide an opportunity to improve performance within a given power budget.</p><p>Many of these heterogeneous processors contain Graphics Processing Units (GPUs) that can perform graphics and embarrassingly parallel computation orders of magnitude faster than a CPU while using less energy. Beyond these obvious applications for GPUs, a larger variety of applications can benefit from a GPU's large computation and memory bandwidth. However, many of these applications are irregular and, as a result, require synchronization and scheduling that are commonly believed to perform poorly on GPUs. The basic building block of synchronization and scheduling is memory consistency, which is, therefore, the first place to look for improving performance on irregular applications. In this thesis, we approach the programmability of irregular applications on GPUs by thinking across traditional boundaries of the compute stack. We think about architecture, microarchitecture and runtime systems from the programmers perspective. To this end, we study architectural memory consistency on future GPUs with cache coherence. In addition, we design a GPU memory system</p><p>microarchitecture that can support fine-grain and coarse-grain synchronization without sacrificing throughput. Finally, we develop a task runtime that embraces the GPU microarchitecture to perform well</p><p>on fork/join parallelism desired by many programmers. Overall, this thesis contributes non-intuitive solutions to improve the performance and programmability of irregular applications from the programmer's perspective.</p> / Dissertation Computer engineering Computer science Cache Coherence GPU Memory Consistency Task Parallelism
12	Estudo sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC Silva, Gustavo Girão Barreto da January 2009 (has links) Ao longo dos últimos anos, os sistemas embarcados vêm se tornando cada vez mais complexos tanto em termos de hardware quanto de software. Ultimamente têm-se adotado como solução o uso de MPSoCs (sistemas multiprocessados integrados em chip) para uma maior eficiência energética e computacional nestes sistemas. Com o uso de diversos elementos de processamento, redes-em-chip (NoC - networks-on-chip) aparecem como soluções de melhor desempenho do que barramentos. Nestes ambientes cujo desempenho depende da eficiência do modelo de comunicação, a hierarquia de memória se torna um elemento chave. Baseando-se neste cenário, este trabalho realiza uma investigação sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC. Dentro deste escopo foi desenvolvida uma nova organização de memória fisicamente centralizada com diferentes espaços de endereçamentos denominada nDMA. Este trabalho também apresenta uma comparação entre a nova organização e outras três organizações bastante difundidas tais como memória distribuída, memória compartilhada e memória compartilhada distribuída. Estas duas ultimas adotam um modelo de coerência de cache baseado em diretório completamente desenvolvido em hardware. Os modelos de memória foram implementados na plataforma virtual SIMPLE (SIMPLE Multiprocessor Platform Environment). Resultados experimentais mostram uma forte dependência com relação à carga de comunicação gerada pelas aplicações. O modelo de memória distribuída apresenta melhores resultados conforme a carga de comunicação das aplicações é baixa. Por outro lado, o novo modelo de memória fisicamente compartilhado com diferentes espaços de endereçamento apresenta melhores resultados conforme a carga de comunicação das aplicações é alta. Também foram realizados experimentos objetivando analisar o desempenho dos modelos de memória em situações de alta latência de comunicação na rede. Resultados mostram melhores resultados do modelo de memória distribuída quando a carga de comunicação das aplicações é alta e, caso contrário, o modelo nDMA apresenta melhores resultados. Por fim, foram analisados os desempenhos dos modelos de memória durante o processo de migração de tarefas. Neste caso, os modelos de memória compartilhada e compartilhada distribuída apresentaram melhores resultados devido ao fato de que não se faz necessária o envio dos dados da aplicação nestes modelos e também devido ao menor tamanho de código se comparado com os outros modelos. / In the past few the years, embedded systems have become even more complex both on terms of hardware and software. Lately, the use of MPSoCs (Multi-Processor Systems-on-Chip) has been adopted on these systems for a better energetic and computational efficiency. Due to the use of several processing elements, Networks-on-Chip arise as better performance solutions than buses. Considering this scenario, this work performs an investigation on the impact of memory hierarchy in NoC-based MPSoCs. In this context, a new physically centralized and shared memory organization with different address spaces named nDMA was developed. This work also presents a comparison between the new memory organization and three different well-known memory hierarchy models such as distributed memory and shared and distributed shared memories that make use of a fully hardware cache coherence solution. The memory models were implemented in the SIMPLE (SIMPLE Multiprocessor Platform Environment) virtual platform. Experimental results shows a strong dependency on the application communication workload. The distributed memory model presents better results as the application communication workload is low. On the other hand, the new memory model (physically shared with different address spaces) presents better results as the application communication workload is high. There were also experiments aiming at observing the performance of the memory models in situations where the communication latency on the network is high. Results show better results of the distributed memory model when the application communication workload is high, and the nDMA model presents better results otherwise. Finally, the performance of the memory models during a task migration process were evaluated. In this case, the shared memory and distributed shared memory models presented better results due to the fact that in this case the data memory does not need to be transferred from one point to another and also due to the low size of the memory code in these cases if compared to other memory models. Microeletrônica MPSoC NoC Embedded systems Multiprocessor system-on-chip Network-on-chip Cache coherence Task migration
13	Estudo sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC Silva, Gustavo Girão Barreto da January 2009 (has links) Ao longo dos últimos anos, os sistemas embarcados vêm se tornando cada vez mais complexos tanto em termos de hardware quanto de software. Ultimamente têm-se adotado como solução o uso de MPSoCs (sistemas multiprocessados integrados em chip) para uma maior eficiência energética e computacional nestes sistemas. Com o uso de diversos elementos de processamento, redes-em-chip (NoC - networks-on-chip) aparecem como soluções de melhor desempenho do que barramentos. Nestes ambientes cujo desempenho depende da eficiência do modelo de comunicação, a hierarquia de memória se torna um elemento chave. Baseando-se neste cenário, este trabalho realiza uma investigação sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC. Dentro deste escopo foi desenvolvida uma nova organização de memória fisicamente centralizada com diferentes espaços de endereçamentos denominada nDMA. Este trabalho também apresenta uma comparação entre a nova organização e outras três organizações bastante difundidas tais como memória distribuída, memória compartilhada e memória compartilhada distribuída. Estas duas ultimas adotam um modelo de coerência de cache baseado em diretório completamente desenvolvido em hardware. Os modelos de memória foram implementados na plataforma virtual SIMPLE (SIMPLE Multiprocessor Platform Environment). Resultados experimentais mostram uma forte dependência com relação à carga de comunicação gerada pelas aplicações. O modelo de memória distribuída apresenta melhores resultados conforme a carga de comunicação das aplicações é baixa. Por outro lado, o novo modelo de memória fisicamente compartilhado com diferentes espaços de endereçamento apresenta melhores resultados conforme a carga de comunicação das aplicações é alta. Também foram realizados experimentos objetivando analisar o desempenho dos modelos de memória em situações de alta latência de comunicação na rede. Resultados mostram melhores resultados do modelo de memória distribuída quando a carga de comunicação das aplicações é alta e, caso contrário, o modelo nDMA apresenta melhores resultados. Por fim, foram analisados os desempenhos dos modelos de memória durante o processo de migração de tarefas. Neste caso, os modelos de memória compartilhada e compartilhada distribuída apresentaram melhores resultados devido ao fato de que não se faz necessária o envio dos dados da aplicação nestes modelos e também devido ao menor tamanho de código se comparado com os outros modelos. / In the past few the years, embedded systems have become even more complex both on terms of hardware and software. Lately, the use of MPSoCs (Multi-Processor Systems-on-Chip) has been adopted on these systems for a better energetic and computational efficiency. Due to the use of several processing elements, Networks-on-Chip arise as better performance solutions than buses. Considering this scenario, this work performs an investigation on the impact of memory hierarchy in NoC-based MPSoCs. In this context, a new physically centralized and shared memory organization with different address spaces named nDMA was developed. This work also presents a comparison between the new memory organization and three different well-known memory hierarchy models such as distributed memory and shared and distributed shared memories that make use of a fully hardware cache coherence solution. The memory models were implemented in the SIMPLE (SIMPLE Multiprocessor Platform Environment) virtual platform. Experimental results shows a strong dependency on the application communication workload. The distributed memory model presents better results as the application communication workload is low. On the other hand, the new memory model (physically shared with different address spaces) presents better results as the application communication workload is high. There were also experiments aiming at observing the performance of the memory models in situations where the communication latency on the network is high. Results show better results of the distributed memory model when the application communication workload is high, and the nDMA model presents better results otherwise. Finally, the performance of the memory models during a task migration process were evaluated. In this case, the shared memory and distributed shared memory models presented better results due to the fact that in this case the data memory does not need to be transferred from one point to another and also due to the low size of the memory code in these cases if compared to other memory models. Microeletrônica MPSoC NoC Embedded systems Multiprocessor system-on-chip Network-on-chip Cache coherence Task migration
14	Estudo sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC Silva, Gustavo Girão Barreto da January 2009 (has links) Ao longo dos últimos anos, os sistemas embarcados vêm se tornando cada vez mais complexos tanto em termos de hardware quanto de software. Ultimamente têm-se adotado como solução o uso de MPSoCs (sistemas multiprocessados integrados em chip) para uma maior eficiência energética e computacional nestes sistemas. Com o uso de diversos elementos de processamento, redes-em-chip (NoC - networks-on-chip) aparecem como soluções de melhor desempenho do que barramentos. Nestes ambientes cujo desempenho depende da eficiência do modelo de comunicação, a hierarquia de memória se torna um elemento chave. Baseando-se neste cenário, este trabalho realiza uma investigação sobre o impacto da hierarquia de memória em MPSoCs baseados em NoC. Dentro deste escopo foi desenvolvida uma nova organização de memória fisicamente centralizada com diferentes espaços de endereçamentos denominada nDMA. Este trabalho também apresenta uma comparação entre a nova organização e outras três organizações bastante difundidas tais como memória distribuída, memória compartilhada e memória compartilhada distribuída. Estas duas ultimas adotam um modelo de coerência de cache baseado em diretório completamente desenvolvido em hardware. Os modelos de memória foram implementados na plataforma virtual SIMPLE (SIMPLE Multiprocessor Platform Environment). Resultados experimentais mostram uma forte dependência com relação à carga de comunicação gerada pelas aplicações. O modelo de memória distribuída apresenta melhores resultados conforme a carga de comunicação das aplicações é baixa. Por outro lado, o novo modelo de memória fisicamente compartilhado com diferentes espaços de endereçamento apresenta melhores resultados conforme a carga de comunicação das aplicações é alta. Também foram realizados experimentos objetivando analisar o desempenho dos modelos de memória em situações de alta latência de comunicação na rede. Resultados mostram melhores resultados do modelo de memória distribuída quando a carga de comunicação das aplicações é alta e, caso contrário, o modelo nDMA apresenta melhores resultados. Por fim, foram analisados os desempenhos dos modelos de memória durante o processo de migração de tarefas. Neste caso, os modelos de memória compartilhada e compartilhada distribuída apresentaram melhores resultados devido ao fato de que não se faz necessária o envio dos dados da aplicação nestes modelos e também devido ao menor tamanho de código se comparado com os outros modelos. / In the past few the years, embedded systems have become even more complex both on terms of hardware and software. Lately, the use of MPSoCs (Multi-Processor Systems-on-Chip) has been adopted on these systems for a better energetic and computational efficiency. Due to the use of several processing elements, Networks-on-Chip arise as better performance solutions than buses. Considering this scenario, this work performs an investigation on the impact of memory hierarchy in NoC-based MPSoCs. In this context, a new physically centralized and shared memory organization with different address spaces named nDMA was developed. This work also presents a comparison between the new memory organization and three different well-known memory hierarchy models such as distributed memory and shared and distributed shared memories that make use of a fully hardware cache coherence solution. The memory models were implemented in the SIMPLE (SIMPLE Multiprocessor Platform Environment) virtual platform. Experimental results shows a strong dependency on the application communication workload. The distributed memory model presents better results as the application communication workload is low. On the other hand, the new memory model (physically shared with different address spaces) presents better results as the application communication workload is high. There were also experiments aiming at observing the performance of the memory models in situations where the communication latency on the network is high. Results show better results of the distributed memory model when the application communication workload is high, and the nDMA model presents better results otherwise. Finally, the performance of the memory models during a task migration process were evaluated. In this case, the shared memory and distributed shared memory models presented better results due to the fact that in this case the data memory does not need to be transferred from one point to another and also due to the low size of the memory code in these cases if compared to other memory models. Microeletrônica MPSoC NoC Embedded systems Multiprocessor system-on-chip Network-on-chip Cache coherence Task migration
15	Algoritmo de prefetching de dados temporizado para sistemas multiprocessadores baseados em NOC SILVEIRA, Maria Cireno Ribeiro 09 March 2015 (has links) Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-03-15T13:58:26Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) UFPE-MEI 2015-078 - Maria Cireno Ribeiro Silveira.pdf: 4578273 bytes, checksum: 1c434494e0c03cb02156a37ebfd1c7da (MD5) / Made available in DSpace on 2016-03-15T13:58:26Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) UFPE-MEI 2015-078 - Maria Cireno Ribeiro Silveira.pdf: 4578273 bytes, checksum: 1c434494e0c03cb02156a37ebfd1c7da (MD5) Previous issue date: 2015-03-09 / O prefetching é uma técnica considerada e ciente para mitigar um problema já conhecido em sistemas computacionais: a diferença entre o desempenho do processador e do acesso à memória. O objetivo do prefetching é aproximar o dado do processador retirando-o da memória e carregando na cache local. Uma vez que o dado seja requisitado pelo processador, ele já estará disponível na cache, reduzindo a taxa de perdas e a penalidade do sistema. Para sistemas multiprocessadores baseados em NoCs a e ciência do prefetching é ainda mais crítica em relação ao desempenho, uma vez que o tempo de acesso ao dado varia dependendo da distância entre processador e memória e do tráfego da rede. Este trabalho propõe um algoritmo de prefetching de dados temporizado, que tem como objetivo minimizar a penalidade dos núcleos através uma solução de prefetching baseada em predição de tempo para sistemas multiprocessadores baseados em NoC. O algoritmo utiliza um processo pró-ativo iniciado pelo servidor para realizar requisições de prefetching baseado no histórico de perdas de cache e informações da NoC. Nos experimentos realizados para 16 núcleos, o algoritmo proposto reduziu a penalidade dos processadores em 53,6% em comparação com o prefetching baseado em eventos (faltas na cache), sendo a maior redução de 29% da penalidade. / The prefetching technique is an e ective approach to mitigate a well-known problem in multi-core processors: the gap between computing and data access performance. The goal of prefetching is to approximate data to the CPU by retrieving the data from the memory and loading it in the cache. When the data is requested by the CPU, it is already available in the cache, reducing the miss rate and penalty. In multiprocessor NoC-based systems the prefetching e ciency is even more critical to system performance, since the access time depends of the distance between the requesting processor and the memory and also of the network tra c. This work proposes a temporized data prefetching algorithm that aims to minimize the penalty of the cores through one prefetching solution based on time prediction for multiprocessor NoC-based systems. The algorithm utilizes a proactive process initiated by the server to request prefetching data based on cache miss history and NoC's information. In the experiments for 16 cores, the proposed algorithm has successfully reduced the processors penalty in 53,6% compared to the event-based prefetching and the best case was a penalty reduction of 29%. Multiprocessador Manycore Prefetching NoC - Network-on-Chip Coerência de cache Multiprocessor Manycore Prefetching NoC - Network-on-Chip Cache coherence
16	Performance evaluation of multithreading in a Diameter Credit Control Application Åkesson, Gustav, Rantzow, Pontus January 2010 (has links) Moore's law states that the amount of computational power available at a given cost doubles every 18 months and indeed, for the past 20 years there has been a tremendous development in microprocessors. However, for the last few years, Moore's law has been subject for debate, since to manage heat issues, processor manufacturers have begun favoring multicore processors, which means parallel computation has become necessary to fully utilize the hardware. This also means that software has to be written with multiprocessing in mind to take full advantage of the hardware, and writing parallel software introduces a whole new set of problems. For the last couple of years, the demands on telecommunication systems have increased and to manage the increasing demands, multiprocessor servers have become a necessity. Applications must fully utilize the hardware and such an application is the Diameter Credit Control Application (DCCA). The DCCA uses the Diameter networking protocol and the DCCA's purpose is to provide a framework for real-time charging. This could, for instance, be to grant or deny a user's request of a specific network activity and to account for the eventual use of that network resource. This thesis investigates whether it is possible to develop a Diameter Credit Control Application that achieves linear scaling and the eventual pitfalls that exist when developing a scalable DCCA server. The assumption is based on the observation that the DCCA server's connections have little to nothing in common (i.e. little or no synchronization), and introducing more processors should therefore give linear scaling. To investigate whether a DCCA server's performance scales linearly, a prototype has been developed. Along with the development of the prototype, constant performance analysis was conducted to see what affected performance and server scalability in a multiprocessor DCCA environment. As the results show, quite a few factors besides synchronization and independent connections affected scalability of the DCCA prototype. The results show that the DCCA prototype did not always achieve linear scaling. However, even if it was not linear, certain design decisions gave considerable performance increase when more processors were introduced. Diameter DCCA Multiprocessing Server Performance Cache coherence Dynamic memory Computer Sciences Datavetenskap (datalogi) Software Engineering Programvaruteknik
17	Protocoles scalables de cohérence des caches pour processeurs manycore à espace d'adressage partagé visant la basse consommation. / Scalable cache coherence protocols for energy-efficient shared memory manycore processors Liu, Hao 27 January 2016 (has links) L'architecture TSAR (Tera-Scale ARchitecture) développée conjointement par BULL, le Lip6 et le CEA-LETI est une architecture manycore CC-NUMA extensible jusqu'à 1024 cœurs. Le protocole de cohérence de cache DHCCP dans l'architecture TSAR repose sur le principe du répertoire global distribué en utilisant la stratégie d'écriture simultanée afin de passer à l'échelle, mais cette scalabilité a un coût énergétique important que nous cherchons à réduire. Actuellement, les plus grosses entreprises dans le domaine des semi-conducteurs, comme Intel ou AMD, utilisent les protocoles MESI ou MOESI dans leurs processeurs multicoeurs. Ces types de protocoles utilisent la stratégie d'écriture différée pour réduire la forte consommation énergétique due aux écritures. Mais la complexité d'implémentation et la forte augmentation de ce trafic de cohérence quand le nombre de processeurs augmente limite le passage à l'échelle de ces protocoles au-delà de quelques dizaines de coeurs. Dans cette thèse, nous proposons un nouveau protocole de cohérence de cache utilisant une méthode hybride pour traiter les écritures dans le cache L1 privé : pour les lignes non partagées, le contrôleur de cache L1 utilise la stratégie d'écriture différée, de façon à modifier les lignes localement. Pour les lignes partagées, le contrôleur de cache L1 utilise la stratégie d'écriture immédiate pour éviter l'état de propriété exclusive sur ces lignes partagées. Cette méthode, appelée RWT pour Released Write Through, passe non seulement à l'échelle, mais réduit aussi significativement la consommation énergétique liée aux écritures. Nous avons aussi optimisé la solution actuelle pour gérer la cohérence des TLBs dans l'architecture TSAR, en termes de performance et de consommation énergétique. Enfin, nous introduisons dans cette thèse un nouveau petit cache, appelé micro-cache, entre le coeur et le cache L1, afin de réduire le nombre d'accès au cache d'instructions. / The TSAR architecture (Tera-Scale ARchitecture) developed jointly by Lip6 Bull and CEA-LETI is a CC-NUMA manycore architecture which is scalable up to 1024 cores. The DHCCP cache coherence protocol in the TSAR architecture is a global directory protocol using the write-through policy in the L1 cache for scalability purpose, but this write policy causes a high power consumption which we want to reduce. Currently the biggest semiconductors companies, such as Intel or AMD, use the MESI MOESI protocols in their multi-core processors. These protocols use the write-back policy to reduce the high power consumption due to writes. However, the complexity of implementation and the sharp increase in the coherence traffic when the number of processors increases limits the scalability of these protocols beyond a few dozen cores. In this thesis, we propose a new cache coherence protocol using a hybrid method to process write requests in the L1 private cache : for exclusive lines, the L1 cache controller chooses the write-back policy in order to modify locally the lines as well as eliminate the write traffic for exclusive lines. For shared lines, the L1 cache controller uses the write-through policy to simplify the protocol and in order to guarantee the scalability. We also optimized the current solution for the TLB coherence problem in the TSAR architecture. The new method which is called CC-TLB not only improves the performance, but also reduces the energy consumption. Finally, this thesis introduces a new micro cache between the core and the L1 cache, which allows to reduce the number of accesses to the instruction cache, in order to save energy. Manycore Cohérence des caches Écriture immédiate Écriture différée Tlb Manycore Cache coherence Write-back policy 004
18	Energy-Efficient and High-Performance Nanophotonic Interconnects for Shared Memory Multicores Morris, Randy W., Jr. 26 July 2012 (has links) No description available. Computer Engineering Electrical Engineering NOCS nanophotonics cache coherence PULSE 3D-NoC High Performance Computing
19	Supporting Software Transactional Memory in Distributed Systems: Protocols for Cache-Coherence, Conflict Resolution and Replication Zhang, Bo 05 December 2011 (has links) Lock-based synchronization on multiprocessors is inherently non-scalable, non-composable, and error-prone. These problems are exacerbated in distributed systems due to an additional layer of complexity: multinode concurrency. Transactional memory (TM) is an emerging, alternative synchronization abstraction that promises to alleviate these difficulties. With the TM model, code that accesses shared memory objects are organized as transactions, which speculatively execute, while logging changes. If transactional conflicts are detected, one of the conflicting transaction is aborted and re-executed, while the other is allowed to commit, yielding the illusion of atomicity. TM for multiprocessors has been proposed in software (STM), in hardware (HTM), and in a combination (HyTM). This dissertation focuses on supporting the TM abstraction in distributed systems, i.e., distributed STM (or D-STM). We focus on three problem spaces: cache-coherence (CC), conflict resolution, and replication. We evaluate the performance of D-STM by measuring the competitive ratio of its makespan --- i.e., the ratio of its makespan (the last completion time for a given set of transactions) to the makespan of an optimal off-line clairvoyant scheduler. We show that the performance of D-STM for metric-space networks is O(N^2) for N transactions requesting an object under the Greedy contention manager and an arbitrary CC protocol. To improve the performance, we propose a class of location-aware CC protocols, called LAC protocols. We show that the combination of the Greedy manager and a LAC protocol yields an O(NlogN s) competitive ratio for s shared objects. We then formalize two classes of CC protocols: distributed queuing cache-coherence (DQCC) protocols and distributed priority queuing cache-coherence (DPQCC) protocols, both of which can be implemented using distributed queuing protocols. We show that a DQCC protocol is O(NlogD)-competitive and a DPQCC protocol is O(log D_delta)-competitive for N dynamically generated transactions requesting an object, where D_delta is the normalized diameter of the underlying distributed queuing protocol. Additionally, we propose a novel CC protocol, called Relay, which reduces the total number of aborts to O(N) for N conflicting transactions requesting an object, yielding a significantly improvement over past CC protocols which has O(N^2) total number of aborts. We also analyze Relay's dynamic competitive ratio in terms of the communication cost (for dynamically generated transactions), and show that Relay's dynamic competitive ratio is O(log D_0), where D_0 is the normalized diameter of the underlying network spanning tree. To reduce unnecessary aborts and increase concurrency for D-STM based on globally-consistent contention management policies, we propose the distributed dependency-aware (DDA) conflict resolution model, which adopts different conflict resolution strategies based on transaction types. In the DDA model, read-only transactions never abort by keeping a set of versions for each object. Each transaction only keeps precedence relations based on its local knowledge of precedence relations. We show that the DDA model ensures that 1) read-only transactions never abort, 2) every transaction eventually commits, 3) supports invisible reads, and 4) efficiently garbage collects useless object versions. To establish competitive ratio bounds for contention managers in D-STM, we model the distributed transactional contention management problem as the traveling salesman problem (TSP). We prove that for D-STM, any online, work conserving, deterministic contention manager provides an Omega(max[s,s^2/D]) competitive ratio in a network with normalized diameter D and s shared objects. Compared with the Omega(s) competitive ratio for multiprocessor STM, the performance guarantee for D-STM degrades by a factor proportional to s/D. We present a randomized algorithm, called Randomized, with a competitive ratio O(sClog n log ^{2} n) for s objects shared by n transactions, with a maximum conflicting degree C. To break this lower bound, we present a randomized algorithm Cutting, which needs partial information of transactions and an approximate TSP algorithm A with approximation ratio phi_A. We show that the average case competitive ratio of Cutting is O(s phi_A log^{2}m log^{2}n), which is close to O(s). Single copy (SC) D-STM keeps only one writable copy of each object, and thus cannot tolerate node failures. We propose a quorum-based replication (QR) D-STM model, which provides provable fault-tolerance without incurring high communication overhead, when compared with the SC model. The QR model stores object replicas in a tree quorum system, where two quorums intersect if one of them is a write quorum, and ensures the consistency among replicas at commit-time. The communication cost of an operation in the QR model is proportional to the communication cost from the requesting node to its closest read or write quorum. In the presence of node failures, the QR model exhibits high availability and degrades gracefully when the number of failed nodes increases, with reasonable higher communication cost. We develop a protoytpe implementation of the dissertation's proposed solutions, including DQCC and DPQCC protocols, Relay protocol, and the DDA model, in the HyFlow Java D-STM framework. We experimentally evaluated these solutions with respective competitor solutions on a set of microbenchmarks (e.g., data structures including distributed linked list, binary search tree and red-black tree) and macrobenchmarks (e.g., distributed versions of the applications in the STAMP STM benchmark suite for multiprocessors). Our experimental studies revealed that: 1) based on the same distributed queuing protocol (i.e., Ballistic CC protocol), DPQCC yields better transactional throughput than DQCC, by a factor of 50% - 100%, on a range of transactional workloads; 2) Relay outperforms competitor protocols (including Arrow, Ballistic and Home) by more than 200% when the network size and contention increase, as it efficiently reduces the average aborts per transaction (less than 0.5); 3) the DDA model outperforms existing contention management policies (including Greedy, Karma and Kindergarten managers) by upto 30%-40% in high contention environments; For read/write-balanced workloads, the DDA model outperforms these contention management policies by 30%-60% on average; for read-dominated workloads, the model outperforms by over 200%. / Ph. D. Cache-Coherence Distributed Queuing Contention Management Software Transactional Memory Replication Quorum System
20	HyFlow: A High Performance Distributed Software Transactional Memory Framework Saad Ibrahim, Mohamed Mohamed 14 June 2011 (has links) We present HyFlow - a distributed software transactional memory (D-STM) framework for distributed concurrency control. Lock-based concurrency control suffers from drawbacks including deadlocks, livelocks, and scalability and composability challenges. These problems are exacerbated in distributed systems due to their distributed versions which are more complex to cope with (e.g., distributed deadlocks). STM and D-STM are promising alternatives to lock-based and distributed lock-based concurrency control for centralized and distributed systems, respectively, that overcome these difficulties. HyFlow is a Java framework for DSTM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols. HyFlow exports a simple distributed programming model that excludes locks: using (Java 5) annotations, atomic sections are defiend as transactions, in which reads and writes to shared, local and remote objects appear to take effect instantaneously. No changes are needed to the underlying virtual machine or compiler. We describe HyFlow's architecture and implementation, and report on experimental studies comparing HyFlow against competing models including Java remote method invocation (RMI) with mutual exclusion and read/write locks, distributed shared memory (DSM), and directory-based D-STM. / Master of Science Directory Protocols Cache Coherence Contention Management Control-Flow Dataflow Software Transactional Memory Distributed Systems

Search results