Global ETD Search

21	On Improving Distributed Transactional Memory through Nesting, Partitioning and Ordering Turcu, Alexandru 03 March 2015 (has links) Distributed Transactional Memory (DTM) is an emerging, alternative concurrency control model that aims to overcome the challenges of distributed-lock based synchronization. DTM employs transactions in order to guarantee consistency in a concurrent execution. When two or more transactions conflict, all but one need to be delayed or rolled back. Transactional Memory supports code composability by nesting transactions. Nesting how- ever can be used as a strategy to improve performance. The closed nesting model enables partial rollback by allowing a sub-transaction to abort without aborting its parent, thus reducing the amount of work that needs to be retried. In the open nesting model, sub- transactions can commit to the shared state independently of their parents. This reduces isolation and increases concurrency. Our first main contribution in this dissertation are two extensions to the existing Transac- tional Forwarding Algorithm (TFA). Our extensions are N-TFA and TFA-ON, and support closed nesting and open nesting, respectively. We additionally extend the existing SCORe algorithm with support for open nesting (we call the result SCORe-ON). We implement these algorithms in a Java DTM framework and evaluate them. This represents the first study of transaction nesting in the context of DTM, and contributes the first DTM implementation which supports closed nesting or open nesting. Closed nesting through our N-TFA implementation proved insufficient for any significant throughput improvements. It ran on average 2% faster than flat nesting, while performance for individual tests varied between 42% slowdown and 84% speedup. The workloads that benefit most from closed nesting are characterized by short transactions, with between two and five sub-transactions. Open nesting, as exemplified by our TFA-ON and SCORe-ON implementations, showed promising results. We determined performance improvement to be a trade-off between the overhead of additional commits and the fundamental conflict rate. For write-intensive, high- conflict workloads, open nesting may not be appropriate, and we observed a maximum speedup of 30%. On the other hand, for lower fundamental-conflict workloads, open nesting enabled speedups of up to 167% in our tests. In addition to the two nesting algorithms, we also develop Hyflow2, a high-performance DTM framework for the Java Virtual Machine, written in Scala. It has a clean Scala API and a compatibility Java API. Hyflow2 was on average two times faster than Hyflow on high-contention workloads, and up to 16 times faster in low-contention workloads. Our second main contribution for improving DTM performance is automated data partition- ing. Modern transactional processing systems need to be fast and scalable, but this means many such systems settled for weak consistency models. It is however possible to achieve all of strong consistency, high scalability and high performance, by using fine-grained partitions and light-weight concurrency control that avoids superfluous synchronization and other over- heads such as lock management. Independent transactions are one such mechanism, that rely on good partitions and appropriately defined transactions. On the downside, it is not usually straightforward to determine optimal partitioning schemes, especially when dealing with non-trivial amounts of data. Our work attempts to solve this problem by automating the partitioning process, choosing the correct transactional primitive, and routing transactions appropriately. Our third main contribution is Alvin, a system for managing concurrently running trans- actions on a geographically replicated data-store. Alvin supports general-purpose transactions, and guarantees strong consistency criteria. Through a novel partial order broadcast protocol, Alvin maximizes the parallelism of ordering and local transaction processing, resulting in low client-perceived latency. Alvin can process read-only transactions either lo- cally or globally, according to the desired consistency criterion. Conflicting transactions are ordered across all sites. We built Alvin in the Go programming language. We conducted our evaluation study on Amazon EC2 infrastructure and compared against Paxos- and EPaxos- based state machine replication protocols. Our results reveal that Alvin provides significant speed-up for read-dominated TPC-C workloads: as much as 4.8x when compared to EPaxos on 7 datacenters, and up to 26% in write-intensive workloads. Our fourth and final contribution is M2Paxos, a multi-leader implementation of Generalized Consensus. Single leader-based consensus protocols are known to stop scaling once the leader reaches its saturation point. Ordering commands based on conflicts is appealing due to the potentially higher parallelism, but is imperfect due to the higher quorum sizes required for fast decisions and the need to compare commands and track their dependencies. M2Paxos on the other hand exploits fast decisions (i.e., delivery of a command in two communication delays) by leveraging a classic quorum size, matching a majority of nodes deployed. M2Paxos does not establish command dependencies based on conflicts, but it binds accessed objects to nodes, making sure commands operating on the same object will be ordered by the same node. Our evaluation study of M2Paxos (also built in Go) confirms the effectiveness of this approach, getting up to 7⨉ improvements in performance over state- of-the-art consensus and generalized consensus algorithms. / Ph. D. Distributed Transactional Memory Distributed Systems Nested Transactions Automated Partitioning Consensus
22	Mutex Locking versus Hardware Transactional Memory: An Experimental Evaluation Moore, Sean Ryan 06 October 2015 (has links) It has historically been the case that CPUs have run programs ever faster without significant intervention on the behalf of the programmer. However, this "free lunch" has largely ended due to the end of exponentially increasing core frequency and the current slow increase in instruction-level parallelism but continues to a degree in cache size improvements. But since Moore's law still largely continues "lunch", i.e. program performance, can still be bought at the price of rewriting code for multiple cores, which is enabled by the trend Moore's law describes. Multicore architectures cannot aid performance for problems whose solutions are necessarily sequential in nature and writing efficient and correct concurrent programs is not easy in all cases when using synchronization methods like fine-grained mutex locks. Transactional memory, and its implementation as hardware transactional memory, allow programmers to write concurrent applications without the attendant complexity of programming with mutex locks. This allows programmers to focus on optimizing the application for performance. Given that transactions can run two segments of code in parallel that a mutex lock would force to run sequentially and that transactions can abort, causing a program to do the same work more than once, whether transactions perform better or worse than mutex locks is dependent on the program's execution profile and the coarseness or fineness at which mutex locks are used. In this thesis the GNU C Library's futex implementation of mutex locks and Intel's Restricted Transactional Memory have been compared and the behavior of those transactions has been analyzed. This analysis includes a pathological behavior permitted by the GNU C Library's hardware transactional memory implementation of mutex locks. The tradeoffs between fine-grained and global locking implementations have been discussed, compared, and used in the context of fallback locks for hardware transactions. This thesis provides evidence to the effect that fine-grained locking is not critical for program performance and that in many cases global locking and hardware transactions can provide nearly equivalent performance without the programming difficulties. This work has shown that across the 23 applications examined, with relation to their original locking implementation, a global locking scheme without elision has a 0.96x speedup, Intel's Restricted Transactional Memory (RTM) with the application's original locks as a fallback has a 1.01x speedup and with global lock fallback RTM has a speedup of 0.97x. This work is supported in part by NAVSEA/NEEC under grant 3003279297. Any opinions, findings, and conclusions or recommendations expressed in this thesis are those of the author and do not necessarily reflect the views of NAVSEA. / Master of Science Hardware Transactional Memory Global Lock GNU C Library
23	Avaliação de desempenho do sistema de memória transacional de Clojure como biblioteca de sincronização na linguagem Java / Performance evaluation of Clojure transactional memory system as a synchronization library in Java language Calcina Ccori, Pablo César 14 June 2011 (has links) Neste trabalho apresenta-se uma avaliação do desempenho da implementação de memória transacional da linguagem Clojure, utilizada como biblioteca de sincronização para uso em conjunto com outras aplicações dentro da máquina virtual de Java. É implementada uma camada de interface entre as estruturas de dados de Clojure e o benchmark STMBench7 e são discutidos alguns aspectos que geram sobrecarga no desempenho. / In this work a performance evaluation of Clojure transactional memory implementation is presented, using it as a synchronization library to work together with other applications on Java virtual machine. It is implemented an interface layer between Clojure data structures and STMBench7 benchmark, and issues about overhead in performance are discussed. clojure clojure memória transactional em software software transactional memory stm stm
24	On improving the ease of use of the software transactional memory abstraction / Faciliter l'utilisation des mémoires transactionnelles logicielles Crain, Tyler 06 March 2013 (has links) Les architectures multicœurs changent notre façon d'écrire des programmes. L'écriture de programmes concurrents est bien connue pour être difficile. Traditionnellement, l'utilisation de verrous (locks) permettant au code de s'exécuter en exclusion mutuelle, a été l'abstraction la plus largement utilisée pour l'écriture des programmes concurrents. Malheureusement, il est difficile d'écrire des programmes concurrents efficaces et corrects reposant sur des verrous. En outre, les verrous présentent d'autres problèmes, notamment celui du passage à l'échelle. Le concept de mémoire transactionnelle a été proposé comme une solution à ces difficultés. Les transactions peuvent être considérées comme une abstraction de haut niveau, ou une méthodologie pour l'écriture de programmes concurrents, ce qui permet au programmeur de pouvoir déclarer des sections de code devant être exécutés de façon atomique, sans avoir à se soucier des détails de synchronisation. Malheureusement, bien qu'assurément plus facile à utiliser que les verrous, la mémoire transactionnelle souffre encore de problèmes de performance et de facilité d'utilisation. En fait, de nombreux concepts relatifs à l'utilisation et à la sémantique des transactions n'ont pas encore des normes convenues. Cette thèse propose de nouvelles solutions permettant de faciliter l'utilisation des mémoires transactionellles. La thèse débute par un chapitre qui donne un bref aperçu de la mémoire transactionnelle logicielle (STM) ainsi qu'une discussion sur le problème de la facilité d'utilisation. Les contributions à la recherche sont ensuite divisées en quatre chapitres principaux, chacun proposant une approche différente afin de rendre les STMs plus facile à utiliser. / Multicore architectures are changing the way we write programs. Writing concurrent programs is well known to be difficult task. Traditionally, the use of locks allowing code to execute in mutual exclusion has been the most widely used abstraction to write concurrent programs. Unfortunately, using locks it is difficult to write correct concurrent programs that perform efficiently. Additionally, locks present other problems such as scalability issues. Transactional memory has been proposed as a possible promising solution to these difficulties of writing concurrent programs. Transactions can be viewed as a high level abstraction or methodology for writing concurrent programs, allowing the programmer to be able to declare what sections of his code should be executed atomically, without having to worry about synchronization details. Unfortunately, although arguably easier to use then locks, transactional memory still suffers from performance and ease of use problems. In fact many concepts surrounding the usage and semantics of transactions have no widely agreed upon standards. This thesis specifically focuses on these ease of use problems by discussing how previous research has dealt with them and proposing new solutions putting ease of use first. The thesis starts with a chapter giving a brief overview of software transactional memory (STM) as well as a discussion of the problem of ease of use that is focused on in the later chapters. The research contributions are then divided into four main chapters, each looking at different approaches working towards making transactional memory easier to use. Mémoire transactionnelle logicielle STM Programmation concurrente Structures de données Software transactional memory STM Concurrent programming Data structures
25	Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacional Cordeiro, Silvio Ricardo January 2014 (has links) Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs. Processamento paralelo Processamento : Alto desempenho Transactional memory Profiling Scheduling Shared memory Parallel programming High-performance computing
26	Scalable and Transparent Parallelization of Multiplayer Games Simion, Bogdan 15 February 2010 (has links) In this thesis, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that STM provides not only ease of programming, but also better scalability than achievable with state-of-the-art lock-based programming for this realistic high impact application. We evaluate and compare two parallel implementations of a simplified version (named SynQuake) of the popular game Quake. While in STM SynQuake support for maintaining consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding-box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to higher scalability of STM SynQuake versus lock-based SynQuake due to increased false sharing in the latter. Task assignment to threads has a second-order effect on scalability of STM-SynQuake, impacting the application's true sharing patterns. We show that a locality-aware task assignment provides the best trade-off between load balancing and conflict reduction. multiplayer games parallelization transactional memory lock-based synchronization scalability load balancing performance 0984
27	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544
28	Relaxing Concurrency Control in Transactional Memory Aydonat, Utku 05 January 2012 (has links) Transactional memory (TM) systems have gained considerable popularity in the last decade driven by the increased demand for tools that ease parallel programming. TM eliminates the need for user-locks that protect accesses to shared data. It offers performance close to that of fine-grain locking with the programming simplicity of coarse-grain locking. Today’s TM systems implement the two-phase-locking (2PL) algorithm which aborts transactions every time a conflict occurs. 2PL is a simple algorithm that provides fast transactional operations. However, it limits concurrency in applications with high contention because it increases the rate of aborts. We propose the use of a more relaxed concurrency control algorithm to provide better concurrency. This algorithm is based on the conflict-serializability (CS) model. Unlike 2PL, it allows some transactions to commit successfully even when they make conflicting accesses. We implement this algorithm both in a software TM system as well as in a simulator of a hardware TM system. Our evaluation using TM benchmarks shows that the algorithm improves the performance of applications with long transactions and high abort rates. Performance is improved by up to 299% in the software TM, and up to 66% in the hardware simulator. We argue that these improvements come with little additional complexity and require no changes to the transactional programming model. This makes our implementation feasible parallel programming multi threaded programs transactional memory synchronization serializability concurrency control 0544 0984
29	Scalable and Transparent Parallelization of Multiplayer Games Simion, Bogdan 15 February 2010 (has links) In this thesis, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that STM provides not only ease of programming, but also better scalability than achievable with state-of-the-art lock-based programming for this realistic high impact application. We evaluate and compare two parallel implementations of a simplified version (named SynQuake) of the popular game Quake. While in STM SynQuake support for maintaining consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding-box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to higher scalability of STM SynQuake versus lock-based SynQuake due to increased false sharing in the latter. Task assignment to threads has a second-order effect on scalability of STM-SynQuake, impacting the application's true sharing patterns. We show that a locality-aware task assignment provides the best trade-off between load balancing and conflict reduction. multiplayer games parallelization transactional memory lock-based synchronization scalability load balancing performance 0984
30	Overlay Architectures for FPGA-Based Software Packet Processing Martin, Labrecque 16 June 2011 (has links) Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and develop processor architectures allowing threads to execute with as little architectural stalls as possible: in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for an average programmer who can write an application as a single thread of computation with coarse-grained synchronization around shared data structures. Comparing with multithreaded processors using lock-based synchronization, we measure up to 57\% additional throughput with the use of transactional-memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock rate, our results suggest that soft processors can process packets in software at high throughput and low latency, while capitalizing on the FPGAs already available in network equipment. computer architecture soft processors FPGA packet processing network processor multithreaded transactional memory 0544

Search results