Global ETD Search

1	On Optimizing Transactional Memory: Transaction Splitting, Scheduling, Fine-grained Fallback, and NUMA Optimization Mohamedin, Mohamed Ahmed Mahmoud 01 September 2015 (has links) The industrial shift from single core processors to multi-core ones introduced many challenges. Among them, a program cannot get a free performance boost by just upgrading to a new hardware because new chips include more processing units but at the same (or comparable) clock speed as the previous generation. In order to effectively exploit the new available hardware and thus gain performance, a program should maximize parallelism. Unfortunately, parallel programming poses several challenges, especially when synchronization is involved because parallel threads need to access the same shared data. Locks are the standard synchronization mechanism but gaining performance using locks is difficult for a non-expert programmers and without deeply knowing the application logic. A new, easier, synchronization abstraction is therefore required and Transactional Memory (TM) is the concrete candidate. TM is a new programming paradigm that simplifies the implementation of synchronization. The programmer just defines atomic parts of the code and the underlying TM system handles the required synchronization, optimistically. In the past decade, TM researchers worked extensively to improve TM-based systems. Most of the work has been dedicated to Software TM (or STM) as it does not requires special transactional hardware supports. Very recently (in the past two years), those hardware supports have become commercially available as commodity processors, thus a large number of customers can finally take advantage of them. Hardware TM (or HTM) provides the potential to obtain the best performance of any TM-based systems, but current HTM systems are best-effort, thus transactions are not guaranteed to commit in any case. In fact, HTM transactions are limited in size and time as well as prone to livelock at high contention levels. Another challenge posed by the current multi-core hardware platforms is their internal architecture used for interfacing with the main memory. Specifically, when the common computer deployment changed from having a single processor to having multiple multi-core processors, the architects redesigned also the hardware subsystem that manages the memory access from the one providing a Uniform Memory Access (UMA), where the latency needed to fetch a memory location is the same independently from the specific core where the thread executes on, to the current one with a Non-Uniform Memory Access (NUMA), where such a latency differs according to the core used and the memory socket accessed. This switch in technology has an implication on the performance of concurrent applications. In fact, the building blocks commonly used for designing concurrent algorithms under the assumptions of UMA (e.g., relying on centralized meta-data) may not provide the same high performance and scalability when deployed on NUMA-based architectures. In this dissertation, we tackle the performance and scalability challenges of multi-core architectures by providing three solutions for increasing performance using HTM (i.e., Part-HTM, Octonauts, and Precise-TM), and one solution for solving the scalability issues provided by NUMA-architectures (i.e., Nemo). • Part-HTM is the first hybrid transactional memory protocol that solves the problem of transactions aborted due to the resource limitations (space/time) of current best-effort HTM. The basic idea of Part-HTM is to partition those transactions into multiple sub-transactions, which can likely be committed in hardware. Due to the eager nature of HTM, we designed a low-overhead software framework to preserve transaction's correctness (with and without opacity) and isolation. Part-HTM is efficient: our evaluation study confirms that its performance is the best in all tested cases, except for those where HTM cannot be outperformed. However, in such a workload, Part-HTM still performs better than all other software and hybrid competitors. • Octonauts tackles the live-lock problem of HTM at high contention level. HTM lacks of advanced contention management (CM) policies. Octonauts is an HTM-aware scheduler that orchestrates conflicting transactions. It uses a priori knowledge of transactions' working-set to prevent the activation of conflicting transactions, simultaneously. Octonauts also accommodates both HTM and STM with minimal overhead by exploiting adaptivity. Based on the transaction's size, time, and irrevocable calls (e.g., system call) Octonauts selects the best path among HTM, STM, or global locking. Results show a performance improvement up to 60% when Octonauts is deployed in comparison with pure HTM with falling back to global locking. • Precise-TM is a unique approach to solve the granularity of the software fallback path of best-efforts HTM. It provide an efficient and precise technique for HTM-STM communication such that HTM is not interfered by concurrent STM transactions. In addition, the added overhead is marginal in terms of space or execution time. Precise-TM uses address-embedded locks (pointers bit-stealing) for a precise communication between STM and HTM. Results show that our precise fine-grained locking pays off as it allows more concurrency between hardware and software transactions. Specifically, it gains up to 5x over the default HTM implementation with a single global lock as fallback path. • Nemo is a new STM algorithm that ensures high and scalable performance when an application workload with a data locality property is deployed. Existing STM algorithms rely on centralized shared meta-data (e.g., a global timestamp) to synchronize concurrent accesses, but in such a workload, this scheme may hamper the achievement of scalable performance given the high latency introduced by NUMA architectures for updating those centralized meta-data. Nemo overcomes these limitations by allowing only those transactions that actually conflict with each other to perform inter-socket communication. As a result, if two transactions are non-conflicting, they cannot interact with each other through any meta-data. Such a policy does not apply for application threads running in the same socket. In fact, they are allowed to share any meta-data even if they execute non-conflicting operations because, supported by our evaluation study, we found that the local processing happening inside one socket does not interfere with the work done by parallel threads executing on other sockets. Nemo's evaluation study shows improvement over state-of-the-art TM algorithms by as much as 65%. / Ph. D. Transaction Memory Hardware Transaction Memory (HTM) Best-efforts HTM Transactions Partitioning Transactions Scheduling NUMA NUMA Optimization NUMA-aware STM Fine-grained Fallback
2	On the Fault-tolerance and High Performance of Replicated Transactional Systems Hirve, Sachin 28 September 2015 (has links) With the recent technological developments in last few decades, there is a notable shift in the way business/consumer transactions are conducted. These transactions are usually triggered over the internet and transactional systems working in the background ensure that these transactions are processed. The majority of these transactions nowadays fall in Online Transaction Processing (OLTP) category, where low latency is preferred characteristic. In addition to low latency, OLTP transaction systems also require high service continuity and dependability. Replication is a common technique that makes the services dependable and therefore helps in providing reliability, availability and fault-tolerance. Deferred Update Replication (DUR) and Deferred Execution Replication (DER) represent the two well known transaction execution models for replicated transactional systems. Under DUR, a transaction is executed locally at one node before a global certification is invoked to resolve conflicts against other transactions running on remote nodes. On the other hand, DER postpones the transaction execution until the agreement on a common order of transaction requests is reached. Both DUR and DER require a distributed ordering layer, which ensures a total order of transactions even in case of faults. In today's distributed transactional systems, performance is of paramount importance. Any loss in performance, e.g., increased latency due to slow processing of client requests, may entail loss of revenue for businesses. On one hand, the DUR model is a good candidate for transaction processing in those systems in case the conflicts among transactions are rare, while it can be detrimental for high conflict workload profiles. On the other hand, the DER model is an attractive choice because of its ability to behave as independent of the characteristics of the workload, but trivial realizations of the model ultimately do not offer a good performance increase margin. Indeed transactions are executed sequentially and the total order layer can be a serious bottleneck for latency and scalability. This dissertation proposes novel solutions and system optimizations to enhance the overall performance of replicated transactional systems. The first presented result is HiperTM, a DER-based transaction replication solution that is able to alleviate the costs of the total order layer via speculative execution techniques. HiperTM exploits the time that is between the broadcast of a client request and the finalization of the order for that request to speculatively execute the request, so to achieve an overlapping between replicas coordination and transactions execution. HiperTM proposes two main components: OS-Paxos, a novel total order layer that is able to early deliver requests optimistically according to a tentative order, which is then either confirmed or rejected by a final total order; SCC, a lightweight speculative concurrency control protocol that is able to exploit the optimistic delivery of OS-Paxos and execute transactions in a speculative fashion. SCC still processes write transactions serially in order to minimize the code instrumentation overheads, but it is able to parallelize the execution of read-only transactions thanks to its built-in object multiversion scheme. The second contribution in this dissertation is X-DUR, a novel transaction replication system that addressed the high cost of local and remote aborts in case of high contention on shared objects in DUR based approaches, due to which the performance is adversely affected. Exploiting the knowledge of client's transaction locality, X-DUR incorporates the benefits of state machine approach to scale-up the distributed performance of DUR systems. As third contribution, this dissertation proposes Archie, a DER-based replicated transactional system that improves HiperTM in two aspects. First, Archie includes a highly optimized total order layer that combines optimistic-delivery and batching thus allowing the anticipation of a big amount of work before the total order is finalized. Then the concurrency control is able to process transactions speculatively and with a higher degree of parallelism, although the order of the speculative commits still follows the order defined by the optimistic delivery. Both HiperTM and Archie perform well up to a certain number of nodes in the system, beyond which their performance is impacted by limitations of single leader-based total-order layer. This motivates the design of Caesar, the forth contribution of this dissertation, which is a transactional system based on a novel multi-leader partial order protocol. Caesar enforces a partial order on the execution of transactions according to their conflicts, by letting non-conflicting transactions to proceed in parallel and without enforcing any synchronization during the execution (e.g., no locks). As the last contribution, this dissertation presents Dexter, a replication framework that exploits the commonly observed phenomenon such that not all read-only workloads require up-to-date data. It harnesses the application specific freshness and content-based constraints of read-only transactions to achieve high scalability. Dexter services the read-only requests according to the freshness guarantees specified by the application and routes the read-only workload accordingly in the system to achieve high performance and low latency. As a result, Dexter framework also alleviates the interference between read-only requests and read-write requests thereby helping to improve the performance of read-write requests execution as well. / Ph. D. Distributed Transaction Memory Fault-tolerance Active Replication Distributed Systems On-line Transaction Processing
3	Extracting Parallelism from Legacy Sequential Code Using Transactional Memory Saad Ibrahim, Mohamed Mohamed 26 July 2016 (has links) Increasing the number of processors has become the mainstream for the modern chip design approaches. However, most applications are designed or written for single core processors; so they do not benefit from the numerous underlying computation resources. Moreover, there exists a large base of legacy software which requires an immense effort and cost of rewriting and re-engineering to be made parallel. In the past decades, there has been a growing interest in automatic parallelization. This is to relieve programmers from the painful and error-prone manual parallelization process, and to cope with new architecture trend of multi-core and many-core CPUs. Automatic parallelization techniques vary in properties such as: the level of paraellism (e.g., instructions, loops, traces, tasks); the need for custom hardware support; using optimistic execution or relying on conservative decisions; online, offline or both; and the level of source code exposure. Transactional Memory (TM) has emerged as a powerful concurrency control abstraction. TM simplifies parallel programming to the level of coarse-grained locking while achieving fine-grained locking performance. This dissertation exploits TM as an optimistic execution approach for transforming a sequential application into parallel. The design and the implementation of two frameworks that support automatic parallelization: Lerna and HydraVM, are proposed, along with a number of algorithmic optimizations to make the parallelization effective. HydraVM is a virtual machine that automatically extracts parallelism from legacy sequential code (at the bytecode level) through a set of techniques including code profiling, data dependency analysis, and execution analysis. HydraVM is built by extending the Jikes RVM and modifying its baseline compiler. Correctness of the program is preserved through exploiting Software Transactional Memory (STM) to manage concurrent and out-of-order memory accesses. Our experiments show that HydraVM achieves speedup between 2×-5× on a set of benchmark applications. Lerna is a compiler framework that automatically and transparently detects and extracts parallelism from sequential code through a set of techniques including code profiling, instrumentation, and adaptive execution. Lerna is cross-platform and independent of the programming language. The parallel execution exploits memory transactions to manage concurrent and out-of-order memory accesses. This scheme makes Lerna very effective for sequential applications with data sharing. This thesis introduces the general conditions for embedding any transactional memory algorithm into Lerna. In addition, the ordered version of four state-of-art algorithms have been integrated and evaluated using multiple benchmarks including RSTM micro benchmarks, STAMP and PARSEC. Lerna showed great results with average 2.7× (and up to 18×) speedup over the original (sequential) code. While prior research shows that transactions must commit in order to preserve program semantics, placing the ordering enforces scalability constraints at large number of cores. In this dissertation, we eliminates the need for commit transactions sequentially without affecting program consistency. This is achieved by building a cooperation mechanism in which transactions can forward some changes safely. This approach eliminates some of the false conflicts and increases the concurrency level of the parallel application. This thesis proposes a set of commit order algorithms that follow the aforementioned approach. Interestingly, using the proposed commit-order algorithms the peak gain over the sequential non-instrumented execution in RSTM micro benchmarks is 10× and 16.5× in STAMP. Another main contribution is to enhance the concurrency and the performance of TM in general, and its usage for parallelization in particular, by extending TM primitives. The extended TM primitives extracts the embedded low level application semantics without affecting TM abstraction. Furthermore, as the proposed extensions capture common code patterns, it is possible to be handled automatically through the compilation process. In this work, that was done through modifying the GCC compiler to support our TM extensions. Results showed speedups of up to 4× on different applications including micro benchmarks and STAMP. Our final contribution is supporting the commit-order through Hardware Transactional Memory (HTM). HTM contention manager cannot be modified because it is implemented inside the hardware. Given such constraint, we exploit HTM to reduce the transactional execution overhead by proposing two novel commit order algorithms, and a hybrid reduced hardware algorithm. The use of HTM improves the performance by up to 20% speedup. / Ph. D. Transaction Memory Automatic Parallelization Low-Level Virtual Machine Optimistic Concurrency Speculative Execution Legacy Systems Age Commitment Order Low-Level TM Semantics TM Friendly Semantics

1

Page generated in 0.1015 seconds