41 |
Advances Towards Data-Race-Free Cache Coherence Through Data ClassificationDavari, Mahdad January 2017 (has links)
Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times. In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases. Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption. In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics.
|
42 |
Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacionalCordeiro, Silvio Ricardo January 2014 (has links)
Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs.
|
43 |
Implementation of Data Parallel Primitives on MIMD Shared Memory SystemsMortensen, Christian January 2019 (has links)
This thesis presents an implementation of a multi-threaded C library for performing data parallel computations on MIMD shared memory systems, with support for user defined operators and one-dimensional sparse arrays. Multi-threaded parallel execution was achieved by the use of the POSIX threads, and the library exposes several functions for performing data parallel computations directly on arrays. The implemented functions were based on a set of primitives that many data parallel programming languages have in common. The individual scalability of the primitives varied greatly, with most of them only gaining a significant speedup when executed on two cores followed by a significant drop-off in speedup as more cores were added. An exception to this was the reduction primitive however, which managed to achieve near optimal speedup in most tests. The library proved unviable for expressing algorithms requiring more then one or two primitives in sequence due to the overhead that each of them cause.
|
44 |
On-the-fly Race Detection for Programs with Recursive Spawn-Sync ParallelismHe, Yuxiong, Wang, Junqing 01 1900 (has links)
Detecting data race is very important for debugging shared-memory parallel programs, because data races result in unintended nondeterministic execution of the program. We propose a dynamic on-the-fly race detection mechanism called Parallel Nondeterminator to check for determinacy races during the parallel execution of a program with recursive spawn-sync parallelism. A modified version of Nested Region Labeling scheme is developed for the concurrency relationship test in the spawn-sync parallel structure. Through the identification of Least Common Ancestor in the spawn tree, the Parallel Nondeterminator only needs to keep two read access records and one write access record for each shared location. The work and critical path in the instrumented codes are analyzed as well as time complexity and space requirements. Let N denote the maximum depth of the recursion in the parallel program. The worst case time increased for each spawn and sync operation is O(N) and the time required to monitor any shared memory location is O(lgN). Moreover, Parallel Nondeterminator is able to execute the race detection code without loss of parallelism of the original program. In summary, the Parallel Non-determinator represents a provably efficient strategy for detecting data races for shared-memory parallel programs. / Singapore-MIT Alliance (SMA)
|
45 |
Architecture Support and Scalability Analysis of Memory Consistency Models in Network-on-Chip based SystemsNaeem, Abdul January 2013 (has links)
The shared memory systems should support parallelization at the computation (multi-core), communication (Network-on-Chip, NoC) and memory architecture levels to exploit the potential performance benefits. These parallel systems supporting shared memory abstraction both in the general purpose and application specific domains are confronting the critical issue of memory consistency. The memory consistency issue arises due to the unconstrained memory operations which leads to the unexpected behavior of shared memory systems. The memory consistency models enforce ordering constraints on the memory operations for the expected behavior of the shared memory systems. The intuitive Sequential Consistency (SC) model enforces strict ordering constraints on the memory operations and does not take advantage of the system optimizations both in the hardware and software. Alternatively, the relaxed memory consistency models relax the ordering constraints on the memory operations and exploit these optimizations to enhance the system performance at the reasonable cost. The purpose of this thesis is twofold. First, the novel architecture supports are provided for the different memory consistency models like: SC, Total Store Ordering (TSO), Partial Store Ordering (PSO), Weak Consistency (WC), Release Consistency (RC) and Protected Release Consistency (PRC) in the NoC-based multi-core (McNoC) systems. The PRC model is proposed as an extension of the RC model which provides additional reordering and relaxation in the memory operations. Second, the scalability analysis of these memory consistency models is performed in the McNoC systems. The architecture supports for these different memory consistency models are provided in the McNoC platforms. Each configurable McNoC platform uses a packet-switched 2-D mesh NoC with deflection routing policy, distributed shared memory (DSM), distributed locks and customized processor interface. The memory consistency models/protocols are implemented in the customized processor interfaces which are developed to integrate the processors with the rest of the system. The realization schemes for the memory consistency models are based on a transaction counter and an an an address ddress ddress ddress ddress ddress ddress stack tacktack-based based based based based based novel approaches.approaches.approaches.approaches. approaches.approaches.approaches.approaches.approaches.approaches. The transaction counter is used in each node of the network to keep track of the outstanding memory operations issued by a processor in the system. The address stack is used in each node of the network to keep track of the addresses of the outstanding memory operations issued by a processor in the system. These hardware structures are used in the processor interface to enforce the required global orders under these different memory consistency models. The realization scheme of the PRC model in addition also uses acquire counter for further classification of the data operations as unprotected and protected operations. The scalability analysis of these different memory consistency models is performed on the basis of different workloads which are developed and mapped on the various sized networks. The scalability study is conducted in the McNoC systems with 1 to 64-cores with various applications using different problem sizes and traffic patterns. The performance metrics like execution time, performance, speedup, overhead and efficiency are evaluated as a function of the network size. The experiments are conducted both with the synthetic and application workloads. The experimental results under different application workloads show that the average execution time under the relaxed memory consistency models decreases relative to the SC model. The specific numbers are highly sensitive to the application and depend on how well it matches to the architectures. This study shows the performance improvement under the relaxed memory consistency models over the SC model that is dependent on the computation-to-communication ratio, traffic patterns, data-to-synchronization ratio and the problem size. The performance improvement of the PRC and RC models over the SC model tends to be higher than 50% as observed in the experiments, when the system is further scaled up. / <p>QC 20130204</p>
|
46 |
Local-spin Abortable Mutual ExclusionLee, Hyonho 10 January 2012 (has links)
Abortable mutual exclusion is a variant of mutual exclusion, in which processes are allowed to abort in the trying protocol.
Scott presented the first local-spin abortable mutual exclusion algorithms.
They are based on a queue and perform O(1) remote memory accesses (RMAs) when no processes abort.
However, they use unbounded space and a process can perform an unbounded number of RMAs, in the worst case, to enter the critical section.
The only other local-spin abortable mutual exclusion algorithm is by Jayanti.
It has bounded space and RMA complexity, but a process always performs \Theta(log N) RMAs to enter the critical section, where N is the number of processes.
In this thesis, three new, bounded space, abortable mutual exclusion algorithms are presented.
We give the first local-spin abortable mutual exclusion algorithm that uses only registers.
It has \Theta(log N) RMA complexity, even if no processes abort.
We also present the first local-spin abortable mutual exclusion algorithm using more general primitives which has bounded space and in which each process performs O(1) RMAs to enter the critical section when no processes abort.
However, this algorithm is local-spin only for a certain type of cache-coherent model.
Finally, we present an abortable mutual exclusion algorithm which is local-spin in any cache-coherent model and in which each process performs O(1) RMAs to enter the critical section when no processes abort.
We develop a new reference counting method to bound the space used in this algorithm.
|
47 |
Local-spin Abortable Mutual ExclusionLee, Hyonho 10 January 2012 (has links)
Abortable mutual exclusion is a variant of mutual exclusion, in which processes are allowed to abort in the trying protocol.
Scott presented the first local-spin abortable mutual exclusion algorithms.
They are based on a queue and perform O(1) remote memory accesses (RMAs) when no processes abort.
However, they use unbounded space and a process can perform an unbounded number of RMAs, in the worst case, to enter the critical section.
The only other local-spin abortable mutual exclusion algorithm is by Jayanti.
It has bounded space and RMA complexity, but a process always performs \Theta(log N) RMAs to enter the critical section, where N is the number of processes.
In this thesis, three new, bounded space, abortable mutual exclusion algorithms are presented.
We give the first local-spin abortable mutual exclusion algorithm that uses only registers.
It has \Theta(log N) RMA complexity, even if no processes abort.
We also present the first local-spin abortable mutual exclusion algorithm using more general primitives which has bounded space and in which each process performs O(1) RMAs to enter the critical section when no processes abort.
However, this algorithm is local-spin only for a certain type of cache-coherent model.
Finally, we present an abortable mutual exclusion algorithm which is local-spin in any cache-coherent model and in which each process performs O(1) RMAs to enter the critical section when no processes abort.
We develop a new reference counting method to bound the space used in this algorithm.
|
48 |
Software Transactional Memory Building BlocksRiegel, Torvald 15 August 2013 (has links) (PDF)
Exploiting thread-level parallelism has become a part of mainstream programming in recent years. Many approaches to parallelization require threads executing in parallel to also synchronize occassionally (i.e., coordinate concurrent accesses to shared state). Transactional Memory (TM) is a programming abstraction that provides the concept of database transactions in the context of programming languages such as C/C++. This allows programmers to only declare which pieces of a program synchronize without requiring them to actually implement synchronization and tune its performance, which in turn makes TM typically easier to use than other abstractions such as locks.
I have investigated and implemented the building blocks that are required for a high-performance, practical, and realistic TM. They host several novel algorithms and optimizations for TM implementations, both for current hardware and future hardware extensions for TM, and are being used in or have influenced commercial TM implementations such as the TM support in GCC.
|
49 |
Efficient openMP over sequentially consistent distributed shared memory systemsCosta Prats, Juan José 20 July 2011 (has links)
Nowadays clusters are one of the most used platforms in High Performance Computing and most programmers use the Message Passing Interface (MPI) library to program their applications in these distributed platforms getting their maximum performance, although it is a complex task. On the other side, OpenMP has been established as the de facto standard to program applications on shared memory platforms because it is easy to use and obtains good performance without too much effort.
So, could it be possible to join both worlds? Could programmers use the easiness of OpenMP in distributed platforms? A lot of researchers think so. And one of the developed ideas is the distributed shared memory (DSM), a software layer on top of a distributed platform giving an abstract shared memory view to the applications. Even though it seems a good solution it also has some inconveniences. The memory coherence between the nodes in the platform is difficult to maintain (complex management, scalability issues, high overhead and others) and the latency of the remote-memory accesses which can be orders of magnitude greater than on a shared bus due to the interconnection network.
Therefore this research improves the performance of OpenMP applications being executed on distributed memory platforms using a DSM with sequential consistency evaluating thoroughly the results from the NAS parallel benchmarks.
The vast majority of designed DSMs use a relaxed consistency model because it avoids some major problems in the area. In contrast, we use a sequential consistency model because we think that showing these potential problems that otherwise are hidden may allow the finding of some solutions and, therefore, apply them to both models.
The main idea behind this work is that both runtimes, the OpenMP and the DSM layer, should cooperate to achieve good performance, otherwise they interfere one each other trashing the final performance of applications.
We develop three different contributions to improve the performance of these applications: (a) a technique to avoid false sharing at runtime, (b) a technique to mimic the MPI behaviour, where produced data is forwarded to their consumers and, finally, (c) a mechanism to avoid the network congestion due to the DSM coherence messages. The NAS Parallel Benchmarks are used to test the contributions.
The results of this work shows that the false-sharing problem is a relative problem depending on each application. Another result is the importance to move the data flow outside of the critical path and to use techniques that forwards data as early as possible, similar to MPI, benefits the final application performance.
Additionally, this data movement is usually concentrated at single points and affects the application performance due to the limited bandwidth of the network. Therefore it is necessary to provide mechanisms that allows the distribution of this data through the computation time using an otherwise idle network.
Finally, results shows that the proposed contributions improve the performance of OpenMP applications on this kind of environments.
|
50 |
CDPthread: A POSIX-Thread Based Distributed Computing EnvironmentTseng, Guo-Fu 28 July 2009 (has links)
Due to the limitation of single machine¡¦s computing power, and the aspect of cost, distributed design is getting more and more popular nowadays. The Distributed Shared Memory (DSM) system is one of the most hot topics in this area. Most people are dedicated on designing a library or even a new language, in order to gain higher performance on DSM systems. As a consequence, the programmers are required to learn a new library or language. Even more, they have to handle synchronizations for the distributed environment. In this paper, we propose a design that is compatible with POSIX-Thread Environment. The distributed nature of the system described herein is totally transparent to the programmers.
|
Page generated in 0.0363 seconds