• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • 2
  • 1
  • Tagged with
  • 11
  • 11
  • 4
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Ensuring performance and correctness for legacy parallel programs

McPherson, Andrew John January 2015 (has links)
Modern computers are based on manycore architectures, with multiple processors on a single silicon chip. In this environment programmers are required to make use of parallelism to fully exploit the available cores. This can either be within a single chip, normally using shared-memory programming or at a larger scale on a cluster of chips, normally using message-passing. Legacy programs written using either paradigm face issues when run on modern manycore architectures. In message-passing the problem is performance related, with clusters based on manycores introducing necessarily tiered topologies that unaware programs may not fully exploit. In shared-memory it is a correctness problem, with modern systems employing more relaxed memory consistency models, on which legacy programs were not designed to operate. Solutions to this correctness problem exist, but introduce a performance problem as they are necessarily conservative. This thesis focuses on addressing these problems, largely through compile-time analysis and transformation. The first technique proposed is a method for statically determining the communication graph of an MPI program. This is then used to optimise process placement in a cluster of CMPs. Using the 64-process versions of the NAS parallel benchmarks, we see an average of 28% (7%) improvement in communication localisation over by-rank scheduling for 8-core (12-core) CMP-based clusters, representing the maximum possible improvement. Secondly, we move into the shared-memory paradigm, identifying and proving necessary conditions for a read to be an acquire. This can be used to improve solutions in several application areas, two of which we then explore. We apply our acquire signatures to the problem of fence placement for legacy well-synchronised programs. We find that applying our signatures, we can reduce the number of fences placed by an average of 62%, leading to a speedup of up to 2.64x over an existing practical technique. Finally, we develop a dynamic synchronisation detection tool known as SyncDetect. This proof of concept tool leverages our acquire signatures to more accurately detect ad hoc synchronisations in running programs and provides the programmer with a report of their locations in the source code. The tool aims to assist programmers with the notoriously difficult problem of parallel debugging and in manually porting legacy programs to more modern (relaxed) memory consistency models.
2

Memory consistency directed cache coherence protocols for scalable multiprocessors

Elver, Marco Iskender January 2016 (has links)
The memory consistency model, which formally specifies the behavior of the memory system, is used by programmers to reason about parallel programs. From a hardware design perspective, weaker consistency models permit various optimizations in a multiprocessor system: this thesis focuses on designing and optimizing the cache coherence protocol for a given target memory consistency model. Traditional directory coherence protocols are designed to be compatible with the strictest memory consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that provide more relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of scalability, in terms of per-core storage due to sharer tracking, which poses a problem with increasing number of cores in today’s CMPs, most of which no longer are sequentially consistent. The recent convergence towards programming language based relaxed memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. As a result, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86) cannot readily make use of lazy coherence protocols without either: adapting the protocol to satisfy the stricter consistency model; or changing the architecture’s consistency model to (a variant of) RC, typically at the expense of backward compatibility. The first part of this thesis explores both these options, with a focus on a practical approach satisfying backward compatibility. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers and instead relies on self-invalidation and detection of potential acquires (in the absence of explicit synchronization) using per cache line timestamps to efficiently and lazily satisfy the TSO memory consistency model. Our results show that TSO-CC achieves, on average, performance comparable to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales logarithmically with increasing core count. Next, we propose an approach for the x86-64 architecture, which is a compromise between retaining the original consistency model and using a more storage efficient lazy coherence protocol. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backward compatibility with legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC), a scalable cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3’s storage overhead per cache line scales logarithmically with increasing core count and reduces on-chip coherence storage overheads by 45% compared to TSO-CC. Finally, it is imperative that hardware adheres to the promised memory consistency model. Indeed, consistency directed coherence protocols cannot use conventional coherence definitions (e.g. SWMR) to be verified against, and few existing verification methodologies apply. Furthermore, as the full consistency model is used as a specification, their interaction with other components (e.g. pipeline) of a system must not be neglected in the verification process. Therefore, verifying a system with such protocols in the context of interacting components is even more important than before. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the consistency model. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when used for simulation-based memory consistency verification would be too slow. We propose McVerSi, a test generation framework for fast memory consistency verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to memory consistency test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering memory consistency bugs. To guide tests towards exercising as much logic as possible, the simulator’s reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI protocol due to the faulty interaction of the pipeline and the cache coherence protocol, highlighting that even conventional protocols should be verified rigorously in the context of a full-system. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests).
3

Samhita: Virtual Shared Memory for Non-Cache-Coherent Systems

Ramesh, Bharath 05 August 2013 (has links)
Among the key challenges of computing today are the emergence of many-core architectures and the resulting need to effectively exploit explicit parallelism. Indeed, programmers are striving to exploit parallelism across virtually all platforms and application domains. The shared memory programming model effectively addresses the parallelism needs of mainstream computing (e.g., portable devices, laptops, desktop, servers), giving rise to a growing ecosystem of shared memory parallel techniques, tools, and design practices. However, to meet the extreme demands for processing and memory of critical problem domains, including scientific computation and data intensive computing, computing researchers continue to innovate in the high-end distributed memory architecture space to create cost-effective and scalable solutions. The emerging distributed memory architectures are both highly parallel and increasingly heterogeneous. As a result, they do not present the programmer with a cache-coherent view of shared memory, either across the entire system or even at the level of an individual node. Furthermore, it remains an open research question which programming model is best for the heterogeneous platforms that feature multiple traditional processors along with accelerators or co-processors. Hence, we have two contradicting trends. On the one hand, programming convenience and the presence of shared memory     call for a shared memory programming model across the entire heterogeneous system. On the other hand, increasingly parallel and heterogeneous nodes lacking cache-coherent shared memory call for a message passing model. In this dissertation, we present the architecture of Samhita, a distributed shared memory (DSM) system that addresses the challenge of providing shared memory for non-cache-coherent systems. We define regional consistency (RegC), the memory consistency model implemented by Samhita. We present performance results for Samhita on several computational kernels and benchmarks, on both cluster supercomputers and heterogeneous systems. The results demonstrate the promising potential of Samhita and the RegC model, and include the largest scale evaluation by a significant margin for any DSM system reported to date. / Ph. D.
4

Mapping parallel graph algorithms to throughput-oriented architectures

McLaughlin, Adam 07 January 2016 (has links)
The stagnant performance of single core processors, increasing size of data sets, and variety of structure in information has made the domain of parallel and high-performance computing especially crucial. Graphics Processing Units (GPUs) have recently become an exciting alternative to traditional CPU architectures for applications in this domain. Although GPUs are designed for rendering graphics, research has found that the GPU architecture is well-suited to algorithms that search and analyze unstructured, graph-based data, offering up to an order of magnitude greater memory bandwidth over their CPU counterparts. This thesis focuses on GPU graph analysis from the perspective that algorithms should be efficient on as many classes of graphs as possible, rather than being specialized to a specific class, such as social networks or road networks. Using betweenness centrality, a popular analytic used to find prominent entities of a network, as a motivating example, we show how parallelism, distributed computing, hybrid and on-line algorithms, and dynamic algorithms can all contribute to substantial improvements in the performance and energy-efficiency of these computations. We further generalize this approach and provide an abstraction that can be applied to a whole class of graph algorithms that require many simultaneous breadth-first searches. Finally, to show that our findings can be applied in real-world scenarios, we apply these techniques to the problem of verifying that a multiprocessor complies with its memory consistency model.
5

Architecture Support and Scalability Analysis of Memory Consistency Models in Network-on-Chip based Systems

Naeem, Abdul January 2013 (has links)
The shared memory systems should support parallelization at the computation (multi-core), communication (Network-on-Chip, NoC) and memory architecture levels to exploit the potential performance benefits. These parallel systems supporting shared memory abstraction both in the general purpose and application specific domains are confronting the critical issue of memory consistency. The memory consistency issue arises due to the unconstrained memory operations which leads to the unexpected behavior of shared memory systems. The memory consistency models enforce ordering constraints on the memory operations for the expected behavior of the shared memory systems. The intuitive Sequential Consistency (SC) model enforces strict ordering constraints on the memory operations and does not take advantage of the system optimizations both in the hardware and software. Alternatively, the relaxed memory consistency models relax the ordering constraints on the memory operations and exploit these optimizations to enhance the system performance at the reasonable cost. The purpose of this thesis is twofold. First, the novel architecture supports are provided for the different memory consistency models like: SC, Total Store Ordering (TSO), Partial Store Ordering (PSO), Weak Consistency (WC), Release Consistency (RC) and Protected Release Consistency (PRC) in the NoC-based multi-core (McNoC) systems. The PRC model is proposed as an extension of the RC model which provides additional reordering and relaxation in the memory operations. Second, the scalability analysis of these memory consistency models is performed in the McNoC systems. The architecture supports for these different memory consistency models are provided in the McNoC platforms. Each configurable McNoC platform uses a packet-switched 2-D mesh NoC with deflection routing policy, distributed shared memory (DSM), distributed locks and customized processor interface. The memory consistency models/protocols are implemented in the customized processor interfaces which are developed to integrate the processors with the rest of the system. The realization schemes for the memory consistency models are based on a transaction counter and an an an address ddress ddress ddress ddress ddress ddress stack tacktack-based based based based based based novel approaches.approaches.approaches.approaches. approaches.approaches.approaches.approaches.approaches.approaches. The transaction counter is used in each node of the network to keep track of the outstanding memory operations issued by a processor in the system. The address stack is used in each node of the network to keep track of the addresses of the outstanding memory operations issued by a processor in the system. These hardware structures are used in the processor interface to enforce the required global orders under these different memory consistency models. The realization scheme of the PRC model in addition also uses acquire counter for further classification of the data operations as unprotected and protected operations. The scalability analysis of these different memory consistency models is performed on the basis of different workloads which are developed and mapped on the various sized networks. The scalability study is conducted in the McNoC systems with 1 to 64-cores with various applications using different problem sizes and traffic patterns. The performance metrics like execution time, performance, speedup, overhead and efficiency are evaluated as a function of the network size. The experiments are conducted both with the synthetic and application workloads. The experimental results under different application workloads show that the average execution time under the relaxed memory consistency models decreases relative to the SC model. The specific numbers are highly sensitive to the application and depend on how well it matches to the architectures. This study shows the performance improvement under the relaxed memory consistency models over the SC model that is dependent on the computation-to-communication ratio, traffic patterns, data-to-synchronization ratio and the problem size. The performance improvement of the PRC and RC models over the SC model tends to be higher than 50% as observed in the experiments, when the system is further scaled up. / <p>QC 20130204</p>
6

Cost-effective Designs for Supporting Correct Execution and Scalable Performance in Many-core Processors

Romanescu, Bogdan Florin January 2010 (has links)
<p>Many-core processors offer new levels of on-chip performance by capitalizing on the increasing rate of device integration. Harnessing the full performance potential of these processors requires that hardware designers not only exploit the advantages, but also consider the problems introduced by the new architectures. Such challenges arise from both the processor's increased structural complexity and the reliability issues of the silicon substrate. In this thesis, we address these challenges in a framework that targets correct execution and performance on three coordinates: 1) tolerating permanent faults, 2) facilitating static and dynamic verification through precise specifications, and 3) designing scalable coherence protocols.</p> <p>First, we propose CCA, a new design paradigm for increasing the processor's lifetime performance in the presence of permanent faults in cores. CCA chips rely on a reconfiguration mechanism that allows cores to replace faulty components with fault-free structures borrowed from neighboring cores. In contrast with existing solutions for handling hard faults that simply shut down cores, CCA aims to maximize the utilization of defect-free resources and increase the availability of on-chip cores. We implement three-core and four-core CCA chips and demonstrate that they offer a cumulative lifetime performance improvement of up to 65% for industry-representative utilization periods. In addition, we show that CCA benefits systems that employ modular redundancy to guarantee correct execution by increasing their availability.</p> <p>Second, we target the correctness of the address translation system. Current processors often exhibit design bugs in their translation systems, and we believe one cause for these faults is a lack of precise specifications describing the interactions between address translation and the rest of the memory system, especially memory consistency. We address this aspect by introducing a framework for specifying translation-aware consistency models. As part of this framework, we identify the critical role played by address translation in supporting correct memory consistency implementations. Consequently, we propose a set of invariants that characterizes address translation. Based on these invariants, we develop DVAT, a dynamic verification mechanism for address translation. We demonstrate that DVAT is efficient in detecting translation-related faults, including several that mimic design bugs reported in processor errata. By checking the correctness of the address translation system, DVAT supports dynamic verification of translation-aware memory consistency.</p> <p>Finally, we address the scalability of translation coherence protocols. Current software-based solutions for maintaining translation coherence adversely impact performance and do not scale. We propose UNITD, a hardware coherence protocol that supports scalable performance and architectural decoupling. UNITD integrates translation coherence within the regular cache coherence protocol, such that TLBs participate in the cache coherence protocol similar to instruction or data caches. We evaluate snooping and directory UNITD coherence protocols on processors with up to 16 cores and demonstrate that UNITD reduces the performance penalty of translation coherence to almost zero.</p> / Dissertation
7

Exploiting Parallelism in GPUs

Hechtman, Blake Alan January 2014 (has links)
<p>Heterogeneous processors with accelerators provide an opportunity to improve performance within a given power budget.</p><p>Many of these heterogeneous processors contain Graphics Processing Units (GPUs) that can perform graphics and embarrassingly parallel computation orders of magnitude faster than a CPU while using less energy. Beyond these obvious applications for GPUs, a larger variety of applications can benefit from a GPU's large computation and memory bandwidth. However, many of these applications are irregular and, as a result, require synchronization and scheduling that are commonly believed to perform poorly on GPUs. The basic building block of synchronization and scheduling is memory consistency, which is, therefore, the first place to look for improving performance on irregular applications. In this thesis, we approach the programmability of irregular applications on GPUs by thinking across traditional boundaries of the compute stack. We think about architecture, microarchitecture and runtime systems from the programmers perspective. To this end, we study architectural memory consistency on future GPUs with cache coherence. In addition, we design a GPU memory system</p><p>microarchitecture that can support fine-grain and coarse-grain synchronization without sacrificing throughput. Finally, we develop a task runtime that embraces the GPU microarchitecture to perform well</p><p>on fork/join parallelism desired by many programmers. Overall, this thesis contributes non-intuitive solutions to improve the performance and programmability of irregular applications from the programmer's perspective.</p> / Dissertation
8

Analyse de la consistance mémoire dans les MPSoCs à l'aide du prototypage virtuel / Analysis of memory consistency in MPSoCs using virtual prototyping

Hedde, Damien 12 June 2013 (has links)
La vérification de la consistance mémoire (VCM) consiste à vérifier que l'exécution d'un programme par une plate-forme matérielle s'est déroulée conformément à un modèle de consistance mémoire (MCM).Un MCM défini certaines propriétés concernant les accès à la mémoire, en particulier concernant l'ordre d'accès provenant de différents processeurs. C'est un problème complexe qui nécessite de connaître beaucoup d'informations sur l'exécution du programme afin d'obtenir une complexité linéaire: en particulier l'ordre des écriture à chaque case mémoire.Néanmoins les techniques classiques de VCM sont trop gourmandes en occupation mémoire pour pouvoir passer à l'échelle. Dans cette thèse nous proposons une méthode de VCM destinée au prototypage virtuel ayant une complexité linéaire. Cette méthode est dynamique, c'est à dire qu'elle est effectuée en même temps que l'exécution du programme. Cette propriété est nécessaire pour pouvoir limiter l'occupation mémoire. Son occupation mémoire est par ailleurs très limitée puisque principalement dépendante de la taille des caches de la plate-forme matérielle. Pour permettre ce passage à l'échelle, la méthode proposée utilise une information supplémentaire par rapport aux méthodes classiques.Nous avons implanté la méthode proposée au dessus d'un environnement générique de traitement de trace issu de la simulation d'un prototype virtuel qui a été devéloppé durant cette thèse. Cet environnement permet la mise en relation, efficacement, d'événements effectués par différents composants (processeurs, caches, mémoires) d'un la plate-forme matérielle. La pertinence de la solution proposée a été évalué expérimentalement en analysant le comportement de différents programmes exécutés par des plateformes matérielles simulées contenant différents nombres de processeurs. / Verifying memory consistency (VMC) allow to check if the an execution of a program by a hardware platform was executed in compliance with a given memoryc consistency model (MCM). A MCM defines properties about memory acceses, particularly about the interleaving of accesses generated by several processors. In order to solve this complex problem with lineary complexity, lots of information about the program execution are needed: the order of write acceses to every memory cell is indeed needed. Existing solutions do not scale due to memory usage requirement. In this thesis, we propose a VMC method which can be used in the context of virtual prototyping. This method is dynamic, meaning it is done during program execution. This is a requirement in order to limit the memory usage. the achieved memory usage is very reasonable since it mostly depends on the size of caches included in the platform. In order to scale with the length of the program execution, the proposed method use additional informations which can be retreived by using virtual prototyping. We have implemented the proposed method with a genric framework for trace processing which was developped during this thesis. This framework allows to keep links between related events traced by differents components of the hardware platform. The proposed method has been evaluated by analysing the behaviour of several programs executed on several virtual platforms which include differents processor count.
9

Software lock elision for x86 machine code

Roy, Amitabha January 2011 (has links)
More than a decade after becoming a topic of intense research there is no transactional memory hardware nor any examples of software transactional memory use outside the research community. Using software transactional memory in large pieces of software needs copious source code annotations and often means that standard compilers and debuggers can no longer be used. At the same time, overheads associated with software transactional memory fail to motivate programmers to expend the needed effort to use software transactional memory. The only way around the overheads in the case of general unmanaged code is the anticipated availability of hardware support. On the other hand, architects are unwilling to devote power and area budgets in mainstream microprocessors to hardware transactional memory, pointing to transactional memory being a 'niche' programming construct. A deadlock has thus ensued that is blocking transactional memory use and experimentation in the mainstream. This dissertation covers the design and construction of a software transactional memory runtime system called SLE_x86 that can potentially break this deadlock by decoupling transactional memory from programs using it. Unlike most other STM designs, the core design principle is transparency rather than performance. SLE_x86 operates at the level of x86 machine code, thereby becoming immediately applicable to binaries for the popular x86 architecture. The only requirement is that the binary synchronise using known locking constructs or calls such as those in Pthreads or OpenMPlibraries. SLE_x86 provides speculative lock elision (SLE) entirely in software, executing critical sections in the binary using transactional memory. Optionally, the critical sections can also be executed without using transactions by acquiring the protecting lock. The dissertation makes a careful analysis of the impact on performance due to the demands of the x86 memory consistency model and the need to transparently instrument x86 machine code. It shows that both of these problems can be overcome to reach a reasonable level of performance, where transparent software transactional memory can perform better than a lock. SLE_x86 can ensure that programs are ready for transactional memory in any form, without being explicitly written for it.
10

Compiler optimisations and relaxed memory consistency models / Optimisations des compilateurs et modèles mémoire relâchés

Morisset, Robin 05 April 2017 (has links)
Les architectures modernes avec des processeurs multicœurs, ainsi que les langages de programmation modernes, ont des mémoires faiblement consistantes. Leur comportement est formalisé par le modèle mémoire de l'architecture ou du langage de programmation ; il définit précisément quelle valeur peut être lue par chaque lecture dans la mémoire partagée. Ce n'est pas toujours celle écrite par la dernière écriture dans la même variable, à cause d'optimisation dans les processeurs, telle que l'exécution spéculative d'instructions, des effets complexes des caches, et des optimisations dans les compilateurs. Dans cette thèse, nous nous concentrons sur le modèle mémoire C11 qui est défini par l'édition 2011 du standard C. Nos contributions suivent trois axes. Tout d'abord, nous avons regardé la théorie autour du modèle C11, étudiant de façon formelle quelles optimisations il autorise les compilateurs à faire. Nous montrons que de nombreuses optimisations courantes sont permises, mais, surprenamment, d'autres, importantes, sont interdites. Dans un second temps, nous avons développé une méthode à base de tests aléatoires pour détecter quand des compilateurs largement utilisés tels que GCC et Clang réalisent des optimisations invalides dans le modèle mémoire C11. Nous avons trouvés plusieurs bugs dans GCC, qui furent tous rapidement fixés. Nous avons aussi implémenté une nouvelle passez d'optimisation dans LLVM, qui recherchent des instructions des instructions spéciales qui limitent les optimisations faites par le processeur - appelées instructions barrières - et élimine celles qui ne sont pas utiles. Finalement, nous avons développé un ordonnanceur en mode utilisateur pour des threads légers communicants via des canaux premier entré-premier sorti à un seul producteur et un seul consommateur. Ce modèle de programmation est connu sous le nom de réseau de Kahn, et nous montrons comment l'implémenter efficacement, via les primitives désynchronisation de C11. Ceci démontre qu'en dépit de ses problèmes, C11 peut être utilisé en pratique. / Modern multiprocessors architectures and programming languages exhibit weakly consistent memories. Their behaviour is formalised by the memory model of the architecture or programming language; it precisely defines which write operation can be returned by each shared memory read. This is not always the latest store to the same variable, because of optimisations in the processors such as speculative execution of instructions, the complex effects of caches, and optimisations in the compilers. In this thesis we focus on the C11 memory model that is defined by the 2011 edition of the C standard. Our contributions are threefold. First, we focused on the theory surrounding the C11 model, formally studying which compiler optimisations it enables. We show that many common compiler optimisations are allowed, but, surprisingly, some important ones are forbidden. Secondly, building on our results, we developed a random testing methodology for detecting when mainstream compilers such as GCC or Clang perform an incorrect optimisation with respect to the memory model. We found several bugs in GCC, all promptly fixed. We also implemented a novel optimisation pass in LLVM, that looks for special instructions that restrict processor optimisations - called fence instructions - and eliminates the redundant ones. Finally, we developed a user-level scheduler for lightweight threads communicating through first-in first-out single-producer single-consumer queues. This programming model is known as Kahn process networks, and we show how to efficiently implement it, using C11 synchronisation primitives. This shows that despite its flaws, C11 can be usable in practice.

Page generated in 0.0854 seconds