Global ETD Search

21	Supporting Software Transactional Memory in Distributed Systems: Protocols for Cache-Coherence, Conflict Resolution and Replication Zhang, Bo 05 December 2011 (has links) Lock-based synchronization on multiprocessors is inherently non-scalable, non-composable, and error-prone. These problems are exacerbated in distributed systems due to an additional layer of complexity: multinode concurrency. Transactional memory (TM) is an emerging, alternative synchronization abstraction that promises to alleviate these difficulties. With the TM model, code that accesses shared memory objects are organized as transactions, which speculatively execute, while logging changes. If transactional conflicts are detected, one of the conflicting transaction is aborted and re-executed, while the other is allowed to commit, yielding the illusion of atomicity. TM for multiprocessors has been proposed in software (STM), in hardware (HTM), and in a combination (HyTM). This dissertation focuses on supporting the TM abstraction in distributed systems, i.e., distributed STM (or D-STM). We focus on three problem spaces: cache-coherence (CC), conflict resolution, and replication. We evaluate the performance of D-STM by measuring the competitive ratio of its makespan --- i.e., the ratio of its makespan (the last completion time for a given set of transactions) to the makespan of an optimal off-line clairvoyant scheduler. We show that the performance of D-STM for metric-space networks is O(N^2) for N transactions requesting an object under the Greedy contention manager and an arbitrary CC protocol. To improve the performance, we propose a class of location-aware CC protocols, called LAC protocols. We show that the combination of the Greedy manager and a LAC protocol yields an O(NlogN s) competitive ratio for s shared objects. We then formalize two classes of CC protocols: distributed queuing cache-coherence (DQCC) protocols and distributed priority queuing cache-coherence (DPQCC) protocols, both of which can be implemented using distributed queuing protocols. We show that a DQCC protocol is O(NlogD)-competitive and a DPQCC protocol is O(log D_delta)-competitive for N dynamically generated transactions requesting an object, where D_delta is the normalized diameter of the underlying distributed queuing protocol. Additionally, we propose a novel CC protocol, called Relay, which reduces the total number of aborts to O(N) for N conflicting transactions requesting an object, yielding a significantly improvement over past CC protocols which has O(N^2) total number of aborts. We also analyze Relay's dynamic competitive ratio in terms of the communication cost (for dynamically generated transactions), and show that Relay's dynamic competitive ratio is O(log D_0), where D_0 is the normalized diameter of the underlying network spanning tree. To reduce unnecessary aborts and increase concurrency for D-STM based on globally-consistent contention management policies, we propose the distributed dependency-aware (DDA) conflict resolution model, which adopts different conflict resolution strategies based on transaction types. In the DDA model, read-only transactions never abort by keeping a set of versions for each object. Each transaction only keeps precedence relations based on its local knowledge of precedence relations. We show that the DDA model ensures that 1) read-only transactions never abort, 2) every transaction eventually commits, 3) supports invisible reads, and 4) efficiently garbage collects useless object versions. To establish competitive ratio bounds for contention managers in D-STM, we model the distributed transactional contention management problem as the traveling salesman problem (TSP). We prove that for D-STM, any online, work conserving, deterministic contention manager provides an Omega(max[s,s^2/D]) competitive ratio in a network with normalized diameter D and s shared objects. Compared with the Omega(s) competitive ratio for multiprocessor STM, the performance guarantee for D-STM degrades by a factor proportional to s/D. We present a randomized algorithm, called Randomized, with a competitive ratio O(sClog n log ^{2} n) for s objects shared by n transactions, with a maximum conflicting degree C. To break this lower bound, we present a randomized algorithm Cutting, which needs partial information of transactions and an approximate TSP algorithm A with approximation ratio phi_A. We show that the average case competitive ratio of Cutting is O(s phi_A log^{2}m log^{2}n), which is close to O(s). Single copy (SC) D-STM keeps only one writable copy of each object, and thus cannot tolerate node failures. We propose a quorum-based replication (QR) D-STM model, which provides provable fault-tolerance without incurring high communication overhead, when compared with the SC model. The QR model stores object replicas in a tree quorum system, where two quorums intersect if one of them is a write quorum, and ensures the consistency among replicas at commit-time. The communication cost of an operation in the QR model is proportional to the communication cost from the requesting node to its closest read or write quorum. In the presence of node failures, the QR model exhibits high availability and degrades gracefully when the number of failed nodes increases, with reasonable higher communication cost. We develop a protoytpe implementation of the dissertation's proposed solutions, including DQCC and DPQCC protocols, Relay protocol, and the DDA model, in the HyFlow Java D-STM framework. We experimentally evaluated these solutions with respective competitor solutions on a set of microbenchmarks (e.g., data structures including distributed linked list, binary search tree and red-black tree) and macrobenchmarks (e.g., distributed versions of the applications in the STAMP STM benchmark suite for multiprocessors). Our experimental studies revealed that: 1) based on the same distributed queuing protocol (i.e., Ballistic CC protocol), DPQCC yields better transactional throughput than DQCC, by a factor of 50% - 100%, on a range of transactional workloads; 2) Relay outperforms competitor protocols (including Arrow, Ballistic and Home) by more than 200% when the network size and contention increase, as it efficiently reduces the average aborts per transaction (less than 0.5); 3) the DDA model outperforms existing contention management policies (including Greedy, Karma and Kindergarten managers) by upto 30%-40% in high contention environments; For read/write-balanced workloads, the DDA model outperforms these contention management policies by 30%-60% on average; for read-dominated workloads, the model outperforms by over 200%. / Ph. D. Cache-Coherence Distributed Queuing Contention Management Software Transactional Memory Replication Quorum System
22	HyFlow: A High Performance Distributed Software Transactional Memory Framework Saad Ibrahim, Mohamed Mohamed 14 June 2011 (has links) We present HyFlow - a distributed software transactional memory (D-STM) framework for distributed concurrency control. Lock-based concurrency control suffers from drawbacks including deadlocks, livelocks, and scalability and composability challenges. These problems are exacerbated in distributed systems due to their distributed versions which are more complex to cope with (e.g., distributed deadlocks). STM and D-STM are promising alternatives to lock-based and distributed lock-based concurrency control for centralized and distributed systems, respectively, that overcome these difficulties. HyFlow is a Java framework for DSTM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols. HyFlow exports a simple distributed programming model that excludes locks: using (Java 5) annotations, atomic sections are defiend as transactions, in which reads and writes to shared, local and remote objects appear to take effect instantaneously. No changes are needed to the underlying virtual machine or compiler. We describe HyFlow's architecture and implementation, and report on experimental studies comparing HyFlow against competing models including Java remote method invocation (RMI) with mutual exclusion and read/write locks, distributed shared memory (DSM), and directory-based D-STM. / Master of Science Directory Protocols Cache Coherence Contention Management Control-Flow Dataflow Software Transactional Memory Distributed Systems
23	Performance Analysis of Complex Shared Memory Systems Molka, Daniel 22 March 2017 (has links) (PDF) Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations. Leistungsanalyse Benchmarking Speicherhierarchie Cache-Kohärenz Protokolle performance analysis benchmarking memory hierarchy cache coherence protocols ddc:004 rvk:ST 155 rvk:ST 237
24	Inférence d'invariants pour le model checking de systèmes paramétrés / Invariants inference for model checking of parameterized systems Mebsout, Alain 29 September 2014 (has links) Cette thèse aborde le problème de la vérification automatique de systèmesparamétrés complexes. Cette approche est importante car elle permet de garantircertaines propriétés sans connaître a priori le nombre de composants dusystème. On s'intéresse en particulier à la sûreté de ces systèmes et on traitele côté paramétré du problème avec des méthodes symboliques. Ces travauxs'inscrivent dans le cadre théorique du model checking modulo théories et ontdonné lieu à un nouveau model checker : Cubicle.Une des contributions principale de cette thèse est une nouvelle technique pourinférer des invariants de manière automatique. Le processus de générationd'invariants est intégré à l'algorithme de model checking et permet de vérifieren pratique des systèmes hors de portée des approches symboliquestraditionnelles. Une des applications principales de cet algorithme estl’analyse de sûreté paramétrée de protocoles de cohérence de cache de tailleindustrielle.Enfin, pour répondre au problème de la confiance placée dans le model checker,on présente deux techniques de certification de notre outil Cubicle utilisantla plate-forme Why3. La première consiste à générer des certificats dont lavalidité est évaluée de manière indépendante tandis que la seconde est uneapproche par vérification déductive du cœur de Cubicle. / This thesis tackles the problem of automatically verifying complexparameterized systems. This approach is important because it can guarantee thatsome properties hold without knowing a priori the number of components in thesystem. We focus in particular on the safety of such systems and we handle theparameterized aspect with symbolic methods. This work is set in the theoreticalframework of the model checking modulo theories and resulted in a new modelchecker: Cubicle.One of the main contribution of this thesis is a novel technique forautomatically inferring invariants. The process of invariant generation isintegrated with the model checking algorithm and allows the verification inpractice of systems which are out of reach for traditional symbolicapproaches. One successful application of this algorithm is the safety analysisof industrial size parameterized cache coherence protocols.Finally, to address the problem of trusting the answer given by the modelchecker, we present two techniques for certifying our tool Cubicle based on theframework Why3. The first consists in producing certificates whose validity canbe assessed independently while the second is an approach by deductiveverification of the heart of Cubicle. Model checking Systèmes paramétrés STM Invariants Certification Protocole de cohérence de cache Model checking Parameterized systems STM Invariants Certification Cache coherence protocols
25	An Interconnection Network for a Cache Coherent System on FPGAs Mirian, Vincent 12 January 2011 (has links) Field-Programmable Gate Arrays (FPGAs) systems now comprise many processing elements that are processors running software and hardware engines used to accelerate specific functions. To make the programming of such a system simpler, it is easiest to think of a shared-memory environment, much like in current multi-core processor systems. This thesis introduces a novel, shared-memory, cache-coherent infrastructure for heterogeneous systems implemented on FPGAs that can then form the basis of a shared-memory programming model for heterogeneous systems. With simulation results, it is shown that the cache-coherent infrastructure outperforms the infrastructure of Woods [1] with a speedup of 1.10. The thesis explores the various configurations of the cache interconnection network and the benefit of the cache-to-cache cache line data transfer with its impact on main memory access. Finally, the thesis shows the cache-coherent infrastructure has very little overhead when using its cache coherence implementation. FPGA Cache Coherence Interconnection Network Protocols Multiprocessor muti-thread Pthread (Posix Thread) programming model parrallel programming shared-memory model 0544
26	An Interconnection Network for a Cache Coherent System on FPGAs Mirian, Vincent 12 January 2011 (has links) Field-Programmable Gate Arrays (FPGAs) systems now comprise many processing elements that are processors running software and hardware engines used to accelerate specific functions. To make the programming of such a system simpler, it is easiest to think of a shared-memory environment, much like in current multi-core processor systems. This thesis introduces a novel, shared-memory, cache-coherent infrastructure for heterogeneous systems implemented on FPGAs that can then form the basis of a shared-memory programming model for heterogeneous systems. With simulation results, it is shown that the cache-coherent infrastructure outperforms the infrastructure of Woods [1] with a speedup of 1.10. The thesis explores the various configurations of the cache interconnection network and the benefit of the cache-to-cache cache line data transfer with its impact on main memory access. Finally, the thesis shows the cache-coherent infrastructure has very little overhead when using its cache coherence implementation. FPGA Cache Coherence Interconnection Network Protocols Multiprocessor muti-thread Pthread (Posix Thread) programming model parrallel programming shared-memory model 0544
27	Dynamic detection of the communication pattern in shared memory environments for thread mapping / Detecção dinâmica do padrão de comunicação em ambientes de memória compartilhada para o mapeamento de threads Cruz, Eduardo Henrique Molina da January 2012 (has links) As threads de aplicações paralelas cooperam a fim de cumprir suas tarefas, dessa forma, comunicação é realizada entre elas. A latência de comunicação entre os núcleos em arquiteturas multiprocessadas diferem dependendo da hierarquia de memória e das interconexões. Com o aumento do número de núcleos por chip e número de threads por núcleo, esta diferença entre as latências de comunicação está aumentando. Portanto, é importante mapear as threads de aplicações paralelas levando em conta a comunicação entre elas. Em aplicações paralelas baseadas no paradigma de memória compartilhada, a comunicação é implícita e ocorre através de acessos à variáveis compartilhadas, o que torna difícil a descoberta do padrão de comunicação entre as threads. Mecanismos tradicionais usam simulação para monitorar os acessos à memória realizados pela aplicação, requerendo modificações no código fonte e aumentando drasticamente a sobrecarga. Nesta dissertação de mestrado, são introduzidos dois mecanismos inovadores com uma baixa sobrecarga para se detectar o padrão de comunicação entre threads. O primeiro mecanismo faz uso de informações sobre linhas compartilhadas de caches providas por protocolos de coerência de cache. O segundo mecanismo utiliza a Translation Lookaside Buffer (TLB) para detectar quais páginas de memória cada núcleo está acessando. Ambos os mecanismos dependem totalmente do hardware, o que torna o mapeamento de threads transparente aos programadores e permite que ele seja realizado dinamicamente pelo sistema operacional. Além disto, nenhuma tarefa de alta sobrecarga, como simulação, é requerida. As propostas foram avaliadas com o NAS Parallel Benchmarks (NPB), obtendo representações precisas dos padrões de comunicação. Mapeamentos para as threads foram gerados utilizando os padrões de comunicação descobertos e um algoritmo de mapeamento. O problema do mapeamento é NP-Difícil. Portanto, de forma a se atingir uma complexidade polinomial, o algoritmo empregado é heurístico, baseado no algoritmo de emparelhamento de grafos de Edmonds. Executando as aplicações com o mapeamento resultou em um ganho de desempenho de até 15; 3%. O número de faltas na cache, invalidações em linhas de cache e transações de espionagem foram reduzidos em até 31; 9%, 41% e 65; 4%, respectivamente. / The threads of parallel applications cooperate in order to fulfill their tasks, thereby communication is performed among themselves. The communication latency between the cores in a multiprocessor architecture differs depending on the memory hierarchy and the interconnections. With the increase in the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables, which makes difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this master thesis, we introduce two novel light-weight mechanisms to find the communication pattern of threads. The first mechanism makes use of the information about shared cache lines provided by cache coherence protocols. The second mechanism makes use of the Translation Lookaside Buffer (TLB) to detect which memory pages each core is accessing. Both our mechanisms rely entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanisms with the NAS Parallel Benchmarks (NPB) and obtained accurate representations of the communication patterns. We generated thread mappings from the detected communication patterns using a mapping algorithm. Mapping is a NP-Hard problem. Therefore, in order to achieve a polynomial complexity, we designed a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3% compared to the original scheduler of the operating system. The number of cache misses, cache line invalidations and snoop transactions were reduced by up to 31.9%, 41% and 65.4%, respectively. Processamento paralelo Desempenho : Computadores Processamento distribuido Thread mapping Parallel computer architectures Shared memory Communication Cache memory Cache coherence protocols TLB
28	Dynamic detection of the communication pattern in shared memory environments for thread mapping / Detecção dinâmica do padrão de comunicação em ambientes de memória compartilhada para o mapeamento de threads Cruz, Eduardo Henrique Molina da January 2012 (has links) As threads de aplicações paralelas cooperam a fim de cumprir suas tarefas, dessa forma, comunicação é realizada entre elas. A latência de comunicação entre os núcleos em arquiteturas multiprocessadas diferem dependendo da hierarquia de memória e das interconexões. Com o aumento do número de núcleos por chip e número de threads por núcleo, esta diferença entre as latências de comunicação está aumentando. Portanto, é importante mapear as threads de aplicações paralelas levando em conta a comunicação entre elas. Em aplicações paralelas baseadas no paradigma de memória compartilhada, a comunicação é implícita e ocorre através de acessos à variáveis compartilhadas, o que torna difícil a descoberta do padrão de comunicação entre as threads. Mecanismos tradicionais usam simulação para monitorar os acessos à memória realizados pela aplicação, requerendo modificações no código fonte e aumentando drasticamente a sobrecarga. Nesta dissertação de mestrado, são introduzidos dois mecanismos inovadores com uma baixa sobrecarga para se detectar o padrão de comunicação entre threads. O primeiro mecanismo faz uso de informações sobre linhas compartilhadas de caches providas por protocolos de coerência de cache. O segundo mecanismo utiliza a Translation Lookaside Buffer (TLB) para detectar quais páginas de memória cada núcleo está acessando. Ambos os mecanismos dependem totalmente do hardware, o que torna o mapeamento de threads transparente aos programadores e permite que ele seja realizado dinamicamente pelo sistema operacional. Além disto, nenhuma tarefa de alta sobrecarga, como simulação, é requerida. As propostas foram avaliadas com o NAS Parallel Benchmarks (NPB), obtendo representações precisas dos padrões de comunicação. Mapeamentos para as threads foram gerados utilizando os padrões de comunicação descobertos e um algoritmo de mapeamento. O problema do mapeamento é NP-Difícil. Portanto, de forma a se atingir uma complexidade polinomial, o algoritmo empregado é heurístico, baseado no algoritmo de emparelhamento de grafos de Edmonds. Executando as aplicações com o mapeamento resultou em um ganho de desempenho de até 15; 3%. O número de faltas na cache, invalidações em linhas de cache e transações de espionagem foram reduzidos em até 31; 9%, 41% e 65; 4%, respectivamente. / The threads of parallel applications cooperate in order to fulfill their tasks, thereby communication is performed among themselves. The communication latency between the cores in a multiprocessor architecture differs depending on the memory hierarchy and the interconnections. With the increase in the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables, which makes difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this master thesis, we introduce two novel light-weight mechanisms to find the communication pattern of threads. The first mechanism makes use of the information about shared cache lines provided by cache coherence protocols. The second mechanism makes use of the Translation Lookaside Buffer (TLB) to detect which memory pages each core is accessing. Both our mechanisms rely entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanisms with the NAS Parallel Benchmarks (NPB) and obtained accurate representations of the communication patterns. We generated thread mappings from the detected communication patterns using a mapping algorithm. Mapping is a NP-Hard problem. Therefore, in order to achieve a polynomial complexity, we designed a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3% compared to the original scheduler of the operating system. The number of cache misses, cache line invalidations and snoop transactions were reduced by up to 31.9%, 41% and 65.4%, respectively. Processamento paralelo Desempenho : Computadores Processamento distribuido Thread mapping Parallel computer architectures Shared memory Communication Cache memory Cache coherence protocols TLB
29	Dynamic detection of the communication pattern in shared memory environments for thread mapping / Detecção dinâmica do padrão de comunicação em ambientes de memória compartilhada para o mapeamento de threads Cruz, Eduardo Henrique Molina da January 2012 (has links) As threads de aplicações paralelas cooperam a fim de cumprir suas tarefas, dessa forma, comunicação é realizada entre elas. A latência de comunicação entre os núcleos em arquiteturas multiprocessadas diferem dependendo da hierarquia de memória e das interconexões. Com o aumento do número de núcleos por chip e número de threads por núcleo, esta diferença entre as latências de comunicação está aumentando. Portanto, é importante mapear as threads de aplicações paralelas levando em conta a comunicação entre elas. Em aplicações paralelas baseadas no paradigma de memória compartilhada, a comunicação é implícita e ocorre através de acessos à variáveis compartilhadas, o que torna difícil a descoberta do padrão de comunicação entre as threads. Mecanismos tradicionais usam simulação para monitorar os acessos à memória realizados pela aplicação, requerendo modificações no código fonte e aumentando drasticamente a sobrecarga. Nesta dissertação de mestrado, são introduzidos dois mecanismos inovadores com uma baixa sobrecarga para se detectar o padrão de comunicação entre threads. O primeiro mecanismo faz uso de informações sobre linhas compartilhadas de caches providas por protocolos de coerência de cache. O segundo mecanismo utiliza a Translation Lookaside Buffer (TLB) para detectar quais páginas de memória cada núcleo está acessando. Ambos os mecanismos dependem totalmente do hardware, o que torna o mapeamento de threads transparente aos programadores e permite que ele seja realizado dinamicamente pelo sistema operacional. Além disto, nenhuma tarefa de alta sobrecarga, como simulação, é requerida. As propostas foram avaliadas com o NAS Parallel Benchmarks (NPB), obtendo representações precisas dos padrões de comunicação. Mapeamentos para as threads foram gerados utilizando os padrões de comunicação descobertos e um algoritmo de mapeamento. O problema do mapeamento é NP-Difícil. Portanto, de forma a se atingir uma complexidade polinomial, o algoritmo empregado é heurístico, baseado no algoritmo de emparelhamento de grafos de Edmonds. Executando as aplicações com o mapeamento resultou em um ganho de desempenho de até 15; 3%. O número de faltas na cache, invalidações em linhas de cache e transações de espionagem foram reduzidos em até 31; 9%, 41% e 65; 4%, respectivamente. / The threads of parallel applications cooperate in order to fulfill their tasks, thereby communication is performed among themselves. The communication latency between the cores in a multiprocessor architecture differs depending on the memory hierarchy and the interconnections. With the increase in the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables, which makes difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this master thesis, we introduce two novel light-weight mechanisms to find the communication pattern of threads. The first mechanism makes use of the information about shared cache lines provided by cache coherence protocols. The second mechanism makes use of the Translation Lookaside Buffer (TLB) to detect which memory pages each core is accessing. Both our mechanisms rely entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanisms with the NAS Parallel Benchmarks (NPB) and obtained accurate representations of the communication patterns. We generated thread mappings from the detected communication patterns using a mapping algorithm. Mapping is a NP-Hard problem. Therefore, in order to achieve a polynomial complexity, we designed a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3% compared to the original scheduler of the operating system. The number of cache misses, cache line invalidations and snoop transactions were reduced by up to 31.9%, 41% and 65.4%, respectively. Processamento paralelo Desempenho : Computadores Processamento distribuido Thread mapping Parallel computer architectures Shared memory Communication Cache memory Cache coherence protocols TLB
30	Practical Support for Strong, Serializability-Based Memory Consistency Biswas, Swarnendu 30 December 2016 (has links) No description available. Computer Science Memory models cache coherence region serializability conflict exceptions region conflicts data races atomicity checking dynamic program analysis

Search results