Spelling suggestions: "subject:"sharedmemory"" "subject:"sharememory""
61 |
Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacionalCordeiro, Silvio Ricardo January 2014 (has links)
Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs.
|
62 |
Automatic task and data mapping in shared memory architectures / Mapeamento automático de processos e dados em arquiteturas de memória compartilhadaDiener, Matthias January 2015 (has links)
Arquiteturas paralelas modernas têm hierarquias de memória complexas, que consistem de vários níveis de memórias cache privadas e compartilhadas, bem como Non-Uniform Memory Access (NUMA) devido a múltiplos controladores de memória por sistema. Um dos grandes desafios dessas arquiteturas é melhorar a localidade e o balanceamento de acessos à memória de tal forma que a latência média de acesso à memória é reduzida. Dessa forma, o desempenho e a eficiência energética de aplicações paralelas podem ser melhorados. Os acessos podem ser melhorados de duas maneiras: (1) processos que acessam dados compartilhados (comunicação entre processos) podem ser alocados em unidades de execução próximas na hierarquia de memória, a fim de melhorar o uso das caches. Esta técnica é chamada de mapeamento de processos. (2) Mapear as páginas de memória que cada processo acessa ao nó NUMA que ele está sendo executado, assim, pode-se reduzir o número de acessos a memórias remotas em arquiteturas NUMA. Essa técnica é conhecida como mapeamento de dados. Para melhores resultados, os mapeamentos de processos e dados precisam ser realizados de forma integrada. Trabalhos anteriores nesta área executam os mapeamentos separadamente, o que limita os ganhos que podem ser alcançados. Além disso, a maioria dos mecanismos anteriores exigem operações caras, como traços de acessos à memória, para realizar o mapeamento, além de exigirem mudanças no hardware ou na aplicação paralela. Estes mecanismos não podem ser considerados soluções genéricas para o problema de mapeamento. Nesta tese, fazemos duas contribuições principais para o problema de mapeamento. Em primeiro lugar, nós introduzimos um conjunto de métricas e uma metodologia para analisar aplicações paralelas, a fim de determinar a sua adequação para um melhor mapeamento e avaliar os possíveis ganhos que podem ser alcançados através desse mapeamento otimizado. Em segundo lugar, propomos um mecanismo que executa o mapeamento de processos e dados online. Este mecanismo funciona no nível do sistema operacional e não requer alterações no hardware, os códigos fonte ou bibliotecas. Uma extensa avaliação com múltiplos conjuntos de carga de trabalho paralelos mostram consideráveis melhorias em desempenho e eficiência energética. / Reducing the cost of memory accesses, both in terms of performance and energy consumption, is a major challenge in shared-memory architectures. Modern systems have deep and complex memory hierarchies with multiple cache levels and memory controllers, leading to a Non-Uniform Memory Access (NUMA) behavior. In such systems, there are two ways to improve the memory affinity: First, by mapping tasks that share data (communicate) to cores with a shared cache, cache usage and communication performance are improved. Second, by mapping memory pages to memory controllers that perform the most accesses to them and are not overloaded, the average cost of accesses is reduced. We call these two techniques task mapping and data mapping, respectively. For optimal results, task and data mapping need to be performed in an integrated way. Previous work in this area performs the mapping only separately, which limits the gains that can be achieved. Furthermore, most previous mechanisms require expensive operations, such as communication or memory access traces, to perform the mapping, require changes to the hardware or to the parallel application, or use a simple static mapping. These mechanisms can not be considered generic solutions for the mapping problem. In this thesis, we make two contributions to the mapping problem. First, we introduce a set of metrics and a methodology to analyze parallel applications in order to determine their suitability for an improved mapping and to evaluate the possible gains that can be achieved using an optimized mapping. Second, we propose two automatic mechanisms that perform task mapping and combined task/data mapping, respectively, during the execution of a parallel application. These mechanisms work on the operating system level and require no changes to the hardware, the applications themselves or their runtime libraries. An extensive evaluation with parallel applications from multiple benchmark suites as well as real scientific applications shows substantial performance and energy efficiency improvements that are significantly higher than simple mechanisms and previous work, while maintaining a low overhead.
|
63 |
Automatic task and data mapping in shared memory architectures / Mapeamento automático de processos e dados em arquiteturas de memória compartilhadaDiener, Matthias January 2015 (has links)
Arquiteturas paralelas modernas têm hierarquias de memória complexas, que consistem de vários níveis de memórias cache privadas e compartilhadas, bem como Non-Uniform Memory Access (NUMA) devido a múltiplos controladores de memória por sistema. Um dos grandes desafios dessas arquiteturas é melhorar a localidade e o balanceamento de acessos à memória de tal forma que a latência média de acesso à memória é reduzida. Dessa forma, o desempenho e a eficiência energética de aplicações paralelas podem ser melhorados. Os acessos podem ser melhorados de duas maneiras: (1) processos que acessam dados compartilhados (comunicação entre processos) podem ser alocados em unidades de execução próximas na hierarquia de memória, a fim de melhorar o uso das caches. Esta técnica é chamada de mapeamento de processos. (2) Mapear as páginas de memória que cada processo acessa ao nó NUMA que ele está sendo executado, assim, pode-se reduzir o número de acessos a memórias remotas em arquiteturas NUMA. Essa técnica é conhecida como mapeamento de dados. Para melhores resultados, os mapeamentos de processos e dados precisam ser realizados de forma integrada. Trabalhos anteriores nesta área executam os mapeamentos separadamente, o que limita os ganhos que podem ser alcançados. Além disso, a maioria dos mecanismos anteriores exigem operações caras, como traços de acessos à memória, para realizar o mapeamento, além de exigirem mudanças no hardware ou na aplicação paralela. Estes mecanismos não podem ser considerados soluções genéricas para o problema de mapeamento. Nesta tese, fazemos duas contribuições principais para o problema de mapeamento. Em primeiro lugar, nós introduzimos um conjunto de métricas e uma metodologia para analisar aplicações paralelas, a fim de determinar a sua adequação para um melhor mapeamento e avaliar os possíveis ganhos que podem ser alcançados através desse mapeamento otimizado. Em segundo lugar, propomos um mecanismo que executa o mapeamento de processos e dados online. Este mecanismo funciona no nível do sistema operacional e não requer alterações no hardware, os códigos fonte ou bibliotecas. Uma extensa avaliação com múltiplos conjuntos de carga de trabalho paralelos mostram consideráveis melhorias em desempenho e eficiência energética. / Reducing the cost of memory accesses, both in terms of performance and energy consumption, is a major challenge in shared-memory architectures. Modern systems have deep and complex memory hierarchies with multiple cache levels and memory controllers, leading to a Non-Uniform Memory Access (NUMA) behavior. In such systems, there are two ways to improve the memory affinity: First, by mapping tasks that share data (communicate) to cores with a shared cache, cache usage and communication performance are improved. Second, by mapping memory pages to memory controllers that perform the most accesses to them and are not overloaded, the average cost of accesses is reduced. We call these two techniques task mapping and data mapping, respectively. For optimal results, task and data mapping need to be performed in an integrated way. Previous work in this area performs the mapping only separately, which limits the gains that can be achieved. Furthermore, most previous mechanisms require expensive operations, such as communication or memory access traces, to perform the mapping, require changes to the hardware or to the parallel application, or use a simple static mapping. These mechanisms can not be considered generic solutions for the mapping problem. In this thesis, we make two contributions to the mapping problem. First, we introduce a set of metrics and a methodology to analyze parallel applications in order to determine their suitability for an improved mapping and to evaluate the possible gains that can be achieved using an optimized mapping. Second, we propose two automatic mechanisms that perform task mapping and combined task/data mapping, respectively, during the execution of a parallel application. These mechanisms work on the operating system level and require no changes to the hardware, the applications themselves or their runtime libraries. An extensive evaluation with parallel applications from multiple benchmark suites as well as real scientific applications shows substantial performance and energy efficiency improvements that are significantly higher than simple mechanisms and previous work, while maintaining a low overhead.
|
64 |
[en] COMMUNICATION PROCESSOR FOR CYGNUS COMPUTER / [pt] PROCESSADOR DE COMUNICAÇÃO PARA O SISTEMA CYGNUSJOSE FRANCO MACHADO DO AMARAL 20 June 2007 (has links)
[pt] A interligação de computadores se torna cada vez mais
importante, pois possibilita a transferência de dados,
assim como o partilhamento de recursos entre eles. O
processador de Comunicação, descrito neste trabalho,
permite a conexão do sistema multiprocessador Cygnus a uma
rede local, que utiliza uma topologia em barra comum, par
trançado e um protocolo de acesso ao meio que evita
colisões. / [en] Computer interconnections are becoming very important
because they allow data exchanging and resources sharing.
The Cygnus multiprocessador system to be connected to a
local network that utilizes a bus topology, twisted pair
and a medium access protocol that avoids collisions.
|
65 |
Code profiling and optimization in transactional memory systems / Profiling e otimização de código em sistemas de memória transacionalCordeiro, Silvio Ricardo January 2014 (has links)
Memória Transacional tem se demonstrado um paradigma promissor na implementação de aplicações concorrentes sob memória compartilhada que busquem evitar um modelo de sincronização baseado em locks. Em vez de sujeitar a execução a um acesso exclusivo com base no valor de um lock que é compartilhado por threads concorrentes, uma aplicação sob Memória Transacional tenta executar seções críticas de modo otimista, desfazendo as modificações no caso de um conflito de acesso à memória. Entretanto, apesar de a abordagem baseada em locks ter adquirido um número significativo de ferramentas automatizadas para a depuração, profiling e otimização automatizados (por ser uma das técnicas de sincronização mais antigas e mais bem pesquisadas), o campo da Memória Transacional ainda é comparativamente recente, e programadores frequentemente precisam adaptar manualmente suas aplicações transacionais ao encontrar problemas de eficiência. Este trabalho propõe um sistema no qual o profiling de código em uma implementação de Memória Transacional simulada é utilizado para caracterizar uma aplicação transacional, formando a base para uma parametrização automatizada do respectivo sistema especulativo para uma execução eficiente do código em questão. Também é proposta uma abordagem de escalonamento de threads guiado por profiling em uma implementação de Memória Transacional baseada em software, usando dados coletados pelo profiler para prever a probabilidade de conflitos e determinar que thread escalonar com base nesta previsão. São apresentados os resultados de experimentos sob ambas as abordagens. / Transactional Memory has shown itself to be a promising paradigm for the implementation of shared-memory concurrent applications that eschew a lock-based model of data synchronization. Rather than conditioning exclusive access on the value of a lock that is shared across concurrent threads, Transactional Memory attempts to execute critical sections optimistically, rolling back the modifications in the event of a data access conflict. However, while the lock-based approach has acquired a significant body of debugging, profiling and automated optimization tools (as one of the oldest and most researched synchronization techniques), the field of Transactional Memory is still comparably recent, and programmers are usually tasked with an unguided manual tuning of their transactional applications when facing efficiency problems. We propose a system in which code profiling in a simulated hardware implementation of Transactional Memory is used to characterize a transactional application, which forms the basis for the automated tuning of the underlying speculative system for the efficient execution of that particular application. We also propose a profile-guided approach to the scheduling of threads in a software-based implementation of Transactional Memory, using collected data to predict the likelihood of conflicts and determine what thread to schedule based on this prediction. We present the results achieved under both designs.
|
66 |
Porting Cilk to the Barrelfish OSHo Bao Le, Chau January 2013 (has links)
Barrelfish operating system is an experimental instance of multikernel structure which exhibits good features such as hardware heterogeneity, scalability, dynamicity, etc. Barrelfish is in progress and lacks applications. Therefore, there is a need to investigate the efficiency of applications running in Barrelfish and one of candidates is a shared-memory application. To conduct an empirical study, Cilk is chosen inasmuch as its runtime library is designed for shared-memory architectures and it has been known to expose good performance. This thesis focuses on making Cilk run on top of Barrelfish in order to reach two goals: portability which is described to be supported by Barrelfish, and good speed afterwards. The porting involves compiling Cilk runtime source code by replacing its pthread subroutines with set of APIs in Barrelfish and then changing the way Cilk scheduler spawns worker thread on multiple cores. However, the main point of the porting is to make different cores access to the same virtual address space. Luckily, Barrelfish provides a notion of domain which specifies the number of cores in an application so that these cores can share the same memory space. This thesis also has carried out benchmarks on some Cilk programs and found that Cilk does not perform as well as it is expected. In addition measurements on parallel workers shows that Cilk on Barrelfish takes more cycles to perform computation. Although Cilk still maintains work-first principle, it cannot achieve the time bound. The spanning domain cost is proportional to the number of cores, but it will matter if applications take small time to complete.
|
67 |
A Performance Evaluation of MPI Shared Memory Programming / En utvärdering av MPI shared memory - programmering med inriktning på prestandaKarlbom, David January 2016 (has links)
The thesis investigates the Message Passing Interface (MPI) support for shared memory programming on modern hardware architecture with multiple Non-Uniform Memory Access (NUMA) domains. We investigate its performance in two case studies: the matrix-matrix multiplication and Conway’s game of life. We compare MPI shared memory performance in terms of execution time and memory consumption with the performance of implementations using OpenMP and MPI point-to-point communication, also called "MPI two-sided". We perform strong scaling tests in both test cases. We observe that MPI two-sided implementation is 21% and 18% faster than the MPI shared and OpenMP implementations respectively in the matrix-matrix multiplication when using 32 processes. MPI shared uses less memory space: when compared to MPI two-sided, MPI shared uses 45% less memory. In the Conway’s game of life, we find that MPI two-sided implementation is 10% and 82% faster than the MPI shared and OpenMP implementations respectively when using 32 processes. We also observe that not mapping virtual memory to a specific NUMA domain can lead to an increment in execution time of 64% when using 32 processes. The use of MPI shared is viable for intranode communication on modern hardware architecture with multiple NUMA domains. / I detta examensarbete undersöker vi Message Passing Inferfaces (MPI) support för shared memory programmering på modern hårdvaruarkitektur med flera Non-Uniform Memory Access (NUMA) domäner. Vi undersöker prestanda med hjälp av två fallstudier: matris-matris multiplikation och Conway’s game of life. Vi jämför prestandan utav MPI shared med hjälp utav exekveringstid samt minneskonsumtion jämtemot OpenMP och MPI punkt-till-punkt kommunikation, även känd som MPI two-sided. Vi utför strong scaling tests för båda fallstudierna. Vi observerar att MPI-two sided är 21% snabbare än MPI shared och 18% snabbare än OpenMP för matris-matris multiplikation när 32 processorer användes. För samma testdata har MPI shared en 45% lägre minnesförburkning än MPI two-sided. För Conway’s game of life är MPI two-sided 10% snabbare än MPI shared samt 82% snabbare än OpenMP implementation vid användandet av 32 processorer. Vi kunde också utskilja att om ingen mappning av virtuella minnet till en specifik NUMA domän görs, leder det till en ökning av exekveringstiden med upp till 64% när 32 processorer används. Vi kom fram till att MPI shared är användbart för intranode kommunikation på modern hårdvaruarkitektur med flera NUMA domäner.
|
68 |
Wait-free Solvability of Colorless Tasks in Anonymous Shared-memory Model / 匿名共有メモリモデルにおける非彩色タスクの無待機可解性Yanagisawa, Nayuta 26 March 2018 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(理学) / 甲第20883号 / 理博第4335号 / 新制||理||1623(附属図書館) / 京都大学大学院理学研究科数学・数理解析専攻 / (主査)教授 西村 進, 教授 上 正明, 教授 長谷川 真人 / 学位規則第4条第1項該当 / Doctor of Science / Kyoto University / DFAM
|
69 |
SNIC-DSM: SmartNIC based DSM Infrastructure for Heterogeneous-ISA MachinesRamesh, Hemanth 14 August 2023 (has links)
Heterogeneous computing is increasingly used in today's datacenters to meet the increasing computational demands of applications. Heterogeneous hardware typically includes CPUs, GPUs, ASICs, and FPGAs, among others. An important emerging trend is instructionset- architecture (ISA)-heterogeneity: high-end x86 servers with attached SmartNICs and SmartSSDs that incorporate general-purpose CPUs, typically of the RISC ISA family (e.g., ARM, RISC-V). To alleviate resource congestion on server computing nodes, application workloads can be scaled-out across server x86 CPUs and SmartNIC ARM CPUs using the distributed shared memory (DSM) abstraction. We present SNIC-DSM, a SmartNIC-based DSM infrastructure for heterogeneous ISA machines. SNIC-DSM implements a low-latency messaging layer, which enables inter-node communication across multi-ISA CPUs, and a DSM protocol processor that provides memory coherency among these nodes, both implemented in SmartNIC's FPGA logic. SNIC-DSM is reconfigurable and allows the implementation of different memory consistency protocols. Our experimental studies using compute-intensive benchmarks reveal that SNIC-DSM outperforms the state-of-the-art DSM - Popcorn Linux's software DSM - when server resource congestion is high. / Master of Science / The availability of heterogeneous computing architectures has led to the development of distributed shared memory systems, which allows compute-intensive applications to run in a distributed manner on different types of computing devices such as graphics processors, reconfigurable logic devices, and custom integrated circuits. Adopting such a heterogeneous computing strategy yields better performance and improves power consumption. Generally, these DSM systems use a software-based approach, which offers great flexibility but suffers from software overheads. Hardware-based approaches are used to overcome these limitations but they generally do not offer flexibility. This thesis presents, SNIC-DSM, which is a reconfigurable implementation of the DSM framework. SNIC-DSM provides a platform for the host and smart networking devices such as SmartNICs to communicate with each other and enables application execution in a distributed manner by providing memory coherency.
Our experimental evaluation using High-Performance Computing benchmarks reveals that SNIC-DSM improves performance when compared with software-based DSM.
|
70 |
Predictability of Optimal Core Distribution Based on Weight and SpeedupEriksson, Rasmus January 2022 (has links)
Efficient use of hardware resources is a vital part of getting good results within high performance computing. This thesis explores the predictability of optimal CPU-core distribution between two tasks running in parallel on a shared-memory machine, with the intent to reach the shortest total runtime possible. The predictions are based on the weight and speedup of each task, in regards to the CPU-frequency decrease that comes with a growing number of active cores in modern CPUs. The weight of a task is the number of floating point operations needed to compute it to completion. The Intel oneAPI Math Kernel Library is used to create a set of different tasks, where each task consists of a single call to a dgemm-routine. Two prediction algorithms for optimal core distribution are presented and used in this thesis. Their predictions are compared to the fastest distribution observed by either running the tasks back-to-back, with each using all available cores, or running the tasks simultaneously in two parallel regions. Experimental results suggest that there is merit to this method, with the best of the two algorithms having a 14/15 prediction-accuracy of the core distribution resulting in the fastest run.
|
Page generated in 0.0363 seconds