Global ETD Search

21	Source code optimizations to reduce multi core and many core performance bottlenecks / Otimizações de código fonte para reduzir gargalos de desempenho em multi core e many core Serpa, Matheus da Silva January 2018 (has links) Atualmente, existe uma variedade de arquiteturas disponíveis não apenas para a indústria, mas também para consumidores finais. Processadores multi-core tradicionais, GPUs, aceleradores, como o Xeon Phi, ou até mesmo processadores orientados para eficiência energética, como a família ARM, apresentam características arquiteturais muito diferentes. Essa ampla gama de características representa um desafio para os desenvolvedores de aplicações. Os desenvolvedores devem lidar com diferentes conjuntos de instruções, hierarquias de memória, ou até mesmo diferentes paradigmas de programação ao programar para essas arquiteturas. Para otimizar uma aplicação, é importante ter uma compreensão profunda de como ela se comporta em diferentes arquiteturas. Os trabalhos relacionados provaram ter uma ampla variedade de soluções. A maioria deles se concentrou em melhorar apenas o desempenho da memória. Outros se concentram no balanceamento de carga, na vetorização e no mapeamento de threads e dados, mas os realizam separadamente, perdendo oportunidades de otimização. Nesta dissertação de mestrado, foram propostas várias técnicas de otimização para melhorar o desempenho de uma aplicação de exploração sísmica real fornecida pela Petrobras, uma empresa multinacional do setor de petróleo. Os experimentos mostram que loop interchange é uma técnica útil para melhorar o desempenho de diferentes níveis de memória cache, melhorando o desempenho em até 5,3 e 3,9 nas arquiteturas Intel Broadwell e Intel Knights Landing, respectivamente. Ao alterar o código para ativar a vetorização, o desempenho foi aumentado em até 1,4 e 6,5 . O balanceamento de carga melhorou o desempenho em até 1,1 no Knights Landing. Técnicas de mapeamento de threads e dados também foram avaliadas, com uma melhora de desempenho de até 1,6 e 4,4 . O ganho de desempenho do Broadwell foi de 22,7 e do Knights Landing de 56,7 em comparação com uma versão sem otimizações, mas, no final, o Broadwell foi 1,2 mais rápido que o Knights Landing. / Nowadays, there are several different architectures available not only for the industry but also for final consumers. Traditional multi-core processors, GPUs, accelerators such as the Xeon Phi, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. Related work proved to have a wide variety of solutions. Most of then focused on improving only memory performance. Others focus on load balancing, vectorization, and thread and data mapping, but perform them separately, losing optimization opportunities. In this master thesis, we propose several optimization techniques to improve the performance of a real-world seismic exploration application provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we show that loop interchange is a useful technique to improve the performance of different cache memory levels, improving the performance by up to 5.3 and 3.9 on the Intel Broadwell and Intel Knights Landing architectures, respectively. By changing the code to enable vectorization, performance was increased by up to 1.4 and 6.5 . Load Balancing improved the performance by up to 1.1 on Knights Landing. Thread and data mapping techniques were also evaluated, with a performance improvement of up to 1.6 and 4.4 . We also compared the best version of each architecture and showed that we were able to improve the performance of Broadwell by 22.7 and Knights Landing by 56.7 compared to a naive version, but, in the end, Broadwell was 1.2 faster than Knights Landing. Avaliacao : Desempenho Hardware Software Performance evaluation HPC Many-core Source code optimizations
22	Cost-effective Designs for Supporting Correct Execution and Scalable Performance in Many-core Processors Romanescu, Bogdan Florin January 2010 (has links) <p>Many-core processors offer new levels of on-chip performance by capitalizing on the increasing rate of device integration. Harnessing the full performance potential of these processors requires that hardware designers not only exploit the advantages, but also consider the problems introduced by the new architectures. Such challenges arise from both the processor's increased structural complexity and the reliability issues of the silicon substrate. In this thesis, we address these challenges in a framework that targets correct execution and performance on three coordinates: 1) tolerating permanent faults, 2) facilitating static and dynamic verification through precise specifications, and 3) designing scalable coherence protocols.</p> <p>First, we propose CCA, a new design paradigm for increasing the processor's lifetime performance in the presence of permanent faults in cores. CCA chips rely on a reconfiguration mechanism that allows cores to replace faulty components with fault-free structures borrowed from neighboring cores. In contrast with existing solutions for handling hard faults that simply shut down cores, CCA aims to maximize the utilization of defect-free resources and increase the availability of on-chip cores. We implement three-core and four-core CCA chips and demonstrate that they offer a cumulative lifetime performance improvement of up to 65% for industry-representative utilization periods. In addition, we show that CCA benefits systems that employ modular redundancy to guarantee correct execution by increasing their availability.</p> <p>Second, we target the correctness of the address translation system. Current processors often exhibit design bugs in their translation systems, and we believe one cause for these faults is a lack of precise specifications describing the interactions between address translation and the rest of the memory system, especially memory consistency. We address this aspect by introducing a framework for specifying translation-aware consistency models. As part of this framework, we identify the critical role played by address translation in supporting correct memory consistency implementations. Consequently, we propose a set of invariants that characterizes address translation. Based on these invariants, we develop DVAT, a dynamic verification mechanism for address translation. We demonstrate that DVAT is efficient in detecting translation-related faults, including several that mimic design bugs reported in processor errata. By checking the correctness of the address translation system, DVAT supports dynamic verification of translation-aware memory consistency.</p> <p>Finally, we address the scalability of translation coherence protocols. Current software-based solutions for maintaining translation coherence adversely impact performance and do not scale. We propose UNITD, a hardware coherence protocol that supports scalable performance and architectural decoupling. UNITD integrates translation coherence within the regular cache coherence protocol, such that TLBs participate in the cache coherence protocol similar to instruction or data caches. We evaluate snooping and directory UNITD coherence protocols on processors with up to 16 cores and demonstrate that UNITD reduces the performance penalty of translation coherence to almost zero.</p> / Dissertation Computer Engineering Address translation Error detection Many-core Proccesors Memory consistency Permanent faults TLB coherence
23	Concurrent Online Testing for Many Core Systems-on-Chips Lee, Jason Daniel 2010 December 1900 (has links) Shrinking transistor sizes have introduced new challenges and opportunities for system-on-chip (SoC) design and reliability. Smaller transistors are more susceptible to early lifetime failure and electronic wear-out, greatly reducing their reliable lifetimes. However, smaller transistors will also allow SoC to contain hundreds of processing cores and other infrastructure components with the potential for increased reliability through massive structural redundancy. Concurrent online testing (COLT) can provide sufficient reliability and availability to systems with this redundancy. COLT manages the process of testing a subset of processing cores while the rest of the system remains operational. This can be considered a temporary, graceful degradation of system performance that increases reliability while maintaining availability. In this dissertation, techniques to assist COLT are proposed and analyzed. The techniques described in this dissertation focus on two major aspects of COLT feasibility: recovery time and test delivery costs. To reduce the time between failure and recovery, and thereby increase system availability, an anomaly-based test triggering unit (ATTU) is proposed to initiate COLT when anomalous network behavior is detected. Previous COLT techniques have relied on initiating tests periodically. However, determining the testing period is based on a device's mean time between failures (MTBF), and calculating MTBF is exceedingly difficult and imprecise. To address the test delivery costs associated with COLT, a distributed test vector storage (DTVS) technique is proposed to eliminate the dependency of test delivery costs on core location. Previous COLT techniques have relied on a single location to store test vectors, and it has been demonstrated that centralized storage of tests scales poorly as the number of cores per SoC grows. Assuming that the SoC organizes its processing cores with a regular topology, DTVS uses an interleaving technique to optimally distribute the test vectors across the entire chip. DTVS is analyzed both empirically and analytically, and a testing protocol using DTVS is described. COLT is only feasible if the applications running concurrently are largely unaffected. The effect of COLT on application execution time is also measured in this dissertation, and an application-aware COLT protocol is proposed and analyzed. Application interference is greatly reduced through this technique. concurrent online testing many core safety-critical systems in-field testing electronic wearout
24	Designing heterogeneous many-core processors to provide high performance under limited chip power budget Woo, Dong Hyuk 04 October 2010 (has links) This thesis describes the efficient design of a future many-core processor that can provide higher performance under the limited chip power budget. To achieve such a goal, this thesis first develops an analytical framework within which computer architects can estimate achievable performance improvement of different many-core architectures given the same power budget. From this study, this thesis found that a future many-core processor needs (1) energy-efficient parallel cores and (2) a high-performance sequential core. Based on these observations, this thesis proposes an energy-efficient broad-purpose acceleration layer that can be snapped on top of a conventional general-purpose processor. In addition to such an energy-efficient parallel cores, this thesis also proposes different architectural techniques to further boost the performance of sequential computation while those parallel cores are idle. In particular, this thesis develops low-cost architectural techniques to enhance the memory performance of a host core by utilizing those idle parallel cores. This idea is evaluated in two different system architectures: one with the aforementioned acceleration layer and the other with an emerging integrated CPU and GPU chip. Heterogeneous many-core architecture Heterogeneous computing Multiprocessors Microprocessors High performance processors
25	Memory-subsystem resource management for the many-core era Kaseridis, Dimitrios 11 July 2012 (has links) As semiconductor technology continues to scale lower in the nanometer era, the communication between processor and main memory has been particularly challenged. The well-studied frequency, memory and power ``walls'' have redirect architects towards utilizing Chip Multiprocessors (CMP) as an attractive architecture for leveraging technology scaling. In order to achieve high efficiency and throughput, CMPs rely heavily on sharing resources among multiple cores, especially in the case of the memory hierarchy. Unfortunately, such sharing introduces resource contention and interference between the multiple executing threads. The ever-increasing access latency difference between processor and memory, the gradually increasing memory bandwidth demands to main memory, and the decreasing cache capacity size available to each core due to multiple core integration, has made the need for an efficient memory subsystem resource management more critical than ever before. This dissertation focuses on managing the sharing of the Last-level Cache (LLC) capacity and the main memory bandwidth, as the two most important resources that significantly affect system performance and energy consumption. The presented schemes include efficient solutions to all of the three basic requirements for implementing a resource management schemes, that is: a) profiling mechanisms to capture applications' resource requirements, b) microarchitecture mechanisms to enforce a resource allocation scheme, and c) resource allocations algorithms/policies to manage the available memory resources throughput the whole memory hierarchy of a CMP system. To achieve these targets the dissertation first describes a set of low overhead, non-invasive profiling mechanisms that are able to project applications’ memory resource requirements and memory sharing behavior. Two memory resource partitioning schemes are presented. The first one, the Bank-aware dynamic partitioning scheme provides a low overhead solution for partitioning cache resources of large CMP architectures that are based on a Dynamic Non-Uniform Cache Architecture (DNUCA) last-level cache design, consistent with the current industry trends. In addition, the second scheme, the Bandwidth-aware dynamic scheme presents a system-wide optimization of memory-subsystem resource allocation and job scheduling for large, multi-chip CMP systems. The scheme is seeking for optimizations both within and outside single CMP chips, aiming at overall system throughput and efficiency improvements. As cache partitioning schemes with isolated partitions impose a set of restrictions in the use of the last-level cache, which can severely affect the performance of large CMP designs, this dissertation presents a Quasi-partitioning scheme that breaks such restrictions while providing most of the benefits of cache partitioning schemes. The presented solution is able to efficiently scale to a significant larger number of cores than what previously described schemes that are based on isolated partition can achieve. Finally, as the memory controller is one of the fundamental components of the memory-subsystem, a well-designed memory-subsystem resource management needs to carefully utilize the memory controller resources and coordinate its functionality with the operation of the main memory and the last-level cache. To improve execution fairness and system throughput, this dissertation presents a criticality-based, memory controller requests priority scheme. The scheme ranks demand read and prefetch operations based on their latency sensitivity, while it coordinates its operation with the DRAM page-mode policy and the memory data prefetcher. / text Chip-multiprocessors Many-core Cache Memory Resource-management Processor Computer architecture Memory controllers
26	New abstractions and mechanisms for virtualizing future many-core systems Kumar, Sanjay 08 July 2008 (has links) To abstract physical into virtual computing infrastructures is a longstanding goal. Efforts in the computing industry started with early work on virtual machines in IBM's VM370 operating system and architecture, continued with extensive developments in distributed systems in the context of grid computing, and now involve investments by key hardware and software vendors to efficiently virtualize common hardware platforms. Recent efforts in virtualization technology are driven by two facts: (i) technology push -- new hardware support for virtualization in multi- and many-core hardware platforms and in the interconnects and networks used to connect them, and (ii) technology pull -- the need to efficiently manage large-scale data-centers used for utility computing and extending from there, to also manage more loosely coupled virtual execution environments like those used in cloud computing. Concerning (i), platform virtualization is proving to be an effective way to partition and then efficiently use the ever-increasing number of cores in many-core chips. Further, I/O Virtualization enables I/O device sharing with increased device throughput, providing required I/O functionality to the many virtual machines (VMs) sharing a single platform. Concerning (ii), through server consolidation and VM migration, for instance, virtualization increases the flexibility of modern enterprise systems and creates opportunities for improvements in operational efficiency, power consumption, and the ability to meet time-varying application needs. This thesis contributes (i) new technologies that further increase system flexibility, by addressing some key problems of existing virtualization infrastructures, and (ii) it then directly addresses the issue of how to exploit the resulting increased levels of flexibility to improve data-center operations, e.g., power management, by providing lightweight, efficient management technologies and techniques that operate across the range of individual many-core platforms to data-center systems. Concerning (i), the thesis contributes, for large many-core systems, insights into how to better structure virtual machine monitors (VMMs) to provide more efficient utilization of cores, by implementing and evaluating the novel Sidecore approach that permits VMMs to exploit the computational power of parallel cores to improve overall VMM and I/O performance. Further, I/O virtualization still lacks the ability to provide complete transparency between virtual and physical devices, thereby limiting VM mobility and flexibility in accessing devices. In response, this thesis defines and implements the novel Netchannel abstraction that provides complete location transparency between virtual and physical I/O devices, thereby decoupling device access from device location and enabling live VM migration and device hot-swapping. Concerning (ii), the vManage set of abstractions, mechanisms, and methods developed in this work are shown to substantially improve system manageability, by providing a lightweight, system-level architecture for implementing and running the management applications required in data-center and cloud computing environments. vManage simplifies management by making it possible and easier to coordinate the management actions taken by the many management applications and subsystems present in data-center and cloud computing systems. Experimental evaluations of the Sidecore approach to VMM structure, Netchannel, and of vManage are conducted on representative platforms and server systems, with consequent improvements in flexibility, in I/O performance, and in management efficiency, including power management. Virtualization Many-core systems Management Data-centers I/O Virtual computer systems
27	Source code optimizations to reduce multi core and many core performance bottlenecks / Otimizações de código fonte para reduzir gargalos de desempenho em multi core e many core Serpa, Matheus da Silva January 2018 (has links) Atualmente, existe uma variedade de arquiteturas disponíveis não apenas para a indústria, mas também para consumidores finais. Processadores multi-core tradicionais, GPUs, aceleradores, como o Xeon Phi, ou até mesmo processadores orientados para eficiência energética, como a família ARM, apresentam características arquiteturais muito diferentes. Essa ampla gama de características representa um desafio para os desenvolvedores de aplicações. Os desenvolvedores devem lidar com diferentes conjuntos de instruções, hierarquias de memória, ou até mesmo diferentes paradigmas de programação ao programar para essas arquiteturas. Para otimizar uma aplicação, é importante ter uma compreensão profunda de como ela se comporta em diferentes arquiteturas. Os trabalhos relacionados provaram ter uma ampla variedade de soluções. A maioria deles se concentrou em melhorar apenas o desempenho da memória. Outros se concentram no balanceamento de carga, na vetorização e no mapeamento de threads e dados, mas os realizam separadamente, perdendo oportunidades de otimização. Nesta dissertação de mestrado, foram propostas várias técnicas de otimização para melhorar o desempenho de uma aplicação de exploração sísmica real fornecida pela Petrobras, uma empresa multinacional do setor de petróleo. Os experimentos mostram que loop interchange é uma técnica útil para melhorar o desempenho de diferentes níveis de memória cache, melhorando o desempenho em até 5,3 e 3,9 nas arquiteturas Intel Broadwell e Intel Knights Landing, respectivamente. Ao alterar o código para ativar a vetorização, o desempenho foi aumentado em até 1,4 e 6,5 . O balanceamento de carga melhorou o desempenho em até 1,1 no Knights Landing. Técnicas de mapeamento de threads e dados também foram avaliadas, com uma melhora de desempenho de até 1,6 e 4,4 . O ganho de desempenho do Broadwell foi de 22,7 e do Knights Landing de 56,7 em comparação com uma versão sem otimizações, mas, no final, o Broadwell foi 1,2 mais rápido que o Knights Landing. / Nowadays, there are several different architectures available not only for the industry but also for final consumers. Traditional multi-core processors, GPUs, accelerators such as the Xeon Phi, or even energy efficiency-driven processors such as the ARM family, present very different architectural characteristics. This wide range of characteristics presents a challenge for the developers of applications. Developers must deal with different instruction sets, memory hierarchies, or even different programming paradigms when programming for these architectures. To optimize an application, it is important to have a deep understanding of how it behaves on different architectures. Related work proved to have a wide variety of solutions. Most of then focused on improving only memory performance. Others focus on load balancing, vectorization, and thread and data mapping, but perform them separately, losing optimization opportunities. In this master thesis, we propose several optimization techniques to improve the performance of a real-world seismic exploration application provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we show that loop interchange is a useful technique to improve the performance of different cache memory levels, improving the performance by up to 5.3 and 3.9 on the Intel Broadwell and Intel Knights Landing architectures, respectively. By changing the code to enable vectorization, performance was increased by up to 1.4 and 6.5 . Load Balancing improved the performance by up to 1.1 on Knights Landing. Thread and data mapping techniques were also evaluated, with a performance improvement of up to 1.6 and 4.4 . We also compared the best version of each architecture and showed that we were able to improve the performance of Broadwell by 22.7 and Knights Landing by 56.7 compared to a naive version, but, in the end, Broadwell was 1.2 faster than Knights Landing. Avaliacao : Desempenho Hardware Software Performance evaluation HPC Many-core Source code optimizations
28	Porting a Real-Time Operating System to a Multicore Platform Sjöström Thames, Sixten January 2012 (has links) This thesis is part of the European MANY project. The goal of MANY is to provide developers with tools to develop software for multi and many-core hardware platforms. This is the first thesis that is part of MANY at Enea. The thesis aims to provide a knowledge base about software on many-core at the Enea student research group. More than just providing a knowledge base, a part of the thesis is also to port Enea's operating system OSE to Tilera's many-core processor TILEpro64. The thesis shall also investigate the memory hierarchy and interconnection network of the Tilera processor. The knowledge base about software on many-core was constrained to investigating the shared memory model and operating systems for many-core. This was achieved by investigating prominent academic research about operating systems for many-core processors. The conclusion was that a shared memory model does not scale and for the operating system case, operating systems shall be designed with scalability as one of the most important requirements. This thesis has implemented the hardware abstraction layer required to execute a single-core version of OSE on the TILEpro architecture. This was done in three steps. The Tilera hardware and the OSE software platform were investigated. After that, an OSE target port was chosen as reference architecture. Finally, the hardware dependent parts of the reference software were modified. A foundation has been made for future development. operating system operating systems many-core manycore multicore RTOS distributed operating system Computer Engineering Datorteknik
29	On-chip Pipelined Parallel Mergesort on the Intel Single-Chip Cloud Computer Avdic, Kenan January 2014 (has links) With the advent of mass-market consumer multicore processors, the growing trend in the consumer off-the-shelf general purpose processor industry has moved away from increasing clock frequency as the classical approach for achieving higher performance. This is commonly attributed to the well-known problems of power consumption and heat dissipation with high frequencies and voltage. This paradigm shift has prompted research into a relatively new field of "many-core" processors, such as the Intel Single-chip Cloud Computer. The SCC is a concept vehicle, an experimental homogenous architecture employing 48 IA32 cores interconnected by a high-speed communication network. As similar multiprocessor systems, such as the Cell Broadband Engine, demonstrate a significantly higher aggregate bandwidth in the interconnect network than in memory, we examine the viability of a pipelined approach to sorting on the Intel SCC. By tailoring an algorithm to the architecture, we investigate whether this is also the case with the SCC and whether employing a pipelining technique alleviates the classical memory bottleneck problem or provides any performance benefits. For this purpose, we employ and combine different classic algorithms, most significantly, parallel mergesort and samplesort. intel scc many-core pipelined sorting mergesort algorithms Computer Engineering Datorteknik Computer Sciences Datavetenskap (datalogi)
30	Modeling performance of serial and parallel sections of multi-threaded programs in many-core era / Modélisation de la performance des sections séquentielles et parallèles au sein de programmes multithreadés à l'ère des many-coeurs Khizakanchery Natarajan, Surya Narayanan 01 June 2015 (has links) Ce travail a été effectué dans le contexte d'un projet financé par l'ERC, Defying Amdahl's Law (DAL), dont l'objectif est d'explorer les techniques micro-architecturales améliorant la performance des processeurs multi-cœurs futurs. Le projet prévoit que malgré les efforts investis dans le développement de programmes parallèles, la majorité des codes auront toujours une quantité signifiante de code séquentiel. Pour cette raison, il est primordial de continuer à améliorer la performance des sections séquentielles des-dits programmes. Le travail de recherche de cette thèse porte principalement sur l'étude des différences entre les sections parallèles et les sections séquentielles de programmes multithreadés (MT) existants. L'exploration de l'espace de conception des futurs processeurs multi-cœurs est aussi traitée, tout en gardant à l'esprit les exigences concernant ces deux types de sections ainsi que le compromis performance-surface. / This thesis work is done in the general context of the ERC, funded Defying Amdahl's Law (DAL) project which aims at exploring the micro-architectural techniques that will enable high performance on future many-core processors. The project envisions that despite future huge investments in the development of parallel applications and porting it to the parallel architectures, most applications will still exhibit a significant amount of sequential code sections and, hence, we should still focus on improving the performance of the serial sections of the application. In this thesis, the research work primarily focuses on studying the difference between parallel and serial sections of the existing multi-threaded (MT) programs and exploring the design space with respect to the processor core requirement for the serial and parallel sections in future many-core with area-performance tradeoff as a primary goal. Ordinateurs Multiprocesseurs Parallélisme (informatique) Programmation parallèle (informatique) Transputers Computers Multiprocessors Parallelism Parallel Program Many-Core

Search results