Global ETD Search

1	Optimizing parallel simulation of multi-core system Dong, Zhenjiang 27 May 2016 (has links) Multi-core design for CPU is the recent trend and we believe the trend will continue in near future. Researchers and industry architects utilize simulation to evaluate their designs and gain a certain level of confidence before manufacturing the actual products. Due to the fact that modern multi-core systems are complex, traditional sequential simulation can hit the bottlenecks in terms of execution time. To handle the complexity, Parallel Discrete Event Simulation (PDES) programs are employed. PDES program with well-designed partitioning schemes, synchronization algorithm and other optimizations can take advantage of the parallel hardware and achieve scalability for the simulation of multi-core systems. The objective of this dissertation is to design, develop, test and evaluate a variety of technologies to improve the performance and efficiency of parallel simulation of multi-core systems. The technologies include a general guide for partitioning schemes, an efficient front-end for timing-directed simulation, and a new conservative synchronization algorithm. Parallel simulation Multicore-system
2	Parallel Algorithms for Time and Frequency Domain Circuit Simulation Dong, Wei 2009 August 1900 (has links) As a most critical form of pre-silicon verification, transistor-level circuit simulation is an indispensable step before committing to an expensive manufacturing process. However, considering the nature of circuit simulation, it can be computationally expensive, especially for ever-larger transistor circuits with more complex device models. Therefore, it is becoming increasingly desirable to accelerate circuit simulation. On the other hand, the emergence of multi-core machines offers a promising solution to circuit simulation besides the known application of distributed-memory clustered computing platforms, which provides abundant hardware computing resources. This research addresses the limitations of traditional serial circuit simulations and proposes new techniques for both time-domain and frequency-domain parallel circuit simulations. For time-domain simulation, this dissertation presents a parallel transient simulation methodology. This new approach, called WavePipe, exploits coarse-grained application-level parallelism by simultaneously computing circuit solutions at multiple adjacent time points in a way resembling hardware pipelining. There are two embodiments in WavePipe: backward and forward pipelining schemes. While the former creates independent computing tasks that contribute to a larger future time step, the latter performs predictive computing along the forward direction. Unlike existing relaxation methods, WavePipe facilitates parallel circuit simulation without jeopardizing convergence and accuracy. As a coarse-grained parallel approach, it requires low parallel programming effort, furthermore it creates new avenues to have a full utilization of increasingly parallel hardware by going beyond conventional finer grained parallel device model evaluation and matrix solutions. This dissertation also exploits the recently developed explicit telescopic projective integration method for efficient parallel transient circuit simulation by addressing the stability limitation of explicit numerical integration. The new method allows the effective time step controlled by accuracy requirement instead of stability limitation. Therefore, it not only leads to noticeable efficiency improvement, but also lends itself to straightforward parallelization due to its explicit nature. For frequency-domain simulation, this dissertation presents a parallel harmonic balance approach, applicable to the steady-state and envelope-following analyses of both driven and autonomous circuits. The new approach is centered on a naturally-parallelizable preconditioning technique that speeds up the core computation in harmonic balance based analysis. The proposed method facilitates parallel computing via the use of domain knowledge and simplifies parallel programming compared with fine-grained strategies. As a result, favorable runtime speedups are achieved. circuit simulation parallelization multicore
3	Galois : a system for parallel execution of irregular algorithms Nguyen, Donald Do 04 September 2015 (has links) A programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems. / text Parallel programming Multicore
4	Design And Analysis Of Time-Predicatable Single-Core And Multi-Core Processors Yan, Jun 01 January 2009 (has links) (PDF) Time predictability is one of the most important design considerations for real-time systems. In this dissertation, time predictability of the instruction cache is studied on both single core processors and multi-core processors. It is observed that many features in modern microprocessor architecture such as cache memories and branch prediction are in favor of average-case performance, which can significantly compromise the time predictability and make accurate worst-case performance analysis extremely difficult if not impossible. Therefore, the time predictability of VLIW (Very Long Instruction Word) processors and its compiler support is studied. The impediments to time predictability for VLIW processors are analyzed and compiler-based techniques to address these problems with minimal modifications to the VLIW hardware design are proposed. Specifically, the VLIW compiler is enhanced to support full if conversion, hyperblock scheduling, and intra-block nop insertion to enable efficient WCET (Worst Case Execution Time) analysis for VLIW processors. Our time-predictable processor incorporates the instruction caches which can mitigate the latency of fetching instructions that hit in the cache. For instruction missing from the cache, instruction prefetching is a useful technique to boost the average-case performance. However, it is unclear whether or not instruction prefetching can benefit the worst-case performance as well. Thus, the impact of instruction prefetching on the worst-case performance of instruction caches is studied. Extension of the static cache simulation technique is applied to model and compute the worst-case instruction cache performance with prefetching. It is shown that instruction prefetching can be reasonably bound, however, the time variation of computing is increased by instruction prefetching. As the technology advances, it is projected that multi-core chips will be increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very challenging due to the possible runtime inter-core interferences in using shared resources such as the shared L2 caches. As the first step toward time-predictable multi-core computing, this dissertation presents novel approaches to bounding the worst-case performance for threads running on multi-core processors with shared L2 instruction caches. CF (Control Flow) based approach. This approach computes the worst-case instruction access interferences between different threads based on the program control flow information of each thread, which can be statically analyzed. Extended ILP (Integer Linear Programming) based approach. This approach uses constraint programming to model the worst-case instruction access interferences between different threads. In the context of timing analysis in many core architecture, static approaches may also face the scalability issue. Thus, it is important and challenging to design time predictable caches in multi-core architecture. We propose an approach to leverage the prioritized shared L2 caches to improve time predictability for real-time threads running on multi-core processors. The prioritized shared L2 caches give higher priority to real-time threads while allowing low-priority threads to use shared L2 cache space that is available. Detailed implementation and experimental results discussion are presented in this dissertation. cache multicore VLIW WCET
5	Intra- and Inter-chip Communication Support for Asymmetric Multicore Processors with Explicitly Managed Memory Hierarchies Rose, Benjamin Aaron 10 June 2009 (has links) The use of asymmetric multi-core processors with on-chip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on making efficient use of individual systems, and porting applications to asymmetric processors. The use of these asymmetric processors, like the Cell processor, in a cluster setting is the inspiration for the Cell Connector framework presented in this thesis. Cell Connector adopts a streaming approach for providing data to compute nodes with high computing potential but limited memory resources. Instead of dividing very large data sets once among computation resources, Cell Connector slices, distributes, and collects work units off of a master data held by a single large memory machine. Using this methodology, Cell Connector is able to maximize the use of limited resources and produces results that are up to 63.3\% better compared to standard non-streaming approaches. / Master of Science Cell BE multicore cluster
6	Paralelização automática de laços para arquiteturas multicore / Automatic loop parallelization for multicore architectures Vieira, Cristianno Martins 11 August 2010 (has links) Orientador: Sandro Rigo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-17T08:17:12Z (GMT). No. of bitstreams: 1 Vieira_CristiannoMartins_M.pdf: 1981128 bytes, checksum: 5af9a00808029ad96cd8d02e569b1cda (MD5) Previous issue date: 2010 / Resumo: Embora muitos programas possuam uma forma regular de paralelismo, que pode ser expressa em termos de laços paralelos, muitos exemplos importantes não a possuem. Loop skewing é uma transformação que remodela o espaço de iteração dos laços para que seja possível expressar o paralelismo implícito através de laços paralelos. Como consequência da complexidade em se modificar o espaço de iteração dos laços, e de possíveis problemas causados por transformações deste tipo - como o possível aumento na taxa de miss em caches -, no geral, elas não são largamente utilizadas. Neste projeto, implementamos a transformação loop skewing sobre o compilador da linguagem C presente no GCC (GNU Compiler Collection), de forma a permitir a assistência pelo programador. Utilizamos a ferramenta Graphite como base para a implementação da otimização, apenas representando-a como uma transformação afim sobre um objeto matemático multidimensional chamado polítopo. Mostramos, através de um estudo detalhado sobre o modelo matemático denominado modelo politópico, que laços com estruturas específicas - perfeitamente aninhados, com limites e acesso á memória descritos por funções afins - poderiam ser representados como polítopos, e que transformações aplicadas a estes seriam espelhadas no código gerado a partir desses polítopos. Dessa forma, qualquer transformação que possa ser estruturada como uma transformação afim sobre um polítopo, poderá ser implementada. Mostramos, ainda, durante a análise de desempenho, que transformações deste tipo são viáveis e, apesar de algumas limitações impostas pela infraestrutura do GCC, aumentam relativamente o desempenho das aplicações compiladas com ela - obtivemos um ganho máximo de aproximadamente 115% para o uso de quatro threads em uma das aplicações executadas. Verificamos o impacto do uso de programas já paralelizados manualmente sobre a plataforma, e obtivemos um ganho máximo de 11% nesses casos, mostrando que ainda aplicações paralelizadas podem conter paralelismo implícito / Abstract: Although many programs present a regular form of parallelism, which can be expressed as parallel loops, many important examples do not. Loop skewing is a transformation that reorganizes the iteration space of loops to make it possible to expose the implicit parallelism through parallel loops. In general, as a consequence of the complexity in modifying the iteration space of loops, and possible problems caused by such changes - such as the possibility of increasing the miss rate in caches -, they are not widely used. In this work, the loop skewing transformation was implemented on GCC's C compiler (GNU Compiler Collection), allowing programmer's assistance. Graphite provides us a basis for implementation of the optimization, just representing it as an a_ne transformation on a multidimensional mathematical object called polytope. We show, through a detailed study about the mathematical model called polytope model, that for a very restricted loop structure - perfectly nested, with limits and memory accesses described by a_ne functions - could be represented as polytopes, and transformations applied to these would be carried by the code generated from these polytope. Thus, any transformation that could be structured as an a_ne transformation on a polytope, could be added. We also show, by means of performance analysis, that this type of transformation is feasible and, despite some limitations imposed by the still under development GCC's infrastructure for auto-parallelization, fairly increases the performance of some applications compiled with it - we achived a maximum of about 115% using four threads with one of the applications. We also veriéd the impact of using manually parallelized programs on this platform, and achieved a maximum gain of 11% in these cases, showing that even parallel applications may have implicit parallelism / Mestrado / Ciência da Computação / Mestre em Ciência da Computação Processadores multicore Arquitetura de computador Politopos Multicore processors Computer architecture Polytopes
7	A Multicore Computing Platform for Benchmarking Dynamic Partial Reconfiguration Based Designs Thorndike, David Andrew 27 August 2012 (has links) No description available. multicore reconfigurable computing DPR
8	Comparative study of parallel programming models for multicore computing Ali, Akhtar January 2013 (has links) Shared memory multi-core processor technology has seen a drastic developmentwith faster and increasing number of processors per chip. This newarchitecture challenges computer programmers to write code that scales overthese many cores to exploit full computational power of these machines.Shared-memory parallel programming paradigms such as OpenMP and IntelThreading Building Blocks (TBB) are two recognized models that offerhigher level of abstraction, shields programmers from low level detailsof thread management and scales computation over all available resources.At the same time, need for high performance power-ecient computing iscompelling developers to exploit GPGPU computing due to GPU's massivecomputational power and comparatively faster multi-core growth. Thistrend leads to systems with heterogeneous architectures containing multicoreCPUs and one or more programmable accelerators such as programmableGPUs. There exist dierent programming models to program these architecturesand code written for one architecture is often not portable to anotherarchitecture. OpenCL is a relatively new industry standard framework, de-ned by Khronos group, which addresses the portability issue. It oers aportable interface to exploit the computational power of a heterogeneous setof processors such as CPUs, GPUs, DSP processors and other accelerators. In this work, we evaluate the eectiveness of OpenCL for programmingmulti-core CPUs in a comparative case study with two CPU specic stableframeworks, OpenMP and Intel TBB, for ve benchmark applicationsnamely matrix multiply, LU decomposition, image convolution, Pi value approximationand image histogram generation. The evaluation includes aperformance comparison of the three frameworks and a study of the relativeeects of applying compiler optimizations on performance numbers.OpenCL performance on two vendor-dependent platforms Intel and AMD,is also evaluated. Then the same OpenCL code is ported to a modern GPUand its code correctness and performance portability is investigated. Finally,usability experience of coding using the three multi-core frameworksis presented. OpenCL OpenMP TBB Multicore frameworks
9	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. Multicore processors Computer architecture GPU
10	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. Multicore processors Computer architecture GPU

Search results