Global ETD Search

1	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. Multicore processors Computer architecture GPU
2	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. Multicore processors Computer architecture GPU
3	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. / Applied Science, Faculty of / Electrical and Computer Engineering, Department of / Graduate Multicore processors Computer architecture GPU
4	Paralelização automática de laços para arquiteturas multicore / Automatic loop parallelization for multicore architectures Vieira, Cristianno Martins 11 August 2010 (has links) Orientador: Sandro Rigo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-17T08:17:12Z (GMT). No. of bitstreams: 1 Vieira_CristiannoMartins_M.pdf: 1981128 bytes, checksum: 5af9a00808029ad96cd8d02e569b1cda (MD5) Previous issue date: 2010 / Resumo: Embora muitos programas possuam uma forma regular de paralelismo, que pode ser expressa em termos de laços paralelos, muitos exemplos importantes não a possuem. Loop skewing é uma transformação que remodela o espaço de iteração dos laços para que seja possível expressar o paralelismo implícito através de laços paralelos. Como consequência da complexidade em se modificar o espaço de iteração dos laços, e de possíveis problemas causados por transformações deste tipo - como o possível aumento na taxa de miss em caches -, no geral, elas não são largamente utilizadas. Neste projeto, implementamos a transformação loop skewing sobre o compilador da linguagem C presente no GCC (GNU Compiler Collection), de forma a permitir a assistência pelo programador. Utilizamos a ferramenta Graphite como base para a implementação da otimização, apenas representando-a como uma transformação afim sobre um objeto matemático multidimensional chamado polítopo. Mostramos, através de um estudo detalhado sobre o modelo matemático denominado modelo politópico, que laços com estruturas específicas - perfeitamente aninhados, com limites e acesso á memória descritos por funções afins - poderiam ser representados como polítopos, e que transformações aplicadas a estes seriam espelhadas no código gerado a partir desses polítopos. Dessa forma, qualquer transformação que possa ser estruturada como uma transformação afim sobre um polítopo, poderá ser implementada. Mostramos, ainda, durante a análise de desempenho, que transformações deste tipo são viáveis e, apesar de algumas limitações impostas pela infraestrutura do GCC, aumentam relativamente o desempenho das aplicações compiladas com ela - obtivemos um ganho máximo de aproximadamente 115% para o uso de quatro threads em uma das aplicações executadas. Verificamos o impacto do uso de programas já paralelizados manualmente sobre a plataforma, e obtivemos um ganho máximo de 11% nesses casos, mostrando que ainda aplicações paralelizadas podem conter paralelismo implícito / Abstract: Although many programs present a regular form of parallelism, which can be expressed as parallel loops, many important examples do not. Loop skewing is a transformation that reorganizes the iteration space of loops to make it possible to expose the implicit parallelism through parallel loops. In general, as a consequence of the complexity in modifying the iteration space of loops, and possible problems caused by such changes - such as the possibility of increasing the miss rate in caches -, they are not widely used. In this work, the loop skewing transformation was implemented on GCC's C compiler (GNU Compiler Collection), allowing programmer's assistance. Graphite provides us a basis for implementation of the optimization, just representing it as an a_ne transformation on a multidimensional mathematical object called polytope. We show, through a detailed study about the mathematical model called polytope model, that for a very restricted loop structure - perfectly nested, with limits and memory accesses described by a_ne functions - could be represented as polytopes, and transformations applied to these would be carried by the code generated from these polytope. Thus, any transformation that could be structured as an a_ne transformation on a polytope, could be added. We also show, by means of performance analysis, that this type of transformation is feasible and, despite some limitations imposed by the still under development GCC's infrastructure for auto-parallelization, fairly increases the performance of some applications compiled with it - we achived a maximum of about 115% using four threads with one of the applications. We also veriéd the impact of using manually parallelized programs on this platform, and achieved a maximum gain of 11% in these cases, showing that even parallel applications may have implicit parallelism / Mestrado / Ciência da Computação / Mestre em Ciência da Computação Processadores multicore Arquitetura de computador Politopos Multicore processors Computer architecture Polytopes
5	Efficient shared cache management in multicore processors Xie, Yuejian 20 May 2011 (has links) In modern multicore processors, various resources (such as memory bandwidth and caches) are designed to be shared by concurrently running threads. Though it is good to be able to run multiple programs on a single chip at the same time, sometimes the contention of these shared resources can create problems for system performance. Naive hard-partitioning between threads can result in low resource utilization. This research shows that simple and effective approaches to dynamically manage the shared cache can be achieved. The contributions of this work are the following: (1) a technique for dynamic on-line classification of application memory access behaviors to predict the usefulness of cache partitioning, and a simple shared-cache management approach based on the classification; (2) a cache pseudo-partitioning technique that manipulates insertion and promotion policies; (3) a scalable algorithm to quickly decide per-core cache allocations; (4) pseudo-LRU cache partition approximation; (5) a dynamic shared cache compression technique that considers different thread behaviors. Performance Management Multicore processors Shared cache Simultaneous multithreading processors Cache memory
6	Melhoria de desempenho da máquina virtual Java na plataforma Cell B.E. / Java virtual machine performance improvement in Cell B.E. architecture Firmino, Raoni Fassina 16 August 2018 (has links) Orientador: Rodolfo Jardim de Azevedo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-16T21:29:21Z (GMT). No. of bitstreams: 1 Firmino_RaoniFassina_M.pdf: 582747 bytes, checksum: c50225f2dc75c4235a785d90a82d71b2 (MD5) Previous issue date: 2010 / Resumo: Esta dissertação concentra-se no atual momento de transição entre as atuais e as novas arquiteturas de processadores, oferecendo uma alternativa para minimizar o impacto desta mudança. Para tal utiliza-se a plataforma Java, que possibilita que o desenvolvimento de aplicações seja independente da arquitetura em que serão executadas. Considerando a arquitetura Cell B.E. como uma nova plataforma que promete desempenho elevado, este trabalho propõe melhorias na Máquina Virtual Java que propiciem um ganho de desempenho na execução de aplicações Java executadas sobre o processador Cell. O objetivo proposto é atingido por meio da utilização do ambiente disponível na própria plataforma Java, o Java Native Interface (JNI), para a implementação de interfaces entre bibliotecas nativas construídas para a arquitetura Cell - com a intenção de obter o máximo desempenho possível - e as aplicações Java. É proposto um modelo para porte e criação das interfaces para bibliotecas e mostra-se a viabilidade da abordagem proposta através de implementações de bibliotecas selecionadas, consolidando a metodologia utilizada. Duas bibliotecas foram portadas completamente como prova de conceito, uma multiplicação de matrizes grandes e o algoritmo RC5. A multiplicação de matrizes obteve um desempenho e escalablidade comparável ao código original em C e em escala muitas vezes superior ao código JNI para arquitetura x86 a ao código Java executando em arquiteturas x86 e Cell. O RC5 executou apenas aproximadamente 0,3 segundos mais lento que o código C original (perda citada em segundos pois se manteve constante independente do tempo levado para as diferentes configurações de execução) / Abstract: This dissertation focuses on the present moment of transition between the current and new processor architectures, offering an alternative to minimize the impact of this change. For this, we use the Java platform, which enables an architecture-independent application development. Considering the Cell BE architecture as a new platform that promises high performance, this paper proposes improvements in the Java Virtual Machine that provide performance gains in the execution of Java applications running on the Cell processor. The proposed objective is achieved through the use of the environment available on the Java platform itself, the Java Native Interface (JNI), to implement interfaces between native libraries built for the Cell architecture - with the intention of obtaining the maximum possible performance - and the Java applications. It is proposed a model to port and build interfaces to libraries and it shows the viability of the proposed methodology with the implementation of selected libraries, consolidating the used methodology. Two libraries were completely ported as proof of concept, a multiplication of large matrices and a RC5 algorithm implementation. The matrices multiplication achieved scalability and performance in the same basis as the native implementation and incomparable with JNI implementation targering x86 architecture and Java implementation running in x86 and Cell architectures. The RC5 was just 0.3 seconds slower than the original C code (the loss is put in seconds since it was constant, independent of the execution time taken by different configurations of execution) / Mestrado / Computação / Mestre em Ciência da Computação Arquitetura de computador Processadores multicore Java (Computer program language) Computer architecture Multicore processors
7	Scalable and Energy Efficient Execution Methods for Multicore Systems Li, Dong 16 February 2011 (has links) Multicore architectures impose great pressure on resource management. The exploration spaces available for resource management increase explosively, especially for large-scale high end computing systems. The availability of abundant parallelism causes scalability concerns at all levels. Multicore architectures also impose pressure on power management. Growth in the number of cores causes continuous growth in power. In this dissertation, we introduce methods and techniques to enable scalable and energy efficient execution of parallel applications on multicore architectures. We study strategies and methodologies that combine DCT and DVFS for the hybrid MPI/OpenMP programming model. Our algorithms yield substantial energy saving (8.74% on average and up to 13.8%) with either negligible performance loss or performance gain (up to 7.5%). To save additional energy for high-end computing systems, we propose a power-aware MPI task aggregation framework. The framework predicts the performance effect of task aggregation in both computation and communication phases and its impact in terms of execution time and energy of MPI programs. Our framework provides accurate predictions that lead to substantial energy saving through aggregation (64.87% on average and up to 70.03%) with tolerable performance loss (under 5%). As we aggregate multiple MPI tasks within the same node, we have the scalability concern of memory registration for high performance networking. We propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures with helper threads. We investigate design polices and performance implications of the helper thread approach. Our method efficiently reduces registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. Our system enables the execution of application input sets that could not run to the completion with the memory registration limitation. / Ph. D. Performance Modeling and Analysis Multicore Processors Power-Aware Computing Concurrency Throttling High-Performance Computing
8	Improving the Efficiency of Parallel Applications on Multithreaded and Multicore Systems Curtis-Maury, Matthew 15 April 2008 (has links) The scalability of parallel applications executing on multithreaded and multicore multiprocessors is often quite limited due to large degrees of contention over shared resources on these systems. In fact, negative scalability frequently occurs such that a non-negligable performance loss is observed through the use of more processors and cores. In this dissertation, we present a prediction model for identifying efficient operating points of concurrency in multithreaded scientific applications in terms of both performance as a primary objective and power secondarily. We also present a runtime system that uses live analysis of hardware event rates through the prediction model to optimize applications dynamically. We discuss a dynamic, phase-aware performance prediction model (DPAPP), which combines statistical learning techniques, including multivariate linear regression and artificial neural networks, with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. We find that the scalability model achieves accuracy approaching 95%, sufficiently accurate to identify improved concurrency levels and thread placements from within real parallel scientific applications. Using DPAPP, we develop a prediction-driven runtime optimization scheme, called ACTOR, which throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each parallel execution phase in an application. ACTOR successfully identifies and exploits program phases where limited scalability results in a performance loss through the use of more processing elements, providing simultaneous reductions in execution time by 5%-18% and power consumption by 0%-11% across a variety of parallel applications and architectures. Further, we extend DPAPP and ACTOR to include support for runtime adaptation of DVFS, allowing for the synergistic exploitation of concurrency throttling and DVFS from within a single, autonomically-acting library, providing improved energy-efficiency compared to either approach in isolation. / Ph. D. power-aware computing high-performance computing performance prediction multicore processors runtime adaptation concurrency throttling
9	Scheduling on Asymmetric Architectures Blagojevic, Filip 22 July 2008 (has links) We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of programmers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically "rightsize" the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model predicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications. / Ph. D. process scheduling performance prediction high-performance computing runtime adaptation Multicore processors Cell BE
10	Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores Shah, Ankur Savailal 28 May 2008 (has links) Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance prediction framework, which we deploy to address the problem of simultaneous runtime optimization of DVFS, DCT, and thread placement on multi-core systems. We present results from an implementation of the prediction framework in a runtime system linked to the Intel OpenMP runtime environment and running on a real dual-processor quad-core system as well as a dual-processor dual-core system. We show that the prediction framework derives near-optimal settings of the three power-aware program adaptation knobs that we consider. Our overall runtime optimization framework achieves significant reductions in energy (12.27% mean) and ED² (29.6% mean), through simultaneous power savings (3.9% mean) and performance improvements (10.3% mean). Our prediction and adaptation framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings. / Master of Science concurrency throttling power-aware computing runtime adaptation performance prediction high-performance computing Multicore processors

Search results