Global ETD Search

1	Analysis and parameter prediction of compiler transformation for graphics processors Magni, Alberto January 2016 (has links) In the last decade graphics processors (GPUs) have been extensively used to solve computationally intensive problems. A variety of GPU architectures by different hardware manufacturers have been shipped in a few years. OpenCL has been introduced as the standard cross-vendor programming framework for GPU computing. Writing and optimising OpenCL applications is a challenging task, the programmer has to take care of several low level details. This is even harder when the goal is to improve performance on a wide range of devices: OpenCL does not guarantee performance portability. In this thesis we focus on the analysis and the portability of compiler optimisations. We describe the implementation of a portable compiler transformation: thread-coarsening. The transformation increases the amount of work carried out by a single thread running on the GPU. The goal is to reduce the amount of redundant instructions executed by the parallel application. The first contribution is a technique to analyse performance improvements and degradations given by the compiler transformation, we study the changes of hardware performance counters when applying coarsening. In this way we identify the root causes of execution time variations due to coarsening. As second contribution, we study the relative performance of coarsening over multiple input sizes. We show that the speedups given by coarsening are stable for problem sizes larger than a threshold that we call saturation point. We exploit the existence of the saturation point to speedup iterative compilation. The last contribution of the work is the development of a machine learning technique that automatically selects a coarsening configuration that improves performance. The technique is based on an iterative model built using a neural network. The network is trained once for a GPU model and used for several programs. To prove the flexibility of our techniques, all our experiments have been deployed on multiple GPU models by different vendors. 006.6
2	GPGPU separation of opaque and transparent mesh polygons Tännström, Ulf Nilsson January 2014 (has links) Context: By doing a depth-prepass in a tiled forward renderer, pixels can be prevented from being shaded more than once. More aggressive culling of lights that might contribute to tiles can also be performed. In order to produce artifact-free rendering, only meshes containing fully opaque polygons can be included in the depth-prepass. This limits the benefit of the depth-prepass for scenes containing large, mostly opaque, meshes that has some portions of transparency in them. Objectives: The objective of this thesis was to classify the polygons of a mesh as either opaque or transparent using the GPU. Then to separate the polygons into two different vertex buffers depending on the classification. This allows all opaque polygons in a scene to be used in the depth-prepass, potentially increasing render performance. Methods: An implementation was performed using OpenCL, which was then used to measure the time it took to separate the polygons in meshes of different complexity. The polygon separation times were then compared to the time it took to load the meshes into the game. What effect the polygon separation had on rendering times was also investigated. Results: The results showed that polygon separation times were highly dependent on the number of polygons and the texture resolution. It took roughly 350ms to separate a mesh with 100k polygons and a 2048x2048 texture, while the same mesh with a 1024x1024 texture took a quarter of the time. In the test scene used the rendering times differed only slightly. Conclusions: If the polygon separation should be performed when loading the mesh or when exporting it depends on the game. For games with a lower geometrical and textural detail level it may be feasible to separate the polygons each time the mesh is loaded, but for most game it would be recommended to perform it once when exporting the mesh. / Att använda ett djup-prepass med en tiled forward renderare kan minska tiden det tar att rendera en scen. Dock kan inte meshar som innehåller transparenta delar inkluderas i detta pre-pass utan att introducera renderingsartefakter. Detta begränsar djup-prepassets användbarhet i scenarion med stora meshar som innehåller små delar av transparens. Denna uppsats försöker lösa detta genom att med hjälp av grafikkortet dela upp meshar i två delar. En del som enbart innehåller icke-transparenta polygoner och en del med enbart transparenta polygoner. Mesh models Graphics processors Rendering Computer Sciences Datavetenskap (datalogi) Human Computer Interaction
3	Apports des architectures hybrides à l'imagerie profondeur : étude comparative entre CPU, APU et GPU. / Contributions of hybrid architectures to depth imaging : a CPU, APU and GPU comparative study Said, Issam 21 December 2015 (has links) Les compagnies pétrolières s'appuient sur le HPC pour accélérer les algorithmes d'imagerie profondeur. Les grappes de CPU et les accélérateurs matériels sont largement adoptés par l'industrie. Les processeurs graphiques (GPU), avec une grande puissance de calcul et une large bande passante mémoire, ont suscité un vif intérêt. Cependant le déploiement d'applications telle la Reverse Time Migration (RTM) sur ces architectures présente quelques limitations. Notamment, une capacité mémoire réduite, des communications fréquentes entre le CPU et le GPU présentant un possible goulot d'étranglement à cause du bus PCI, et des consommations d'énergie élevées. AMD a récemment lancé l'Accelerated Processing Unit (APU) : un processeur qui fusionne CPU et GPU sur la même puce via une mémoire unifiée. Dans cette thèse, nous explorons l'efficacité de la technologie APU dans un contexte pétrolier, et nous étudions si elle peut surmonter les limitations des solutions basées sur CPU et sur GPU. L'APU est évalué à l'aide d'une suite OpenCL de tests mémoire, applicatifs et d'efficacité énergétique. La faisabilité de l'utilisation hybride de l'APU est explorée. L'efficacité d'une approche par directives de compilation est également étudiée. En analysant une sélection d'applications sismiques (modélisation et RTM) au niveau du noeud et à grande échelle, une étude comparative entre CPU, APU et GPU est menée. Nous montrons la pertinence du recouvrement des entrées-sorties et des communications MPI par le calcul pour les grappes d'APU et de GPU, que les APU délivrent des performances variant entre celles du CPU et celles du GPU, et que l'APU peut être aussi énergétiquement efficace que le GPU. / In an exploration context, Oil and Gas (O&G) companies rely on HPC to accelerate depth imaging algorithms. Solutions based on CPU clusters and hardware accelerators are widely embraced by the industry. The Graphics Processing Units (GPUs), with a huge compute power and a high memory bandwidth, had attracted significant interest.However, deploying heavy imaging workflows, the Reverse Time Migration (RTM) being the most famous, on such hardware had suffered from few limitations. Namely, the lack of memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI transfer rate, and high power consumptions. Recently, AMD has launched theAccelerated Processing Unit (APU): a processor that merges a CPU and a GPU on the same die, with promising features notably a unified CPU-GPU memory. Throughout this thesis, we explore how efficiently may the APU technology be applicable in an O&G context, and study if it can overcome the limitations that characterize the CPU and GPU based solutions. The APU is evaluated with the help of memory, applicative and power efficiency OpenCL benchmarks. The feasibility of the hybrid utilization of the APUs is surveyed. The efficiency of a directive based approach is also investigated. By means of a thorough review of a selection of seismic applications (modeling and RTM) on the node level and on the large scale level, a comparative study between the CPU, the APU and the GPU is conducted. We show the relevance of overlapping I/O and MPI communications with computations for the APU and GPUclusters, that APUs deliver performances that range between those of CPUs and those of GPUs, and that the APU can be as power efficient as the GPU. HPC Calcul GPU Architectures hybrides APU Géophysique RTM APU RTM Graphics processors 004
4	An Efficient Hybrid CMOS/PTL (Pass-Transistor-Logic) Synthesizer and Its Applications to the Design of Arithmetic Units and 3D Graphics Processors Tsai, Ming-Yu 20 October 2009 (has links) The mainstream of current VLSI design and logic synthesis is based on traditional CMOS logic circuits. However, in the past two decades, various new logic circuit design styles based on pass-transistor logic (PTL) have been proposed. Compared with CMOS circuits, these PTL-based circuits are claimed to have better results in area, speed, and power in some particular applications, such as adder and multiplier designs. Since most current automatic logic synthesis tools (such as Synopsys Design Compiler) are based on conventional CMOS standard cell library, the corresponding logic minimization for CMOS logic cannot be directly employed to generate efficient PTL circuits. In this dissertation, we develop two novel PTL synthesizers that can efficiently generate PTL-based circuits. One is based on pure PTL cells; the other mixes CMOS and PTL cells in the standard cell library to achieve better performance in area, speed, and power. Since PTL-based circuits are constructed by only a few basic PTL cells, the layouts in PTL cells can be easily updated to design large SoC systems as the process technology migrates rapidly in current Nano technology era. The proposed PTL logic synthesis flows employ the popular Synopsys Design Compiler (DC) to perform logic translation and minimization based on the standard cell library composed of PTL and CMOS cells, thus, the PTL design flow can be easily embedded in the standard cell-based ASIC design flow. In this dissertation, we also discuss PTL-based designs of some fundamental hardware components. Furthermore, the proposed PTL cell library is used to synthesize large processor systems in applications of computer arithmetic and 3D graphics. 3D Graphics Processors Arithmetic Units Standard Cell Library ASIC Cell-Based Design Flow Logic Synthesizer Pass-Transistor-Logic (PTL) CMOS logic
5	A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms / Um sistema de escalonamento dinâmico e tuning em tempo de execução para plataformas desktop heterogêneas de múltiplos núcleos Binotto, Alécio Pedro Delazari January 2011 (has links) Atualmente, o computador pessoal (PC) moderno poder ser considerado como um cluster heterogênedo de um nodo, o qual processa simultâneamente inúmeras tarefas provenientes das aplicações. O PC pode ser composto por Unidades de Processamento (PUs) assimétricas, como a Unidade Central de Processamento (CPU), composta de múltiplos núcleos, a Unidade de Processamento Gráfico (GPU), composta por inúmeros núcleos e que tem sido um dos principais co-processadores que contribuiram para a computação de alto desempenho em PCs, entre outras. Neste sentido, uma plataforma de execução heterogênea é formada em um PC para efetuar cálculos intensivos em um grande número de dados. Na perspectiva desta tese, a distribuição da carga de trabalho de uma aplicação nas PUs é um fator importante para melhorar o desempenho das aplicações e explorar tal heterogeneidade. Esta questão apresenta desafios uma vez que o custo de execução de uma tarefa de alto nível em uma PU é não-determinístico e pode ser afetado por uma série de parâmetros não conhecidos a priori, como o tamanho do domínio do problema e a precisão da solução, entre outros. Nesse escopo, esta pesquisa de doutorado apresenta um sistema sensível ao contexto e de adaptação em tempo de execução com base em um compromisso entre a redução do tempo de execução das aplicações - devido a um escalonamento dinâmico adequado de tarefas de alto nível - e o custo de computação do próprio escalonamento aplicados em uma plataforma composta de CPU e GPU. Esta abordagem combina um modelo para um primeiro escalonamento baseado em perfis de desempenho adquiridos em préprocessamento com um modelo online, o qual mantém o controle do tempo de execução real de novas tarefas e escalona dinâmicamente e de modo eficaz novas instâncias das tarefas de alto nível em uma plataforma de execução composta de CPU e de GPU. Para isso, é proposto um conjunto de heurísticas para escalonar tarefas em uma CPU e uma GPU e uma estratégia genérica e eficiente de escalonamento que considera várias unidades de processamento. A abordagem proposta é aplicada em um estudo de caso utilizando uma plataforma de execução composta por CPU e GPU para computação de métodos iterativos focados na solução de Sistemas de Equações Lineares que se utilizam de um cálculo de stencil especialmente concebido para explorar as características das GPUs modernas. A solução utiliza o número de incógnitas como o principal parâmetro para a decisão de escalonamento. Ao escalonar tarefas para a CPU e para a GPU, um ganho de 21,77% em desempenho é obtido em comparação com o escalonamento estático de todas as tarefas para a GPU (o qual é utilizado por modelos de programação atuais, como OpenCL e CUDA para Nvidia) com um erro de escalonamento de apenas 0,25% em relação à combinação exaustiva. / A modern personal computer can be now considered as a one-node heterogeneous cluster that simultaneously processes several applications’ tasks. It can be composed by asymmetric Processing Units (PUs), like the multi-core Central Processing Unit (CPU), the many-core Graphics Processing Units (GPUs) - which have become one of the main co-processors that contributed towards high performance computing - and other PUs. This way, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In the perspective of this thesis, to improve the performance of applications and explore such heterogeneity, a workload distribution over the PUs plays a key role in such systems. This issue presents challenges since the execution cost of a task at a PU is non-deterministic and can be affected by a number of parameters not known a priori, like the problem size domain and the precision of the solution, among others. Within this scope, this doctoral research introduces a context-aware runtime and performance tuning system based on a compromise between reducing the execution time of the applications - due to appropriate dynamic scheduling of high-level tasks - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. This approach combines a model for a first scheduling based on an off-line task performance profile benchmark with a runtime model that keeps track of the tasks’ real execution time and efficiently schedules new instances of the high-level tasks dynamically over the CPU/GPU execution platform. For that, it is proposed a set of heuristics to schedule tasks over one CPU and one GPU and a generic and efficient scheduling strategy that considers several processing units. The proposed approach is applied in a case study using a CPU-GPU execution platform for computing iterative solvers for Systems of Linear Equations using a stencil code specially designed to explore the characteristics of modern GPUs. The solution uses the number of unknowns as the main parameter for assignment decision. By scheduling tasks to the CPU and to the GPU, it is achieved a performance gain of 21.77% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as OpenCL and CUDA for Nvidia) with a scheduling error of only 0.25% compared to exhaustive search. Processamento paralelo Microeletrônica Processamento : Imagem Processamento : Alto desempenho High-performance computing Scheduling Dynamic load-balancing Heterogenous systems Graphics processors Solvers for systems of linear equations
6	A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms / Um sistema de escalonamento dinâmico e tuning em tempo de execução para plataformas desktop heterogêneas de múltiplos núcleos Binotto, Alécio Pedro Delazari January 2011 (has links) Atualmente, o computador pessoal (PC) moderno poder ser considerado como um cluster heterogênedo de um nodo, o qual processa simultâneamente inúmeras tarefas provenientes das aplicações. O PC pode ser composto por Unidades de Processamento (PUs) assimétricas, como a Unidade Central de Processamento (CPU), composta de múltiplos núcleos, a Unidade de Processamento Gráfico (GPU), composta por inúmeros núcleos e que tem sido um dos principais co-processadores que contribuiram para a computação de alto desempenho em PCs, entre outras. Neste sentido, uma plataforma de execução heterogênea é formada em um PC para efetuar cálculos intensivos em um grande número de dados. Na perspectiva desta tese, a distribuição da carga de trabalho de uma aplicação nas PUs é um fator importante para melhorar o desempenho das aplicações e explorar tal heterogeneidade. Esta questão apresenta desafios uma vez que o custo de execução de uma tarefa de alto nível em uma PU é não-determinístico e pode ser afetado por uma série de parâmetros não conhecidos a priori, como o tamanho do domínio do problema e a precisão da solução, entre outros. Nesse escopo, esta pesquisa de doutorado apresenta um sistema sensível ao contexto e de adaptação em tempo de execução com base em um compromisso entre a redução do tempo de execução das aplicações - devido a um escalonamento dinâmico adequado de tarefas de alto nível - e o custo de computação do próprio escalonamento aplicados em uma plataforma composta de CPU e GPU. Esta abordagem combina um modelo para um primeiro escalonamento baseado em perfis de desempenho adquiridos em préprocessamento com um modelo online, o qual mantém o controle do tempo de execução real de novas tarefas e escalona dinâmicamente e de modo eficaz novas instâncias das tarefas de alto nível em uma plataforma de execução composta de CPU e de GPU. Para isso, é proposto um conjunto de heurísticas para escalonar tarefas em uma CPU e uma GPU e uma estratégia genérica e eficiente de escalonamento que considera várias unidades de processamento. A abordagem proposta é aplicada em um estudo de caso utilizando uma plataforma de execução composta por CPU e GPU para computação de métodos iterativos focados na solução de Sistemas de Equações Lineares que se utilizam de um cálculo de stencil especialmente concebido para explorar as características das GPUs modernas. A solução utiliza o número de incógnitas como o principal parâmetro para a decisão de escalonamento. Ao escalonar tarefas para a CPU e para a GPU, um ganho de 21,77% em desempenho é obtido em comparação com o escalonamento estático de todas as tarefas para a GPU (o qual é utilizado por modelos de programação atuais, como OpenCL e CUDA para Nvidia) com um erro de escalonamento de apenas 0,25% em relação à combinação exaustiva. / A modern personal computer can be now considered as a one-node heterogeneous cluster that simultaneously processes several applications’ tasks. It can be composed by asymmetric Processing Units (PUs), like the multi-core Central Processing Unit (CPU), the many-core Graphics Processing Units (GPUs) - which have become one of the main co-processors that contributed towards high performance computing - and other PUs. This way, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In the perspective of this thesis, to improve the performance of applications and explore such heterogeneity, a workload distribution over the PUs plays a key role in such systems. This issue presents challenges since the execution cost of a task at a PU is non-deterministic and can be affected by a number of parameters not known a priori, like the problem size domain and the precision of the solution, among others. Within this scope, this doctoral research introduces a context-aware runtime and performance tuning system based on a compromise between reducing the execution time of the applications - due to appropriate dynamic scheduling of high-level tasks - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. This approach combines a model for a first scheduling based on an off-line task performance profile benchmark with a runtime model that keeps track of the tasks’ real execution time and efficiently schedules new instances of the high-level tasks dynamically over the CPU/GPU execution platform. For that, it is proposed a set of heuristics to schedule tasks over one CPU and one GPU and a generic and efficient scheduling strategy that considers several processing units. The proposed approach is applied in a case study using a CPU-GPU execution platform for computing iterative solvers for Systems of Linear Equations using a stencil code specially designed to explore the characteristics of modern GPUs. The solution uses the number of unknowns as the main parameter for assignment decision. By scheduling tasks to the CPU and to the GPU, it is achieved a performance gain of 21.77% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as OpenCL and CUDA for Nvidia) with a scheduling error of only 0.25% compared to exhaustive search. Processamento paralelo Microeletrônica Processamento : Imagem Processamento : Alto desempenho High-performance computing Scheduling Dynamic load-balancing Heterogenous systems Graphics processors Solvers for systems of linear equations
7	A dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms / Um sistema de escalonamento dinâmico e tuning em tempo de execução para plataformas desktop heterogêneas de múltiplos núcleos Binotto, Alécio Pedro Delazari January 2011 (has links) Atualmente, o computador pessoal (PC) moderno poder ser considerado como um cluster heterogênedo de um nodo, o qual processa simultâneamente inúmeras tarefas provenientes das aplicações. O PC pode ser composto por Unidades de Processamento (PUs) assimétricas, como a Unidade Central de Processamento (CPU), composta de múltiplos núcleos, a Unidade de Processamento Gráfico (GPU), composta por inúmeros núcleos e que tem sido um dos principais co-processadores que contribuiram para a computação de alto desempenho em PCs, entre outras. Neste sentido, uma plataforma de execução heterogênea é formada em um PC para efetuar cálculos intensivos em um grande número de dados. Na perspectiva desta tese, a distribuição da carga de trabalho de uma aplicação nas PUs é um fator importante para melhorar o desempenho das aplicações e explorar tal heterogeneidade. Esta questão apresenta desafios uma vez que o custo de execução de uma tarefa de alto nível em uma PU é não-determinístico e pode ser afetado por uma série de parâmetros não conhecidos a priori, como o tamanho do domínio do problema e a precisão da solução, entre outros. Nesse escopo, esta pesquisa de doutorado apresenta um sistema sensível ao contexto e de adaptação em tempo de execução com base em um compromisso entre a redução do tempo de execução das aplicações - devido a um escalonamento dinâmico adequado de tarefas de alto nível - e o custo de computação do próprio escalonamento aplicados em uma plataforma composta de CPU e GPU. Esta abordagem combina um modelo para um primeiro escalonamento baseado em perfis de desempenho adquiridos em préprocessamento com um modelo online, o qual mantém o controle do tempo de execução real de novas tarefas e escalona dinâmicamente e de modo eficaz novas instâncias das tarefas de alto nível em uma plataforma de execução composta de CPU e de GPU. Para isso, é proposto um conjunto de heurísticas para escalonar tarefas em uma CPU e uma GPU e uma estratégia genérica e eficiente de escalonamento que considera várias unidades de processamento. A abordagem proposta é aplicada em um estudo de caso utilizando uma plataforma de execução composta por CPU e GPU para computação de métodos iterativos focados na solução de Sistemas de Equações Lineares que se utilizam de um cálculo de stencil especialmente concebido para explorar as características das GPUs modernas. A solução utiliza o número de incógnitas como o principal parâmetro para a decisão de escalonamento. Ao escalonar tarefas para a CPU e para a GPU, um ganho de 21,77% em desempenho é obtido em comparação com o escalonamento estático de todas as tarefas para a GPU (o qual é utilizado por modelos de programação atuais, como OpenCL e CUDA para Nvidia) com um erro de escalonamento de apenas 0,25% em relação à combinação exaustiva. / A modern personal computer can be now considered as a one-node heterogeneous cluster that simultaneously processes several applications’ tasks. It can be composed by asymmetric Processing Units (PUs), like the multi-core Central Processing Unit (CPU), the many-core Graphics Processing Units (GPUs) - which have become one of the main co-processors that contributed towards high performance computing - and other PUs. This way, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In the perspective of this thesis, to improve the performance of applications and explore such heterogeneity, a workload distribution over the PUs plays a key role in such systems. This issue presents challenges since the execution cost of a task at a PU is non-deterministic and can be affected by a number of parameters not known a priori, like the problem size domain and the precision of the solution, among others. Within this scope, this doctoral research introduces a context-aware runtime and performance tuning system based on a compromise between reducing the execution time of the applications - due to appropriate dynamic scheduling of high-level tasks - and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. This approach combines a model for a first scheduling based on an off-line task performance profile benchmark with a runtime model that keeps track of the tasks’ real execution time and efficiently schedules new instances of the high-level tasks dynamically over the CPU/GPU execution platform. For that, it is proposed a set of heuristics to schedule tasks over one CPU and one GPU and a generic and efficient scheduling strategy that considers several processing units. The proposed approach is applied in a case study using a CPU-GPU execution platform for computing iterative solvers for Systems of Linear Equations using a stencil code specially designed to explore the characteristics of modern GPUs. The solution uses the number of unknowns as the main parameter for assignment decision. By scheduling tasks to the CPU and to the GPU, it is achieved a performance gain of 21.77% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as OpenCL and CUDA for Nvidia) with a scheduling error of only 0.25% compared to exhaustive search. Processamento paralelo Microeletrônica Processamento : Imagem Processamento : Alto desempenho High-performance computing Scheduling Dynamic load-balancing Heterogenous systems Graphics processors Solvers for systems of linear equations

1

Page generated in 0.0879 seconds