Global ETD Search

21	Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations Lutz, Thibaut January 2015 (has links) Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four. 006.6
22	Comparative Study of CPU and GPGPU Implementations of the Sievesof Eratosthenes, Sundaram and Atkin Månsson, Jakob January 2021 (has links) Background. Prime numbers are integers divisible only on 1 and themselves, and one of the oldest methods of finding them is through a process known as sieving. A prime number sieving algorithm produces every prime number in a span, usually from the number 2 up to a given number n. In this thesis, we will cover the three sieves of Eratosthenes, Sundaram, and Atkin. Objectives. We shall compare their sequential CPU implementations to their parallel GPGPU (General Purpose Graphics Processing Unit) counterparts on the matter of performance, accuracy, and suitability. GPGPU is a method in which one utilizes hardware indented for graphics rendering to achieve a high degree of parallelism. Our goal is to establish if GPGPU sieving can be more effective than the sequential way, which is currently commonplace. Method. We utilize the C++ and CUDA programming languages to implement the algorithms, and then extract data regarding their execution time and accuracy. Experiments are set up and run at several sieving limits, with the upper bound set by the memory capacity of available GPU hardware. Furthermore, we study each sieve to identify what characteristics make them fit or unfit for a GPGPU approach. Results. Our results show that the sieve of Eratosthenes is slow and ill-suited for GPGPU computing, that the sieve of Sundaram is both efficient and fit for parallelization, and that the sieve of Atkin is the fastest but suffers from imperfect accuracy. Conclusions. Finally, we address how the lesser concurrent memory capacity available for GPGPU limits the ranges that can be sieved, as compared to CPU. Utilizing the beneficial characteristics of the sieve of Sundaram, we propose a batch-divided implementation that would allow the GPGPU sieve to cover an equal range of numbers as any of the CPU variants. General Purpose Graphics Processing Unit Parallelization Prime number Sieve. Software Engineering Programvaruteknik
23	Acceleration of Hardware Testing and Validation Algorithms using Graphics Processing Units Li, Min 16 November 2012 (has links) With the advances of very large scale integration (VLSI) technology, the feature size has been shrinking steadily together with the increase in the design complexity of logic circuits. As a result, the efforts taken for designing, testing, and debugging digital systems have increased tremendously. Although the electronic design automation (EDA) algorithms have been studied extensively to accelerate such processes, some computational intensive applications still take long execution times. This is especially the case for testing and validation. In order tomeet the time-to-market constraints and also to come up with a bug-free design or product, the work presented in this dissertation studies the acceleration of EDA algorithms on Graphics Processing Units (GPUs). This dissertation concentrates on a subset of EDA algorithms related to testing and validation. In particular, within the area of testing, fault simulation, diagnostic simulation and reliability analysis are explored. We also investigated the approaches to parallelize state justification on GPUs, which is one of the most difficult problems in the validation area. Firstly, we present an efficient parallel fault simulator, FSimGP2, which exploits the high degree of parallelism supported by a state-of-the-art graphic processing unit (GPU) with the NVIDIA Compute Unified Device Architecture (CUDA). A novel three-dimensional parallel fault simulation technique is proposed to achieve extremely high computation efficiency on the GPU. The experimental results demonstrate a speedup of up to 4Ã compared to another GPU-based fault simulator. Then, another GPU based simulator is used to tackle an even more computation-intensive task, diagnostic fault simulation. The simulator is based on a two-stage framework which exploits high computation efficiency on the GPU. We introduce a fault pair based approach to alleviate the limited memory capacity on GPUs. Also, multi-fault-signature and dynamic load balancing techniques are introduced for the best usage of computing resources on-board. With continuously feature size scaling and advent of innovative nano-scale devices, the reliability analysis of the digital systems becomes more important nowadays. However, the computational cost to accurately analyze a large digital system is very high. We proposes an high performance reliability analysis tool on GPUs. To achieve highmemory bandwidth on GPUs, two algorithms for simulation scheduling and memory arrangement are proposed. Experimental results demonstrate that the parallel analysis tool is efficient, reliable and scalable. In the area of design validation, we investigate state justification. By employing the swarm intelligence and the power of parallelism on GPUs, we are able to efficiently find a trace that could help us reach the corner cases during the validation of a digital system. In summary, the work presented in this dissertation demonstrates that several applications in the area of digital design testing and validation can be successfully rearchitected to achieve maximal performance on GPUs and obtain significant speedups. The proposed algorithms based on GPU parallelism collectively aim to contribute to improving the performance of EDA tools in Computer aided design (CAD) community on GPUs and other many-core platforms. / Ph. D. parallel algorithm design validation fault diagnosis Fault simulation
24	GPScheDVS: A New Paradigm of the Autonomous CPU Speed Control for Commodity-OS-based General-Purpose Mobile Computers with a DVS-friendly Task Scheduling Kim, Sookyoung 25 September 2008 (has links) This dissertation studies the problem of increasing battery life-time and reducing CPU heat dissipation without degrading system performance in commodity-OS-based general-purpose (GP) mobile computers using the dynamic voltage scaling (DVS) function of modern CPUs. The dissertation especially focuses on the impact of task scheduling on the effectiveness of DVS in achieving this goal. The task scheduling mechanism used in most contemporary general-purpose operating systems (GPOS) prioritizes tasks based only on their CPU occupancies irrespective of their deadlines. In currently available autonomous DVS schemes for GP mobile systems, the impact of this GPOS task scheduling is ignored and a DVS scheme merely predicts and enforces the lowest CPU speed that can meet tasks' deadlines without meddling with task scheduling. This research, however, shows that it is impossible to take full advantage of DVS in balancing energy/power and performance in the current DVS paradigm due to the mismatch between the urgency (i.e., having a nearer deadline) and priority of tasks under the GPOS task scheduling. This research also shows that, consequently, a new DVS paradigm is necessary, where a "DVS-friendly" task scheduling assigns higher priorities to more urgent tasks. The dissertation begins by showing how the mismatch between the urgency and priority of tasks limits the effectiveness of DVS and why conventional real-time (RT) task scheduling, which is intrinsically DVS-friendly cannot be used in GP systems. Then, the dissertation describes the requirements for "DVS-friendly GP" task scheduling as follows. Unlike the existing GPOS task scheduling, it should prioritize tasks by their deadline. But, at the same time, it must be able to do so without a priori knowledge of the deadlines and be able to handle the various tasks running in today's GP systems, unlike conventional RT task scheduling. The various tasks include sporadic tasks such as user-interactive tasks and tasks having dependencies on each other such as a family of threads and user-interface server/clients tasks. Therefore, the first major result of this research is to propose a new DVS paradigm for commodity-OS-based GP mobile systems in which DVS is performed under a DVS-friendly GP task scheduling that meets these requirements. The dissertation then proposes GPSched, a DVS-friendly GP task scheduling mechanism for commodity-Linux-based GP mobile systems, as the second major result. GPSched autonomously prioritizes tasks by their deadlines using the type of services that each task is involved with as the indicator of the deadline. At the same time, GPSched properly handles a family of threads and user-interface server/clients tasks by distinguishing and scheduling them as a group, and user-interactive tasks by incorporating a feature of current GPOS task scheduling — raising the priority of a task that is idle most of the time — which is desirable to quickly respond to user input events in its prioritization mechanism. The final major result is GPScheDVS, the integration of GPSched and a task-based DVS scheme customized for GPSched called GPSDVS. GPScheDVS provides two alternative modes: (1) the system-energy-centric (SE) mode aiming at a longer battery life-time by reducing system energy consumption and (2) the CPU-power-centric (CP) mode focusing on limiting CPU heat dissipation by reducing CPU power consumption. Experiments conducted under a set of real-life usage scenarios on a laptop show that the best, worst, and average reductions of system energy consumption by the SE mode GPScheDVS were 24%, -1%, and 17%, respectively, over the no-DVS case and 11%, -1%, and 5%, respectively, over the state-of-the-art task-based DVS scheme in the current DVS paradigm. The experiments also show that the best, worst, and average reductions of CPU energy consumption by the SE mode GPScheDVS were 69%, 0%, and 43% over the no-DVS case and 26%, -1%, and 13% over the state-of-the-art task-based DVS scheme in the current DVS paradigm. Considering that no power management was performed on non-CPU components for the experiments, these results imply that the system energy savings achievable by GPScheDVS will be increased if the non-CPU components' power is properly managed. On the other hand, the best, worst, and average reductions of average CPU power by the CP mode GPScheDVS were 69%, 49%, and 60% over the no-DVS case and 63%, 0%, and 30% over the existing task-based DVS scheme. Furthermore, oscilloscope measurements show that the best, worst, and average reduction of peak system power by the CP mode GPScheDVS were 29%, 10%, and 23% over the no-DVS case and 28%, 6%, and 22% over the existing task-based DVS scheme signifying that GPScheDVS is effective also in restraining the peak CPU power. On the top of these advantages in energy and power, the experimental results show that GPScheDVS even improves system performance in either mode due to its deadline-based task scheduling property. For example, the deadline meet ratio on continuous videos by GPScheDVS was at least 91.2%, whereas the ratios by the no-DVS case and the existing task-based DVS scheme were down to 71.3% and 71.0%, respectively. / Ph. D. Task Scheduling Dynamic Voltage Scaling Energy Management Power Management Thermal Management General-Purpose Mobile Computers
25	Programming High-Performance Clusters with Heterogeneous Computing Devices Aji, Ashwin M. 19 May 2015 (has links) Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization. / Ph. D. Runtime Systems Programming Models Message Passing Interface (MPI) CUDA OpenCL
26	Structural, economic and material comparison of various steel grades under fatigue loading Amobi, Ikechukwu Ugochukwu 28 March 2008 (has links) ABSTRACT As industries are upgrading rapidly from a lower steel grade to higher ones it has become necessary to study the effect of changing from lower steel grades to higher grades. This thesis reports on fatigue life and behaviour, economic implications and material composition of these higher strength steels (HSS) as compared to the conventional grades. Three grades are commercially available in South Africa: 300W, 350W and 460W. These different steel grades (conventional and HSS) with the same moment capacities where subjected to constant dynamic stresses and the fatigue crack growth of the overloading and unloading were monitored and compared with each other. The influences of the overloading and unloading made standard grades perform better under repeated loading than the HSS, since HSS have been proved to have poor ductility, resulting in lower number of cycles to failure. An 85% increase in material cost was generated as HSS replaces the conventional lower steel grades. Reduction in number of cycles to failure in HSS was over 500%. A space analysis for a multi-storey building with 10 beam floors was conducted for the various steel grades using a software package. The buckling and linear behaviours of these structures were compared. Although the deflections were not too far apart, it was shown clearly that grade lower steel grades performed better than the higher grades. An optimization was conducted using the parameters discussed in the text or obtained from experiment and computer modelling, in order to aid in the selection criterion of general purpose steel. Grade 300W was the optimal grade although the result was based mainly on the cost and fatigue behaviour of the three grades. fatigue steel grades ductility HSS stresses moment capacity crack growth cycles to failure optimization buckling general purpose steel
27	Caracterização energética da codificação de vídeo de alta eficiência (HEVC) em processador de propósito geral / Energy characterization of high efficiency video coding (HEVC) in general purpose processor Monteiro, Eduarda Rodrigues January 2017 (has links) A popularização das aplicações que manipulam vídeos digitais de altas resoluções incorpora diversos desafios no desenvolvimento de novas e eficientes técnicas para manter a eficiência na compressão de vídeo. Para lidar com esta demanda, o padrão HEVC foi proposto com o objetivo de duplicar as taxas de compressão quando comparado com padrões predecessores. No entanto, para atingir esta meta, o HEVC impõe um elevado custo computacional e, consequentemente, o aumento no consumo de energia. Este cenário torna-se ainda mais preocupante quando considerados dispositivos móveis alimentados por bateria os quais apresentam restrições computacionais no processamento de aplicações multimídia. A maioria dos trabalhos relacionados com este desafio, tipicamente, concentram suas contribuições no redução e controle do esforço computacional refletido no processo de codificação. Entretanto, a literatura indica uma carência de informações com relação ao consumo de energia despendido pelo processamento da codificação de vídeo e, principalmente, o impacto energético da hierarquia de memória cache neste contexto. Esta tese apresenta uma metodologia para caracterização energética da codificação de vídeo HEVC em processador de propósito geral. O principal objetivo da metodologia proposta nesta tese é fornecer dados quantitativos referentes ao consumo de energia do HEVC. Esta metodologia é composta por dois módulos, um deles voltado para o processamento da codificação HEVC e, o outro, direcionado ao comportamento do padrão HEVC no que diz respeito à memória cache. Uma das principais vantagens deste segundo módulo é manter-se independente de aplicação ou de arquitetura de processador. Neste trabalho, diversas análises foram realizadas visando a caracterização do consumo de energia do codificador HEVC em processador de propósito geral, considerando diferentes sequências de vídeo, resoluções e parâmetros do codificador. Além disso, uma análise extensa e detalhada de diferentes configurações possíveis de memória cache foi realizada com o propósito de avaliar o impacto energético destas configurações na codificação. Os resultados obtidos com a caracterização proposta demonstram que o gerenciamento dos parâmetros da codificação de vídeo, de maneira conjunta com as especificações da memória cache, tem um alto potencial para redução do consumo energético de codificação de vídeo, mantendo bons resultados de qualidade visual das sequências codificadas. / The popularization of high-resolution digital video applications brings several challenges on developing new and efficient techniques to maintain the video compression efficiency. To respond to this demand, the HEVC standard was proposed aiming to duplicate the compression rate when compared to its predecessors. However, to achieve such goal, HEVC imposes a high computational cost and, consequently, energy consumption increase. This scenario becomes even more concerned under battery-powered mobile devices which present computational constraints to process multimedia applications. Most of the related works about encoder realization, typically concentrate their contributions on computational effort reduction and management. Therefore, there is a lack of information regarding energy consumption on video encoders, specially about the energy impact of the cache hierarchy in this context. This thesis presents a methodology for energy characterization of the HEVC video encoder in general purpose processors. The main goal of this methodology is to provide quantitative data regarding the HEVC energy consumption. This methodology is composed of two modules, one focuses on the HEVC processing and the other focuses on the HEVC behavior regarding cache memory-related consumption. One of the main advantages of this second module is to remain independent of application or processor architecture. Several analyzes are performed aiming at the energetic characterization of HEVC coding considering different video sequences, resolutions, and parameters. In addition, an extensive and detailed analysis of different cache configurations is performed in order to evaluate the energy impact of such configurations during the video coding execution. The results obtained with the proposed characterization demonstrate that the management of the video coding parameters in conjunction with the cache specifications has a high potential for reducing the energy consumption of video coding whereas maintaining good coding efficiency results. Microeletrônica Codificacao : Video digital Vídeo digital Consumo : Energia Video coding HEVC Energy consumption Cache memory General purpose processor
28	A general purpose artificial intelligence framework for the analysis of complex biological systems Kalantari, John I. 15 December 2017 (has links) This thesis encompasses research on Artificial Intelligence in support of automating scientific discovery in the fields of biology and medicine. At the core of this research is the ongoing development of a general-purpose artificial intelligence framework emulating various facets of human-level intelligence necessary for building cross-domain knowledge that may lead to new insights and discoveries. To learn and build models in a data-driven manner, we develop a general-purpose learning framework called Syntactic Nonparametric Analysis of Complex Systems (SYNACX), which uses tools from Bayesian nonparametric inference to learn the statistical and syntactic properties of biological phenomena from sequence data. We show that the models learned by SYNACX offer performance comparable to that of standard neural network architectures. For complex biological systems or processes consisting of several heterogeneous components with spatio-temporal interdependencies across multiple scales, learning frameworks like SYNACX can become unwieldy due to the the resultant combinatorial complexity. Thus we also investigate ways to robustly reduce data dimensionality by introducing a new data abstraction. In particular, we extend traditional string and graph grammars in a new modeling formalism which we call Simplicial Grammar. This formalism integrates the topological properties of the simplicial complex with the expressive power of stochastic grammars in a computation abstraction with which we can decompose complex system behavior, into a finite set of modular grammar rules which parsimoniously describe the spatial/temporal structure and dynamics of patterns inferred from sequence data. Artificial General Intelligence Complex Adaptive Systems Computational Systems Biology General-Purpose Learning Probabilistic Generative Models Simplicial Grammar
29	Putting Queens in Carry Chains Preußer, Thomas B., Nägel , Bernd, Spallek, Rainer G. 14 November 2012 (has links) (PDF) This paper describes an FPGA implementation of a solution-counting solver for the N-Queens Puzzle. The proposed algorithmic mapping utilizes the fast carrychain logic found on modern FPGA architectures in order to achieve a regular and efficient design. From an initial full chessboard mapping, several optimization strategies are explored. Also, the infrastructure is described, which we have constructed for the computation of the currently unknown solution count of the 26- Queens Puzzle. Finally, we compare the performance of our used concrete FPGA device mappings also in contrast to general-purpose CPUs. Programmierung Damenproblem N-Queens Puzzle FPGA implementation FPGA device mapping general-purpose CPUs ddc:004 rvk:SS 5514
30	Caracterização energética da codificação de vídeo de alta eficiência (HEVC) em processador de propósito geral / Energy characterization of high efficiency video coding (HEVC) in general purpose processor Monteiro, Eduarda Rodrigues January 2017 (has links) A popularização das aplicações que manipulam vídeos digitais de altas resoluções incorpora diversos desafios no desenvolvimento de novas e eficientes técnicas para manter a eficiência na compressão de vídeo. Para lidar com esta demanda, o padrão HEVC foi proposto com o objetivo de duplicar as taxas de compressão quando comparado com padrões predecessores. No entanto, para atingir esta meta, o HEVC impõe um elevado custo computacional e, consequentemente, o aumento no consumo de energia. Este cenário torna-se ainda mais preocupante quando considerados dispositivos móveis alimentados por bateria os quais apresentam restrições computacionais no processamento de aplicações multimídia. A maioria dos trabalhos relacionados com este desafio, tipicamente, concentram suas contribuições no redução e controle do esforço computacional refletido no processo de codificação. Entretanto, a literatura indica uma carência de informações com relação ao consumo de energia despendido pelo processamento da codificação de vídeo e, principalmente, o impacto energético da hierarquia de memória cache neste contexto. Esta tese apresenta uma metodologia para caracterização energética da codificação de vídeo HEVC em processador de propósito geral. O principal objetivo da metodologia proposta nesta tese é fornecer dados quantitativos referentes ao consumo de energia do HEVC. Esta metodologia é composta por dois módulos, um deles voltado para o processamento da codificação HEVC e, o outro, direcionado ao comportamento do padrão HEVC no que diz respeito à memória cache. Uma das principais vantagens deste segundo módulo é manter-se independente de aplicação ou de arquitetura de processador. Neste trabalho, diversas análises foram realizadas visando a caracterização do consumo de energia do codificador HEVC em processador de propósito geral, considerando diferentes sequências de vídeo, resoluções e parâmetros do codificador. Além disso, uma análise extensa e detalhada de diferentes configurações possíveis de memória cache foi realizada com o propósito de avaliar o impacto energético destas configurações na codificação. Os resultados obtidos com a caracterização proposta demonstram que o gerenciamento dos parâmetros da codificação de vídeo, de maneira conjunta com as especificações da memória cache, tem um alto potencial para redução do consumo energético de codificação de vídeo, mantendo bons resultados de qualidade visual das sequências codificadas. / The popularization of high-resolution digital video applications brings several challenges on developing new and efficient techniques to maintain the video compression efficiency. To respond to this demand, the HEVC standard was proposed aiming to duplicate the compression rate when compared to its predecessors. However, to achieve such goal, HEVC imposes a high computational cost and, consequently, energy consumption increase. This scenario becomes even more concerned under battery-powered mobile devices which present computational constraints to process multimedia applications. Most of the related works about encoder realization, typically concentrate their contributions on computational effort reduction and management. Therefore, there is a lack of information regarding energy consumption on video encoders, specially about the energy impact of the cache hierarchy in this context. This thesis presents a methodology for energy characterization of the HEVC video encoder in general purpose processors. The main goal of this methodology is to provide quantitative data regarding the HEVC energy consumption. This methodology is composed of two modules, one focuses on the HEVC processing and the other focuses on the HEVC behavior regarding cache memory-related consumption. One of the main advantages of this second module is to remain independent of application or processor architecture. Several analyzes are performed aiming at the energetic characterization of HEVC coding considering different video sequences, resolutions, and parameters. In addition, an extensive and detailed analysis of different cache configurations is performed in order to evaluate the energy impact of such configurations during the video coding execution. The results obtained with the proposed characterization demonstrate that the management of the video coding parameters in conjunction with the cache specifications has a high potential for reducing the energy consumption of video coding whereas maintaining good coding efficiency results. Microeletrônica Codificacao : Video digital Vídeo digital Consumo : Energia Video coding HEVC Energy consumption Cache memory General purpose processor

Search results