• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 465
  • 88
  • 87
  • 56
  • 43
  • 20
  • 14
  • 14
  • 10
  • 5
  • 5
  • 3
  • 3
  • 3
  • 2
  • Tagged with
  • 977
  • 316
  • 202
  • 182
  • 167
  • 165
  • 153
  • 137
  • 123
  • 104
  • 96
  • 93
  • 92
  • 87
  • 81
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
111

An Evaluation of TensorFlow as a Programming Framework for HPC Applications / En undersökning av TensorFlow som ett utvecklingsramverk för högpresterade datorsystem

Chien, Wei Der January 2018 (has links)
In recent years, deep-learning, a branch of machine learning gained increasing popularity due to their extensive applications and performance. At the core of these application is dense matrix-matrix multiplication. Graphics Processing Units (GPUs) are commonly used in the training process due to their massively parallel computation capabilities. In addition, specialized low-precision accelerators have emerged to specifically address Tensor operations. Software frameworks, such as TensorFlow have also emerged to increase the expressiveness of neural network model development. In TensorFlow computation problems are expressed as Computation Graphs where nodes of a graph denote operation and edges denote data movement between operations. With increasing number of heterogeneous accelerators which might co-exist on the same cluster system, it became increasingly difficult for users to program efficient and scalable applications. TensorFlow provides a high level of abstraction and it is possible to place operations of a computation graph on a device easily through a high level API. In this work, the usability of TensorFlow as a programming framework for HPC application is reviewed. We give an introduction of TensorFlow as a programming framework and paradigm for distributed computation. Two sample applications are implemented on TensorFlow: tiled matrix multiplication and conjugate gradient solver for solving large linear systems. We try to illustrate how such problems can be expressed in computation graph for distributed computation. We perform scalability tests and comment on performance scaling results and quantify how TensorFlow can take advantage of HPC systems by performing micro-benchmarking on communication performance. Through this work, we show that TensorFlow is an emerging and promising platform which is well suited for a particular class of problem which requires very little synchronization. / Under de senaste åren har deep-learning, en så kallad typ av maskininlärning, blivit populärt på grund av dess applikationer och prestanda. Den viktigaste komponenten i de här teknikerna är matrismultiplikation. Grafikprocessorer (GPUs) är vanligt förekommande vid träningsprocesser av artificiella neuronnät. Detta på grund av deras massivt parallella beräkningskapacitet. Dessutom har specialiserade lågprecisionsacceleratorer  som  specifikt beräknar  matrismultiplikation tagits fram. Många utvecklingsramverk har framkommit för att hjälpa programmerare att hantera artificiella neuronnät. I TensorFlow uttrycks beräkningsproblem som en beräkningsgraf. En nod representerar en beräkningsoperation och en väg representerar dataflöde mellan beräkningsoperationer i en beräkningsgraf. Eftersom man måste programmera olika acceleratorer med olika systemarkitekturer har programmering av högprestandasystem blivit allt svårare. TensorFlow erbjuder en hög abstraktionsnivå och förenklar programmering av högprestandaberäkningar. Man programmerar acceleratorer genom att placera operationer inom grafen på olika acceleratorer med en API. I detta arbete granskas användbarheten hos TensorFlow som ett programmeringsramverk för applikationer med högprestandaberäkningar. Vi presenterar TensorFlow som ett programmeringsutvecklingsramverk för distribuerad beräkning. Vi implementerar två vanliga applikationer i TensorFlow: en lösare som löser linjära ekvationsystem med konjugerade gradientmetoden samt blockmatrismultiplikation och illustrerar hur de här problemen kan uttryckas i beräkningsgrafer för distribuerad beräkning. Vi experimenterar och kommenterar metoder för att demonstrera hur TensorFlow kan nyttja HPC-maskinvaror. Vi testar både skalbarhet och effektivitet samt gör mikro-benchmarking på kommunikationsprestanda. Genom detta arbete visar vi att TensorFlow är en framväxande och lovande plattform som passar väl för en viss typ av problem som kräver minimal synkronisering.
112

A Multi-GPU Compute Solution for Optimized Genomic Selection Analysis

Devore, Trevor 01 June 2014 (has links) (PDF)
Many modern-day Bioinformatics algorithms rely heavily on statistical models to analyze their biological data. Some of these statistical models lend themselves nicely to standard high performance computing optimizations such as parallelism, while others do not. One such algorithm is Markov Chain Monte Carlo (MCMC). In this thesis, we present a heterogeneous compute solution for optimizing GenSel, a genetic selection analysis tool. GenSel utilizes a MCMC algorithm to perform Bayesian inference using Gibbs sampling. Optimizing an MCMC algorithm is a difficult problem because it is inherently sequential, containing a loop carried dependence between each Markov Chain iteration. The optimization presented in this thesis utilizes GPU computing to exploit the data-level parallelism within each of these iterations. In addition, it allows for the efficient management of memory, the pipelining of CUDA kernels, and the use of multiple GPUs. The optimizations presented show performance improvements of up to 1.84 times that of the original algorithm.
113

Massively parallel GPU computing of continuum robotic dynamics

Orellana, Roberto A 30 April 2011 (has links)
Continuum robots, with the capability of bending and extending at any point along their length mimic the abilities of an octopus arm or an elephant trunk. These manipulators present a number of exciting possibilities. While calculating a static solution for the system has been proven with certain models to produce satisfactory results [1], this approach ignores the significant effects a dynamics solution captures. However, adding time and studying the physical effects produced on a continuum robot involves calculation of the robot’s shape at a number of discrete points. Typically, the separation between points will be very small and thus a solution requires large amounts of computational power. We present a method to improve calculation speed for dynamic problems with the use of CUDA, a framework for parallel GPU computing. GPUs are ideally suited for massively parallel computations because of their multi-processor architecture. Our dynamics solution will take advantage of this parallel environment.
114

Performance Enhancements of the Spin-Image Pose Estimation Algorithm

Gerlach, Adam R. 12 April 2010 (has links)
No description available.
115

GPU-ASSISTED RENDERING OF LARGE TREE-SHAPED DATA SETS

Mangalvedkar, Pallavi Ramachandra 27 November 2007 (has links)
No description available.
116

Runtime Adaptation for Autonomic Heterogeneous Computing

Scogland, Thomas R. 12 December 2014 (has links)
Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life, efficiently managing heterogeneous compute resources is becoming a critical, and ever more complex, task. The focus of this dissertation is to lay the foundation for an autonomic system for heterogeneous computing, employing runtime adaptation to improve performance portability and performance consistency while maintaining or increasing programmability. We investigate heterogeneity arising from a myriad of factors, grouped into the dimensions of locality and capability. This work has resulted in runtime schedulers capable of automatically detecting and mitigating heterogeneity in physically homogeneous systems through MPI and adaptive coscheduling for physically heterogeneous accelerator based systems as well as a synthesis of the two to address multiple levels of heterogeneity as a coherent whole. We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue for many-core systems. Each component addresses aspects of the urgent need for automated management of the extreme and ever expanding complexity introduced by heterogeneity. / Ph. D.
117

Exploring Performance Portability for Accelerators via High-level Parallel Patterns

Hou, Kaixi 27 August 2018 (has links)
Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues. To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi). / Ph. D. / Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues. To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi).
118

Cu2cl: a Cuda-To-Opencl Translator for Multi- and Many-Core Architectures

Martinez Arroyo, Gabriel Ernesto 02 September 2011 (has links)
The use of graphics processing units (GPUs) in high-performance parallel computing continues to steadily become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendor-neutral programming environment and run-time system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a "write once, run anywhere" ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is almost straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to-OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in the CUDA Runtime API, and we have successfully translated several applications from the CUDA SDK and Rodinia benchmark suite. CU2CL's translation times are reasonable, allowing for many applications to be translated at once. The number of manual changes required after executing our translator on CUDA source is minimal, with some compiling and working with no changes at all. The performance of our automatically translated applications via CU2CL is on par with their manually ported counterparts. / Master of Science
119

Power Analysis and Prediction for Heterogeneous Computation

Dutta, Bishwajit 12 February 2018 (has links)
Power, performance, and cost dictate the procurement and operation of high-performance computing (HPC) systems. These systems use graphics processing units (GPUs) for performance boost. In order to identify inexpensive-to-acquire and inexpensive-to-operate systems, it is important to do a systematic comparison of such systems with respect to power, performance and energy characteristics with the end use applications. Additionally, the chosen systems must often achieve performance objectives without exceeding their respective power budgets, a task that is usually borne by a software-based power management system. Accurately predicting the power consumption of an application at different DVFS levels (or more generally, different processor configurations) is paramount for the efficient functioning of such a management system. This thesis intends to apply the latest in the state-of-the-art in green computing research to optimize the total cost of acquisition and ownership of heterogeneous computing systems. To achieve this we take a two-fold approach. First, we explore the issue of greener device selection by characterizing device power and performance. For this, we explore previously untapped opportunities arising from a special type of graphics processor --- the low-power integrated GPU --- which is commonly available in commodity systems. We compare the greenness (power, energy, and energy-delay product $rightarrow$ EDP) of the integrated GPU against a CPU running at different frequencies for the specific application domain of scientific visualization. Second, we explore the problem of predicting the power consumption of a GPU at different DVFS states via machine-learning techniques. Specifically, we perform statistically rigorous experiments to uncover the strengths and weaknesses of eight different machine-learning techniques (namely, ZeroR, simple linear regression, KNN, bagging, random forest, SMO regression, decision tree, and neural networks) in predicting GPU power consumption at different frequencies. Our study shows that a support vector machine-aided regression model (i.e., SMO regression) achieves the highest accuracy with a mean absolute error (MAE) of 4.5%. We also observe that the random forest method produces the most consistent results with a reasonable overall MAE of 7.4%. Our results also show that different models operate best in distinct regions of the application space. We, therefore, develop a novel, ensemble technique drawing the best characteristics of the various algorithms, which reduces the MAE to 3.5% and maximum error to 11% from 20% for SMO regression. / MS
120

GPU Based Large Scale Multi-Agent Crowd Simulation and Path Planning

Gusukuma, Luke 13 May 2015 (has links)
Crowd simulation is used for many applications including (but not limited to) videogames, building planning, training simulators, and various virtual environment applications. Particularly, crowd simulation is most useful for when real life practices wouldn't be practical such as repetitively evacuating a building, testing the crowd flow for various building blue prints, placing law enforcers in actual crowd suppression circumstances, etc. In our work, we approach the fidelity to scalability problem of crowd simulation from two angles, a programmability angle, and a scalability angle, by creating new methodology building off of a struct of arrays approach and transforming it into an Object Oriented Struct of Arrays approach. While the design pattern itself is applied to crowd simulation in our work, the application of crowd simulation exemplifies the variety of applications for which the design pattern can be used. / Master of Science

Page generated in 0.0153 seconds