Global ETD Search

61	Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures Panwar, Lokendra Singh 21 October 2014 (has links) Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make in-roads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection of best processor for an application, effective performance optimization strategies, maintaining performance portability across architectures etc. In this thesis, we present our techniques and approach to address some of these significant issues. Firstly, we present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs. We then present our approach to accelerate a seismology modeling application that is based on the finite difference method (FDM), using MPI and CUDA over a hybrid CPU+GPU cluster. We describe the generic computational complexities involved in porting such applications to the GPUs and present our strategy of efficient performance optimization and characterization. We also show how performance modeling can be used to reason and drive the hardware-specific optimizations on the GPU. The performance evaluation of our approach delivers a maximum speedup of 23-fold with a single GPU and 33-fold with dual GPUs per node over the serial version of the application, which in turn results in a many-fold speedup when coupled with the MPI distribution of the computation across the cluster. We also study the efficacy of GPU-integrated MPI, with MPI-ACC as an example implementation, in a seismology modeling application and discuss the lessons learned. / Master of Science Heterogeneous Computing Graphics Processing Unit (GPU) GPU Emulation Performance Modeling Finite Difference Method Seismology Modeling
62	FPGA-Based Accelerator Development for Non-Engineers Uliana, David Christopher 02 June 2014 (has links) In today's world of big-data computing, access to massive, complex data sets has reached an unprecedented level, and the task of intelligently processing such data into useful information has become a growing concern to the high-performance computing community. However, domain experts, who are the brains behind this processing, typically lack the skills required to build FPGA-based hardware accelerators ideal for their applications, as traditional development flows targeting such hardware require digital design expertise. This work proposes a usable, end-to-end accelerator development methodology that attempts to bridge this gap between domain-experts and the vast computational capacity of FPGA-based heterogeneous platforms. To accomplish this, two development flows were assembled, both targeting the Convey Hybrid-Core HC-1 heterogeneous platform and utilizing existing graphical design environments for design entry. Furthermore, incremental implementation techniques were applied to one of the flows to accelerate bitstream compilation, improving design productivity. The efficacy of these flows in extending FPGA-based acceleration to non-engineers in the life sciences was informally tested at two separate instances of an NSF-funded summer workshop, organized and hosted by the Virginia Bioinformatics Institute at Virginia Tech. In both workshops, groups of four or five non-engineer participants made significant modifications to a bare-bones Smith-Waterman accelerator, extending functionality and improving performance. / Master of Science Big-data HPC Field programmable gate arrays Heterogeneous Computing Life Sciences
63	Collecting and representing parallel programs with high performance instrumentation Railing, Brian Paul 07 January 2016 (has links) Computer architecture has looming challenges with finding program parallelism, process technology limits, and limited power budget. To navigate these challenges, a deeper understanding of parallel programs is required. I will discuss the task graph representation and how it enables programmers and compiler optimizations to understand and exploit dynamic aspects of the program. I will present Contech, which is a high performance framework for generating dynamic task graphs from arbitrary parallel programs. The Contech framework supports a variety of languages and parallelization libraries, and has been tested on both x86 and ARM. I will demonstrate how this framework encompasses a diversity of program analyses, particularly by modeling a dynamically reconfigurable, heterogeneous multi-core processor. Computer architecture Compilers Compiler-based instrumentation Parallel programming Parallel program analysis Instrumentation performance Task graph Program representation Heterogeneous computing
64	Parallel video decoding Álvarez Mesa, Mauricio 08 September 2011 (has links) Digital video is a popular technology used in many different applications. The quality of video, expressed in the spatial and temporal resolution, has been increasing continuously in the last years. In order to reduce the bitrate required for its storage and transmission, a new generation of video encoders and decoders (codecs) have been developed. The latest video codec standard, known as H.264/AVC, includes sophisticated compression tools that require more computing resources than any previous video codec. The combination of high quality video and the advanced compression tools found in H.264/AVC has resulted in a significant increase in the computational requirements of video decoding applications. The main objective of this thesis is to provide the performance required for real-time operation of high quality video decoding using programmable architectures. Our solution has been the simultaneous exploitation of multiple levels of parallelism. On the one hand, video decoders have been modified in order to extract as much parallelism as possible. And, on the other hand, general purpose architectures has been enhanced for exploiting the type of parallelism that is present in video codec applications. / El vídeo digital es una tecnología popular utilizada en una gran variedad de aplicaciones. La calidad de vídeo, expresada en la resolución espacial y temporal, ha ido aumentando constantemente en los últimos años. Con el fin de reducir la tasa de bits requerida para su almacenamiento y transmisión, se ha desarrollado una nueva generación de codificadores y decodificadores (códecs) de vídeo. El códec estándar de vídeo más reciente, conocido como H.264/AVC, incluye herramientas sofisticadas de compresión que requieren más recursos de computación que los códecs de vídeo anteriores. El efecto combinado del vídeo de alta calidad y las herramientas de compresión avanzada incluidas en el H.264/AVC han llevado a un aumento significativo de los requerimientos computacionales de la decodificación de vídeo. El objetivo principal de esta tesis es proporcionar el rendimiento necesario para la decodificación en tiempo real de vídeo de alta calidad. Nuestra solución ha sido la explotación simultánea de múltiples niveles de paralelismo. Por un lado, se realizaron modificaciones en el decodificador de vídeo con el fin de extraer múltiples niveles de paralelismo. Y, por otro lado, se modificaron las arquitecturas de propósito general para mejorar la explotación del tipo paralelismo que está presente en las aplicaciones de vídeo. Primero hicimos un análisis de la escalabilidad de dos extensiones de Instrucción Simple con Múltiples Datos (SIMD por sus siglas en inglés): una de una dimensión (1D) y otra matricial de dos dimensiones (2D). Se demostró que al escalar la extensión 2D se obtiene un mayor rendimiento con una menor complejidad que al escalar la extensión 1D. Luego se realizó una caracterización de la decodificación de H.264/AVC en aplicaciones de alta definición (HD) donde se identificaron los núcleos principales. Debido a la falta de un punto de referencia (benchmark) adecuado para la decodificación de vídeo HD, desarrollamos uno propio, llamado HD-VideoBench el cual incluye aplicaciones completas de codificación y decodificación de vídeo junto con una serie de secuencias de vídeo en HD. Después optimizamos los núcleos más importantes del decodificador H.264/AVC usando instrucciones SIMD. Sin embargo, los resultados no alcanzaron el máximo rendimiento posible debido al efecto negativo de la desalineación de los datos en memoria. Como solución, evaluamos el hardware y el software necesarios para realizar accesos no alineados. Este soporte produjo mejoras significativas de rendimiento en la aplicación. Aparte se realizó una investigación sobre cómo extraer paralelismo de nivel de tarea. Se encontró que ninguno de los mecanismos existentes podía escalar para sistemas masivamente paralelos. Como alternativa, desarrollamos un nuevo algoritmo que fue capaz de encontrar miles de tareas independientes al explotar paralelismo de nivel de macrobloque. Luego implementamos una versión paralela del decodificador de H.264 en una máquina de memoria compartida distribuida (DSM por sus siglas en inglés). Sin embargo esta implementación no alcanzó el máximo rendimiento posible debido al impacto negativo de las operaciones de sincronización y al efecto del núcleo de decodificación de entropía. Con el fin de eliminar estos cuellos de botella se evaluó la paralelización al nivel de imagen de la fase de decodificación de entropía combinada con la paralelización al nivel de macrobloque de los demás núcleos. La sobrecarga de las operaciones de sincronización se eliminó casi por completo mediante el uso de operaciones aceleradas por hardware. Con todas las mejoras presentadas se permitió la decodificación, en tiempo real, de vídeo de alta definición y alta tasa de imágenes por segundo. Como resultado global se creó una solución escalable capaz de usar el número creciente procesadores en las arquitecturas multinúcleo. H.264 High definition Video decoding Parallel SIMD MPEG-2 Vector processors Heterogeneous computing Multicore Computer architecture Parallel programming 004
65	Meeting Data Sharing Needs of Heterogeneous Distributed Users Zhan, Zhiyuan 16 January 2007 (has links) The fast growth of wireless networking and mobile computing devices has enabled us to access information from anywhere at any time. However, varying user needs and system resource constraints are two major heterogeneity factors that pose a challenge to information sharing systems. For instance, when a new information item is produced, different users may have different requirements for when the new value should become visible. The resources that each device can contribute to such information sharing applications also vary. Therefore, how to enable information sharing across computing platforms with varying resources to meet different user demands is an important problem for distributed systems research. In this thesis, we address the heterogeneity challenge faced by such systems. We assume that shared information is encapsulated in distributed objects, and we use object replication to increase system scalability and robustness, which introduces the consistency problem. Many consistency models have been proposed in recent years but they are either too strong and do not scale very well, or too weak to meet many users' requirements. We propose a Mixed Consistency (MC) model as a solution. We introduce an access constraints based approach to combine both strong and weak consistency models together. We also propose a MC protocol that combines existing implementations together with minimum modifications. It is designed to tolerate crash failures and slow processes/communication links in the system. We also explore how the heterogeneity challenge can be addressed in the transportation layer by developing an agile dissemination protocol. We implement our MC protocol on top of a distributed publisher-subscriber middleware, Echo. We finally measure the performance of our MC implementation. The results of the experiments are consistent with our expectations. Based on the functionality and performance of mixed consistency protocols, we believe that this model is effective in addressing the heterogeneity of user requirements and available resources in distributed systems. Non-responsive Heterogeneous Consistency model Mixed consistency Fault tolerance Distributed systems Heterogeneous computing Fault-tolerant computing
66	Graph-Based Control of Networked Systems Ji, Meng 11 June 2007 (has links) Networked systems have attracted great interests from the control society during the last decade. Several issues rising from the recent research are addressed in this dissertation. Connectedness is one of the important conditions that enable distributed coordination in a networked system. Nonetheless, it has been assumed in most implementations, especially in continuous-time applications, until recently. A nonlinear weighting strategy is proposed in this dissertation to solve the connectedness preserving problem. Both rendezvous and formation problem are addressed in the context of homogeneous network. Controllability of heterogeneous networks is another issue which has been long omitted. This dissertation contributes a graph theoretical interpretation of controllability. Distributed sensor networks make up another important class of networked systems. A novel estimation strategy is proposed in this dissertation. The observability problem is raised in the context of our proposed distributed estimation strategy, and a graph theoretical interpretation is derived as well. The contributions of this dissertation are as follows: It solves the connectedness preserving problem for networked systems. Based on that, a formation process is proposed. For heterogeneous networks, the leader-follower structure is studied and sufficient and necessary conditions are presented for the system to be controllable. A novel estimation strategy is proposed for distributed sensor networks, which could improve the performance. The observability problem is studied for this estimation strategy and a necessary condition is obtained. This work is among the first ones that provide graph theoretical interpretations of the controllability and observability issues. Graph theory Coordination control Multi-agent systems Networked systems Controllability Observability Control theory Heterogeneous computing Graph theory Computer networks
67	Harmony: an execution model for heterogeneous systems Diamos, Gregory Frederick 10 November 2011 (has links) The emergence of heterogeneous and many-core architectures presents a unique opportunity to deliver order of magnitude performance increases to high performance applications by matching certain classes of algorithms to specifically tailored architectures. However, their ubiquitous adoption has been limited by a lack of programming models and management frameworks designed to reduce the high degree of complexity of software development inherent to heterogeneous architectures. This dissertation introduces Harmony, an execution model for heterogeneous systems that draws heavily from concepts and optimizations used in processor micro-architecture to provide: (1) semantics for simplifying heterogeneity management, (2) dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and (3) online monitoring driven performance optimization for heterogeneous many core systems. This work focuses on simplifying development and ensuring binary portability and scalability across system configurations and sizes. Heterogeneous Many-core Compiler Runtime GPU Processor SIMD Scheduling Execution model Modeling Computing model Computer architecture Algorithms Heterogeneous computing
68	Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU Öhberg, Tomas January 2018 (has links) The trend in computer architectures has for several years been heterogeneous systems consisting of a regular CPU and at least one additional, specialized processing unit, such as a GPU.The different characteristics of the processing units and the requirement of multiple tools and programming languages makes programming of such systems a challenging task. Although there exist tools for programming each processing unit, utilizing the full potential of a heterogeneous computer still requires specialized implementations involving multiple frameworks and hand-tuning of parameters.To fully exploit the performance of heterogeneous systems for a single computation, hybrid execution is needed, i.e. execution where the workload is distributed between multiple, heterogeneous processing units, working simultaneously on the computation. This thesis presents the implementation of a new hybrid execution backend in the algorithmic skeleton framework SkePU. The skeleton framework already gives programmers a user-friendly interface to algorithmic templates, executable on different hardware using OpenMP, CUDA and OpenCL. With this extension it is now also possible to divide the computational work of the skeletons between multiple processing units, such as between a CPU and a GPU. The results show an improvement in execution time with the hybrid execution implementation for all skeletons in SkePU. It is also shown that the new implementation results in a lower and more predictable execution time compared to a dynamic scheduling approach based on an earlier implementation of hybrid execution in SkePU. Heterogeneous computing Hybrid execution Skeleton programming Workload partitioning SkePU Skeleton CPU GPU Accelerator Computer Sciences Datavetenskap (datalogi)
69	HCLogP: um modelo computacional para clusters heterogêneos Soares, Thiago Marques 09 March 2017 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-15T14:39:21Z No. of bitstreams: 1 thiagomarquessoares.pdf: 1372109 bytes, checksum: 0decc31aa35ac2d0364f017e2f671861 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-05-17T15:59:41Z (GMT) No. of bitstreams: 1 thiagomarquessoares.pdf: 1372109 bytes, checksum: 0decc31aa35ac2d0364f017e2f671861 (MD5) / Made available in DSpace on 2017-05-17T15:59:41Z (GMT). No. of bitstreams: 1 thiagomarquessoares.pdf: 1372109 bytes, checksum: 0decc31aa35ac2d0364f017e2f671861 (MD5) Previous issue date: 2017-03-09 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O modelo LogP foi desenvolvido em 1993 para medir os efeitos da latência de comunicação, ocupação dos processadores e banda passante em multiprocessadores com memória distribuída. A ideia era caracterizar multiprocessadores de memória distribuída usando estes parâmetros chave, analisando seus impactos no desempenho. Este trabalho propõe um novo modelo, baseado no LogP, que descreve a inﬂuência destes parâmetros no desempenho de aplicações regulares executadas em um agregado (cluster) de computadores heterogêneos. O modelo considera que um agregado heterogêneo é composto por diferentes tipos de processadores, aceleradores e controladores de rede. Os resultados mostram que o pior erro nas estimativas feitas pelo modelo para o tempo de execução paralelo foi de 19,2%, e, em muitos casos, a execução estimada foi igual ou próxima do tempo real. Além disso, com base neste modelo, foi desenvolvido um escalonador, que baseado nas características da aplicação e do ambiente, escolhe um subconjunto de componentes que minimizem o tempo total de execução paralelo. O escalonador obteve êxito na escolha da melhor conﬁguração para a execução de aplicações com diferentes comportamentos. / The LogP model was proposed in 1993 to measure the eﬀects of communication latency, processor occupancy and bandwidth in distributed memory multiprocessors. The idea was to characterize distributed memory multiprocessor using these key parameters and study their impact on performance in simulation environments. This work proposes a new model, based on LogP, that describes the impacts on performance of regular applications executing on a heterogeneous cluster. The model considers that a heterogeneous cluster is composed of distinct types of processors, accelerators and networks. The results show that the worst error in the estimations of the parallel execution time was about 19,2%, and, in many cases, the estimated execution time is equal to or very close to the real one. In addition, based on this model, a scheduler was developed. Based on the applications and computational environment characteristics, the scheduler chooses the subset of processors, accelerators and networks that minimize the parallel execution time. For applications with diﬀerent behaviors, the scheduler successfully chose the best conﬁguration. CNPQ::CIENCIAS EXATAS E DA TERRA Modelos paralelos Agregados de computadores Ambientes heterogêneos de computação Escalonador Parallel Models Cluster Heterogeneous computing Scheduler
70	Advancing the Cyberinfrastructure for Integrated Water Resources Modeling Buahin, Caleb A. 01 December 2017 (has links) Like other scientists, hydrologists encode mathematical formulations that simulate various hydrologic processes as computer programs so that problems with water resource management that would otherwise be manually intractable can be solved efficiently. These computer models are typically developed to answer specific questions within a specific study domain. For example, one computer model may be developed to solve for magnitudes of water flow and water levels in an aquifer while another may be developed to solve for magnitudes of water flow through a water distribution network of pipes and reservoirs. Interactions between different processes are often ignored or are approximated using overly simplistic assumptions. The increasing complexity of the water resources challenges society faces, including stresses from variable climate and land use change, means that some of these models need to be stitched together so that these challenges are not evaluated myopically from the perspective of a single research discipline or study domain. The research in this dissertation presents an investigation of the various approaches and technologies that can be used to support model integration. The research delves into some of the computational challenges associated with model integration and suggests approaches for dealing with these challenges. Finally, it advances new software that provides data structures that water resources modelers are more accustomed to and allows them to take advantage of advanced computing resources for efficient simulations. Component-based modeling High performance heterogeneous computing Integrated water resources modeling computational performance OpenMI Civil and Environmental Engineering

Search results