• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 46
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 87
  • 87
  • 43
  • 25
  • 18
  • 17
  • 15
  • 14
  • 14
  • 14
  • 13
  • 13
  • 11
  • 11
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Collecting and representing parallel programs with high performance instrumentation

Railing, Brian Paul 07 January 2016 (has links)
Computer architecture has looming challenges with finding program parallelism, process technology limits, and limited power budget. To navigate these challenges, a deeper understanding of parallel programs is required. I will discuss the task graph representation and how it enables programmers and compiler optimizations to understand and exploit dynamic aspects of the program. I will present Contech, which is a high performance framework for generating dynamic task graphs from arbitrary parallel programs. The Contech framework supports a variety of languages and parallelization libraries, and has been tested on both x86 and ARM. I will demonstrate how this framework encompasses a diversity of program analyses, particularly by modeling a dynamically reconfigurable, heterogeneous multi-core processor.
62

Parallel video decoding

Álvarez Mesa, Mauricio 08 September 2011 (has links)
Digital video is a popular technology used in many different applications. The quality of video, expressed in the spatial and temporal resolution, has been increasing continuously in the last years. In order to reduce the bitrate required for its storage and transmission, a new generation of video encoders and decoders (codecs) have been developed. The latest video codec standard, known as H.264/AVC, includes sophisticated compression tools that require more computing resources than any previous video codec. The combination of high quality video and the advanced compression tools found in H.264/AVC has resulted in a significant increase in the computational requirements of video decoding applications. The main objective of this thesis is to provide the performance required for real-time operation of high quality video decoding using programmable architectures. Our solution has been the simultaneous exploitation of multiple levels of parallelism. On the one hand, video decoders have been modified in order to extract as much parallelism as possible. And, on the other hand, general purpose architectures has been enhanced for exploiting the type of parallelism that is present in video codec applications. / El vídeo digital es una tecnología popular utilizada en una gran variedad de aplicaciones. La calidad de vídeo, expresada en la resolución espacial y temporal, ha ido aumentando constantemente en los últimos años. Con el fin de reducir la tasa de bits requerida para su almacenamiento y transmisión, se ha desarrollado una nueva generación de codificadores y decodificadores (códecs) de vídeo. El códec estándar de vídeo más reciente, conocido como H.264/AVC, incluye herramientas sofisticadas de compresión que requieren más recursos de computación que los códecs de vídeo anteriores. El efecto combinado del vídeo de alta calidad y las herramientas de compresión avanzada incluidas en el H.264/AVC han llevado a un aumento significativo de los requerimientos computacionales de la decodificación de vídeo. El objetivo principal de esta tesis es proporcionar el rendimiento necesario para la decodificación en tiempo real de vídeo de alta calidad. Nuestra solución ha sido la explotación simultánea de múltiples niveles de paralelismo. Por un lado, se realizaron modificaciones en el decodificador de vídeo con el fin de extraer múltiples niveles de paralelismo. Y, por otro lado, se modificaron las arquitecturas de propósito general para mejorar la explotación del tipo paralelismo que está presente en las aplicaciones de vídeo. Primero hicimos un análisis de la escalabilidad de dos extensiones de Instrucción Simple con Múltiples Datos (SIMD por sus siglas en inglés): una de una dimensión (1D) y otra matricial de dos dimensiones (2D). Se demostró que al escalar la extensión 2D se obtiene un mayor rendimiento con una menor complejidad que al escalar la extensión 1D. Luego se realizó una caracterización de la decodificación de H.264/AVC en aplicaciones de alta definición (HD) donde se identificaron los núcleos principales. Debido a la falta de un punto de referencia (benchmark) adecuado para la decodificación de vídeo HD, desarrollamos uno propio, llamado HD-VideoBench el cual incluye aplicaciones completas de codificación y decodificación de vídeo junto con una serie de secuencias de vídeo en HD. Después optimizamos los núcleos más importantes del decodificador H.264/AVC usando instrucciones SIMD. Sin embargo, los resultados no alcanzaron el máximo rendimiento posible debido al efecto negativo de la desalineación de los datos en memoria. Como solución, evaluamos el hardware y el software necesarios para realizar accesos no alineados. Este soporte produjo mejoras significativas de rendimiento en la aplicación. Aparte se realizó una investigación sobre cómo extraer paralelismo de nivel de tarea. Se encontró que ninguno de los mecanismos existentes podía escalar para sistemas masivamente paralelos. Como alternativa, desarrollamos un nuevo algoritmo que fue capaz de encontrar miles de tareas independientes al explotar paralelismo de nivel de macrobloque. Luego implementamos una versión paralela del decodificador de H.264 en una máquina de memoria compartida distribuida (DSM por sus siglas en inglés). Sin embargo esta implementación no alcanzó el máximo rendimiento posible debido al impacto negativo de las operaciones de sincronización y al efecto del núcleo de decodificación de entropía. Con el fin de eliminar estos cuellos de botella se evaluó la paralelización al nivel de imagen de la fase de decodificación de entropía combinada con la paralelización al nivel de macrobloque de los demás núcleos. La sobrecarga de las operaciones de sincronización se eliminó casi por completo mediante el uso de operaciones aceleradas por hardware. Con todas las mejoras presentadas se permitió la decodificación, en tiempo real, de vídeo de alta definición y alta tasa de imágenes por segundo. Como resultado global se creó una solución escalable capaz de usar el número creciente procesadores en las arquitecturas multinúcleo.
63

Meeting Data Sharing Needs of Heterogeneous Distributed Users

Zhan, Zhiyuan 16 January 2007 (has links)
The fast growth of wireless networking and mobile computing devices has enabled us to access information from anywhere at any time. However, varying user needs and system resource constraints are two major heterogeneity factors that pose a challenge to information sharing systems. For instance, when a new information item is produced, different users may have different requirements for when the new value should become visible. The resources that each device can contribute to such information sharing applications also vary. Therefore, how to enable information sharing across computing platforms with varying resources to meet different user demands is an important problem for distributed systems research. In this thesis, we address the heterogeneity challenge faced by such systems. We assume that shared information is encapsulated in distributed objects, and we use object replication to increase system scalability and robustness, which introduces the consistency problem. Many consistency models have been proposed in recent years but they are either too strong and do not scale very well, or too weak to meet many users' requirements. We propose a Mixed Consistency (MC) model as a solution. We introduce an access constraints based approach to combine both strong and weak consistency models together. We also propose a MC protocol that combines existing implementations together with minimum modifications. It is designed to tolerate crash failures and slow processes/communication links in the system. We also explore how the heterogeneity challenge can be addressed in the transportation layer by developing an agile dissemination protocol. We implement our MC protocol on top of a distributed publisher-subscriber middleware, Echo. We finally measure the performance of our MC implementation. The results of the experiments are consistent with our expectations. Based on the functionality and performance of mixed consistency protocols, we believe that this model is effective in addressing the heterogeneity of user requirements and available resources in distributed systems.
64

Graph-Based Control of Networked Systems

Ji, Meng 11 June 2007 (has links)
Networked systems have attracted great interests from the control society during the last decade. Several issues rising from the recent research are addressed in this dissertation. Connectedness is one of the important conditions that enable distributed coordination in a networked system. Nonetheless, it has been assumed in most implementations, especially in continuous-time applications, until recently. A nonlinear weighting strategy is proposed in this dissertation to solve the connectedness preserving problem. Both rendezvous and formation problem are addressed in the context of homogeneous network. Controllability of heterogeneous networks is another issue which has been long omitted. This dissertation contributes a graph theoretical interpretation of controllability. Distributed sensor networks make up another important class of networked systems. A novel estimation strategy is proposed in this dissertation. The observability problem is raised in the context of our proposed distributed estimation strategy, and a graph theoretical interpretation is derived as well. The contributions of this dissertation are as follows: It solves the connectedness preserving problem for networked systems. Based on that, a formation process is proposed. For heterogeneous networks, the leader-follower structure is studied and sufficient and necessary conditions are presented for the system to be controllable. A novel estimation strategy is proposed for distributed sensor networks, which could improve the performance. The observability problem is studied for this estimation strategy and a necessary condition is obtained. This work is among the first ones that provide graph theoretical interpretations of the controllability and observability issues.
65

Harmony: an execution model for heterogeneous systems

Diamos, Gregory Frederick 10 November 2011 (has links)
The emergence of heterogeneous and many-core architectures presents a unique opportunity to deliver order of magnitude performance increases to high performance applications by matching certain classes of algorithms to specifically tailored architectures. However, their ubiquitous adoption has been limited by a lack of programming models and management frameworks designed to reduce the high degree of complexity of software development inherent to heterogeneous architectures. This dissertation introduces Harmony, an execution model for heterogeneous systems that draws heavily from concepts and optimizations used in processor micro-architecture to provide: (1) semantics for simplifying heterogeneity management, (2) dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and (3) online monitoring driven performance optimization for heterogeneous many core systems. This work focuses on simplifying development and ensuring binary portability and scalability across system configurations and sizes.
66

Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU

Öhberg, Tomas January 2018 (has links)
The trend in computer architectures has for several years been heterogeneous systems consisting of a regular CPU and at least one additional, specialized processing unit, such as a GPU.The different characteristics of the processing units and the requirement of multiple tools and programming languages makes programming of such systems a challenging task. Although there exist tools for programming each processing unit, utilizing the full potential of a heterogeneous computer still requires specialized implementations involving multiple frameworks and hand-tuning of parameters.To fully exploit the performance of heterogeneous systems for a single computation, hybrid execution is needed, i.e. execution where the workload is distributed between multiple, heterogeneous processing units, working simultaneously on the computation. This thesis presents the implementation of a new hybrid execution backend in the algorithmic skeleton framework SkePU. The skeleton framework already gives programmers a user-friendly interface to algorithmic templates, executable on different hardware using OpenMP, CUDA and OpenCL. With this extension it is now also possible to divide the computational work of the skeletons between multiple processing units, such as between a CPU and a GPU. The results show an improvement in execution time with the hybrid execution implementation for all skeletons in SkePU. It is also shown that the new implementation results in a lower and more predictable execution time compared to a dynamic scheduling approach based on an earlier implementation of hybrid execution in SkePU.
67

HCLogP: um modelo computacional para clusters heterogêneos

Soares, Thiago Marques 09 March 2017 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-15T14:39:21Z No. of bitstreams: 1 thiagomarquessoares.pdf: 1372109 bytes, checksum: 0decc31aa35ac2d0364f017e2f671861 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-05-17T15:59:41Z (GMT) No. of bitstreams: 1 thiagomarquessoares.pdf: 1372109 bytes, checksum: 0decc31aa35ac2d0364f017e2f671861 (MD5) / Made available in DSpace on 2017-05-17T15:59:41Z (GMT). No. of bitstreams: 1 thiagomarquessoares.pdf: 1372109 bytes, checksum: 0decc31aa35ac2d0364f017e2f671861 (MD5) Previous issue date: 2017-03-09 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O modelo LogP foi desenvolvido em 1993 para medir os efeitos da latência de comunicação, ocupação dos processadores e banda passante em multiprocessadores com memória distribuída. A ideia era caracterizar multiprocessadores de memória distribuída usando estes parâmetros chave, analisando seus impactos no desempenho. Este trabalho propõe um novo modelo, baseado no LogP, que descreve a influência destes parâmetros no desempenho de aplicações regulares executadas em um agregado (cluster) de computadores heterogêneos. O modelo considera que um agregado heterogêneo é composto por diferentes tipos de processadores, aceleradores e controladores de rede. Os resultados mostram que o pior erro nas estimativas feitas pelo modelo para o tempo de execução paralelo foi de 19,2%, e, em muitos casos, a execução estimada foi igual ou próxima do tempo real. Além disso, com base neste modelo, foi desenvolvido um escalonador, que baseado nas características da aplicação e do ambiente, escolhe um subconjunto de componentes que minimizem o tempo total de execução paralelo. O escalonador obteve êxito na escolha da melhor configuração para a execução de aplicações com diferentes comportamentos. / The LogP model was proposed in 1993 to measure the effects of communication latency, processor occupancy and bandwidth in distributed memory multiprocessors. The idea was to characterize distributed memory multiprocessor using these key parameters and study their impact on performance in simulation environments. This work proposes a new model, based on LogP, that describes the impacts on performance of regular applications executing on a heterogeneous cluster. The model considers that a heterogeneous cluster is composed of distinct types of processors, accelerators and networks. The results show that the worst error in the estimations of the parallel execution time was about 19,2%, and, in many cases, the estimated execution time is equal to or very close to the real one. In addition, based on this model, a scheduler was developed. Based on the applications and computational environment characteristics, the scheduler chooses the subset of processors, accelerators and networks that minimize the parallel execution time. For applications with different behaviors, the scheduler successfully chose the best configuration.
68

Advancing the Cyberinfrastructure for Integrated Water Resources Modeling

Buahin, Caleb A. 01 December 2017 (has links)
Like other scientists, hydrologists encode mathematical formulations that simulate various hydrologic processes as computer programs so that problems with water resource management that would otherwise be manually intractable can be solved efficiently. These computer models are typically developed to answer specific questions within a specific study domain. For example, one computer model may be developed to solve for magnitudes of water flow and water levels in an aquifer while another may be developed to solve for magnitudes of water flow through a water distribution network of pipes and reservoirs. Interactions between different processes are often ignored or are approximated using overly simplistic assumptions. The increasing complexity of the water resources challenges society faces, including stresses from variable climate and land use change, means that some of these models need to be stitched together so that these challenges are not evaluated myopically from the perspective of a single research discipline or study domain. The research in this dissertation presents an investigation of the various approaches and technologies that can be used to support model integration. The research delves into some of the computational challenges associated with model integration and suggests approaches for dealing with these challenges. Finally, it advances new software that provides data structures that water resources modelers are more accustomed to and allows them to take advantage of advanced computing resources for efficient simulations.
69

An Exploration Of Heterogeneous Networks On Chip

Grimm, Allen Gary 01 January 2011 (has links)
As the the number of cores on a single chip continue to grow, communication increasingly becomes the bottleneck to performance. Networks on Chips (NoC) is an interconnection paradigm showing promise to allow system size to increase while maintaining acceptable performance. One of the challenges of this paradigm is in constructing the network of inter-core connections. Using the traditional wire interconnect as long range links is proving insufficient due to the increase in relative delay as miniaturization progresses. Novel link types are capable of delivering single-hop long-range communication. We investigate the potential benefits of constructing networks with many link types applied to heterogeneous NoCs and hypothesize that a network with many link types available can achieve a higher performance at a given cost than its homogeneous network counterpart. To investigate NoCs with heterogeneous links, a multiobjective evolutionary algorithm is given a heterogeneous set of links and optimizes the number and placement of those links in an NoC using objectives of cost, throughput, and energy as a representative set of a NoC's quality. The types of links used and the topology of those links is explored as a consequence of the properties of available links and preference set on the objectives. As the platform of experimentation, the Complex Network Evolutionary Algorithm (CNEA) and the associated Complex Network Framework (CNF) are developed. CNEA is a multiobjective evolutionary algorithm built from the ParadisEO framework to facilitate the construction of optimized networks. CNF is designed and used to model and evaluate networks according to: cost of a given topology; performance in terms of a network's throughput and energy consumption; and graph-theory based metrics including average distance, degree-, length-, and link-distributions. It is shown that optimizing complex networks to cost as a function of total link length and average distance creates a power-law link-length distribution. This offers a way to decrease the average distance of a network for a given cost when compared to random networks or the standard mesh network. We then explore the use of several types of constrained-length links in the same optimization problem and find that, when given access to all link types, we obtain networks that have the same or smaller average distance for a given cost than any network that is produced when given access to only one link type. We then introduce traffic on the networks with an interconnect-based packet-level shortest-path-routed traffic model. We find heterogeneous networks can achieve a throughput as good or better than the homogeneous network counterpart using the same amount of link. Finally, these results are confirmed by augmenting a wire-based mesh network with non-traditional link types and finding significant increases the overall performance of that network.
70

Reconfigurable Technologies for Next Generation Internet and Cluster Computing

Unnikrishnan, Deepak C. 01 September 2013 (has links)
Modern web applications are marked by distinct networking and computing characteristics. As applications evolve, they continue to operate over a large monolithic framework of networking and computing equipment built from general-purpose microprocessors and Application Specific Integrated Circuits (ASICs) that offers few architectural choices. This dissertation presents techniques to diversify the next-generation Internet infrastructure by integrating Field-programmable Gate Arrays (FPGAs), a class of reconfigurable integrated circuits, with general-purpose microprocessor-based techniques. Specifically, our solutions are demonstrated in the context of two applications - network virtualization and distributed cluster computing. Network virtualization enables the physical network infrastructure to be shared among several logical networks to run diverse protocols and differentiated services. The design of a good network virtualization platform is challenging because the physical networking substrate must scale to support several isolated virtual networks with high packet forwarding rates and offer sufficient flexibility to customize networking features. The first major contribution of this dissertation is a novel high performance heterogeneous network virtualization system that integrates FPGAs and general-purpose CPUs. Salient features of this architecture include the ability to scale the number of virtual networks in an FPGA using existing software-based network virtualization techniques, the ability to map virtual networks to a combination of hardware and software resources on demand, and the ability to use off-chip memory resources to scale virtual router features. Partial-reconfiguration has been exploited to dynamically customize virtual networking parameters. An open software framework to describe virtual networking features using a hardware-agnostic language has been developed. Evaluation of our system using a NetFPGA card demonstrates one to two orders of improved throughput over state-of-the-art network virtualization techniques. The demand for greater computing capacity grows as web applications scale. In state-of-the-art systems, an application is scaled by parallelizing the computation on a pool of commodity hardware machines using distributed computing frameworks. Although this technique is useful, it is inefficient because the sequential nature of execution in general-purpose processors does not suit all workloads equally well. Iterative algorithms form a pervasive class of web and data mining algorithms that are poorly executed on general purpose processors due to the presence of strict synchronization barriers in distributed cluster frameworks. This dissertation presents Maestro, a heterogeneous distributed computing framework that demonstrates how FPGAs can break down such synchronization barriers using asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers. The benefits of a heterogeneous cluster are illustrated by executing a general-class of iterative algorithms on a cluster of commodity CPUs and FPGAs. Computation is dynamically prioritized to accelerate algorithm convergence. We implement a general-class of three iterative algorithms on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers 154× speedup versus a standard Hadoop-based CPU-workstation cluster. Improved performance is achieved by clusters of FPGAs.

Page generated in 0.2669 seconds