• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 54
  • 13
  • 6
  • 6
  • 6
  • 5
  • 5
  • 4
  • 3
  • 3
  • 1
  • Tagged with
  • 118
  • 118
  • 57
  • 53
  • 26
  • 25
  • 25
  • 22
  • 20
  • 19
  • 19
  • 18
  • 16
  • 15
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Software Performance Estimation Techniques in a Co-Design Environment

Subramanian, Sriram 02 September 2003 (has links)
No description available.
52

Aplicando verificação de modelos baseada nas teorias do módulo da satisfabilidade para o particionamento de hardware/software em sistemas embarcados

Trindade, Alessandro Bezerra 09 February 2015 (has links)
Submitted by Kamila Costa (kamilavasconceloscosta@gmail.com) on 2015-06-15T21:23:16Z No. of bitstreams: 1 Dissertacao-Alessandro B Trindade.pdf: 1833454 bytes, checksum: 132beb74daa71e138bbfcdc0dcf5b174 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-06-16T15:00:54Z (GMT) No. of bitstreams: 1 Dissertacao-Alessandro B Trindade.pdf: 1833454 bytes, checksum: 132beb74daa71e138bbfcdc0dcf5b174 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-06-16T15:02:16Z (GMT) No. of bitstreams: 1 Dissertacao-Alessandro B Trindade.pdf: 1833454 bytes, checksum: 132beb74daa71e138bbfcdc0dcf5b174 (MD5) / Made available in DSpace on 2015-06-16T15:02:16Z (GMT). No. of bitstreams: 1 Dissertacao-Alessandro B Trindade.pdf: 1833454 bytes, checksum: 132beb74daa71e138bbfcdc0dcf5b174 (MD5) Previous issue date: 2015-02-09 / Não Informada / When performing hardware/software co-design for embedded systems, does emerge the problem of allocating properly which functions of the system should be implemented in hardware (HW) or in software (SW). This problem is known as HW/SW partitioning and in the last ten years, a significant research effort has been carried out in this area. In this proposed project, we present two new approaches to solve the HW/SW partitioning problem by using SMT-based verification techniques, and comparing the results using the traditional technique of Integer Linear Programming (ILP) and a modern method of optimization by Genetic Algorithm (GA). The goal is to show with experimental results that model checking techniques can be effective, in particular cases, to find the optimal solution of the HW/SW partitioning problem using a state-of-the-art model checker based on Satisfiability Modulo Theories (SMT) solvers, when compared to the traditional techniques. / Quando se realiza um coprojeto de hardware/software para sistemas embarcados, emerge o problema de se decidir qual função do sistema deve ser implementada em hardware (HW) ou em software (SW). Este tipo de problema recebe o nome de particionamento de HW/SW. Na última década, um esforço significante de pesquisa tem sido empregado nesta área. Neste trabalho, são apresentadas duas novas abordagens para resolver o problema de particionamento de HW/SW usando técnicas de verificação formal baseadas nas teorias do módulo da satisfabilidade (SMT). São comparados os resultados obtidos com a tradicional técnica de programação linear inteira (ILP) e com o método moderno de otimização por algoritmo genético (GA). O objetivo é demonstrar, com os resultados empíricos, que as técnicas de verificação de modelos podem ser efetivas, em casos particulares, para encontrar a solução ótima do problema de particionamento de HW/SW usando um verificador de modelos baseado no solucionador SMT, quando comparado com técnicas tradicionais.
53

Algoritmy souběžného technického a programového návrhu / Hardware-Software Codesign Algorithms

Vlach, Jan January 2007 (has links)
This master's thesis deals with a parallel design of the program and a technical equipment of embedded systems. It involves both a general description of the whole process and an illustration of the design, a simulation and implementation of the FIR filter. It also includes a description of the proposed program Polis and the simulation system Ptolemy. The conclusion of the project is devoted to a generation of simulation models in VHDL language incl. a subsequent synthesis.
54

A Hardware/Software Stack for Heterogeneous Systems

Lehner, Wolfgang, Castrillon, Jeronimo, Lieber, Matthias, Klüppelholz, Sascha, Völp, Marcus, Asmussen, Nils, Aßmann, Uwe, Baader, Franz, Baier, Christel, Fettweis, Gerhard, Fröhlich, Jochen, Goens, Andrés, Haas, Sebastian, Habich, Dirk, Härtig, Hermann, Hasler, Mattis, Huismann, Immo, Karnagel, Tomas, Karol, Sven, Kumar, Akash, Leuschner, Linda, Ling, Siqi, Märcker, Steffen, Menard, Christian, Mey, Johannes, Nagel, Wolfgang, Nöthen, Benedikt, Peñaloza, Rafael, Raitza, Michael, Stiller, Jörg, Ungethüm, Annett, Voigt, Axel, Wunderlich, Sascha 17 July 2023 (has links)
Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on a tiled heterogeneous multicore.
55

Research and Design of Neural Processing Architectures Optimized for Embedded Applications

Wu, Binyi 28 May 2024 (has links)
Der Einsatz von neuronalen Netzen in Edge-Geräten und deren Einbindung in unser tägliches Leben findet immer mehr Beachtung. Ihre hohen Rechenkosten machen jedoch viele eingebettete Anwendungen zu einer Herausforderung. Das Hauptziel meiner Doktorarbeit ist es, einen Beitrag zur Lösung dieses Dilemmas zu leisten: die Optimierung neuronaler Netze und die Entwicklung entsprechender neuronaler Verarbeitungseinheiten für Endgeräte. Diese Arbeit nahm die algorithmische Forschung als Ausgangspunkt und wandte dann deren Ergebnisse an, um das Architekturdesign von Neural Processing Units (NPUs) zu verbessern. Die Optimierung neuronaler Netzwerkmodelle begann mit der Quantisierung neuronaler Netzwerke mit einfacher Präzision und entwickelte sich zu gemischter Präzision. Die Entwicklung der NPU-Architektur folgte den Erkenntnissen der Algorithmusforschung, um ein Hardware/Software Co-Design zu erreichen. Darüber hinaus wurde ein neuartiger Ansatz zur gemeinsamen Entwicklung von Hardware und Software vorgeschlagen, um das Prototyping und die Leistungsbewertung von NPUs zu beschleunigen. Dieser Ansatz zielt auf die frühe Entwicklungsphase ab. Er hilft Entwicklern, sich auf das Design und die Optimierung von NPUs zu konzentrieren und verkürzt den Entwicklungszyklus erheblich. Im Abschlussprojekt wurde ein auf maschinellem Lernen basierender Ansatz angewendet, um die Rechen- und Speicherressourcen der NPU zu erkunden and optimieren. Die gesamte Arbeit umfasst mehrere verschiedene Bereiche, von der Algorithmusforschung bis zum Hardwaredesign. Sie alle arbeiten jedoch an der Verbesserung der Inferenz-Effizienz neuronaler Netze. Die Optimierung der Algorithmen zielt insbesondere darauf ab, den Speicherbedarf und die Rechenkosten von neuronalen Netzen zu verringern. Das NPU-Design hingegen konzentriert sich auf die Verbesserung der Nutzung von Hardwareressourcen. Der vorgeschlagene Ansatz zur gemeinsamen Entwicklung von Software und Hardware verkürzt den Entwurfszyklus und beschleunigt die Entwurfsiterationen. Die oben dargestellte Reihenfolge entspricht dem Aufbau dieser Dissertation. Jedes Kapitel ist einem Thema gewidmet und umfasst relevante Forschungsarbeiten, Methodik und Versuchsergebnisse.:1 Introduction 2 Convolutional Neural Networks 2.1 Convolutional layer 2.1.1 Padding 2.1.2 Convolution 2.1.3 Batch Normalization 2.1.4 Nonlinearity 2.2 Pooling Layer 2.3 Fully Connected Layer 2.4 Characterization 2.4.1 Composition of Operations and Parameters 2.4.2 Arithmetic Intensity 2.5 Optimization 3 Quantization with Double-Stage Squeeze-and-Threshold 19 3.1 Overview 3.1.1 Binarization 3.1.2 Multi-bit Quantization 3.2 Quantization of Convolutional Neural Networks 3.2.1 Quantization Scheme 3.2.2 Operator fusion of Conv2D 3.3 Activation Quantization with Squeeze-and-Threshold 3.3.1 Double-Stage Squeeze-and-Threshold 3.3.2 Inference Optimization 3.4 Experiment 3.4.1 Ablation Study of Squeeze-and-Threshold 3.4.2 Comparison with State-of-the-art Methods 3.5 Summary 4 Low-Precision Neural Architecture Search 39 4.1 Overview 4.2 Differentiable Architecture Search 4.2.1 Gumbel Softmax 4.2.2 Disadvantage and Solution 4.3 Low-Precision Differentiable Architecture Search 4.3.1 Convolution Sharing 4.3.2 Forward-and-Backward Scaling 4.3.3 Power Estimation 4.3.4 Architecture of Supernet 4.4 Experiment 4.4.1 Effectiveness of solutions to the dominance problem 4.4.2 Softmax and Gumbel Softmax 4.4.3 Optimizer and Inverted Learning Rate Scheduler 4.4.4 NAS Method Evaluation 4.4.5 Searched Model Analysis 4.4.6 NAS Cost Analysis 4.4.7 NAS Training Analysis 4.5 Summary 5 Configurable Sparse Neural Processing Unit 65 5.1 Overview 5.2 NPU Architecture 5.2.1 Buffer 5.2.2 Reshapeable Mixed-Precision MAC Array 5.2.3 Sparsity 5.2.4 Post Process Unit 5.3 Mapping 5.3.1 Mixed-Precision MAC 5.3.2 MAC Array 5.3.3 Support of Other Operation 5.3.4 Configurability 5.4 Experiment 5.4.1 Performance Analysis of Runtime Configuration 5.4.2 Roofline Performance Analysis 5.4.3 Mixed-Precision 5.4.4 Comparison with Cortex-M7 5.5 Summary 6 Agile Development and Rapid Design Space Exploration 91 6.1 Overview 6.1.1 Agile Development 6.1.2 Design Space Exploration 6.2 Agile Development Infrastructure 6.2.1 Chisel Backend 6.2.2 NPU Software Stack 6.3 Modeling and Exploration 6.3.1 Area Modeling 6.3.2 Performance Modeling 6.3.3 Layered Exploration Framework 6.4 Experiment 6.4.1 Efficiency of Agile Development Infrastructure 6.4.2 Effectiveness of Agile Development Infrastructure 6.4.3 Area Modeling 6.4.4 Performance Modeling 6.4.5 Rapid Exploration and Pareto Front 6.5 Summary 7 Summary and Outlook 123 7.1 Summary 7.2 Outlook A Appendix of Double-Stage ST Quantization 127 A.1 Training setting of ResNet-18 in Table 3.3 A.2 Training setting of ReActNet in Table 3.4 A.3 Training setting of ResNet-18 in Table 3.4 A.4 Pseudocode Implementation of Double-Stage ST B Appendix of Low-Precision Neural Architecture Search 131 B.1 Low-Precision NAS on CIFAR-10 B.2 Low-Precision NAS on Tiny-ImageNet B.3 Low-Precision NAS on ImageNet Bibliography 137 / Deploying neural networks on edge devices and bringing them into our daily lives is attracting more and more attention. However, its expensive computational cost makes many embedded applications daunting. The primary objective of my doctoral studies is to make contributions towards resolving this predicament: optimizing neural networks and designing corresponding efficient neural processing units for edge devices. This work took algorithmic research, specifically the optimization of deep neural networks, as a starting point and then applied its findings to steer the architecture design of Neural Processing Units (NPUs). The optimization of neural network models started with single precision neural network quantization and progressed to mixed precision. The NPU architecture development followed the algorithmic research findings to achieve hardware/software co-design. Furthermore, a new approach to hardware and software co-development was introduced, aimed at expediting the prototyping and performance assessment of NPUs. This approach targets early-stage development. It helps developers to focus on the design and optimization of NPUs and significantly shortens the development cycle. In the final project, a machine learning-based approach was applied to explore and optimize the computational and memory resources of the NPU. The entire work covers several different areas, from algorithmic research to hardware design. But they all work on improving the inference efficiency of neural networks. Specifically, algorithm optimization aims to reduce the memory footprint and computational cost of neural networks. The NPU design, on the other hand, focuses on improving the utilization of hardware resources. The proposed software and hardware co-development approach shortens the design cycle and speeds up the design iteration. The order presented above corresponds to the structure of this dissertation. Each chapter corresponds to a topic and covers relevant research, methodology, and experimental results.:1 Introduction 2 Convolutional Neural Networks 2.1 Convolutional layer 2.1.1 Padding 2.1.2 Convolution 2.1.3 Batch Normalization 2.1.4 Nonlinearity 2.2 Pooling Layer 2.3 Fully Connected Layer 2.4 Characterization 2.4.1 Composition of Operations and Parameters 2.4.2 Arithmetic Intensity 2.5 Optimization 3 Quantization with Double-Stage Squeeze-and-Threshold 19 3.1 Overview 3.1.1 Binarization 3.1.2 Multi-bit Quantization 3.2 Quantization of Convolutional Neural Networks 3.2.1 Quantization Scheme 3.2.2 Operator fusion of Conv2D 3.3 Activation Quantization with Squeeze-and-Threshold 3.3.1 Double-Stage Squeeze-and-Threshold 3.3.2 Inference Optimization 3.4 Experiment 3.4.1 Ablation Study of Squeeze-and-Threshold 3.4.2 Comparison with State-of-the-art Methods 3.5 Summary 4 Low-Precision Neural Architecture Search 39 4.1 Overview 4.2 Differentiable Architecture Search 4.2.1 Gumbel Softmax 4.2.2 Disadvantage and Solution 4.3 Low-Precision Differentiable Architecture Search 4.3.1 Convolution Sharing 4.3.2 Forward-and-Backward Scaling 4.3.3 Power Estimation 4.3.4 Architecture of Supernet 4.4 Experiment 4.4.1 Effectiveness of solutions to the dominance problem 4.4.2 Softmax and Gumbel Softmax 4.4.3 Optimizer and Inverted Learning Rate Scheduler 4.4.4 NAS Method Evaluation 4.4.5 Searched Model Analysis 4.4.6 NAS Cost Analysis 4.4.7 NAS Training Analysis 4.5 Summary 5 Configurable Sparse Neural Processing Unit 65 5.1 Overview 5.2 NPU Architecture 5.2.1 Buffer 5.2.2 Reshapeable Mixed-Precision MAC Array 5.2.3 Sparsity 5.2.4 Post Process Unit 5.3 Mapping 5.3.1 Mixed-Precision MAC 5.3.2 MAC Array 5.3.3 Support of Other Operation 5.3.4 Configurability 5.4 Experiment 5.4.1 Performance Analysis of Runtime Configuration 5.4.2 Roofline Performance Analysis 5.4.3 Mixed-Precision 5.4.4 Comparison with Cortex-M7 5.5 Summary 6 Agile Development and Rapid Design Space Exploration 91 6.1 Overview 6.1.1 Agile Development 6.1.2 Design Space Exploration 6.2 Agile Development Infrastructure 6.2.1 Chisel Backend 6.2.2 NPU Software Stack 6.3 Modeling and Exploration 6.3.1 Area Modeling 6.3.2 Performance Modeling 6.3.3 Layered Exploration Framework 6.4 Experiment 6.4.1 Efficiency of Agile Development Infrastructure 6.4.2 Effectiveness of Agile Development Infrastructure 6.4.3 Area Modeling 6.4.4 Performance Modeling 6.4.5 Rapid Exploration and Pareto Front 6.5 Summary 7 Summary and Outlook 123 7.1 Summary 7.2 Outlook A Appendix of Double-Stage ST Quantization 127 A.1 Training setting of ResNet-18 in Table 3.3 A.2 Training setting of ReActNet in Table 3.4 A.3 Training setting of ResNet-18 in Table 3.4 A.4 Pseudocode Implementation of Double-Stage ST B Appendix of Low-Precision Neural Architecture Search 131 B.1 Low-Precision NAS on CIFAR-10 B.2 Low-Precision NAS on Tiny-ImageNet B.3 Low-Precision NAS on ImageNet Bibliography 137
56

Embedded electronic systems driven by run-time reconfigurable hardware

Fons Lluís, Francisco 29 May 2012 (has links)
Abstract This doctoral thesis addresses the design of embedded electronic systems based on run-time reconfigurable hardware technology –available through SRAM-based FPGA/SoC devices– aimed at contributing to enhance the life quality of the human beings. This work does research on the conception of the system architecture and the reconfiguration engine that provides to the FPGA the capability of dynamic partial reconfiguration in order to synthesize, by means of hardware/software co-design, a given application partitioned in processing tasks which are multiplexed in time and space, optimizing thus its physical implementation –silicon area, processing time, complexity, flexibility, functional density, cost and power consumption– in comparison with other alternatives based on static hardware (MCU, DSP, GPU, ASSP, ASIC, etc.). The design flow of such technology is evaluated through the prototyping of several engineering applications (control systems, mathematical coprocessors, complex image processors, etc.), showing a high enough level of maturity for its exploitation in the industry. / Resumen Esta tesis doctoral abarca el diseño de sistemas electrónicos embebidos basados en tecnología hardware dinámicamente reconfigurable –disponible a través de dispositivos lógicos programables SRAM FPGA/SoC– que contribuyan a la mejora de la calidad de vida de la sociedad. Se investiga la arquitectura del sistema y del motor de reconfiguración que proporcione a la FPGA la capacidad de reconfiguración dinámica parcial de sus recursos programables, con objeto de sintetizar, mediante codiseño hardware/software, una determinada aplicación particionada en tareas multiplexadas en tiempo y en espacio, optimizando así su implementación física –área de silicio, tiempo de procesado, complejidad, flexibilidad, densidad funcional, coste y potencia disipada– comparada con otras alternativas basadas en hardware estático (MCU, DSP, GPU, ASSP, ASIC, etc.). Se evalúa el flujo de diseño de dicha tecnología a través del prototipado de varias aplicaciones de ingeniería (sistemas de control, coprocesadores aritméticos, procesadores de imagen, etc.), evidenciando un nivel de madurez viable ya para su explotación en la industria. / Resum Aquesta tesi doctoral està orientada al disseny de sistemes electrònics empotrats basats en tecnologia hardware dinàmicament reconfigurable –disponible mitjançant dispositius lògics programables SRAM FPGA/SoC– que contribueixin a la millora de la qualitat de vida de la societat. S’investiga l’arquitectura del sistema i del motor de reconfiguració que proporcioni a la FPGA la capacitat de reconfiguració dinàmica parcial dels seus recursos programables, amb l’objectiu de sintetitzar, mitjançant codisseny hardware/software, una determinada aplicació particionada en tasques multiplexades en temps i en espai, optimizant així la seva implementació física –àrea de silici, temps de processat, complexitat, flexibilitat, densitat funcional, cost i potència dissipada– comparada amb altres alternatives basades en hardware estàtic (MCU, DSP, GPU, ASSP, ASIC, etc.). S’evalúa el fluxe de disseny d’aquesta tecnologia a través del prototipat de varies aplicacions d’enginyeria (sistemes de control, coprocessadors aritmètics, processadors d’imatge, etc.), demostrant un nivell de maduresa viable ja per a la seva explotació a la indústria.
57

Support matériel pour la communication inter-processus dans un système multi-coeur / Hardware support for inter-process communication in multiprocessor system

France pillois, Maxime 27 September 2018 (has links)
La forte parallélisation des applications MPSoC accroît le besoin d'optimisation des mécanismes de synchronisation, primordiaux pour l'échange sûr d'informations entre processus. En effet, les délais qu'ils introduisent impactent les performances globales des MPSoC. L'objet de cette thèse est d'étudier puis d'optimiser les performances temporelles de ces mécanismes de synchronisation.La complexité croissante des MPSoC impose l'étude précise des mécanismes ciblés dans un environnement réaliste mettant en exergue les spécificités logicielles et matérielles.Les outils de mesures disponibles ne répondant pas à nos exigences de précision conjuguée à la vitesse d'analyse, nous avons conçu notre propre chaîne de mesure non intrusive reposant sur une plateforme d'émulation.Appliquée à l'étude de l'implémentation GNU du mécanisme de barrière de synchronisation offert par la bibliothèque d'aide à la parallélisation de code OpenMP, notre chaîne de mesure a mis en évidence deux faiblesses d'implémentation, aboutissant à la mise en place d'optimisations logicielles et matérielles réduisant de manière significative les délais de ce mécanisme.La chaîne de mesure développée nous a également permis de vérifier une hypothèse structurante pour l'optimisation : un verrou, bien qu'utilisé par plusieurs cœurs de différentes grappes au cours de l'application, est très souvent repris par le dernier cœur l'ayant libéré. Sur la base de ce constat, nous proposons une solution innovante assurant, de manière totalement décentralisée, la relocalisation dynamique des verrous dans la mémoire proche du cœur ayant obtenu l'accès. Cela permet de réduire la latence d'accès et le trafic réseau lors de la réutilisation d'un verrou par une même grappe. / High parallelism of MPSoC applications increase the need of optimization for the synchronization mechanisms, essential to ensure consistent data exchanges between threads. Delays inserted by them impact the whole performances of the system. This thesis work aims to analyze and reduce delays of synchronization mechanisms for MPSoC architectures.The growing complexity of MPSoCs requires assessment of proposed optimizations against hardware and software specifics in real-life environment. Since usual tools to perform measurements do not fulfill required accuracy with sufficient evaluation speed, we have designed a custom non-intrusive tool-chain based on an emulation platform.The study of the textit{GNU} OpenMP library implementation of the synchronization barriers, carried out with our tool-chain, has revealed two weaknesses. Our proposed hardware and software optimizations achieve significant reduction of the delays introduced by the synchronization barrier.The designed tool-chain has also allowed us to confirm a fundamental hypothesis for the optimization of the lock mechanism : although during the run time a lock may be used by various cores belonging to different clusters, it is often reused by the last core which has released it. Based on this observation, we propose an innovative decentralized solution to manage dynamic re-homing of locks in memory close to the last access-granted core, thus reducing access latency and network traffic in case of reuse of the lock by the same cluster.
58

Arquitetura de co-projeto hardware/software para implementação de um codificador de vídeo escalável padrão H.264/SVC

Husemann, Ronaldo January 2011 (has links)
Visando atuação flexível em redes heterogêneas, modernos sistemas multimídia podem adotar o conceito da codificação escalável, onde o fluxo de vídeo é composto por múltiplas camadas, cada qual complementando e aprimorando gradualmente as características de exibição, de forma adaptativa às capacidades de cada receptor. Atualmente, a especificação H.264/SVC representa o estado da arte da área, por sua eficiência de codificação aprimorada, porém demanda recursos computacionais extremamente elevados. Neste contexto, o presente trabalho apresenta uma arquitetura de projeto colaborativo de hardware e software, que explora as características dos diversos algoritmos internos do codificador H.264/SVC, buscando um adequado balanceamento entre as duas tecnologias (hardware e software) para a implementação prática de um codificador escalável de até 16 camadas em formato de 1920x1080 pixels. A partir de um modelo do código de referência H.264/SVC, refinado para reduzir tempos de codificação, foram definidas estratégias de particionamento de módulos e integração entre entidades de software e hardware, avaliando-se questões como dependência de dados e potencial de paralelismo dos algoritmos, assim como restrições práticas das interfaces de comunicação e acessos à memória. Em hardware foram implementados módulos de transformadas, quantização, filtro anti-blocagem e predição entre camadas, permanecendo em software funções de gerência do sistema, entropia, controle de taxa e interface com usuário. A solução completa obtida, integrando módulos em hardware, sintetizados em uma placa de desenvolvimento, com o software de referência refinado, comprova a validade da proposta, pelos significativos ganhos de desempenho registrados, mostrando-se como uma solução adequada para aplicações que exijam codificação escalável tempo real. / In order to support heterogeneous networks and distinct devices simultaneously, modern multimedia systems can adopt the scalability concept, when the video stream is composed by multiple layers, each one being responsible for gradually enhance the video exhibition quality, according to specific receiver capabilities. Currently the H.264/SVC specification can be considered the state-of-art in this area, by improving the coding efficiency, but, in the other hand, impacting in extremely high computational demands. Based on that, this work presents a hardware/software co-design architecture, which explores the characteristics of H.264/SVC internal algorithms, aiming the right balancing between both technologies (hardware and software) in order to generate a practical scalable encoder implementation, able to process up to 16 layers in 1920x1080 pixels format. Based in an H.264/SVC reference code model, which was refined in order to reduce global encoding time, the approaches for module partitioning and data integration between hardware and software were defined. The proposed methodology took into account characteristics like data dependency and inherent possibility of parallelism, as well practical restrictions like influence of communication interfaces and memory accesses. Particularly, the modules of transforms, quantization, deblocking and inter-layer prediction were implemented in hardware, while the functions of system management, entropy, rate control and user interface were kept in software. The whole solution, which was obtained integrating hardware modules, synthesized in a development board, with the refined H.264/SVC reference code, validates the proposal, by the significant performance gains registered, indicating it as an adequate solution for applications which require real-time video scalable coding.
59

Scalable Register File Architecture for CGRA Accelerators

January 2016 (has links)
abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable of accelerating even non-parallel loops and loops with low trip-counts. One challenge in compiling for CGRAs is to manage both recurring and nonrecurring variables in the register file (RF) of the CGRA. Although prior works have managed recurring variables via rotating RF, they access the nonrecurring variables through either a global RF or from a constant memory. The former does not scale well, and the latter degrades the mapping quality. This work proposes a hardware-software codesign approach in order to manage all the variables in a local nonrotating RF. Hardware provides modulo addition based indexing mechanism to enable correct addressing of recurring variables in a nonrotating RF. The compiler determines the number of registers required for each recurring variable and configures the boundary between the registers used for recurring and nonrecurring variables. The compiler also pre-loads the read-only variables and constants into the local registers in the prologue of the schedule. Synthesis and place-and-route results of the previous and the proposed RF design show that proposed solution achieves 17% better cycle time. Experiments of mapping several important and performance-critical loops collected from MiBench show proposed approach improves performance (through better mapping) by 18%, compared to using constant memory. / Dissertation/Thesis / Masters Thesis Computer Science 2016
60

Arquitetura de co-projeto hardware/software para implementação de um codificador de vídeo escalável padrão H.264/SVC

Husemann, Ronaldo January 2011 (has links)
Visando atuação flexível em redes heterogêneas, modernos sistemas multimídia podem adotar o conceito da codificação escalável, onde o fluxo de vídeo é composto por múltiplas camadas, cada qual complementando e aprimorando gradualmente as características de exibição, de forma adaptativa às capacidades de cada receptor. Atualmente, a especificação H.264/SVC representa o estado da arte da área, por sua eficiência de codificação aprimorada, porém demanda recursos computacionais extremamente elevados. Neste contexto, o presente trabalho apresenta uma arquitetura de projeto colaborativo de hardware e software, que explora as características dos diversos algoritmos internos do codificador H.264/SVC, buscando um adequado balanceamento entre as duas tecnologias (hardware e software) para a implementação prática de um codificador escalável de até 16 camadas em formato de 1920x1080 pixels. A partir de um modelo do código de referência H.264/SVC, refinado para reduzir tempos de codificação, foram definidas estratégias de particionamento de módulos e integração entre entidades de software e hardware, avaliando-se questões como dependência de dados e potencial de paralelismo dos algoritmos, assim como restrições práticas das interfaces de comunicação e acessos à memória. Em hardware foram implementados módulos de transformadas, quantização, filtro anti-blocagem e predição entre camadas, permanecendo em software funções de gerência do sistema, entropia, controle de taxa e interface com usuário. A solução completa obtida, integrando módulos em hardware, sintetizados em uma placa de desenvolvimento, com o software de referência refinado, comprova a validade da proposta, pelos significativos ganhos de desempenho registrados, mostrando-se como uma solução adequada para aplicações que exijam codificação escalável tempo real. / In order to support heterogeneous networks and distinct devices simultaneously, modern multimedia systems can adopt the scalability concept, when the video stream is composed by multiple layers, each one being responsible for gradually enhance the video exhibition quality, according to specific receiver capabilities. Currently the H.264/SVC specification can be considered the state-of-art in this area, by improving the coding efficiency, but, in the other hand, impacting in extremely high computational demands. Based on that, this work presents a hardware/software co-design architecture, which explores the characteristics of H.264/SVC internal algorithms, aiming the right balancing between both technologies (hardware and software) in order to generate a practical scalable encoder implementation, able to process up to 16 layers in 1920x1080 pixels format. Based in an H.264/SVC reference code model, which was refined in order to reduce global encoding time, the approaches for module partitioning and data integration between hardware and software were defined. The proposed methodology took into account characteristics like data dependency and inherent possibility of parallelism, as well practical restrictions like influence of communication interfaces and memory accesses. Particularly, the modules of transforms, quantization, deblocking and inter-layer prediction were implemented in hardware, while the functions of system management, entropy, rate control and user interface were kept in software. The whole solution, which was obtained integrating hardware modules, synthesized in a development board, with the refined H.264/SVC reference code, validates the proposal, by the significant performance gains registered, indicating it as an adequate solution for applications which require real-time video scalable coding.

Page generated in 0.0754 seconds