Spelling suggestions: "subject:"hardware/software c.design."" "subject:"hardware/software candesign.""
21 |
Scalable Register File Architecture for CGRA AcceleratorsJanuary 2016 (has links)
abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable
of accelerating even non-parallel loops and loops with low trip-counts. One challenge
in compiling for CGRAs is to manage both recurring and nonrecurring variables in
the register file (RF) of the CGRA. Although prior works have managed recurring
variables via rotating RF, they access the nonrecurring variables through either a
global RF or from a constant memory. The former does not scale well, and the latter
degrades the mapping quality. This work proposes a hardware-software codesign
approach in order to manage all the variables in a local nonrotating RF. Hardware
provides modulo addition based indexing mechanism to enable correct addressing
of recurring variables in a nonrotating RF. The compiler determines the number of
registers required for each recurring variable and configures the boundary between the
registers used for recurring and nonrecurring variables. The compiler also pre-loads
the read-only variables and constants into the local registers in the prologue of the
schedule. Synthesis and place-and-route results of the previous and the proposed RF
design show that proposed solution achieves 17% better cycle time. Experiments of
mapping several important and performance-critical loops collected from MiBench
show proposed approach improves performance (through better mapping) by 18%,
compared to using constant memory. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
22 |
Arquitetura de co-projeto hardware/software para implementação de um codificador de vídeo escalável padrão H.264/SVCHusemann, Ronaldo January 2011 (has links)
Visando atuação flexível em redes heterogêneas, modernos sistemas multimídia podem adotar o conceito da codificação escalável, onde o fluxo de vídeo é composto por múltiplas camadas, cada qual complementando e aprimorando gradualmente as características de exibição, de forma adaptativa às capacidades de cada receptor. Atualmente, a especificação H.264/SVC representa o estado da arte da área, por sua eficiência de codificação aprimorada, porém demanda recursos computacionais extremamente elevados. Neste contexto, o presente trabalho apresenta uma arquitetura de projeto colaborativo de hardware e software, que explora as características dos diversos algoritmos internos do codificador H.264/SVC, buscando um adequado balanceamento entre as duas tecnologias (hardware e software) para a implementação prática de um codificador escalável de até 16 camadas em formato de 1920x1080 pixels. A partir de um modelo do código de referência H.264/SVC, refinado para reduzir tempos de codificação, foram definidas estratégias de particionamento de módulos e integração entre entidades de software e hardware, avaliando-se questões como dependência de dados e potencial de paralelismo dos algoritmos, assim como restrições práticas das interfaces de comunicação e acessos à memória. Em hardware foram implementados módulos de transformadas, quantização, filtro anti-blocagem e predição entre camadas, permanecendo em software funções de gerência do sistema, entropia, controle de taxa e interface com usuário. A solução completa obtida, integrando módulos em hardware, sintetizados em uma placa de desenvolvimento, com o software de referência refinado, comprova a validade da proposta, pelos significativos ganhos de desempenho registrados, mostrando-se como uma solução adequada para aplicações que exijam codificação escalável tempo real. / In order to support heterogeneous networks and distinct devices simultaneously, modern multimedia systems can adopt the scalability concept, when the video stream is composed by multiple layers, each one being responsible for gradually enhance the video exhibition quality, according to specific receiver capabilities. Currently the H.264/SVC specification can be considered the state-of-art in this area, by improving the coding efficiency, but, in the other hand, impacting in extremely high computational demands. Based on that, this work presents a hardware/software co-design architecture, which explores the characteristics of H.264/SVC internal algorithms, aiming the right balancing between both technologies (hardware and software) in order to generate a practical scalable encoder implementation, able to process up to 16 layers in 1920x1080 pixels format. Based in an H.264/SVC reference code model, which was refined in order to reduce global encoding time, the approaches for module partitioning and data integration between hardware and software were defined. The proposed methodology took into account characteristics like data dependency and inherent possibility of parallelism, as well practical restrictions like influence of communication interfaces and memory accesses. Particularly, the modules of transforms, quantization, deblocking and inter-layer prediction were implemented in hardware, while the functions of system management, entropy, rate control and user interface were kept in software. The whole solution, which was obtained integrating hardware modules, synthesized in a development board, with the refined H.264/SVC reference code, validates the proposal, by the significant performance gains registered, indicating it as an adequate solution for applications which require real-time video scalable coding.
|
23 |
Arquitetura de co-projeto hardware/software para implementação de um codificador de vídeo escalável padrão H.264/SVCHusemann, Ronaldo January 2011 (has links)
Visando atuação flexível em redes heterogêneas, modernos sistemas multimídia podem adotar o conceito da codificação escalável, onde o fluxo de vídeo é composto por múltiplas camadas, cada qual complementando e aprimorando gradualmente as características de exibição, de forma adaptativa às capacidades de cada receptor. Atualmente, a especificação H.264/SVC representa o estado da arte da área, por sua eficiência de codificação aprimorada, porém demanda recursos computacionais extremamente elevados. Neste contexto, o presente trabalho apresenta uma arquitetura de projeto colaborativo de hardware e software, que explora as características dos diversos algoritmos internos do codificador H.264/SVC, buscando um adequado balanceamento entre as duas tecnologias (hardware e software) para a implementação prática de um codificador escalável de até 16 camadas em formato de 1920x1080 pixels. A partir de um modelo do código de referência H.264/SVC, refinado para reduzir tempos de codificação, foram definidas estratégias de particionamento de módulos e integração entre entidades de software e hardware, avaliando-se questões como dependência de dados e potencial de paralelismo dos algoritmos, assim como restrições práticas das interfaces de comunicação e acessos à memória. Em hardware foram implementados módulos de transformadas, quantização, filtro anti-blocagem e predição entre camadas, permanecendo em software funções de gerência do sistema, entropia, controle de taxa e interface com usuário. A solução completa obtida, integrando módulos em hardware, sintetizados em uma placa de desenvolvimento, com o software de referência refinado, comprova a validade da proposta, pelos significativos ganhos de desempenho registrados, mostrando-se como uma solução adequada para aplicações que exijam codificação escalável tempo real. / In order to support heterogeneous networks and distinct devices simultaneously, modern multimedia systems can adopt the scalability concept, when the video stream is composed by multiple layers, each one being responsible for gradually enhance the video exhibition quality, according to specific receiver capabilities. Currently the H.264/SVC specification can be considered the state-of-art in this area, by improving the coding efficiency, but, in the other hand, impacting in extremely high computational demands. Based on that, this work presents a hardware/software co-design architecture, which explores the characteristics of H.264/SVC internal algorithms, aiming the right balancing between both technologies (hardware and software) in order to generate a practical scalable encoder implementation, able to process up to 16 layers in 1920x1080 pixels format. Based in an H.264/SVC reference code model, which was refined in order to reduce global encoding time, the approaches for module partitioning and data integration between hardware and software were defined. The proposed methodology took into account characteristics like data dependency and inherent possibility of parallelism, as well practical restrictions like influence of communication interfaces and memory accesses. Particularly, the modules of transforms, quantization, deblocking and inter-layer prediction were implemented in hardware, while the functions of system management, entropy, rate control and user interface were kept in software. The whole solution, which was obtained integrating hardware modules, synthesized in a development board, with the refined H.264/SVC reference code, validates the proposal, by the significant performance gains registered, indicating it as an adequate solution for applications which require real-time video scalable coding.
|
24 |
Implementation of an FPGA based Emulator for High Speed Power Electronic SystemsAdnan, Muhammad Wasif January 2014 (has links)
During development of control systems for power electronic systems, it is desirable to test the controller in real-time, by interfacing it with an emulator device. In this context, this work comprises the development of an emulator that can model accurately the dynamics of high speed power electronic systems and provides interfaces that are compatible with the real hardware. The realtime state calculations, based on discrete models, were performed on custom logic, implemented on an FPGA. The realized system allows to emulate Linear Parameter Varying (LPV) systems, achieving sampling rates up to 12MHz using a low cost Xilinx FPGA. As a result, power electronic systems with very high switching frequencies can be modeled. In addition, the FPGA incorporates a soft-core processor that allows a designer to easily re-configure the system model through software. The emulator system has been validated for a multiphase DC-DC converter, by comparing its results with the real hardware setup.
|
25 |
Neural network computing using on-chip acceleratorsEldridge, Schuyler 05 November 2016 (has links)
The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources.
In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications.
|
26 |
High Level Power Estimation and Reduction Techniques for Power Aware Hardware DesignAhuja, Sumit 14 June 2010 (has links)
The unabated continuation of the Moore's law has allowed the doubling of the number of transistors per unit area of a silicon die every 2 years or so. At the same time, an increasing demand on consumer electronics and computing equipments to run sophisticated applications has led to an unprecedented complexity of hardware designs. These factors have necessitated the abstraction level of design-entry of hardware systems to be raised beyond the Register-Transfer-Level (RTL) to Electronic System Level (ESL). However, power envelope on the designs due to packaging and other thermal limitations, and the energy envelope due to battery life-time considerations have also created a need for power/energy efficient design. The confluence of these two technological issues has created an urgent need for solving two problems: (i) How do we enable a power-aware design flow with a design entry point at the Electronic System Level? (ii) How do we enable power aware High Level Synthesis to automatically synthesize RTL implementation from ESL?
This dissertation distinguishes itself by addressing the following two issues: (i) Since power/energy consumption of electronic systems largely depends on implementation details, and high-level models abstract away from such details, power/energy estimation at such levels has not been addressed thoroughly. (ii) A lot of work has been done in applying various techniques on control-data-flow graphs (CDFG) to find power/area/latency pareto points during behavioral synthesis. However, high level C-based functional models of various compute-intensive components, which could be easily synthesized as co-processors, have many opportunities to reduce power. Some of these savings opportunities are traditional such as clock-gating, operand-isolation etc. The exploration of alternate granularities of these techniques with target applications in mind, opens the door for traditional power reduction opportunities at the high-level.
This work therefore concentrates on the aforementioned two areas of inadequacy of hardware design methodologies. Our proposed solutions include utilizing ESL simulation traces and mapping those to lower abstraction levels for power estimation, derivation of statistical power models using regression based learning for power estimation at early design stages, etc. On the HLS front, techniques that insert the power saving features during the synthesis process using exploration of granularity and scope of clock-gating, sequential clock-gating are proposed. Finally, this work shows how to marry two domains, that is estimation and reduction. In this regard, a power model is proposed, which helps in predicting power savings obtained using clock-gating and further guiding HLS to selectively insert clock-gating. / Ph. D.
|
27 |
Hardware-Software Co-Design for Sensor Nodes in Wireless NetworksZhang, Jingyao 11 June 2013 (has links)
Simulators are important tools for analyzing and evaluating different design options for wireless sensor networks (sensornets) and hence, have been intensively studied in the past decades. However, existing simulators only support evaluations of protocols and software aspects of sensornet design. They cannot accurately capture the significant impacts of various hardware designs on sensornet performance. As a result, the performance/energy benefits of customized hardware designs are difficult to be evaluated in sensornet research. To fill in this technical void, in first section, we describe the design and implementation of SUNSHINE, a scalable hardware-software emulator for sensornet applications.
SUNSHINE is the first sensornet simulator that effectively supports joint evaluation and design of sensor hardware and software performance in a networked context. SUNSHINE captures the performance of network protocols, software and hardware up to cycle-level accuracy through its seamless integration of three existing sensornet simulators: a network simulator TOSSIM, an instruction-set simulator SimulAVR and a hardware simulator GEZEL. SUNSHINE solves several sensornet simulation challenges, including data exchanges and time synchronization across different simulation domains and simulation accuracy levels. SUNSHINE also provides hardware specification scheme for simulating flexible and customized hardware designs. Several experiments are given to illustrate SUNSHINE's simulation capability. Evaluation results are provided to demonstrate that SUNSHINE is an efficient tool for software-hardware co-design in sensornet research.
Even though SUNSHINE can simulate flexible sensor nodes (nodes contain FPGA chips as coprocessors) in wireless networks, it does not estimate power/energy consumption of sensor nodes. So far, no simulators have been developed to evaluate the performance of such flexible nodes in wireless networks. In second section, we present PowerSUNSHINE, a power- and energy-estimation tool that fills the void. PowerSUNSHINE is the first scalable power/energy estimation tool for WSNs that provides an accurate prediction for both fixed and flexible sensor nodes. In the section, we first describe requirements and challenges of building PowerSUNSHINE. Then, we present power/energy models for both fixed and flexible sensor nodes. Two testbeds, a MicaZ platform and a flexible node consisting of a microcontroller, a radio and a FPGA based co-processor, are provided to demonstrate the simulation fidelity of PowerSUNSHINE. We also discuss several evaluation results based on simulation and testbeds to show that PowerSUNSHINE is a scalable simulation tool that provides accurate estimation of power/energy consumption for both fixed and flexible sensor nodes.
Since the main components of sensor nodes include a microcontroller and a wireless transceiver (radio), their real-time performance may be a bottleneck when executing computation-intensive tasks in sensor networks. A coprocessor can alleviate the burden of microcontroller from multiple tasks and hence decrease the probability of dropping packets from wireless channel. Even though adding a coprocessor would gain benefits for sensor networks, designing applications for sensor nodes with coprocessors from scratch is challenging due to the consideration of design details in multiple domains, including software, hardware, and network. To solve this problem, we propose a hardware-software co-design framework for network applications that contain multiprocessor sensor nodes. The framework includes a three-layered architecture for multiprocessor sensor nodes and application interfaces under the framework. The layered architecture is to make the design of multiprocessor nodes' applications flexible and efficient. The application interfaces under the framework are implemented for deploying reliable applications of multiprocessor sensor nodes. Resource sharing technique is provided to make processor, coprocessor and radio work coordinately via communication bus. Several testbeds containing multiprocessor sensor nodes are deployed to evaluate the effectiveness of our framework. Network experiments are executed in SUNSHINE emulator to demonstrate the benefits of using multiprocessor sensor nodes in many network scenarios. / Ph. D.
|
28 |
Towards the development of a reliable reconfigurable real-time operating system on FPGAsHong, Chuan January 2013 (has links)
In the last two decades, Field Programmable Gate Arrays (FPGAs) have been rapidly developed from simple “glue-logic” to a powerful platform capable of implementing a System on Chip (SoC). Modern FPGAs achieve not only the high performance compared with General Purpose Processors (GPPs), thanks to hardware parallelism and dedication, but also better programming flexibility, in comparison to Application Specific Integrated Circuits (ASICs). Moreover, the hardware programming flexibility of FPGAs is further harnessed for both performance and manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR allows a part or parts of a circuit to be reconfigured at run-time, without interrupting the rest of the chip’s operation. As a result, hardware resources can be more efficiently exploited since the chip resources can be reused by swapping in or out hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR improves fault tolerance against transient errors and permanent damage, such as Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid error accumulation. Furthermore, power and heat can be reduced by removing finished or idle tasks from the chip. For all these reasons above, DPR has significantly promoted Reconfigurable Computing (RC) and has become a very hot topic. However, since hardware integration is increasing at an exponential rate, and applications are becoming more complex with the growth of user demands, highlevel application design and low-level hardware implementation are increasingly separated and layered. As a consequence, users can obtain little advantage from DPR without the support of system-level middleware. To bridge the gap between the high-level application and the low-level hardware implementation, this thesis presents the important contributions towards a Reliable, Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the user exploitation of DPR from the application level, by managing the complex hardware in the background. In R3TOS, hardware tasks behave just like software tasks, which can be created, scheduled, and mapped to different computing resources on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and EVC), which schedule tasks in real-time and circumvent emerging faults while maintaining more compact empty areas. 3) Design and implementation of a faulttolerant microprocessor by harnessing the existing FPGA resources, such as Error Correction Code (ECC) and configuration primitives. 4) A novel symmetric multiprocessing (SMP)-based architectures that supports shared memory programing interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest Neighbour classifier, which is a non-parametric classification algorithm widely used in various fields of data mining; and b) pairwise sequence alignment, namely the Smith Waterman algorithm, used for identifying similarities between two biological sequences. R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking applications, whereby resources can be dynamically managed in respect of user requirements and hardware availability. Benefiting from this, not only the hardware resources can be more efficiently used, but also the system performance can be significantly increased. Results show that the scheduling and allocating efficiencies have been improved up to 2x, and the overall system performance is further improved by ~2.5x. Future work includes the development of Network on Chip (NoC), which is expected to further increase the communication throughput; as well as the standardization and automation of our system design, which will be carried out in line with the enablement of other high-level synthesis tools, to allow application developers to benefit from the system in a more efficient manner.
|
29 |
Co-Projeto de hardware/software para correlação de imagens / Hardware/software co-design for imge cross-correlationDias, Maurício Acconcia 26 July 2011 (has links)
Este trabalho de pesquisa tem por objetivo o desenvolvimento de um coprojeto de hardware/software para o algoritmo de correlação de imagens visando atingir um ganho de desempenho com relação à implementação totalmente em software. O trabalho apresenta um comparativo entre um conjunto bastante amplo e significativo de configurações diferentes do soft-processor Nios II implementadas em FPGA, inclusive com a adição de novas instruções dedicadas. O desenvolvimento do co-projeto foi feito com base em uma modificação do método baseado em profiling adicionando-se um ciclo de desenvolvimento e de otimização de software. A comparação foi feita com relação ao tempo de execução para medir o speedup alcançado durante o desenvolvimento do co-projeto que atingiu um ganho de desempenho significativo. Também analisou-se a influência de estruturas de hardware básicas e dedicadas no tempo de execução final do algoritmo. A análise dos resultados sugere que o método se mostrou eficiente considerando o speedup atingido, porém o tempo total de execução ainda ficou acima do esperado, considerando-se a necessidade de execução e processamento de imagens em tempo real dos sistemas de navegação robótica. No entanto, destaca-se que as limitações de processamento em tempo real estão também ligadas as restrições de desempenho impostas pelo hardware adotado no projeto, baseado em uma FPGA de baixo custo e capacidade média / This work presents a FPGA based hardware/software co-design for image normalized cross correlation algorithm. The main goal is to achieve a significant speedup related to the execution time of the all-software implementation. The co-design proposed method is a modified profiling-based method with a software development step. The executions were compared related to execution time resulting on a significant speedup. To achieve this speedup a comparison between 21 different configurations of Nios II soft-processor was done. Also hardware influence on execution time was evaluated to know how simple hardware structures and specific hardware structures influence algorithm final execution time. Result analysis suggest that the method is very efficient considering achieved speedup but the final execution time still remains higher, considering the need for real time image processing on robotic navigation systems. However, the limitations for real time processing are a consequence of the hardware adopted in this work, based on a low cost and capacity FPGA
|
30 |
Co-Projeto de hardware/software para correlação de imagens / Hardware/software co-design for imge cross-correlationMaurício Acconcia Dias 26 July 2011 (has links)
Este trabalho de pesquisa tem por objetivo o desenvolvimento de um coprojeto de hardware/software para o algoritmo de correlação de imagens visando atingir um ganho de desempenho com relação à implementação totalmente em software. O trabalho apresenta um comparativo entre um conjunto bastante amplo e significativo de configurações diferentes do soft-processor Nios II implementadas em FPGA, inclusive com a adição de novas instruções dedicadas. O desenvolvimento do co-projeto foi feito com base em uma modificação do método baseado em profiling adicionando-se um ciclo de desenvolvimento e de otimização de software. A comparação foi feita com relação ao tempo de execução para medir o speedup alcançado durante o desenvolvimento do co-projeto que atingiu um ganho de desempenho significativo. Também analisou-se a influência de estruturas de hardware básicas e dedicadas no tempo de execução final do algoritmo. A análise dos resultados sugere que o método se mostrou eficiente considerando o speedup atingido, porém o tempo total de execução ainda ficou acima do esperado, considerando-se a necessidade de execução e processamento de imagens em tempo real dos sistemas de navegação robótica. No entanto, destaca-se que as limitações de processamento em tempo real estão também ligadas as restrições de desempenho impostas pelo hardware adotado no projeto, baseado em uma FPGA de baixo custo e capacidade média / This work presents a FPGA based hardware/software co-design for image normalized cross correlation algorithm. The main goal is to achieve a significant speedup related to the execution time of the all-software implementation. The co-design proposed method is a modified profiling-based method with a software development step. The executions were compared related to execution time resulting on a significant speedup. To achieve this speedup a comparison between 21 different configurations of Nios II soft-processor was done. Also hardware influence on execution time was evaluated to know how simple hardware structures and specific hardware structures influence algorithm final execution time. Result analysis suggest that the method is very efficient considering achieved speedup but the final execution time still remains higher, considering the need for real time image processing on robotic navigation systems. However, the limitations for real time processing are a consequence of the hardware adopted in this work, based on a low cost and capacity FPGA
|
Page generated in 0.0919 seconds