Spelling suggestions: "subject:"lls"" "subject:"alls""
21 |
Creation of a Technology Independent Design FlowUrvantsev, Anton January 2019 (has links)
Modern embedded systems development poses new challenges to a designer due to the global reachability of the contemporary market. One product shipped to different countries or customers should satisfy varying conditions, standards and constraints. Variability of a developed system should be taken into account by a designer. In a case of the embedded heterogeneous systems, this problem becomes challenging. Along with the variability heterogeneity of a system introduces new tasks, which should be addressed during design process. In this work, we propose a technology independent design flow. The proposed solution is supported by state-of-the-art tools and takes into account variability, partitioning, interfacing and dependency resolving processes. This thesis is conducted as a case study. We explored a design process of an industrial project, identified existing challenges and drawbacks in the existing solutions. We propose a new approach to a design flow of heterogeneous embedded systems. Also, a tool, supporting the presented solution, is implemented, which would allow a developer to include this approach into everyday design flow in order to increase a development speed and enable a task automation.
|
22 |
Detection algorithms and FPGA implementations for SC-FDMA uplink receiversHänninen, T. (Tuomo) 29 June 2018 (has links)
Abstract
The demand in mobile broadband communications is increasing dramatically. It is expected that 1000 times more mobile-network capacity will be needed within 10 years. Multiple-input, multiple-output (MIMO) antenna configuration and spatial multiplexing are among the essential techniques for reaching the targets. This creates motivation for study of advanced receivers for combating inter-antenna interference (IAI) and inter-symbol interference (ISI). While various receiver structures have been extensively considered for MIMO receivers, the emphasis has been on those operating in downlink orthogonal frequency-division multiple access (OFDM) systems, wherein ISI is not a problem.
Advanced receiver structures for single-carrier frequency-division multiple access (SC-FDMA) uplink systems were studied and analysed. Various receivers were compared via MATLAB simulations, with the objective being to gain solid understanding of how they perform in different channel environments. An efficient combination of IAI and ISI equalisation for SC-FDMA receivers is proposed. The proposed receiver architecture is shown to be a considerable improvement over the conventional linear minimum mean-square error (LMMSE) receiver. Several MIMO detector algorithms and their performance–complexity characteristics are presented. The K-best algorithm with a list size of 8 is shown to be the best option for practical MIMO detector implementation of this receiver in the 4x4 MIMO 64-level quadrature amplitude modulation (QAM) scenario.
The second objective involved examining the implementation aspects of the 8-best receiver to achieve good understanding of the complexity of various implementation architectures. It emerged that avoiding the sorting operation in the 8-best list sphere detector (LSD) tree-search algorithm implementation is not recommendable in the 4x4 MIMO 64-QAM scenario. Several field-programmable gate array (FPGA) implementations were carried out, with a range of high-level synthesis (HLS) tools. It is shown that HLS tools have improved significantly and are especially favourable for prototyping of large designs. Additionally, the importance of FPGA technology selection is addressed. Smaller silicon technology should be exploited if base-station baseband processing power consumption is to be minimised. The potential performance or complexity-related gain with the latest FPGAs should be taken into account in comparison of the performance–complexity characteristics of the algorithms. Differences of a few tens of per cent in estimated complexity or performance between two algorithms are often below the threshold of what can be gained or lost in the practical implementation process. / Tiivistelmä
Tiheään asuttujen kaupunkien uudet langattomat palvelut tarvitsevat tietoliikenneverkkoja, jotka mahdollistavat suuremman tiedonsiirtonopeuden ja kapasiteetin kuin sen, jonka nykyiset mobiiliverkot voivat tarjota. On arveltu, että mobiiliverkkojen kapasiteetin tarve tuhatkertaistuu seuraavan kymmenen vuoden aikana. Tuhatkertainen kapasiteetti on arvioitu saavutettavan kasvattamalla kolmea eri osa-aluetta kymmenkertaiseksi: taajuusspektrin määrä, spektrin käytön tehokkuus sekä tukiasematiheys. Tämä väitöskirja keskittyy spektrin käytön tehokkuuden kasvattamiseen. Moniantennitoteutus (multiple-input multiple-output, MIMO) on siinä välttämätön. MIMO-tekniikkaa hyödyntävien solukkojärjestelmien tukiasemavastaanottimissa tarvitaan melko monimutkainen kanavakorjain sekä ilmaisin, joiden algoritmien optimointi ja toteutus ymmärretään vielä sangen puutteellisesti.
Väitöskirjatutkimuksen päätavoitteena on tutkia edistyksellisiä vastaanotinrakenteita, joilla saavutetaan LTE-A-standardin tavoitetiedonsiirtonopeus kohtuullisella kompleksisuudella. Työssä keskitytään ns. nousevaan siirtosuuntaan (uplink) eli päätelaitteesta tukiasemaan tapahtuvaan tiedonsiirtoon, jossa käytetään yhden kantoaallon taajuusjakomonikäyttötekniikkaa (single-carrier frequency-division multiple-access, SC-FDMA) ortognaalisen taajuusjakomonikäytön (orthogonal frequency division multiple access, OFDMA) sijaan. Eri vastaanotinrakenteita ja näiden ilmaisinalgoritmeja vertaillaan tietokonesimuloinnein MATLAB-ympäristössä. Väitöskirjassa ehdotetaan kaksiosaista vastaanotinrakennetta, jossa antennien välinen keskinäishäiriö (inter antenna interference, IAI) ja symbolien välinen keskinäisvaikutus (intersymbol interference, ISI) poistetaan kahdessa eri vaiheessa. Tietokoneimulaatiot osoittavat ko. rakenteen parantavan suorituskykyä huomattavasti perinteiseen lineaariseen keskineliövirheen minimoivaan (linear minimum mean square error, LMMSE) vastaanottimeen verrattuna. Nk. K parasta polkua valitsevan MIMO-ilmaisinalgoritmin listan koolla kahdeksan todetaan tarjoavan 4x4 MIMO 64-tasoisen kvadratuuriamplitudimodulaation (quadrature amplitude modulation, QAM) ympäristössä parhaan kompromissin suorituskyvyn ja kompleksisuuden suhteen.
Käytännön toteutettavuuden kannalta keskitytään ohjelmoitavaan digitaalipiiritoteutukseen (field-programmable gate array, FGPA) ja ns. korkean tason synteesi (high-level synthesis, HLS) -työkalujen käyttöön vastaanottimen suunnittelussa. K parasta polkua valitsevan MIMO-ilmaisinalgoritmin arkkitehtuurivertailut osoittavat, että sinänsä vaativaa lajittelualgoritmia ei aina kannata yrittää välttää kirjallisuudessa aikaisemmin ehdotetulla ratkaisulla. Useita eri HLS työkaluja käytetään FPGA toteutuksissa ja todetaan että työkalut ovat kehittyneet huomattavasti viimeisen kahdeksan vuoden aikana. Lisäksi todetaan, että 16 nm viivanleveyden piireillä voidaan saavuttaa noin 15 % suurempi ilmaisunopeus ja 60 % pienempi tehonkulutus verrattuna 28 nm viivanleveyttä käyttäviin piireihin. Erityisesti potentiaali tehonkulutuksen minimoiseksi kannattaa hyödyntää, mikäli signaalinkäsittely näyttelee merkittävää roolia vastaanottimen kokonaistehonkulutuksessa. Kokonaisuutena todetaan, että toteutukseen liittyvät valinnat sekä vaikutus lopputulokseen, tulisi ottaa huomioon jo algoritmien valinnassa. Pieni ero kahden eri algoritmin suorituskyvyn välillä häviää helposti toteutusvaiheen ratkaisujen vaikutusten alle.
|
23 |
Localized Application for Video Capture for a Multimedia Sensor Node with Name-Based Segment StreamingJanuary 2018 (has links)
abstract: The Internet of Things (IoT) has become a more pervasive part of everyday life. IoT networks such as wireless sensor networks, depend greatly on the limiting unnecessary power consumption. As such, providing low-power, adaptable software can greatly improve network design. For streaming live video content, Wireless Video Sensor Network Platform compatible Dynamic Adaptive Streaming over HTTP (WVSNP-DASH) aims to revolutionize wireless segmented video streaming by providing a low-power, adaptable framework to compete with modern DASH players such as Moving Picture Experts Group (MPEG-DASH) and Apple’s Hypertext Transfer Protocol (HTTP) Live Streaming (HLS). Each segment is independently playable, and does not depend on a manifest file, resulting in greatly improved power performance. My work was to show that WVSNP-DASH is capable of further power savings at the level of the wireless sensor node itself if a native capture program is implemented at the camera sensor node. I created a native capture program in the C language that fulfills the name-based segmentation requirements of WVSNP-DASH. I present this program with intent to measure its power consumption on a hardware test-bed in future. To my knowledge, this is the first program to generate WVSNP-DASH playable video segments. The results show that our program could be utilized by WVSNP-DASH, but there are issues with the efficiency, so provided are an additional outline for further improvements. / Dissertation/Thesis / Masters Thesis Computer Engineering 2018
|
24 |
Accelerated Simulation of Modelica Models Using an FPGA-Based ApproachLundkvist, Herman, Yngve, Alexander January 2018 (has links)
This thesis presents Monza, a system for accelerating the simulation of modelsof physical systems described by ordinary differential equations, using a generalpurpose computer with a PCIe FPGA expansion card. The system allows bothautomatic generation of an FPGA implementation from a model described in theModelica programming language, and simulation of said system.Monza accomplishes this by using a customizable hardware architecture forthe FPGA, consisting of a variable number of simple processing elements. A cus-tom compiler, also developed in this thesis, tailors and programs the architectureto run a specific model of a physical system.Testing was done on two test models, a water tank system and a Weibel-lung,with up to several thousand state variables. The resulting system is several timesfaster for smaller models and somewhat slower for larger models compared to aCPU. The conclusion is that the developed hardware architecture and softwaretoolchain is a feasible way of accelerating model execution, but more work isneeded to ensure faster execution at all times.
|
25 |
HTTP Based Adaptive Bitrate Streaming Protocols in Live Surveillance SystemsDzabic, Daniel, Jacob, Mårtensson January 2018 (has links)
This thesis explores possible solutions to replace Adobe Flash Player by using toolsalready built into modern web browsers, and explores the tradeoffs between bitrate, qual-ity, and delay when using an adaptive bitrate for live streamed video. Using an adaptivebitrate for streamed video was found to reduce stalls in playback for the client by adapt-ing to the available bandwidth. A newer codec can further compress the video file sizewhile maintaining the same video quality. This can improve the viewing experience forclients on a restricted or a congested network. The tests conducted in this thesis showthat producing an adaptive bitrate stream and changing codecs is a very CPU intensiveprocess.
|
26 |
Compiler-Based Tools to Aid in Data Transfer Optimization and On-Chip Debug of Heterogeneous Compute SystemsAshcraft, Matthew B. 07 July 2020 (has links)
First, we present techniques to efficiently schedule data transfers through compiler analyses. Compared to transferring data immediately before and after the kernel executes, our scheduling results in orders of magnitude improvements in execution time, number of data transfers, and number of bytes transferred. Second, we demonstrate techniques to provide on-chip debugging for heterogeneous systems through recording execution on the software in addition to debugging circuitry in the hardware, and provide a temporal correlation between the hardware and software traces through synchronization. This allows us to follow debug data between the hardware and software trace buffers. Due to the added cost of synchronizing the trace buffers, we explore synchronization schemes which can reduce the impact synchronization depending on the code structure. We demonstrate the quantitative impact of these techniques on execution time and hardware and software resources, which are under a 2x increase to execution time in most cases. Third, we demonstrate how source-code debugging techniques for on-chip debugging can be applied to OpenCL FPGA kernels in heterogeneous systems. We developed techniques and a tool-flow that allows users to select variables to record, automatically insert recording instructions into the kernel source code, synthesize the changes directly into the hardware design using commercial HLS tools, retrieve the trace data through kernel arguments, and present it to the user. Overall, quantitative measurements showed our techniques resulted in modest increases to execution time and hardware resources.
|
27 |
An Investigation Into Partitioning Algorithms for Automatic Heterogeneous CompilersLeija, Antonio M 01 September 2015 (has links) (PDF)
Automatic Heterogeneous Compilers allows blended hardware-software solutions to be explored without the cost of a full-fledged design team, but limited research exists on current partitioning algorithms responsible for separating hardware and software. The purpose of this thesis is to implement various partitioning algorithms onto the same automatic heterogeneous compiler platform to create an apples to apples comparison for AHC partitioning algorithms. Both estimated outcomes and actual outcomes for the solutions generated are studied and scored. The platform used to implement the algorithms is Cal Poly’s own Twill compiler, created by Doug Gallatin last year. Twill’s original partitioning algorithm is chosen along with two other partitioning algorithms: Tabu Search + Simulated Annealing (TSSA) and Genetic Search (GS). These algorithms are implemented inside Twill and test bench input code from the CHStone HLS Benchmark tests is used as stimulus. Along with the algorithms cost models, one key attribute of interest is queue counts generated, as the more cuts between hardware and software requires queues to pass the data between partition crossings. These high communication costs can end up damaging the heterogeneous solution’s performance. The Genetic, TSSA, and Twill’s original partitioning algorithm are all scored against each other’s cost models as well, combining the fitness and performance cost models with queue counts to evaluate each partitioning algorithm. The solutions generated by TSSA are rated as better by both the cost model for the TSSA algorithm and the cost model for the Genetic algorithm while producing low queue counts.
|
28 |
On the Programmability and Performance of OpenCL Designs for FPGAVerma, Anshuman 09 February 2018 (has links)
Field programmable gate arrays (FPGAs) have been emerging as a promising bedrock to provide opportunities for several types of accelerators that spans across various domains such as finance, web-search, and data center networking, among others. Research interests facilitating the development of accelerators on FPGAs are increasing significantly, in particular, because of their effectiveness with a variety of applications, flexibility, and high performance per watt. However, several key challenges remain that hinder their large-scale deployment. Overcoming these challenges would enable them to match the pervasiveness of graphics processor units (GPUs), their principal competitors in this arena. One of the primary reasons responsible for the slow adaptation by programmers has been the programming model, which uses a low-level hardware description language (HDL).
Using HDLs require a detailed understanding of logic design and significant effort to implement and verify the behavioral models, with the latter growing with its complexity. Recent advancements in high-level language synthesis (HLS) tools have addressed this challenge to a considerable extent by allowing the programmers to write their applications in a high-level language named OpenCL. These applications are then compiled and synthesized to create a bitstream that configures the FPGA. This thesis characterizes the efficacy of HLS compiler optimizations that can be employed to improve the performance of these applications.
The synthesized hardware from OpenCL kernels is fundamentally different from traditional hardware such as CPUs and GPUs, which exploit instruction level parallelism (ILP) thread level parallelism (TLP), or data level parallelism (DLP) for performance gains. FPGAs typically use deep pipelining (i.e., ILP) for performance. A stall in this pipeline may severely undermine the performance of applications. Thus, it is imperative to identify and remove any such bottlenecks. To this end, this thesis presents and discusses a software-centric framework to debug and profile the synthesized designs generated using HLS tools. This thesis proposes basic code patterns, including a timestamp and a scalable framework, which can be plugged easily into OpenCL kernels, to collect and process run-time information dynamically. This scalable framework has a small overhead for area utilization and frequency but provides fine-grained information about the bottlenecks and latencies in design.
Additionally, although HLS tools have improved programmability, this may come at the cost of performance or area utilization. This thesis addresses this design trade-off via a comparative study of a hand-coded design in HDL and an architecturally similar, tool-generated design using an OpenCL compiler in the application area of 3D-stencil (i.e., structured grid) computation. Experiments in this thesis show that the performance of an OpenCL approach can achieve 95% of the peak attainable performance of a microkernel for multiple problem sizes. In comparison to the OpenCL approach, an HDL approach results in approximately 50% less memory usage and only 2% better performance on average. / MS / A hardware chip consists of switches or transistors, and a modern chip can have a few billions of them. Specifying the interconnection among these transistors and their placement on a chip is a complex problem. To simplify this, the chip-design flow uses automated tools and abstraction at the different levels of the flow, such as architecture, design, synthesis, placement, among others. During design, an engineer specifies the behavioral model in a hardware description language (HDL), which is later used by the automated tools for further processing. Using the HDL requires a detailed understanding of logic design and significant effort to implement and verify the behavioral models, with the latter growing with its complexity. Recent advancements in high-level language synthesis tools have addressed this challenge to a considerable extent by allowing the programmers to write their applications in a high-level language. This thesis characterizes the efficacy of such a tool and available optimizations that can be employed to improve the performance of these applications.
Additionally, this thesis presents and discusses a framework to debug and profile the designs generated using high-level synthesis tools, which can be plugged easily into an application, to collect and process run-time information dynamically. This scalable framework has a small overhead but provides fine-grained information about the bottlenecks in the design. Furthermore, the experiments in this work show that a design generated from a high-level synthesis tool has similar performance when compared to a manual design in HDL, at the expense of area utilization.
|
29 |
A design flow to automatically Generate on chip monitors during high-level synthesis of Hardware accelarators / Un flot de conception pour générer automatiquement des moniteurs sur puce pendant la synthèse de haut niveau d'accélérateurs matérielsBen Hammouda, Mohamed 11 December 2014 (has links)
Les systèmes embarqués sont de plus en plus utilisés dans des domaines divers tels que le transport, l’automatisation industrielle, les télécommunications ou la santé pour exécuter des applications critiques et manipuler des données sensibles. Ces systèmes impliquent souvent des intérêts financiers et industriels, mais aussi des vies humaines ce qui impose des contraintes fortes de sûreté. Par conséquent, un élément clé réside dans la capacité de tels systèmes à répondre correctement quand des erreurs se produisent durant l’exécution et ainsi empêcher des comportements induits inacceptables. Les erreurs peuvent être d’origines naturelles telles que des impacts de particules, du bruit interne (problème d’intégrité), etc. ou provenir d’attaques malveillantes. Les architectures de systèmes embarqués comprennent généralement un ou plusieurs processeurs, des mémoires, des contrôleurs d’entrées/sorties ainsi que des accélérateurs matériels utilisés pour améliorer l’efficacité énergétique et les performances. Avec l’évolution des applications, le cycle de conception d’accélérateurs matériels devient de plus en plus complexe. Cette complexité est due en partie aux spécifications des accélérateurs matériels qui reposent traditionnellement sur l’écriture manuelle de fichiers en langage de description matérielle (HDL).Cependant, la synthèse de haut niveau (HLS) qui favorise la génération automatique ou semi-automatique d’accélérateurs matériels à partir de spécifications logicielles, comme du code C, permet de réduire cette complexité.Le travail proposé dans ce manuscrit cible l’intégration d’un support de vérification dans les outils de HLS pour générer des moniteurs sur puce au cours de la synthèse de haut niveau des accélérateurs matériels. Trois contributions distinctes ont été proposées. La première contribution consiste à contrôler les erreurs de comportement temporel des entrées/sorties (impactant la synchronisation avec le reste du système) ainsi que les erreurs du flot de contrôle (sauts illégaux ou problèmes de boucles infinies). La synthèse des moniteurs est automatique sans qu’aucune modification de la spécification utilisée en entrée de la HLS ne soit nécessaire. La deuxième contribution vise la synthèse des propriétés de haut niveau (ANSI-C asserts) qui ont été ajoutées dans la spécification logicielle de l’accélérateur matériel. Des options de synthèse ont été proposées pour arbitrer le compromis entre le surcout matériel, la dégradation de la performance et le niveau de protection. La troisième contribution améliore la détection des corruptions des données qui peuvent modifier les valeurs stockées, et/ou modifier les transferts de données, sans violer les assertions (propriétés) ni provoquer de sauts illégaux. Ces erreurs sont détectées en dupliquant un sous-ensemble des données du programme, limité aux variables les plus critiques. En outre, les propriétés sur l’évolution des variables d’induction des boucles ont été automatiquement extraites de la description algorithmique de l’accélérateur matériel. Il faut noter que l’ensemble des approches proposées dans ce manuscrit, ne s’intéresse qu’à la détection d’erreurs lors de l’exécution. La contreréaction c.à.d. la manière dont le moniteur réagit si une erreur est détectée n’est pas abordée dans ce document. / Embedded systems are increasingly used in various fields like transportation, industrial automation, telecommunication or healthcare to execute critical applications and manipulate sensitive data. These systems often involve financial and industrial interests but also human lives which imposes strong safety constraints.Hence, a key issue lies in the ability of such systems to respond safely when errors occur at runtime and prevent unacceptable behaviors. Errors can be due to natural causes such as particle hits as well as internal noise, integrity problems, but also due to malicious attacks. Embedded system architecture typically includes processor (s), memories, Input / Output interface, bus controller and hardware accelerators that are used to improve both energy efficiency and performance. With the evolution of applications, the design cycle of hardware accelerators becomes more and more complex. This complexity is partly due to the specification of hardware accelerators traditionally based on handwritten Hardware Description Language (HDL) files. However, High-Level Synthesis (HLS) that promotes automatic or semi-automatic generation of hardware accelerators according to software specification, like C code, allows reducing this complexity.The work proposed in this document targets the integration of verification support in HLS tools to generate On-Chip Monitors (OCMs) during the high-level synthesis of hardware accelerators (HWaccs). Three distinct contributions are proposed. The first one consists in checking the Input / Output timing behavior errors (synchronization with the whole system) as well as the control flow errors (illegal jumps or infinite loops). On-Chip Monitors are automatically synthesized and require no modification in their high-level specification. The second contribution targets the synthesis of high-level properties (ANSI-C asserts) that are added into the software specification of HWacc. Synthesis options are proposed to trade-off area overhead, performance impact and protection level. The third contribution improves the detection of data corruptions that can alter the stored values or/and modify the data transfers without causing assertions violations or producing illegal jumps. Those errors are detected by duplicating a subset of program’s data limited to the most critical variables. In addition, the properties over the evolution of loops induction variables are automatically extracted from the algorithmic description of HWacc. It should be noticed that all the proposed approaches, in this document, allow only detecting errors at runtime. The counter reaction i.e. the way how the HWacc reacts if an error is detected is out of scope of this work.
|
30 |
Infraestrutura de compilação para a implementação de aceleradores em FPGARettore, Paulo Henrique Lopes 23 November 2012 (has links)
Made available in DSpace on 2016-06-02T19:06:00Z (GMT). No. of bitstreams: 1
4747.pdf: 5016839 bytes, checksum: ca7594d5895754f4ee9eb215e548c3cc (MD5)
Previous issue date: 2012-11-23 / Financiadora de Estudos e Projetos / In recent years, performance improvements in sequential microprocessors have been limited by physical and technological factors. For this reason, alternative approaches for high performance execution have gained importance. One of them is based in the use of reconfigurable hardware, implemented using FPGAs. However, conventional methods for programming those devices are notoriously complex, usually based on hardware description languages such as VHDL and Verilog. This work presents the development of a compilation framework to support the translation of a loop, described in C language, into its corresponding version for synthesis in reconfigurable hardware. The optimized execution is based on the loop pipelining technique, which requires advanced compiler support. That is achieved by using the Cetus compiler, enhanced by a number of modifications, and thus used as a basis for the semi-automatic generation of custom-hardware accelerators. In order to guide the compiler developments and validate its basic functionalities, two study cases were considered: one based on finite state machines as the method of choice for hardware modelling (EC-1), and another based on the LALP domain specific language. In both cases, the proposed compilation framework have shown to be a facilitator element for the development of high performance custom-hardware. / O aumento no desempenho de processadores sequenciais tem sido limitado severamente por fatores físicos e tecnológicos nos últimos anos. Dessa forma, abordagens alternativas para a execução com alto desempenho ganharam maior importância nos últimos anos. Uma delas baseia-se na utilização de hardware customizado, implementado utilizando-se FPGAs. Entretanto, os métodos convencionais para programação desses dispositivos são notoriamente complexos, normalmente baseados em linguagens como VHDL e Verilog. Este trabalho apresenta o desenvolvimento de um framework de compilação para auxiliar a transformação de um loop, escrito em linguagem C, em sua versão para hardware customizado. A execução otimizada baseia-se na técnica de loop pipelining, a qual exige suporte avançado de compilação. Este é conseguido utilizando o compilador Cetus, que após uma série de modificações, pode ser utilizado como base para a geração semi-automática de aceleradores em hardware customizado. Como forma de guiar o desenvolvimento do compilador e validar suas funcionalidades básicas, dois casos de estudo foram considerados: um baseado na utilização de máquinas de estados finitos como método para a modelagem de hardware (EC-1), e outro baseado na linguagem de domínio específico LALP (EC-2). Em ambos os casos, o framework de compilação proposto mostrou-se útil como elemento facilitador ao desenvolvimento de hardware customizado de alto desempenho.
|
Page generated in 0.0462 seconds