Global ETD Search

1	Accelerating Emerging Neural Workloads Jacob R Stevens (11805797) 20 December 2021 (has links) <div> Due to a combination of algorithmic advances, wide-spread availability of rich data sets, and tremendous growth in compute availability, Deep Neural Networks (DNNs) have seen considerable success in a wide variety of fields, achieving state-of-the art accuracy in a number of perceptual domains, such as text, video and audio processing. Recently, there have been many efforts to extend this success in the perceptual, Euclidean-based domain to non-perceptual tasks, such as task planning or reasoning, as well as to non-Euclidean domains, such as graphs. While several DNN accelerators have been proposed in the past decade, they largely focus on traditional DNN workloads, such as Multi-layer Perceptions (MLPs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). These accelerators are ill-suited to the unique computational needs of the emerging neural networks. In this dissertation, we aim to fix this gap by proposing novel hardware architectures that are specifically tailored to emerging neural workloads.</div><div><br></div><div>First, we consider memory-augmented neural networks (MANNs), a new class of neural networks that exhibits capabilities such as one-shot learning and task planning that are well beyond those of traditional DNNs. MANNs augment a traditional DNN with an external differentiable memory that is used to store dynamic state. This dissertation proposes a novel accelerator that targets the main bottleneck of MANNs: the soft reads and writes to this external memory, each of which requires access to all the memory locations.</div><div><br></div><div>We then focus on Transformer networks, which have become very popular for Natural Language Processing (NLP). A key to the success of these networks is a technique called self-attention, which employs a softmax operation. Softmax is poorly supported in modern, matrix multiply-focused accelerators since it accounts for a very small fraction of traditional DNN workloads. We propose a hardware/software co-design approach to realize softmax efficiently by utilize a suite of approximate computing techniques.</div><div><br></div><div>Next, we address graph neural networks (GNNs). GNNs are achieving state-of-the-art results in a variety of fields such as physics modeling, chemical synthesis, and electronic design automation. These GNNs are a hybrid between graph processing workloads and DNN workloads; they utilize DNN-based feature extractors to form hidden representations for each node in a graph and then combine these representations through some form of a graph traversal. As a result, existing hardware specialized for either graph processing workloads or DNN workloads is insufficient. Instead, we design a novel architecture that balances the needs of these two heterogeneous compute patterns. We also propose a novel feature dimension-blocking dataflow to further increase performance by mitigating the memory bottleneck.</div><div><br></div><div>Finally, we address the growing difficulty in tightly coupling new DNNs and a hardware platform. Given the extremely large DNN-HW design space consisting of DNN selection, hardware operating condition, and DNN-to-HW mapping, it is infeasible to exhaustively search this space by running each sample on a physical hardware device. This has led to the need for highly accurate, machine learning-based performance models which can \emph{predict} the latency/power/energy even faster than direct execution. We first present a taxonomy to characterize the possible approaches to these performance estimators. Based on the insights from this taxonomy, we present a new performance estimator that combines coarse-grained and fine-grained to achieve superior accuracy with a limited number of training samples. Finally, we propose a flexible framework for creating these DNN-HW performance estimators.</div><div><br></div><div>In summary, this dissertation identifies the growing gap between current hardware and new emerging neural networks. We first propose three novel hardware architectures that address this gap for MANNs, Transformers, and GNNs. We then propose a novel hardware-aware DNN estimator and framework to ease addressing this gap for new networks in the future.</div> Computer Engineering Computer architecture Deep Learning hardware software codesign
2	Definition and evaluation of spatio-temporal scheduling strategies for 3D multi-core heterogeneous architectures / Définition et évaluation des stratégies d’ordonnancement spatio-temporel pour les architectures 3D multicore hétérogènes Khuat, Quang Hai 16 March 2015 (has links) Empilant une couche multiprocesseur (MPSoC) et une couche de FPGA pour former un système sur puce reconfigurable en trois dimension (3DRSoC), est une solution prometteuse donnant un niveau de flexibilité élevé en adaptant l'architecture aux applications visées. Pour une application exécutée sur ce système, l'un des principaux défis vient de la gestion de haut niveau des tâches. Cette gestion est effectuée par le service d'ordonnancement du système d'exploitation et elle doit être en mesure de déterminer, lors de l'exécution de l'application, quelle tâche est exécutée logiciellement et/ou matériellement, quand (dimension temporelle) et sur quelles ressources (dimension spatiale, c'est à dire sur quel processeur ou quelle région du FPGA) pour atteindre la haute performance du système. Dans cette thèse, nous proposons des stratégies d'ordonnancement spatio-temporel pour les architectures 3DRSoCs. La première stratégie décide la nécessité de placer une tâche matérielle et une tâche logicielle en face-à-face afin que le coût de la communication entre tâches soit minimisé. La deuxième stratégie vise à minimiser le temps d'exécution globale de l'application. Cette stratégie exploits la présence de processeurs de la couche MPSoC afin d'anticiper, en temps-réel, l'exécution d'une tâche logicielle quand sa version matérielle ne peut pas être allouée sur le FPGA. Ensuite, un outil de simulation graphique a été développé pour vérifier le bon fonctionnement des stratégies développées et aussi nous permettre de produire des résultats. / Stacking a multiprocessor (MPSoC) layer and a FPGA layer to form a 3D Reconfigurable System-on- Chip (3DRSoC) is a promising solution giving a high flexibility level in adapting the architecture to the targeted application. For an application defined as a graph of parallel tasks running on the 3DRSoC system, one of the main challenges comes from the high-level management of tasks. This management is done by the scheduling service of the Operating System and it must be able to determine, on the fly, what task should be run in software and/or hardware, when (temporal dimension) and where (spatial dimension, i.e. on what processor or what area of the FPGA) in order to achieve high performance of the system. In this thesis, we propose online spatio-temporal scheduling strategies for 3DRSoCs. The first strategy decides, during the task scheduling, the need for a SW task and a HW task to communicate in face-to-face so that the communication cost between tasks is minimized. The second strategy aims at minimizing the overall execution time of the application. It exploits the presence of processors in the MPSoC layer in order to anticipate, at run-time, the SW execution of a task when its HW version cannot be allocated to the FPGA. Then, a graphical simulation tool has been developed to verify the proper functioning of the developed strategies and also enable us to produce results. Système d’exploitation Ordonnancement spatio-Temporel Fpga Reconfiguration dynamique et partielle Hardware/software codesign Spatio-Temporal scheduling Operating systems Field programmable gate arrays Hardware/software codesign
3	An Application-Specific Instruction Set for Accelerating Set-Oriented Database Primitives Arnold, Oliver, Haas, Sebastian, Fettweis, Gerhard, Schlegel, Benjamin, Kissinger, Thomas, Lehner, Wolfgang 13 June 2022 (has links) The key task of database systems is to efficiently manage large amounts of data. A high query throughput and a low query latency are essential for the success of a database system. Lately, research focused on exploiting hardware features like superscalar execution units, SIMD, or multiple cores to speed up processing. Apart from these software optimizations for given hardware, even tailor-made processing circuits running on FPGAs are built to run mostly stateless query plans with incredibly high throughput. A similar idea, which was already considered three decades ago, is to build tailor-made hardware like a database processor. Despite their superior performance, such application-specific processors were not considered to be beneficial because general-purpose processors eventually always caught up so that the high development costs did not pay off. In this paper, we show that the development of a database processor is much more feasible nowadays through the availability of customizable processors. We illustrate exemplarily how to create an instruction set extension for set-oriented database rimitives. The resulting application-specific processor provides not only a high performance but it also enables very energy-efficient processing. Our processor requires in various configurations more than 960x less energy than a high-end x86 processor while providing the same performance. info:eu-repo/classification/ddc/004 ddc:004
4	Towards Efficient Resource Allocation for Embedded Systems Hasler, Mattis 06 June 2023 (has links) Das Hauptthema ist die dynamische Ressourcenverwaltung in eingebetteten Systemen, insbesondere die Verwaltung von Rechenzeit und Netzwerkverkehr auf einem MPSoC. Die Idee besteht darin, eine Pipeline für die Verarbeitung von Mobiler Kommunikation auf dem Chip dynamisch zu schedulen, um die Effizienz der Hardwareressourcen zu verbessern, ohne den Ressourcenverbrauch des dynamischen Schedulings dramatisch zu erhöhen. Sowohl Software- als auch Hardwaremodule werden auf Hotspots im Ressourcenverbrauch untersucht und optimiert, um diese zu entfernen. Da Applikationen im Bereich der Signalverarbeitung normalerweise mit Hilfe von SDF-Diagrammen beschrieben werden können, wird deren dynamisches Scheduling optimiert, um den Ressourcenverbrauch gegenüber dem üblicherweise verwendeten statischen Scheduling zu verbessern. Es wird ein hybrider dynamischer Scheduler vorgestellt, der die Vorteile von Processing-Networks und der Planung von Task-Graphen kombiniert. Es ermöglicht dem Scheduler, ein Gleichgewicht zwischen der Parallelisierung der Berechnung und der Zunahme des dynamischen Scheduling-Aufands optimal abzuwägen. Der resultierende dynamisch erstellte Schedule reduziert den Ressourcenverbrauch um etwa 50%, wobei die Laufzeit im Vergleich zu einem statischen Schedule nur um 20% erhöht wird. Zusätzlich wird ein verteilter dynamischer SDF-Scheduler vorgeschlagen, der das Scheduling in verschiedene Teile zerlegt, die dann zu einer Pipeline verbunden werden, um mehrere parallele Prozessoren einzubeziehen. Jeder Scheduling-Teil wird zu einem Cluster mit Load-Balancing erweitert, um die Anzahl der parallel laufenden Scheduling-Jobs weiter zu erhöhen. Auf diese Weise wird dem vorhandene Engpass bei dem dynamischen Scheduling eines zentralisierten Schedulers entgegengewirkt, sodass 7x mehr Prozessoren mit dem Pipelined-Clustered-Dynamic-Scheduler für eine typische Signalverarbeitungsanwendung verwendet werden können. Das neue dynamische Scheduling-System setzt das Vorhandensein von drei verschiedenen Kommunikationsmodi zwischen den Verarbeitungskernen voraus. Bei der Emulation auf Basis des häufig verwendeten RDMA-Protokolls treten Leistungsprobleme auf. Sehr gut kann RDMA für einmalige Punkt-zu-Punkt-Datenübertragungen verwendet werden, wie sie bei der Ausführung von Task-Graphen verwendet werden. Process-Networks verwenden normalerweise Datenströme mit hohem Volumen und hoher Bandbreite. Es wird eine FIFO-basierte Kommunikationslösung vorgestellt, die einen zyklischen Puffer sowohl im Sender als auch im Empfänger implementiert, um diesen Bedarf zu decken. Die Pufferbehandlung und die Datenübertragung zwischen ihnen erfolgen ausschließlich in Hardware, um den Software-Overhead aus der Anwendung zu entfernen. Die Implementierung verbessert die Zugriffsverwaltung mehrerer Nutzer auf flächen-effiziente Single-Port Speichermodule. Es werden 0,8 der theoretisch möglichen Bandbreite, die normalerweise nur mit flächenmäßig teureren Dual-Port-Speichern erreicht wird. Der dritte Kommunikationsmodus definiert eine einfache Message-Passing-Implementierung, die ohne einen Verbindungszustand auskommt. Dieser Modus wird für eine effiziente prozessübergreifende Kommunikation des verteilten Scheduling-Systems und der engen Ansteuerung der restlichen Prozessoren benötigt. Eine Flusskontrolle in Hardware stellt sicher, dass eine große Anzahl von Sendern Nachrichten an denselben Empfänger senden kann. Dabei wird garantiert, dass alle Nachrichten korrekt empfangen werden, ohne dass eine Verbindung hergestellt werden muss und die Nachrichtenlaufzeit gering bleibt. Die Arbeit konzentriert sich auf die Optimierung des Codesigns von Hardware und Software, um die kompromisslose Ressourceneffizienz der dynamischen SDF-Graphen-Planung zu erhöhen. Besonderes Augenmerk wird auf die Abhängigkeiten zwischen den Ebenen eines verteilten Scheduling-Systems gelegt, das auf der Verfügbarkeit spezifischer hardwarebeschleunigter Kommunikationsmethoden beruht.:1 Introduction 1.1 Motivation 1.2 The Multiprocessor System on Chip Architecture 1.3 Concrete MPSoC Architecture 1.4 Representing LTE/5G baseband processing as Static Data Flow 1.5 Compuation Stack 1.6 Performance Hotspots Addressed 1.7 State of the Art 1.8 Overview of the Work 2 Hybrid SDF Execution 2.1 Addressed Performance Hotspot 2.2 State of the Art 2.3 Static Data Flow Graphs 2.4 Runtime Environment 2.5 Overhead of Deloying Tasks to a MPSoC 2.6 Interpretation of SDF Graphs as Task Graphs 2.7 Interpreting SDF Graphs as Process Networks 2.8 Hybrid Interpretation 2.9 Graph Topology Considerations 2.10 Theoretic Impact of Hybrid Interpretation 2.11 Simulating Hybrid Execution 2.12 Pipeline SDF Graph Example 2.13 Random SDF Graphs 2.14 LTE-like SDF Graph 2.15 Key Lernings 3 Distribution of Management 3.1 Addressed Performance Hotspot 3.2 State of the Art 3.3 Revising Deployment Overhead 3.4 Distribution of Overhead 3.5 Impact of Management Distribution to Resource Utilization 3.6 Reconfigurability 3.7 Key Lernings 4 Sliced FIFO Hardware 4.1 Addressed Performance Hotspot 4.2 State of the Art 4.3 System Environment 4.4 Sliced Windowed FIFO buffer 4.5 Single FIFO Evaluation 4.6 Multiple FIFO Evalutaion 4.7 Hardware Implementation 4.8 Key Lernings 5 Message Passing Hardware 5.1 Addressed Performance Hotspot 5.2 State of the Art 5.3 Message Passing Regarded as Queueing 5.4 A Remote Direct Memory Access Based Implementation 5.5 Hardware Implementation Concept 5.6 Evalutation of Performance 5.7 Key Lernings 6 Summary / The main topic is the dynamic resource allocation in embedded systems, especially the allocation of computing time and network traﬃc on an multi processor system on chip (MPSoC). The idea is to dynamically schedule a mobile communication signal processing pipeline on the chip to improve hardware resource eﬃciency while not dramatically improve resource consumption because of dynamic scheduling overhead. Both software and hardware modules are examined for resource consumption hotspots and optimized to remove them. Since signal processing can usually be described with the help of static data ﬂow (SDF) graphs, the dynamic handling of those is optimized to improve resource consumption over the commonly used static scheduling approach. A hybrid dynamic scheduler is presented that combines beneﬁts from both processing networks and task graph scheduling. It allows the scheduler to optimally balance parallelization of computation and addition of dynamic scheduling overhead. The resulting dynamically created schedule reduces resource consumption by about 50%, with a runtime increase of only 20% compared to a static schedule. Additionally, a distributed dynamic SDF scheduler is proposed that splits the scheduling into different parts, which are then connected to a scheduling pipeli ne to incorporate multiple parallel working processors. Each scheduling stage is reworked into a load-balanced cluster to increase the number of parallel scheduling jobs further. This way, the still existing dynamic scheduling bottleneck of a centralized scheduler is widened, allowing handling 7x more processors with the pipelined, clustered dynamic scheduler for a typical signal processing application. The presented dynamic scheduling system assumes the presence of three different communication modes between the processing cores. When emulated on top of the commonly used remote direct memory access (RDMA) protocol, performance issues are encountered. Firstly, RDMA can neatly be used for single-shot point-to-point data transfers, like used in task graph scheduling. Process networks usually make use of high-volume and high-bandwidth data streams. A ﬁrst in ﬁrst out (FIFO) communication solution is presented that implements a cyclic buffer on both sender and receiver to serve this need. The buffer handling and data transfer between them are done purely in hardware to remove software overhead from the application. The implementation improves the multi-user access to area-eﬃcient single port on-chip memory modules. It achieves 0.8 of the theoretically possible bandwidth, usually only achieved with area expensive dual-port memories. The third communication mode deﬁnes a lightweight message passing (MP) implementation that is truly connectionless. It is needed for eﬃcient inter-process communication of the distributed and clustered scheduling system and the worker processing units’ tight coupling. A hardware ﬂow control assures that an arbitrary number of senders can spontaneously start sending messages to the same receiver. Yet, all messages are guaranteed to be correctly received while eliminating the need for connection establishment and keeping a low message delay. The work focuses on the hardware-software codesign optimization to increase the uncompromised resource eﬃciency of dynamic SDF graph scheduling. Special attention is paid to the inter-level dependencies in developing a distributed scheduling system, which relies on the availability of speciﬁc hardwareaccelerated communication methods.:1 Introduction 1.1 Motivation 1.2 The Multiprocessor System on Chip Architecture 1.3 Concrete MPSoC Architecture 1.4 Representing LTE/5G baseband processing as Static Data Flow 1.5 Compuation Stack 1.6 Performance Hotspots Addressed 1.7 State of the Art 1.8 Overview of the Work 2 Hybrid SDF Execution 2.1 Addressed Performance Hotspot 2.2 State of the Art 2.3 Static Data Flow Graphs 2.4 Runtime Environment 2.5 Overhead of Deloying Tasks to a MPSoC 2.6 Interpretation of SDF Graphs as Task Graphs 2.7 Interpreting SDF Graphs as Process Networks 2.8 Hybrid Interpretation 2.9 Graph Topology Considerations 2.10 Theoretic Impact of Hybrid Interpretation 2.11 Simulating Hybrid Execution 2.12 Pipeline SDF Graph Example 2.13 Random SDF Graphs 2.14 LTE-like SDF Graph 2.15 Key Lernings 3 Distribution of Management 3.1 Addressed Performance Hotspot 3.2 State of the Art 3.3 Revising Deployment Overhead 3.4 Distribution of Overhead 3.5 Impact of Management Distribution to Resource Utilization 3.6 Reconfigurability 3.7 Key Lernings 4 Sliced FIFO Hardware 4.1 Addressed Performance Hotspot 4.2 State of the Art 4.3 System Environment 4.4 Sliced Windowed FIFO buffer 4.5 Single FIFO Evaluation 4.6 Multiple FIFO Evalutaion 4.7 Hardware Implementation 4.8 Key Lernings 5 Message Passing Hardware 5.1 Addressed Performance Hotspot 5.2 State of the Art 5.3 Message Passing Regarded as Queueing 5.4 A Remote Direct Memory Access Based Implementation 5.5 Hardware Implementation Concept 5.6 Evalutation of Performance 5.7 Key Lernings 6 Summary info:eu-repo/classification/ddc/621.3 ddc:621.3
5	Compiler Directed Codesign for FPGA-based Embedded Systems Hauff, Martin Anthony, marty@extendabilities.com.au January 2008 (has links) As embedded systems designers increasingly turn to programmable logic technologies in place of off-the-shelf microprocessors, there is a growing interest in the development of optimised custom processing cores that can be designed on a per-application basis. FPGAs blur the traditional distinction between hardware and software and offer the promise of application specific hardware acceleration. But realizing this in a general sense requires a significant departure from traditional embedded systems development flows. Whereas off-the-shelf processors have a fixed architecture, the same cannot be said of purpose-built FPGA-based processors. With this freedom comes the challenge of empirically determining the optimal boundary point between hardware and software. The fluidity of the hardware/software partition also poses an interesting challenge for compiler developers. This thesis presents a tool and methodology that addresses these codesign challenges in a new way. Described as 'compiler-directed codesign', it makes use of a suitably modified compiler to help direct the development of a custom processor core on a per-application basis. By exposing the compiler's internal representation of a compiled target program, visibility into those instructions, and hardware resources, that are most sought after by the compiler can be gained. This information is then used to inform further processor development and to determine the optimal partition between hardware and software. At each design iteration, the machine model is updated to reflect the available hardware resources, the compiler is rebuilt, and the target application is compiled once again. By including the compiler 'in-the-loop' of custom processor design, developers can accurately quantify the impact on performance caused by the addition or removal of specific hardware resources and iteratively converge on an optimal solution. Compiler Directed Codesign has advantages over existing codesign methodologies because it offers both a concrete point from which to begin the partitioning process as well as providing quantifiable and rapid feedback of the merits of different partitioning choices. When applied to an Adaptive PCM Encoder/Decoder case study, the Compiler Directed Codesign technique yielded a custom processor core that was between 36% and 73% smaller, consumed between 11% to 19% less memory, and performed up to 10X faster than comparable general-purpose FPGA-based processor cores. The conclusion of this work is that a suitably modified compiler can serve a valuable role in directing hardware/software partitioning on a per-application basis. Compiler Directed Codesign Hardware / Software Codesign Embedded Systems Codesign FPGA Custom Processor Design
6	Hardware accelerators for embedded fingerprint-based personal recognition systems Fons Lluís, Mariano 29 May 2012 (has links) Abstract The development of automatic biometrics-based personal recognition systems is a reality in the current technological age. Not only those operations demanding stringent security levels but also many daily use consumer applications request the existence of computational platforms in charge of recognizing the identity of one individual based on the analysis of his/her physiological and/or behavioural characteristics. The state of the art points out two main open problems in the implementation of such applications: on the one hand, the needed reliability improvement in terms of recognition accuracy, overall security and real-time performances; and on the other hand, the cost reduction of those physical platforms in charge of the processing. This work aims at finding the proper system architecture able to address those limitations of current personal recognition applications. Embedded system solutions based on hardware-software co-design techniques and programmable (and run-time reconfigurable) logic devices under FPGAs or SOPCs is proven to be an efficient alternative to those existing multiprocessor systems based on HPCs, GPUs or PC platforms in the development of that kind of high-performance applications at low cost / El desenvolupament de sistemes automàtics de reconeixement personal basats en tècniques biomètriques esdevé una realitat en l’era tecnològica actual. No només aquelles operacions que exigeixen un elevat nivell de seguretat sinó també moltes aplicacions quotidianes demanen l’existència de plataformes computacionals encarregades de reconèixer la identitat d’un individu a partir de l’anàlisi de les seves característiques fisiològiques i/o comportamentals. L’estat de l’art de la tècnica identifica dues limitacions importants en la implementació d’aquest tipus d’aplicacions: per una banda, és necessària la millora de la fiabilitat d’aquests sistemes en termes de precisió en el procés de reconeixement personal, seguretat i execució en temps real; i per altra banda, és necessari reduir notablement el cost dels sistemes electrònics encarregats del processat biomètric. Aquest treball té per objectiu la cerca de l’arquitectura adequada a nivell de sistema que permeti fer front a les limitacions de les aplicacions de reconeixement personal actuals. Es demostra que la proposta de sistemes empotrats basats en tècniques de codisseny hardware-software i dispositius lògics programables (i reconfigurables en temps d’execució) sobre FPGAs o SOPCs resulta ser una alternativa eficient en front d’aquells sistemes multiprocessadors existents basats en HPCs, GPUs o plataformes PC per al desenvolupament d’aquests tipus d’aplicacions que requereixen un alt nivell de prestacions a baix cost. / El desarrollo de sistemas automáticos de reconocimiento personal basados en técnicas biométricas se ha convertido en una realidad en la era tecnológica actual. No tan solo aquellas operaciones que requieren un alto nivel de seguridad sino también muchas otras aplicaciones cotidianas exigen la existencia de plataformas computacionales encargadas de verificar la identidad de un individuo a partir del análisis de sus características fisiológicas y/o comportamentales. El estado del arte de la técnica identifica dos limitaciones importantes en la implementación de este tipo de aplicaciones: por un lado, es necesario mejorar la fiabilidad que presentan estos sistemas en términos de precisión en el proceso de reconocimiento personal, seguridad y ejecución en tiempo real; y por otro lado, es necesario reducir notablemente el coste de los sistemas electrónicos encargados de dicho procesado biométrico. Este trabajo tiene por objetivo la búsqueda de aquella arquitectura adecuada a nivel de sistema que permita hacer frente a las limitaciones de los sistemas de reconocimiento personal actuales. Se demuestra que la propuesta basada en sistemas embebidos implementados mediante técnicas de codiseño hardware-software y dispositivos lógicos programables (y reconfigurables en tiempo de ejecución) sobre FPGAs o SOPCs resulta ser una alternativa eficiente frente a aquellos sistemas multiprocesador actuales basados en HPCs, GPUs o plataformas PC en el ámbito del desarrollo de aplicaciones que demandan un alto nivel de prestaciones a bajo coste Biometrics Embedded System hardware-software codesign Programmable Logic System-on-Chip FPGA 00 62 621.3
7	Kostenmodellierung mit SystemC/System-AMS Markert, Erik, Wang, Hailu, Herrmann, Göran, Heinkel, Ulrich 08 June 2007 (has links) (PDF) In diesem Beitrag wird eine Methode zur Beschreibung von Kostenfaktoren und deren Verknüpfung über Hierarchiegrenzen hinweg dargestellt. Sie eignet sich sowohl für rein digitale Systeme mit Softwareanteilen als auch für gemischt analog/digitale Systeme. Damit ist sie im Hardware-Software Codesign und im Analog-Digital Codesign zum Vergleich verschiedener Systemkompositionen anwendbar. Die Implementierung mit C++ ermöglicht neben einer Nutzung mit digitalem SystemC auch den Einsatz mit der analogen SystemC-Erweiterung SystemC-AMS und vereinfacht die Nutzung gegenüber einer vorhandenen VHDL-Implementierung. Als Anwendungsbeispiel fungieren Komponenten eines Systems zur Inertialnavigation. Analog/Digital-Design Hardware/Software- Codesign Inertialnavigation SystemC/System-AMS ddc:004 ddc:500 Mikrosystemtechnik Systementwurf
8	FPGA-based Experiment Platform for Hardware-Software Codesign and Hardware Emulation Nagaonkar, Yajuvendra 01 May 2006 (has links) (PDF) An FPGA-based experiment platform for hardware-software codesign experiments was developed. The proposed platform would be used by an engineer who can be affiliated with academia, research or industry for codesign experiments or hardware emulation. The platform utilizes a combination of a microcontroller and a FPGA device to enable sufficient flexibility in exploring the design space to implement codesign experiments. The FPGA device operation is integrated with that of the microcontroller to provide an overall embedded solution for codesign experimentations. It is anticipated that the platform will be used in academia for educating the students the concepts of computer architecture and microprocessor design. Future work suggested includes development of performance metrics of hardware and software solutions, and in the partitioning stage of the codesign flow. FPGA hardware software codesign computer architecture prototyping education Electrical and Computer Engineering
9	RESOURCE-AWARE OPTIMIZATION TECHNIQUES FOR MACHINE LEARNING INFERENCE ON HETEROGENEOUS EMBEDDED SYSTEMS Spantidi, Ourania 01 May 2023 (has links) (PDF) With the increasing adoption of Deep Neural Networks (DNNs) in modern applications, there has been a proliferation of computationally and power-hungry workloads, which has necessitated the use of embedded systems with more sophisticated, heterogeneous approaches to accommodate these requirements. One of the major solutions to tackle these challenges has been the development of domain-specific accelerators, which are highly optimized for the computationally intensive tasks associated with DNNs. These accelerators are designed to take advantage of the unique properties of DNNs, such as parallelism and data locality, to achieve high throughput and energy efficiency. Domain-specific accelerators have been shown to provide significant improvements in performance and energy efficiency compared to traditional general-purpose processors and are becoming increasingly popular in a range of applications such as computer vision and speech recognition. However, designing these architectures and managing their resources can be challenging, as it requires a deep understanding of the workload and the system's unique properties. Achieving a favorable balance between performance and power consumption is not always straightforward and requires careful design decisions to fully exploit the benefits of the underlying hardware. This dissertation aims to address these challenges by presenting solutions that enable low energy consumption without compromising performance for heterogeneous embedded systems. Specifically, this dissertation will focus on three topics: (i) the utilization of approximate computing concepts and approximate accelerators for energy-efficient DNN inference,(ii) the integration of formal properties in the systematic employment of approximate computing concepts, and (iii) resource management techniques on heterogeneous embedded systems.In summary, this dissertation provides a comprehensive study of solutions that can improve the energy efficiency of heterogeneous embedded systems, enabling them to perform computationally intensive tasks associated with modern applications that incorporate DNNs without compromising on performance. The results of this dissertation demonstrate the effectiveness of the proposed solutions and their potential for wide-ranging practical applications. Embedded Systems Formal Properties Hardware Accelerators Hardware-Software Codesign Machine Learning Neural Networks
10	Software Performance Estimation Techniques in a Co-Design Environment Subramanian, Sriram 02 September 2003 (has links) No description available. Computer Science software performance estimation VCC SPIM MiBench hardware-software-codesign

Search results