Global ETD Search

1	Real-Time Operating System Hardware Extension Core for System-on-Chip Designs Best, Joel 08 January 2013 (has links) This thesis presents a real-time operating system hardware extension core which supports the integration of hardware accelerators into real-time system-on-chip designs as hardware tasks. The hardware extension core utilizes reconfigurable logic to manage synchronization events, data transfers, and hardware task control. A reduction in interrupt latency, frequency, and execution time provides performance and predictability improvements for real-time applications. Required communication between the CPU and hardware accelerators is also reduced significantly. Compared to a software implementation, synthetic benchmarks of common synchronization tasks show up to a 41% increase in synchronization performance. Analysis of a test case design for audio encoding and encryption using three hardware accelerators shows results of a 2.89x throughput improvement in comparison to the use of software device driver tasks. Overall, this design simplifies the integration of hardware accelerators into real-time system-on-chip designs while improving the performance and predictability of these systems. Real-Time Systems System-on-Chip Hardware Accelerators Reconfigurable Logic
2	Optimisation des performances et de la complexité dans les architectures multiprocesseurs hétérogènes sur puce / Performance and complexity optimization in heterogeneous multiprocessors system on chip Dammak Masmoudi, Bouthaina 06 November 2015 (has links) Les travaux présentés dans cette thèse visent le développement d'une méthodologie capable d’estimer rapidement les performances d’une architecture MPSoC avec des instructions spécialisées. Pour ces architectures, l’outil proposé intègre une méthodologie de partage des accélérateurs hardwares pour les mêmes motifs de calcul. L’idée est donc de trouver dans les différentes applications parallèles exécutées par les différents processeurs des motifs de calcul communs. Ces motifs seront alors implantés sur le FPGA par un nombre réduit d’accélérateurs partagés entre les processeurs. Grâce à des modèles de programmation mixte, la méthodologie développée est capable de trouver, pour des performances exigés par le concepteur, le nombre optimal d’accélérateurs privés et/ou partagés pour les différents motifs. / No summary in english Architecture MPSoC Accélérateurs hardwares Fpga MPSoC Architecture Hardware accelerators Fpga
3	A Computational Approach to Custom Data Representation for Hardware Accelerators Kinsman, Adam 04 1900 (has links) <p> This thesis details the application of computational methods to the problem of determining custom data representations when building hardware accelerators for numerical computations. A majority of scientific applications which require hardware acceleration are implemented in IEEE-754 double precision. However, in many cases the error tolerance requirements of the application are much less than the accuracy which IEEE-754 double precision provides. By leveraging custom data representations, a more resource efficient hardware implementation arises thereby enabling greater parallelism and thus higher performance of the accelerator. </p> <p> The existing custom representation methods are unable to guarantee robust representations while at the same time adequately supporting ill-conditioned operators. Support for both of these scenarios is necessary for accelerating scientific calculations. To address this, we propose the use of a computational method based on Satisfiability-Modulo Theory (SMT). By capturing a calculation as a set of constraints, an SMT instance can be formulated which provides meaningful bounds even in the presence of ill-conditioned operators. At the same time, the analytical nature of SMT satisfies the need for robustness. Utilizing block vector arithmetic, our SMT approach is extended to provide scalability to large instances involving vector calculus which arise in scientific calculations. Atop this foundation, a unified error model is proposed which deals simultaneously with absolute and relative error, thereby providing the means of supporting both fixed-point and custom floating-point data types. Iterative algorithm analysis is leveraged to derive constraints for the SMT method. The application of the method to several scientific algorithms is discussed by way of case studies. </p> / Thesis / Doctor of Philosophy (PhD) custom data representation hardware accelerators computational method numerical computation
4	RESOURCE-AWARE OPTIMIZATION TECHNIQUES FOR MACHINE LEARNING INFERENCE ON HETEROGENEOUS EMBEDDED SYSTEMS Spantidi, Ourania 01 May 2023 (has links) (PDF) With the increasing adoption of Deep Neural Networks (DNNs) in modern applications, there has been a proliferation of computationally and power-hungry workloads, which has necessitated the use of embedded systems with more sophisticated, heterogeneous approaches to accommodate these requirements. One of the major solutions to tackle these challenges has been the development of domain-specific accelerators, which are highly optimized for the computationally intensive tasks associated with DNNs. These accelerators are designed to take advantage of the unique properties of DNNs, such as parallelism and data locality, to achieve high throughput and energy efficiency. Domain-specific accelerators have been shown to provide significant improvements in performance and energy efficiency compared to traditional general-purpose processors and are becoming increasingly popular in a range of applications such as computer vision and speech recognition. However, designing these architectures and managing their resources can be challenging, as it requires a deep understanding of the workload and the system's unique properties. Achieving a favorable balance between performance and power consumption is not always straightforward and requires careful design decisions to fully exploit the benefits of the underlying hardware. This dissertation aims to address these challenges by presenting solutions that enable low energy consumption without compromising performance for heterogeneous embedded systems. Specifically, this dissertation will focus on three topics: (i) the utilization of approximate computing concepts and approximate accelerators for energy-efficient DNN inference,(ii) the integration of formal properties in the systematic employment of approximate computing concepts, and (iii) resource management techniques on heterogeneous embedded systems.In summary, this dissertation provides a comprehensive study of solutions that can improve the energy efficiency of heterogeneous embedded systems, enabling them to perform computationally intensive tasks associated with modern applications that incorporate DNNs without compromising on performance. The results of this dissertation demonstrate the effectiveness of the proposed solutions and their potential for wide-ranging practical applications. Embedded Systems Formal Properties Hardware Accelerators Hardware-Software Codesign Machine Learning Neural Networks
5	Efficient Modeling for DNN Hardware Resiliency Assessment / EFFICIENT MODELING FOR DNN HARDWARE RESILIENCY ASSESSMENT Mahmoud, Karim January 2025 (has links) Deep neural network (DNN) hardware accelerators are critical enablers of the current resurgence in machine learning technologies. Adopting machine learning in safety-critical systems imposes additional reliability requirements on hardware design. Addressing these requirements mandates an accurate assessment of the impact caused by permanent faults in the processing engines (PE). Carrying out this reliability assessment early in the design process allows for addressing potential reliability concerns when it is less costly to perform design revisions. However, the large size of modern DNN hardware and the complexity of the DNN applications running on it present barriers to efficient reliability evaluation before proceeding with the design implementation. Considering these barriers, this dissertation proposes two methodologies to assess fault resiliency in integer arithmetic units in DNN hardware. Using the information from the data streaming patterns of the DNN accelerators, which are known before the register-transfer level (RTL) implementation, the first methodology enables fault injection experiments to be carried out in PE units at the pre-RTL stage during architectural design space exploration. This is achieved in a DNN simulation framework that captures the mapping between a model's operations and the hardware's arithmetic units. This facilitates a fault resiliency comparison of state-of-the-art DNN accelerators comprising thousands of PE units. The second methodology introduces accurate and efficient modelling of the impact of permanent faults in integer multipliers. It avoids the need for computationally intensive circuit models, e.g., netlists, to inject faults in integer arithmetic units, thus scaling the fault resiliency assessment to accelerators with thousands of PE units with negligible simulation time overhead. As a first step, we formally analyze the impact of permanent faults affecting the internal nodes of two integer multiplier architectures. This analysis indicates that, for most internal faults, the impact on the output is independent of the operands involved in the arithmetic operation. As the second step, we develop a statistical fault injection approach based on the likelihood of a fault being triggered in the applications that run on the target DNN hardware. By modelling the impact of faults in internal nodes of arithmetic units using fault-free operations, fault injection campaigns run three orders of magnitude faster than using arithmetic circuit models in the same simulation environment. The experiments also show that the proposed method's accuracy is on par with that of using netlists to model arithmetic circuitry in which faults are injected. Using the proposed methods, one can conduct fault assessment experiments for various DNN models and hardware architectures, examining the sensitivity of DNN model-related and hardware architecture-related features on the DNN accelerator's reliability. In addition to understanding the impact of permanent hardware faults on the accuracy of DNN models running on defective hardware, the outcomes of these experiments can yield valuable insights for designers seeking to balance fault criticality and performance, thereby facilitating the development of more reliable DNN hardware in the future. / Thesis / Doctor of Philosophy (PhD) / The reliability of Deep Neural Network (DNN) hardware has become critical in recent years, especially for the adoption of machine learning in safety-critical applications. Evaluating the reliability of DNN hardware early in the design process enables addressing potential reliability concerns before committing to full implementation. However, the large size and complexity of DNN hardware impose challenges in evaluating its reliability in an efficient manner. In this dissertation, two novel methodologies are proposed to address these challenges. The first methodology introduces an efficient method to describe the mapping of operations of DNN applications to the processing engines of a target DNN hardware architecture in a high-performance computing DNN simulation environment. This approach allows for assessing the fault resiliency of large hardware architectures, incorporating thousands of processing engines while using fewer simulation resources compared to existing methods. The second methodology introduces an accurate and efficient approach to modelling the impact of permanent faults in integer arithmetic units of DNN hardware during inference. By leveraging the special characteristics of integer arithmetic units, this method achieves fault assessment at negligible computational overhead relative to running DNN inference in the fault-free mode in state-of-the-art DNN frameworks.
6	Neural network computing using on-chip accelerators Eldridge, Schuyler 05 November 2016 (has links) The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications. Computer engineering RISC-V Hardware accelerators Hardware/software co-design Multi-transaction computation Neural networks Simultaneous multithreading
7	Trace-based Performance Analysis for Hardware Accelerators / Leistungsanalyse hardwarebeschleunigter Anwendungen mittels Programmspuren Juckeland, Guido 14 February 2013 (has links) (PDF) This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat. Leistungsanalyse Hardwarebeschleuniger GPUs CUDA OpenCL Tracing Particle-in-Cell Performance Analysis Hardware accelerators GPUs CUDA OpenCL Tracing Particle-in-Cell ddc:004 rvk:ST 150
8	Hardware assisted memory checkpointing and applications in debugging and reliability Doudalis, Ioannis 25 July 2011 (has links) The problems of software debugging and system reliability/availability are among the most challenging problems the computing industry is facing today, with direct impact on the development and operating costs of computing systems. A promising debugging technique that assists programmers identify and fix the causes of software bugs a lot more efficiently is bidirectional debugging, which enables the user to execute the program in "reverse", and a typical method used to recover a system after a fault is backwards error recovery, which restores the system to the last error-free state. Both reverse execution and backwards error recovery are enabled by creating memory checkpoints, which are used to restore the program/system to a prior point in time and re-execute until the point of interest. The checkpointing frequency is the primary factor that affects both the latency of reverse execution and the recovery time of the system; more frequent checkpoints reduce the necessary re-execution time. Frequent creation of checkpoints poses performance challenges, because of the increased number of memory reads and writes necessary for copying the modified system/program memory, and also because of software interventions, additional synchronization and I/O, etc., needed for creating a checkpoint. In this thesis I examine a number of different hardware accelerators, whose role is to create frequent memory checkpoints in the background, at minimal performance overheads. For the purpose of reverse execution, I propose the HARE and Euripus hardware checkpoint accelerators. HARE and Euripus create different types of checkpoints, and employ different methods for keeping track of the modified memory. As a result, HARE and Euripus have different hardware costs and provide different functionality which directly affects the latency of reverse execution. For improving the availability of the system, I propose the Kyma hardware accelerator. Kyma enables simultaneous creation of checkpoints at different frequencies, which allows the system to recover from multiple types of errors and tolerate variable error-detection latencies. The Kyma and Euripus hardware engines have similar architectures, but the functionality of the Kyma engine is optimized for further reducing the performance overheads and improving the reliability of the system. The functionality of the Kyma and Euripus engines can be combined into a unified accelerator that can serve the needs of both bidirectional debugging and system recovery. Memory checkpointing Hardware accelerators Computer systems reliability Computer architecture Debugging Debugging in computer science Memory management (Computer science) Software maintenance
9	Trace-based Performance Analysis for Hardware Accelerators Juckeland, Guido 05 February 2013 (has links) This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat. info:eu-repo/classification/ddc/004 ddc:004
10	Photonic Deep Neural Network Accelerators for Scaling to the Next Generation of High-Performance Processing Shiflett, Kyle D. January 2022 (has links) No description available. Electrical Engineering Computer Engineering Computer Science computer architecture hardware accelerators networks-on-chip machine learning neural networks silicon photonics emerging technology analog computing mixed-signed computing

Search results