• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 1
  • Tagged with
  • 10
  • 10
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Real-Time Operating System Hardware Extension Core for System-on-Chip Designs

Best, Joel 08 January 2013 (has links)
This thesis presents a real-time operating system hardware extension core which supports the integration of hardware accelerators into real-time system-on-chip designs as hardware tasks. The hardware extension core utilizes reconfigurable logic to manage synchronization events, data transfers, and hardware task control. A reduction in interrupt latency, frequency, and execution time provides performance and predictability improvements for real-time applications. Required communication between the CPU and hardware accelerators is also reduced significantly. Compared to a software implementation, synthetic benchmarks of common synchronization tasks show up to a 41% increase in synchronization performance. Analysis of a test case design for audio encoding and encryption using three hardware accelerators shows results of a 2.89x throughput improvement in comparison to the use of software device driver tasks. Overall, this design simplifies the integration of hardware accelerators into real-time system-on-chip designs while improving the performance and predictability of these systems.
2

Optimisation des performances et de la complexité dans les architectures multiprocesseurs hétérogènes sur puce / Performance and complexity optimization in heterogeneous multiprocessors system on chip

Dammak Masmoudi, Bouthaina 06 November 2015 (has links)
Les travaux présentés dans cette thèse visent le développement d'une méthodologie capable d’estimer rapidement les performances d’une architecture MPSoC avec des instructions spécialisées. Pour ces architectures, l’outil proposé intègre une méthodologie de partage des accélérateurs hardwares pour les mêmes motifs de calcul. L’idée est donc de trouver dans les différentes applications parallèles exécutées par les différents processeurs des motifs de calcul communs. Ces motifs seront alors implantés sur le FPGA par un nombre réduit d’accélérateurs partagés entre les processeurs. Grâce à des modèles de programmation mixte, la méthodologie développée est capable de trouver, pour des performances exigés par le concepteur, le nombre optimal d’accélérateurs privés et/ou partagés pour les différents motifs. / No summary in english
3

A Computational Approach to Custom Data Representation for Hardware Accelerators

Kinsman, Adam 04 1900 (has links)
<p> This thesis details the application of computational methods to the problem of determining custom data representations when building hardware accelerators for numerical computations. A majority of scientific applications which require hardware acceleration are implemented in IEEE-754 double precision. However, in many cases the error tolerance requirements of the application are much less than the accuracy which IEEE-754 double precision provides. By leveraging custom data representations, a more resource efficient hardware implementation arises thereby enabling greater parallelism and thus higher performance of the accelerator. </p> <p> The existing custom representation methods are unable to guarantee robust representations while at the same time adequately supporting ill-conditioned operators. Support for both of these scenarios is necessary for accelerating scientific calculations. To address this, we propose the use of a computational method based on Satisfiability-Modulo Theory (SMT). By capturing a calculation as a set of constraints, an SMT instance can be formulated which provides meaningful bounds even in the presence of ill-conditioned operators. At the same time, the analytical nature of SMT satisfies the need for robustness. Utilizing block vector arithmetic, our SMT approach is extended to provide scalability to large instances involving vector calculus which arise in scientific calculations. Atop this foundation, a unified error model is proposed which deals simultaneously with absolute and relative error, thereby providing the means of supporting both fixed-point and custom floating-point data types. Iterative algorithm analysis is leveraged to derive constraints for the SMT method. The application of the method to several scientific algorithms is discussed by way of case studies. </p> / Thesis / Doctor of Philosophy (PhD)
4

RESOURCE-AWARE OPTIMIZATION TECHNIQUES FOR MACHINE LEARNING INFERENCE ON HETEROGENEOUS EMBEDDED SYSTEMS

Spantidi, Ourania 01 May 2023 (has links) (PDF)
With the increasing adoption of Deep Neural Networks (DNNs) in modern applications, there has been a proliferation of computationally and power-hungry workloads, which has necessitated the use of embedded systems with more sophisticated, heterogeneous approaches to accommodate these requirements. One of the major solutions to tackle these challenges has been the development of domain-specific accelerators, which are highly optimized for the computationally intensive tasks associated with DNNs. These accelerators are designed to take advantage of the unique properties of DNNs, such as parallelism and data locality, to achieve high throughput and energy efficiency. Domain-specific accelerators have been shown to provide significant improvements in performance and energy efficiency compared to traditional general-purpose processors and are becoming increasingly popular in a range of applications such as computer vision and speech recognition. However, designing these architectures and managing their resources can be challenging, as it requires a deep understanding of the workload and the system's unique properties. Achieving a favorable balance between performance and power consumption is not always straightforward and requires careful design decisions to fully exploit the benefits of the underlying hardware. This dissertation aims to address these challenges by presenting solutions that enable low energy consumption without compromising performance for heterogeneous embedded systems. Specifically, this dissertation will focus on three topics: (i) the utilization of approximate computing concepts and approximate accelerators for energy-efficient DNN inference,(ii) the integration of formal properties in the systematic employment of approximate computing concepts, and (iii) resource management techniques on heterogeneous embedded systems.In summary, this dissertation provides a comprehensive study of solutions that can improve the energy efficiency of heterogeneous embedded systems, enabling them to perform computationally intensive tasks associated with modern applications that incorporate DNNs without compromising on performance. The results of this dissertation demonstrate the effectiveness of the proposed solutions and their potential for wide-ranging practical applications.
5

Neural network computing using on-chip accelerators

Eldridge, Schuyler 05 November 2016 (has links)
The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications.
6

Trace-based Performance Analysis for Hardware Accelerators / Leistungsanalyse hardwarebeschleunigter Anwendungen mittels Programmspuren

Juckeland, Guido 14 February 2013 (has links) (PDF)
This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat.
7

Hardware assisted memory checkpointing and applications in debugging and reliability

Doudalis, Ioannis 25 July 2011 (has links)
The problems of software debugging and system reliability/availability are among the most challenging problems the computing industry is facing today, with direct impact on the development and operating costs of computing systems. A promising debugging technique that assists programmers identify and fix the causes of software bugs a lot more efficiently is bidirectional debugging, which enables the user to execute the program in "reverse", and a typical method used to recover a system after a fault is backwards error recovery, which restores the system to the last error-free state. Both reverse execution and backwards error recovery are enabled by creating memory checkpoints, which are used to restore the program/system to a prior point in time and re-execute until the point of interest. The checkpointing frequency is the primary factor that affects both the latency of reverse execution and the recovery time of the system; more frequent checkpoints reduce the necessary re-execution time. Frequent creation of checkpoints poses performance challenges, because of the increased number of memory reads and writes necessary for copying the modified system/program memory, and also because of software interventions, additional synchronization and I/O, etc., needed for creating a checkpoint. In this thesis I examine a number of different hardware accelerators, whose role is to create frequent memory checkpoints in the background, at minimal performance overheads. For the purpose of reverse execution, I propose the HARE and Euripus hardware checkpoint accelerators. HARE and Euripus create different types of checkpoints, and employ different methods for keeping track of the modified memory. As a result, HARE and Euripus have different hardware costs and provide different functionality which directly affects the latency of reverse execution. For improving the availability of the system, I propose the Kyma hardware accelerator. Kyma enables simultaneous creation of checkpoints at different frequencies, which allows the system to recover from multiple types of errors and tolerate variable error-detection latencies. The Kyma and Euripus hardware engines have similar architectures, but the functionality of the Kyma engine is optimized for further reducing the performance overheads and improving the reliability of the system. The functionality of the Kyma and Euripus engines can be combined into a unified accelerator that can serve the needs of both bidirectional debugging and system recovery.
8

Trace-based Performance Analysis for Hardware Accelerators

Juckeland, Guido 05 February 2013 (has links)
This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well. High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well. Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods. / Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt. Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten. Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt. Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat.
9

Photonic Deep Neural Network Accelerators for Scaling to the Next Generation of High-Performance Processing

Shiflett, Kyle D. January 2022 (has links)
No description available.
10

Implementation trade-offs for FGPA accelerators / Compromis pour l'implémentation d'accélérateurs sur FPGA

Deest, Gaël 14 December 2017 (has links)
L'accélération matérielle désigne l'utilisation d'architectures spécialisées pour effectuer certaines tâches plus vite ou plus efficacement que sur du matériel générique. Les accélérateurs ont traditionnellement été utilisés dans des environnements contraints en ressources, comme les systèmes embarqués. Cependant, avec la fin des règles empiriques ayant régi la conception de matériel pendant des décennies, ces quinze dernières années ont vu leur apparition dans les centres de calcul et des environnements de calcul haute performance. Les FPGAs constituent une plateforme d'implémentation commode pour de tels accélérateurs, autorisant des compromis subtils entre débit/latence, surface, énergie, précision, etc. Cependant, identifier de bons compromis représente un défi, dans la mesure où l'espace de recherche est généralement très large. Cette thèse propose des techniques de conception pour résoudre ce problème. Premièrement, nous nous intéressons aux compromis entre performance et précision pour la conversion flottant vers fixe. L'utilisation de l'arithmétique en virgule fixe au lieu de l'arithmétique flottante est un moyen efficace de réduire l'utilisation de ressources matérielles, mais affecte la précision des résultats. La validité d'une implémentation en virgule fixe peut être évaluée avec des simulations, ou en dérivant des modèles de précision analytiques de l'algorithme traité. Comparées aux approches simulatoires, les méthodes analytiques permettent une exploration plus exhaustive de l'espace de recherche, autorisant ainsi l'identification de solutions potentiellement meilleures. Malheureusement, elles ne sont applicables qu'à un jeu limité d'algorithmes. Dans la première moitié de cette thèse, nous étendons ces techniques à des filtres linéaires multi-dimensionnels, comme des algorithmes de traitement d'image. Notre méthode est implémentée comme une analyse statique basée sur des techniques de compilation polyédrique. Elle est validée en la comparant à des simulations sur des données réelles. Dans la seconde partie de cette thèse, on se concentre sur les stencils itératifs. Les stencils forment un motif de calcul émergeant naturellement dans de nombreux algorithmes utilisés en calcul scientifique ou dans l'embarqué. À cause de cette diversité, il n'existe pas de meilleure architecture pour les stencils de façon générale : chaque algorithme possède des caractéristiques uniques (intensité des calculs, nombre de dépendances) et chaque application possède des contraintes de performance spécifiques. Pour surmonter ces difficultés, nous proposons une famille d'architectures pour stencils. Nous offrons des paramètres de conception soigneusement choisis ainsi que des modèles analytiques simples pour guider l'exploration. Notre architecture est implémentée sous la forme d'un flot de génération de code HLS, et ses performances sont mesurées sur la carte. Comme les résultats le démontrent, nos modèles permettent d'identifier les solutions les plus intéressantes pour chaque cas d'utilisation. / Hardware acceleration is the use of custom hardware architectures to perform some computations faster or more efficiently than on general-purpose hardware. Accelerators have traditionally been used mostly in resource-constrained environments, such as embedded systems, where resource-efficiency was paramount. Over the last fifteen years, with the end of empirical scaling laws, they also made their way to datacenters and High-Performance Computing environments. FPGAs constitute a convenient implementation platform for such accelerators, allowing subtle, application-specific trade-offs between all performance metrics (throughput/latency, area, energy, accuracy, etc.) However, identifying good trade-offs is a challenging task, as the design space is usually extremely large. This thesis proposes design methodologies to address this problem. First, we focus on performance-accuracy trade-offs in the context of floating-point to fixed-point conversion. Usage of fixed-point arithmetic instead of floating-point is an affective way to reduce hardware resource usage, but comes at a price in numerical accuracy. The validity of a fixed-point implementation can be assessed using either numerical simulations, or with analytical models derived from the algorithm. Compared to simulation-based methods, analytical approaches enable more exhaustive design space exploration and can thus increase the quality of the final architecture. However, their are currently only applicable to limited sets of algorithms. In the first part of this thesis, we extend such techniques to multi-dimensional linear filters, such as image processing kernels. Our technique is implemented as a source-level analysis using techniques from the polyhedral compilation toolset, and validated against simulations with real-world input. In the second part of this thesis, we focus on iterative stencil computations, a naturally-arising pattern found in many scientific and embedded applications. Because of this diversity, there is no single best architecture for stencils: each algorithm has unique computational features (update formula, dependences) and each application has different performance constraints/requirements. To address this problem, we propose a family of hardware accelerators for stencils, featuring carefully-chosen design knobs, along with simple performance models to drive the exploration. Our architecture is implemented as an HLS-optimized code generation flow, and performance is measured with actual execution on the board. We show that these models can be used to identify the most interesting design points for each use case.

Page generated in 0.1927 seconds