• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 87
  • 12
  • 9
  • 8
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 153
  • 153
  • 102
  • 36
  • 36
  • 29
  • 28
  • 25
  • 22
  • 21
  • 20
  • 20
  • 19
  • 19
  • 17
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

Architectures parallèles reconfigurables pour le traitement vidéo temps-réel / Parallel reconfigurable hardware architectures for video processing applications

Ali, Karim Mohamed Abedallah 08 February 2018 (has links)
Les applications vidéo embarquées sont de plus en plus intégrées dans des systèmes de transport intelligents tels que les véhicules autonomes. De nombreux défis sont rencontrés par les concepteurs de ces applications, parmi lesquels : le développement des algorithmes complexes, la vérification et le test des différentes contraintes fonctionnelles et non-fonctionnelles, la nécessité d’automatiser le processus de conception pour augmenter la productivité, la conception d’une architecture matérielle adéquate pour exploiter le parallélisme inhérent et pour satisfaire la contrainte temps-réel, réduire la puissance consommée pour prolonger la durée de fonctionnement avant de recharger le véhicule, etc. Dans ce travail de thèse, nous avons utilisé les technologies FPGAs pour relever certains de ces défis et proposer des architectures matérielles reconfigurables dédiées pour des applications embarquées de traitement vidéo temps-réel. Premièrement, nous avons implémenté une architecture parallèle flexible avec deux contributions principales : (1) Nous avons proposé un modèle générique de distribution/collecte de pixels pour résoudre le problème de transfert de données à haut débit à travers le système. Les paramètres du modèle requis sont tout d’abord définis puis la génération de l’architecture a été automatisée pour minimiser le temps de développement. (2) Nous avons appliqué une technique d’ajustement de la fréquence pour réduire la consommation d’énergie. Nous avons dérivé les équations nécessaires pour calculer le niveau maximum de parallélisme ainsi que les équations utilisées pour calculer la taille des FIFO pour le passage d’un domaine de l’horloge à un autre. Au fur et à mesure que le nombre de cellules logiques sur une seule puce FPGAaugmente, passer à des niveaux d’abstraction plus élevés devient inévitable pour réduire la contrainte de « time-to-market » et augmenter la productivité des concepteurs. Pendant la phase de conception, l’espace de solutions architecturales présente un grand nombre d’alternatives avec des performances différentes en termes de temps d’exécution, ressources matérielles, consommation d’énergie, etc. Face à ce défi, nous avons développé l’outil ViPar avec deux contributions principales : (1) Un modèle empirique a été introduit pour estimer la consommation d’énergie basé sur l’utilisation du matériel (Slice et BRAM) et la fréquence de fonctionnement ; en plus de cela, nous avons dérivé les équations pour estimer les ressources matérielles et le temps d’exécution pour chaque alternative au cours de l’exploration de l’espace de conception. (2) En définissant les principales caractéristiques de l’architecture parallèle comme le niveau de parallélisme, le nombre de ports d’entrée/sortie, le modèle de distribution des pixels, ..., l’outil ViPar génère automatiquement l’architecture matérielle pour les solutions les plus pertinentes. Dans le cadre d’une collaboration industrielle avec NAVYA, nous avons utilisé l’outil ViPar pour implémenter une solution matérielle parallèle pour l’algorithme de stéréo matching « Multi-window Sum of Absolute Difference ». Dans cette implémentation, nous avons présenté un ensemble d’étapes pour modifier le code de description de haut niveau afin de l’adapter efficacement à l’implémentation matérielle. Nous avons également exploré l’espace de conception pour différentes alternatives en termes de performance, ressources matérielles, fréquence, et consommation d’énergie. Au cours de notre travail, les architectures matérielles ont été implémentées et testées expérimentalement sur la plateforme d’évaluation Xilinx Zynq ZC706. / Embedded video applications are now involved in sophisticated transportation systems like autonomous vehicles. Many challenges faced the designers to build those applications, among them: complex algorithms should be developed, verified and tested under restricted time-to-market constraints, the necessity for design automation tools to increase the design productivity, high computing rates are required to exploit the inherent parallelism to satisfy the real-time constraints, reducing the consumed power to extend the operating duration before recharging the vehicle, etc. In this thesis work, we used FPGA technologies to tackle some of these challenges to design parallel reconfigurable hardware architectures for embedded video streaming applications. First, we implemented a flexible parallel architecture with two main contributions: (1)We proposed a generic model for pixel distribution/collection to tackle the problem of the huge data transferring through the system. The required model parameters were defined then the architecture generation was automated to minimize the development time. (2) We applied frequency scaling as a technique for reducing power consumption. We derived the required equations for calculating the maximum level of parallelism as well as the ones used for calculating the depth of the inserted FIFOs for clock domain crossing. As the number of logic cells on a single FPGA chip increases, moving to higher abstraction design levels becomes inevitable to shorten the time-to-market constraint and to increase the design productivity. During the design phase, it is common to have a space of design alternatives that are different from each other regarding hardware utilization, power consumption and performance. We developed ViPar tool with two main contributions to tackle this problem: (1) An empirical model was introduced to estimate the power consumption based on the hardware utilization (Slice and BRAM) and the operating frequency. In addition to that, we derived the equations for estimating the hardware resources and the execution time for each point during the design space exploration. (2) By defining the main characteristics of the parallel architecture like parallelism level, the number of input/output ports, the pixel distribution pattern, etc. ViPar tool can automatically generate the parallel architecture for the selected designs for implementation. In the context of an industrial collaboration, we used high-level synthesis tools to implement a parallel hardware architecture for Multi-window Sum of Absolute Difference stereo matching algorithm. In this implementation, we presented a set of guiding steps to modify the high-level description code to fit efficiently for hardware implementation as well as we explored the design space for different alternatives in terms of hardware resources, performance, frequency and power consumption. During the thesis work, our designs were implemented and tested experimentally on Xilinx Zynq ZC706 (XC7Z045- FFG900) evaluation board.
122

A Design Kit for Mobile Device-Based Interaction Techniques

Korzetz, Mandy, Kühn, Romina, Aßmann, Uwe, Schlegel, Thomas 23 July 2021 (has links)
Beside designing the graphical interface of mobile applications, mobile phones and their built-in sensors enable various possibilities to engage with digital content in a physical, device-based manner that move beyond the screen content. So-called mobile device-based interactions are characterized by device movements and positions as well as user actions in real space. So far, there is only little guidance available for novice designers and developers to ideate and design new solutions for specic individual or collaborative use cases. Hence, the potential for designing mobile-based interactions is seldom fully exploited. To address this issue, we propose a design kit for mobile device-based interaction techniques following a morphological approach. Overall, the kit comprises seven dimensions with several elements that can be easily combined with each other to form an interaction technique by selecting at least one entry of each dimension. The design kit can be used to support designers in exploring novel mobile interaction techniques to specic interaction problems in the ideation phase of the design process but also in the analysis of existing device-based interaction solutions.
123

Design Space Exploration and Architecture Design for Inference and Training Deep Neural Networks

Qi, Yangjie January 2021 (has links)
No description available.
124

Design space exploration for co-mapping of periodic and streaming applications in a shared platform / Validering av designlösningar för utforskning av rymden för samkartläggning av periodiska och strömmande applikationer i en delad plattform

Yuhan, Zhang January 2023 (has links)
As embedded systems advance, the complexity and multifaceted requirements of products have increased significantly. A trend in this domain is the selection of different types of application models and multiprocessors as the platform. However, limited design space exploration techniques often perform one particular model, and combining diverse application models may cause compatibility issues. Additionally, embedded system design inherently involves multiple objectives. Beyond the essential functionalities, other metrics always need to be considered, such as power consumption, resource utilization, cost, safety, etc. The consideration of these diverse metrics results in a vast design space, so effective design space exploration also plays a crucial role. This thesis addresses these challenges by proposing a co-mapping approach for two distinct models: the periodically activated tasks model for real-time applications and the synchronous dataflow model for digital signal processing. Our primary goal is to co-map these two kinds of models onto a multi-core platform and explore trade-offs between the solutions. We choose the number of used resources and throughput of the synchronous dataflow model as our performance metrics for assessment. We adopt a combination method in which periodic tasks are given precedence to ensure their deadlines are met. The remaining processor resources are then allocated to the synchronous dataflow model. Both the execution of periodic tasks and the synchronous dataflow model are managed by a scheduler, which prevents resource contention and optimizes the utilization of available processor resources. To achieve a balance between different metrics, we implement Pareto optimization as a guiding principle in our approach. This thesis uses the IDeSyDe tool, an extension of the ForSyDe group’s current design space exploration tool, following the Design Space Identification methodology. Implementation is based on Scala and Python, running on the Java virtual machine. The experiment results affirm the successful mapping and scheduling of the periodically activated tasks model and the synchronous dataflow model onto the shared multi-processor platform. We find the Pareto-optimal solutions by IDeSyDe, strategically aiming to maximize the throughput of synchronous dataflow while concurrently minimizing resource consumption. This thesis serves as a valuable insight into the application of different models on a shared platform, particularly for developers interested in utilizing IDeSyDe. However, due to time constraints, our test case may not fully encompass the potential scalability of our thesis method. Additional tests can demonstrate the better effectiveness of our approach. For further reference, the code can be checked in the GitHub repository at*. / Allt eftersom inbyggda system utvecklas, blir komplexiteten och de mångfacetterade kraven av produkter har ökat avsevärt. En trend inom detta område är urval av olika typer av applikationsmodeller och multiprocessorer som plattformen. Dock begränsad design utrymme utforskning tekniker ofta utföra en viss modell, och kombinera olika applikationsmodeller kan orsaka kompatibilitetsproblem. Dessutom inbyggt systemdesign i sig involverar flera mål. Utöver de väsentliga funktionerna, andra mätvärden måste alltid beaktas, såsom strömförbrukning, resurs användning, kostnad, säkerhet, etc. Övervägandet av dessa olika mätvärden resulterar i ett stort designutrymme spelar så effektiv designrumsutforskning också en avgörande roll roll. Denna avhandling tar upp dessa utmaningar genom att föreslå en samkartläggning tillvägagångssätt för två distinkta modeller: modellen med periodiskt aktiverade uppgifter för realtidsapplikationer och den synkrona dataflödesmodellen för digital signal bearbetning. Vårt primära mål är att samkarta dessa två typer av modeller på en multi-core plattform och utforska avvägningar mellan lösningarna. Vi väljer antalet använda resurser och genomströmning av det synkrona dataflödet modell som vårt prestationsmått för bedömning. Vi använder en kombinationsmetod där periodiska uppgifter ges företräde för att säkerställa att deras tidsfrister hålls. Den återstående processorn resurser allokeras sedan till den synkrona dataflödesmodellen. Både utförandet av periodiska uppgifter och den synkrona dataflödesmodellen är hanteras av en schemaläggare, vilket förhindrar resursstrid och optimerar utnyttjandet av tillgängliga processorresurser. För att uppnå en balans mellan olika mått, implementerar vi Pareto-optimering som en vägledande princip i vårt tillvägagångssätt. Denna avhandling använder verktyget IDeSyDe, en förlängning av ForSyDe gruppens nuvarande verktyg för utforskning av designutrymme, efter Design Space Identifieringsmetodik. Implementeringen är baserad på Scala och Python, körs på den virtuella Java-maskinen. Experimentresultaten bekräftar den framgångsrika kartläggningen och schemaläggningen av den periodiskt aktiverade uppgiftsmodellen och det synkrona dataflödet modell på den delade flerprocessorplattformen. Vi finner Pareto-optimal lösningar av IDeSyDe, strategiskt inriktade på att maximera genomströmningen av synkront dataflöde samtidigt som resursförbrukningen minimeras. Denna uppsats fungerar som en värdefull inblick i tillämpningen av olika modeller på en delad plattform, särskilt för utvecklare IDeSyDe. På grund av tidsbrist kanske vårt testfall inte är fullt ut omfattar den potentiella skalbarheten hos vår avhandlingsmetod. Ytterligare tester kan visa hur effektiv vår strategi är. För ytterligare referens, koden kan kontrolleras i GitHub*.
125

Automatic Design Space Exploration of Fault-tolerant Embedded Systems Architectures

Tierno, Antonio 26 January 2023 (has links)
Embedded Systems may have competing design objectives, such as to maximize the reliability, increase the functional safety, minimize the product cost, and minimize the energy consumption. The architectures must be therefore configured to meet varied requirements and multiple design objectives. In particular, reliability and safety are receiving increasing attention. Consequently, the configuration of fault-tolerant mechanisms is a critical design decision. This work proposes a method for automatic selection of appropriate fault-tolerant design patterns, optimizing simultaneously multiple objective functions. Firstly, we present an exact method that leverages the power of Satisfiability Modulo Theory to encode the problem with a symbolic technique. It is based on a novel assessment of reliability which is part of the evaluation of alternative designs. Afterwards, we empirically evaluate the performance of a near-optimal approximation variation that allows us to solve the problem even when the instance size makes it intractable in terms of computing resources. The efficiency and scalability of this method is validated with a series of experiments of different sizes and characteristics, and by comparing it with existing methods on a test problem that is widely used in the reliability optimization literature.
126

A body-centric framework for generating and evaluating novel interaction techniques / Un espace de conception centré sur les fonctions corporelles pour la génération et l'évaluation de nouvelles techniques d'interaction

Wagner, Julie 06 December 2012 (has links)
Cette thèse présente BodyScape, un espace de conception prenant en compte l’engagement corporel de l’utilisateur dans l’interaction. BodyScape décrit la façon dont les utilisateurs coordonnent les mouvements de, et entre leurs membres, lorsqu’ils interagissent avec divers dispositifs d’entrée et entre plusieurs surfaces d’affichage. Il introduit une notation graphique pour l’analyse des techniques d’interaction en termes (1) d’assemblages de moteurs, qui accomplissent une tâche d’interaction atomique (assemblages de moteurs d’entrée), ou qui positionnent le corps pour percevoir les sorties du système (assemblages de moteurs de sortie); (2) de coordination des mouvements de ces assemblages de moteurs, relativement au corps de l’utilisateur ou à son environnement interactif.Nous avons appliqué BodyScape à : 1) la caractérisation du rôle du support dans l’étude de nouvelles interactions bimanuelles pour dispositifs mobiles; 2) l’analyse des effets de mouvements concurrents lorsque l’interaction et son support impliquent le même membre; et 3) la comparaison de douze techniques d’interaction multi-échelle afin d’évaluer le rôle du guidage et des interférences sur la performance.La caractérisation des interaction avec BodyScape clarifie le rôle du support des dispositifs d’interaction sur l’équilibre de l’utilisateur, et donc sur le confort d’utilisation et la performance qui en découlent. L’espace de conception permet aussi aux concepteurs d’interactions d’identifier des situations dans lesquelles des mouvements peuvent interférer entre eux et donc diminuer performance et confort. Enfin, BodyScape révèle les compromis à considérer a priori lors de la combinaison de plusieurs techniques d’interaction, permettant l’analyse et la génération de techniques d’interaction variées pour les environnements multi-surfaces.Plus généralement, cette thèse défend l’idée qu’en adoptant une approche centrée sur les fonctions corporelles engagées au cours de l’interaction, il est possible de maîtriser la complexité de la conception de techniques d’interaction dans les environnements multi-surfaces, mais aussi dans un cadre plus général. / This thesis introduces BodyScape, a body-centric framework that accounts for how users coordinate their movements within and across their own limbs in order to interact with a wide range of devices, across multiple surfaces. It introduces a graphical notation that describes interaction techniques in terms of (1) motor assemblies responsible for performing a control task (input motor assembly) or bringing the body into a position to visually perceive output (output motor assembly), and (2) the movement coordination of motor assemblies, relative to the body or fixed in the world, with respect to the interactive environment. This thesis applies BodyScape to 1) investigate the role of support in a set of novel bimanual interaction techniques for hand-held devices, 2) analyze the competing effect across multiple input movements, and 3) compare twelve pan-and-zoom techniques on a wall-sized display to determine the roles of guidance and interference on performance. Using BodyScape to characterize interaction clarifies the role of device support on the user's balance and subsequent comfort and performance. It allows designers to identify situations in which multiple body movements interfere with each other, with a corresponding decrease in performance. Finally, it highlights the trade-offs among different combinations of techniques, enabling the analysis and generation of a variety of multi-surface interaction techniques. I argue that including a body-centric perspective when defining interaction techniques is essential for addressing the combinatorial explosion of interactive devices in multi-surface environments.
127

Design, Analysis, and Applications of Approximate Arithmetic Modules

Ullah, Salim 06 April 2022 (has links)
From the initial computing machines, Colossus of 1943 and ENIAC of 1945, to modern high-performance data centers and Internet of Things (IOTs), four design goals, i.e., high-performance, energy-efficiency, resource utilization, and ease of programmability, have remained a beacon of development for the computing industry. During this period, the computing industry has exploited the advantages of technology scaling and microarchitectural enhancements to achieve these goals. However, with the end of Dennard scaling, these techniques have diminishing energy and performance advantages. Therefore, it is necessary to explore alternative techniques for satisfying the computational and energy requirements of modern applications. Towards this end, one promising technique is analyzing and surrendering the strict notion of correctness in various layers of the computation stack. Most modern applications across the computing spectrum---from data centers to IoTs---interact and analyze real-world data and take decisions accordingly. These applications are broadly classified as Recognition, Mining, and Synthesis (RMS). Instead of producing a single golden answer, these applications produce several feasible answers. These applications possess an inherent error-resilience to the inexactness of processed data and corresponding operations. Utilizing these applications' inherent error-resilience, the paradigm of Approximate Computing relaxes the strict notion of computation correctness to realize high-performance and energy-efficient systems with acceptable quality outputs. The prior works on circuit-level approximations have mainly focused on Application-specific Integrated Circuits (ASICs). However, ASIC-based solutions suffer from long time-to-market and high-cost developing cycles. These limitations of ASICs can be overcome by utilizing the reconfigurable nature of Field Programmable Gate Arrays (FPGAs). However, due to architectural differences between ASICs and FPGAs, the utilization of ASIC-based approximation techniques for FPGA-based systems does not result in proportional performance and energy gains. Therefore, to exploit the principles of approximate computing for FPGA-based hardware accelerators for error-resilient applications, FPGA-optimized approximation techniques are required. Further, most state-of-the-art approximate arithmetic operators do not have a generic approximation methodology to implement new approximate designs for an application's changing accuracy and performance requirements. These works also lack a methodology where a machine learning model can be used to correlate an approximate operator with its impact on the output quality of an application. This thesis focuses on these research challenges by designing and exploring FPGA-optimized logic-based approximate arithmetic operators. As multiplication operation is one of the computationally complex and most frequently used arithmetic operations in various modern applications, such as Artificial Neural Networks (ANNs), we have, therefore, considered it for most of the proposed approximation techniques in this thesis. The primary focus of the work is to provide a framework for generating FPGA-optimized approximate arithmetic operators and efficient techniques to explore approximate operators for implementing hardware accelerators for error-resilient applications. Towards this end, we first present various designs of resource-optimized, high-performance, and energy-efficient accurate multipliers. Although modern FPGAs host high-performance DSP blocks to perform multiplication and other arithmetic operations, our analysis and results show that the orthogonal approach of having resource-efficient and high-performance multipliers is necessary for implementing high-performance accelerators. Due to the differences in the type of data processed by various applications, the thesis presents individual designs for unsigned, signed, and constant multipliers. Compared to the multiplier IPs provided by the FPGA Synthesis tool, our proposed designs provide significant performance gains. We then explore the designed accurate multipliers and provide a library of approximate unsigned/signed multipliers. The proposed approximations target the reduction in the total utilized resources, critical path delay, and energy consumption of the multipliers. We have explored various statistical error metrics to characterize the approximation-induced accuracy degradation of the approximate multipliers. We have also utilized the designed multipliers in various error-resilient applications to evaluate their impact on applications' output quality and performance. Based on our analysis of the designed approximate multipliers, we identify the need for a framework to design application-specific approximate arithmetic operators. An application-specific approximate arithmetic operator intends to implement only the logic that can satisfy the application's overall output accuracy and performance constraints. Towards this end, we present a generic design methodology for implementing FPGA-based application-specific approximate arithmetic operators from their accurate implementations according to the applications' accuracy and performance requirements. In this regard, we utilize various machine learning models to identify feasible approximate arithmetic configurations for various applications. We also utilize different machine learning models and optimization techniques to efficiently explore the large design space of individual operators and their utilization in various applications. In this thesis, we have used the proposed methodology to design approximate adders and multipliers. This thesis also explores other layers of the computation stack (cross-layer) for possible approximations to satisfy an application's accuracy and performance requirements. Towards this end, we first present a low bit-width and highly accurate quantization scheme for pre-trained Deep Neural Networks (DNNs). The proposed quantization scheme does not require re-training (fine-tuning the parameters) after quantization. We also present a resource-efficient FPGA-based multiplier that utilizes our proposed quantization scheme. Finally, we present a framework to allow the intelligent exploration and highly accurate identification of the feasible design points in the large design space enabled by cross-layer approximations. The proposed framework utilizes a novel Polynomial Regression (PR)-based method to model approximate arithmetic operators. The PR-based representation enables machine learning models to better correlate an approximate operator's coefficients with their impact on an application's output quality.:1. Introduction 1.1 Inherent Error Resilience of Applications 1.2 Approximate Computing Paradigm 1.2.1 Software Layer Approximation 1.2.2 Architecture Layer Approximation 1.2.3 Circuit Layer Approximation 1.3 Problem Statement 1.4 Focus of the Thesis 1.5 Key Contributions and Thesis Overview 2. Preliminaries 2.1 Xilinx FPGA Slice Structure 2.2 Multiplication Algorithms 2.2.1 Baugh-Wooley’s Multiplication Algorithm 2.2.2 Booth’s Multiplication Algorithm 2.2.3 Sign Extension for Booth’s Multiplier 2.3 Statistical Error Metrics 2.4 Design Space Exploration and Optimization Techniques 2.4.1 Genetic Algorithm 2.4.2 Bayesian Optimization 2.5 Artificial Neural Networks 3. Accurate Multipliers 3.1 Introduction 3.2 Related Work 3.3 Unsigned Multiplier Architecture 3.4 Motivation for Signed Multipliers 3.5 Baugh-Wooley’s Multiplier 3.6 Booth’s Algorithm-based Signed Multipliers 3.6.1 Booth-Mult Design 3.6.2 Booth-Opt Design 3.6.3 Booth-Par Design 3.7 Constant Multipliers 3.8 Results and Discussion 3.8.1 Experimental Setup and Tool Flow 3.8.2 Performance comparison of the proposed accurate unsigned multiplier 3.8.3 Performance comparison of the proposed accurate signed multiplier with the state-of-the-art accurate multipliers 3.8.4 Performance comparison of the proposed constant multiplier with the state-of-the-art accurate multipliers 3.9 Conclusion 4. Approximate Multipliers 4.1 Introduction 4.2 Related Work 4.3 Unsigned Approximate Multipliers 4.3.1 Approximate 4 × 4 Multiplier (Approx-1) 4.3.2 Approximate 4 × 4 Multiplier (Approx-2) 4.3.3 Approximate 4 × 4 Multiplier (Approx-3) 4.4 Designing Higher Order Approximate Unsigned Multipliers 4.4.1 Accurate Adders for Implementing 8 × 8 Approximate Multipliers from 4 × 4 Approximate Multipliers 4.4.2 Approximate Adders for Implementing Higher-order Approximate Multipliers 4.5 Approximate Signed Multipliers (Booth-Approx) 4.6 Results and Discussion 4.6.1 Experimental Setup and Tool Flow 4.6.2 Evaluation of the Proposed Approximate Unsigned Multipliers 4.6.3 Evaluation of the Proposed Approximate Signed Multiplier 4.7 Conclusion 5. Designing Application-specific Approximate Operators 5.1 Introduction 5.2 Related Work 5.3 Modeling Approximate Arithmetic Operators 5.3.1 Accurate Multiplier Design 5.3.2 Approximation Methodology 5.3.3 Approximate Adders 5.4 DSE for FPGA-based Approximate Operators Synthesis 5.4.1 DSE using Bayesian Optimization 5.4.2 MOEA-based Optimization 5.4.3 Machine Learning Models for DSE 5.5 Results and Discussion 5.5.1 Experimental Setup and Tool Flow 5.5.2 Accuracy-Performance Analysis of Approximate Adders 5.5.3 Accuracy-Performance Analysis of Approximate Multipliers 5.5.4 AppAxO MBO 5.5.5 ML Modeling 5.5.6 DSE using ML Models 5.5.7 Proposed Approximate Operators 5.6 Conclusion 6. Quantization of Pre-trained Deep Neural Networks 6.1 Introduction 6.2 Related Work 6.2.1 Commonly Used Quantization Techniques 6.3 Proposed Quantization Techniques 6.3.1 L2L: Log_2_Lead Quantization 6.3.2 ALigN: Adaptive Log_2_Lead Quantization 6.3.3 Quantitative Analysis of the Proposed Quantization Schemes 6.3.4 Proposed Quantization Technique-based Multiplier 6.4 Results and Discussion 6.4.1 Experimental Setup and Tool Flow 6.4.2 Image Classification 6.4.3 Semantic Segmentation 6.4.4 Hardware Implementation Results 6.5 Conclusion 7. A Framework for Cross-layer Approximations 7.1 Introduction 7.2 Related Work 7.3 Error-analysis of approximate arithmetic units 7.3.1 Application Independent Error-analysis of Approximate Multipliers 7.3.2 Application Specific Error Analysis 7.4 Accelerator Performance Estimation 7.5 DSE Methodology 7.6 Results and Discussion 7.6.1 Experimental Setup and Tool Flow 7.6.2 Behavioral Analysis 7.6.3 Accelerator Performance Estimation 7.6.4 DSE Performance 7.7 Conclusion 8. Conclusions and Future Work
128

Processor design-space exploration through fast simulation / Exploration de l'espace de conception de processeurs via simulation accélérée

Khan, Taj Muhammad 12 May 2011 (has links)
Nous nous focalisons sur l'échantillonnage comme une technique de simulation pour réduire le temps de simulation. L'échantillonnage est basé sur le fait que l'exécution d'un programme est composée des parties du code qui se répètent, les phases. D'où vient l'observation que l'on peut éviter la simulation entière d'un programme et simuler chaque phase juste une fois et à partir de leurs performances calculer la performance du programme entier. Deux questions importantes se lèvent: quelles parties du programme doit-on simuler? Et comment restaurer l'état du système avant chaque simulation? Pour répondre à la première question, il existe deux solutions: une qui analyse l'exécution du programme en termes de phases et choisit de simuler chaque phase une fois, l'échantillonnage représentatif, et une deuxième qui prône de choisir les échantillons aléatoirement, l'échantillonnage statistique. Pour répondre à la deuxième question de la restauration de l'état du système, des techniques ont été développées récemment qui restaurent l'état (chauffent) du système en fonction des besoins du bout du code simulé (adaptativement). Les techniques des choix des échantillons ignorent complètement les mécanismes de chauffage du système ou proposent des alternatives qui demandent beaucoup de modification du simulateur et les techniques adaptatives du chauffage ne sont pas compatibles avec la plupart des techniques d'échantillonnage. Au sein de cette thèse nous nous focalisons sur le fait de réconcilier les techniques d'échantillonnage avec celles du chauffage adaptatif pour développer un mécanisme qui soit à la fois facile à utiliser, précis dans ses résultats, et soit transparent à l'utilisateur. Nous avons prit l'échantillonnage représentatif et statistique et modifié les techniques adaptatives du chauffage pour les rendre compatibles avec ces premiers dans un seul mécanisme. Nous avons pu montrer que les techniques adaptatives du chauffage peuvent être employées dans l'échantillonnage. Nos résultats sont comparables avec l'état de l'art en terme de précision mais en débarrassant l'utilisateur des problèmes du chauffage et en lui cachant les détails de la simulation, nous rendons le processus plus facile. On a aussi constaté que l'échantillonnage statistique donne des résultats meilleurs que l'échantillonnage représentatif / Simulation is a vital tool used by architects to develop new architectures. However, because of the complexity of modern architectures and the length of recent benchmarks, detailed simulation of programs can take extremely long times. This impedes the exploration of processor design space which the architects need to do to find the optimal configuration of processor parameters. Sampling is one technique which reduces the simulation time without adversely affecting the accuracy of the results. Yet, most sampling techniques either ignore the warm-up issue or require significant development effort on the part of the user.In this thesis we tackle the problem of reconciling state-of-the-art warm-up techniques and the latest sampling mechanisms with the triple objective of keeping the user effort minimum, achieving good accuracy and being agnostic to software and hardware changes. We show that both the representative and statistical sampling techniques can be adapted to use warm-up mechanisms which can accommodate the underlying architecture's warm-up requirements on-the-fly. We present the experimental results which show an accuracy and speed comparable to latest research. Also, we leverage statistical calculations to provide an estimate of the robustness of the final results.
129

Dynamic instruction set extension of microprocessors with embedded FPGAs

Bauer, Heiner 13 April 2017 (has links) (PDF)
Increasingly complex applications and recent shifts in technology scaling have created a large demand for microprocessors which can perform tasks more quickly and more energy efficient. Conventional microarchitectures exploit multiple levels of parallelism to increase instruction throughput and use application specific instruction sets or hardware accelerators to increase energy efficiency. Reconfigurable microprocessors adopt the same principle of providing application specific hardware, however, with the significant advantage of post-fabrication flexibility. Not only does this offer similar gains in performance but also the flexibility to configure each device individually. This thesis explored the benefit of a tight coupled and fine-grained reconfigurable microprocessor. In contrast to previous research, a detailed design space exploration of logical architectures for island-style field programmable gate arrays (FPGAs) has been performed in the context of a commercial 22nm process technology. Other research projects either reused general purpose architectures or spent little effort to design and characterize custom fabrics, which are critical to system performance and the practicality of frequently proposed high-level software techniques. Here, detailed circuit implementations and a custom area model were used to estimate the performance of over 200 different logical FPGA architectures with single-driver routing. Results of this exploration revealed similar tradeoffs and trends described by previous studies. The number of lookup table (LUT) inputs and the structure of the global routing network were shown to have a major impact on the area delay product. However, results suggested a much larger region of efficient architectures than before. Finally, an architecture with 5-LUTs and 8 logic elements per cluster was selected. Modifications to the microprocessor, whichwas based on an industry proven instruction set architecture, and its software toolchain provided access to this embedded reconfigurable fabric via custom instructions. The baseline microprocessor was characterized with estimates from signoff data for a 28nm hardware implementation. A modified academic FPGA tool flow was used to transform Verilog implementations of custom instructions into a post-routing netlist with timing annotations. Simulation-based verification of the system was performed with a cycle-accurate processor model and diverse application benchmarks, ranging from signal processing, over encryption to computation of elementary functions. For these benchmarks, a significant increase in performance with speedups from 3 to 15 relative to the baseline microprocessor was achieved with the extended instruction set. Except for one case, application speedup clearly outweighed the area overhead for the extended system, even though the modeled fabric architecturewas primitive and contained no explicit arithmetic enhancements. Insights into fundamental tradeoffs of island-style FPGA architectures, the developed exploration flow, and a concrete cost model are relevant for the development of more advanced architectures. Hence, this work is a successful proof of concept and has laid the basis for further investigations into architectural extensions and physical implementations. Potential for further optimizationwas identified on multiple levels and numerous directions for future research were described. / Zunehmend komplexere Anwendungen und Besonderheiten moderner Halbleitertechnologien haben zu einer großen Nachfrage an leistungsfähigen und gleichzeitig sehr energieeffizienten Mikroprozessoren geführt. Konventionelle Architekturen versuchen den Befehlsdurchsatz durch Parallelisierung zu steigern und stellen anwendungsspezifische Befehlssätze oder Hardwarebeschleuniger zur Steigerung der Energieeffizienz bereit. Rekonfigurierbare Prozessoren ermöglichen ähnliche Performancesteigerungen und besitzen gleichzeitig den enormen Vorteil, dass die Spezialisierung auf eine bestimmte Anwendung nach der Herstellung erfolgen kann. In dieser Diplomarbeit wurde ein rekonfigurierbarer Mikroprozessor mit einem eng gekoppelten FPGA untersucht. Im Gegensatz zu früheren Forschungsansätzen wurde eine umfangreiche Entwurfsraumexploration der FPGA-Architektur im Zusammenhang mit einem kommerziellen 22nm Herstellungsprozess durchgeführt. Bisher verwendeten die meisten Forschungsprojekte entweder kommerzielle Architekturen, die nicht unbedingt auf diesen Anwendungsfall zugeschnitten sind, oder die vorgeschlagenen FGPA-Komponenten wurden nur unzureichend untersucht und charakterisiert. Jedoch ist gerade dieser Baustein ausschlaggebend für die Leistungsfähigkeit des gesamten Systems. Deshalb wurden im Rahmen dieser Arbeit über 200 verschiedene logische FPGA-Architekturen untersucht. Zur Modellierung wurden konkrete Schaltungstopologien und ein auf den Herstellungsprozess zugeschnittenes Modell zur Abschätzung der Layoutfläche verwendet. Generell wurden die gleichen Trends wie bei vorhergehenden und ähnlich umfangreichen Untersuchungen beobachtet. Auch hier wurden die Ergebnisse maßgeblich von der Größe der LUTs (engl. "Lookup Tables") und der Struktur des Routingnetzwerks bestimmt. Gleichzeitig wurde ein viel breiterer Bereich von Architekturen mit nahezu gleicher Effizienz identifiziert. Zur weiteren Evaluation wurde eine FPGA-Architektur mit 5-LUTs und 8 Logikelementen ausgewählt. Die Performance des ausgewählten Mikroprozessors, der auf einer erprobten Befehlssatzarchitektur aufbaut, wurde mit Ergebnissen eines 28nm Testchips abgeschätzt. Eine modifizierte Sammlung von akademischen Softwarewerkzeugen wurde verwendet, um Spezialbefehle auf die modellierte FPGA-Architektur abzubilden und eine Netzliste für die anschließende Simulation und Verifikation zu erzeugen. Für eine Reihe unterschiedlicher Anwendungs-Benchmarks wurde eine relative Leistungssteigerung zwischen 3 und 15 gegenüber dem ursprünglichen Prozessor ermittelt. Obwohl die vorgeschlagene FPGA-Architektur vergleichsweise primitiv ist und keinerlei arithmetische Erweiterungen besitzt, musste dabei, bis auf eine Ausnahme, kein überproportionaler Anstieg der Chipfläche in Kauf genommen werden. Die gewonnen Erkenntnisse zu den Abhängigkeiten zwischen den Architekturparametern, der entwickelte Ablauf für die Exploration und das konkrete Kostenmodell sind essenziell für weitere Verbesserungen der FPGA-Architektur. Die vorliegende Arbeit hat somit erfolgreich den Vorteil der untersuchten Systemarchitektur gezeigt und den Weg für mögliche Erweiterungen und Hardwareimplementierungen geebnet. Zusätzlich wurden eine Reihe von Optimierungen der Architektur und weitere potenziellen Forschungsansätzen aufgezeigt.
130

Methods for parameterizing and exploring Pareto frontiers using barycentric coordinates

Daskilewicz, Matthew John 08 April 2013 (has links)
The research objective of this dissertation is to create and demonstrate methods for parameterizing the Pareto frontiers of continuous multi-attribute design problems using barycentric coordinates, and in doing so, to enable intuitive exploration of optimal trade spaces. This work is enabled by two observations about Pareto frontiers that have not been previously addressed in the engineering design literature. First, the observation that the mapping between non-dominated designs and Pareto efficient response vectors is a bijection almost everywhere suggests that points on the Pareto frontier can be inverted to find their corresponding design variable vectors. Second, the observation that certain common classes of Pareto frontiers are topologically equivalent to simplices suggests that a barycentric coordinate system will be more useful for parameterizing the frontier than the Cartesian coordinate systems typically used to parameterize the design and objective spaces. By defining such a coordinate system, the design problem may be reformulated from y = f(x) to (y,x) = g(p) where x is a vector of design variables, y is a vector of attributes and p is a vector of barycentric coordinates. Exploration of the design problem using p as the independent variables has the following desirable properties: 1) Every vector p corresponds to a particular Pareto efficient design, and every Pareto efficient design corresponds to a particular vector p. 2) The number of p-coordinates is equal to the number of attributes regardless of the number of design variables. 3) Each attribute y_i has a corresponding coordinate p_i such that increasing the value of p_i corresponds to a motion along the Pareto frontier that improves y_i monotonically. The primary contribution of this work is the development of three methods for forming a barycentric coordinate system on the Pareto frontier, two of which are entirely original. The first method, named "non-domination level coordinates," constructs a coordinate system based on the (k-1)-attribute non-domination levels of a discretely sampled Pareto frontier. The second method is based on a modification to an existing "normal boundary intersection" multi-objective optimizer that adaptively redistributes its search basepoints in order to sample from the entire frontier uniformly. The weights associated with each basepoint can then serve as a coordinate system on the frontier. The third method, named "Pareto simplex self-organizing maps" uses a modified a self-organizing map training algorithm with a barycentric-grid node topology to iteratively conform a coordinate grid to the sampled Pareto frontier.

Page generated in 0.0464 seconds