Global ETD Search

121	Design Space Exploration and Architecture Design for Inference and Training Deep Neural Networks Qi, Yangjie January 2021 (has links) No description available. Electrical Engineering Computer Engineering Artificial Intelligence deep neural network DNN computer architecture DNN accelerator design space exploration edge computing hardware architecture
122	Design space exploration for co-mapping of periodic and streaming applications in a shared platform / Validering av designlösningar för utforskning av rymden för samkartläggning av periodiska och strömmande applikationer i en delad plattform Yuhan, Zhang January 2023 (has links) As embedded systems advance, the complexity and multifaceted requirements of products have increased significantly. A trend in this domain is the selection of different types of application models and multiprocessors as the platform. However, limited design space exploration techniques often perform one particular model, and combining diverse application models may cause compatibility issues. Additionally, embedded system design inherently involves multiple objectives. Beyond the essential functionalities, other metrics always need to be considered, such as power consumption, resource utilization, cost, safety, etc. The consideration of these diverse metrics results in a vast design space, so effective design space exploration also plays a crucial role. This thesis addresses these challenges by proposing a co-mapping approach for two distinct models: the periodically activated tasks model for real-time applications and the synchronous dataflow model for digital signal processing. Our primary goal is to co-map these two kinds of models onto a multi-core platform and explore trade-offs between the solutions. We choose the number of used resources and throughput of the synchronous dataflow model as our performance metrics for assessment. We adopt a combination method in which periodic tasks are given precedence to ensure their deadlines are met. The remaining processor resources are then allocated to the synchronous dataflow model. Both the execution of periodic tasks and the synchronous dataflow model are managed by a scheduler, which prevents resource contention and optimizes the utilization of available processor resources. To achieve a balance between different metrics, we implement Pareto optimization as a guiding principle in our approach. This thesis uses the IDeSyDe tool, an extension of the ForSyDe group’s current design space exploration tool, following the Design Space Identification methodology. Implementation is based on Scala and Python, running on the Java virtual machine. The experiment results affirm the successful mapping and scheduling of the periodically activated tasks model and the synchronous dataflow model onto the shared multi-processor platform. We find the Pareto-optimal solutions by IDeSyDe, strategically aiming to maximize the throughput of synchronous dataflow while concurrently minimizing resource consumption. This thesis serves as a valuable insight into the application of different models on a shared platform, particularly for developers interested in utilizing IDeSyDe. However, due to time constraints, our test case may not fully encompass the potential scalability of our thesis method. Additional tests can demonstrate the better effectiveness of our approach. For further reference, the code can be checked in the GitHub repository at. / Allt eftersom inbyggda system utvecklas, blir komplexiteten och de mångfacetterade kraven av produkter har ökat avsevärt. En trend inom detta område är urval av olika typer av applikationsmodeller och multiprocessorer som plattformen. Dock begränsad design utrymme utforskning tekniker ofta utföra en viss modell, och kombinera olika applikationsmodeller kan orsaka kompatibilitetsproblem. Dessutom inbyggt systemdesign i sig involverar flera mål. Utöver de väsentliga funktionerna, andra mätvärden måste alltid beaktas, såsom strömförbrukning, resurs användning, kostnad, säkerhet, etc. Övervägandet av dessa olika mätvärden resulterar i ett stort designutrymme spelar så effektiv designrumsutforskning också en avgörande roll roll. Denna avhandling tar upp dessa utmaningar genom att föreslå en samkartläggning tillvägagångssätt för två distinkta modeller: modellen med periodiskt aktiverade uppgifter för realtidsapplikationer och den synkrona dataflödesmodellen för digital signal bearbetning. Vårt primära mål är att samkarta dessa två typer av modeller på en multi-core plattform och utforska avvägningar mellan lösningarna. Vi väljer antalet använda resurser och genomströmning av det synkrona dataflödet modell som vårt prestationsmått för bedömning. Vi använder en kombinationsmetod där periodiska uppgifter ges företräde för att säkerställa att deras tidsfrister hålls. Den återstående processorn resurser allokeras sedan till den synkrona dataflödesmodellen. Både utförandet av periodiska uppgifter och den synkrona dataflödesmodellen är hanteras av en schemaläggare, vilket förhindrar resursstrid och optimerar utnyttjandet av tillgängliga processorresurser. För att uppnå en balans mellan olika mått, implementerar vi Pareto-optimering som en vägledande princip i vårt tillvägagångssätt. Denna avhandling använder verktyget IDeSyDe, en förlängning av ForSyDe gruppens nuvarande verktyg för utforskning av designutrymme, efter Design Space Identifieringsmetodik. Implementeringen är baserad på Scala och Python, körs på den virtuella Java-maskinen. Experimentresultaten bekräftar den framgångsrika kartläggningen och schemaläggningen av den periodiskt aktiverade uppgiftsmodellen och det synkrona dataflödet modell på den delade flerprocessorplattformen. Vi finner Pareto-optimal lösningar av IDeSyDe, strategiskt inriktade på att maximera genomströmningen av synkront dataflöde samtidigt som resursförbrukningen minimeras. Denna uppsats fungerar som en värdefull inblick i tillämpningen av olika modeller på en delad plattform, särskilt för utvecklare IDeSyDe. På grund av tidsbrist kanske vårt testfall inte är fullt ut omfattar den potentiella skalbarheten hos vår avhandlingsmetod. Ytterligare tester kan visa hur effektiv vår strategi är. För ytterligare referens, koden kan kontrolleras i GitHub. Design Space Exploration Periodically activated tasks Synchronous dataflow IDeSyDe Designutrymmesutforskning Periodiskt aktiverade uppgifter Synkron data-flöde IDeSyDe Elektroteknik och elektronik
123	Automatic Design Space Exploration of Fault-tolerant Embedded Systems Architectures Tierno, Antonio 26 January 2023 (has links) Embedded Systems may have competing design objectives, such as to maximize the reliability, increase the functional safety, minimize the product cost, and minimize the energy consumption. The architectures must be therefore configured to meet varied requirements and multiple design objectives. In particular, reliability and safety are receiving increasing attention. Consequently, the configuration of fault-tolerant mechanisms is a critical design decision. This work proposes a method for automatic selection of appropriate fault-tolerant design patterns, optimizing simultaneously multiple objective functions. Firstly, we present an exact method that leverages the power of Satisfiability Modulo Theory to encode the problem with a symbolic technique. It is based on a novel assessment of reliability which is part of the evaluation of alternative designs. Afterwards, we empirically evaluate the performance of a near-optimal approximation variation that allows us to solve the problem even when the instance size makes it intractable in terms of computing resources. The efficiency and scalability of this method is validated with a series of experiments of different sizes and characteristics, and by comparing it with existing methods on a test problem that is widely used in the reliability optimization literature.
124	A body-centric framework for generating and evaluating novel interaction techniques / Un espace de conception centré sur les fonctions corporelles pour la génération et l'évaluation de nouvelles techniques d'interaction Wagner, Julie 06 December 2012 (has links) Cette thèse présente BodyScape, un espace de conception prenant en compte l’engagement corporel de l’utilisateur dans l’interaction. BodyScape décrit la façon dont les utilisateurs coordonnent les mouvements de, et entre leurs membres, lorsqu’ils interagissent avec divers dispositifs d’entrée et entre plusieurs surfaces d’affichage. Il introduit une notation graphique pour l’analyse des techniques d’interaction en termes (1) d’assemblages de moteurs, qui accomplissent une tâche d’interaction atomique (assemblages de moteurs d’entrée), ou qui positionnent le corps pour percevoir les sorties du système (assemblages de moteurs de sortie); (2) de coordination des mouvements de ces assemblages de moteurs, relativement au corps de l’utilisateur ou à son environnement interactif.Nous avons appliqué BodyScape à : 1) la caractérisation du rôle du support dans l’étude de nouvelles interactions bimanuelles pour dispositifs mobiles; 2) l’analyse des effets de mouvements concurrents lorsque l’interaction et son support impliquent le même membre; et 3) la comparaison de douze techniques d’interaction multi-échelle afin d’évaluer le rôle du guidage et des interférences sur la performance.La caractérisation des interaction avec BodyScape clarifie le rôle du support des dispositifs d’interaction sur l’équilibre de l’utilisateur, et donc sur le confort d’utilisation et la performance qui en découlent. L’espace de conception permet aussi aux concepteurs d’interactions d’identifier des situations dans lesquelles des mouvements peuvent interférer entre eux et donc diminuer performance et confort. Enfin, BodyScape révèle les compromis à considérer a priori lors de la combinaison de plusieurs techniques d’interaction, permettant l’analyse et la génération de techniques d’interaction variées pour les environnements multi-surfaces.Plus généralement, cette thèse défend l’idée qu’en adoptant une approche centrée sur les fonctions corporelles engagées au cours de l’interaction, il est possible de maîtriser la complexité de la conception de techniques d’interaction dans les environnements multi-surfaces, mais aussi dans un cadre plus général. / This thesis introduces BodyScape, a body-centric framework that accounts for how users coordinate their movements within and across their own limbs in order to interact with a wide range of devices, across multiple surfaces. It introduces a graphical notation that describes interaction techniques in terms of (1) motor assemblies responsible for performing a control task (input motor assembly) or bringing the body into a position to visually perceive output (output motor assembly), and (2) the movement coordination of motor assemblies, relative to the body or fixed in the world, with respect to the interactive environment. This thesis applies BodyScape to 1) investigate the role of support in a set of novel bimanual interaction techniques for hand-held devices, 2) analyze the competing effect across multiple input movements, and 3) compare twelve pan-and-zoom techniques on a wall-sized display to determine the roles of guidance and interference on performance. Using BodyScape to characterize interaction clarifies the role of device support on the user's balance and subsequent comfort and performance. It allows designers to identify situations in which multiple body movements interfere with each other, with a corresponding decrease in performance. Finally, it highlights the trade-offs among different combinations of techniques, enabling the analysis and generation of a variety of multi-surface interaction techniques. I argue that including a body-centric perspective when defining interaction techniques is essential for addressing the combinatorial explosion of interactive devices in multi-surface environments. Interaction multi-surfaces Interaction bimanuelle Tablettes tactiles Espace de conception BiTouch BiPad Interfaces multi-échelle Pan & Zoom Navigation Murs d'images Techniques d'interaction gestuelles Multi-surface interaction Body-centric design space Bimanual interaction Tablet computer BiTouch design space BiPad Multi-scale interfaces Pan & Zoom Navigation Wall-sized displays Mid-air interaction techniques
125	Design, Analysis, and Applications of Approximate Arithmetic Modules Ullah, Salim 06 April 2022 (has links) From the initial computing machines, Colossus of 1943 and ENIAC of 1945, to modern high-performance data centers and Internet of Things (IOTs), four design goals, i.e., high-performance, energy-efficiency, resource utilization, and ease of programmability, have remained a beacon of development for the computing industry. During this period, the computing industry has exploited the advantages of technology scaling and microarchitectural enhancements to achieve these goals. However, with the end of Dennard scaling, these techniques have diminishing energy and performance advantages. Therefore, it is necessary to explore alternative techniques for satisfying the computational and energy requirements of modern applications. Towards this end, one promising technique is analyzing and surrendering the strict notion of correctness in various layers of the computation stack. Most modern applications across the computing spectrum---from data centers to IoTs---interact and analyze real-world data and take decisions accordingly. These applications are broadly classified as Recognition, Mining, and Synthesis (RMS). Instead of producing a single golden answer, these applications produce several feasible answers. These applications possess an inherent error-resilience to the inexactness of processed data and corresponding operations. Utilizing these applications' inherent error-resilience, the paradigm of Approximate Computing relaxes the strict notion of computation correctness to realize high-performance and energy-efficient systems with acceptable quality outputs. The prior works on circuit-level approximations have mainly focused on Application-specific Integrated Circuits (ASICs). However, ASIC-based solutions suffer from long time-to-market and high-cost developing cycles. These limitations of ASICs can be overcome by utilizing the reconfigurable nature of Field Programmable Gate Arrays (FPGAs). However, due to architectural differences between ASICs and FPGAs, the utilization of ASIC-based approximation techniques for FPGA-based systems does not result in proportional performance and energy gains. Therefore, to exploit the principles of approximate computing for FPGA-based hardware accelerators for error-resilient applications, FPGA-optimized approximation techniques are required. Further, most state-of-the-art approximate arithmetic operators do not have a generic approximation methodology to implement new approximate designs for an application's changing accuracy and performance requirements. These works also lack a methodology where a machine learning model can be used to correlate an approximate operator with its impact on the output quality of an application. This thesis focuses on these research challenges by designing and exploring FPGA-optimized logic-based approximate arithmetic operators. As multiplication operation is one of the computationally complex and most frequently used arithmetic operations in various modern applications, such as Artificial Neural Networks (ANNs), we have, therefore, considered it for most of the proposed approximation techniques in this thesis. The primary focus of the work is to provide a framework for generating FPGA-optimized approximate arithmetic operators and efficient techniques to explore approximate operators for implementing hardware accelerators for error-resilient applications. Towards this end, we first present various designs of resource-optimized, high-performance, and energy-efficient accurate multipliers. Although modern FPGAs host high-performance DSP blocks to perform multiplication and other arithmetic operations, our analysis and results show that the orthogonal approach of having resource-efficient and high-performance multipliers is necessary for implementing high-performance accelerators. Due to the differences in the type of data processed by various applications, the thesis presents individual designs for unsigned, signed, and constant multipliers. Compared to the multiplier IPs provided by the FPGA Synthesis tool, our proposed designs provide significant performance gains. We then explore the designed accurate multipliers and provide a library of approximate unsigned/signed multipliers. The proposed approximations target the reduction in the total utilized resources, critical path delay, and energy consumption of the multipliers. We have explored various statistical error metrics to characterize the approximation-induced accuracy degradation of the approximate multipliers. We have also utilized the designed multipliers in various error-resilient applications to evaluate their impact on applications' output quality and performance. Based on our analysis of the designed approximate multipliers, we identify the need for a framework to design application-specific approximate arithmetic operators. An application-specific approximate arithmetic operator intends to implement only the logic that can satisfy the application's overall output accuracy and performance constraints. Towards this end, we present a generic design methodology for implementing FPGA-based application-specific approximate arithmetic operators from their accurate implementations according to the applications' accuracy and performance requirements. In this regard, we utilize various machine learning models to identify feasible approximate arithmetic configurations for various applications. We also utilize different machine learning models and optimization techniques to efficiently explore the large design space of individual operators and their utilization in various applications. In this thesis, we have used the proposed methodology to design approximate adders and multipliers. This thesis also explores other layers of the computation stack (cross-layer) for possible approximations to satisfy an application's accuracy and performance requirements. Towards this end, we first present a low bit-width and highly accurate quantization scheme for pre-trained Deep Neural Networks (DNNs). The proposed quantization scheme does not require re-training (fine-tuning the parameters) after quantization. We also present a resource-efficient FPGA-based multiplier that utilizes our proposed quantization scheme. Finally, we present a framework to allow the intelligent exploration and highly accurate identification of the feasible design points in the large design space enabled by cross-layer approximations. The proposed framework utilizes a novel Polynomial Regression (PR)-based method to model approximate arithmetic operators. The PR-based representation enables machine learning models to better correlate an approximate operator's coefficients with their impact on an application's output quality.:1. Introduction 1.1 Inherent Error Resilience of Applications 1.2 Approximate Computing Paradigm 1.2.1 Software Layer Approximation 1.2.2 Architecture Layer Approximation 1.2.3 Circuit Layer Approximation 1.3 Problem Statement 1.4 Focus of the Thesis 1.5 Key Contributions and Thesis Overview 2. Preliminaries 2.1 Xilinx FPGA Slice Structure 2.2 Multiplication Algorithms 2.2.1 Baugh-Wooley’s Multiplication Algorithm 2.2.2 Booth’s Multiplication Algorithm 2.2.3 Sign Extension for Booth’s Multiplier 2.3 Statistical Error Metrics 2.4 Design Space Exploration and Optimization Techniques 2.4.1 Genetic Algorithm 2.4.2 Bayesian Optimization 2.5 Artificial Neural Networks 3. Accurate Multipliers 3.1 Introduction 3.2 Related Work 3.3 Unsigned Multiplier Architecture 3.4 Motivation for Signed Multipliers 3.5 Baugh-Wooley’s Multiplier 3.6 Booth’s Algorithm-based Signed Multipliers 3.6.1 Booth-Mult Design 3.6.2 Booth-Opt Design 3.6.3 Booth-Par Design 3.7 Constant Multipliers 3.8 Results and Discussion 3.8.1 Experimental Setup and Tool Flow 3.8.2 Performance comparison of the proposed accurate unsigned multiplier 3.8.3 Performance comparison of the proposed accurate signed multiplier with the state-of-the-art accurate multipliers 3.8.4 Performance comparison of the proposed constant multiplier with the state-of-the-art accurate multipliers 3.9 Conclusion 4. Approximate Multipliers 4.1 Introduction 4.2 Related Work 4.3 Unsigned Approximate Multipliers 4.3.1 Approximate 4 × 4 Multiplier (Approx-1) 4.3.2 Approximate 4 × 4 Multiplier (Approx-2) 4.3.3 Approximate 4 × 4 Multiplier (Approx-3) 4.4 Designing Higher Order Approximate Unsigned Multipliers 4.4.1 Accurate Adders for Implementing 8 × 8 Approximate Multipliers from 4 × 4 Approximate Multipliers 4.4.2 Approximate Adders for Implementing Higher-order Approximate Multipliers 4.5 Approximate Signed Multipliers (Booth-Approx) 4.6 Results and Discussion 4.6.1 Experimental Setup and Tool Flow 4.6.2 Evaluation of the Proposed Approximate Unsigned Multipliers 4.6.3 Evaluation of the Proposed Approximate Signed Multiplier 4.7 Conclusion 5. Designing Application-specific Approximate Operators 5.1 Introduction 5.2 Related Work 5.3 Modeling Approximate Arithmetic Operators 5.3.1 Accurate Multiplier Design 5.3.2 Approximation Methodology 5.3.3 Approximate Adders 5.4 DSE for FPGA-based Approximate Operators Synthesis 5.4.1 DSE using Bayesian Optimization 5.4.2 MOEA-based Optimization 5.4.3 Machine Learning Models for DSE 5.5 Results and Discussion 5.5.1 Experimental Setup and Tool Flow 5.5.2 Accuracy-Performance Analysis of Approximate Adders 5.5.3 Accuracy-Performance Analysis of Approximate Multipliers 5.5.4 AppAxO MBO 5.5.5 ML Modeling 5.5.6 DSE using ML Models 5.5.7 Proposed Approximate Operators 5.6 Conclusion 6. Quantization of Pre-trained Deep Neural Networks 6.1 Introduction 6.2 Related Work 6.2.1 Commonly Used Quantization Techniques 6.3 Proposed Quantization Techniques 6.3.1 L2L: Log_2_Lead Quantization 6.3.2 ALigN: Adaptive Log_2_Lead Quantization 6.3.3 Quantitative Analysis of the Proposed Quantization Schemes 6.3.4 Proposed Quantization Technique-based Multiplier 6.4 Results and Discussion 6.4.1 Experimental Setup and Tool Flow 6.4.2 Image Classification 6.4.3 Semantic Segmentation 6.4.4 Hardware Implementation Results 6.5 Conclusion 7. A Framework for Cross-layer Approximations 7.1 Introduction 7.2 Related Work 7.3 Error-analysis of approximate arithmetic units 7.3.1 Application Independent Error-analysis of Approximate Multipliers 7.3.2 Application Specific Error Analysis 7.4 Accelerator Performance Estimation 7.5 DSE Methodology 7.6 Results and Discussion 7.6.1 Experimental Setup and Tool Flow 7.6.2 Behavioral Analysis 7.6.3 Accelerator Performance Estimation 7.6.4 DSE Performance 7.7 Conclusion 8. Conclusions and Future Work info:eu-repo/classification/ddc/004 ddc:004
126	Processor design-space exploration through fast simulation / Exploration de l'espace de conception de processeurs via simulation accélérée Khan, Taj Muhammad 12 May 2011 (has links) Nous nous focalisons sur l'échantillonnage comme une technique de simulation pour réduire le temps de simulation. L'échantillonnage est basé sur le fait que l'exécution d'un programme est composée des parties du code qui se répètent, les phases. D'où vient l'observation que l'on peut éviter la simulation entière d'un programme et simuler chaque phase juste une fois et à partir de leurs performances calculer la performance du programme entier. Deux questions importantes se lèvent: quelles parties du programme doit-on simuler? Et comment restaurer l'état du système avant chaque simulation? Pour répondre à la première question, il existe deux solutions: une qui analyse l'exécution du programme en termes de phases et choisit de simuler chaque phase une fois, l'échantillonnage représentatif, et une deuxième qui prône de choisir les échantillons aléatoirement, l'échantillonnage statistique. Pour répondre à la deuxième question de la restauration de l'état du système, des techniques ont été développées récemment qui restaurent l'état (chauffent) du système en fonction des besoins du bout du code simulé (adaptativement). Les techniques des choix des échantillons ignorent complètement les mécanismes de chauffage du système ou proposent des alternatives qui demandent beaucoup de modification du simulateur et les techniques adaptatives du chauffage ne sont pas compatibles avec la plupart des techniques d'échantillonnage. Au sein de cette thèse nous nous focalisons sur le fait de réconcilier les techniques d'échantillonnage avec celles du chauffage adaptatif pour développer un mécanisme qui soit à la fois facile à utiliser, précis dans ses résultats, et soit transparent à l'utilisateur. Nous avons prit l'échantillonnage représentatif et statistique et modifié les techniques adaptatives du chauffage pour les rendre compatibles avec ces premiers dans un seul mécanisme. Nous avons pu montrer que les techniques adaptatives du chauffage peuvent être employées dans l'échantillonnage. Nos résultats sont comparables avec l'état de l'art en terme de précision mais en débarrassant l'utilisateur des problèmes du chauffage et en lui cachant les détails de la simulation, nous rendons le processus plus facile. On a aussi constaté que l'échantillonnage statistique donne des résultats meilleurs que l'échantillonnage représentatif / Simulation is a vital tool used by architects to develop new architectures. However, because of the complexity of modern architectures and the length of recent benchmarks, detailed simulation of programs can take extremely long times. This impedes the exploration of processor design space which the architects need to do to find the optimal configuration of processor parameters. Sampling is one technique which reduces the simulation time without adversely affecting the accuracy of the results. Yet, most sampling techniques either ignore the warm-up issue or require significant development effort on the part of the user.In this thesis we tackle the problem of reconciling state-of-the-art warm-up techniques and the latest sampling mechanisms with the triple objective of keeping the user effort minimum, achieving good accuracy and being agnostic to software and hardware changes. We show that both the representative and statistical sampling techniques can be adapted to use warm-up mechanisms which can accommodate the underlying architecture's warm-up requirements on-the-fly. We present the experimental results which show an accuracy and speed comparable to latest research. Also, we leverage statistical calculations to provide an estimate of the robustness of the final results. Simulation Architecture de processeur Echantillonage Echantillonage représentatif Echantillonage aléatoire Chauffage Phases Prédiction de phases Exploration d'espace de conception Simulation Computer Architecture Sampling Representative Sampling Random Sampling Warm-up Phases Phase Prediction Design Space Exploration
127	Dynamic instruction set extension of microprocessors with embedded FPGAs Bauer, Heiner 13 April 2017 (has links) (PDF) Increasingly complex applications and recent shifts in technology scaling have created a large demand for microprocessors which can perform tasks more quickly and more energy efficient. Conventional microarchitectures exploit multiple levels of parallelism to increase instruction throughput and use application specific instruction sets or hardware accelerators to increase energy efficiency. Reconfigurable microprocessors adopt the same principle of providing application specific hardware, however, with the significant advantage of post-fabrication flexibility. Not only does this offer similar gains in performance but also the flexibility to configure each device individually. This thesis explored the benefit of a tight coupled and fine-grained reconfigurable microprocessor. In contrast to previous research, a detailed design space exploration of logical architectures for island-style field programmable gate arrays (FPGAs) has been performed in the context of a commercial 22nm process technology. Other research projects either reused general purpose architectures or spent little effort to design and characterize custom fabrics, which are critical to system performance and the practicality of frequently proposed high-level software techniques. Here, detailed circuit implementations and a custom area model were used to estimate the performance of over 200 different logical FPGA architectures with single-driver routing. Results of this exploration revealed similar tradeoffs and trends described by previous studies. The number of lookup table (LUT) inputs and the structure of the global routing network were shown to have a major impact on the area delay product. However, results suggested a much larger region of efficient architectures than before. Finally, an architecture with 5-LUTs and 8 logic elements per cluster was selected. Modifications to the microprocessor, whichwas based on an industry proven instruction set architecture, and its software toolchain provided access to this embedded reconfigurable fabric via custom instructions. The baseline microprocessor was characterized with estimates from signoff data for a 28nm hardware implementation. A modified academic FPGA tool flow was used to transform Verilog implementations of custom instructions into a post-routing netlist with timing annotations. Simulation-based verification of the system was performed with a cycle-accurate processor model and diverse application benchmarks, ranging from signal processing, over encryption to computation of elementary functions. For these benchmarks, a significant increase in performance with speedups from 3 to 15 relative to the baseline microprocessor was achieved with the extended instruction set. Except for one case, application speedup clearly outweighed the area overhead for the extended system, even though the modeled fabric architecturewas primitive and contained no explicit arithmetic enhancements. Insights into fundamental tradeoffs of island-style FPGA architectures, the developed exploration flow, and a concrete cost model are relevant for the development of more advanced architectures. Hence, this work is a successful proof of concept and has laid the basis for further investigations into architectural extensions and physical implementations. Potential for further optimizationwas identified on multiple levels and numerous directions for future research were described. / Zunehmend komplexere Anwendungen und Besonderheiten moderner Halbleitertechnologien haben zu einer großen Nachfrage an leistungsfähigen und gleichzeitig sehr energieeffizienten Mikroprozessoren geführt. Konventionelle Architekturen versuchen den Befehlsdurchsatz durch Parallelisierung zu steigern und stellen anwendungsspezifische Befehlssätze oder Hardwarebeschleuniger zur Steigerung der Energieeffizienz bereit. Rekonfigurierbare Prozessoren ermöglichen ähnliche Performancesteigerungen und besitzen gleichzeitig den enormen Vorteil, dass die Spezialisierung auf eine bestimmte Anwendung nach der Herstellung erfolgen kann. In dieser Diplomarbeit wurde ein rekonfigurierbarer Mikroprozessor mit einem eng gekoppelten FPGA untersucht. Im Gegensatz zu früheren Forschungsansätzen wurde eine umfangreiche Entwurfsraumexploration der FPGA-Architektur im Zusammenhang mit einem kommerziellen 22nm Herstellungsprozess durchgeführt. Bisher verwendeten die meisten Forschungsprojekte entweder kommerzielle Architekturen, die nicht unbedingt auf diesen Anwendungsfall zugeschnitten sind, oder die vorgeschlagenen FGPA-Komponenten wurden nur unzureichend untersucht und charakterisiert. Jedoch ist gerade dieser Baustein ausschlaggebend für die Leistungsfähigkeit des gesamten Systems. Deshalb wurden im Rahmen dieser Arbeit über 200 verschiedene logische FPGA-Architekturen untersucht. Zur Modellierung wurden konkrete Schaltungstopologien und ein auf den Herstellungsprozess zugeschnittenes Modell zur Abschätzung der Layoutfläche verwendet. Generell wurden die gleichen Trends wie bei vorhergehenden und ähnlich umfangreichen Untersuchungen beobachtet. Auch hier wurden die Ergebnisse maßgeblich von der Größe der LUTs (engl. "Lookup Tables") und der Struktur des Routingnetzwerks bestimmt. Gleichzeitig wurde ein viel breiterer Bereich von Architekturen mit nahezu gleicher Effizienz identifiziert. Zur weiteren Evaluation wurde eine FPGA-Architektur mit 5-LUTs und 8 Logikelementen ausgewählt. Die Performance des ausgewählten Mikroprozessors, der auf einer erprobten Befehlssatzarchitektur aufbaut, wurde mit Ergebnissen eines 28nm Testchips abgeschätzt. Eine modifizierte Sammlung von akademischen Softwarewerkzeugen wurde verwendet, um Spezialbefehle auf die modellierte FPGA-Architektur abzubilden und eine Netzliste für die anschließende Simulation und Verifikation zu erzeugen. Für eine Reihe unterschiedlicher Anwendungs-Benchmarks wurde eine relative Leistungssteigerung zwischen 3 und 15 gegenüber dem ursprünglichen Prozessor ermittelt. Obwohl die vorgeschlagene FPGA-Architektur vergleichsweise primitiv ist und keinerlei arithmetische Erweiterungen besitzt, musste dabei, bis auf eine Ausnahme, kein überproportionaler Anstieg der Chipfläche in Kauf genommen werden. Die gewonnen Erkenntnisse zu den Abhängigkeiten zwischen den Architekturparametern, der entwickelte Ablauf für die Exploration und das konkrete Kostenmodell sind essenziell für weitere Verbesserungen der FPGA-Architektur. Die vorliegende Arbeit hat somit erfolgreich den Vorteil der untersuchten Systemarchitektur gezeigt und den Weg für mögliche Erweiterungen und Hardwareimplementierungen geebnet. Zusätzlich wurden eine Reihe von Optimierungen der Architektur und weitere potenziellen Forschungsansätzen aufgezeigt. embedded FPGA architecture design space exploration reconfigurable computing 22 nm CMOS technologie instruction set extension tight coupled fabric embedded FPGA Architektur Entwurfsraumexploration rekonfigurierbare System 22 nm Halbleitertechnologie Befehlssatzerweiterung enge Kopplung ddc:621.3 rvk:ZN 4940
128	Methods for parameterizing and exploring Pareto frontiers using barycentric coordinates Daskilewicz, Matthew John 08 April 2013 (has links) The research objective of this dissertation is to create and demonstrate methods for parameterizing the Pareto frontiers of continuous multi-attribute design problems using barycentric coordinates, and in doing so, to enable intuitive exploration of optimal trade spaces. This work is enabled by two observations about Pareto frontiers that have not been previously addressed in the engineering design literature. First, the observation that the mapping between non-dominated designs and Pareto efficient response vectors is a bijection almost everywhere suggests that points on the Pareto frontier can be inverted to find their corresponding design variable vectors. Second, the observation that certain common classes of Pareto frontiers are topologically equivalent to simplices suggests that a barycentric coordinate system will be more useful for parameterizing the frontier than the Cartesian coordinate systems typically used to parameterize the design and objective spaces. By defining such a coordinate system, the design problem may be reformulated from y = f(x) to (y,x) = g(p) where x is a vector of design variables, y is a vector of attributes and p is a vector of barycentric coordinates. Exploration of the design problem using p as the independent variables has the following desirable properties: 1) Every vector p corresponds to a particular Pareto efficient design, and every Pareto efficient design corresponds to a particular vector p. 2) The number of p-coordinates is equal to the number of attributes regardless of the number of design variables. 3) Each attribute y_i has a corresponding coordinate p_i such that increasing the value of p_i corresponds to a motion along the Pareto frontier that improves y_i monotonically. The primary contribution of this work is the development of three methods for forming a barycentric coordinate system on the Pareto frontier, two of which are entirely original. The first method, named "non-domination level coordinates," constructs a coordinate system based on the (k-1)-attribute non-domination levels of a discretely sampled Pareto frontier. The second method is based on a modification to an existing "normal boundary intersection" multi-objective optimizer that adaptively redistributes its search basepoints in order to sample from the entire frontier uniformly. The weights associated with each basepoint can then serve as a coordinate system on the frontier. The third method, named "Pareto simplex self-organizing maps" uses a modified a self-organizing map training algorithm with a barycentric-grid node topology to iteratively conform a coordinate grid to the sampled Pareto frontier. Multi-objective optimization Pareto frontier Barycentric coordinates Design space exploration Decision theory Multi-attribute design Design optimization Visualization Multidisciplinary design optimization Engineering design Decision making Multiple criteria decision making
129	Design, Implementation and Evaluation of a Configurable NoC for AcENoCs FPGA Accelerated Emulation Platform Lotlikar, Swapnil Subhash 2010 August 1900 (has links) The heterogenous nature and the demand for extensive parallel processing in modern applications have resulted in widespread use of Multicore System-on-Chip (SoC) architectures. The emerging Network-on-Chip (NoC) architecture provides an energy-efficient and scalable communication solution for Multicore SoCs, serving as a powerful replacement for traditional bus-based solutions. The key to successful realization of such architectures is a flexible, fast and robust emulation platform for fast design space exploration. In this research, we present the design and evaluation of a highly configurable NoC used in AcENoCs (Accelerated Emulation platform for NoCs), a flexible and cycle accurate field programmable gate array (FPGA) emulation platform for validating NoC architectures. Along with the implementation details, we also discuss the various design optimizations and tradeoffs, and assess the performance improvements of AcENoCs over existing simulators and emulators. We design a hardware library consisting of routers and links using verilog hardware description language (HDL). The router is parameterized and has a configurable number of physical ports, virtual channels (VCs) and pipeline depth. A packet switched NoC is constructed by connecting the routers in either 2D-Mesh or 2D-Torus topology. The NoC is integrated in the AcENoCs platform and prototyped on Xilinx Virtex-5 FPGA. The NoC was evaluated under various synthetic and realistic workloads generated by AcENoCs' traffic generators implemented on the Xilinx MicroBlaze embedded processor. In order to validate the NoC design, performance metrics like average latency and throughput were measured and compared against the results obtained using standard network simulators. FPGA implementation of the NoC using Xilinx tools indicated a 76% LUT utilization for a 5x5 2D-Mesh network. A VC allocator was found to be the single largest consumer of hardware resources within a router. The router design synthesized at a frequency of 135MHz, 124MHz and 109MHz for 3-port, 4-port and 5-port configurations, respectively. The operational frequency of the router in the AcENoCs environment was limited only by the software execution latency even though the hardware itself could be clocked at a much higher rate. An AcENoCs emulator showed speedup improvements of 10000-12000X over HDL simulators and 5-15X over software simulators, without sacrificing cycle accuracy. FPGA NoC Network-on-Chip On-Chip Communication SoC System-on-Chip RTL HDL Hardware Description Language Hardware-Software Emulation Framework Network-on-Chip Emulation Network-on-Chip Validation Network-on-Chip Design Space Exploration
130	Multivariate Synergies in Pharmaceutical Roll Compaction : The quality influence of raw materials and process parameters by design of experiments Souihi, Nabil January 2014 (has links) Roll compaction is a continuous process commonly used in the pharmaceutical industry for dry granulation of moisture and heat sensitive powder blends. It is intended to increase bulk density and improve flowability. Roll compaction is a complex process that depends on many factors, such as feed powder properties, processing conditions and system layout. Some of the variability in the process remains unexplained. Accordingly, modeling tools are needed to understand the properties and the interrelations between raw materials, process parameters and the quality of the product. It is important to look at the whole manufacturing chain from raw materials to tablet properties. The main objective of this thesis was to investigate the impact of raw materials, process parameters and system design variations on the quality of intermediate and final roll compaction products, as well as their interrelations. In order to do so, we have conducted a series of systematic experimental studies and utilized chemometric tools, such as design of experiments, latent variable models (i.e. PCA, OPLS and O2PLS) as well as mechanistic models based on the rolling theory of granular solids developed by Johanson (1965). More specifically, we have developed a modeling approach to elucidate the influence of different brittle filler qualities of mannitol and dicalcium phosphate and their physical properties (i.e. flowability, particle size and compactability) on intermediate and final product quality. This approach allows the possibility of introducing new fillers without additional experiments, provided that they are within the previously mapped design space. Additionally, this approach is generic and could be extended beyond fillers. Furthermore, in contrast to many other materials, the results revealed that some qualities of the investigated fillers demonstrated improved compactability following roll compaction. In one study, we identified the design space for a roll compaction process using a risk-based approach. The influence of process parameters (i.e. roll force, roll speed, roll gap and milling screen size) on different ribbon, granule and tablet properties was evaluated. In another study, we demonstrated the significant added value of the combination of near-infrared chemical imaging, texture analysis and multivariate methods in the quality assessment of the intermediate and final roll compaction products. Finally, we have also studied the roll compaction of an intermediate drug load formulation at different scales and using roll compactors with different feed screw mechanisms (i.e. horizontal and vertical). The horizontal feed screw roll compactor was also equipped with an instrumented roll technology allowing the measurement of normal stress on ribbon. Ribbon porosity was primarily found to be a function of normal stress, exhibiting a quadratic relationship. A similar quadratic relationship was also observed between roll force and ribbon porosity of the vertically fed roll compactor. A combination of design of experiments, latent variable and mechanistic models led to a better understanding of the critical process parameters and showed that scale up/transfer between equipment is feasible. Roll compaction dry granulation mannitol dicalcium phosphate design of experiments critical quality attributes tablet manufacturing quality by design design space near-infrared chemical imaging texture analysis modeling scale up instrumented roll Johanson model

Search results