• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 20
  • 4
  • 3
  • 1
  • Tagged with
  • 38
  • 38
  • 11
  • 10
  • 7
  • 7
  • 7
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

The V-SLAM Hurdler : A Faster V-SLAM System using Online Semantic Dynamic-and-Hardness-aware Approximation / V-SLAM Häcklöparen : Ett Snabbare V-SLAM System med Online semantisk Dynamisk-och-Hårdhetsmedveten Approximation

Mingxuan, Liu January 2022 (has links)
Visual Simultaneous Localization And Mapping (V-SLAM) and object detection algorithms are two critical prerequisites for modern XR applications. V-SLAM allows XR devices to geometrically map the environment and localize itself within the environment, simultaneously. Furthermore, object detectors based on Deep Neural Network (DNN) can be used to semantically understand what those features in the environment represent. However, both of these algorithms are computationally expensive, which makes it challenging for them to achieve good real-time performance on device. In this thesis, we first present TensoRT Quantized YOLOv4 (TRTQYOLOv4), a faster implementation of YOLOv4 architecture [1] using FP16 reduced precision and INT8 quantization powered by NVIDIA TensorRT [2] framework. Second, we propose the V-SLAM Hurdler: A Faster VSLAM System using Online Dynamic-and-Hardness-aware Approximation. The proposed system integrates the base RGB-D V-SLAM ORB-SLAM3 [3] with the INT8 TRTQ-YOLOv4 object detector, a novel Entropy-based Degreeof- Difficulty Estimator, an Online Hardness-aware Approximation Controller and a Dynamic Object Eraser, applying online dynamic-and-hardness aware approximation to the base V-SLAM system during runtime while increasing its robustness in dynamic scenes. We first evaluate the proposed object detector on public object detection dataset. The proposed FP16 precision TRTQ-YOLOv4 achieves 2×faster than the full-precision model without loss of accuracy, while the INT8 quantized TRTQ-YOLOv4 is almost 3×faster than the full-precision one with only 0.024 loss in mAP@50:5:95. Second, we evaluate our proposed V-SLAM system on public RGB-D SLAM dataset. In static scenes, the proposed system speeds up the base VSLAM system by +21.2% on average with only −0.7% loss of accuracy. In dynamic scenes, the proposed system not only accelerate the base system by +23.5% but also improves the accuracy by +89.3%, making it as robust as in the static scenes. Lastly, the comparison against the state-of-the-art SLAMs designed dynamic environments shows that our system outperforms most of the compared methods in highly dynamic scenes. / Visual SLAM (V-SLAM) och objektdetekteringsalgoritmer är två kritiska förutsättningar för moderna XR-applikationer. V-SLAM tillåter XR-enheter att geometriskt kartlägga miljön och lokalisera sig i miljön samtidigt. Dessutom kan DNN-baserade objektdetektorer användas för att semantiskt förstå vad dessa egenskaper i miljön representerar. Men båda dessa algoritmer är beräkningsmässigt dyra, vilket gör det utmanande för dem att uppnå bra realtidsprestanda på enheten. I det här examensarbetet presenterar vi först TRTQ-YOLOv4, en snabbare implementering av YOLOv4 arkitektur [1] med FP16 reducerad precision och INT8 kvantisering som drivs av NVIDIA TensorRT [2] ramverk. För det andra föreslår vi V-SLAM-häckaren: ett snabbare V-SLAM-system som använder online-dynamisk och hårdhetsmedveten approximation. Det föreslagna systemet integrerar basen RGB-D V-SLAM ORB-SLAM3 [3] med INT8 TRTQYOLOv4 objektdetektorn, en ny Entropi-baserad svårighetsgradsuppskattare, en online hårdhetsmedveten approximationskontroller och en Dynamic Object Eraser, applicerar online-dynamik- och hårdhetsmedveten approximation till bas-V-SLAM-systemet under körning samtidigt som det ökar dess robusthet i dynamiska scener. Vi utvärderar först den föreslagna objektdetektorn på datauppsättning för offentlig objektdetektering. Den föreslagna FP16 precision TRTQ-YOLOv4 uppnår 2× snabbare än fullprecisionsmodellen utan förlust av noggrannhet, medan den INT8 kvantiserade TRTQ-YOLOv4 är nästan 3× snabbare än fullprecisionsmodellen med endast 0.024 förlust i mAP@50:5:95. För det andra utvärderar vi vårt föreslagna V-SLAM-system på offentlig RGB-D SLAM-datauppsättning. I statiska scener snabbar det föreslagna systemet upp V-SLAM-bassystemet med +21.2% i genomsnitt med endast −0.7% förlust av noggrannhet. I dynamiska scener accelererar det föreslagna systemet inte bara bassystemet med +23.5% utan förbättrar också noggrannheten med +89.3%, vilket gör det lika robust som i de statiska scenerna. Slutligen visar jämförelsen med de senaste SLAM-designade dynamiska miljöerna att vårt system överträffar de flesta av de jämförda metoderna i mycket dynamiska scener.
32

[en] ADAPTIVE RELAXED SYNCHRONIZATION THROUGH THE USE OF SUPERVISED LEARNING METHODS / [pt] RELAXAMENTO ADAPTATIVO DA SINCRONIZAÇÃO ATRAVÉS DO USO DE MÉTODOS DE APRENDIZAGEM SUPERVISIONADA

ANDRE LUIS CAVALCANTI BUENO 31 July 2018 (has links)
[pt] Sistemas de computação paralelos vêm se tornando pervasivos, sendo usados para interagir com o mundo físico e processar uma grande quantidade de dados de várias fontes. É essencial, portanto, a melhora contínua do desempenho computacional para acompanhar o ritmo crescente da quantidade de informações que precisam ser processadas. Algumas dessas aplicações admitem uma menor qualidade no resultado final em troca do aumento do desempenho de execução. Este trabalho tem por objetivo avaliar a viabilidade de usar métodos de aprendizagem supervisionada para garantir que a técnica de Sincronização Relaxada, utilizada para o aumento do desempenho de execução, forneça resultados dentro de limites aceitáveis de erro. Para isso, criamos uma metodologia que utiliza alguns dados de entrada para montar casos de testes que, ao serem executados, irão fornecer valores representativos de entrada para o treinamento de métodos de aprendizagem supervisionada. Dessa forma, quando o usuário utilizar a sua aplicação (no mesmo ambiente de treinamento) com uma nova entrada, o algoritmo de classificação treinado irá sugerir o fator de relaxamento de sincronização mais adequado à tripla aplicação/entrada/ambiente de execução. Utilizamos essa metodologia em algumas aplicações paralelas bem conhecidas e mostramos que, aliando a Sincronização Relaxada a métodos de aprendizagem supervisionada, foi possível manter a taxa de erro máximo acordada. Além disso, avaliamos o ganho de desempenho obtido com essa técnica para alguns cenários em cada aplicação. / [en] Parallel computing systems have become pervasive, being used to interact with the physical world and process a large amount of data from various sources. It is essential, therefore, the continuous improvement of computational performance to keep up with the increasing rate of the amount of information that needs to be processed. Some of these applications admit lower quality in the final result in exchange for increased execution performance. This work aims to evaluate the feasibility of using supervised learning methods to ensure that the Relaxed Synchronization technique, used to increase execution performance, provides results within acceptable limits of error. To do so, we have created a methodology that uses some input data to assemble test cases that, when executed, will provide input values for the training of supervised learning methods. This way, when the user uses his/her application (in the same training environment) with a new input, the trained classification algorithm will suggest the relax synchronization factor that is best suited to the triple application/input/execution environment. We used this methodology insome well-known parallel applications and showed that, by combining Relaxed Synchronization with supervised learning methods, it was possible to maintain the maximum established error rate. In addition, we evaluated the performance gain obtained with this technique for a number of scenarios in each application.
33

Automatický multikriteriální paralelní evoluční návrh a aproximace obvodů / Automated Multi-Objective Parallel Evolutionary Circuit Design and Approximation

Hrbáček, Radek Unknown Date (has links)
Spotřeba a energetická efektivita se stává jedním z nejdůležitějších parametrů při návrhu počítačových systémů, zejména kvůli omezené kapacitě napájení u zařízení napájených bateriemi a velmi vysoké spotřebě energie rostoucích datacenter a cloudové infrastruktury. Současně jsou uživatelé ochotni do určité míry tolerovat nepřesné nebo chybné výpočty v roustoucím počtu aplikací díky nedokonalostem lidských smyslů, statistické povaze výpočtů, šumu ve vstupních datech apod. Přibližné počítání, nová oblast výzkumu v počítačovém inženýrství, využívá rozvolnění požadavků na funkčnost za účelem zvýšení efektivity počítačových systémů, pokud jde o spotřebu energie, výpočetní výkon či složitost. Aplikace tolerující chyby mohou být implementovány efektivněji a stále sloužit svému účelu se stejnou nebo mírně sníženou kvalitou. Ačkoli se objevují nové metody pro návrh přibližně počítajících výpočetních systémů, je stále nedostatek automatických návrhových metod, které by nabízely velké množství kompromisních řešení dané úlohy. Konvenční metody navíc často produkují řešení, která jsou daleko od optima. Evoluční algoritmy sice přinášejí inovativní řešení složitých optimalizačních a návrhových problémů, nicméně trpí několika nedostatky, např. nízkou škálovatelností či vysokým počtem generací nutných k dosažení konkurenceschopných výsledků. Pro přibližné počítání je vhodný zejména multikriteriální návrh, což existující metody většinou nepodporují. V této práci je představen nový automatický multikriteriální paralelní evoluční algoritmus pro návrh a aproximaci digitálních obvodů. Metoda je založena na kartézském genetickém programování, pro zvýšení škálovatelnosti byla navržena nová vysoce paralelizovaná implementace. Multikriteriální návrh byl založen na principech algoritmu NSGA-II. Výkonnost implementace byla vyhodnocena na několika různých úlohách, konkrétně při návrhu (přibližně počítajících) aritmetických obvodů, Booleovských funkcích s vysokou nelinearitou či přibližných logických obvodů pro tří-modulovou redundanci. V těchto úlohách bylo dosaženo význammých zlepšení ve srovnání se současnými metodami.
34

Aproximativní implementace aritmetických operací v obrazových filtrech / Approximate Implementation of Arithmetic Operations in Image Filters

Válek, Matěj January 2021 (has links)
Tato diplomová práce se zabývá  aproximativní implementace aritmetických operací v obrazových filtrech. Zejména tedy využitím aproximativních technik pro úpravu způsobu násobení v netriviálním obrazovém filtru. K tomu je využito několik technik, jako použití převodu násobení s pohyblivou řadovou čárkou na násobení s pevnou řadovou čárkou, či využití evolučních algoritmů zejména kartézkého genetického programování pro vytvoření nových aproximovaných násobiček, které vykazují přijatelnou chybu, ale současně redukují výpočetní náročnost filtrace. Výsledkem jsou evolučně navržené aproximativní násobičky zohledňující distribuci dat v obrazovém filtru a jejich nasazení v obrazovém filtru a porovnání původního filtru s aproximovaným fitrem na sadě barevných obrázků.
35

Analyse de fiabilité de circuits logiques et de mémoire basés sur dispositif spintronique / Reliability analysis of spintronic device based logic and memory circuits

Wang, You 13 February 2017 (has links)
La jonction tunnel magnétique (JTM) commutée par la couple de transfert de spin (STT) a été considérée comme un candidat prometteur pour la prochaine génération de mémoires non-volatiles et de circuits logiques, car elle fournit une solution pour surmonter le goulet d'étranglement de l'augmentation de puissance statique causée par la mise à l'échelle de la technologie CMOS. Cependant, sa commercialisation est limitée par la fiabilité faible, qui se détériore gravement avec la réduction de la taille du dispositif. Cette thèse porte sur l'étude de la fiabilité des circuits basés sur JTM. Tout d'abord, un modèle compact de JTM incluant les problèmes principaux de fiabilité est proposé et validé par la comparaison avec des données expérimentales. Sur la base de ce modèle précis, la fiabilité des circuits typiques est analysée et une méthodologie d'optimisation de la fiabilité est proposée. Enfin, le comportement de commutation stochastique est utilisé dans certaines nouvelles conceptions d'applications classiques. / Spin transfer torque magnetic tunnel junction (STT-MTJ) has been considered as a promising candidate for next generation of non-volatile memories and logic circuits, because it provides a perfect solution to overcome the bottleneck of increasing static power caused by CMOS technology scaling. However, its commercialization is limited by the poor reliability, which deteriorates severely with device scaling down. This thesis focuses on the reliability investigation of MTJ based non-volatile circuits. Firstly, a compact model of MTJ including main reliability issues is proposed and validated by the comparison with experimental data. Based on this accurate model, the reliability of typical circuits is analyzed and reliability optimization methodology is proposed. Finally, the stochastic switching behavior is utilized in some new designs of conventional applications.
36

Design, Analysis, and Applications of Approximate Arithmetic Modules

Ullah, Salim 06 April 2022 (has links)
From the initial computing machines, Colossus of 1943 and ENIAC of 1945, to modern high-performance data centers and Internet of Things (IOTs), four design goals, i.e., high-performance, energy-efficiency, resource utilization, and ease of programmability, have remained a beacon of development for the computing industry. During this period, the computing industry has exploited the advantages of technology scaling and microarchitectural enhancements to achieve these goals. However, with the end of Dennard scaling, these techniques have diminishing energy and performance advantages. Therefore, it is necessary to explore alternative techniques for satisfying the computational and energy requirements of modern applications. Towards this end, one promising technique is analyzing and surrendering the strict notion of correctness in various layers of the computation stack. Most modern applications across the computing spectrum---from data centers to IoTs---interact and analyze real-world data and take decisions accordingly. These applications are broadly classified as Recognition, Mining, and Synthesis (RMS). Instead of producing a single golden answer, these applications produce several feasible answers. These applications possess an inherent error-resilience to the inexactness of processed data and corresponding operations. Utilizing these applications' inherent error-resilience, the paradigm of Approximate Computing relaxes the strict notion of computation correctness to realize high-performance and energy-efficient systems with acceptable quality outputs. The prior works on circuit-level approximations have mainly focused on Application-specific Integrated Circuits (ASICs). However, ASIC-based solutions suffer from long time-to-market and high-cost developing cycles. These limitations of ASICs can be overcome by utilizing the reconfigurable nature of Field Programmable Gate Arrays (FPGAs). However, due to architectural differences between ASICs and FPGAs, the utilization of ASIC-based approximation techniques for FPGA-based systems does not result in proportional performance and energy gains. Therefore, to exploit the principles of approximate computing for FPGA-based hardware accelerators for error-resilient applications, FPGA-optimized approximation techniques are required. Further, most state-of-the-art approximate arithmetic operators do not have a generic approximation methodology to implement new approximate designs for an application's changing accuracy and performance requirements. These works also lack a methodology where a machine learning model can be used to correlate an approximate operator with its impact on the output quality of an application. This thesis focuses on these research challenges by designing and exploring FPGA-optimized logic-based approximate arithmetic operators. As multiplication operation is one of the computationally complex and most frequently used arithmetic operations in various modern applications, such as Artificial Neural Networks (ANNs), we have, therefore, considered it for most of the proposed approximation techniques in this thesis. The primary focus of the work is to provide a framework for generating FPGA-optimized approximate arithmetic operators and efficient techniques to explore approximate operators for implementing hardware accelerators for error-resilient applications. Towards this end, we first present various designs of resource-optimized, high-performance, and energy-efficient accurate multipliers. Although modern FPGAs host high-performance DSP blocks to perform multiplication and other arithmetic operations, our analysis and results show that the orthogonal approach of having resource-efficient and high-performance multipliers is necessary for implementing high-performance accelerators. Due to the differences in the type of data processed by various applications, the thesis presents individual designs for unsigned, signed, and constant multipliers. Compared to the multiplier IPs provided by the FPGA Synthesis tool, our proposed designs provide significant performance gains. We then explore the designed accurate multipliers and provide a library of approximate unsigned/signed multipliers. The proposed approximations target the reduction in the total utilized resources, critical path delay, and energy consumption of the multipliers. We have explored various statistical error metrics to characterize the approximation-induced accuracy degradation of the approximate multipliers. We have also utilized the designed multipliers in various error-resilient applications to evaluate their impact on applications' output quality and performance. Based on our analysis of the designed approximate multipliers, we identify the need for a framework to design application-specific approximate arithmetic operators. An application-specific approximate arithmetic operator intends to implement only the logic that can satisfy the application's overall output accuracy and performance constraints. Towards this end, we present a generic design methodology for implementing FPGA-based application-specific approximate arithmetic operators from their accurate implementations according to the applications' accuracy and performance requirements. In this regard, we utilize various machine learning models to identify feasible approximate arithmetic configurations for various applications. We also utilize different machine learning models and optimization techniques to efficiently explore the large design space of individual operators and their utilization in various applications. In this thesis, we have used the proposed methodology to design approximate adders and multipliers. This thesis also explores other layers of the computation stack (cross-layer) for possible approximations to satisfy an application's accuracy and performance requirements. Towards this end, we first present a low bit-width and highly accurate quantization scheme for pre-trained Deep Neural Networks (DNNs). The proposed quantization scheme does not require re-training (fine-tuning the parameters) after quantization. We also present a resource-efficient FPGA-based multiplier that utilizes our proposed quantization scheme. Finally, we present a framework to allow the intelligent exploration and highly accurate identification of the feasible design points in the large design space enabled by cross-layer approximations. The proposed framework utilizes a novel Polynomial Regression (PR)-based method to model approximate arithmetic operators. The PR-based representation enables machine learning models to better correlate an approximate operator's coefficients with their impact on an application's output quality.:1. Introduction 1.1 Inherent Error Resilience of Applications 1.2 Approximate Computing Paradigm 1.2.1 Software Layer Approximation 1.2.2 Architecture Layer Approximation 1.2.3 Circuit Layer Approximation 1.3 Problem Statement 1.4 Focus of the Thesis 1.5 Key Contributions and Thesis Overview 2. Preliminaries 2.1 Xilinx FPGA Slice Structure 2.2 Multiplication Algorithms 2.2.1 Baugh-Wooley’s Multiplication Algorithm 2.2.2 Booth’s Multiplication Algorithm 2.2.3 Sign Extension for Booth’s Multiplier 2.3 Statistical Error Metrics 2.4 Design Space Exploration and Optimization Techniques 2.4.1 Genetic Algorithm 2.4.2 Bayesian Optimization 2.5 Artificial Neural Networks 3. Accurate Multipliers 3.1 Introduction 3.2 Related Work 3.3 Unsigned Multiplier Architecture 3.4 Motivation for Signed Multipliers 3.5 Baugh-Wooley’s Multiplier 3.6 Booth’s Algorithm-based Signed Multipliers 3.6.1 Booth-Mult Design 3.6.2 Booth-Opt Design 3.6.3 Booth-Par Design 3.7 Constant Multipliers 3.8 Results and Discussion 3.8.1 Experimental Setup and Tool Flow 3.8.2 Performance comparison of the proposed accurate unsigned multiplier 3.8.3 Performance comparison of the proposed accurate signed multiplier with the state-of-the-art accurate multipliers 3.8.4 Performance comparison of the proposed constant multiplier with the state-of-the-art accurate multipliers 3.9 Conclusion 4. Approximate Multipliers 4.1 Introduction 4.2 Related Work 4.3 Unsigned Approximate Multipliers 4.3.1 Approximate 4 × 4 Multiplier (Approx-1) 4.3.2 Approximate 4 × 4 Multiplier (Approx-2) 4.3.3 Approximate 4 × 4 Multiplier (Approx-3) 4.4 Designing Higher Order Approximate Unsigned Multipliers 4.4.1 Accurate Adders for Implementing 8 × 8 Approximate Multipliers from 4 × 4 Approximate Multipliers 4.4.2 Approximate Adders for Implementing Higher-order Approximate Multipliers 4.5 Approximate Signed Multipliers (Booth-Approx) 4.6 Results and Discussion 4.6.1 Experimental Setup and Tool Flow 4.6.2 Evaluation of the Proposed Approximate Unsigned Multipliers 4.6.3 Evaluation of the Proposed Approximate Signed Multiplier 4.7 Conclusion 5. Designing Application-specific Approximate Operators 5.1 Introduction 5.2 Related Work 5.3 Modeling Approximate Arithmetic Operators 5.3.1 Accurate Multiplier Design 5.3.2 Approximation Methodology 5.3.3 Approximate Adders 5.4 DSE for FPGA-based Approximate Operators Synthesis 5.4.1 DSE using Bayesian Optimization 5.4.2 MOEA-based Optimization 5.4.3 Machine Learning Models for DSE 5.5 Results and Discussion 5.5.1 Experimental Setup and Tool Flow 5.5.2 Accuracy-Performance Analysis of Approximate Adders 5.5.3 Accuracy-Performance Analysis of Approximate Multipliers 5.5.4 AppAxO MBO 5.5.5 ML Modeling 5.5.6 DSE using ML Models 5.5.7 Proposed Approximate Operators 5.6 Conclusion 6. Quantization of Pre-trained Deep Neural Networks 6.1 Introduction 6.2 Related Work 6.2.1 Commonly Used Quantization Techniques 6.3 Proposed Quantization Techniques 6.3.1 L2L: Log_2_Lead Quantization 6.3.2 ALigN: Adaptive Log_2_Lead Quantization 6.3.3 Quantitative Analysis of the Proposed Quantization Schemes 6.3.4 Proposed Quantization Technique-based Multiplier 6.4 Results and Discussion 6.4.1 Experimental Setup and Tool Flow 6.4.2 Image Classification 6.4.3 Semantic Segmentation 6.4.4 Hardware Implementation Results 6.5 Conclusion 7. A Framework for Cross-layer Approximations 7.1 Introduction 7.2 Related Work 7.3 Error-analysis of approximate arithmetic units 7.3.1 Application Independent Error-analysis of Approximate Multipliers 7.3.2 Application Specific Error Analysis 7.4 Accelerator Performance Estimation 7.5 DSE Methodology 7.6 Results and Discussion 7.6.1 Experimental Setup and Tool Flow 7.6.2 Behavioral Analysis 7.6.3 Accelerator Performance Estimation 7.6.4 DSE Performance 7.7 Conclusion 8. Conclusions and Future Work
37

Metodologie pro automatický návrh nízkopříkonových aproximativních obvodů / Automated Design Methodology for Approximate Low Power Circuits

Mrázek, Vojtěch January 2018 (has links)
Rozšiřování moderních vestavěných a mobilních systémů napájených bateriemi zvyšuje požadavky na návrh těchto systémů s ohledem na příkon. Přestože moderní návrhové techniky optimalizují příkon, elektrická spotřeba těchto obvodů stále roste díky jejich složitosti. Nicméně existuje celá řada aplikací, kde nepotřebujeme získat úplně přesný výstup. Díky tomu se objevuje technika zvaná aproximativní (přibližné) počítání, která umožňuje za cenu zanesení malé chyby do výpočtu významně redukovat příkon obvodů. V práci se zaměřujeme na použití evolučních algoritmů v této oblasti. Ačkoliv již tyto algoritmy byly úspěšně použity v syntéze přesných i aproximativních obvodů, objevují se problémy škálovatelnosti - schopnosti aproximovat složité obvody. Cílem této disertační práce je ukázat, že aproximační logická syntéza založená na genetickém programování umožňuje dosáhnout vynikajícího kompromisu mezi spotřebou a chybou. Byla provedena analýza čtyř různých aplikacích na třech úrovních popisu. Pomocí kartézského genetického programování s modifikovanou reprezentací jsme snížili spotřebu malých obvodů popsaných na úrovni tranzistorů použitelných například v technologické knihovně. Dále jsme zavedli novou metodu pro aproximaci aritmetických obvodů, jako jsou sčítačky a násobičky, popsaných na úrovni hradel. S využitím metod formální verifikace navíc celý návrhový proces umožňuje garantovat stanovenou chybu aproximace. Tyto obvody byly využity pro významné snížení příkonu v neuronových sítích pro rozpoznávání obrázků a v diskrétní kosinově transformaci v HEVC kodéru. Pomocí nové chybové metriky nezávislé na rozložení vstupních dat jsme navrhli komplexní aproximativní mediánové filtry vhodné pro zpracování signálů. Disertační práce reprezentuje ucelenou metodiku pro návrh aproximativních obvodů na různých úrovních popisu, která navíc garantuje nepřekročení zadané chyby aproximace.
38

ENERGY EFFICIENT EDGE INFERENCE SYSTEMS

Soumendu Kumar Ghosh (14060094) 07 August 2023 (has links)
<p>Deep Learning (DL)-based edge intelligence has garnered significant attention in recent years due to the rapid proliferation of the Internet of Things (IoT), embedded, and intelligent systems, collectively termed edge devices. Sensor data streams acquired by these edge devices are processed by a Deep Neural Network (DNN) application that runs on the device itself or in the cloud. However, the high computational complexity and energy consumption of processing DNNs often limit their deployment on these edge inference systems due to limited compute, memory and energy resources. Furthermore, high costs, strict application latency demands, data privacy, security constraints, and the absence of reliable edge-cloud network connectivity heavily impact edge application efficiency in the case of cloud-assisted DNN inference. Inevitably, performance and energy efficiency are of utmost importance in these edge inference systems, aside from the accuracy of the application. To facilitate energy- efficient edge inference systems running computationally complex DNNs, this dissertation makes three key contributions.</p> <p><br></p> <p>The first contribution adopts a full-system approach to Approximate Computing, a design paradigm that trades off a small degradation in application quality for significant energy savings. Within this context, we present the foundational concepts of AxIS, the first approximate edge inference system that jointly optimizes the constituent subsystems leading to substantial energy benefits compared to optimization of the individual subsystem. To illustrate the efficacy of this approach, we demonstrate multiple versions of an approximate smart camera system that executes various DNN-based unimodal computer vision applications, showcasing how the sensor, memory, compute, and communication subsystems can all be synergistically approximated for energy-efficient edge inference.</p> <p><br></p> <p>Building on this foundation, the second contribution extends AxIS to multimodal AI, harnessing data from multiple sensor modalities to impart human-like cognitive and perceptual abilities to edge devices. By exploring optimization techniques for multiple sensor modalities and subsystems, this research reveals the impact of synergistic modality-aware optimizations on system-level accuracy-efficiency (AE) trade-offs, culminating in the introduction of SysteMMX, the first AE scalable cognitive system that allows efficient multimodal inference at the edge. To illustrate the practicality and effectiveness of this approach, we present an in-depth case study centered around a multimodal system that leverages RGB and Depth sensor modalities for image segmentation tasks.</p> <p><br></p> <p>The final contribution focuses on optimizing the performance of an edge-cloud collaborative inference system through intelligent DNN partitioning and computation offloading. We delve into the realm of distributed inference across edge devices and cloud servers, unveiling the challenges associated with finding the optimal partitioning point in DNNs for significant inference latency speedup. To address these challenges, we introduce PArtNNer, a platform-agnostic and adaptive DNN partitioning framework capable of dynamically adapting to changes in communication bandwidth and cloud server load. Unlike existing approaches, PArtNNer does not require pre-characterization of underlying edge computing platforms, making it a versatile and efficient solution for real-world edge-cloud scenarios.</p> <p><br></p> <p>Overall, this thesis provides novel insights, innovative techniques, and intelligent solutions to enable energy-efficient AI at the edge. The contributions presented herein serve as a solid foundation for future researchers to build upon, driving innovation and shaping the trajectory of research in edge AI.</p>

Page generated in 0.0724 seconds