Der Einsatz von neuronalen Netzen in Edge-Geräten und deren Einbindung in unser tägliches Leben findet immer mehr Beachtung. Ihre hohen Rechenkosten machen jedoch viele eingebettete Anwendungen zu einer Herausforderung. Das Hauptziel meiner Doktorarbeit ist es, einen Beitrag zur Lösung dieses Dilemmas zu leisten: die Optimierung neuronaler Netze und die Entwicklung entsprechender neuronaler Verarbeitungseinheiten für Endgeräte. Diese Arbeit nahm die algorithmische Forschung als Ausgangspunkt und wandte dann deren Ergebnisse an, um das Architekturdesign von Neural Processing Units (NPUs) zu verbessern. Die Optimierung neuronaler Netzwerkmodelle begann mit der Quantisierung neuronaler Netzwerke mit einfacher Präzision und entwickelte sich zu gemischter Präzision. Die Entwicklung der NPU-Architektur folgte den Erkenntnissen der Algorithmusforschung, um ein Hardware/Software Co-Design zu erreichen. Darüber hinaus wurde ein neuartiger Ansatz zur gemeinsamen Entwicklung von Hardware und Software vorgeschlagen, um das Prototyping und die Leistungsbewertung von NPUs zu beschleunigen. Dieser Ansatz zielt auf die frühe Entwicklungsphase ab. Er hilft Entwicklern, sich auf das Design und die Optimierung von NPUs zu konzentrieren und verkürzt den Entwicklungszyklus erheblich. Im Abschlussprojekt wurde ein auf maschinellem Lernen basierender Ansatz angewendet, um die Rechen- und Speicherressourcen der NPU zu erkunden and optimieren. Die gesamte Arbeit umfasst mehrere verschiedene Bereiche, von der Algorithmusforschung bis zum Hardwaredesign. Sie alle arbeiten jedoch an der Verbesserung der Inferenz-Effizienz neuronaler Netze. Die Optimierung der Algorithmen zielt insbesondere darauf ab, den Speicherbedarf und die Rechenkosten von neuronalen Netzen zu verringern. Das NPU-Design hingegen konzentriert sich auf die Verbesserung der Nutzung von Hardwareressourcen. Der vorgeschlagene Ansatz zur gemeinsamen Entwicklung von Software und Hardware verkürzt den Entwurfszyklus und beschleunigt die Entwurfsiterationen. Die oben dargestellte Reihenfolge entspricht dem Aufbau dieser Dissertation. Jedes Kapitel ist einem Thema gewidmet und umfasst relevante Forschungsarbeiten, Methodik und Versuchsergebnisse.:1 Introduction
2 Convolutional Neural Networks
2.1 Convolutional layer
2.1.1 Padding
2.1.2 Convolution
2.1.3 Batch Normalization
2.1.4 Nonlinearity
2.2 Pooling Layer
2.3 Fully Connected Layer
2.4 Characterization
2.4.1 Composition of Operations and Parameters
2.4.2 Arithmetic Intensity
2.5 Optimization
3 Quantization with Double-Stage Squeeze-and-Threshold 19
3.1 Overview
3.1.1 Binarization
3.1.2 Multi-bit Quantization
3.2 Quantization of Convolutional Neural Networks
3.2.1 Quantization Scheme
3.2.2 Operator fusion of Conv2D
3.3 Activation Quantization with Squeeze-and-Threshold
3.3.1 Double-Stage Squeeze-and-Threshold
3.3.2 Inference Optimization
3.4 Experiment
3.4.1 Ablation Study of Squeeze-and-Threshold
3.4.2 Comparison with State-of-the-art Methods
3.5 Summary
4 Low-Precision Neural Architecture Search 39
4.1 Overview
4.2 Differentiable Architecture Search
4.2.1 Gumbel Softmax
4.2.2 Disadvantage and Solution
4.3 Low-Precision Differentiable Architecture Search
4.3.1 Convolution Sharing
4.3.2 Forward-and-Backward Scaling
4.3.3 Power Estimation
4.3.4 Architecture of Supernet
4.4 Experiment
4.4.1 Effectiveness of solutions to the dominance problem
4.4.2 Softmax and Gumbel Softmax
4.4.3 Optimizer and Inverted Learning Rate Scheduler
4.4.4 NAS Method Evaluation
4.4.5 Searched Model Analysis
4.4.6 NAS Cost Analysis
4.4.7 NAS Training Analysis
4.5 Summary
5 Configurable Sparse Neural Processing Unit 65
5.1 Overview
5.2 NPU Architecture
5.2.1 Buffer
5.2.2 Reshapeable Mixed-Precision MAC Array
5.2.3 Sparsity
5.2.4 Post Process Unit
5.3 Mapping
5.3.1 Mixed-Precision MAC
5.3.2 MAC Array
5.3.3 Support of Other Operation
5.3.4 Configurability
5.4 Experiment
5.4.1 Performance Analysis of Runtime Configuration
5.4.2 Roofline Performance Analysis
5.4.3 Mixed-Precision
5.4.4 Comparison with Cortex-M7
5.5 Summary
6 Agile Development and Rapid Design Space Exploration 91
6.1 Overview
6.1.1 Agile Development
6.1.2 Design Space Exploration
6.2 Agile Development Infrastructure
6.2.1 Chisel Backend
6.2.2 NPU Software Stack
6.3 Modeling and Exploration
6.3.1 Area Modeling
6.3.2 Performance Modeling
6.3.3 Layered Exploration Framework
6.4 Experiment
6.4.1 Efficiency of Agile Development Infrastructure
6.4.2 Effectiveness of Agile Development Infrastructure
6.4.3 Area Modeling
6.4.4 Performance Modeling
6.4.5 Rapid Exploration and Pareto Front
6.5 Summary
7 Summary and Outlook 123
7.1 Summary
7.2 Outlook
A Appendix of Double-Stage ST Quantization 127
A.1 Training setting of ResNet-18 in Table 3.3
A.2 Training setting of ReActNet in Table 3.4
A.3 Training setting of ResNet-18 in Table 3.4
A.4 Pseudocode Implementation of Double-Stage ST
B Appendix of Low-Precision Neural Architecture Search 131
B.1 Low-Precision NAS on CIFAR-10
B.2 Low-Precision NAS on Tiny-ImageNet
B.3 Low-Precision NAS on ImageNet
Bibliography 137 / Deploying neural networks on edge devices and bringing them into our daily lives is attracting more and more attention. However, its expensive computational cost makes many embedded applications daunting. The primary objective of my doctoral studies is to make contributions towards resolving this predicament: optimizing neural networks and designing corresponding efficient neural processing units for edge devices. This work took algorithmic research, specifically the optimization of deep neural networks, as a starting point and then applied its findings to steer the architecture design of Neural Processing Units (NPUs). The optimization of neural network models started with single precision neural network quantization and progressed to mixed precision. The NPU architecture development followed the algorithmic research findings to achieve hardware/software co-design. Furthermore, a new approach to hardware and software co-development was introduced, aimed at expediting the prototyping and performance assessment of NPUs. This approach targets early-stage development. It helps developers to focus on the design and optimization of NPUs and significantly shortens the development cycle. In the final project, a machine learning-based approach was applied to explore and optimize the computational and memory resources of the NPU. The entire work covers several different areas, from algorithmic research to hardware design. But they all work on improving the inference efficiency of neural networks. Specifically, algorithm optimization aims to reduce the memory footprint and computational cost of neural networks. The NPU design, on the other hand, focuses on improving the utilization of hardware resources. The proposed software and hardware co-development approach shortens the design cycle and speeds up the design iteration. The order presented above corresponds to the structure of this dissertation. Each chapter corresponds to a topic and covers relevant research, methodology, and experimental results.:1 Introduction
2 Convolutional Neural Networks
2.1 Convolutional layer
2.1.1 Padding
2.1.2 Convolution
2.1.3 Batch Normalization
2.1.4 Nonlinearity
2.2 Pooling Layer
2.3 Fully Connected Layer
2.4 Characterization
2.4.1 Composition of Operations and Parameters
2.4.2 Arithmetic Intensity
2.5 Optimization
3 Quantization with Double-Stage Squeeze-and-Threshold 19
3.1 Overview
3.1.1 Binarization
3.1.2 Multi-bit Quantization
3.2 Quantization of Convolutional Neural Networks
3.2.1 Quantization Scheme
3.2.2 Operator fusion of Conv2D
3.3 Activation Quantization with Squeeze-and-Threshold
3.3.1 Double-Stage Squeeze-and-Threshold
3.3.2 Inference Optimization
3.4 Experiment
3.4.1 Ablation Study of Squeeze-and-Threshold
3.4.2 Comparison with State-of-the-art Methods
3.5 Summary
4 Low-Precision Neural Architecture Search 39
4.1 Overview
4.2 Differentiable Architecture Search
4.2.1 Gumbel Softmax
4.2.2 Disadvantage and Solution
4.3 Low-Precision Differentiable Architecture Search
4.3.1 Convolution Sharing
4.3.2 Forward-and-Backward Scaling
4.3.3 Power Estimation
4.3.4 Architecture of Supernet
4.4 Experiment
4.4.1 Effectiveness of solutions to the dominance problem
4.4.2 Softmax and Gumbel Softmax
4.4.3 Optimizer and Inverted Learning Rate Scheduler
4.4.4 NAS Method Evaluation
4.4.5 Searched Model Analysis
4.4.6 NAS Cost Analysis
4.4.7 NAS Training Analysis
4.5 Summary
5 Configurable Sparse Neural Processing Unit 65
5.1 Overview
5.2 NPU Architecture
5.2.1 Buffer
5.2.2 Reshapeable Mixed-Precision MAC Array
5.2.3 Sparsity
5.2.4 Post Process Unit
5.3 Mapping
5.3.1 Mixed-Precision MAC
5.3.2 MAC Array
5.3.3 Support of Other Operation
5.3.4 Configurability
5.4 Experiment
5.4.1 Performance Analysis of Runtime Configuration
5.4.2 Roofline Performance Analysis
5.4.3 Mixed-Precision
5.4.4 Comparison with Cortex-M7
5.5 Summary
6 Agile Development and Rapid Design Space Exploration 91
6.1 Overview
6.1.1 Agile Development
6.1.2 Design Space Exploration
6.2 Agile Development Infrastructure
6.2.1 Chisel Backend
6.2.2 NPU Software Stack
6.3 Modeling and Exploration
6.3.1 Area Modeling
6.3.2 Performance Modeling
6.3.3 Layered Exploration Framework
6.4 Experiment
6.4.1 Efficiency of Agile Development Infrastructure
6.4.2 Effectiveness of Agile Development Infrastructure
6.4.3 Area Modeling
6.4.4 Performance Modeling
6.4.5 Rapid Exploration and Pareto Front
6.5 Summary
7 Summary and Outlook 123
7.1 Summary
7.2 Outlook
A Appendix of Double-Stage ST Quantization 127
A.1 Training setting of ResNet-18 in Table 3.3
A.2 Training setting of ReActNet in Table 3.4
A.3 Training setting of ResNet-18 in Table 3.4
A.4 Pseudocode Implementation of Double-Stage ST
B Appendix of Low-Precision Neural Architecture Search 131
B.1 Low-Precision NAS on CIFAR-10
B.2 Low-Precision NAS on Tiny-ImageNet
B.3 Low-Precision NAS on ImageNet
Bibliography 137
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:91609 |
Date | 28 May 2024 |
Creators | Wu, Binyi |
Contributors | Mayr, Christian Georg, Benini, Luca, Technische Universität Dresden |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0054 seconds