The growing complexity of neural networks makes their deployment on resource-constrained embedded or mobile devices challenging. With millions of weights and biases, modern deep neural networks can be computationally intensive, with large memory, power and computational requirements. In this thesis, we devise and explore three quantization methods (post-training, in-training and combined quantization) that quantize 32-bit floating-point weights and biases to lower bit width fixed-point parameters while also achieving significant pruning, leading to model compression. We use the total accumulated absolute gradient over the training process as the indicator of importance of a parameter to the network. The most important parameters are quantized by the smallest amount. The post-training quantization method sorts and clusters the accumulated gradients of the full parameter set and subsequently assigns a bit width to each cluster. The in-training quantization method sorts and divides the accumulated gradients into two groups after each training epoch. The larger group consisting of the lowest accumulated gradients is quantized. The combined quantization method performs in-training quantization followed by post-training quantization. We assume storage of the quantized parameters using compressed sparse row format for sparse matrix storage. On LeNet-300-100 (MNIST dataset), LeNet-5 (MNIST dataset), AlexNet (CIFAR-10 dataset) and VGG-16 (CIFAR-10 dataset), post-training quantization achieves 7.62x, 10.87x, 6.39x and 12.43x compression, in-training quantization achieves 22.08x, 21.05x, 7.95x and 12.71x compression and combined quantization achieves 57.22x, 50.19x, 13.15x and 13.53x compression, respectively. Our methods quantize at the cost of accuracy, and we present our work in the light of the accuracy-compression trade-off. / Master of Science / Neural networks are being employed in many different real-world applications. By learning the complex relationship between the input data and ground-truth output data during the training process, neural networks can predict outputs on new input data obtained in real time. To do so, a typical deep neural network often needs millions of numerical parameters, stored in memory. In this research, we explore techniques for reducing the storage requirements for neural network parameters. We propose software methods that convert 32-bit neural network parameters to values that can be stored using fewer bits. Our methods also convert a majority of numerical parameters to zero. Using special storage methods that only require storage of non-zero parameters, we gain significant compression benefits. On typical benchmarks like LeNet-300-100 (MNIST dataset), LeNet-5 (MNIST dataset), AlexNet (CIFAR-10 dataset) and VGG-16 (CIFAR-10 dataset), our methods can achieve up to 57.22x, 50.19x, 13.15x and 13.53x compression respectively. Storage benefits are achieved at the cost of classification accuracy, and we present our work in the light of the accuracy-compression trade-off.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/98617 |
Date | 29 May 2020 |
Creators | Gaopande, Meghana Laxmidhar |
Contributors | Electrical and Computer Engineering, Abbott, A. Lynn, Huang, Jia-Bin, Williams, Ryan K. |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0018 seconds