Autonomous driving requires lightweight and robust perception systems that can rapidly and accurately interpret the complex driving environment. This dissertation investigates the transformative capacity of discrete wavelet transform (DWT), inverse DWT, CNNs, and transformers as foundational elements to develop lightweight perception architectures for autonomous vehicles. The inherent properties of DWT, including its invertibility, sparsity, time-frequency localization, and ability to capture multi-scale information, present an inductive bias. Similarly, transformers capture long-range dependency between features. By harnessing these attributes, novel wavelet-enhanced deep learning architectures are introduced. The first contribution is introducing a lightweight backbone network that can be employed for real-time processing. This network balances processing speed and accuracy, outperforming established models like ResNet-50 and VGG16 in terms of accuracy while remaining computationally efficient. Moreover, a multiresolution attention mechanism is introduced for CNNs to enhance feature extraction. This mechanism directs the network's focus toward crucial features while suppressing less significant ones. Likewise, a transformer model is proposed by leveraging the properties of DWT with vision transformers. The proposed wavelet-based transformer utilizes the convolution theorem in the frequency domain to mitigate the computational burden on vision transformers caused by multi-head self-attention. Furthermore, a proposed wavelet-multiresolution-analysis-based 3D object detection model exploits DWT's invertibility, ensuring comprehensive environmental information capture. Lastly, a multimodal fusion model is presented to use information from multiple sensors. Sensors have limitations, and there is no one-fits-all sensor for specific applications. Therefore, multimodal fusion is proposed to use the best out of different sensors. Using a transformer to capture long-range feature dependencies, this model effectively fuses the depth cues from LiDAR with the rich texture derived from cameras. The multimodal fusion model is a promising approach that integrates backbone networks and transformers to achieve lightweight and competitive results for 3D object detection. Moreover, the proposed model utilizes various network optimization methods, including pruning, quantization, and quantization-aware training, to minimize the computational load while maintaining optimal performance. The experimental results across various datasets for classification networks, attention mechanisms, 3D object detection, and multimodal fusion indicate a promising direction in developing a lightweight and robust perception system for robotics, particularly in autonomous driving.
Identifer | oai:union.ndltd.org:MSSTATE/oai:scholarsjunction.msstate.edu:td-7080 |
Date | 10 May 2024 |
Creators | Alaba, Simegnew Yihunie |
Publisher | Scholars Junction |
Source Sets | Mississippi State University |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Theses and Dissertations |
Page generated in 0.0047 seconds