Return to search

Multimodal Sensor Fusion with Object Detection Networks for Automated Driving

Object detection is one of the key tasks of environment perception for highly automated vehicles. To achieve a high level of performance and fault tolerance, automated vehicles are equipped with an array of different sensors to observe their environment. Perception systems for automated vehicles usually rely on Bayesian fusion methods to combine information from different sensors late in the perception pipeline in a highly abstract, low-dimensional representation. Newer research on deep learning object detection proposes fusion of information in higher-dimensional space directly in the convolutional neural networks to significantly increase performance. However, the resulting deep learning architectures violate key non-functional requirements of a real-world safety-critical perception system for a series-production vehicle, notably modularity, fault tolerance and traceability.

This dissertation presents a modular multimodal perception architecture for detecting objects using camera, lidar and radar data that is entirely based on deep learning and that was designed to respect above requirements. The presented method is applicable to any region-based, two-stage object detection architecture (such as Faster R-CNN by Ren et al.). Information is fused in the high-dimensional feature space of a convolutional neural network. The feature map of a convolutional neural network is shown to be a suitable representation in which to fuse multimodal sensor data and to be a suitable interface to combine different parts of object detection networks in a modular fashion. The implementation centers around a novel neural network architecture that learns a transformation of feature maps from one sensor modality and input space to another and can thereby map feature representations into a common feature space. It is shown how transformed feature maps from different sensors can be fused in this common feature space to increase object detection performance by up to 10% compared to the unimodal baseline networks. Feature extraction front ends of the architecture are interchangeable and different sensor modalities can be integrated with little additional training effort. Variants of the presented method are able to predict object distance from monocular camera images and detect objects from radar data.

Results are verified using a large labeled, multimodal automotive dataset created during the course of this dissertation. The processing pipeline and methodology for creating this dataset along with detailed statistics are presented as well.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:77012
Date07 January 2022
CreatorsSchröder, Enrico
ContributorsHamker, F., Masrur, A., Technische Universität Chemnitz, AUDI AG
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0025 seconds