Table of links
2 BACKGROUND: OMNIDIRECTIONAL 3D OBJECT DETECTION
3.1 Experiment Setup
3.2 Observations
3.3 Summary and Challenges
5 MULTI-BRANCH OMNIDIRECTIONAL 3D OBJECT DETECTION
5.1 Model Design
6.1 Performance Prediction
5.2 Model Adaptation
6.2 Execution Scheduling
8.1 Testbed and Dataset
8.2 Experiment Setup
8.3 Performance
8.4 Robustness
8.5 Component Analysis
8.6 Overhead
5 MULTI-BRANCH OMNIDIRECTIONAL 3D OBJECT DETECTION
We describe a multi-branch design of our omnidirectional 3D detection model. We also explain how the model is modified to fit the given device and latency target.
5.1 Model De-sign
The goal of our multi-branch model is to allow each region in a surrounding space can be processed through various capabilities of the BEV-based 3D detectors detailed in Table 1. Instead of simply using all types of models loaded on the memory, we modularized the models and combined the core modules, eliminating redundant ones. The model follows the baseline pipeline of the BEV-based 3D detection illustrated in Figure 2. However, as shown in Figure 7, an image can be processed by one of several modules at each inference stage. Each branch within the model represents a unique execution path for an image, comprising a sequence of module decisions at all stages. We detailed the branch design following the order of the baseline pipeline. The initial stage of BEV feature generation is to extract 2D feature maps from all multi-view images. At this stage, each image can be processed through one of four backbone.
networks, each of which is differentiated by its capacity and input resolution, as detailed in Table 1. From the extracted 2D feature maps, the model infers the 3D depth information in each camera coordinate. To enable the adjustment of quality and latency of the depth prediction, our model is equipped with two types of DepthNets. One is a simple 1- layer convolutional network trained using only object depth loss, providing sparse supervision signals. The other is a deep convolutional network trained on dense depth maps derived from 3D point clouds. As illustrated in Figure 8, the densely supervised DepthNet generates an accurate and detailed depth map from an image. The depth prediction quality is further increased by incorporating a large-scale backbone network with high-resolution images, providing a clear distinction between foreground objects and the background.
BEV feature maps are generated by projecting the extracted 2D feature map into a unified grid space using the predicted depth and camera parameters. Fusing the generated BEV features over time enhances the robustness of perception by providing contiguous observations of objects. As shown in Figure 7, we developed a module that can fuse the BEV features extracted from consecutive images of each camera view π. Due to the possible movements of the camera (or a robot carrying the camera), directly fusing the BEV features may result in misalignment of spatial features over time, greatly reducing the accuracy gain. Therefore, the module aligns the last frameβs BEV feature map πΉ π π‘β1 into the current frameβs camera coordinate system. The aligned BEV feature map πΉ β²π π‘β1βπ‘ is acquired as follows:
where T is the operation of transforming the spatial locations of features in the BEV grid using the relative camera pose, i.e., the camera motion ππ‘β1βπ‘ . To estimate ππ‘β1βπ‘ , we employ a neural network that uses inertial sensor data and consecutive images. At runtime, Panopticus buffers the previous image πΌ π π‘β1 and πΉ π π‘β1 of each camera view π. Since the prediction results of ππ‘β1βπ‘ are consistent across camera views, the model predicts it once per frame, allowing the sharing of predicted ππ‘β1βπ‘ with the alignment operations for other views.
After aggregating πΉ π π‘ and the aligned πΉ β²π π‘β1βπ‘ across all camera views, the model concatenates the integrated BEV feature maps, which are then fed into the BEV head. The resulting 3D bounding boxes are post-processed using nonmax suppression (NMS) to remove duplicates. Panopticus utilizes a 3D object tracker to keep track of the latest states of detected objects. We employed a 3D Kalman filter that can efficiently forecast and update the status of tracked objects. Our model features a branch that outputs the predicted states of tracked objects, skipping new detection for the target camera view. The future state x β² π‘ of a tracked object at the incoming time π‘ is predicted as follows:
where π΄π‘β1 and xπ‘β1 are the state transition model and the object state vector at time π‘β1, respectively. xπ‘β1 is parameterized by the objectβs 3D location (π₯, π¦, π§), velocity (π£π₯, π£π¦, π£π§ ) and size (π€, β,π). Future state estimation of the target object using π΄π‘β1 is simply applying the velocity predicted by the BEV head to its location, which is processed instantaneously. Overall, the image from each camera view can be processed by one of 17 branches: 16 detection branches and 1 lightweight trackerβs branch utilizing objectsβ predicted states.
5.2 Model Adaptation
Panopticus supports offline model adaptation to meet the memory and latency constraints on a given device. To achieve this, the memory consumption and processing latency of all modules in the model are profiled on the target device. First, the model is adjusted to adhere to the memory constraints. For example, for the Jetson Orin Nano [8] with 6.3 GB of limited memory, large-sized modules such as the R152 backbone are detached to prevent overloading the memory capacity. Second, some inference branches require significant computation; in fact, inferring a single image may surpass the target latency. To meet latency requirements, branches with latency profiles that exceed the target limit are also removed. The model, with its number of branches π changed, is then deployed on the target device.
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.