We describe a multi-branch design of our omnidirectional 3D detection model. We also explain how the model is modified to fit the given device and latency target.

5.1 Model De-sign

The goal of our multi-branch model is to allow each region in a surrounding space can be processed through various capabilities of the BEV-based 3D detectors detailed in Table 1. Instead of simply using all types of models loaded on the memory, we modularized the models and combined the core modules, eliminating redundant ones. The model follows the baseline pipeline of the BEV-based 3D detection illustrated in Figure 2. However, as shown in Figure 7, an image can be processed by one of several modules at each inference stage. Each branch within the model represents a unique execution path for an image, comprising a sequence of module decisions at all stages. We detailed the branch design following the order of the baseline pipeline. The initial stage of BEV feature generation is to extract 2D feature maps from all multi-view images. At this stage, each image can be processed through one of four backbone.

networks, each of which is differentiated by its capacity and input resolution, as detailed in Table 1. From the extracted 2D feature maps, the model infers the 3D depth information in each camera coordinate. To enable the adjustment of quality and latency of the depth prediction, our model is equipped with two types of DepthNets. One is a simple 1- layer convolutional network trained using only object depth loss, providing sparse supervision signals. The other is a deep convolutional network trained on dense depth maps derived from 3D point clouds. As illustrated in Figure 8, the densely supervised DepthNet generates an accurate and detailed depth map from an image. The depth prediction quality is further increased by incorporating a large-scale backbone network with high-resolution images, providing a clear distinction between foreground objects and the background.

BEV feature maps are generated by projecting the extracted 2D feature map into a unified grid space using the predicted depth and camera parameters. Fusing the generated BEV features over time enhances the robustness of perception by providing contiguous observations of objects. As shown in Figure 7, we developed a module that can fuse the BEV features extracted from consecutive images of each camera view 𝑗. Due to the possible movements of the camera (or a robot carrying the camera), directly fusing the BEV features may result in misalignment of spatial features over time, greatly reducing the accuracy gain. Therefore, the module aligns the last frame’s BEV feature map 𝐹 𝑗 𝑡−1 into the current frame’s camera coordinate system. The aligned BEV feature map 𝐹 ′𝑗 𝑡−1→𝑡 is acquired as follows:

where T is the operation of transforming the spatial locations of features in the BEV grid using the relative camera pose, i.e., the camera motion 𝜃𝑡−1→𝑡 . To estimate 𝜃𝑡−1→𝑡 , we employ a neural network that uses inertial sensor data and consecutive images. At runtime, Panopticus buffers the previous image 𝐼 𝑗 𝑡−1 and 𝐹 𝑗 𝑡−1 of each camera view 𝑗. Since the prediction results of 𝜃𝑡−1→𝑡 are consistent across camera views, the model predicts it once per frame, allowing the sharing of predicted 𝜃𝑡−1→𝑡 with the alignment operations for other views.

After aggregating 𝐹 𝑗 𝑡 and the aligned 𝐹 ′𝑗 𝑡−1→𝑡 across all camera views, the model concatenates the integrated BEV feature maps, which are then fed into the BEV head. The resulting 3D bounding boxes are post-processed using nonmax suppression (NMS) to remove duplicates. Panopticus utilizes a 3D object tracker to keep track of the latest states of detected objects. We employed a 3D Kalman filter that can efficiently forecast and update the status of tracked objects. Our model features a branch that outputs the predicted states of tracked objects, skipping new detection for the target camera view. The future state x ′ 𝑡 of a tracked object at the incoming time 𝑡 is predicted as follows:

where 𝐴𝑡−1 and x𝑡−1 are the state transition model and the object state vector at time 𝑡−1, respectively. x𝑡−1 is parameterized by the object’s 3D location (𝑥, 𝑦, 𝑧), velocity (𝑣𝑥, 𝑣𝑦, 𝑣𝑧 ) and size (𝑤, ℎ,𝑙). Future state estimation of the target object using 𝐴𝑡−1 is simply applying the velocity predicted by the BEV head to its location, which is processed instantaneously. Overall, the image from each camera view can be processed by one of 17 branches: 16 detection branches and 1 lightweight tracker’s branch utilizing objects’ predicted states.

5.2 Model Adaptation

Panopticus supports offline model adaptation to meet the memory and latency constraints on a given device. To achieve this, the memory consumption and processing latency of all modules in the model are profiled on the target device. First, the model is adjusted to adhere to the memory constraints. For example, for the Jetson Orin Nano [8] with 6.3 GB of limited memory, large-sized modules such as the R152 backbone are detached to prevent overloading the memory capacity. Second, some inference branches require significant computation; in fact, inferring a single image may surpass the target latency. To meet latency requirements, branches with latency profiles that exceed the target limit are also removed. The model, with its number of branches 𝑀 changed, is then deployed on the target device.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

Edge Devices Are Getting Better at Seeing the World in 3D | HackerNoon

Table of links

5 MULTI-BRANCH OMNIDIRECTIONAL 3D OBJECT DETECTION

5.1 Model De-sign

5.2 Model Adaptation

Plex Punts A Popular Feature Amid A Major App Revamp

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster | Towards Data Science

Female founders in game companies got 12.1% of game VC deals in 2024 | The DeanBeat (updated)

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Edge Devices Are Getting Better at Seeing the World in 3D | HackerNoon

Table of links

5 MULTI-BRANCH OMNIDIRECTIONAL 3D OBJECT DETECTION

5.1 Model De-sign

5.2 Model Adaptation

Plex Punts A Popular Feature Amid A Major App Revamp

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster | Towards Data Science

Female founders in game companies got 12.1% of game VC deals in 2024 | The DeanBeat (updated)

Supercharge Your Portfolio with Future Tech Stocks!

Join us for Profitable Insights & Expert Tips!

Subscribe