Table of Links
2. Related Work
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
4. Experiments
For our experiments, we select from dynamic scenes in the Nvidia Dynamic Scenes Dataset [35]. Scenes in this dataset are captured using a sparse set of 12 stationary cameras located in two rows, producing images of resolution 1015×1920. The static scenes we use are taken from one frame from the dynamic scenes. For the backbone NeRF, we use static and dynamic versions of K-Planes [25] implemented in nerfstudio [30]. For each scene, we conduct inpainting by replacing a foreground object with another text-prompted object with a different geometry. We will demonstrate the effectiveness of our method by showing the qualitative intermediate and final results. In addition, we will explain different parts of our design by ablations and comparisons on our baseline.
4.1. Qualitative results
3D Examples. We show several 3D inpainting examples in figure 2. For each individual inpainting task, we show 2 renderings of the final NeRF from different views to demonstrate the multiview consistency. Additionally, we show the first seed image, another pre-processed image, as well as the RGB and depth map in the three stages: before training, after warmup training, and after convergence. These beforeand-after images demonstrate the efficacy of each stage in our method. As shown in Figure 2, a roughly consistent preprocessed image can optimize a coarse inpainted NeRF after warmup training, and the geometry (represented by depth map) converges during warmup training. Then, fine convergence across views is achieved after the final training stage. All 3D inpainting tasks are trained on a single Nvidia RTX 4090 GPU. Warmup training takes approximately 0.5–1 hour, and the main training stage with IDU takes approximately 1–2 hours.
4D Example. We show a 4D inpainting example in figure 3 to demonstrate that our method has the potential to generalize to dynamic NeRFs. In this example, we remove the foreground object in the video of the seed view using E2FGVI [11], a flow-based method with optimization by feature propagation and content hallucination. For transferring motion to the generated object, after key point tracking, we estimate a rigid transformation between the key points, and propagate the pixels along the transformation. This dynamic scene consists of 16 frames, in which the first frame includes the first seed image. As shown in the figures, we successfully obtained an overall convergence on the generated object with correct motion for all the illustrated frames.