Instruct-4DGS: Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation

Joohyun Kwon1*, Hanbyel Cho2*, Junmo Kim2† (*Equal contribution, †Corresponding author)
1DGIST, South Korea 2KAIST, South Korea
CVPR 2025

TL;DR: Efficient 4D dynamic scene editing method using 4D Gaussian Splatting, focusing on static 3D Gaussians and score distillation refinement to achieve faster, high-quality edits.

Abstract

Recent 4D dynamic scene editing methods require editing thousands of 2D images used for dynamic scene synthesis and updating the entire scene with additional training loops, resulting in several hours of processing to edit a single dynamic scene. Therefore, these methods are not scalable with respect to the temporal dimension of the dynamic scene (i.e., the number of timesteps). In this work, we propose Instruct-4DGS, an efficient dynamic scene editing method that is more scalable in terms of temporal dimension. To achieve computational efficiency, we leverage a 4D Gaussian representation that models a 4D dynamic scene by combining static 3D Gaussians with a Hexplane-based deformation field, which captures dynamic information. We then perform editing solely on the static 3D Gaussians, which is the minimal but sufficient component required for visual editing. To resolve the misalignment between the edited 3D Gaussians and the deformation field, which may arise from the editing process, we introduce a refinement stage using a score distillation mechanism. Extensive editing results demonstrate that Instruct-4DGS is efficient, reducing editing time by more than half compared to existing methods while achieving high-quality edits that better follow user instructions.

Method

Overall pipeline of our proposed dynamic scene editing method (Instruct-4DGS): To obtain the target dynamic scene for editing, we first optimize the 4D Gaussians using a multi-camera captured video dataset. We then perform 3D Gaussian editing on the static canonical 3D Gaussians by editing only the multiview images corresponding to the first timestep. We apply score-based temporal refinement to mitigate motion artifacts without additional image editing.

Results

Qualitative Comparison

Qualitative comparison of visual quality: We compare our method (i.e., Instruct-4DGS) with the baseline (i.e., Instruct 4D-to-4D) on DyNeRF’s coffee_martini and sear_steak scenes, as well as Technicolor's Painter and Train scenes.

DyNeRF - coffee_martini

Instruct-4DGS (ours)

Instruct 4D-to-4D (baseline)

DyNeRF - sear_steak

Instruct-4DGS (ours)

Instruct 4D-to-4D (baseline)

Technicolor - Painter

Instruct-4DGS (ours)

Instruct 4D-to-4D (baseline)

Technicolor - Train

Instruct-4DGS (ours)

Instruct 4D-to-4D (baseline)

Qualitative Comparison of Temporal Consistency

Qualitative comparison of temporal consistency: The baseline shows noticeable flickering artifacts across timesteps. In contrast, Instruct-4DGS effectively avoids such artifacts by editing only the static component with score-based temporal refinement.

Ablation Study

Ablation study of the dynamic scene editing method: Each pie chart shows the proportion of user preferences (1st-4th ranks) for each method variant. Our proposed method (denoted as “Ours (w/o refine {E, D})”) achieves the highest preference score.

Effectiveness of Score-based Temporal Refinement

Effectiveness of score-based temporal refinement: Score-based temporal refinement effectively resolves misalignment between the canonical 3D Gaussians and the original deformation field that arises during the 3D Gaussian editing process. Without requiring additional 2D image updates, this process completes dynamic scene editing within a few hundred iterations.

More Results (Ours)

Qualitative Results on HyperNeRF

Qualitative results of our Instruct-4DGS on the HyperNeRF dataset (a monocular dataset): We evaluate our method on the Interp_chickchicken scene from the HyperNeRF dataset.

Qualitative Results with Varying Camera Poses

Qualitative results of our Instruct-4DGS under various camera poses on the DyNeRF dataset: We render the edited dynamic scene from novel camera poses to evaluate the spatial consistency of our method. Our Instruct-4DGS produces view-consistent and geometrically plausible results.

BibTeX

@misc{kwon2025instruct4dgsefficientdynamicscene,
      title={Instruct-4DGS: Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation}, 
      author={Joohyun Kwon and Hanbyel Cho and Junmo Kim},
      year={2025},
      eprint={2502.02091},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.02091}, 
    }