PhysMotion: Physics-Grounded Dynamics From a Single Image

1University of California, Los Angeles, 2University of Utah
*Equal Contributions

Leveraging intermediate 3D representation, PhysMotion is a novel framework that generates physics-grounded high-quality dynamics from only a single input image.

Abstract

We introduce PhysMotion, a novel framework that leverages principled physics-based simulations to guide intermediate 3D representations generated from a single image and input conditions (e.g. applied force and torque), producing high-quality, physically plausible video generation. By utilizing continuum mechanics-based simulations as a prior knowledge, our approach addresses the limitations of traditional data-driven generative models and result in more consistent physically plausible motions. Our framework begins by reconstructing a feed-forward 3D Gaussian from a single image through geometry optimization. This representation is then time-stepped using a differentiable Material Point Method (MPM) with continuum mechanics-based elastoplasticity models, which provides a strong foundation for realistic dynamics, albeit at a coarse level of detail. To enhance the geometry, appearance, and ensure spatiotemporal consistency, we refine the initial simulation using a text-to-image (T2I) diffusion model with cross-frame attention, resulting in a physically plausible video that retains intricate details comparable to the input image. We conduct comprehensive qualitative and quantitative evaluations to validate the efficacy of our method.

Pipeline

Given a single image input, we introduce a novel pipeline to generate high-fidelity, physics-grounded video with 3D understanding. Our pipeline consists of two main stages: first, we perform a single view 3DGS reconstruction of segmented object from the input image, then synthesize a physics-grounded coarse object dynamics. Next, we apply a difusion-based video enhancement to produce the final enhanced video with backgrounds, enabling users to create visually compelling, physics-driven video from a single image with an applied conditional force or torque.

Comparison

We compare PhysMotion with several open-sourced state-of-the-art image-to-video generation models: I2VGen-XL, CogVideoX-5B, MotionI2V, DragAnything, Dynamicrafter. The first three methods use text-based conditions, while the latter two use trajectory-based conditions to generate dynamics from a single image. We use ChatGPT-4o to generate the prompts for each image on possible dynamics for video generation.

Diamond

Ours

CogVideoX-5B

Dynamicrafter

MotionI2V

DragAnything

I2VGen-XL

Fox

Ours

CogVideoX-5B

Dynamicrafter

MotionI2V

DragAnything

I2VGen-XL

Tennis

Ours

CogVideoX-5B

Dynamicrafter

MotionI2V

DragAnything

I2VGen-XL

Generative Video Enhancement

Our generative video enhancement pipeline successfully captures the realistic texture of bread in the torn area as shown on the right.


w/o generative video enhancement

w/ generative video enhancement

BibTeX

@article{tan2024physmotion,
      title={PhysMotion: Physics-Grounded Dynamics From a Single Image}, 
      author={Tan, Xiyang and Jiang, Ying and Li, Xuan and Zong, Zeshun and Xie, Tianyi and Yang, Yin and Jiang, Chenfanfu},
      journal={arXiv preprint arXiv:2411.17189},
      year={2024}
}