We introduce PhysMotion, a novel framework that leverages principled physics-based simulations to guide intermediate 3D representations generated from a single image and input conditions (e.g. applied force and torque), producing high-quality, physically plausible video generation. By utilizing continuum mechanics-based simulations as a prior knowledge, our approach addresses the limitations of traditional data-driven generative models and result in more consistent physically plausible motions. Our framework begins by reconstructing a feed-forward 3D Gaussian from a single image through geometry optimization. This representation is then time-stepped using a differentiable Material Point Method (MPM) with continuum mechanics-based elastoplasticity models, which provides a strong foundation for realistic dynamics, albeit at a coarse level of detail. To enhance the geometry, appearance, and ensure spatiotemporal consistency, we refine the initial simulation using a text-to-image (T2I) diffusion model with cross-frame attention, resulting in a physically plausible video that retains intricate details comparable to the input image. We conduct comprehensive qualitative and quantitative evaluations to validate the efficacy of our method.
Given a single image input, we introduce a novel pipeline to generate high-fidelity, physics-grounded video with 3D understanding. Our pipeline consists of two main stages: first, we perform a single view 3DGS reconstruction of segmented object from the input image, then synthesize a physics-grounded coarse object dynamics. Next, we apply a difusion-based video enhancement to produce the final enhanced video with backgrounds, enabling users to create visually compelling, physics-driven video from a single image with an applied conditional force or torque.
We compare PhysMotion with several open-sourced state-of-the-art image-to-video generation models: I2VGen-XL, CogVideoX-5B, MotionI2V, DragAnything, Dynamicrafter. The first three methods use text-based conditions, while the latter two use trajectory-based conditions to generate dynamics from a single image. We use ChatGPT-4o to generate the prompts for each image on possible dynamics for video generation.
Our generative video enhancement pipeline successfully captures the realistic texture of bread in the torn area as shown on the right.
w/o generative video enhancement
w/ generative video enhancement
@article{tan2024physmotion,
title={PhysMotion: Physics-Grounded Dynamics From a Single Image},
author={Tan, Xiyang and Jiang, Ying and Li, Xuan and Zong, Zeshun and Xie, Tianyi and Yang, Yin and Jiang, Chenfanfu},
journal={arXiv preprint arXiv:2411.17189},
year={2024}
}