RealWonder: Real-Time Physical Action-Conditioned Video Generation
Abstract
RealWonder enables real-time action-conditioned video generation by integrating 3D reconstruction, physics simulation, and a distilled video generator to simulate physical consequences of 3D actions.
Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation (2026)
- Video Generation Models in Robotics - Applications, Research Challenges, Future Directions (2026)
- ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors (2026)
- Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control (2026)
- VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control (2026)
- Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering (2026)
- Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper