PEVA predicts future video frames from previous frames and specified 3D pose changes, enabling generation of atomic action videos, counterfactual simulations, and extended video sequences.
Challenges in Modeling Embodied Agents
Recent progress in world models has enhanced our ability to forecast future states for planning and control, spanning from intuitive physics to multi-step video prediction. However, most existing models fall short when applied to truly embodied agents-those that physically interact with the real world through complex, high-dimensional actions. Unlike abstract control signals, embodied agents operate with a physically grounded action space and perceive the environment from an egocentric perspective, which introduces unique challenges such as dynamic viewpoints and diverse real-world scenarios.
- Context-Dependent Perception and Action: Identical visual inputs can lead to different movements depending on the context, reflecting the goal-directed nature of human behavior.
- Complexity of Human Motion: Full-body movements involve over 48 degrees of freedom with hierarchical and time-dependent dynamics, making control highly intricate.
- Egocentric Vision Limitations: First-person views reveal intentions but obscure the body itself, requiring models to infer physical actions from indirect visual cues.
- Delayed Visual Feedback: Perception often trails behind action execution, necessitating long-term temporal reasoning and prediction.
To build effective world models for embodied agents, it is essential to ground them in realistic human-like behaviors, where vision guides action and the body’s movements encompass both locomotion and manipulation.
Introducing PEVA: Predicting Egocentric Video from Whole-Body Actions
We developed PEVA, a novel model designed to forecast egocentric video frames conditioned on detailed human body actions. PEVA leverages kinematic pose trajectories structured by the body’s joint hierarchy to simulate how physical movements influence the environment from a first-person viewpoint. Trained on Nymeria-a comprehensive dataset combining real-world egocentric videos with precise body pose captures-PEVA employs an autoregressive conditional diffusion transformer architecture. This approach enables the model to handle complex, high-dimensional motion sequences and supports a hierarchical evaluation framework that rigorously tests embodied prediction and control capabilities.
Encoding Human Motion: Structured Action Representation
To effectively link human motion with egocentric visual input, PEVA encodes each action as a comprehensive, high-dimensional vector capturing full-body dynamics. Instead of relying on simplified control signals, the model encodes global translation and relative joint rotations based on the body’s kinematic tree. Specifically, motion is represented in 3D space with 3 degrees of freedom for root translation and 15 upper-body joints, using Euler angles for joint rotations, resulting in a 48-dimensional action space. Motion capture data is synchronized with video frames and transformed into a pelvis-centered local coordinate system to ensure invariance to position and orientation. Normalization of positions and rotations stabilizes training, while inter-frame motion differences enable the model to associate physical movements with their visual outcomes over time.
PEVA Architecture: Autoregressive Conditional Diffusion Transformer
Building upon the Conditional Diffusion Transformer (CDiT) framework, originally designed for navigation tasks with simple control inputs, PEVA extends this approach to handle the complexity of whole-body human motion. Key innovations include:
- Random Timeskips: Training with variable temporal skips enables the model to capture both short-term dynamics and longer activity patterns.
- Sequence-Level Training: Loss functions are applied over prefixes of motion sequences, encouraging coherent predictions across entire action sequences.
- Action Embeddings: High-dimensional action vectors are concatenated and integrated into each AdaLN layer, conditioning the model on detailed whole-body movements.
Inference: Sampling and Rollout Mechanism
During inference, PEVA generates future video frames by conditioning on a set of past context frames. These frames are encoded into latent representations, and noise is added to the target frame, which is then progressively denoised through the diffusion process. To optimize efficiency, attention mechanisms are restricted: within-frame attention focuses on the target frame, while cross-frame attention is applied only to the most recent context frame. For action-conditioned predictions, an autoregressive rollout strategy is employed-starting with context frames encoded by a VAE encoder, the current action is appended, and the model predicts the next frame. This predicted frame is added to the context, the oldest frame is dropped, and the process repeats for each subsequent action. Finally, predicted latents are decoded back into pixel space using a VAE decoder.
Decomposing Movements: Atomic Actions
To evaluate PEVA’s understanding of how specific joint-level movements affect egocentric views, complex human motions are broken down into atomic actions. These include simple hand gestures (e.g., moving the left hand up or right) and whole-body movements (e.g., moving forward or rotating). Below are examples illustrating these atomic actions:
Whole-Body Movements
Move Forward
Rotate Left
Rotate Right
Left Hand Movements
Move Left Hand Up
Move Left Hand Down
Move Left Hand Left
Move Left Hand Right
Right Hand Movements
Move Right Hand Up
Move Right Hand Down
Move Right Hand Left
Move Right Hand Right
Extended Video Generation: Long Rollouts
PEVA demonstrates the capacity to generate visually and semantically consistent video sequences over extended durations. Below are examples of 16-second video rollouts conditioned on full-body motion, showcasing the model’s ability to maintain coherence across time:
Sequence 1
Sequence 2
Sequence 3
Visual Planning Through Simulation
PEVA supports planning by simulating multiple candidate action sequences and evaluating them based on perceptual similarity to a target goal, measured using the Learned Perceptual Image Patch Similarity (LPIPS) metric. This enables the model to discard suboptimal paths and identify sequences that achieve desired outcomes. For example, PEVA can rule out routes leading to irrelevant locations like a sink or outdoors and instead find the correct path to open a fridge. Similarly, it can avoid actions such as grabbing nearby plants when the goal is to reach a shelf.
Example: PEVA excludes paths leading to the sink or outdoors, selecting the correct route to open the fridge.
Example: PEVA avoids grabbing plants and going to the kitchen, finding a reasonable action sequence to reach the shelf.
Visual Planning via Energy Minimization
Planning is formulated as an energy minimization problem, optimized using the Cross-Entropy Method (CEM). This approach focuses on refining action sequences for either the left or right arm while keeping other body parts static. Below are representative examples illustrating PEVA’s ability to predict action sequences that approximate target goals:
Predicted sequence raises the right arm toward a mixing stick. Note: left arm movement is not predicted, highlighting a current limitation.
Sequence approaches the kettle but does not fully grasp it, indicating room for improvement in fine-grained control.
Sequence pulls the left arm inward, closely matching the goal.
Performance Evaluation
PEVA’s effectiveness is validated through extensive quantitative assessments, demonstrating superior performance in generating high-fidelity egocentric videos from whole-body actions. The model excels in perceptual quality, maintains temporal coherence over long sequences, and benefits from scaling with increased model size.
Comparative Perceptual Metrics
Comparison of perceptual quality metrics across various models.
Atomic Action Generation Accuracy
Performance comparison on generating videos of atomic actions.
FID Scores Over Time
Fréchet Inception Distance (FID) comparison illustrating video quality across different models and prediction lengths.
Model Scaling Benefits
PEVA demonstrates improved performance with larger model sizes, highlighting strong scaling capabilities.
Looking Ahead: Future Enhancements
While PEVA marks a significant step toward embodied video prediction and planning, several avenues remain for advancement. Current planning capabilities are limited to simulating candidate arm actions without full trajectory optimization or long-horizon planning. Future work aims to extend PEVA to closed-loop control and interactive environments, incorporate explicit task intent and semantic goal conditioning, and integrate object-centric representations. Additionally, moving beyond image similarity metrics toward more task-relevant objectives will enhance practical applicability.




