Whole-Body Conditioned Egocentric Video Prediction

PEVA predicts future video frames from previous frames and specified 3D pose changes, enabling generation of atomic action videos, counterfactual simulations, and extended video sequences.

Challenges in Modeling Embodied Agents

Recent progress in world models has enhanced our ability to forecast future states for planning and control, spanning from intuitive physics to multi-step video prediction. However, most existing models fall short when applied to truly embodied agents-those that physically interact with the real world through complex, high-dimensional actions. Unlike abstract control signals, embodied agents operate with a physically grounded action space and perceive the environment from an egocentric perspective, which introduces unique challenges such as dynamic viewpoints and diverse real-world scenarios.

Context-Dependent Perception and Action: Identical visual inputs can lead to different movements depending on the context, reflecting the goal-directed nature of human behavior.
Complexity of Human Motion: Full-body movements involve over 48 degrees of freedom with hierarchical and time-dependent dynamics, making control highly intricate.
Egocentric Vision Limitations: First-person views reveal intentions but obscure the body itself, requiring models to infer physical actions from indirect visual cues.
Delayed Visual Feedback: Perception often trails behind action execution, necessitating long-term temporal reasoning and prediction.

To build effective world models for embodied agents, it is essential to ground them in realistic human-like behaviors, where vision guides action and the body’s movements encompass both locomotion and manipulation.

Introducing PEVA: Predicting Egocentric Video from Whole-Body Actions

We developed PEVA, a novel model designed to forecast egocentric video frames conditioned on detailed human body actions. PEVA leverages kinematic pose trajectories structured by the body’s joint hierarchy to simulate how physical movements influence the environment from a first-person viewpoint. Trained on Nymeria-a comprehensive dataset combining real-world egocentric videos with precise body pose captures-PEVA employs an autoregressive conditional diffusion transformer architecture. This approach enables the model to handle complex, high-dimensional motion sequences and supports a hierarchical evaluation framework that rigorously tests embodied prediction and control capabilities.

Encoding Human Motion: Structured Action Representation

To effectively link human motion with egocentric visual input, PEVA encodes each action as a comprehensive, high-dimensional vector capturing full-body dynamics. Instead of relying on simplified control signals, the model encodes global translation and relative joint rotations based on the body’s kinematic tree. Specifically, motion is represented in 3D space with 3 degrees of freedom for root translation and 15 upper-body joints, using Euler angles for joint rotations, resulting in a 48-dimensional action space. Motion capture data is synchronized with video frames and transformed into a pelvis-centered local coordinate system to ensure invariance to position and orientation. Normalization of positions and rotations stabilizes training, while inter-frame motion differences enable the model to associate physical movements with their visual outcomes over time.

PEVA Architecture: Autoregressive Conditional Diffusion Transformer

Building upon the Conditional Diffusion Transformer (CDiT) framework, originally designed for navigation tasks with simple control inputs, PEVA extends this approach to handle the complexity of whole-body human motion. Key innovations include:

Random Timeskips: Training with variable temporal skips enables the model to capture both short-term dynamics and longer activity patterns.
Sequence-Level Training: Loss functions are applied over prefixes of motion sequences, encouraging coherent predictions across entire action sequences.
Action Embeddings: High-dimensional action vectors are concatenated and integrated into each AdaLN layer, conditioning the model on detailed whole-body movements.

Inference: Sampling and Rollout Mechanism

During inference, PEVA generates future video frames by conditioning on a set of past context frames. These frames are encoded into latent representations, and noise is added to the target frame, which is then progressively denoised through the diffusion process. To optimize efficiency, attention mechanisms are restricted: within-frame attention focuses on the target frame, while cross-frame attention is applied only to the most recent context frame. For action-conditioned predictions, an autoregressive rollout strategy is employed-starting with context frames encoded by a VAE encoder, the current action is appended, and the model predicts the next frame. This predicted frame is added to the context, the oldest frame is dropped, and the process repeats for each subsequent action. Finally, predicted latents are decoded back into pixel space using a VAE decoder.

Decomposing Movements: Atomic Actions

To evaluate PEVA’s understanding of how specific joint-level movements affect egocentric views, complex human motions are broken down into atomic actions. These include simple hand gestures (e.g., moving the left hand up or right) and whole-body movements (e.g., moving forward or rotating). Below are examples illustrating these atomic actions:

Whole-Body Movements

Move Forward

Rotate Left

Rotate Right

Left Hand Movements

Move Left Hand Up

Move Left Hand Down

Move Left Hand Left

Move Left Hand Right

Right Hand Movements

Move Right Hand Up

Move Right Hand Down

Move Right Hand Left

Move Right Hand Right

Extended Video Generation: Long Rollouts

PEVA demonstrates the capacity to generate visually and semantically consistent video sequences over extended durations. Below are examples of 16-second video rollouts conditioned on full-body motion, showcasing the model’s ability to maintain coherence across time:

Sequence 1

Sequence 2

Sequence 3

Visual Planning Through Simulation

PEVA supports planning by simulating multiple candidate action sequences and evaluating them based on perceptual similarity to a target goal, measured using the Learned Perceptual Image Patch Similarity (LPIPS) metric. This enables the model to discard suboptimal paths and identify sequences that achieve desired outcomes. For example, PEVA can rule out routes leading to irrelevant locations like a sink or outdoors and instead find the correct path to open a fridge. Similarly, it can avoid actions such as grabbing nearby plants when the goal is to reach a shelf.

Example: PEVA excludes paths leading to the sink or outdoors, selecting the correct route to open the fridge.

Example: PEVA avoids grabbing plants and going to the kitchen, finding a reasonable action sequence to reach the shelf.

Visual Planning via Energy Minimization

Planning is formulated as an energy minimization problem, optimized using the Cross-Entropy Method (CEM). This approach focuses on refining action sequences for either the left or right arm while keeping other body parts static. Below are representative examples illustrating PEVA’s ability to predict action sequences that approximate target goals:

Predicted sequence raises the right arm toward a mixing stick. Note: left arm movement is not predicted, highlighting a current limitation.

Sequence approaches the kettle but does not fully grasp it, indicating room for improvement in fine-grained control.

Sequence pulls the left arm inward, closely matching the goal.

Performance Evaluation

PEVA’s effectiveness is validated through extensive quantitative assessments, demonstrating superior performance in generating high-fidelity egocentric videos from whole-body actions. The model excels in perceptual quality, maintains temporal coherence over long sequences, and benefits from scaling with increased model size.

Comparative Perceptual Metrics

Comparison of perceptual quality metrics across various models.

Atomic Action Generation Accuracy

Atomic action video generation comparison

Performance comparison on generating videos of atomic actions.

FID Scores Over Time

FID comparison across models and time horizons

Fréchet Inception Distance (FID) comparison illustrating video quality across different models and prediction lengths.

Model Scaling Benefits

PEVA demonstrates improved performance with larger model sizes, highlighting strong scaling capabilities.

Looking Ahead: Future Enhancements

While PEVA marks a significant step toward embodied video prediction and planning, several avenues remain for advancement. Current planning capabilities are limited to simulating candidate arm actions without full trajectory optimization or long-horizon planning. Future work aims to extend PEVA to closed-loop control and interactive environments, incorporate explicit task intent and semantic goal conditioning, and integrate object-centric representations. Additionally, moving beyond image similarity metrics toward more task-relevant objectives will enhance practical applicability.