Whole-Body Conditioned Egocentric Video Prediction

PEVA predicts future video frames from previous frames and specified 3D pose changes, enabling generation of atomic action videos, counterfactual simulations, and extended video sequences.

Challenges in Modeling Embodied Agents

Recent progress in world models has enhanced our ability to forecast future states for planning and control, spanning from intuitive physics to multi-step video prediction. However, most existing models fall short when applied to truly embodied agents-those that physically interact with the real world through complex, high-dimensional actions. Unlike abstract control signals, embodied agents operate with a physically grounded action space and perceive the environment from an egocentric perspective, which introduces unique challenges such as dynamic viewpoints and diverse real-world scenarios.

Context-Dependent Perception and Action: Identical visual inputs can correspond to different movements, and the same action can appear differently depending on context, reflecting the complexity of goal-driven human behavior.
High-Dimensional, Structured Control: Human full-body motion involves over 48 degrees of freedom, governed by hierarchical and time-dependent dynamics.
Egocentric Vision Limits Body Visibility: First-person views reveal intentions but obscure the body itself, requiring models to infer physical actions from indirect visual cues.
Delayed Perceptual Feedback: Visual information often lags behind actions, necessitating long-term temporal reasoning and prediction.

To build effective world models for embodied agents, it is essential to ground learning in real-world, egocentric data that captures the interplay between perception, intention, and complex body movements, including both locomotion and manipulation.

Introducing PEVA: Predicting Egocentric Video from Whole-Body Actions

We present PEVA, a novel framework designed to forecast egocentric video conditioned on detailed human body motion. PEVA leverages kinematic pose trajectories structured by the body’s joint hierarchy to simulate how physical actions influence the environment from a first-person viewpoint. Trained on Nymeria-a large-scale dataset combining real-world egocentric video with precise body pose capture-PEVA employs an autoregressive conditional diffusion transformer to model the intricate relationship between full-body motion and visual outcomes.

This approach marks a pioneering effort to capture complex embodied behaviors and environmental interactions through human-perspective video prediction, enabling applications such as action-conditioned video synthesis and embodied planning.

Capturing Motion with Structured Action Representations

To effectively link human motion with egocentric vision, PEVA encodes each action as a comprehensive, high-dimensional vector that reflects both global translation and detailed joint rotations. The model represents motion in 3D space, using 3 degrees of freedom for root translation and 15 upper-body joints, each described by Euler angles, resulting in a 48-dimensional action space. Motion capture data is synchronized with video frames and transformed into a pelvis-centered local coordinate system to ensure invariance to position and orientation. Normalization of positions and rotations stabilizes training, while inter-frame motion differences enable the model to learn temporal dynamics connecting physical movement to visual changes.

Architecture: Autoregressive Conditional Diffusion Transformer

Building upon the Conditional Diffusion Transformer (CDiT) framework, originally designed for navigation tasks with simple control inputs, PEVA extends this architecture to handle the complexity of whole-body human motion. Key innovations include:

Random Timeskips: Training with variable temporal skips allows the model to capture both short-term dynamics and longer activity patterns.
Sequence-Level Training: Loss functions are applied over prefixes of motion sequences, encouraging coherent predictions across entire action sequences.
Action Embeddings: High-dimensional action vectors are embedded and concatenated to condition each adaptive layer normalization (AdaLN) layer, enabling nuanced control over the generation process.

Inference and Rollout Mechanism

During inference, PEVA generates future video frames by conditioning on a set of past context frames encoded into latent representations. The model progressively denoises noisy latent frames using the diffusion process. To optimize efficiency, attention mechanisms are restricted: self-attention is applied only within the target frame, and cross-attention is limited to the last context frame. For action-conditioned predictions, an autoregressive rollout strategy is employed-each predicted frame is appended to the context, and the oldest frame is dropped as the sequence advances. The final latent outputs are decoded back into pixel space using a variational autoencoder (VAE) decoder.

Decomposing Movements into Atomic Actions

To evaluate PEVA’s understanding of fine-grained motion effects, complex human movements are broken down into atomic actions, such as directional hand movements and whole-body rotations or translations. This decomposition allows the model to demonstrate precise control over egocentric video generation in response to specific joint-level commands.

Examples of Body Movement Actions

Move Forward

Rotate Left

Rotate Right

Left Hand Movements

Move Left Hand Up

Move Left Hand Down

Move Left Hand Left

Move Left Hand Right

Right Hand Movements

Move Right Hand Up

Move Right Hand Down

Move Right Hand Left

Move Right Hand Right

Maintaining Coherence in Extended Video Generation

PEVA excels at producing visually and semantically consistent video sequences over long durations. Demonstrations include 16-second rollouts conditioned on full-body motion, showcasing the model’s ability to sustain realistic egocentric perspectives and action consequences over time.

Sequence 1

Sequence 2

Sequence 3

Visual Planning Through Action Simulation

PEVA supports planning by simulating multiple candidate action sequences and evaluating them based on perceptual similarity to a target goal, measured using Learned Perceptual Image Patch Similarity (LPIPS). This enables the model to discard suboptimal paths and identify sequences that achieve desired outcomes.

Example: The model eliminates routes leading to the sink or outdoors, successfully finding the path to open the fridge.

Example: The model avoids paths involving grabbing plants or going to the kitchen, instead finding a plausible sequence to reach the shelf.

Optimizing Action Sequences for Visual Goals

Planning is formulated as an energy minimization problem, solved via the Cross-Entropy Method (CEM). PEVA optimizes sequences of arm movements while keeping other body parts fixed, demonstrating the ability to generate plausible action plans that approach target poses or interactions.

Example: Predicted sequence raises the right arm toward a mixing stick, though left arm adjustments are not modeled.

Example: Sequence reaches toward a kettle but does not fully grasp it, illustrating partial goal achievement.

Example: Predicted actions pull the left arm inward, closely matching the goal pose.

Performance Evaluation

PEVA’s effectiveness is validated through multiple quantitative metrics, demonstrating superior perceptual quality, temporal coherence, and scalability compared to baseline models. The model consistently generates high-fidelity egocentric videos conditioned on whole-body actions.

Comparative Perceptual Metrics

Comparison of perceptual quality metrics across different video prediction models.

Atomic Action Generation Accuracy

Performance comparison on generating videos of atomic actions.

FID Scores Over Time

Fréchet Inception Distance (FID) comparison illustrating video quality degradation over time for various models.

Model Scaling Benefits

PEVA exhibits improved performance with increased model size, highlighting strong scaling capabilities.

Looking Ahead: Future Enhancements

While PEVA sets a foundation for embodied video prediction and planning, several avenues remain for advancement. Current limitations include restricted planning scope focused on arm actions, absence of long-horizon trajectory optimization, and lack of explicit task or semantic goal conditioning. Future research aims to integrate closed-loop control, interactive environments, and object-centric representations, as well as incorporate high-level goal conditioning to enhance task-directed planning and execution.