The Future of Vision AI: How Apple’s AIMV2 Leverages Images and Text to Lead the Pack

The landscape of vision model pre-training has undergone significant evolution, especially with the rise of Large Language Models (LLMs). Traditionally, vision models operated within fixed, predefined paradigms, but LLMs have introduced a more flexible approach, unlocking new ways to leverage pre-trained vision encoders. This shift has prompted a reevaluation of pre-training methodologies for vision models to better align with multimodal applications.

In a new paper Multimodal Autoregressive Pre-training of Large Vision Encoders, an Apple research team introduces AIMV2, a family of vision encoders that employs a multimodal autoregressive pre-training strategy. Unlike conventional methods, AIMV2 is designed to predict both image patches and text tokens within a unified sequence. This combined objective enables the model to excel in a range of tasks, such as image recognition, visual grounding, and multimodal understanding.

The key innovation of AIMV2 lies in its ability to generalize the unimodal autoregressive framework to a multimodal setting. By treating image patches and text tokens as a single sequence, AIMV2 unifies the prediction process for both modalities. This approach enhances its capacity to understand complex visual and textual relationships.

The pre-training process of AIMV2 involves a causal multimodal decoder that first predicts image patches, followed by the generation of text tokens in an autoregressive manner. This simple yet effective design offers multiple advantages:

  1. Simplicity and Efficiency: The pre-training process does not require large batch sizes or complex inter-batch communication, making it easier to implement and scale.
  2. Alignment with LLM Multimodal Applications: The architecture naturally integrates with LLM-driven multimodal systems, enabling smooth interoperability.
  3. Denser Supervision: By extracting learning signals from every image patch and text token, AIMV2 achieves denser supervision compared to traditional discriminative objectives, facilitating more efficient training.

The architecture of AIMV2 is centered on the Vision Transformer (ViT), a well-established model for vision tasks. However, the AIMV2 team introduces key modifications to enhance its performance:

  • Constrained Self-Attention: A prefix attention mask is applied within the vision encoder, enabling bidirectional attention during inference without additional adjustments.
  • Feedforward and Normalization Upgrades: The SwiGLU activation function is utilized as the feedforward network (FFN), while all normalization layers are replaced with RMSNorm. These choices are inspired by the success of similar techniques in language modeling, leading to improved training stability and efficiency.
  • Unified Multimodal Decoder: A shared decoder handles the autoregressive generation of image patches and text tokens simultaneously, further strengthening AIMV2’s multimodal capabilities.

Empirical evaluations reveal the impressive capabilities of AIMV2. The AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k using a frozen trunk, demonstrating its potential for high-performance image recognition. Moreover, AIMV2 consistently surpasses state-of-the-art contrastive models, such as CLIP and SigLIP, in multimodal image understanding across diverse benchmarks.

One of the key contributors to this success is AIMV2’s ability to fully utilize the learning signals from all input tokens and image patches. This dense supervision approach allows for more effective training with fewer samples compared to other self-supervised or vision-language pre-trained models.

AIMV2 represents a significant step forward in the development of vision encoders. By unifying image and text prediction under a single multimodal autoregressive framework, AIMV2 achieves superior performance across a broad range of tasks. Its straightforward pre-training process, combined with architectural improvements like SwiGLU and RMSNorm, ensures scalability and adaptability. As vision models continue to scale, AIMV2 offers a blueprint for more efficient, versatile, and unified multimodal learning systems.

The code is available on project’s . The paper Multimodal Autoregressive Pre-training of Large Vision Encoders is on .


Author: Hecate He | Editor: Chain Zhang


The post first appeared on .

More from this stream

Recomended


Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home2/mflzrxmy/public_html/website_18d00083/wp-includes/functions.php on line 5464