Home News Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept...

Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos

0

How can one efficiently identify, segment, and monitor every occurrence of a specific concept within extensive collections of images and videos using straightforward prompts? Meta AI has introduced the Segment Anything Model 3 (SAM 3), an open-source, unified foundational model designed for promptable segmentation across both images and videos. Unlike previous models that operated primarily at the pixel level, SAM 3 works directly with visual concepts. It can detect, segment, and track objects based on text prompts as well as visual cues such as points, bounding boxes, and masks. Compared to its predecessor SAM 2, SAM 3 excels at exhaustively locating all instances of an open vocabulary concept-for instance, identifying every “red baseball cap” throughout a lengthy video-using a single, versatile model.

Advancing from Visual Prompts to Concept-Based Segmentation

Previous iterations of SAM concentrated on interactive segmentation, where users would click or draw a box to generate a single mask. While effective for isolated tasks, this approach struggled to scale when the goal was to locate all instances of a concept across vast image or video datasets. SAM 3 introduces the formal framework of Promptable Concept Segmentation (PCS), which accepts concept prompts and outputs instance masks along with consistent identities for every matching object in both images and videos.

Concept prompts in SAM 3 combine concise noun phrases with visual examples. The model can interpret detailed descriptions such as “yellow school bus” or “player in red jersey” and can also utilize exemplar crops as positive or negative references to clarify subtle visual distinctions. Textual prompts define the concept, while exemplar images help resolve ambiguities in fine-grained visual features. Additionally, SAM 3 integrates seamlessly as a vision module within multimodal large language models, which can generate complex referring expressions and then distill them into concise concept prompts for SAM 3 to process.

Innovative Architecture: Presence Token and Tracking Mechanisms

SAM 3 comprises 848 million parameters and integrates a detector and a tracker that share a unified vision encoder. The detector employs a DETR-based architecture conditioned on three types of inputs: text prompts, geometric prompts, and image exemplars. This design decouples the core image representation from the prompting interface, enabling the backbone to support a wide range of segmentation tasks efficiently.

A notable enhancement in SAM 3 is the introduction of the presence token, which predicts whether each candidate bounding box or mask corresponds to the requested concept. This feature is crucial when dealing with closely related prompts, such as “a player in white” versus “a player in red,” as it reduces misclassification and enhances precision in open vocabulary scenarios. Importantly, the model separates recognition (classifying candidates as the concept) from localization (determining the shape and position of boxes and masks).

For video applications, SAM 3 reuses the transformer encoder-decoder tracker from SAM 2 but integrates it more tightly with the new detector. This tracker maintains consistent instance identities across frames and supports interactive refinements. The modular design of detector and tracker minimizes interference between tasks, scales effectively with increasing data and concepts, and preserves an interactive interface for point-based refinements similar to earlier Segment Anything models.

Introducing the SA-Co Dataset and Benchmark Suite

To facilitate training and evaluation of Promptable Concept Segmentation, Meta has developed the SA-Co family of datasets and benchmarks. The SA-Co benchmark encompasses approximately 270,000 unique concepts-over 50 times more than previous open vocabulary segmentation benchmarks. Each image or video is annotated with noun phrases and dense instance masks for all objects matching those phrases, including negative prompts where no matching objects exist.

The accompanying data engine has automatically annotated more than 4 million unique concepts, establishing SA-Co as the largest high-quality open vocabulary segmentation corpus to date. This engine leverages extensive ontologies combined with automated validation and supports hard negative mining-handling phrases that are visually similar but semantically distinct. Such scale and rigor are vital for training models that respond robustly to diverse text prompts in real-world scenarios.

Performance Highlights on Images and Videos

On the SA-Co image benchmarks, SAM 3 achieves between 75% and 80% of human-level performance as measured by the cgF1 metric. Competing models like OWLv2, DINO-X, and Gemini 2.5 trail significantly behind. For example, in the SA-Co Gold box detection task, SAM 3 attains a cgF1 score of 55.7, whereas OWLv2 scores 24.5, DINO-X 22.5, and Gemini 2.5 only 14.4. These results demonstrate that a single unified model can surpass specialized detectors in open vocabulary segmentation tasks.

In video segmentation and tracking, SAM 3 has been evaluated on datasets including SA-V, YT-Temporal 1B, SmartGlasses, LVVIS, and BURST. It achieves 30.3 cgF1 and 58.0 pHOTA on SA-V, 50.8 cgF1 and 69.9 pHOTA on YT-Temporal 1B, 36.4 cgF1 and 63.6 pHOTA on SmartGlasses, and records 36.3 mAP and 44.5 HOTA on LVVIS and BURST respectively. These metrics confirm that SAM 3’s architecture effectively handles both image-based PCS and long-duration video tracking within a single framework.

Implications for Data Annotation Platforms and Benchmarking

For annotation platforms focused on data-centric workflows, such as Encord, SAM 3 represents a significant advancement beyond their current integrations of SAM and SAM 2 for auto-labeling and video tracking. These platforms already enable users to auto-annotate over 90% of images with high mask accuracy using foundation models within QA-driven pipelines. Other annotation tools like CVAT, SuperAnnotate, and Picsellia are also adopting Segment Anything-style models for zero-shot labeling, model-in-the-loop annotation, and MLOps workflows. SAM 3’s capabilities in promptable concept segmentation and unified image-video tracking open new avenues for editorial control and benchmarking, such as quantifying reductions in labeling costs and improvements in annotation quality when transitioning from SAM 2 to SAM 3, especially in dense video datasets or multimodal environments.

Summary of Key Innovations

  1. SAM 3 consolidates image and video segmentation into a single 848 million parameter foundational model that supports diverse prompt types including text, exemplars, points, and bounding boxes for Promptable Concept Segmentation.
  2. The SA-Co dataset and benchmark introduce approximately 270,000 evaluated concepts and over 4 million automatically annotated concepts, making it one of the most extensive open vocabulary segmentation resources available.
  3. SAM 3 significantly outperforms previous open vocabulary segmentation systems, achieving 75-80% of human cgF1 performance on SA-Co and more than doubling the scores of competitors like OWLv2 and DINO-X on critical detection metrics.
  4. The model architecture separates a DETR-based detector from a SAM 2-style video tracker enhanced with a presence token, enabling stable instance tracking over extended video sequences while maintaining interactive refinement capabilities.

Concluding Insights

SAM 3 marks a pivotal evolution in segmentation technology, transitioning from promptable visual segmentation to promptable concept segmentation within a unified model for both images and videos. Leveraging the expansive SA-Co benchmark, SAM 3 approaches human-level performance on complex segmentation tasks. Its modular design, combining a DETR-based detector with a sophisticated video tracker, positions it as a practical and scalable vision foundation model suitable for integration into intelligent agents and commercial products. As such, SAM 3 sets a new standard for open vocabulary segmentation at production scale.

Exit mobile version