Meta AI's SAM 2: Advancing Real-Time Image & Video Segmentation

Meta AI's Segment Anything Model 2 (SAM 2) represents a significant advancement in the field of computer vision, specifically in the areas of image and video segmentation. Released on July 29, 2024, SAM 2 builds upon the capabilities of its predecessor, the original SAM, by introducing real-time, promptable segmentation for both static images and dynamic video content [1].

‍

What is SAM 2?

SAM 2, or Segment Anything Model 2, is an advanced computer vision model developed by Meta AI that extends the capabilities of its predecessor, the original SAM. It is designed to perform real-time, promptable segmentation on both static images and dynamic video content. The model is capable of generating segmentation masks based on user inputs, such as clicks, bounding boxes, or initial masks, making it highly versatile and user-friendly. SAM 2 is particularly notable for its ability to handle both image and video data within a unified framework, offering enhanced performance and efficiency in a variety of applications.

‍

Key Features of SAM 2

Unified Segmentation Model: SAM 2 is the first model to unify image and video segmentation. It allows users to segment objects in real-time using prompts such as clicks, boxes, or masks.
Advanced Architecture: The model's architecture includes a streaming memory design, a transformer-based image encoder, and a mask decoder. This design enables SAM 2 to process video frames sequentially and maintain context across frames, which is crucial for accurately tracking objects over time.
Promptable Visual Segmentation (PVS): SAM 2 supports PVS, allowing users to specify areas of interest in images or videos through various input types. The model then generates segmentation masks based on these inputs.
Zero-Shot Generalization: SAM 2 can segment objects in previously unseen images and videos without requiring custom adaptation, making it highly versatile for a wide range of applications.

‍

SAM 2 Dataset

The dataset used to train and evaluate SAM 2 is a crucial component of its development, providing the diverse and comprehensive data necessary to achieve its advanced segmentation capabilities.

Here are the key aspects of the SAM 2 dataset:

‍

Dataset Composition

Diverse Content: The dataset includes a wide range of images and video sequences, capturing various scenes, objects, and environments. It contains 51,000 videos and 643,000 segmentation masks. The SA-V dataset, distributed under a CC 4.0 license.
Annotated Data: Each image and video frame in the dataset is meticulously annotated with segmentation masks. Human labelers employed an interactive, human-in-the-loop method, utilizing SAM 2 to annotate the videos. These annotations were subsequently used to enhance the performance of the SAM 2 model.
Multi-Modal Inputs: The dataset supports multiple types of prompts, such as clicks, bounding boxes, and initial masks, reflecting the promptable nature of SAM 2. This variety allows the model to learn how to respond to different user inputs effectively.

‍

Challenges Addressed by the Dataset

Dynamic Video Content: The inclusion of video data poses unique challenges, such as maintaining context across frames and handling object motion. The dataset is designed to address these challenges by providing sequences that require the model to track and segment moving objects accurately.
Complex Scenes: The dataset includes complex scenes with multiple overlapping objects, varying lighting conditions, and diverse backgrounds. This complexity ensures that SAM 2 is robust and capable of handling real-world scenarios.

Segment Anything 2 SA-V dataset — Example videos from the SA-V dataset with masklets overlaid. [1]

‍

Dataset Size and Quality

Large Scale: The dataset is extensive, containing a large number of images and video frames. This scale is essential for training deep learning models like SAM 2, which require vast amounts of data to learn effectively.
High-Quality Annotations: The quality of annotations is critical for model performance. The dataset used for SAM 2 is carefully annotated to ensure accuracy and consistency, providing reliable ground truth for model training and evaluation.

‍

Applications of the Dataset

Training and Evaluation: The dataset is used to train SAM 2, enabling it to learn the complex task of segmentation across different types of content. It is also used to evaluate the model's performance, ensuring it meets the desired accuracy and efficiency standards.
Benchmarking: The dataset serves as a benchmark for comparing SAM 2 with other segmentation models, highlighting its strengths and areas for improvement.

In summary, the SAM 2 dataset is a foundational element of the model's development, providing the diverse, high-quality data necessary for achieving state-of-the-art segmentation performance in both images and videos.

‍

SAM 2 Architecture

Architecture overview Segment Anything 2 SAM 2 — Overview of the SAM2 architecture. [1]

‍

Key Components of SAM 2 Architecture

Transformer-Based Image Encoder:
- The image encoder in SAM 2 is based on a transformer architecture, which is known for its ability to capture long-range dependencies and contextual information within images. This component processes input images to extract rich feature representations that are essential for accurate segmentation.
Streaming Memory Design:
- A unique aspect of SAM 2 is its streaming memory design, which allows the model to handle video data efficiently. This design enables the model to maintain context across sequential video frames, which is crucial for tracking objects over time. The streaming memory acts as a buffer, storing information from previous frames to inform current segmentation decisions.
Mask Decoder:
- The mask decoder is responsible for generating segmentation masks based on the features extracted by the image encoder. It translates the high-level feature representations into pixel-level segmentation masks, which delineate the boundaries of objects within the image or video.
Promptable Segmentation:
- SAM 2 supports promptable segmentation, meaning it can generate segmentation masks based on user-provided prompts. These prompts can be in the form of clicks, bounding boxes, or initial masks, allowing users to specify areas of interest and guide the segmentation process.
Real-Time Processing:
- The architecture is optimized for real-time processing, making it suitable for applications that require immediate feedback, such as video editing and augmented reality. The efficient design ensures that the model can handle high-resolution images and videos without significant latency.

‍

Innovations in SAM 2 Architecture

Unified Approach: Unlike traditional models that separate image and video segmentation tasks, SAM 2 unifies these processes within a single architecture. This integration allows for seamless transitions between static and dynamic content, enhancing the model's versatility.
Zero-Shot Generalization: The architecture is designed to generalize well to new, unseen data without requiring task-specific fine-tuning. This capability is particularly valuable for applications where the model must adapt to a wide range of scenarios and object types.

‍

Challenges Addressed by the Architecture

Handling Dynamic Content: By incorporating a streaming memory design, SAM 2 effectively addresses the challenge of maintaining context in dynamic video content, ensuring accurate object tracking over time.
Reducing User Interactions: The promptable nature of the architecture reduces the need for extensive user input, streamlining the segmentation process and making it more efficient.

In summary, the SAM 2 architecture represents a significant advancement in the field of segmentation models, offering a robust and efficient solution for both image and video segmentation tasks. Its innovative design and real-time capabilities make it a powerful tool for a wide range of applications.

‍

Segment Anything 2 Model Versions

The Segment Anything Model 2 (SAM 2) is available in several versions, each designed to balance performance and resource requirements. These versions vary primarily in terms of model size, which affects their computational efficiency and segmentation accuracy. Here is an overview of the different SAM 2 model versions [2]:

Model	Size (M)	Speed (FPS)	SA-V test (J&F)
sam2_hiera_tiny	38.9	47.2	75
sam2_hiera_small	46	*43.3 (53.0 compiled)**	74.9
sam2_hiera_base_plus	80.8	*34.8 (43.8 compiled)**	74.7
sam2_hiera_large	224.4	*24.2 (30.2 compiled)**	76

‍

Model Versions

Tiny (149 MB)
- Purpose: This version is optimized for environments with limited computational resources. It is suitable for applications where speed and efficiency are prioritized over the highest possible accuracy.
- Use Cases: Ideal for mobile or embedded systems where memory and processing power are constrained.
Small (176 MB)
- Purpose: The small version offers a slight increase in model capacity compared to the tiny version, providing a balance between efficiency and improved segmentation performance.
- Use Cases: Suitable for applications requiring a moderate level of accuracy without significantly increasing computational demands.
Base Plus (b+) (309 MB)
- Purpose: This version is designed to deliver enhanced performance with a more substantial model size, offering better accuracy for more demanding segmentation tasks.
- Use Cases: Appropriate for desktop applications or cloud-based services where there is more flexibility in terms of computational resources.
Large (856 MB)
- Purpose: The large version is the most powerful, designed to maximize segmentation accuracy. It leverages a larger model capacity to handle complex tasks with high precision.
- Use Cases: Best suited for high-end applications such as detailed video editing, scientific research, and industrial use cases where accuracy is critical.

‍

Easily run SAM2

Setup

With the Ikomia API, you can effortlessly run zero-shot object detection with just a few lines of code.

To get started, you need to install the API in a virtual environment [3].


pip install ikomia

‍

Run SAM 2 with a few lines of code

You can also directly charge the notebook we have prepared.

Go to notebook

Go to Colab

‍

Automatic mask generator

When no prompt is used, SAM2 will generate masks automatically over the entire image.


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
algo  = wf.add_task(name="infer_segment_anything_2", auto_connect=True)

# Setting parameters: boxes on the wheels
algo.set_parameters({
    "model_name": "sam2_hiera_small",
    "cuda": "True",
    "points_per_side": "32",
    "input_size_percent": "80",
    "apply_postprocessing": "True"
})
# Run directly on your image
wf.run_on(url="https://raw.githubusercontent.com/facebookresearch/segment-anything-2/main/notebooks/images/cars.jpg")

# Display your image
display(algo.get_image_with_mask())

Segment Anything 2 automatic segmentation

‍

Box prompt prediction


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
algo  = wf.add_task(name="infer_segment_anything_2", auto_connect=True)

# Setting parameters: boxes on the wheels
algo.set_parameters({
    "input_box": "[[425, 600, 700, 875], [1240, 675, 1400, 750], [1375, 550, 1650, 800]]"
})
# Run directly on your image
wf.run_on(url="https://github.com/facebookresearch/segment-anything/blob/main/notebooks/images/truck.jpg?raw=true")
# Run directly on your image
wf.run_on(url="https://github.com/facebookresearch/segment-anything/blob/main/notebooks/images/truck.jpg?raw=true")

# Display your image
display(algo.get_image_with_mask())

Segment Anything 2 segmentation using box prompt

‍‍

General parameters:

model_name [str] : The SAM model can be loaded with four different encoders:some text
- sam2_hiera_tiny - size: 38.9
- sam2_hiera_small - size: 46
- sam2_hiera_base_plus - size: 80.8
- sam2_hiera_large - size: 224.4
cuda [bool]: If True, CUDA-based inference (GPU). If False, run on CPU.
input_size_percent [int] : Percentage size of the input image. Can be reduce to save memory usage.

‍

Prompt predictor parameters:

input_box [list]: A Nx4 array of given box prompts to the model, in [[XYXY]] or [[XYXY], [XYXY]] format.
input_point [list]: A Nx2 array of point prompts to the model. Each point is in [[X,Y]] or [[XY], [XY]] in pixels.
input_point_label [list]: A length N array of labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point
multimask_output - [bool]: if true, the model will return three masks. If only a single mask is needed, the model's predicted quality score can be used to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results.

‍

Automatic predictor parameters:

points_per_side [int or None] - the number of points to be sampled along one side of the image. The total number of points is
points_per_batch - [int] - sets the number of points run simultaneously by the model. Higher numbers may be faster but use more GPU memory.
iou_thresh [float] - a filtering threshold in [0,1], using the model's predicted mask quality.
stability_score_thresh - [float] - a filtering threshold in [0,1], using the stability of the mask under changes to the cutoff used to binarize the model's mask predictions.
stability_score_offset - [float] - the amount to shift the cutoff when calculated the stability score.
box_nms_thresh - [float] - the box IoU cutoff used by non-maximal suppression to filter duplicate masks.
crop_n_layers - [int] - if >0, mask prediction will be run again on crops of the image. Sets the number of layers to run, where each layer has 2i_layer` number of image crops.
crop_nms_thresh - [float] - the box IoU cutoff used by non-maximal suppression to filter duplicate masks between different crops.
crop_overlap_ratio - [float] - sets the degree to which crops overlap. In the first crop layer, crops will overlap by this fraction of the image length. Later layers with more crops scale down this overlap.
crop_n_points_downscale_factor - [int] - the number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factorn`.
use_m2m - [bool]: Whether to add a one step refinement using previous mask predictions.

‍

Exploring Image Segmentation with SAM and the Ikomia Ecosystem

In this tutorial, we covered the essentials of SAM 2 and demonstrated its application. We explored how to use the automatic mask generator to create segmentation masks for entire images and how to segment objects with the graphic prompt predictor.

‍

The Ikomia API streamlines the creation of Computer Vision workflows, making it easy to experiment with different parameters to achieve optimal results.

For further details on the API, consult the documentation. You can also explore the cutting-edge algorithms available on Ikomia HUB and experiment with Ikomia STUDIO, which provides a user-friendly interface with the same capabilities as the API.

‍