MobileSAM: The Future of Mobile Image Segmentation

Allan Kouidri
-
2/12/2024
Automatic SAM on work desk

MobileSAM (Mobile Segment Anything Model) marks a significant milestone in making advanced AI-powered image segmentation accessible on mobile devices. Its innovative architecture, coupled with the key features of decoupled distillation and mobile optimization, paves the way for a new era in mobile vision applications.

What is MobileSAM?

Mobile SAM is a streamlined and efficient variant of the Segment Anything Model (SAM), optimized for mobile applications. The innovation primarily addresses the challenge posed by the original SAM's resource-intensive image encoder. MobileSAM introduces a lightweight image encoder, significantly reducing the model's size and computational demands without compromising performance.

Original SAM and MobileSAM with a box as the prompt
Original SAM and MobileSAM with a box as the prompt. [1]

Key Features and Innovations

Decoupled Distillation

The essence of decoupled distillation lies in its separation of the knowledge distillation process into two distinct phases. 

  1. Initially, the process involves distilling the image encoder by transferring knowledge from the heavier ViT-H-based SAM to a SAM with a smaller image encoder. The lightweight image encoder, distilled from the default image encoder, is inherently aligned with the default mask decoder, ensuring compatibility and maintaining performance. 
  2. Optionally, further fine-tuning of the mask decoder may be performed to better align it with the distilled image encoder. This optional fine-tuning stage, however, is not always necessary due to the close resemblance of the generated image encoding from the student image encoder to that of the original teacher encoder.
Decoupled distillation for SAM. [1]
Decoupled distillation for SAM. [1]

The decoupled distillation process effectively addresses the optimization challenges posed by the coupled optimization of the image encoder and mask decoder. By decoupling these components, MobileSAM achieves substantial resource savings, significantly reducing the model's size while maintaining performance parity with the original SAM. 

The reduced computational requirements make MobileSAM a practical and efficient solution for mobile and resource-constrained environments.

Efficiency and Performance

MobileSAM maintains comparable performance to the original SAM while streamlining the process by substituting the heavy ViT-H encoder with a more compact Tiny-ViT encoder. 

This modification significantly reduces computational load, enabling MobileSAM to process an image in about 12ms on a single GPU, with the image encoder and mask decoder contributing 8ms and 4ms, respectively, to the total runtime.

Image Encoder

Original SAM

MobileSAM

Parameters

611M

5M

Speed

452ms

8ms

Comparison of ViT-based image encoder. [1]

Full pipiline (Enc + Dec)

Original SAM

MobileSAM

Parameters

611M

5M

Speed

456ms

12ms

Comparison of the whole pipeline (Encoder + decoder). [1]

Optimized for Mobile Environments

With a deep understanding of the constraints in mobile environments, MobileSAM is engineered to operate smoothly on mobile devices. It strikes the perfect balance between speed, size, and accuracy, making it an ideal choice for real-time applications.

Easily run Mobile SAM for segmentation

The Ikomia API allows for fast segmentation using MobileSAM with minimal coding.

Setup

To begin, it's important to first install the API in a virtual environment [2]. This setup ensures a smooth and efficient start to using the API's capabilities.

Run MobileSAM Pose Estimation with a few lines of code

You can also directly charge the notebook we have prepared.


pip install ikomia

Inference using box prompt


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
algo  = wf.add_task(name = "infer_mobile_segment_anything", auto_connect=True)

# Setting parameters: boxes on the wheels
algo.set_parameters({
    "input_box": "[[425, 600, 700, 875], [1240, 675, 1400, 750], [1375, 550, 1650, 800]]"
})

# Run directly on your image
wf.run_on(url="https://github.com/facebookresearch/segment-anything/blob/main/notebooks/images/truck.jpg?raw=true")

# Inspect your result
display(algo.get_image_with_mask())

Mobile SAM segmentation on car truck wheel

Inference using automatic mask generator


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
algo  = wf.add_task(name = "infer_mobile_segment_anything", auto_connect=True)

# Setting parameters: boxes on the wheels
algo.set_parameters({
    "points_per_side": "16",
})

# Run directly on your image
wf.run_on(url="https://github.com/Ikomia-dev/notebooks/blob/main/examples/img/img_work.jpg?raw=true")

# Display your image
display(algo.get_image_with_mask())

Automatic SAM on work desk

List of parameters:

  • input_box (list): A Nx4 array of given box prompts to the model, in [XYXY] or [[XYXY], [XYXY]] format.
  • draw_graphic_input (Boolean): When set to True, it allows you to draw graphics (box or point) over the object you wish to segment. If set to False, MobileSAM will automatically generate masks for the entire image.
  • points_per_side (int or None, optional): The number of points to be sampled for mask generation when running automatic segmentation.
  • mask_id (int) - default '1': When a single graphic point is selected, MobileSAM with generate three outputs given a single point (3 best scores). You can select which mask to output using the mask_id parameters (1, 2 or 3).
  • input_point (list, optional): A Nx2 array of point prompts to the model. Each point is in [X,Y] in pixels.
  • input_point_label (list, optional): A length N array of labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point.
  • points_per_side (int) - default '32': (Automatic detection mode). The number of points to be sampled along one side of the image. The total number of points is points_per_side**2.
  • points_per_batch (int) - default '64': (Automatic detection mode). Sets the number of points run simultaneously by the model. Higher numbers may be faster but use more GPU memory.
  • stability_score_thres (float) - default '0.95': Filtering threshold in [0,1], using the stability of the mask under changes to the cutoff used to binarize the model's mask predictions.
  • box_nms_thres (float) - default '0.7': The box IoU cutoff used by non-maximal suppression to filter duplicate masks.
  • iou_thres (float) - default '0.88': A filtering threshold in [0,1], using the model's predicted mask quality.
  • crop_n_layers (int) - default '0' : If >0, mask prediction will be run again oncrops of the image. Sets the number of layers to run, where each layer has 2**i_layer number of image crops.
  • crop_nms_thres (float) - default '0': The box IoU cutoff used by non-maximal suppression to filter duplicate masks between different crops.
  • crop_overlap_ratio (float) default 'float (512 / 1500)'
  • crop_n_points_downscale_factor (int) - default '1': The number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factor**n.
  • min_mask_region_area (int) - default '0': op layer. Exclusive with points_per_side. min_mask_region_area (int): If >0, postprocessing will be applied to remove disconnected regions and holes in masks with area smaller than min_mask_region_area.
  • input_size_percent (int) - default '100': Percentage size of the input image. Can be reduced to save memory usage.

Create your workflow using MobileSAM & Stable diffusion

In this article, we've explored image segmentation with MobileSAM. 

The Ikomia API significantly streamlines the integration of diverse algorithms from various platforms, offering a cohesive and efficient image processing experience. Imagine segmenting part of an image with the Segment Anything Model and then, with the same ease, using Stable Diffusion's inpainting to replace it, all driven by simple text commands.

Discover the Fusion of SAM and Stable Diffusion for Inpainting→

A standout feature of the Ikomia API is its seamless ability to bridge algorithms from disparate sources such as YOLO, Hugging Face, and OpenMMLab. It simplifies the process by eliminating the complexities of managing numerous dependencies.

  • For comprehensive instructions on leveraging this powerful API, consult the Ikomia documentation
  • To further enrich your experience, explore the Ikomia HUB for an array of advanced algorithms.

Engage with Ikomia STUDIO, which offers a user-friendly environment while preserving the full capabilities of the API.

References

‍[1] MobileSAM – GitHub official repository

[2] How to create a virtual environment in Python

Arrow
Arrow
No items found.