MobileSAM (Mobile Segment Anything Model) marks a significant milestone in making advanced AI-powered image segmentation accessible on mobile devices. Its innovative architecture, coupled with the key features of decoupled distillation and mobile optimization, paves the way for a new era in mobile vision applications.
What is MobileSAM?
Mobile SAM is a streamlined and efficient variant of the Segment Anything Model (SAM), optimized for mobile applications. The innovation primarily addresses the challenge posed by the original SAM's resource-intensive image encoder. MobileSAM introduces a lightweight image encoder, significantly reducing the model's size and computational demands without compromising performance.
Key Features and Innovations
Decoupled Distillation
The essence of decoupled distillation lies in its separation of the knowledge distillation process into two distinct phases.
Initially, the process involves distilling the image encoder by transferring knowledge from the heavier ViT-H-based SAM to a SAM with a smaller image encoder. The lightweight image encoder, distilled from the default image encoder, is inherently aligned with the default mask decoder, ensuring compatibility and maintaining performance.
Optionally, further fine-tuning of the mask decoder may be performed to better align it with the distilled image encoder. This optional fine-tuning stage, however, is not always necessary due to the close resemblance of the generated image encoding from the student image encoder to that of the original teacher encoder.
The decoupled distillation process effectively addresses the optimization challenges posed by the coupled optimization of the image encoder and mask decoder. By decoupling these components, MobileSAM achieves substantial resource savings, significantly reducing the model's size while maintaining performance parity with the original SAM.
The reduced computational requirements make MobileSAM a practical and efficient solution for mobile and resource-constrained environments.
Efficiency and Performance
MobileSAM maintains comparable performance to the original SAM while streamlining the process by substituting the heavy ViT-H encoder with a more compact Tiny-ViT encoder.
This modification significantly reduces computational load, enabling MobileSAM to process an image in about 12ms on a single GPU, with the image encoder and mask decoder contributing 8ms and 4ms, respectively, to the total runtime.
Image Encoder
Original SAM
MobileSAM
Parameters
611M
5M
Speed
452ms
8ms
Comparison of ViT-based image encoder. [1]
Full pipiline (Enc + Dec)
Original SAM
MobileSAM
Parameters
611M
5M
Speed
456ms
12ms
Comparison of the whole pipeline (Encoder + decoder). [1]
Optimized for Mobile Environments
With a deep understanding of the constraints in mobile environments, MobileSAM is engineered to operate smoothly on mobile devices. It strikes the perfect balance between speed, size, and accuracy, making it an ideal choice for real-time applications.
Easily run Mobile SAM for segmentation
The Ikomia API allows for fast segmentation using MobileSAM with minimal coding.
Setup
To begin, it's important to first install the API in a virtual environment [2]. This setup ensures a smooth and efficient start to using the API's capabilities.
Run MobileSAM Pose Estimation with a few lines of code
You can also directly charge the notebook we have prepared.
from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display
# Init your workflow
wf = Workflow()
# Add algorithm
algo = wf.add_task(name = "infer_mobile_segment_anything", auto_connect=True)
# Setting parameters: boxes on the wheels
algo.set_parameters({
"input_box": "[[425, 600, 700, 875], [1240, 675, 1400, 750], [1375, 550, 1650, 800]]"
})
# Run directly on your image
wf.run_on(url="https://github.com/facebookresearch/segment-anything/blob/main/notebooks/images/truck.jpg?raw=true")
# Inspect your result
display(algo.get_image_with_mask())
Inference using automatic mask generator
from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display
# Init your workflow
wf = Workflow()
# Add algorithm
algo = wf.add_task(name = "infer_mobile_segment_anything", auto_connect=True)
# Setting parameters: boxes on the wheels
algo.set_parameters({
"points_per_side": "16",
})
# Run directly on your image
wf.run_on(url="https://github.com/Ikomia-dev/notebooks/blob/main/examples/img/img_work.jpg?raw=true")
# Display your image
display(algo.get_image_with_mask())
List of parameters:
input_box (list): A Nx4 array of given box prompts to the model, in [XYXY] or [[XYXY], [XYXY]] format.
draw_graphic_input (Boolean): When set to True, it allows you to draw graphics (box or point) over the object you wish to segment. If set to False, MobileSAM will automatically generate masks for the entire image.
points_per_side (int or None, optional): The number of points to be sampled for mask generation when running automatic segmentation.
mask_id (int) - default '1': When a single graphic point is selected, MobileSAM with generate three outputs given a single point (3 best scores). You can select which mask to output using the mask_id parameters (1, 2 or 3).
input_point (list, optional): A Nx2 array of point prompts to the model. Each point is in [X,Y] in pixels.
input_point_label (list, optional): A length N array of labels for the point prompts. 1 indicates a foreground point and 0 indicates a background point.
points_per_side (int) - default '32': (Automatic detection mode). The number of points to be sampled along one side of the image. The total number of points is points_per_side**2.
points_per_batch (int) - default '64': (Automatic detection mode). Sets the number of points run simultaneously by the model. Higher numbers may be faster but use more GPU memory.
stability_score_thres (float) - default '0.95': Filtering threshold in [0,1], using the stability of the mask under changes to the cutoff used to binarize the model's mask predictions.
box_nms_thres (float) - default '0.7': The box IoU cutoff used by non-maximal suppression to filter duplicate masks.
iou_thres (float) - default '0.88': A filtering threshold in [0,1], using the model's predicted mask quality.
crop_n_layers (int) - default '0' : If >0, mask prediction will be run again oncrops of the image. Sets the number of layers to run, where each layer has 2**i_layer number of image crops.
crop_nms_thres (float) - default '0': The box IoU cutoff used by non-maximal suppression to filter duplicate masks between different crops.
crop_n_points_downscale_factor (int) - default '1': The number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factor**n.
min_mask_region_area (int) - default '0': op layer. Exclusive with points_per_side. min_mask_region_area (int): If >0, postprocessing will be applied to remove disconnected regions and holes in masks with area smaller than min_mask_region_area.
input_size_percent (int) - default '100': Percentage size of the input image. Can be reduced to save memory usage.
Create your workflow using MobileSAM & Stable diffusion
In this article, we've explored image segmentation with MobileSAM.
The Ikomia API significantly streamlines the integration of diverse algorithms from various platforms, offering a cohesive and efficient image processing experience. Imagine segmenting part of an image with the Segment Anything Model and then, with the same ease, using Stable Diffusion's inpainting to replace it, all driven by simple text commands.
A standout feature of the Ikomia API is its seamless ability to bridge algorithms from disparate sources such as YOLO, Hugging Face, and OpenMMLab. It simplifies the process by eliminating the complexities of managing numerous dependencies.
For comprehensive instructions on leveraging this powerful API, consult the Ikomia documentation.
To further enrich your experience, explore the Ikomia HUB for an array of advanced algorithms.
Engage with Ikomia STUDIO, which offers a user-friendly environment while preserving the full capabilities of the API.