OWLv2: The Next Evolution in Open-Vocabulary Object Detection

Object detection has long been a cornerstone task in Computer Vision, with applications ranging from autonomous driving to medical imaging. However, traditional object detection models have been limited by their closed-set nature - they can only detect objects from a predefined set of categories they were trained on.

‍

This limitation has motivated research into open-vocabulary object detection, where models can localize and classify objects described by arbitrary text queries, even for object categories not seen during training. A major breakthrough in this area came with the introduction of OWL-ViT (Objects With Labels - Vision Transformer) in 2022.

‍

OWL-ViT leveraged large pretrained vision-language models like CLIP to enable zero-shot object detection. While groundbreaking, OWL-ViT still left room for improvement, particularly in detecting rare object categories.

‍

Enter OWLv2 - the next evolution in open-vocabulary object detection developed by researchers at Google DeepMind. OWLv2 builds on the foundation of OWL-ViT but introduces key innovations to dramatically scale up training and improve performance. Let's dive deep into what makes OWLv2 tick and how it pushes the boundaries of open-vocabulary detection.

‍

The OWLv2 Architecture

At its core, OWLv2 uses a similar architecture to the original OWL-ViT model:

‍

1. Vision Transformer (ViT) Image Encoder: The model leverages a Vision Transformer to process the input image. This architecture is scalable and effective for extracting rich visual features from the image.

2. Text Encoder: A text encoder, typically based on CLIP, is used to encode text queries. This encoder transforms textual descriptions into embeddings that can be compared with visual features extracted by the image encoder.

3. Detection Heads: The model includes detection heads that predict bounding boxes and classify objects within those boxes. The architecture integrates classification and localization tasks to identify objects and their locations in the image

Overview of the Owl-ViT method — Overview of the OWL-ViT method [1]

‍

The key difference is that OWLv2 introduces several optimizations to make training more efficient:

‍

Token merging: This technique, adapted from work on accelerating ViT inference, reduces the number of tokens that need to be processed in deeper layers of the network. This cuts training FLOPs by approximately 50% compared to OWL-ViT.
Objectness scoring: Instead of processing all detected objects, OWLv2 uses an objectness score to select only the top ~10% of detections during training. This further reduces computational overhead.
Efficient implementation: The researchers adopted best practices for large-scale Transformer training to squeeze out additional performance gains.

‍

These optimizations allow OWLv2 to achieve 2x higher training throughput compared to OWL-ViT (e.g., 2.2 vs 1.0 examples/second/core for the L/14 variant at 840x840 resolution on TPUv3 hardware). Crucially, these changes only affect training - at inference time, OWLv2 is identical to OWL-ViT, maintaining the latter's flexibility and zero-shot capabilities.

‍

Scaling Up with Self-Training

The real magic of OWLv2 lies not just in architectural tweaks, but in how it leverages massive amounts of weakly-supervised data through self-training. The researchers developed a three-step process they call the OWL-ST (OWL Self-Training) recipe:

‍

1. Pseudo-annotation: Use an existing open-vocabulary detector (in this case, OWL-ViT) to generate bounding box predictions on a large dataset of web images with associated text (10 billion images from the WebLI dataset).

2. Self-training: Train a new OWLv2 model from scratch on these pseudo-annotations.

3. Fine-tuning (optional): Briefly fine-tune the self-trained model on a smaller dataset with human-annotated bounding boxes.

This approach allows OWLv2 to benefit from the vast amount of image-text pairs available on the web, without requiring expensive human annotation.

‍

However, several key challenges needed to be addressed to make this work effectively:

‍

Challenge 1: Choosing the Right Label Space

A critical decision in the pseudo-annotation step is what label space to use - in other words, what text queries should the initial detector use to label the web images? The researchers explored two main approaches:

1. Human-curated vocabulary: Combining label sets from existing object detection datasets (LVIS, Objects365, OpenImagesV4, and Visual Genome) to create a fixed set of 2,520 common object categories.

2. Machine-generated queries: Extracting N-grams (up to 10 words long) directly from the text associated with each image, with minimal filtering.

Interestingly, they found that the machine-generated approach, despite being noisier, led to better generalization to unseen classes and datasets. This suggests that the diversity of weak supervision from web data is more valuable than human curation for open-vocabulary performance.

‍

Challenge 2: Filtering Pseudo-Annotations

Not all pseudo-annotations are created equal. To improve the quality of the training data, the researchers experimented with filtering the pseudo-annotations based on the detection confidence scores from the initial OWL-ViT model.

‍

They found that aggressive filtering (keeping only high-confidence detections) led to better performance when training on smaller subsets of the data. However, as they scaled up to billions of training examples, using a lower confidence threshold and keeping more data yielded the best results.

‍

Challenge 3: Efficient Training at Scale

To make training on billions of examples feasible, the researchers introduced several techniques:

‍

Image mosaics: Combining multiple images into grids of up to 6x6 images for each training example. This increases the effective number of images seen per forward pass and improves small object detection performance.
Variable resolution: Training on a mix of different image resolutions to better handle the varying quality of web images.
Gradient checkpointing and other optimizations: Standard techniques for training large models efficiently.

‍

Results: Pushing the Boundaries of Open-Vocabulary Detection

The impact of OWLv2 and the OWL-ST training recipe is nothing short of remarkable. On the challenging LVIS dataset, which includes many rare object categories, OWLv2 achieved significant improvements:

‍

An OWLv2-L/14 model self-trained on web data (without fine-tuning) achieved 31.2% mAP on rare LVIS categories. This already surpassed previous state-of-the-art methods.
After fine-tuning on LVIS base classes (but still zero-shot for rare classes), performance on rare classes jumped to 44.6% mAP - a relative improvement of 43% over the non-fine-tuned model.
The largest OWLv2-G/14 variant reached 37.5% mAP on rare LVIS categories without fine-tuning, outperforming the next best model in the literature by 4.7 percentage points.

‍

What's particularly impressive is how OWLv2 performs on rare object categories for which it has never seen human-annotated bounding boxes. This demonstrates the power of leveraging web-scale data and self-training to improve open-vocabulary generalization.

‍

As of the time of writing this article, OWLv2 is the state-of-the-art open-source model for zero-shot object detection.

Zero shot benchmark on LVIS v1.0 val [3]

‍

Easily run OWLv2

Setup

With the Ikomia API, you can effortlessly run zero-shot object detection with just a few lines of code.

To get started, you need to install the API in a virtual environment [4].


pip install ikomia

‍

Run OWLv2 with a few lines of code

You can also directly charge the notebook we have prepared.

Go to notebook

Go to Colab


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()    

# Add the OWLv2 Object Detector
owl = wf.add_task(name="infer_owl_v2", auto_connect=True)
owl.set_parameters({
    "model_name":"google/owlv2-base-patch16-ensemble",
    "prompt":"a cat, remote control",
    "conf_thres":"0.25",
    "cuda":"True"
})

# Run on your image  
# wf.run_on(path="path/to/your/image.png")
wf.run_on(url='http://images.cocodataset.org/val2017/000000039769.jpg')

# Inspect your results
display(owl.get_image_with_graphics())

‍

List of parameters:

‍

model_name (str) - default 'google/owlv2-base-patch16-ensemble': The OWLv2 algorithm has different checkpoint models,some text
- google/owlv2-base-patch16-ensemble
- google/owlv2-base-patch16
- google/owlv2-base-patch16-finetuned
- google/owlv2-large-patch14
- google/owlv2-large-patch14-finetuned
prompt (str) - default 'a cat, remote control': Text prompt for the model
conf_thres (float) - default '0.2': Box threshold for the prediction‍‍
cuda (bool): If True, CUDA-based inference (GPU). If False, run on CPU

‍

Take aways

The success of OWLv2 and the OWL-ST recipe has several important implications for the field of computer vision:

‍

Scaling laws for object detection: Much like we've seen in language models and image classification, OWLv2 shows that detection performance continues to improve as we scale up the amount of (weakly supervised) training data. This opens up new avenues for improving object detection by leveraging even larger web-scale datasets.
Reduced reliance on human annotation: While fine-tuning on human-annotated data still provides benefits, OWLv2 demonstrates that we can achieve strong open-vocabulary performance primarily through self-training on weakly supervised web data. This could significantly reduce the cost and effort required to build high-performance object detectors.
Improved rare object detection: The ability to detect rare objects has always been a challenge in computer vision. OWLv2's strong performance on rare LVIS categories suggests that this approach could be particularly valuable for applications requiring the detection of uncommon objects.
Foundation for downstream tasks: As an improved open-vocabulary detector, OWLv2 could serve as a stronger foundation for more complex vision tasks like instance segmentation, visual question answering, and robotic manipulation.

‍

Resources

Discover more Zero-Shot Object Detection models like Florence-2 and Grounding Dino on the Ikomia HUB.
For detailed instructions on utilizing the API, refer to the Ikomia documentation. It is designed to help you maximize the capabilities of the API.
Try Ikomia STUDIO to run the latest state-of-the-art algorithms with a no-code, visual approach to image processing. This tool mirrors the features of the API, offering an accessible and user-friendly interface.

‍