Object detection has long been a cornerstone task in Computer Vision, with applications ranging from autonomous driving to medical imaging. However, traditional object detection models have been limited by their closed-set nature - they can only detect objects from a predefined set of categories they were trained on.
This limitation has motivated research into open-vocabulary object detection, where models can localize and classify objects described by arbitrary text queries, even for object categories not seen during training. A major breakthrough in this area came with the introduction of OWL-ViT (Objects With Labels - Vision Transformer) in 2022.
OWL-ViT leveraged large pretrained vision-language models like CLIP to enable zero-shot object detection. While groundbreaking, OWL-ViT still left room for improvement, particularly in detecting rare object categories.
Enter OWLv2 - the next evolution in open-vocabulary object detection developed by researchers at Google DeepMind. OWLv2 builds on the foundation of OWL-ViT but introduces key innovations to dramatically scale up training and improve performance. Let's dive deep into what makes OWLv2 tick and how it pushes the boundaries of open-vocabulary detection.
At its core, OWLv2 uses a similar architecture to the original OWL-ViT model:
1. Vision Transformer (ViT) Image Encoder: The model leverages a Vision Transformer to process the input image. This architecture is scalable and effective for extracting rich visual features from the image.
2. Text Encoder: A text encoder, typically based on CLIP, is used to encode text queries. This encoder transforms textual descriptions into embeddings that can be compared with visual features extracted by the image encoder.
3. Detection Heads: The model includes detection heads that predict bounding boxes and classify objects within those boxes. The architecture integrates classification and localization tasks to identify objects and their locations in the image
The key difference is that OWLv2 introduces several optimizations to make training more efficient:
These optimizations allow OWLv2 to achieve 2x higher training throughput compared to OWL-ViT (e.g., 2.2 vs 1.0 examples/second/core for the L/14 variant at 840x840 resolution on TPUv3 hardware). Crucially, these changes only affect training - at inference time, OWLv2 is identical to OWL-ViT, maintaining the latter's flexibility and zero-shot capabilities.
The real magic of OWLv2 lies not just in architectural tweaks, but in how it leverages massive amounts of weakly-supervised data through self-training. The researchers developed a three-step process they call the OWL-ST (OWL Self-Training) recipe:
1. Pseudo-annotation: Use an existing open-vocabulary detector (in this case, OWL-ViT) to generate bounding box predictions on a large dataset of web images with associated text (10 billion images from the WebLI dataset).
2. Self-training: Train a new OWLv2 model from scratch on these pseudo-annotations.
3. Fine-tuning (optional): Briefly fine-tune the self-trained model on a smaller dataset with human-annotated bounding boxes.
This approach allows OWLv2 to benefit from the vast amount of image-text pairs available on the web, without requiring expensive human annotation.
However, several key challenges needed to be addressed to make this work effectively:
A critical decision in the pseudo-annotation step is what label space to use - in other words, what text queries should the initial detector use to label the web images? The researchers explored two main approaches:
1. Human-curated vocabulary: Combining label sets from existing object detection datasets (LVIS, Objects365, OpenImagesV4, and Visual Genome) to create a fixed set of 2,520 common object categories.
2. Machine-generated queries: Extracting N-grams (up to 10 words long) directly from the text associated with each image, with minimal filtering.
Interestingly, they found that the machine-generated approach, despite being noisier, led to better generalization to unseen classes and datasets. This suggests that the diversity of weak supervision from web data is more valuable than human curation for open-vocabulary performance.
Not all pseudo-annotations are created equal. To improve the quality of the training data, the researchers experimented with filtering the pseudo-annotations based on the detection confidence scores from the initial OWL-ViT model.
They found that aggressive filtering (keeping only high-confidence detections) led to better performance when training on smaller subsets of the data. However, as they scaled up to billions of training examples, using a lower confidence threshold and keeping more data yielded the best results.
To make training on billions of examples feasible, the researchers introduced several techniques:
The impact of OWLv2 and the OWL-ST training recipe is nothing short of remarkable. On the challenging LVIS dataset, which includes many rare object categories, OWLv2 achieved significant improvements:
What's particularly impressive is how OWLv2 performs on rare object categories for which it has never seen human-annotated bounding boxes. This demonstrates the power of leveraging web-scale data and self-training to improve open-vocabulary generalization.
As of the time of writing this article, OWLv2 is the state-of-the-art open-source model for zero-shot object detection.
With the Ikomia API, you can effortlessly run zero-shot object detection with just a few lines of code.
To get started, you need to install the API in a virtual environment [4].
You can also directly charge the notebook we have prepared.
List of parameters:
The success of OWLv2 and the OWL-ST recipe has several important implications for the field of computer vision:
[1] Simple Open-Vocabulary Object Detection with Vision Transformers - https://arxiv.org/pdf/2205.06230
[2] Scaling Open-Vocabulary Object Detection - https://arxiv.org/pdf/2306.09683
[3] https://paperswithcode.com/sota/zero-shot-object-detection-on-lvis-v1-0-val