Zero-Shot Learning (ZSL) is a machine learning paradigm that enables models to classify data into categories they have never encountered during training.
If you've ever tried to gather large quantities of labeled data, you know it's a tough job. Really tough.
Things get even trickier when you have countless different categories (like every possible type of vehicle) that your model needs to learn.
So, what's the solution?
One approach is to reduce our dependency on labeled data. This brings us to zero-shot learning (ZSL), where the model learns to classify categories it has never seen before.
Imagine a vehicle classification model that can identify a "hovercraft," even though it never saw a labeled example of one during training.
Sound incredible?
In the next section, we’ll dive into how this seemingly magical method works, using examples of models that operate on a zero-shot learning framework.
Zero-Shot Learning (ZSL) is fundamentally a subfield of Transfer Learning. The general idea behind ZSL is to leverage the knowledge acquired from the training instances (seen classes) and apply it to the task of classifying unseen instances (unseen classes). This transfer of knowledge enables the model to make predictions on categories it has never encountered during training.
Zero-Shot Learning (ZSL) can be approached in various ways, each with its own strengths and applications. Here are some of the primary methods used:
This method relies on semantic attributes that are shared among different categories. These attributes act as a bridge between seen and unseen classes by providing a common set of features that can be used for classification [3].
Example: Identifying animals based on shared attributes like "has fur" or "can fly." If a model knows these attributes for known animals (seen classes), it can infer the presence of these attributes in unknown animals (unseen classes) and classify them accordingly.
Strengths:
Applications:
This approach involves using knowledge from related tasks or domains to classify new categories. The model is pre-trained on a related task and then adapted to the new task with minimal additional training [4].
Example: Applying a model trained on vehicle types to classify new transportation modes. If the model has learned to identify cars, bikes, and buses, it can transfer this knowledge to recognize new categories like electric scooters or autonomous drones.
Strengths:
Applications:
Generative models create synthetic data for unseen categories based on learned attributes and features. These models can generate examples of new classes, allowing the classifier to learn from this synthetic data [5].
Example: Using a generative adversarial network (GAN) to create images of new animal species based on learned features from existing species.
Strengths:
Applications:
Graph-based methods use knowledge graphs to represent relationships between known and unknown categories. These methods leverage relational data to aid classification [6].
Example: A knowledge graph representing relationships between different animal species, including shared habitats, dietary habits, and phylogenetic relationships. This graph helps the model infer the category of an unseen animal based on its relationship to known animals.
Strengths:
Applications:
Let’s take a practical example; OpenAI's CLIP (Contrastive Language-Image Pretraining) is a powerful example of Zero-Shot Learning in action. CLIP is trained on an extensive dataset comprising image-text pairs, enabling it to connect specific phrases with corresponding visual patterns. When making predictions, CLIP leverages its learned associations to classify new images by matching them with its extensive database of image-text relationships [7].
The core technique driving CLIP's success is contrastive learning. This method focuses on minimizing the differences between the digital representations of images and their textual descriptions. By aligning image features with text elements, contrastive learning enhances the model's ability to classify new images based on conceptual similarities rather than relying solely on exact label matches.
CLIP undergoes a rigorous training process that involves several critical steps:
During the inference phase, Zero-Shot Learning (ZSL) models apply their trained knowledge to classify new, unseen data. This process involves several key steps to ensure accurate and efficient classification:
The first step in zero-shot inference involves creating a comprehensive list of potential categories, each represented by an encoded label. This is done using a pretrained text encoder that can understand and convert textual descriptions of categories into a vector representation.
Process:
Outcome: A robust set of encoded category labels that capture the semantic essence of each category.
The next step is encoding the new, unseen image using a pretrained image encoder. This encoder converts the visual information from the image into a vector representation that can be compared with the label encodings.
Process:
Outcome: A detailed vector representation of the new image that is ready for comparison with the label vectors.
With both the image and label vectors prepared, the next step is to assess the similarity between the encoded image and each encoded label. This is typically done using a similarity metric, such as cosine similarity, which measures the cosine of the angle between two vectors.
Process:
Outcome: A ranked list of category labels based on their similarity to the new image.
The final step is to classify the new image into the category it most closely resembles based on the similarity assessment.
Process:
Outcome: The new image is assigned to the most similar category, completing the zero-shot classification process.
[1] Mignone, P., Pio, G., Džeroski, S. et al. Multi-task learning for the simultaneous reconstruction of the human and mouse gene regulatory networks. Sci Rep 10, 22295 (2020). https://doi.org/10.1038/s41598-020-78033-7
[2] Y. Xian, C. Lampert, B. Schiele and Z. Akata, "Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly" in IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 41, no. 09, pp. 2251-2265, 2019.doi: 10.1109/TPAMI.2018.2857768
[3] C. H. Lampert, H. Nickisch and S. Harmeling, "Attribute-Based Classification for Zero-Shot Visual Object Categorization," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453-465, March 2014, doi: 10.1109/TPAMI.2013.140.
[4] Socher, Richard et al. “Zero-Shot Learning Through Cross-Modal Transfer.” Neural Information Processing Systems (2013).
[5] Mishra, Ashish et al. “A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017): 2269-22698.
[6] Kampffmeyer, Michael C. et al. “Rethinking Knowledge Graph Propagation for Zero-Shot Learning.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018): 11479-11488.
[7] Radford, Alec et al. “Learning Transferable Visual Models From Natural Language Supervision.” International Conference on Machine Learning (2021).