PuLID: Revolutionizing AI Identity Customization in Seconds

What is PuLID?

PuLID enables fast and high-quality identity customization. The system quickly learns the defining features of a face from your photos and accurately transfers these characteristics to new AI-generated images. This process creates unique visuals in just a few seconds, ensuring the original photos' consistency is maintained [1].

‍

How does PuLID work?

PuLID (Pure and Lightning ID Customization) is an advanced, tuning-free method designed for customizing identities in text-to-image (T2I) generation models. The framework ensures high fidelity to the user's identity while minimizing disruptions to the original model's behavior. Here’s a detailed breakdown of how PuLID works:

‍

‍Conventional Diffusion Branch

Description: The upper half of the framework image illustrates the conventional diffusion training process.
Functionality: This branch handles the standard diffusion-denoising training process.
Key Feature: Uses SDXL Lightning, a highly efficient acceleration technology that allows for very fast inference in just 4 steps.
Process:some text
- During the forward diffusion process, noise is systematically added to the data sample based on a predefined noise schedule, producing a progressively noisier sample at each timestep.
- The reverse denoising process involves the model predicting and removing this noise to revert the sample to its original state or a clean version of it.
- The model uses conditions such as textual prompts and ID features to guide this denoising process.

‍

Lightning T2I Branch

Description: The lower half of the framework image demonstrates the Lightning T2I training branch introduced by PuLID.
Functionality: Leverages recent fast sampling methods to iteratively denoise from pure noise to high-quality images in a limited number of steps (specifically 4 steps in this implementation).
Purpose:some text
- Generates high-quality images quickly, enabling accurate ID loss calculation.
- Constructs contrastive paths to align features between images with and without ID conditions, ensuring minimal disruption to the model's behavior.

‍

ID Encoder

Description: The ID encoder is shown as a component that processes identity features.
Functionality:some text
- Employs two commonly used backbones within the ID customization domain: the face recognition model and the CLIP image encoder.
- Extracts ID features from provided images.
Process:some text
- The feature vectors from the last layers of both backbones are concatenated.
- A Multilayer Perceptron (MLP) maps these concatenated features into tokens representing the global ID features.
- Additional MLPs map multi-layer features of CLIP into tokens representing local ID features.

‍

Contrastive Pair

Description: Involves constructing contrastive paths to ensure the ID embedding process does not disrupt the original model's behavior.
Functionality:some text
- Two paths are constructed from the same prompt and initial latent state: one with ID embedding and one without.
- Aligns the UNet features from both paths semantically.
Purpose: Teaches the model to integrate ID information in a way that maintains the consistency of non-ID-related image elements (like background and style).

‍

Accurate ID Loss

Description: Focuses on calculating ID loss accurately to ensure high ID fidelity.
Functionality:some text
- Uses the precise, high-quality images generated by the Lightning T2I branch.
- Calculates the ID loss by comparing the face embeddings of the generated image with ground truth face embeddings.
Process:some text
- Generates an accurate image from pure noise conditioned on the ID in just 4 steps.
- Ensures that the ID loss is calculated in a setting that closely aligns with the actual test conditions, improving the precision of ID fidelity.

‍

In summary, the image of the PuLID framework captures the interaction between these key components to deliver a highly efficient and accurate method for ID customization in text-to-image generation models.

‍

The combination of the conventional diffusion branch with SDXL Lightning, the innovative Lightning T2I branch, the robust ID encoder, the careful construction of contrastive pairs, and the precise calculation of accurate ID loss, all contribute to the superior performance of PuLID in maintaining high ID fidelity while minimizing disruptions to the original model's behavior.

‍

Easily run PuLID

Effortlessly generate AI avatars using PuLID through the Ikomia Imaginarium Web App, or locally with just a few lines of code.

‍

Run PuLID with Imaginarium

Generate your AI avatar with no code needed using the Ikomia Imaginarium Web App.

Go to Imaginarium

‍

Run PuLID with a few lines of code

‍

To get started, you need to install the API in a virtual environment [3].


pip install ikomia

‍

You can also directly charge the notebook we have prepared.

Go to notebook

Go to Colab


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add the PuLID algorithm
# Add algorithm
pulid = wf.add_task(name="infer_pulid", auto_connect=True)

# Set parameters
pulid.set_parameters({
    'prompt':'portrait, color, cinematic, in garden, soft light, detailed face, wonderwoman costum, golden boomerang tiara, short hair',
    'guidance_scale':'1.2',
    'guidance_scale_id':'0.8',
    'num_inference_steps':'4',
    'seed':'-1',
    'width':'1024',
    'height':'1024',
    'mode':'fidelity',
    'num_images_per_prompt':'1',
    'mix_id':'False'
    })

# Run on your image  
wf.run_on(url="https://github.com/Ikomia-dev/notebooks/blob/main/examples/img/img_portrait_5.jpg?raw=true")

display(pulid.get_output(0).get_image())

‍

List of parameters:

‍

prompt (str): Text prompt to guide the image generation.
negative_prompt (str, optional) - default 'flaws in the eyes, flaws in the face, flaws, lowres, non-HDRi, low quality, worst quality,'
'artifacts noise, text, watermark, glitch, deformed, mutated, ugly, disfigured, hands, '
'low resolution, partially rendered objects, deformed or partially rendered eyes, '
'deformed, deformed eyeballs, cross-eyed, blurry'. The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_inference_steps (int) - default '4': Number of denoising steps.
guidance_scale (float) - default '1.2': Stable diffusion Scale for classifier-free guidance. Recommended between [1, 1.5]. 1 will be faster.
guidance_scale_id (float) - default '0.8': ID guidance scale. Recommended between [0, 5].
seed (int) - default '-1': Seed value. '-1' generates a random number between 0 and 1919655350.
num_images_per_prompt (int) - default '1': Number of generated images.
mode (str) - default 'fidelity': Mode of the image generation 'fidelity' or 'extremely style'. We don't see much of a difference between the two.
width (int) - default '1024': Output width. If not divisible by 8 it will be automatically modified to a multiple of 8.
height (int) - default '1024': Output height. If not divisible by 8 it will be automatically modified to a multiple of 8.
id_mix (bool) - default 'False': If you want to mix two ID images, please turn this on, otherwise, turn this off.

‍

Resources

Browse the Ikomia HUB play with more diffusion models such as:

- face inpainting with RealVisXL

- SDXL and SDXL Turbo

- Stable Cascade

‍

For more info how on how to use the API, see Ikomia documentation. It's set up to help you get the most out of the API's offerings.

‍

Ikomia STUDIO complements the ecosystem by offering a no-code, visual approach to image processing, reflecting the API's features in an accessible interface.

‍