Kandinsky AI: Leading the Way in Advanced Text-to-Image Generation

Allan Kouidri
-
1/19/2024
Cat skate cyberpunk

The Kandinsky AI series represents a significant advancement in the field of AI-driven text-to-image generation. This model series, developed by a team from Russia, has evolved through several iterations, each bringing new features and improvements in image synthesis from text descriptions.

The Evolution and Impact of Kandinsky AI Models: Harnessing Latent Diffusion in Text-to-Image Generation

Kandinsky 2's models, a culmination of several evolved versions, marks a pivotal moment in AI-driven image synthesis. Latent diffusion models, central to Kandinsky 2, revolutionize text-to-image generation by creating images in a latent, compressed space, then meticulously refining them to achieve intricate detail. 

This method, offering enhanced control and creativity, empowers Kandinsky 2 to effectively transform complex textual prompts into vivid, detailed images. 

This advancement not only elevates the quality and versatility of AI imagery but also signifies the model's prowess in producing highly realistic and contextually accurate visuals.

Kandinsky evolution
Comparison of the Kandinsky versions [1]

Kandinsky AI 2.0: The Multilingual Pioneer

Kandinsky AI 2.0 marked its significance as the first multilingual text2image model. It was notable for its large UNet size of 1.2 billion parameters and incorporated two multi-lingual text encoders: mCLIP-XLMR with 560M parameters and mT5-encoder-small with 146M parameters. 

These encoders, combined with multilingual training datasets, opened up new possibilities in text2image generation across different languages.

Kandinsky AI 2.1: Advancements in Image Synthesis

Kandinsky AI 2.1 was a leap forward, building upon the solid foundation laid by its predecessor. It was recognized for its state-of-the-art (SOTA) capabilities in multilingual text-to-image latent diffusion. 

This model leveraged the strengths of DALL-E 2 and Latent Diffusion models, incorporating both CLIP visual and text embeddings for generating the latent representation of images. 

The introduction of an image prior model to create a visual embedding CLIP from a text prompt was a key innovation. Additionally, the model used an image-blending capability, allowing the combination of two visual CLIP embeddings to produce a blended image.

One of the significant architectural changes in Kandinsky 2.1 was the shift from VQGAN generative models to a specifically trained MoVQGAN model. This allowed for improved effectiveness in image generation. With 3.3 billion parameters, including a text encoder, image prior, CLIP image encoder, Latent Diffusion UNet, and MoVQ encoder/decoder, Kandinsky 2.1 demonstrated notable improvements in image synthesis quality.

Kandinsky architecture
Overview of the Kankinsky 2.2 text-to-image architecture [1]

Kandinsky AI 2.2: The Latest Evolution

Kandinsky 2.2 brought further enhancements, primarily through the integration of the CLIP-ViT-G image encoder. This upgrade significantly improved the model's ability to generate aesthetically pleasing and accurate images.

Another notable addition was the ControlNet mechanism, allowing for precise control over the image generation process and enabling the manipulation of images based on text guidance.

Kandinsky 2.2 was designed to be more adaptable and versatile, capable of generating images at various resolutions and aspect ratios. This model was trained on a mixture of datasets, including the LAION HighRes dataset and a collection of 2M high-quality, high-resolution images with descriptions, thereby enhancing its performance in generating more aesthetic pictures and better understanding text.

 

Kandinsky 2.1

Kandinsky 2.2

Model type

Latent Diffusion

Latent Diffusion

Number of parameters

3.3 billion

4.6 billion

Text encoder

0.6 billion 

0.6 billion 

Diffusion Mapping

1.0 billion 

1.0 billion 

U-Net

1.2 billion 

1.2 billion 

ViT

0.5 billion 

1.8 billion 

MoVQ

0.08 billion 

0.08 billion 

Dataset volume

1.2 billion pairs

1.5 billion pairs 

Image quality

Good

Very good

Image size

(768×768)

(1024x1024) / various aspect ratios

Release

April 4, 2023

July 12, 2023

Kandinsky AI 2.2: Diverse Variants for Creative Expression

The Kandinsky 2.2 model includes a range of variants catering to different image synthesis needs:

Text-to-Image

  • Generates images directly from textual descriptions.
Man portrait Kandinsky
'Portrait of a man, cinematic'

Image-to-Image

Transforms an input image according to new text specifications.

Fantasy landscape img2img Kandinsky

Image Mixing (Fusion)

Combines elements of different images based on textual guidance.

Kandinsky image mixing/fusion

ControlNet

Allows precise control over the image generation process, tailored by text inputs.

Controlnet Kandinsky, cat to robot

Inpainting

Edits or completes parts of an image based on textual cues, useful for restoring or modifying images.

Kandinsky cat to dog inpainting

Each variant leverages the core strengths of the Kandinsky 2.2 model, offering flexibility and creativity in image synthesis tasks.

Practical Applications and Usage

The Kandinsky 2 series, particularly versions 2.2, have wide-ranging applications. They can be used in design for rapid conversion of textual ideas into visual concepts, streamlining the creative process. In education, these models can transform complex textual descriptions into visual diagrams, making learning more engaging and accessible.

Get started with Ikomia API 

Using the Ikomia API, you can effortlessly create images with Kandinsky 2.2 in just a few lines of code.

To get started, you need to install the API in a virtual environment [2].


pip install ikomia

Run Kandinsky with a few lines of code

You can also directly charge the notebook we have prepared. 

Note: This workflow uses 11GB GPU on Google Colab (T4).


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name = "infer_kandinsky_2", auto_connect=False)

algo.set_parameters({
    'model_name': 'kandinsky-community/kandinsky-2-2-decoder',
    'prompt': 'A Woman Jedi fighter performs a beautiful move with one lightsabre, full body, dark galaxy background, look at camera, Ancient Chinese style, cinematic, 4K.',
    'negative_prompt': 'low quality, bad quality',
    'prior_num_inference_steps': '25',
    'prior_guidance_scale': '4.0',
    'num_inference_steps': '100',
    'guidance_scale': '1.0',
    'seed': '-1',
    'width': '1280',
    'height': '768',
    })

# Generate your image
wf.run()

# Display the image
display(algo.get_output(0).get_image())

Woman jedi kandinsky

model_name (str) - default 'kandinsky-community/kandinsky-2-2-decoder': Name of the latent diffusion model.

prompt (str) - default 'portrait of a young women, blue eyes, cinematic' : Text prompt to guide the image generation .

negative_prompt (str, optional) - default 'low quality, bad quality': The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).• prior_num_inference_steps (int) - default '25': Number of denoising steps of the prior model (CLIP).

prior_guidance_scale (float) - default '4.0': Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. (minimum: 1; maximum: 20).

num_inference_steps (int) - default '100': The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

guidance_scale (float) - default '1.0': Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. (minimum: 1; maximum: 20).

height (int) - default '768: The height in pixels of the generated image.

width (int) - default '768: The width in pixels of the generated image.

seed (int) - default '-1': Seed value. '-1' generates a random number between 0 and 191965535.

Note:  "prior model" interprets and encodes the input text to understand the desired image content, while the "decoder model" translates this encoded information into the actual visual representation, effectively generating the image based on the text description.

Create more AI-Driven Art with Kandinsky AI 2.2 and Beyond

In this article, we've explored the innovative world of image creation using Kandinsky 2.2, a versatile AI model.

With Kandinsky 2.2, you can easily engage in various image synthesis tasks such as image-to-image transformation, image fusion, inpainting, and even ControlNet, all through a few lines of code with the Ikomia API. 

If you're seeking more cutting-edge diffusion models, consider exploring Stable diffusion XL and SDXL Turbo, notable competitors in the open-source domain. 

  • For comprehensive insights, our documentation offers detailed guidance.
  • Dive into a range of state-of-the-art algorithms on Ikomia HUB.
  • Leverage Ikomia STUDIO for a user-friendly, no-code interface that retains full API functionality.

References

[1] https://github.com/ai-forever/Kandinsky-2

[2] How to create a virtual environment in Python

Arrow
Arrow
No items found.