AI image generation is currently captivating minds, offering a remarkable capability to produce visually stunning images from textual descriptions. The enchanting quality of this technology signals a transformative shift in how humans approach art creation. The introduction of Stable Diffusion represents a significant milestone in this field, democratizing access to a high-performance model with impressive image quality, speed, and relatively modest resource requirements.
This is a brief introduction to how Stable Diffusion works.
Stable Diffusion is versatile in that it can be used in a number of different ways.
To understand better, let’s delve into the core components, how they work together, and what the choices in image generation options and parameters mean.
What are the components of Stable Diffusion?
Stable Diffusion comprises multiple components and models rather than being a single, monolithic model. Upon closer inspection, we first notice a text-understanding element that translates textual information into a numerical representation capturing the ideas within the text. At this high level, we’ll delve into more machine learning intricacies later. This text encoder functions as a specialized Transformer language model, specifically the text encoder of a CLIP model. It takes the input text and produces a series of numbers representing each word or token in the text (a vector per token). Subsequently, this information is fed into the Image Generator, which itself consists of several components.
The image generator goes through two stages:
Image Information Creator
This element serves as the secret sauce in Stable Diffusion, contributing significantly to its improved performance compared to earlier models. The process involves running this component through multiple steps to generate image information, with the default steps parameter often set to 50 or 100 in Stable Diffusion interfaces and libraries.
Notably, the image information creator operates exclusively in the image information space, also known as the latent space, a concept we’ll delve into later. This characteristic enhances its speed compared to previous diffusion models that operated in pixel space. In technical terms, this component consists of a UNet neural network and a scheduling algorithm.
The term “diffusion” describes what happens in this component.It is the step-by-step information processing that ultimately results in the generation of a high-quality image by the subsequent component, the image decoder.
The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).
Image Decoder
The image decoder generates an image based on the information provided by the creator, executing a single operation at the conclusion of the process to produce the final pixelated image.
With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:
ClipText for text encoding.
U-Net + Scheduler to gradually process/diffuse information in the information (latent) space.
Auto-encoder Decoder that paints the final image using the processed information array.
So what is Diffusion?
Credits: Jay Alamaar
Diffusion is the process that produces an information array that the image decoder uses to paint the final image.
The process unfolds in sequential steps, with each step contributing additional pertinent information. To grasp the process intuitively, we can examine the random latent array, observing it manifests as visual noise. Visual assessment involves passing it through the image decoder.
Diffusion occurs iteratively, with each step working on an input latent array to generate another array that more accurately mirrors the input text and incorporates visual details gleaned from the model’s training images.
How it works?
Generating images using diffusion models relies on powerful computer vision models. With a sufficiently large dataset, these models can learn complex operations.
The approach involves adding noise to an image in controlled increments over multiple steps. This process creates numerous training examples for each image in the dataset. By training a noise predictor using this dataset, we obtain an effective tool that generates images when configured appropriately.
This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of the image generation model.
Painting images by removing noise
The trained noise predictor can take a noisy image, and the number of the denoising step, and It can predict a portion of noise, which, when subtracted from the image, results in an output closer to the images the model learned from. While not an exact replication of specific images, it aligns with the distribution of pixel arrangements characteristic of the familiar world where the sky tends to be blue, people generally have two eyes, and cats typically exhibit certain features like pointy ears and an unimpressed expression.
The Text Encoder: A Transformer Language Model
The language understanding component in the form of a Transformer language model is employed to process the text prompt and generate token embeddings. The Stable Diffusion model, as released, utilizes ClipText, a model based on GPT, for this purpose.
Larger/better language models have a significant effect on the quality of image generation models.
How CLIP is trained?
CLIP undergoes training on a dataset featuring 400 million images paired with captions. In reality, these images are sourced from the web, accompanied by their respective “alt” tags.
CLIP comprises both an image encoder and a text encoder. The training procedure can be simplified by considering the pairing of an image with its caption. Each element is encoded using the image and text encoders, and their resulting embeddings are compared using cosine similarity. Initially, during the training process, the similarity might be low, even if the text accurately describes the image. To address this, we update the two models so that subsequent embeddings exhibit higher similarity upon encoding.
By repeating this across the dataset and with large batch sizes, we end up with the encoders being able to produce embeddings where an image of a cat and the sentence “a picture of a cat” are similar. Also the training process needs to include negative examples of images and captions that don’t match, and the model needs to assign them low similarity scores.
Feeding Text Information Into The Image Generation Process
Incorporating text into the image generation process requires modifying the noise predictor to take the text as an input. Consequently, the dataset now contains the encoded text. As we are working within the latent space, both the input images and the predicted noise exist in this latent space.
Layers of the U-Net Noise predictor
Without Text:
Let’s first look at a diffusion U-Net that does not use text. Inside, we see that:
The U-Net is a series of layers that work on transforming the latents array
Each layer operates on the output of the previous layer
Some of the outputs are fed (via residual connections) into the processing later in the network
The time step is transformed into a time step embedding vector, and that’s what gets used in the layers
With Text:
The main change to the system we need to add support for text inputs (technical term: text conditioning) is to add an attention layer between the ResNet blocks.
Note that the ResNet block doesn’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.
Conclusion
Stable Diffusion represents a breakthrough in AI image generation, combining a text encoder, U-Net with a scheduler, and an autoencoder decoder. This versatile model utilizes CLIP-Text for text encoding and employs a controlled noise addition process to generate high-quality images.