Preface#
This article briefly documents my personal learning process of the working principles and model structure of the diffusion model in the study of stable diffusion.
Main Text#
1. Composition of the Stable Diffusion Model#
The stable diffusion model consists of three parts: text feature extraction model, image feature encoding and decoding model, and noise prediction model.
The basic neural network structures used include Transformer and ResNet.
- Text feature extraction model
The Text Encoder is a language model based on Transformer, such as ClipText and BERT, which generates token embeddings for the input text prompts. Its purpose is to understand the input prompt text and obtain high-dimensional features to guide the prediction direction of the subsequent noise prediction model.
- Image feature encoding and decoding model
The encoding and decoding here are composed of Image Encoder and Image Decoder, which are used in the training and inference stages. In the training stage, Image Encoder and Text Encoder are paired for training to match the input text prompts and the generated images. In the inference stage, the Image Encoder is used to decode and recover from the noise predictor guided by the text, generating the final image.
- Noise prediction model
This is the protagonist of this article: the diffusion model. As the name suggests, noise prediction is not used to predict images, but to predict noise. This is the core component of stable diffusion and the main part of Dall-E 2 and Imagen. Almost all large-scale image generation models now use diffusion structures.
2. Diffusion Model#
1. Working Principles#
Diffusion means iteratively predicting the noise that may be added in the previous step, subtracting the predicted noise from the input, and then making the next prediction. This step-by-step diffusion process ultimately results in an image without any possible noise.
The above shows how the diffusion model "diffuses" from the perspective of the inference process. In the training process, it is the opposite: in each step, noise is added on top of the previous step, so that both the Label and Output are obtained, and training and loss calculation are completed at the same time.
via Training Process |
2. Model Construction#
The overall framework is based on UNet, with ResNet and Transformer used internally. Attention is mainly used for text feature embedding. The text-based diffusion structure is introduced in the next section. Here, we only analyze the fundamental image diffusion structure.
The diffusion model processes high-dimensional features in the latent space, rather than directly processing explicit images. This ensures inference efficiency.
The diffusion model is based on the iteration of multiple diffusion structures. Each step has the same structure, corresponding to different stages of image denoising generation. The latent space features obtained from these different stages can be decoded into explicit images, allowing us to observe how an image is generated step by step.
3. Diffusion Model with Text#
The principle of diffusion has been introduced, which is the process of iteratively predicting and subtracting the noise added in the previous step to restore the image. Without text prompts, the input is only a random noise image, and the generated image is actually meaningless. Specific images can only be generated with guiding feature information, so the text prompt information needs to be embedded into the diffusion structure.
The text prompt information has been transformed into text feature vectors through the text feature extraction model. It is only necessary to add an attention module after each step of the diffusion model to embed the text features into the latent space features.
4. Diffusion VS GAN#
Image generation is one of the hottest directions in the field of AI and GC. The diffusion model has opened a new era of image generation, achieving unprecedented levels of image quality and model flexibility, and has been applied in DALLE 2, Imagen, and Stable Diffusion.
The diffusion model is based on iterative operations and can optimize simple targets during iteration to achieve the final goal. However, it requires high computational power (think about it, DALLE 2, Imagen, and Stable Diffusion, which one is not a giant model 😂). These models are difficult for general small and medium-sized companies to reproduce, and they don't even have enough devices! In the end, they can only rely on API from top companies, further monopolizing the market.
Before the diffusion model, Generative Adversarial Networks (GANs) were commonly used basic architectures in image generation models. Compared with the diffusion model, GANs do not have iterative operations and have higher inference efficiency. However, due to the instability of the training process, it is very difficult to scale up GANs, and careful adjustment of network architecture and training factors is required. However, the image generation effects of GANs are also remarkable. Models such as StyleGAN, Pix2Pix, and PLUSE have very stunning effects, and their model sizes are much smaller than those of diffusion models, which can be implemented on ordinary devices.
In summary, models based on the GAN framework and models based on diffusion have their own advantages and disadvantages. It cannot be said that the diffusion model has defeated GAN models. PS, there is a recent study on extending GAN image generation.
Conclusion#
References:
The Illustrated Stable Diffusion
Scaling up GANs for Text-to-Image Synthesis
Disclaimer#
This article is only for personal learning purposes.
This article is synchronized with HUGHIE Blog.