Introduction

VQ-GAN is a proposal to utilize transformer, a high performer in deep learning, to synthesize high quality images. The authors first use CNNs, a popular neural network to learn locality in images, to extract perceptually rick image constituents. Then they use transformer to utilize its ability to do long range interactions freely and generate images that are globally coherent.

VQ-GAN stands for Vector Quantized Generative Adversarial Network. It is a combination of two machine learning methodologies: vector quantization and generative adversarial network. GAN was developed with two neural networks: a generator and a discriminator. They are trained together and the generator trying to improve its realistic generation and the discriminator improving its ability to spot fake generation. Vector quantization is a technique used in lossy data compression where large vectors are approximated by a finite set of smaller vectors. Each vector from the large set is mapped to one of the smaller ones. And the smaller vectors are called the codebook. In VQ-GAN, the VQ layer is like a lookup table, which maps continuous inputs (images) into a discrete set of outputs. These outputs can then be put through a transformer. And that constitutes the generator part. DALL-E from OpenAI uses a form of VQ-GAN to generate images from textual descriptions.

CNN

The first phase is to learn a codebook (a sequence) of image constituents for the transformer to use. The image x would be represented by a spatial collection of codebook entries z. To learn such a codebook, the authors use CNN and quantization. Let a CNN consist of an encoder E and a decoder D and they learn to represent images with codes from a discrete codebook X. An image x would be approximated \(\hat{x} = D(z)\), with z being the encoding \(\hat{z} = E(x)\). z is quantized with function \(q(z) = argmin \mid \mid \hat{z} - z \mid\mid\). The reconstruction \(\hat{x} = D(z) = D(q(E(x)))\). Backpropagation goes straight through the quantization part, as if it doesn’t happen.

VQ-VAE

VAE (Variational Autoencoder) belongs to the type of generative models that use deep learning techniques to learn complex data distributions. After learning the data distribution, the mathematics of Bayesian inference is used to generate new sample from that distribution.

As any traditional autoencoder architecture, VAE has an encoder. And the underlying philosophy is the same. It inputs datapoints and in the encoding process, it learns a more efficient/representative representation of the dataset. It can also be said that it learns the essential characteristic of the input data since there is some constraints putting on the learning process. The representation is called latent distribution since it is not directly or trivially observed from the data. The VAE departs from the tradditional auto encoder in that it incorporates probability in its nature. It takes input datapoint and outputs a set of parameters for a conditional probability distribution of the latent variables.

Directly sampling from the latent distribution is hard. We do reparametrerization by sampling from a standard normal distribution that is shifted and rescaled by the mean and standard deviation output by the encoder. If the data is real valued, the distribution is usually Gaussian. If the data is binary, the distribution is usually Bernoulli. The loss function of a VAE has two parts: the first one is the reconstruction loss, it measures how well the decoder output matches the original input. The measurement is typically negative log likelihood of the input given the decoder’s output. The second part is the KL divergence between the encoder’s output distribution and a prior distribution over the latent variables. The prior is usually a standard normal distribution. This KL divergence term is to encourage the encoder to produce output resembling a given distribution (the prior).

VQ-GAN

VQ-GAN is composed of several building blocks: a VQ-VAE encoder, an image transformer, and a discriminator. The VQ-VAE is a vector quantized variational auto encoder that compress the input image into discrete latent space. The image transformer is used to generate in the discrete latent space, to reduce computational complexity. Then we can sample from this codebook (discrete latent space) to generate new images. Since the latent space is the one that can maintain the patterns in the images, it also maintains spatial coherence and semantics of the generated images. Note that VQ-GAN makes use of transformer. Transformer has shown superior performance in many settings to CNN. However, it is mostly used and known for natural language processing tasks. Here it is used for image generation/systhesization. And the authors were able to synthesize images at the resolution of 1024x1024 which is a significant improvement over previous transformer-based method.

The vector quantization part turns the continuous output into discretized set of fixed values. When the gradient is passed back in the backpropagation, it is passed directly through the encoder as if the quantization doesn’t exist. The codebook is also learned. In each forward pass, after the quantization, a commitment loss is calculated. It is the distance between the encoder’s output and the selected codebook vector. The codebook will be updated to minimized this loss.

Here is the loss function:

\[L_{VQ}(E,G,Z) = \mid\mid x - \hat{x} \mid\mid_2^2 + \mid\mid sg{[E(x)]} - z \mid\mid_2^2 + \mid\mid sg{[z]} - E(x) \mid\mid_2^2\]

\(L_{rec} = \mid\mid x - \hat{x} \mid\mid_2^2\) is a reconstruction loss, sg means stop gradient, and the last term is the commitment loss.

GAN

To learn a perceptually rich codebook, the authors use transformer. Note that it is not to learn the pixel, but the latent representation of the image, in an adversarial training style, with a patch based discriminator DC.

\[L_{GAN} = {[logDC(x) + log(1-DC(\hat{x}))]}\]

The function to find the optimal compression model Q is:

\[Q = arg min_{E,G,Z} max_{DC} E {[L_{VQ}(E,G,Z) + \lambda L_{GAN}(\{E,G,Z\}, DC)]}\]

with \(\lambda = \frac{\nabla_{G_L}L_{rec}}{\nabla_{G_L}L_{GAN} + \delta}\). \(L_{rec}\) is the perceptual reconstruction loss. \(\nabla_{G_L}\) is the gradient of its input with respect to the last layer L of the decoder. \(\delta = 10^{-6}\).

Screenshot 2023-06-15 at 15 40 56

Image transformer

The difference from other previous method is that the transformer operates in the latent space. We encode and the images into quantized encodings that are stored in the codebook with indices. When we have the indices of the sequence, we can map it back to the codebook entries and then decoded back into an image. The image generation in the latent space is an autoregressive prediction: to predict the next index of the sequence. Full representation of an image is p(s) and we maximize the log-likelihood of the data representations:

\[L_{transformer} = E{[-log p(s)]}\]

If we want to condition the image generation process with a label or another image, we need to learn the likelihood of the sequence giving the information c \(p(s\mid c)\).

Image high resolution generation

Results of image generation trained on ImageNet:

Screenshot 2023-06-15 at 16 03 47

Trained on CelebHQ:

Screenshot 2023-06-15 at 16 13 33

Conditioning on key point figures to create fashion images:

Screenshot 2023-06-15 at 16 13 25

Comparison between VQ-GAN, VQ-VAE-2, and some other models:

Screenshot 2023-06-15 at 16 11 39

Conclusion

The authors of VQ-GAN tackle some of the fundamental difficulties that previously limited the use of transformer models for handling low-resolution images. These challenges mainly revolve around the quadratic complexity that comes into play when attempting to model images directly at the pixel level, which becomes computationally infeasible for large images.

To overcome these issues, the authors propose an innovative approach that fundamentally changes how images are represented within the model. Rather than dealing with individual pixels, their method focuses on what they term ‘perceptually rich image constituents.’ This concept essentially means breaking down an image into its most meaningful and perceptually significant parts, rather than its most basic and numerous elements (the pixels).

The crux of their approach is the combination of Convolutional Neural Networks (CNNs) and transformers. CNNs have proven to be remarkably efficient at handling image data, thanks to their ability to effectively capture the spatial dependencies in images through the application of filters. On the other hand, transformers excel at modeling relationships between elements, irrespective of their spatial distances, thanks to their self-attention mechanism.

By having CNNs model the image constituents and transformers handle their compositions, the authors create a synergistic model that capitalizes on the strengths of both architectures. The CNN efficiently processes the image data and reduces its dimensionality, while the transformer captures the relationships between the different parts of the image.

This technique has led to some impressive results, including the successful synthesis of high-resolution images. It’s a significant milestone as it marks the first time that transformer-based architectures have been used to produce images at such a high resolution. The approach has been shown to outperform state-of-the-art convolutional methods in experiments, demonstrating the benefits of combining convolutional inductive biases and transformer expressivity.

Moreover, the approach comes with a general mechanism for conditional synthesis, meaning it can generate images based on certain conditions or prompts. This feature provides a wide array of potential applications, particularly in the realm of neural rendering, making this method a promising avenue for future research and development in image synthesis.

In conclusion, VQ-GAN is a combination of multiple techniques: transformers, autoencoding, quantization that has proved to be successful in image synthesis tasks. This opens up application in the area beyond text based tasks, such as high resolution image generation. However, the process requires substantial computational resources and powerful hardware for training, including multiple high performance GPUs. This has some limitation for applicability in resource-constrained settings.