Introduction

Wasserstein loss function

The Wasserstein loss function, also known as the Wasserstein distance or the Earth Mover’s Distance, is a measure of the distance between two probability distributions. In the context of generative adversarial networks (GANs), the Wasserstein loss function is used as an alternative to the traditional binary cross-entropy loss function.

The main idea behind the Wasserstein loss function is to encourage the generator network to produce realistic images that are as close as possible to the real images in the dataset, while at the same time discouraging the discriminator network from being too confident in its predictions. This is achieved by training the discriminator network to output a real-valued score instead of a binary classification, which measures the distance between the distribution of the real images and the distribution of the fake images generated by the generator network

Assume that \(p_r\) is the real distribution of the data, \(p_g\) is the generated distribution. Wasserstein / EM (EarthMover) is the infimum (smallest floor) of the distance between \(p_r\) and \(p_g\):

\[W(p_r,p_g) = inf_{\gamma \in \prod(p_r,p_g)} E_{(x,y)} {[\mid \mid x - y \mid \mid]}\]

\(\prod(p_r,p_g)\) is the join distribution of \(p_r\) and \(p_g\).

We solve the duality of Wasserstein:

\[W(p_r, p_{\theta}) = sup E_{x \in p_r} {[f(x)]} - E_{x \in p_{\theta}}{[f(x)]}\]

with f to be a function of 1-Lipschitz continuous. This function gives us a limited gradient and would protect us from exploding gradient. Being K-Lipschitz continuous means:

\[\mid \frac{f(x_1)-f(x_2)}{x_1 - x_2} \mid \leq K, \forall x_1, x_2 \in R\]

The loss function for the discriminator becomes:

\[\nabla_{\phi} {[ \frac{1}{m} \sum_{i=1}^{m} f_{\phi} (x^{(i)}) - \frac{1}{m} \sum_{i=1}^{m} f_{\phi}(G_{\theta}(z^{(i)})) ]}\]

with \(\phi\) being the weights of the discriminator network, and \(\theta\) being the weights of the generator network.

The loss function for the generator becomes:

\[\nabla_{\theta} \frac{1}{m} \sum_{i=1}^{m} f_{\phi} (g_{\theta} (z^{(i)}))\]

One of the advantages of the Wasserstein loss function is that it produces more stable training dynamics compared to the binary cross-entropy loss function, which can suffer from problems such as mode collapse or vanishing gradients. The Wasserstein loss function also provides a meaningful measure of the distance between the real and fake distributions, which can be used to evaluate the quality of the generated images. The progressiveGAN model uses Wasserstein loss function, and styleGAN is a progressiveGAN.

ProgressiveGAN

The Progressive GAN paper, titled “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” was published in 2018 by researchers from NVIDIA. The paper introduces a novel training method for generative adversarial networks (GANs) called progressive growing, which enables the generation of high-quality, diverse, and stable images.

ProgressiveGAN starts training with a generator and discriminator for images of size 4x4, then gradually add layers into the generator and discriminator until the image size those networks can handle is up to 1024x1024. This is a way to generate large images reliably. The final generator and discriminator are composite functions: \(G = G_1 \circ G_2 \circ G_3 ... \circ G_N\) and \(D = D_1 \circ D_2 ... \circ D_N\)

This procedure allows the network to learn the big picture, then to focus on finer scale detail, instead of learning all scales at once. During the training, the generator and discriminator are mirror images of each other (that always grow together). All layers are trainable and after each addition of layer, there is a fading in process to smooth in the new layer.

The fading in process works as follows:

Screen Shot 2023-03-28 at 13 18 47

The output after an x layer is a residual block, it gets to skip the 2x layer. Then it is convexly combined with the output of the 2x layer, in RGB.

To assess the results, the authors assess the statistical similarity for multi scales (big and small). Since to be considered a good generator, it should generate images that has the local structure similar to the training set, on all scales. They calculate the statistical similarity between local image patches and the target, starting from patch size from 16x16 up to the full resolution (1024x1024). Each patch is first normalized with respect to mean and standard deviation of each color channel, then they use the sliced Wasserstein distance to compute the similarity between generated patch \(x_i\) and target patch \(y_i\). A small Wasserstein distance will say that the distributions are similar. The distance between small patch (16x16) would tell the similarity in large-scale structure (big picture) and the distance between big patch (finest-level) would tell the similarity of pixel level attributes such as sharpness or edges.

Code example

import tensorflow as tf
tf.random.set_seed(0)
import imageio
import tensorflow_hub as hub

progan = hub.load("https://tfhub.dev/google/progan-128/1").signatures['default']

plt.imshow(np.array(progan(tf.random.normal([latent_dim]))['default'])[0])

fig = plt.figure()
figsize=(16,16)
for i in range(4):
  plt.subplot(2,2,i+1)
  plt.imshow(np.array(progan(tf.random.normal([latent_dim]))['default'])[0])
plt.savefig('image')
plt.show()

def animate(images):
  images = np.array(images)
  converted_images = np.clip(images * 255, 0, 255).astype(np.uint8)
  imageio.mimsave('./animation.gif', converted_images)
  return embed.embed_file('./animation.gif')

def interpolate_between_vectors():
  v1 = tf.random.normal([latent_dim])
  v2 = tf.random.normal([latent_dim])
    
  # Creates a tensor with 25 steps of interpolation between v1 and v2.
  vectors = interpolate_hypersphere(v1, v2, 50)

  # Uses module to generate images from the latent space.
  interpolated_images = progan(vectors)['default']

  return interpolated_images

interpolated_images = interpolate_between_vectors()
animate(interpolated_images)

animation

StyleGAN

A styleGAN is an extension of progressiveGAN that was published in 2019 also by NVIDIA researchers. It uses the gradually added layer approach by progressiveGAN and adds some deviation to the generator desgin. The generator design learns from the tradditional neural style transfer literature. Because of this, styleGAn is able to separate the style and content of the generated images. This allows more control over the final image output.

Screen Shot 2023-03-28 at 16 33 51

In tradditional progressive GAN, the random latent vector goes through normalization, then convoluted layer then doubling the size. But in styleGAN, the latent vector goes through normalization then a mapping network of 8 fully connected layers (a MLP), then the output goes through a synthesis network to generate the image. The mapping maps the noise vector into a latent space to capture the style of the image, while the synthesis network takes the learned intermediate latent space vector to generate the image itself. The synthesis network looks like a progressive network with size doubling in each phase. Each phase (block) starts with a constant vector, then it goes through convolution and AdaIN, then Gaussian noise is added.

\[AdaIN(x_i, y) = y_{s,i} \frac{x_i - \mu (x_i)}{\sigma(x_i)} + y_{b,i}\]

The paper demonstrates the effectiveness of the StyleGAN architecture by training the model on large datasets of images, such as faces and animals, and generating high-quality images that exhibit a wide range of styles and variations. The paper also compares the results of StyleGAN with other state-of-the-art GAN architectures and shows that StyleGAN outperforms them in terms of visual quality, diversity, and stability.

Overall, the StyleGAN paper represents a significant contribution to the field of deep learning and computer vision, advancing the state-of-the-art in image synthesis and enabling new applications in areas such as computer graphics, virtual reality, and creative arts.