Introduction

The first difference between CycleGAN and other GANs is that it is unpaired image translation. Unpaired means that the input images don’t look like each other, the translation part only takes care of transfering the style. So it is a style transfering GAN, but the mechanism is different from StyleGAN.

Screen Shot 2023-03-24 at 11 41 59

The second difference is in the translating mechanism. It has two discriminators \(D_X\) and \(D_Y\), with two generators (two generating functions) F and G. There are two inputs: X and Y. G translate X into Y, as usual, and F translate Y to X. F is the inverse function of G. The authors add a new term into the loss function called cycle consistency loss. When you translate x to y and back to x’, the cycle consistency loss would make sure that x’ is highly similar to x. This is call forward consistency loss: x -> G(x) -> F(G(x)) \(\approx\) x. We have the composite functions: \(F \circ G: X \rightarrow X\). In the same vein, when you translate from y to x and back to y’, they cycle consistency loss would make sure that y’ is close to y. This is called backward cycle consistency loss: y -> F(y) -> G(F(y)) \(\approx\) y. The composite function is \(G \circ F: Y \rightarrow Y\). In natural language, this process is equivalent to: translate a sentence from English to Spainish, the translate that Spanish sentence back to English, we need to make sure that we get back the English sentence. This task is actually style transfer, however, style transferring uses a different technique: they match the Gram matrix statistics of pre trained deep features to give the strokes of the style, then use that stroke to paint the input image. Here the authors want to learn the mapping functions between two image collections, rather than just two images.

The adversarial loss becomes 3 losses:

  • For the function G that converts X to Y:

with G: X \(\rightarrow\) Y, we have: \(Loss_{GAN}(G,D_Y,X,Y) = E_y{[logD_Y(y)]} + E_x{[log(1-D_Y(G(x))]}\). The problem becomes \(min_G max_{D_Y} Loss_{GAN}(G,D_Y,X,Y)\)

  • For the function F that converts Y to X:

with F: Y \(\rightarrow\) X, the problem becomes: \(min_F max_{D_X} Loss_{GAN}(F,D_X,Y,X)\)

  • The cycle consistency loss encourages forward and backward cycle consistency:
\[Loss_{cycle}(G,F) = E_x{[ \mid \mid F(G(x)) - x \mid \mid_1 ]} + E_y{[ \mid \mid G(F(y)) - y \mid \mid_1]}\]

The full objective function is:

\[Loss(G,F,D_X,D_Y) = Loss_{GAN}(G,D_Y,X,Y) + Loss_{GAN}(F,D_X,Y,X)+\alpha Loss_{cycle}(G,F)\] \[G*, F* = arg min_{G,F} max_{D_X, D_Y} Loss(G,F,D_X,D_Y)\]

The authors use a similar to style transferring architecture for the generator: three convolutions, some residual blocks, two convolutions with stride \(\frac{1}{2}\) and one convolution to map features to RGB. They use six blocks for 128x128 images and nine blocks for 256x256 and higher resolution images. 70x70 PatchGANs are used for the discriminator. PatchGAN is a GAN that classify patches of the image to be real or fake, not the entire image. This gives equivalent results, but with less computation.

During training, though, a least squares loss is replacement for the negative log likelihood. In particular, G is trained to minimize \(E_x{[(D(G(x))-1)^2]}\) and D to minimize \(E_y{[(D(y)-1)^2]} + E_x{[D(G(x))^2]}\). They set \(\alpha = 10\), use stochastic gradient descent (batch size = 1), and learning rate = 0.0002. The same learning rate is applied for the first 100 epochs and linearly decay to zero over the next 100 epochs.

Code example

We use the provided dataset of horses and zebras in the tensorflow datasets. It is a mix of real images of horses in zebras in different settings. And the generator use the structure of a ResNet. ResNet is short for Residual Network in which they concatenate input directly to a later layer to provide the extra and original feature maps to later layer. This is done because the information is warped / transformed after many convoluted layers, to send the original structuring as additional information to the later layer, they simply let the input skip several connections and concanate directly to the later layer they wish.

Input

horseandzebras

class ReflectionPadding2D(layers.Layer):
    """Implements Reflection Padding as a layer.

    Args:
        padding(tuple): Amount of padding for the
        spatial dimensions.

    Returns:
        A padded tensor with the same type as the input tensor.
    """

    def __init__(self, padding=(1, 1), **kwargs):
        self.padding = tuple(padding)
        super().__init__(**kwargs)

    def call(self, input_tensor, mask=None):
        padding_width, padding_height = self.padding
        padding_tensor = [
            [0, 0],
            [padding_height, padding_height],
            [padding_width, padding_width],
            [0, 0],
        ]
        return tf.pad(input_tensor, padding_tensor, mode="REFLECT")


def residual_block(
    x,
    activation,
    kernel_initializer=kernel_init,
    kernel_size=(3, 3),
    strides=(1, 1),
    padding="valid",
    gamma_initializer=gamma_init,
    use_bias=False,
):
    dim = x.shape[-1]
    input_tensor = x

    x = ReflectionPadding2D()(input_tensor)
    x = layers.Conv2D(
        dim,
        kernel_size,
        strides=strides,
        kernel_initializer=kernel_initializer,
        padding=padding,
        use_bias=use_bias,
    )(x)
    x = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(x)
    x = activation(x)

    x = ReflectionPadding2D()(x)
    x = layers.Conv2D(
        dim,
        kernel_size,
        strides=strides,
        kernel_initializer=kernel_initializer,
        padding=padding,
        use_bias=use_bias,
    )(x)
    x = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(x)
    x = layers.add([input_tensor, x])
    return x


def downsample(
    x,
    filters,
    activation,
    kernel_initializer=kernel_init,
    kernel_size=(3, 3),
    strides=(2, 2),
    padding="same",
    gamma_initializer=gamma_init,
    use_bias=False,
):
    x = layers.Conv2D(
        filters,
        kernel_size,
        strides=strides,
        kernel_initializer=kernel_initializer,
        padding=padding,
        use_bias=use_bias,
    )(x)
    x = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(x)
    if activation:
        x = activation(x)
    return x


def upsample(
    x,
    filters,
    activation,
    kernel_size=(3, 3),
    strides=(2, 2),
    padding="same",
    kernel_initializer=kernel_init,
    gamma_initializer=gamma_init,
    use_bias=False,
):
    x = layers.Conv2DTranspose(
        filters,
        kernel_size,
        strides=strides,
        padding=padding,
        kernel_initializer=kernel_initializer,
        use_bias=use_bias,
    )(x)
    x = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(x)
    if activation:
        x = activation(x)
    return x

def get_resnet_generator(
    filters=64,
    num_downsampling_blocks=2,
    num_residual_blocks=9,
    num_upsample_blocks=2,
    gamma_initializer=gamma_init,
    name=None,
):
    img_input = layers.Input(shape=input_img_size, name=name + "_img_input")
    x = ReflectionPadding2D(padding=(3, 3))(img_input)
    x = layers.Conv2D(filters, (7, 7), kernel_initializer=kernel_init, use_bias=False)(
        x
    )
    x = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(x)
    x = layers.Activation("relu")(x)

    # Downsampling
    for _ in range(num_downsampling_blocks):
        filters *= 2
        x = downsample(x, filters=filters, activation=layers.Activation("relu"))

    # Residual blocks
    for _ in range(num_residual_blocks):
        x = residual_block(x, activation=layers.Activation("relu"))

    # Upsampling
    for _ in range(num_upsample_blocks):
        filters //= 2
        x = upsample(x, filters, activation=layers.Activation("relu"))

    # Final block
    x = ReflectionPadding2D(padding=(3, 3))(x)
    x = layers.Conv2D(3, (7, 7), padding="valid")(x)
    x = layers.Activation("tanh")(x)

    model = keras.models.Model(img_input, x, name=name)
    return model

def get_discriminator(
    filters=64, kernel_initializer=kernel_init, num_downsampling=3, name=None
):
    img_input = layers.Input(shape=input_img_size, name=name + "_img_input")
    x = layers.Conv2D(
        filters,
        (4, 4),
        strides=(2, 2),
        padding="same",
        kernel_initializer=kernel_initializer,
    )(img_input)
    x = layers.LeakyReLU(0.2)(x)

    num_filters = filters
    for num_downsample_block in range(3):
        num_filters *= 2
        if num_downsample_block < 2:
            x = downsample(
                x,
                filters=num_filters,
                activation=layers.LeakyReLU(0.2),
                kernel_size=(4, 4),
                strides=(2, 2),
            )
        else:
            x = downsample(
                x,
                filters=num_filters,
                activation=layers.LeakyReLU(0.2),
                kernel_size=(4, 4),
                strides=(1, 1),
            )

    x = layers.Conv2D(
        1, (4, 4), strides=(1, 1), padding="same", kernel_initializer=kernel_initializer
    )(x)

    model = keras.models.Model(inputs=img_input, outputs=x, name=name)
    return model


# Get the generators
gen_G = get_resnet_generator(name="generator_G")
gen_F = get_resnet_generator(name="generator_F")

# Get the discriminators
disc_X = get_discriminator(name="discriminator_X")
disc_Y = get_discriminator(name="discriminator_Y")

Generator F

generatorF

Generator G

generatorG

Discriminator X

discriminatorX

Discriminator Y

discriminatorY

Results

horse2zebra

StarGAN

For a normal CycleGAN, we need to build an independent model for each pair of image domains. StarGAN is a CycleGAN that can handle multidomain image translation. An image can have multidomain such as: white hair, blonde, with hat, wearing glasses, happy, sad, etc. For usual cycleGAN, to train for each pair of domains, we need one model, for k attributes, we need k(k-1) models. To achive multidomain translation, when training generator G to translate input image x to output image y a domain label target c is added: \(G(x,c) \rightarrow y\). Now discriminator D also predicts over both sources and domain labels: \(D: x \rightarrow \{D_{src}(x), D_{lbs}(x)\}\)

The adversarial loss becomes:

\[Loss_{adv}=E_x{[logD_{src}(x)]} + E_{x,c}{[log(1-D_{src}(G(x,c)))]}\]

The domain classification loss for real images is:

\[Loss_{lbs}^{r} = E_{x,c}{[-log D_{lbs}(c \mid x) ]}\]

The domain classification loss for fake images is:

\[Loss_{lbs}^{f} = E_{x,c}{[-logD_{lbs}(c\mid G(x,c))]}\]

When we minimize the adversarial and classification loss, G will generate realistic images with correct classification. To enforce consistency, the authors use a cycle consistency loss for the generator:

\[Loss_{cc} = E_{x,c} {[\mid \mid x - G(G(x,c),c)\mid \mid_1 ]}\]

All of these losses combine force and give us objective functions for G and D:

\[Loss_D = - Loss_{adv} + \alpha Loss_{lbs}^r\] \[Loss_G = Loss_{adv} + \alpha Loss_{lbs}^f + \beta Loss_{cc}\]

In their experiments, they set \(\alpha = 1\) and \(\beta = 10\) so that cycle consistency is ten times more important than labelling.

For training, the generator has two convolutional layers with stride of two (for downsampling), six residual blocks and two transposed convolutional layers with stride of two (for upsampling). PatchGANs were used for the discriminator. StarGAN is trained on multiple datasets, with different features, so the author creates a mask vector that specify when to focus on which dataset.

The following example shows the result on a pretrained model. Notice that the middle image input is a drawing and it has a hand next to the face, making it a bit more difficult to emulate.

Code example

We clone this repo and here is the results:

reference-3-2