TOC

Definition

UNET is a neural network architecture born for the task of segmentation in medical research. Here is the architecture of the UNET from the original paper. They use the usual convolutional layer, max pooling, etc. To the left is the encoder part: to encode image input into number. To the right is the decoder part: to decode and decompress the matrix back into the original size of the image. To not lose the original information along the way, for the decoder branch, they concanate the input with the feature map calculated by the corresponding encoder.

Screen Shot 2023-03-20 at 10 08 16

The main principal of UNET is that it labels each pixel whether it belongs to a specific cell/region or not. So it uses the cross entropy loss function. These propensities go through a softmax function as usual to output K regions/classes.

\[L = \sum_{x \in \omega} w(x) log(p_{k(x)} (x))\]

The term w(x) is added to give different weights to different pixels - this is to let the neurons understand bounderies between regions, so that we can have separating lines among nearby cells. Here is the weight map:

\[w(x) = w_{c}(x) + w_0 . exp(-\frac{(d_1(x) + d_2(x))^2}{2\sigma^2})\]

with \(w_c\) is to balance the class frequencies, and \(d_1\) is the distance to the border of the nearest cell, \(d_2\) is the distance to the border of the second nearest cell. In the original paper, they set \(w_0=10\) and \(\sigma \approx 5\) pixels.

For initialization, they use a Gaussian distribution with a standard deviation of \(\sqrt{\frac{2}{N}}\) with N to be the incoming nodees of one neuron. This is to make each feature map to have unit variance.

For data that are microscopical images, the images need to endure amidst shifts, rotation, deformation and grayscale variations. They generate smooth deformations using random displacement vectors.

Code implementation

Let’s examine a kaggle challenge in which we need to color the nuclei in the cell image, using UNET. Examples of the images and its mask:

example3 example2 example1

from tensorflow.keras.layers import Conv2D, BatchNormalization, Activation, MaxPool2D, Conv2DTranspose, Concatenate, Input
from tensorflow.keras.models import Model

def conv_block(inputs, num_filters):
    x = Conv2D(num_filters, 3, padding="same")(inputs)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    x = Conv2D(num_filters, 3, padding="same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)

    return x

def encoder_block(inputs, num_filters):
    s = conv_block(inputs, num_filters)
    p = MaxPool2D((2, 2))(s)
    return s, p

def decoder_block(inputs, skip_features, num_filters):
    x = Conv2DTranspose(num_filters, (2, 2), strides=2, padding="same")(inputs)
    x = Concatenate()([x, skip_features])
    x = conv_block(x, num_filters)
    return x

def build_unet(input_shape):
    """ Input layer """
    inputs = Input(input_shape)

    """ Encoder """
    s1, p1 = encoder_block(inputs, 64)
    s2, p2 = encoder_block(p1, 128)
    s3, p3 = encoder_block(p2, 256)
    s4, p4 = encoder_block(p3, 512)

    """ Bottleneck """
    b1 = conv_block(p4, 1024)

    """ Decoder """
    d1 = decoder_block(b1, s4, 512)
    d2 = decoder_block(d1, s3, 256)
    d3 = decoder_block(d2, s2, 128)
    d4 = decoder_block(d3, s1, 64)

    """ Output layer """
    outputs = Conv2D(1, 1, padding="same", activation="sigmoid")(d4)

    model = Model(inputs, outputs, name="UNET")
    return model
model = build_unet((512, 512, 3))
model.summary()

As metrics, we make use of IOU and Dice coefficient to measure the similarity between prediction and ground truth.

import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K

# Metrics
def iou(y_true, y_pred):
    def f(y_true, y_pred):
        intersection = (y_true * y_pred).sum()
        union = y_true.sum() + y_pred.sum() - intersection
        x = (intersection + 1e-15) / (union + 1e-15)
        x = x.astype(np.float32)
        return x
    return tf.numpy_function(f, [y_true, y_pred], tf.float32)

smooth = 1e-15
def dice_coef(y_true, y_pred):
    y_true = tf.keras.layers.Flatten()(y_true)
    y_pred = tf.keras.layers.Flatten()(y_pred)
    intersection = tf.reduce_sum(y_true * y_pred)
    return (2. * intersection + smooth) / (tf.reduce_sum(y_true) + tf.reduce_sum(y_pred) + smooth)

def dice_loss(y_true, y_pred):
    return 1.0 - dice_coef(y_true, y_pred)

# hyper params
batch_size = 4
lr = 1e-4
num_epochs = 10

For the normal UNET model (without image augmentation), the dice coefficient is 0.83, which shows the high similarity between the true value and the predictive value. We plot some prediction and we can see that there are still issues such as: bordering nuclei are not so good, some nuclei is not color fully (broken inside). We plot the result of UNET together with Canny (a tradditional machine learning algorithm in computer vision, using to trace border).

model = build_unet((IMG_HEIGHT, IMG_WIDTH, 3))
metrics = [dice_coef, iou, Recall(), Precision()]
model.compile(loss="binary_crossentropy", optimizer=Adam(lr), metrics=metrics)
train_steps = (len(X_train)//batch_size)

model.fit(
        X_train, Y_train,
        epochs=num_epochs,
        steps_per_epoch=train_steps,
    )
# dice coef 0.83
model.save('model_data_science_bowl_2018.h5')

model_pred1

Data augmentation

When the dataset is small, data augmentation is a common technique to improve accuracy. For example, from one image, we shift it to the right, rotate it a bit, or impose affine transformation. After some simple augmentation of the data, the dice coefficient raise to 0.84.

from keras.preprocessing import image
BATCH_SIZE=4
# Creating the training Image and Mask generator
image_datagen = image.ImageDataGenerator(shear_range=0.5, rotation_range=50, zoom_range=0.2, width_shift_range=0.2, height_shift_range=0.2, fill_mode='reflect')
mask_datagen = image.ImageDataGenerator(shear_range=0.5, rotation_range=50, zoom_range=0.2, width_shift_range=0.2, height_shift_range=0.2, fill_mode='reflect')

# Keep the same seed for image and mask generators so they fit together
seed=1
image_datagen.fit(X_train[:int(X_train.shape[0]*0.9)], augment=True, seed=seed)
mask_datagen.fit(Y_train[:int(Y_train.shape[0]*0.9)], augment=True, seed=seed)

x=image_datagen.flow(X_train[:int(X_train.shape[0]*0.9)],batch_size=BATCH_SIZE,shuffle=True, seed=seed)
y=mask_datagen.flow(Y_train[:int(Y_train.shape[0]*0.9)],batch_size=BATCH_SIZE,shuffle=True, seed=seed)

# Checking if the images fit

from matplotlib import pyplot as plt
%matplotlib inline

imshow(x.next()[0].astype(np.uint8))
plt.show()
imshow(np.squeeze(y.next()[0].astype(np.uint8)))
plt.show()

#creating a training generator that generate masks and images
train_generator = zip(x, y)
# Fit model
import tensorflow as tf
# earlystopper = EarlyStopping(patience=3, verbose=1)
# checkpointer = ModelCheckpoint('model-dsbowl2018-1.h5', verbose=1, save_best_only=True)
model2 = build_unet((IMG_HEIGHT, IMG_WIDTH, 3))
metrics = [dice_coef, iou, Recall(), Precision()]
model2.compile(loss="binary_crossentropy", optimizer=Adam(lr), metrics=metrics)
train_steps = (len(X_train)//batch_size)

results = model2.fit_generator(train_generator, steps_per_epoch=train_steps,
                              epochs=10)
# dice coef 0.84
model2.save("model_augmeted_data.h5")
print('done')

model pred

Filters and feature maps

To understand this neural net more, we plot some of the filters (3x3 matrix mostly) and the feature maps (feature maps are matrix after being multiplied with/convoluted by filters). The followings are the filters of the first convo layer. They look rather simple since in the beginning, the neuron layers are just to realize very simple feature. When we print out the visualization of the first feature map (the input after being transformed/convoluted once), we can see that the initial layers are to look for contrast: they distinguish the in and out of nuclei. When we print out a middle feature map (the input after being convoluted for many times), the images show very stark difference of the borders, only in different direction. So we know that half way through the network, neurons start to be able to border the nuclei, strictly recognising those nuclei. Those images prove that visulization is a great way to see into the black box of neural networks: to see what the neurons see after each layer.

filters

# redefine model to output right after the first hidden layer
ixs = [1,4,8,11,15]
outputs = [model.layers[i].output for i in ixs]
model = Model(inputs=model.inputs, outputs=outputs)
img=X_test[0]
# expand dimensions so that it represents a single 'sample'
img = np.expand_dims(img, axis=0)
# get feature map for first hidden layer
feature_maps = model.predict(img)
# plot the output from each block
fmaps = np.squeeze(feature_maps[0])


# create figure
fig = plt.figure(figsize=(10, 10))
rows = 4
columns = 4

for i in range (16):
  # Adds a subplot at the 1st position
  fig.add_subplot(rows, columns, i+1)
  # showing image
  plt.imshow(fmaps[:,:,i])
  plt.axis('off')
  

plt.savefig('fmap0')
plt.show()

fmap0

fmap2