Definition

Remember the fully connected neural net as a linear combination of input features plus a nonlinear transformation (an activation function). For this kind of neural net, we need to flatten the input into a one-dimensional array. One thing about one dimensional array is that it doesn’t imply 2-dimensional quality. Meanwhile there are a lot of data in this world that inherently has a structure of more than one dimension. For example, a colored image has 3 channels, each channel has a single-colored image with width w x height h (pixels). So this image’s matrix interpretation would be a matrix of 3 dimensions: 3 w * h matrices. Moreover, if we flatten this image out to feed into a fully connected network like we did before, the amount of connection and parameters can become millions quite fast. So the convolution neural net (CNN) has much less parameters given the same amount of neurons and hidden layers, making it easier to train and compute.

Convo layer

To be able to process such complex data, we use convolutional neural network. We would then use matrix multiplication to transform the input (or convolute). The function that we would use to transform those input is called a filter or kernel, effectively a matrix of small size (usually 3x3). The filter is full of weights (\(\theta\)) as we know and it would transform the entire input matrix by sliding from top left to the bottom right horizontally. The sliding can take one, two or more values per step (which is called stride), hence it reduces the original input’s data. Since sliding with a matrix loses some values untransformed at the borders, we pad the original input matrix with 0 around it so that the algorithm leaves nothing untouched. The result after being slided would be added a bias (\(\theta_0\)) and then activated. This is just like machine learning models as we know it, though I figured the bias addition and activation could be done after the pooling layer explained below. Each filter would then create an output called an activation map. Or we can say that it contribute to one depth of it. A convolution layer with 10 filters (kernels) would result in an activation map with depth of 10. Each depth carries a simple information about the original input, as we explain shortly below.

Those filters at the beginning layers of the neural net are dubbed low-level neurons because it resembles the mechanism of biological simple neurons (or locally receptive neurons): in the brain, there would be simple neurons that are in charge of identifying very simple shapes such as a horizontal line or a vertical line. Those simple signals then would be sent to groups of more complex neurons and those group would be able to combine the lines to recognize more abstract patterns (corners, wrinkles, eyes, noses, etc) (separately!).

!pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.7.0.72-cp37-abi3-macosx_10_16_x86_64.whl (53.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.9/53.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hRequirement already satisfied: numpy>=1.17.0 in /Users/nguyenlinhchi/opt/anaconda3/lib/python3.9/site-packages (from opencv-python) (1.21.5)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.7.0.72

import numpy as np
import matplotlib.pyplot as plt
import cv2 
from PIL import Image

%matplotlib inline

X = np.array(Image.open('house.jpg'))
X = X.dot([0.299, 0.5870, 0.114]) # to grey

def conv2d(X, F, s = 1, p = 1):
    (w1, h1) = X.shape
    f = F.shape[0]
    w2 = int((w1 + 2*p - f)/s) + 1
    h2 = int((h1 + 2*p - f)/s) + 1
    Y = np.zeros((w2, h2))
    X_pad = np.pad(X, pad_width = p, mode = 'constant', constant_values = 0)
    for i in range(w2):
        for j in range(h2):
            idw = i*s
            idh = j*s
            Y[i, j] = np.abs(np.sum(X_pad[idw:(idw+f), idh:(idh+f)]*F))
    return Y

F1 = np.array([[-1, -1, -1], # filter that only sees horizontal lines
              [0, 0, 0],
              [1, 1, 1]])

Y1 = conv2d(X, F1, s = 1, p = 1)
# plt.imshow(Y1)
F2 = np.array([[1, 0, -1], # filter that only see vertical lines
             [1, 0, -1],
             [1, 0, -1]])
Y2 = conv2d(X, F2, s = 3, p = 1)
# plt.imshow(Y2)


fig = plt.figure(figsize=(20, 20))
  
# setting values to rows and column variables
rows = 1
columns = 2
  
# Adds a subplot at the 1st position
fig.add_subplot(rows, columns, 1)
  
# showing image
plt.imshow(Y1)
plt.axis('off')
plt.title("horizontal seer")
  
# Adds a subplot at the 2nd position
fig.add_subplot(rows, columns, 2)
  
# showing image
plt.imshow(Y2)
plt.axis('off')
plt.title("vertical seer")

Text(0.5, 1.0, 'vertical seer')

png

10ConvoNet_2_1

Those filters can also be seen as the visual perception of the neurons in the activation maps. The perception is small, each of those neurons would be connected not to all but to a limited part of the input hence the word locally receptive. Note that despite the width and height reduce, the depth of the activation map (the number of filters mapped to the depth of the activation map) increases quite significantly in some cases.

Note that there is Convo1D whose kernel is a 1x3 vector as well.

Batch normalization layer

This layer is usually put right after the convo layer, at the early step of the net, for some technical reasons. So, batch normalization is a technique that standardize the layers back to the standard distribution so to help the gradient descent process. For each bath, after calculating the activation, we calculate the mean and standard deviation of the input:

\[\mu = \frac{1}{m} \sigma_{i=1}^{m} x_i\] \[\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2\] \[\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}\]

Pooling layer

After being convoluted, we can add a pooling layer to the data. A pooling matrix is usually 2x2 (up to 5x5) and it works by choosing one of the value in that window, mostly the max or sometimes the average. This means to reduce the matrix size, hence reduce complexity, computation and overfit. The depth of the activation map would be the same though. Gradient descent flows through this pooling normally (through the max data value).

Dropout layer

Since the dimensions can grow big, there is a technique called dropout that temporarily blank out a percentage of neurons. Think of this analogy: in a company, when a person is absent, others would learn to take over his job. In general, with regular absence, the employees become more generalized in their skills and this overall is better for the company. Also for the neural net, since it becomes less overfit.

Residual layer

There is this idea in deep learning that after lots of transformation, the data loses some of its information. So, to preserve that amount of information in the input, we simply plus the original input to the convoluted layer. Those inputs are said to skip some transformations. UNET and ResNet are networks that implement this idea.

Code example

In this example, let’s explore the fashion MNIST (similar to MNIST but for fashion images). First, let’s add an extra dimension for the data, it would store the number of data points we have and together we have a tensor (multidimensional vector) for the training.

mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = mnist
X_train_full = np.expand_dims(X_train_full, axis=-1).astype(np.float32) / 255
X_test = np.expand_dims(X_test.astype(np.float32), axis=-1) / 255
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

A CNN architecture is more complicated than a fully connected net. Firstly, we put a convolution layer (usually with relu activation), then a max pool, then another convolution layer (can be with double neurons, to process information better), another max pool, and so on. After that, when we are done with the convoluted layers, when we feel like the net is enough to comprehend complex feature of the data, we move on to the dense layers. Before doing that, we need to add a flatten method which is not a layer, it is just an unwinding of neurons from the convolution layer. Then we put some dense layer, mix with dropout (rate typically 0.2 to 0.5). The last layer, as usual, is a softmax for the classification problem, and relu for regression.

from functools import partial
import keras
import tensorflow as tf
import numpy as np

tf.random.set_seed(42)
DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same",
                        activation="relu", kernel_initializer="he_normal")

model = tf.keras.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=128),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=64, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))
score = model.evaluate(X_test, y_test)
X_new = X_test[:10]  # pretend we have new images
y_pred = model.predict(X_new)

Epoch 1/10

1719/1719 [==============================] - 240s 136ms/step - loss: 0.7840 - accuracy: 0.7094 - val_loss: 0.3669 - val_accuracy: 0.8788

Epoch 2/10

1719/1719 [==============================] - 229s 133ms/step - loss: 0.4747 - accuracy: 0.8292 - val_loss: 0.2914 - val_accuracy: 0.8920

Epoch 3/10

1719/1719 [==============================] - 226s 132ms/step - loss: 0.3912 - accuracy: 0.8591 - val_loss: 0.2578 - val_accuracy: 0.9042

Epoch 4/10

1719/1719 [==============================] - 226s 132ms/step - loss: 0.3450 - accuracy: 0.8765 - val_loss: 0.2484 - val_accuracy: 0.9084

Epoch 5/10

1719/1719 [==============================] - 230s 134ms/step - loss: 0.3077 - accuracy: 0.8894 - val_loss: 0.2387 - val_accuracy: 0.9096

Epoch 6/10

1719/1719 [==============================] - 231s 134ms/step - loss: 0.2845 - accuracy: 0.8979 - val_loss: 0.2483 - val_accuracy: 0.9092 Epoch 7/10

1719/1719 [==============================] - 230s 134ms/step - loss: 0.2568 - accuracy: 0.9071 - val_loss: 0.2445 - val_accuracy: 0.9184

Epoch 8/10

1719/1719 [==============================] - 233s 136ms/step - loss: 0.2368 - accuracy: 0.9129 - val_loss: 0.2285 - val_accuracy: 0.9186

Epoch 9/10

1719/1719 [==============================] - 228s 133ms/step - loss: 0.2233 - accuracy: 0.9188 - val_loss: 0.2314 - val_accuracy: 0.9162

Epoch 10/10

1719/1719 [==============================] - 232s 135ms/step - loss: 0.2075 - accuracy: 0.9234 - val_loss: 0.2334 - val_accuracy: 0.9182 313/313 [==============================] - 11s 37ms/step - loss: 0.2620 - accuracy: 0.9158 1/1 [==============================] - 0s 142ms/step

Conclusion

In computer vision, one of the difference between tradditional machine learning and deep learning is that deep learning works on data that has scale much larger than what a tradditional system can handle. It is able to do so due to the large amount of parameters (million to billion). For a deep network, the layers decrease in size but can increase in depth. This is to have a better representation (feature learning) of the data. One more thing, at the end of the convoluted layers, the output is usually 2 dimensional. To turn them into one dimensional prediction vector, we run those output through a flatten function that flattens those matrix and then run it through fully connected layer.

Convolutional Neural Network (CNN)

TOC