Neural Network

Introduction

Neural network, which can also be named deep learning, is an advanced machine learning model.

Neural network is an algorithm suitable for nearly all kinds of machine learning. Compared to traditional models, neural network performs better when the training set is large.

Component

Layer

Neural network is consisted of different layers. A layer is a grouping of neurons which takes the same or similar features as input and in turn outputs a few numbers together. The first layer (layer 0) is called input layer where input and output are the same. The last layer is called output layer which outputs the value of the neural network. Input and output layer are the only two layers that are visible to us, therefore, the other layers are called hidden layer.

There are different types of hidden layer:

  • Dense layer: Each neuron output is a function of all the activation outputs of the previous layer;
  • Convolutional layer: Each neuron only looks at part of the previous layer's outputs. Different neurons may look at the same outputs.
  • ...

1

Fig. 1. Multilayer perceptron

Neuron

Each layer is made up of several (including one) neurons. Each neuron is a traditional machine learning model, like linear regression, logistic regression and so on. The output of one neuron is called activation and the function of this neuron is called activation function, which means it activate the next neuron.

The magic of neural network is that it can learn new features by itself. So, we do not need to define who is the father of one neuron. Actually, each neuron will take the activations as its input, but the parameter of some activations may be zero. We just need to input the training set and define the structure of neural network. Then, the neural network will produce the most suitable new features. That is, a neuron (traditional model) is actually a new feature.

The structure of neural network is called neural network architecture. It defines the number of layers and the number of neurons in each layer.

2

Fig. 2.

Notation

  • $a^{[i]}$ = output of layer i;
  • $\vec{w}^{[i]}, b^{[i]}$=parameters of layer i.

Forward propagation algorithm

Forward propagation is a series of steps to count $f$. It is an inference or prediction of $y$. So, it is similar to $\widehat{y}$ in traditional model.

3

Fig. 3. Handwritten digit recognition

Numpy and Tensorflow

The data representation in numpy is slightly different from tensorflow. In numpy, we can represent data either in the form of matrix or in the form of vector:

1
2
3
4
import numpy as np
x = np.array([[200, 17]]) # array 1*2
x = np.array([[200],[17]]) # array 2*1
x = np.array([200, 17]) # just a row vector

But we can only represent data in the form of matrix in tensorflow. Therefore, when using numpy and tensorflow together, it is advisable to store the data in the form of matrix.

The followings are the implementation of a neuron network about coffee roasting using numpy and tensorflow. (Assuming the neural network has been trained)

4

Fig. 4. Coffee roasting (two inputs)

1
2
3
4
5
6
7
8
9
10
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense

x = np.array([[200.0, 17.0]])
layer_1 = Dense(units=3, activation='sigmoid')
a1 = layer_1(x)

layer_2 = Dense(units=1, activation='sigmoid')
a2 = layer_2(a1)

The data type of a1 and a2 are tensor,which is a built-in type in tensorflow :

1
2
When print a1: 
tf.Tensor([[0.2 0.7 0.3]], shape=(1, 3), dtype=float32)

We can also print it in the form of numpy:

1
2
When print a1.numpy():
array([0.2, 0.7, 0.3], dtype=float32)

The difference between tensor and array is that tensor has shape and data while array just has data. Therefore, a tensor variable can actually be treated as an image. That is why tensor data can be processed in GPU.

Instead of building a neural network layer by layer, we can directly concatenate the layers to form the neural network. That is what Sequential do:

1
2
3
4
5
6
import tensorflow as tf
from tensorflow.keras import Sequential

model = Sequential([Dense(units=3, activation='sigmoid'), Dense(units=1, activation='sigmoid')])
...
model.predict(x)

Vectorization

GPU and some CPU functions are very good at doing large matrix multiplications. Neural network can be vectorized, because of which neural network can be processed rapidly.

For layer 1 in the neural network of fig.3, the vectorized version is:

1
2
3
4
5
6
7
8
9
import numpy as np
A = np.array([[200, 17]])
W = np.array([[1, -3, 5], [-2, 4, -6]])
B = np.array([[-1, 1, 2]])

def dense(A_in, W, B) :
Z = np.matmul(A_in, W) + B
A_out = g(Z) # A_out is a row vector
return A_out

Activation function

Activation function is actually a reprocessing of the model $f$ and creates a new model. By using activation function, we can divide our model into two parts. The first part is uniform for all models:
$$z=\vec{w}\cdot\vec{x}+b$$
To generate different models, we only need to select the most suitable activation function $g(z)$. And that is the second part. There are three commonly used activation functions: linear function (identity), Sigmoid (soft step) and ReLU (rectified linear unit).

Linear function

5

Fig. 5. Linear function

In linear function, we do not do anything to the first part of the model. Therefore, our model is just a linear regression model:
$$f=g(z)=\vec{w}\cdot\vec{x}+b$$
Since the linear function of a linear function is still a linear function, we actually do not use linear function in the hidden layer, otherwise, the hidden layer will be useless.

Sigmoid

6

Fig. 6. Sigmoid

Sigmoid is useful when we the output just has two possible value. So it often be used in the output layer of binary classification.

ReLU

7

Fig. 7. ReLU

ReLU is one of the most commonly used activation function in hidden layer. As the slope of it does not change on the negative or positive semi-axis, the convergence speed of ReLU is much faster than Sigmoid. In addition, ReLU makes sense because it has a "off" point which enables neurons to stitch together to form complex non-linear functions:

8

Fig. 8. Unit0+Unit1+Unit2

Andrew Ng suggests that for the output layer, we should select the activation function that produces the exact result we need, but for the hidden layer, it is advisable to choose ReLU as out default activation function.

Softmax regression

Softmax regression or softmax activation function is used to deal with multiclass classification. Multiclass classification is an extension of binary classification. In multiclass classification, the number of output is more than two.

9

Fig. 9. Multiclass classification

In binary regression, $g(z)$ is actually the possibility of $a==1$. We can also get the possibility of $a==0$ which is $1-g(z)$. But in multiclass classification, we can not do that. To solve this, softmax calculates the probability of all possible values. We use $z_i$ to represent a possible value and $a_i$ to represent its possibility:
$$z_1=\vec{w_1}\cdot\vec{x}+b_1;a_1=\frac{e^{z_1}}{e^{z_1}+...+e^{z_n}}=P(y=1)|\vec{x})$$
$$...$$
$$z_n=\vec{w_n}\cdot\vec{x}+b_n;a_n=\frac{e^{z_n}}{e^
{z_1}+...+e^{z_n}}=P(y=n|\vec{x})$$
And the loss function is:
$$L(a_1,...,a_n,y)=\begin{cases}
-\log{a_1},&y=1 \\
...& \\
-log{a_n},&y=n
\end{cases}$$

The loss function will make $a_i$ tends to 1 when $y=i$.Binary classfication is a special case where n=2.

Softmax is a special activation in neural network as it is actually a layer. Its output is a vector whose elements are the possibility of values.

10

Fig. 10. Neural network with softmax

Multi-label classification

Multi-label classification is another type of classification. In multi-label classification, we are required to classify a thing into as many labels as we want. To realize this, we just need to use several sigmoid functions in our output layer.

11

Fig. 11. Multi-label classification

Adam algorithm

Adam algorithm is optimization of gradient descent, which will automatically modify $\alpha$. In adam algorithm, each neuron of the same layer has different $\alpha$ (the initial value is the same):

  • If $w_j$ or $b$ keeps moving in the same direction, it increases $\alpha_j$;
  • If $w_j$ or $b$ keeps oscillating, it reduces $\alpha_j$.

12

Fig. 12. Adam algorithm