Neural Networks and Deep Learning

Hui Lin @Google

Ming Li @Amazon

Course Website

https://course2020.scientistcafe.com/

Types of Neural Network

Figure adapted from slides by Andrew NG, Deep Learning Specialization

A Little Bit of History – Perceptron

\[z^{(i)} = w_0 + w_1x_1^{(i)} + w_2x_2^{(i)}\]

\[pred^{(i)}=\begin{cases} \begin{array}{c} 1\\ -1 \end{array} & \begin{array}{c} if\ z^{(i)}>0\\ if\ z^{(i)}\le0 \end{array}\end{cases}\]

Perceptron Algorithm

Start with random weights. Set a maximum of epochs of M, for each epoch (permutation):

\[w_0 = w_0 + \eta(actual^{(i)} - pred^{(i)})\]

\[w_1 = w_1 + \eta(actual^{(i)} - pred^{(i)})x_1^{(i)}\] \[w_2 = w_2 + \eta(actual^{(i)} - pred^{(i)})x_2^{(i)}\]

Perceptron Algorithm

Logistic Regression as A Neural Network

\[X=\left[\begin{array}{cccc} x_{1}^{(1)} & x_{1}^{(2)} & \dotsb & x_{1}^{(m)}\\ x_{2}^{(1)} & x_{2}^{(2)} & \dotsb & x_{2}^{(m)}\\ \vdots & \vdots & \vdots & \vdots\\ x_{n_{x}}^{(1)} & x_{n_{x}}^{(2)} & \dots & x_{n_{x}}^{(m)} \end{array}\right]\in\mathbb{R}^{n_{x}\times m}\]

\[y=[y^{(1)},y^{(2)},\dots,y^{(m)}] \in \mathbb{R}^{1 \times m}\]

\(\hat{y}^{(i)} = \sigma(w^Tx^{(i)} + b)\) where \(\sigma(z) = \frac{1}{1+e^{-z}}\)

Logistic Regression as A Neural Network

\[X=\left[\begin{array}{cccc} x_{1}^{(1)} & x_{1}^{(2)} & \dotsb & x_{1}^{(m)}\\ x_{2}^{(1)} & x_{2}^{(2)} & \dotsb & x_{2}^{(m)}\\ \vdots & \vdots & \vdots & \vdots\\ x_{n_{x}}^{(1)} & x_{n_{x}}^{(2)} & \dots & x_{n_{x}}^{(m)} \end{array}\right]\in\mathbb{R}^{n_{x}\times m}\]

\[y=[y^{(1)},y^{(2)},\dots,y^{(m)}] \in \mathbb{R}^{1 \times m}\]

\(\hat{y}^{(i)} = \sigma(w^Tx^{(i)} + b)\) where \(\sigma(z) = \frac{1}{1+e^{-z}}\)

Forward Propagation

Forward Propagation

Backward Propagation

Stochastic Gradient Descent (SGD)

Neural Network: 0 Hidden Layer Neural Network

Neural Network: 1 Hidden Layer Neural Network

Neural Network: 1 Layer Neural Network

Neural Network: 1 Layer Neural Network

Neural Network: 1 Hidden Layer Neural Network

Across m Samples

Across m Samples

MNIST Dataset

Image Data

1 Hidden Layer Neural Network Example

Forward and Backward Propagation

Forward and Backward Propagation

Deep Neural Network

Batch, Mini-batch, Stochastic Gradient Descent

\[\begin{array}{ccc} x= & [\underbrace{x^{(1)},x^{(2)},\cdots,x^{(1000)}}/ & \cdots/\cdots x^{(m)}]\\ (n_{x},m) & mini-batch\ 1 \end{array}\]

\[\begin{array}{ccc} y= & [\underbrace{y^{(1)},y^{(2)},\cdots,y^{(1000)}}/ & \cdots/\cdots y^{(m)}]\\ (1,m) & mini-batch\ 1 \end{array}\]

Batch, Mini-batch, Stochastic Gradient Descent

Activation Functions

Activation Functions

Activation Functions

  1. fast computation;
  2. non-linear;
  3. reduced likelihood of the gradient to vanish;
  4. Unconstrained response
    • Sigmoid, studied in the past, not as good as Relu in deep learning, due to the gradient vanishing problem when there are many layers
    • hyperbolic tangent function (tanh)

Deal with Overfitting: Regularization

For logistic regression,

\[\underset{w,b}{min}J(w,b)= \frac{1}{m} \Sigma_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)}) + penalty\]

where

\[L_2\ penalty=\frac{\lambda}{2}\parallel w \parallel_2^2 = \frac{\lambda}{2}\Sigma_{i=1}^{n_x}w_i^2\] \[L_1\ penalty = \lambda\Sigma_{i=1}^{n_x}|w|\] For neural network,

\[J(w^{[1]},b^{[1]},\dots,w^{[L]},b^{[L]})=\frac{1}{m}\Sigma_{i=1}^{m}L(\hat{y}^{(i)},y^{(i)}) + \frac{\lambda}{2}\Sigma_{l=1}^{L} \parallel w^{[l]} \parallel^2_F\] where \(\parallel w^{[l]} \parallel^2_F = \Sigma_{i=1}^{n^{[l]}}\Sigma_{j=1}^{n^{[l-1]}} (w^{[l]}_{ij})^2\)

Deal with Overfitting: Dropout

Deal with Overfitting

Exponentially Weighted Averages

Suppose we have the following 100 days’ temperature data:

\[ \theta_{1}=49F, \theta_{2}=53F, \dots, \theta_{99}=70F, \theta_{100}=69F\]

The weighted average is defined as:

\[V_t = \beta V_{t-1}+(1-\beta)\theta_t\]

And we have:

\[\begin{array}{c} V_{0}=0\\ V_{1}=\beta V_0 + (1-\beta)\theta_1\\ V_2=\beta V_1 + (1-\beta)\theta_2\\ \vdots \\ V_{100}= \beta V_{99} + (1-\beta)\theta_{100} \end{array}\]

Corrected Exponentially Weighted Averages

\[V_t^{corrected} = \frac{V_t}{1-\beta^t}\]

For example,

\[\begin{array}{cc} \beta=0.95\\ V_{0}=0\\ V_{1}=0.05\theta_{1} & V_{1}^{corrected}=\frac{V_{1}}{1-0.95}=\theta_{1}\\ V_{2}=0.95V_{1}+0.05\theta_{1}=0.0475\theta_{1}+0.05\theta_{2} & V_{2}^{corrected}=\frac{V_{2}}{1-0.95^{2}}=0.4872\theta_{1}+0.5128\theta_{2} \end{array}\]

With v.s. Without Correction

Momentum

On iteration t, compute \(dw\), \(db\) using samples in one mini-batch and update the parameters as follows:

\[V_{dw} = \beta V_{dw}+(1-\beta)dw\]

\[V_{db} = \beta V_{db}+(1-\beta)db\]

\[w=w-\alpha V_{dw};\ \ b=b-\alpha V_{db}\]

RMSprop

It was proposed by Geoffrey Hinton at his Neural Networks Coursera course

On iteration t, compute dw, db using the current mini-batch.

\[S_{dw}=\beta S_{dw} + (1-\beta)dw^2\]

\[S_{db}=\beta S_{db} + (1-\beta)db^2\]

\[w = w - \alpha \frac{dw}{\sqrt{S_{dw}}};\ b=b-\alpha \frac{db}{\sqrt{S_{db}}}\]

Adaptive Moment Estimation (Adam)

On iteration t, compute \(dw\), \(db\) using the current mini-batch.

\[\begin{cases} \begin{array}{c} V_{dw}=\beta_{1}V_{dw}+(1-\beta_{1})dw\\ V_{db}=\beta_{1}V_{db}+(1-\beta_{1})db \end{array} & momantum\ update\ \beta_{1}\end{cases}\]

\[\begin{cases} \begin{array}{c} S_{dw}=\beta_{2}S_{dw}+(1-\beta_{2})dw^{2}\\ S_{db}=\beta_{2}S_{db}+(1-\beta_{2})db^{2} \end{array} & RMSprop\ update\ \beta_{2}\end{cases}\]

\[\begin{cases} \begin{array}{c} V_{dw}^{corrected}=\frac{V_{dw}}{1-\beta_{1}^{t}}\\ V_{db}^{corrected}=\frac{V_{db}}{1-\beta_{1}^{t}} \end{array}\end{cases};\ \ \begin{cases} \begin{array}{c} S_{dw}^{corrected}=\frac{S_{dw}}{1-\beta_{2}^{t}}\\ S_{db}^{corrected}=\frac{S_{db}}{1-\beta_{2}^{t}} \end{array}\end{cases}\]

\[w=w-\alpha \frac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected}} +\epsilon};\ b=b-\alpha\frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}\]

Recap of A Few Key Concepts

Use Keras R Package

  1. Data preprocessing (from image to list of input features)
    • One image of 28x28 grey scale value matrix \(\rightarrow\) 784 column of features
    • Scale the value to between 0 and 1, by divide each value by 255
    • Make response categorical (i.e. 10 columns with the corresponding digit column 1 and rest columns zero.
  2. Load keras package and build a neural network with a few layers
    • Define a placeholder object for the NN structure
    • 1st layer using 256 nodes, fully connected, using ‘relu’ activation function and connect from the input 784 features
    • 2nd layer using 128 nodes, fully connected, using ‘relu’ activation function
    • 3rd layer using 64 nodes, fully connected, using ‘relu’ activation function
    • 4th layer using 10 nodes, fully connected, using ‘softmax’ activation function and connect to the output 10 columns
    • Add drop out to the first three layers to prevent overfitting
  3. Compile the NN model, define loss function, optimizer, and metrics to follow

  4. Fit the NN model using the training dataset, define epoch, mini batch size, and validation size used in the training where the metrics will be checked

  5. Predict using the fitted NN model using the testing dataset

R Scripts