Recurrent Neural Networks

Hui Lin @Google

Ming Li @Amazon

Types of Neural Network

Why sequency?

Speech Recognition		\(\longrightarrow\)	Get your facts first, then you can distort them as you please.
Music generation	\(\emptyset\)	\(\longrightarrow\)
Sentiment classification	Great movie ? Are you kidding me ! Not worth the money.	\(\longrightarrow\)
DNA sequence analysis	ACGGGGCCTACTGTCAACTG	\(\longrightarrow\)	AC GGGGCCTACTG TCAACTG
Machine translation	网红脸	\(\longrightarrow\)	Internet celebrity face
Video activity recognition		\(\longrightarrow\)	Running
Name entity recognition	Use Netlify and Hugo.	\(\longrightarrow\)	Use Netlify and Hugo.

RNN types

rectangle: a vector
- green: input vector
- blue: output vector
- red: intermediate state vector
arrow: matrix multiplications

Notation

x: Use(\(x^{<1>}\)) Netlify(\(x^{<2>}\)) and(\(x^{<3>}\)) Hugo(\(x^{<4>}\)) .(\(x^{<5>}\))
y: 0 (\(y^{<1>}\)) 1(\(y^{<2>}\)) 0(\(y^{<3>}\)) 1(\(y^{<4>}\)) 0(\(y^{<5>}\))
\(x^{(i)<t>}\), \(T_x^{(i)}\) (\(i^{th}\) sample)
\(y^{(i)<t>}\), \(T_y^{(i)}\) (\(i^{th}\) sample)

Representing words

One Hot Encoding (OHE)

\(\left[\begin{array}{c} a[1]\\ aaron[2]\\ \vdots\\ and[360]\\ \vdots\\ Hugo[4075]\\ \vdots\\ Netlify[5210]\\ \vdots\\ use[8320]\\ \vdots\\ Zulu[10000] \end{array}\right]\Longrightarrow use=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right], Netlify=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right], and=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right], Hugo=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right]\)

What is RNN?

x: Use(\(x^{<1>}\)) Netlify(\(x^{<2>}\)) and(\(x^{<3>}\)) Hugo(\(x^{<4>}\)) .(\(x^{<5>}\))
y: 0 (\(y^{<1>}\)) 1(\(y^{<2>}\)) 0(\(y^{<3>}\)) 1(\(y^{<4>}\)) 0(\(y^{<5>}\))
\(x^{(i)<t>}\), \(T_x^{(i)}\) (\(i^{th}\) sample)
\(y^{(i)<t>}\), \(T_y^{(i)}\) (\(i^{th}\) sample)

Forward Propagation

\(a^{<0>}= \mathbf{o}\); \(a^{<1>} = g(W_{aa}a^{<0>} + W_{ax}x^{<1>} + b_a)\)

\(\hat{y}^{<1>} = g'(W_{ya}a^{<1>} + b_y)\)

\(a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)\)

\(\hat{y}^{<t>} = g'(W_{ya}a^{<t>} + b_y)\)

Forward Propagation

\(L^{<t>}(\hat{y}^{<t>}) = -y^{<t>}log(\hat{y}^{<t>}) - (1-y^{<t>})log(1-\hat{y}^{<t>})\)

\(L(\hat{y}, y) = \Sigma_{t=1}^{T_y}L^{<t>} (\hat{y}^{<t>}, y^{<t>})\)

Backpropagation through time

Deep RNN

Vanishing gradients with RNNs

The cat, which ate already, was full.
The cats, which ate already, were full.

LSTM

Word representation

Vacabulary = [a, aaron, …, zulu, ], |V|=10,000
One hot representation

\[\begin{array}{cccccc} Man & Woman & King & Queen & Apple & Pumpkin\\ (5391) & (9853) & (4914) & (7157) & (456) & (6332)\\ \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] \end{array}\]

Word representation

My favourite Christmas dessert is pumpkin ____
My favourite Christmas dessert is apple ____

\[\begin{array}{cccccc} Man & Woman & King & Queen & Apple & Pumpkin\\ (5391) & (9853) & (4914) & (7157) & (456) & (6332)\\ \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] \end{array}\]

Featurized representation: word embedding

Analogies¹

man \(\longrightarrow\) woman \(\approx\) king \(\longrightarrow\) ?

Analogies

man \(\longrightarrow\) woman \(\approx\) king \(\longrightarrow\) ?

Analogies

man \(\longrightarrow\) woman \(\approx\) king \(\longrightarrow\) ?

Analogies

\(e_{man} - e_{woman} = [-2, -0.01, 0.03, 0]^{T} \approx [-2, 0, 0, 0]^{T}\)
\(e_{king} - e_{queen} = [-1.92, -0.02, 0.01, -0.01]^{T} \approx [-2, 0, 0, 0]^{T}\)

Analogies

\(e_{man} - e_{woman} \approx e_{king} - e_{?}\)

\(\rightarrow \underset{w}{argmax} \{sim (e_{w}, e_{king} - e_{man} + e_{woman})\}\)

Cosine similarity

\(sim(e_w, e_{king}-e_{man}+e_{woman})\) = ?

Cosine similarity: \(sim(a,b) = \frac{a^{T}b}{ ||a||_{2} ||b||_{2}}\)

Cosine similarity

\(sim(e_w, e_{king}-e_{man}+e_{woman})\) = ?

Cosine similarity: \(sim(a,b) = \frac{a^{T}b}{ ||a||_{2} ||b||_{2}}\)

Embedding matrix

In practice, we look up embedding instead of doing matrix multiplication.

Data Preprocessing

Why you should avoid removing STOPWORDS

Data Preprocessing

Tokenize and Pad

Some Papers

Cho et al., 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
Chung et al., 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Hochreiter & Schmidhuber 1997. Long short-term memory

Recurrent Neural Networks

Types of Neural Network

Why sequency?

RNN types

Notation

Representing words

What is RNN?

Forward Propagation

Forward Propagation

Backpropagation through time

Deep RNN

Vanishing gradients with RNNs

LSTM

LSTM

LSTM

LSTM

Word representation

Word representation

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Analogies1

Analogies

Analogies

Analogies

Analogies

Cosine similarity

Cosine similarity

Embedding matrix

Embedding matrix

Data Preprocessing

Data Preprocessing

Some Papers

Analogies¹