Recurrent Models Overview

Recurrent Layers: SimpleRNN, LSTM, GRU

Jake Batsuuri
Computronium Blog

--

What’s SimpleRNN?

SimpleRNN is the recurrent layer object in Keras.

from keras.layers import SimpleRNN

Remember that we input our data point, for example the entire length of our review, the number of timesteps.

Now the SimpleRNN processes data in batches, just like every other neural network. So this layer takes as input the tensor (batch_size, timesteps, input_features).

All recurrent layers in keras can be run in two different modes:

  • All of the successive outputs aka the states
  • Just the last state

Let’s look at writing a simple recurrent model:

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))
model.summary()

The model summary is:

________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_22 (Embedding) (None, None, 32) 320000
________________________________________________________________
simplernn_10 (SimpleRNN) (None, 32) 2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0

We can see that the output here is the last state. If instead we enable all states to be returned.

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.summary()

Then the model looks like:

________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_23 (Embedding) (None, None, 32) 320000
________________________________________________________________
simplernn_11 (SimpleRNN) (None, None, 32) 2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0

How do you increase the representational power of a network?

Remember that the Recurrent Layer combines the current input with the previous output, thereby preserving some kind of relationship between the time steps, given that the data is sequential.

Remember that with Convolutional Layers, stacking them makes them learn ever more abstract concepts with each stack up.

For a keras model, to do this, we have to return full states between the hidden layers:

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))

model.add(SimpleRNN(32))
model.summary()

Although, spatial hierarchy is an essential property of deep convolutional models, deepening recurrent models, doesn’t really give a theoretical boost to our model. In practice however, increasing the representational power does improve performance on certain tasks.

It just depends on whether the data supplied has higher order abstract patterns that can be learned. Most times it doesn’t.

How do you train a RNN?

from keras.layers import Densemodel = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

When we plot the results of this model training, we get:

import matplotlib.pyplot as pltacc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Accuracy of about 85%. Which isn’t all that great… Hence the need for something more powerful.

When we try to find out why, we realize that our recurrence has a bit of a decaying shape when it comes to retaining information from previous inputs. The last state, it will remember, but bunch of steps ago? Fuggetaboutit.

This is due to the vanishing gradient problem, as a network gets deeper, the signal to update almost disappears from further as in deeper away elements.

What’s the LSTM Layer?

The Long Short Term Memory layers were invented to solve the vanishing gradient problem. Simplistic analogous explanation would be that, imagine having a conveyer belt beside your unrolled recurrent layer. On this conveyer belt you could store all the previous states and the current time step can take any weighted value of these previous states. Even stuff from long ago, that would have normally pretty much vanished.

Code wise, it looks like this:

output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo)i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)
c_t+1 = i_t * k_t + c_t * f_t

If a simple RNN had as input:

  • Input
  • State from previous

The LST has as input:

  • Input
  • State from previous
  • Long term information carrier

Let’s see it’s performance:

from keras.layers import LSTMmodel = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

This gives us an accuracy of 89%.

What’s the GRU Layer?

GRU and LSTM both try to solve the vanishing gradient problem. Accuracy wise they’re on par with each other, however GRU might be a bit more efficient. A simple GRU RNN might look like:

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)

GRUs are simpler and may sometimes be the preferred solution for more language specific practitioners. But others will say LSTM is more sophisticated so theoretically should offer better accuracy. The only consensus is that they are comparable, and you should probably try both on your current task. While favoring LSTM for language related and more complex tasks and GRU for simpler tasks with fewer data points.

Let’s see an example of GRU with temporal data:

import osdata_dir = '/users/fchollet/Downloads/jena_climate'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')
f = open(fname)
data = f.read()
f.close()
lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]
print(header)
print(len(lines))

This is a dataset of temperature, pressure and humidity etc. Since its in an CSV, let’s put it into an array:

import numpy as npfloat_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(',')[1:]]
float_data[i, :] = values

For our prediction problem we define the following:

  • delay = 144 — Targets will be 24 hours in the future, since the data point is in 10 minutes and there’s 6 of those in an hour
  • lookback = 720 — Observations will go back 5 days
  • steps = 6 — Observations will be sampled at one data point per hour

It’s always a good idea to normalize our data:

mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

Our model does everything in batches, so we’ll need to create batches from our array:

def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)

samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield samples, targets

Just like every machine learning problem we’ll generate 3 sets of data:

  • Training
  • Validating
  • Testing
lookback = 1440
step = 6
delay = 144
batch_size = 128
train_gen = generator(float_data, lookback=lookback, delay=delay, min_index=0, max_index=200000, shuffle=True, step=step, batch_size=batch_size)val_gen = generator(float_data, lookback=lookback, delay=delay, min_index=200001, max_index=300000, step=step, batch_size=batch_size)test_gen = generator(float_data, lookback=lookback, delay=delay, min_index=300001, max_index=None, step=step, batch_size=batch_size)# How many steps to draw from val_gen in order to see the entire validation set
val_steps = (300000 - 200001 - lookback)
# How many steps to draw from test_gen in order to see the entire test set
test_steps = (len(float_data) - 300001 - lookback)
print(val_steps) #98559

Our model is:

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)

This approach doesn’t regularize so let’s add a recurrent dropout:

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()model.add(layers.GRU(32, dropout=0.2, recurrent_dropout=0.2, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)

A model like this gets us about a MAE, mean absolute error, of 0.265. Which when we denormalize is 2.35 degrees. That’s a pretty okay prediction. We can tune this model further by stacking more layers or hyperparameter optimization.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture
14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties
18. Working Understanding of Convolutional Models
Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning
19. Convolutional Models for Sequential Data
And easing into Recurrent Neural Networks
20. Recurrent Models Overview
Recurrent Layers: SimpleRNN, LSTM, GRU

Up Next…

Coming up next is probably Recurrent Neural Networks and LSTM Layers. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--