Recurrent Models Overview

Recurrent Layers: SimpleRNN, LSTM, GRU

Jake Batsuuri

Published in

Computronium Blog

7 min readMar 15, 2021

What’s SimpleRNN?

SimpleRNN is the recurrent layer object in Keras.

from keras.layers import SimpleRNN

Remember that we input our data point, for example the entire length of our review, the number of timesteps.

Now the SimpleRNN processes data in batches, just like every other neural network. So this layer takes as input the tensor (batch_size, timesteps, input_features).

All recurrent layers in keras can be run in two different modes:

All of the successive outputs aka the states
Just the last state

Let’s look at writing a simple recurrent model:

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNNmodel = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))model.summary()

The model summary is:

________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_22 (Embedding) (None, None, 32) 320000
________________________________________________________________
simplernn_10 (SimpleRNN) (None, 32) 2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0

We can see that the output here is the last state. If instead we enable all states to be returned.

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.summary()

Then the model looks like:

________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_23 (Embedding) (None, None, 32) 320000
________________________________________________________________
simplernn_11 (SimpleRNN) (None, None, 32) 2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0

How do you increase the representational power of a network?

Remember that the Recurrent Layer combines the current input with the previous output, thereby preserving some kind of relationship between the time steps, given that the data is sequential.

Remember that with Convolutional Layers, stacking them makes them learn ever more abstract concepts with each stack up.

For a keras model, to do this, we have to return full states between the hidden layers:

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))
model.summary()

Although, spatial hierarchy is an essential property of deep convolutional models, deepening recurrent models, doesn’t really give a theoretical boost to our model. In practice however, increasing the representational power does improve performance on certain tasks.

It just depends on whether the data supplied has higher order abstract patterns that can be learned. Most times it doesn’t.

How do you train a RNN?

from keras.layers import Densemodel = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

When we plot the results of this model training, we get:

import matplotlib.pyplot as pltacc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']epochs = range(1, len(acc) + 1)plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()plt.figure()plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()plt.show()

Accuracy of about 85%. Which isn’t all that great… Hence the need for something more powerful.

When we try to find out why, we realize that our recurrence has a bit of a decaying shape when it comes to retaining information from previous inputs. The last state, it will remember, but bunch of steps ago? Fuggetaboutit.

This is due to the vanishing gradient problem, as a network gets deeper, the signal to update almost disappears from further as in deeper away elements.

What’s the LSTM Layer?

The Long Short Term Memory layers were invented to solve the vanishing gradient problem. Simplistic analogous explanation would be that, imagine having a conveyer belt beside your unrolled recurrent layer. On this conveyer belt you could store all the previous states and the current time step can take any weighted value of these previous states. Even stuff from long ago, that would have normally pretty much vanished.

Code wise, it looks like this:

output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo)i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)c_t+1 = i_t * k_t + c_t * f_t

If a simple RNN had as input:

Input
State from previous

The LST has as input:

Input
State from previous
Long term information carrier

Let’s see it’s performance:

from keras.layers import LSTMmodel = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

This gives us an accuracy of 89%.

What’s the GRU Layer?

GRU and LSTM both try to solve the vanishing gradient problem. Accuracy wise they’re on par with each other, however GRU might be a bit more efficient. A simple GRU RNN might look like:

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSpropmodel = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)

GRUs are simpler and may sometimes be the preferred solution for more language specific practitioners. But others will say LSTM is more sophisticated so theoretically should offer better accuracy. The only consensus is that they are comparable, and you should probably try both on your current task. While favoring LSTM for language related and more complex tasks and GRU for simpler tasks with fewer data points.

Let’s see an example of GRU with temporal data:

import osdata_dir = '/users/fchollet/Downloads/jena_climate'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')f = open(fname)
data = f.read()
f.close()lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]print(header)
print(len(lines))

This is a dataset of temperature, pressure and humidity etc. Since its in an CSV, let’s put it into an array:

import numpy as npfloat_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
 values = [float(x) for x in line.split(',')[1:]]
 float_data[i, :] = values

For our prediction problem we define the following:

delay = 144 — Targets will be 24 hours in the future, since the data point is in 10 minutes and there’s 6 of those in an hour
lookback = 720 — Observations will go back 5 days
steps = 6 — Observations will be sampled at one data point per hour

It’s always a good idea to normalize our data:

mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std

Our model does everything in batches, so we’ll need to create batches from our array:

def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
  if max_index is None:
    max_index = len(data) - delay - 1
  i = min_index + lookbackwhile 1:
    if shuffle:
      rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
    else:
      if i + batch_size >= max_index:
        i = min_index + lookback
      rows = np.arange(i, min(i + batch_size, max_index))
      i += len(rows)
      
    samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
    targets = np.zeros((len(rows),))for j, row in enumerate(rows):
      indices = range(rows[j] - lookback, rows[j], step)
      samples[j] = data[indices]
      targets[j] = data[rows[j] + delay][1]
    yield samples, targets

Just like every machine learning problem we’ll generate 3 sets of data:

Training
Validating
Testing

lookback = 1440
step = 6
delay = 144
batch_size = 128train_gen = generator(float_data, lookback=lookback, delay=delay, min_index=0, max_index=200000, shuffle=True, step=step, batch_size=batch_size)val_gen = generator(float_data, lookback=lookback, delay=delay, min_index=200001, max_index=300000, step=step, batch_size=batch_size)test_gen = generator(float_data, lookback=lookback, delay=delay, min_index=300001, max_index=None, step=step, batch_size=batch_size)# How many steps to draw from val_gen in order to see the entire validation set
val_steps = (300000 - 200001 - lookback)
# How many steps to draw from test_gen in order to see the entire test set
test_steps = (len(float_data) - 300001 - lookback)print(val_steps) #98559

Our model is:

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSpropmodel = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)

This approach doesn’t regularize so let’s add a recurrent dropout:

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSpropmodel = Sequential()model.add(layers.GRU(32, dropout=0.2, recurrent_dropout=0.2, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)

A model like this gets us about a MAE, mean absolute error, of 0.265. Which when we denormalize is 2.35 degrees. That’s a pretty okay prediction. We can tune this model further by stacking more layers or hyperparameter optimization.

Up Next…

Coming up next is probably Recurrent Neural Networks and LSTM Layers. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.