Convolutional Models for Sequential Data

Easing Into Recurrent Neural Networks

Jake Batsuuri
Computronium Blog

--

Remember these two useful properties of Convolutional Models.

Translation Invariance

A convolutional model can learn a certain pattern in the lower right area, then after that point detect it anywhere on the image.

Spatial Hierarchy

A convolutional model can learn patterns in a hierarchical fashion, much like we do. The first layers will learn relatively simple patterns, like horizontalness and verticalness etc. Then the second layers will put these together to learn such things as corners. And so on with each new layer.

So if we take the translation invariance property and apply it to sequential data, such as text or time series data. We can get really nice pattern recognition program.

An image is a plane or a two dimensional object, whereas a sequential data is a line or a one dimensional object.

However, even simple objects like lines can have regularity and patterns. If instead of 2D convolutional layers, we use 1D convolutional layers, we can get a pretty neat text classifier or timeseries forecaster.

The 1D convolutional layer would be primarily finding subsections of the complete data, then analyzing it, and finding other similar subsections.

For example, in the data above, the convolution layer would find 6 different subsections all very similar.

Another cool use for these is their use in morphological analysis of words. Imagine running a convolutional layer on a text. The model would be able to find instances of words, and become a word counter. Furthermore, it would find related words such as “happiness” and “unhappiness”. Given that “un” is a common prefix meant to negate the root word, it might also find other “un” words, such as “unworthy”, “unfriendly” etc.

How do 1D convolutions work?

For an image, we quantified the image into matrices, where the value of each element of the matrix was the channels for that pixel. For example single channel was grayscale and 3 channels for RGB.

For simplicity, let’s take grayscale only. An image might be height and width of 1000, and our model would only analyze a tiny window of 5 by 5 pixels.

For a 1D convolution, the data is just an array, instead of a matrix (for grayscale) or a tensor (for RGB). The length might be 1000 similarly and we just take a sub array of 5.

We dot product the 5 pixel values with weights and get a single value, which goes into our output.

What does a 1D convolutional model code look like?

from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
model.summary()
model.compile(optimizer=RMSprop(lr=1e-4), loss='binary_crossentropy', metrics=['acc'])history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

The model looks just like any other ML model, it follows the same pattern of stacking layers on layers, downsampling often then flattening it and running a classifier or regressor at the end.

Because of the squareness of 2D layers we were previously limited by how big we can make our kernels. In 1D layers, we can make the kernel size a bit bigger without worrying about compute time.

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 500, 128) 1280000
_________________________________________________________________
conv1d (Conv1D) (None, 494, 32) 28704
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 98, 32) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 92, 32) 7200
_________________________________________________________________
global_max_pooling1d (Global (None, 32) 0
_________________________________________________________________
dense (Dense) (None, 1) 33
=================================================================
Total params: 1,315,937
Trainable params: 1,315,937
Non-trainable params: 0

The main hero of our story is the object Conv1D, it’s a layer object. It’s compatible with other layers and downsamplers. It takes as input 3D tensors with shape (samples, time, features) and returns the same shape tensors.

Remember that we use downsamplers for 2 main purposes, to reduce our model to manageable number of coefficients and to help with spatial hierarchy. For the latter, a layer like maxpooling will only “learn” the most important features and disregard the other information.

For Conv1D, we use MaxPooling1D and instead of a Flatten layer at the end, we can also use a GlobalMaxPooling1D, which flattens it for our dense classifier.

A model like this essentially takes 50k samples of IMDB movie reviews and classifies it into 2 groups, positive and negative.

The second way to process sequential data is the Recurrent Neural Network, or RNN.

Why the need for an RNN?

With Feedforward Neural Networks, FNNs, we’d have to ingest the entire sequence at once, in order to process it. This is computationally untenable so we use RNNs. Plus we know that when we have useful properties like translation invariance, we get better results. We’ll learn more about RNN properties in a bit.

In rough terms, an RNN iterates through the sequence elements while maintaining a state containing information relative to what it has seen so far.

This is pretty vague. So let’s specify.

If our dataset has 50'000 data points, the RNN would consider each data point a complete sequence. For each complete sequence, the RNN resets the state.

In a forward propagation, whereas an FNN would go through each node once, an RNN would internally loop over sequence elements. Y tho?

People and animals basically add information about their surroundings incrementally to their mental model while maintaining an internal representation of their surrounding. The whole point being, trying to predict what happens next.

An RNN, does this at a simpler level, specifically, the maintaining of an internal state of the shown data.

How do RNNs generate their output?

The RNN loops at each timestep, and at each timestep, it takes in 2 inputs:

  • The input feature
  • The current internal state

And the output is the output at that timestep.

One thing to mention is, when the RNN first starts, at time step t=0, there is no current internal state, so the vector representing the state is just all zeroes.

Furthermore, the current, just generated output becomes the state for the next iteration. In code it would look something like this:

state_t = 0for input_t in input_sequence:
output_t = f(input_t, state_t)
state_t = output_t

The function f, can even be defined as such:

f(input_t, state_t) = activation(dot(W, input_t) + dot(U, state_t) + b)

The output is calculated by dot product-ing the input tensor with weight matrix W and state tensor with weight matrix U and adding the bias and passing it through an activation function. Let’s see a more realistic implementation:

import numpy as nptimesteps = 100
input_features = 32
output_features = 64
inputs = np.random.random((timesteps, input_features))
state_t = np.zeros((output_features,))
W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features,))
successive_outputs = []for input_t in inputs:
output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
successive_outputs.append(output_t)
state_t = output_t
final_output_sequence = np.concatenate(successive_outputs, axis=0)

Different RNNs will be marked by their step function:

output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)

Remember that the RNN only remembers the states inside each data point, meaning it has only short term memory. In the next article we will address the need for long term memory for a RNN.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture
14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties
18. Working Understanding of Convolutional Models
Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning
19. Convolutional Models for Sequential Data
And easing into Recurrent Neural Networks

Up Next…

Coming up next is probably Recurrent Neural Networks and LSTM Layers. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--