Language Processing with Recurrent Models

Bidirectional RNNs, Encoding, Word Embedding and Tips

Jake Batsuuri
Computronium Blog

--

What's a Bidirectional RNN?

Bidirectional RNN is an RNN variant, that sometimes can increase performance. It is especially useful for natural language processing tasks.

The BD-RNN uses two regular RNNs, one of them where the sequential data is going forward, and one where the data sequences backwards, then merging their representations.

This method doesn’t work very well for timeseries data, since there’s a more abstract meaning to chronological order. For example, it does make sense that more recent events should have more weight in predicting what will happen next.

Whereas in language related problems, its clear that “cat in the hat” and “tah eht ni tac” should have no real higher abstract meaning. “Tah” and “hat” both refer to the same object. Hopefully it’s easy to see that reversing an image of a cat, or flipping it upside down is still an image of a cat.

It’s also kind of funny when we talk about palindromes, like “Bob” or “racecar”.

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras import layers
from keras.models import Sequential
max_features = 10000
maxlen = 500
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words=max_features)x_train = [x[::-1] for x in x_train]
x_test = [x[::-1] for x in x_test]
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()model.add(layers.Embedding(max_features, 128))
model.add(layers.LSTM(32))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

When we use the same model on a reversed text, we get very similar results in terms of accuracy. Which is great to see, but what’s even cooler is that, the model accomplishes the task by learning very different representations than the one trained forward.

Thankfully there’s a dedicated layer object, that creates a second instance and reverses the data and trains it and merges it for us. So we don’t have to write code for it.

model = Sequential()model.add(layers.Embedding(max_features, 32))
model.add(layers.Bidirectional(layers.LSTM(32)))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

And with some regularization this model can approach 90% accuracy, which is awesome.

Let’s do a simple text processing example:

Our model can only differentiate numeric data, so we first need to convert our text data into vectors and tensors. We can do this at 3 different levels:

  • Character level
  • Word level
  • N-gram level

We take each level and just assign a unique vector to it. Now this numeric vector is encoded to this phrase and we can encode and decode to it. The unique abstraction may be called a token, and the process is called tokenization.

For example, a large corpus in English may have 26 characters for each alphabet letter. You can create a frequency for each characters. Now each of the 26 characters are tokens.

At the word level, the same corpus, may have thousands of words. Common words like “the”, “in” may occurs more than once. But nevertheless, we will encode each occurrence as the same vector.

At the n-gram level, where n=2, we create a 2 word phrase from every consecutive pair. And from this we can again create frequency table, some bigrams might occur more than once. We will encode each bigram as a unique token, and encode it with a numeric vector. The frequency table is not important here, I just provide it to illustrate the nature of it.

Once we have the abstraction level (characters, words, n-grams) decided and tokenization complete. We can decide how to vectorize the tokens. We can either:

  • One hot encode
  • Token embed

For one hot encoding, we simply count all the unique words in the text, call this N. Then assign a unique integer under N to a word. As long as there are no collisions. We good. We can do this at word and n-gram levels to.

One-hot
00000001
00000010
00000100
00001000
00010000
00100000
01000000
10000000

For word level, you can naively implement it as such yourself, or use the prebuilt keras methods to do it:

import numpy as npsamples = ['The cat sat on the mat.', 'The dog ate my homework.']token_index = {}for sample in samples:
for word in sample.split():
if word not in token_index:
token_index[word] = len(token_index) + 1
max_length = 10
results = np.zeros(shape=(len(samples),
max_length,
max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = token_index.get(word)
results[i, j, index] = 1.

In keras:

from keras.preprocessing.text import Tokenizersamples = ['The cat sat on the mat.', 'The dog ate my homework.']tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

One additional thing to consider is sometimes we may cut the lower tail of the words in terms of distribution, and say take only the most frequent 1000 words. Because this saves us compute time.

To decode a one hot encoded index:

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

In word embeddings, instead of our vectors looking like [0, 0, 0, … 1, 0], we create vector that look more like [0.243, 0.33454, … 0.5553].

While hot encoded vector can be of size 1000, embedded vectors can be much, much smaller.

How do we learn these fractional elements of the vector though?

We can do it at the same time as our main task on the data that we have, or use a pretrained word embeddings. What’s nice about the embeddings is that they learn meanings of words.

How do we know this?

Remember that vectors can be mapped to a geometric space. And if you draw the word embedded vectors into a geometric space we start to see geometric relations between related words.

Why is it that its theoretically better to train the word embeddings with your training data or in context that is closer to the task you have at hand?

Well languages aren’t isomorphic, English and Russian don’t have the same mappings. Features that exist in one language may not exists entirely in the other.

Furthermore, between two English speakers, they might not agree on the definition of words, therefore the semantic relationship of that word in relation to other words.

Even further, the same person might use a word differently in different contexts. So context matters a lot in semantics.

Let’s embed some words:

from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)

1000 and 64 signify in a way how big your one hot vectors would have been and how big are they now. One hot encoding is like digital signals and word embeddings is like the analog with continuous signals. Except that it’s trained to become analog, then it freezes it and uses it as a digital unique signal.

We can just use the word embeddings and a dense classifier to see what kind of accuracy we get:

from keras.datasets import imdb
from keras import preprocessing
max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
from keras.models import Sequential
from keras.layers import Flatten, Dense
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

This gives us 76% accuracy. Not bad.

What if we used pretrained word embeddings?

Before we do that we need to get the labels:

import osimdb_dir = '/Users/username/Downloads/imdb_dataset'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
dir_name = os.path.join(train_dir, label_type)
for fname in os.listdir(dir_name):
if fname[-4:] == '.txt':
f = open(os.path.join(dir_name, fname))
texts.append(f.read())
f.close()
if label_type == 'neg':
labels.append(0)
else:
labels.append(1)

Word2vec is one of the first and most successful such pretrained word embeddings. Another great one is the GloVe.Then we vectorize using the pretrained word embeddings with GloVe:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
maxlen = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen)labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

We need to manually download the GloVe embeddings here, then:

glove_dir = '/Users/fchollet/Downloads/glove.6B'embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

To illustrate some points we first define our model:

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

We need to load the word embeddings into the embedding layer:

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

We freeze the word embeddings, cuz we don’t wanna mess with their nice structure. But we need the embedding matrix, to get that:

embedding_dim = 100embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

Now we are ready to train:

model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))

model.save_weights('pre_trained_glove_model.h5')

The validation accuracy reaches about 50%. Which we can improve with LSTM or GRU, and maybe even fine tuning the word embeddings after the LSTM has been trained. Remember we can do this by unfreezing layers, just like the last layers of a convolutional model.

Finally remember that we picked 200 training samples. That’s too few.

Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations
2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD
3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance
4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory
5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations
6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares
7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence
8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters
9. Principal Component Analysis Breakdown
Motivation, Derivation
10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick
11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood
13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture
14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews
15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections
16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties
18. Working Understanding of Convolutional Models
Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning
19. Convolutional Models for Sequential Data
And easing into Recurrent Neural Networks
20. Recurrent Models Overview
Recurrent Layers: SimpleRNN, LSTM, GRU
21. Language Processing with Recurrent Models
Bidirectional RNNs, Encoding, Word Embeddings and Tips

Up Next…

Coming up next is probably LSTM Text Generation. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--