Text Generation with Python and Tensorflow (Keras) — Part 2

Danish Khan
8 min readDec 1, 2020

Note: This is part 2 of a three part mini series. Part 1

In this post, we’ll be going through the process of implementing an LSTM and text generation.

This is how the text generation will work:

  • Given a starting character, the model will learn the probabilities regarding what character will come next.
  • We will then chain these probabilities together to create an output of many characters.

I’ll be using Google Colab for this project, though you can follow along using any other notebook service or using your own hardware. Setting up a Colab notebook is very easy and straightforward, since you don’t need to install anything. In case you are using Colab, you can use a GPU as a hardware accelerator to improve training speeds by going to Runtime -> Change runtime type -> Hardware accelerator.

Importing libraries

We’ll be needing the following libraries:

  • Numpy
  • Keras
  • NLTK (Natural Language ToolKit)
  • sys

Let’s go ahead and import all of these into our program.

import numpyimport sysimport nltkfrom nltk.tokenize import RegexpTokenizerfrom nltk.corpus import stopwordsfrom keras.models import Sequentialfrom keras.layers import Dense, Dropout, LSTMfrom keras.utils import np_utilsfrom keras.callbacks import ModelCheckpointnltk.download("stopwords")

You’ll notice that we downloaded something called “stopwords”. We’ll need this later.

We need data to train our model on. For this project I’ll be using William Shakespeare’s Romeo and Juliet, which is available to download for free on Project Gutenberg. You are free to use any other text you have.

If you are using a file from Project Gutenberg, you’ll notice that there is some information and legal text about the usage of the book at the beginning and end of the text file. You may choose to remove it if you want to.

Upload your text file onto you notebook and read it.

data = open("RomeoAndJuliet.txt").read()

To make things easier for this example, we’ll convert all the text to lowercase. We will also do some prepossessing to clean the data and then convert the text file into arrays that our model can use.

We’ll need to convert our words into tokens before we can make arrays. A token is basically a sequence of characters that are grouped together as a useful semantic unit for processing.

An example of tokenization

Tokenization can be a very complicated process. In English, you can’t remove every punctuation and white-space you come across, since that can change the meaning of the words. It is even more difficult in languages like Arabic, where a single word can comprise up to four independent tokens. There will be links at the end of the post if you want to learn more about this topic.

We’ll create an instance of a tokenizer and use it on our text file.

In the end, we’ll remove the tokens which are in a list of Stop Words and which do not add significant value to the sentence. This is where the “stopwords” that we downloaded earlier comes in handy. It contains words (‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’) which do not contribute to the meaning of the sentence in a significant manner. We use this to filter out the stopwords.

Let’s create a function that does all that:

def tokenize(input):input = input.lower()tokenizer = RegexpTokenizer(r'\w+')tokens = tokenizer.tokenize(input)filtered = filter(lambda token: token not in stopwords.words('english'), tokens)return " ".join(filtered)

Now call this function on our file:

processed_input = tokenize(text)

As stated before we’ll need to convert our data into arrays of numbers. The first step in doing that is to assign a number to each character that appears in our data and to create a dictionary which holds these character — number pairs.

We first sort the list of of the set of all characters that are in our data. We then use the enumerate function to get the numbers which represent these characters. We finally create the dictionary which holds these values.

characters = sorted(list(set(processed_input)))char_to_num = dict((c, i) for i, c in enumerate(characters))

Now that we’ve transformed the data into the form it needs to be in, we can begin making a dataset out of it, which we’ll feed into our network. We need to define how long we want an individual sequence (one complete mapping of inputs characters as integers) to be. We’ll set a length of 100 for now, and declare empty lists to store our input and output data:

seq_length = 100
x_data = []
y_data = []

We now need to create input-output pairs to train and test the model on. For this, we’ll go through the entire list of inputs and chop them up into sequences of 100 characters, and convert these characters into numbers. These will be the inputs. The output will be the next character which comes after a single sequence of 100 characters, converted into it’s corresponding number.

for i in range(0, input_len - seq_length, 1):in_seq = processed_input[i:i + seq_length]
out_seq = processed_input[i + seq_length]
x_data.append([char_to_num[char] for char in in_seq])
y_data.append(char_to_num[out_seq])

We now have our training data features and labels, stored as x_data and y_data. Now we have to convert our input sequences into a processed numpy array for the model to use. We will also have to convert the values in the arrays into floats so that our model can output probabilities from 0 to 1.

X = numpy.reshape(x_data, (len(x_data), seq_length, 1))X = X/float(vocab_len)

We will now one-hot-encode out label data:

y = np_utils.to_categorical(y_data)

Now that all the data processing is done, we can finally create our LSTM model. We’ll define the type of model (in this case sequential), add a few LSTM layers and dropout layers (to prevent overfitting, and then a final layer layer that will output the probability about what the next character will be.

model = Sequential()model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))model.add(Dropout(0.2))model.add(LSTM(256, return_sequences=True))model.add(Dropout(0.2))model.add(LSTM(128))model.add(Dropout(0.2))model.add(Dense(y.shape[1], activation='softmax'))

We will now compile the model, after which it will be ready for training.

model.compile(loss='categorical_crossentropy', optimizer='adam')

It takes the model quite a while to train, and for this reason we’ll save the weights and reload them when the training is finished. We’ll set a checkpoint to save the weights to, and then make them the callbacks for our future model.

filepath = "model_weights_saved.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
desired_callbacks = [checkpoint]

We will now train the model:

model.fit(X, y, epochs=4, batch_size=256, callbacks=desired_callbacks)
Training can take some time

Now that we have a trained model, we load in the weights. If you are using Google Colab, now would also be a good time to download your weights if you want to use the model in the future, since all of your files will be deleted if you when you disconnect from the runtime.

filename = "model_weights_saved.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

Since we converted all our data into numbers before, we will need to define a dictionary that will convert the output of the model back into readable text.

num_to_char = dict((i, c) for i, c in enumerate(characters))

To use the model for character generation, we provide our model with a random character (the seed), using which it will generate a string of characters.

start = numpy.random.randint(0, len(x_data) - 1)
pattern = x_data[start]

Now to FINALLY generate text, we’re going to iterate through our chosen number of characters and convert our seed into float values. We then input these values into the model and ask it to predict what comes next. We take these characters and append them to pattern, and repeat the process for a set number of time (1000 in this example), while printing out the generated characters.

for i in range(1000):
x = numpy.reshape(pattern, (1, len(pattern), 1))
x = x / float(vocab_len)
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = num_to_char[index]
seq_in = [num_to_char[value] for value in pattern]
sys.stdout.write(result) pattern.append(index)
pattern = pattern[1:len(pattern)]

This is what my model generated:

oe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe tooe too

It is … slightly disappointing, to say the least. The generated text doesn’t make any sense, and the model immediately started repeating a simple pattern.

However, the longer you train the model, the better it will become. We trained the model for 4 epochs. This is the result of the same model trained for 50 epochs:

stay thou hast shall street ere shou hast shall street ere shou hast shall street ere thou hast shall

The words have started making sense now, even though the model still goes back to repeating a pattern fairly quickly.

And this is the result of the model after it was trained for 150 epochs:

wound shall stay thy lady capulet sun haste serve holy shy sorch bome thou wilt sword shall fortune thy live shall stay thy lady capulet sun haste serve holy shy sorch bome

You can experiment with more training time and increasing the number of layers to get a better model.

Note: If you decide to train your models for a very long amount of time, keep in mind that a Google Colab notebook recycles after 12 hours if the browser is kept open. Checkpointing would a good way to get a susbstantial amount of training done.

In the next post, we’ll see how we can further improve our model and the different ways in which NLP can be used in daily life.

If you want to read more about tokenization, you can do so here.

More reading:

Dropout layers: https://keras.io/api/layers/regularization_layers/dropout/ and https://towardsdatascience.com/machine-learning-part-20-dropout-keras-layers-explained-8c9f6dc4c9ab

Stay safe and have a nice day!

Danish Khan is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/

--

--

Danish Khan

AI and ML | Arduino | Astronomy | Physics | Philosophy