“El ingenioso hidalgo don Quijote de la Mancha”, in English translated as _“The Ingenious Nobleman Sir Quixote of La Mancha” or simply “Don Quixote”, is one of the classical novels from the Spanish literature written by Cervantes in the 17th century. The style of the original book is rather difficult to read these days as it is written in a form of old castilian, and certainly even more difficult to emulate for someone living in the 21st century. Can we use deep learning to solve this problem? The answer is YES!
The text of the book El Quijote is freely available in The Project Gutenberg. I used chapters 1 to 10, with roughly 110k characters and 20k words.
The model takes as input a sequence of characters, say 100 characters. The goal of the model is to predict the next character. After fitting the model to the training data, we sequentially produce ‘predictions’ of characters so that new words and full sentences are generated.
The training data is sliced so that, for each each training sample, X is a column vector of 100 dimensions and Y is a single item. Here there are the first observations:
X(m) | Y(m) |
---|---|
‘primera parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condi’ | ‘c’ |
‘rimera parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condic’ | ‘i’ |
‘imera parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condici’ | ‘o’ |
‘mera parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condicio’ | ‘n’ |
‘era parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condicion’ | ’ ‘ |
‘ra parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condicion ‘ | ‘y’ |
‘a parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condicion y’ | ’ ‘ |
’ parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condicion y ‘ | ‘e’ |
‘parte del ingenioso hidalgo don quijote de la mancha\ncapitulo primero. que trata de la condicion y e’ | ‘j’ |
… | … |
Note several things:
The basic block of the model is a Long-Short Term Memory (LSTM)[https://en.wikipedia.org/wiki/Long_short-term_memory] block. LSTM-based models have proved to work really well in practice thanks to their forgetting and updating capabilities. Here it is a quick summary of the components of a LSTM block:
For the Don Quixote we stack 3 LSTM layers with a dropout layer in between. A dropout layer basically randomly “switches off” some neurons at each pass of the optimization algorithm, making the model less prone to over-fitting. In Keras (Python), the model is defined as follows:
# define the LSTM model
model = Sequential()
model.add(LSTM(512, input_shape=(X.shape[1], X.shape[2]),return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(512,return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(512))
model.add(Dropout(0.5))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
The rest of the code is publicly available in this repository.
For generating sequences we do the following:
The above, translated to Python:
def generate_words(model,chars,n_vocab, dataX,seq_length):
# backward dictionary
int_to_char = dict((i, c) for i, c in enumerate(chars))
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(3000):
# Select latest sequence
pattern_aux = pattern[(len(pattern) - seq_length):len(pattern)]
x = np.reshape(pattern_aux, (1, len(pattern_aux), 1))
x = x / float(n_vocab)
# Predict probability of character appearing next
prediction = model.predict(x, verbose=0)
# Sample
index = sample(prediction[0],0.5)
# add new element
pattern.append(index)
# Translate index to char
seq_in = [int_to_char[value] for value in pattern]
print('\nGenerated text:\n')
print(''.join(seq_in))
print('\n\t*** THE END ***')
return seq_in
Running the main file as follows generates new words:
python3 DonQuijote.py -w weights-improvement-3L-512-23-1.2375.hdf5
An example of the output (takes around 1 minute to generate):
con todo eso, se le dejaron de ser su romance v me dejase, porque no le dejare y facilidad de su modo
que de la lanza en la caballeriza, por el mesmo camino, y la donde se le habia de haber de los que el campo,
porque el estaba la cabeza que le parece a le puerto y de contento, son de la primera entre algunas cosas
de la venta, con tanta furia de su primer algunos que a los caballeros andantes a su lanza, y, aunque
el no puede le dios te parecian y a tu parte, se dios ser puede los viera en la caballeria en la
caballeria en altas partes de la mancha,
It is surprisingly good! Some thoughts:
A minor modification to the generating algorithm is done such that the newly generated words must exist in the book. If not, that word is rejected. The process is repeated until reasonable words are generated.
python3 DonQuijote.py -w weights-improvement-3L-512-23-1.2375.hdf5 -o True
al cual le parecieron don quijote de la mancha, en cuando le daba a le senor tio en el corral, y tio que andaba muy
acerto los dos viejos, y, al caso de van manera con el de tu escudero. don quijote y mas venta a su asno, con toda
su amo pasa dios de la caballeria y de al que habia leido, no habia de ser tu escudero: la suelo del camino de la
venta, de que san caballo de los que le habia dejado; a este libro es este es el mismo coche, como te ve don mucho
deseos de los que el caballero le hallaba; y al corral con la cabeza que aquel sabio en la gente de la lanza
y tan las demas y camas de tu escudero,
With this modification, the character-generating model can now generate words with a correct syntax.
FloydHub is a cloud-based platform for training and deploying deep learning models. It is really easy and straightforward to use. Here it is a quick guide on how you can train your models with only a few lines of code:
pip3 install -U floyd-cli
floyd login
cd DonQuijote_RNN
floyd init DonQuijote_RNN
floyd run --gpu --env tensorflow-1.3 "python3 DonQuijote.py"
The model has been trained on a 61 GB GPU for roughly 5 hours. The weights of the model are updated to the repository.