Blog: Hello, RNN!
In this post, we will briefly cover the simplest form of a recurrent neural network. In Andrej Karpathy’s excellent post, “The Unreasonable Effectiveness of Recurrent Neural Networks”, Karpathy introduces a simple character-level language model that is capable of generating the text, ‘hello’. In this post, we will produce working code for his conceptual example, and we will be doing it in PyTorch!
data = DataHandler("Hello, RNN!")
We’ll instantiate our data handler class with the string, “Hello, RNN!” — our class will determine the set of unique characters in our string, create mappings between these characters and integer representations (e.g, ‘H’ → 0, ‘R’ → 5), and convert the characters into one-hot encodings.
Lastly, the class generates X, y data for training the model. As described in Karpathy’s post (see figure below), our model takes in a sequence of characters. For a given input character, the model should predict the next character in the sequence. For example, given the string ‘hello’, if we feed in ‘h’ to the model, it should predict ‘e’; if we feed in ‘e’, it should predict ’l’, and so on. The power of the RNN lies within this capability: if we feed in an ‘l’, the model will learn whether the output should be another ‘l’ or an ‘o’, depending on the sequence of information processed up until the given input character.
In our class, X and y are offset by 1 character. For example, if our string is “Hello”, then X= [‘H’, ‘e’, ‘l’, ‘l’], and y = [‘e’, ‘l’, ‘l’, ‘o’], (except that we are using a one-hot encoding, rather than characters).
We instantiate our model by giving it the number of characters in our dataset and the number of hidden units we want to comprise our hidden state.
net = HelloRNN(num_chars=data.num_characters, num_hidden=50)
We will also need an optimizer and a loss criterion. Here, we’ll use Adam from PyTorch’s built-in optimizers (torch.optim) and Cross Entropy loss.
optimizer = optim.Adam(net.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
The nice aspect of PyTorch is that its automatic differentiation takes care of computing the gradients for us. The basic training loop in PyTorch is, for each loop through the data (here, we only have one example so we are ignoring the typical inner-loop over batches), we
- zero out the gradients using the optimizer
- Pass the data to the network to obtain the output (in the form of logits)
- Computer the loss from the output logits and character labels via CrossEntropyLoss — this function automatically applies softmax to the outputs
- Computer the gradients over all trainable parameters in the model via loss.backward()
- Iterate over all parameters and clamp the gradients (useful for addressing the issue of exploding gradients in recurrent models)
- Apply computed gradients to their respective parameters via optimizer.step()
- Do other useful things, such as printing information to stdout, evaluating the model, logging results, etc.
That’s all for now! A pointer to the full code is below.