Over the long story that is my career as an environment artist working for video game and virtual reality developers, I’ve spent many tedious hours gathering and creating resources to populate virtual sets. I can imagine a time in our near future where all that we need to create a fully-realized virtual world is to provide to an A.I. assistant a description of what we want, a few drawings, photographs, or videos showing some ideas, and in a short time this digital assistant would give the user an environment reflecting those ideas. Think something like the Holodeck from Star Trek. My recent explorations into neural networks and machine learning suggest to me that we are quickly approaching this reality.
One of the first pieces of technology we would need to achieve this sci-fi fantasy is object classification and pose estimation. There are several approaches to this problem, but the one I decided to study and implement is RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints by Asako Kanezaki , Yasuyuki Matsushita and Yoshifumi Nishida. Their published paper can be found at https://arxiv.org/abs/1603.06208. I found their description in the paper difficult to follow, but they’ve published a github repository of a PyTorch implementation, found here: https://github.com/kanezaki/pytorch-rotationnet. I dug deeply into the code in an attempt to understand it’s methodologies and to explore if this would be a viable approach to achieving a small but important part of automated virtual set creation. These are my findings.
The basic idea behind RotationNet is to train a CNN using a fixed number of images that represent an object rotated in a specific, predetermined order. During training, the pose (i.e. the angle from which the image was captured) is discovered as a latent feature, meaning that it is determined through unsupervised learning. After training, anywhere from 1 to the total number of images in a single training datapoint can be fed into the CNN, and the object’s category and pose will be predicted.
Training involves a CNN pre-trained on Imagenet, but the output is customized and then processed to determine the view order, which is the latent feature discovered during training.
Processing the CNN Output to Generate the Target Tensor
As mentioned earlier, I found it difficult to completely grasp the training process through reading the paper. Thanks to fast.ai I understand how to setup and train a CNN, but the post-CNN processing was difficult for me to understand. The PyTorch implementation code is, thankfully, very concise and clear, and stepping through it really helped me to understand what they are doing. It was also helpful for me to create illustrations of the various tensors as they were created, reshaped, and updated. I present here my illustrations and commentary.
An Imagenet pretrained CNN (AlexNet in my implementation) is adjusted so that it will output a large two-dimensional tensor. The first dimension of this tensor needs to be the number of samples that are being simultaneously input into the CNN (the minibatch size) times the number of training images per datapoint. The second dimension needs to be the number of classes plus one (the extra “background” class) times the number of training images per datapoint. It is then reshaped so that the classes align along the first dimension, as illustrated below.
This reshaped tensor is what will later be used during the cross-entropy loss calculation.
A softmax calculation is then performed along the columns to scale and normalize the activations. The next step consists of subtracting the “background” class from each of the other classes, and then discarding it. I don’t think this step is described in the paper, but what I think this is doing is weakening the softmax values that have a strong “background” class activation, thus making the softmax values for the desired class label relatively stronger. The resulting tensor is then reshaped a couple more times so that the final tensor’s first dimension is class, second is image, and third is sample. These reshapings are setting up the tensor for the “score” calculation that will determine which combination of views has the strongest softmax values.
Next, the “score” for each view candidate is calculated. For each sample, the tensor values from the previous step are added together for each view candidate and stored in the score tensor. The resulting score tensor contains values that represent which view candidates have the strongest softmax values. Another one-dimensional tensor is created that will contain the target classes. It’s initialized with the “background” class. From the score tensor the view candidate that has the highest score is selected, and the target class tensor is populated with the target label in the positions determined by this selection.
Now that we have our target tensor, loss can be calculated and backpropagation performed.
I worked through the first part of the fast.ai MOOC classes taught by Jeremy Howard, and attended the in-person classes earlier this spring. I’m really excited about his approach and the libraries that simplify and streamline the whole process of training a neural network. To help dig deep into their libraries and develop my understanding of them I tried to work within the fast.ai ecosystem as much as possible. I was mostly successful, but there were a few complications and I did need to alter their library code slightly to account for the differences the RotationNet implementation requires. I’ll go into those in more detail below.
For those not familiar with fast.ai, it’s an excellent collection of libraries built on top of PyTorch that streamline and encode a lot of the current best practices for deep learning. It currently supports various deep learning techniques, including CNN’s, RNN’s, U-nets, tabular data, NLP, etc. I’ve been having a lot of fun playing around with it over that past few months to develop some computer vision ideas (i.e. https://github.com/amathis726/self-steering-ue4-beta), and this project is my attempt to dig deeper into their libraries and achieve a greater understanding, as well as develop some cool computer vision tools that might have some useful applications down the road.
To keep things as flexible as possible, fast.ai has implemented into its training loop the concept of “callbacks,” which are basically just a way to insert some custom functionality into the training loop at various predetermined points in the training process. Because RotationNet does extensive processing to the output tensor of the CNN before calculating the loss I knew I’d have to implement a few callbacks. Specifically, I created four with significant functionality:
Because each datapoint of the training dataset consists of multiple images, I couldn’t shuffle each minibatch randomly, otherwise the image groupings would be lost and the latent view variable would not be able to be determined. Yet, when I tried training the CNN with an unshuffled minibatch I was unable to achieve accuracies above 30%. Thus, at the beginning of each epoch, I had to implement a custom shuffle routine which shuffled the order of groups of images in a minibatch, but kept the groupings and their internal order intact.
This callback is called immediately after a minibatch has worked its way through the CNN, but before the loss is calculated and backpropagation begins. This is the exact time that I needed to process the output tensor (process illustrated above) to generate the target class tensor.
The fast.ai libraries have a system built in to automatically calculate loss and accuracy, and display it to the user in a nice table. It works beautifully for most deep learning applications, but RotationNet needs accuracy not just for a single image, but for a group of images. This callback is called right before the validation phase for a minibatch, so this is the perfect point to call my own customized validation scoring.
This callback is called at the end of each epoch. Because RotationNet needs customized metrics, this is the point when its custom losses and accuracy are calculated and displayed to the user.
fast.ai Source Code Changes
Because RotationNet has a few requirements that set it apart from other image classification tasks, I had to make some very minor changes to fast.ai’s library source code. Specifically, in the training loop, before calculating loss, I needed to process the CNN output and generate the target class tensor. Fast.ai’s on_loss_begin callback only returns one argument, so I had to add another to make sure the target class tensor got saved for future use. Also, during the routine to calculate loss I also needed to add a check to see if the model was evaluating or training. If evaluating it shouldn’t create the target class tensor because the view variable is latent and discovered during training, and thus shouldn’t be evaluated at validation time.
The authors of the RotationNet paper created their own dataset, called MIRO, to prove that their approach has real-world applications. It consists of 12 classes, with 9 training datapoints for each class, and one validation datapoint for each class. Obviously this a very tiny dataset, but I think it was successful in showing that RotationNet’s approach is feasible and it’s able to make accurate predictions for both classification and pose. Here are a few examples of my results using MIRO:
The top row shows a random set of images from one datapoint in the validation set. The bottom row shows a representative image from the training set, showing predicted class and pose. The final printout at the bottom shows the ultimate class prediction, which is simply the class that had the most predictions. For the example above, there were 15 total images input into the CNN for evaluation (only the first 10 are shown). Since “bus” was the overwhelming majority of predictions, it’s final prediction is “bus”.
Notice that the predicted pose is actually a number, not an image, as might be incorrectly inferred from my graphic. The images in the second row are only a graphical representation of the pose number. Because “pose 13” is meaningless and/or difficult to decipher for us humans, it’s easier to validate results when represented this way.
The pose number represents the view that was determined during training. The view that had the strongest “score” during training is designated as the first pose (pose 0), and, because the views are captured in a predetermined rotation order, the pose numbers that follow the first pose follow this same order. As a simple example, let’s assume there were only three views in the training data. Since the views are created in a predetermined rotation order (either clockwise or counter-clockwise) then there can be only three view candidates ((1,2,3), (2,3,1), and (3,1,2)). When the first pose is determined during the scoring phase the pose order that follows is determined at the same time (i.e. if the first pose is 1 then the order is 1,2,3, if the first pose is 2 then the order is 2,3,1, etc.).
More MIRO examples
The results are encouraging. This seems like a valid approach to object recognition and pose estimation. After implementing object detection functionality I can imagine providing a piece of concept art, or a video or photos of a set walk-through and having the CNN recognize the objects represented in it. Theoretically, it should then be possible to pull up models from a database and be able to orient them similarly to how they are positioned in the representation.
Thanks for reading. Ideas, feedback, and suggestions are always welcome.
Code can be found at: https://github.com/amathis726/rotationNet