Blog: Transferring knowledge across Learning Processes
The current wave of machine learning is the epitome of division of labor, wherein a given model is able to perform a narrowly defined task to perfection. While these tasks can be quite complex, each task requires a distinct model to be trained to solve that task and only that task. In contrast, a truly artificial general intelligence is an agent that is able to adapt to changes in its environment and to new tasks.
When humans adapt to new tasks, we draw heavily on previous knowledge and experience. This is a crucial ability that lets us learn new things with very few demonstrations. For instance, most people would be able to use Netflix without any demonstrations because they can extrapolate from general website design patterns.
Replicating this type of behaviour in a machine learning algorithm is very challenging. A common approach is to simplify the problem to using a model trained on some other task as our starting point and fine-tune that model to the task we now want to solve. While very simple, fine-tuning is a highly effective method in Computer Vision [1,2,3] and Natural Language Processing [4,5,6,7] when the tasks to transfer between are similar.
When tasks are not so similar, or when there is severe data scarcity, fine-tuning fails. Fundamentally, the problem with fine-tuning and similar transfer learning methods is that they use only information from the pre-trained model, but ignore the process of learning. This can induce a catastrophic loss of information, as the pre-trained model will discard information that is not useful for the task it is being trained on, but that would have been useful for the task we actually care about. When you learn to navigate the Netflix website, you use abstract knowledge about how websites work, not the details of how to navigate some specific website.
Towards an adaptable artificial intelligence
What we need is a more principled approach to sharing information between tasks. In particular, a way of sharing information between two (or more) learning processes. One such approach is given by meta-learning (or learning to learn).
In meta-learning, we treat information transfer as a learning problem in its own right, and optimise for maximal transfer. Because meta-learning learns to transfer knowledge between a set of tasks, it also learns to transfer knowledge to new, unseen, tasks that resemble those it was trained on. As such, meta-learning has the potential to learn a truly general artificial intelligence that can smoothly adapt to new tasks with very few demonstrations [8, 9, 10].
Current meta-learning focuses exclusively on the so-called few-shot learning problem. In this setting, we have access to a relatively rich distribution of tasks, but have relatively few samples from each tasks and thus can only take a handful of gradient steps before we overfit. For this to work, tasks must be very similar, or a handful of gradient steps will not be sufficient to learn a task. While this is an important case of meta-learning, it is by no means all that meta-learning has to offer.
We argue that meta-learning can contribute even more to when tasks are complex and require many training steps. In fact, because they requires many training steps, meta-learning can have a dramatic impact on the rate of convergence and ultimately, the final performance obtained. Currently, there are no meta-learning algorithms designed for large-scale meta-learning and most are computationally unfeasible beyond a few-shot setting.
To remedy the situation, we propose Leap. A light-weight meta-learning algorithm designed specifically for long training processes–even millions of gradient descent steps.
Designing a Scalable Meta-Learner
To design such a meta-learner, we face several tough challenges. The first is computational; meta-learning requires learning from learning processes, so inherently scales poorly. Most few-shot algorithms requires some form of backpropagation through the learning process, and as such are totally unfeasible. Hence, our first constraint is that we cannot backprop through the learning process. The second problem is how to maintain a consistent transfer learning objective over very long learning processes. For instance, looking at the final loss won’t do: if there are a million steps between initialisation and convergence, there’s no way of telling which steps were right and which were wrong. We need to derive a novel meta-objective on first principles.
What does it mean when we say learning a new task way easy? Beyond the emotional experience, what we mean is that it didn’t take us too long and that the process was consistent (in the sense that we didn’t have a lot of false starts). From a machine learning perspective, this implies rapid convergence. It also implies updating parameters should monotonically improve performance (well, in expectation at least). Oscillating back and forth is equivalent to not know how to improve upon your current skill level.
Both these notions revolve around how we travel from our initialisation to our final destination on the model’s loss surface. The ideal is a going straight down-hill into the best parameterisation for the given task. Worst case is taking a long detour with lots of back-and-forts. In Leap, we leverage the following insight:
Transferring knowledge therefore implies influencing the parameter trajectory such that it converges as rapidly and smoothly as possible.
Scaling meta-learning: Leap
To make this intuition crisp, we need a formal framework to work in. While we won’t get in to the details here (for that, read the paper), we’ll flesh out the high-level ideas. You can safely skip the math if you prefer.
A learning process starts with some initial guess of the model parameters p(0) and updates that guess via some update rule u that depends on the learning objective L and some observational input x with target output y,
p(k) = p(k-1) + u(L(x; p(0)), y).
This sequence eventually converges, at which point our model hopefully has learned to solve the task. The length of this process can be formally described by the distance it traversed from the initial guess to the final parameterisation, say p(K). This distance is given by summing up the length of each update:
d(p(0), p(K)) = ∑ | p(k+1) – p(k) | .
Assuming that we converge to a good minimum on this task, the distance measure d of this process tells us if it was an easy or hard task. If it is small, it means we didn’t have to travel far, so our initial guess was good. If it is large, our initial guess was poor as we had to travel a lot to get there.
Consequently, if we can learn to transfer knowledge across learning processes by learning an initialisation such that the expected distance we have to travel on any task we expect to encounter is as short as possible.
This is the overall objective of Leap. Given a distribution of task that we can sample from, Leap learns an initialisation p(0) such that the expected distance of any learning process from that task distribution is as short as possible in expectation. Thus, Leap extracts information across a variety of learning processes during meta-training and condense it into a good initial guess that ensures learning a new task is as easy as possible. Importantly, this initial guess has nothing to do with the details of the final parameterisation on any task, it is meta-learned to facilitate the process of learning those parameters, whatever they might be.
Taking a leap
We’re now ready to see if meta-learning can leap beyond the few-shot setting. To test Leap, we took a standard benchmark, Omniglot , and turned it into a much harder problem that cannot be solved with a few gradient steps. Omniglot is a dataset of 50 distinct alphabets, where each alphabet consists of 20 hand-drawn samples for each of its characters. To solve a task, the model draws data from the alphabet’s dataset and tried to predict which character each sample depicts. We allowed the model to take 100 gradient steps.
During meta-training, we allowed the meta-learned to see a subset of alphabets, ranging from 1 to 25, and held out 10 alphabets for final evaluation. To see if Leap offers any benefits, we tested it against no pre-training, multi-headed fine-tuning (which is cheating a bit because it has more parameters and use transduction, but let’s ignore that), the popular MAML  meta-learner, its first-order approximation (FOMAML), and Reptile , a simplification to MAML that iteratively moves the initialisation a small portion in the direction of the final task parameterisation (as such, it is a special case of Leap that only looks at the distance between the starting point p(0) and final point p(K)). So, how did we do?
When we don’t have an accurate representation of the task distribution, i.e. less than 5 tasks, standard fine-tuning is as good as it gets. That is unexpected, since we cannot generalise if our task distribution is degenerate. As the task distribution grows richer, Leap converges much faster than fine-tuning. Importantly, we find that the initialisation performs much better which means it would do much better on tasks where we face data scarcity. Strikingly, using few-short learning algorithms (MAML, FOMAML) does not cut it all.
We also tested Leap in a more demanding scenario where each task is a distinct computer vision dataset (see paper for details). Here, learning a task required thousands of parameter updates. We found that Leap not only improved the rate of convergence, but also the final performance obtained, as faster convergence protects against overfitting. Finally, we went a little overboard and tested Leap on a very difficult transfer learning problem in Reinforcement Learning, the Atari suite. Here, learning a task requires many millions of parameter updates.
While Leap has no overhead during task training, we still need to collect full training trajectories, which renders training costly. Even so, we found that training for relatively few steps (a few hundred) gave some modest improvement on our held-out games. This improvement was not due to faster convergence, but more consistent exploration across seeds. While not definitive, we take these as encouraging results that meta-learning can solve extremely complex problems; very exciting!
Leap is a first-step towards a general meta-learner that can tackle any level of complexity. It simple, light-weight (constant memory, negligible compute overhead, linear complexity) and can be integrated with other meta-learning algorithms that try to tackle other challenges, such as probabilistic reasoning or embedded task inference. We hope you’ve found some inspiration for your next idea and are looking forward to see what you come up with. Don’t hesitate to reach out!
 He, Kaiming, et al. “Mask r-cnn.” ICCV. 2017.
 Zhao, Hengshuang, et al. “Pyramid scene parsing network.” CVPR. 2017.
 Papandreou, George, et al. “Towards accurate multi-person pose estimation in the wild.” CVPR. 2017.
 Peters, Matthew E., et al. “Deep contextualized word representations.” NAACL-HLT. 2018.
 Howard, Jeremy, and Sebastian Ruder. “Fine-tuned Language Models for Text Classification.” ACL. 2018.
 Radford, Alec, et al. “Improving Language Understanding by Generative Pre-Training.”
 Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv:1810.04805
 Koch, Gregory, et al. “Siamese neural networks for one-shot image recognition.” ICML. 2015.
 Vinyals, Oriol, et al. “Matching Networks for One Shot Learning.” NeuIPS. 2016.
 Finn, Chelsea, et al., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” ICML. 2017.
 Ravi, Sachin, and Hugo Larochelle. “Optimization as a Model for Few-Shot Learning.” ICML. 2017.
 Lake, Brenden, “One shot learning of simple visual concepts.” CogSci. 2011.
 Nichol, Alex, et. al. “On First-Order Meta-Learning Algorithms.” ArXiv:1803.02999. 2018.