Oleksiy Ostapenko, Tassilo Klein, Moin Nabi (ML Research)
Humans have an extraordinary ability to learn continuously throughout their lifetime. The ability to apply previously learned knowledge to new situations, environments and tasks form the key feature of human intelligence. On the biological level, this is commonly attributed to the ability to selectively store and govern memories over a sufficiently long period of time in neural connections called synapses. Unlike biological brains, conventional Artificial Neural Networks (ANNs) do not possess the ability to control the strength of synaptic connections between neurons. This leads to extremely short memory lifetimes in ANNs — the effect known as catastrophic forgetting.
In the past decade most of the research in the field of Artificial Intelligence (AI) was directed towards exceeding the human level performance on isolated, clearly defined tasks such playing computer games, sorting out spam emails, classifying cats from dogs and recognising speech, just to name a few. As a result, most of the AI surrounding us in our day-to-day life can be referred to as Artificial Narrow Intelligence or weak AI. Strong AI, in contrast, refers to human-like AI that can perform any intelligent task, while being able to learn continuously, forget selectively, while quickly adapting to new tasks and making use of previous experiences. These properties only recently started receiving attention by AI researchers.
Why continual learning? The key to ever-changing scenarios
Forgetting and missing knowledge transfer constitute one of the main challenges on the way from weak AI to strong AI. Unlike humans, who forget selectively, machines forget catastrophically. Accordingly, while a “baby learns to crawl, walk and then run” (~Dave Waters), AI would completely forget to crawl once it learned how to walk, and it would forget to walk once it learned how to run. Before reviewing possible solutions to the challenge of continual lifelong learning, let us consider a simple example of an AI-based clothes catalog search. A machine learning model trained on a dataset containing clothing items from season (A) would perform extremely well when searching among this season’s (A) products. However, once the season changes, fashion trends might change as well. Once fashion trends change, new product categories, models and styles might be added to the catalogue (e.g. high heels instead of sneakers, long jackets instead of short jackets etc.). The model trained on the data of the first season (A) would not perform well when searching through items that have been added in the new season. In fact, simply training our model on the data from the new season, would lead to catastrophically forgetting the ability to search among the items of the previous season.
Common way of solving forgetting?
One of the earliest techniques to mitigate catastrophic forgetting in ANNs is known as experience replay or “rehearsal”. Continuing with our catalogue search example, in order to maintain the information that was learned in the first season, the machine learning model is simply retrained from scratch on the mixture of data from both seasons, i.e. previously learned knowledge is replayed to the model trained on the data of the new season. Generally, retraining the model every time the data distributions “shifts” would result in exploding data storage costs and effort needed to maintain intelligent systems, not to mention the dramatical reduction of system scalability. Finally, storing raw data of previous tasks can largely violate data privacy requirements of the real-world application.
In this context, many researchers have focused on simulating neural plasticity in ANNs and thus mitigating the need of storing raw data (1,2,3,4,5,6). This is usually done in the so-called “task-incremental” setup, where every newly added data chunk is considered as a separate task and the information about the task label is assumed to be available at the test time. Coming back to the catalog search example, this would require the information about the season label (task label) to be included into each query; hence classifying a given garment item would require an a-priori information about the season it belongs to (task label). Having such a “task label” would automatically reduce the output of the model to the classes that belong to the assumed task. Thus, in our example above it would only restrict the model to the particular season. These assumptions can be rarely fulfilled in real-world applications.
A separate line of work tackles a more real-world like scenario. In this “class-incremental” scenario, the classification output of the model is extended continuously as new classes are learned. In this context a common strategy is to introduce a so-called generative memory component (e.g. 7,8,9). Here, instead of storing raw data, a generative model such as GAN or VAE (see previous blogpost) is trained to generate experience to be replayed. Hence, in the catalog example, items (with the corresponding class) of the first season would be generated and replayed to the model.
Existing generative memory approaches mostly rely on the idea of deep generative replay where the generative model is repetitively retrained on the mixture of currently available real data (new season) and the replay episodes synthesised by the previous generator (past season). However, apart from being highly inefficient in the training, these approaches are severely prone to an effect known as “semantic drifting’’. “Semantic drifting” refers to the quality of images generated at each memory replay depending on the previously generated images, causing susceptibility to error propagation and thus resulting in a loss of quality and forgetting.
Proposed solution — Plasticity learning in a generative memory network
So far, we have learned that experience replay is a simple and useful strategy to overcome forgetting in ANNs in general, and particularly in the challenging “class-incremental” situation. Yet, this strategy is only applicable when the replay episodes are not kept as raw data but in form of relevant and efficiently stored memory patterns.
To address this, in our recent work we proposed a method called Dynamic Generative Memory (DGM) — an end-to-end trainable continual learning framework that simulates synaptic plasticity with learnable hard attention masks applied to the parameters of a generative network (GAN). Hard attention masking identifies the network segments that are essential for memorising currently learned information and prevents their updates during the future learning. The network is further incentivised to reuse previously learned knowledge, which was stored in such “reserved” network segments yielding positive forward transfer of knowledge. Hence, in our product catalogue example, knowledge about the catalog items from the previous season could be effectively reused when learning about new season’s items. All in all, DGM can learn about new tasks without the need of replaying old knowledge, thus improving the training efficiency and becoming more robust in the face of catastrophic forgetting.
Consequently, DGM can generate informative and diverse samples of previously learned categories at any step of continual learning as displayed in the picture below. Replaying these samples to the task solving model (D) yields a model that can retain high classification performance on all classes that have been seen during the continual learning process.
Given limited network size, it is inevitable that with a growing number of tasks to learn, the model capacity is depleted at some point in time. This issue is aggravated when simulating neural plasticity with parameter level hard attention masking. In order to guarantee enough capacity and constant expressive power of the underlying network, DGM keeps the number of “free” parameters (i.e. once that can be effectively updated) constant by expanding the network with exactly the number of parameters that were reserved for the previous task. The key idea here, is that with a given positive forward transfer of knowledge (i.e. parameter reusability), the number of parameter reservations for new tasks should decrease over time and the network growth should saturate at a certain point.
For technical details on the DGM method please refer to the full paper on arXiv.
Even though still far away from solving the issue of catastrophic forgetting entirely, and despite several limitations, DGM demonstrates efficient network growth and robustness against catastrophic forgetting in a challenging “class-incremental” setup. We believe that the presented research can help us advance our understanding of continual learning — an essential ability on the way towards achieving strong AI, that is able to learn (and forget) adaptively and progressively over time.
Our work on lifelong learning is presented at the CVPR 2019.
About the author: Oleksiy Ostapenko, an Associate Research Engineer at the SAP machine learning research team, is working on the challenges of continual lifelong learning discussed in this post in his paper which will be presented at this year’s CVPR.