Blog: Dealing with Data Scarcity in Natural Language Processing
Machine learning is notoriously data-hungry. Generally speaking, the more labelled data you use to train your model, the better it gets. This need for mostly manual labeling in supervised machine learning has been a big hindrance to applying Natural Language Processing in industry. Countless specialized domains or low-resource languages exist for which little data is available. As a result, there is little hope of training supervised machine learning models. In this blog post, we’ll explore a number of ways to reduce this dependence of NLP models on large collections of manually labelled data and thereby widen the boundaries of NLP.
It is often said we live in the age of Big Data. This is true, of course: every day, enormous numbers of texts are created: e-mails, tweets, text messages, blogs, research papers, news articles, legislation, books on every conceivable topic — you name it, there are plenty of texts about it. This situation holds great promise for Natural Language Processing. On the one hand, all these texts are potential training data for us to train and tune our models on. On the other, NLP can be used to uncover information in this deluge of data and help people make sense of these texts.
However, when companies want to apply NLP to their specific task, they often find themselves in a fix. They may be drowning in data, but very little of it is readily usable for training NLP models. Their documents are often completely unstructured, lack any relevant metadata and are completely mismanaged. Many dreams of Artificial Intelligence have been shattered because such companies have been able to collect a few example documents for the task they want to automate, and that’s it.
In many cases, this is a showstopper. Most of the successful NLP applications rely on supervised machine learning, which is notoriously data-hungry. In supervised learning, the model relies on labelled training examples to learn a mapping from the input (typically a text or a sentence) to the output (a label such as “positive” or “negative”, the answer to the input question, the translated input sentence, etc.). Traditionally, such training examples are collected by having subject matter experts label training data. Needless to say this is an effective, but difficult, time-consuming and expensive process.
However, recently there has been increased interest in how we can reduce this dependence of NLP models on such manual labelling. The first family of solutions focuses on the training data. They address the question of how we can collect more data while reducing the labeling effort. The second family of approaches turns to the models instead. They focus on how we can train better NLP models with less training data. In this article, we’ll explore these two approaches. Let’s start with the data first.
Data is king
Collecting and cleaning data is not the favourite pastime of your average NLPer. Yet, from an economic perspective it makes sense to focus on your data. There’s no denying AI algorithms and models are becoming increasingly commoditized. Pretrained models exist for many tasks, user-friendly machine learning libraries make training models accessible for many developers, and automated machine learning (AutoML) partly automates the task of data scientists and researchers.
Your real asset is your data. This is what sets you apart from your competitors, and the amount and quality of your data will determine whether or not your company is ready for AI. So, how can you create valuable training data without having to label every text in your database? While the range of options is endless, we’ll focus on three tried and tested solutions here: active learning, semi-supervised learning and weak supervision.
Like many of the semi-supervised approaches we’ll discuss below, active learning is an iterative procedure that grows your training data step by step. The difference with semi-supervised learning is that active learning still relies on manual labelling. However, it reduces the labelling effort by iteratively selecting informative training instances. One of the most popular methods is uncertainty sampling, where the algorithm picks those examples that the model at that iteration is most uncertain about.
Let’s take a simple example as an illustration. If you’re training a model for sentiment analysis, this model may be very quick to learn that “I hated this movie” is a negative sentence, and label it with a high confidence. At the same time, it may be very uncertain about the classification of other sentences with more infrequent words, such as “This movie was a stinker”. Active learning allows the model to acquire new information more quickly by having a human annotator label such difficult examples and adding them to the training set. Collecting training sets in this way usually brings down the labelling effort considerably.
Several tools exist to support this active learning process. The most user-friendly is probably Prodigy, a web application that makes data labelling much more efficient. Prodigy relies on active learning to select the most informative training examples and continuously updates the model as you’re labelling. Moreover, its simple binary labelling choices make labelling a much more efficient process than it usually is.
Even active learning solutions can require a considerably labelling effort. That’s why some people turn to semi-supervised learning, which allows them to increase the size of their training data without needing manual labelling. This is possible by turning to the labels given by the model instead. Obviously it would be foolish to have a basic model annotate unlabelled data and expect its performance to go up when you retrain it on the resulting dataset. Instead, like active learning selects the most informative examples, there are clever ways of selecting those examples that your model or set of models is likely to get right.
The most basic type of semi-supervised learning is self-training. Here we train one initial model and iteratively grow the dataset by having it label those training examples it is most confident about. This is almost the opposite of active learning, where we selected those examples our model was least confident about. This immediately highlights one problem with self-training: the examples that our model labels with the highest confidence are usually not the most informative ones. Additionally, this approach carries the risk that our imperfect model labels new examples incorrectly and doubles down on its errors.
One way to select more informative instances is by training several models instead of one. In co-training, we train two models with mutually exclusive feature sets. At each iteration, we move the instances that are labelled with high confidence by one model to the training data of the other. Similarly, having several models can help us mitigate the risk of adding incorrectly labelled training data. In democratic co-learning, we train several models with different inductive biases, by choosing different algorithms or giving the models access to differently sampled training data. Here we iteratively add examples to the training data if the majority of models agree on the label.
With these more advanced approaches, semi-supervised learning can be very effective and provide a strong baseline for several NLP tasks. For a more extensive overview of possible techniques, we refer the reader to Sebastian Ruder’s blog post on this topic.
Active learning and semi-supervised learning can only leverage knowledge internal to the data. The machine is on its own, and may not be able to capture knowledge that is sometimes self-evident. That’s why it would be useful to integrate domain knowledge in the labelling process, by having domain experts assist the machine in labelling new data. This is a process we call weak supervision.
There are ways of using domain knowledge to label new data. The simplest is probably to write heuristic rules. For example, in sentiment analysis for tweets, we can assume that most tweets that contain happy emojis like 😃 convey a positive emotion. In distant supervision, we rely on existing knowledge sources to label new data. This is done very often in relationship extraction, where databases like DBPedia are used to collect known relations. For example, if a database contains the relation
born_in(Barack Obama, Honolulu), we can use this information to label every sentence containing the words Obama and Honolulu with the
Training examples that are obtained in this way are not always labelled correctly. In the sentence “Barack Obama visits Honolulu”, there is no reference to his being born there. As a result, applying such simple rules blindly can lead to suboptimal results. Snorkel is a system for combining such potentially low-quality labelling functions and using them to train high-quality end models.
Snorkel’s process is as follows. First, a developer writes labelling functions and evaluates them on a small set of labelled training data. Snorkel allows us to evaluate the accuracy and coverage of all our labelling functions, and their overlaps and conflicts with each other. Next, it trains a generative label model over these labelling functions that learns how best to combine them. Finally, this label model outputs probabilistic labels that we can use to train an end model.
Domain experts aren’t always developers, and vice versa. For that reason, some researchers have experimented with putting more power into the hands of the domain experts. BabbleLabble is one such experiment. Instead of relying on developers to write labelling functions, it asks human labellers to explain why they labelled an example in a certain way. A semantic parser then translates the natural-language explanation that the labeller provides to code. The resulting labelling functions are used by Snorkel to train a label model and label a larger training set. As the original paper shows, this approach can lead to a drastic reduction in the size of the original training set, without asking too much of the labeller. Solutions like this one aren’t widely applied yet, but they offer an interesting perspective on how we can attack the current bottlenecks in Machine Learning.
Approaches like active learning, semi-supervised learning and weak supervision allow us to create more labelled training data without needlessly increasing the labelling effort. If they’re applied correctly, this increased training set will result in a better model.
Models rule the world
So far we’ve focused on collecting more labelled training data. While the benefits are clear, in the last year or so another approach has taken center stage in NLP: transfer learning. In transfer learning, models leverage knowledge they have learnt elsewhere than the labelled data for the task at hand. Elsewhere could mean labelled corpora for different tasks, but could also refer to large corpora of unlabelled texts, such as Wikipedia articles or extensive collection of texts crawled from the web.
Pretrained task-specific models
The most straightforward application of transfer learning consists in finetuning an existing model that has already been trained for a task similar to yours. Let’s say you need to do named entity recognition in newspaper articles. Named entity recognition is the task of automatically identifying the names of people, organizations, locations and similar “entities” in text. Because named entity recognition is such a common task, many pre-trained models exist that you can download and apply to your texts. However, because many of these models will have been trained on other texts than newspaper articles, their performance on newspaper article texts will likely be suboptimal.
For example, if you want to do named entity recognition in newspaper articles, it can make sense to start from a model that has already been trained for named entity recognition on a different type of texts. Although its performance on your texts will probably be lower than that on the original text type, it will already have learnt a lot about the type of contexts where people, organizations or locations are likely to be found. If you have a lot of labelled data yourself, you can retrain a model from scratch. You can be pretty confident you’ll outperform the off-the-shelf models. However, if your resources are limited, it would be a shame to ignore all the information in the off-the-shelf models. In this case, it makes much more sense to take a pre-trained model and finetune it on your data. You can have the best of both worlds: you can combine the existing knowledge in the pre-trained models with the idiosyncratic information in your data and obtain a state-of-the-art model without the need for enormous collections of labelled data.
Pretrained generic models
Obviously, you can only use the solution above if you can find a pretrained model in your language and task. As soon as you move away from traditional NLP tasks such as NER and sentiment analysis and work in other languages, this can become problematic. There’s no need to despair, however. The most impactful research in transfer learning recently has focused on training generic models that can be used for every single NLP task.
Unlike task-specific models, these generic models make use of unlabelled training data. This is great, because unlabelled data is what we have most of. In order to acquire knowledge from unlabelled data, most transfer learning approaches are trained for a language modelling task. In language modelling, the goal is to predict a word on the basis of its context. There’s no need to label any data: we know what the words in the context are. Texts are what we call self-labelled for this task. As they learn to predict the word on the basis of its context, language models pick up a lot about language: word meaning, syntax, co-reference, and so on. This knowledge is widely applicable to virtually any NLP task.
You could argue that we’ve been applying transfer learning every since the deep learning revolution in NLP. This is true: pretrained word embeddings (word2vec, Glove, FastText and the like) are a type of poor man’s transfer learning. Like the new flavour of approaches, they leverage knowledge from unlabelled data by learning word meaning on the basis of its context. However, the more recent transfer learning models go further than that. ELMo embeddings introduce contextualized word embeddings, which are sensitive to the individual context in which a word is used. ULMFit and BERT even pre-train a full neural network instead of just an embedding layer.
Let’s take ULMFit as an example. In the pre-training phase, ULMFit is trained on a large collection of unlabelled texts. To predict individual words, its last layer has as many output cells as there are words in the vocabulary. For each context it sees, it predicts the probability of of all these words. In a second step, the language model can be finetuned on any additional unlabelled data you may have for your task. If you’re doing sentiment analysis and you have a collection of unlabelled reviews, you can make the language model more sensitive to the language in your particular task. Then, you need to convert the model for your particular task. This is done by adding another final task-specific layer. In sentiment analysis, this can be a softmax layer with one or two cells. The rest of the network design is kept intact. Finally, you can train this network on your labelled data.
The benefits of transfer learning are undisputed: since the introduction of these LM-based approaches, state-of-the-art results have been falling like dominoes. Thanks to their language knowledge, ULMFit, BERT and similar models make it possible to train a model on much less data. Howard and Ruder show, for example, that for some tasks ULMFit needs 100 times fewer labelled training instances than a model that has been trained from scratch.
It’s clear working on your data and models pays off. Thanks to semi-supervised learning and weak supervision you can make better use of unlabelled, task-specific training data. Thanks to transfer learning, you can even leverage unlabelled, generic training data. Obviously, this is not an either-or story. In March 2019, the Snorkel team achieved a new state-of-the-art score on the GLUE benchmark. This benchmark brings together nine sentence-level language understanding tasks, such as sentiment analysis, textual entailment, text similarity, etc. The team realized this score by combining traditional supervision, transfer learning (with BERT), multi-task learning (they trained a single model to predict multiple tasks), dataset slicing (they added a task head to their model for particularly difficult data slices) and ensembling (they combined several BERT models: cased/uncased and trained on different training/validation test splits). It’s clear creativity and diligence pay off.