Blog: The Latest in AI Research — Issue #7
The AI Research newsletter by Daniël Heres
Papers of Interest
Recently, learning and using contextual representations for Natural Language Processing became popular after techniques like ELMo, ULMFiT and BERT were developed. Those techniques tackle the problem of learning to represent a sentence or document. This representation can then be transferred to other tasks, such as text classification or question answering. What the models have in common is that they all are language models: models that predict one or more hidden words or sentences given a previous context sentence. Such models can be used to generate text, by sampling a word from the model step-by-step and in the end generating a full sentence or document.
In this work, the authors from OpenAI show that scaling up a transformer-based language model, both in model size as measured by the number of parameters (1.5 billion) and the data set size, 40GB of textual data from Reddit links, filtered to have a minimum number of upvotes. They show that increasing the size of the model increases the accuracy of the model considerably.
Interestingly, the paper shows they don’t even have to fine-tune to achieve good or even state-of-the-art performance on a range of other NLP tasks. For example, simply appending the text “TL;DR:” (which stands for “too long; didn’t read”) after a document and sampling a number of words gets a good result on a summarization task.
Partly because OpenAI didn’t release the weights of the bigger model (as is normal in this subfield) it got a lot of attention in the media, with titles like THE AI TEXT GENERATOR THAT’S TOO DANGEROUS TO MAKE PUBLIC (Wired). After a heated online debate a live debate hosted by “This week in Machine Learning & AI” was streamed live on YouTube.
Augmentation is used a lot in machine learning to learn more accurate models. For example, in computer vision, images can be rotated, mirrored, the contrast can be changed, etc. These transformations help preventing the model to only fit to the training data (also called overfitting), and allows it to perform better to data points it has never seen.
In this article the authors show a number of ways to augment data for speech recognition such as warping and masking audio features. When applying the data augmentation in combination with a good performing speech recognition model, the authors improve on existing state of the art results on LibriSpeech and Switchboard datasets.
Transformers, or models with attention, have become popular recently, and are part of many state of the art models, especially in NLP. Attention is an algorithmic concept which allows a machine learning model to learn from sequential data. The attention weights are multiplied with the input data or intermediate representations, allowing it to “focus” on certain time steps more than on others. For the interested reader, see this informative background article which explains the relation to the idea of routing information and how it relates to the older idea of capsule networks.
However, as the authors argue, the self attention needs N * N attention matrix, which requires a lot of computation and memory. They therefore propose to use sparse attention: attention that only attends to a smaller (in this case the N * square root of N) subset of the elements in the sequence. The model can now be applied to problems with longer sequences, such as images and audio and learns sparser patterns, which the authors suggest may be beneficial. The model achieves state of the art performance on a couple of datasets, often using a lower number of parameters. The authors share the code for running the sparse attention kernels efficiently on a GPU.
To get models to learn tasks with good accuracy, we often need models with a high number of human annotated labels. Unsupervised learning has the promise to cluster and find similarities in data automatically, but as of today it has relatively little use except in data exploration This is mainly because supervised learning usually leads to models that are much more accurate.
In this work a new unsupervised method is proposed that is achieving interestingly good performance on classification and segmentation tasks. The basic idea here is to apply a simple transformation, such as mirroring the image, and learn to map two of the same data points to have the same information.
Deep Reinforcement Learning is moving more and more from academic research towards practical use cases in robotics, self-driving cars and much more. One big downside however is the often enormous amount of simulation & frames needed to train smart RL agents.
In this work, the authors propose an algorithm for meta-RL, which allows the algorithm to adapt better to new unknown environments, greatly improving the sample efficiency, the number of data points needed to learn a task. You can find the results on a number of tasks and the link to the code on Papers With Code.
Colorful Code and Delicate Datasets
TensorFlow 2.0 is coming! TensorFlow 2.0 will clean up of the TensorFlow API and remove tf.contrib, but will also focus on bringing foundations for the next wave of ML. TensorFlow 2.0 will focus on building best-practices and of the shelf models to TensorFlow with support for distributed learning and using tf.keras as an important building block.
Snork Metal is an NLP framework which achieves high performance on the GLUE natural language understanding benchmark. The highest performing submissions use both transfer learning and multi task learning as a learning strategy, suggesting that this will be an important way of achieving highly performing models and predictions in the feature.
In The Bitter Lesson Richard Sutton shares his view about the “bitter” lessons we can learn from the history of machine learning. He mainly shows that we rely more and more on “brute force” computation rather than rules and human-crafted algorithms. Instead, we use learning and search more to learn from data and simulations. The article received a lot of attention and also a number of responses from e.g. Rodney Brooks “A Better Lesson” and from Max Welling “Do we still need models or just more data and compute?”
In a live streamed meeting with investors, Tesla showed their progress and approach on developing self-driving cars. The presentations show a lot of detail about the hardware they use, the models and the way they learn from data from the cars already driving on the roads.
OpenAI showed that their reinforcement learning Dota agents can beat professional Dota players. The agents are trained using relatively standard reinforcement learning algorithms, LSTMs and manual reward shaping and many hours of self play (180 years in total). Also in public matches it defeated 99.4% of the players. A number of players found a strategy to beat the AI agents, which it couldn’t handle.