Blog: BERT Meets GPUs
The Bidirectional Encoder Representations from Transformers (BERT) neural network architecture  has recently gained a lot of interest from researchers and industry alike, owing to the impressive accuracy it has been able to achieve on a wide range of tasks . To help the NLP community, we have optimized BERT to take advantage of NVIDIA Volta GPUs and Tensor Cores. With these optimizations, BERT-large can be pre-trained in 3.3 days on four DGX-2H nodes (a total of 64 Volta GPUs). We’ve fully open-sourced our data preparation scripts and training recipes to enable others to reproduce the research results from the initial publication, and also to use for further research or production scenarios. The code can be found on GitHub in our NVIDIA Deep Learning Examples repository, which contains several high-performance training recipes that use Volta Tensor Cores.
BERT is an unsupervised deep learning language model that only requires unannotated datasets for training. This opens up the opportunity to train on a much greater amount of data compared to approaches that rely on annotated data sources. The BERT model also better captures the understanding of languages by using bi-directional context between layers of transformer encoders. BERT training consists of two steps, pre-training the language model in an unsupervised fashion on vast amounts of unannotated datasets, and then using this pre-trained model for fine-tuning for various NLP tasks, such as question and answer, sentence classification, or sentiment analysis. Fine-tuning typically adds an extra layer or two for the specific task and further trains the model using a task-specific annotated dataset, starting from the pre-trained backbone weights. The end-to-end process can be summarized using Figure 1 and the results are covered in the following sections.
The exact steps used to prepare a raw training corpus for training the BERT model can impact the final prediction accuracy after training. Therefore we have provided scripts that are fully automated to download, extract, properly format, and preprocess the data into TFRecords suitable for pre-training and to reproduce the results we present here. The scripts have been written to decouple each step to allow for additional steps and/or additional data to be included to suit your particular application needs.
The bulk of the pre-training data is derived from Wikipedia, and we use the full English dataset. BooksCorpus provides a broader range of training input since Wikipedia contains few first-person narrative passages and conversations. This combination serves to provide a good, general-purpose set of pre-trained weights to use with your fine-tuning tasks. Accuracy may be further improved by incorporating data sources more closely related to your task.
While existing pre-trained weights are available, you may want to pre-train BERT on your own data related to the fine-tuning task at hand. For example, biomedical literature contains many domain-specific proper nouns, and question answering accuracy can be improved by nearly 10% by pre-training on relevant data sources .
Mixed precision training offers up to 3x speedup by performing many operations in half-precision format using Tensor Cores in the Volta and Turing architectures. Mixed-precision training is enabled in TensorFlow automatically with only a couple of lines of code. We’ve also found that the XLA JIT compiler provides a significant speed up when used in conjunction with mixed precision.
The test results in Table 1 show that our recipe (data preparation, pre-training, fine-tuning, and inference) with BERT-large and mixed-precision on Volta GPUs reproduces accuracies from the original paper. The results are obtained with Tensorflow 1.13.1 in the 19.03-py3 NGC container with XLA enabled.
Figure 2 shows the training loss vs steps for training BERT-large following identical curves with and without mixed-precision. Mixed precision training with Tensor Cores provides a significant speedup — it takes about 6.8 days to train without mixed precision, while it takes only about 4.5 days (1.5x faster) after enabling automatic mixed precision, and only 3.3 days (2.1x faster) after further enabling the XLA compiler. For this experiment, we used a batch size of 4 per GPU to closely mimic the original published results with a global batch size of 256. Recent research has shown that BERT can successfully train using larger batch sizes, which provides room for additional speedup. For instance, while the speedup at batch size 4 per GPU is 2.1x, the speedup at batch size 8 per GPU will be ~3x.
Fine-tuning and Inference
In our BERT repository, we also provide scripts that fine-tune a pre-trained BERT model for the SQuAD question answering task using mixed precision. In Table 1, these scripts were used to generate the fine-tuning results labeled “Present Work.” Our peak fine-tuning training throughput is approximately 143 sentences/second at a global batch size of 160 sequences on a single DGX-1 system consisting of 8 V100–32GB GPUs. Our inference performance on a single V100–32GB GPU is 119 sentences/second.
Conclusion and Additional Work
We have been able to optimize time-to-solution for end-to-end BERT training using GPUs and mixed-precision training with Tensor Cores. Mixed precision training achieved the same result accuracy as single-precision training while substantially speeding up the process — time to train was reduced to less than half of single-precision time. All of the scripts and recipes to reproduce our work using TensorFlow, including data preparation, have been published on GitHub. Additionally, Megatron-LM is a PyTorch repository for large language model research that can be used to train BERT and will continue to be updated by NVIDIA with additional large-scale language modeling research. Last but not least, the BERT in GluonNLP for MXNet repository also supports training BERT with mixed-precision on GPUs. We hope that these GPU resources help accelerate research and development in the NLP community.
Christopher Forster; Thor Johnsen; Swetha Mandava; Sharath Turuvekere Sreenivas; Deyu Fu; Julie Bernauer; Allison Gray; Sharan Chetlur
 Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
 GLUE leaderboard
 Lee, Jinhyuk, et al. “BioBERT: pre-trained biomedical language representation model for biomedical text mining.” arXiv preprint arXiv:1901.08746 (2019).