Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, 2018, Pretraining of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/abs/1810.04805
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach: https://arxiv.org/abs/1907.11692