Google beats the records in NLP with their new XLNet

Google released a record-breaking language model called XLNet. What is a language model, and why is Google investing so much in Natural Language Processing research?

When will computers learn to understand us? 

It’s difficult to get computers to do what we want them to. They have no common sense and therefore take everything literally. Today’s post is about how researchers are using neural networks to get computers to understand the world better. Neural networks are computer programs that learn things from data. We describe neural networks in more detail in our blog post, “What is a neural network.”

To train neural networks to understand human language, researcher train neural networks to learn to guess the next word in a sentence, sort of like iPhones and Huey, Dewey, and Louie do it:

In NLU, researcher train neural networks to learn to guess the next word in a sentence, like Huey, Dewey, and Louie do.

Donald Duck: Counter Spy (Cheerios premium giveaway, 1947) – © Disney


Such neural networks are called language models, and they are crucial for getting computers to understand humans.

Why are language models so important?

Language models are important for 3 main reasons:

1. It requires much knowledge of the world to guess the right word.

It might seem simple to guess the next word in a sentence to you, but it is very difficult for computers. If I say: “I am not following my vegan diet, sometimes I eat a little …” and you want to guess that next word is turkey, you must know what it means not to be a vegan.

2. There is almost unlimited training data available.

Language models can be trained with any text. We can create training examples by merely hiding and remembering the last word in any sentence. Today’s language models learn from texts that are thousands of books long.

3. Language models can pass their knowledge on to other neural networks.

Once a language model has learned to guess the next word in a sentence, it has acquired lots of knowledge about the world. By connecting the language model to another neural network, they can share knowledge, so the neural network also gains knowledge about the world.  

It all might sound complicated, but it’s just like two people sharing information.

So what are researchers working on today?

The idea of passing information from one neural network to another has been around for a long time and has been used to understand images for many years.  

In January 2018, a language model called “ULMfit” showed that this technique also worked very well for text. Soon after, another language model called “GPT” improved this idea by using a more advanced kind of neural network called a “Transformer neural network.”

Then, a language model called “BERT” became even better by not just guessing the next word in a sentence, but also works in the middle of sentences, such as guessing that X is “spaghetti” in the sentence “I eat X with meatballs.” 

Recently, the same team that made “GPT” has a second version “GPT 2” that is even better, because it has been trained for a longer time on a lot more data, and because they used higher quality data. 


A BERT language model example that allows chatbots and conversational AI to understand context.

BERT language model example

Google has now set a new record with the new XLNet 

For every new model, it has read more and more text. The “ULMfit” model had read about 1000 books, and the new GPT 2 model has read about 10.000 books. It requires loads of power to learn from so many books. It’s estimated that it would cost around $50000 to rent enough computers to train the GPT 2 model.

So what is the Google XLNet? 

XLNet is a general language model just like ULMfit, BERT, and GPT. XLNet beats the performance by using several neat tricks:  

  • BERT used a neural architecture called the “Transformer.” However, Google has subsequently released an update to Transformers called Transformer-XL. Transformer-XL architecture is better at handling long, complicated sentences.
  • Rather than just guessing X in “I eat X with meatballs,” XLNet also guess X in shuffled sentences such as “I X with meatballs eat” and “X meatballs eat I with.”
  • XLNet is “autoregressive,” where BERT is an “autoencoder.” “Autoregressive models are better at generating new text, where autoencoders are better at reconstructing text it has already learned from.

What will happen in the future?   

Language models are likely to keep becoming better and better. The internet is large enough that there is still room for making computers that read even more text. Many researchers are working on this problem, so we will almost certainly see improvements to the models used as well.

Even once we get to a point where language models have read the entire internet, they will still lack much of common sense that we humans take for granted. Humans rely on a lot of context and insight knowledge when communicating.

One particularly interesting approach to this problem is using “Knowledge bases.” Knowledge bases are significant collections of facts about the world, such as “A dog is a mammal” and “Mammals are animals.” The Chinese search engine Baidu has released its language model that outperforms BERT by incorporating such knowledge into the model. Beating BERT, they snarkily decided to name their model “Enhanced language RepresentatioN with Informative Entities,” or short: ERNIE. 

Rivalry among research teams might seem petty, but it is also pushing the research community to new levels of performance. Natural Language Processing is moving forward at an impressively fast pace. Make sure to check out our blog frequently to stay up to date with the latest developments.

Update: Baidu just released ERNIE 2.0, a record-breaking language model that sets a new standard for what is possible with general-purpose language models.

We mentioned above how the Baidu team released a language model named ERNIE. Today, Baidu has released ERNIE 2.0, an update to ERNIE that beats XLNet. The Baidu team were kind enough to let us have a preliminary view of their research, which is now publicly available.

What does it mean to “beat the records”?

Language models compete in competitions called “benchmarks”. One of the most commonly recognized benchmarks is the GLUE (General Language Understanding Evaluation) benchmark.

The GLUE benchmark tests language models on 9 different tasks. The tasks test the language models ability to perform intelligent operations such as:

– Annotate sentences with grammatical information. For example, can it correctly detect the nouns, verbs, and adjectives of a sentence, and how they relate to one another?

– Determine the sentiment of a sentence. For example, is the sentence a negative statement about something, or is it a perhaps a positive movie review?

– Understand the entailment between sentences. For example, does the sentence “If you help the needy, God will reward you” entail that “Giving money to a poor man has good consequences”?

– Know what coreference words such as “it” and “they” refer to. For example, does the language model know that “it” refers to the ball in a sentence such as “The ball bounced because it was made of rubber”?

– Detect if one sentence is a paraphrase of another. For example, does the language model know that “He said that the witness distorted evidence” means the same as “The witness was accused of distorting evidence by him”?

– Tell if one sentence is the answer to another specific question. For example, can the language model tell if “About 5 meters” is a reasonable answer to the question “How tall is a Giraffe?”?

Running a language model such as ERNIE 2.0 on the GLUE task results in a test score. The better the score, the higher the language model will be ranked. Academic articles usually include a scoring table that shows the performance of a new model compared to the previous state of the art models.

What makes ERNIE 2.0 significant?

Most general-purpose language models are trained in a two-step process. First, the language model is trained on a massive amount of books and internet articles, such as the entire English Wikipedia, or even a big part of the internet. After the initial training, the language model is fine-tuned on a smaller number of books that relate to the target domain. In the case of chatbots, this means training the model on a smaller subset of conversations.

ERNIE 2.0 also starts by training on a massive amount of books and internet articles. However, rather than just fine-tuning on the target domain, ERNIE 2.0 can pre-train on an unlimited number of subtasks. Such subtasks can teach ERNIE to become better at distinguishing between nouns and verbs, between questions and statements or even between fact and fiction.

So what happens next?

The code and model of ERNIE 2.0 have been publicly released on In the coming weeks, the data science team at BotXO will run experiments with the model to see what can be achieved with the current state of the art performance in NLP.

In the future, we can be sure that new general-purpose language models will improve NLP performance even further. Be sure to regularly check our blog to stay up to date with the latest news.