BotXO has trained the most advanced Danish AI language model yet. The BotXO BERT model has been fed with a staggering 1.6 billion Danish words and it is also available as open source.
Google’s BERT model is one of the most well-known machine learning models for text analysis. Google has released English, Chinese and multilingual models. But now BotXO are the first to release an extensive open sourced Danish BERT model. The language model has been recognized by academicians from Copenhagen University and Alexandra Institute as performing better than models by Google, Zalando, and Standford University. The conclusions of the study can be found in DaNE: A Named Entity Resource for Danish
The code is downloadable from github.com/botxo.
Language models are constantly evolving
Remember our blog post about the Google XLnet where we explained how the Chinese team Baidu, was beating Google’s BERT model with their new language model ERNIE 2.0? We also mentioned that BotXO would proceed to experiment with what can be achieved with the current state of the art performance in Natural Language Processing (NLP).
This has led BotXO to build upon Google’s BERT model as any Danish private company, educational institution, NGO or public organisation in need of AI in Danish could greatly benefit from this. This makes BotXO one of the very few companies in Denmark to improve and support Danish AI by publishing open source code. This is no small thing. On one hand, this next step in the development of the Danish BERT model is highly useful for the whole Danish AI and machine learning ecosystem and on the other hand, it provides inspiration to the whole industry, in the direction of democratizing AI and making new updates publicly available, for everyone.
But first, what is a BERT model?
How much text has the Danish model read?
When working with AI language models, part of the challenge is to collect huge amounts of text needed to make an extensive model. BotXO has managed to overcome the obstacle by turning the model into a massive bookworm.
BotXOs Danish BERT model has read 1.6 billion words, equivalent to more than 30.000 novels. Although this might sound like a lot, the model could have read even more, but it is difficult to find much more publicly available Danish text.
What can a BERT model be used for?
The general language understanding capabilities of the model can be used as the first step in text analysis pipelines. The model reads texts and returns vectors, which are points in a coordinate system. The shorter the distance between the points returned by two different texts is, the more equivalent their meaning is. Thus, the vectors can be used to figure out if different pieces of text are related. By combining the model’s general language understanding with e.g. data and knowledge of the positivity and negativity of the texts, the BERT model can help with sentiment analysis, entity extraction and all the other disciplines in Natural Language Processing.
The Danish BERT model can be used for sentiment analysis in Danish. For instance, it can analyse different prejudices in a text, define the purpose of a text, context and point out relevant words. This is useful to multiple industries such as e-commerce, finance, the tech industry and the public sector.
Why is it so important to Denmark?
Why do we need a Danish BERT model?
Google has released a multilingual BERT model, but it is trained in more than a hundred different languages. Danish text, therefore, only constitutes 1% of the total amount of data. The model has a vocabulary of 120,000 words*, thus, it only has room for about 1200 Danish words. BotXO’s model, on the other hand, has a vocabulary of 32,000 Danish words.
(* In fact, “words” are a bit inaccurate. In reality, it functions in a way that the model divides rare words so that “Inconsequential” for example, becomes “In-”, “-con-” and “-sequential”. As these kind of word divisions occur among different languages, there are room for more than 1200 Danish “words” in Google’s multilingual model.)
How does the model learn from text?
The model learns in two different ways:
First, it reads a sentence, e.g. “I like Chinese food, especially spring rolls.”. Then, it hides some of the words from itself: “I like [HIDDEN] food, especially spring rolls.”. Then, it tries to guess the hidden word. If it guesses wrong, it adjusts so that it gets better the next time. If, on the other hand, it guesses correctly, then the model knows that it has understood the meaning of the text. In the example, the model learns that spring rolls belong to the Chinese cuisine.
Afterwards, the model would read the next sentence in the text, for example: “That’s why I often do my grocery shopping in the Asian supermarket”. Then the model also reads a random sentence from another book: “At 19 o’clock, Mads Jensen was arrested”. The model then tries to figure out which of the two sentences is the correct one that would logically follow the first sentence, “I like Chinese food, especially spring rolls.”.
How can we use the BERT model?
In line with our mission at BotXO to develop and make Danish AI publicly available, it only made sense that the Danish BotXO BERT model would be open source. This means that others can further develop it and use it to improve their products and services as well as producing new solutions.
The model and the instructions for data scientists and engineers are available for free here: github.com/botxo. We hope that you will support Danish AI by sharing the link in your organization. If your organization needs something industry-specific and you don’t have the time, ability or resources to build it yourself – aka you’re not a developer – we can set it up for you on our platform. Just get in touch with us at firstname.lastname@example.org.
Follow our blog to keep an eye on the latest AI news, chatbot best practices, and much more.