BotXO Releases the First-Ever Norwegian Bert Model, Improves the Danish Model, and Starts a Model Zoo Initiative.
BotXO’s open-source Danish BERT Model has sparked quite a bit of interest. Danish newspaper Børsen wrote an article about it, and many Danish data scientists have participated in discussions about it on GitHub.
Many of our customers here at BotXO are also running experiments and are using the models for different projects.
Today, BotXO data science team is releasing the first-ever BERT model trained on Norwegian data – Norwegian Bert Module. Most importantly, we hope that the model will help data scientists in Norway build state of the art Natural Language Processing solutions. We encourage Norwegian data scientists and managers will reach out to us just as the Danish community did.
Today, we are also releasing an improved version of the Danish model. You can find both the updated Danish BERT model and the new Norwegian BERT model in the same GitHub repository.
Why Release a Norwegian Model?
The Norwegian language is used only in Norway, where there are approximately 4.6 million native speakers. Like Danish, this means that the language is often overlooked for Natural Language Processing tools.
By open-sourcing a Norwegian BERT model, we hope to help the community build their own Natural Language Processing solutions.
Our chatbots at BotXO support Norwegian out of the box and by using our prebuilt intents for Norwegian, it is easy to get started setting up a state of the art chatbot.
How Are the Models Trained?
We train BERT models on a new kind of computer chip called a TPU, short for Tensor Processing Unit. In other words, the chip is excellent at “Tensor” operations. Exactly the kind of operations needed to train Deep Neural Networks.
The same way that “Vector” means a list of numbers, and “Matrix” means a rectangle of numbers, a “Tensor” is just a fancy word for a box of numbers*. A 1-dimensional tensor is a vector, a 2-dimensional tensor is a matrix and anything with more dimensions (such as a box) is called a tensor.
Renting Google’s TPUs – which is the only way to access them – cost a lot of money. In short, TPUs are expensive to use, so it is important to make the algorithms run as fast as possible to decrease cost.
Where Do the Training Data Come From?
We use text fetched from the internet to train our BERT models.
The non-profit organization Common Crawl is periodically gathering huge amounts of data from the internet. But automatically detecting the language of the text, we can create a data set of (for example) Norwegian data.
Because it takes a lot of time to read through the vast amounts of data, consequently we have run our algorithms on multiple computers at once. And also make sure our algorithms are super fast.
What Are We Going to Do Next?
Now that we have released a Norwegian model, we are going to target the remaining Nordic languages, which are Swedish and Finnish. Afterwards, we are going to start working on the remaining European languages.
However, since Natural Language Processing research is progressing so rapidly, it is increasingly more challenging to maintain a repository of models that are up-to-date with state of the art research.
That is why we have decided to pick a different strategy: Rather than releasing more European models, we are going to release our data sets formatted for training new BERT models in many different languages.
Importantly, we hope that we can get the European NLP community to help us train models that are up-to-date with state of the art General Purpose Language Models.
Please share this article and remember to check the blog regularly for updates on our new Model Zoo initiative!