BotXO Releases Swedish BERT Model, Completing the Scandinavian Trio

Swedish BERT Model
Share on facebook
Share on twitter
Share on linkedin
Share on email
Table of Contents

BotXO Releases Swedish BERT Model

After the successful release of Danish and Norwegian BERT models, BotXO is ready with a model for Swedish, the Scandinavian language with the most speakers. Swedish BERT model has been trained on a staggering 25 GB of raw text data, more than ten times data than the previously biggest Swedish BERT model. As with the Danish and Norwegian models, it can be downloaded freely from BotXOs GitHub profile here: BotXO Nordic BERT. 
 
BotXO hope that the model will contribute to the Swedish Natural Language Processing community in the same manner that the Danish and Norwegian BERT models have been contributing in Denmark and Norway. The hope is that the Swedish data scientists will share their findings.

How Is Swedish Language Different?

Swedish has a different set of characters than Danish and Norwegian. Apart from the usual English letters, Swedish uses the vowels Å, Ä, and Ö. As more than 10 million people speak the language, almost as many as the Danish and Norwegian population combined.
 
BotXO is in the process of running a more in-depth analysis of the data for the different languages. They are convinced that careful analysis will allow them to improve the quality of the training data.
 
As an example of their current findings, consider the peculiarity of Lsjbot, a Swedish Wikipedia bot that skews automatically gathered datasets in Swedish.

What Is Lsjbot?

Lsjbot is an automated article-creating program or a bot of Wikipedia, mostly creating articles for Swedish Wikipedia. 
 
Consider the following chart of Wikipedia articles in different languages:
Wekipedia articles in different languages chart
Wikipedia distribution of articles in different languages

Do you notice something strange? The second most represented language, Cebuano, is spoken by only 16 million people in the southern part of the Philippines. The third most represented language by article count is Swedish. What is going on?  

It turns out that most of the Swedish articles were contributed by the Swedish physicist Sverker Johansson. Or rather, Lsjbot, an automatic robot that Sverker Johansson created. The robot reads data from a database and publishes automatically written articles. As the articles are automatically generated, they all use the same kinds of expressions.  

For example, an article might read information about an animal, and write something along the lines of “The average adult [Giraffe] is [4.6m-6.1m] tall and weighs [800kg]. Its diet mainly consists of [leaves, seeds, and fruit]”; replacing the particular statistics and names for different animals.  

While this might be great for fill out a Wikipedia page, it isn’t beneficial for training Natural Language Processing algorithms. Since the sentences are likely to be very similar for different animals, the algorithm is going to be biased towards these particular expressions. In turn, it skews the model and impacts performance negatively.  

You might be wondering: Why is Cebuano the second most represented language? I’ll give you a hint: Sverker Johansson’s wife is from the Philippines…

What Can You Expect in the Future From BotXO?

Besides training a Finnish BERT model, BotXO is going to work on running a more detailed analysis of the data for different languages. By publishing high-quality data sets, BotXO hope to get data scientists from all over Europe to contribute to their efforts in improving Natural Language Processing for all European languages.

Article written by: Jens Dahl Møllerhøj

Designed by: Patrycja Hala Saçan

Share on facebook
Share on twitter
Share on linkedin
Share on email

Sign up for a Free Trial

Get access to all features for 14 days and try out our technology for free. No credit card required.

  • Minimum 8 characters
  • At least one uppercase letter
  • At least one lowercase letter
  • At least one number
  • By using BotXO you agree to our Privacy Policy, Terms and Conditions.

    Book a Demo


    Please fill in your information below to receive a guided tour of the BotXO Platform and have a talk about your use case.

    By using BotXO you agree to our Privacy Policy, Terms and Conditions.

    Learn more about our NLU engine.

    We’re thrilled to see that you’re interested in our NLU engine.
    Please fill in your information below and our humans will be in touch soon to give you a tour.

    By using BotXO you agree to our Privacy Policy, Terms and Conditions.

    Sign up for our BotXO Essential Plan

    We’re happy to see you’re signing up to the XO Essential Plan! Please fill in your information below:

    By using BotXO you agree to our Privacy Policy, Terms and Conditions.

    Sign up for our BotXO Business Plan

    We’re happy to see you’re signing up to the XO Business Plan! Please fill in your information below:

    By using BotXO you agree to our Privacy Policy, Terms and Conditions.

    Sign up for our BotXO Enterprise Plan

    We’re happy to see you’re signing up to the XO Enterprise Plan! Please fill in your information below:

    By using BotXO you agree to our Privacy Policy, Terms and Conditions.