BotXO Releases Swedish BERT Model
How Is Swedish Language Different?
What Is Lsjbot?
Do you notice something strange? The second most represented language, Cebuano, is spoken by only 16 million people in the southern part of the Philippines. The third most represented language by article count is Swedish. What is going on?
It turns out that most of the Swedish articles were contributed by the Swedish physicist Sverker Johansson. Or rather, Lsjbot, an automatic robot that Sverker Johansson created. The robot reads data from a database and publishes automatically written articles. As the articles are automatically generated, they all use the same kinds of expressions.
For example, an article might read information about an animal, and write something along the lines of “The average adult [Giraffe] is [4.6m-6.1m] tall and weighs [800kg]. Its diet mainly consists of [leaves, seeds, and fruit]”; replacing the particular statistics and names for different animals.
While this might be great for fill out a Wikipedia page, it isn’t beneficial for training Natural Language Processing algorithms. Since the sentences are likely to be very similar for different animals, the algorithm is going to be biased towards these particular expressions. In turn, it skews the model and impacts performance negatively.
You might be wondering: Why is Cebuano the second most represented language? I’ll give you a hint: Sverker Johansson’s wife is from the Philippines…