Building Multilingual LLMs: Why Data Quality Matters
In the realm of Artificial Intelligence, crafting systems capable of understanding and generating human language necessitates immense quantities of language data. This data forms the bedrock of a Language Model’s (LLM) capability to both comprehend and articulate human-like language. Yet, there’s truth in the saying that not all data is created equal. This understanding can be the deciding factor in the effectiveness of a model.
Examining the model cards of recent European and American LLMs reveals an abundance of textual data for training these models. However, only a fraction of it comprises high-quality, well-curated information, while the majority is opportunistically gathered by scraping the Web.
As Tilde embarked on the journey to develop TildeLM, a multilingual foundational LLM, the adage “garbage in, garbage out” became a significant concern. On one side, we calculated that training a 30 billion parameter foundational LLM demands a dataset of approximately 600-700 billion words. Paradoxically, upon scrutinizing the quality of available data, what initially seemed a treasure trove soon resembled Pandora’s box.
Most data available for training LLMs are sourced from Common Crawl and Internet Archive, repositories built by scraping countless web pages. While these sources provide an abundance of content, they present notable drawbacks, particularly for languages other than English.
Data Distribution: English Versus Global Languages
Irrespective of the dataset, English dominates the linguistic landscape. The volume of English words across all datasets not only surpasses any single language but also outnumbers languages from entire regions. This discrepancy is why, currently, over 90% of the training data for most LLMs is in English, causing many languages to be underrepresented. This imbalance fuels “English-centrism,” where AI models excel in English but falter with the nuances and cultural intricacies of other languages. For speakers of underrepresented languages, this translates to lower-quality AI tools and restricted access to advanced technologies.
In addition to the limited volume, a substantial portion of non-English text is subpar machine-translated content. For instance, reviewing the 300 most frequently accessed top-level web domains of the HPLT-v2 Latvian data, we discovered that a significant 25% are extensive machine-translated web pages.
This is problematic not only due to the poor quality of translations but also because translations generally do not adequately represent a language, as they often closely mimic the source language. Known as translationese, this results in unnatural phrasing, grammatical inaccuracies, and lost cultural context. Consequently, AI models trained on such data may struggle to understand or generate nuanced content in these languages.
Challenges in Data Selection and Ethical Concerns
Beyond surface characteristics, such as faulty grammar or failure to adhere to linguistic norms, several more alarming issues arise. Analyzing data from ostensibly valid sources revealed content that was either unsuitable or irrelevant to our training objectives. We often faced challenging decisions about what content was acceptable. While it is easy to exclude obviously inappropriate materials like pornography, which do not contribute meaningfully to language understanding, other content is trickier to discern. Models trained on such material raise ethical and safety concerns.
Notable among these is politically charged content, especially from pro-Russian media sources. This content often embodies strong anti-Western and anti-LGBTQ narratives, pro-Russian sentiment, and anti-Ukrainian propaganda. Thanks to the efforts of organizations like the National Electronic Mass Media Council of Latvia, which maintains a list of banned sites, excluding many such sites is straightforward. However, pro-Russian Serbian media posed a unique challenge, as these aren’t banned within the EU. We found many sites presenting hearsay or ‘expert opinions’ with fake photos to propagate Kremlin narratives concerning Western military aggression.
Such content is problematic due to its political bias and its tendency to present falsehoods as facts, particularly in domains like history, medicine, and social issues, fostering discord among Europeans. Training an LLM with such data risks reinforcing harmful stereotypes and misinformation, which could undermine the model’s impartiality and usefulness in real-world applications.
The Necessity for High-Quality, Diverse Data
It is evident that developing an effective LLM requires more than just large text volumes; it demands high-quality, diverse, and trustworthy datasets. Training LLMs requires data that provides signals for models to grasp cultural context and complex reasoning. As previous instances illustrate, relying solely on vast quantities of Web data is naive.
Regrettably, many high-quality datasets in languages other than English are small or fragmented. The nature of LLM training necessitates lengthy text passages to assist models in understanding narrative flow and context. Without such resources, the LLM’s language “understanding” remains superficial. Despite their quality, brief snippets lack the depth required to develop a refined model.
Additionally, licensing restrictions further complicate the situation. Many academic projects funded by national governments have produced sentence-level corpora unsuitable for LLM training. These resources, designated for scholarly research, are inaccessible for commercial use, leaving commercial researchers dependent on opportunistically sourced data.
Impact on Language and Encouraging Collaborations
Using low-quality data affects not only LLM performance but also influences us and the languages themselves. As AI-generated content becomes increasingly common in emails, articles, and marketing materials, the way language is used and perceived changes. Given that these tools excel in replicating linguistic details, widespread AI-generated text can gradually normalize errors or unnatural patterns, potentially eroding language richness.
Despite these challenges, encouraging examples of collaboration are emerging. Data donors are contributing curated, high-quality datasets to address these issues. Our first partner, the Estonian Language Institute (EKI), has actively ensured that the Estonian language is well-represented in AI training. By offering its resources, EKI has provided diverse materials, including literary works and government publications. These datasets are crucial for training models to understand formal and informal language, enabling tools to serve the community with more accuracy and cultural sensitivity.
Similarly, SpeakLeash, a grassroots organization in Poland, is making noteworthy progress in preserving Polish. Operated by volunteers, SpeakLeash builds and catalogues datasets specifically for use in AI tools.
Both organizations have significantly contributed to TildeLM, ensuring Baltic and Eastern European languages are equally represented with depth and nuance. Beyond these, other institutions such as the National Library of Finland, Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences, Slovenian Language Model initiative, and others have also extended support.
These initiatives demonstrate how local communities and organizations can actively preserve their linguistic heritage in digital form.
The “More Is Better” Approach: A Fundamental Flaw
The challenges encountered while developing TildeLM — from data quality issues to modern challenges like misinformation — underscore the flawed nature of the “more is better” approach. With vast datasets impossible to verify, this challenge escalates daily. It is clear that the pathway for LLMs should not lie in accumulating ever-larger data volumes, as high-quality data may become scant in the coming years.
Only time will reveal if AI’s future hinges on collaborative efforts between technology companies, academic institutions, and cultural organizations. Encouraging partnerships with entities like the Estonian Language Institute and SpeakLeash indicate that such cooperation is feasible. It remains to be seen if these can scale beyond individual successes to create higher-quality models trained on smaller, more reliable datasets. The outcome could decide whether AI genuinely serves all languages and cultures equitably.