The World’s Best AI Models Operate in English. Other Languages — Even Major Ones Like Cantonese — Risk Falling Further Behind
How do you translate “dim sum”? For many English speakers, this question might seem unusual, as the term is widely recognized to describe the assortment of small dishes served during a Cantonese-style brunch and needs no translation.
However, for developers like Jacky Chan, who launched a Cantonese large language model (LLM) through his startup Votee last year, words like “dim sum” present considerable challenges. While human translators can intuitively identify loanwords and determine which terms need direct translation, machines struggle to do so.
According to Chan, “It’s not natural enough. When you see it, you know it’s not something a human writes.” This highlights the broader issue that today’s AI models face when dealing with an array of smaller languages, although they excel in English and other major languages.
When AI models encounter unfamiliar words or phrases absent in another culture, they might invent a translation. Aliya Bhatia, a senior policy analyst at the Center of Democracy & Technology, explains, “As a result, many machine-created datasets could feature mistranslations, words that no native speaker actually uses in a specific language.”
Large Language Models (LLMs) require extensive amounts of data. They break down text from books, articles, and websites into smaller word sequences to create a training dataset. From this, LLMs learn to predict subsequent words in a sequence, eventually mastering text generation.
Although AI has become remarkably proficient at generating text, particularly in English, it performs significantly poorer in other languages. Approximately half of all web content is in English, providing abundant digital resources for LLMs to learn from. Many other languages lack this abundance of data.
Low-resource languages face significant challenges due to limited online data. This category includes endangered languages that are not being passed down to future generations, but also widely spoken languages like Cantonese, Vietnamese, and Bahasa Indonesia.
Factors such as limited internet access and government regulations might hinder the creation of digital content. For instance, Indonesia has authority to remove online content without an appeals process, often resulting in self-censorship. Consequently, data in some regional languages might not accurately represent local cultures.
This lack of resources leads to a performance gap. Non-English LLMs frequently produce unintelligible or inaccurate responses. They also struggle with languages that don’t use the Latin script of English or those with complex tonal features that are difficult to convert into writing or code.
The most effective AI models currently operate in English and, to a lesser extent, Mandarin Chinese, reflecting the geographic locations of the world’s major tech companies. Outside of these markets, numerous developers are striving to make AI accessible to everyone.
For example, South Korean internet company Naver has developed an LLM, HyperCLOVA X, which it boasts is trained on 6,500 times more Korean data than GPT-4. Naver’s efforts extend to markets like Saudi Arabia and Thailand to expand its influence in creating “sovereign AI,” which is AI specifically tailored to the needs of individual countries. “We focus on what companies and governments that want to use AI would want, and what needs Big Tech can’t fulfill,” CEO Choi Soo-Yeon said to Fortune last year.
Similarly, in Indonesia, telecom operator Indosat and tech startup Goto are working together to launch a 70 billion parameter LLM that operates in Bahasa Indonesia and five other local languages, including Javanese, Balinese, and Bataknese.
One major challenge facing non-English LLMs is achieving a scale similar to their English counterparts. The largest models are colossal, comprising billions of word sequences transformed into variables known as parameters. For instance, estimates suggest GPT-4 has around 1.8 trillion parameters. In contrast, DeepSeek’s R1 has 671 billion parameters.
The Southeast Asian Languages in One Model (SEA-LION) project demonstrates efforts to build powerful non-English LLMs. It has produced two LLMs from scratch: one with 3 billion parameters and another with 7 billion, which are significantly smaller than leading English and Chinese models.
Votee’s Chan experiences these difficulties firsthand with Cantonese, spoken by 85 million people in southern China and Hong Kong. Unlike formal writing, informal writing and speech in Cantonese employ different grammar. However, digital data available in Cantonese is often scarce or low-quality.
According to Chan, training on digitized Cantonese texts is like “learning from a library with many books, but they have lots of typos, they are poorly translated, or they’re just plain wrong.” Without a comprehensive dataset, an LLM can’t generate complete and accurate results.
Data for low-resource languages is often dominated by formal texts like legal documents, religious materials, or Wikipedia entries since these are more likely to undergo digitization. This bias limits the LLM’s tone, vocabulary, style, and knowledge.
LLMs lack an inherent sense of truth, leading them to inevitably replicate false or incomplete information as fact. A model trained solely on Vietnamese pop music, for example, may struggle to address historical questions accurately, especially those unrelated to Vietnam.
Turning English content into the target language is one method to augment the limited training data. Chan notes, “We synthesize the data using AI so that we can have more data to do the training.”
But machine translation is fraught with risks, such as missed linguistic or cultural nuances. A Georgia Tech study of cultural bias in Arabic LLMs discovered that AI models trained on Arabic datasets still displayed Western bias, referencing alcoholic beverages in Islamic religious contexts. Much of the pre-training data was found to be machine-translated from English, allowing foreign cultural values to infiltrate.
In the long run, AI-generated content might subsequently degrade the quality of low-resource language datasets, akin to “a photocopy of a photocopy” as Chan describes. In 2024, the journal Nature raised concerns about “model collapse,” where AI-generated text might contaminate training data for future LLMs, leading to reduced performance.
This threat is particularly pronounced for low-resource languages. With fewer authentic content sources available, AI-generated content can swiftly compose a larger share of online material in a given language. Large businesses are beginning to realize the potential in building non-English AI solutions. Yet, despite being formidable players in their fields, they still fall short of the scale achieved by tech giants like Alibaba, OpenAI, and Microsoft.
Bhatia emphasizes that investments from more organizations, both profit and nonprofit-oriented, are crucial in enabling multilingual AI to become a truly global technology. “If LLMs are going to be used to equip people with access to economic opportunities, educational resources, and more, they should work in the languages people use,” she states.