Unbabel says its new AI model has dethroned OpenAI’s GPT-4 as the tech industry’s best language translator
AI & Technology

Unbabel says its new AI model has dethroned OpenAI’s GPT-4 as the tech industry’s best language translator

Unbabel, a tech company that provides both machine and human-based translation services for businesses, has created a new AI model that it says beats OpenAI’s GPT-4o and other commercially available AI systems on translation between English and six commonly spoken European and Asian languages.

Translation has been one of the more attractive business use cases for large language models (LLMs), the kind of AI systems that underpin chatbots like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude. And to date, GPT-4o, the latest version of OpenAI’s most powerful AI model, has outperformed all competitors when it comes to translating languages for which large amounts of digital text exists. (GPT-4’s performance on “low resource languages,” which have far fewer digital documents to train from, has never been as good.)

Unbabel tested its AI model, which it calls TowerLLM, against GPT-4o and the original GPT-4, as well as OpenAI’s GPT-3.5 and competing models from Google and the language translation company DeepL. It looked at translation from English to Spanish, French, German, Portuguese, Italian, and Korean. In almost every case, TowerLLM narrowly edged out GPT-4o and GPT-4. TowerLLM’s highest accuracy came in English-Korean translations, where it beat OpenAI’s best models by about 1.5%. On English-German translations, GPT-4 and GPT-4o were a fraction of a percentage point better.

Unbabel also tested its model on translations of documents for specific professional domains, such as finance, medicine, law, and technical writing. Here again, TowerLLM performed between 1% and 2% better than OpenAI’s best models.

Unbabel’s results have not been independently verified, but if confirmed, the fact that GPT-4 has now been bested at translation may indicate that the model, which has remained the top-performing LLM on most language benchmarks despite having debuted 15 months ago—an eternity in the fast-paced world of AI development—may now be vulnerable to newer AI systems being trained with different methods. OpenAI is reportedly training a more powerful LLM—although its release date remains uncertain.

Unbabel, which has headquarters in both San Francisco and Lisbon, said TowerLLM was trained to be multilingual on a large public dataset of multilingual text. This means the model also performs better on reasoning tasks in multiple languages than some competing open-source AI models of a similar size created by companies such as Meta and French AI startup Mistral.

TowerLLM was then fine-tuned with a carefully curated dataset of high-quality translations between language pairs. Unbabel was able to use another AI model that it had trained to assess translation quality—which is called COMETKiwi—to help curate this fine-tuning dataset.

João Graça, Unbabel’s chief technology officer, told Fortune that most other LLMs have a higher proportion of English-language text in their initial training set and only pick up the ability to translate coincidentally. But TowerLLM was trained on a dataset that was specifically designed to include a large amount of multilingual text. He also said that fine-tuning on the smaller, curated dataset of high-quality translations was key to the resulting model’s superior performance.

It was one several recent examples in which smaller AI models have equaled or exceeded the performance of much larger ones when trained on better quality datasets. For instance, Microsoft created a small language model called Phi 3, with just 3.8 billion parameters (the tunable variables in the model), that outperforms models more than double that size by creating what Microsoft called a “textbook-quality” dataset. “The insight from Phi is that people should focus on the quality of the data,” Graça said. He noted that all AI companies are now using the same basic algorithmic design with some subtle variations. What differentiates the models is data. “It’s all about the data and the training curriculum, which is how you give the data to the model,” he said.

TowerLLM is currently available in two sizes, one with 7 billion parameters and one with 13 billion. An earlier version of the model, which debuted in January, came close to GPT-4’s performance, but didn’t quite exceed it. That model also only worked for 10 language pairs. The new model edges past GPT-4 and supports 18 language pairs.

The model has only been tested against GPT-4o for translation, meaning that GPT-4 may still have an advantage at other tasks such as reasoning, coding, writing, and summarization.

Graça said that Unbabel plans to expand the number of languages TowerLLM supports, adding 10 additional ones soon. The model is also being fine-tuned to work on very specific translation tasks that businesses often care most about—such as translating complex legal documents or patent and copyright information. It has been trained to get better at “transcreation,” the skill of translating a piece of content not word for word, but so that it captures very subtle cultural nuances, such as using colloquial expressions or slang that a native from a certain generation would use, Graça said.

Subscribe to the Eye on AI newsletter to stay abreast of how AI is shaping the future of business. Sign up for free.