I built something kinda similar, and made it open source. It picks the top x models based on my research, translates with them, then has a final judge model critique, compare, and synthesise a combined best translation. You can try it at https://nuenki.app/translator if you're interested, and my data is at https://nuenki.app/blog
https://nuenki.app/blog/the_more_llms_think_the_worse_they_t...
Your results accord with my own (much less systematic) tests of the translation of short texts by reasoning models. The issue becomes more fuzzy with the translation of longer texts, where quality is more difficult to evaluate objectively. I'll drop you an email with some thoughts.
The only thing that actually worked was knowing the target language and sitting down with multiple LLMs, going through the translation one sentence at a time with a translation memory tool wired in.
The LLMs are good, but they make lot of strange mistakes a human never would. Weird grammatical adherence to English structures, false friend mistakes that no one bilingual would make, and so on. Bizarrely many of these would not be caught between LLMs -- sometimes I would get _increasingly_ unnatural outputs instead of more natural outputs.
This is not just for English to Asian languages, even English to German or French... I shipped something to a German editor and he rewrote 50% of the lines.
LLMs are good editors and suggestors for alternatives, but I've found that if you can't actually read your target language to some degree, you're lost in the woods.
I have been astounded at the sophistication of LLM translation, and haven't encountered a single false-friend example ever. Maybe it depends a lot on which models you're using? Or it thinks you're trying to have a conversation that code-switches mid-sentence, which is a thing LLM's can do if you want?
I live in Japan. Almost every day I find myself asking things like “what does X mean in this specific setting?” or “how do I tell Y to that specific person via this specific medium?”.
Much of this can be further automated via custom instructions, so that e.g. the LMM knows that text in a particular language should be automatically translated and explained.
Great ideas. I'll think about adding those features to the system in my next vibe-coding session.
What I automated in the MVP I vibe-coded yesterday could all be done alone by a human user with access to the LLMs, of course. The point of such an app would be to guide people who are not familiar with the issues and intricacies of translation so that they can get better translations for their purposes.
I have no intention to try to commercialize my app, as there would be no moat. Anyone who wanted to could feed this thread to Claude, ask it to write a starting prompt for Claude Code, and produce a similar system in probably less time than it took me.
The word "potatoes" in context of a specific 500-page book has little ambiguity. Same word but extracted out of the same book and fed to a translator(human or machine) in isolation would be much more ambiguous. You probably don't need the whole book, but the solution space do reduce as you give translators more out of the content or how it's used in the original as well as in other parts of translations.
It's similar to how GPS works. With one satellite, your location is precise as "on Earth, I guess". It gets more precise as you add more satellites that further and further reduce margins of errors.
RIP global power consumption
It can be as simple as discuss one’s own religion
I also have a few heuristics (e.g. "I can't translate" in many different languages) to detect if it deviates from that.
It works pretty well.
As a human translator, if I were starting to translate a text in the middle and I wanted my translation to flow naturally with what had been translated before, I would want both a summary of the previous content and notes about how to handle specific names and terms and maybe about the writing style as well. When I start working on the project again tomorrow, I’ll see if Claude Code can come up with a solution along those lines.
Disclaimer: I work for Soniox.
How is this part done? How are they chosen/combined to give the best results? Any info would be appreciated as I've seen this sort of thing mentioned before, but details have been scant!
https://www.gally.net/temp/20250619translationprogramtrace/i...
The Anthropic servers seem to have been overloaded, but if you read through the above it should give you an idea of how the program is currently doing that synthesis. If you want the code as it stands now (a single HTML file), feel free to email me. My website is linked from my profile page.
Interesting. Curious if you modeled the cost of that single translation with the multiple LLM calls and how that compares to a human.
I didn't double-check the module's arithmetic, but it seems to have been in the ballpark, as my total API costs for OpenAI, Anthropic, and Google yesterday--when I was testing this system repeatedly--came to about eight dollars.
A human translator would charge many, many times more.
Google Translate can't, but LLMs given enough context can. I've been testing and experimenting with LLMs extensively for translation between Japanese and English for more than two years, and, when properly prompted, they are really good. I say this as someone who worked for twenty years as a freelance translator of Japanese and who still does translation part-time.
Just yesterday, as it happens, I spent the day with Claude Code vibe-coding a multi-LLM system for translating between Japanese and English. You give it a text to be translated, and it asks you questions that it generates on the fly about the purpose of the translation and how you want it translated--literal or free, adapted to the target-language culture or not, with or without footnotes, etc. It then writes a prompt based on your answers, sends the text to models from OpenAI, Anthropic, and Google, creates a combined draft from the three translations, and then sends that draft back to the three models for several rounds of revision, checking, and polishing. I had time to run only a few tests on real texts before going to bed, but the results were really good--better than any model alone when I've tested them, much better than Google Translate, and as good as top-level professional human translation.
The situation is different with interpreting, especially in person. If that were how I made my living, I wouldn't be too worried yet. But for straight translation work where the translator's personality and individual identity aren't emphasized, it's becoming increasingly hard for humans to compete.