Preferences

> ... a translators’ and interpreters’ work is mostly about ensuring context, navigating ambiguity, and handling cultural sensitivity. This is what Google Translate cannot currently do.

Google Translate can't, but LLMs given enough context can. I've been testing and experimenting with LLMs extensively for translation between Japanese and English for more than two years, and, when properly prompted, they are really good. I say this as someone who worked for twenty years as a freelance translator of Japanese and who still does translation part-time.

Just yesterday, as it happens, I spent the day with Claude Code vibe-coding a multi-LLM system for translating between Japanese and English. You give it a text to be translated, and it asks you questions that it generates on the fly about the purpose of the translation and how you want it translated--literal or free, adapted to the target-language culture or not, with or without footnotes, etc. It then writes a prompt based on your answers, sends the text to models from OpenAI, Anthropic, and Google, creates a combined draft from the three translations, and then sends that draft back to the three models for several rounds of revision, checking, and polishing. I had time to run only a few tests on real texts before going to bed, but the results were really good--better than any model alone when I've tested them, much better than Google Translate, and as good as top-level professional human translation.

The situation is different with interpreting, especially in person. If that were how I made my living, I wouldn't be too worried yet. But for straight translation work where the translator's personality and individual identity aren't emphasized, it's becoming increasingly hard for humans to compete.


Alex-Programs
I've ended up doing a lot of research into LLM translation, because my language learning tool (https://nuenki.app) uses it a lot.

I built something kinda similar, and made it open source. It picks the top x models based on my research, translates with them, then has a final judge model critique, compare, and synthesise a combined best translation. You can try it at https://nuenki.app/translator if you're interested, and my data is at https://nuenki.app/blog

tanvach
Very neat, love how there’s a formality level selection! Google translate has such bad tendencies to use very formal language (at least when translating into Thai) that it’s almost useless in real life. Some English to Thai examples I tried so far have been quite natural.
mordechai9000
I assumed Google errs on the side of formality because being informal in an inappropriate context is worse than being too formal for someone who is obviously not a native speaker. Not for Thai in particular, just in general.
anticensor
Swedish has informal as the default register instead.
tkgally OP
Very nice! Thanks for the links.
Alex-Programs
I just did some more research, which you might find interesting. Thinking actually makes them translate worse!

https://nuenki.app/blog/the_more_llms_think_the_worse_they_t...

tkgally OP
Also very interesting! Excellent research design and presentation, too.

Your results accord with my own (much less systematic) tests of the translation of short texts by reasoning models. The issue becomes more fuzzy with the translation of longer texts, where quality is more difficult to evaluate objectively. I'll drop you an email with some thoughts.

f38zf5vdt
It's funny -- I independently implemented the same thing (without vibe coding) and found it doesn't actually work. When I ended up with was a game of telephone where errors were often introduced and propagated between the models.

The only thing that actually worked was knowing the target language and sitting down with multiple LLMs, going through the translation one sentence at a time with a translation memory tool wired in.

The LLMs are good, but they make lot of strange mistakes a human never would. Weird grammatical adherence to English structures, false friend mistakes that no one bilingual would make, and so on. Bizarrely many of these would not be caught between LLMs -- sometimes I would get _increasingly_ unnatural outputs instead of more natural outputs.

This is not just for English to Asian languages, even English to German or French... I shipped something to a German editor and he rewrote 50% of the lines.

LLMs are good editors and suggestors for alternatives, but I've found that if you can't actually read your target language to some degree, you're lost in the woods.

crazygringo
That doesn't match my experience at all. Maybe it's something to do with what your prompts are asking for or the way you're passing translations? Or the size of chunks being translated?

I have been astounded at the sophistication of LLM translation, and haven't encountered a single false-friend example ever. Maybe it depends a lot on which models you're using? Or it thinks you're trying to have a conversation that code-switches mid-sentence, which is a thing LLM's can do if you want?

f38zf5vdt
I'm using o3 and Gemini Pro 2.5, paying for the high tier subscriptions. The complaints I get are from native speakers -- editors and end consumers. The LLMs tend to overfit to the English language, sometimes make up idioms that don't exist, use false friend words (especially verbs), directly translate English idioms, and so on. I've translated several book length texts now and I've seen it all.
felipeerias
It is hard to convey just how important it is to be able to provide additional context, ask follow-up questions, and reason about the text.

I live in Japan. Almost every day I find myself asking things like “what does X mean in this specific setting?” or “how do I tell Y to that specific person via this specific medium?”.

Much of this can be further automated via custom instructions, so that e.g. the LMM knows that text in a particular language should be automatically translated and explained.

tkgally OP
> It is hard to convey just how important it is to be able to ... ask follow-up questions, and reason about the text.

Great ideas. I'll think about adding those features to the system in my next vibe-coding session.

What I automated in the MVP I vibe-coded yesterday could all be done alone by a human user with access to the LLMs, of course. The point of such an app would be to guide people who are not familiar with the issues and intricacies of translation so that they can get better translations for their purposes.

I have no intention to try to commercialize my app, as there would be no moat. Anyone who wanted to could feed this thread to Claude, ask it to write a starting prompt for Claude Code, and produce a similar system in probably less time than it took me.

numpad0
Maybe we should stop using advanced and somewhat hand-wavy vocabulary like "context" for that. The thing is that the prompt has to be long enough.

The word "potatoes" in context of a specific 500-page book has little ambiguity. Same word but extracted out of the same book and fed to a translator(human or machine) in isolation would be much more ambiguous. You probably don't need the whole book, but the solution space do reduce as you give translators more out of the content or how it's used in the original as well as in other parts of translations.

It's similar to how GPS works. With one satellite, your location is precise as "on Earth, I guess". It gets more precise as you add more satellites that further and further reduce margins of errors.

aidenn0
I'm bad with names, so for any Japanese literature, I need to take notes; it's not unusual to see one character referred to by 3 names. Then you might have 3 characters that are all referred to as Tanaka-san at different points in time.
Casteil
> It then writes a prompt based on your answers, sends the text to models from OpenAI, Anthropic, and Google, creates a combined draft from the three translations, and then sends that draft back to the three models for several rounds of revision, checking, and polishing.

RIP global power consumption

jiehong
The problem with LLMs for translation is when they refuse to do so if the topic being translated isn’t following their policies, even if the context shows it’s fine here.

It can be as simple as discuss one’s own religion

Alex-Programs
I made a tool which translates sentences as you browse, for immersion[0]. I solved this by giving the model a code (specifically, "483") to return in any refusal. Then, if I detect that in the output, I fail over to another model+provider.

I also have a few heuristics (e.g. "I can't translate" in many different languages) to detect if it deviates from that.

It works pretty well.

[0] https://nuenki.app

Jensson
You can just turn that off, at least on Googles model.
boredhedgehog
What's your approach for dealing with a text too long for an ordinary context window? If I split it into chunks, each one needs some kind of summary of the previous ones for context, and I'm always unsure how detailed they should be.
tkgally OP
I haven’t developed an approach to it yet. In my tests yesterday, I did run into errors when the texts were too long for the context windows, but I haven’t tried to solve it yet.

As a human translator, if I were starting to translate a text in the middle and I wanted my translation to flow naturally with what had been translated before, I would want both a summary of the previous content and notes about how to handle specific names and terms and maybe about the writing style as well. When I start working on the project again tomorrow, I’ll see if Claude Code can come up with a solution along those lines.

Try Soniox for real-time translation (interpreting). With the limited context it has in real-time, it's actually really good.

https://soniox.com

Disclaimer: I work for Soniox.

dr_dshiv
I’ve been looking for that! Thanks
djaychela
>creates a combined draft from the three translations

How is this part done? How are they chosen/combined to give the best results? Any info would be appreciated as I've seen this sort of thing mentioned before, but details have been scant!

tkgally OP
The combination, compilation, and evaluation process was designed by Claude Code and I haven’t examined or evaluated it closely yet. Here is the log of a session I did just now, after a few more hours of vibe-polishing the program:

https://www.gally.net/temp/20250619translationprogramtrace/i...

The Anthropic servers seem to have been overloaded, but if you read through the above it should give you an idea of how the program is currently doing that synthesis. If you want the code as it stands now (a single HTML file), feel free to email me. My website is linked from my profile page.

jedberg
> You give it a text to be translated ... and then sends that draft back to the three models for several rounds of revision, checking, and polishing.

Interesting. Curious if you modeled the cost of that single translation with the multiple LLM calls and how that compares to a human.

tkgally OP
I had Claude Code write a module that monitored the incoming and outcoming token counts and display the accumulated costs. A Japanese-to-English translation that yielded about a thousand words in English cost around US$0.40.

I didn't double-check the module's arithmetic, but it seems to have been in the ballpark, as my total API costs for OpenAI, Anthropic, and Google yesterday--when I was testing this system repeatedly--came to about eight dollars.

A human translator would charge many, many times more.

bboygravity
You just created the software for a profitable business. People would use that and pay for it.
bugtodiffer
but it is easy to build a competitor

This item has no comments currently.