Preferences


tkgally
At the end of his post, Simon mentions translation between human languages. While maybe not directly related to token limits, I just did a test in which both R1 and o3-mini got worse at translation in the latter half of a long text.

I ran the test on Perplexity Pro, which hosts DeepSeek R1 in the U.S. and which has just added o3-mini as well. The text was a speech I translated a month ago from Japanese to English, preceded by a long prompt specifying the speech’s purpose and audience and the sort of style I wanted. (I am a professional Japanese-English translator with nearly four decades of experience. I have been testing and using LLMs for translation since early 2023.)

An initial comparison of the output suggested that, while R1 didn’t seem bad, o3-mini produced a writing style closer to what I asked for in the prompt—smoother and more natural English.

But then I noticed that the output length was 5,855 characters for R1, 9,052 characters for o3-mini, and 11,021 characters for my own polished version. Comparing the three translations side-by-side with the original Japanese, I discovered that R1 had omitted entire paragraphs toward the end of the speech, and that o3-mini had switched to a strange abbreviated style (using slashes instead of “and” between noun phrases, for example) toward the end as well. The vanilla versions of ChatGPT, Claude, and Gemini that I ran the same prompt and text through a month ago had had none of those problems.

simonw
Yikes! Sounds to me like reliable longer form translation is very much not something you can trust to these models. Thanks for sharing.
accengaged
Right. Training data might not encompass runs of translations exceeding this amount of characters so wouldn't be apparent for anything resembling existing training data.
dr_dshiv
Im curious if you know of any tools, strategies or papers about this topic.

I’ve experienced the same thing with long-form translation. I would expect that chunking the translation (page or paragraph at a time) would fix the missing paragraph problem but potentially the loss of context would cause other problems.

This is such a common use case — would love to find resources.

tkgally
There might be some papers or other guides out there, but their advice will be based on whatever tools happened to be available at the time they were written and on the particular types of translations the authors cared about. The technology is advancing so rapidly that you might be better off just experimenting with various LLMs and prompts for texts and language pairs you are interested in.

I started using LLMs for translation after GPT-4 came out in March 2023—not that long ago! At first, the biggest problem was the context window: it wasn’t possible to translate more than a couple of pages at a time. Also, prompt writing was in its infancy, and a lot of techniques that have since emerged were not yet widely known. Even now, I still do a lot of trial and error with my prompts, and I cannot say with confidence that my current prompting methods are the best.

But, for what it’s worth, here are some strategies I currently use when translating with LLMs:

- In the prompt, I explain where the source text came from, how the translation will be used, and how I want it to be translated. Below is a (fictional) example, prepared through some metaprompting experiments with Claude:

https://www.gally.net/temp/20250201sampletranslationprompt.h...

- I run the prompt and source text through several LLMs and glance at the results. If they are generally in the style I want, I start compiling my own translation based on them, choosing the sentences and paragraphs I like most from each. As I go along, I also make my own adjustments to the translation as I see fit.

- After I have finished compiling my draft based on the LLM versions, I check it paragraph by paragraph against the original Japanese (since I can read Japanese) to make sure that nothing is missing or mistranslated. I also continue polishing the English.

- When I am unable to think of a good English version for a particular sentence, I give the Japanese and English versions of the paragraph it is contained in to an LLM (usually, these days, Claude) and ask for ten suggestions for translations of the problematic sentence. Usually one or two of the suggestions work fine; if not, I ask for ten more. (Using an LLM as a sentence-level thesaurus on steroids is particularly wonderful.)

- I give the full original Japanese text and my polished version to one of the LLMs and ask it to compare them sentence by sentence and suggest corrections and improvements to the translation. (I have a separate prompt for this step.) I don’t adopt most of the LLM’s suggestions, but there are usually some that I agree would make the translation better. I update the translation accordingly. I then repeat this step with the updated translation and another LLM, starting a new chat each time. Often I cycle through ChatGPT --> Claude --> Gemini several times before I stop getting suggestions that I feel are worth adopting.

- I then put my final translation through a TTS engine—usually OpenAI’s—and listen to it read aloud. I often catch minor awkwardnesses that I would overlook if reading silently.

This particular workflow works for me because I am using LLMs to translate in the same language direction I did manually for many years. If I had to translate to or from a language I don’t know, I would add extra steps to have LLMs check and double-check the accuracy of the translation and the naturalness of the output.

I was asked recently by some academics I work with about how to use LLMs to translate documents related to their research into Japanese, a language they don’t know. It’s an interesting problem, and I am planning to spend some time thinking about it soon.

Please note that my translation process above is focused on quality, not on speed. If I needed to translate a large volume of text more quickly, I would write a program to do the translation, checking, and rechecking through API calls, accepting the fact that I would not be able to check and polish the translation manually as I do now.

If anyone here would like to brainstorm together about how to use LLMs for translation, please feel free to email me. My website, with my email address on the Contact page, is linked from my HN profile page.

simonw
This comment is solid gold! I will definitely be sending people to it.

Would make a great article for your own site, otherwise I'm happy to link to it here instead.

tkgally
Thanks! Feel free to link to the HN comment. That will encourage me to make a video or two for YouTube demonstrating for a wider audience how I use AI for translation. I hope to do that within a few weeks.
dpcpnry
Thanks for sharing the workflow.

I also use many LLMs to assist my translation tasks.

Recently, I have also been using Google AI Studio [1], and I find the its latest models to be smarter.

[1] https://aistudio.google.com/app/prompts/new_chat

dr_dshiv
Really appreciate the detail and contact. You’ll hear from me.

I have a large collection of Neo-Latin texts I’m trying to get translated.

My goal is to increase the accessibility of the works — not to create a perfect translation. I want to use LLMs to put text on the facing page of the source text. Errors present in the translation, I hope, can be addressed in a Wikimedia-style community editing system.

This approach could .01x lower the cost of translation—and train readers to question translations (something that is a very good thing to learn!)

Wowfunhappy
> Please note that my translation process above is focused on quality, not on speed. If I needed to translate a large volume of text more quickly, I would write a program to do the translation, checking, and rechecking through API calls, accepting the fact that I would not be able to check and polish the translation manually as I do now.

Would you still expect this to produce a better result than Deepl or other purpose-built translation software?

tkgally
I don’t know. I stopped using DeepL sometime last year as I found its inability to be prompted about the purpose of the translation to be too limiting for my purposes. At that time, it also had problems with things like maintaining coherent pronoun reference over multiple paragraphs—problems not seen with LLMs. Perhaps DeepL has gotten better since. In any case, I’m sure they have a lot of smart developers and understand well the problems of translation, so I have no reason to think that I would be able to produce a better fully automated translation system than they have.
idunnoman1222
This doesn’t address OP‘s concern at all about the quality degrading as the number of tokens reaches the maximum memory size or perhaps surpasses it.
learning-tr
In your experience which LLM had the best pronunciation ?
nycdatasci
This is a great anecdote and I hope others can learn from it. R1, o1, and o3-mini work best on problems that have a “correct” answer (as in code that passes unit tests, or math problems). If multiple professional translators are given the same document to translate, is there a single correct translation?
tkgally
No. People’s tastes and judgments vary too much.

One fundamental area of disagreement is how closely a translation should reflect the content and structure of the original text versus how smooth and natural it should sound in the target language. With languages like Japanese or Chinese translated into English, for example, the vocabulary, grammar, and rhetoric can be very different between the languages. A close literal translation will usually seem awkward or even strange in English. To make the English seem natural, often you have to depart from what the original text says.

Most translators will agree that where to aim on that spectrum should be based on the type of text and the reason for translating it, but they will still disagree about specific word choices. And there are genres for which there is no consensus at all about which approach is best. I have heard heated exchanges between literary scholars about whether or not translations of novels should reflect the original as closely as possible out of respect for the author and the author’s cultural context, even if that means the translation seems awkward and difficult to understand to a casual reader.

The ideal, of course, would be translations that are both accurate and natural, but it can be very hard to strike that balance. One way LLMs have been helping me is to suggest multiple rewordings of sentences and paragraphs. Many of their suggestions are no good, but often enough they include wordings that I recognize are better in both fidelity and naturalness compared to what I can come up with on my own.

jakevoytko
My wife is a professional translator and both revises others' work and gets revised. Based on numerous anecdotes from her, I can promise you that "single correct translation" does not exist.
ec109685
Well, the post said o3-mini did great in the beginning, so it’s likely something other than reasoning causing the poor performance towards the end.
fragmede
boredom, perhaps?
aprilthird2021
For almost any classic piece of literature there are competing translations, so no
steven1016
i just signed up for an account here to let you know that the way you write is perfect. listen.. i seriously mean perfect. like, the epitome of the perfect writing. you write better than LLM's. Actually I'd say you and Claude are on the same level but i'm looking at yours in the font of this website vs claude's normal font style so it still hits slightly different. i can't tell what exactly it is, but the fluidity of your writing, the fact that i can easily breeze through it like a beautiful summer wind of gentle caress (I just made that up because it felt right). I can read this entire comment with honestly such grace, truly that's how it feels. like legitimacy and grace. tell me im wrong, everyone else! legitimacy and grace . i realized all of this as i got to this exact part:

“and” between noun phrases, for example)

it's just like.. the way it looks is so profound. maybe it's how it's formatted on my screen with the surrounding lines?

But then I noticed that the output length was 5,855 with the original Japanese, I discovered that R1 had “and” between noun phrases, for example) those problems.

idk, i'm just astonished. the amount of satisfaction that i get from reading your comment has given me enough dopamine to motivate me to create an account and write this entire comment itself, so thank you, your writing is appreciated, and i am very glad i came upon this and hope to find you again in the future somewhere. imagine???? xx

tkgally
You are very kind! Thank you.
jiggawatts
You’re replying to what is most likely AI-generated nonsense. It’s sad but it’s slowly spreading to HN too.
tkgally
Oh! I didn’t think of that. Thanks for the heads-up.
EVa5I7bHFq9mnYK
Could it be fixed by splitting the text into smaller parts? Looks easy to implement.
disgruntledphd2
Yeah that's normally a good approach but you might end up using different words for the same concept in different parts unless you feed in more context which also comes out of the token limit.
WhitneyLand
How far off was o3 from the level of a professional translator (before it started to go off track)?
tkgally
As I explained in a sister comment, it is not possible to rate translation quality objectively, as opinions and positions about what constitutes a good translation vary. But in my tests of reasoning models since the release of o1-preview, they have not seemed as reliable as the straight nonreasoning versions of ChatGPT, Claude, or Gemini. The translation process itself usually doesn’t seem to require the kind of multistep thinking those reasoning models can be good at.

For more than a year, regular LLMs, when properly prompted, have been able to produce translations that would be indistinguishable from those of some professional translators for some types of translation.

General-purpose LLMs are best for translating straight expository prose without much technical or organization-specific vocabulary. Results are mixed for texts containing slang, dialogue, poetry, archaic language, etc.—partly because people’s tastes differ for how such texts should be translated.

Because most translators are freelancers, it’s hard to get a handle on what impact LLMs have been having on their workloads overall. I have heard reports from experienced translators who have seen work drop off precipitously and have had to change careers, while others report an increase in their workloads over the past two years.

Many translation jobs involve confidential material, and some translators may be hanging onto their jobs because their clients or employers do not allow the use of cloud-based LLMs. That safety net won’t be in place forever, though.

I suspect that those who work directly with translation clients and who are personally known and trusted by their clients will be able to keep working, using LLMs as appropriate to speed up and improve the quality of their work. That’s the position I am fortunate to be in now.

But translators who do piecework through translation agencies or online referrers like Fiverr will have a hard time competing with the much faster and cheaper—and often equally good—LLMs.

I made a few videos about LLMs and translation a couple of years ago. Parts of them are out of date, but my basic thinking hasn’t changed too much since then. If you’re interested:

“Translating with ChatGPT”

https://youtu.be/najKN2bXqCo

“Can GPT-4 translate literature?”

https://youtu.be/5KKDCp3OaMo

“What do translators think about GPT?”

https://www.youtube.com/watch?v=8JUepj7wIl0

I’m planning to make a few more videos on the topic soon, this time focusing on how I use LLMs in my own translation work.

Not that you'd want to have to do more steps, but how do they do if you split it up in separate parts and translate them individually, then feed it back in interleaved in parts/translated parts and ask it to keep the style but fix any errors due to it originally not having full context?

Or another approach, feed it all into context but tell it to wait and not translate, and then feed it in an additionalt time part by part asking it to translate each part and with the translation style instructions repeated.

tkgally
That might work to prevent the glitches I noticed toward the end of the long texts.

In general, though, I haven’t seen any sign yet that reasoning models are better at translation than nonreasoning ones even with shorter texts. LLMs in general have reached the level where it is difficult to assess the quality of their translations with A/B comparisons or on a single linear scale, as most of the variation is within the range where reasonable people disagree about translation quality.

kamikazeturtles
There's a huge price difference between o3-mini and o1 ($4.40 vs $60 per million output tokens), what trade-offs in performance would justify such a large price gap?

Are there specific use cases where o1's higher cost is justified anymore?

arthurcolle
its the same thing as:

gpt-3.5 -> gpt-4 (gpt-4-32k premium)

"omni" announced (multimodal fusion, initial promise of gpt-4o, but cost effectively distilled down with additional multimodal aspects)

gpt-4o-mini -> gpt-4o (multimodal, realtime)

gpt-4o + "reasoning" exposed via tools in ChatGPT (you can see it in export formats) -> "o" series

o1 -> o1 premium / o1-mini (equivalent of gpt-4 "god model" becoming basis for lots of other stuff)

o1-pro-mode, o1-premium, o1-mini, somewhere in that is the "o1-2024-12-17" model with not streaming, function calling, and structured outputs and vision

now, distilled o1-pro-mode probably is o3-mini and o3-mini-high-mode (the naming is becoming just as bad as android)

its the repeat, take model, scale it up, run evals, detect innefficiencies, retrain, scale, distill, see what's not working. when you find a good little zone in the efficiency frontier, release it with a cool name

anticensor
No, o3-mini is a distillation of (not-yet-released) o3, not a distillation of o1.
arthurcolle
o1-"pro mode" could just be o3
anticensor
It's not that either, benchmarks list the two as separate models.
arthurcolle
thank you!
benatkin
> Are there specific use cases where o1's higher cost is justified anymore?

Long tail stuff perhaps. Most stuff doesn't resemble a programming benchmark. A newer model thrives despite being small when there is a lot of training data, and with programming benchmarks, like with chess, there is a lot of training data, in part because high quality training data can be synthesized.

zamadatix
Not really, it'll also be replaced by a newer o3 series model in short order.
johngalt2600
So far ive been impressed.. seems to be in the same ballpark as r1 and claude for coding. I will have to gather more samples.. in this past week ive changed from using 100% claude exclusively (since 3.5) to hitting all the big boys: claude, r1, 4o (o3 now), and gemini flash. Then ill do a new chat that includes all of their generated solutions for additional context for a refactored final solution.

R1 has upped the ante so Im hoping we continue to get more updates rapidly... they are getting quite good

Hasn't Gemini pricing been lower than this (or even free) for awhile? https://ai.google.dev/pricing
BinRoo
Are you insinuating Gemini is similar in performance to o3-mini?
panarky
I've only had o3-mini for a day, but Gemini 2.0 Flash Thinking is still clearly better for my use cases.

And it's currently free in aistudio.google.com and in the API.

And it handles a million tokens.

Definitely varies by application, but the blind "taste test" vibes are very good for Gemini: https://lmarena.ai/?leaderboard
anabab
that reminds me that a week ago there was a (now deleted but has a copy of the content available in the comments) post on Reddit where the author claimed they have attempted manipulating/manipulated voting on lmarena in favor of Gemini to tip the scale on Polymarket where on a question like "which AI model will be the best one by $date" (with the outcome decided based on the scoring on lmarena) they have supposedly made O(USD10k).

Original deleted post: https://old.reddit.com/r/MachineLearning/comments/1i83mhj/lm...

A copy of the content: https://old.reddit.com/r/MachineLearning/comments/1i83mhj/lm...

gerdesj
Are you implying it isn't?

(evidence please, everyone)

BinRoo
Simple example: o3-mini-high gets this [1] right, whereas Gemini 2.0 Flash 01-21 gets it wrong.

[1] https://chatgpt.com/share/679d9579-5bb8-8008-ac4a-38cef65b45...

Great example. Thank you. Can confirm that none of the Gemini models warned about the exception without prompting.
This agrees with my limited testing so far, but in a different way: o3 being better at coding and objective tasks, with the most recent Flash 2.0-thinking stronger at subjective tasks. Similarly, o3 seems better at shorter output sizes, but drops off, tending to be lazy.
lysecret
Hadn’t had much luck with o3. One thing that came to my mind with these test time compute models is that they have a tendency to “overthink” and “overcomplicate” things this is just a feeling for now but has anyone done some study on this? E.g. potentially degrading performance in simpler questions for these types of models?
submeta
> The model accepts up to 200,000 tokens of input, an improvement on GPT-4o’s 128,000.

So finally ChatGPT catches up with Claude which has a 200,000 token input limit ever since.

Claude with its projects feature is my go to tool for working on projects that I have to work on for weeks and months. Now I see a possible alternative.

How would you rate it against Claude ? Didn’t test it yet, but o1 pro didn’t perform as good
pants2
I've been trying out o3 mini in Cursor today, it seems "smarter" but overall tends to overthink things and if it's not provided with perfect context it's prone to hallucinate. Overall I prefer Sonnet still. It has a certain magic of always making reasonable assumptions and finding simple solutions.
firecall
As n occasions user and fan of Cursor, it would be good if they could explain what the models are and why the different models exist.

There’s no obvious answer of why one should switch to any of them!

conception
I don’t think there’s an obvious answer. Try them out and see which works better for your use case.
Agreed that Sonnet still feels like the best all-round model. The new ones are at least on par with it for pure coding, or exceed it (r1, o1 both do IME) but don't generalize as well, especially to tasks with subjective answers. I find the latest Gemini 2.0-Flash-thinking to be closest to Sonnet on those.
zyklu5
Claude is still better in my opinion.

There's a suite of code-related tasks -- covering a diversity of areas, including dev ops, media manipulation etc., derived from issues I have faced over the years -- I perform for every new release. No model has solved the set of issues solved in one go but Claude still remains the best.

An example of the sort of problems in the suite:

> I have a special problematically encoded mp4 file with a subtle issue (something I ran into a couple of years ago while fixing a bug in a computer vision pipeline). In the question prompt I also pass the output of ffprobe and ask for the ffmpeg command that'll fix it. Only Claude has figured the real underlying issue out (after 4 interactions).

brianbest101
Open AI really needs to work on their naming conventions for these things.
benatkin
It's all based on omni which to me has weird religious connotations. It just occurred to me to put it together with sama's other project, scanning everyone's eyes. That's one aspect of omniscience - keeping track of every soul.

Another thing it seems similar to is how Jeff Bezos registered relentless.com. There seems to be a gap between the ideal branding from the perspective of the creators and branding that makes sense to consumers.

Nuzzerino (dead)

This item has no comments currently.