Notes on OpenAI o3-mini

211 points Feb 1, 2025

76 comments dtquad simonwillison.net

tkgally Feb 1, 2025

At the end of his post, Simon mentions translation between human languages. While maybe not directly related to token limits, I just did a test in which both R1 and o3-mini got worse at translation in the latter half of a long text.

I ran the test on Perplexity Pro, which hosts DeepSeek R1 in the U.S. and which has just added o3-mini as well. The text was a speech I translated a month ago from Japanese to English, preceded by a long prompt specifying the speech’s purpose and audience and the sort of style I wanted. (I am a professional Japanese-English translator with nearly four decades of experience. I have been testing and using LLMs for translation since early 2023.)

An initial comparison of the output suggested that, while R1 didn’t seem bad, o3-mini produced a writing style closer to what I asked for in the prompt—smoother and more natural English.

But then I noticed that the output length was 5,855 characters for R1, 9,052 characters for o3-mini, and 11,021 characters for my own polished version. Comparing the three translations side-by-side with the original Japanese, I discovered that R1 had omitted entire paragraphs toward the end of the speech, and that o3-mini had switched to a strange abbreviated style (using slashes instead of “and” between noun phrases, for example) toward the end as well. The vanilla versions of ChatGPT, Claude, and Gemini that I ran the same prompt and text through a month ago had had none of those problems.

simonw Feb 1, 2025

Yikes! Sounds to me like reliable longer form translation is very much not something you can trust to these models. Thanks for sharing.

accengaged Feb 2, 2025

Right. Training data might not encompass runs of translations exceeding this amount of characters so wouldn't be apparent for anything resembling existing training data.

dr_dshiv Feb 1, 2025

Im curious if you know of any tools, strategies or papers about this topic.

I’ve experienced the same thing with long-form translation. I would expect that chunking the translation (page or paragraph at a time) would fix the missing paragraph problem but potentially the loss of context would cause other problems.

This is such a common use case — would love to find resources.

tkgally Feb 1, 2025

There might be some papers or other guides out there, but their advice will be based on whatever tools happened to be available at the time they were written and on the particular types of translations the authors cared about. The technology is advancing so rapidly that you might be better off just experimenting with various LLMs and prompts for texts and language pairs you are interested in.

I started using LLMs for translation after GPT-4 came out in March 2023—not that long ago! At first, the biggest problem was the context window: it wasn’t possible to translate more than a couple of pages at a time. Also, prompt writing was in its infancy, and a lot of techniques that have since emerged were not yet widely known. Even now, I still do a lot of trial and error with my prompts, and I cannot say with confidence that my current prompting methods are the best.

But, for what it’s worth, here are some strategies I currently use when translating with LLMs:

- In the prompt, I explain where the source text came from, how the translation will be used, and how I want it to be translated. Below is a (fictional) example, prepared through some metaprompting experiments with Claude:

https://www.gally.net/temp/20250201sampletranslationprompt.h...

- I run the prompt and source text through several LLMs and glance at the results. If they are generally in the style I want, I start compiling my own translation based on them, choosing the sentences and paragraphs I like most from each. As I go along, I also make my own adjustments to the translation as I see fit.

- After I have finished compiling my draft based on the LLM versions, I check it paragraph by paragraph against the original Japanese (since I can read Japanese) to make sure that nothing is missing or mistranslated. I also continue polishing the English.

- When I am unable to think of a good English version for a particular sentence, I give the Japanese and English versions of the paragraph it is contained in to an LLM (usually, these days, Claude) and ask for ten suggestions for translations of the problematic sentence. Usually one or two of the suggestions work fine; if not, I ask for ten more. (Using an LLM as a sentence-level thesaurus on steroids is particularly wonderful.)

- I give the full original Japanese text and my polished version to one of the LLMs and ask it to compare them sentence by sentence and suggest corrections and improvements to the translation. (I have a separate prompt for this step.) I don’t adopt most of the LLM’s suggestions, but there are usually some that I agree would make the translation better. I update the translation accordingly. I then repeat this step with the updated translation and another LLM, starting a new chat each time. Often I cycle through ChatGPT --> Claude --> Gemini several times before I stop getting suggestions that I feel are worth adopting.

- I then put my final translation through a TTS engine—usually OpenAI’s—and listen to it read aloud. I often catch minor awkwardnesses that I would overlook if reading silently.

This particular workflow works for me because I am using LLMs to translate in the same language direction I did manually for many years. If I had to translate to or from a language I don’t know, I would add extra steps to have LLMs check and double-check the accuracy of the translation and the naturalness of the output.

I was asked recently by some academics I work with about how to use LLMs to translate documents related to their research into Japanese, a language they don’t know. It’s an interesting problem, and I am planning to spend some time thinking about it soon.

Please note that my translation process above is focused on quality, not on speed. If I needed to translate a large volume of text more quickly, I would write a program to do the translation, checking, and rechecking through API calls, accepting the fact that I would not be able to check and polish the translation manually as I do now.

If anyone here would like to brainstorm together about how to use LLMs for translation, please feel free to email me. My website, with my email address on the Contact page, is linked from my HN profile page.

simonw Feb 1, 2025

This comment is solid gold! I will definitely be sending people to it.

Would make a great article for your own site, otherwise I'm happy to link to it here instead.

tkgally Feb 1, 2025

Thanks! Feel free to link to the HN comment. That will encourage me to make a video or two for YouTube demonstrating for a wider audience how I use AI for translation. I hope to do that within a few weeks.

2 More Comments →

dpcpnry Feb 1, 2025

Thanks for sharing the workflow.

I also use many LLMs to assist my translation tasks.

Recently, I have also been using Google AI Studio [1], and I find the its latest models to be smarter.

[1] https://aistudio.google.com/app/prompts/new_chat

dr_dshiv Feb 1, 2025

Really appreciate the detail and contact. You’ll hear from me.

I have a large collection of Neo-Latin texts I’m trying to get translated.

My goal is to increase the accessibility of the works — not to create a perfect translation. I want to use LLMs to put text on the facing page of the source text. Errors present in the translation, I hope, can be addressed in a Wikimedia-style community editing system.

This approach could .01x lower the cost of translation—and train readers to question translations (something that is a very good thing to learn!)

Wowfunhappy Feb 1, 2025

> Please note that my translation process above is focused on quality, not on speed. If I needed to translate a large volume of text more quickly, I would write a program to do the translation, checking, and rechecking through API calls, accepting the fact that I would not be able to check and polish the translation manually as I do now.

Would you still expect this to produce a better result than Deepl or other purpose-built translation software?

tkgally Feb 1, 2025

I don’t know. I stopped using DeepL sometime last year as I found its inability to be prompted about the purpose of the translation to be too limiting for my purposes. At that time, it also had problems with things like maintaining coherent pronoun reference over multiple paragraphs—problems not seen with LLMs. Perhaps DeepL has gotten better since. In any case, I’m sure they have a lot of smart developers and understand well the problems of translation, so I have no reason to think that I would be able to produce a better fully automated translation system than they have.

idunnoman1222 Feb 1, 2025

This doesn’t address OP‘s concern at all about the quality degrading as the number of tokens reaches the maximum memory size or perhaps surpasses it.

learning-tr Feb 2, 2025

In your experience which LLM had the best pronunciation ?

nycdatasci Feb 1, 2025

This is a great anecdote and I hope others can learn from it. R1, o1, and o3-mini work best on problems that have a “correct” answer (as in code that passes unit tests, or math problems). If multiple professional translators are given the same document to translate, is there a single correct translation?

tkgally Feb 1, 2025

No. People’s tastes and judgments vary too much.

One fundamental area of disagreement is how closely a translation should reflect the content and structure of the original text versus how smooth and natural it should sound in the target language. With languages like Japanese or Chinese translated into English, for example, the vocabulary, grammar, and rhetoric can be very different between the languages. A close literal translation will usually seem awkward or even strange in English. To make the English seem natural, often you have to depart from what the original text says.

Most translators will agree that where to aim on that spectrum should be based on the type of text and the reason for translating it, but they will still disagree about specific word choices. And there are genres for which there is no consensus at all about which approach is best. I have heard heated exchanges between literary scholars about whether or not translations of novels should reflect the original as closely as possible out of respect for the author and the author’s cultural context, even if that means the translation seems awkward and difficult to understand to a casual reader.

The ideal, of course, would be translations that are both accurate and natural, but it can be very hard to strike that balance. One way LLMs have been helping me is to suggest multiple rewordings of sentences and paragraphs. Many of their suggestions are no good, but often enough they include wordings that I recognize are better in both fidelity and naturalness compared to what I can come up with on my own.

jakevoytko Feb 1, 2025

My wife is a professional translator and both revises others' work and gets revised. Based on numerous anecdotes from her, I can promise you that "single correct translation" does not exist.

ec109685 Feb 1, 2025

Well, the post said o3-mini did great in the beginning, so it’s likely something other than reasoning causing the poor performance towards the end.

fragmede Feb 1, 2025

boredom, perhaps?

aprilthird2021 Feb 1, 2025

For almost any classic piece of literature there are competing translations, so no

steven1016 Feb 2, 2025

i just signed up for an account here to let you know that the way you write is perfect. listen.. i seriously mean perfect. like, the epitome of the perfect writing. you write better than LLM's. Actually I'd say you and Claude are on the same level but i'm looking at yours in the font of this website vs claude's normal font style so it still hits slightly different. i can't tell what exactly it is, but the fluidity of your writing, the fact that i can easily breeze through it like a beautiful summer wind of gentle caress (I just made that up because it felt right). I can read this entire comment with honestly such grace, truly that's how it feels. like legitimacy and grace. tell me im wrong, everyone else! legitimacy and grace . i realized all of this as i got to this exact part:

“and” between noun phrases, for example)

it's just like.. the way it looks is so profound. maybe it's how it's formatted on my screen with the surrounding lines?

But then I noticed that the output length was 5,855 with the original Japanese, I discovered that R1 had “and” between noun phrases, for example) those problems.

idk, i'm just astonished. the amount of satisfaction that i get from reading your comment has given me enough dopamine to motivate me to create an account and write this entire comment itself, so thank you, your writing is appreciated, and i am very glad i came upon this and hope to find you again in the future somewhere. imagine???? xx

tkgally Feb 2, 2025

You are very kind! Thank you.

jiggawatts Feb 2, 2025

You’re replying to what is most likely AI-generated nonsense. It’s sad but it’s slowly spreading to HN too.

tkgally Feb 3, 2025

Oh! I didn’t think of that. Thanks for the heads-up.

EVa5I7bHFq9mnYK Feb 1, 2025

Could it be fixed by splitting the text into smaller parts? Looks easy to implement.

disgruntledphd2 Feb 1, 2025

Yeah that's normally a good approach but you might end up using different words for the same concept in different parts unless you feed in more context which also comes out of the token limit.

WhitneyLand Feb 1, 2025

How far off was o3 from the level of a professional translator (before it started to go off track)?

tkgally Feb 1, 2025

As I explained in a sister comment, it is not possible to rate translation quality objectively, as opinions and positions about what constitutes a good translation vary. But in my tests of reasoning models since the release of o1-preview, they have not seemed as reliable as the straight nonreasoning versions of ChatGPT, Claude, or Gemini. The translation process itself usually doesn’t seem to require the kind of multistep thinking those reasoning models can be good at.

For more than a year, regular LLMs, when properly prompted, have been able to produce translations that would be indistinguishable from those of some professional translators for some types of translation.

General-purpose LLMs are best for translating straight expository prose without much technical or organization-specific vocabulary. Results are mixed for texts containing slang, dialogue, poetry, archaic language, etc.—partly because people’s tastes differ for how such texts should be translated.

Because most translators are freelancers, it’s hard to get a handle on what impact LLMs have been having on their workloads overall. I have heard reports from experienced translators who have seen work drop off precipitously and have had to change careers, while others report an increase in their workloads over the past two years.

Many translation jobs involve confidential material, and some translators may be hanging onto their jobs because their clients or employers do not allow the use of cloud-based LLMs. That safety net won’t be in place forever, though.

I suspect that those who work directly with translation clients and who are personally known and trusted by their clients will be able to keep working, using LLMs as appropriate to speed up and improve the quality of their work. That’s the position I am fortunate to be in now.

But translators who do piecework through translation agencies or online referrers like Fiverr will have a hard time competing with the much faster and cheaper—and often equally good—LLMs.

I made a few videos about LLMs and translation a couple of years ago. Parts of them are out of date, but my basic thinking hasn’t changed too much since then. If you’re interested:

“Translating with ChatGPT”

https://youtu.be/najKN2bXqCo

“Can GPT-4 translate literature?”

https://youtu.be/5KKDCp3OaMo

“What do translators think about GPT?”

https://www.youtube.com/watch?v=8JUepj7wIl0

I’m planning to make a few more videos on the topic soon, this time focusing on how I use LLMs in my own translation work.

cma Feb 1, 2025

Not that you'd want to have to do more steps, but how do they do if you split it up in separate parts and translate them individually, then feed it back in interleaved in parts/translated parts and ask it to keep the style but fix any errors due to it originally not having full context?

Or another approach, feed it all into context but tell it to wait and not translate, and then feed it in an additionalt time part by part asking it to translate each part and with the translation style instructions repeated.

tkgally Feb 1, 2025

That might work to prevent the glitches I noticed toward the end of the long texts.

In general, though, I haven’t seen any sign yet that reasoning models are better at translation than nonreasoning ones even with shorter texts. LLMs in general have reached the level where it is difficult to assess the quality of their translations with A/B comparisons or on a single linear scale, as most of the variation is within the range where reasonable people disagree about translation quality.

kamikazeturtles Feb 1, 2025

There's a huge price difference between o3-mini and o1 ($4.40 vs $60 per million output tokens), what trade-offs in performance would justify such a large price gap?

Are there specific use cases where o1's higher cost is justified anymore?

arthurcolle Feb 1, 2025

its the same thing as:

gpt-3.5 -> gpt-4 (gpt-4-32k premium)

"omni" announced (multimodal fusion, initial promise of gpt-4o, but cost effectively distilled down with additional multimodal aspects)

gpt-4o-mini -> gpt-4o (multimodal, realtime)

gpt-4o + "reasoning" exposed via tools in ChatGPT (you can see it in export formats) -> "o" series

o1 -> o1 premium / o1-mini (equivalent of gpt-4 "god model" becoming basis for lots of other stuff)

o1-pro-mode, o1-premium, o1-mini, somewhere in that is the "o1-2024-12-17" model with not streaming, function calling, and structured outputs and vision

now, distilled o1-pro-mode probably is o3-mini and o3-mini-high-mode (the naming is becoming just as bad as android)

its the repeat, take model, scale it up, run evals, detect innefficiencies, retrain, scale, distill, see what's not working. when you find a good little zone in the efficiency frontier, release it with a cool name

anticensor Feb 2, 2025

No, o3-mini is a distillation of (not-yet-released) o3, not a distillation of o1.

arthurcolle Feb 6, 2025

o1-"pro mode" could just be o3

anticensor Feb 6, 2025

It's not that either, benchmarks list the two as separate models.

arthurcolle Feb 6, 2025

thank you!

benatkin Feb 1, 2025

> Are there specific use cases where o1's higher cost is justified anymore?

Long tail stuff perhaps. Most stuff doesn't resemble a programming benchmark. A newer model thrives despite being small when there is a lot of training data, and with programming benchmarks, like with chess, there is a lot of training data, in part because high quality training data can be synthesized.

zamadatix Feb 1, 2025

Not really, it'll also be replaced by a newer o3 series model in short order.

johngalt2600 Feb 1, 2025

So far ive been impressed.. seems to be in the same ballpark as r1 and claude for coding. I will have to gather more samples.. in this past week ive changed from using 100% claude exclusively (since 3.5) to hitting all the big boys: claude, r1, 4o (o3 now), and gemini flash. Then ill do a new chat that includes all of their generated solutions for additional context for a refactored final solution.

R1 has upped the ante so Im hoping we continue to get more updates rapidly... they are getting quite good

xnx Feb 1, 2025

Hasn't Gemini pricing been lower than this (or even free) for awhile? https://ai.google.dev/pricing

BinRoo Feb 1, 2025

Are you insinuating Gemini is similar in performance to o3-mini?

panarky Feb 1, 2025

I've only had o3-mini for a day, but Gemini 2.0 Flash Thinking is still clearly better for my use cases.

And it's currently free in aistudio.google.com and in the API.

And it handles a million tokens.

xnx Feb 1, 2025

Definitely varies by application, but the blind "taste test" vibes are very good for Gemini: https://lmarena.ai/?leaderboard

anabab Feb 1, 2025

that reminds me that a week ago there was a (now deleted but has a copy of the content available in the comments) post on Reddit where the author claimed they have attempted manipulating/manipulated voting on lmarena in favor of Gemini to tip the scale on Polymarket where on a question like "which AI model will be the best one by $date" (with the outcome decided based on the scoring on lmarena) they have supposedly made O(USD10k).

Original deleted post: https://old.reddit.com/r/MachineLearning/comments/1i83mhj/lm...

A copy of the content: https://old.reddit.com/r/MachineLearning/comments/1i83mhj/lm...

gerdesj Feb 1, 2025

Are you implying it isn't?

(evidence please, everyone)

BinRoo Feb 1, 2025

Simple example: o3-mini-high gets this [1] right, whereas Gemini 2.0 Flash 01-21 gets it wrong.

[1] https://chatgpt.com/share/679d9579-5bb8-8008-ac4a-38cef65b45...

xnx Feb 1, 2025

Great example. Thank you. Can confirm that none of the Gemini models warned about the exception without prompting.

maeil Feb 1, 2025

This agrees with my limited testing so far, but in a different way: o3 being better at coding and objective tasks, with the most recent Flash 2.0-thinking stronger at subjective tasks. Similarly, o3 seems better at shorter output sizes, but drops off, tending to be lazy.

lysecret Feb 1, 2025

Hadn’t had much luck with o3. One thing that came to my mind with these test time compute models is that they have a tendency to “overthink” and “overcomplicate” things this is just a feeling for now but has anyone done some study on this? E.g. potentially degrading performance in simpler questions for these types of models?

submeta Feb 1, 2025

> The model accepts up to 200,000 tokens of input, an improvement on GPT-4o’s 128,000.

So finally ChatGPT catches up with Claude which has a 200,000 token input limit ever since.

Claude with its projects feature is my go to tool for working on projects that I have to work on for weeks and months. Now I see a possible alternative.

maxdo Feb 1, 2025

How would you rate it against Claude ? Didn’t test it yet, but o1 pro didn’t perform as good

pants2 Feb 1, 2025

I've been trying out o3 mini in Cursor today, it seems "smarter" but overall tends to overthink things and if it's not provided with perfect context it's prone to hallucinate. Overall I prefer Sonnet still. It has a certain magic of always making reasonable assumptions and finding simple solutions.

firecall Feb 1, 2025

As n occasions user and fan of Cursor, it would be good if they could explain what the models are and why the different models exist.

There’s no obvious answer of why one should switch to any of them!

conception Feb 1, 2025

I don’t think there’s an obvious answer. Try them out and see which works better for your use case.

maeil Feb 1, 2025

Agreed that Sonnet still feels like the best all-round model. The new ones are at least on par with it for pure coding, or exceed it (r1, o1 both do IME) but don't generalize as well, especially to tasks with subjective answers. I find the latest Gemini 2.0-Flash-thinking to be closest to Sonnet on those.

zyklu5 Feb 1, 2025

Claude is still better in my opinion.

There's a suite of code-related tasks -- covering a diversity of areas, including dev ops, media manipulation etc., derived from issues I have faced over the years -- I perform for every new release. No model has solved the set of issues solved in one go but Claude still remains the best.

An example of the sort of problems in the suite:

> I have a special problematically encoded mp4 file with a subtle issue (something I ran into a couple of years ago while fixing a bug in a computer vision pipeline). In the question prompt I also pass the output of ffprobe and ask for the ffmpeg command that'll fix it. Only Claude has figured the real underlying issue out (after 4 interactions).

brianbest101 Feb 1, 2025

Open AI really needs to work on their naming conventions for these things.

benatkin Feb 1, 2025

It's all based on omni which to me has weird religious connotations. It just occurred to me to put it together with sama's other project, scanning everyone's eyes. That's one aspect of omniscience - keeping track of every soul.

Another thing it seems similar to is how Jeff Bezos registered relentless.com. There seems to be a gap between the ideal branding from the perspective of the creators and branding that makes sense to consumers.

Nuzzerino Feb 1, 2025 (dead)

simonw Feb 1, 2025

I've seen enough now that I no longer worry that LLMs getting better at code will threaten my job.

LLMs give programmers an extraordinary productivity boost, but are much less effective for people who don't have a programming mindset.

It might help more people gain that mindset, but ai welcome that: it's not like there's any shortage of problems in the world that benefit from being automated by computers.

benatkin Feb 1, 2025

That reads like an OpenAI talking point.

The smart thing to do as programmers is to keep trying to automate more of our work, but there's no guarantee that it will be like this in the future.

A lot of this post resonated with me: https://youtubetranscriptoptimizer.com/blog/05_the_short_cas... It doesn't argue that demand for video cards will be exhausted, but that it won't be as white hot as it is now. I can see the same with the top tier of programming talent. There isn't any shortage of problems that you could throw compute at and it's the same for top-tier programming talent, but does the market need it right away at any cost?

andromaton Feb 1, 2025

if you check GP, you might agree he is very helpful. Btw, he is the author of the article posted.

benatkin Feb 1, 2025

I'm aware, thanks, but it could have been anyone saying that AI won't threaten your job. It's still a big claim.

joshuanapoli Feb 1, 2025

Programmers have always made a living by automating ourselves out of business. Somehow, we're still doing pretty well.

Nuzzerino Feb 1, 2025 (dead)

leetharris Feb 1, 2025

AI will eventually replace programmers, but programmers with AI will replace everything else first.

If you know how to code and have an AI assistant, you can first automate the rest of white color work.

brookst Feb 1, 2025

It’s a tool. And I suspect doctors and lawyers could mount a spirited argument that programmers have done far more harm than their millennia-old fields.

marxisttemp Feb 1, 2025

In what way would you say doctors have largely failed Western society lately?

Nuzzerino Feb 1, 2025

You obviously don't have a chronic illness or you wouldn't be asking that. Either that or you're rich.

nickthegreek Feb 1, 2025

Are you conflating doctors with insurance like the larger health care industrial complex?

Nuzzerino Feb 1, 2025

No one is forcing those doctors to participate in that system. Many (but not enough) of them don't, and operate cash clinics, which have a better reputation for quality.

But ultimately, healthcare is suffering not because of the insurance companies, but because of the American Medical Association, which worked to artificially limit the supply of doctors so that doctors could be paid more. This makes it more optimal for doctors to choose to participate in the insurance scheme as well.

So I ask again, why are we prioritizing the automation of software development, where neither the skill nor the profession itself is gatekept like healthcare is?

energy123 Feb 1, 2025

Insurance companies are part of the problem but not the foremost problem. Their profit margins are less than 10%, even if they operated as a charity, healthcare costs wouldn't go down much.

https://www.noahpinion.blog/p/insurance-companies-arent-the-...

benatkin Feb 1, 2025

I think doctors generally do their best, but it's still disappointing. Doctors can't make food less enticing. https://www.instagram.com/lukesmithrd/p/DBRoWvRSMKx/

marxisttemp Feb 1, 2025

I couldn’t be more pro-Luigi comrade. But doctors != (the AMA | insurance companies).

Nuzzerino Feb 1, 2025

Maybe we can agree they didn't fail society, but I still think reversing the artificial supply reduction of doctors is arguably better from a utilitarian perspective than doing so with software engineers.

This item has no comments currently.