I’ve been trying to discover how o1 works and how it was trained since the model’s release. It’s like a good mystery novel and I’ve enjoyed trying to piece things together. After spending an inordinate amount of time on the subject, I’d like to get my thoughts down in writing. I’m not pretending to have all the answers, but I do think I have a clearer understanding, at least compared to where I started.
It’ll come as no surprise to anyone that I do believe o1 is a large language model. I believe it has a similar (if not identical) architecture to the 4o models. What sets o1 apart is its unique training regime and, to some degree, how it works at run time. But I do think it largely builds on the foundations of the GPT series, even if they are trying a new naming scheme.
Based on pricing, as well as “intelligence,” it’s possible that these models are larger.**¹** Maybe o1 is really 6 times larger than GPT-4o and o1-mini about the same size as 4o. However they do appear to sample rather quickly, so maybe they’re not much bigger. The increased pricing could be due to a more advanced decoding strategy. Perhaps o1 is the same size as 4o, but is sampled in best-of-6 on a token or “thought”/line basis and the best completion is selected by a reward model, maybe the elusive Q*. However, I think Q* has more to do with training than test time**—**more on that later.
Another possibility is that o1 is as much a test of a new business strategy as a new training regimen. We know that while revenues have jumped to $3-4B ARR, losses are far ahead.² LLMs, especially GPT-3.5 tier, have been largely commoditized at this point and GPT-4 level models have become very competitive. OpenAI needs to differentiate and o1 gives it a new opportunity to create a real moat. They’ve admitted that they are keeping the chain-of-thought tokens hidden, for among other reasons, for competitive advantage.³ As much as I would like to see the reasoning traces, they are entirely in their rights to do so. All this to say, it’s possible that the o1 models are marked up more than the 4o models, and if they can maintain a strong, possibly 10x, margin, their latest raise at an eye-watering $157B post-money valuation makes more sense.
They’ve actually shared some great examples of what the hidden chain-of-thought looks like in the release blog post⁴ and I’ve copied those to a GitHub repo⁵ for interested parties. It’s very revealing. Assuming OpenAI hasn’t modified the output, it appears far less structured and more “stream of conscience” than you might think.
There is, however, one easily noticeable structure: most every thought appears to be delimited by two new lines. As I hinted at earlier, it’s possible that these represent “atomic” thoughts in the output. The model has been trained, either purposefully or incidentally, to format it’s thinking in this way. There are no “thinking,” “reflection,” etc. XML tags that are popular in some chain-of-thought methods, but there are patterns. What really sets the outputs apart are the raw verbosity and the intellectual humility. The model isn’t afraid to acknowledge its mistakes, backtrack, start over, and really take its time solving the problem. Most LLMs are much more assertive and confidently get to the answer in shorter time. But they are also often wrong, so to see a model take its time is very refreshing.
The model also, surprisingly, has a sense of time! I was somewhat blown away from a comment by Bob McGrew in an article by The Verge⁶ that the model has said (maybe paraphrased) “Oh, I’m running out of time, let me get to an answer quickly” in its chain-of-thought. To me this implies that the model is given a time limit or compute budget and is trained to be aware of time as it passes or the decrementing time is automatically interleaved into the output or a system message. This could explain how they scaled inference-time compute to get anywhere from 20% to 80% on the AIME benchmark. They’ve also acknowledged that they want to give developers more control over how much time is spent in the future and perhaps there is a very straightforward way to do this.
So the model has been trained to take the user’s request very seriously, spend considerable time “thinking” about the problem including backtracking or adopting new strategies, and then return an answer summary very similar to the 4o models (though typically more verbose and comprehensive). In the ChatGPT interface an auto-generated summary of o1’s reasoning trace is also presented, though there are no guarantees for it’s accuracy. I would guess a much, much, smaller, cheaper model is generating the summary, probably 4o-mini. After all this ceremony, the results have been very good, though not better than 4o on all tasks.
o1-mini actually is somewhat better than the larger o1-preview at STEM related tasks.⁷ Apparently they did not train it on as general a corpus, but a more precise, targeted dataset. That, combined with the fact that it appears to have finished training, makes it a very powerful model for certain use cases. However, subjectively in my experience, I definitely feel like o1-preview is the deeper, more intelligent model and I expect o1 will be better than o1-mini in nearly every way, but I could be wrong.