OpenAI recently ‘introduced’ the latest model in its o series, the o3. The o3 succeeds the o1 preview and o1 mini models released earlier this year. Now, why isn’t the new model called o2? OpenAI chose to skip the name “O2” to prevent any potential conflict with the British telecommunications provider O2. The ChatGPT o3 models are not available to the public but are available to select safety researchers who sign up. But it’s surprising to see them release a new ‘o’ model this soon after the initial release of o1 models.
This blog covers the o3 announcement and offers a comparison with the o1 Preview and o1 Mini based on what we have so far, including technical specs, tests, pricing, and more.
Overview of ChatGPT o3, the heavy cost of performance
As per OpenAI, the o3 model represents a major leap in AI technology. It is engineered to tackle complex reasoning tasks that require advanced cognitive capabilities. Unlike its predecessors, which were already impressive in their own right, o3 is designed to deliver responses that are not only faster but also more logically structured and accurate. That said, to say this model is ‘pricey’ would be an understatement. In the above image, the high-scoring version of o3 used over $1,000 in compute for each task. The o1 models used about $5 in compute per task, while o1-mini only required a few cents.
OpenAI has also positioned o3 as a model that could redefine the capabilities of AI systems, with some suggesting it may bring us closer to AGI than ever before (which is more to create hype than anything). The o3 model has demonstrated remarkable performance in several benchmarks, particularly in the ARC-AGI test, which measures an AI’s ability to learn and adapt to new tasks without relying solely on pre-trained knowledge. In this test, o3 achieved an impressive score of 87.5%, surpassing previous models and even matching human performance levels. This leap in capability has led many to speculate whether we are witnessing the dawn of true AGI.
Key Features of OpenAI o3
- Enhanced Reasoning: The o3 model has demonstrated exceptional performance in reasoning tasks, achieving scores significantly higher than those of the O1 models on various benchmarks.
- Deliberative Alignment: A novel safety feature integrated into o3 allows it to evaluate its responses critically against safety protocols. This helps mitigate risks associated with misuse or manipulation of AI outputs.
- Performance Benchmarks: In tests such as the SWE-bench verified coding assessments and mathematical competitions like AIME 2024, o3 has outperformed previous models by substantial margins.
- Adaptive Learning: The model is designed to adapt its reasoning efforts based on task complexity, allowing it to balance speed and accuracy effectively.
Detailed Comparison of o3, o1 Preview, and o1 Mini
As expected, the o3 model performs better than the o1 preview and the regular o1. Here’s a general comparison between the three to give you an idea of how things may look.
o3 Specifications vs o1-Preview and o1-Mini
Feature | o3 | o1 Preview | o1 Mini |
Context Window | (est.) 256K tokens | 128K tokens | 128K tokens |
Maximum Output Tokens | (est.) 100K tokens | 32K tokens | 65.5K tokens |
Input Cost | TBD | $15.00 per million | $3.00 per million |
Output Cost | TBD | $60.00 per million | $15.00 per million |
Performance Highlights | Superior reasoning and coding skills | Strong coding skills | Cost-effective for coding |
Release Date | January/February 2025 (expected) | September 2024 | September 2024 |
Performance Tests
The performance of these models can be assessed through various benchmarks that evaluate their capabilities in coding, mathematics, and scientific reasoning.
Coding Performance:
o3 achieved a score of 2727 on Codeforces, significantly surpassing the 1891 score of O1 Preview. In the SWE-bench verified tests, o3 scored 71.7%, compared to 48.9% for O1 Preview, indicating a marked improvement in coding ability.
Mathematical Reasoning:
In the AIME 2024 exam, o3 scored an impressive 96.7%, missing only one question. In contrast, O1 Preview scored 83.3%, highlighting the advancements made with the new model. On the EpochAI Frontier Math benchmark—one of the most challenging tests—o3 achieved a score of 25.2%, while previous models struggled to exceed 2%.
Scientific Reasoning:
For scientific questions assessed through GPQA Diamond (which includes PhD-level problems), o3 scored 87.7%, compared to 78% for O1 Preview, showcasing its superior analytical capabilities.
OpenAI o3 Use Cases
o3 is expected to excel in:
- Complex coding challenges requiring deep logical reasoning.
- Advanced mathematical problem-solving where accuracy is critical.
- Scientific research tasks that demand high-level comprehension and analysis.
o1 Preview serves well for:
- Tasks requiring detailed coding solutions.
- Legal analysis or other areas where complex reasoning is beneficial.
o1 Mini is best suited for:
- Quick coding tasks where efficiency is prioritized over depth.
- Applications needing basic reasoning without extensive computational resources.
OpenAI o3 vs o1 Preview vs o1 Mini Test Examples
Here are some NLP prompts you can try to test OpenAI o3, o1 Preview, and o1 Mini’s coding and reasoning capabilities. Copy these prompts and try them out:
ChatGPT o3 Pricing: Comparison with o1 preview and mini
The exact pricing for input and output costs for o3 has not yet been disclosed but is anticipated to be exponentially higher due to its advanced capabilities. In one of the advanced benchmarks presented by OpenAI during its ChaatGPT o3 Livestream, the cost per task was $20, and the average completion time for each task was 1.3 minutes. This was achieved using a high-efficiency version of the model, which limits reasoning. So you can guess.
For reference: o1 Preview costs $15 per million tokens for input and $60 per million tokens for output. That was a whopping 6x costlier than GPT-4o. o1 Mini offers a more budget-friendly option at $3 per million tokens for input and $15 per million tokens for output.
The Bottom Line
The OpenAI o3 model represents a major leap forward in AI development, bringing unprecedented reasoning capabilities to complex tasks like coding and mathematics. While it’s positioned as a premium offering, the more affordable o1 Mini provides a practical alternative for straightforward applications without breaking the bank.
As OpenAI refines these models through safety testing and user feedback ahead of its 2025 release, the AI landscape continues to evolve. But o3 is still far from release, so what about right now? You can still try various advanced models, such as Claude 3.5 Sonnet, GPT-4o, and Llama 405B on Bind AI Copilot. Select the model of your choice and get started today.