Categories
LLM OpenAI

ChatGPT o3 vs o1 Preview vs o1 Mini: Which one is better?

OpenAI recently ‘introduced’ the latest model in its o series, the o3. The o3 succeeds the o1 preview and o1 mini models released earlier this year. Now, why isn’t the new model called o2? OpenAI chose to skip the name “O2” to prevent any potential conflict with the British telecommunications provider O2. The ChatGPT o3 models are not available to the public but are available to select safety researchers who sign up. But it’s surprising to see them release a new ‘o’ model this soon after the initial release of o1 models.

This blog covers the o3 announcement and offers a comparison with the o1 Preview and o1 Mini based on what we have so far, including technical specs, tests, pricing, and more.

Overview of ChatGPT o3, the heavy cost of performance

As per OpenAI, the o3 model represents a major leap in AI technology. It is engineered to tackle complex reasoning tasks that require advanced cognitive capabilities. Unlike its predecessors, which were already impressive in their own right, o3 is designed to deliver responses that are not only faster but also more logically structured and accurate. That said, to say this model is ‘pricey’ would be an understatement. In the above image, the high-scoring version of o3 used over $1,000 in compute for each task. The o1 models used about $5 in compute per task, while o1-mini only required a few cents.

OpenAI has also positioned o3 as a model that could redefine the capabilities of AI systems, with some suggesting it may bring us closer to AGI than ever before (which is more to create hype than anything). The o3 model has demonstrated remarkable performance in several benchmarks, particularly in the ARC-AGI test, which measures an AI’s ability to learn and adapt to new tasks without relying solely on pre-trained knowledge. In this test, o3 achieved an impressive score of 87.5%, surpassing previous models and even matching human performance levels. This leap in capability has led many to speculate whether we are witnessing the dawn of true AGI.

Key Features of OpenAI o3

  • Enhanced Reasoning: The o3 model has demonstrated exceptional performance in reasoning tasks, achieving scores significantly higher than those of the O1 models on various benchmarks.
  • Deliberative Alignment: A novel safety feature integrated into o3 allows it to evaluate its responses critically against safety protocols. This helps mitigate risks associated with misuse or manipulation of AI outputs.
  • Performance Benchmarks: In tests such as the SWE-bench verified coding assessments and mathematical competitions like AIME 2024, o3 has outperformed previous models by substantial margins.
  • Adaptive Learning: The model is designed to adapt its reasoning efforts based on task complexity, allowing it to balance speed and accuracy effectively.

Detailed Comparison of o3, o1 Preview, and o1 Mini

As expected, the o3 model performs better than the o1 preview and the regular o1. Here’s a general comparison between the three to give you an idea of how things may look.

o3 Specifications vs o1-Preview and o1-Mini

Featureo3o1 Previewo1 Mini
Context Window(est.) 256K tokens128K tokens128K tokens
Maximum Output Tokens(est.) 100K tokens32K tokens65.5K tokens
Input CostTBD$15.00 per million$3.00 per million
Output CostTBD$60.00 per million$15.00 per million
Performance HighlightsSuperior reasoning and coding skillsStrong coding skillsCost-effective for coding
Release DateJanuary/February 2025 (expected)September 2024September 2024

Performance Tests

The performance of these models can be assessed through various benchmarks that evaluate their capabilities in coding, mathematics, and scientific reasoning.

Coding Performance:

o3 achieved a score of 2727 on Codeforces, significantly surpassing the 1891 score of O1 Preview. In the SWE-bench verified tests, o3 scored 71.7%, compared to 48.9% for O1 Preview, indicating a marked improvement in coding ability.

Mathematical Reasoning:

In the AIME 2024 exam, o3 scored an impressive 96.7%, missing only one question. In contrast, O1 Preview scored 83.3%, highlighting the advancements made with the new model. On the EpochAI Frontier Math benchmark—one of the most challenging tests—o3 achieved a score of 25.2%, while previous models struggled to exceed 2%.

Scientific Reasoning:

For scientific questions assessed through GPQA Diamond (which includes PhD-level problems), o3 scored 87.7%, compared to 78% for O1 Preview, showcasing its superior analytical capabilities.

OpenAI o3 Use Cases

o3 is expected to excel in:

  • Complex coding challenges requiring deep logical reasoning.
  • Advanced mathematical problem-solving where accuracy is critical.
  • Scientific research tasks that demand high-level comprehension and analysis.

o1 Preview serves well for:

  • Tasks requiring detailed coding solutions.
  • Legal analysis or other areas where complex reasoning is beneficial.

o1 Mini is best suited for:

  • Quick coding tasks where efficiency is prioritized over depth.
  • Applications needing basic reasoning without extensive computational resources.

OpenAI o3 vs o1 Preview vs o1 Mini Test Examples

Here are some NLP prompts you can try to test OpenAI o3, o1 Preview, and o1 Mini’s coding and reasoning capabilities. Copy these prompts and try them out:

1. Coding (Python): Write a Python function that takes a list of dictionaries as input, where each dictionary represents a product with keys ‘name’, ‘price’, and ‘category’. The function should return a new list containing only the dictionaries for products in the ‘electronics’ category, sorted by price in ascending order.

2. Coding (JavaScript): Create a JavaScript function that asynchronously fetches data from a given URL using the Fetch API. The function should handle potential network errors and return the parsed JSON data if successful, or an error message otherwise.

3. Mathematics: Calculate the volume of a sphere with a radius of 5 centimeters. Express the answer in terms of pi and then as a numerical approximation to two decimal places.

4. Science: Describe the process of photosynthesis in plants. Explain the role of chlorophyll in this process and how it converts light energy into chemical energy.

ChatGPT o3 Pricing: Comparison with o1 preview and mini

The exact pricing for input and output costs for o3 has not yet been disclosed but is anticipated to be exponentially higher due to its advanced capabilities. In one of the advanced benchmarks presented by OpenAI during its ChaatGPT o3 Livestream, the cost per task was $20, and the average completion time for each task was 1.3 minutes. This was achieved using a high-efficiency version of the model, which limits reasoning. So you can guess.

For reference: o1 Preview costs $15 per million tokens for input and $60 per million tokens for output. That was a whopping 6x costlier than GPT-4o. o1 Mini offers a more budget-friendly option at $3 per million tokens for input and $15 per million tokens for output.

The Bottom Line

The OpenAI o3 model represents a major leap forward in AI development, bringing unprecedented reasoning capabilities to complex tasks like coding and mathematics. While it’s positioned as a premium offering, the more affordable o1 Mini provides a practical alternative for straightforward applications without breaking the bank.

As OpenAI refines these models through safety testing and user feedback ahead of its 2025 release, the AI landscape continues to evolve. But o3 is still far from release, so what about right now? You can still try various advanced models, such as Claude 3.5 Sonnet, GPT-4o, and Llama 405B on Bind AI Copilot. Select the model of your choice and get started today.