Categories
GPT o1 LLM Code Generation

GPT o1 vs Claude 3.5 Sonnet: Which model is better for Coding?

OpenAI’s recent announcement of the OpenAI o1 model family has shone debate among developers interested in AI code generation. As per OpenAI, the o1 models have been designed specifically for tasks like coding, which require better reasoning and contextual awareness. Many people have already started to draw comparisons between OpenAI o1, GPT-4o, and even Claude 3.5 Sonnet for their coding potentiality.

In this blog, we compare the capabilities of OpenAI o1, Claude 3.5 Sonnet and GPT-4o. We’ll draw on real experiences and practical examples shared by users on Reddit and other platforms.

GPT o1 Overview

Its important to know that “o1” is a new series of OpenAI models with an architecture different from GPT (i.e. generative pre-trained transformers), so calling it GPT o1 is not accurate. It is being referred to as “OpenAI o1”. The model is designed for complex reasoning and problem-solving tasks – especially code generation. Unlike its predecessors, GPT-4o and GPT-4o mini, o1 spends more time reasoning with chain of thoughts (CoT) and processing the inputs before generating outputs, which is particularly beneficial for coding challenges that require a deeper understanding of context and logic. As per OpenAI, the models can perform similarly to PhD students on challenging benchmark tasks.

The model operates differently compared to previous generation, as it can take up to 30 seconds for reasoning and it may not exactly be efficient at a conversational type interaction. Giving upfront context in the prompt with instructions might work better than subsequent prompting.

Key Features of OpenAI o1

  • Enhanced Reasoning: o1 is reported to outperform GPT-4o in complex reasoning tasks, achieving an impressive 89th percentile in competitive programming evaluations.
  • Contextual Awareness: With a context window of 128K tokens, o1 can handle larger inputs, making it suitable for extensive coding projects that require maintaining context over longer interactions.
  • Model Variants: OpenAI currently offers two versions of o1: the o1-preview, which is more powerful but slower, and the o1-mini, which is faster and cheaper, making it accessible for everyday coding tasks.

OpenAI o1 vs Claude 3.5 Sonnet for Coding

o1 performance in Coding Tasks

When it comes to OpenAI o1’s coding abilities, early users have reported mixed experiences on reddit and twitter (X). Many users have reported a decently good reasoning boost on GPT-4o, while a few believe the hype won’t last more than a few weeks. GitHub Copilot’s blog suggests that o1-preview’s reasoning capability allows a deeper understanding of the code’s constraints and edge cases, which helps produce a more efficient and higher-quality result. We’ll have to wait for the full release to see conclusional results.

Performance Metrics

Here are some performance metrics based on independent (and anecdotal) evaluations of each AI model’s capabilities in reasoning, context handling, speed, and error correction (courtesy: Aider, GitHub, Reddit). 

FeatureOpenAI o1GPT-4oClaude 3.5
Reasoning AbilitySuperiorModerateGood
Context Window128K tokens128K tokens500K tokens (Enterprise)
SpeedSlower than ClaudeModerateFast
Error CorrectionGoodModerateExcellent

General technical specifications

Model NameAverageSpatialWeb of Lies v2Zebra Puzzle
01 mini (2024-09-12)77.335010082
Claude 3.5 sonnet (20240620)58.67488048
gpt 40 (2024-08-06)54.67566642
meta-llama-3.1-405b-instruct-turbo53.33468034
chatgpt 4o latest52426846
gpt-4-turbo-2024-04-0951.33467038
gpt-40-2024-05-1350407040
gemini-1.5-pro-exp-082749.33368032
gemini-1.5-pro-exp-080148.67367238
gemini-1.5-flash-exp-082747.33386836
gpt-4-0125-preview47.33465244
deepseek-coder-v245.33385642

Benchmark test results

In the above table, we can a big leap between o1 and GPT-4o’s reasoning and bias-detection capabilities.

Credit: Aider

The first benchmark run of o1-mini has it ~tied with GPT-4o on aider’s code editing benchmark.

Practical Examples of o1’s Coding Capabilities

To illustrate the differences in coding capabilities, let’s consider a few practical examples based on user experiences and discussions from the GitHub Copilot’s o1 vs Claude analysis

Look at the first example below, where both Claude 3.5 Sonnet, currently the most advanced AI model, and OpenAI o1 provide the correct solution for the query prompt.

In the 2nd example we’ve chosen, we can see o1 outsmarting Claude to provide the correct/better solution. This suggests that there are some advantages to the o1.

Now, here are some example prompts (that you can also try yourself) to help you see which model is better. We’ve also provided context on how well each model performs.

Example 1: Generating a Simple Function

Prompt: “Write a Python function that takes a list of integers and returns the sum of all even numbers in the list. The function should handle empty lists and lists containing only one element.”

Try the prompt with Claude 3.5 Sonnet

  • OpenAI o1: The model generated a concise and efficient function that met all the requirements specified in the prompt. It handled edge cases gracefully and provided clear variable names for easy understanding.
  • GPT-4o: The generated function was correct but included unnecessary comments and validation checks that made the code longer than necessary. While the function worked as expected, the excessive code could be confusing for users.
  • Claude 3.5 Sonnet: The model produced a straightforward and well-structured function that addressed the prompt effectively. It used clear variable names, minimal comments, and a simple loop to iterate through the list, making the code easy to read and understand.

Example 2: Debugging Code

Prompt: “Debug this JavaScript function to remove all vowels from a string function removeVowels(str) {

  return str.replace(/[aeiou]/gi, ”);

}”

Try the prompt with GPT 4o

  • OpenAI o1: After carefully analyzing the provided code, o1 quickly identified the issue and suggested a corrected version that worked perfectly. The solution was concise and efficient, addressing the problem directly.
  • GPT-4o: While the model correctly identified the error in the code, the suggested fix was overly complicated and included unnecessary changes. The corrected version, although functional, could confuse users looking for a simple solution.
  • Claude 3.5 Sonnet: The model pinpointed the mistake in the code and provided a clear and concise correction. The suggested solution was easy to understand and implement, making it user-friendly and effective.

Example 3: Writing Unit Tests

Prompt: “Develop a set of unit tests for a function that takes a list of strings as input and returns a new list containing only the strings that are palindromes. The function should handle empty lists and lists containing only one element.”

Try the prompt with Claude

  • OpenAI o1: The model generated a comprehensive set of unit tests that covered various scenarios, including edge cases. The tests were well-structured, easy to read, and ensured the function’s robustness and reliability.
  • GPT-4o: While the generated tests covered some basic cases, they lacked thoroughness and missed critical edge cases. The tests were adequate but might not provide enough confidence in the function’s behavior under certain conditions.
  • Claude 3.5 Sonnet: The model produced a well-structured and thorough set of unit tests that addressed the prompt effectively. The tests covered a wide range of scenarios, including edge cases, ensuring the function’s reliability and making it easier for users to trust the implementation.

OpenAI o1 vs Claude 3.5 Sonnet: Verdict

The selection among OpenAI o1, GPT-4o, and Claude 3.5 Sonnet is heavily influenced by user requirements, with recent discussions highlighting some notable distinctions and user experiences. While we can only provide a subjective and anecdotal verdict in its early days, it is essential to test out multiple models for each of your use cases.

Claude 3.5 Sonnet vs OpenAI o1

Claude 3.5 Sonnet continues to be a strong competitor to OpenAI o1, particularly in terms of cost-effectiveness (it’s 4x cheaper). Users have noted that even though Claude 3.5 Sonnet is significantly cheaper, it still closely matches the performance of o1, especially in coding tasks with Claude and Github codebase integration. This model operates with a 200K token context window (up to 500K in Claude), which enhances its ability to handle complex coding scenarios effectively. Internal evaluations indicate that Claude 3.5 Sonnet solved 64% of coding problems, outperforming its predecessor, Claude 3 Opus, which solved only 38%.

On the other hand, OpenAI o1 is often favored for intricate coding tasks that demand deep reasoning and extensive context retention. However, some users have expressed that the o1 mini version does not perform as closely to the o1 preview as expected, leading to a perception that it may not be the best option for all coding needs.

Claude 3.5 Sonnet: Still a cost-effective leader for Coding

Claude 3.5 is still a very effective model for coding tasks and reasoning abilities, especially given that it is 4X cheaper than GPT o1.

For ChatGPT users, GPT 4o remains a viable choice for simpler coding tasks or when speed is of the essence. However, users have reported that it can sometimes produce verbose outputs and struggle with effective error correction. This verbosity can detract from its utility in fast-paced coding environments where clarity and conciseness are crucial.

The choice of model ultimately hinges on specific user needs. For instance, Claude 3.5 Sonnet’s speed and its accessibility via code generation platforms such as Bind AI Copilot or Cursor make it particularly appealing for developers who need to iterate quickly and efficiently. . 

In contrast, OpenAI o1 is a very powerful model for more complex projects requiring nuanced understanding and extensive context retention. Users who frequently switch between models have noted that both Claude and OpenAI have their strengths, often relying on each to complement the other in different coding scenarios.

So yes, the bottom line is that while OpenAI o1 offers reasonable advancements for complex coding tasks, Claude 3.5 Sonnet presents the most compelling alternative with its cost efficiency and performance that isn’t too far behind the o1. You can try Claude 3.5 Sonnet and compare with GPT 4o in Bind AI Copilot and let us know which model you prefer for your tasks.