Categories
Anthropic Claude 3.7 Sonnet LLM Code Generation

Claude 3.7 Sonnet vs o3-mini: Which is Better for Coding?

Anthropic has officially launched its successor to the unofficial Claude 3.6, the Claude 3.7 Sonnet. It has also launched ‘Claude Code,’ a new system developed for coders. We’ve already covered the announcement in detail in this article. But now the bigger question is: how good is the new Claude 3.7 Sonnet, and how well does it compare to OpenAI’s o3-mini for coding tasks? But this article is more than a Claude 3.7 Sonnet vs o3-mini High comparison. We also put it against its predecessor, Claude 3.5 Sonnet, and xAI’s Grok 3 to see which is the best model for coding.

Let’s get going.

Claude 3.7 Sonnet and Claude Code – What are they?

Credit: Anthropic

Claude 3.7 Sonnet combines quick answers and deep thinking in one AI model. Unlike older models that could only do one or the other, this one lets you choose. Need a fast response? It can do that. Facing a tough coding or math problem? It can switch to a careful, step-by-step thinking mode. This flexibility makes it useful for many different tasks.

The model is especially good at coding. Testers say it creates clean, ready-to-use code with fewer mistakes than earlier versions. Despite these improvements, Anthropic kept the price the same—$3 per million input tokens and $15 per million output tokens—making it affordable for different users.

Accompanying Claude 3.7 Sonnet’s release is Claude Code, a command-line tool that brings agentic coding to developers’ fingertips. This powerful extension enables programmers to delegate substantial engineering tasks directly from their terminal, fundamentally changing how development work happens. Consider what Claude Code brings to the development process:

  • It dramatically accelerates workflows by completing in a single pass tasks that would typically require 45+ minutes of focused work.
  • It functions as a true programming partner by searching and reading code, editing files, writing and running tests, making commits, pushing to GitHub, and using command-line tools—all while keeping developers informed throughout the process.
  • It continuously improves through Anthropic’s commitment to enhancing tool reliability, supporting long-running commands, and refining in-app rendering based on real-world user feedback.

With Claude 3.7 Sonnet available across all Claude plans (except for extended thinking mode on the free tier) and through multiple cloud platforms, Anthropic has created an AI ecosystem that promises to fundamentally reshape how we approach complex intellectual tasks.

Understand the Comparison

Now before we get into the comparison, first let’s understand what we have here:

  • o3-mini: From OpenAI, it’s a smaller, efficient version of O3, optimized for STEM, especially coding, with low, medium, and high reasoning effort levels (o3-mini Performance).
  • Claude 3.5 Sonnet: Predecessor to 3.7, known for strong coding and reasoning, setting benchmarks like HumanEval at 92.0%.
  • Grok 3: From xAI, claimed to be powerful, with strong coding benchmark scores like 79.4% on LiveCodeBench (Grok 3 Beta Announcement).

Claude 3.7 Sonnet vs o3-mini High vs Grok 3 Coding Performance Comparison

Both Claude 3.7 Sonnet and O3-mini shine in coding, but their strengths differ:

  • Claude 3.7 Sonnet is state-of-the-art on SWE-bench Verified and TAU-bench, suggesting it’s great for real-world software tasks.
  • O3-mini, especially its high version, scores 49.3% on SWE-bench and 2130 on Codeforces ELO, indicating strength in competitive programming.
  • Grok 3, at 79.4% on LiveCodeBench, is competitive but lacks direct comparisons with the others.

Real-World Usage

  • Claude 3.7 Sonnet is praised for handling complex codebases and multi-step tasks, used in various coding applications.
  • O3-mini is efficient and cost-effective, available in ChatGPT and API, suitable for coding tasks.
  • Claude 3.5 Sonnet has a strong track record, while Grok 3 is newer with promising but less tested real-world performance.

Here’s a detailed table (courtesy: Anthropic) that gives us a glimpse of how Claude 3.7 Sonnet compares to its competitors.

Credit: Anthropic

This table focuses on Claude 3.7 Sonnet’s and its competitor’s reasoning, coding, tool use, multilingual capabilities, visual understanding, instruction following, and mathematical problem-solving. Claude 3.7 Sonnet shines when given time to think, handling complex reasoning and math problems especially well. OpenAI’s o3-mini performs solidly across many different tasks despite its smaller size. Grok 3 Beta seems built specifically for reasoning and math tasks, where it performs impressively.

It’s interesting to see how AI companies are taking different approaches to building their models. Some focus on making well-rounded assistants while others target specific abilities. The lack of standard testing methods makes it hard to directly compare these models, but it’s clear they’re all becoming more capable in their own ways.

Coding Benchmark Analysis

Credit: Anthropic

Software Engineering Performance (SWE-bench Verified):

Claude 3.7 Sonnet demonstrates significantly higher accuracy in software engineering tasks compared to o3-mini.

  • Claude 3.7 Sonnet achieves a 62.3% accuracy, with a potential increase to 70.3% when utilizing a custom scaffold.
  • OpenAI’s o3-mini scores 49.3%.
Credit: Anthropic

Agentic Tool Use Performance:

Claude 3.7 Sonnet exhibits superior agentic tool use compared to O-3 MINI in both retail and airline domains.

  • Retail: Claude 3.7 Sonnet achieves an 81.2% accuracy, significantly higher than Claude 3.5’s 71.5%.
  • Airline: Claude 3.7 Sonnet maintains a lead with 58.4% accuracy, while Claude 3.5 Sonnet “NEW” scores 54.2%.

Claude 3.7 Sonnet vs o3-mini vs Claude 3.5 Sonnet vs Grok 3 Detailed Performance Insights

  • Claude 3.7 Sonnet: Achieves state-of-the-art performance on SWE-bench Verified and TAU-bench, frameworks testing AI agents on complex real-world tasks. It’s reported to excel in instruction-following and agentic coding, with extended thinking mode enhancing math and science, likely benefiting coding. Early tests show it producing production-ready code with fewer errors, as noted by companies like Vercel and Canva.
  • O3-mini: Particularly strong in coding benchmarks, with o3-mini high outperforming predecessors on Codeforces and SWE-bench. Its ability to handle real-world software problems is competitive, marginally overtaking DeepSeek-R1, but it lags behind Claude 3.5 Sonnet in some user experiences. The different reasoning effort levels (low, medium, high) allow flexibility, with high effort showing significant improvements.
  • Claude 3.5 Sonnet: With a 92.0% HumanEval score, it’s a benchmark leader for code generation, solving 64% of internal agentic coding problems, outperforming Claude 3 Opus at 38%. It’s fast and cost-effective, ideal for everyday coding, but likely surpassed by 3.7 Sonnet in complex tasks.
  • Grok 3: Scores 79.4% on LiveCodeBench, with claims of outperforming models like Claude 3.5 Sonnet and GPT-4o on various benchmarks. Its reasoning capabilities, enhanced by Think mode, make it suitable for complex problem-solving, but real-world coding data is limited.

Hands-on Test: Claude 3.7 Sonnet vs o3-mini High vs Grok 3

Now, on to something that will likely affect your decision of which model to choose more than any benchmarks. We conducted a case study to see whether Claude 3.7 Sonnet, o3-mini, or Grok 3 performs better at designing a sophisticated HTML landing page for a company. And guess what were the results? Here’s what our prompt looked like to give you an idea:

Create a high-converting, FOMO-driving HTML landing page for “Market Mavens,” a stock market investment advice service similar to Motley Fool. The page should use a gradient background and highlight important elements with eye-catching colors. (the full prompt was a lot longer than this; you can try it here.)

We used this prompt on every platform (except for o3-mini, for which we chose to use Bind AI for its efficiency), as you can see here:

claude 3.7 sonnet test
Claude 3.7 Sonnet
o3-mini
o3-mini Hight via Bind AI
Grok 3

So, what did the results look like? Let’s see:

1. Claude 3.7 Sonnet

Claude outputted the page in under 30 seconds, and it looked good:

Claude AI interface and its built-in code previewer

Here’s a section-by-section look at what Claude 3.7 Sonnet generated:

Header, Upper-body
Body
Endorsement
Lower-body, Footer

As you can see, the page looks good and has distinct sections for you to put your content in. Good stuff. But what about o3-mini?

2. o3-mini High

As stated above, we used Bind AI for our o3-mini testing, due to it’s advanced IDE functionality and direct deployment.

o3-mini and Bind AI’s impressive IDE

Here’s what o3-mini generated:

Header, Upper-body
Body, CTA
Lower-body, FAQ, Footer

You can argue that o3-mini did certain sections better than Claude 3.7 Sonnet, take the countdown CTA, as an example. Still, we have one more to go.

3. Grok 3

Unfortunately, Grok 3 doesn’t offer real-time testing, so we used an external HTML tester to check the results. Here’s what did:

~Header is missing~, Upper-body
Body, Pop-up, Video
CTA
FAQ, Footer

Now while Grok 3 did manage to exclude the header entirely, it worked well on other sections. But given the complete lack of any header, it ranks the lowest.

NOTE: Our case study involved rather rudimentary results because creating something more complex takes a lot of iteration and editing. It won’t be possible to show it all here. But what we showed should give you an idea of what to expect.

Additional Coding Prompts to Test

Want to try something else? Here you go: Some additional coding prompts to try each model.

1. Write a Python script to scrape the latest news headlines from a website and save them to a CSV file.

2. Create a RESTful API using Node.js and Express that allows users to create, read, update, and delete tasks in a to-do list application.

3. Design a simple HTML and CSS webpage that showcases a portfolio of a graphic designer, including sections for projects, skills, and contact information.

4. Write a SQL query to retrieve the names and email addresses of all customers who made a purchase in the last 30 days.

5. Create a JavaScript function that takes an array of numbers and returns a new array containing only the even numbers.

The Verdict

Given the data and our testing, Claude 3.7 Sonnet seems likely to excel in real-world software engineering due to its hybrid reasoning and state-of-the-art performance on SWE-bench. O3-mini, with its high version, is strong in competitive programming, as evidenced by its Codeforces ELO score of 2130. Claude 3.5 Sonnet remains a solid choice for general coding, while Grok 3 shows potential but lacks extensive real-world validation.

For developers, the choice depends on specific needs:

  • Real-world software tasks: Opt for Claude 3.7 Sonnet.
  • Competitive programming: Choose O3-mini high.
  • Emerging capabilities: Watch Grok 3 for future developments, but test thoroughly.

Direct comparisons are limited, so testing with specific coding tasks is recommended to determine the best fit. You can try o3-mini, DeepSeek R1, and Claude 3.7 Sonnet here.

Try DeepSeek R1, Claude 3.5 Sonnet, OpenAI O3

Generate code with AI, Create landing pages, full stack applications, backend code and more