Categories
Anthropic GPT-40 LLM

Claude 3.5 Sonnet vs GPT-4o: Does Claude outperform GPT-4o ?

Anthropic launched Claude 3.5 Sonnet, the first in their new family of Claude models. The new Sonnet brings significant advancements in AI technology. Available through Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI, it offers a 200K token context window at $3 per million input tokens and $15 per million output tokens. This model runs at twice the speed of its predecessor, Claude 3 Opus, and has an enhanced understanding of nuances, humor, and complex instructions.

Overview of the Claude 3.5 Model Family

The predecessor, Claude 3 model family consisted of three distinct variants, each designed for specific needs and applications. Claude 3 Haiku stood out with its rapid execution of lightweight actions, making it the ideal choice for scenarios where time efficiency was crucial. This model excelled in providing quick responses and swift data retrieval, ensuring seamless operations. In contrast, Claude 3 Sonnet offered advanced reasoning capabilities, making it adept at handling moderately complex tasks that required deeper analysis and contextual understanding. This model was particularly effective for detailed customer inquiries and intricate data analysis. At the pinnacle of the Claude 3 series was Claude 3 Opus, designed for tackling extensive and multi-step tasks with outstanding precision. It excelled in higher-order mathematics, sophisticated coding, and precise vision analysis, maintaining near-perfect recall and a 200K token context window. Capable of generating extensive responses up to 4096 tokens, Claude 3 Opus was perfect for comprehensive projects that demanded in-depth analysis, real-time customer interactions, auto-completions, and data extraction. Together, these models ensured that the Claude 3 family catered to a wide range of applications, from ultra-fast tasks to intricate problem-solving and thorough analytical endeavors.

With the new launch, Claude 3.5 Sonnet is designed to deliver frontier intelligence. It operates at double the speed and improved cost-efficiency compared to previous models like Claude 3 Opus. 

Their internal tests show Claude 3.5 Sonnet solving 64% of coding problems, significantly higher than Claude 3 Opus’s 38% success rate. Claude Opus has been very well regarded in code generation tasks, and the new Sonnet model outperforming it is a significant achievement.  Whether Anthropic plans to release a Claude 3.5 Opus model in yet to be known.

Overview of GPT-4o

GPT-4o is OpenAI’s newest flagship model, it represents a significant step forward in creating more natural human-computer interactions. Unlike previous models that focused primarily on text-based input and output, GPT-4o can accept and generate any combination of text, audio, image, and video. This multimodal approach allows for a more seamless and intuitive user experience, with the model able to respond to audio inputs in as little as 232 milliseconds on average, rivaling human response times in conversation. Despite its advanced capabilities, GPT-4o remains highly efficient, matching the performance of GPT-4 Turbo on text and code tasks while being significantly faster and 50% cheaper through the API.

One of the key advancements in GPT-4o is its end-to-end training across text, vision, and audio modalities. This unified approach enables the model to process and understand context from multiple input types simultaneously, rather than relying on separate models for each task. As a result, GPT-4o can directly observe nuances like tone, multiple speakers, and background noises, and it can express itself through laughter, singing, and emotional responses. While the full potential of this multimodal model is still being explored, GPT-4o has already demonstrated state-of-the-art performance on various benchmarks, including visual perception tasks, multilingual evaluations, and audio understanding.

Published benchmarks for Claude 3.5 Sonnet vs GPT-4o

While 3.5 Sonnet seems to be an improvement over its predecessor, how does it compare against OpenAI’s GPT-4o? Let’s do a detailed comparison for specific real world tasks. From the benchmarks published by Anthropic, the new Claude model outperforms GPT-40 and other models on Graduate Level Reasoning, Undergraduate level knowledge, Coding abilities, Multi-lingual math and general reasoning. Typically, these benchmarks are against a standard set of tasks, which may or may not represent your real world use cases. Coding ability is the most significant accomplishment for the new model, and in this post, we will be taking some real examples and comparing the quality of outputs across Claude 3.5 Sonnet and GPT-40

Code Generation with Claude 3.5 Sonnet vs GPT-40 

For assessing coding abilities of a model, a widely used benchmark is HumanEval which was introduced by OpenAI. It consists of 164 hand-written programming problems spanning a wide range of difficulties and domains, including math, string manipulation, data structures, and algorithms. Each problem includes a function signature, docstring, body, and several test cases. To evaluate an LLM, the function signature and docstring are provided as a prompt, and the model is tasked with generating the corresponding function body. The generated code is then executed against the test cases to determine if it produces the expected outputs, with the percentage of problems solved correctly serving as the HumanEval score. This benchmark challenges LLMs to understand problem statements, implement the required logic, and handle edge cases correctly based solely on human-readable descriptions. A higher score indicating greater proficiency in understanding and solving coding problems based on human-written prompts.

Based on benchmarks, Claude 3.5 Sonnet achieves a 92.0% score on HumanEval. It is higher than the score of 90.2% which GPT-4o achieved. This indicates a very strong performance in code generation and error correction.

We will use the below coding scenarios to test the efficacy of these models. For each of the task, we’ve included a link to the query which will automatically show you the results with GPT-40 and Claude.

Test Case Task Evaluation Criteria

Test Case 1:

Python Code Generation

Write a script to generate email address from name and domain
  • The script should accept two input parameters: a) The person’s full name (first name and last name) b) The domain name for the email address
  • The script should generate a valid email address by: a) Converting the full name to lowercase b) Removing any spaces in the name c) Concatenating the first name, a dot (.), and the last name d) Appending the @ symbol followed by the provided domain name

  • Bonus points if the script includes additional features such as: a) Validating the format of the provided domain name b) Allowing the user to choose between different email formats (e.g., firstname.lastname@domain.com, firstinitial.lastname@domain.com)

Test Case 2:

Web Page creation

Create an HTML file that displays a simple personal portfolio webpage. The webpage should include a header with your name, a profile picture, a brief introduction about yourself, and a list of your skills. Use basic HTML tags to structure the content and include some inline CSS to style the elements

  • The webpage should have a header with the person’s name.
  • The webpage should include a profile picture.
  • The webpage should have a brief introduction about the person.
  • The webpage should include a list of skills.
  • The HTML tags should be used correctly to structure the content.
  • Inline CSS should be used to style the elements.

Test Case 3:

API Query Generation

  • The script should make a properly formatted API request to the DALL-E 3 service, including the necessary authentication headers and request parameters.
  • It should directly generate a cURL
  • It should return a valid response.
     

Now let’s see how how each of the models performed. (Please note that you will require a free trial to select Claude models and generate the outputs.)

Test Case Claude 3.5 Sonnet GPT-4o Winner
#1: Python Code Generation

Claude does a pretty job at creating multiple possible patterns for the email addresses. Typically, each company has a different pattern for an address, and there are 20+ possible patterns. We would have liked it to consider a few more possible patterns though.

Output:

Generated email addresses for Elon Musk at tesla.com:

elon@tesla.com, musk@tesla.com, elon.musk@tesla.com, emusk@tesla.com

Regenerate Result with Claude 3.5 Sonnet

Link to Result

The code looks fine, however, it does not include all possible patterns. With additional instructions GPT-4o will definitely include that. 

Output:

Generated email address: elon.musk@tesla.com

Claude 3.5 Sonnet!

Overall, both models did good, however Claude readily delivered the expected outcomes.

#2: Web Page Creation

Claude created a beautiful looking visually appealing webpage, with essentially no information in the prompt beyond the instructions. 

Scroll down to see the created webpage below

Regenerate Result with Claude 3.5 Sonnet

Link to Result

GPT-4o checked all the boxes to generate a page, however, it does not look visually pleasing. The styling and color would need subsequent instructions.

See the created webpage below

Claude 3.5 Sonnet!

Claude wins agains. It seems to do a just a little bit better and reduces the need for subsequent prompting.

#3: API Query Generation

View generated Image

Claude directly generated a cURL and returned a result. Exactly as expected.

Regenerate Result with Claude 3.5 Sonnet

 

Link to Result

GPT ended up generating a bash script which required taking additional steps. It does work, however it did not generate a cURL as requested.

Claude again!

As per evaluation criteria, the expectation was to get a cURL request. This one is debatable, as GPT code would also have worked and added additional error validation as well.

       

This is what Claude 3.5 Sonnet Generated for the Web Page

Web page generated by GPT-4o

 

Overall Result: Does Claude 3.5 Sonnet outperform GPT-4o?

Based on the published benchmarks and our evaluation tests, it does indeed look like Claude’s new Sonnet model outperforms GPT-4o at least on the Coding abilities. GPT-4o is still a great model and preferred by many. The competition is heating up, especially with the Claude 3.5 Sonnet being cheaper ($3 per million input tokens vs $5 per million tokens for GPT-4o)

If you want to use AI for code generation, content writing or use multiple premium models (Claude, GPT, Mistral, Command R+), please Try Bind AI