GLM 5.2 vs Claude Opus 4.8 vs GPT-5.5 – Which Is Better for Coding?

June 22, 2026
10:31 am

Z.ai, the creator of GLM models, recently launched GLM 5.2. It’s an open-weight model that most developers hadn’t followed this month, beating every closed-source model on Design Arena’s human-preference coding leaderboard. That aforementioned result matters more than any of these lab-published scores because SWE-bench results are, at this point, partially a story companies tell about themselves. Claude Opus 4.8 posts 69.2% on SWE-bench Pro and wins that comparison. GPT-5.5 posts 58.6% on the same test and generates far more coverage. But how about GLM 5.2? Let’s find that out (along with other disconnects that you won’t learn from social coverage) in this coding comparison between GLM 5.2 vs Opus 4.8 vs GPT-5.5.

Why Do AI Coding Benchmarks Keep Telling Contradictory Stories?

This is the question nobody wants to answer cleanly, but it is the right place to start. GPT-5.5 launched in April 2026 with a Terminal-Bench 2.0 score of 82.7% and dominated the headlines. Claude Opus 4.8 launched five weeks later with a Terminal-Bench 2.1 score of 85.0% and received considerably less attention. The problem is that those two scores were used in different testing environments. GPT-5.5 ran on OpenAI’s Codex CLI harness.

Opus 4.8 and GLM 5.2 ran on Terminus-2. Comparing them directly is like timing two sprinters on different tracks and declaring a winner.

The harness issue is not a technical footnote. It represents a structural problem in how frontier labs publish benchmark results, where each one optimizes its submission environment for its own model’s strengths. Third-party platforms like vals.ai frequently produce meaningfully different scores as a result. On vals.ai’s standardized harness, GPT-5.5’s SWE-bench Verified score lands at 82.6%; Anthropic’s own system card puts Opus 4.8 at 88.6% on the same benchmark.

SWE-bench Pro is the most trustworthy benchmark in this comparison because it pulls tasks from post-training-cutoff repositories, making memorization nearly impossible. On Scale AI’s standardized scaffolding across 1,865 tasks, the ranking is unambiguous: Claude Opus 4.8 at 69.2%, GLM 5.2 at 62.1%, GPT-5.5 at 58.6%. That ordering holds across every long-horizon benchmark available.

GLM 5.2 vs Claude Opus 4.8 vs GPT-5.5 – Benchmark Comparison

With the context above, here is the data.

</>	GLM 5.2	Claude Opus 4.8	GPT-5.5
SWE-bench Pro (Scale SEAL harness)	62.1%	69.2%	58.6%
SWE-bench Verified	Not reported	88.6% (self-reported)	82.6% (vals.ai)
Terminal-Bench 2.1	81.0% (Terminus-2)	85.0% (Terminus-2)	83.4% (Codex CLI)*
FrontierSWE	~74%	~75%	Below GLM 5.2
Design Arena Coding (Elo)	1360 (1st overall)	Not ranked	Not ranked
PostTrainBench rank	2nd overall	1st overall	Below top 2

*Different harness. Not directly comparable to Terminus-2 scores.

Two things stand out. Opus 4.8 leads on everything that resists gaming. GPT-5.5’s strongest numbers consistently come from OpenAI’s own toolchain or internal evals. The Design Arena result complicates that picture. It is a crowdsourced human-preference competition, not a lab-curated benchmark, and GLM 5.2 won it outright, beating Claude Fable 5, the previous number one. Human preference is what shipping software actually depends on, and that signal pointed in a different direction from the press cycle around GPT-5.5’s launch.

Which Model Is Best for Long, Complex Engineering Tasks?

The short answer is Claude Opus 4.8, but the margin depends heavily on which benchmark you look at. Three long-horizon tests measure this most directly.

FrontierSWE (open-ended technical work at the scale of hours to tens of hours, covering systems optimization, large-scale code construction, and applied ML research):

Claude Opus 4.8 leads the field
GLM 5.2 trails by exactly 1%, which is inside margin-of-error territory for benchmarks of this type
GPT-5.5 falls below both and does not rank near the top on this benchmark

PostTrainBench (each agent receives an H100 GPU and must improve a smaller model through post-training, rewarding sustained multi-step reasoning over extended compute budgets):

Claude Opus 4.8 finishes first overall
GLM 5.2 finishes second, ahead of every other model in the field, open or closed
GPT-5.5 does not finish in the top two

SWE-Marathon (ultra-long tasks including building compilers from scratch and developing production-grade services):

Claude Opus 4.8 leads GLM 5.2 by 13%
GLM 5.2 holds second place
GPT-5.5 trails both by a meaningful margin

What these three share is a consistent hierarchy: Opus 4.8 first, GLM 5.2 second, GPT-5.5 third. That consistency is more informative than any individual score. GPT-5.5 is a capable model, but the narrative that it leads on coding is not supported by long-horizon data. It leads on terminal execution, which is a real strength and a narrower capability than most of its coverage implies.

GLM 5.2 vs Claude Opus 4.8 vs GPT-5.5 – Pricing Comparison

This framing is the one most coverage sidesteps because the answer is uncomfortable. GLM 5.2 outperforms GPT-5.5 on SWE-bench Pro, tops the human-preference coding leaderboard, and costs approximately six times less on output tokens. That is not a tradeoff. It is a structural problem for the closed-source pricing argument.

GLM 5.2 (via FriendliAI; self-hosting available at near-zero variable cost under MIT license):

Input: $1.40 per million tokens
Output: $4.40 per million tokens

Claude Opus 4.8 (Anthropic API; pricing unchanged from Opus 4.7):

Input: $5.00 per million tokens
Output: $25.00 per million tokens
Fast Mode (2.5x speed): $10.00 input / $50.00 output per million tokens

GPT-5.5 (OpenAI API):

Input: $5.00 per million tokens
Output: $30.00 per million tokens
GPT-5.5 Pro (extended deliberative reasoning): $30.00 input / $180.00 output per million tokens

OpenAI argued at launch that GPT-5.5 uses roughly 40% fewer output tokens on Codex-style tasks than GPT-5.4, bringing the effective cost increase to around 20% over the previous model. That math works when you compare GPT-5.5 to GPT-5.4. It collapses when you compare it to GLM 5.2. No token efficiency argument closes a six-times gap on output pricing, especially for teams running continuous agent pipelines where output volume is the primary cost driver. Opus 4.8 is a harder call. Its 7-point SWE-bench Pro lead over GLM 5.2 is real and consistent across data sets. Whether that edge justifies a 5.7x output token premium depends entirely on whether accuracy or throughput is the binding constraint in your system.

Why Is GLM 5.2’s MIT License the Most Underreported Part of This Story?

Glm 5.2 weights hit hf today under MIT, frontier-level open source is actually happening
byu/Exact-Literature-395 inLocalLLM

The benchmark discussion risks burying the actual lead here. GLM 5.2 is the first open-weight model to top Design Arena’s coding leaderboard at this capability level, and it ships under a pure MIT license with no geographic limits, no usage restrictions, and full weight availability on HuggingFace and ModelScope. That combination has not previously existed at frontier-competitive performance. Z.AI’s documentation explicitly uses the phrase “no regional limits,” which is not incidental language. It is a direct response to the export control and access restrictions that increasingly define how closed-source AI gets deployed globally.

For teams evaluating coding models seriously, the MIT license changes three things that benchmarks cannot:

Infrastructure control: GLM 5.2 runs on-premises, in sovereign cloud environments, or in air-gapped systems. No closed-source model can offer that regardless of price negotiations.
Fine-tuning rights: The license permits fine-tuning on proprietary codebases without terms-of-service review or data-sharing concerns. For companies with IP or regulatory exposure, this is a substantive operational difference.
Vendor risk elimination: An MIT-licensed model cannot be deprecated, repriced, or made unavailable through policy changes. In June 2026, Anthropic suspended access to Claude Fable 5 and Mythos for all users due to export control directives. GLM 5.2 users were unaffected.

The effort-level controls GLM 5.2 introduces add a practical dimension to the cost argument. The Max setting runs at peak performance but consumes around 85,000 output tokens per task. The High setting gives up only a few benchmark points while cutting token output nearly in half. That user-controlled compute tradeoff is rare at this level and directly addresses the cost objection without requiring a model switch.

Which Coding AI Model Should You Actually Be Using in 2026?

Based on the benchmark data and pricing above, here is the practical routing guide.

Choose Claude Opus 4.8 if:

Multi-file software engineering is your primary workload, where its 69.2% on SWE-bench Pro is the most consistent signal available
Dynamic Workflows, specifically the ability to spawn parallel subagents for repository-scale tasks, is something your team will actually use
You are already on Anthropic’s Max, Team, or Enterprise plan with Claude Code embedded in your workflow
The 7-point SWE-bench Pro lead over GLM 5.2 is worth the 5.7x output token premium for your specific accuracy requirements

Choose GPT-5.5 if:

Terminal-first workflows, DevOps pipelines, and shell automation are where your compute budget goes
Long-context retrieval at scale is critical and MRCR v2 performance at 512K to 1M tokens is a hard requirement (74.0%, up from 36.6% on GPT-5.4)
Your team is deeply embedded in Codex with high switching costs and ecosystem dependencies
System-level conceptual debugging across ambiguous multi-file failures matches GPT-5.5’s documented behavioral strengths

Choose GLM 5.2 if:

Output token volume is high and a six-to-one cost gap is a real architectural consideration
Sovereign or private infrastructure hosting is a security or compliance requirement that cannot be waived
MIT licensing for fine-tuning on proprietary code is important and closed-source terms are a legal concern
Agentic tool calling is central to your workflow, where GLM 5.2 surpasses GPT-5.5 when tools are enabled
You want native framework support across vLLM, SGLang, transformers, and ktrans without additional middleware

The Bottom Line

GLM 5.2 vs Claude Opus 4.8 vs GPT-5.5 is not a benchmark horse race with a clean podium, and reducing it to one misses the structural shift this generation represents. Claude Opus 4.8 is the most capable publicly available model for multi-file software engineering, and the SWE-bench Pro lead is backed by the most tamper-resistant data available right now. GPT-5.5 owns terminal and pipeline-heavy workflows, even if its headline numbers deserve more scrutiny than they usually get. GLM 5.2 is the development worth watching: an open-weight model within 1% of Opus 4.8 on FrontierSWE, sitting at number one on the only benchmark real engineers voted on, at roughly one-sixth the output token cost and zero vendor dependency. The closed-source default assumption for production coding in 2026 is no longer obviously correct.

The AI workspace that turns prompts into results.

Plan, research, and ship faster with AI that understands your work.

From PRD to production before the week is over. Build with Friday AI

Available on:

tryfriday.ai

product_team_goals:

time_to_market: "shipped_in_hours"

dev_alignment: "prds_to_clean_code"

overhead: "zero_waste_meetings"

sprint_status: features_deployed_successfully...