Snowflake's CEO declared a tie. The iteration ledger didn't.

Snowflake CEO tested GLM-5.2 vs Claude Opus 4.7: 103 tasks, pass@3 near-tie, 2x token use, 5.7x cheaper output. Here's what the numbers actually mean for builders.

Creeta

Jun 24, 2026

Snowflake's CEO declared a tie. The iteration ledger didn't.

A CEO calling a near-tie is one thing. Reading the iteration counts underneath it is another — and the gap between the two is where this story actually lives.

The Snowflake DuckDB Experiment: Design and Outcomes

On June 24, 2026, Snowflake CEO Sridhar Ramaswamy said Zhipu/Z.ai's open-weight GLM-5.2 was competitive with Anthropic's Claude Opus 4.7 on real coding work at a fraction of the cost . The claim rests on an internal Snowflake evaluation, not a vendor marketing test: 103 real-world programming tasks, each attempted three times, where models had to produce code that ran simultaneously on DuckDB and the Snowflake platform .

The pass@3 result was a near-tie. GLM-5.2 solved 66% of the problems against Opus 4.7's 67% — a one-percentage-point gap . Read that number carefully: it covers SQL and data-engineering tasks only, scored by a single executive over one domain, not a peer-reviewed or reproducible public eval.

Metric (Snowflake internal eval)	GLM-5.2	Opus 4.7
Pass@3 (3 attempts)	66%	67%
First-attempt accuracy	47.6%	53.7%
Avg. iterations per task	99	80
Total tokens across run	~860M	439M

Ramaswamy's read cut both ways. He credited GLM-5.2 with uniquely validating certain cross-platform programs — code that had to run on both DuckDB and Snowflake at once — that Opus could not solve . He also flagged the failure mode.

"Only GLM could solve certain tasks," Ramaswamy noted, while observing that on others GLM "gave up too early" and ran excessive, ineffective verification loops (source: The Decoder, 2026-06).

One qualification matters before anyone maps this onto leaderboards. Snowflake benchmarked against Opus 4.7 — the model in production at the time — whereas most public comparison tables now lead with the newer Opus 4.8 . The targets are not the same model, so a near-tie here does not transfer cleanly to a near-tie there. What it does establish is a concrete, workload-grounded data point: on Snowflake's own SQL-heavy tasks, the open-weight challenger landed within a point of the premium model — and the cost of getting there is the next thing worth measuring.

Where the Dollar Multiple Comes From, and Where It Goes

The headline discount is real on list prices, but the size of it depends on how you buy. On Z.ai's official pricing, GLM-5.2 costs $1.40 per 1M input tokens and $4.40 per 1M output, with cached input at $0.26 per 1M and cache storage listed as free for a limited time. Anthropic's Opus 4.7 holds at $5 input and $25 output per 1M tokens under model ID claude-opus-4-7.

That puts GLM-5.2 input at roughly 3.6× cheaper and output at 5.7× cheaper on list. Third-party routing widens the gap: OpenRouter lists GLM-5.2 at $0.95 input / $3 output, about 19% and 12% of Opus 4.7's prices. The exact multiple shifts with buying channel, caching policy, and request mix — so "fraction of the cost" is directionally robust rather than a fixed number.

Channel	Input ($/1M)	Output ($/1M)	vs Opus 4.7 output
Opus 4.7 (Anthropic list)	$5.00	$25.00	—
GLM-5.2 (Z.ai list)	$1.40	$4.40	~17.6%
GLM-5.2 (OpenRouter)	$0.95	$3.00	~12%

Per-token price is not per-task spend, though. On Snowflake's agentic run, GLM-5.2 consumed roughly 860M tokens against Opus's ~439M — close to double. That overhead erodes the nominal 5.7× output multiple: a model that needs twice the tokens to finish gives back half its per-token discount.

Even so, the math still favored Z.ai on this workload. A ~2× token count against a 3.6–5.7× price advantage leaves real headroom per completed task. The picture changes at the edges:

Token-heavy jobs — long agent loops or large-context retrieval shrink the margin fastest, since GLM's iteration overhead scales with the run.
Latency-sensitive work — more tokens means more wall-clock time, which can outweigh a dollar saving where speed is the constraint.
Cache-friendly workloads — GLM's $0.26 cached input tilts economics further toward Z.ai when prompts repeat.

The practical read for builders weighing model routing: benchmark the discount on your own task shape, because the savings live in the token count, not the price sheet.

Doubled Iterations and Lower One-Shot Consistency

The token gap is not a rounding error — it is the dominant cost variable. On Snowflake's 103-program run, Opus 4.7 averaged 80 iterations per task and 439 million tokens total, while GLM-5.2 averaged 99 iterations and roughly 860 million tokens — nearly double the consumption for a near-identical pass@3 outcome . The discount survives that overhead, but it shrinks, and it shrinks unevenly across workloads.

First-attempt reliability is the other tradeoff. GLM-5.2 solved 47.6% of problems on the initial try versus Opus 4.7's 53.7% — a roughly 6-point gap . In retry-tolerant batch jobs that is negligible. In latency-sensitive or single-shot paths — an inline agent step, a synchronous code-fix endpoint — every failed first attempt is another round trip, and the token math tilts back toward Opus.

Public benchmarks echo the pattern, though they mostly compare against the newer Opus 4.8 rather than the 4.7 Snowflake tested. Z.ai's own Hugging Face card reports:

SWE-bench Pro: GLM-5.2 at 62.1 vs Opus 4.8 at 69.2
Terminal-Bench 2.1: 81.0 vs 85.0
SWE-Marathon: 13.0 vs 26.0 — Opus 4.8's widest lead, on the longest-horizon task

The SWE-Marathon spread is the one worth watching: it is precisely the sustained, multi-step agentic work where excess iterations compound.

Ramaswamy's framing suggests this is tunable rather than fixed. He noted that GLM sometimes "gave up too early" and ran "excessive, ineffective verification loops" . That reads as a prompt-engineering signal, not a hard architectural ceiling. Capping retry depth, tightening stop conditions, or adding an explicit verification budget in the system prompt could claw back much of the token overhead — which means the per-task economics you measure today are a floor you can likely lower, not a fixed verdict.

Geopolitical Provenance: Ascend Chips and Residency Concerns

That economic floor sits on top of an unusual supply chain. GLM-5.2 is Z.ai's open-weight Mixture-of-Experts flagship, published as zai-org/GLM-5.2 on Hugging Face under an MIT license on June 16, 2026. It carries 753B total parameters with roughly 40B active per forward pass , and extends usable context to 1M tokens, up from GLM-5.1's 200K .

The provenance is what makes this more than a pricing story. Per Decrypt, GLM-5.2 was reportedly trained entirely on Huawei Ascend chips with no NVIDIA involvement . Treat that as reported, not confirmed — Z.ai has not published a hardware disclosure.

The same caveat applies to the cost figure:

"~$25M training expenditure," — estimate attributed to Emad Mostaque (source: Decrypt).

That number is a third party's read, not a vendor disclosure . Useful as a directional signal, not a line item.

For developers, the open-weight license changes the governance math:

Hosted API: calling Z.ai's endpoint raises China data-residency and transfer questions for regulated workloads.
Self-host: an MIT-licensed downloadable checkpoint sidesteps that entirely — weights running on your own infrastructure remove the cross-border transfer concern for most enterprise scenarios .

The structural pattern is familiar. Business Insider and Axios both frame GLM-5.2 as another DeepSeek-style moment — an open-weight challenger from China that can be downloaded, modified, and operated inside users' own systems, reducing dependence on closed U.S. providers . Axios notes the reception reignited debate over how fast China is closing the gap, while flagging that benchmark wins do not automatically equal frontier parity given compute, data, and infrastructure constraints . Same playbook as DeepSeek R1 in early 2025 — the difference now is the workload is agentic coding, not chat.

High-Retry Scenarios vs. One-Shot Discipline

For builders weighing a routing switch, the decision splits cleanly by workload shape. GLM-5.2 is the stronger economic choice where retries are cheap and tolerated; Opus 4.7/4.8 holds the edge where a single attempt has to land. The Snowflake near-tie — 66% vs 67% on pass@3 — only holds at three attempts, so it maps directly onto harnesses that already retry.

GLM-5.2 favors:

Batch-style agentic jobs where wall-clock and token count aren't tightly capped.
pass@3-style harnesses that retry by design — exactly Snowflake's setup.
Non-latency-sensitive pipelines where verification loops are acceptable overhead.

Opus 4.7/4.8 favors:

Tasks needing single-attempt precision, where Opus led 53.7% vs 47.6% on first try .
Work outside the SQL/data-engineering domain the Snowflake eval measured.
Fixed per-token budgets, where GLM's roughly 2x token use makes retry overhead expensive.

One independent signal outside SQL points the same direction. Semgrep's cyber benchmark found GLM-5.2 at 39% F1 on IDOR detection versus Claude Code's 32%, at roughly $0.17 per vulnerability found . Different domain, same directional result: competitive accuracy at a large cost discount. That corroboration matters because the Snowflake test is not reproducible from a public endpoint, and no independently verified result currently exists for the SQL-agentic domain.

The takeaway is narrow on purpose. Ramaswamy declared a tie on one internal, domain-specific run; the iteration ledger — 99 iterations and ~860M tokens for GLM against 80 and ~439M for Opus — did not. Before you route Opus calls to GLM-5.2 inside Claude Code or a similar harness, run your own eval on your own tasks. The price gap is real; whether it survives your token profile is the only number that decides it.

Frequently asked questions

What is pass@3, and why does it matter for evaluating agentic coding performance?

Pass@3 measures whether a model solves a problem within up to three attempts, not on the first try alone. Agentic harnesses retry automatically, so pass@3 maps more closely to real-world success than single-shot accuracy. That framing is why GLM-5.2's weaker first-attempt rate doesn't sink it here: it landed 66% pass@3 versus Opus 4.7's 67%, even though Opus led 53.7% to 47.6% on first attempt . If your harness retries, pass@3 is the number to weigh; if it doesn't, first-shot reliability matters more.

How does GLM-5.2 price out per completed program after accounting for its higher token usage?

Per task, GLM-5.2 still costs less despite burning more tokens. On Snowflake's run it averaged ~860M tokens against Opus's ~439M — roughly 2× . But list output pricing is $4.40 per 1M for GLM-5.2 versus $25 per 1M for Opus 4.7 , so doubling token count does not close a ~5.7× output-price gap. The exact multiple shifts with your output/input ratio, caching, and whether you buy direct or through a router like OpenRouter, which lists GLM-5.2 at $0.95 input / $3 output .

Can I self-host GLM-5.2, and does that resolve enterprise data-residency concerns?

Yes. The weights ship under an MIT license as zai-org/GLM-5.2 on Hugging Face, so you can run them on your own infrastructure . Self-hosting eliminates API calls to Z.ai's servers, which removes the China data-transfer concern that hosted use raises. The Mixture-of-Experts design keeps roughly 40B parameters active per token out of 753B total , so inference requirements are more tractable than the headline size suggests — though serving 753B of weights still demands serious memory.

Is the Snowflake benchmark reproducible by third parties?

Not currently. It is an internal Snowflake evaluation disclosed through CEO Sridhar Ramaswamy's public statement on June 24, 2026 . The 103-task harness, the three-attempt protocol, and the DuckDB-plus-Snowflake test suite are not publicly downloadable, so no outside party can re-run it as specified. Treat it as a directional signal from a non-neutral source — Snowflake has a stake in cheaper model routing — rather than a settled, peer-reviewed result. Benchmark on your own workloads before acting on it.

Did the Snowflake test compare GLM-5.2 to Opus 4.7 or the newer Opus 4.8?

Opus 4.7. The near-parity framing applies specifically to that checkpoint. Published leaderboard tables — including Z.ai's own Hugging Face model card — benchmark GLM-5.2 mainly against the newer Opus 4.8, where Anthropic keeps a wider lead: 69.2 versus 62.1 on SWE-bench Pro and 85.0 versus 81.0 on Terminal-Bench 2.1 . So "matches Opus" is accurate for the 4.7 run Snowflake used, but does not extend cleanly to Anthropic's current top model.