MiniMax M3 benchmarks at $0.30/M: verified vs. vendor-only

MiniMax M3 at $0.30/M: what the 1M-sequence benchmarks mean, credential selection, and a quickstart.

MiniMax M3 benchmarks at $0.30/M: verified vs. vendor-only
Share

MiniMax shipped M3 on 2026-06-01, and the headline is not a benchmark — it's the attention mechanism that makes a 1,000,000-token context window economically plausible. The catch: nearly every number comes from MiniMax itself.

Sparse attention and 1M-sequence support: M3's headline improvements

M3's core change is MSA (MiniMax Sparse Attention), which replaces full quadratic attention with KV-block selection — a "KV outer gather Q" scheme that uses KV blocks as outer loops to aggregate queries . MiniMax claims that at 1M context MSA cuts per-token compute to roughly 1/20 of the prior generation, yielding more than 9x faster prefill, more than 15x faster decoding, and over 4x faster throughput than open-source Flash-Sparse-Attention . That is what lets the model expose a 1M-token window instead of the M2.7-era 204,800-token ceiling.

The benchmarks are strong but vendor-reported, run on MiniMax infrastructure with non-public scoring setups, and not independently reproducible as of 2026-06-09.

BenchmarkM3 (vendor-reported)Comparison
SWE-Bench Pro59.0%Ahead of GPT-5.5 and Gemini 3.1 Pro, behind Opus 4.7
BrowseComp83.5vs. Opus 4.7's 79.3
Terminal-Bench 2.166.0%

All figures above are sourced to MiniMax's launch post ; technical press has flagged the "frontier" claims as unverified . As TechTimes noted, the methodology "relies on internal MiniMax infrastructure," so treat the leaderboard wins as marketing until third parties reproduce them.

Two more things shape how you'll use M3. It is natively multimodal — text, image, video, and computer-use trained together on roughly 100 trillion interleaved tokens, not a vision adapter bolted onto a text backbone ; parameter count and MoE structure were undisclosed at launch. And despite the "open-weight" framing, the technical report and weights were promised about 10 days after launch and were not published at announcement, with the license unspecified . Treat M3 as API-only for now. One practical limit: the API guarantees a 512K-token floor per request, while the full 1M band is initial-access-limited for Standard pay-as-you-go — plan for 512K as your working ceiling.

What to resolve before writing a request

MiniMax M3 benchmarks at $0.30/M: verified vs. vendor-only

Before your first call, settle three things: which credential system you hold, which pricing band you fall into, and whether you can live without open weights. M3 runs two separate key systems that cannot be mixed — a Standard API key for pay-as-you-go, created at platform.minimax.io, versus a Subscription Key tied to a Token Plan; sending requests with the wrong key type returns quota errors . Decide this first, because it dictates every later header.

On price, Standard sits under a permanent 50% launch discount: for inputs up to 512K, $0.30/M input, $1.20/M output, and $0.06/M prompt-cache-read (list $0.60/$2.40/$0.12) . Priority service via service_tier runs roughly 50% higher — $0.45/$1.80/$0.09 — buying scheduling priority and stable latency .

Token Plan suits individuals and small teams: Plus $20/mo, Max $50/mo, Ultra $120/mo, with approximate monthly M3 quotas of ~1.7B, ~5.1B, and ~9.8B tokens against 5-hour rolling and weekly windows .

If open weights are a hard requirement, they had not materialized as of 2026-06-09, with the license still unspecified — plan for API-only access until a public release lands.

How to point an Anthropic-compatible client at MiniMax M3

MiniMax M3 benchmarks at $0.30/M: verified vs. vendor-only

The fastest integration path is the Anthropic-compatible endpoint, and it needs no SDK swap — M3 is a drop-in at the URL level. Set ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic and put your key in ANTHROPIC_API_KEY (the pay-as-you-go Standard key or a Token Plan Subscription key, depending on which key system you resolved earlier), then call client.messages.create(model='MiniMax-M3', max_tokens=..., messages=[...]) against the installed Anthropic SDK .

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY + ANTHROPIC_BASE_URL

resp = client.messages.create(
    model="MiniMax-M3",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this repo's build steps."}],
    thinking={"type": "disabled"},  # faster direct answers
)
print(resp.content)

This endpoint supports streaming, tool definitions, tool_choice, temperature in [0,2], and top_p in [0,1] . Thinking is on by default for M3; unlike M2.x where it could not be turned off, you can set thinking={'type':'disabled'} for faster direct answers or {'type':'adaptive'} to keep it on explicitly .

Prefer OpenAI tooling? Set OPENAI_BASE_URL=https://api.minimax.io/v1 and call chat.completions.create(model='MiniMax-M3', messages=[...]). To surface reasoning separately into reasoning_content/reasoning_details rather than buried in content, add extra_body={'reasoning_split': true} .

For editor integrations — Cursor, Cline, Roo Code, Kilo Code, Claude Code — point the custom base URL at the MiniMax endpoint and set the model to MiniMax-M3. Some config pages still carry stale M2.7 wording, so verify the model field after saving .

Multimodal requests add content parts directly: image_url accepts JPEG/PNG/GIF/WEBP up to 10 MB with detail set to low/default/high, while video_url takes MP4/AVI/MOV/MKV up to 50 MB by URL or base64 (default fps 1, range 0.2–5), or up to 512 MB via the Files API using mm_file://file_id. The overall request body caps at 64 MB .

The message-history rule you cannot skip

MiniMax M3 benchmarks at $0.30/M: verified vs. vendor-only

In multi-turn tool-use sessions, append the model's full assistant content list — thinking, text, and tool_use blocks — back into history unchanged after every turn. M3's reasoning continuity depends on reading its own prior thinking blocks verbatim, so the history you store is part of the model's state, not just a transcript .

This matters because thinking is on by default for M3, and stripping or summarizing those thinking blocks between turns produces coherent-looking but silently corrupted responses — no error is raised. It is the most common integration mistake reported since the 2026-06-01 launch . If you need faster direct answers, disable reasoning explicitly with thinking {'type':'disabled'} rather than editing history after the fact.

On the OpenAI-compatible path the rule is the same shape: preserve the full assistant message, including tool_calls and reasoning_details, in stored history. The extra_body {'reasoning_split': true} flag only separates thinking into reasoning_content/reasoning_details on a single response — it is a per-request formatting choice, not a signal to modify what you persist .

One more silent failure to guard against: call POST /anthropic/v1/messages/count_tokens before sending long inputs. The 512K-token boundary is the hard limit under current Standard access, and hitting it mid-session fails quietly rather than truncating gracefully .

From hello world to autonomous long-horizon sessions

Once a single request round-trips cleanly, the real question is whether M3 holds up across hours, not seconds. MiniMax publicized two agentic spans worth reproducing before you trust them: a roughly 12-hour autonomous reproduction of an ICLR 2025 award paper (18 commits, 23 figures), and a 24-hour CUDA kernel optimization run that made 147 benchmark submissions and 1,959 tool calls, lifting Hopper GPU utilization from 7.6% to 71.3% for a 9.4x speedup . Treat these as test designs, not proof — run your own long-horizon eval before committing.

For a low-code harness, MiniMax Code (download at agent.minimaxi.com) is updated for M3 and adds an "Agent Team" for concurrent multi-stage workflows; the harness is slated for open-sourcing after launch . To evaluate without that harness, the model is on OpenRouter as minimax/minimax-m3 and available via Ollama for local runs .

Before routing production traffic, do three things: validate latency under your actual quota tier, exercise multi-turn tool-call loops to confirm history preservation holds, and run regressions on anything that previously ran on M2.7 — MSA changes prefill behavior in ways that can surface edge cases. The takeaway: M3 is a credible repo-scale coding and multimodal agent at $0.30/M, but the benchmarks are vendor-reported, so verify on your own workloads first .

Frequently asked questions

Is MiniMax M3 actually open-source?

Not yet, in practice. MiniMax markets M3 as open-weight, but neither the weights nor the technical report shipped at the 2026-06-01 launch — both were promised within roughly 10 days after launch (mid-June), and the official GitHub repository still showed a pre-release README with no published releases when crawled . The exact license was unspecified at announcement . Treat M3 as API-only until the weights and a concrete license land publicly.

What is the difference between a Standard API key and a Subscription Key on MiniMax?

They are two separate billing systems, and mixing them produces quota errors. A Standard Open Platform API key bills pay-as-you-go and is created at platform.minimax.io. A Subscription Key is tied to Token Plan quota — Plus, Max, or Ultra — and draws on plan Credits rather than per-call billing . MiniMax states explicitly that the two key systems are distinct, and Token Plan quota refreshes on rolling 5-hour and weekly windows . Pick the key that matches how you intend to be billed before writing any client config.

How do I preserve M3's reasoning state across multi-turn tool calls?

Append the full assistant content list — thinking, text, and tool_use blocks — back into message history after each turn, unchanged. Reasoning continuity in M3 depends on those blocks remaining intact, so never strip, summarize, or drop the thinking block between turns; doing so silently corrupts the reasoning chain . On the OpenAI-compatible path the same rule applies: preserve the complete assistant message including tool_calls and reasoning_details, and use extra_body {'reasoning_split': true} if you want reasoning separated into reasoning_content .

Does M3 support video input, and what are the size limits?

Yes. M3 accepts video through video_url content parts in MP4, AVI, MOV, and MKV. Via URL or base64 the limit is 50 MB per request, with request bodies capped at 64 MB, and larger files up to 512 MB are supported through the Files API using mm_file://file_id . Frame sampling defaults to 1 fps and accepts a range of 0.2–5 . Tune fps down for long clips to control token cost.

How does MiniMax M3 pricing compare to GPT-5.5 or Gemini 3.1 Pro?

MiniMax claims M3 runs at roughly 5–10% of comparable proprietary pricing . Standard pay-as-you-go for ≤512K inputs is $0.30/M input and $1.20/M output under a 50% launch discount (list $0.60/$2.40) . The full 1M-sequence band is initial-access-limited, so a true apples-to-apples cost comparison at scale requires access approval from MiniMax sales — and remember the headline benchmarks are vendor-reported, so cost-per-quality should be judged against your own workloads.