MAI-Thinking-1 beats Anthropic's top model — per Microsoft

Seven MAI models at Build 2026. MAI-Thinking-1 is a 35B-active sparse MoE — specs, claimed scores, and what's still unverified externally.

MAI-Thinking-1 beats Anthropic's top model — per Microsoft
Share

MAI-Thinking-1: The Sparse MoE at the Center of Build 2026

MAI-Thinking-1 is Microsoft's first home-grown flagship reasoning model: a medium-sized sparse Mixture-of-Experts decoder with 35B active and roughly 1T total parameters, trained from scratch on Microsoft-operated Azure infrastructure . It anchored a June 2 keynote that announced not one model but seven, signaling that the Microsoft AI (MAI) team now intends to compete across modalities with its own stack rather than wrap a partner's .

Quick Answer: At Build 2026 (June 2–3), Microsoft unveiled MAI-Thinking-1, its first in-house flagship reasoning model — a sparse Mixture-of-Experts decoder with 35B active and ~1T total parameters. Microsoft reports 52.8% on SWE-Bench Pro, which it positions level with Anthropic's Claude Opus 4.6. It is in private preview on Microsoft Foundry.

The seven models, all built in-house at Microsoft AI and announced June 2–3, span reasoning, coding, image, transcription, and voice :

  • MAI-Thinking-1 — flagship reasoning model
  • MAI-Code-1-Flash — agentic coding model aimed at GitHub Copilot
  • MAI-Image-2.5 and MAI-Image-2.5-Flash — text-to-image plus image editing
  • MAI-Transcribe-1.5 — speech-to-text across 43 languages
  • MAI-Voice-2 and MAI-Voice-2-Flash (the latter listed as coming soon) — text-to-speech

The strategic framing was explicit. Microsoft describes these as models "trained from the ground up" to give it long-term independence and reduce reliance on external suppliers — most directly OpenAI — while seating MAI as a first-party, enterprise-oriented multimodal stack inside Microsoft Foundry rather than a consumer Copilot layer .

For now, access is gated. MAI-Thinking-1 is in private preview on Microsoft Foundry, with a public MAI Playground preview said to be coming soon, and it is also listed for distribution via OpenRouter, Fireworks AI, and Baseten . The sections that follow break down the architecture, the data-lineage choices, and how skeptically to read Microsoft's own benchmark numbers.

78 Layers, 512 Experts, 8,000 GB200 GPUs: The MAI-Base-1 Spec Sheet

MAI-Thinking-1 beats Anthropic's top model — per Microsoft

MAI-Base-1 is the foundation under MAI-Thinking-1: a decoder-only Transformer with alternating dense and Mixture-of-Experts feed-forward blocks, 78 layers, and routing that activates 8 of 512 experts per token . In its base configuration it carries about 34.7B active and 962B total parameters — sparse by design, so only a fraction of the network fires on any given token . The reasoning model built on top, MAI-Thinking-1, lands at roughly 35B active and ~1T total parameters with a 256K-token context window .

Microsoft says the model was pretrained from scratch on 8,000 NVIDIA GB200 GPUs on a Microsoft-operated Azure cluster, consuming 30T pretraining tokens plus 3.55T mid-training tokens . Notably, that corpus excludes synthetic language-model-generated content in favor of publicly available and licensed human sources — a choice we unpack in the next section .

SpecMAI-Base-1 (base)MAI-Thinking-1 (reasoning)
ArchitectureDecoder-only Transformer, dense + MoE blocksSparse MoE
Layers78
Expert routing8 of 512 per token8 of 512 per token
Active parameters~34.7B~35B
Total parameters~962B~1T
Context window256K tokens
Training hardware8,000 NVIDIA GB200 (Azure)Inherits base
Training tokens30T pretrain + 3.55T mid-trainInherits base

On the developer surface, MAI-Thinking-1 supports function calling, layered developer instructions, and the Chat Completions API, so existing tooling that already speaks that schema can target it with minimal rewiring .

One caveat on provenance worth flagging before you treat these figures as settled: Microsoft published a full technical report alongside the June 2, 2026 launch, but it has not been submitted to arXiv or an independent peer-review venue . Read it as a detailed first-party disclosure — richer than a typical model card, but vendor-authored and not externally vetted. The parameter counts, layer depth, and expert configuration are Microsoft's own accounting, useful for sizing the model against rivals but not yet confirmed by a third party.

Zero Distillation and Clean IP: Why Microsoft Made This Tradeoff

The data story behind MAI-Base-1 is a deliberate IP-provenance pitch, not just an engineering footnote. Microsoft says it pretrained the model entirely on publicly available and licensed human-generated sources — web text, public GitHub code, books, academic papers, news, multilingual corpora, and domain materials — and explicitly excluded synthetic, language-model-generated content . That "zero distillation" posture means the model inherits no outputs from another vendor's system, which is the point Microsoft is selling to enterprises.

The reasoning is straightforward for anyone who has worried about training-data lineage. Distilling from a competitor's model can carry copyright exposure from that model's outputs and contamination from its synthetic pipeline — risks that are hard to audit after the fact. By training from scratch on documented human sources, Microsoft can claim a cleaner chain of custody for the weights, which matters more in regulated procurement than a few benchmark points .

The first named application makes the logic concrete: a healthcare-focused model built with Mayo Clinic on MAI infrastructure, a domain where clean data lineage directly reduces regulatory and liability risk . In settings bound by HIPAA-style review, "where did this model's knowledge come from" is a question a procurement team has to answer, and "no competitor's synthetic output" is a cleaner answer than most foundation models can give today.

Microsoft frames the ownership angle bluntly. In the Build keynote, the MAI team argued that "the RLEs and the models you build inside of them become your moat" — the pitch being that customers fine-tune and own behavior rather than rent shared intelligence .

The tradeoff is cost. Excluding synthetic data and training from scratch is more expensive than distillation, which is partly why Microsoft also leaned on its in-house Maia 200 silicon, citing roughly a 1.4x efficiency improvement over commodity GPU clusters . That efficiency claim is the lever that could let Microsoft price MAI-Thinking-1 competitively once general-availability rates land — a number it has not yet published.

52.8% on SWE-Bench Pro: Microsoft's Figures, Awaiting External Eyes

MAI-Thinking-1 beats Anthropic's top model — per Microsoft

Every headline benchmark for MAI-Thinking-1 so far comes from Microsoft's own evaluation harness, not a neutral leaderboard. The reported scores are 52.8% on SWE-Bench Pro, 97.0% on AIME 2025, 94.5% on AIME 2026, and 87.7% on LiveCodeBench v6 . Microsoft frames the SWE-Bench Pro figure as toe-to-toe with Anthropic's Claude Opus 4.6 . Treat them as vendor claims until external runs land.

BenchmarkMAI-Thinking-1 (vendor-reported)Source
SWE-Bench Pro52.8%Microsoft harness
AIME 202597.0%Microsoft harness
AIME 202694.5%Microsoft harness
LiveCodeBench v687.7%Microsoft harness

Microsoft also ran a human-preference study through data vendor Surge, covering 1,276 single- and multi-turn tasks, and says blind raters preferred MAI-Thinking-1 over Claude Sonnet 4.6 in side-by-side comparison . The methodology — blind, multi-turn, large sample — is more rigorous than a single static benchmark. But the study was commissioned and run by Microsoft, so it carries the same conflict-of-interest caveat as the automated scores. A blind protocol controls for rater bias; it does not control for who chose the tasks.

To its credit, Microsoft attaches an anti-overfitting claim to the numbers:

"[The scores were achieved] without specifically targeting any of these benchmarks." — Microsoft AI, MAI keynote, Build 2026 (source: microsoft.ai)

That matters because benchmark-targeted training is a known way to inflate SWE-Bench and AIME figures. If true, it reduces one concern — that the model was tuned to the test rather than to the task. It does not substitute for independent replication, which is the only thing that converts a vendor number into a trusted one.

So what would external validation look like? Three signals are worth tracking. First, an LMSYS Chatbot Arena submission, which surfaces human-preference Elo across an open population rather than a commissioned panel. Second, an Artificial Analysis evaluation — Microsoft already cites that platform for its MAI-Transcribe-1.5 word-error-rate placement , so its reasoning index is the natural cross-check here. Third, community-run SWE-Bench Pro submissions, where independent harnesses can confirm or undercut the 52.8% figure. As of June 7, 2026, none of these external data points has been published — the private-preview rollout limits who can run the model at all. Until they appear, the honest read is that MAI-Thinking-1's scores are credible-but-unverified.

The Cheaper Coding Sibling: MAI-Code-1-Flash Joins GitHub Copilot

MAI-Code-1-Flash is the only model in the new MAI family with confirmed general availability today — MAI-Thinking-1 stays invite-only in Foundry private preview, but its coding sibling is already shipping inside GitHub Copilot. It is a roughly 5B-active-parameter agentic coding model that Microsoft says solves equivalent tasks with up to 60% fewer tokens, positioning it as the lightweight, cost-sensitive option rather than a frontier reasoner . The rollout to Copilot Free, Student, Pro, Pro+, and Max users in VS Code began June 2, 2026, starting with a limited set of users and expanding over the following weeks .

The benchmark pitch is framed against Anthropic's small model, not its flagship. In Microsoft's own production harness, MAI-Code-1-Flash outperformed Claude Haiku 4.5 on SWE-Bench Verified, SWE-Bench Pro, SWE-Bench Multilingual, and Terminal Bench 2 — including 51.2% versus 35.2% on SWE-Bench Pro — and scored 85.8% adjusted accuracy on Microsoft's 186-question, 34-category adversarial reasoning benchmark . To Microsoft's credit, it published a named weakness: Einstellung-trap accuracy (resisting a familiar-but-wrong solution path) sits below 50% . These are vendor-run figures, so the same "awaiting external eyes" caveat applies — but the disclosed failure mode is more useful signal than a clean scorecard.

Pricing is the part developers can act on now. After included AI-credit allowances, GitHub lists MAI-Code-1-Flash as GA and lightweight at the following per-million-token rates :

Token typePrice per 1M tokens
Input$0.75
Cached input$0.075
Output$4.50

Combined with the 60% token-reduction claim, the effective per-task cost is the real argument here: a small model that finishes in fewer tokens compounds the headline rate into a meaningfully lower bill for high-volume agentic loops. For now it is the one MAI model you can route Copilot traffic to and measure yourself — which makes it the practical entry point to evaluate whether Microsoft's first-party stack holds up outside the keynote slides.

Visual, Voice, and Transcription: The Remaining MAI Models

MAI-Thinking-1 beats Anthropic's top model — per Microsoft

Beyond reasoning and coding, Microsoft shipped four multimodal models that extend the MAI stack into images, speech, and voice — and several are already in production surfaces rather than preview. MAI-Image-2.5 handles both text-to-image and image-to-image editing, and Microsoft reports it ranked No. 2 on Arena's image-editing leaderboard and No. 3 for text-to-image as of June 2026 . The same writeup claims a +75 overall Arena-score gain over MAI-Image-2, including +107 specifically for text rendering — the failure mode most image models struggle with .

Pricing is public and worth reading closely if you batch image generation. MAI-Image-2.5 costs $5 per 1M text-input tokens, $8 per 1M image-input tokens, and $47 per 1M image-output tokens; the Flash variant drops those to $1.75, $1.75, and $19.50 respectively . Microsoft Learn lists both image models at version 2026-06-02 for Global Standard deployment in West Central US, East US, West US, West Europe, Sweden Central, South India, and UAE North, with PNG output and a 1,048,576-pixel maximum output area . Both are live in PowerPoint and rolling out on OneDrive .

On the speech side, MAI-Transcribe-1.5 widens language coverage from 25 to 43 and claims best-in-class FLEURS word error rate, reporting a No. 3 placement on Artificial Analysis at 2.4% WER . It adds keyword and content biasing for domain terminology — claimed up to 30% WER reduction on FLEURS — and cites up to 5x faster long-audio transcription than named rivals, though both figures are vendor-reported .

MAI-Voice-2 rounds out the family as a text-to-speech model in Foundry, already being integrated into VS Code and Dynamics 365 Contact Center. It supports 15 languages, zero-shot voice prompting from 5–60 seconds of reference audio, emotion tags, and explicit consent guardrails, and Microsoft says raters preferred it over MAI-Voice-1 in 72% of side-by-side tests . A lower-cost MAI-Voice-2-Flash is listed as coming soon .

The connecting logic is commercial, not just technical. As Microsoft framed it in the Build keynote, the pitch is "market-leading quality per dollar," with the bet that "the RLEs and the models you build inside of them become your moat" — customers fine-tuning and owning behavior rather than renting shared intelligence . Published per-modality prices and embedded surfaces like PowerPoint make that argument testable in a way the reasoning flagship, still in private preview, currently is not.

What Microsoft Hasn't Published About MAI-Thinking-1 Yet

For developers evaluating MAI-Thinking-1, the largest unknowns are commercial, not architectural. Microsoft has not published a per-token price for the reasoning flagship, even though its two siblings shipped with full public price sheets dated June 2, 2026. MAI-Code-1-Flash lists at $0.75 input, $0.075 cached input, and $4.50 output per 1M tokens , and MAI-Image-2.5 publishes $5 text input, $8 image input, and $47 image output per 1M tokens . The model whose quality Microsoft is promoting hardest is the one whose cost remains undisclosed.

Access terms are equally thin. MAI-Thinking-1 is in private preview on Microsoft Foundry , but rate limits, concurrency caps, and SLA terms are not publicly documented, and Microsoft has not specified how a developer qualifies for an invitation. A MAI Playground is described as a "public preview coming soon" with no date attached, and the lower-cost MAI-Voice-2-Flash variant is also listed as "coming soon" without a timeline . No general-availability date has been confirmed, and final API terms are unset.

The benchmark picture carries the same caveat. The headline figures — 52.8% on SWE-Bench Pro, positioned against Claude Opus 4.6, plus 97.0% on AIME 2025 and 87.7% on LiveCodeBench v6 — are vendor-reported, including the Surge human-preference result over Claude Sonnet 4.6 across 1,276 tasks . None has been independently replicated yet. The gap between Microsoft's claimed SWE-Bench Pro parity and external confirmation is the single open question that matters most for anyone deciding whether to build on it.

The concrete takeaway: treat MAI-Code-1-Flash and MAI-Image-2.5 as evaluable today, because they ship with prices and benchmarks you can test. Treat MAI-Thinking-1 as a credible but unpriced preview — worth requesting access to and worth benchmarking yourself, but not yet something to commit a production roadmap to until cost, limits, and third-party scores exist.

Frequently asked questions

What is MAI-Thinking-1 and how does it compare to GPT-4o or Claude?

MAI-Thinking-1 is Microsoft's first home-grown flagship reasoning model: a sparse Mixture-of-Experts design with 35B active and roughly 1T total parameters, trained from scratch without distillation from other LLMs . Against GPT-4o (OpenAI) and Claude Sonnet 4.6 (Anthropic), it exposes a similar Chat Completions API surface, so integration cost is comparable. The real differences are ownership and data lineage: Microsoft's pitch is supplier independence and clean IP provenance — zero synthetic training data — rather than a decisive benchmark lead . Read it as a strategic alternative, not a drop-in replacement chosen on scores alone.

How do I get access to MAI-Thinking-1 today?

As of June 2026, MAI-Thinking-1 is in private preview on Microsoft Foundry, which requires an invitation . Developers who want earlier hands-on access can also reach it through third-party distribution: it is listed on OpenRouter, Fireworks AI, and Baseten . A MAI Playground public preview is described as "coming soon," but Microsoft has not confirmed a date, and general availability terms are still unset .

Is MAI-Code-1-Flash available in Copilot right now?

Yes. MAI-Code-1-Flash began rolling out to GitHub Copilot on June 2, 2026, reaching Free, Student, Pro, Pro+, and Max users, starting with a limited set of VS Code users and expanding over the following weeks . GitHub's pricing page lists it as generally available and lightweight, charging (after included AI-credit allowances) $0.75 per 1M input tokens, $0.075 cached input, and $4.50 per 1M output tokens . Unlike the flagship reasoning model, it ships with prices and benchmarks you can evaluate today.

Should I trust Microsoft's MAI-Thinking-1 benchmark scores?

Treat them as a strong prior, not a verdict. The headline figures — 52.8% on SWE-Bench Pro, 97.0% on AIME 2025, and 87.7% on LiveCodeBench v6 — all come from Microsoft's own harness, and the human-preference result over Claude Sonnet 4.6 came from a Microsoft-commissioned Surge rater study across 1,276 tasks . As of June 7, 2026, no independent LMSYS, Artificial Analysis, or community SWE-Bench submissions exist for the model. Watch those leaderboards before committing a production decision to the vendor-reported numbers.

What does 'zero distillation' mean and why does it matter for enterprise use?

"Zero distillation" means Microsoft excluded all synthetic, LLM-generated text from pretraining. All 30T pretraining tokens (plus 3.55T mid-training tokens) come from publicly available and licensed human-generated sources — web, public GitHub code, books, academic papers, news, and multilingual text . The practical implication is provenance: no inherited copyright or IP exposure from another model's outputs . For regulated verticals such as healthcare, finance, and legal, that clean data lineage reduces one category of compliance and IP risk that distilled models can carry.