Distillation went mainstream. Then came the IP dispute.

R1's MIT license made open-source distillation legal. V4's OPD is how it works now. The IP dispute explains the limits.

Distillation went mainstream. Then came the IP dispute.
Share

"Distillation" has gone from a research footnote to the default way teams ship small, capable models — and now to the center of a US-China policy fight. But before the dispute, it helps to separate what is documented from what is just circulating.

What knowledge distillation is and where the open-deepthink story falls apart

Knowledge distillation is a training technique where a smaller "student" model learns from a larger "teacher" model's output distribution, compressing much of the teacher's capability without rerunning frontier-scale pre-training. Instead of training a student against raw labels alone, you train it to mimic the teacher's probabilities (or generated responses), which transfers reasoning behavior at a fraction of the compute. Red Hat's January 2026 review of open-source AI confirms distillation has moved from research curiosity to a mainstream efficiency lever for fine-tuning student models on teacher outputs .

The "open-deepthink shipped a full knowledge distillation mode" claim, however, does not hold up. There is no GitHub release, official changelog, or credible technical press supporting it. The one real codebase matching the name, Astrodevil/Open-DeepThink-Researcher, is an MIT-licensed web-research agent built on DeepSeek-R1 and the Exa API — with only about 7 commits on main, no version tags, and no published releases . Its README mentions no distillation mode and no five-month roadmap. Treat the headline as unsubstantiated.

What is documented is the milestone the story rides on. On January 20, 2025, DeepSeek released DeepSeek-R1 under the MIT License, explicitly permitting distillation and commercialization, and shipped six smaller distilled variants the same day . API pricing for the reasoner ran roughly $0.14–$0.55 per million input tokens and $2.19 per million output . That is the real story — and the next sections trace where it led.

R1's MIT terms: what practitioners can legitimately distill and commercialize

Distillation went mainstream. Then came the IP dispute.

Under the MIT License, distilling R1 into your own student models is legally clean: the license grants the right to "use, copy, modify, merge, publish, distribute, sublicense, and/or sell" the software without restriction . There is no non-compete carve-out and no clause reserving derivative reasoning models for the licensor. A model you train on R1 outputs — the distilled derivative — falls inside that grant, so you can commercialize it the same way you would any MIT-covered artifact. DeepSeek made the intent explicit when it shipped six distilled variants alongside R1 on January 20, 2025 .

The contrast with closed teachers is where the same technical act diverges legally. OpenAI's and Anthropic's terms of service explicitly prohibit using API outputs to train systems that replicate or compete with their services. Generating teacher tokens and fitting a student to them is identical engineering in both cases — but distilling from R1 is a granted right, while distilling from a closed API is a contract breach the vendor can litigate. OpenAI has already accused DeepSeek of malpractice over distillation ahead of model launches , which underscores that provenance, not method, determines exposure.

"Permission is hereby granted, free of charge, to any person obtaining a copy of this software... to deal in the Software without restriction" — the MIT License grant DeepSeek attached to R1 (source: DeepSeek-R1, GitHub).

A short compliance checklist keeps distilled releases defensible:

  • Retain the MIT header. The license requires the copyright and permission notice to travel with the work; ship it in your distilled artifact's repo and model card.
  • Document the teacher-student relationship. State in release notes that the student was distilled from R1, including the date and source repo, so downstream users can audit the lineage.
  • Don't misrepresent provenance. Claiming a clean-room model when outputs came from R1 — or vice versa — invites both license and IP disputes you could have avoided.

R1's distilled LLMs, from 1.5B to 70B: a size-to-accuracy map

At launch, DeepSeek shipped six distilled models alongside R1 itself: variants at 1.5B, 7B, 8B, 14B, 32B, and 70B parameters, all released on January 20, 2025, all derived from the same R1 teacher, and all carrying the MIT license . That spread is the practical payoff of the distillation pipeline: a single reasoning teacher compressed into a ladder of student sizes, so you pick the smallest model that clears your accuracy bar instead of paying for headroom you don't use.

The headline accuracy claim sits at the top of the ladder. DeepSeek reported that the 32B and 70B distilled variants perform on par with OpenAI's o1-mini on reasoning benchmarks, with R1 itself positioned against OpenAI o1 on math, code, and reasoning tasks . The benchmark suite cited for the comparison covers AIME 2024 (competition math), MATH-500, and LiveCodeBench (code generation) — the standard reasoning triad. Treat "on par with o1-mini" as a vendor-reported result, not an independent verdict: it tells you the 32B/70B tier is the one to test first when you need o1-mini-class reasoning, not that every task will replicate the leaderboard.

Distilled variantLicenseReasoning benchmarksPractical tier
R1-Distill-1.5BMITEntry reasoningOn-device / edge prototyping
R1-Distill-7BMITMid reasoningSingle-GPU apps, batch tasks
R1-Distill-8BMITMid reasoningSingle-GPU, Llama-base tooling
R1-Distill-14BMITUpper-mid reasoningCost/accuracy balance
R1-Distill-32BMITReported on par with o1-miniProduction reasoning, self-hosted
R1-Distill-70BMITReported on par with o1-miniTop accuracy, multi-GPU serving

The economics shift once you self-host. DeepSeek's hosted R1 reasoner was priced around $0.14 per million input tokens on a cache hit and $0.55 per million on a cache miss, with output around $2.19 per million tokens . Those are recurring per-token costs that scale with usage. The MIT-licensed distilled weights invert that model: you download once, then pay only for the GPU you run them on, with zero marginal per-token fee. For high-volume workloads, a self-hosted 32B can undercut any per-token API simply by removing metered billing — the trade is upfront infrastructure and serving complexity against open-ended API spend.

The decision rule that falls out of this map: start at the smallest tier that meets your accuracy bar, validate on your own eval set rather than the published benchmarks, and only climb to 32B/70B when the smaller students miss. Because all six share the same license and teacher, moving up or down the ladder is a swap, not a re-architecture.

On-Policy Distillation in V4: iterative student-teacher correction vs. static KD

Distillation went mainstream. Then came the IP dispute.

On-Policy Distillation (OPD) is a training method in which the student model first generates its own candidate responses and then consults multiple teacher models to correct them, rather than passively imitating pre-recorded teacher outputs. DeepSeek used it to train DeepSeek V4, launched on April 24, 2026 . The shift matters because it changes which error modes the student can actually fix.

Classic offline knowledge distillation trains the student on a fixed corpus of pre-generated teacher outputs. It is one-pass and cheap: you sample the teacher once, then fit the student to that frozen distribution. The weakness is structural. Because the training signal is static, the student never sees corrections for the mistakes it makes on its own — when teacher coverage is sparse, the student's characteristic failure modes go unobserved and unpenalized. This is the regime the six R1 distilled checkpoints were trained under, and it is why a student can match a teacher on covered distributions yet drift on out-of-distribution prompts.

OPD inverts the data flow. The student generates first; multiple teacher LLMs then evaluate and correct those specific responses . The corrective signal is keyed to the student's actual output distribution — its own mistakes, not the teacher's hypothetical ones. In effect the student co-authors its training set: each round targets the gaps the student currently has, which is what "on-policy" denotes. The trade-off is cost. Offline KD samples the teacher once; OPD requires repeated student generation plus teacher evaluation across iterations, so it spends more compute per step in exchange for a signal that is far better aligned to where the student is weak.

The two approaches sit at different points on a familiar curve:

  • Static (offline) KD — student fits frozen teacher outputs. One pass, low cost, no self-correction. Good when teacher coverage is dense and the deployment distribution is predictable.
  • On-Policy Distillation — student generates, teachers correct iteratively. Higher cost, signal aligned to the student's own error modes. Better for reasoning-intensive tasks where blind spots cluster off the teacher's sampled distribution.

The reported outcome is incremental, not a leap. Asia Times frames V4 as narrowing the gap on reasoning while still trailing frontier models such as GPT-5.4 by roughly three to six months . The honest read: OPD accelerates convergence toward the teachers' competence but does not guarantee parity with the frontier. As the Asia Times report on the V4 debut puts it, the model is positioned as "competitive on reasoning" while remaining months behind the leading systems — a gain in efficiency, not a claim of equivalence (source: Asia Times, 2026-04).

For builders, the practical signal is that the corrective-loop pattern is what pushed V4's reasoning forward — and reproducing it on your own student means budgeting for iterative generation and a panel of teacher evaluators, not a single sampling pass.

Permissive vs. proprietary distillation: MIT, Apache-2.0, and ToS-blocked teachers

The legal surface area of a distillation project is set by the teacher model's license, not by the technique you use. MIT-licensed teachers like DeepSeek-R1 are the cleanest case: distillation, commercialization, and redistribution are all explicitly permitted, which is exactly why R1 became the reference teacher for open derivative products after its release on January 20, 2025 . Apache-2.0 teachers add a patent grant but stay permissive; ToS-restricted API outputs (OpenAI, Anthropic) sit at the opposite end, where training a competing model is contractually prohibited and increasingly litigated .

The practical trap is the gap between a license badge and a model's actual terms. Meta's Llama weights, for instance, carry supplemental usage conditions — including a cap that triggers for products above 700 million monthly active users — layered on top of an otherwise permissive grant. Read the model card and the accompanying use policy, not just the SPDX tag, before you commit a teacher to a shipping product.

Teacher sourceDistillation permitted?Commercial useRedistribution
MIT weights (DeepSeek-R1, most Mistral variants)Yes, explicitYesYes
Apache-2.0 weights (some Qwen, some Llama variants)Yes, with patent grantYesYes
Meta Llama (community license)Yes, below MAU capYes, <700M MAU; check cardConditional
Closed APIs (OpenAI, Anthropic)No — ToS bars training competitorsOutput reuse contestedNo

Two builder takeaways follow. First, the ToS restriction on closed APIs is not limited to commercial launches — the standard "you may not use outputs to train competing models" clause creates legal exposure even for non-commercial fine-tuning, because the violation is in how the data was obtained, not in whether you sell the result. Second, the open path is now well-trodden: Red Hat's January 2026 review confirms distillation has moved from research curiosity to a mainstream efficiency lever for fine-tuning student models on teacher outputs . R1's MIT terms specifically permit distillation and commercialization, giving permissively licensed teachers the lowest legal risk for derivative products . When the teacher's license is permissive, your only real engineering question is quality; when it is a closed API, the license question can end the project before the first training run.

The US-China IP dispute over distillation: OSTP, MATCH Act, and what follows

Distillation went mainstream. Then came the IP dispute.

The distillation that is technically permissive under R1's license has become a geopolitical flashpoint when the teacher is a closed US model. On April 23, 2026, the White House Office of Science and Technology Policy publicly characterized Chinese distillation as coordinated extraction at scale . That framing matters for builders because it shifts distillation from a licensing question into an export-control and national-security question — one that can reshape what teachers you are allowed to call regardless of any terms of service.

"Industrial-scale campaigns" using "tens of thousands of proxy accounts" to extract outputs from US models and strip safety guardrails — White House Office of Science and Technology Policy, characterizing Chinese distillation (source: Asia Times, 2026-04).

The policy moves arrived in a cluster. A US House Select Committee on China hearing on April 16, 2026 examined the issue, and the proposed MATCH Act — which targets Chinese access to ASML lithography — has been framed alongside distillation as a second, dual-use capability-extraction vector . The logic linking them is that both hardware fabrication and model outputs are treated as restricted capabilities that adversaries try to acquire indirectly: one through chip-making tools, the other through API queries. The OSTP statement landed one day before DeepSeek launched V4 on April 24, 2026, which is no coincidence in timing .

Underneath the policy layer sits a private-sector dispute. OpenAI has formally accused DeepSeek of distillation malpractice ahead of multiple V4-era releases, alleging its outputs were used as teacher signal in violation of terms of service . What makes this consequential is the combination: a contractual ToS-enforcement claim is now carrying geopolitical weight, and the two reinforce each other. A ToS breach that might otherwise be a commercial arbitration becomes evidence in a national-security narrative, and the national-security narrative gives regulators a reason to codify what was previously a private contract term.

For practitioners, the takeaway is asymmetry of risk. Distilling from MIT-licensed open weights carries quality risk only; distilling from a closed US API now carries contractual, reputational, and potentially regulatory exposure that could harden into formal export controls. The legal status of ToS-based distillation remains contested rather than settled — which means the safest engineering decision and the safest legal decision currently point to the same place: permissively licensed teachers.

When to distill from R1: practitioner guidance on permissive vs. restricted teachers

For most builders, the permissive path is the one to take: distill from an MIT- or Apache-2.0-licensed teacher, retain the original license header in your derivative, and document the teacher-to-student relationship in your model card. DeepSeek-R1 shipped under the MIT License on January 20, 2025 , with terms that explicitly permit distillation and commercialization . That makes the workflow legally clear, commercially viable, and reproducible — anyone auditing your pipeline can trace the provenance and verify the license terms themselves.

The restricted path is using a closed API's output as a training signal. It is technically feasible and often tempting, but the terms of service of providers like OpenAI prohibit it, and the legal status of ToS-based distillation remains contested rather than settled . Geopolitics now compounds the contractual risk: the White House Office of Science and Technology Policy in April 2026 characterized cross-border distillation as "industrial-scale campaigns" extracting outputs from US models . For any pipeline touching restricted teachers, especially across borders, that exposure is real and hardening.

On technique, match the method to the task. On-Policy Distillation — where the student generates responses and then consults teacher models to correct them, the approach behind DeepSeek V4's April 24, 2026 release — pays off on reasoning-intensive workloads. For commodity factual fine-tuning, static offline knowledge distillation remains sufficient and cheaper to operate, since Red Hat's January 2026 review confirms it is now a mainstream efficiency lever rather than a research curiosity .

The concrete takeaway: default to R1 or another permissively licensed teacher, reserve on-policy correction for reasoning tasks that justify its cost, and keep closed-API outputs out of your training data until the legal picture settles. The cheapest engineering choice and the safest legal one currently coincide — build accordingly.

Frequently asked questions

Yes. DeepSeek-R1 was released under the MIT License on January 20, 2025 , and the license explicitly permits distillation, modification, redistribution, and commercialization of derivatives . The one obligation that matters in practice: retain the original MIT copyright and permission header in your release so downstream users inherit the same terms.

What is On-Policy Distillation and how does it differ from standard KD?

Standard knowledge distillation trains a student on fixed, pre-generated teacher outputs — the teacher labels a dataset once and the student fits to it. On-Policy Distillation (OPD) inverts the order: the student generates its own candidate responses first, then one or more teacher models evaluate and correct them, which accelerates learning on the student's actual error distribution. DeepSeek introduced OPD when training DeepSeek V4, launched April 24, 2026 .

Can I use OpenAI or Anthropic API outputs to distill a smaller LLM?

No — not without legal exposure. Both providers' terms of service explicitly prohibit using API outputs to train systems that replicate or compete with their services, and OpenAI has publicly accused DeepSeek of malpractice over exactly this kind of distillation . MIT-licensed open weights such as R1 carry no equivalent restriction, which is why permissively licensed teachers remain the defensible path.

Why is the US government raising concerns about Chinese AI distillation?

On April 23, 2026, the White House Office of Science and Technology Policy characterized Chinese distillation as "industrial-scale campaigns" that use tens of thousands of proxy accounts to extract outputs from US models and strip their safety guardrails . A US House Select Committee on China hearing on April 16, 2026, and the proposed MATCH Act targeting Chinese access to ASML lithography are the downstream policy responses .

Which open-weight teachers can I distill from commercially in 2026?

MIT-licensed teachers are the cleanest option: DeepSeek R1 and its distilled variants, plus Mistral 7B and 8x7B, permit commercial distillation outright . For Apache-2.0 weights, check model-specific addenda — for example, Meta's Llama license adds monthly-active-user caps. Off-limits commercially: any output derived from OpenAI or Anthropic API calls, whose terms forbid competitive training .