NVIDIA's 550B finally lands: free to use, expensive to host

Nemotron 3 Ultra, 550B MoE (June 4 2026): hardware minimums, hosted API quickstart, NIM steps, benchmark check.

NVIDIA's 550B finally lands: free to use, expensive to host
Share

NVIDIA shipped its biggest open-weight model yet, and the weights are free to download — but standing it up yourself is a datacenter project, not a weekend one.

Nemotron 3 Ultra: What Landed on June 4

Nemotron 3 Ultra is a text-only, open-weight Mixture-of-Experts model with 550 billion total parameters that activate 55 billion per token (the "550B-A55B" label), released on Hugging Face on June 4, 2026 . It is not a standard transformer stack: it uses a hybrid Mamba-Attention design with "LatentMoE" routing — 108 layers, model dimension 8192, and 512 experts per layer with top-k 22 activated .

The license matters as much as the architecture. Ultra ships under the Linux Foundation's permissive OpenMDW-1.1, covering weights, architecture, training data, and recipes — with no output restrictions and no usage-fee clauses . Four checkpoints are published: BF16 base, BF16 post-trained, NVFP4 post-trained, and a GenRM reward model for RLHF . NVFP4 is the practical self-hosting path.

It was pre-trained on 20T tokens (data cutoff September 2025; post-training cutoff May 2026), with native Multi-Token Prediction layers for speculative decoding and a 1,048,576-token context window — though that 1M window comes with serving conditions we cover in the gotchas .

Before You Pull the Container

NVIDIA's 550B finally lands: free to use, expensive to host

This is datacenter-class hardware, not a workstation download. There is no single-consumer-GPU path to self-hosting Nemotron 3 Ultra, even at 4-bit. The BF16 profile needs 16× H100 across two nodes or 8× H200 in a single node, while the smaller NVFP4 footprint still demands 8× H100, 4–8× H200, or 2× B200/GB300 . Counts are profile-dependent, so verify your exact case against the NIM support matrix before ordering anything.

PrecisionMinimum GPUsDisk for model cache
BF1616× H100 (two nodes) or 8× H200~1.1–1.7 TB
NVFP48× H100, 4–8× H200, or 2× B200/GB300~330 GB

The toolchain has hard floors: Ubuntu 22.04 LTS or newer, NVIDIA Container Toolkit 1.14.0+, CUDA SDK 12.9+, driver 580+, and Docker 24.0+ . Budget disk before the first pull: the container image alone is ~38 GB on top of the model cache above. Provision a dedicated persistent volume mounted at /opt/nim/.cache so repeated runs don't re-download hundreds of gigabytes .

Quickstart: Hosted and Self-Hosted NIM

NVIDIA's 550B finally lands: free to use, expensive to host

If you want to call Nemotron 3 Ultra today without provisioning a single GPU, use NVIDIA's hosted NIM API. It speaks the OpenAI protocol, so the only changes from a standard client are the base URL and the model name. Generate an API key at build.nvidia.com, then point the OpenAI SDK at https://integrate.api.nvidia.com/v1 and call nvidia/nemotron-3-ultra-550b-a55b .

pip install openai

from openai import OpenAI
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="$NVIDIA_API_KEY",
)
resp = client.chat.completions.create(
    model="nvidia/nemotron-3-ultra-550b-a55b",
    messages=[{"role": "user", "content": "Plan a 3-step refactor."}],
    temperature=1, top_p=0.95, max_tokens=16384,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "thinking": {"type": "enabled", "budget_tokens": 10000},
    },
)
print(resp.choices[0].message.content)

Temperature 1, top_p 0.95, and max_tokens 16384 are NVIDIA's reference settings, and reasoning is toggled through extra_body rather than a top-level field . The budget_tokens value caps how much the model "thinks" before answering — your lever for trading latency against depth.

Self-hosting follows two steps once the hardware and disk from the previous section are in place. First, set up access: join the NVIDIA Developer Program, generate an NGC Personal API Key, accept the model's NGC terms, then authenticate Docker with the literal username $oauthtoken and your NGC key as the password :

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
docker pull nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant

docker run --gpus=all -e NGC_API_KEY \
  -v /opt/nim/.cache:/opt/nim/.cache \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-ultra-550b-a55b:2.0.5-variant \
  --reasoning-parser nemotron_v3

NIM 2.0.5 is built on vLLM 0.20.2 . The --reasoning-parser nemotron_v3 flag is what enables the thinking controls server-side; without it, enable_thinking is ignored. Tool calling needs two more launch flags: --enable-auto-tool-choice and --tool-call-parser qwen3_coder . One sharp edge: the container does not connect to MCP servers itself, so if you run an MCP-based agent you must convert MCP tool schemas to OpenAI tool format in your client before the request reaches NIM.

Gotchas: What the Docs Don't Say

NVIDIA's 550B finally lands: free to use, expensive to host

Beyond the launch flags, four discrepancies between NVIDIA's marketing and its own technical docs will bite you in production. Start with the parameter count: Hugging Face file metadata lists the BF16 artifact at 561B parameters , not the marketed 550B-A55B. It is not a red flag — but when you size hardware or cite specs, use the model card figure, not the press-release number.

The context window is the second trap. NIM's native serving limit is 262,144 tokens; the advertised 1,048,576-token (1M) context requires launching with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and --max-model-len 1048576 . NVIDIA explicitly warns this reduces concurrency and should be quality-validated — so test it under your real request mix before promising 1M context to anyone.

Third, reasoning budget has a measured quality cliff. The NIM 2.0.5 release notes report MTP delivers ~17% higher chat and ~26% higher SWE throughput versus open-source vLLM, but with a roughly 5-percentage-point GPQA accuracy drop at maximum reasoning budget . Cap thinking_token_budget in production rather than letting the model think unboundedly.

Finally, treat the headline numbers skeptically. Independent testing matters here. As Artificial Analysis put it in its launch review:

"Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index — the leading US open-weights model, ahead of gpt-oss-120b, but still behind Kimi K2.6 at 53.9," — Artificial Analysis (source: Artificial Analysis).

NVIDIA's "5.9× throughput" claim, meanwhile, is an internal measurement at 8k-input/64k-output against GLM-5.1 — favorable settings, not independently replicated. Benchmark your own workload before trusting it.

Where to Go From Here

Once Ultra is serving, the next step is shaping it to your workload. For fine-tuning, pull nvcr.io/nvidia/nemo-automodel:26.04.00 and run the NeMo Automodel LoRA PEFT recipe, validated on 4× GB200 (16 GPUs), with a documented H100-oriented 4×8 variant at ep_size 32 . If you are doing preference alignment, the published GenRM reward-model checkpoint drops straight into RLHF post-training via NeMo Megatron Bridge or the GRPO and GRPO-LoRA recipes in github.com/NVIDIA-NeMo/Nemotron .

For agents, CrewAI, LangChain Deep Agents, Hermes Agent, and OpenCode all wire in through the OpenAI-compatible chat interface — swap base_url and model, no further SDK changes . The takeaway: start on the hosted endpoint, prove the workload, then decide whether self-hosting's six-figure GPU bill is worth owning the weights.

Frequently asked questions

What is the minimum hardware to self-host Nemotron 3 Ultra?

The practical path is the NVFP4 checkpoint, which the NIM support matrix lists at 8× H100, 4–8× H200, or as few as 2× B200/GB300 . BF16 is far heavier — 16× H100 (two nodes) or 8× H200 . Requirements are profile-dependent and differ by precision and card generation, so treat the support matrix plus your chosen precision as authoritative rather than the looser figures on the model card.

Is the hosted API free, and what are the rate limits?

build.nvidia.com exposes a Free Endpoint tier alongside a Partner Endpoint and a Download option , so you can prototype without a GPU bill. Exact quotas and per-token pricing are not documented in the reviewed sources — they are endpoint-specific and largely unpublished. For production throughput, check the Partner Endpoint directly; do not assume the free tier's limits will hold under sustained agentic load.

How does the 1M-token context window actually work in NIM?

NIM's native context is 262,144 tokens by default; reaching the full 1,048,576-token window requires setting VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and launching with --max-model-len 1048576 . NVIDIA warns this reduces concurrency and recommends quality-validating long-context output before production use. The model itself reports ~94.7% on RULER at 1M , but serving capacity, not model capability, is the real constraint here.

Does it support structured output and tool calling out of the box?

Both work, but neither is on by default. Tool calling with tool_choice auto needs --enable-auto-tool-choice and --tool-call-parser qwen3_coder at NIM launch . Structured output uses response_format type json_object with schema validation handled on the client side . There is no native MCP connector — the NIM container will not reach your MCP servers, so your client must convert MCP schemas into OpenAI-style tools first.

How does Nemotron 3 Ultra compare to Kimi K2 and Qwen 3 on independent evals?

On Artificial Analysis's Intelligence Index, Ultra scored 47.7 — the leading US open-weights model, ahead of Nemotron 3 Super (36.0) and gpt-oss-120b (33.3), but behind Kimi K2.6 at 53.9 . So it tops US-hosted publicly available weights while trailing the current best open-weight globally. NVIDIA's own throughput claims (e.g. 5.9× versus GLM-5.1) rely on specific 8k-input/64k-output settings — treat those as directional until independently replicated.