Meta / Llama vLLM / Ollama News & Releases Dev Tools & SDK Changelogs

The parity fix that quietly resets your profiling baseline

llama.cpp b9437: -fa auto added to llama-bench, -ngl default flips to -1. What changes and who's affected.

Jun 12, 2026

The parity fix that quietly resets your profiling baseline

If you profile models with llama-bench, a small May 2026 build quietly changed two of its defaults — enough to make fresh numbers stop lining up with your old ones.

What Is b9437 and What Does It Add?

b9437 is a single-PR incremental tag of ggml-org/llama.cpp, published on May 30, 2026 — not a milestone release. llama.cpp does not ship curated semantic versions; it cuts a fresh tagged build for essentially every merged commit on master, producing roughly 1–3 tags per day . So b9437's effective changelog is the title of the one PR it contains.

Quick Answer: b9437 is a per-commit llama.cpp tag from May 30, 2026 containing only PR #23714, "Support -fa auto in llama-bench." It changes two defaults in the benchmark harness — Flash Attention and GPU-layer offload — touching just two files (108 additions, 51 deletions). It alters no inference runtime, model format, or API.

The entire change is PR #23714, "Support -fa auto in llama-bench," authored by gaugarg-nv and merged after two approvals and 27 passing checks . It touched only two files under tools/llama-bench — llama-bench.cpp and its README.md — for 108 additions and 51 deletions .

The net effect is two convention changes inside the llama-bench harness: a new Flash Attention default and a reinterpreted GPU-layer offload default. There is zero impact on the inference runtime, model formats, quantization, or any API surface. Note the cadence, too: by research time the index listed b9611 (Jun 12, 2026) as Latest , leaving b9437 well behind and best treated as a historical waypoint.

From Boolean to Enum: The `-fa` Parameter Refactor

The core of b9437 is that llama-bench's -fa/--flash-attn flag stopped being a two-state boolean and became a three-state setting accepting on, off, or auto — with auto now the default . Before this change the harness could only force Flash Attention on or off, even though llama-server and llama-cli already exposed an auto mode that decides at runtime whether to enable it based on the model and backend. PR #23714 closes that gap.

Under the hood the flash_attn parameter changed type from a plain bool to the llama_flash_attn_type enum. In b9437 that enum carries three named values: LLAMA_FLASH_ATTN_TYPE_AUTO = -1, LLAMA_FLASH_ATTN_TYPE_DISABLED = 0, and LLAMA_FLASH_ATTN_TYPE_ENABLED = 1 . When you pass auto, the value maps into the context parameters by enabling Flash Attention unless explicitly disabled and marking auto_fa true, so the actual decision is delegated to the runtime's auto-selection path rather than a hard flag .

The motivation was practical, not theoretical. According to the PR author, the request came from an automation team running llama-bench for regression testing:

"The automation team wanted the harness to behave like llama-cli and llama-server, which already supported an auto mode." — gaugarg-nv, PR #23714 author (source: ggml-org/llama.cpp PR #23714)

The payoff is that you can now benchmark a model under the same Flash Attention decision logic the server or CLI would apply in production, removing a mismatch between what you measure and what actually runs. The diff itself was small — the PR touched only README.md and llama-bench.cpp for roughly 108 additions and 51 deletions across the two files .

Aspect	Before b9437	b9437 and after
Accepted `-fa` values	`on`, `off`	`on`, `off`, `auto`
Internal type	`bool`	`llama_flash_attn_type` enum
Default	`off`/`on` boolean	`auto` (`-1`)
Runtime mapping	Hard enable/disable	`auto` → enable unless disabled, set `auto_fa`, defer to runtime selection

For anyone scripting against the harness, the type change is the detail to note: a field that once read as a boolean now reports an integer enum, which the next section examines in the export output.

Layer Count vs. Layer Convention: The `-ngl` Reinterpretation

The -fa refactor did not arrive alone. PR #23714 also flipped the default value of -ngl/--n-gpu-layers — the number of model layers offloaded to GPU — from 99 to -1, and updated the option parser to accept negative integers . On its own that looks like a cosmetic tweak. It is actually a switch from an arbitrary numeric cap to the convention llama.cpp uses everywhere else, and it changes what some of your recorded runs report.

The reason sits in the public header. In llama.h, a negative n_gpu_layers is the sentinel meaning "store all layers in VRAM" rather than a literal layer count . The old default of 99 was a high-enough integer to offload essentially every model in practice, but it was still a fixed number standing in for "all." Setting the default to -1 aligns llama-bench with the same "offload everything" semantics that llama-server and llama-cli already follow, so the harness now expresses intent rather than a guessed ceiling .

For most models the two resolve identically at runtime — 99 and -1 both offload the entire network, so throughput numbers do not move . The difference is in comparability, not performance. For any model with more than 99 offloadable layers, the old default would have left some layers on the CPU while -1 offloads all of them. And the printed ngl column now reads -1 instead of 99, which breaks scripts that parse that field or diff it against historical runs.

The migration path is short:

Reproduce pre-b9437 metadata exactly: pass an explicit -ngl 99 so the recorded value and offload behavior match your archived runs.
Forward-looking scripts: use -ngl -1, or simply omit the flag, which is now the correct modern form and matches the runtime's "all layers" convention.

If your regression tooling stores raw llama-bench output, pin the flag explicitly on both old and new runs rather than relying on the default — that keeps the ngl column stable across the b9437 boundary regardless of which build produced the row.

What Your Existing Export Scripts See Now

If you pipe llama-bench output into a parser, b9437 changes the shape of two columns. The flash_attn field is no longer two-valued: rows that once read true or false now carry the integer enum -1 (auto), 0 (off), or 1 (on) . Any script that tests flash_attn == true or matches a boolean literal will silently misfire — it will not error, it will just read the wrong value.

The b9437 README examples make this concrete: a default run logs flash_attn: -1 with the device label auto in CSV and JSON rows, because auto is now the default mode . The same applies to the ngl column: runs without an explicit -ngl flag now print -1 instead of 99 . Dashboards that hardcode 99 as the "fully offloaded" sentinel will mislabel every fresh run.

Column	Before b9437	From b9437 (default run)	Migration note
`flash_attn`	`true` / `false` (boolean)	`-1` auto · `0` off · `1` on (enum)	Replace boolean checks; map `-1`→auto, `1`→on, `0`→off.
`ngl`	`99` when flag omitted	`-1` when flag omitted	Stop treating `99` as "all layers"; `-1` is the new sentinel.

The safest migration is to stop comparing against literals entirely. Normalize both fields at ingestion — coerce the flash_attn enum to a label and treat any negative ngl as "all layers" — so a single parser handles output from builds on either side of the b9437 boundary. For longitudinal regression data, re-key or backfill historical rows so old boolean and 99 values line up with the new enum and -1 convention before you chart them together.

Narrow Impact: Profiling Users Only, Not App Consumers

The blast radius of b9437 is small and confined to one tool. The only people who need to act are developers who compile or download llama.cpp and run llama-bench to profile token throughput, prompt-processing speed, or Flash Attention behavior across model, quantization, and backend combinations. PR #23714 touched exactly two files under tools/llama-bench — README.md and llama-bench.cpp — for 108 additions and 51 deletions in total . Nothing outside that harness moved.

A second group is affected indirectly: any CI job or regression script that parses llama-bench structured output, or that compares runs without pinning explicit -fa and -ngl flags. Those pipelines see the flash_attn field shift from a boolean to an enum (printed as -1 for auto) and the -ngl default flip from 99 to -1 . If your scripts already pin both flags, the defaults never apply and your numbers stay stable across the boundary.

Everything else is untouched. llama-server, llama-cli, the llama-cpp-python bindings, the GGUF format, model weights, and every application-level consumer of the runtime see no change. There is no API surface modification, no model-format revision, and no inference path that behaves differently because this tag exists .

It is worth being precise about what b9437 is not. The release introduces no new model architectures, no new quantization formats, and no inference-quality changes . This is a tooling-layer parity fix that aligns llama-bench with the auto-selection logic llama-cli and llama-server already used, not a runtime change. If you ship products on top of llama.cpp rather than benchmark it directly, you can skip this tag entirely and track the current head instead .

Distributed Artifacts: 23 Platforms, Some Disabled Slots

The b9437 release page ships roughly 23 prebuilt assets covering desktop, mobile, and server targets, so most users can download a binary rather than compile from source . Coverage spans Apple, Linux, Android, and Windows, but the per-commit CI matrix does not run every accelerator slot on every tag — a routine gap worth checking before you pin a build for a specific GPU stack.

The named targets in this tag include:

Apple: macOS Apple Silicon (arm64), macOS Intel (x64), and an iOS XCFramework .
Linux: Ubuntu x64/arm64/s390x CPU, Ubuntu x64/arm64 Vulkan, Ubuntu x64 ROCm 7.2, Ubuntu x64 OpenVINO, plus an openEuler build .
Android: arm64 CPU .
Windows: x64/arm64 CPU, Vulkan, HIP, and two distinct CUDA binaries — x64 CUDA 12 (shipping CUDA 12.4 DLLs) and x64 CUDA 13 (shipping CUDA 13.3 DLLs) .

Windows CUDA users should note that those two builds are not interchangeable: check your installed driver and toolkit version before choosing the CUDA 12.4 or CUDA 13.3 artifact. Meanwhile, several optional builds were marked disabled for this tag, including the macOS Apple Silicon KleidiAI and SYCL variants . That is expected behavior in a continuous-tagging pipeline, not a regression. The practical takeaway: if your workflow depends on a particular accelerator build, confirm that slot was actually enabled for the tag you want — don't assume full coverage on any given build number .

b9437 in Perspective: The Continuous Tagging Cadence

b9437 is one data point in a release stream that grows by the day, not a milestone you should pin your expectations to. llama.cpp cuts a fresh tagged build for essentially every merged commit on master, producing roughly 1–3 (sometimes more) tagged releases per day . By that cadence, b9437 sits among more than 9,000 such tags, each typically carrying a single pull request — here, PR #23714's -fa auto parity work for llama-bench .

The tag is already well behind the project head. As of June 12, 2026, the releases index listed b9611 (dated Jun 12, 14:03) as Latest, putting b9437 roughly 174 tags back . The project follows no semantic versioning and maintains no separate "stable" versus "beta" tracks — the head is always the most feature-complete version, and there is no curated release branch to fall back to .

That shapes how you should use a number like b9437. For regression pinning, the tag is a valid stable anchor: it identifies an exact commit (aa46bda) and a fixed artifact set, so a benchmark script can reference it for reproducible runs . For anything else — new features, fixes, or current accelerator coverage — treat b9437 as a historical waypoint and track the Latest entry at the llama.cpp releases page instead. Pin the number when you need yesterday's result to reproduce tomorrow; follow the head when you need today's capabilities.

Frequently asked questions

Is llama.cpp b9437 a stable release I should pin to?

No. llama.cpp does not use semantic versioning — it cuts a fresh tagged build for essentially every merged commit, producing roughly 1–3 tagged releases per day . b9437 is a single-PR incremental tag built from commit aa46bda on May 30, 2026, not a curated milestone. For stability, pin one validated tag and reuse it; for current features, track the daily Latest at the releases page.

Will upgrading to b9437 change my llama-bench results?

Yes, if you rely on default flags. In b9437, -fa/--flash-attn now defaults to auto instead of off, and -ngl/--n-gpu-layers defaults to -1 (all layers in VRAM) instead of 99. To reproduce pre-b9437 output exactly, pass -fa off -ngl 99 explicitly. The -ngl flip mainly affects comparability for models with more than 99 offloadable layers .

Why did flash_attn change from true/false to -1/0/1 in structured output?

The internal flash_attn parameter type changed from a boolean to the llama_flash_attn_type enum so it can carry three states instead of two: LLAMA_FLASH_ATTN_TYPE_AUTO = -1, LLAMA_FLASH_ATTN_TYPE_DISABLED = 0, and LLAMA_FLASH_ATTN_TYPE_ENABLED = 1 . CSV and JSON rows now print the raw integer enum value — the README examples show flash_attn as -1 with the device label auto — rather than the old boolean.

Does b9437 affect llama-server, llama-cli, or llama-cpp-python?

No. PR #23714 modified only two files, both under tools/llama-bench (README.md and llama-bench.cpp), for 108 additions and 51 deletions . The inference runtime, GGUF model format, quantization formats, and all language bindings such as llama-cpp-python are untouched — b9437 introduces no API or model-format change .

What is the current head of llama.cpp if b9437 is already stale?

As of June 12, 2026, b9611 (dated Jun 12, 14:03) was listed as Latest, well ahead of b9437 . Because llama.cpp tags a build per merged commit, the Latest entry changes daily — check the releases index for the current head before assuming any specific build number is newest.