96% of cuBLAS, no `unsafe`: what cuTile Rust proves

cuTile Rust applies Rust ownership to GPU kernels. Covers sm_80+ prereqs, the partition-dispatch pattern, and Grout.

Jun 26, 2026

96% of cuBLAS, no `unsafe`: what cuTile Rust proves

GPU programming usually asks Rust developers to surrender the borrow checker at the launch boundary: references collapse into raw pointers, and aliasing, synchronization, and stream lifetimes become hand-managed invariants. A new NVIDIA Labs paper argues that trade is unnecessary.

How cuTile Rust Extends the Borrow Discipline to GPU Dispatch

cuTile Rust is a tile-based DSL that carries Rust's ownership and borrowing rules across the host-to-GPU launch boundary — not just through host code. Introduced in "Fearless Concurrency on the GPU" (arXiv:2606.15991), submitted by NVIDIA researchers Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler, and Michael Garland , it lets you author the kernel itself in idiomatic, memory-safe Rust rather than wrapping hand-written unsafe CUDA.

The mechanism is type construction, not a runtime lock. Before launch, mutable output tensors are partitioned into provably disjoint tiles; each tile program then receives an exclusive &mut view of its slice, while inputs arrive as shared & references . Because the partitions cannot overlap, the kernel is single-threaded in its semantics and data-race-free by construction, yet still compiles to massively parallel GPU code. As Melih Elibol put it, "each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references" (source: users.rust-lang.org). Explicit unchecked types remain available for local opt-out when you need lower-level control.

The safety story would be academic if it cost throughput, but the reported numbers say otherwise. On an NVIDIA B200, cuTile Rust reaches 7 TB/s on memory-bound element-wise operations and 2 PFlop/s on GEMM — roughly 96% of cuBLAS, and within measurement noise of cuTile Python . End to end, the companion Qwen3 inference engine Grout reaches 171 generated tokens/s for Qwen3-4B on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200 in batch-1 decode . Those are the authors' own measurements on specific hardware — independent reproduction is not yet established — but they frame the central claim this article unpacks: safe Rust kernels without a measured performance penalty.

sm_80+, Driver ≥610, Rust 1.89: What the Crate Expects

Before any of that lands on your hardware, the crate sets a firm floor. cuTile Rust targets NVIDIA GPUs with compute capability sm_80 or higher — Ampere, Hopper, and Blackwell — which excludes Volta (V100) and earlier . It builds on CUDA 13.3, Rust 1.89+, and Linux, tested on Ubuntu 24.04; Windows and macOS are unsupported, and no AMD/ROCm or Metal backend exists as of June 2026 . CUDA 13.x needs driver ≥580 for minor-version compatibility, and CUDA 13.3 GA corresponds to Linux driver ≥610.43.02 .

Requirement	Minimum
GPU compute capability	sm_80+ (Ampere/Hopper/Blackwell)
CUDA toolkit	13.3
Linux driver	≥610.43.02 (≥580 for 13.x minor-compat)
Rust	1.89+
OS	Linux (Ubuntu 24.04 tested)
Tile IR toolchain	CMake 3.20+, C++17, Python 3.6+

The Tile IR toolchain itself — cuda-tile-translate and tileiras, which compile MLIR-based Tile IR bytecode into cubins — expects CMake 3.20+, C++17, and Python 3.6+ . Confirm your driver and GPU first; everything below assumes the floor is met.

Annotating, Partitioning, and Dispatching a cuTile Rust Crate

Writing a cuTile Rust kernel means declaring a #[cutile::module] block, annotating the function with #[cutile::entry()], and bringing the prelude into scope with use cutile::prelude::*. The macro rewrites that function into a GPU kernel and auto-generates the host-side launcher that partitions tensors — you write no hand-rolled dispatch code . The canonical element-wise add reads like ordinary Rust:

#[cutile::module]
mod kernel {
  #[cutile::entry()]
  fn add<const B: i32>(
    z: &mut Tensor<f32, {[B]}>,  // exclusive write
    x: &Tensor<f32, {[-1]}>,     // shared read
    y: &Tensor<f32, {[-1]}>,     // shared read
  ) {
    let tx = load_tile_like(x, z);
    let ty = load_tile_like(y, z);
    z.store(tx + ty);
  }
}

The signature is the contract. Mutable outputs are typed &mut Tensor<f32, {[B]}>; shared inputs are &Tensor<f32, {[-1]}>. The const-generic shape parameter encodes the tile size at the type level, so the borrow checker sees one exclusive writer and many immutable readers per tile .

On the host the recipe is short: create your tensors, call .partition([128]) on the mutable output before launch, then run kernel::add(z, x, y).sync()? for blocking execution. The generated launcher holds the operands while GPU work is in flight, and ownership of the tensors returns to you only after .sync() completes . Because the partitions are provably disjoint, each tile program is single-threaded in its semantics and data-race-free by construction.

For inference pipelines, cuTile Rust exposes a lazy DeviceOp model. Use .sync() for blocking dispatch, .into_future() (via IntoFuture) for async execution, and .graph() / CudaGraph::scope for CUDA graph capture and replay . The intended pattern builds a reusable layer graph once, borrows temporary buffers mutably inside each recorded op, and releases them after sync. Stream-order capture plus Rust lifetimes make buffer reuse visible to the type system, so ordering is enforced without manual annotation. Kernels JIT-compile through CUDA Tile IR, an MLIR-based intermediate representation, before reaching the GPU .

The safety idea is easy to feel out without a GPU. The illustrative Python below (executed; not the production Rust path) proves each tile's bounds once, then touches memory only through checked ranges — the same "prove disjointness, then trust the slice" shape cuTile Rust enforces at compile time:

from dataclasses import dataclass
from random import Random


@dataclass(frozen=True)
class Tile:
    row: range
    col: range
    red: range

    def proved(self, m: int, n: int, k: int) -> "Tile":
        assert 0 <= self.row.start <= self.row.stop <= m
        assert 0 <= self.col.start <= self.col.stop <= n
        assert 0 <= self.red.start <= self.red.stop <= k
        return self


def tiled_matmul(a, b, block=8):
    m, k, n = len(a), len(a[0]), len(b[0])
    c = [[0.0] * n for _ in range(m)]
    proofs = 0
    for i in range(0, m, block):
        for j in range(0, n, block):
            for p in range(0, k, block):
                t = Tile(range(i, min(i + block, m)),
                         range(j, min(j + block, n)),
                         range(p, min(p + block, k))).proved(m, n, k)
                proofs += 1
                for r in t.row:
                    for q in t.red:
                        arq = a[r][q]
                        for s in t.col:
                            c[r][s] += arq * b[q][s]
    return c, proofs


def plain_matmul(a, b):
    return [[sum(x * y for x, y in zip(row, col)) for col in zip(*b)] for row in a]


rng = Random(0)
size = 24
a = [[rng.random() for _ in range(size)] for _ in range(size)]
b = [[rng.random() for _ in range(size)] for _ in range(size)]
got, proofs = tiled_matmul(a, b)
want = plain_matmul(a, b)
err = max(abs(got[i][j] - want[i][j]) for i in range(size) for j in range(size))

print("cuTile idea in Python: prove tile bounds once, then use only checked ranges.")
print(f"tiles proved: {proofs}; unsafe operations: 0")
print(f"max error vs reference: {err:.2e}")
print("The 96%-of-cuBLAS claim is about Rust/CUDA performance; this shows the safety proof shape.")

Friction to Expect: No AMD, Evolving Macros, Unproven Concurrency

cuTile Rust is NVIDIA/CUDA-only today, and that constraint runs deep. There is no AMD/ROCm path, no Metal backend, and no portable WebGPU fallback — every kernel JIT-compiles through CUDA Tile IR into cubins . The compute-capability floor is hard: sm_80 (Ampere) or newer, paired with CUDA 13.3, Rust 1.89+, and Linux . Any pre-Ampere card is excluded outright.

The surface API is explicitly early-stage. The Tensor<f32, {[B]}> const-generic shape syntax and the #[cutile::module]/#[cutile::entry()] macro forms can change between releases . Pin your dependency in Cargo.lock before this lands in CI; treat API churn as expected, not exceptional.

Be precise about the headline numbers. The 96%-of-cuBLAS GEMM result and 171 tokens/s batch-1 decode for Qwen3-4B on an RTX 5090 are the authors' own measurements on specific hardware, including a B200 . An independent evaluation of the CUDA Tile Python stack reported 52–79% of cuBLAS for GEMM and only 53% of FlashAttention-2 throughput on RTX PRO 6000 Blackwell Server Edition — results that vary by workload and architecture . Multi-batch throughput, prefill latency, and model coverage beyond Qwen3 remain uncharacterized. Validate on your target GPU, batch distribution, and context length before you swap out a mature inference stack.

Grout: The Inference Reference for cuTile Rust Crate Authors

If you want to see cuTile Rust in a real decode path rather than a microbenchmark, read Grout. Grout is a cuTile-Rust Qwen3 inference engine co-authored by Eric Buehler, who also maintains mistral.rs, and it serves as the canonical production call-site pattern. Study how it structures lazy DeviceOp graphs, borrows temporary buffers mutably inside CudaGraph::scope capture, and recovers ownership only after .sync() — that ordering is the intended idiom for inference pipelines, where stream-order capture plus Rust lifetimes make buffer reuse visible to the type system.

This is the contrast that matters. Candle, Burn, and mistral.rs largely FFI into or wrap hand-written, often unsafe kernels; cuTile Rust offers a path to author the kernels themselves in safe Rust with no measured penalty. As lead author Melih Elibol frames the guarantee, "each tile program gets an exclusive &mut view of its memory, plus the inputs as shared references" .

Concrete next step: clone Grout, run the Qwen3-4B decode path — the authors report 171 generated tokens/s in batch-1 decode on an RTX 5090 — on an A100 or RTX 4090, and compare tok/s against a vllm>=0.8.4 baseline . The size of that gap — or its absence — is the real signal, not the headline.

Frequently asked questions

What NVIDIA GPU is required to run cuTile Rust?

You need an NVIDIA GPU with compute capability sm_80 (Ampere) or higher, plus CUDA 13.3, Rust 1.89+, and Linux (tested on Ubuntu 24.04) . That floor covers the RTX 3000/4000/5000 series, A100, H100, and B200, but excludes Volta (V100) and Turing (RTX 2000). On the driver side, CUDA 13.3 GA corresponds to a Linux driver of at least 610.43.02 .

How does cuTile Rust achieve data-race freedom without a runtime lock?

It moves the guarantee to compile time. Mutable output tensors are partitioned on the host into provably non-overlapping tiles before dispatch, and each tile program receives an exclusive &mut view of its slice while inputs arrive as shared & references . Because the partitions cannot alias, Rust's borrow checker — which permits one mutable reference or many immutable ones — rules out conflicting writes statically . No runtime synchronization primitive is inserted; the kernel is single-threaded in its semantics yet compiles to massively parallel GPU code.

Is cuTile Rust production-ready?

Not yet. The authors describe it as early-stage, so the API surface — including the Tensor<f32, {[B]}> const-generic shape syntax and the macro forms — may change . It is CUDA/Linux-only (sm_80+, CUDA 13.3), and multi-batch throughput, prefill, and broader model coverage beyond Qwen3 are uncharacterized. Grout is a useful reference call site, but validate your target GPU, driver, model, batch size, and graph-capture behavior before replacing a mature stack like vLLM or SGLang.

Does cuTile Rust work on AMD GPUs or Apple Silicon?

No. cuTile Rust JIT-compiles through CUDA Tile IR, which targets NVIDIA hardware (sm_80+) only, and as of June 2026 there is no ROCm, Metal, or WebGPU backend . The portable Rust-on-GPU ecosystem — Rust GPU and wgpu — does reach AMD and Apple Silicon, but it takes a different, non-CUDA approach and does not carry cuTile's ownership-across-launch model.

How does Grout's 171 tok/s on RTX 5090 compare to vLLM?

The authors report 171 generated tokens/s for Qwen3-4B batch-1 decode on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, characterizing both as competitive with vLLM and SGLang and near the HBM roofline for memory-bound decoding . Treat that as the authors' own measurement — independent reproduction has not been published. For your own baseline, Qwen recommends vllm>=0.8.4 or sglang>=0.4.6.post1 .