docs(paged): consolidate the dev-trail docs into one canonical README

The paged-attention patch directory had accumulated ~55 scattered dev docs
(results, progress, scope, lever, and gap-analysis notes). Consolidate the
durable content of all of them into one canonical
backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is,
the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM
decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030
patch series table with bit-exact status, the GB10 benchmarks
(patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes
(bit-exact methodology, the per-path gate, the MoE-parity conclusion, the
rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the
pin + canary maintenance policy, and the published NVFP4 gallery models.

Delete the consolidated-away dev trail. Keep the three operational docs the
README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md
(per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the
ship-as-own-backend design-of-record), plus the benchmark plots + csv. The
.patch files and the unit/bench .cpp are untouched.

Repoint every external reference to a deleted doc at the new README:
grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the
canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base
patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC
reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a
patch-internal path matcher, not a repo-doc link) is left intact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 09:23:30 +00:00
parent a5a5b2ad80
commit fb2dc33d52
62 changed files with 325 additions and 12885 deletions

View File

@@ -28,7 +28,7 @@
# build on 0019's code, the rejection cascades to them too. This is a
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
# and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still
# and README.md). We exclude ONLY that dev-doc path and still
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
# still fails the canary. prepare.sh tolerates the same hunk via
# `patch ... || true`; this mirrors that tolerance precisely.

View File

@@ -850,7 +850,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
// Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
// externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
// NOT bit-exact (~91% same-top-p ceiling); see patches/paged/A_HYBRID_SSM_RESULTS.md.
// NOT bit-exact (~91% same-top-p ceiling); see patches/paged/README.md (Dev notes).
} else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
if (optval != NULL) {
try {

View File

@@ -57,7 +57,7 @@ All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so t
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
(`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
(see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,

View File

@@ -1,102 +0,0 @@
# Patch 0003 — paged gather-read: exact implementation plan
**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
## Design
The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
`transpose → get_rows → transpose`.
### KEY CORRECTIONS (found while implementing — these change the edits)
1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
*current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
`kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
`reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
n_kv, then gathered). Smaller than the 7-edit estimate above.
## Edits
### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
```cpp
bool is_paged() const { return paged_active; } // near get_size()
ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
void set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
uint32_t get_n_gather(const slot_info & sinfo) const; // == sum of used cells gathered
```
Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
`build_input_k_idxs`/`get_n_kv`).
### 2. `src/llama-kv-cache.cpp`
- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
the mask/k/v will see). This is the canonical gather order.
### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
### 4. `src/llama-graph.cpp`
- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()``mctx->set_input_gather_idxs(gather_idxs, ...)`.
- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
mctx_cur->build_input_gather_idxs(ctx0, ...)`.
- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
```cpp
if (ggml_tensor * gi = inp->get_gather_idxs()) {
k = ggml_get_rows(ctx0, k, gi); // [d, n_gather, ...] (reshape view ok)
v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
m = ggml_get_rows(ctx0, m, gi); // [n_tokens, n_gather]
kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tokens]
}
ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
```
Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
mask gather in the graph. (Cleaner — revisit during impl.)
### 5. V-transposed path
When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
## Verification (the gate)
```sh
cmake --build build-cpu --target llama-simple -j
M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt # stock
LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt # paged gather-read
diff a.txt b.txt # MUST be identical
```
Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
## Risks
- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.

View File

@@ -1,347 +0,0 @@
# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis
Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev`
HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense),
harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a
clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column.
## TL;DR verdict
CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well
under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**.
Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps
(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap
(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged
decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays
99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM
(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at
batch 128), not in launch overhead or the host loop.
The premise that "the paged decode runs eager (graphs reused=0)" did not survive
measurement: at the benchmarked context the default paged decode captures and
replays graphs exactly like stock non-paged. Two measurement traps (below)
explain the earlier "reused=0 / gap-bound" reading.
## Method note: a graph-enable trap that was corrected
`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr`
(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string
still disables graphs. A first 4-cell pass that used
`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF
in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the
variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it
never set the variable).
## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled)
npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`:
| cell | npl 32 | npl 128 |
|------------------|---------|---------|
| stock_graphon | 116.47 | 148.41 |
| stock_graphoff | 115.17 | 148.21 |
| paged_graphon | 116.21 | 148.60 |
| paged_graphoff | 114.62 | 147.65 |
ON vs OFF (the graph win):
| config | npl 32 | npl 128 |
|--------|--------|---------|
| stock | +1.13% | +0.13% |
| paged | +1.39% | +0.64% |
- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at
npl 32 (small-batch, where per-kernel launch overhead is relatively larger).
All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128).
- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT
eager at this config (see Step 2); it captures graphs like stock.
- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at
npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM.
All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is
shared by stock and paged; it is not paged-specific. Calibration cross-check
(paged ON, ntg 64): 147.64, matching the reference 148-149.
## Step 2 - why the "eager" premise is wrong, and what actually mutates
CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in
`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node
properties did not change vs the previous step; any later change logs
`CUDA graph warmup reset` and reverts to eager until stable again.
`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus
each src's `data` ptr / `ne` / `nb`.
`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96):
- stock: `warmup complete` x2, `warmup reset` x0.
- paged: `warmup complete` x2, `warmup reset` x0.
Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0
for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so
the uid fast-path never fires; the graph is still replayed via the
`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative,
not evidence of eager execution.** (Trap #1.)
Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512
token boundaries), counts over ~320 decode steps:
| path | complete | reset | interpretation |
|-------------------------------|----------|-------|-------------------------------|
| paged in-kernel (default) | 10 | 8 | resets only at 256-boundaries |
| paged gather (KV_PAGED_GATHER)| 0 | 0 | never captures -> pure eager |
| stock non-paged | 10 | 8 | identical 256-cadence |
The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s),
not per-step. The default paged decode is therefore captured for ~97% of steps,
re-warming only every ~256 tokens, with the **same cadence as stock**.
What mutates (the block-table / gather input):
- in-kernel decode (default): the block-table graph input
`idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with
`n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]`
steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The
kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the
property change is per-256-tokens, not per-step.
- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the
index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)`
(`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every
step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every
step"). That changes a node property every step, warmup never completes, and
the path runs pure eager. This is the only "graphs reused=0" path, and it is
not the default decode path.
`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across
the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table
`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256.
## Step 3 - where the step time goes (nsys), and a second trap
npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup).
Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps
>= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph
reserve / prefill one-offs).
Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA
graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU
rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy.
Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows
and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON
nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** -
the likely source of the earlier "freed GPU time became idle gaps" conclusion.
Reliable steady-state numbers:
| trace | GPU rows | busy | within-step idle | between-step idle | host gap/step |
|--------------------------------|----------|--------|------------------|-------------------|---------------|
| OFF (eager) | 84,946 | 99.4% | 0.37% | 0.24% | ~2.0 ms |
| ON (captured, node-trace) | 84,946 | 99.5% | 0.11% | 0.38% | ~1.9 ms |
- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has
~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus
ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%.
- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged.
- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample +
update_slots + batch build), 0.24-0.38% of the ~896 ms step.
Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel
compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps.
## The load-bearing question, answered
Within-step or between-step? **Neither is large.** The steady decode is 99.4%
GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove
the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is
~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim
either; the host loop is currently **hidden under GPU compute** (the GPU stays
busy while the host syncs/schedules). It would only become a lever once the
kernels are fast enough to drop GPU-busy below the host time, i.e. it is a
second-order floor, not the present bottleneck.
## Verdict
1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the
default decode; capturing reduces within-step launch idle from 0.37% to 0.11%
but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured
+0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise).
2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step,
hidden under GPU compute). It is the candidate floor only after the kernels
speed up.
3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x
gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch
128). That, not graphs and not the host loop, is the throughput lever.
4. Corrected premises: paged is not perpetually eager (it captures with a
256-token reset cadence identical to stock); "graphs reused=0" was a uid
fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy
unless `--cuda-graph-trace=node` is set.
No code patch in Phase 1 (graphs are not the lever, so there is no paged
graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`,
`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.
# Phase 2 - the real decode lever, located (per-kernel decomposition)
Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the
per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured
that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys
`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the
measurement. Two corrections, then the lever.
The conditional Phase 2 fix (make the paged decode graph-capturable) is moot:
Phase 1 already showed the default paged decode captures, and the fresh re-check
below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph
fix / between-step host loop) is the lever; the lever is a third thing, measured
here.
## Fresh re-confirmation: graphs are not the lever
Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table:
| paged decode | S_TG t/s | vs vLLM 391 |
|---------------|----------|-------------|
| graphs ON | 146.03 | 37.3% |
| graphs OFF | 144.90 | 37.1% |
+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also
99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy
nodes too, so they cannot remove either the copies or the compute.
## Correction 1: the model is a hybrid SSM, not a plain transformer
`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with
`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The
decode-window kernel cadence (per step, ~19.8 steps in the window) is 48
`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48
gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1
hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention
layers.
## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute
Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24
OFF rep; single CUDA stream; running-max-end so it is overlap-correct):
| activity set | GPU busy | idle |
|------------------------|----------|-------|
| kernels only | 80.2% | 19.8% |
| kernels + memcpy (all) | 99.4% | 0.6% |
The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are
filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each,
356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet
recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the
paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH
47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy"
and lumped them into "GPU kernel compute" - they are memory movement, and they
are addressable.
## Decode GPU-time decomposition (% of kernel+memcpy busy)
OFF/eager rep, steady window. `/step` = instances per decode step.
| share | activity | /step | role |
|-------|-----------------------------------|-------|-------------------------------|
| 23.4% | gated_delta_net_cuda | 48 | linear-attn recurrence |
| 21.9% | k_get_rows_float | 97 | SSM state / conv-state gather |
| 18.9% | MEMCPY DtoD | 80 | SSM recurrent-state copy |
| 15.5% | mul_mat_vec_q (nvfp4, ncols=1) | 48 | FP4 GEMV |
| 10.4% | mul_mat_q (nvfp4) | 352 | FP4 GEMM |
| 1.9% | quantize_mmq_nvfp4 | 448 | act requant for MMQ |
| 1.0% | concat_cont | 48 | SSM state glue |
| 0.8% | ssm_conv_f32 | 48 | SSM short conv |
| 0.7% | unary_gated_op silu | 112 | SSM gating |
| 0.4% | flash_attn_tile/_ext | 16 | FULL attention (paged) |
Grouped:
- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy
+ conv + gating glue): **~67% of decode**.
- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**.
- Full attention - everything paged attention optimizes: **~0.4%**.
## Verdict and scope of the real lever
1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture
the memcpy too, so they cannot touch the copies or the compute.
2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s).
3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26%
faster left decode_agg flat) - it was never the long pole.
4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs,
block-table stabilization, gather rewrite) can move decode_agg on this model**
- it optimizes under half a percent of the step. This is the structural reason
A.2, and the paged-decode track generally, cannot close the vLLM gap on
q36-27b: the model barely uses the path being optimized.
The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer
per step it re-materializes and D2D-copies the full recurrent state (~230 MB at
batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2
`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is
SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention
`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the
gather into the scan, avoiding both the per-layer D2D copy and the gathers.
Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path -
not paged-attn, not a graph capture, not a block-table tweak):
1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor
in place (or double-buffer / write-back), so the recurrence consumes and
produces the persistent state without a full-state copy each layer each step.
2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel.
Ceiling from this rep (upper bound; assumes the work is fully removed, not just
overlapped):
- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s.
- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within
~1.6x of vLLM 391 with the FP4 GEMM still untouched.
No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in
the SSM path, too large for this measurement pass and orthogonal to paged
attention. `patches/paged/0018` stays free. Evidence on the DGX:
`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
# A.2 final synthesis - the four-point verdict
All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
`decode_agg` = `S_TG t/s`), npl 128 unless noted.
**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
There was nothing to fix: the premise was false. At the benchmarked context the
DEFAULT in-kernel paged decode already captures and replays graphs, with a
256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
index grows every step), which is not the default decode. Because graphs were
already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
path can move decode_agg. The residual gap is structural and lives elsewhere:
~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
GPU compute - neither is the residual.
The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
stays free.

View File

@@ -1,107 +0,0 @@
# Additive layout for the paged-KV patch series - "hook, don't edit"
Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
the opposite: **additive**. This document is the design rule and the per-patch core-touch
budget.
## The rule
> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
> **single, env-gated hook** at one call site in a core file that delegates to the new
> file. No logic lives in a core file. No core struct/signature is edited.
Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
old-0003 approach) conflicts on every bump and must be re-understood each time.
This is enforceable as a **core-touch budget**: each patch declares the core files it
touches and the line count; review rejects anything that grows logic in core.
## Why it's achievable here (grounded in the pinned source)
The two seams paged KV needs are both already abstract in llama.cpp at the pin
(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
cell indices. Paged placement is just *different indices*. 0002 already does this as one
gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
`llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
## Per-patch core-touch budget
| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
|---|-------|----------------------|------------------------------|-----------:|
| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
is new files or the LocalAI-side server loop.
## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
The additive form removes the core-struct and core-`set_input` edits entirely:
**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
same set the `kq_mask` keeps), in the canonical order.
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
transpose`). When not active it returns immediately -> **stock path byte-identical**.
**Core hooks (the whole core diff for 0003):**
1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
```cpp
paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
```
One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
registered input, so `llama-graph.h` is untouched.
2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
file can read the used-cell set without reaching into internals -
`uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
(delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
pattern). ~8 lines total, no signature changes to existing methods.
3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
KV set; `n_gather < n_kv` proves compaction, not identity).
## Anti-drift practices (already in `README.md`, restated as policy)
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
`git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
needs a touch, and the failure names the exact step.
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
(e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
commit, re-export. (Set up and verified for this pin during this work.)
## Status / next step
- 0001, 0002: done, additive, verified token-identical.
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
0001+0002 applied is ready (`paged` branch). Remaining work is the focused
implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
FA-first), not a structural one - the structure is settled here.
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
patch at all).

View File

@@ -1,669 +0,0 @@
# ARCH_GENERALITY_AUDIT - llama-cpp-localai-paged backend
Source/build/gallery audit (no GPU, no hardware). Maps how arch-general the
paged backend's BUILD targeting is, and whether non-Blackwell / Metal / CPU
hosts get a working build.
## Section: backend-build-matrix (build targeting)
### 1. CUDA arch list: NOT Blackwell-only - it is the FULL upstream ggml default
There is NO explicit CUDA arch list anywhere in the paged build path:
- `.docker/llama-cpp-localai-paged-compile.sh` only injects
`-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}` *when* `CUDA_DOCKER_ARCH` is
non-empty (`if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]`).
- NO `backend-matrix.yml` row for `llama-cpp-localai-paged` sets
`CUDA_DOCKER_ARCH` (nor does any stock `llama-cpp` row). It is empty.
- `backend/cpp/llama-cpp/Makefile` (reused verbatim by the paged wrapper) sets
only `-DGGML_CUDA=ON` (+ `-DGGML_NATIVE=OFF`). It never sets
`CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`.
=> The compiled arch fan is whatever upstream llama.cpp / ggml-cuda picks by
default with `GGML_NATIVE=OFF` (the full multi-arch default, which includes
Blackwell sm_120 alongside the older archs ggml ships). This is BIT-IDENTICAL to
how the stock llama-cpp backend is targeted - the paged wrapper copies and reuses
the exact same Makefile + CMakeLists + prepare.sh, only forcing `LLAMA_PAGED=on`.
Consequence for NVFP4: the FP4-MMA kernel is compile-time gated *inside* the
ggml-cuda TU by `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer, sm_100
datacenter). Because the build emits the full arch fan (not a Blackwell-only
list), the sm_120 NVFP4-MMA path is present for Blackwell AND the older archs get
their own kernels (NVFP4 runs the non-MMA fallback at runtime on
Ampere/Ada/Hopper). NOTHING in the build pins this to GB10/sm_121. The binary is
arch-portable; only the NVFP4 *speedup* is Blackwell-specific, by kernel gating,
not by build targeting.
### 2. Variants built: CUDA + ROCm + SYCL + Vulkan + CPU (NOT CUDA-only)
`backend-matrix.yml` `include:` (Linux) ships, for `llama-cpp-localai-paged`,
one row per stock-llama-cpp Linux row (10 rows, lines 4889-5046):
- cublas CUDA 12.8 (linux/amd64)
- cublas CUDA 13.0 (linux/amd64)
- cublas CUDA 13.0 arm64 l4t (jetson)
- cublas CUDA 12.0 arm64 l4t (jetson)
- hipblas / ROCm 7.2.1 (linux/amd64) - AMDGPU_TARGETS = gfx908..gfx1201
- sycl_f32 (Intel oneAPI)
- sycl_f16 (Intel oneAPI)
- vulkan (linux/amd64)
- vulkan (linux/arm64)
- CPU (linux/amd64) + CPU (linux/arm64), built via the ggml
`CPU_ALL_VARIANTS` single-build (dlopen libggml-cpu-*.so by host CPU feature;
arm64 uses gcc-14 for SME).
So it is NOT CUDA-only. Per image, `compile.sh` builds: the accelerator variant
(or CPU_ALL_VARIANTS when BUILD_TYPE empty) + grpc-server + rpc-server.
### 3. THE GAP vs stock llama-cpp: NO Metal / Darwin build
This is the single build-targeting divergence:
- stock `llama-cpp` HAS a Darwin row in `includeDarwin`
(`-metal-darwin-arm64-llama-cpp`, line 5071) and a `metal:` capability key
(`metal: "metal-llama-cpp"`, index.yaml line 25).
- `llama-cpp-localai-paged` has ZERO entries in `includeDarwin` (grep past line
5048 = none) and NO `metal:` capability key in its meta-backend.
- There is NO `metal-*-llama-cpp-localai-paged` tag anywhere in
backend-matrix.yml or backend/index.yaml.
`scripts/changed-backends.js` already anticipates a future darwin paged row
(lines 78-81 map `backend === "llama-cpp-localai-paged"` to the C++ source dir),
but no such matrix row exists, so it is currently dead/forward-looking code.
Everything else (CUDA arch fan, ROCm gfx list, SYCL, Vulkan, CPU) matches stock
llama-cpp exactly.
### 4. Does a non-Blackwell / Metal / CPU host get a working build of THIS backend?
Meta-backend capabilities map (index.yaml lines 101-111):
default(cpu), nvidia(cuda12), intel(sycl-f16), amd(rocm), vulkan, nvidia-l4t,
nvidia-cuda-13, nvidia-cuda-12, nvidia-l4t-cuda-12/13. NO `metal:` key.
- Non-Blackwell NVIDIA (Ampere sm_80-86 / Ada sm_89 / Hopper sm_90 / datacenter
Blackwell sm_100): selects the SAME cuda12 / cuda13 image. That image is
compiled for the full arch fan, so it RUNS. NVFP4 falls back to the non-MMA
path on pre-Blackwell; on sm_100 it gets FP4-MMA but is compute-bound (HBM3e),
not the LPDDR5x-bound GB10 regime the patches were tuned for. WORKS, just
without the GB10-specific bandwidth win.
- AMD / Intel / Vulkan / CPU (amd64 + arm64) Linux hosts: each has its own
matching variant in the map + matrix. WORKS.
- Metal / macOS Apple Silicon: NO `metal:` key and NO darwin build. Capability
resolution falls back to `default` = `cpu-llama-cpp-localai-paged`, which is a
Linux (amd64/arm64) image, NOT a macOS-native build, so it will NOT run on
macOS. And because this is a SEPARATE meta-backend, it does NOT fall through to
the stock `llama-cpp` backend - a Mac user who explicitly selects
llama-cpp-localai-paged gets a non-running selection and must manually pick the
stock llama-cpp backend instead. DOES NOT WORK on Metal/macOS; no auto-fallback
to stock.
## Verdict (build-targeting)
- Arch-general on Linux: YES. The build is NOT Blackwell-only; it targets the
exact same full CUDA arch fan + the same ROCm/SYCL/Vulkan/CPU variant set as
stock llama-cpp. Any Linux host that can run stock llama-cpp can run THIS
backend; the NVFP4 speedup is the only Blackwell-gated piece, and that gating
is inside the kernel, not in the build matrix.
- Single gap: NO Metal/Darwin variant and NO `metal:` capability key. macOS /
Apple Silicon hosts have no working build of this backend and do not auto-fall
to stock llama-cpp. To close the gap, add an `includeDarwin` row
(`-metal-darwin-arm64-llama-cpp-localai-paged`, mirroring the stock llama-cpp
darwin row + the C++ source build path that changed-backends.js already
anticipates) and a `metal:` key to the paged meta-backend. (Note: NVFP4 has no
Metal MMA path, so a Metal build would deliver paged-KV behaviour only, no
NVFP4 acceleration - still a correctness/availability win over the current
broken selection.)
## Section: gguf-gallery-targeting (NVFP4 portability + hardware gating)
### 1. NVFP4 GGUFs LOAD + RUN on non-Blackwell - runs-via-dequant, NOT FP4-MMA-required
The published GGUFs use `file_type` MOSTLY_NVFP4 / `GGML_TYPE_NVFP4` (type id 40).
This is a standard ggml block-quant type with FULL software dequant + matmul
coverage across every backend, NOT a Blackwell-only format. Verified against the
paged backend's pinned ggml source (pin 0a2677c6, same upstream as stock
llama-cpp):
- CPU (any arch, amd64 + arm64): full support, no special hardware.
- `ggml/src/ggml-cpu/quants.c`: `quantize_row_nvfp4` (from_float) +
`ggml_vec_dot_nvfp4_q8_0_generic` (the matmul dot product), dequant via the
`kvalues_mxfp4` lookup table. Registered in the CPU type-traits table
(`ggml-cpu.c` line 283: `[GGML_TYPE_NVFP4] = { .from_float=..., .vec_dot=... }`).
- NVFP4 handled in all the CPU op switches (`ops.cpp` lines 674, 1125, 1255,
4424, 4701, 4925, 5651). LOADS + RUNS correctly on a pure-CPU host, just slow.
- CUDA, NON-Blackwell (Pascal/Volta/Turing/Ampere sm_80-86 / Ada sm_89 /
Hopper sm_90): RUNS correctly via the integer-quantized matmul paths, no
FP4-MMA needed.
- `convert.cu` registers `dequantize_row_nvfp4_cuda` as both the to_float and
to_fp16 dequant kernel (lines 759, 814) - the generic dequant->GEMM path.
- `mmvq.cu`: `vec_dot_nvfp4_q8_1` (DP4A integer dot, works on any GPU with
dp4a, i.e. Pascal sm_61+). This is the decode (gemv) path.
- `mmq.cuh`: NVFP4 has a `MMQ_DP4A_TXS_Q8_0_16` DP4A tile AND a separate
`MMQ_MMA_TILE_X_K_NVFP4` tile explicitly commented "NVFP4 Generic" (line
222), DISTINCT from `MMQ_MMA_TILE_X_K_FP4` "MXFP4 and NVFP4 Blackwell" (line
221). So there are three tiers: DP4A (oldest), generic-MMA (Turing+), and
Blackwell-native FP4-MMA.
- The Blackwell path is a runtime FLAG, not a requirement:
`mmq.cu` line 125 `const bool use_native_fp4 = blackwell_mma_available(cc)
&& (... NVFP4)`. When false (non-Blackwell), it falls through to the generic
quantized kernel. Grep for any abort/unsupported on NVFP4+blackwell = NONE.
No `GGML_ABORT`, no garbage - just the non-MMA kernel.
- Vulkan: has `dequant_nvfp4.comp` + NVFP4 in `ggml-vulkan.cpp` / dequant_funcs
- LOADS + RUNS on Vulkan hosts (AMD/Intel/NVIDIA) via dequant.
- Metal: NVFP4 referenced only in `ggml-metal-device.m` (type registration /
size), NO Metal NVFP4 compute kernel. On Apple Silicon NVFP4 tensors would
fall back to the CPU backend op-by-op (correct but slow) IF a Metal build
existed - which for THIS backend it does not (see build-targeting Section 3).
Bottom line: the NVFP4 GGUFs are PORTABLE. A Hopper/Ada/Ampere/CPU/Vulkan host
loads and runs them correctly (bit-faithful dequant), just WITHOUT the FP4-MMA
speedup. FP4-MMA is a Blackwell-only performance tier layered on top of a
fully-general software path, NOT a load/run gate. Off-Blackwell = runs-via-dequant,
correct-but-slow; never fail/garbage.
### 2. Gallery hardware-targeting GAP: nothing stops a non-Blackwell user
The 6 -paged entries declare NO machine-readable hardware targeting. The only
Blackwell signal is free prose in `description:` ("native Blackwell NVFP4
(FP4-MMA)", "Benchmarked on GB10 / DGX Spark") and a `nvfp4` string in `tags:`.
How LocalAI's gallery CAN express hardware gating (what exists):
- `tags:` are FREE-TEXT, search-only. `core/gallery/gallery.go` line 89 just does
`strings.Contains(lower(join(tags)), term)` for the search box + line 128
collects them for filter chips. There is NO tag that gates install or warns;
the `nvfp4` tag is purely discoverability.
- The model `ModelConfig` struct (`core/gallery/models.go`) has only
Description/Icon/License/URLs/Name/ConfigFile/Files/PromptTemplates. There is
NO capabilities / requirements / hardware field at the MODEL level. (Signing
`verification:` is the only structured gate, unrelated to hardware.)
- The `capabilities:` map (default/nvidia/intel/amd/metal/vulkan/...) is a
BACKEND-level concept in `backend/index.yaml` (paged entry lines 100-111). It
selects the backend IMAGE by detected accelerator FAMILY (nvidia vs amd vs
metal vs cpu). Crucially it does NOT and CANNOT distinguish Blackwell sm_120/121
from older NVIDIA - `nvidia: cuda12-llama-cpp-localai-paged` is served to ANY
NVIDIA GPU. There is no sub-nvidia (microarch) gating mechanism in the gallery
or the backend capability resolver.
So the gating gap is real: a non-Blackwell user browsing the gallery is offered
the NVFP4 entries with no machine-readable signal that they will run far below
the advertised "90-117% of vLLM" numbers (those numbers are GB10/LPDDR5x-bound
specific). It will install and run correctly, just slowly, and the bench claims
in the description will not hold.
### 3. How to express Blackwell-targeting (recommendation)
Given there is no microarch-gating primitive, the honest options are, in order:
a. DESCRIPTION + TAG (only thing available today, zero code): the entries already
say "native Blackwell NVFP4 (FP4-MMA)" - tighten it to a leading one-line
"Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
NVIDIA/CPU via NVFP4 dequant but WITHOUT the FP4-MMA speedup and below the
quoted GB10 throughput." Add a `blackwell` tag alongside `nvfp4` for the
filter chip. This is the existing convention (other entries use free prose +
`nvidia` tag, e.g. line 2395; quant trade-offs are described in prose, e.g.
the Gemma "Mobile-optimized" notes lines 1312/1366). No other gallery entry
today encodes a GPU-microarch requirement, so prose is the de-facto standard.
b. If a structured signal is wanted, it would need a NEW field (e.g. a
`recommended_hardware` / `requires` note surfaced by the React UI import
dialog) - that is a feature, not a config tweak, and does not exist yet.
c. The `nvfp4` tag should at minimum be present on ALL six entries - the four
Qwopus/Qwen-MTP entries at lines 819/854/890 have only `[llm, gguf]` tags and
omit `nvfp4`, so they are not even discoverable/filterable as NVFP4, despite
being NVFP4 GGUFs. Inconsistent tagging is a secondary gap.
Verdict (gallery-targeting): NVFP4 GGUFs are safe to ship broadly (they run
everywhere via dequant, never fail), so the risk is PERFORMANCE-EXPECTATION, not
correctness. LocalAI has no microarch gating primitive; the only lever is the
description + tags. Recommend a one-line Blackwell-recommended hardware note +
consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
claims with the "runs slower off-Blackwell" caveat.
## Section: optimization-generality (patches 0013/0016 + 0017-0029)
Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch),
GB10-TUNED (needs per-arch retuning of the magnitude/constants), or
Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the
patch commit bodies + the diffs they touch; bit-exactness verdicts are the
patches' own md5/test-backend-ops gates.
Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer
+ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none;
Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but
HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10
"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM
ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they
reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic.
### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral)
Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no
bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs
the code path.
- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy,
zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off
byte-identical. Latency/fairness lever (flattens decode-ITL spike from a
co-batched long prefill). No arch assumption.
- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure
`update_slots()` policy, default-off byte-identical, T==n_batch degenerate case
== stock. Arch-neutral, identical paged on/off.
- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot
release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes
a real fragmentation/throughput-collapse bug on long-lived servers.
Arch-independent host bookkeeping.
- 0029 block-table within-step host cache - memcpy-reuse of the host block table
across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1
off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound
(MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so
the BIGGER this win elsewhere.
- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by
reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no
equivalent of is a win on ANY arch running the GDN path; not FP4, not
bandwidth-floor specific.
- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021
conv-state in-place fusion - remove a D2D state copy-back (0018), a state
get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring
vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is
plumbing, so wins on ANY arch running the gated-DeltaNet recurrence.
- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching,
in-kernel paged read. No precision or bandwidth-floor assumption; the most
portable part of the work, helps capacity/serving anywhere it compiles.
NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every
gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU};
NOT covered on Metal/SYCL/Vulkan - see SAFETY #1.
### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor
Correct + beneficial everywhere, but the specific %/constants are GB10-bound.
- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the
ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes
the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the
128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the
gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8
GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA
archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128
is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7%
magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct.
REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8)
could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the
MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit.
- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1)
quantize the unique token activations once and gather the identical FP4 blocks
instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off.
DIRECTION-GENERAL (de-duplicating identical work is always good) but
NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where
activation-quant is a measurable decode bucket - on a compute-bound arch the
saved quant time may be off the critical path (even on GB10 the MoE TG win is
only +1.7%).
### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch)
- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params
NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The
reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column
assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel.
(16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal
values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere
(bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off
GB10; on a compute-bound arch (sm_100) the kernel may not even be the
bottleneck. Needs a per-arch GDN_NW/CPW sweep.
- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT-
OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X).
Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on
sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default =>
byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the
kill-gate could flip; the levers are in place and inert, ready to re-sweep.
### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists)
- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped
stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off
byte-identical. The CUDA-graph mechanism is general, but the specific guard
condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here"
reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the
node would not take that branch -> inert.
- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds
--cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length).
Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN
decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On
sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is
precision-specific (opt-in, default-safe).
- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific
for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere
they run-via-dequant (correct, slow) per the gallery-targeting section above.
### Per-arch expected story
- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120
(GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the
bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline +
graph wins (0029/0025) and the MMQ reshape (0020) hold.
- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast
(precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound.
Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The
bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the
host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net:
works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants.
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4
levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ
etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler
(0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv
plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj
MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged +
SSM + scheduler infra is a pure win".
- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no
NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference
kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses
ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT
get the new fused-op kernels (SAFETY #1).
### SAFETY / regression risks
1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated.
0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids],
ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are
emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set
(constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the
active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan
two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud
supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream,
so a qwen35 model likely cannot run there regardless); (b) the fused conv
variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a
backend that supports plain SSM_CONV but ignores the discriminator would compute
the WRONG plain conv -> SILENT corruption. That is the one genuine
silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute
backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated
SSM_CONV where the fused handling is absent).
2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery
batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream
decode, n_seqs=1). Bit-identical either way - only a potential perf regression
at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming
the reshape still picks the better kernel at n_seqs=1 elsewhere.
3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal;
do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep.
No correctness risk.
4. Gallery rows do not state a GPU-arch requirement (covered in the
gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note.
### One-line verdict
The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024,
block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the
o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles -
bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers
(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The
occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned
and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and
the regime flips to compute-bound). The single real SAFETY gap: the new fused
GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal
paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute
(discriminated SSM_CONV) - it should be compute-backend-gated.
## Section: patch-arch-safety (build-break / miscompile classification, 0018-0029)
This section is the narrow safety read: for EACH patch, does it (a) compile and
behave correctly on every build target, (b) compile only under
BLACKWELL_MMA_AVAILABLE with a fallback elsewhere, or (c) RISK a build break /
miscompile / crash on a non-Blackwell arch. Class letters here are
build-safety classes, distinct from the perf-generality buckets above. Note 0027
does not exist (numbering gap). The dispositive build facts: the backend is built
for CUDA 12/13, L4T arm64, ROCm/hipblas, SYCL f32/f16, CPU (amd64+arm64), Vulkan -
and NOT for darwin/Metal (no includeDarwin row), and the CUDA build emits the full
multi-arch fan (CUDA_DOCKER_ARCH unset; Dockerfile documents e.g. `75;86;89;120`),
so every .cu TU MUST already compile for non-Blackwell SASS.
Method: grepped every added line in 0017-0029 for arch-specific tokens
(BLACKWELL/__CUDA_ARCH__/sm_NNN/cp.async/ldmatrix/mma./asm volatile/cc gates).
The ONLY hits are in 0017 (all correctly `#if`-gated) and free-text comments. No
SSM/conv/GDN kernel in the series uses a Blackwell intrinsic or a hardcoded
sm_12x launch geometry.
| patch | class | build-safety note |
|-------|-------|-------------------|
| 0017 fp4-gemm-decode-tile-tune | (b) GATED | only Blackwell-specific patch; NVFP4 mmq_y/min-blocks levers behind `#if defined(BLACKWELL_MMA_AVAILABLE)` + `blackwell_mma_available(cc)` + `type==GGML_TYPE_NVFP4`, ALL default-off => default build byte-identical to stock on every arch. `get_mmq_y_device<type>()` templating has a default arg keeping stock behaviour for non-NVFP4. Builds on sm_80-90 (body stripped). |
| 0018 ssm-decode-inplace-state | (a) general | plain in-place GDN state write-back, no intrinsics; CPU mirror in ggml-cpu/ops.cpp. |
| 0019 ssm-decode-fused-gather | (a) general | `gdn_gather_nonident_kernel` = plain `<<<n_seqs,256>>>`; CPU mirror added. |
| 0020 gdn-oproj-mmq-reshape | (a) general | host-side reshape_2d in qwen35*/qwen3next.cpp, no device code. |
| 0021 conv-state-inplace-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (4th src discriminator), no new enum => no ggml-cpu.c switch needed; `ssm_conv_update_f32` plain portable CUDA (threads=128, templated d_conv); CPU mirror + test case. |
| 0022 gdn-recurrence-occupancy-retune | (a) general | template NUM_WARPS/COLS_PER_WARP/MIN_BLOCKS; new default (16,8) = 512 thr/block, MIN_BLOCKS=2, within the 1024 limit on sm_70..120 and AMD; bit-exact for any (NW,CPW). NOT Blackwell-gated and NOT a break - just a GB10-tuned default applied everywhere (see risk 3 below). |
| 0023 moe-nvfp4-quant-dedup | (a) general | `gather_mmq_fp4` = plain uint4 byte-copy kernel; reached ONLY inside the pre-existing `if (use_native_fp4)` branch (Blackwell-only at runtime) and uses `block_fp4_mmq`, a type that already compiles for the full arch fan pre-0023. Adds no new arch surface. |
| 0024 paged-pool-burst-reclaim | (a) general | pure host C++. |
| 0025 moe-nvfp4-decode-regraph | (a) general | host-side ggml-cuda.cu graph-guard relaxation, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off => byte-identical; predicate is runtime cc-generic. |
| 0026 hybrid-perhead-ssm-state | (a) general | mostly host plumbing; GDN kernel = same portable column-folded code; fill.cu instantiates `fill_kernel<nv_bfloat16>` (bf16 STORAGE-only, fine on all targeted arches; bf16-compute SSM plan is SHELVED so STATE_T stays f32 on the hot path). LOW-RISK verify item: confirm no bf16-arithmetic GDN instantiation reaches sm_75 if sm_75 ships. |
| 0028 recurrent-state-gather-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (ids src + rs_head); `ssm_conv_gather_nonident_kernel` plain portable CUDA; CPU mirror + test cases. |
| 0029 blocktable-within-step-cache | (a) general | pure host C++ + host-timing instrumentation. |
### Specific lines that carry the only conditional/risk surface
- 0017 the ONLY correctly-gated arch surface:
- `get_mmq_y_host`: `if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc))`
- `get_mmq_y_device<type>()` / `mmq_get_min_blocks_device<type>()`: bodies inside `#if defined(BLACKWELL_MMA_AVAILABLE)`.
All default to the stock value, so a default build is byte-identical everywhere.
- 0023 the gather kernel default-on (GGML_CUDA_MOE_QUANT_DEDUP=1) but the call site
is `if (moe_quant_dedup && ne11 == 1)` strictly inside `if (use_native_fp4)`; on
non-Blackwell `use_native_fp4` is false so the dedup never executes.
- 0022 the GB10-tuned launch geometry is `GDN_DEFAULT_NW 16` / `GDN_DEFAULT_CPW 8`
(=> 512 threads, MIN_BLOCKS=2). This is the closest thing to a "hardcoded for
GB10" launch config, but it is a correct, within-limits, bit-exact default for
ANY arch, runtime-overridable via GDN_NW/GDN_CPW. Not a break.
### THE ONE silent-correctness risk (cross-ref SAFETY #1 above)
0021/0028 (and 0018/0019 for the GDN op) implement their new ops for CUDA + CPU
ONLY, and the fused conv variants REUSE GGML_OP_SSM_CONV discriminated by a
non-null src[3]/src[4]. Emission is NOT gated on the active compute backend. A
backend that supports plain SSM_CONV but ignores the discriminator would run the
WRONG plain conv => SILENT corruption (not a build break). In practice the model
that emits these (qwen35 hybrid) also needs the fork-custom GDN op, which is
CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts/falls back FIRST and the
model cannot run there regardless; and Metal is not a build target. So the risk is
latent rather than live, but it should still be closed by gating fused-op emission
on a CUDA/HIP compute backend (or a supports_op guard rejecting the discriminated
SSM_CONV where fused handling is absent). This is the single item that could ever
miscompute silently; everything else is either build-safe or loud.
### Build-safety verdict per target (would it COMPILE / RUN)
- CUDA sm_80 / 86 / 89 / 90 (Ampere/Ada/Hopper): BUILDS (0017 Blackwell code
`#if`-stripped + default-off; all other device code portable CUDA). qwen35 hybrid
models RUN (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4
GGUFs run via the stock non-FP4-MMA dequant/DP4A path; the FP4 levers are inert,
not broken. No patch in 0018-0029 breaks this build.
- CUDA sm_100 (datacenter Blackwell, HBM3e): BUILDS + every lever active
(BLACKWELL_MMA_AVAILABLE defined). Bit-exact. GB10-tuned launch defaults are
correct but tuned for the LPDDR5x BW floor; on HBM3e the regime is compute-bound,
so safe-but-not-necessarily-optimal (re-sweep 0022/0017 levers). No build/correctness risk.
- Metal: NOT a build target (no darwin row), so missing Metal kernels for the new
SSM_CONV/GDN ops cannot break a build or a run here. (The GDN op has no Metal
kernel regardless.)
- CPU (amd64 + arm64): BUILDS + RUNS - every new op ships a CPU mirror under the
reused enums; host patches are portable C++.
- ROCm/HIP, Intel SYCL, Vulkan: BUILD ok. The .cu additions hipify cleanly (no
Blackwell intrinsic outside the `#if`; 0022's 512-thread launch within AMD limits).
SYCL/Vulkan are separate backends that don't compile the .cu files and lack the
GDN op, so qwen35 hybrid models fall back/assert there rather than run; classic
(non-qwen35) models are unaffected because SSM_CONV semantics only change when the
qwen35 graph emits the discriminator src. The latent silent-SSM_CONV risk above
applies only if a backend both supports SSM_CONV and ignores the discriminator.
Verdict: of 0018-0029, none would break a non-Blackwell CUDA build, the CPU build,
or the ROCm/SYCL/Vulkan builds; 0017 is the only Blackwell-gated patch and is
default-off and `#if`-guarded. The sole non-build hazard is the latent
discriminated-SSM_CONV silent-miscompute on a hypothetical Vulkan/SYCL/Metal GDN
run, which should be closed by compute-backend-gating the fused-op emission.
## Section: CROSS-ARCH SYNTHESIS (final verdict)
Consolidates the four audit sections above into a single ship decision. The arch
axis: NVFP4 FP4-MMA requires `BLACKWELL_MMA_AVAILABLE` = sm_120/121 (consumer
Blackwell, GB10/RTX-50) + sm_100 (datacenter Blackwell). sm_90 Hopper / sm_89 Ada
/ sm_80-86 Ampere = NO FP4-MMA. Metal/CPU/AMD/Intel = no NVFP4-MMA. GB10's wins
are dominated by the LPDDR5x ~273 GB/s bandwidth floor; sm_100 has FP4-MMA but
HBM3e ~8 TB/s so it is COMPUTE-bound and every "bandwidth-bound" GB10 verdict
inverts there.
### 1. BUILD SAFETY: does it build + run WITHOUT CRASHING off-Blackwell?
YES on every target it builds for, with ONE latent silent-correctness hazard
(not a crash) to close before claiming non-Blackwell support. The build is NOT
GB10-pinned: there is no explicit CUDA arch list anywhere in the paged path
(`CUDA_DOCKER_ARCH` empty in every matrix row, identical to stock llama-cpp), so
the CUDA TUs compile the full upstream ggml arch fan and the NVFP4 FP4-MMA path
is gated INSIDE the kernel by `BLACKWELL_MMA_AVAILABLE`, never by the matrix.
| target | builds? | runs? | notes |
|--------|---------|-------|-------|
| CUDA sm_80/86/89/90 (Ampere/Ada/Hopper) | YES | YES | 0017 Blackwell code `#if`-stripped + default-off; all other device code portable. qwen35 hybrid models run (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 GGUFs run via dequant/DP4A; FP4 levers inert, not broken. |
| CUDA sm_100 (datacenter Blackwell, HBM3e) | YES | YES | every lever active + bit-exact; GB10-tuned launch defaults are correct but compute-bound regime => safe-but-suboptimal (re-sweep, do not assume GB10 constants). |
| CPU (amd64 + arm64) | YES | YES | every new op ships a CPU mirror under the reused enums; host patches portable C++. |
| ROCm/HIP, Intel SYCL, Vulkan | YES | partial | .cu hipifies cleanly (no Blackwell intrinsic outside `#if`; 0022's 512-thread launch within AMD limits). SYCL/Vulkan don't compile the .cu and lack the GDN op, so qwen35 hybrid models assert/fall back rather than run; classic non-qwen35 models unaffected. |
| Metal / macOS | NOT BUILT | N/A | no `includeDarwin` row, no `metal:` capability key. Mac selection of this backend falls back to `default`=cpu (a Linux image) and does NOT run; no auto-fallthrough to stock llama-cpp. |
No patch in 0017-0029 breaks a non-Blackwell CUDA build, the CPU build, or the
ROCm/SYCL/Vulkan builds. The only thing that is not merely "suboptimal" is the
fused-conv silent-miscompute hazard (item RISKY-1 below), and even that is latent
because the co-emitted GDN op asserts first on the backends that lack it.
### 2. EVERY patch/opt, four buckets
SAFE-EVERYWHERE (ship as-is; bit-exact or default-off byte-identical; pure win or
neutral on any arch that runs the path):
- 0001-0012 paged KV core (manager, on-demand alloc, prefix caching, in-kernel paged read)
- 0013 / 0016 prefill-token budget scheduler (pure `update_slots()` policy, default-off byte-identical)
- 0018 in-place SSM-state write-back (CUDA+CPU; see RISKY-1 for backend coverage)
- 0019 fused SSM-state gather (CUDA+CPU)
- 0021 conv-state in-place fusion (CUDA+CPU)
- 0028 recurrent-state (conv-tap) gather fusion (CUDA+CPU)
- 0020 o_proj GDN MMVQ->MMQ reshape (zero-cost view, bit-identical; MMQ>MMVQ at M=128 is universal; magnitude GB10-bound, perf-only caveat at tiny real M=1, see RISKY-2)
- 0024 paged-pool burst-reclaim (pure host C++; fixes a real long-server fragmentation collapse)
- 0029 block-table within-step host cache (host memcpy reuse, bit-exact; bigger win the FASTER the GPU, i.e. MORE host-bound decode elsewhere)
BLACKWELL-ONLY, CLEAN FALLBACK (only meaningful where FP4-MMA exists; provably
inert/byte-identical elsewhere, never a break):
- 0017 FP4 dense-GEMM decode tile tune - levers `#if BLACKWELL_MMA_AVAILABLE` + `blackwell_mma_available(cc)` + `type==NVFP4`, ALL default-off => default build byte-identical to stock on every arch
- 0023 MoE NVFP4 activation-quant de-dup - plain uint4 copy kernel reached ONLY inside the pre-existing `if (use_native_fp4)` branch (false off-Blackwell); never executes there
- 0025 MoE NVFP4 decode re-graph - host-side CUDA-graph guard, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off; the NVFP4-grouped guard predicate is inert on non-FP4
- NVFP4 GGUFs + 6 gallery rows - FAST path is sm_120/121/100 only; elsewhere run-via-dequant (correct, slow), never a load/run gate
GB10-TUNED (works + safe everywhere, but the constants/magnitude are GB10
bandwidth-floor winners; re-sweep per arch, no correctness risk):
- 0022 GDN recurrence occupancy retune - column-fold default (16,8)=512thr/MIN_BLOCKS=2, bit-exact, env-overridable GDN_NW/GDN_CPW; within the 1024-thread limit on sm_70..120 + AMD. Optimal values depend on DRAM latency/L2/SM-count; on a compute-bound arch the kernel may not be the bottleneck.
- 0026 bf16 per-head SSM/conv cache - default f32 bit-exact (opt-in `--cache-type-ssm/-conv`); bf16 only pays off on a bandwidth-bound arch, buys little on sm_100 HBM3e. bf16 is STORAGE-only (fill_kernel<nv_bfloat16>), the bf16-compute SSM plan is shelved so STATE_T stays f32 on the hot path.
- 0017 / 0023 magnitudes (the % wins, not the gating) are also GB10-floor-bound.
RISKY (fix before claiming non-Blackwell ship; neither is a crash, one is silent):
- RISKY-1 (the one real gap) fused GDN/conv ops are CUDA+CPU-only with
backend-UNGATED, DEFAULT-ON emission. Confirmed: `cparams.fused_gdn_ch = true`
and `auto_fgdn = true` in the `llama_context` constructor; emission fires on
`(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar)` with NO compute-backend
check. The fused conv variant REUSES `GGML_OP_SSM_CONV` discriminated by a
non-null `src[3]` (verified: CUDA `if (dst->src[3] != nullptr)` branch at the
top of `ggml_cuda_op_ssm_conv`, CPU mirror in ops.cpp, NO supports_op guard). A
backend that supports plain SSM_CONV but ignores `src[3]` would compute the
WRONG plain conv => SILENT corruption. Latent today only because the co-emitted
fork-custom GDN op is CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts
first and the qwen35 hybrid model cannot run there anyway, and Metal is not
built. FIX: gate fused-op emission on a CUDA/HIP compute backend, OR add a
supports_op guard that rejects the discriminated SSM_CONV where fused handling
is absent. This is the single thing that could miscompute silently; close it
before a Vulkan/SYCL/Metal paged build of a gated-DeltaNet model is ever shipped.
- RISKY-2 (perf-only, not correctness) 0020 forces MMQ; at a genuine single-stream
decode M<=8 (n_seqs=1) MMQ could be slower than MMVQ off the GB10 batched
regime. Bit-identical either way. Confirm the reshape still picks the better
kernel at n_seqs=1 on non-GB10 archs.
### 3. NVFP4-GGUF + gallery targeting recommendation
Do NOT hardware-gate the entries (and you cannot: LocalAI has no microarch-gating
primitive - `tags:` are free-text/search-only, `ModelConfig` has no
hardware/requirements field, and backend `capabilities:` resolves by accelerator
FAMILY only, serving `nvidia: cuda12-...-paged` to ANY NVIDIA GPU with no
sub-nvidia resolution). The GGUFs run correctly everywhere via dequant, so the
risk is PERFORMANCE-EXPECTATION, not correctness; a hard gate would wrongly block
valid (slow) use. Recommended, in order:
1. (zero-code, do now) Lead each of the 6 descriptions with one honest line:
"Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
NVIDIA/CPU via NVFP4 dequant but WITHOUT FP4-MMA and below the quoted GB10
throughput." Temper the "90-117% of vLLM" claims with that caveat (those are
LPDDR5x-bandwidth-bound specific).
2. (zero-code) Tag all six consistently with `nvfp4` + a new `blackwell` chip. The
four Qwopus/Qwen-MTP entries currently carry only `[llm, gguf]` and are not even
discoverable as NVFP4 despite being NVFP4 GGUFs - secondary correctness-of-metadata gap.
3. (feature, later) A structured `recommended_hardware` field surfaced by the React
import dialog is the only way to express this machine-readably; it does not exist.
### 4. Per-arch roadmap (ranked by value / effort)
- sm_100 datacenter Blackwell - HIGH value, MEDIUM effort. FP4-MMA works so NVFP4
stays fast and the precision bucket (0017/0023/0025) carries over, but the BW
floor is gone => compute-bound. Needs: re-sweep 0022 GDN_NW/CPW; re-evaluate the
0017 kill-gate (levers ready, may flip); expect 0018/0019/0026 bandwidth wins to
shrink toward neutral while 0029/0025/0020 host/graph/MMQ wins still help. No
code change to be SAFE; a tuning pass to be OPTIMAL.
- Metal / macOS - MEDIUM value, MEDIUM effort. Add the `includeDarwin`
`-metal-darwin-arm64-llama-cpp-localai-paged` row + a `metal:` capability key
(changed-backends.js already anticipates the source path). Delivers paged-KV +
scheduler value only (no NVFP4-MMA on Metal); still a strict win over today's
broken Mac selection. MUST also land RISKY-1 first (Metal would otherwise hit the
discriminated-SSM_CONV path if it ever gained an SSM_CONV kernel without the
discriminator).
- CPU - LOW effort, already works. Reference kernels exist for every fused op;
paged KV + scheduler + reclaim are the portable value. Nothing to do.
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 - MEDIUM value, LOW effort (no FP4
work). No FP4-MMA => pair the precision-agnostic infra (paged KV, 0013/0016,
0024, 0029, 0018/0019/0021/0028, 0020) with a DIFFERENT quant (Q4_K/AWQ/GPTQ).
Messaging: "no NVFP4 here, use another quant, but paged + SSM + scheduler infra
is a pure win". The GGUFs/gallery rows are out of scope for these.
### 5. What MUST be empirically verified (and on what hardware)
- GB10 (sm_121, user has it): the validated target; already measured. Re-confirm
bit-exactness gates after RISKY-1 fix.
- M4 Mac (user has it): (a) once an `includeDarwin` paged row exists, verify the
Metal build compiles + a NON-qwen35 model runs (paged KV path); (b) verify a
qwen35 hybrid model on Metal EITHER asserts loudly OR is correct - it must NOT
silently miscompute the discriminated SSM_CONV. This is the direct test of
RISKY-1 on real Metal. Do this BEFORE shipping a Metal paged build. Also verify
CPU correctness of every fused op on the Mac (arm64 CPU mirror).
- non-Blackwell NVIDIA (sm_80/86/89/90 - user would need to ACQUIRE, e.g. cloud
A100/L4/L40S/H100): verify (a) the cuda12/cuda13 paged image runs a qwen35
hybrid model correctly (GDN + ssm_conv_update + gather non-Blackwell kernels),
(b) NVFP4 GGUFs load + produce correct output via dequant/DP4A (not garbage),
(c) RISKY-2: that 0020's forced MMQ does not regress single-stream (n_seqs=1)
decode latency vs MMVQ. This is the only bucket needing hardware acquisition;
everything else is covered by the GB10 + M4 the user already has.
- sm_100 (datacenter Blackwell - cloud B200 if a tuning pass is wanted): only
needed to make sm_100 OPTIMAL, not to make it SAFE. Defer unless targeting it.
### 6. SHIP DECISION
SAFE TO SHIP TODAY as a Blackwell-targeted backend on Linux. The build is
arch-general (same arch fan + variant set as stock llama-cpp), every targeted
Linux variant builds and runs, and all Blackwell-specific code is default-off +
`#if`-guarded so a non-Blackwell build is byte-identical to stock on the FP4 path.
The NVFP4 GGUFs run everywhere via dequant (correct, slower), so broad gallery
exposure is a performance-expectation issue, not a correctness one.
MINIMUM to not break / mislead other archs:
1. (correctness, before ANY Vulkan/SYCL/Metal paged build of a gated-DeltaNet
model) Close RISKY-1: compute-backend-gate the fused GDN/conv op emission, or
add a supports_op guard rejecting the discriminated SSM_CONV. This is the only
hard requirement; it is latent on the current Linux targets but becomes live
the moment a Metal/Vulkan/SYCL paged build of qwen35 exists.
2. (availability, zero-risk) Add the `includeDarwin` paged row + `metal:` key so
Mac users get a working (paged-KV-only) build instead of a non-running
default=cpu selection with no fallthrough to stock.
3. (expectation, zero-code) Add the Blackwell-recommended hardware note + the
"runs slower off-Blackwell via dequant" caveat to the 6 gallery descriptions
and tag all six `nvfp4` + `blackwell`.
4. (perf, verify don't block) Confirm 0020 does not regress n_seqs=1 decode on
non-GB10 NVIDIA; if it does, gate the MMVQ->MMQ reshape on a real-M threshold.
Items 2-4 do not block a Linux Blackwell ship. Item 1 blocks only a future
non-CUDA paged build of a gated-DeltaNet model; on the current build targets the
hazard is latent (the GDN op asserts first). Net: ship for Blackwell/Linux now;
land item 1 before extending paged to Metal/Vulkan/SYCL.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,56 +0,0 @@
# A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS
Label: A-build (GPU agent). Base: DGX `~/llama-paged-dev` branch `paged` HEAD 2f4f5ab (patch 0025),
plus `BF16_SSM_STATE.diff` applied as the bf16 plumbing base. Goal: per-head mixed-dtype SSM state
(f32 long-memory heads, bf16 fast heads); default `ssm_hybrid_tau_thresh=inf` (all-f32, bit-exact).
## Design recap (from SPEEDUP_HUNT.md A-hybrid-design)
- Classifier (host, model-load): tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h])); f32 if tau_h>T.
ssm_a = SSM_A_NOSCAN = -exp(A_log) (verified qwen35.cpp:376). ssm_dt = SSM_DT bias.
- Split cache: per GDN layer, s_l (f32, n_f32 heads) + s_l_bf16 (bf16, n_bf16 heads). head_slot map.
- Kernel: ONE kernel templated +HYBRID; per-block (h_idx) branch on head_slot (uniform, no divergence).
Recurrence math byte-for-byte f32-register, untouched. Homogeneous (HYBRID=false) path bit-exact.
- Op: extra src[8]=state_bf16, src[9]=head_slot; backend detects hybrid = (src[9]!=null).
- CPU mirror: per-head partition read.
- test-backend-ops: MIXED case (some heads f32, some bf16) output-append, decode+prefill+keep_rs_t.
## DE-RISK GATE (must pass before sweep)
1. test-backend-ops GATED_DELTA_NET mixed PASS (CUDA mixed vs CPU mixed).
2. T_thresh=inf greedy md5 == 0023 baseline: dense 5951a5b4d624ce891e22ab5fca9bc439,
MoE 07db32c2bcb78d17a43ed18bc22705cd.
## KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)
Rule (brief verbatim + physics + "start 32-64" guidance all agree): a head is kept f32 iff
tau_h > T_thresh, else bf16. tau_h = 1/(|ssm_a|*softplus(ssm_dt)) in tokens. Long-memory (large tau)
heads stay f32 (bf16 rounding does not contract there -> KL); fast (small tau) heads -> bf16.
- ssm_hybrid_tau_thresh DEFAULT = 0.0 => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).
- ssm_hybrid_tau_thresh -> +inf => ALL BF16 (shelved mode).
- sweep: raise T (16/32/64/128 tokens) to bf16 progressively more (longer-memory) heads = more speed.
NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED vs the operative rule it states
("keep f32 if tau>T") and vs "start 32-64" + the physics. Correct endpoints: 0=all-f32, inf=all-bf16.
Implemented the physically-correct rule; default 0.0 = bit-exact all-f32.
## STATUS
- [x] ggml.h/ggml.c hybrid op builders
- [x] gated_delta_net.cu hybrid kernel + dispatch (one kernel, +HYBRID template, uniform per-block branch)
- [x] ops.cpp CPU hybrid read mirror (output-append; ids in-place is GPU-only, asserted)
- [x] test-backend-ops mixed case (32 cases: hc 4/8 x hs 64/128 x decode/prefill/keep_rs_t x kda)
- [x] de-risk gate 1: test-backend-ops GATED_DELTA_NET = 84/84 PASS (incl 32 hybrid mixed CUDA-vs-CPU)
- [x] cparam/CLI ssm_hybrid_tau_thresh plumbing (default 0.0; threaded context->cparams->memory->ctors)
- [x] memory-recurrent split cache + classifier (validated: real tau split, correct 2-partition layout)
- [x] delta-net-base hybrid op build (fused ids decode + bf16 rs_zero/extra mirror)
- [x] full build clean (sm_121; llama-completion/batched-bench/perplexity/test-backend-ops)
- [x] de-risk gate 2 (default/all-f32 md5 == 0023 both models, re-verified post-build)
- [x] hybrid-ON decode FIXED: the incoherence was head_slot being zeroed by clear(data=true) (whole-RS
buffer clear) after warm-up, never re-uploaded => every head -> f32-local-0 => split collapse.
Fix = persist head_slot_host + re-upload via upload_head_slots() after every buffer clear. Hybrid
decode now coherent; cross-op carry verified BYTE-EXACT (write==read both partitions).
- [x] A-gatesweep DONE: KL sweep T in {0.25..128} both models, single-seq c1024 (clean carry), drift.
SHIP GATE FAILS - no T passes MeanKLD<1e-3 AND top-p>=99.5% with meaningful speedup. Premise
(error concentrates in long-memory heads) REFUTED: KL scales with bf16 COUNT and saturates
~0.06/~91% (MoE saturates at the minimal split). Carry byte-exact => genuine bf16 sensitivity,
not a bug. Throughput lever real: dense +12.4% / MoE +11.5% decode @npl128 at T=128.
- [x] Shipped default-off (f32, bit-exact). De-risk gates re-verified on the clean build (84/84;
md5 == baseline both models). See A_HYBRID_SSM_RESULTS.md for the full tables.
Committed: DGX paged 33e7c65 (amended); worktree patch 0026 + A_HYBRID_SSM_RESULTS.md + this doc.

View File

@@ -1,119 +0,0 @@
# A - HYBRID PER-HEAD f32/bf16 SSM STATE - BUILD + DE-RISK + GATE-SWEEP RESULTS
Label: A-build + A-gatesweep. Lands as patch 0026 on top of 0025 (DGX HEAD 2f4f5ab),
incorporating the bf16-SSM-state plumbing as the base. Built into `~/llama-paged-dev/build-cuda`
(sm_121); committed on the DGX `paged` branch (33e7c65, amended) and as
`patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch` + this doc in the worktree.
## VERDICT
The hybrid machinery is **CORRECT and complete** (both de-risk gates PASS; the carry is byte-exact;
the previously-open decode-incoherence bug is FIXED). The **ship gate FAILS**: no T_thresh reaches
`MeanKLD < 1e-3 AND Same-top-p >= 99.5%` for both models with any meaningful speedup. The design
premise - that the bf16 KL error concentrates in long-memory heads and is removed by keeping them
f32 at f32-fraction 0.30-0.50 - is **empirically refuted** on q36-27b and q36-35b-a3b-nvfp4: the KL
error scales with the bf16 head COUNT and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any
useful byte-saving. The bf16 byte-saving (and the decode speedup it buys) is real but cannot meet the
strict KL bar. **Shipped default-off (f32, bit-exact opt-out); the hybrid is opt-in only.**
## THE FIX (was: hybrid-ON decode incoherent)
Root cause: `llama_memory_recurrent::clear(data=true)` zeroes the WHOLE recurrent backend buffer with
`ggml_backend_buffer_clear`, which includes the per-layer `head_slot` maps. Those maps were uploaded
only once in the constructor. llama.cpp calls `clear(true)` to reset state after the warm-up run (and
on context resets), so by the time real prefill/decode runs, every `head_slot[h] == 0`. The kernel
decodes `head_slot==0` as "f32 head, local index 0", so EVERY head reads/writes f32-partition slot 0:
the split collapses (the bf16 partition is never written, every head collides on one f32 slot) and the
output is garbage. Warm-up showed correct values precisely because it ran before the clear.
Fix: persist the host-side maps (`head_slot_host`) and re-upload them after every buffer clear via a
new `upload_head_slots()` (called both at construction and at the end of `clear(true)`). 22 lines in
`src/llama-memory-recurrent.cpp` + 7 in the header. After the fix:
- head_slot reads back correct in every forward (e.g. `0 1 -1 -2`), in both llama-completion and
llama-perplexity;
- the bf16 partition is written (non-zero) every step;
- the cross-op state carry is **byte-exact**: at a continuation forward the op reads back EXACTLY what
the prior op wrote, element-for-element, in BOTH partitions (f32 `[0]=0.00303 [1]=-0.00074
[16384]=0.00054`, bf16 `[0]=-0.00023 [1]=0.00008 [16384]=0.00269` write == read), confirming there
is no addressing/scramble/corruption bug. The only residual difference from f32 is the bf16 rounding
of the bf16-partition heads.
## DE-RISK GATES - both PASS (re-verified on the final clean build)
1. **test-backend-ops GATED_DELTA_NET = 84/84 PASS, CUDA0 OK** (incl. the 32 mixed-dtype hybrid cases
vs CPU: head_count {4,8} x head_size {64,128} x {decode, prefill 33/64/100, keep_rs_t K=4} x kda).
2. **T=0 (default, all-f32) greedy md5 == 0023 baseline, both models**, NO `--ssm-bf16-tau`:
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == baseline
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == baseline
The bit-exact opt-out is preserved byte-for-byte.
## SHIP GATE - the KL/throughput sweep (FAILS)
KL harness = the bf16-work GateBench: `llama-perplexity --kl-divergence` on wikitext-2-raw,
`-ngl 99 -fa on --seed 1`, base = T=0 (f32). The clean carry config is single-sequence
`-b 1024 -ub 512 -c 1024 --chunks 8` (one cross-ubatch bf16 round-trip; f32-vs-f32 floor = 100.000%
same-top-p, MeanKLD ~ -1.2e-5). Gate: `MeanKLD < 1e-3 AND Same-top-p >= 99.5% AND bounded drift`.
### Dense q36-27b-nvfp4 (H_v=48), c1024 single-seq
| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD | Same-top-p |
|---------:|-----------:|--------:|--------:|---------:|-----------:|
| 0.25 | 14 | 0.964 | 0.982 | 0.00270 | 98.92% |
| 0.5 | 48 | 0.963 | 0.982 | 0.01439 | 96.18% |
| 1 | 118 | 0.935 | 0.968 | 0.06357 | 91.59% |
| 8 | ~610 | 0.735 | 0.868 | 0.05669 | 91.59% |
| 32 | ~1113 | 0.517 | 0.759 | 0.05724 | 90.97% |
| 64 | ~1304 | 0.434 | 0.717 | 0.06183 | 91.85% |
| 128 | ~1460 | 0.366 | 0.683 | 0.05980 | 91.56% |
Monotonic below the knee (T<=1), then a flat plateau. Best meaningful point T=0.25 (only ~1.8% byte
saving) already FAILS both criteria (KLD 0.0027 > 1e-3; top-p 98.92% < 99.5%). To pass the gate the
bf16 count must be < ~14 heads (f_bytes > 0.98) => no speedup.
### MoE q36-35b-a3b-nvfp4 (H_v=32), c1024 single-seq
| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD | Same-top-p |
|---------:|-----------:|--------:|--------:|---------:|-----------:|
| 0.25 | 23 | 0.940 | 0.970 | 0.03907 | 91.61% |
| 0.5 | 53 | 0.928 | 0.964 | 0.04620 | 90.31% |
| 1 | 78 | 0.910 | 0.955 | 0.04425 | 89.82% |
| 32 | 585 | 0.391 | 0.695 | 0.04552 | 90.09% |
MoE has NO low-KL regime: even the minimal split (23 bf16 heads, ~3% byte saving) is already at the
~0.045 / ~91% plateau. Fails the gate everywhere by a wide margin.
### Why it fails (the refutation)
The carry is byte-exact, so this is genuine bf16 rounding of the recurrent state, not a bug. The
gated-DeltaNet logit is extremely sensitive to ANY perturbation of the temporal state: even rounding a
handful of small-magnitude heads to bf16 flips ~9% of hard-wikitext argmaxes, and adding more bf16
heads does not flip materially more (saturation - the flips concentrate in an inherently-marginal
token pool). This matches the prior whole-bf16 finding (MeanKLD 0.05-0.17, top-p ~90%, "bounded but
LARGE, plateaus with context"). The error is NOT concentrated by tau, so f32-ing the long-memory heads
(or, tested, the fast heads - inverted classifier gives the same plateau) does not recover the gate.
## THROUGHPUT - the byte-saving lever IS real (but KL-gated out)
`llama-batched-bench -fa on -npp 128 -ntg 128 -npl 128`, `LLAMA_KV_PAGED=1`, decode_agg = S_TG t/s:
| model | T=0 (f32) | T=128 (f_bytes ~0.68) | gain |
|-------|----------:|----------------------:|-------:|
| dense | 529.0 | 594.4 | +12.4% |
| MoE | 1110.7 | 1238.1 | +11.5% |
So the split delivers the predicted recurrence-bandwidth win (~+12% decode at f_bytes ~0.68), but only
at T values whose KL is ~0.06 / ~91% top-p. There is no operating point with both a real speedup and a
passing KL.
## RECOMMENDATION
- Ship 0026 as-is: **default `ssm_hybrid_tau_thresh = 0.0` (f32, bit-exact)**; the hybrid is opt-in via
`--ssm-bf16-tau` for callers who explicitly accept reduced precision for ~+12% decode. Do NOT put a
hybrid T in the gallery/recommended config - it does not pass the KL bar.
- Lever A is closed as a KL-passing speedup: the GDN recurrent state does not tolerate bf16 on a
head-subset basis. Speed beyond the f32 recurrence must come from elsewhere (the MoE FP4 GEMM /
re-graph levers, or NVFP4-dense quant), not from bf16-ing the SSM state.
- If a product later accepts a looser bar (e.g. top-p >= 95%), dense T=0.5 (96.18%, f_bytes 0.982) is
the only near-miss and buys ~2% - still not worth it; MoE has nothing.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,92 +0,0 @@
# FINAL apples-to-apples NVFP4 benchmark (GB10 / DGX Spark) - CLEAN env, containers stopped
# llama 0023 clean f7409c2 | LLAMA_KV_PAGED=1, LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget; beats stock 394s->142s TTFT@npl32), CUDA graphs ON, -c 131072 --parallel 128 -b 2048 -ub 512 -fa on
# vLLM 0.23.0 | CUDA graphs ON (no enforce-eager), util 0.85, max-model-len 4096, max-num-seqs 256, tp1
# client h2h_cli3.py: 512-tok UNIQUE-nonce prompt (fresh full prefill, defeats prefix caching), max_tokens=256, temp0, ignore_eos, stream+usage
# llama restarts server PER NPL (paged-pool degrades after high-npl bursts); vllm one server/combo + npl8 re-check. 1 measured pass/npl + ptok8 graph warmup. peak_gb engine = PEAK-PRE.
# started Fri Jun 26 04:43:38 AM CEST 2026 baseline=3.29 GB
[2026-06-26 04:43:38] [dense_llama] ==== START dense_llama (llama) baseline_mem=3.29 ====
[2026-06-26 04:43:38] [dense_llama] NPL=8 launching server PRE_GB=3.29
[2026-06-26 04:43:48] [dense_llama] NPL=8 ready LOADED_GB=47.06
[2026-06-26 04:43:55] [dense_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is". This is a straightforward factual question with a clear answer.\n2. **Identify Key Entity:** France (country)\n3. **Identify Question Type:** Capit
[2026-06-26 04:44:30] [dense_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2040, "prompt_tok_total": 4195, "gen_per_req": 255.0, "agg_tps": 61.8, "decode_agg_tps": 82.5, "decode_perseq_tps": 9.57, "prefill_tps": 507.3, "ttft_mean_ms": 6038.1, "ttft_max_ms": 8270.0, "wall_s": 32.999}
[2026-06-26 04:44:30] [dense_llama] NPL=8 PEAK_GB=53.51
[2026-06-26 04:44:35] [dense_llama] NPL=8 server stopped mem=3.31
[2026-06-26 04:44:35] [dense_llama] NPL=32 launching server PRE_GB=3.31
[2026-06-26 04:44:40] [dense_llama] NPL=32 ready LOADED_GB=46.96
[2026-06-26 04:47:55] [dense_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8180, "prompt_tok_total": 16900, "gen_per_req": 255.6, "agg_tps": 43.2, "decode_agg_tps": 192.6, "decode_perseq_tps": 4.79, "prefill_tps": 115.0, "ttft_mean_ms": 133551.7, "ttft_max_ms": 147007.0, "wall_s": 189.49}
[2026-06-26 04:47:55] [dense_llama] NPL=32 PEAK_GB=69.63
[2026-06-26 04:48:01] [dense_llama] NPL=32 server stopped mem=3.32
[2026-06-26 04:48:01] [dense_llama] NPL=64 launching server PRE_GB=3.32
[2026-06-26 04:48:11] [dense_llama] NPL=64 ready LOADED_GB=46.97
[2026-06-26 04:55:10] [dense_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16382, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 39.8, "decode_agg_tps": 277.8, "decode_perseq_tps": 3.09, "prefill_tps": 95.9, "ttft_mean_ms": 321618.8, "ttft_max_ms": 352633.6, "wall_s": 411.603}
[2026-06-26 04:55:10] [dense_llama] NPL=64 PEAK_GB=83.96
[2026-06-26 04:55:16] [dense_llama] NPL=64 server stopped mem=3.30
[2026-06-26 04:55:16] [dense_llama] NPL=128 launching server PRE_GB=3.30
[2026-06-26 04:55:21] [dense_llama] NPL=128 ready LOADED_GB=47.09
[2026-06-26 05:13:18] [dense_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32767, "prompt_tok_total": 67969, "gen_per_req": 256.0, "agg_tps": 30.9, "decode_agg_tps": 384.6, "decode_perseq_tps": 1.86, "prefill_tps": 69.7, "ttft_mean_ms": 902762.7, "ttft_max_ms": 975832.6, "wall_s": 1061.031}
[2026-06-26 05:13:18] [dense_llama] NPL=128 PEAK_GB=93.82
[2026-06-26 05:13:25] [dense_llama] NPL=128 server stopped mem=3.31
[2026-06-26 05:13:25] [dense_llama] ==== DONE dense_llama POST_GB=3.31 ====
[2026-06-26 05:13:25] [dense_vllm] ==== START dense_vllm (vllm) baseline_mem=3.31 ====
[2026-06-26 05:13:25] [dense_vllm] launching vllm PRE_GB=3.31
[2026-06-26 05:21:15] [dense_vllm] vllm ready LOADED_GB=110.48
[2026-06-26 05:21:27] [dense_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is"\n2. **Identify Key Entity/Question:** The question is asking for the capital city of France.\n3. **Retrieve Knowledge:** I know from general knowledge that t
[2026-06-26 05:21:59] [dense_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1959, "prompt_tok_total": 4195, "gen_per_req": 244.9, "agg_tps": 65.6, "decode_agg_tps": 70.4, "decode_perseq_tps": 8.76, "prefill_tps": 2096.2, "ttft_mean_ms": 1861.1, "ttft_max_ms": 2000.6, "wall_s": 29.843}
[2026-06-26 05:21:59] [dense_vllm] NPL=8 PEAK_GB=110.92
[2026-06-26 05:22:47] [dense_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8165, "prompt_tok_total": 16900, "gen_per_req": 255.2, "agg_tps": 176.3, "decode_agg_tps": 211.8, "decode_perseq_tps": 6.28, "prefill_tps": 2182.6, "ttft_mean_ms": 5353.2, "ttft_max_ms": 7741.4, "wall_s": 46.302}
[2026-06-26 05:22:47] [dense_vllm] NPL=32 PEAK_GB=110.87
[2026-06-26 05:23:59] [dense_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16314, "prompt_tok_total": 33828, "gen_per_req": 254.9, "agg_tps": 236.5, "decode_agg_tps": 309.1, "decode_perseq_tps": 4.38, "prefill_tps": 2088.9, "ttft_mean_ms": 9512.4, "ttft_max_ms": 16191.0, "wall_s": 68.976}
[2026-06-26 05:23:59] [dense_vllm] NPL=64 PEAK_GB=110.88
[2026-06-26 05:25:57] [dense_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32640, "prompt_tok_total": 67969, "gen_per_req": 255.0, "agg_tps": 288.4, "decode_agg_tps": 418.8, "decode_perseq_tps": 2.79, "prefill_tps": 1929.1, "ttft_mean_ms": 18449.5, "ttft_max_ms": 35227.7, "wall_s": 113.162}
[2026-06-26 05:25:57] [dense_vllm] NPL=128 PEAK_GB=110.95
[2026-06-26 05:26:27] [dense_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 2044, "prompt_tok_total": 4187, "gen_per_req": 255.5, "agg_tps": 68.1, "decode_agg_tps": 73.4, "decode_perseq_tps": 9.07, "prefill_tps": 1921.9, "ttft_mean_ms": 1877.6, "ttft_max_ms": 2178.1, "wall_s": 30.018}
[2026-06-26 05:26:35] [dense_vllm] ==== DONE dense_vllm POST_GB=3.53 ====
[2026-06-26 05:26:35] [moe_llama] ==== START moe_llama (llama) baseline_mem=3.53 ====
[2026-06-26 05:26:35] [moe_llama] NPL=8 launching server PRE_GB=3.53
[2026-06-26 05:26:50] [moe_llama] NPL=8 ready LOADED_GB=36.42
[2026-06-26 05:26:52] [moe_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:*
[2026-06-26 05:27:06] [moe_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2048, "prompt_tok_total": 4195, "gen_per_req": 256.0, "agg_tps": 156.8, "decode_agg_tps": 211.8, "decode_perseq_tps": 24.45, "prefill_tps": 1236.4, "ttft_mean_ms": 2477.1, "ttft_max_ms": 3392.9, "wall_s": 13.061}
[2026-06-26 05:27:06] [moe_llama] NPL=8 PEAK_GB=39.66
[2026-06-26 05:27:11] [moe_llama] NPL=8 server stopped mem=3.34
[2026-06-26 05:27:11] [moe_llama] NPL=32 launching server PRE_GB=3.34
[2026-06-26 05:27:16] [moe_llama] NPL=32 ready LOADED_GB=36.54
[2026-06-26 05:27:54] [moe_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8192, "prompt_tok_total": 16900, "gen_per_req": 256.0, "agg_tps": 235.6, "decode_agg_tps": 393.0, "decode_perseq_tps": 10.02, "prefill_tps": 1213.9, "ttft_mean_ms": 8225.2, "ttft_max_ms": 13921.9, "wall_s": 34.768}
[2026-06-26 05:27:54] [moe_llama] NPL=32 PEAK_GB=47.11
[2026-06-26 05:28:00] [moe_llama] NPL=32 server stopped mem=3.30
[2026-06-26 05:28:00] [moe_llama] NPL=64 launching server PRE_GB=3.30
[2026-06-26 05:28:05] [moe_llama] NPL=64 ready LOADED_GB=36.39
[2026-06-26 05:29:10] [moe_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16384, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 271.0, "decode_agg_tps": 527.0, "decode_perseq_tps": 6.15, "prefill_tps": 1152.3, "ttft_mean_ms": 15849.5, "ttft_max_ms": 29356.9, "wall_s": 60.449}
[2026-06-26 05:29:10] [moe_llama] NPL=64 PEAK_GB=57.13
[2026-06-26 05:29:16] [moe_llama] NPL=64 server stopped mem=3.28
[2026-06-26 05:29:16] [moe_llama] NPL=128 launching server PRE_GB=3.28
[2026-06-26 05:29:21] [moe_llama] NPL=128 ready LOADED_GB=36.48
[2026-06-26 05:34:19] [moe_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32760, "prompt_tok_total": 67969, "gen_per_req": 255.9, "agg_tps": 112.7, "decode_agg_tps": 726.4, "decode_perseq_tps": 3.73, "prefill_tps": 276.8, "ttft_mean_ms": 213017.2, "ttft_max_ms": 245528.7, "wall_s": 290.634}
[2026-06-26 05:34:19] [moe_llama] NPL=128 PEAK_GB=61.51
[2026-06-26 05:34:25] [moe_llama] NPL=128 server stopped mem=3.28
[2026-06-26 05:34:25] [moe_llama] ==== DONE moe_llama POST_GB=3.28 ====
[2026-06-26 05:34:25] [moe_vllm] ==== START moe_vllm (vllm) baseline_mem=3.28 ====
[2026-06-26 05:34:25] [moe_vllm] launching vllm PRE_GB=3.28
[2026-06-26 05:39:35] [moe_vllm] vllm ready LOADED_GB=109.46
[2026-06-26 05:39:38] [moe_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:**\n - C
[2026-06-26 05:39:47] [moe_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1900, "prompt_tok_total": 4195, "gen_per_req": 237.5, "agg_tps": 231.2, "decode_agg_tps": 256.5, "decode_perseq_tps": 31.84, "prefill_tps": 5186.5, "ttft_mean_ms": 768.8, "ttft_max_ms": 808.2, "wall_s": 8.217}
[2026-06-26 05:39:47] [moe_vllm] NPL=8 PEAK_GB=109.62
[2026-06-26 05:40:07] [moe_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 7794, "prompt_tok_total": 16900, "gen_per_req": 243.6, "agg_tps": 426.4, "decode_agg_tps": 500.8, "decode_perseq_tps": 14.9, "prefill_tps": 6223.4, "ttft_mean_ms": 1830.4, "ttft_max_ms": 2714.2, "wall_s": 18.28}
[2026-06-26 05:40:07] [moe_vllm] NPL=32 PEAK_GB=109.63
[2026-06-26 05:40:37] [moe_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 15927, "prompt_tok_total": 33828, "gen_per_req": 248.9, "agg_tps": 550.7, "decode_agg_tps": 686.1, "decode_perseq_tps": 9.83, "prefill_tps": 5926.5, "ttft_mean_ms": 3224.4, "ttft_max_ms": 5704.9, "wall_s": 28.92}
[2026-06-26 05:40:37] [moe_vllm] NPL=64 PEAK_GB=109.63
[2026-06-26 05:41:27] [moe_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 31795, "prompt_tok_total": 67969, "gen_per_req": 248.4, "agg_tps": 650.7, "decode_agg_tps": 882.2, "decode_perseq_tps": 6.05, "prefill_tps": 5300.5, "ttft_mean_ms": 6487.7, "ttft_max_ms": 12817.8, "wall_s": 48.863}
[2026-06-26 05:41:27] [moe_vllm] NPL=128 PEAK_GB=109.64
[2026-06-26 05:41:36] [moe_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 1702, "prompt_tok_total": 4187, "gen_per_req": 212.8, "agg_tps": 207.2, "decode_agg_tps": 226.4, "decode_perseq_tps": 28.06, "prefill_tps": 6021.3, "ttft_mean_ms": 642.7, "ttft_max_ms": 694.8, "wall_s": 8.213}
[2026-06-26 05:41:44] [moe_vllm] ==== DONE moe_vllm POST_GB=3.31 ====
==== ALL 16 ROWS COLLECTED (2 models x 2 engines x 4 npl) ====
decode_agg t/s (llama | vLLM | llama%vLLM):
DENSE q36-27b-nvfp4: npl8 82.5|70.4|117% npl32 192.6|211.8|91% npl64 277.8|309.1|90% npl128 384.6|418.8|92%
MoE q36-35b-a3b: npl8 211.8|256.5|83% npl32 393.0|500.8|78% npl64 527.0|686.1|77% npl128 726.4|882.2|82%
peak_gb (llama on-demand grows | vLLM fixed ~107 pool):
DENSE llama 53.5->93.8 ; vLLM ~110.9 flat
MoE llama 39.7->61.5 ; vLLM ~109.6 flat
Final CSV: final_benchmark.csv ; analysis: QWEN36_NVFP4_BENCH.md (FINAL section).
Cleanup: no leftover server/bench PIDs; GPU free (memnow 3.28 GB); local-ai + local-ai-worker
containers restarted (host returned). DONE.

View File

@@ -1,628 +0,0 @@
# bf16 SSM-state cache: BUILD PLAN (PART C synthesis - hand this to the build agent)
Status: READ-ONLY design. Lands ON TOP of patch 0021 (conv-state in-place fusion, building
concurrently on the GPU). DEFAULT = bf16 SSM recurrent state, f32 opt-out. This PART C is the
executive build brief: ordered edits, acceptance gate, bench targets, semantics/back-compat/risk
register, and the de-risk-first item. PART A (cparams wiring), PART B (kernel/op plumbing) and the
Appendix (upstream precedent + numeric safety) below are the detailed reference each step points into.
The decision (settled by GDN_RECURRENCE_BYTE_GATE.md): the gated-DeltaNet recurrence is the dominant
decode kernel (51.6% of the step, 805 MB f32 state R+W/call at 74% of GB10 peak BW) and is ALREADY
single-pass (measured re-stream ~1.0x, hard-capped <=1.33x). The whole ~2x DRAM gap vs vLLM is purely
f32(llama) vs bf16(vLLM) state-cache WIDTH, not extra passes. Narrowing the persisted SSM state to
bf16 (load->f32, recurrence math in f32 UNCHANGED, store->bf16) halves the dominant term and reaches
vLLM parity-to-ahead. vLLM's own GDN state cache is bf16, so this is a fair equal-precision change.
## C.0 Synthesis decisions that OVERRIDE the per-part text
1. v1 ships `type_s` = BF16 (SSM recurrent state, the 805 MB lever) and KEEPS `type_r` = F32 (conv
state). Reason: `ggml_concat` at prefill (`build_conv_state`, delta-net-base.cpp:472) requires
same-type operands; a bf16 conv cache breaks the f32 `qkv_mixed` concat. Conv state is ~12.6 MB
(launch-bound, ~0 ms byte benefit), so keeping it f32 costs nothing. This OVERRIDES PART A §3a/§3b,
which set BOTH defaults to BF16: in v1 set the `type_r` / `cache_type_conv` DEFAULT to
`GGML_TYPE_F32`. `type_r`=bf16 is a v2 follow-up (needs an f32 staging view before the prefill
concat - PART B §B.6).
2. Keep ALL transient/scratch tensors f32: the GDN op OUTPUT scratch (ggml.c:6327), the 0019 gather
scratch, and the keep_rs_t prefill snapshot. ONLY the PERSISTED cache rows narrow to bf16 (the
src[5] read view and the src[6] in-place write view).
3. The gate REPLACES the bit-exact md5 gate for the bf16 default: bf16 is intentionally non-bit-exact
vs llama f32 (it is equal precision to vLLM's bf16). The 0018/0019 md5 gate STILL applies to (a)
patch 0021's conv fusion and (b) verifying the f32 opt-out path is byte-identical to the pre-bf16
f32 baseline.
## C.1 Ordered file-by-file edit list (build order, on top of 0021)
Order is dependency- and de-risk-driven: prove the kernel dtype-correct in ISOLATION before flipping
any default. Section refs point into PART A / PART B below.
STEP 1 - kernel + op made dtype-generic (the load/store conversion), validated standalone:
- 1a `ggml/src/ggml.c` - relax the F32-only state asserts to {F32,BF16} in the 3 GDN builders:
`ggml_gated_delta_net` (~6308), `_inplace` (~6370), `_inplace_ids` (~6430), on `state` and
`src_state_dst`. KEEP the op OUTPUT scratch F32 (6327). [PART B §B.2]
- 1b `ggml/src/ggml-cuda/ggml-cuda.cu` - `supports_op` `GGML_OP_GATED_DELTA_NET` (~3096): permit a
BF16 `src[5]`/`src[6]`. [PART B §B.3]
- 1c `ggml/src/ggml-cuda/gated_delta_net.cu` - template kernel+gather+launch on `bool STATE_BF16`;
`#include <cuda_bf16.h>`. LOAD `__bfloat162float` (~102), STORE `__float2bfloat16` (~207), GATHER
bf16->f32 scratch (~20). Cast `src_state`/`src_state_dst` pointers to `nv_bfloat16` on bf16; relax
dispatcher asserts (309-311) `sizeof(float)` -> `ggml_type_size(type)`. Keep gather scratch +
keep_rs_t snapshot f32. ALL recurrence math (106-200) UNCHANGED in f32 registers. [PART B §B.4,§B.8]
- 1d `ggml/src/ggml-cpu/ops.cpp` - matching bf16 load/store branch in the GDN reference (10726/10744/
10891 load via `GGML_BF16_TO_FP32`, 10758-10762 store via `GGML_FP32_TO_BF16`); relax `nb[]` asserts
to `ggml_type_size(type)`. [PART B §B.5]
- 1e `tests/test-backend-ops.cpp` - add a BF16-state `GATED_DELTA_NET` case covering BOTH `n_tokens==1`
decode AND a multi-token (prefill/chunk) + `keep_rs_t==true` path, CUDA bf16 vs CPU bf16 reference.
THIS IS THE DE-RISK GATE for Step 1 (see C.5). Build + pass before Step 2.
STEP 2 - cparams selection wiring (llama.cpp core):
- 2a `include/llama.h` (after :366) - add `enum ggml_type type_s;` and `type_r;` adjacent to
`type_k`/`type_v`, marked `[EXPERIMENTAL]`. [PART A §3a]
- 2b `src/llama-context.cpp:3468` (`llama_context_default_params`) - add `/*.type_s =*/ GGML_TYPE_BF16,`
and `/*.type_r =*/ GGML_TYPE_F32,`. THIS IS THE DEFAULT CHANGE (type_r stays F32 per C.0). [PART A §3a]
- 2c `src/llama-memory.h:19` (`struct llama_memory_params`) - add `ggml_type type_r;` and `type_s;`.
[PART A §3a]
- 2d `src/llama-context.cpp:325` (`params_mem` init) - pass `params.type_r` / `params.type_s`. [PART A §3a]
- 2e `src/llama-model.cpp` - replace the 3 hardcoded `GGML_TYPE_F32` pairs (2056-57 recurrent, 2098-99
hybrid_iswa, 2117-18 hybrid = the qwen35/qwen35moe path) with `params.type_r` / `params.type_s`.
[PART A §2/§3a]
STEP 3 - back-compat for saved recurrent state (REQUIRED, the default flips):
- 3a `src/llama-memory-recurrent.cpp` `state_read_data` - on `s_type_i_ref != live type` with both in
{F32,BF16}, CONVERT row-by-row during load instead of returning false (same for `r`). Bump the
recurrent state-file version. [PART A §5, option A]
STEP 4 - CLI / llama-server surface (needed by the gate harness):
- 4a `common/common.h:566` region - `cache_type_ssm = GGML_TYPE_BF16;` and
`cache_type_conv = GGML_TYPE_F32;` (conv default F32 per C.0). [PART A §3b]
- 4b `common/common.cpp:1589` region - `cparams.type_s = params.cache_type_ssm;` and
`cparams.type_r = params.cache_type_conv;`. [PART A §3b]
- 4c `common/arg.cpp` (after :2074) - add `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv`
via the existing `kv_cache_type_from_str` (arg.cpp:402); confirm `bf16` -> `GGML_TYPE_BF16`. The C.2
harness depends on `--cache-type-ssm {f32,bf16}`. [PART A §3b]
STEP 5 - LocalAI gRPC / YAML (force f32 from model config):
- 5a `backend/backend.proto` - `string CacheTypeSSM` / `CacheTypeConv` (next free tags after 64);
regen proto. [PART A §3c]
- 5b `backend/cpp/llama-cpp/grpc-server.cpp:504` region - `params.cache_type_ssm =
kv_cache_type_from_str(request->cachetypessm());` + conv. [PART A §3c]
- 5c `core/config/model_config.go:935` - `CacheTypeSSM`/`CacheTypeConv` yaml fields. [PART A §3c]
- 5d `core/backend/options.go:247` - map into the request. [PART A §3c]
- 5e `core/config/meta/registry.go` + `build_test.go` - register `cache_type_ssm`/`cache_type_conv`
as static fields (gate). [PART A §3c]
STEP 6 - capability fallback (heterogeneous / CPU-offload safety):
- 6a `src/llama-context.cpp:518-595` - an `auto_fgdn`-style device-match probe: if a participating
device lacks the bf16 GDN load/store specialization (CPU-offloaded GDN layer, non-GB10 backend),
demote `type_s` to F32 BEFORE alloc and log once. [PART A §4]
## C.2 Acceptance gate (REPLACES the bit-exact md5 gate)
bf16 is intentionally non-bit-exact, so the 0018/0019 md5 byte-equality gate does NOT apply to the
bf16 default. The gate is teacher-forced KL-divergence + PPL-delta + greedy coherence + a
long-context drift sweep, vs the SAME model run f32. All commands on `dgx.casa` (DO NOT run during
this design - GPU busy). Binaries `~/llama-paged-dev/build*/bin`; models `~/bench/q36-27b-nvfp4.gguf`
(dense) and `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE); scratch `~/bench/klgate`.
Why teacher-forced (not self-greedy): a self-greedy decode lets each precision pick its own argmax,
so after the first divergence the contexts differ and per-token logits are no longer comparable (you
measure trajectory divergence, not numeric error). `llama-perplexity --kl-divergence` feeds both
precisions the IDENTICAL token stream and compares output distributions position-by-position; the
greedy trajectory is validated SEPARATELY by the Same-top-p metric + a coherence read.
Corpus (one-time): wikitext-2 raw test (~280k tokens) into `~/bench/klgate`. KL mode needs
>= 2*n_ctx tokens; any fixed >=8k-token UTF-8 file works as long as base AND test share it.
256-token headline gate (per model; shown for dense):
```
M=~/bench/q36-27b-nvfp4.gguf; F=~/bench/klgate/wikitext-2-raw/wiki.test.raw; D=~/bench/klgate
COMMON="-m $M -f $F -c 256 -b 256 -ngl 99 -fa on --seed 1 --chunks 32"
# (a) f32 BASE: reference logits + f32 PPL
llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32.c256.base.log
# (b) bf16 TEST: KL(bf16||f32) + bf16 PPL + Same-top-p
llama-perplexity $COMMON --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.bf16.c256.kl.log
```
Noise floor (run FIRST, mandatory - GPU reductions are not bit-deterministic, so KLD has a non-zero
floor; bf16 is judged against BOTH the absolute threshold AND this floor):
```
llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32f32.floor.log
```
Record `Mean KLD_floor` and `Same-top-p_floor` (expect KLD ~1e-6..1e-5, top-p ~100%).
Coherence spot-check (greedy trajectory, reuses the 0018/0019 `--temp 0 --seed 1` convention):
```
P="Explain how a transformer language model generates text, step by step."
for T in f32 bf16; do llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm $T 2>/dev/null > $D/q27.greedy.$T.txt; done
diff $D/q27.greedy.f32.txt $D/q27.greedy.bf16.txt && echo "GREEDY BYTE-IDENTICAL"
```
Long-context drift sweep (verifies the g<1 decay bound: bf16 state-rounding error must stay FLAT, not
accumulate, as context grows - the GDN state spans the whole window):
```
for C in 256 1024 2048 4096; do
CMN="-m $M -f $F -c $C -b $C -ngl 99 -fa on --seed 1 --chunks 8"
llama-perplexity $CMN --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c$C.kld >/dev/null
llama-perplexity $CMN --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c$C.kld | tee $D/q27.bf16.c$C.kl.log
done
```
f32 opt-out verification (the safety valve must actually select f32 and reproduce the committed f32
greedy md5 from 0018/0019 - the bf16 default must NOT change the f32-path output):
```
llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm f32 2>/dev/null | md5sum # == 0018/0019 f32 baseline md5
```
Repeat the WHOLE gate verbatim for the MoE model (`M=~/bench/q36-35b-a3b-nvfp4.gguf`).
PASS/FAIL (bf16 ships as DEFAULT only if ALL rows pass for BOTH dense and MoE):
| metric | source | PASS threshold |
|---|---|---|
| Mean KLD | 256-gate (b) | **< 1e-3 nats** (hard, the brief) |
| Mean KLD vs floor | (b) vs floor | <= ~5x `Mean KLD_floor` (bounded signal, not pure noise) |
| Same top p | (b) | **>= 99.5%** (100% => greedy byte-identical to f32) |
| PPL-delta `ln(PPL_bf16/PPL_f32)` | (a)+(b) | **abs < 0.005** (PPL within +-0.5%) |
| Max / 99.9% KLD | (b) | report; flag if Max > 0.05 (tail outliers) |
| Coherence | greedy | fluent + on-topic; byte-identical if Same-top-p=100% |
| Long-context drift | sweep | MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0% |
If any row fails for a model: keep THAT model on f32 (gallery YAML `cache_type_ssm: f32`) while the
global default stays bf16; the cparams f32 fallback is the safety valve. MoE has fewer GDN layers
(31 vs 48) and smaller per-head state (H_v=32 vs 48), so expected KLD <= dense; same thresholds.
Same-top-p is the bridge to the old md5 harness: at 100% the bf16 greedy output is byte-identical to
f32 and the 0018/0019 md5 gate would still pass - the strongest possible non-bit-exact result.
## C.3 Bench targets + nsys confirmation
Dense q36-27b-nvfp4 (48 GDN layers, S_v=128, H_v=48), npl128, GB10/sm_121, graphs-OFF
apples-to-apples (the measured baseline):
- Recurrence per call: 3.98 ms (f32, 805 MB R+W, 74% peak) -> **~2.0-3.0 ms** (bf16, ~413 MB R+W).
2.0 ms = 74% peak retained; 3.0 ms = conservative 50% peak on the smaller footprint.
- Recurrence per step: 191 ms -> ~96-143 ms (save ~48-95 ms).
- Step time: 384 ms -> **289-339 ms**.
- Decode throughput: ~335 -> **360-443 tok/s** = parity-to-ahead of vLLM (327 ms / 391 tok/s).
MoE q36-35b-a3b-nvfp4 (31 GDN layers, H_v=32): state per (seq,layer) = 128*128*32*4 = 2.0 MiB f32 ->
per-call R+W ~537 MB f32 -> ~268 MB bf16. Fewer layers + smaller state => smaller ABSOLUTE recurrence
savings, and MoE decode is more GEMM-bound (the `MUL_MAT_ID` expert path), so the bf16-state win is a
smaller FRACTION of the MoE step. Target: a measurable per-call halving of the GDN recurrence time
with the C.2 KL gate passing; no absolute MoE step target is asserted here (the MoE step is
MUL_MAT_ID-dominated, a separate lever from this one).
nsys confirmation (the measurement that proves the lever landed):
```
GGML_CUDA_DISABLE_GRAPHS=1 nsys profile -o ssmbf16 --force-overwrite true \
llama-batched-bench -m $M -npp 8 -ntg 12 -npl 128 -ub 2048
nsys stats --report cuda_gpu_kern_sum ssmbf16.nsys-rep | grep -i gated_delta_net
```
Confirm: `gated_delta_net_cuda` mean duration/call drops 3.98 -> 2.0-3.0 ms; step time + tok/s land in
the 289-339 ms / 360-443 tok/s band; the f32 opt-out reproduces the 3.98 ms f32 call. The gate is the
JOINT condition: per-call speed in band AND KL<1e-3 - neither alone ships bf16.
## C.4 Default / opt-out semantics, back-compat, risk register
Semantics:
- DEFAULT `type_s` = `GGML_TYPE_BF16` (SSM recurrent state). `type_r` = `GGML_TYPE_F32` in v1 (conv
state; bf16 is v2). This is the INVERSE of KV (KV is opt-IN to compression at F16 default; SSM is
opt-OUT to f32).
- Opt-out: `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML) -> bit-exact f32
recurrence. Per-model opt-out lives in gallery YAML if a model fails the gate; the global default
stays bf16.
- Silent capability fallback: the C.1 STEP 6 device-match probe demotes `type_s` to F32 before alloc
on devices lacking the bf16 GDN specialization (CPU offload / non-GB10) and logs once.
Back-compat (the ONE real breakage): `llama-memory-recurrent.cpp` serializes the per-layer state
dtype and HARD-matches on restore (mismatch -> `"mismatched s type"` -> returns false). The f32->bf16
default flip makes OLD f32-saved sessions fail to restore against a bf16 build. Fix = STEP 3a: convert
row-by-row on mismatch (both in {F32,BF16}) + bump the recurrent state-file version. KV never hit this
because `type_k`/`type_v` were EXPERIMENTAL and never default-changed; the SSM default FLIP is what
forces the convert/version work.
Risk register:
- **R1 numeric drift (KL gate fails).** Likelihood LOW: g<1 geometric decay contracts per-step bf16
rounding to a bounded series (~`eps/(1-exp(g_mean))`), f32 registers confine rounding to one
per-step cache write, and vLLM ships this exact config in production. Mitigation: C.2 gate +
per-model f32 opt-out + global f32 fallback.
- **R2 prefill / keep_rs_t / gather state path (the silent-corruption landmine).** The conversion
points are documented for DECODE; the SAME kernel also runs the chunked prefill path, the keep_rs_t
snapshot (writes to f32 scratch while the cache is bf16), and the 0019 gather (reads bf16 cache ->
f32 scratch). A dtype mistake on any of these corrupts the state at the prefill->decode handoff and
surfaces ONLY as long-context drift, which a decode-only 256-token gate can mask. Mitigation: STEP
1e test-backend-ops MUST cover the multi-token prefill + keep_rs_t==true path, not just decode; the
C.2 long-context sweep is the second net. (This is C.5, the single biggest risk.)
- **R3 MoE MUL_MAT_ID path.** The GDN recurrence op is IDENTICAL for dense and MoE; the MoE expert
GEMM (`MUL_MAT_ID`) does NOT touch the SSM state, so bf16-state is orthogonal to the expert path.
Residual risk: `qwen35moe` `build_recurrent_attn` must route the same bf16 state view (it shares
delta-net-base.cpp). Mitigation: run the full C.2 gate on the MoE model; the test-backend-ops case
is arch-agnostic.
- **R4 conv-state coupling with patch 0021.** Flipping `type_r` to bf16 breaks `ggml_concat` at
prefill (different types). Mitigation: v1 keeps `type_r`=F32 (C.0); `type_r`=bf16 deferred to v2
with an f32 staging view (PART B §B.6).
- **R5 back-compat restore failure.** Mitigation: STEP 3a convert + version bump (above).
## C.5 Single biggest risk + how the build agent de-risks it FIRST
Single biggest risk: **R2 - silent state corruption on the NON-decode state paths** (chunked prefill,
the keep_rs_t snapshot, the 0019 gather). The 805 MB measurement and every conversion-point in the
cheat-sheet describe the STEADY decode path (`n_tokens==1`, `!keep_rs_t`). But the bf16 cache is ALSO
read/written by the multi-token prefill path and the prefill/rollback snapshot (which targets f32
scratch while the cache is bf16). A dtype bug there does not crash and barely moves the 256-token
decode md5; it corrupts the recurrent state at the prefill->decode boundary and shows up ONLY as
long-context drift - exactly the failure a quick gate misses.
De-risk FIRST (before ANY default flip or wiring): implement STEP 1 (kernel + op dtype-generic) and
STEP 1e (test-backend-ops) ONLY, then prove the kernel is dtype-correct in ISOLATION by forcing a
bf16 state allocation behind a temporary debug flag and running test-backend-ops with a case that
exercises (a) single-token decode, (b) a multi-token prefill chunk, and (c) `keep_rs_t==true`,
comparing CUDA bf16 against the CPU bf16 reference AND against the f32 path within tolerance. Only
after that case is GREEN does the build agent proceed to STEP 2 (flip the default) and the C.2
model-level gate. This decouples kernel dtype-correctness from the cparams wiring, so a Step-1 bug is
caught by a deterministic op test in minutes instead of as a fuzzy long-context regression after the
full stack is wired.
---
# bf16 SSM state cache — cparams wiring (DEFAULT bf16 + f32 opt-out)
Label: cparams-default-fallback (READ-ONLY design). Mirrors the KV-cache `type_k`/`type_v`
precision plumbing exactly. Designed against HEAD-after-patch-0021 (conv-state in-place fusion).
This is lever (2) of GDN_RECURRENCE_BYTE_GATE.md: the recurrent SSM state cache is the dominant
decode byte stream (805 MB R+W/call, 51.6% of step, single-pass f32 = at the BW floor). The whole
~2x DRAM gap vs vLLM is f32(llama) vs bf16(vLLM) state width. Storing the persisted state in bf16
(load→f32, recurrence math in f32 UNCHANGED, store→bf16) halves the dominant term. vLLM's GDN state
cache is bf16, so bf16-default is the fair equal-precision comparison → make it the DEFAULT.
---
## 1. The KV-cache template we mirror (exact chain for type_k / type_v)
```
CLI common/arg.cpp:2052 -ctk/--cache-type-k TYPE → params.cache_type_k
(common_params, common/common.h:566, default GGML_TYPE_F16)
glue common/common.cpp:1589 cparams.type_k = params.cache_type_k (cparams = llama_context_params)
API include/llama.h:365 llama_context_params.type_k // [EXPERIMENTAL]
llama-context.cpp:3468 default in llama_context_default_params() = GGML_TYPE_F16
mem llama-context.cpp:326 llama_memory_params params_mem.type_k = params.type_k
llama-memory.h:19 struct llama_memory_params { ggml_type type_k; type_v; ... }
alloc llama-model.cpp:2030 create_memory(params_mem, cparams) → KV cache uses params.type_k
```
Key facts:
- `type_k`/`type_v` are NOT stored in `struct llama_cparams` (src/llama-cparams.h). They ride in
`llama_context_params` → `llama_memory_params` and are consumed directly at cache-alloc time.
We mirror that: NO new `llama_cparams` field is needed.
- KV default is opt-IN to compression (F16 default, pass `-ctk q8_0` to shrink). SSM is the INVERSE:
bf16 DEFAULT, pass an explicit `f32` to opt out / restore bit-exactness.
## 2. Where the SSM state type is currently hardcoded (the targets)
The recurrent cache constructor already accepts the types — only the model hardcodes F32:
- `src/llama-memory-recurrent.cpp:22-23` ctor params `ggml_type type_r, type_s`
- `r_l` (line 100, `n_embd_r`) = short conv state → `type_r` (TINY: conv_width-1 taps × conv_dim)
- `s_l` (line 101, `n_embd_s`) = SSM recurrent state → `type_s` (THE 805 MB/call dominant)
- `src/llama-memory-hybrid.h:32-33` ctor params `type_r, type_s` (qwen35 / qwen35moe path)
- Hardcoded `GGML_TYPE_F32` call sites in `src/llama-model.cpp::create_memory`:
- 2056-2057 `llama_memory_recurrent(...)` (pure recurrent arches)
- 2098-2099 `llama_memory_hybrid_iswa(...)` recurrent_type_r / recurrent_type_s
- 2117-2118 `llama_memory_hybrid(...)` recurrent_type_k / recurrent_type_v (mislabeled; they are r/s)
Note: `qwen35` / `qwen35moe` are HYBRID (filter_attn/filter_recr, no SWA) → they take the
`llama_memory_hybrid` branch (2108-2118). That is the call site that matters for the parity push.
## 3. New plumbing (parallel chain `type_s` / `type_r`)
### 3a. Public API + cparams glue (llama.cpp side)
| File | Change |
|------|--------|
| `include/llama.h` (after :366) | Add `enum ggml_type type_s; // data type for recurrent SSM state cache [EXPERIMENTAL]` and `enum ggml_type type_r; // data type for recurrent conv state cache [EXPERIMENTAL]`. Place adjacent to `type_k`/`type_v`. |
| `src/llama-context.cpp:3468` (default params) | Add `/*.type_s =*/ GGML_TYPE_BF16,` and `/*.type_r =*/ GGML_TYPE_BF16,`. **This is the DEFAULT change.** |
| `src/llama-memory.h:19` (`struct llama_memory_params`) | Add `ggml_type type_r;` and `ggml_type type_s;` next to `type_k`/`type_v`. |
| `src/llama-context.cpp:325` (`params_mem` init) | Add `/*.type_r =*/ params.type_r,` and `/*.type_s =*/ params.type_s,`. |
| `src/llama-model.cpp` 2056-57 / 2098-99 / 2117-18 | Replace the 3 hardcoded `GGML_TYPE_F32` pairs with `params.type_r` / `params.type_s`. |
### 3b. CLI / llama-server (common side)
| File | Change |
|------|--------|
| `common/common.h:566` region | Add `ggml_type cache_type_ssm = GGML_TYPE_BF16;` and `ggml_type cache_type_conv = GGML_TYPE_BF16;` (mirror `cache_type_k/v`; note the DEFAULT is BF16, not F16). |
| `common/common.cpp:1589` region | Add `cparams.type_s = params.cache_type_ssm;` and `cparams.type_r = params.cache_type_conv;`. |
| `common/arg.cpp` (after :2074) | Add `--cache-type-ssm TYPE` (`-ctssm`) → `params.cache_type_ssm = kv_cache_type_from_str(value)`, and `--cache-type-conv TYPE` (`-ctconv`). Reuse the existing `kv_cache_type_from_str` (arg.cpp:402). Help text: "recurrent SSM state cache type (default bf16; pass f32 for bit-exact recurrence)". |
`kv_cache_type_from_str` already accepts `f32`/`bf16`/`f16` — no change needed; just confirm `bf16`
maps to `GGML_TYPE_BF16` (add the case if absent).
### 3c. LocalAI gRPC backend (so users can force f32 from model YAML)
Mirror `CacheTypeKey` exactly:
| File | Change |
|------|--------|
| `backend/backend.proto:419` region | Add `string CacheTypeSSM = NN;` and `string CacheTypeConv = NN;` (next free field tags). Regenerate proto. |
| `backend/cpp/llama-cpp/grpc-server.cpp:504` region | `if (!request->cachetypessm().empty()) params.cache_type_ssm = kv_cache_type_from_str(request->cachetypessm());` and the conv equivalent. (grpc-server already has its own `kv_cache_type_from_str`; ensure it knows `bf16`.) |
| `core/config/model_config.go:935` region | Add `CacheTypeSSM string yaml:"cache_type_ssm,omitempty"` and `CacheTypeConv string yaml:"cache_type_conv,omitempty"`. |
| `core/backend/options.go:247` region | Add `CacheTypeSSM: c.CacheTypeSSM,` and `CacheTypeConv: c.CacheTypeConv,` to the request build. |
| `core/config/meta/registry.go:161` + `core/config/meta/build_test.go:140` | Register `cache_type_ssm` / `cache_type_conv` as static fields (the `staticFields` slice + registry map) so the meta-config gate passes. |
LocalAI semantics: leaving `cache_type_ssm` UNSET in YAML → empty gRPC string → backend keeps its
BF16 default. Setting `cache_type_ssm: f32` → forces the f32 opt-out (bit-exact recurrence).
## 4. Default / fallback semantics
- **DEFAULT = `GGML_TYPE_BF16`** for both SSM state (`type_s`) and conv state (`type_r`).
- SSM state (`type_s`) is the lever: f32→bf16 halves 805→413 MB/call → ~3.98→~2.0-3.0 ms/call.
- Conv state (`type_r`) is negligible bytes; default it bf16 too for consistency, but it can stay
f32 with zero perf cost if patch-0021's in-place conv path assumes f32 — see §6.
- **Opt-out = `GGML_TYPE_F32`** via `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML).
Restores bit-exact recurrence; use when the KL gate (<1e-3 / PPL-delta over 256-tok greedy) fails
for a given model, or for deterministic regression baselines.
- **Silent capability fallback**: gate the bf16 path behind a device-match probe modeled on
`auto_fgdn` (llama-context.cpp:518-595). If the GDN recurrence kernel's bf16 load/store
specialization is unavailable on a participating device (e.g. a CPU-offloaded GDN layer with no
bf16 op, or a non-GB10 backend), fall back to `GGML_TYPE_F32` for `type_s` BEFORE cache alloc and
log it once. This keeps "bf16 default" from breaking heterogeneous/CPU setups.
- The kernel contract is unchanged-math: load bf16→f32 into `s_shard` (registers stay f32), all
recurrence arithmetic in f32, store f32→bf16. Only the persisted cache is rounded per step;
geometric decay (g<1) bounds the rounding (does not accumulate unboundedly).
## 5. Back-compat (the one real breakage — saved sessions / state files)
`src/llama-memory-recurrent.cpp` SERIALIZES the per-layer state tensor dtype and does a HARD match
on restore:
- write: `state_write_data` writes `s_type_i = (int32_t)s_l[il]->type` (line ~900) and the r type.
- read: `state_read_data` reads `s_type_i_ref`, compares to current `s_l[il]->type`, and on
mismatch logs `"mismatched s type (%d != %d, layer %d)"` and **returns false** (restore FAILS).
Same for `r` type.
Consequence of the default flip f32→bf16:
- Sessions SAVED by an old f32-default build will FAIL to RESTORE against a new bf16-default build
(and vice versa), because the serialized `s_type_i_ref` (F32) ≠ the new cache type (BF16).
Required handling (pick one, recommend A):
- **A (convert on mismatch, recommended)**: in `state_read_data`, when `s_type_i_ref != current`
and both ∈ {F32, BF16}, convert row-by-row during load (`ggml_fp32_to_bf16` / `bf16→fp32`) instead
of returning false. Same for `r`. Bump the recurrent state-file version so older readers reject
cleanly. This makes old f32 sessions loadable into bf16 caches and round-trips safely.
- **B (pin precision to the saved file)**: if a session is being restored, read `s_type_i_ref`
first and set `type_s`/`type_r` from it, overriding the default for that context. Keeps restore
working but silently disables the bf16 win for resumed sessions.
- **C (document-only)**: keep the hard match; document that bf16-default invalidates cross-version
saved recurrent states. Lowest effort, worst UX. Not recommended given parity is the goal.
KV-cache parallel: `type_k`/`type_v` were always EXPERIMENTAL and non-default-changing, so the KV
path never had to solve this. The SSM default-FLIP is what forces the convert/version work — call it
out as the single most load-bearing back-compat item.
## 6. Coupling notes / sequencing
- Land ON TOP of patch 0021 (conv-state in-place fusion). If 0021's fused conv write assumes an f32
conv-state tensor, either (a) extend it to the cache tensor's dtype, or (b) keep `type_r` = F32 by
default and make ONLY `type_s` bf16 (conv bytes are negligible, so this loses nothing perf-wise and
de-risks 0021). Decision: ship `type_s`=BF16 first; make `type_r`=BF16 a follow-up gated on 0021's
conv path being dtype-generic.
- Kernel side (separate patch, not this wiring): `ggml/src/ggml-cuda/gated_delta_net.cu` currently
takes `const float * curr_state` / `float * state_dst` and does `s_shard[r] = read_state[i]`
(line 102) — hardcoded f32. The bf16 build needs the dispatch to read `s0->type` and route a
bf16 load/store specialization; the gather kernel `gdn_gather_nonident_kernel` (line 7, `const
float * cache`) likewise needs a bf16 variant. The cparams wiring here only selects the cache
dtype; the kernel patch consumes it. Patches 0018 (in-place) / 0019 (gather) state asserts must be
relaxed from f32-only to {f32,bf16}.
- CPU mirror `ggml-cpu/ops.cpp` GDN path needs the same bf16 load/store for CI parity / fallback.
## 7. Validation gate
- KL < 1e-3 and PPL-delta within tolerance vs the f32-state build over a 256-token greedy run, per
model (dense q36-27b-nvfp4, MoE q36-35b-a3b-nvfp4). If a model fails, that model sets
`cache_type_ssm: f32` in its gallery YAML (per-model opt-out) — the global default stays bf16.
- Add a `test-backend-ops` case for the GDN recurrence with bf16 state (mirror the 0021 harness:
dense text md5 + MoE byte check) to lock the load→f32→store→bf16 contract.
---
# Appendix - label `upstream-bf16-precedent` (READ-ONLY research)
Precedent + numeric-safety justification for the §1-7 wiring above. Sources: paged dev tree
(`dgx.casa:~/llama-paged-dev`, branch `paged`) and the vLLM checkout
(`~/vllm-bench/.../site-packages/vllm`).
## A.1 Upstream llama.cpp: recurrent-cache f32 is HARDCODED (no f16/bf16 path), not a documented numeric guard
The asymmetry to override: the attention KV cache type is user-tunable; the recurrent state cache is not.
- KV cache: `llama_context_params.type_k/type_v` default `GGML_TYPE_F16`
(`src/llama-context.cpp:3468-3469`), `[EXPERIMENTAL]` in `include/llama.h:365-366`, plumbed from
user params (`attn_type_k = params.type_k`).
- Recurrent/SSM cache: `llama_memory_recurrent(... type_r, type_s ...)` and the hybrid wrappers take
the recurrent types as ctor args, but EVERY call site in `src/llama-model.cpp` passes the literal
`GGML_TYPE_F32` (2056-2057 pure-recurrent; 2098-2099 hybrid-iswa `recurrent_type_r/s`;
2117-2118 hybrid `recurrent_type_k/v`). No cparams field feeds these - compile-time constants.
So mamba/mamba2/rwkv/falcon-h1/nemotron-h/qwen3.5 ALL get f32 recurrent + conv state unconditionally.
- Alloc: `r = ggml_new_tensor_2d(ctx, type_r, ...)`, `s = ggml_new_tensor_2d(ctx, type_s, ...)`
(`src/llama-memory-recurrent.cpp:100-101`). No f16 branch anywhere.
Is f32 a deliberate numeric constraint? Structural, not documented:
- `ggml_ssm_conv` / `ggml_ssm_conv_update_inplace` HARD-ASSERT f32 on conv state/kernel/x_cur/dst
plus `nb[0]==sizeof(float)` (`ggml/src/ggml.c:5581-5584,5589,5597`). Conv path is f32-locked at the
builder.
- `ggml_ssm_scan` does NOT assert input state `s` dtype, but hardcodes its OUTPUT as
`GGML_TYPE_F32` (`ggml/src/ggml.c:5662`); scan kernels read `s` as `float *`.
- `ggml/src/ggml-cuda/gated_delta_net.cu` takes `const float * curr_state`, `float * state`,
`float * state_dst`; the per-(seq,head) shard `float s_shard[rows_per_lane]` is loaded/stored as raw
float (34-102). Same in `ggml-cpu/ops.cpp`.
- NO code comment anywhere justifies "f32 for precision". The constraint is that the ops were written
float-only. => recurrent-cache-f32 is a hardcoded implementation default to override deliberately:
the 3 literal `GGML_TYPE_F32` call-site pairs (gate behind `type_s`/`type_r` per §3), the
gated_delta_net.cu load/store convert, and KEEP conv f32 unless its asserts are extended (conv bytes
are negligible - only the temporal `type_s` state needs bf16).
## A.2 vLLM: GDN temporal state cache is bf16 BY DEFAULT, fp32-accumulated in-kernel (the exact design)
- Dtype: `qwen3_next.py:780-787` -> `MambaStateDtypeCalculator.gated_delta_net_state_dtype` ->
`_mamba_state_dtype` (`mamba_utils.py:84-96`):
`conv_state_dtype = get_kv_cache_torch_dtype(mamba_cache_dtype, model_dtype)`;
`if mamba_ssm_cache_dtype == "auto": temporal_state_dtype = conv_state_dtype`.
With both knobs default `"auto"`, `get_kv_cache_torch_dtype("auto", model_dtype)` returns
`model_dtype` (`torch_utils.py:293-297`) = bf16 for Qwen3-Next => BOTH conv and temporal state are
bf16 by default. Explicit opt-out: `--mamba-ssm-cache-dtype float32` (mirror of our f32 fallback).
- In-kernel numerics (decode), `fla/ops/fused_recurrent.py`:
`b_h = tl.load(p_h0).to(tl.float32)` (303) load bf16->fp32; q/k/v/g/beta `.to(tl.float32)` (309-318);
recurrence in fp32 `b_h*=exp(g); b_v-=sum(b_h*b_k); b_v*=beta; b_h+=b_v*b_k; b_o=sum(b_h*b_q)`
(327-331); `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` (337) store fp32->bf16. Prefill chunk path
identical (`b_h=tl.zeros(...,tl.float32)`, `+= load().to(fp32)`, 102/120).
=> byte-for-byte the proposed llama lever: load bf16->f32, math in f32 (UNCHANGED order, matches
gated_delta_net.cu's v-g*kv -> *beta -> S-update -> S^T q), store f32->bf16; only the persisted cache
crosses the bf16 boundary, once per step.
- vLLM numeric guards: NONE beyond fp32 accumulation - no per-step renorm, no clamp, no Kahan. Optional
`use_qk_l2norm_in_kernel` normalizes q,k (keeps k unit-norm) but does not touch the state.
- KDA nuance: `kda_state_dtype` returns `(state_dtype, torch.float32)` - Kimi Delta Attention keeps a
fp32 secondary component. qwen3.5 is `gated_delta_net` (fully-bf16 temporal state), but this shows
vLLM judged a fp32 component necessary for one delta variant -> reinforces keeping the f32 toggle.
Verdict: vLLM's own GDN state cache is bf16, so bf16-state in llama is a FAIR equal-precision target,
not a regression vs the competitor. bf16 brings llama TO vLLM's precision.
## A.3 Numeric-safety assessment for bf16 gated-DeltaNet state
Update: `S <- S*diag(exp(g)) + beta * k (x) (v - S k)`, with
`g = -exp(A_log)*softplus(a+dt_bias) <= 0` so `exp(g) in (0,1]` (strict geometric decay) and
`beta = sigmoid(.) in (0,1)`.
- Decay bounds error accumulation. bf16 = 8 mantissa bits -> per-element rel rounding
`eps ~= 2^-8 ~= 3.9e-3`. An error injected at step t is multiplied by `exp(g)<1` every later step ->
carry-error is a CONTRACTING geometric series bounded by ~`eps/(1-exp(g_mean))`, a small constant
multiple of one step's eps, NOT linear/unbounded. The recurrence is a contraction map - no
divergence. (The "per-step renorm" framing is not a literal renorm op in either codebase; the bound
IS the `g<1` contraction + `beta in (0,1)` + unit-norm k from the l2norm capping `||k (x) delta||`.)
- fp32 register accumulation is the minimal-error placement: load bf16->f32, do `S k`, `v-g*kv`,
`*beta`, the outer-product accumulate and `S^T q` ALL in fp32 (UNCHANGED math), store f32->bf16 once.
Identical to vLLM, which ships this as the Qwen3-Next default with no reported quality regression -
the strongest empirical safety evidence.
- Dominant risk is small KL/PPL drift, not instability. Gate KL<1e-3 + PPL-delta over 256-tok greedy
vs the f32 build; fall back to f32 via the §3c toggle if it fails. Keep conv state f32 (ssm_conv* is
f32-locked, conv bytes negligible) - no reason to risk it.
Bottom line: (1) upstream recurrent-cache f32 is a hardcoded implementation default (conv asserts f32;
scan/gdn kernels float-only; no numeric-rationale comments) - override via §3's `type_s`/`type_r`
plumbing, bf16-default + f32 opt-out, touching only the temporal state. (2) vLLM's GDN temporal state
is bf16 by default (auto->model_dtype), fp32-accumulated, with `--mamba-ssm-cache-dtype float32`
opt-out - a fair equal-precision target. (3) bf16 GDN state is numerically safe: g<1 decay contracts
rounding to a bounded geometric series, fp32 registers confine bf16 rounding to one per-step cache
write, and vLLM ships this exact config in production. KL<1e-3 / PPL gate + f32 fallback is the right
safety net.
---
# PART B - label `bf16-kernel-plumbing` (the kernel/op edits §6 defers)
Part A wires the cache DTYPE selection (cparams -> memory_params -> `s_l`/`r_l` alloc). Part B is the
consuming half: every kernel/op that reads or writes those caches, and the exact
load->f32->compute(f32, UNCHANGED)->store->bf16 conversion points. Traced against HEAD-after-0021 on
`dgx.casa:~/llama-paged-dev` (branch `paged`).
## B.1 Complete set of state-cache READERS/WRITERS (one op family only)
`s_l` (ssm_states_all) reaches compute through exactly ONE op family - the gated-DeltaNet recurrence -
via a strided VIEW from `build_rs` (graph base) that carries the cache dtype. The cache-touching srcs:
- `src[5]` `src_state` - the s0 read view (the cache, or the 0019 gather scratch).
- `src[6]` `src_state_dst` - the 0018 in-place write-back target (a view INTO the cache).
- `src[7]` `ids` - I32 seq map for the 0019 gather (no dtype concern).
No other op reads `s_l`. `build_rs` only re-strides (dtype rides through); the 0019
`gdn_gather_nonident_kernel` is the only other reader. So bf16 awareness localizes to: the 3 ggml.c
builders (asserts), cuda `supports_op`, `gated_delta_net.cu`, and the CPU mirror in `ops.cpp`.
## B.2 ggml.c builder asserts (relax F32-only -> {F32,BF16})
File `ggml/src/ggml.c`:
- `ggml_gated_delta_net` (6287): line 6308 `GGML_ASSERT(state->type == GGML_TYPE_F32)` ->
`... == GGML_TYPE_F32 || ... == GGML_TYPE_BF16`.
- `ggml_gated_delta_net_inplace` (6349): same `state` assert (~6366-6370) + any `src_state_dst`
type assert -> allow BF16.
- `ggml_gated_delta_net_inplace_ids` (6417): same `state` + `src_state_dst` relax.
- KEEP the op OUTPUT scratch f32: line 6327 `ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne)` stays. The
`[attn_scores | new_states]` output is a TRANSIENT graph tensor; the bf16 persisted write goes
through `src_state_dst`/`state` (in-place). The non-in-place fallback `cpy`s scratch->cache and
`ggml_cpy` already type-converts f32->bf16.
## B.3 CUDA supports_op
`ggml/src/ggml-cuda/ggml-cuda.cu`, `supports_op` case `GGML_OP_GATED_DELTA_NET` (3096): allow a BF16
`src[5]`/`src[6]` (add BF16 to the permitted state-src types).
## B.4 CUDA recurrence kernel `ggml/src/ggml-cuda/gated_delta_net.cu`
Template the kernel + gather + launch on the CACHE-pointer dtype (`bool STATE_BF16`); keep f32 valid so
the f32 opt-out is the SAME kernel. Include `<cuda_bf16.h>`; convert with `__bfloat162float` /
`__float2bfloat16`. ALL recurrence math (lines 106-200) stays in f32 registers, byte-for-byte UNCHANGED.
- Signatures: line 39 `const float * curr_state` -> `const STATE_T * curr_state`; line 57
`float * state_dst` -> `STATE_T * state_dst`; `read_state` (85-88) -> `const STATE_T * read_state`.
- LOAD (s0 -> f32 regs), lines 100-103:
`if constexpr (STATE_BF16) s_shard[r]=__bfloat162float(read_state[i]); else s_shard[r]=read_state[i];`
`s_shard` stays `float`.
- STORE-BACK (f32 regs -> bf16 cache):
- non-keep final write (203-208): `state[col*S_v+i] = STATE_BF16 ? __float2bfloat16(s_shard[r]) : s_shard[r];`
- keep_rs_t snapshot (191-200) targets `dst + attn_score_elems` = the f32 OUTPUT scratch (kept f32
per B.2); this is the prefill/rollback path (n_rs_seq>0), decode is `!keep_rs_t`. KEEP it f32.
Only the CACHE pointers (`curr_state` src[5], `state_dst` src[6]) are STATE_T.
- 0019 gather `gdn_gather_nonident_kernel` (7-30): `const float * cache` -> `const STATE_T * cache`;
`dst[i] = STATE_BF16 ? __bfloat162float(src[i]) : src[i];`. Keep `scratch` OUTPUT f32 (pool alloc
326-333 stays `ggml_cuda_pool_alloc<float>`) so the non-identity read path feeds f32; the identity
in-place path reads bf16 directly. `read_state`'s dtype follows the branch that selected it.
- Dispatcher (270-353):
- casts 299/323 `(const float *)src_state->data`, 312 `(float *)src_state_dst->data` ->
`(const nv_bfloat16 *)` / `(nv_bfloat16 *)` when `type == GGML_TYPE_BF16`; branch launch on type.
- asserts 309-311: `src_state_dst->type == GGML_TYPE_F32` -> allow BF16; `nb[0] == sizeof(float)` ->
`== ggml_type_size(type)`; `nb[1] == S_v*S_v*H*sizeof(float)` -> `... * ggml_type_size(type)`.
- q/k/v/g/beta strides (348-353) are ACTIVATION (f32) strides - UNCHANGED. Kernel indexes state by
ELEMENT (`col*S_v+i`), so the typed pointer halves the byte stride implicitly.
- `launch_gated_delta_net` (212-) + S_v switch (230-260): thread `STATE_BF16` into the
`gated_delta_net_cuda<S_v, KDA, keep_rs_t, STATE_BF16>` instantiations.
## B.5 CPU reference `ggml/src/ggml-cpu/ops.cpp` (parity / CI / CPU-offload fallback)
`ggml_compute_forward_gated_delta_net_one_chunk` (10662) + `_f32` (10847), dispatch (10915):
- LOAD: 10726 `const float * state_in_base = (const float *)src_state->data`, the rs_head/gather read
10744-10745, and 10891 `const float * cache = (const float *)src_state->data` -> when
`src_state->type == GGML_TYPE_BF16`, read `GGML_BF16_TO_FP32(((const ggml_bf16_t*)..)[..])`.
- STORE: 10758-10762 `inplace_state_base = (float *)src_state_dst->data` -> store
`((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`; relax asserts `nb[0]`/`nb[1]`
to `ggml_type_size(type)`. Keep ONE impl, branch load/store on `src_state->type`.
## B.6 Conv state (`r_l`) -> bf16 : DEFER (optional, low-value, prefill snag)
Conv state ~12.6 MB total, LAUNCH-bound (0021 removed concat/cpy); bf16 saves ~0 ms, adds complexity:
- DECODE (0021 fused) `ggml_ssm_conv_update_inplace` (ggml.c:5566) asserts 5581-5584
`conv_states/conv_state_dst->type == F32`; CUDA `ssm_conv_update_f32` (ssm-conv.cu:131) + CPU
`ggml_compute_forward_ssm_conv_update_f32` (ops.cpp:9471) read/write f32. To bf16: relax the 2
asserts, template tap LOAD (`__bfloat162float`) + ring write-back STORE (`__float2bfloat16`), cast
`conv_states`/`conv_state_dst` ptrs in both dispatchers.
- PREFILL (non-fused) `build_conv_state` (delta-net-base.cpp:449-524): `conv_states=build_rs(...)`
(bf16 view) then `ggml_concat(conv_states, qkv_mixed, 0)` (472). **`ggml_concat` requires same type**
- qkv_mixed is f32 -> bf16 conv cache BREAKS the prefill concat (needs an f32 staging view of the
taps first; the ring write-back `ggml_cpy` at 496/520 already converts; concat is the blocker).
RECOMMENDATION: keep `type_r` = F32 in v1 (matches Part A §6). Ship `type_s`=BF16 first; `type_r`=BF16
is a follow-up that adds the f32 staging view.
## B.7 Confirm UNTOUCHED: full-attn KV-cache (16 layers) + FP4 weights
- KV-cache: the `llama_kv_cache` half of `llama_memory_hybrid`, alloc with `params.type_k/type_v`
(llama-model.cpp 2030-2031 / 2089-2090 / 2108-2109). Part A changes ONLY the recurrent half's
`type_s`; `attn_type_k`/`attn_type_v` untouched. Paged-KV gather (0003-0011), flash-attn,
`type_k()/type_v()` accessors (kv-cache.h 161-162/381-382) unaffected.
- FP4 weights (nvfp4 dense + MoE): model weights, separate from runtime state caches; recurrence/conv
kernels read STATE not weights. FP4 GEMM (0017/0020) untouched.
- Activations (q/k/v/g/beta, attn-out, z) stay f32 (<1% of bytes). Only persisted `s_l` rows narrow.
## B.8 Conversion-point cheat-sheet (the ONLY numeric-precision boundaries)
1. CUDA load `gated_delta_net.cu` ~102: `s_shard[r] = __bfloat162float(read_state[i])`.
2. CUDA store ~207: `state[col*S_v+i] = __float2bfloat16(s_shard[r])`.
3. CUDA gather ~20: `dst[i] = __bfloat162float(src[i])` (bf16 cache -> f32 scratch).
4. CPU load `ops.cpp` ~10726/10744/10891: `GGML_BF16_TO_FP32(((ggml_bf16_t*)src_state->data)[..])`.
5. CPU store ~10762: `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`.
Everything between (1)/(4) and (2)/(5) is f32-register math, identical to today's f32 kernel. Only the
persisted cache rounds to bf16 once per step; g<1 geometric decay bounds the rounding.
## B.9 File-by-file edit table (Part B)
| File | Edit |
|---|---|
| `ggml/src/ggml.c` | relax `state`/`src_state_dst` F32 asserts -> allow BF16 in the 3 GDN builders (6308, ~6370, ~6430); keep output scratch F32 (6327) |
| `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` GATED_DELTA_NET (3096): allow BF16 state src |
| `ggml/src/ggml-cuda/gated_delta_net.cu` | template kernel+gather+launch on STATE_BF16; `__bfloat162float` load / `__float2bfloat16` store; cast src_state/src_state_dst ptrs; relax dispatcher asserts (309-311) to `ggml_type_size(type)`; keep gather scratch + keep_rs snapshot f32 |
| `ggml/src/ggml-cpu/ops.cpp` | bf16 load/store branch in GDN ref (10726/10744/10758-10762/10891); relax asserts |
| `tests/test-backend-ops.cpp` | add BF16-state GATED_DELTA_NET case (CUDA bf16 vs CPU bf16) |
| (deferred) conv: `ggml.c:5581-84`, `ssm-conv.cu:131`, `ops.cpp:9471`, `delta-net-base.cpp:472` | v2 only - f32 staging before prefill concat |
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,37 +0,0 @@
# bf16 SSM state - build/de-risk progress
DECISION (user override of plan): f32 DEFAULT + bf16 OPT-IN. type_s default = GGML_TYPE_F32.
Conv state (type_r) stays F32. Recurrence math stays f32 (load->f32, store->cache dtype).
## STEP 1 (dtype-generic kernel + op) - DONE + DE-RISK GATE 1 PASSED
Files (DGX ~/llama-paged-dev):
- ggml/src/ggml.c: 3 GDN builder asserts F32 -> {F32,BF16}; state_dst nb[0] -> ggml_type_size.
- ggml/src/ggml-cuda/gated_delta_net.cu: gdn_state_t<STATE_BF16> alias; gather + recurrence kernel +
launchers templated on STATE_BF16; __bfloat162float load / __float2bfloat16 store; gather scratch
shares cache dtype (uniform read); dispatcher detects src_state->type, GDN_DISPATCH macro 8-way.
- ggml/src/ggml-cpu/ops.cpp: byte-based read base + read_bf16 load conversion; bf16 in-place
convert-store after token loop; bf16 gather widening; relaxed asserts to ggml_type_size.
- ggml/src/ggml-cpu/ggml-cpu.c: work-size +S_v*S_v for bf16 in-place.
- tests/test-backend-ops.cpp: state_type field on test_gated_delta_net; 16 bf16 cases (hs 64+128 x
decode/prefill/keep_rs x kda).
GATE 1: build clean (EXIT=0); test-backend-ops -o GATED_DELTA_NET = 52/52 OK (CUDA bf16 vs CPU bf16).
## STEP 2/3/4 (cparams opt-in wiring) - IN PROGRESS
f32 DEFAULT everywhere; --cache-type-ssm bf16 opts in.
## STEP 2/3/4 (cparams opt-in) - DONE
- llama.h/llama-context.cpp/llama-memory.h/llama-model.cpp: type_r/type_s plumbed, DEFAULT F32.
- common.h/common.cpp/arg.cpp: cache_type_ssm/conv (F32 default) + --cache-type-ssm/-conv CLI.
- llama-memory-recurrent.cpp: convert-on-mismatch f32<->bf16 (r and s) via ggml_*_row API.
## EXTRA FIX (plan B.1 miss): build_rs rs_zero clear used ggml_scale (f32-only) -> bf16 abort.
- llama-graph.cpp: f32 keeps ggml_scale_inplace (bit-exact); non-f32 uses ggml_fill_inplace.
- fill.cu + ops.cpp + ggml.c: added BF16 to ggml_fill. get_rows/cpy already bf16-capable.
## DE-RISK GATE - ALL PASS
- build clean EXIT=0 (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
- test-backend-ops -o GATED_DELTA_NET = 52/52 (16 bf16 cases: decode/prefill/keep_rs x kda x hs64/128).
- f32 default md5: dense 5951a5b4... MoE 07db32c2... == 0023 (non-invasive; also --cache-type-ssm f32 matches).
- bf16 opt-in: coherent "Paris", no crash; byte-identical to f32 on 48-tok sample (Same-top-p 100%).
- diff backup: ~/llama-paged-dev/BF16_SSM_STATE.diff (1003 lines, 15 files). NOT committed/pushed.
READY FOR C.2 KL GATE (GateBench).

View File

@@ -1,203 +0,0 @@
# bf16 SSM-state cache - BUILD + DE-RISK RESULTS
Label: bf16-build-derisk (the GPU build agent). Lands on top of patch 0023 (HEAD f7409c2) on the DGX
dev tree `~/llama-paged-dev` (branch `paged`). Status: **DE-RISK GATE PASSED, READY FOR THE C.2 KL
GATE (GateBench).** Work is built into `build-cuda` and saved as `~/llama-paged-dev/BF16_SSM_STATE.diff`
(uncommitted on the dev tree; the 0024 ship/shelve decision is gated on GateBench's KL results).
## DECISION applied (user override of the plan): f32 DEFAULT + bf16 OPT-IN
The plan defaulted bf16; the user wants f32 to stay the bit-exact DEFAULT and bf16 to be opt-IN via
`--cache-type-ssm bf16`. So `type_s` default = `GGML_TYPE_F32`, `type_r` default = `GGML_TYPE_F32`
(conv stays f32 always, per plan C.0). Only the persisted RECURRENT (temporal) state narrows to bf16
when opted in; recurrence math stays f32 (load->f32, compute f32, store->cache dtype). The opt-in is
non-invasive: with no flag the output is byte-identical to 0023.
## Files changed (15; full diff = ~/llama-paged-dev/BF16_SSM_STATE.diff, 1003 lines)
STEP 1 - dtype-generic kernel + op (the de-risk core):
- `ggml/src/ggml.c` - 3 GDN builder `state`/`state_dst` asserts F32 -> {F32,BF16}; `state_dst->nb[0]`
`sizeof(float)` -> `ggml_type_size(state_dst->type)`. Also relaxed the `ggml_fill` builder assert to
allow BF16 (needed by the rs_zero clear; see below).
- `ggml/src/ggml-cuda/gated_delta_net.cu` - `gdn_state_t<STATE_BF16>` alias (`nv_bfloat16`/`float`);
recurrence kernel + gather kernel + both launchers + the dispatcher templated on `STATE_BF16`.
LOAD `__bfloat162float`, STORE `__float2bfloat16`; the gather scratch is allocated at the CACHE
dtype so `read_state` is a single uniform dtype (no mixed-dtype read path - eliminates the plan-R2
landmine). The keep_rs snapshot + the non-in-place final write stay f32 (op output scratch); the
bf16 store happens ONLY on the in-place cache path. `supports_op` already returned `true`
unconditionally for GATED_DELTA_NET, so no change there.
- `ggml/src/ggml-cpu/ops.cpp` - byte-based prior-state read base + `read_bf16` load conversion
(`GGML_BF16_TO_FP32`); bf16 in-place convert-store after the per-(head,seq) token loop
(`GGML_FP32_TO_BF16`); bf16-widening non-identity gather; relaxed `nb[]` asserts to
`ggml_type_size`. Added a `ggml_compute_forward_fill_bf16` + dispatch case.
- `ggml/src/ggml-cuda/fill.cu` - BF16 case in the fill kernel switch.
- `ggml/src/ggml-cpu/ggml-cpu.c` - GDN work-size adds the extra `S_v*S_v` f32 buffer when the cache is
bf16 in-place (mirror of `need_work` in ops.cpp).
- `tests/test-backend-ops.cpp` - `state_type` field on `test_gated_delta_net`; 16 bf16-state cases
(head_size 64 + 128 x {decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda 0/1, n_seqs 1/2).
STEP 2/3/4 - cparams opt-in wiring (f32 DEFAULT):
- `include/llama.h` - `type_r`/`type_s` in `llama_context_params` (adjacent to type_k/type_v).
- `src/llama-context.cpp` - default-params `type_r = type_s = GGML_TYPE_F32`; `params_mem` passes them.
- `src/llama-memory.h` - `type_r`/`type_s` in `llama_memory_params`.
- `src/llama-model.cpp` - the 3 hardcoded `GGML_TYPE_F32` recurrent ctor pairs (recurrent /
hybrid_iswa / hybrid = the qwen35/qwen35moe path) now pass `params.type_r` / `params.type_s`.
- `src/llama-memory-recurrent.cpp` - back-compat: `state_read_data` converts f32<->bf16 on type
mismatch (helper `recurrent_read_convert_rows` via the public `ggml_bf16_to_fp32_row` /
`ggml_fp32_to_bf16_row`) instead of failing, for both r and s; lets an f32-saved session restore
into a bf16 cache and vice versa.
- `src/llama-graph.cpp` - `build_rs` rs_zero clear: f32 keeps the exact `ggml_scale_inplace(.,0)` op
(bit-exactness); bf16 (and any non-f32) state uses `ggml_fill_inplace(.,0)` (CUDA scale is f32-only;
this was the one extra state-touching op the plan's "one op family" claim missed). get_rows + cpy
on the extra-states path already support bf16, so no change needed there.
- `common/common.h` / `common/common.cpp` / `common/arg.cpp` - `cache_type_ssm` / `cache_type_conv`
(default F32) + `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` CLI (reusing the
existing `kv_cache_type_from_str`, which already maps `f32`/`bf16`).
## DE-RISK GATE - ALL PASS
1. **Build clean** (build-cuda, CUDA arch 121): EXIT=0 for ggml/ggml-cuda/ggml-cpu/llama/llama-common
and the binaries (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
2. **test-backend-ops -o GATED_DELTA_NET = 52/52 PASS** (CUDA backend vs CPU reference). Includes all
16 new bf16-state cases (CUDA bf16 vs CPU bf16) covering decode (n_tokens==1), multi-token
prefill/chunk (33/64/100), and keep_rs_t (K=4), with kda on/off and head_size 64 + 128 (production
S_v). The bf16 op test is the deterministic R2 de-risk for the load/compute/store contract.
3. **f32-default md5 == 0023 baseline (opt-in is non-invasive):**
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023 (no flag AND `--cache-type-ssm f32`)
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023
Command: `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`.
4. **bf16 opt-in coherence + engaged (dense, `--cache-type-ssm bf16`):** no crash; coherent + on-topic.
- 48-tok greedy ("The capital of France is"): "**Paris**." - byte-identical to f32 (md5 5951a5b4...),
i.e. Same-top-p = 100% over that short sample (the g<1 decay bounds the per-step rounding so the
argmax trajectory is unchanged at short length).
- 256-tok greedy ("Explain how a transformer LM generates text, step by step"): fluent, well-structured
step-by-step explanation, and the bf16 md5 (`fc82b4cd44f8ec999c3b6843eb3f8c61`) **DIVERGES** from
f32 (`554cc667a2e62a47c34a999a127ac7e5`) - definitive proof that bf16 is genuinely ENGAGED (not a
silent f32 fallback) and behaves as expected (non-bit-exact, coherent). The per-token divergence
is exactly what the C.2 teacher-forced KL gate quantifies.
- Independent proof bf16 is allocated: BEFORE the build_rs fill fix, decode aborted in
`ggml_cuda_op_scale` on the recurrent-state tensor - an f32 cache would never have reached that
bf16-only failure, so the opt-in demonstrably allocates bf16. Wiring is also directly traceable:
`--cache-type-ssm bf16` -> cache_type_ssm -> cparams.type_s -> params_mem.type_s -> the
llama_memory_hybrid recurrent `s_l` alloc.
CONFIRM: ready for the C.2 KL-divergence + PPL-delta + long-context drift gate (GateBench).
## A landmine fixed beyond the plan (record for the gate/ship agents)
The plan B.1 asserted `s_l` reaches compute through ONLY the gated-DeltaNet op. It also flows through
`build_rs`: (a) the rs_zero restart-slot clear used `ggml_scale_inplace(.,0)`, and `ggml_cuda_op_scale`
hard-asserts f32 -> the first bf16 decode aborted in scale. Fixed by routing the bf16 clear through
`ggml_fill` (with a new bf16 fill path). (b) the extra-states `ggml_get_rows` + `ggml_cpy` already
support bf16 (verified) - no change. This is exactly the kind of non-decode state path the de-risk
was meant to surface; it is now covered end-to-end (the bf16 coherence run exercises rs_zero on the
fresh-sequence prompt).
## NOT done in this phase (next agents)
- STEP 5 LocalAI gRPC/YAML (`CacheTypeSSM`/`CacheTypeConv` proto + grpc-server + model_config +
options + meta registry) - needed to force f32/bf16 from a gallery YAML; not on the de-risk gate.
- STEP 6 capability fallback (device-match probe to demote bf16->f32 before alloc on a device lacking
the bf16 GDN/fill path, e.g. CPU-offloaded GDN). The CPU reference DOES implement bf16 (load/store/
gather/fill) so a CPU fallback is numerically correct today, but the probe is the clean guard.
- The C.2 KL/PPL/long-context gate + the C.3 nsys per-call bench - GateBench (GPU gate agent, runs
sequentially after this build phase; binaries are pre-built in build-cuda).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# C.2/C.3 ACCEPTANCE GATE + PARITY BENCH RESULTS (label bf16-gate-bench)
Status: **GATE FAILS -> NO-SHIP. KEEP SHELVED. patch 0024 NOT created; nothing committed.**
All runs on `dgx.casa` build-cuda binaries, wikitext-2-raw test, `-ngl 99 -fa on --seed 1`.
Corpus: `~/bench/klgate/wikitext-2-raw/wiki.test.raw` (symlink to `~/wikitext-2-raw`, ~280k tokens).
## 1. KL acceptance gate
### Noise floor (f32-vs-f32, c256 chunks32) - the non-determinism floor
| model | Mean KLD | Max KLD | Same-top-p | ln(PPL(Q)/PPL(base)) |
|---|---|---|---|---|
| dense q27 | -1.3e-5 | 1e-6 | 100.000% | +0.001256 |
| MoE q35 | ~0 (-3e-7) | 5.9e-5 | 100.000% | +0.000607 |
### Headline 256-token gate (bf16-vs-f32, c256 chunks32) - PASSES, but vacuously
bf16 c256 is **byte-identical to the floor** for both models (Mean KLD -1.3e-5 dense / ~0 MoE,
Same-top-p 100%, identical PPL). Reason: a single 256-token window is processed in ONE ubatch
(ub512 > 256), so the recurrent state is written to the bf16 cache only ONCE at the chunk end and is
NEVER read back to produce that window's logits. The 256-token gate therefore does NOT exercise the
bf16 round-trip at all - it is blind to the actual cost.
### Long-context drift sweep (bf16-vs-f32, chunks8) - FAILS HARD for BOTH models
| model | ctx | Mean KLD | Same-top-p | Max KLD | 99.9% KLD |
|---|---|---|---|---|---|
| dense | 256 | -1.3e-5 | 100.000% | 1e-6 | 0 |
| dense | 1024 | 0.0647 | 91.54% | 20.17 | 7.69 |
| dense | 2048 | 0.1739 | 90.65% | 24.89 | 18.18 |
| dense | 4096 | 0.1258 | 90.40% | 26.03 | 17.22 |
| MoE | 256 | ~0 | 100.000% | 5.6e-5 | 4.9e-5 |
| MoE | 1024 | 0.0472 | 90.04% | 5.13 | 0.95 |
| MoE | 2048 | 0.0442 | 90.84% | 1.85 | 1.11 |
| MoE | 4096 | 0.0422 | 89.97% | 2.76 | 0.83 |
Gate thresholds: Mean KLD < 1e-3; Same-top-p >= 99.5%; |ln(PPL ratio)| < 0.005;
drift MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0%.
Result: 256-tok PASS (vacuous); **drift FAIL by ~50-170x on Mean KLD and ~9 pts on Same-top-p**
(top-p ~90% = roughly 1 token in 10 flips its argmax at >=1024 ctx). FAIL for both dense and MoE.
### Discrimination (is it a bug or genuine bf16?) - dense c1024 chunks8
- **f32 long-context floor c1024**: Mean KLD -1.2e-5, Same-top-p 100% -> the bf16 divergence is REAL
signal, not a long-context measurement artifact.
- **bf16 KLD is invariant to ubatch-boundary count** (= the cross-ubatch state read-back frequency):
ub1024 (0 internal boundaries) 0.0642 / 91.19%; ub512 (1) 0.0647 / 91.54%; ub256 (3) 0.0639 /
91.17%; ub64 (15) 0.0682 / 90.97%. Flat. -> The error is INTRINSIC to bf16 over the long
recurrence INSIDE a single op call, NOT a chunked-prefill/keep_rs/gather handoff bug (R2 ruled out;
test-backend-ops 52/52 already passed). The error PLATEAUS with context (contraction), i.e. it is
bounded but LARGE: the gated-DeltaNet has long-memory heads (exp(g) ~ 1), so the g<1 decay does NOT
tightly contract the per-step bf16 rounding the way the plan's A.3 optimistically assumed.
Note (CORRECTED): this is NOT vLLM's precision. vLLM keeps the GDN **temporal state in f32** (proven
three ways in BITEXACT_VS_VLLM.md: empirical kernel-boundary tensor dtype, the config chain, and the
bandwidth regime; only vLLM's tiny conv state is bf16). So bf16 temporal here is a step BELOW vLLM's
recurrent precision, not a match. (An earlier byte-gate draft mislabeled vLLM as bf16-state; that was
refuted.) At equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW).
## 2. Parity bench - the perf lever IS real
### nsys recurrence per-call (graphs OFF, npp4 ntg32 npl128) - gated_delta_net_cuda Avg
| model | f32 ms/call | bf16 ms/call | delta |
|---|---|---|---|
| dense q27 | 3.381 | 1.726 | **-49.0%** |
| MoE q35 | 2.245 | 1.153 | **-48.6%** |
The predicted 3.49 -> 2-3 ms/call lever LANDED (even beat it). Total GPU time dropped too (dense
~12.05 -> ~9.05 s graphs-off). bf16 halving the persisted SSM-state bytes halves the dominant decode
kernel, exactly as designed.
### End-to-end decode throughput (S_TG aggregate, npp128 ntg128, graphs ON unless noted)
| model | npl | f32 t/s | bf16 t/s | note |
|---|---|---|---|---|
| dense | 32 | 212 | 239 | +12.8% |
| dense | 128 | 371-376 (stable) | 287 / 336 / 487 / 498 (BIMODAL) | clean ~490 = +31%; bad runs from a CUDA-graph instability on the dense path |
| dense | 128 | 371.67 (graphsOFF) | 486.68 (graphsOFF) | clean +31% |
| MoE | 32 | 449 | 509 | +13.4% |
| MoE | 128 | 767 | 958 | +24.9% (clean, nsys-corroborated) |
% of vLLM (391 t/s dense reference): f32 default = 95-96% (bit-exact, higher precision than vLLM);
bf16 clean ~490 = **125%** (but unstable on dense + fails the numeric gate). MoE bf16 +25% is clean.
## 3. DECISION: NO-SHIP / KEEP SHELVED
- The KL gate **fails** the long-context drift criterion for both models: bf16 SSM state changes
~10% of tokens at >=1024 ctx vs our f32 (Same-top-p ~90%, Mean KLD 0.04-0.17). It is therefore NOT
a quality-neutral opt-in and cannot honor the project's "f32 bit-exact default" promise.
- Per the task rule (gate FAIL -> do not ship as usable): **patch 0024 was NOT created and nothing was
committed** (DGX tree stays uncommitted; backup `~/llama-paged-dev/BF16_SSM_STATE.diff`).
- The perf lever is genuinely real (recurrence halves; dense ~490 t/s = 125% of vLLM when clean; MoE
+25%), but bf16 temporal is BELOW vLLM's precision (vLLM keeps temporal f32), so it remains a valid
FUTURE option only if shipped as an explicitly-labeled "reduced-precision, NON-bit-exact, below-vLLM"
mode (never quality-neutral), AND the dense CUDA-graph throughput instability (bimodal 287..498) is
fixed first. The principled path is hybrid per-head precision (f32 long-memory heads + bf16 fast
heads) - keeps precision at-or-above vLLM while capturing most of the speedup.
- Recommendation: keep the shipped default as f32 bit-exact (95% of vLLM at higher precision). Shelve
bf16. Optional follow-up lever if precision must be cut: bf16 only on the SHORT-memory heads (those
with exp(g) well below 1), keeping long-memory heads f32 - a mixed-precision state that could pass
the gate while still cutting bytes; not implemented/measured here.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,339 +0,0 @@
# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
>
> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
> reading consistent with both source *and* the measured time.
## Headline (two answers)
1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
accumulation - a >>ULP divergence in every projection and the LM head.
2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
---
# The five questions, answered (synthesis)
**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
same-precision run because of the A8-vs-A4 GEMM gap.
**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
recurrence (~440 tok/s) - an **over-clock**, not a requirement.
**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
which costs llama precision only on a tiny stream and buys almost no BW.
**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
precision-for-speed trade it is.
**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
"bf16 is required" framing - keep the bit-exact levers.**
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
non-associativity across distinct kernel stacks).
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
**Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
parity-matching change.
---
# PART A - Divergence inventory (per source: bit-identical vs close)
Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
(`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
`mmq_y`/`mmq_x` + the warp-MMA fragment layout.
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
(**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
divergence and is present in every projection + the LM head, so logits differ materially on its own.
## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
With S[k][v] (llama, transposed) == h[v][k] (vLLM):
```
llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum
delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k]
vLLM: h' = g*h_old # decay rounded into EVERY element first
kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta
h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k]
```
Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
with identical f32 state, identical inputs, and identical reduction tree:
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
Different variance reduction tree / eps placement / fusion boundary.
## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
rounding points + convert-time vs runtime `exp(A_log)`.
## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
rounding.
## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
### Inventory summary
| Source | bit-identical? | divergence size |
|---|---|---|
| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
| MRoPE | NO | ~ULP-1e-6 |
| gate softplus/exp | NO | rounding points |
| flash attention | NO | softmax tiling |
| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
not close-to-equal at the bit level; they agree only to a few significant digits.
---
# PART B - The decisive f32-state correction (proof from source)
The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
and **`text_config.mamba_ssm_dtype = "float32"`**.
2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
`"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
`mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
`cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
**`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
(vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
It runs **before** any layer/model state-dtype resolution.
5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
--gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
non-default args confirm none). So the override fires and the value is "float32".
6. State dtype resolution reads the **already-overridden** value:
- `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
- `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
`vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
- `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
= **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
with `p_ht.dtype == initial_state.dtype == float32`.
**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
byte stream).
## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
|---|---|---|---|---|---|
| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
"more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
count vLLM is at ~81%, i.e. **more** efficient than llama.
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
is bandwidth **efficiency** on equal bytes, not width.
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
---
# PART C - The f32-preserving lever, and where recompute/bf16 land
Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
occupancy tune above is the real bit-exact win:**
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
overlapping two BW-bound phases sums to ~0. FAILS.
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
decode); attention/materialization-free form is O(t) over the prefix. FAILS.
**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
MMQ + activation fold); none require bf16.
---
# PART D - Verdict on B4 + the meaningful weaker form
## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
`g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
expression*, differing **even given identical f32 state and inputs**.
2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
**cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
weakening greedy agreement for this model specifically.
**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
bf16 state does not and must be KL/PPL-gated vs llama's own f32).
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,53 +0,0 @@
# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate)
## Hard blocker on direct DRAM counters
- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only).
- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM.
- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root.
- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte
geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor.
## Byte geometry (exact, from gated_delta_net.cu + GGUF)
- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head).
- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB.
- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register,
writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced).
l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at
steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away.
- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB.
- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call.
## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97)
- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call
(min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms).
- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak.
- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal.
=> re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE.
Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak.
## Conv-path (same trace, steady-decode region kernels, per-call):
- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD
- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big)
- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op)
## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite)
- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16).
Geometry identical to llama (H=48, k/v head_dim 128, S_v 128).
- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1),
bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs ->
store bf16 once).
- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient.
## Conv-path (llama steady decode, per call x48 layers)
- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56)
= ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%).
- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state).
## FINAL VERDICT (DONE)
- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak).
- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%);
gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes.
- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width.
805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327.
Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision.
- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended).

View File

@@ -1,57 +0,0 @@
# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint
Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev`
`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128.
B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the
grouped FP4 `mul_mat_q<NVFP4>` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3.
## The lever (B-3 / SPEEDUP_HUNT B rank #3)
mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q<NVFP4>` in `ggml/.../mmq.cuh`.
mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA
shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency,
WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE
GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the
right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS =
+8.7% slower 0017).
## The coupling that makes it a real kernel change (not the 0017 knob alone)
The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the
m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The
0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE:
`mmq_get_nwarps_device<type>()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the
coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to `<type>`. Default
GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock.
## Bit-exactness note (the real risk)
The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64
DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split
across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping =>
FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT
bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP.
## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md.
- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites.
- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `<type>` swaps. git diff clean (37+/11-).
- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref;
MoE S_TG npl32=441.98, npl128=756.47.
- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64).
- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%,
npl128 +0.40% - within the ~0.4% noise band. FLAT.
- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%)
dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2).
- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda
reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM.
- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap,
uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%).
## Gate references
- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
- MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd
- gate cmd: `llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`
- bench: `llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128` (S_TG=decode_agg)
- vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8).
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,232 +0,0 @@
# B_MOE_RESULTS.md - B-2 (down_proj act-quant retune / M1) RESULT: NEGATIVE (no headroom)
Agent: B2-build (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
branch `b-work`), independent of the held hybrid 0026 (`33e7c65`). Lever: SPEEDUP_HUNT.md section B,
rank #2 ("down_proj act-quant retune (M1): bit-exact, bounded - act-quant is ~2% of MoE step").
## VERDICT
**The existing `blockDim.x = 128` is ALREADY the kernel-level optimum for `quantize_mmq_nvfp4` on
GB10 sm_121. B-2 has zero headroom: there is nothing to bake (128 is the current default), and it
does NOT lift MoE decode (end-to-end flat within 0.4% noise across all block sizes). No patch 0027.**
MoE stays ~85% of vLLM @npl128 / ~87% @npl32, well below vLLM => the remaining MoE lever is B-3.
## The change that was built+measured (bit-exact, then REVERTED - did not lift)
`ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. Replaced the hardcoded
`constexpr int nvfp4_block_size = 128` with a `static const int` selected once from env
`LLAMA_MOE_QUANT_BLOCK` (default 128), `block_num_y` recomputed from the SAME `blockDim.x`. ~20 LOC.
### Why ANY block size is provably byte-identical (the bit-exact invariant)
`quantize_mmq_nvfp4` maps thread -> column purely via the global linear index
`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
communication (no shared memory, no warp reduction) and every thread writing its OWN disjoint output
sub-block (its own `sub` slot in `block_fp4_mmq`: `yqs[2*sub+0/1]`, `d4[sub]`). The per-thread quant
body (amax, the 5-offset fp8-code search, the q0/q1 nibble packing, the writeback) is untouched. So
the (thread)->output-byte map - and the produced bytes - are invariant to `blockDim.x`. Confirmed
empirically: md5 identical at block 64, 128, AND 256, both models.
## GATE (bit-exact) - BOTH MODELS PASS at default AND at non-128 blocks
greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
| block | dense q36-27b-nvfp4 md5 | MoE q36-35b-a3b-nvfp4 md5 |
|------:|-------------------------|---------------------------|
| 128 (default) | 5951a5b4d624ce891e22ab5fca9bc439 == ref | 07db32c2bcb78d17a43ed18bc22705cd == ref |
| 64 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
| 256 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
test-backend-ops (CUDA0): **MUL_MAT 1146/1146 PASS**, **MUL_MAT_ID 806/806 PASS**.
## MEASUREMENT 1 - end-to-end MoE decode_agg (S_TG t/s), the actual throughput
`llama-batched-bench -m q36-35b-a3b-nvfp4.gguf -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`,
1 rep/block (run-to-run noise ~0.3-0.5%):
| block | npl=32 S_TG | npl=128 S_TG |
|------:|------------:|-------------:|
| 32 | 437.54 | 750.41 |
| 64 | 437.82 | 751.68 |
| 96 | 437.69 | 749.46 |
| **128 (base/default)** | **438.14** | **751.76** |
| 160 | 436.38 | 750.99 |
| 192 | 436.81 | 751.61 |
| 256 | 437.77 | 750.14 |
Spread: npl32 = 1.76 t/s (0.4%), npl128 = 2.3 t/s (0.3%) - all within noise. **No block size lifts
end-to-end decode.** Expected: the act-quant is ~2% of the MoE step, so even a perfect (0 ns) quantize
kernel caps the end-to-end win at ~2%, and 128 is already optimal => measured 0%.
## MEASUREMENT 2 - nsys kernel-level delta on quantize_mmq_nvfp4 (the meaningful B-2 metric)
`nsys --report cuda_gpu_kern_sum`, MoE, `GGML_CUDA_DISABLE_GRAPHS=1 -npp 4 -ntg 32 -npl 128`,
8,193 kernel invocations (the kernel is 2.0-2.2% of GPU time in this decode-heavy window):
| block | total ns | avg ns | median ns | vs 128 (total) |
|------:|---------:|-------:|----------:|---------------:|
| 64 | 127,523,328 | 15,564.9 | 12,256 | +8.7% slower |
| **128 (default)** | **117,371,424** | **14,325.8** | **11,488** | baseline (fastest) |
| 192 | 128,970,464 | 15,741.5 | 12,032 | +9.9% slower |
| 256 | 125,422,048 | 15,308.4 | 11,936 | +6.9% slower |
**128 is a clean local minimum** (faster than the 64 below and the 192/256 above; 96 and 160 are its
immediate neighbors, end-to-end-neutral, nsys-stats flaked on the re-runs but cannot beat a bracketed
local min). The 7-10% kernel-level regression of the alternatives at 0% end-to-end change is exactly
why end-to-end is flat: this BW-bound, 256-tiny-expert model has no col-tile/occupancy headroom in
the act-quant - the same conclusion patch 0015 reached for the M-tile and patch 0017 for MINBLOCKS.
## WHERE MoE STANDS (decode_agg, this base = 0025 with the re-graph)
vLLM ref @npl128 = 882.2, @npl32 = 500.8.
- npl128: 751.8 / 882.2 = **85.2% of vLLM**
- npl32: 438.1 / 500.8 = **87.5% of vLLM**
B-2 adds 0 (within noise). MoE is **still well below vLLM** => **TRY B-3** (the mmq_y-down warp-remap
on the grouped `mul_mat_q<NVFP4,M-tile=64>` GEMM, ~27% of the MoE step - the only untested MoE GEMM
lever; SPEEDUP_HUNT B rank #3, real kernel change, bit-exact, predicted bounded on this BW-bound
model). The structural MoE residual is the FP4 grouped GEMM at the LPDDR5x BW floor + the bf16
projections (~10.5%); the act-quant tax (~2%) is NOT where the gap lives and is already optimally
tiled. Recurrence (~48%) is already past vLLM (0018-0022).
## DECISION
No patch 0027 (B-2 does not lift; dev tree reverted to pristine 0025). The `LLAMA_MOE_QUANT_BLOCK`
hook + this measurement confirm 128 is the GB10 optimum, should other hardware ever want re-tuning.
Hand off to B-3 (patch 0028) as the next MoE GEMM lever.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028)
Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the
0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q<NVFP4>` (the ~26-27%
MoE-specific GEMM), the only untested MoE GEMM occupancy lever.
## VERDICT
**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4%
(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The
warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is
BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128.
## The change that was built+measured (bit-exact, then REVERTED)
`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the
invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64
block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017
`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new
`mmq_get_nwarps_device<type>()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else
the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads +
9 `<type>` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host).
Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64).
**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is
byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128).
### Bit-exactness is EMPIRICAL here (not by-construction)
The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64
DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be
split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different
grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT
for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON
ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain.
## GATE (bit-exact) - BOTH MODELS PASS
greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref
- MoE q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref
- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.**
## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64)
`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s,
3 reps alternating (no concurrent load):
| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta |
|----:|----------------------:|--------------------:|------:|
| 32 | 437.6 (437.3-437.7) | 438.8 (438.4-439.1) | +0.29% |
| 128 | 750.1 (748.9-751.1) | 753.1 (753.0-753.4) | +0.40% |
Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread
(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win
at well under 1%, nowhere near the gap to vLLM (882).
## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load
`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`:
| kernel (% of window) | mmq_y=128 total ns | mmq_y=64 total ns | delta |
|---------------------------------|-------------------:|------------------:|-------:|
| gated_delta_net (SSM, ~40%) | 2,335,951,709 | 2,334,847,390 | 0.0% (untouched, DOMINANT) |
| **mul_mat_q<NVFP4, M-tile 64>** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** |
| mul_mat_q<NVFP4, M-tile 128> (router, ~3.7%) | 224,532,704 | 210,885,920 | -6.1% |
| quantize_mmq_nvfp4 (act-quant, ~2%) | 119,118,624 | 118,718,496 | -0.3% |
| **mul_mat_q_stream_k_fixup<128>** (~0.6%) | **26,848,479** | **38,117,532** | **+42% (fixup COSTLIER)** |
The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident
CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y
DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the
GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and
already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net
mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise).
**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move
end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2
(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit.
## DECISION
No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup
penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff),
build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device<type>()` remap is a
correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10.
---
# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation
## The bit-exact MoE GEMM/launch track is now EXHAUSTED
| MoE lever (bit-exact) | result | MoE decode_agg @npl128 |
|-----------------------|--------|------------------------|
| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM |
| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% |
| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% |
**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).**
B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/
launch track has no remaining bit-exact headroom.
## Is the residual the structural Marlin-MoE gap? YES.
The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys:
- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is
end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more
bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the
decode shape is BW-bound so the tier does not help). This is THE structural gap and matches
FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone."
- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever.
- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever.
No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau.
## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config)
Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the
held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies -
but with a hard caveat the team must carry.**
1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest,
precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4
arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity*
bit-exactly - claim ~85% with a precision profile at-or-above vLLM.
2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the
~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in
the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it
**FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5%
bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL
regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches
**~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state).
=> Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as
an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5%
decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config.
**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is
the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph;
B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%)
is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in,
default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85%
bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market
MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the
defensible claim.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,499 +0,0 @@
# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10
Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This
document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style
token-granular continuous-batch scheduler, and records the single honest finding that
re-shapes what the change can and cannot buy.
Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`),
MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev`
(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`.
Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that
`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`).
## TL;DR (the honest reframe)
Three findings, read directly from the source at HEAD `151343b` and from the committed
NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset
what it is allowed to claim:
1. **The unified mixed batch already exists.** `update_slots()` already builds ONE
`llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill
tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every
`SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then
Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode
is therefore **already claimed first and never dropped or capped** - the exact property
vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction.
2. **The chunked-prefill slot state already exists and already persists across steps.** A
slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()`
is a partial prefill; it stays in that state and resumes next step until its prompt is
fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING`
(line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING`
simultaneously; there is **no global "one prefill at a time" gate**. So the mission's
"allow a slot to be mid-prefill while others decode in the same step" is **not a state
machine to build - it is already the behaviour.** This is the single biggest de-risking
fact in this document.
3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013
(`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by
slots in iteration order. It is not decode-load-aware (does not subtract the live decode
count `D`), not adaptive (one constant across npl 8..128), and not fair (the first
`PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that
static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total
per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover**
`T - D` distributed across waiting prompts with a per-slot cap. That is ~the only
behavioural change. **No new slot states, no batch-formation rewrite.**
### The honest ceiling (this is load-bearing for how the work is scoped and sold)
The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus
`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a
decode-KERNEL batch-scaling ceiling, not a scheduler defect**:
- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode).
- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero
prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same
~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the
same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The
scheduler cannot lift this.**
- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular
scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once
workload.
Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is:
- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on
staggered arrival) - the genuine, large win.
- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without
per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE
TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole
npl range and across dense vs MoE.
- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the
burst-TTFT spread), and no admission collapse under sustained load.
Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the
paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or
CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's
scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness +
self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids.
## The gap, precisely localized (recap of the committed bench)
At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0,
decode_agg tok/s | TTFT mean, npl swept 8/32/64/128:
| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM |
|----:|------------------------:|-----------:|----------------------:|---------:|
| 8 | 63.5 / 4.3 s | 64.3 / 2.6 s | 169.3 / 1.7 s | 202.0 / 0.8 s |
| 32 | 105.7 / 23.1 s | 189.8 / 7.5 s | 239.0 / 9.0 s | 462.0 / 2.3 s |
| 64 | 132.0 / 109 s | 284.2 / 13 s | 277.0 / 16.2 s | 624.5 / 4.1 s |
| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s |
Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That
convergence is the signal: once prefill starvation is removed, a dense model and a
12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill**
and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching
decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual).
This scope addresses the first; it names the second as the separate lever.
## What already exists (reuse, do NOT rebuild)
All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`.
- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every
`SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to
`generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends
its sampled token (+ draft tokens) via `common_batch_add`. After this loop,
`batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in.
- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (=
`num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT`
slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it).
Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at
line 3502. **This is exactly vLLM's "leave the request in `running`, advance
`num_computed_tokens` next step" - already implemented.**
- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill;
the compute loop (lines ~3366-3378) `for (i=0; i<batch.n_tokens; i+=n_tokens){ n_tokens =
min(n_batch, batch.n_tokens-i); llama_decode(batch_view); }` runs it as one `llama_decode`
when `batch.n_tokens <= n_batch`; `n_ubatch` (512) splitting happens inside `llama_decode`.
- **[D] patch 0013 static prefill budget** - the thing to supersede. Read once at lines
2737-2747 (`n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))`, a CONSTANT for
the run); enforced as an extra `while` predicate at line 3188 (`n_prompt_budgeted <
n_prefill_budget`), counter at 3214, outer break at 3326. `0` = disabled = byte-identical
stock.
- **[E] productization seam** - `backend/cpp/llama-cpp/grpc-server.cpp` lines 781-791 parse
the model option `max_prefill_tokens` / `mpt` / `prefill_budget` and `setenv`
`LLAMA_PREFILL_BUDGET` before context init (same pattern as `kv_paged`). New knobs hang off
this seam identically.
- **[F] paged KV (patches 0001-0011)** - on-demand block allocation keyed by sequence
position. Batch formation only changes **which** tokens are in a step; paged alloc is
driven by the per-slot sequence positions, which are unchanged. Orthogonal (see Correctness).
## vLLM v1 reference algorithm (the target, for fidelity)
From `vllm/v1/core/sched/scheduler.py::schedule()` (0.23.0, on the box). The unifying idea:
there is no prefill phase vs decode phase. Every request advances `num_computed_tokens`
toward `num_tokens` by up to N this step; for a decoder N=1, for a prefiller N=remaining
prompt. One per-step `token_budget = max_num_batched_tokens` bounds the TOTAL (decode +
prefill). Pass 1 visits `running` first (decoders cost 1 each -> all decode claimed before
any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each
chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps:
`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len,
per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch
maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded
progress (-> low, flat TTFT), one `model.forward()` per step.
The mapping to llama is clean because [A]+[B] already give us "running visited first" and
"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D`
sizing, and the per-request chunk cap with fair distribution.**
## The unified per-step batch-formation algorithm (the design)
New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`):
- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token
budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch`
(2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is
unchanged.
- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per
step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`,
floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress.
- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave
exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case.
Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today):
```
common_batch_clear(batch); // line 2594
// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719)
for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot);
... speculative draft ...
for (slot : generating) slot.update_batch(batch); // appends decode (+draft) tokens
>> D = batch.n_tokens; // NEW seam: decode load is now final (after 2719)
>> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch);
>> prefill_budget_step = max(0, T - D); // DYNAMIC leftover, auto-shrinks with D
>> prefill_cap_per_slot = PREFILL_CAP; // long_prefill_token_threshold analogue
>> n_prompt_budgeted = 0; // total prompt tokens added this step (subsumes 0013)
// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair)
if (cont_batching || batch.n_tokens == 0) {
>> for (k = 0; k < n_slots; ++k) { // round-robin start offset (fairness, see P2)
>> slot = slots[(rr_start + k) % n_slots];
if (!slot.is_processing() || !can_batch_with) continue;
if (slot.state == STARTED) slot.state = PROCESSING_PROMPT; // line 2782 (unchanged)
>> slot_prompt_added = 0; // NEW: per-slot chunk counter (reset each slot)
// inner prompt-fill (lines 3187-3239), guard now triple-bounded:
while (slot.prompt.n_tokens() < slot.task->n_tokens()
>> && batch.n_tokens < T // was: < n_batch
>> && n_prompt_budgeted < prefill_budget_step // was: 0013 static n_prefill_budget
>> && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution
common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd);
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
n_prompt_budgeted++; slot_prompt_added++;
... checkpoint-boundary breaks (unchanged) ...
}
if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT; // line 3252
... checkpoint creation (unchanged) ...
>> if (batch.n_tokens >= T) break; // was: >= n_batch (line 3320)
>> if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326)
}
}
for (i=0; i<batch.n_tokens; i+=n) { n=min(n_batch,batch.n_tokens-i); llama_decode(view); } // unchanged
```
The whole change is: (a) compute `prefill_budget_step = T - D` at the new seam after line
2719 instead of reading a static env constant at 2737; (b) bound the inner/outer loops by `T`
and the dynamic budget instead of `n_batch` and the static budget; (c) add `slot_prompt_added`
with `prefill_cap_per_slot` for per-slot fairness; (d) a round-robin start offset so the same
early slots do not always win the leftover.
**Why this holds the decode ceiling without tuning.** `T` bounds total tokens per step ->
bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec).
As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates
the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256
reaches 161; here it is reached **automatically across the npl range** because the budget is
`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover
(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode-
saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes
bounded progress every step instead of waiting for a dedicated prefill burst.
## Slot state machine changes (minimal - this is the headline de-risk)
**No new states. No state-transition rewrite.** The existing 6-state machine
(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72)
already encodes everything:
- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING`
slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch).
- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() <
task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters
the slot and resumes from `prompt.n_tokens()`.
The only **additions** are per-step scheduler scratch, not slot lifecycle state:
1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for
the per-slot chunk cap. Not stored on the slot across steps.
2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the
leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always
draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without
it, slot 0's prompt finishes first every time and the last slots starve).
3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT`
transitions begin in one step. This falls out of the budget arithmetic already (a bounded
`prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it
may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a
new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots.
That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap.
The mission's feared "slot-state rewrite" does not materialize.
## How it supersedes / subsumes patch 0013
| property | 0013 (static cap) | this scheduler (dynamic `T - D`) |
|----------|-------------------|----------------------------------|
| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises |
| decode-load aware | no (ignores `D`) | yes (leftover after decode) |
| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) |
| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) |
| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts |
| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** |
0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant
and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the
new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the
dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013.
## Correctness
- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already
advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the
chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add`
receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state.
- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by
the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded:
pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays
coherent and answers correctly. The op-level math per token is position-determined and
unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic
oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies
to end-to-end only, never to the op test.
- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed
by sequence position and slot/stream, which this change does not touch; it changes only which
tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates
per-token via the block tables regardless of what prefill tokens are co-batched. Required gate:
run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical
scheduling decisions** (same per-step token counts, same admission order) - paged must be a
no-op on the scheduler.
- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same
`task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant
LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types /
per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note
it, do not try to lift it here.
- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that
reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the
dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained
heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s,
zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step.
This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left
on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision
should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose
checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching
change.
## Phased plan P0 -> P3 (work, payoff, files, risk)
| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk |
|-------|------|-----------------------------------------------------|-------|------|
| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low |
| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) |
| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) |
| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) |
**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):**
- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on
staggered arrival it goes below the vLLM burst figure to ~2 s).
- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM,
the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate
lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the
ceiling, the kernel on the throughput residual.
## GB10 considerations
- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read +
paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only
keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single
`llama_decode`; `n_ubatch` (512) governs the internal split.
- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more
prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower
TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load.
Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving.
- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be
checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win.
- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat
~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change.
- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the
scheduler's uniform decode-only step is a precondition that *helps* a future capture.
## Biggest risks and how to de-risk
1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others-
decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch
(`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states
untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock,
verified by an A/B diff of per-step token counts.
2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already
documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**:
op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with
the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical**
to 0013.
3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite
with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a
no-op on batch formation). This is a hard gate, not a spot check.
4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the
scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission
checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the
batching win is legible independent of the checkpoint co-defect.
5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the
work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**:
frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission
rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually
moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the
scheduler is never charged with the 391/811 number the kernel forbids.
## Commit / hygiene
Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity
scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are
implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch-
scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp`
beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
(human pushes).
---
## Review / risk (adversarial, source-verified)
Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
The central de-risking claims check out against the code, and the load-bearing honesty (decode
residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
Two calibration fixes are required before P1 (below), neither changes the go decision.
### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
`paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
(line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
`llama_decode`/ubatch/KV rewrite in scope.
- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
`split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
forcing `-kvu`. Neither is introduced by this change.
- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
-> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
(llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
does not break anything (0013 already hits it) but it means the actual scheduled object is a
split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
hang off it identically.
### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
(`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
`cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
**eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
**flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
`T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
"toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
dynamic budget would not. This strengthens P1's case beyond what the doc claims.
- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
`T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
"barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
sound (default-off path is untouched).
- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
npl8 and costs MoE TTFT), so P1 is the floor.
### Realistic effort / payoff and sequencing
- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
residual spread/robustness problem.
- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
not change."
### Bottom line
GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
of the 2.4x throughput residual.

View File

@@ -1,106 +0,0 @@
# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
## What changed
A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step]
-> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step]
-> ggml_silu [folded into ssm_conv on CUDA]
-> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step]
with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
vLLM's `causal_conv1d_update`.
Files:
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
(src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
(channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
`ggml_cuda_op_ssm_conv`.
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
channels) + branch in `ggml_compute_forward_ssm_conv`.
- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
(`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
rollback keep the existing concat+ssm_conv+silu+cpy chain.
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
vs the CPU reference across backends.
## Gate: test-backend-ops (CUDA0 vs CPU reference)
- SSM_CONV: 45/45 OK (unchanged path intact)
- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
- SSM_CONV_BIAS_SILU: 90/90 OK
## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
| model | baseline md5 | fused md5 | result |
|--------------------|----------------------------------|----------------------------------|-----------------|
| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL |
| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL |
## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
Dense q36-27b-nvfp4:
| mode | npl | baseline | fused | delta |
|-----------|-----|----------|--------|---------|
| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% |
| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% |
| eager | 32 | 196.07 | 197.61 | +0.8% |
| eager | 128 | 333.62 | 342.97 | +2.8% |
MoE q36-35b-a3b-nvfp4:
| mode | npl | baseline | fused | delta |
|-----------|-----|----------|--------|---------|
| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% |
| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% |
| eager | 32 | 421.05 | 432.46 | +2.7% |
| eager | 128 | 689.15 | 713.87 | +3.6% |
Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
## Step time + nsys kernel delta
Per-step decode time (dense npl128, T_TG / ntg=128):
- baseline 48.711 s / 128 = 380.6 ms/step
- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms)
- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
| kernel | baseline calls | fused calls | per-step (eager) |
|---------------------|----------------|-------------|------------------|
| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms |
| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms |
| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms |
| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms |
Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
## Verdict
Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,639 +0,0 @@
# Critical-Path Gap Analysis - GDN decode region
## vllm-gdn-compare (READ-ONLY, no GPU) - vLLM decode GDN kernel inventory vs llama
### Source ground truth
- Local checkout `/home/mudler/_git/vllm` and the DGX's benchmarked venv
`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm` are STRUCTURALLY
IDENTICAL (same file `qwen_gdn_linear_attn.py`, byte-for-byte same line numbers
1287/1344/1457/1644/1684). So the analysis below is faithful to what was actually
benchmarked on the GB10. Both are a recent dev build (`__version__ = "dev"`), same
era as the "0.23.0" reference; the GDN path is the refactored
`vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`.
### The headline: vLLM runs the entire GDN region at decode as 2 Triton kernels + 3 GEMMs, ALL fused
Per Qwen3.5 gated-DeltaNet (linear-attn) layer, vLLM decode launches:
| # | Kernel | What is folded in |
|---|--------|-------------------|
| 1 | `in_proj_qkvz` GEMM | (quantized matmul - shared with llama) |
| 2 | `in_proj_ba` GEMM | (quantized matmul - shared with llama) |
| 3 | `_causal_conv1d_update_kernel` (causal_conv1d.py:1193) | conv1d **+ silu activation fused in** (the `activation` arg) |
| 4 | `fused_recurrent_gated_delta_rule_packed_decode_kernel` (fused_recurrent.py:256-336) | **l2norm(q), l2norm(k), scale, softplus gate, A_log decay exp(g), sigmoid(beta), the delta-rule recurrence (b_h*=exp(g); delta update), the output b_o=sum(b_h*b_q), AND the SSM state write-back** - all in one kernel |
| 5 | `RMSNormGated` (gated rms_norm) | **output gate silu/sigmoid * z fused into the rms_norm**; the comment notes the norm+quant is further fusable by the compilation pass (`fuse_norm_quant`) |
| 6 | `out_proj` GEMM | (quantized matmul - shared with llama) |
So the GDN-region "glue" elementwise op count in vLLM is effectively ZERO separate
launches. Everything llama runs as standalone ggml nodes - conv-silu, gate
sigmoid, softplus, l2norm, scale, decay mul, residual add, gather - is absorbed
into kernels #3, #4, and #5.
Verified kernel bodies:
- `fused_recurrent_gated_delta_rule_packed_decode_kernel` lines 313-336:
`b_q/sqrt(sum(b_q^2)+eps)`, `b_k/sqrt(...)` (l2norm), `b_q*scale`,
`softplus_x=where(x<=thr, log(1+exp(x)), x)`, `g_val=-exp(A_log)*softplus_x`,
`beta_val=sigmoid(b)`, `b_h*=exp(g_val)`, `b_v-=sum(b_h*b_k)`, `b_v*=beta_val`,
`b_h+=b_v*b_k`, `b_o=sum(b_h*b_q)`, `tl.store(p_o,...)`, `tl.store(p_ht,...)`.
ONE kernel = recurrence + ALL gating + l2norm + state writeback.
- The non-packed variant `fused_sigmoid_gating_delta_rule_update_kernel`
(fused_sigmoid_gating.py:24-179) is the same fusion (used for the spec-decode /
mixed-batch path); both fold gate+l2norm+recurrence+writeback into one launch.
- Decode dispatch: `_forward_core` (line 1286-1298) routes pure non-spec decode to
`_forward_core_decode_non_spec` (line 1644), which calls exactly
`causal_conv1d_update` (#3) then `fused_recurrent_gated_delta_rule_packed_decode`
(#4). `_output_projection` (line 851) does `self.norm(core_attn_out, z)` (#5,
gated rmsnorm) then `out_proj` (#6).
### vLLM ALSO captures decode in a FULL CUDA graph - the launch bubbles are gone entirely
`vllm/v1/attention/backends/gdn_attn.py`:
- `_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH` (line 82)
- `use_full_cuda_graph = cudagraph_mode.has_full_cudagraphs()` (line 113)
- `build_for_cudagraph_capture` (line 509): "only decode is supported for full
cudagraphs with Mamba" / "GDN only supports decode-only full CUDAGraph capture".
So at decode vLLM captures the WHOLE forward (all 48 layers: GDN linear-attn layers
+ the 1-in-4 full-attn layers + projections + conv + recurrence + gated rmsnorm)
into a single replayed CUDA graph. Per-kernel host launch latency and the
data-dependent inter-op gaps are eliminated at replay time. Even the 2 Triton
kernels per GDN layer incur no host-side launch bubble during graph replay.
### Why this is the 62%-vs-40% explanation (not GEMM throughput)
- llama runs the GDN region as ~7-9 separate ggml nodes per layer at decode
(`ssm_conv`, `gated_delta_net` recurrence, `gdn_gather`, `k_bin_bcast` mul,
`silu`, `sigmoid`, `l2_norm`, `op_add`, `concat`), each a host-launched kernel,
serially data-dependent (conv -> gate -> recurrence -> gather), with the gating
elementwise wedged between recurrence steps. Each launch + the dependency stall
is a bubble ON the critical path. x48 layers x ~8 ops = ~384 launch bubbles/step.
- vLLM has 2 fused Triton kernels per GDN layer AND wraps them in a CUDA graph, so
the GDN-region inter-op bubble count at decode is ~0. The recurrence kernel
itself is already near-parity in llama (gated_delta_net 1.47 ms/call vs vLLM).
The gap is the surrounding launch/sync overhead, which is exactly the 60% idle
measured (llama ~40% busy vs vLLM 62%).
- This matches why P2a and Lever 2 were FLAT: they shrink GPU-busy kernels that are
already overlapped with the 42% mul_mat_q GEMM. The real wall-clock lever is the
SERIAL GDN gating chain's launch bubbles, which vLLM removed by (a) fusion into
the recurrence kernel and (b) CUDA-graph capture.
### What llama would need to match vLLM (two independent wins, either helps)
1. **Op fusion (Lever 3).** Collapse the GDN per-layer gating chain into the
recurrence kernel: fold conv-silu, q/k l2norm, scale, softplus+A_log gate,
sigmoid-beta, the exp-decay mul, the residual add, and the SSM-state write-back
INTO the `gated_delta_net` CUDA kernel (and fuse the output gate silu*z into the
final rms_norm). Target: from ~8 GDN nodes/layer down to ~2 (conv-fused +
recurrence-fused), mirroring vLLM's `fused_recurrent_gated_delta_rule_packed_decode`.
The conv silu fold and the l2norm/scale/gate fold are the high-value pieces -
they are pure elementwise prologues sitting ON the serial chain between conv and
recurrence.
2. **CUDA-graph the decode step.** Even without fusion, capturing the decode forward
in a CUDA graph removes the per-node host launch latency for all ~384 nodes/step.
(Prior A.2 work flagged ggml-cuda graph capture as the orthogonal lever; the
measured GDN structure here is exactly why it should move the wall.) vLLM gets
BOTH; llama gets neither today.
### Bottom line for the gap-analysis agent
The candidate explanation is confirmed at the source level: vLLM's GDN decode region
is 2 fused Triton kernels under a full CUDA graph vs llama's ~8 separate
host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM
compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing nsys
trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
---
## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
### Inputs (measured / GGUF metadata, no new GPU work)
- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
Both numbers are shared identically by llama and vLLM (same HW, same weights).
- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
**16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
| term | bytes/step | basis |
|------|-----------|-------|
| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
occupancy) and Lever-2 were FLAT on decode.
### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
| term | FLOPs/step | floor |
|------|-----------|-------|
| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
### 3. Verdict / bubble budget / parity target
```
compute floor bandwidth floor MEASURED step x above bw-floor
GB10 dense-FP4 ~13 ms ~154 ms (117-224)
vLLM dense @128 327 ms ~2.1x (1.5-2.8x)
llama dense @128 384 ms ~2.5x (1.7-3.3x)
```
- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
Confirms prior "decode is 2.5x above its bandwidth floor" work.
- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
= kernel-launch latency + occupancy gaps + serial data-dependency stalls.
- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
### Cross-check / sizing for the gap-analysis (timeline) agent
- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
wall is the critical path.
- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
< 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
kernel slower than vLLM fused_recurrent, or scheduler/H2D).
- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
come from bubble (not bytes, not FLOPs).
Assisted-by: Claude:opus-4.8 [Claude Code]
## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
128 lanes hold the full head-vector (one element per lane).
Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
critical path between the in-proj GEMMs and the recurrence):
1. `beta = ggml_sigmoid(ssm_beta @ cur)` -> kernel reads `beta_val = *beta_t`
2. `alpha = ssm_alpha @ cur`
3. `ggml_add(alpha, ssm_dt)` (k_bin_bcast op_add)
4. `ggml_softplus(...)` (unary_op<softplus>, 1248 inst)
5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked) -> g; kernel does `expf(g_t)`
6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
Post-GDN gate (consumes kernel output):
8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
operand it needs:
- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
### Why the OUTPUT gate (step 8) is NOT folded into this kernel
The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
grid-global sync - not feasible without a grid redesign. Leave step 8 as the
existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
The conv-silu (step 6) is a convolution, structurally separate; rely on the
existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
recurrence kernel.
### Implementation scope
- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
### Numeric considerations / bit-exactness
- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
### Estimated kernels-removed-per-layer and the honest ceiling
- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
### RISK (must be settled before building)
1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
gap (overlapped), do NOT build op-fusion; go to lever (2) below.
2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
to the post-GDN region or to inter-layer launch latency generally, the
higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
CUDA-graph lever if the bubbles are spread across the step rather than concentrated
in the pre-GDN gate slice.
3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
later op needing more srcs on this node has zero headroom; flag for review.
## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
### How graph-disable / re-instantiation are decided (this fork's state machine)
- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
`node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
list. So no GDN op forces graph-disable.
- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
(incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
### Why the GDN region's properties are STABLE across steady decode steps
The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
```
state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
kv_head * n_embd_s * elsize); // offset = kv_head
ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
```
Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
graphs engage), confirming none of these break stream capture.
### The only re-instantiation is NOT GDN-driven
A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
the default decode path and is not the GDN/SSM path.)
### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
wall-clock).
### Graph-safe lever for GDN: none new
- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
are already safe and captured.
- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
(`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
worth a separate confirm for the MoE model.
## decode-timeline-gap (GPU, label gap-analysis) - the decisive fresh node-level measurement
This is the new GPU run the analysis was waiting on. It arbitrates between the
roofline/vllm-gdn-compare theory ("57 ms = 100% bubble, Lever 3 closes it") and the
cudagraph-coverage source verdict ("~99.4% busy, bandwidth-bound, bubbles refuted").
The measurement confirms the latter and refutes the former, with per-kernel numbers.
### Capture (the trap the prior `--trace=cuda` fell into is now avoided)
`nsys profile --trace=cuda --cuda-graph-trace=node` on build-cuda-base (clean
Lever-1, HEAD df1cc97, git-clean mmq.cuh), q36-27b-nvfp4 dense, `-fa on -npp 128
-ntg 24 -npl 128 -c 33000`. Artifacts on DGX: `~/llama-paged-dev/nsysgap.{nsys-rep,
sqlite}`. The decode step is a single CUDA graph (graphId=11, 23 replays = steps
2-24; graphId=1 x8 = prefill). Plain `--trace=cuda` recorded each step as ONE opaque
~380 ms block, so the widely-cited `nsysab_new.kern.txt` breakdown (mul_mat_q 42%,
gated_delta_net 13%) is PREFILL + the single eager capture step, NOT decode. With
node-level trace the graph expands: 168201 kernels = 91499 graph-internal + 76702
eager prefill. **All graph kernels on stream 14 (single stream) -> strictly serial,
no overlap, so any inter-kernel gap is pure GPU idle.**
### One steady decode step (window between decode launches 22413.26 / 22796.74 ms, width 383.48 ms)
Exactly 48 `gated_delta_net` + 16 `flash_attn` = one clean step (48 GDN + 16 attn).
2965 kernels.
| classification | ms/step | % of step |
|---|---|---|
| (a) inter-kernel LAUNCH gaps + (b) SERIAL-DEPENDENCY stalls (LAG sum, single stream) | **0.225** | **0.06%** |
| (c) within-kernel time (GPU running) | 380.4 | 99.94% |
Zero gaps > 5 us. Largest single gap 2.40 us. 1260 sub-1us gaps + 1700 back-to-back.
**The decode step is 99.94% GPU-busy. There are no bubbles.** This independently
confirms cudagraph-coverage's ~99.4% and **refutes** roofline-decode's "57 ms = 100%
bubble" and vllm-gdn-compare's "~384 launch bubbles/step on the critical path".
nvidia-smi's "40% util" = low SM/compute efficiency WITHIN kernels (c) (memory-latency-
bound, ~12-16% achieved occupancy), not wall-clock idle.
### Real decode kernel mix (% of the 380.4 ms step) - corrects the prefill-contaminated kern_sum
| kernel | n/step | ms | % | grid CTAs | waves/48SM |
|---|---|---|---|---|---|
| gated_delta_net_cuda | 48 | **196.37** | **51.6** | 48x128x32 = 196608 | 4096 |
| mul_mat_q (FP4 in/out/qkv/o proj) | 496 | 92.90 | 24.4 | 136 | 1.5 |
| quantize_mmq_nvfp4 | 496 | 17.13 | 4.5 | 483 | 10 |
| nvjet GEMM (lm_head) | 1 | 11.91 | 3.1 | 1944 | 40 |
| flash_attn_ext_f16 (16 attn layers) | 16 | 11.67 | 3.1 | 48 | 1.0 |
| concat_cont (conv-state) | 48 | 8.01 | 2.1 | 20480 | 427 |
| cpy_scalar | 64 | 7.62 | 2.0 | 49152 | 1024 |
| k_get_rows_float | 49 | 7.08 | 1.9 | 15098 | 315 |
| k_bin_bcast (gate mul + add) | 720 | 6.59 | 1.7 | 3169 | 66 |
| ssm_conv_f32 | 48 | 5.64 | 1.5 | 10240 | 213 |
| unary_gated (silu/sigmoid) | 128 | 5.36 | 1.4 | 5888 | 123 |
| mul_mat_q_stream_k_fixup | 304 | 3.94 | 1.0 | 192 | 4 |
| rms_norm_f32 | 209 | 3.52 | 0.9 | 1764 | 37 |
| l2_norm_f32 | 96 | 0.64 | 0.2 | | |
| gdn_gather_nonident | 48 | **0.061** | 0.016 | | |
- `gated_delta_net` is **51.6% of the step**, the single dominant term. The
previously-cited "1.47 ms/call near-vLLM" was the EAGER average over 1248 calls
(range 0.046-4.42 ms = prefill warmups + capture); true steady decode is
**4.08-4.11 ms/call** (gridY=128 = the 128 seqs). 2.8x higher than believed.
- It launches 196608 CTAs / 4096 waves = NOT occupancy-starved; the cost is
bandwidth-bound state traffic (~384 MB read + ~384 MB write per layer for the
48-head x 128-seq x [state 128 x head_v 128] recurrent state, ~190 GB/s effective).
- The Lever-3 narrow target (gating glue) = k_bin_bcast 6.59 + silu/sigmoid 5.36 +
l2_norm 0.64 + softplus 0.13 = **12.76 ms = 3.35%** of the step. `gdn_gather` is
**0.06 ms** (negligible - it early-returns on identity ids as predicted).
### The three answers (with numbers)
1. **Bubbles on the serial GDN critical path?** NO. 0.225 ms idle/step = 0.06%,
zero gaps > 5 us. CUDA graphs eliminated launch overhead; serial dependencies do
not produce idle (each kernel starts < 1 us after the previous). The premise is
refuted by direct measurement.
2. **Would Lever 3 (fuse the gating chain) shrink the step or overlap away?** It
shrinks it, but only by its hard ceiling **12.76 ms = 3.35%** (380 -> 367 ms, 336
-> ~348 tok/s, 86% -> 89% of vLLM). It does NOT close the 14% / 53-57 ms gap.
IMPORTANT mechanism correction: the step is single-stream and 99.94% busy, so
there is NO overlap to absorb freed time (the lever3-design RISK #1 "same trap as
P2a if overlapped" does NOT apply - nothing overlaps). So removing those kernels'
GPU-time DOES cut wall-clock - but the win is removing their HBM byte traffic, NOT
launch bubbles (there are none). And the value is the measured ~12.76 ms, not the
"~288 launch bubbles" framing (those launches cost ~0 inside the graph). This also
explains P2a/Lever-2 flatness correctly: NOT "overlapped busy-time" (no overlap),
but P2a tuned the prefill large-M GEMM (decode GEMMs are 136-CTA tail-bound, untouched)
and Lever-2 relocated mandatory quantize work into the GEMM prologue (net zero).
3. **Do CUDA graphs cover the GDN region at B=128?** YES, fully. Whole step = one
graph, 23 replays, ~0.2 ms host gap between steps. `gdn_gather_nonident` and the
in-place state ops are graph-internal nodes (graphNodeId != 0); no fragmentation.
Confirms cudagraph-coverage. Note: lever #2 from vllm-gdn-compare ("CUDA-graph the
decode step") is ALREADY IN EFFECT in this build and did not close the gap - so it
is spent, not pending.
### Verdict against roofline-decode's own sizing test
roofline-decode stated: "if critical-path gaps total < 57 ms, parity is NOT reachable
via GDN-gate fusion alone and the gap is elsewhere (GDN core kernel slower than vLLM
fused_recurrent)." **Measured gaps = 0.225 ms << 57 ms.** Therefore, by that test, the
53-57 ms / 14% gap is NOT bubble and NOT closable by gating fusion. It lives in
**kernel GPU-time**, dominated by the `gated_delta_net` recurrence (51.6%, bandwidth-
bound) and secondarily the FP4 GEMM + quantize stack (29%). The "57 ms = 100% bubble"
roofline conclusion was an inference from the prefill-contaminated GPU-busy sum
(~555 ms vs 384 ms "implies overlap"); the node-level decode-only measurement shows
per-step GPU-busy = wall (no overlap), so that inference does not hold.
### Recommendation (resized)
- The real lever is the `gated_delta_net` recurrence kernel itself (196 ms, 51.6%):
match vLLM's `fused_recurrent_gated_delta_rule_packed_decode` (vllm-gdn-compare
kernel #4) which folds l2norm + gate + decay + recurrence + state-writeback into a
SINGLE pass over the state, reducing HBM round-trips of the state. The win is byte
reduction in a memory-bound single-stream step, not bubble removal.
- The lever3-design fusion is still worth doing as a component of that (it removes
~12.76 ms = 3.35% of real byte traffic, and unlike its own RISK section feared, it
will NOT be flat because there is no overlap), but on its own it is a ~3% lever, not
the gap-closer. Build it folded into a single-pass recurrence kernel, not as an
isolated gate fold.
- Next decisive measurement (future GPU-agent run): profile vLLM's decode step at
npl128 with the same node-level method and compare per-region GPU-time (GDN
recurrence vs GEMM vs attention) to localize exactly where vLLM spends its 53-57 ms
less. Both engines move near-identical bytes only if vLLM's fused recurrence does
not re-stream state; the per-kernel A/B will show whether the gap is the recurrence
pass or the GEMM/quantize stack.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict
Reconciles all six investigation sections above plus the three adversarial verdicts
(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is
99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by
the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers
are byte-reduction inside that kernel - NOT launch-bubble removal.**
### 1. The proven critical-path decomposition of the decode step
Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97,
q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One
steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on
ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window
383.48 ms.
BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle):
| bucket | ms/step | % step | note |
|---|---|---|---|
| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency |
| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us |
| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back |
| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces |
| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step |
The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved
occupancy on memory-latency-bound kernels), NOT wall-clock idle.
KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives):
| kernel | ms | % step | regime |
|---|---|---|---|
| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) |
| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode |
| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) |
| `nvjet` lm_head GEMM | 11.91 | 3.1 | |
| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | |
| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target |
| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) |
| `k_get_rows_float` | 7.08 | 1.9 | |
| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) |
| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 |
| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) |
| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | |
| `rms_norm_f32` | 3.52 | 0.9 | |
| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target |
| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) |
GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step.
The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was
PREFILL + the single eager capture step contaminating the average over 1248 calls (range
0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step.
### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty
**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN
chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim
requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay
already collapsed launch latency; serial data-dependency does NOT equal idle when the
graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM
efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated
`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block).
**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing
serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove
(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap
anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a
tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and
Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero).
(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy
single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%**
(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that
over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but
only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3
is a component, not the gap-closer.
**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"):
**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware
floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is
not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is
software-closable in BIT-EXACT form turns on one unmeasured quantity (below).
RESIDUAL UNCERTAINTY (the single open question that decides everything):
- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above
ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED
it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write)
bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM,
then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit
HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same,
the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a
hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on
both halves). This is the one measurement that converts the verdict from "refuted as
worded" to a definitive yes/no.
- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip
[TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode
graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace
cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold.
The dense 335 tok/s headline is fully settled.
### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327)
Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines
share identical bandwidth/compute floors. Ranked by expected recovery.
| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate |
|---|---|---|---|---|---|---|
| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** |
| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless |
| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone |
| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track |
| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) |
| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound |
| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat |
| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap |
| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor |
Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a
single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a
distinct, bit-exactness-breaking precision track.
### 4. The honest verdict and the single highest-value next step
**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the
unmeasured byte ratio:
- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass
fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking
llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance).
- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the
57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it.
Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT
ship the isolated gate fold as "the parity lever."
**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN
recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass
over the recurrent state) - BUT gate the build on one cheap measurement first, because it
is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal.
**The measurement that confirms it before over-investing (one short GPU run, gap-analysis
agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's
`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two
counters:
- `dram__bytes.sum` (actual DRAM bytes/call)
- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s)
Decision rule:
- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant
un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery
scales with the byte delta (up to ~40 ms). This is the gap-closer.
- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the
recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only
the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4,
accept non-bit-exact), and bit-exact parity is NOT reachable.
**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort):
the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity),
which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound,
single-stream step where their full GPU-time is wall-clock.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,185 +0,0 @@
# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121)
Profiling study (no engine changes). Question: matched apples-to-apples (both
batched servers, NVFP4-class weights, prefix caching on, both eager), why is
`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a
1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural.
Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers.
llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf`
(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`,
`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin),
`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix,
K=32 concurrent, generate 64. All profiling scripts are dev-tree only
(`~/bench/decode_study/`); minimal in-code timers were not needed (server already
reports per-slot `eval time`, which excludes prompt-eval = pure decode).
## TL;DR
1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode
the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys).
Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage
sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The
"GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload.
2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the
opposite of the thin-batch-GEMM story. Attention scaling with context length, not
the matmul, is the load-bearing cost.
3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than
stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It
gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every
step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's
in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts
decode latency.
4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM**
(~120-185 ms/step). The residual gap is the **long-context decode-attention
kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity
gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps.
## The measured numbers (batch 32, server-reported pure-decode step time)
`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token`
(the most-contended, full-batch-32 slots; excludes prompt eval).
| config | decode step ms (max / top8) | client wall ms/step |
|------------------------------------------|-----------------------------|---------------------|
| paged, ctx 1024, greedy | 1343 / 1279 | 1468 |
| paged, ctx 1024, **heavy 5-sampler** | 1346 / 1280 | 1470 |
| **stock** (no paging), ctx 1024, greedy | **729 / 650** | 768 |
| paged, **ctx 64** (short), greedy | **215 / 215** | 253 |
| vLLM NVFP4A16, ctx 1024 (K=32) | **~120-185** (270 tok/s) | - |
The brief's reference ~828 ms/step sits between the stock (650-729) and paged
(1279-1343) numbers measured here; the decomposition below is what is robust. Our
fan-out shares no prefix across the 32 slots (each slot independently prefills 1056
tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the
"max" slot is maximally contended, which is why our paged max runs a little above 828.
### Context sweep - decode step is attention-scaling, not fixed overhead
Pure-decode step vs shared-prefix length (paged, batch 32):
| prefix ctx | decode step ms |
|-----------|----------------|
| 64 | 215 |
| 128 | ~290 |
| 256 | ~410 |
| 512 | ~660 |
| 1024 | ~1280 |
Roughly linear in context length: ~1 ms of added step time per added context token.
The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope +
loop + sampling, attention negligible). Everything above it scales with KV length =
attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step.
## Where the ~1280 ms paged decode step goes (nsys, pure-decode window)
`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way
decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's
own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%).
GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi).
| bucket | % GPU time | abs (of ~1280 ms) | what it is |
|--------------------------------|-----------:|------------------:|------------|
| `flash_attn_ext_f16` ATTENTION | **47.7%** | ~610 ms | decode attention over the 1056-cell KV |
| `cpy_scalar` KV copy/cast | 18.3% | ~234 ms | KV write + f32->f16 casts |
| `get_rows/set_rows` KV gather | 17.8% | ~228 ms | **paged** gather of K/V/mask to contiguous |
| `mul_mat_q` + `quantize_mmq` | 15.7% | ~201 ms | NVFP4 weight GEMM (+ activation requant) |
| rmsnorm / silu / rope / add | ~0.6% | ~8 ms | elementwise |
Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the
weight matmul is ~the entire short-context step, and is context-independent, as
expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep
finding that ~84% of the step scales with context.
Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util%
+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the
classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step).
### Stock vs paged decomposition
- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read
directly by the FA kernel, **no gather**).
- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged
gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that
feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's
self-inflicted cost and is the entire ~1.85x stock->paged regression.
## vLLM decode architecture mapped onto each llama bucket
vLLM at ~120-185 ms/step is faster on **every** bucket:
| llama bucket (paged) | ms | vLLM equivalent | does vLLM avoid it? |
|-----------------------------|-------|-----------------|---------------------|
| paged KV gather (get_rows) | ~228 | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. |
| KV copy/cast | ~234 | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy |
| decode attention | ~610 | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 |
| weight GEMM + act quant | ~201 | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel |
| CPU sampling / loop | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too |
vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so
vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a
stack of kernel-efficiency wins, not one silver bullet.
## Ranked levers - closable vs structural
1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged
path; net-zero risk - it is a regression]** The worktree's paged engine makes
decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer
every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable
paged KV** - it only ever helps prefix-prefill *memory*, never decode latency.
Fully recovering this *and* keeping paging requires reading paged blocks
in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2.
2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock /
~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx.
llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's
FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the
cost that *grows with context* - exactly the regime the brief targets. Tractable in
principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel
paged read that also kills lever 1's gather), but it is deep CUDA work on a new
arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.**
3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]**
The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and
less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA
and folding the activation quant into the GEMM is tractable kernel work. Bounded
impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step).
4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy ==
heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until
the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the
"host-bound serving loop" hypothesis for this decode-bound workload.
5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server
already fuses all 32 slots into one decode step (one set of kernels per step over
batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins
on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is
kernel-bound, not scheduler-bound.
## Honest bottom line
The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as:
- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it
(don't run paged for decode-latency workloads).
- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**:
llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than
vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel
paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project.
- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not**
the gap; the GPU is ~95% busy on memory-bound kernels the whole step.
What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel
work). What is structural / hard: lever 2 (the decode-attention kernel + a real
in-kernel paged read), which is where the context-scaling gap actually lives and where
any serious effort to approach vLLM on GB10 must go.
## Reproduction (dev-tree only, `~/bench/decode_study/`)
- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config.
- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`.
- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`.
- `cat2.py` - kernel-time categorization from the sqlite export.
- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s).
</content>
</invoke>

View File

@@ -1,756 +0,0 @@
# Decode parity exploration (post-SSM-fix) - per-agent findings
Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
The remaining +54% to parity is the question each section below probes. All numbers
DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
---
## Section: per-token-latency (critical path / host-loop) - READ-ONLY
**Verdict: the per-step critical path and host loop are NOT the residual lever.
Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
| activity set | busy % | exposed idle |
|-------------------------|---------|--------------|
| kernels only | 80.25% | 19.74% |
| kernels + memcpy (all) | 99.35% | **0.65%** |
- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
(the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
They were a real critical-path serialization, and **patches 0018/0019 removed exactly
these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
+37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
**median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
(sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
**0.34%** of the window - the pure launch overhead that CUDA graphs collapse
(measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
default paged decode with a 256-token reset cadence).
### 2. Post-SSM scaling of the FIXED host gap
The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
As decode accelerated it grew only as a fraction of a shrinking step:
| build | step ms @npl128 | host gap | host gap % of step |
|---------------|-----------------|----------|--------------------|
| pre-SSM (146) | ~877 | ~2 ms | 0.24% |
| post-SSM (256)| ~499 | ~2 ms | **~0.40%** |
| vLLM (391) | ~328 | (n/a) | (would be ~0.6%) |
Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
below the host time, which is not the case at 65% of parity.
### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
needs layer N-1; cannot be parallelized across layers). The question is whether each
individual kernel fills the SMs at batch 128. It does:
- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
= `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
`(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
npl128: if the chain under-filled, busy% would be well below 99%.
Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
host/scheduler problem.
### 4. What the host actually does per step (eager rep runtime API)
Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
| API | n | total | avg |
|---------------------------|-------|---------|---------|
| cudaStreamSynchronize | 1723 | 7775 ms | 4513 us |
| cudaLaunchKernelExC | 30983 | 4045 ms | 131 us |
| cudaLaunchKernel | 20385 | 2694 ms | 132 us |
| cudaMemcpyAsync | 2085 | 96 ms | 46 us |
~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
vectorized numpy (no per-token Python), attention metadata `build()` once per step
reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
next-step host prep overlaps the current-step GPU compute on the async stream. The key
asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
kernel time (FP4 GEMM), not host.
### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
Nothing left to win here.
- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
the forward; the host still samples + runs `update_slots` + rebuilds the batch
between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
cleanup, not a parity lever.
- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
### Evidence
- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
# Decode-Parity Exploration
## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
### Model config (Qwen3.5-27B dense, from vLLM config.json)
- linear_key_head_dim K = 128, linear_value_head_dim V = 128
- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
### Which kernels run at decode
- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
(the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
### The state HBM traffic is IDENTICAL - it is NOT the lever
Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
### Algorithmic / parallelization delta (the real differences)
1) Reduction strategy (biggest structural difference)
- llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
rounds on the critical path, with n_tokens=1 they are NOT amortized.
- vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
2) Occupancy / launch geometry (likely the dominant bandwidth gap)
- llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
(only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
(~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
- vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
3) Redundant non-state traffic in llama
- q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
=> ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
- Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
- vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
(-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
- llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
early-returns on the identity check, but still a grid launch of n_seqs blocks).
Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
### What a faster llama GDN decode kernel would need (optimization scope)
- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
(the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
the equal state bytes - the single highest-leverage change.
- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
the output store across the warp.
- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
(or fold the identity check into the recurrence prologue, which it already partly does).
- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
divergence-from-reference not a parity lever.
### Bottom line
llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
---
## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
memcpy summed separately so D2D is never lumped into compute), not from summary text.
### CLAIM 1 - decode decomposition
PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
full attention 0.37%.
D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
my DtoD 0.05 GB, the 2.16 GB is DtoH).
FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
### CLAIM 2 - 146 -> ~257 ("+66%")
146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
### CLAIM 3 - P2a GEMM-remap FLAT on decode
THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
### CLAIM 4 - 65% of vLLM (254 vs 391)
254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
### Traps checked
GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
### Net verdict
1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
---
## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
WEIGHTS (read ONCE per step, shared across all 128 seqs):
- NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
(per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
- LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
- per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
- token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
=> WEIGHTS TOTAL = 15.14 GB / step
PER-SEQUENCE STATE (x128 seqs, read + write every step):
- SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
- conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
- KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
### Floor vs measured -- decode is NOT at the bandwidth floor
Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
Measured llama = 128 tok / 254 t/s = **504 ms/step** => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
vLLM 391 t/s = 128 / 391 = **327 ms/step** => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
(pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
(18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
launches/step of serialization.
3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
recurrence prologue.
4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
the GEMM prologue -> pure launch latency, not traffic.
5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
### Bottom line (weight-bandwidth)
At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
lever, largely orthogonal to the batch-128 decode-parity gap.
---
## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
fusion gate, TMA scoping, and the conv-state residual.
**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
baseline" is only about the act-quant producer.
**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
shaved".
### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
as final.
### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
the npl128 throughput number does not** - tune the two separately.
### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
"NVFP4-dense is a prefill/low-batch lever".)
### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
### Deferred (covered by other sections, I concur)
- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
parity lever - last priority, quality-risk.
- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
### Bottom line (this section's net-new)
Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
w16/w64 timing, in-process, batch 128):
| engine / mode | ms/step | decode tok/s | notes |
|--------------------------|---------|--------------|--------------------------------|
| llama post-SSM (graphs) | ~510-522| **245-251** | S_TG @npl128 ntg32 (this run) |
| vLLM enforce_eager | 324.9 | **394.0** | == the ~391 ref (h2h log 371-384)|
| vLLM cuda-graphs | 304.9 | **419.8** | graphs buy only +6% |
- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
(394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
exactly -> the tight window is faithful; per-kernel numbers below use it.
### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
| component (per step) | llama ms/step | llama % | vLLM ms/step | vLLM % |
|-----------------------------|---------------|---------|--------------|--------|
| GDN linear-attn recurrence | 193 (48x4.03) | 38% | 174 (48x3.62)| 53% |
| FP4 matmul + act-quant | **236** | **46%** | **117** | **36%**|
| - mul_mat_vec_q (GEMV) | 132 (48x2.75) | 26% | - | - |
| - mul_mat_q (GEMM) | 88 (448 calls)| 17% | cutlass 61 | 19% |
| - quantize_mmq_nvfp4 | 16 (448) | 3% | nvjet 53+cvt2| 17% |
| full attention (16 layers) | 6.6 (16) | 1.3% | 6.2 (16) | 1.9% |
| SSM conv + glue/elementwise | ~45 | 9% | ~22 | 7% |
| MEMCPY (D2D+H2D) | 2.5 (131 MB) | 0.5% | 0.36 (85 MB) | 0.1% |
| **TOTAL** | **~510** | 100% | **~326** | 100% |
### 2. The three load-bearing comparisons (the brief)
**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
GEMV-at-batch-128 on the vLLM side at all.**
Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
(16 ms + 448 launches), then the GDN +19 ms.
### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
not the GDN kernel and not byte-cutting.
### Verdict (profiler)
- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
18 GB/step DtoD removal CONFIRMED in-trace.
- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
= 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
### Evidence (DGX, this agent)
- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
(decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED
Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the
fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by
reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD
46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference.
### 0. Headline
Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is
NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA
graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect:
**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`)
at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of
decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass
M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step.
### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED
HELD (confirmed by both the adversarial re-derivation and the fresh profile):
- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB,
mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings).
- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed
three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel
sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The
SSM fix (0018/0019) is the real breakthrough and is working.
- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself
shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound.
- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed.
- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util)
vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below.
- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128
(llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler.
- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already
runs with graphs. Confirmed by the profiler.
REFUTED / CORRECTED:
- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net
1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The
profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% /
+19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama,
53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower
window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/
fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity.
- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM,
not the speedup (validate-findings).
SUPERSEDED (the gap validate-findings flagged, now filled by real data):
- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the
producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace
replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM
family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17%
(the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as
"48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This
is why P2a was flat on decode: **it optimized the wrong FP4 kernel.**
### 2. Ground-truth per-step decode decomposition + the single biggest overage
From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate:
| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) |
|-----------------------------|----------|--------|---------|-------|------------------|
| GDN recurrence kernel | 193 | 38% | 174 | 53% | **+19** |
| FP4 matmul + act-quant | 236 | 46% | 117 | 36% | **+119** |
| - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0 | - | **+132** |
| - mul_mat_q (MMQ GEMM) | 88 | 17% | 61 (cutlass) | 19% | +27 |
| - quantize_mmq_nvfp4 | 16 | 3% | 55 (nvjet+cvt)| 17% | -39 |
| full attention (16 layers) | 6.6 | 1.3% | 6.2 | 1.9% | +0.4 |
| SSM conv + glue/elementwise | 45 | 9% | 22 | 7% | +23 |
| MEMCPY | 2.5 | 0.5% | 0.36 | 0.1% | +2 |
| **TOTAL** | **~510** | 100% | **~326**| 100% | **+184** |
The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset
by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So
the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms.
**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block
(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout
`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is:
```cpp
// current code (qwen3next.cpp, end of the GDN block)
ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm,
head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128]
cur = build_lora_mm(model.layers[il].ssm_out, final_output); // <-- the matmul
cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); // collapse AFTER
```
`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda
dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the
128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs =
**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's
trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128
sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles -
the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not
in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN
kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all.
This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt
len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2])
trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream
`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is
the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer.
### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391)
Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover
figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels);
each needs a confirming re-profile. Ceilings are cumulative.
| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability |
|---|-------|-------------------|--------------|-----------------------|-----------|--------------|
| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) |
| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) |
| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) |
| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) |
| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) |
| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) |
| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) |
Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per
o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1
rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise
passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band,
not a sum.
### 4. Verdict: is true decode parity reachable?
**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.**
Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32
recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM
achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x
throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to
specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The
GDN recurrence (the supposed floor) is only +11%/call between the two engines.
How far each tier reaches:
- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves
the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact,
no new kernel (MMQ already runs the in-projection at this exact shape and type).
- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks
38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the
GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now).
- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work:
matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy,
folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard
but not floored" - it is kernel engineering, not a hardware wall.
**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D
`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant
post-matmul `reshape_2d`):
```cpp
// route the GDN output projection through tensor-core MMQ at decode:
// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact.
ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm,
head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ
```
Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one
number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The
same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and
any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those
and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy
compromise, so bit-exactness is preserved throughout.
### Evidence (this section)
- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj
layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on
`ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is
MMQ-supported).
- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,184 +0,0 @@
# F16/BF16 Glue Probe - the dense decode residual to vLLM
Question: dense decode parity sits at llama 384.6 vs vLLM 418.8 t/s @ npl128 = 91.8%.
The 49% SSM recurrence (f32 BOTH engines) and the 27% NVFP4 GEMM (W4A4 BOTH) are
precision-matched. The residual ~8% may be partly that llama runs the NON-recurrence
GLUE (attention, norms, activations, elementwise, residual stream) in F32 while vLLM
runs the model in BF16. This probe settles, empirically on q36-27b-nvfp4 @npl128, how
much of that residual is realistically f16/bf16-closable.
Model: Qwen3.5-27B NVFP4 (dense). 64 layers = 16 attention + 48 gated-DeltaNet
(SSM) recurrent. Build b104-f7409c2 (patch 0023), verified git-clean and coherent.
The bf16 SSM work was never applied to the tree (only saved as a diff backup);
ggml-cuda needed no recompile on rebuild, so the binary is bit-identical to clean 0023.
## (1) Current KV / state dtype (SETTLED)
From the `-v` init log:
- ATTENTION KV cache (16 of 64 layers):
`K (f16): 1280 MiB, V (f16): 1280 MiB` => **DEFAULT IS ALREADY F16.**
- RECURRENT cache (48 gated-DeltaNet layers):
`R (f32): 180 MiB` (conv state), `S (f32): 4608 MiB` (SSM state) => **f32.**
Consequence: the attention KV is ALREADY at vLLM's 16-bit bit-width. `--cache-type f16`
is a literal no-op; the cheap KV lever is spent. The f32 lives in (a) the recurrent
SSM/conv state (matched to vLLM, the bf16 version is shelved for failing the f32 KL
gate) and (b) the intermediate-activation glue (norms, residual stream, attention
compute, activations) - that glue is where llama still pays f32 vs vLLM bf16.
## (2) Decode kernel budget (nsys --cuda-graph-trace=node, npl128, 39 steady steps)
step span 342.0 ms ; sum-of-kernels 338.8 ms ; **kern/span 99.0%** - the decode is
GPU-bound, kernels back-to-back, nsys overhead negligible. The measured bench step
(128 tok / 373.5 t/s = 342.8 ms) equals the nsys span, so the %-of-step figures below
ARE wall-time fractions.
OUT of scope - already precision-matched (83.2% of the step):
| kernel | ms/step | % |
|---|---:|---:|
| gated_delta_net (SSM recurrence, f32 BOTH) | 167.1 | 49.3 |
| mul_mat_q NVFP4 (W4A4 GEMM, BOTH) | 93.0 | 27.4 |
| quantize_mmq_nvfp4 (FP4 act-quant) | 17.6 | 5.2 |
| mul_mat_q stream_k fixup (FP4 reduction) | 4.1 | 1.2 |
F16-ABLE GLUE - f32 in llama, bf16 in vLLM:
Budget A (clean compute glue, decoupled from the f32 state):
| kernel | ms/step |
|---|---:|
| flash_attn_ext | 11.94 |
| unary_gated_op (silu) | 5.16 |
| k_bin_bcast (mul) | 4.72 |
| rms_norm | 3.58 |
| k_bin_bcast (add, residual)| 1.67 |
| l2_norm | 0.65 |
| cpy_scalar | 0.37 |
| rope | 0.26 |
| sigmoid | 0.22 |
| softplus | 0.09 |
| flash_attn fixups | 0.08 |
| **Budget A total** | **28.74 ms = 8.4% of step** |
Budget B (+ the non-FP4 cublas GEMM): + nvjet 12.17 ms => **40.91 ms = 12.0%**.
Recurrence-coupled data movement (NOT bit-safe f16-able - needs the f32 state to go
bf16, which is the shelved work that fails the f32 KL gate):
ssm_conv 8.37 + k_get_rows_float 6.98 + k_set_rows 0.66 + gdn_gather 0.06 = 16.08 ms = 4.7%.
## (3) Cache-type A/B (decode_agg S_TG t/s, dense)
| npl | DEFAULT | F16-explicit | Q8_0 |
|---:|---:|---:|---:|
| 32 | 209.05 | 208.75 | 208.63 |
| 128 | 373.46 | 373.56 | 374.71 |
- F16-explicit == DEFAULT (0.03% delta) => proves the default KV is already f16; the
flag is a no-op.
- Q8_0 (8-bit, half the f16 KV bytes) is within noise at every npl => the attention KV
bandwidth is NOT a decode bottleneck (it is 16/64 layers; flash_attn is 3.5% of the
step). The KV-cache dtype is not a decode lever for this model.
- Coherence (48-tok greedy, "The capital of France is"): default and q8_0 both fully
coherent; q8_0 only causes minor greedy-path divergence, no quality break. But since
q8_0 buys zero speed and is not bit-exact, it is pointless here.
## Read: how much of the ~8% dense residual is f16-closable
The gap is ~27 ms/step (llama 332.8 ms vs vLLM 305.7 ms at npl128).
f16 does not zero the glue, it speeds it up. Realistic recovery:
- Memory-bound glue (norms + elementwise + activations + copies + rope = 16.7 ms):
f16 halves the bytes => ~50% => ~8.4 ms.
- flash_attn_ext (12.0 ms): KV is ALREADY f16 and the accumulation must stay f32
(vLLM also f32-accumulates), so only the Q/projection side helps => ~25% => ~3.0 ms.
- Budget A realistic recovery ~= **11.4 ms**.
- nvjet non-FP4 GEMM (12.2 ms): bf16 tensor cores vs f32 ~= ~40-50% => ~5 ms, but
uncertain (may already run TF32) => +nvjet recovery ~= **16 ms**.
So f16/bf16 glue realistically recovers **~11 ms (glue only) to ~16 ms (+GEMM) of the
~27 ms gap = roughly 40-60% of the dense residual.** That moves parity 91.8% ->
~95-96%, NOT a full close. The remaining ~3-4% is structural: cublas GEMM efficiency
on the non-FP4 paths, graph/launch scheduling vs vLLM, and the irreducible f32
accumulation in attention and the recurrence.
Caveats for a build decision:
1. The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable
(KV already f16, accumulate stays f32). The cleanly recoverable mass is the
norms+elementwise+activations (~16.7 ms).
2. The recurrence-coupled 4.7% (ssm_conv + state gather) is only f16-able by taking the
SSM/conv state to bf16 = the already-built, already-shelved work that fails the f32
KL gate. It is OUT of a bit-safe f16 build.
3. f16 glue is NON-bit-exact (same category as the shelved bf16 SSM state). It would be
an OPT-IN fast path, not the bit-exact default. Realistic ceiling ~95-96% parity for
a meaningful (norms/elementwise/activations + optionally nvjet) f16 conversion, at
the cost of leaving the 95%-bit-exact f32 plateau.
## (4) What it costs to capture it: NOT a flag (source map, read-only)
The asymmetry confirmed at the source level (DGX `~/llama-paged-dev` @ f7409c2, tree
git-clean; vLLM ref from BITEXACT_VS_VLLM.md):
- vLLM `text_config.dtype = bfloat16` => the ENTIRE non-quantized compute (residual
stream, RMSNorm I/O with f32-internal reduction, FlashAttention out, SiLU, gating,
conv state) runs in BF16. Only the gated-DeltaNet temporal SSM state is f32
(`mamba_ssm_dtype="float32"`, matched to llama).
- llama's intermediate activations are F32 **by construction, everywhere**:
`ggml_mul_mat` hardcodes an F32 result (ggml.c:3250), so the stream snaps back to F32
after EVERY projection (Q/K/V/O, wqkv, ssm in/out, ffn up/gate/down, eh_proj, lm_head).
`ggml_rms_norm`/`ggml_l2_norm`/`ggml_silu`/`ggml_add`/`ggml_mul`/`flash_attn_ext`/
`ggml_ssm_conv` all preserve/emit F32. There is no point where the stream is f16.
There is **no vLLM-style global model-compute-dtype knob** in ggml/llama. You cannot flip
one model-load flag. Three escalating options, all opt-in / non-bit-exact:
- A flag: does not exist and cannot exist as-is - the F32 is structural, not a default.
- Option 1 (targeted per-op f16, no new kernels): silu/sigmoid/softplus (unary.cu),
add/mul (binbcast.cu), rope already have f16 paths. But the residual stream stays F32,
so each op must be wrapped cast(F16)->op->cast(F32), adding 2 `cpy` ops per op. At
decode these ops are tiny and memory-bound; the cast traffic ~= the op traffic, so the
net win is near-zero or negative unless the cast is FUSED into the producer/consumer.
Crucially this CANNOT capture the norms - the largest glue item.
- Option 2 (the real lever, multi-file code change): carry the residual stream in F16
across the layer, cast to F32 only at the quantize boundary. Requires (a) f16 projection
output (patch `ggml_mul_mat` to honor a dst-type, or a cpy->F16 after each proj),
(b) **NEW F16 template instantiations in norm.cu** for rms_norm / l2_norm / fused
rms+mul / fused rms+mul+add (today hard-`GGML_ASSERT(type==F32)` at norm.cu:441-442,
465-466, 525-527, 601-604) keeping the f32 reduction, (c) optionally an F16 ssm-conv.cu,
plus graph-dtype plumbing in qwen35.cpp / llama-graph.cpp to thread F16 through
inpL/cur/the residual adds. The single biggest code item is the norm.cu f16 kernels -
the exact band vLLM runs in bf16 that Option 1 cannot reach.
Must-stay-f32 regardless (vLLM does the same): RMSNorm/L2Norm sum-of-squares reduction;
FlashAttention KQ/softmax accumulation (forced `GGML_PREC_F32`, llama-graph.cpp:2117);
the gated-DeltaNet recurrent SSM temporal state (f32 BOTH engines, out of scope); the
src1->q8_1/nvfp4 activation quantization reads F32, so the stream must be F32 at every
projection boundary no matter what.
## Verdict: probe-further-then-decide, leaning not-worth-it for the default
f16 does NOT meaningfully close the dense residual on its own, and what it can close is a
multi-file non-bit-exact build, not a flag.
- Precision is NOT the dominant cause of the 8% gap. 83.2% of the decode step (recurrence
49.3% + FP4 GEMM 27.4% + FP4 act-quant/fixup 6.4%) is already precision-matched f32/W4A4
on both engines. The f16-able glue is only 8.4% of the step (Budget A); of the ~27 ms
gap, f16 realistically recovers ~11 ms (glue) to ~16 ms (+ the uncertain nvjet GEMM) =
40-60% of the residual. The remaining ~3-4% is kernel/scheduling efficiency (non-FP4
cublas GEMM, graph-launch overhead, irreducible f32 accumulation) that f16 cannot touch.
- The recoverable mass is the norm+elementwise+activation band, which is precisely the
part that needs NEW f16 norm kernels (Option 2). The no-new-kernel ops (Option 1) are
too small and their cast overhead likely eats the win.
- Any version is opt-in / non-bit-exact, the same gate-failing category as the already
shelved bf16-SSM-state work. It cannot be the bit-exact f32 default; it is a second,
separately-maintained fast path with a ~95-96% ceiling.
Recommendation: do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau
(patches 0018-0023) as the default. If chasing the last 4% later, the only lever worth a
build is Option 2's norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise
band, ~11 ms); gate it behind an explicit opt-in flag and validate it against the same KL
threshold as bf16-SSM before shipping. The non-FP4 cublas GEMM efficiency and graph-launch
scheduling - the structural ~3-4% - are a better long-term target than precision, because
they help the bit-exact default too.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,532 +0,0 @@
# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
benchmarks were run** (track A is on the box).
## 0. Grounded inputs (measured, committed)
| quantity | value | source |
|---|---|---|
| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
| FP4 (NVFP4/MXFP4) dense peak | **~427500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
compute-bound.
---
## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
This is the first thing track B must settle, and the evidence settles it decisively.
| option | verdict | why |
|---|---|---|
| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~915 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
to the bandwidth floor at the thin decode M-tile.
### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
**track A's fuse target**) → `mul_mat_q``vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
`mma_block_scaled_fp4` (`mma.cuh:1126`).
**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
`ggml_cuda_float_to_fp4_e2m1`. The MMA is
`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
---
## 2. The roofline — answering the load-bearing question
**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
### 2a. DENSE Qwen3.6-27B, M=128
`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
@216 GB/s = 83 ms → 1,542 tok/s.
- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
territory.
- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
compute-bound territory at single-digit efficiency.
- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
the GEMM.**
### 2b. MoE Qwen3.6-35B-A3B, M=128
@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
expert M ≈ **4 tokens**.
- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
feeding ragged M≈4 tiles.
- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
loss, not compute).
- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
already pushing the MoE BW floor.
**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 46× ABOVE vLLM's 391/811. Bandwidth
is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
---
## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
single most important structural fact for track B:
> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
> regimes pick opposite knobs:
| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
|---|---|---|
| selection picks | `mmq_x=128``ntiles_x=1`**weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
0015 auto-cap→64; ideal ≈816); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
The five inefficiencies, ranked:
1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
`block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
activation round-trip, not just the launch (see §4.4).
2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
`load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads`
`vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
`cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
**`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
tile is applied per expert at density ≈4, so the accumulator is ~36% filled and ~1 `tile_C` frag-
group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
---
## 4. The specific build-ready changes
All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
stock behavior until proven.
### 4.1 Dense M-tile / occupancy (the make-or-break)
- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
re-reads 18 GB). Lock this as an invariant in P0.
- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
dense occupancy lever and respects every GB10 rule.
- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
`rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
### 4.2 FP4-MMA fragment usage
- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
`m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
shape.
- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
(the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
`LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **816** (= density) and sweep.
- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
(scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
(`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
(`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
`rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
`quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
struct is the frozen interface between them.
### 4.5 GB10-fit rules (binding constraints on every kernel change)
- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
(smaller `mmq_y`, smaller shared), not a deeper pipeline.
- Tuning is **empirical**`nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
---
## 5. Correctness / parity gate (every phase)
- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
`test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
test).
- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
`mmq_y` changes and block-pad masking can leak.
- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
rounding).
- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
(dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
---
## 6. Phased plan, with expected decode_agg at each phase
Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
### DENSE (parity target 391)
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
|---|---|---:|---:|---:|---:|---|
| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small**`mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **6681 ms BW floor (1721% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **6681** | 390405 | **316328** | **8184%** | **med-high** |
| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 6681 | **325340** | **376394** | **96101%** | low |
Dense climb: **161 → ~177 → 316328 → 376394** tok/s = **41% → 45% → 8184% → 96101% of vLLM 391.**
Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
contingent on P2 clearing the kill-gate and on A landing.**
### MoE (parity target 811)
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
|---|---|---:|---:|---:|---:|---|
| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 816) | Free per-expert tile shrink (no re-read); reclaim the 36% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
### Explicitly NOT in scope (and why)
- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
CUTLASS's grouped FP4 is broken on sm_121.
- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
- "Make activations 4-bit" — already W4A4; no work, no win there.
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
**separate tracks**; B owns the GEMM only.
---
## 7. The honest ceiling — does B reach TRUE PARITY?
- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376394 tok/s = 90103% of vLLM 391.**
The catch: it needs **~1721% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
occupancy kill-gate and contingent on track A. **GO (conditional).**
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
*different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
(occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
which is exactly what the P2 kill-gate measures.
**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
PARTIAL for MoE.**
---
## 8. One-paragraph summary
The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
floors 46× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
**90103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
demonstrated GB10 FP4-efficiency envelope (~1721%)**, with **no margin**, gated by the P2 occupancy
kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
(conditional, B+A), PARTIAL for MoE.**
---
## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
fixed**.
### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
build stuck at 1/4 of MMQ.
- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
floor** as M grows; vLLM **holds it**.
So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
it is a different and more tractable regime; it under-states that the *binding risk is identical*.
### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
not the hardware, is the limiter.
**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
the dense step (numbers from the doc's own inputs):
```
llama step @npl128 795 ms (decode_agg 161)
vLLM step @npl128 328 ms (decode_agg 391)
total gap 467 ms
llama GEMM 471 ms
vLLM GEMM (at the floor) ~66-88 ms (66 @273 GB/s spec, 88 @216 GB/s achieved)
=> GEMM gap 383-405 ms = 82-87% of the 467 ms total gap
=> non-GEMM gap 62-84 ms = 13-18% of the total gap
```
So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
parity story is *entirely* contingent on A, which the P3 framing buries.
### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
The doc conflates two different things under "deep pipeline":
- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
(correctly banned).
- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
abandoning the path that works because the *manual* version of it failed.
### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
`nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
### 9.5 Honest expected outcome (the numbers I would defend)
| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
|---|---:|---:|---:|---|
| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
tail, not median.
**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
the parity tail hits.
### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
reasons:
1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
seam is speculative until P3.
2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
<15%-eff kill-gate.
3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
risky kernel tune.
**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
make-or-break decision cheaper, better-informed, and bankable-either-way.
### 9.7 Verdict (go / no-go)
- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
(expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
(iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
(it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
**1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
CUDA-graph, GDN bf16 state).
- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
**top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.

View File

@@ -1,96 +0,0 @@
# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS
Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md`
(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but
were emitted DEFAULT-ON with no backend guard.
## The hazard
- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally
in the `llama_context` constructor (`src/llama-context.cpp`).
- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]`
(reuse `GGML_OP_GATED_DELTA_NET` with extra src slots).
- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse
`GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids).
- Both families have CUDA + CPU kernels only. No `supports_op` change was made for
the discriminated variants.
- A backend that supports **plain** `SSM_CONV` but ignores the discriminator
(Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler
assigns the discriminated conv to it; it runs the **wrong plain conv** =>
SILENT corruption (not a crash).
- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the
discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and
discriminated-conv happened to share backend coverage). It goes live the moment a
non-CUDA paged build of a gated-DeltaNet model exists.
## The fix (emission gate, not supports_op)
Chosen route: **gate the emission on the active compute backend type.** The
`supports_op` route would require editing every other backend's per-device
`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` -
invasive, fragile, and not centrally exposed by the ggml backend interface. The
emission gate is self-contained in the fork's own code.
`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before
the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute
backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three
`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the
discriminated-op handling), force
`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`.
Every emission site keys off these flags:
`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar`
(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and
`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With
the flags false the graph takes the upstream non-fused branch: a **plain
`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles
correctly.
## CUDA byte-identical invariant
On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the
flags are left untouched, and the emitted decode graph is unchanged. The fix only
changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical
to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy
md5 gates are unaffected on the validated GB10 target.
## Verification
- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state
(upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and
applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops`
links `libllama.so` and the test binary with **0 errors**; the edited
`llama-context.cpp` compiles clean (uses only the already-included `<cstring>`
and the backend-reg API already used in this TU:
`ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` /
`ggml_backend_dev_type`).
- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check`
(Makefile path) and `patch -p1 -N` (prepare.sh path).
- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run
skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are
registered and will be exercised by the DGX CUDA run:
`test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) /
`test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) /
`test_gated_delta_net` (+ `_hybrid`).
## Pending on the DGX (GPU)
The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared
tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole
session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`)
once reachable, then commit 0030 there too:
```
test-backend-ops test -o SSM_CONV
test-backend-ops test -o SSM_CONV_UPDATE
test-backend-ops test -o SSM_CONV_UPDATE_IDS
test-backend-ops test -o GATED_DELTA_NET # expect: 2/2 backends passed, OK
```
Greedy md5 (only if >40GB VRAM free; must equal the established baselines):
`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`,
`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip
any flag on CUDA, the md5 is unchanged by code-path argument; the run is a
belt-and-suspenders confirmation, not a correctness dependency.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,89 +0,0 @@
# Decode-Parity: Parked Levers (future exploration)
**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode
38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch
**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the
dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor.
bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate).
The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact
floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode
gains are marginal; the levers below are the honest pick-up points, ranked by promise.
---
## 1. Hybrid-precision SSM state (the most promising)
The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large -
recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean
runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx,
~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the
long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding);
short/fast-decaying heads are fine.
**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1)
in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could
capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph
build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel
plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head
dtype selection on top.
*Note (precision, corrected):* plain bf16 (no split) is a legitimate **opt-in for precision-tolerant
deployments**, but it is *below* vLLM's recurrent precision, NOT equal to it. vLLM keeps the
gated-DeltaNet **temporal state in f32** (proven three ways in `BITEXACT_VS_VLLM.md`; only its tiny
conv state is bf16, and llama keeps even that f32). So bf16 here trades *below-vLLM* precision for
*above-vLLM* throughput. We declined it as the default because both llama's f32 AND vLLM's f32 are a
higher bar - and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW),
so we do not need bf16 to match vLLM's recurrence.
## 2. Dense CUDA-graph instability
The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path
CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable
(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on
the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic.
## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk)
The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4`
into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover
~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst
gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1
GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by
the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and
proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.**
## 4. Datacenter Blackwell (sm_100)
This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter
Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA
characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the
whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding
kernel). A separate investigation if datacenter Blackwell becomes a target.
## 5. Prefill / TTFT scheduler + paged-pool burst degradation (HIGH priority - the weakest benchmark number)
The final benchmark (`QWEN36_NVFP4_BENCH.md`) exposed TTFT as the clear weak spot vs vLLM. Two distinct
issues:
- **Static decode-first budget tradeoff:** the QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS=512`)
maximizes decode tok/s + memory but throttles burst-prefill, so under a synchronized 128-way burst TTFT
climbs to **903 s dense / 213 s MoE @npl128** vs vLLM's chunked-prefill 6-18 s. A dynamic/adaptive budget
(by concurrency + queue depth), or matching vLLM's chunked-prefill interleave, would rebalance.
- **Paged-pool burst-degradation BUG (concrete, found in the benchmark):** after a high-npl burst, a
server's *subsequent lower-npl* prefill collapses (fresh npl8 = 507 t/s / 6 s TTFT; npl8 after an npl64
burst = 65 t/s / 64 s). Decode stays robust; only prefill degrades -> root-cause the paged-pool state
that persists across the burst.
**HIGH promise** for the serving experience: decode (dense 90-117%, MoE 77-83% of vLLM) and memory (1.5-3x
lower) are already strong; TTFT is the one number holding back a clean public win.
## 6. MoE-specific recurrence tuning
The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The
MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM
shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile.
---
*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this
directory.*

View File

@@ -1,208 +0,0 @@
# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?
Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify
llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the
context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?
Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda
sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf`
(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models:
`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf`
(MoE, arch `qwen35moe`).
## TL;DR verdict
**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that
reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's
`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a
major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so
the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general
kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode
kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded.
The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is
**flat** while the full-attention kernel grows ~3.1x:
| kernel | ctx 1024 | ctx 4096 | ratio | meaning |
|--------|---------:|---------:|------:|---------|
| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** |
| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected |
| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent |
Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a
per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows.
## 1. Architecture (confirmed from GGUF metadata + tensor names)
Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard
full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.
**Dense Qwen3.6-27B (`general.architecture = qwen35`):**
- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**.
- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`,
rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`.
- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`,
`ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is
`[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value
heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.
**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):**
- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers).
- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`,
`expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`.
- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`,
`inner_size = 4096` -> `H_v = 32` value heads.
**Tensor names confirm the op split (27B, per-layer dump):**
- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`,
`ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn
block). No `attn_k/v/output`, no per-head q/k norm.
- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`,
`attn_q_norm`, `attn_k_norm`. No `ssm_*`.
llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV
cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read
and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the
SSM-style recurrent path, not standard attention.
## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)
Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from
`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits
`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472).
**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1`
(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else
`build_delta_net_autoregressive`. Both are O(1):
- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on
the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay),
`sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update),
`o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and
the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`.
- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op,
`ggml_gated_delta_net(q, k, v, g, b, s, K=1)`.
**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):**
the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back
with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv
state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3`
samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back.
So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens -
this is the recurrent analogue, fixed size, independent of context length.
**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and
`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same
device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99`
that does not happen, so the **fused single-kernel path is what runs**.
**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously
O(1) in context:**
- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185):
the grid spans heads x sequences x state-columns. **There is no context-length dimension and no
context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s)
`[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`).
- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then
loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single
iteration: read the one new token, do the rank-1 update
`s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141),
then write the updated state back (lines 161-167). No second loop, no read of any past KV.
- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x
batch) and **constant in context length**. This is precisely vLLM's
`fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a
fixed-size `[K,V]` state) cited in the grounding doc.
A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`);
the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches
(prefill), never at decode.
## 3. nsys decode profiling: GDN is a small share and does not scale with context
Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven
to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and
sqlites under `~/bench/gdn_study/`).
**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:**
| bucket | % of busy | us/launch |
|--------|----------:|----------:|
| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - |
| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 |
| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - |
| elementwise / act_glu / norm / rope | ~13.5 | - |
| embed_gather (`get_rows`) | 2.9 | - |
| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 |
| copy_cast (`cpy`) | 1.8 | - |
| **GDN_conv (`ssm_conv`)** | **1.5** | - |
The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the
**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model,
where the grounding predicted the GEMM would be the lever.
**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:**
GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the
weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the
context.
**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs
4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1
`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical
decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is
a true per-step comparison:
| kernel / bucket | ctx 1024 | ctx 4096 | ratio |
|-----------------|---------:|---------:|------:|
| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** |
| GDN_recurrent share | 0.6% | 0.4% | flat/down |
| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x |
| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** |
| ATTENTION_full share | 0.6% | 1.8% | 3.0x up |
| total ms / decode step | 84.9 | 86.0 | 1.01x |
The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the
full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely
moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is
the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.**
(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently
prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)`
+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping
prefill short.)
## 4. Verdict and fix scope
**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused,
single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in
the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1)
`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning
with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the
op carries no context-length parameter at all.
**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat
share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would
not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`:
1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that
vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with
context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.
If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a
complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is
`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this
traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a
llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the
`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and
it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM
(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed
steady-state step as the bounded host-side add-on.
## Reproduce
- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`.
- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373,
build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595`
(fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid
184-185, dispatch 226-312).
- Profiles: `~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur>` runs `llama-server` under
nsys and drives `clientloop.py`; `catgdn.py <sqlite>` buckets kernels. Sqlites:
`gdn_npl128_ctx1024`, `gdn_npl32_ctx1024`, `gdn_npl4_ctx1024`, `gdn_npl4_ctx4096`.

View File

@@ -1,344 +0,0 @@
# GDN recurrence byte gate + fused single-pass kernel design
Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only;
the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent.
## TL;DR (the correction the workflow was set up to settle)
**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>`
(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers
ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state
back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`,
`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM.
The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather
each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only
the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`,
the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations**
(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to
recover; the recurrence kernel is byte-minimal on state by construction.
Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is
f32-state read+write bandwidth, already a single pass. The two real levers are decoupled:
1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small
per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv
`cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM.
2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only
large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only
the persisted cache in bf16.
Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below).
## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128)
State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**.
Per kernel call (one GDN layer, full 128-seq batch), single pass:
- state read = 786,432 * 128 * 4 = 402.65 MB
- state write = 402.65 MB
- **state R+W = 805.3 MB/call** (768 MiB)
- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**.
Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**.
GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.**
48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB
activation traffic is noise; state is 99% of the recurrence bytes.
### What this means for the open question
- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`,
consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware
floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an
occupancy/coalescing tune, NOT a fusion win.
- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent
state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant
stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the
single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of
vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence
kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is
at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%.
## The fused single-pass kernel design
Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte
verdict.
### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel
Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing
single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never
make a separate DRAM round-trip, and the per-op launches collapse to one.
Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer:
```
wqkv GEMM -> qkv_mixed (keep: GEMM, separate)
wqkv_gate GEMM -> z (keep: GEMM, separate)
ssm_beta GEMM -> beta -> sigmoid [FOLD beta sigmoid]
ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g]
build_conv_state: reshape, transpose qkv, CONCAT, cpy [concat/cpy -> conv-state plumbing, see note]
ggml_ssm_conv(conv_input, conv_kernel) [FOLD depthwise conv, K=4]
ggml_silu(conv_output) [FOLD silu]
views q_conv/k_conv/v_conv
ggml_l2_norm(q_conv); ggml_l2_norm(k_conv) [FOLD 2x l2norm]
[repeat_4d skipped on fused path]
ggml_gated_delta_net_inplace_ids(...) <-- THE recurrence kernel (196 ms)
build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul [FOLD post gated-RMSNorm]
ssm_out GEMM (keep: GEMM, separate)
```
Fold list (what moves INTO the kernel):
- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta.
- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel.
The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the
`a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches.
- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the
conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading
state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed
token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim)
is tiny and already cached; fold its read here and its 1-token shift write into the epilogue
(replaces the `concat`+`cpy` conv-state update).
- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel
already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same
warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math.
- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the
attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS,
multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output -
removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip.
State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm
collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery:
the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held
(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391).
Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4):
1. PDL sync.
2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel,
dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element.
3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps).
4. Load `s0` column into `s_shard` (UNCHANGED single read).
5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta,
attn = S^T q * scale).
6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv
state back to the conv cache (replaces concat+cpy).
7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and
by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate.
Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw
`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new
`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe,
llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it.
### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact)
Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is
already >=85% of peak (then f32 is at the floor and bf16 is the only way down).
- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at
the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected
335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of
peak, so likely better.
- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence
arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved
within a step; only the PERSISTED state is rounded to bf16 each step.
- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not
accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a
256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the
ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is
unreachable because the f32 state bytes are irreducible (single pass already).
## Numeric / bit-exactness notes (for fold (1))
- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum).
Order of summation across lanes differs from the standalone op's sequential sum -> floating
reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a
tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend:
ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019
already established the harness: dense text md5, MoE byte-identical).
- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` /
`ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the
v - g*kv -> *beta -> S update -> S^T q sequence.
- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu`
if done in the same order.
- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path)
to stay bit-exact.
## Implementation scope
- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new
kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm
epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI.
- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d,
ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps).
- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside
the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route
the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`,
`kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged.
- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused
op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`.
- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store
convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids
builders' state asserts, and the recurrent cache type.
## Risk register
- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of
vLLM). Worth it only if the workflow wants >90% and accepts no bf16.
- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is
f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf
call.
- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters
(CPU offload of a GDN layer must fall back cleanly).
- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32
state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort.
---
# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED
The design above predicted the answer; this is the decisive measurement that confirms it.
## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)).
Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is
physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at
**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak).
The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming.
## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway
- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM`
(`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa).
DRAM byte counters are unobtainable on this box.
- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from
the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor;
comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it.
## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF)
llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1.
vLLM: postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples).
| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream |
|---|---|---|---|---|---|---|
| llama gated_delta_net_cuda | f32 | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x |
| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1) | 111 GB/s | **41%** | ~1.0x |
- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms).
- vLLM recurrence/step = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x.
The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE).
- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive
f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))`
once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" ->
model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128).
## Why re-stream ~1.0x (the gate number)
Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**.
1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end:
~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD.
## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead
2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one
(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM):
- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms
(~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call =>
save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead.
- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on
PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16.
## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48
concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56)
= ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes
-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op,
identity seqs -> confirms gather does NOT re-stream state at steady decode).
## One-line answer
llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak.
conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
---
# FINAL DECISION (synthesis of all four agents) - the five points
This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
## (1) Byte-ratio verdict - the decisive number
**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
## (2) Fused single-pass GDN recurrence: **NO-BUILD**
A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
This is independent of the recurrence verdict and holds regardless. Build a fused
`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
folds `silu`, and writes the 1-token-shifted ring state back in place.
- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
(5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
(88.5-89.3% of vLLM 391).
- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
primitive, same f32 state bytes written - only the producing node changes. Greedy output is
bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
step) is narrowing the state-cache width to bf16, exactly as vLLM does.
- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
360-443 tok/s @npl128).
- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
(g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
recurrence term.
## (5) Ranked build order + the single highest-value next step
1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
byte-identical).
2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
f32 fallback verified.
3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
+ launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
md5 unchanged.
**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,34 +0,0 @@
# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
## What
Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
(read path gather -> indexed in-kernel read; values + reduction order unchanged).
## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
## Artifacts
- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
## gather-bench landed (worktree)
Rigorous same-session A/B (DGX GB10) validated patch 0028 bit-exact and lifting both models;
results folded into LEVER1_GATHER_RESULTS.md and the regenerated 0028 patch. The bench files
first landed in this worktree via concurrent merge c1f1d1e8e (origin/master sweep); this commit
re-anchors them with sign-off attribution. DGX llama tree dedicated commit: fafe878 (code
byte-identical to 944636c; docs-only amend). Both trees committed, not pushed.

View File

@@ -1,163 +0,0 @@
# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
kernel vLLM has no equivalent of (~5.2 ms/step MoE decode; also present in dense): vLLM updates its
gated-DeltaNet recurrent state in-place inside the fused decode kernel, while llama ran a separate
`ggml_get_rows` gather. Patch 0019 fused the SSM recurrent-state gather; patch 0021 fused the conv
compute/write-back but KEPT a `build_rs` gather for the conv taps ("tiny; not one of the eliminated
buckets"). This patch closes that residual.
## Which gather was still firing (nsys-located, DGX GB10 sm_121)
Profiled MoE `q36-35b-a3b-nvfp4` at batch-128 decode (`llama-batched-bench -npp128 -ntg24 -npl128
-fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`). The decode-window `k_get_rows_float<float,float>`
distribution was bimodal: a BIG cluster of **~720 instances (= 30 GDN layers x 24 decode steps) at
~115 us each** plus small embedding/router gathers.
The big gather's geometry (`grid=(ne10=128, block_num_y=96, 1)`) decodes to **128 rows (= n_seqs
active sequences) of ne00 = 24576 floats**. With the model's real dims (`d_conv=4, d_inner=4096,
n_group=16, d_state=128`):
- `n_embd_r = (d_conv-1) * (d_inner + 2*n_group*d_state) = 3 * 8192 = 24576` -> `block_num_y=96` EXACT match.
- `n_embd_s = d_state * d_inner = 524288` (the SSM state, gridY 2048 - already fused by 0019).
So the residual `k_get_rows` is the **conv-state tap gather** in `build_conv_state_fused`
(`src/models/delta-net-base.cpp`), which called the plain 4-arg `build_rs` -> `ggml_get_rows` of the
24576-float conv-state row x 128 sequences, once per GDN layer per decode step (~3.4 ms/step here,
~5.2 ms/step at steady ntg=128). The SSM-state gather is already fused, so this conv gather is the
last `k_get_rows` in the GDN decode path.
## What changed (mirror of the 0019 SSM gather fusion; bit-exact by construction)
New op `ggml_ssm_conv_update_inplace_ids` (reuses `GGML_OP_SSM_CONV`, discriminated by a non-null
`src[4]` = ids). Instead of a pre-gathered tap scratch, it takes the FULL conv-state cache (`src[0]`)
plus the per-sequence `ids` (= the recurrent-state `s_copy`, `src[4]`; `op_params[1]=rs_head`) and
reads each active sequence's prior K-1 taps directly from `cache[ids[s]]` in the kernel. This removes
the separate `k_get_rows` launch.
Race-free, exactly mirroring 0019:
- **Identity** sequences (`ids[s] == rs_head + s`, the whole AR-decode path) read the taps in place
from the `conv_state_dst` write slot. The kernel loads the full conv window into registers before
it writes the 1-token-shifted ring back, so read==write slot is race-free per (channel, seq) thread.
- **Non-identity** sequences (reorder / `rs_zero` remap at a prefill->decode boundary) are gathered
into a disjoint scratch by a small `ssm_conv_gather_nonident_kernel` first (no-op at steady decode),
so the update kernel never reads a slot another block writes.
The read VALUES are unchanged (identity in-place taps == the gathered taps == `cache[ids[s]]`); only
the read PATH changes from a `ggml_get_rows` materialization to an indexed in-kernel read. The conv
math, ascending-tap FMA order, silu and the ring write-back are byte-identical to 0021.
Files:
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: `ggml_ssm_conv_update_inplace_ids` builder
(src[0]=full cache [K-1,channels,n_cells], src[1]=conv_kernel, src[2]=x_cur, src[3]=conv_state_dst,
src[4]=ids; op_params[0]=fuse_silu, op_params[1]=rs_head).
- `ggml/src/ggml-cuda/ssm-conv.cu`: `ssm_conv_gather_nonident_kernel` + `ssm_conv_update_ids_f32`
kernel + `ggml_cuda_op_ssm_conv_update_ids` + a `src[4]`-discriminated branch in `ggml_cuda_op_ssm_conv`.
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_ids_f32` (window copied to a
local before the possibly-aliasing write) + dispatch branch.
- `src/models/delta-net-base.cpp`: `build_conv_state_fused` now feeds the FULL cache + ids through the
`build_rs` `get_state_rows` lambda (the rs_zero clear + extra-states copy still run around it),
exactly like the 0019 recurrent-attn fusion. The `qwen35` / `qwen35moe` / `qwen3next` callers are
unchanged (they already route the single-token decode path here).
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update_ids` (16 cases) - ids = a shuffled permutation
with `rs_head=0`, so each case exercises BOTH the identity in-place read and the non-identity cache
read; validates the conv+silu output vs the CPU reference.
## GATE: test-backend-ops (CUDA0 vs CPU, 2/2 backends)
- SSM_CONV_UPDATE_IDS: OK (NEW; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
- SSM_CONV_UPDATE: OK (0021 path intact)
- SSM_CONV: OK
- GATED_DELTA_NET: OK
- GET_ROWS: OK
## GATE: greedy bit-exactness (--temp 0 --seed 1 -n 48, -fa on) - BOTH models BYTE-IDENTICAL
| model | baseline md5 | 0028 md5 | result |
|--------------------|----------------------------------|----------------------------------|-----------------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | BYTE-IDENTICAL |
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | BYTE-IDENTICAL |
(Built on the `paged` branch f32-default = 0026 hybrid default is f32; the baseline was re-confirmed
on the same build before the edit.)
## nsys proof - the gather is eliminated (MoE decode, npp128 ntg24 npl128, same window)
| kernel | before | after |
|-------------------------------------|---------------|-------------------------------|
| `k_get_rows_float<float,float>` cnt | 10174 | 9454 (720 fewer = 30 GDN x 24)|
| `k_get_rows_float<float,float>` sum | 186.3 ms | 102.8 ms (-83.5 ms) |
| conv update kernel | `ssm_conv_update_f32` 720 | `ssm_conv_update_ids_f32` 720 |
| `ssm_conv_gather_nonident_kernel` | - | 720 x ~1.1 us = 0.8 ms (no-op at decode) |
The 720 big ~115 us conv gathers are gone; the only added work is a ~1.1 us no-op gather kernel per
layer-step (all sequences identity during steady AR decode). This matches 0019's "no-op at decode,
median ~1.2 us" non-identity gather.
## Preliminary throughput (post-fusion, single point; rigorous A/B is the bench phase)
- MoE `q36-35b-a3b-nvfp4` npl128 (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`): **783.9 t/s**, step
163.3 ms/step (MOE_GAP @0025 was 752.3 t/s / 169.8 ms/step => -6.5 ms/step in this stack).
- dense `q36-27b-nvfp4` npl128: **377.3 t/s** (~96% of vLLM 391; includes 0022/0026 base gains).
- npl128 ran clean (EXIT=0) on both - the non-identity boundary path does not crash.
## Verdict
Bit-exact (both md5 gates byte-identical, all test-backend-ops pass), the residual `k_get_rows` conv
gather is eliminated (nsys-confirmed), and decode throughput improves. Helps BOTH dense and MoE (the
shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path (after 0019 SSM-state
+ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
Assisted-by: Claude:opus-4.8 [Claude Code]
## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
| model | base (0026) | lever1 (0028) | recorded baseline |
|-------------------|----------------------------------|----------------------------------|----------------------------------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
|-----|-----------|-------------|--------|----------------|
| 32 | 208.56 | 209.39 | +0.40% | - |
| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
|-----|-----------|-------------|--------|----------------|
| 32 | 456.85 | 459.56 | +0.59% | - |
| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
| kernel | base (0026) | lever1 (0028) |
|---------------------------------|------------------------|----------------------------------------------|
| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
the -3.13 ms/step throughput delta at npl128.
### Verdict (gather-bench)
Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.

View File

@@ -1,77 +0,0 @@
# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
bit-exact tensor reshape that re-routes the per-layer SSM output projection
from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
## The mechanism (profiled, both engines)
Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
the ssm_out weight read across the 128 sequences. vLLM packs the same projection
into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
only the output projection was in 3D SSM layout.
## The fix
In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
proven by the in-projection.
```
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
...
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
```
## Gates (all PASS)
- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
post-SSM baseline build:
- dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
- MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
- Coherent dense + MoE output (greedy text inspected).
## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
S_TG t/s (decode aggregate):
| model | npl | baseline | Lever 1 | delta |
|------------------|-----|----------|---------|---------|
| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
up from 65% post-SSM).
## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
| kernel | baseline | Lever 1 |
|-------------------------------------|--------------------|------------------|
| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
(+1200 instances, +363 ms over the window), and its per-call average DROPS
(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
old GEMV: the amortized weight read is the win.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,83 +0,0 @@
# LEVER 4 - NVFP4 the bf16 MoE GDN/attn projections: KL-GATE FAIL, no-ship
GPU agent (L4-gatebench), DGX GB10 (sm_121, BLACKWELL_NATIVE_FP4=1). Build at 0028 (HEAD fafe878,
branch `paged`). Lever 4 hypothesis (from `MOE_GAP_VS_VLLM.md` + the lever-4 scope): the MoE GGUF's
GDN/attn projections (in_proj_qkvz=attn_qkv, in_proj_ba=ssm_alpha/ssm_beta, out_proj=ssm_out,
attn_gate, full-attn attn_q/k/v/output) are left in BF16 by nvidia modelopt while the dense
q36-27b-nvfp4 (unsloth) already ships them NVFP4. The scope called this a "quant-provenance accident"
and proposed re-quantizing them to NVFP4 to recover the ~20.3->13.8ms projection-GEMM bucket.
**Verdict: KL-GATE FAIL on every axis, for both variants. STOP, do NOT ship. No 0029 GGUF, no
gallery entry, no bench, no nsys** (per spec: KL fails first -> report, do not bench/ship). The bf16
projections are a **deliberate precision choice, not an accident** - re-quantizing them costs ~6% PPL.
## Gate setup (all bit-changing -> KLD gate per spec)
- Reference (the "f32" of the gate): `~/work/darwin_36b_opus/f16.gguf` - the full-precision f16 GGUF
of the same Qwen3.6-35B-A3B model (qwen35moe, 41 blocks, vocab 248320, embd 2048). Verified it
matches the NVFP4 baseline shape; its own PPL = 7.376 self-consistent with the KLD base.
- KL base: `llama-perplexity --kl-divergence-base` over `wiki.test.raw`, c512, 16 chunks (8192 tok),
-ngl 99, seed 1. Base file `~/bench/l4gate/klbase_moe.dat` (2.0 GB). f16 PPL(base) = 7.3734.
- Candidates scored with `--kl-divergence` against that base, identical c512/16-chunks/seed.
- Current "bf16-projection GGUF" baseline = `~/bench/q36-35b-a3b-nvfp4.gguf` (the shipping NVFP4:
experts NVFP4, GDN/attn projections BF16). It is the reference for the PPL-delta and argmax gates.
## Measurements (16 chunks, c512, 8192 tokens, wiki.test.raw)
| model | PPL(Q) | PPL delta vs baseline | Mean KLD-to-f16 | Same-top-p (argmax agree vs f16) | RMS dp |
|-------|--------|-----------------------|-----------------|----------------------------------|--------|
| baseline NVFP4 (proj BF16, shipping) | 7.3896 | - (reference) | 0.1366 | 84.31% | 9.20% |
| **projq FULL** (190 proj -> NVFP4, incl. in_proj_ba) | 7.8705 | **+6.51%** | 0.1638 | 81.72% | 10.47% |
| **projq CONS** (130 proj -> NVFP4, in_proj_ba kept BF16) | 7.8440 | **+6.15%** | 0.1716 | 82.16% | 10.82% |
Baseline vs f16: PPL ratio 1.0022 (+0.22%), i.e. the shipping NVFP4 is already near-f16 - because
modelopt put the quant-sensitive GDN/attn projections in BF16 and only the experts (designed for FP4)
in NVFP4. projq pushes the projections to NVFP4 and PPL ratio jumps to 1.067 (FULL) / 1.064 (CONS).
## Gate verdict (all three conditions FAIL)
1. **PPL delta < ~1% vs the bf16-projection GGUF -> FAIL.** FULL +6.51%, CONS +6.15%. Off by ~6x.
2. **KLD-to-f32 < 0.06 -> FAIL.** The shipping baseline NVFP4 itself sits at 0.137 mean KLD vs f16
(per-token KLD is naturally high at 248K vocab), and projq raises it to 0.164 (FULL) / 0.172 (CONS).
Whatever the intended reference granularity, projq is strictly worse than the baseline, not < 0.06.
3. **Zero greedy-argmax flips -> FAIL.** Per-token top-1 agreement vs f16 drops from 84.31% (baseline)
to 81.72% (FULL) / 82.16% (CONS): the requant flips the argmax on ~2.2-2.6% MORE tokens than the
shipping model. (A direct `llama-cli --temp 0 -n 48` greedy diff was attempted but the paged
llama-cli build segfaults at teardown on ALL models incl. baseline - not projq-specific - so the
8192-token Same-top-p above is the argmax measure used; it is strictly stronger than a 48-tok probe.)
CONSERVATIVE (keeping the most quant-sensitive in_proj_ba=ssm_alpha/ssm_beta in BF16) recovered almost
nothing: 7.844 vs 7.871. The damage is in the BULK attn/GDN projections (attn_qkv, ssm_out, attn_gate,
attn_q/k/v/output), not the tiny in_proj_ba. An attn_gate-excluded third variant would, at best, shave
a fraction of a percent off a 6% miss - not worth a GPU pass. lm_head was already NVFP4 in the baseline
(and in vLLM's checkpoint), so it is not a variable here and was never the issue.
## Why the premise was wrong (root cause of the failure)
The scope assumed vLLM runs these projections in NVFP4. It does not. vLLM runs the **nvidia modelopt
checkpoint** (`~/bench/q36-35b-a3b-nvfp4-vllm`), which is the SAME provenance that left these exact
projections in BF16. So:
- The baseline GGUF's bf16 projections **match vLLM** already. They are not a llama-vs-vLLM gap.
- modelopt left in_proj_qkvz/in_proj_ba/out_proj/attn_q/k/v/output in BF16 **because they are
quant-sensitive in this hybrid gated-DeltaNet + attention model** - the gate confirms this empirically
at ~6% PPL. The dense q36-27b-nvfp4 (unsloth) tolerating NVFP4 projections does not transfer: it is a
different (non-MoE, different-provenance) model and a different sensitivity profile.
- Re-quantizing them is therefore not "matching vLLM" - it is going BEYOND vLLM's precision and paying
for it in quality. The ~20.3ms projection-GEMM bucket is the price of running these projections in
high precision; vLLM pays the same precision cost (its nvjet/cutlass bf16 GEMMs), so the bucket is NOT
the lever it looked like. The L4 speed win is real but only purchasable with a 6% PPL regression -
rejected by the gate.
## Disposition / artifacts
- Both projq GGUFs exist on DGX but are **dead** (do not publish): `~/bench/q36-35b-a3b-nvfp4-projq.gguf`
(FULL, md5 1bd32114..., sha256 88b7e812...), `~/bench/q36-35b-a3b-nvfp4-projq-cons.gguf` (CONS, md5
6847ebe3..., sha256 ca035111...). The L4-requant pin files (`~/bench/pins_projq_{full,cons}.txt`) and
`/tmp/gen_pins.py` remain if a future, kernel-side (not precision-side) approach is ever revisited.
- Gate logs: `~/bench/l4gate/` - `f16base.log`, `kld.{baseline,projqFULL,projqCONS}.log`,
`klbase_moe.dat`.
- No code change, no patch, no commit to the DGX `llama-paged-dev` tree. No `-paged` gallery entry.
- MoE remains at 86.3% of vLLM @ npl128; this lever does not move it within the quality budget.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,73 +0,0 @@
# Lever 5 - block-table within-step host cache (patch 0029)
## What
`get_block_table()` is called once per full-attention layer per decode step. The
KV cell layout (and therefore the block table bytes) is fixed for the whole step;
it only changes in `apply()` when the ubatch's slots are committed. The old path
recomputed the full table on every full-attention layer of every step.
Patch 0029 builds the table once per step and reuses the bytes (`memcpy`) for the
remaining full-attention layers, invalidating the cache in `apply()`. The reused
bytes are identical to a fresh compute, so the change is bit-exact. Disable with
`LLAMA_PAGED_NO_BT_CACHE=1`.
## Host-side get_block_table time (the lever)
`llama-batched-bench`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`,
`-npp 128 -ntg 128 -npl 128 -ngl 99 -fa on`, measured with the in-tree
`[L5INSTR]` host timers (aggregate over the full bench, n=2048 dense / 1280 MoE
get_block_table calls):
| model | get_block_table host, cache OFF | cache ON | reduction |
|-------|--------------------------------:|---------:|----------:|
| MoE q36-35b-a3b-nvfp4 | 112.94 ms | 14.82 ms | -87% |
| dense q36-27b-nvfp4 | 193.78 ms | 16.90 ms | -91% |
The MoE 112.94 -> 14.82 ms is the "110 -> 14 ms host" headline. `set_inputs`
host time falls in lockstep (MoE 128.6 -> 32.0 ms; dense 220.2 -> 36.5 ms) and
`process_ubatch` host (hostproc) drops MoE 498.8 -> 413.0 ms, dense 730.1 ->
544.2 ms.
## Throughput effect
Same bench, TG (decode) tokens/s, cache OFF -> ON:
| model | TG t/s OFF | TG t/s ON | delta | vs vLLM @npl128 |
|-------|-----------:|----------:|------:|----------------:|
| dense q36-27b-nvfp4 | 364.81 | 374.72 | +2.7% | 374.72 / 391 = 95.8% |
| MoE q36-35b-a3b | 752.19 | 756.97 | +0.6% (flat) | n/a |
- Dense decode is partly host-bound, so removing ~90% of the get_block_table host
time lifts dense TG by a few percent (run-to-run; ~0.4-2.7% across runs) and
pushes it to ~96-97.5% of the vLLM 391 t/s @npl128 reference.
- MoE decode is compute-bound (the FP4 GEMM dominates the step), so the ~98 ms of
saved host time is hidden behind GPU compute and is off the critical path: MoE
TG is flat. The deployment path (MoE) sees no regression and no win - the cache
is a pure pipeline cleanup there.
- npl=1 single-stream decode: get_block_table is tiny either way (MoE 0.64 ->
0.22 ms over 128 steps); the lever only matters at batch.
## Bit-exactness
`llama-completion -p "The capital of France is" -n 48 --temp 0 --seed 1`,
chat-template (conversation) path:
| path | md5 |
|------|-----|
| non-paged MoE | 07db32c2bcb78d17a43ed18bc22705cd |
| paged MoE, cache ON | 8cb0ce23777bf55f92f63d0292c756b0 |
| paged MoE, cache OFF (`LLAMA_PAGED_NO_BT_CACHE=1`) | 8cb0ce23777bf55f92f63d0292c756b0 |
| dense non-paged | 5951a5b4d624ce891e22ab5fca9bc439 |
| dense paged | 5951a5b4d624ce891e22ab5fca9bc439 |
cache ON == cache OFF confirms the lever is numerically neutral. The paged-MoE
md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a benign
FP-accumulation-order difference of the paged attention reduction, KL-validated
in PAGED_BITEXACT_NOTE.md (not introduced by this lever - it is present on the
0028 baseline too).
## Verdict
Ship. Bit-exact per path, real host-pipe win on host-bound (dense) decode,
neutral on the compute-bound MoE deployment path.

View File

@@ -420,8 +420,8 @@ may want a second lighter gallery variant (context_size 16384, parallel 4, drop
================================================================================
The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; QWEN36_NVFP4_BENCH.md section "The 4 models"
documents provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
tensors from nvidia modelopt weights). To publish:
1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
@@ -434,7 +434,7 @@ tensors from nvidia modelopt weights). To publish:
3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
paged), and the GB10 benchmark table (link QWEN36_NVFP4_BENCH.md numbers).
paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the

View File

@@ -1,143 +0,0 @@
# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
`mmq_x` auto-select in `mul_mat_q_case`. Companion to
`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
larger expert FFN, standard attention).
## What it does (vs 0014)
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
the per-expert column upper bound = token count) in one column-tile, i.e. stock
**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
per-expert density is tiny, that 128-wide tile is mostly padding.
Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
per-expert density host-side, from args the ids path already passes:
```
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
n_experts = nchannels_x = ne02
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
```
and caps to the small tile (default 64) **only when `density <= density_max`**, so
the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
No new kernel: the selection only lowers the loop's upper bound to an
already-compiled, granularity- and shared-memory-validated `mmq_x`.
## The threshold matters: `density_max = 8`, not `tile/4 = 16`
The cap must fire for decode but not for a prefill ubatch. Each has per-expert
density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
```
128 experts 256 experts
prefill ubatch (512) 32 16
decode npl128 (128) 8 4
```
`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
default change is what makes the patch prefill-safe on the 256-expert model.
## Measurements (default-on vs stock, median of 5 reps)
`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
```
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
```
Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
2029.4, 2029.0]`.
### Honest read: neutral on this model
On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
lever has nothing to bite on.
### npl128 decode tile sweep confirms 64 is the only useful width
`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
```
TILE8 TILE16 TILE32 TILE64 TILE96
-6.31% -3.18% -0.17% +0.70% -0.76%
```
Smaller tiles are **worse**, not better: more column-tiles per expert = more
grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
same as 0014.
## Why ship it default-on anyway
1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
global, so prefill keeps its 128 tile (S_PP neutral above).
2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
`tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
`cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
-1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
is identical by construction.)
3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
does not help.
4. **Correctness-gated** by the P0 harness (below).
## Conservative by design (known limitation)
A pure-density gate cannot separate two cases with the **same** per-expert density:
Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
as future work, not implemented.
## Knobs
- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
## P0 correctness gate
`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
byte-identical to stock).
## Verdict
- Correct, prefill-safe, default-on density-aware tile select; the durable design
0014's own doc scoped. Supersedes 0014's global cap as the default path; the
`LLAMA_MOE_MMQ_X` knob is retained as a manual override.
- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
prefill cost.

View File

@@ -1,21 +0,0 @@
# MOE_GAP_PROGRESS.md - moe-gap-groundtruth GPU agent checkpoint
Status: **DONE.** Both-engine MoE decode decomposition complete. Findings in `MOE_GAP_VS_VLLM.md`.
## Runs (DGX GB10 sm_121, GPU free, foreground)
- llama: `build-cuda` 2f4f5ab (0025), `llama-batched-bench -npp128 -ntg128 -npl128 -c32768 -fa on`,
`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`. S_TG=752.3 t/s, step 169.8 ms, busy 97.5%.
Artifacts on DGX: `~/llama-paged-dev/moe_gap_llama.{nsys-rep,trace.csv}`.
- vLLM 0.23.0 graphs-ON (FULL_AND_PIECEWISE, the 882-ref config): `~/bench/moe_gap_vllm.py` under
`nsys --capture-range=cudaProfilerApi`. step 142.0 ms, busy 99.7%.
Artifacts on DGX: `~/bench/moe_gap_vllm.{nsys-rep,trace.csv}`, script `~/bench/moe_gap_vllm.py`.
- Extractor: `~/bench/decode_decomp2.py` (dual-engine, steps = GDN-kernel-count / 30; cross-checked vs
flash/reshape_cache = 10x and vs throughput). Grouped-MoE GEMM isolated by per-call duration (LONG/SHORT).
## Result (1 line)
Gap = 27.8 ms/step (llama 83.6% of vLLM). **MoE grouped GEMM is a llama WIN** (native FP4-MMA W4A4 47.3 ms
vs Marlin W4A16 50.0 ms). The 15% is bf16-projections+convert (+6.5), recurrence state-gather plumbing
(+6.6, led by k_get_rows 5.2 ms), graph/overlap (+7.0), W4A4 act-quant tax (+3.3), router/glue (+5.4).
Marlin is NOT the lever; do not build a W4A16 MoE GEMM.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,484 +0,0 @@
# MOE_GAP_VS_VLLM.md - ground-truth both-engine MoE decode decomposition (where vLLM's ~15% lives)
> **READ THE FINAL SECTION FIRST ("RESIDUAL-ASSESS (FINAL)" at the bottom).** It concludes the hunt and
> CORRECTS one premise used throughout the body below: this doc assumes vLLM runs the GDN/attn projections
> as NVFP4-Marlin. It does NOT. vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
> projection bucket is a matched-precision (bf16) gap, not a quant gap. Lever 4 (NVFP4 the projections) is
> REJECTED (+6% PPL, and not even a vLLM gap). The MoE is at its bit-exact ceiling (~86-88% of vLLM).
THE GPU AGENT (label `moe-gap-groundtruth`), DGX GB10 (sm_121). First **side-by-side, both-engine,
per-kernel ms/step** decomposition of the MoE decode gap. All prior B work decomposed llama ONLY; this
profiles vLLM's decode step too and computes the per-bucket `llama - vLLM` delta to pinpoint the gap.
Model `q36-35b-a3b-nvfp4` (40 layers: 30 GDN linear-attn + 10 full-attn, 256 experts top-8, vocab 248320).
Both engines profiled at **batch 128 decode** with `nsys --cuda-graph-trace=node`, steady-decode window,
per-step normalized by GDN-kernel-count / 30 (cross-checked vs flash/reshape_cache counts and throughput).
- **llama**: `build-cuda` tip `2f4f5ab` (patch 0025), `llama-batched-bench -npp 128 -ntg 128 -npl 128
-c 32768 -fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1` (the re-graph ON = the 752 t/s ship point).
Measured **S_TG = 752.3 t/s** => **step = 169.8 ms**, GPU-busy 97.5% (idle 2.5% = 4.2 ms/step).
- **vLLM 0.23.0**: `q36-35b-a3b-nvfp4-vllm`, **CUDA graphs ON** (`cudagraph_mode=FULL_AND_PIECEWISE`,
the 882-reference config, NOT enforce_eager), MARLIN NvFp4 MoE, 128 seqs x 128-tok prompt x 128 gen.
Measured **step = 142.0 ms** (= 901 t/s-equiv), GPU-busy 99.7% (idle 0.3% = 0.4 ms/step).
- Gap reproduced: **169.8 - 142.0 = 27.8 ms/step** (llama 83.6% of vLLM here; matches the ~85% server number).
## THE HEADLINE: the MoE grouped GEMM is NOT vLLM's advantage - it is a llama WIN
Grouped MoE-expert GEMM, isolated by per-call duration (LONG calls = the per-expert grouped GEMM):
| grouped MoE-expert GEMM | ms/step | what |
|-------------------------|--------:|------|
| **llama** `mul_mat_q<NVFP4,M-tile=64>` (+stream-k fixup + gather) | **48.3** | native Blackwell FP4-MMA **W4A4** |
| **vLLM** `marlin_moe_wna16::Marlin` | **50.0** | **W4A16** (FP4 weights dequant-in-kernel -> bf16 MMA) |
**llama's native-FP4 grouped GEMM is ~1.7 ms/step FASTER than vLLM's Marlin W4A16 at the ragged
tiny-M (~4 rows/expert) decode shape** (pure GEMM core 47.3 vs 50.0). Both read the same ~4-bit weight
bytes and are bandwidth-bound, so they tie to within a few %, and llama's 2x-rate FP4-MMA edges it.
**=> Marlin is NOT faster here; a Marlin-style W4A16 MoE GEMM in llama would make the MoE GEMM SLOWER.**
This directly answers the brief's load-bearing question #1/#2 and extends the prior `w4a16-marlin` DENSE
conclusion ("the win was NVFP4-dense-quant, not the Marlin kernel") to MoE: **the MoE GEMM kernel is not
the lever; llama already beats Marlin there.**
## Side-by-side per-step decomposition (ms/step, kernel-time attribution)
| bucket | llama ms | vLLM ms | Δ llama-vLLM | note |
|--------|---------:|--------:|-------------:|------|
| **Recurrence / SSM** | **79.3** | **72.7** | **+6.6** | core kernel is a llama WIN (70.0 vs 71.1); the gap is llama's state-gather/conv plumbing |
| **MoE-expert grouped GEMM** | 48.3 | 50.0 | **-1.7** | **llama FASTER** (native FP4-MMA W4A4 vs Marlin W4A16) |
| **Dense projections (+glue)** | **20.3** | **13.8** | **+6.5** | llama runs GDN/attn projections in BF16 cublas; vLLM runs them as compact NVFP4-Marlin; +2.9 ms is llama's bf16<->f32 `convert_unary` glue vLLM never pays |
| **Norms / glue / memcpy** | 9.6 | 6.0 | +3.6 | llama `k_bin_bcast` (expert-combine+residual) 4.3 + memcpy 2.4 heavier |
| **Act-quant (W4A4 tax)** | 3.3 | 0.0 | **+3.3** | `quantize_mmq_nvfp4`; vLLM W4A16 keeps acts bf16 => structurally ZERO |
| **Router / align** | 2.4 | 0.5 | +1.9 | llama computes router via a full FP4 GEMM (1.6) + argsort/scatter; vLLM fuses topk/align |
| **Attention (full-attn)** | 2.8 | 2.6 | +0.2 | parity |
| kernel-time subtotal | 166.1 | 145.7 | +20.4 | |
| **GPU idle (host bubble)** | 4.2 | 0.4 | **+3.8** | graph coverage: llama partially-graphed (0025) vs vLLM FULL_AND_PIECEWISE |
| cross-stream overlap (union<sumdur) | ~0.8 | ~4.0 | ~-3.2 (vLLM overlaps more) | vLLM runs more kernels concurrently |
| **STEP TOTAL (wall)** | **169.8** | **142.0** | **+27.8** | |
### Per-engine top kernels (ms/step)
```
llama (752 t/s, step 169.8 ms, 97.5% busy) vLLM (901-equiv, step 142.0 ms, 99.7% busy)
70.0 gated_delta_net_cuda REC core 71.1 fused_recurrent_gated_delta REC core
47.3 mul_mat_q grouped MoE (M=64) MoE GEMM 50.0 marlin_moe_wna16::Marlin MoE GEMM
8.2 nvjet 192x136 (bf16 proj) PROJ 4.0 nvjet 128x72 (bf16 proj) PROJ
5.2 k_get_rows_float REC-GATHER REC <-- vLLM 2.8 marlin dense (lm_head NVFP4) PROJ
4.5 cutlass::Kernel2 (bf16 GEMM) PROJ has 2.7 nvjet 128x64 (bf16 proj) PROJ
4.3 k_bin_bcast (combine+resid) GLUE no 2.5 flash_fwd_splitkv ATTN
4.1 nvjet 128x64 (bf16 proj) PROJ equiv 2.0 marlin dense small (NVFP4) PROJ
3.4 ssm_conv_update_f32 REC of 1.6 causal_conv1d_update REC
3.3 quantize_mmq_nvfp4 W4A4 TAX ACTQ <-- vLLM these 1.4 std::enable_if (glue) GLUE
2.9 convert_unary bf16<->f32 PROJ-GLUE <-- two 1.2 reduce_kernel GLUE
2.8 flash_attn_tile ATTN (5.2+ 1.0 cutlass::device (fp8 lin) PROJ
2.4 MEMCPY-Device (SSM state) GLUE 2.9 = 0.8 nvjet 32x64 PROJ
1.6 mul_mat_q router (M=128) ROUTER 8 ms 0.4 act_and_mul (SwiGLU) GLUE
1.5 rms_norm_f32 GLUE pure 0.2 topkGating / moe_align ROUTE
... llama 0.1 reshape_and_cache_flash ATTN
tax)
```
## WHERE THE 27.8 ms ACTUALLY IS (ranked) - and it is NOT the Marlin GEMM
1. **Dense projections + bf16<->f32 glue: +6.5 ms.** llama keeps the GDN/attn linear projections (and
the lm_head) in **BF16** (cublas `nvjet`/`cutlass`, full-precision weight reads) and pays a 2.9 ms
`convert_unary` bf16<->f32 tax around them; vLLM runs the same projections as **compact NVFP4-Marlin
W4A16** (4-bit weight read, ~4x less BW) and stays bf16 end-to-end (no convert). This is the
**`NVFP4-dense-quant` lever the prior `w4a16-marlin` project already identified - applied to the
still-bf16 projections**, not the MoE GEMM.
2. **Recurrence state-gather/conv plumbing: +6.6 ms.** The recurrence CORE kernel is a **llama win**
(gated_delta_net 70.0 vs vLLM fused_recurrent 71.1, confirming "past vLLM on BW efficiency"). The gap
is entirely the surrounding plumbing: **`k_get_rows_float` 5.2 ms (the recurrent-state gather)** +
`ssm_conv_update` 3.4 vs vLLM's single `causal_conv1d_update` 1.6. vLLM has **no gather** - its
recurrent state is updated in-place inside the fused decode kernel. `k_get_rows` is the single biggest
llama-specific kernel vLLM has no equivalent of.
3. **Graph coverage + stream overlap: ~+7.0 ms combined** (idle +3.8, cross-stream overlap ~+3.2). vLLM
FULL_AND_PIECEWISE is 99.7% busy with more concurrent kernels; llama (partially graphed post-0025) is
97.5% busy with thinner overlap.
4. **W4A4 act-quant tax: +3.3 ms.** `quantize_mmq_nvfp4`; vLLM's W4A16 choice makes this structurally 0.
Fusing the quant into the preceding op (as vLLM fuses act_quant into RMSNorm/SiLU) would erase it.
5. **Router GEMM + norms/glue: +5.4 ms.** llama computes router logits via a full FP4 GEMM (1.6) and has
heavier `k_bin_bcast` combine/residual + memcpy; vLLM fuses routing into tiny topk/align kernels.
## THE SINGLE BIGGEST vLLM-MoE ADVANTAGE
**Not the Marlin GEMM.** It is a near-tie between two ~6.5 ms buckets, both bf16-precision-related:
- **Dense projections (+6.5 ms)** - vLLM runs the GDN/attn projections + lm_head as NVFP4-Marlin while
llama runs them BF16 + a 2.9 ms convert tax. Single biggest *bucket* delta.
- **Recurrent-state gather (+5.2 ms, kernel `k_get_rows_float`)** - the single biggest *kernel* vLLM
avoids entirely (in-place fused state vs llama's separate gather). Plus +1.8 ms more REC plumbing.
The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it is explicitly ruled out.
## ANSWERS TO THE BRIEF
1. **WHERE is vLLM's 15%?** Spread across bf16-projection BW (+6.5) + recurrence state-gather plumbing
(+6.6) + graph/overlap (+7.0) + act-quant tax (+3.3) + router/glue (+5.4). **NOT the MoE GEMM.**
2. **Is Marlin faster at tiny-M decode?** **No.** llama native FP4-MMA W4A4 = 47.3 ms vs Marlin W4A16 =
50.0 ms. Marlin is ~5% slower here; both are at the LPDDR5x BW floor.
3. **Should llama implement a Marlin-style W4A16 MoE GEMM?** **No** - it would slow the MoE GEMM and is
not where the gap lives. The `w4a16-marlin` DENSE verdict ("NVFP4-dense-quant, not the Marlin kernel")
carries to MoE. The real, ordered levers are: **(a) NVFP4-quantize the still-bf16 GDN/attn projections
+ lm_head** (close ~+6.5, the largest, bit-changing but the same class of move vLLM makes); **(b) fuse
away the recurrent-state gather `k_get_rows`** (~+5, bit-exact, the biggest single-kernel win);
**(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant
into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM.
---
# FINAL DECISION (cross-agent synthesis) - "can we do what vLLM does on MoE?"
Three agents converged on the same verdict from independent angles: `moe-gap-groundtruth`
(the measured both-engine nsys decomposition above), `vllm-marlin-study` (source-read of vLLM's
`moe_wna16_marlin_gemm` / `moe_align_block_size` / `prepare_nvfp4_moe_layer_for_marlin` on the DGX),
and `marlin-port-feasibility` (read-only assessment of the dense W4A16 scaffold + prior STOP). All
three agree, and the measurement is the arbiter. Below is the decision the user asked for.
## (1) WHERE the 15% lives - decisive
The gap is **27.8 ms/step (llama at 83.6% of vLLM)** and it is **NOT one kernel - it is a sum of small
deltas, and the MoE grouped GEMM is on llama's side of the ledger.** Ranked:
| rank | lever | Δ ms/step | bit-exact? | this is... |
|-----:|-------|----------:|:----------:|------------|
| 1 | Graph coverage + cross-stream overlap | ~+7.0 | **yes** | scheduler/runtime (idle +3.8, overlap +3.2) |
| 2 | Recurrence state-gather/conv plumbing (`k_get_rows_float` 5.2 + conv) | +6.6 | **yes** | llama-only kernels; vLLM updates state in-place |
| 3 | Dense GDN/attn projections + lm_head (bf16 vs NVFP4) + convert glue | +6.5 | **no** | the NVFP4-dense-quant lever, on the projections |
| 4 | Router GEMM + norms/combine/memcpy glue | +5.4 | mostly yes | llama router = full FP4 GEMM; vLLM fuses topk/align |
| 5 | W4A4 act-quant tax (`quantize_mmq_nvfp4`) | +3.3 | **yes** | vLLM's W4A16 makes this structurally 0 |
| - | **MoE-expert grouped GEMM** | **-1.7** | - | **llama WIN** - native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 |
**The Marlin GEMM is explicitly ruled out as the source of the gap.** Both engines read the same ~22 GB
of ~4-bit expert weights once per step and are LPDDR5x-bandwidth-bound; on that weight stream they tie,
and llama's 2x-rate FP4-MMA edges Marlin's half-rate bf16 MMA. It is **not the projections-vs-Marlin
distinction in the experts, it is the projections in the DENSE path, the recurrence plumbing, and the
runtime/graph** that cost llama the 15%. Not distributed, not the expert GEMM, not routing alone.
## (2) Can llama MATCH it - and HOW
**Yes - to within a few percent, and NOT with a Marlin/W4A16 MoE GEMM.** The two biggest *compute*
kernels (the gated-DeltaNet SSM core 70.0 vs 71.1, and the MoE grouped GEMM 47.3 vs 50.0) are **already
llama wins.** The gap is overhead/scheduling/precision-of-the-other-tensors, all of which llama can
attack on its existing W4A4 FP4-MMA expert path. The four levers, in recommended build order:
| order | build | gain | bit-exact / gate | effort |
|------:|-------|-----:|------------------|--------|
| 1st | **Fuse away the recurrent-state gather `k_get_rows_float`** (update SSM state in-place in the GDN decode path, fold `ssm_conv_update`) | ~+5 ms (~3% of step) - biggest single-kernel win | **bit-exact** (no md5 rebaseline) | medium - CUDA, the GDN decode kernel |
| 2nd | **Fuller CUDA-graph coverage + stream overlap** (extend the 0025 re-graph to the remaining MoE/projection nodes, overlap independent streams) | ~+7 ms combined; 0025 already banked ~+1.9% | **bit-exact** | medium - scheduler, partly done |
| 3rd | **NVFP4-quantize the still-bf16 GDN/attn projections + lm_head** (the same move vLLM makes on its dense path; 4-bit weight read ~4x less BW, kills the 2.9 ms bf16<->f32 convert) | ~+6.5 ms - biggest *bucket* | **bit-changing** (re-baselines md5 gates; precision-UPGRADE, see below) | medium-high - new NVFP4 weight path for non-expert linears |
| 4th | **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** (as vLLM fuses act-quant) | +3.3 ms | **bit-exact** | low-medium |
**Reach:** the three bit-exact levers (1+2+4 ~= +15.3 ms) alone close the gap to ~154.5 ms/step
=> ~830 t/s = **~94% of vLLM, with zero precision change and zero md5 rebaseline.** Adding the
NVFP4-projection lever (3, +6.5) reaches ~148 ms => ~865 t/s = **~96-97% of vLLM**, with the residual
being router/glue and the irreducible cross-stream-overlap that is structural to how ggml schedules
host-launched nodes vs vLLM's single fused graph. Because llama's two heaviest kernels are already
ahead, **parity-or-better is physically reachable** once the plumbing/overhead is removed; vLLM has no
arithmetic advantage on this hardware (its W4A16 is half-rate FP4 - it only wins on overhead and on the
dense-path weight-read BW).
## (3) The leading lever, in full - and the Marlin question, settled
**The user's specific hypothesis - "do what vLLM does = a Marlin-style W4A16 grouped MoE GEMM" - is
REJECTED, by measurement and by feasibility.**
- **It is not where the gap is.** The MoE GEMM is a **-1.7 ms llama win.** A W4A16 Marlin MoE GEMM would
make that bucket SLOWER (half-rate bf16 MMA on the ~27% GEMM bucket), not faster.
- **Its entire intrinsic upside is the ~2% act-quant tax** (W4A16 has no activation quantize). That
+2% ceiling is **smaller than the +1.9% the bit-exact 0025 re-graph already banked**, at vastly higher
effort and with a precision change. And the act-quant tax is independently closeable bit-exactly by
lever 4 (fuse it into the preceding op) without touching the GEMM.
- **The scaffold does not help.** `paged/kernel/w4a16/marlin-w4a16.cu` is dense-only, Q4_0/Q4_K, with no
grouped/MUL_MAT_ID path and no NVFP4 dequant. A real MoE Marlin is effectively a from-scratch port of
`moe_wna16_marlin_gemm` (per-expert M-tiles, block-padded `moe_align` token-sort, stream-K over ragged
segments, NVFP4->bf16 in-kernel dequant). vLLM only reaches the BW floor via cutlass-SM120 TMA +
warp-specialized pipelining; the GB10 occupancy-only route the dense scaffold tried **plateaued at
~9 TFLOPS / 178 t/s (~5x under MMQ)** and STOPPED at the occupancy wall (XOR-swizzle + deep cp.async
collapse GB10 occupancy). Realistic outcome of an MoE port: **a net REGRESSION** on the 27% GEMM
bucket. Multi-week, high-risk, DGX-only, no `ncu`, for a +2% ceiling. **Do not build it.**
**Why vLLM runs W4A16 at all:** not because it is better - because sm_121 (consumer Blackwell / GB10)
has no working cutlass FP4 MoE cubins (vLLM whitelists only sm_100/103 datacenter Blackwell for native
FP4 MoE; the engine literally warns it is falling back to "Weight-only FP4 ... Marlin kernel"). On GB10,
W4A16 is HALF the FP4-MMA rate. **llama's native W4A4 FP4-MMA is the higher hardware tier; matching vLLM
does NOT mean copying its W4A16 fallback.**
**Precision / gate (the brief's key nuance, assessed honestly):** the observation that W4A16 (bf16 acts)
is a strict activation-precision UPGRADE over W4A4 (FP4 acts), with better KL-to-f32, is **correct but
unmonetizable here.** (a) The current W4A4 MoE default is **already bit-exact to the f32 reference**
(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable on both models) - you get no quality credit for
being more precise than a default that already passes, and the precision-sensitive site is the
gated-DeltaNet SSM *state* (a different op, addressed by the separate 0026 bf16-SSM opt-in), not the MoE
GEMM. (b) W4A16 is **non-bit-exact vs the W4A4 default, so adopting it re-baselines every shipped md5
gate** - a real cost for a +2% throughput ceiling that is itself likely negative. So the precision angle
does not flip the verdict: it would be a precision upgrade nobody needs, bought with a slower,
occupancy-hostile, gate-rebaselining kernel. The one genuinely precision-positive AND throughput-positive
move that quantizes weights is **lever 3 (NVFP4 projections)** - and that is W4A16 on the DENSE linears
(where it cuts weight-read BW), not on the experts.
## (4) HONEST VERDICT + recommended build
**VERDICT: We can essentially match vLLM on MoE decode (~94% bit-exact, ~96-97% with the projection
quant, parity-or-better physically in reach), but NOT by doing "what vLLM does" in the sense the question
implies. A Marlin/W4A16 grouped MoE GEMM is the wrong lever - the MoE GEMM is already a llama win and a
W4A16 port would regress it. The 15% is bf16 dense-projection bandwidth + recurrence-gather plumbing +
graph/overlap overhead + a 2% act-quant tax + router glue. Every piece is closeable on llama's existing
native-FP4 expert path, mostly bit-exactly.**
**Recommended build (ship order, none of it a new MoE GEMM):**
1. **`k_get_rows` SSM-state-gather fusion** - bit-exact, ~+5 ms, biggest single-kernel win, no rebaseline. **Do first.**
2. **Extend CUDA-graph coverage + stream overlap** beyond 0025 - bit-exact, ~+7 ms combined, partly banked.
3. **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** - bit-exact, +3.3 ms, erases the act-quant tax (the only thing W4A16 would have bought) without W4A16.
4. **NVFP4-quantize the bf16 GDN/attn projections + lm_head** - +6.5 ms (biggest bucket), bit-changing
(re-gate md5; precision-UPGRADE, the same NVFP4-dense-quant move vLLM makes). Ship as default after
re-gating, or as an opt-in if the md5 rebaseline is undesirable.
**Do NOT build:** the W4A16/Marlin grouped MoE GEMM (`paged/kernel/w4a16/` scaffold is dense-only and not
reusable). Neither default nor opt-in: +2% ceiling < the already-banked bit-exact +1.9%, likely a net
regression on the 27% GEMM bucket, multi-week high-risk, and it rebaselines every gate. The dense
`w4a16-marlin` STOP transfers to MoE, and MORE strongly (the tiny-M decode shape is purely BW-bound, so
the FP4-vs-bf16 tier is a wash that the weight-read floor erases - leaving only the half-rate downside).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# LEVER 4 (scope) - NVFP4-quantize the still-bf16 MoE GDN/attn projections (+lm_head), the +6.5 ms bucket
Label `L4-scope`, READ-ONLY (no GPU). This scopes lever 4 - the single biggest *bucket* in the table
above (**Dense projections +glue, +6.5 ms**) and the only remaining MoE lever with a real, measurable
gain after levers 2 and 3 both came back FLAT measurement-STOPs (no patch, no commit - see
`LEVER2_GRAPH_COVERAGE_RESULTS.md`, `LEVER3_ACTQUANT_FUSION_RESULTS.md`, `LEVERS_23_PROGRESS.md`). Lever 4
is **bit-changing** (re-gates md5; gate on KL-to-f32, not bit-exact md5). Below: the root cause, the
path, effort, the precision/KL story, the expected gain, and the default-vs-opt-in recommendation.
## Root cause: the MoE GGUF's projections are bf16 only because of its quant PROVENANCE
The "still-bf16 GDN/attn projections" are **MoE-specific, and they are an accident of how the MoE
checkpoint was quantized - not a llama limitation.** The two GGUFs have different quant lineages:
- **Dense `q36-27b-nvfp4` (unsloth, native-Blackwell FP4, 304 NVFP4 tensors):** the GDN/attn projections
ARE already NVFP4. Proven directly - `DECODE_PARITY_EXPLORE.md:594` shows the dense `ssm_out`
(GDN out-projection) running as an **FP4 GEMV/MMQ** (`mul_mat_vec_q`/`mul_mat_q<NVFP4>`), and the
in_proj runs FP4 MMQ at M=128. This is exactly why the **dense decode is already at 96.6% of vLLM** -
it has essentially no bf16-projection bucket left.
- **MoE `q36-35b-a3b-nvfp4` (nvidia modelopt, 241 NVFP4 tensors):** modelopt quantized the **256-expert
FFN** tensors to NVFP4 (the 241 count is dominated by the packed grouped-expert tensors) but **left the
GDN/attn linear projections in BF16** - `in_proj_qkvz`, `in_proj_ba`, the GDN `out_proj`/`ssm_out`, and
the full-attn `attn_q/k/v/output`. Those are exactly the **bf16 nvjet/cutlass projection GEMMs** seen in
the MoE decode top-kernel list (8.2 `nvjet 192x136` + 4.5 `cutlass::Kernel2` + 4.1 `nvjet 128x64`)
plus the 2.9 ms `convert_unary` bf16<->f32 glue = the **20.3 ms projection bucket** vs vLLM's 13.8 ms
(vLLM runs the same projections, and on this modelopt checkpoint even its lm_head, as NVFP4-Marlin -
see its `2.8 marlin dense (lm_head NVFP4)` kernel).
**=> Lever 4 is overwhelmingly a MoE-GGUF move:** bring the MoE GGUF's GDN/attn projections to the SAME
NVFP4 the DENSE GGUF already ships and that vLLM already runs on the identical weights. It is not a new
capability - the dense GGUF is the existence proof that llama runs and ships these projections in NVFP4.
## (1) THE PATH + EFFORT
Two ways to get the projection weights into NVFP4:
- **PATH A - offline re-quantize to a NEW GGUF variant (RECOMMENDED, = exactly what vLLM does).** Re-run
`llama-quantize` on the MoE source with the `--tensor-type` selector EXPANDED to also capture the
GDN/attn projection tensor-name patterns that the modelopt checkpoint left bf16 (the GDN `in_proj_*` /
`out_proj`/`ssm_out` and full-attn `attn_q/k/v/output` weights), producing e.g.
`q36-35b-a3b-nvfp4-projq.gguf`. **ZERO kernel/runtime code:** NVFP4 weights already flow end-to-end -
the loader auto-creates the per-tensor NVFP4 sidecar scales when `type == GGML_TYPE_NVFP4`
(`llama-model.cpp:1459`), and the projection GEMMs then route to the already-tuned `mul_mat_q<NVFP4>`
(patch 0017) instead of cublas/nvjet. The dense GGUF is the live proof this path works and gates clean.
**Effort: LOW-MEDIUM** - the only "build" is the quantize recipe + a KL gate harness + a gallery/index
entry + a RELEASE note. Risk items: (i) confirm the exact bf16 tensor list with a CPU `gguf_dump`
(metadata-only, no GPU); (ii) NVFP4 needs the contraction dim divisible by the 16-elt block - any
projection whose row dim is not a multiple of 16 stays bf16 (or needs padding), which is the most
likely reason a given tensor was left bf16 and must be checked per-tensor; (iii) the lm_head decision
(below).
- **PATH B - runtime quantize bf16->NVFP4 at load.** Convert the bf16 projection weights in-memory at
model load (one-time ue4m3 per-block scale-search), GGUF unchanged. **Worse choice:** needs new
load-time quant code (MEDIUM), and it *silently* changes the output of an existing GGUF for current
users (an implicit, non-opt-in precision change) - strictly inferior to an explicit new artifact.
Only attractive if shipping a new GGUF is somehow impossible; it is not.
## (2) PRECISION / KL story (honest)
Quantizing the projection WEIGHTS bf16 -> NVFP4 (e2m1 + per-16 ue4m3 scale) is a per-weight precision
**downgrade vs the current bf16** on those specific tensors (it adds ~4-bit weight-quant error), and -
because they route to the W4A4 MMQ path - it also FP4-quantizes those projections' activations. It is
NOT a precision upgrade over bf16; it is the **same W4A4/W4A16-class move vLLM already makes on these
same projections**, so at matched precision it is apples-to-apples with vLLM. Non-bit-exact => **re-gate
on KL-to-f32, not md5.**
**KL estimate: should PASS with margin.** Three independent reasons: (a) the dense GGUF ALREADY ships
these GDN/attn projections in NVFP4 and passes its greedy gate (`5951a5b4...`), so the move is
empirically proven shippable on this architecture; (b) the 256 experts already run W4A4 NVFP4 and pass
(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable) - the GDN/attn projections are the same class of
linear op and arguably less sensitive than the expert FFN; (c) this is a per-step, **non-accumulating**
weight-quant error - structurally unlike the bf16-GDN-*state* experiment (`BF16_SSM_STATE_RESULTS.md`)
that FAILED the KL gate (KLD 0.06-0.17, ~10% argmax flips) because that error *accumulated* through the
recurrence. Expect KLD-to-f32 well under that failed-state threshold and PPL delta sub-percent (cf. the
broader NVFP4-dense ~+4.8% PPL-vs-Q4_K figure is for full-model NVFP4; here only a minority of residual
projection tensors move). **The one genuinely risky tensor is lm_head** (logit-direct; `OTHER_PATHS_
INVESTIGATION.md` flags NVFP4-lm_head can flip the greedy argmax). For the MoE, quantizing lm_head is
*fair* (vLLM's modelopt checkpoint already runs lm_head NVFP4), so include it but gate it explicitly on
argmax-agreement; if it flips the greedy probe, keep lm_head bf16 and bank only the GDN/attn portion.
Recommended gate: **KLD-to-f32 < the bf16-state failure floor (~0.06) AND PPL delta < ~1% vs the current
bf16-projection GGUF AND zero greedy-argmax flips on the -n 48 probe.**
## (3) EXPECTED MoE GAIN
Closing the +6.5 ms projection bucket = bringing llama's 20.3 ms projection bucket down to vLLM's
~13.8 ms (NVFP4 cuts the projection weight-read ~4x - 2.37 GB-class bf16 -> ~0.56 B/wt - and the W4A4
MMQ path stays in the quantized domain, **erasing the 2.9 ms `convert_unary` bf16<->f32 glue**). llama's
native FP4-MMA is faster per-FLOP than vLLM's W4A16-Marlin and these projections are BW-bound, so llama
lands at parity-or-slightly-better, same as the expert GEMM (where W4A4 beat Marlin by 1.7 ms).
- With **lm_head also NVFP4** (fair on this modelopt MoE, vLLM did it): full ~**+6.5 ms** =>
step 169.8 -> ~163.3 ms => ~785 t/s.
- With **lm_head kept bf16** (conservative): ~**+4 to +5 ms** (the GDN/attn projections + the convert
glue; lm_head's ~bf16 GEMM stays) => step 169.8 -> ~165-166 ms => ~768-775 t/s.
In MOE_GAP frame (vLLM 142.0 ms / 901 t/s-equiv): **MoE moves from 86.3% (post-lever-1 / 0028) toward
~89-91% of vLLM** (full bucket) or ~88% (lm_head bf16). This is the **largest single banked MoE gain
available** - lever 1 (gather) shipped, levers 2 and 3 banked nothing, and the MoE GEMM is already a
llama win - so after lever 4 the residual is just router/glue + the structural cross-stream-overlap and
the ~4.2 ms host bubble (reachable only via a paged-attn host-pipeline edit, not a quant or graph knob).
## (4) RECOMMENDATION: ship as a SEPARATE OPT-IN gallery GGUF variant (KL-gated), not a re-gated default
**Ship lever 4 as a distinct, opt-in gallery variant** (e.g. `q36-35b-a3b-nvfp4-projq` / `-w4a4full`),
**not** as a silent replacement of the default MoE GGUF. Rationale:
1. The current default MoE GGUF is **md5-bit-exact-gated** (`07db32c2...` shipped); making it default
forces a permanent md5 rebaseline of every gate - the hard line this whole track has held (levers 2+3
STOPPED rather than cross it). A new artifact sidesteps that for users who chose the f32-lineage GGUF.
2. Path A produces a **new GGUF anyway** (offline re-quant), so a separate gallery entry costs nothing
extra and makes the throughput<->precision choice explicit and reversible.
3. The gain (~+4-6.5 ms, ~86% -> ~88-91% of vLLM) is real but modest - not worth forcing a precision
change on default-path users.
4. **Promotion path:** because lever 4 only brings the MoE GGUF to the SAME NVFP4 the dense GGUF already
ships *as its default* and that vLLM already runs, a clean KL gate (KLD << 0.06, PPL delta < ~0.5%,
zero argmax flips) is a strong case to PROMOTE the variant to the default MoE GGUF in a later release.
Ship opt-in first to preserve the bit-exact default and avoid a forced rebaseline; promote if the
gate is clean and lm_head NVFP4 holds.
**Effort summary:** LOW-MEDIUM, dominated by the KL gate + gallery wiring, NOT code (zero new kernel; the
NVFP4 weight path - loader sidecar scales + tuned `mul_mat_q<NVFP4>` - is already in tree and proven by
the dense GGUF). Highest-ROI remaining MoE lever. **Do first among remaining MoE work**, ahead of any
non-bit-exact recurrence-plumbing or the rejected W4A16/Marlin GEMM.
Assisted-by: Claude:opus-4.8 [Claude Code]
> **SUPERSEDED:** the lever-4 scope above was optimistic and PRE-GATE. The L4 KL gate FAILED
> (+6.15-6.51% PPL, see `LEVER4_PROJNVFP4_RESULTS.md`) and the premise was wrong (vLLM keeps these
> projections BF16 too). Lever 4 is REJECTED - do NOT ship. See the FINAL section below.
---
# RESIDUAL-ASSESS (FINAL, concludes the hunt) - convert-glue + bf16-GEMM verdicts, the bit-exact MoE ceiling
Label `residual-assess`, DGX GB10 (sm_121). After lever 1 shipped (0028, MoE 86.3% of vLLM @npl128,
bit-exact), levers 2+3 flat, lever 4 REJECTED (KL-gate FAIL, AND vLLM keeps the same projections bf16),
and lever 5 flat for MoE (host-side, off the compute-bound critical path; dense gets +0.41%), this is the
final honest assessment of the two remaining sub-levers inside the 20.3-vs-13.8 ms projection bucket.
Both are **bit-CHANGING or at-the-BW-floor.** The hunt is DONE.
## CORRECTION that reframes the projection bucket
The body above assumed **vLLM runs the GDN/attn projections as NVFP4-Marlin.** FALSE (confirmed by the L4
gate). vLLM runs the **same nvidia-modelopt checkpoint** as the GGUF, which keeps `in_proj_qkvz`,
`in_proj_ba`, `out_proj`, `attn_gate`, and full-attn `attn_q/k/v/output` in **BF16**. llama and vLLM run
these projections at the **same precision (bf16).** The +6.5 ms projection-bucket delta is therefore NOT
a precision/quant gap - it is (a) llama's f32-residual-stream convert tax and (b) bf16-GEMM kernel /
round-trip efficiency, both at matched bf16 precision.
## (1) convert-glue verdict (3.24 ms/step measured): NOT bit-exact eliminable
Empirical split (`moe_dec` nsys, per-step over 43 decode steps):
- `convert_unary<float,bf16>` (input, f32 act -> bf16): **1.73 ms/step**, 186 calls/step
- `convert_unary<bf16,float>` (output, bf16 -> f32): **1.52 ms/step**, 186 calls/step (equal count = every
bf16 projection round-trips)
Source root cause (`ggml/src/ggml-cuda/ggml-cuda.cu:1663-1690`, the `src0->type == BF16` cuBLAS path):
ggml converts f32 activations to bf16, runs `cublasGemmEx` bf16xbf16 with **CUBLAS_COMPUTE_32F** but
writes the result to a **bf16** buffer (`dst_bf16`, `CUDA_R_16BF`), then widens bf16 -> f32. The f32
accumulator is **rounded to bf16 and then widened back** - it drops ~15 mantissa bits, and that
bf16-rounded value feeds the f32 residual stream.
- The **output round-trip is load-bearing for the shipped numerics.** The fp16-fp32-compute path 40 lines
down (`:1729`, `dst CUDA_R_32F`) proves cuBLAS CAN write the f32 accumulator directly - so the bf16
output write+convert is a removable ggml inefficiency. BUT removing it (f32-direct output) changes the
value from "bf16-rounded" to "full-f32" => greedy md5 (`07db32c2`) re-baselines. It is a **precision
boundary (an upgrade), exactly like lever 4.** NOT bit-exact.
- The **input convert is intrinsic** to a bf16 GEMM (cuBLAS needs bf16 inputs; ggml's residual stream is
f32). The only bit-exact move is to fuse the f32->bf16 cast into the producing op's epilogue (same RNE
rounding, one fewer launch) - but that is per-site ggml graph surgery for a sub-1.7 ms launch ceiling,
and it is **subsumed by the (rejected) lever-4 move**: NVFP4-quantizing the weights routes the
projection to `mul_mat_q<NVFP4>` (W4A4) and deletes the entire bf16 cuBLAS path - input convert, GEMM,
output convert - in one shot.
- vLLM pays ~0 here because it runs an **end-to-end bf16 residual stream** (no f32 intermediate). Matching
that = converting llama's residual stream to bf16 = a global precision change, md5 rebaseline. Also not
bit-exact.
**Verdict: bit-exact-eliminable = NO.** The f32<->bf16 round-trip is load-bearing for the current md5 (the
bf16-rounded output IS the shipped value). Every way to remove it (f32-direct GEMM output, bf16 residual
stream, or NVFP4 weights) is bit-changing. The one bit-exact sliver (fuse the input cast into the
producer) is ~1.7 ms ceiling, high per-site effort, and redundant with lever 4. (Aside: the f32-direct
GEMM output is a genuine upstreamable ggml win - faster AND more precise - but it rebaselines md5, so it
is off the bit-exact table for this hunt.)
## (2) bf16 projection GEMM verdict (17.27 ms/step measured): BW-bound at the floor, no kernel lever
Per-step bf16-projection GEMM (nvjet cuBLASLt + cutlass bf16, `moe_dec` nsys): **17.27 ms/step, 225
calls/step.** Roofline at the M=128 decode shape:
- Arithmetic intensity ~= 2*M FLOP / 2 bytes-per-weight = **M = 128 FLOP/byte** (the weight read
dominates; activations/output negligible at M=128).
- GB10: LPDDR5x unified BW ~= **273 GB/s**; bf16 tensor-core peak >= ~250 TFLOPS => ridge point ~=
250e12 / 273e9 ~= **>900 FLOP/byte.** 128 << 900 => **memory-bandwidth-bound by ~7x.**
- Achieved: 17.27 ms at 273 GB/s = **~4.7 GB of bf16 projection weights streamed per step** - i.e. the
GEMM moves the weight bytes at ~full LPDDR5x bandwidth. **It is at the BW floor.**
The nvjet kernels are `tmaAB` (TMA-streamed on both operands) - the optimal Blackwell weight-streaming
access pattern; vLLM's cutlass does the same and reads the **same bf16 bytes.** A cutlass swap cannot beat
the byte floor. The only way faster is **fewer weight bytes = quantize** (lever 4, ~4x fewer bytes) -
bit-changing AND rejected on quality (+6% PPL) AND not even a vLLM-parity gap. The residual ~3.5 ms of the
llama-vs-vLLM GEMM-bucket delta traces to llama's extra `dst_bf16` write+read round-trip traffic (the
convert glue of verdict 1), not a worse GEMM kernel.
**Verdict: at the bandwidth floor; no bit-exact (nor even same-precision) kernel lever exists.** nvjet
already streams the weights near-optimally.
## (3) The bit-exact MoE ceiling, and the irreducible residual
| MoE lever | status | bit-exact? | MoE gain |
|-----------|--------|:----------:|----------|
| 1 - recurrent-state gather fusion (0028) | **SHIPPED** | yes | banked -> 86.3% of vLLM |
| 2 - graph coverage / overlap | flat | yes | ~0 |
| 3 - act-quant fusion | flat | yes | ~0 |
| 5 - block-table within-step cache | flat for MoE | yes | ~0 (host off compute-bound path; dense +0.41%) |
| 4 - NVFP4 projections | REJECTED | no | +6% PPL, not a vLLM gap |
| convert-glue elimination | this assess | **no** (precision boundary) | bit-changing only |
| bf16-GEMM kernel | this assess | **no** (BW floor) | none |
**Realistic bit-exact MoE ceiling = ~86-88% of vLLM @npl128. The shipped state (lever 1, 86.3%) is
essentially AT it.** Lever 5 adds nothing to MoE. No clean bit-exact MoE lever remains.
**The irreducible ~12-14% residual to vLLM is structural, not a missing optimization:**
1. **f32-residual-stream convert tax (~3.2 ms/step)** - ggml runs an f32 graph and casts per bf16
projection; vLLM runs bf16 end-to-end. Removing it is a precision change.
2. **bf16-GEMM BW floor + round-trip traffic (~3.5 ms/step)** - both engines at the LPDDR5x byte floor on
bf16 weights; the delta is the round-trip traffic (= item 1, bit-changing).
3. **Recurrence-plumbing remainder** - mostly banked by lever 1; the core SSM kernel is already a llama
win.
4. **Between-replay host loop + graph/overlap bubble** - sampling needs logits between graph replays;
irreducible at this batch shape.
## CONCLUSION: the MoE-parity hunt is DONE
The MoE is at its bit-exact ceiling. The two heaviest MoE compute kernels (the gated-DeltaNet SSM core and
the NVFP4 expert grouped GEMM) are **already llama wins**, so there is no arithmetic gap to close. The
remaining 12-14% is the f32-vs-bf16 graph-precision tax, the bf16-weight BW floor, and the irreducible
host loop - none of which is a clean bit-exact lever, and the one bit-changing option (quantize the
projections) is rejected on quality and is not even a vLLM-parity gap. **No one-more-lever for MoE.** The
only clean win left in the whole track is DENSE (+0.41% from lever 5), gated behind first resolving the
pre-existing paged-MoE baseline md5 drift (paged `8cb0ce23` vs canonical `07db32c2`) the L5 finish flagged.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,220 +0,0 @@
# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
Build-ready plan. **Not implemented in this workflow** (large kernel work). This
document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
finding that re-shapes the whole effort.
Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
HEAD at patch 0013), `build-cuda` sm_121.
## TL;DR (the honest reframe)
**The grouped GEMM the mission scoped to build from scratch already exists in
upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
building blocks:
1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
grid flattened into kbc-continuous space over expert x col-tile x row-tile x
k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
The per-expert host-side row-gather loop in `ggml-cuda.cu`
`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
analysis describes as "the cliff" - is a **fallback only reached when
`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
GB10 mxfp4 path.**
Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
original `253/505/830/620` cliff was a real high-batch regression that has since
been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
batched-bench artifact.
**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
sizing heuristic.** This document scopes that delta, plus the optional
block-padded align, plus the parity gate and phased plan. It also records what is
intentionally NOT built and why (the W4A16 occupancy wall).
## The one structural gap: M-tile sizing
`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
always selects the **largest** `mmq_x` that fits shared memory. vLLM's
CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
(typ. 16/32/64), padded **per expert**.
ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
`need_check`) every expert runs as a masked partial tail.
The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
re-read only across multiple col-tiles, of which there is one) while it **lowers
shared-mem footprint and raises occupancy** - strictly aligned with the GB10
occupancy lessons.
## What already exists (reuse, do NOT rebuild)
Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
warp-scan for the compaction index, scatters into `ids_src1` (column gather
permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
`expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
are exact (not block-padded).
- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
(kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
`stream_k_fixup` pass recombines split tiles.
- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
(== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
`quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
the `ids_src1` gather fused** (one pass, no separate row-copy).
Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
7 for turing_plus - which is fine for thin batch and out of scope here.)
## What to add (the durable delta, priority order)
### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
host-side SELECTION change - it picks a different already-compiled instantiation.
**Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
This is the single highest-leverage change and the seed of the durable port.
### [2] Block-padded moe_align (the moe_align_block_size port proper)
Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
block boundaries. Then every col-tile is **full**, which:
- drops the `need_check` masking + per-expert partial-tail MMA,
- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
schedule), removing the `col_diff` skip branch.
Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
`write_back` masking (pad rows must not write output). This is the proper
`moe_align_block_size` analogue and the durable second step.
### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
completeness; do not start without [1]+[2] measured as insufficient.
## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
`ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
(`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
scatter). No change to the host dispatch decision.
- **Gated fallback preserved**: the existing per-expert host loop
(`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
non-Blackwell GPUs. The new selection only fires on the grouped path.
- **Env gates** (off = exact current behavior):
- `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
(lever-1 + [1] manual knob).
- `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
Default both off until parity + throughput proven, then flip [1]'s
auto-selection on by default.
## Correctness / parity gate
Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
reference is **deterministic** - the op test must be **bit-exact**.
- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
`n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
tile geometry and where block-pad masking can leak.
- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
to current op-test output (op test is exact; the GB10 CUDA greedy-decode
non-determinism band applies only to end-to-end, never to the op test).
- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
`-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
## Phased plan, expected payoff, risk per phase
| Phase | Work | Expected payoff | Risk |
|-------|------|-----------------|------|
| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
no broken path to unlock. The durable win is **matching vLLM's saturated
grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
M-tile mismatch - a micro-efficiency gain at large effective batch, not a
step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
already uses the same sorted-grouped-GEMM algorithm, so structural parity is
**already met**; this closes the residual kernel micro-gap.
## The biggest risk: the GB10 W4A16 occupancy wall
The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
or XOR-swizzle - they are exactly what killed W4A16);
- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
FP4 = 2x INT8/BF16 rate;
- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
MoE wins come from occupancy/tile fit, not bandwidth).
Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
rewards - which is why it leads the plan and [3] is gated behind it.
## Commit / hygiene
Scope doc only (this file). No engine change committed in this workflow. Bench and
parity scripts are dev-tree-only. Commit with `git -s`, trailer
`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
Do not push (human pushes). When [1]/[2] are implemented they mirror to
`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).

View File

@@ -1,71 +0,0 @@
# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
## What
ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
`n_expert_used` times (4x for q36-35b-a3b).
`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
shfl/reduction). So the quantized block for a given token is byte-identical no matter which
expert slot it lands in.
## Lever
When `ne11 == 1` (broadcast up/gate):
1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
(`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
(`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
gather costs 32 ms.
## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
| kernel | dedup off | dedup on |
|-----------------------|-----------|----------|
| quantize_mmq_nvfp4 | 868 ms | 457 ms |
| gather_mmq_fp4 | - | 32 ms |
| net quantize path | 868 ms | 489 ms | (-379 ms decode GPU-time)
| gated_delta_net (50%) | unchanged | unchanged |
| mul_mat_q<NVFP4> | unchanged | unchanged |
Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
| model | npl32 off->on | npl128 off->on |
|-----------------|------------------|-----------------------|
| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
| dense q36-27b | 207.4 -> 206.9 (flat) | 373.28 -> 373.24 (byte-flat) |
Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
| model | md5 (default on) | == 0022 |
|------------------|--------------------------------------|---------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | yes (dense untouched) |
| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd | yes (on == off == 0022) |
test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
## Knob
On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
(byte-identical output, used as the A/B baseline).
Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,99 +0,0 @@
# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
`build-cuda` sm_121.
## Headline (honest): there is no npl128 cliff to erase on this build
The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
```
llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
npl 1 8 32 64 128 256
stock 85 282 629 935 1295 1779 <- monotonic, no knee
```
The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
`MOE_GROUPED_GEMM_SCOPE.md`.
## What the knob does
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
throughput on the padding columns, and the larger y-tile lowers occupancy.
`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
already uses for smaller batches - no new kernel configuration. Default
(unset/<=0) = disabled => byte-identical to stock.
## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
Decode throughput, S_TG t/s:
```
npl stock cap16 cap32 cap64
1 85 85 85 85
8 282 280 282 282
32 629 623 629 628
64 935 915 949 934
128 1295 1204 1344 1357 <- cap64 +4.8% (cap16 -7%)
256 1779 1370 1723 1820 <- cap64 +2.3% (cap16 -23%)
```
Prefill throughput, S_PP t/s (the cost):
```
npl stock cap16 cap32 cap64
128 3083 1817 2559 3038
256 3084 1818 2560 3046
-41% -17% -1.3%
```
Reproducibility (interleaved off/cap64, two reps each):
```
npl off rep1/rep2 cap64 rep1/rep2
128 1300 / 1290 1357.5 / 1357.0
256 1786 / 1782 1826.3 / 1824.5
```
cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
## Why 64 is the only value that helps net
A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
the fuller, higher-occupancy tile.
## Verdict
- Real but **modest** high-effective-batch DECODE micro-optimization
(+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
continuous batching already scales). Shipped as an opt-in, default-off knob;
recommended value 64 for decode-heavy high-concurrency deployments.
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
for single-stream generation and stays coherent; thousands of capped MoE
matmuls at npl128/256 ran with no CUDA error / NaN.
## Durable follow-up (scoped, not implemented)
Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
small tile while prefill keeps its large tile automatically (removes the ~1.3%
prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
`MOE_GROUPED_GEMM_SCOPE.md`.

View File

@@ -1,323 +0,0 @@
# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
- `quantize_mmq_nvfp4` (496/step) ~4.5%
- `nvjet` lm_head GEMM ~3.1%
- `flash_attn_ext_f16` (16 attn layers) ~3.1%
- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
`fattn.cu`, `fattn-common.cuh`.
---
## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
`vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
verified in prior runs (1115/805 gate, byte-identical).
- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
results.
The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
no order-preserving change moves it.
---
## RANKED bit-exact non-recurrence levers
Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
`amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
the removed kernel).
- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
half of the prior Lever-3 design.
- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
(`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
grouping is forbidden.)
- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
---
## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
recurrence's vectorized state loads).
- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
slower and breaks the one-read invariant; not a lever.
- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
---
## One-line summary for the parent
The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
shapes (large-N asymptotic vs 1-3-wave decode per-call).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
gated_delta_net (SSM recurrence) 193.97 ms/step 51.0% <- BINDING KERNEL
mul_mat_q<NVFP4,m=128,nc=0> 93.64 ms/step 24.6% <- the P2a target
quantize_mmq_nvfp4 16.77 ms/step 4.4%
nvjet (cublas lm_head GEMM) 12.07 ms/step 3.2%
flash_attn_ext_f16 11.69 ms/step 3.1%
concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
-> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
DENSE q36-27b: npl 32 208.02 -> 207.51 (-0.2%) npl 128 374.30 -> 373.19 (-0.3%) FLAT
MoE q36-35b-a3b: npl 32 438.83 -> 439.30 (+0.1%) npl 128 745.71 -> 745.07 (-0.1%) FLAT
Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
gated_delta_net 167.99 -> 167.89 ms/step (IDENTICAL - it is byte-identical code, untouched)
mul_mat_q<NVFP4,128,0> 93.79 -> 92.74 ms/step (-1.1%, FLAT) <- the P2a target, decode shape
mul_mat_q_stream_k_fixup 3.96 -> 5.65 ms/step (+1.7ms, REGRESSES at nwarps/2=2)
=> the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
in whatever other flag/patch state each was built from.
## VERDICT
P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
## Levers EVALUATED
### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
selection -> less / better-localized expert work -> artificially fast. The authoritative
decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
restores stock. Committed: DGX f7409c2 + worktree patch 0023.
### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
order), lm_head (cublas) have NO bit-exact lever.
## Verdict
The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,119 +0,0 @@
# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
dtype=torch.float32 elem_bytes=4 shape=(1553, 48, 128, 128) per-slot state = 786432 elems = 3.000 MiB (f32)
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB (bf16 would be 402.7 MB)
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32 (mamba_cache_dtype=auto/bf16 for conv)
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
-> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak (min 84.3%, p90 81.1%)
## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
(Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
## THE GAP (equal f32 bytes, different efficiency)
llama 805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
vLLM 805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
=> vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
---
# retune-build (BUILD AGENT) — patch 0022 SHIPPED
vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
## The change — bit-exact column folding (Lever A + B + D)
`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
value provably does not depend on either (columns are fully independent — column c reads only its own
S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
agent's 4.021 ms / 73.4% exactly (n=1152).
| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
|---------------------------|---------|------|-------|
| base (0021) | 4.021 | 200.3| 73.4 |
| 4 x 1 (control == base) | 4.034 | 199.7| 73.1 |
| 4 x 2 | 3.887 | 207.2| 75.9 |
| 4 x 4 | 3.775 | 213.3| 78.1 |
| 8 x 1 | 3.837 | 209.9| 76.9 |
| 8 x 2 | 3.749 | 214.8| 78.7 |
| 8 x 4 | 3.699 | 217.7| 79.9 |
| 8 x 8 | 3.586 | 224.6| 82.3 |
| 16 x 2 | 3.665 | 219.8| 80.5 |
| 16 x 4 | 3.585 | 224.7| 82.3 |
| **16 x 8 (WINNER/default)** | **3.488** | **230.9** | **84.6** |
| 32 x 4 | 3.489 | 230.8| 84.6 |
Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
## Gates (both PASS, non-negotiable)
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
(16,8 default) AND (4,1 control):
- q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
- q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
| model | npl | base S_TG t/s | flag S_TG t/s | gain |
|-------|-----|---------------|---------------|------|
| dense 27b | 32 | 199.2 | 207.6 | +4.2% |
| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
| MoE 35b-a3b | 32 | 420.6 | 440.0 | +4.6% |
| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
## Parity vs vLLM (recurrence kernel)
Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
(matmul/attn), not in gated-DeltaNet.
Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.

View File

@@ -1,511 +0,0 @@
# OTHER_PATHS_INVESTIGATION.md
Read-only investigation of the four post-0023 paths (A MoE grouped-GEMM, B lm_head GEMM,
C TTFT/paged-pool burst, D dense CUDA-graph). One section per agent. No GPU except the
moe-gpu-profile agent.
---
## A. MoE grouped-GEMM gap (label: moe-gemm-source, READ-ONLY, no GPU)
### The decisive finding: vLLM's MoE on GB10 is MARLIN W4A16, not a native-FP4 grouped GEMM
Engine-log ground truth (`VLLM_DECODE_GROUNDING.md`, from `~/bench/h2h_moe_vllm.log`):
`"Using 'MARLIN' NvFp4 MoE backend ... Your GPU does not have native support for FP4
computation ... Weight-only FP4 compression will be used leveraging the Marlin kernel"`.
vLLM does NOT take its native-FP4 cutlass/trtllm MoE path on sm_121 (it whitelists only
sm_100/103 datacenter Blackwell for FP4-MMA MoE). So on this box vLLM's MoE is:
- `moe_align_block_size` (BLOCK-PADDED token-sort; `num_tokens_post_padded`, sentinel pad rows),
- **2 grouped `moe_wna16_marlin_gemm` launches per MoE block** (gate_up, then SiLU+mul, then down),
each ONE launch over ALL experts, `use_fp32_reduce=True`,
- **W4A16: activations stay bf16, NEVER quantized**; FP4 weights dequantized in-kernel to bf16,
bf16 MMA,
- the whole decode step under a FULL CUDA graph.
llama's MUL_MAT_ID on GB10 (mmq.cu id-branch + mmid.cu + mmq.cuh stream-k) is:
- `mm_ids_helper` token-sort/scatter, **NO block padding** (exact segments, `expert_bounds`),
- **activation FP4 quantize** (`quantize_mmq_fp4`) of the expert-gathered rows = W4A4,
- **1 persistent stream-k `mul_mat_q<NVFP4>` launch per projection**, native Blackwell FP4-MMA
(`block_fp4_mmq`), fp32 accumulate + `stream_k_fixup`,
- per-expert-density `mmq_x` (M-tile) select (patches 0014/0015, default tile 64 @ density<=8),
- NOT under a CUDA graph.
### So the "missing fused grouped GEMM" does not exist - llama already HAS it
llama's grouped FP4-MMA stream-k IS the same sorted-grouped-GEMM algorithm vLLM uses, and on
GB10 llama's MoE GEMM is at a HIGHER-precision/native-FP4 tier than vLLM's W4A16 Marlin. The
MoE decode gap (77-83% of vLLM vs dense 90-117%) is therefore NOT a grouped-GEMM-architecture
deficit. The MoE-specific EXTRA gap (the ~10-15pt that MoE is worse than dense) decomposes as:
1. **W4A4 activation-quantize tax (llama-only, the biggest MoE-specific discrete cost).**
llama quantizes activations to FP4 for the MoE GEMM; vLLM (W4A16) keeps them bf16 and pays
ZERO activation quantize. At MoE decode npl128 that is 1024 up/gate rows (patch 0023 dedup'd
the broadcast ones to 128 unique + a coalesced block gather) PLUS 1024 down_proj rows
(distinct per expert, CANNOT be dedup'd). nsys decode-isolated (`MOE_QUANT_DEDUP_RESULTS.md`):
`quantize_mmq_nvfp4` is still **457 ms** of decode GPU-time after the 0023 up/gate dedup; the
remaining bulk is the down_proj per-expert re-quantize. vLLM's W4A16 choice is actually SMART
for MoE decode on a bandwidth-bound box: keeping activations bf16 adds negligible activation
bandwidth at M~8/expert but ELIMINATES the entire quantize pass.
2. **Un-graphed extra MoE nodes' launch bubbles.** Per MoE layer llama runs mm_ids_helper +
quantize + gather + 2 grouped GEMMs + SiLU/mul + down-quantize + moe_sum as separate
host-launched ggml nodes, none under a CUDA graph; vLLM runs moe_align + 2 grouped launches
under a full decode graph. This is the SAME launch-bubble root cause `CRITICALPATH_GAP_ANALYSIS.md`
pins for the GDN region (57 ms/step dense = 100% bubble), amplified for MoE by the extra
quantize/gather/scatter nodes - consistent with MoE being relatively worse than dense.
3. **Ragged tiny-M tile + `need_check` partial-tail MMA** in the grouped stream-k. Already
addressed by 0014/0015 and measured **NEUTRAL** on q36-35b-a3b: that model is bandwidth/
SSM-recurrence-bound, not col-tile-occupancy-bound (the `LLAMA_MOE_DECODE_TILE` sweep shows 64
is the only non-negative width and it is within noise). So the M-tile lever has nothing to
bite on for THIS model; it banks +4.8% only on col-tile-bound MoE (Qwen3-Coder-30B).
### Bit-exact llama MoE-GEMM levers (ranked)
- **M1 (bit-exact, modest): down_proj activation-quantize kernel retune.** The remaining ~457 ms
is dominated by the down_proj per-expert FP4 re-quantize (`ne11==n_expert_used`, no dedup
possible). The per-block quantize is a pure per-thread function of 16 consecutive inputs (the
property 0023 exploited to make its gather bit-exact), so the launch GEOMETRY can be retuned
(occupancy/coalescing, like 0022 did for the recurrence and like 0023's coalesced-uint4 gather
fix) while the quantized bytes stay BYTE-IDENTICAL. Also worth checking whether the down gather
(`ids_src1`) is redundant when the SwiGLU intermediate is already expert-contiguous. Scope:
nsys the down-branch `quantize_mmq_fp4` on GB10, retune block/grid, gate on test-backend-ops
MUL_MAT_ID exact + greedy md5 == 0023. Expected: low single-digit % at npl128 (bounded - it is
a fraction of a fraction of the step), but it is the only clean quantize-axis lever left after
0023 and it is strictly bit-exact.
- **M2 (bit-exact, the structurally-correct big one, SHARED with path D/A.2): CUDA-graph the MoE
decode step.** Graph replay does not change numerics => bit-exact. The MoE-specific extra node
count (quantize+gather+scatter+2 GEMM+silu+sum/layer, none graphed) makes the launch-bubble tax
larger for MoE than dense, which is exactly why MoE sits at 77-83% while dense is 90-117%.
Capturing the decode forward removes those bubbles. This is the same lever the GDN/A.2 work
scoped; it helps MoE MORE than dense. Highest-leverage bit-exact MoE win, but it is a
decode-graph-capture project, not a MoE-GEMM kernel edit.
- **M0 (already shipped): 0017 `GGML_CUDA_FP4_MINBLOCKS` (min-resident-CTAs register-cap) and
0014/0015 (`mmq_x` density auto-tile) already cover the FP4-MMA occupancy + M-tile axes of the
SHARED `mul_mat_q<NVFP4>` kernel.** 0017 is bit-exact (register allocation cannot change
results) and was tuned on dense; a MoE-targeted min-blocks re-sweep (grouped per-expert M-tiles
have different occupancy than the dense M=128 GEMM) is a cheap bit-exact follow-up, but
MOE_DENSITY_AUTO_TILE already found this model is bandwidth-bound, so headroom is likely small.
### NOT recommended (explicitly out of scope)
- **W4A16 bf16-activation MoE GEMM (matching vLLM's Marlin choice).** This is the single biggest
MoE-specific structural difference and would erase the activation-quantize tax entirely, but it
(a) is NOT bit-exact (bf16 activations vs llama's FP4), and (b) is the W4A16 occupancy-wall
dead-end the docs flag (only ~9 TFLOP/178 t/s on GB10). Do not pursue.
### Verdict / ranking of path A
Path A is NOT a missing-kernel opportunity - llama already runs the sorted-grouped-FP4-MMA GEMM,
at a higher native-FP4 tier than vLLM's GB10 W4A16 Marlin fallback. The MoE-specific extra gap is
(1) the W4A4 activation-quantize tax vLLM structurally avoids by choosing W4A16, and (2) the same
un-graphed launch-bubble tax as the GDN region, amplified by MoE's extra nodes. The only purely
bit-exact, MoE-GEMM-local lever left is M1 (down_proj quantize retune, modest). The real MoE
bit-exact win is M2 (CUDA-graph the decode step), which is the SAME lever as path A.2/D and helps
MoE more than dense - so A's best lever collapses into the decode-graph effort rather than
standing alone. Recommend ranking A's standalone kernel value BELOW the decode-graph (M2/D) and
the lm_head (B) levers; fold A into the decode-graph build, and keep M1 as a cheap bit-exact
bank-shot.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## B. lm_head GEMM (label: cublas-lmhead, READ-ONLY, no GPU)
### The decisive fact: lm_head is BF16, not NVFP4 - so it CANNOT take the FP4 MMQ path
`output.weight` (the LM head) in q36-35b-a3b-nvfp4 is **type 30 = GGML_TYPE_BF16, NOT quantized**
(verified in `DECODE_PARITY_EXPLORE.md:298`: "2425 MB = 2.37 GB, read in full each step", 16% of
weight traffic). This is by construction: the model was quantized with `--tensor-type attn/ffn=
nvfp4`, which converts the attn+ffn tensors to NVFP4 and **leaves `output.weight` (and `tok_embd`)
at base BF16** - the standard recipe, because the final projection is the most logit-sensitive
tensor. The NVFP4 sidecar scales (`output_s`, `output_in_s`) are only created when
`output->type == GGML_TYPE_NVFP4` (`llama-model.cpp:1459`), so for the BF16 head `model.output_s`
is null.
### Why it runs cublas/nvjet and not MMQ (exact routing trace)
Graph: `qwen35moe.cpp:244` `cur = build_lora_mm(model.output, cur, model.output_s)` ->
`llama-graph.cpp:1093` is just `ggml_mul_mat(ctx0, w, cur)` (the null `w_s` skips the scale `ggml_mul`).
Then `ggml_cuda_mul_mat` (`ggml-cuda.cu:2540`) decides the kernel:
- `use_mul_mat_q` / `use_mul_mat_vec_q` BOTH require `ggml_is_quantized(src0->type)`. BF16 is NOT
quantized (`is_quantized=false` for F16/BF16/F32; NVFP4 IS `is_quantized=true`, `ggml.c:748`).
=> **both MMQ paths are ineligible for the BF16 head.** (If the head were NVFP4 it WOULD route to
the tuned FP4 `mul_mat_q` - this is exactly the difference.)
- At decode npl128 the activation `src1->ne[1] = 128` columns: `use_mul_mat_vec_f` is gated off by
the mmvf batch cap; `use_mul_mat_f` (the MMF bf16 tensor-core GEMM) is gated off by
`ggml_cuda_should_use_mmf` for the wide `151936-row x 128-col` shape.
- `use_batched_cublas_bf16` is true, but the batched-cublas branch additionally requires
`src1->ne[2]*src1->ne[3] > 1` (a 3D/4D multi-batch GEMM). The decode lm_head is 2D
(`ne[2]*ne[3] == 1`) => **batched-cublas branch is skipped.**
- => falls through to the final `else`: `ggml_cuda_op_mul_mat_cublas`. With `src0` BF16 +
bf16-MMA hardware it takes the BF16 branch (`ggml-cuda.cu:1663`): `cublasGemmEx(CUDA_R_16BF,
CUDA_R_16BF -> CUBLAS_COMPUTE_32F, TENSOR_OP)`. **That cublasLt kernel is `nvjet_sm121`.**
Cost (both models): dense `nvjet` lm_head = **12.17 ms = 3.66% of the 332.8 ms dense step**
(`F16_DENSE_RESIDUAL_PROBE.md:65`); MoE = **11.91 ms = 3.1%** (`CRITICALPATH_GAP_ANALYSIS.md:398`).
### CRITICAL correction: the current head is NOT "f32-lm_head" - it is already BF16-rounded
The task brief calls the baseline "f32-lm_head"; it is not. The cublas BF16 branch **downcasts the
F32 activation to BF16**, does BF16xBF16 with F32 accumulate, **writes the result as BF16** (dst is
`CUDA_R_16BF`), then upcasts BF16->F32. So today's "bit-exact reference" logits are already
**BF16-precision**, not f32. Two consequences:
1. Any NVFP4/FP8 head swap is measured against a BF16 baseline, not f32 - the precision delta vs
the *true* f32 head is partly already paid.
2. A *different* BF16 GEMM kernel that also F32-accumulates and BF16-rounds the output is
**bit-identical for the vast majority of logits** (differs only at rare BF16 rounding ties) -
this is what makes option (c) below "essentially bit-exact".
### The options, and which break bit-exactness
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ. BIGGEST win, BREAKS bit-exactness.** Weight
2.37 GB BF16 -> ~0.6 GB NVFP4 (0.5625 B/wt = 4x fewer bytes) AND it then hits the already-tuned
`mul_mat_q<NVFP4>` (0017) instead of cublas. Memory-bound floor drops ~4x => save ~8-9 ms =
~2.5% of the dense step. But NVFP4 < BF16 precision => **different logit bits, can flip the greedy
argmax** = NOT bit-exact; and it is **UNFAIR vs vLLM**, which keeps its LM head BF16
(`DECODE_PARITY_EXPLORE.md:358`: "fp8 LM head ... only matters if vLLM also quantizes it"). This
is the same opt-in, non-bit-exact bucket as the f16-glue probe (already concluded SKIP).
- **(b) FP8 / Q8_0 head.** Smaller error than NVFP4 but still != BF16 bits => still NOT bit-exact,
and it is not even on the tuned FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
- **(c) Keep BF16 weight, swap the kernel (custom skinny wide-vocab streaming GEMM, or a cublasLt
algo heuristic tuned for the thin-M / huge-N memory-bound shape).** The ONLY essentially-bit-exact
option (F32 accumulate + BF16 round = identical except rounding ties, per the correction above).
### Realistic lever + scope: there is NO good bit-exact lever here
Bandwidth math kills option (c): `nvjet` moves 2.37 GB in ~11.9-12.2 ms = **~195-199 GB/s = ~72% of
the GB10's 273 GB/s peak**. The lm_head GEMM is therefore **already one of the MOST
bandwidth-efficient kernels in the step** - the overall decode step runs at only 40% util /
110 GB/s (`DECODE_PARITY_EXPLORE.md`). The bit-exact ceiling is tiny: even a perfect
HBM-saturating kernel (199 -> 273 GB/s) takes 11.9 -> ~8.7 ms = **save ~3 ms = ~0.9% of the dense
step**, and beating cublas's own tuned nvjet on a pure weight-stream shape is NOT guaranteed (it may
already be near-optimal). High kernel-writing effort, uncertain sub-1% payoff. (`F16_DENSE_RESIDUAL_
PROBE.md:97` independently estimates a bf16-glue nvjet recovery of only ~5 ms and flags it
"uncertain - may already run TF32" - consistent with little headroom.)
The structural reason: the head must read the **entire 2.37 GB weight for just 128 output columns**
(inherently memory-bound), and **you cannot cut those weight bytes without changing the dtype** -
i.e. bit-exactness and the only real speedup (fewer weight bytes) are **mutually exclusive** here.
### Verdict / ranking of path B
The lm_head cublas/nvjet GEMM is a **dead end for a bit-exact win**: it is already ~72% of peak HBM
(the step's most efficient major kernel), so a bit-exact kernel swap caps at <1% with real risk and
no guarantee of beating cublas. The only large win - NVFP4-quantizing the head (~2.5%) - is
explicitly non-bit-exact AND unfair vs vLLM (which keeps BF16), so it lands in the same opt-in
non-bit-exact bucket as f16-glue that was already shelved. Rank B's bit-exact value **at the bottom**
of the four paths. The one worthwhile note for the team is the correction that the head is already
BF16 (not f32), which slightly narrows what "bit-exact" even protects here; if the project ever
opens a *non*-bit-exact opt-in track, NVFP4-head (option a) is a clean ~2.5% dense lever that rides
the existing tuned FP4 MMQ - but it must be gated as opt-in and excluded from any vLLM-parity claim.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## A.2 / D. GPU-measured MoE decode decomposition + dense-graph stability (label: moe-gpu-profile, THE GPU AGENT)
nsys `--cuda-graph-trace=node` on a steady MoE decode at npl128 (q36-35b-a3b-nvfp4, HEAD f7409c2,
clean 0023 build-cuda). The measurement was decode-isolated: the run has a prefill phase (16384 tok,
the big-GEMM region) followed by 64 steady decode steps; I segmented the timeline by GPU-idle gaps,
dropped the prefill window, and aggregated per-kernel time over the 64-step decode window only
(`moe_decode_npl128.{nsys-rep,trace.csv}` on the DGX; extractor `decfull.py`/`grid.py`).
### MoE decode window: 98.3% GPU-bound, ~165 ms/step. Per-kernel share of decode GPU-time:
```
41.9% gated_delta_net_cuda REC (SHARED with dense, already tuned 0018-0022)
26.9% mul_mat_q<NVFP4, M-tile=64> MOE expert grouped GEMM (MUL_MAT_ID) <-- biggest MoE-specific kernel
7.7% nvjet_sm121 (cublas bf16) attn/gdn bf16 projections + the BF16 lm_head (path B)
2.7% cutlass_80 bf16 s16816 relu bf16 GEMM (shared-expert / gate)
2.7% k_bin_bcast (mul/add) expert-combine + routing-weight scale + glue
2.6% k_get_rows_float REC recurrent-state gather
2.4% flash_attn_ext_f16 attention
2.3% mul_mat_q<NVFP4, M-tile=128> router / non-grouped FP4 GEMM
2.1% ssm_conv(+update) REC
2.0% quantize_mmq_nvfp4 MOE W4A4 activation-quantize tax (3.25 ms/step)
1.8% convert_unary bf16<->f32 glue around the bf16 projections
1.5% cpy_scalar glue
0.9% rms_norm
0.5% REC gating act | 0.5% streamk_fixup | 0.3% mm_ids_helper | 0.3% argsort |
0.2% l2norm | 0.2% set_rows | 0.1% gather_mmq_fp4 | <0.1% topk/softmax/reduce (routing)
```
Bucketed: **Recurrence (shared, tuned) ~= 47.5%** (gdn 41.9 + get_rows 2.6 + ssm_conv 2.1 + gating
0.5 + l2norm 0.2 + set_rows 0.2). **MoE FFN+routing block ~= 31%** (grouped GEMM 26.9 + activation
quant 2.0 + streamk 0.5 + mm_ids_helper/argsort/gather/softmax/topk/reduce ~1.3 + the expert-combine
share of bin_bcast). **cublas/cutlass bf16 projections ~= 10.4%** (nvjet 7.7 + cutlass 2.7).
Attention ~2.4%. The recurrence is the single biggest term but it is shared with dense and already
the subject of 0018-0022, so it is NOT a MoE lever.
### The biggest MoE-specific kernel (the lever): mul_mat_q<NVFP4, M-tile=64> grouped GEMM
26.9% of decode = ~43.5 ms/step, avg **373 us/call**, grids of **2048 and 8192** 64-wide tiles
(blk=32 = 1 warp/block). Compare the dense FFN GEMM in the same family at npl128: `mul_mat_q<NVFP4,
M-tile=128>` avg **31 us/call**, grid 48. The grouped per-expert GEMM is ~12x the per-call cost and
launches 100-200x more tiles because each of 128 experts is a separate tiny-M sub-GEMM (128 tokens x
top-k / 128 experts ~= a handful of rows per expert) padded into 64-wide tiles. This is exactly the
ragged-tiny-M / col-tile-occupancy axis section A's 0014/0015 `mmq_x` density auto-tile already
covers and measured NEUTRAL on this bandwidth-bound a3b model. MMQ FP4 is integer/FP4-exact
independent of tile geometry, so this kernel IS bit-exact to retune (occupancy/min-blocks/M-tile),
but the headroom on THIS model is small (it is bandwidth-bound, not tile-occupancy-bound).
### Confirmations / quantifications of section A (from live GPU, not source-reading):
1. **Un-graphed at npl128: CONFIRMED in source, but NOT the npl128 bottleneck.** NVFP4 on sm121
(turing_plus path) has `mmvq_mmid_max = 8` (`mmvq.cu:145`); MoE decode batch ne[2]=128 > 8, so
`[TAG_MUL_MAT_ID_CUDA_GRAPHS]` (`ggml-cuda.cu:3273`) disables CUDA graphs for the WHOLE step and
the MMQ grouped path (not MMVQ) is taken. HOWEVER the measured decode window is **98.3% GPU-util
with ~7.8 us inter-step host gaps** - at npl128 the kernels are large enough to fully hide the
per-op launch latency, so the un-graphed launch-bubble tax is negligible HERE. The un-graphed
penalty is a SMALL-npl problem; at npl128 the MoE gap is in-kernel (grouped GEMM + quantize),
not host bubbles. This refines A's M2: graphing the decode step helps small-npl MoE much more
than npl128 MoE.
2. **W4A4 activation-quantize tax: CONFIRMED present but only 2.0% at npl128.** `quantize_mmq_nvfp4`
= 3.25 ms/step in the decode-isolated window (A's 457 ms figure is a whole-run/different-window
total). Real, and vLLM-W4A16 avoids it, but it is a small-single-digit term, not dominant.
3. **lm_head/projection cublas (path B): CONFIRMED ~12.4 ms/step** of nvjet in MoE decode (matches
B's 11.91 ms), but that 7.7% bundle is mostly per-layer attn/gdn bf16 projections, not just the
one lm_head.
### D. Dense CUDA-graph stability: f32 dense is STABLE, the bimodality was a BF16-only artifact
Dense (q36-27b-nvfp4) has no MUL_MAT_ID, so it stays fully CUDA-graphed. Measured S_TG @npl128:
```
intra-process (1 load, 6x npl=128, npp8/ntg48, N_KV=7168): 376.2 376.2 375.7 375.1 375.3 374.9 (spread <0.4%)
inter-process (6 separate procs, fresh graph capture each):373.6 377.0 376.8 376.6 376.2 375.7 (spread ~0.9%)
committed heavy config (npl128 ntg128, N_KV=32768): 333.3 / 334.8 / 335.9 (spread ~0.8%)
```
No bimodality in either replay (intra-process) or capture (inter-process). The custom graph state
machine (`ggml-cuda.cu:4484`: warmup_complete requires 2 property-stable calls; the one-time capture
cost lands in T_PP, not S_TG) absorbs capture into prefill, which is the only "hint" (the first
in-process measurement has a slightly higher T_PP and a marginally lower S_TG, fully bounded). The
287/336/487/498 bimodality in the brief was the shelved BF16 SSM-state path (BF16_SSM_STATE.diff,
never applied), not the shipped f32 path. There is NO graphs-off env in this fork (graph enable is
compile-time USE_CUDA_GRAPH + the warmup machine), so a graph-disable A/B would need a rebuild; given
the f32 path is already stable to <1%, path D is a non-issue and not worth the rebuild.
### Verdict (GPU agent)
- The MoE decode gap vs vLLM at npl128 is **in-kernel, not host-overhead**: 98.3% GPU-util rules
out the un-graphed launch-bubble story AT npl128. The single biggest MoE-specific kernel is the
`mul_mat_q<NVFP4, M-tile=64>` grouped GEMM (26.9%, 43.5 ms/step); it is bit-exact to retune but
bandwidth-bound on this a3b model (A's auto-tile already measured neutral), so the standalone
bit-exact MoE-GEMM lever is REAL but BOUNDED. The recurrence (47.5%) is shared and already tuned.
- **Path D (dense graph instability) is closed: the shipped f32 dense path is stable (<1%, no
bimodality).** No latent fragility, no rebuild warranted.
- Net ranking from the GPU side agrees with A/B: the MoE-GEMM and lm_head levers are both bounded
and partly non-bit-exact; the only structurally large bit-exact MoE win (A's M2, graph the decode
step) pays off mostly at SMALL npl, not at the npl128 where the benchmark gap is reported.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## C. TTFT / paged-pool burst degradation (label: ttft-burst-rootcause, READ-ONLY, source + committed traces)
Files read: `paged/paged_kv_manager.{h,cpp}`; patches `0004` (on-demand alloc), `0007` (persistent
manager + ref-counted prefix), `0008` (server cross-request share), `0013`/`0016` (prefill budget);
docs `QWEN36_NVFP4_BENCH.md`, `BENCHMARK_PROGRESS.md`, `CHUNKED_PREFILL_PLAN.md`,
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`, `P1_DYNAMIC_BUDGET_RESULTS.md`, `FUTURE_LEVERS.md`.
### Part 1 - the static decode-first budget: why a 128-way burst hits 903 s dense / 213 s MoE TTFT
How the budget schedules (patch 0016, `server-context.cpp::update_slots`): each step builds ONE
mixed batch. Phase 1 appends every GENERATING slot's single sampled token UNCONDITIONALLY (no budget
gate), so after Phase 1 `batch.n_tokens == D` (the live decode load). Phase 2 then fills prompt
tokens, bounded by three predicates: the hard `batch.n_tokens < n_batch` (2048) ceiling, a per-step
`prefill_budget_step`, and a per-slot `prefill_cap_per_slot`. **Decode is structurally claimed first
and never capped; only prefill is throttled.**
At the shipped config (`LLAMA_MAX_BATCH_TOKENS=512`, i.e. T=512=n_ubatch) the dynamic terms
degenerate to constants:
- `prefill_budget_step = max(n_ubatch, T - D) = max(512, 512-D) = 512` for all D in [0,128] - the
floor binds, the `T-D` adaptivity NEVER bites (exactly the "structural note" in
`P1_DYNAMIC_BUDGET_RESULTS.md`).
- `prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) = min(512, 5243) = 512`, clamped to 512.
So each step admits at most 512 prefill tokens TOTAL and up to 512 per single slot. Each benchmark
prompt is exactly 512 tokens and there is NO round-robin (0016 drains slots in index order):
**the first waiting slot consumes the entire 512-token step budget with its whole prompt; the 128
prompts prefill strictly SERIALLY, one prompt per step.** Slot k's first token appears after ~k
prefill steps and each step co-batches the accumulating decode load, so step time grows. Mean TTFT
~= (half the prompts) x step_time ~= **903 s dense** (each step reads the full 28B NVFP4 weights) /
**213 s MoE** (3B active = cheaper steps). Decode_agg stays high (384/726 t/s) because Phase 1 seats
every decode token every step. This is the **deliberate decode-first tradeoff**: T=512 was chosen
for decode throughput + memory; TTFT was the sacrificed axis. The 903 s is partly self-inflicted by
the floor budget + lack of fairness, not a kernel limit (dense `prefill_tps` collapses to ~70 t/s
under the throttle vs vLLM's flat ~1420).
The fix (chunked-interleave / fair dynamic budget = P2 of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`,
NOT implemented), three pieces in `update_slots` Phase 2, zero libllama change:
1. Raise T toward `n_batch` (2048) so the per-step total budget is large; keep decode-first via the
REAL `prefill_budget_step = T - D` (leftover auto-shrinks as D rises, so the step never inflates
past T even at npl128).
2. A per-slot chunk cap MUCH smaller than the budget (the `long_prefill_token_threshold` analogue),
e.g. 128-256 tokens, so one prompt cannot monopolize the step.
3. A round-robin start offset over PROCESSING_PROMPT slots so leftover budget spreads across MANY
waiting prompts per step.
Net: instead of "one full 512-prompt per step" (serial, last prompt waits 128 steps), each step
admits small chunks from ~T/cap prompts at once, so all 128 advance in lockstep and TTFT collapses
from O(k*step) to O(constant) - the vLLM 6-18 s regime. 0016's per-slot-cap variable already exists
but is inert at the shipped config and lacks the round-robin spreader. Honest boundary (already in
the docs): this closes TTFT, it does NOT lift the ~161/333 decode ceiling (a separate lever).
### Part 2 - the burst-degradation BUG: later lower-npl prefill collapses 507 -> 65 t/s, decode fine, restart cures it
The signature - prefill-only collapse, decode untouched, persists in-process, a server restart fully
cures it (the benchmark's documented "restart per npl" workaround) - points to persistent paged-pool
host state never restored short of `clear()`/teardown. Two compounding mechanisms, both confirmable
from the patch source:
**(1) RECLAMATION GAP - blocks are returned ONLY on a FULL-range wipe.** `paged_alloc` returns a
sequence's blocks to the pool in exactly two places (patch 0004, kept in 0007): `clear()` ->
`release_all`, and `seq_rm(seq, p0, p1)` ONLY when `p0 == 0 && p1 == MAX`. But llama-server's normal
slot lifecycle issues PARTIAL truncations: slot reuse with a retained common/BOS prefix calls
`seq_rm(slot.id, n_past, -1)` with `n_past > 0` (patch 0008 itself calls
`common_context_seq_rm(ctx, slot.id, n_past, -1)`); context-shift / partial rewinds likewise. None
satisfy `p0 == 0`, so the release hook never fires: the kv-cache frees those CELLS but the manager
still believes the sequence owns those BLOCKS. The two desync and the manager's effective free pool
shrinks every time. Patch 0008's own comment is the smoking gun - it added the `n_past < 16` gate
because a mismatched full-prompt reservation vs suffix-only submission "never leaves stale blocks
(which otherwise fragment the paged pool ... and crashed the server under high fan-out)". 0008 only
closed that hole for the narrow `share()` path; the general partial-`seq_rm` path stays unhooked, so
over a high-fan-out burst leaked blocks accumulate and never return.
**(2) FRAGMENTATION / NO COMPACTION - the free queue is permuted by the burst and never rebuilt.**
Even for cleanly freed blocks, `BlockPool::free_blocks` just `prepend_n`/`append_n`s them in free
order; no compaction, no pristine reset. After a high-fan-out burst (many interleaved alloc/free
across many seqs in the unified pool, or reversed-order frees in a per-stream pool) the free queue is
a scrambled permutation of physical block ids. A subsequent LOW-npl prefill then `popleft`s
physically SCATTERED blocks, so its 512-token KV scatter-WRITE plus the in-kernel paged-attention
GATHER lose locality across the KV span -> prefill throughput collapses. Decode is a single-token
append per step with a gather amortized over tiny per-step work, so it barely notices - exactly the
observed "prefill collapses, decode robust". The scramble + leak persist for the process lifetime
(only `clear()`/restart rebuilds a contiguous free queue) - precisely why restart-per-npl restores
507 t/s. Contributing factor: slots used in the burst but not reassigned next run are never released
(release fires only on next-task divergence), so a low-npl run sees a reduced, fragmented pool and
falls back to the stock contiguous allocator more often (the `place()->false->res.idxs.clear()`
fallback in find_slot), scanning a littered cell array - another prefill-only slowdown.
Fix scope (all gated behind `LLAMA_KV_PAGED`, default-off byte-identical, no libllama API change):
- **Fix-1 (core, ~30-50 lines): close the reclamation gap.** Add
`paged::PagedKVManager::truncate(seq, n_keep)` that frees the trailing blocks of a request beyond
block index `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of the truncated block suffix),
expose `paged_alloc::truncate(cache, stream, seq, n_keep)`, and call it from
`llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Manager
accounting then tracks the kv-cache exactly; the leak stops.
- **Fix-2 (small): defrag on empty.** When a stream's cells reach `get_used() == 0`, rebuild that
manager's free queue to pristine contiguous order (or recreate the manager) so a reused pool
starts unfragmented.
- **Fix-3 (small): release on slot completion.** Add a paged release at server `slot.release()` so
finished-but-idle sequences return blocks promptly and a later low-npl run sees a full, compact
pool.
- **Fix-4 (optional hardening): best-fit / contiguous-run preference** in `get_new_blocks` + a
defrag pass before the find_slot stock-fallback fires.
Validation repro (GPU-bound, for a later profiling pass): npl64 burst then npl8 on ONE server;
assert npl8 `prefill_tps` within ~10% of a fresh-server npl8, and that `paged_alloc::num_free`
returns to the fresh value after the burst drains.
### Verdict / ranking of path C
Two distinct things: a **BUG** (Part 2) and a **tuning tradeoff** (Part 1). Rank the BUG first - it
is a true correctness/hygiene defect, not a tradeoff: a long-lived production server silently
degrades under ordinary mixed load and currently REQUIRES the "restart per npl" crutch, unacceptable
in real serving. Fix scope is small and localized to the paged-alloc unit + one `seq_rm` call site,
default-off byte-identical, with a crisp pass/fail repro. The chunked-interleave scheduler (Part 1)
is the bigger HEADLINE (the weakest benchmark number, 903 s/213 s burst TTFT vs vLLM 6-18 s) but a
larger effort with a deliberate TTFT-vs-decode-ITL tradeoff to navigate. The two are complementary:
the scheduler reduces how punishing each burst is; the bug fix ensures the pool survives the burst
so the NEXT request is not poisoned.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## SYNTHESIS - ranking and the first build target (label: orchestrator)
The brief framed two tracks: **BIT-EXACT** levers (help the shipped f32 parity DEFAULT, included in
the vLLM-parity claim) and **SERVING** levers (gated behind `LLAMA_KV_PAGED`, default-off
byte-identical, outside the parity claim). The decisive cross-cutting finding from all four agents:
**there is no compelling first build target on the bit-exact decode-default track** - A is bounded,
B is a sub-1% dead end, D is closed - **while the SERVING track has one clear, high-ROI, tractable,
low-risk, byte-identical-default first target: the paged-pool burst-degradation bug.**
### Per-path scorecard
| Path | Expected gain | Tractability | Bit-exactness | Net |
|------|---------------|--------------|---------------|-----|
| **A** MoE grouped-GEMM | Standalone kernel: **bounded, low single-digit %** at npl128 (model is bandwidth-bound; 0014/0015 M-tile auto-tile already NEUTRAL here). The big MoE win = M2 = graph-the-decode-step, which is SHARED with D and pays off mostly at SMALL npl, not the npl128 benchmark point. | M1 (down_proj quantize retune) cheap; M2 a decode-graph-capture project (large). | M1 strictly bit-exact (byte-identical quantized output); M2 bit-exact (replay). Helps the DEFAULT. | Real but **bounded**; no clean standalone kernel win. Keep M1 as a cheap bank-shot; fold M2 into a decode-graph effort. |
| **B** cublas lm_head (nvjet) | Bit-exact ceiling **<1%** (~3 ms; nvjet already ~72% of peak HBM, the step's most efficient major kernel). The only big win (NVFP4 head ~2.5%) is non-bit-exact AND unfair vs vLLM (which keeps BF16). | Custom skinny-GEMM = high effort, uncertain it beats cublas. | Bit-exact option caps <1%; the 2.5% option is a logits change (opt-in only). | **Dead end** for the default. Rank LAST. |
| **C** TTFT / paged-pool burst | **Part 2 bug:** restores prefill from collapsed 65 -> ~507 t/s after a burst (removes the "restart per npl" crutch). **Part 1 scheduler:** the headline - 903 s/213 s burst TTFT -> vLLM 6-18 s regime. | **Part 2: small + localized** (paged-alloc unit + 1 seq_rm call site). Part 1: larger (fairness + admission + tuning). | Both gated behind `LLAMA_KV_PAGED`, **default-off byte-identical**. SERVING track (doesn't touch the parity-default numerics). | **Highest ROI x tractability.** Part 2 is a true correctness defect with a crisp repro. |
| **D** dense CUDA-graph instability | **Zero** - f32 dense measured STABLE (<1% spread, no bimodality). The 287/336/487/498 bimodality was the SHELVED BF16 SSM path, not the shipped f32 path. | n/a (would need a rebuild for a graphs-off A/B). | n/a | **CLOSED.** Not worth any work. |
### Ranked order (ROI x tractability x bit-exactness)
1. **C-Part2 - paged-pool burst-degradation bug fix.** Small, localized, default-off byte-identical,
crisp pass/fail repro, removes a real production-serving defect + the benchmark's restart crutch.
2. **C-Part1 - chunked-interleave / fair dynamic budget.** The public-facing TTFT headline closer,
but a larger effort and a deliberate TTFT-vs-ITL tradeoff. Do it AFTER the bug fix (the scheduler
reduces burst pain; the bug fix keeps the pool alive across bursts).
3. **A-M1 - down_proj activation-quantize kernel retune** (cheap bit-exact bank-shot for the default;
bounded payoff on this bandwidth-bound model). Optionally folded with a future decode-graph build
(A-M2 / the shared MoE+GDN decode-graph capture), which is the only structurally large bit-exact
MoE lever but a big project that helps small-npl more than npl128.
4. **B - lm_head kernel swap.** Bit-exact ceiling <1% with real risk. Skip unless a non-bit-exact
opt-in track opens (then NVFP4-head ~2.5% dense, gated, excluded from parity claims).
5. **D - dense graph instability.** Closed, no work.
### THE FIRST BUILD TARGET: paged-pool burst-degradation bug fix (C-Part2)
**Why this one:** it is the only candidate that is simultaneously (a) high ROI - fixes a real
correctness defect that forces the "restart per npl" crutch in long-lived serving, (b) tractable -
small and localized to the paged-alloc unit plus one `seq_rm` call site, (c) safe for the parity
claim - gated behind `LLAMA_KV_PAGED`, default-off byte-identical, and (d) verifiable with a crisp
pass/fail repro. Every bit-exact-default alternative is bounded (A), a dead end (B), or closed (D).
**Implementation plan (incremental, each step independently shippable):**
1. **Fix-1 (core):** add `paged::PagedKVManager::truncate(seq, n_keep)` that ref-count-frees the
trailing blocks beyond block index `ceil(n_keep/bs)`; expose
`paged_alloc::truncate(cache, stream, seq, n_keep)`; call it from `llama_kv_cache::seq_rm` for the
`p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Closes the reclamation gap so manager
accounting tracks the kv-cache exactly.
2. **Fix-2:** defrag-on-empty - when a stream reaches `get_used() == 0`, rebuild its free queue to
pristine contiguous order.
3. **Fix-3:** paged release at server `slot.release()` so finished-idle sequences return blocks
promptly.
4. **Fix-4 (optional):** best-fit / contiguous-run preference in `get_new_blocks` + a defrag pass
before the find_slot stock fallback.
**Confirming measurement (the explicit repro, GPU-bound):** on ONE long-lived server, run an npl64
burst, let it drain, then run npl8. PASS if (i) npl8 `prefill_tps` is within ~10% of a fresh-server
npl8 (vs the ~65 vs ~507 collapse today), and (ii) `paged_alloc::num_free` returns to the
fresh-start value after the burst drains (proves no leaked blocks). Decode t/s must be unchanged.
**Bit-exact gate it MUST pass:**
- With `LLAMA_KV_PAGED` unset, the build is byte-identical to HEAD f7409c2 (the fix lives entirely
inside the paged path) - `test-backend-ops` + the greedy-decode md5 against the 0023 baseline are
unchanged.
- With `LLAMA_KV_PAGED` set, the fix changes only block ACCOUNTING and PLACEMENT, never KV values or
compute, so the greedy-decode md5 on a fixed prompt is identical before vs after the fix (and the
post-burst run produces the same tokens as a fresh-server run).
**Paths NOT worth building now:** B (lm_head, sub-1% bit-exact ceiling, the only big win is a
non-bit-exact unfair-vs-vLLM logits change), and D (dense graph instability, measured stable -
closed). A's standalone kernel value is bounded; keep A-M1 as a cheap follow-up and fold A-M2 into a
later decode-graph project, but it is not the first target.
**First target: ship the paged-pool burst-degradation bug fix (C-Part2, Fix-1 + Fix-2 + Fix-3).**
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,305 +0,0 @@
# P1 results: dynamic decode-first prefill-token budget (patch 0016)
Implements **P1** of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`: replace patch 0013's
**static** per-step prefill cap with a **dynamic, decode-first** token budget in
`tools/server/server-context.cpp::update_slots()`. Policy change only, zero
libllama changes, default-off byte-identical. P2 (round-robin / checkpoint-aware
admission) and P3 (decode-kernel / CUDA-graph) are explicitly **not** in this patch.
## What changed (engine, patch 0016)
The 0013 budget block already sits **after** Phase 1's decode fill
(`for (slot : generating) slot.update_batch(batch)`, lines 2716-2720), so at that
point `batch.n_tokens == D` is the live decode load. No new seam is needed: the
dynamic budget is computed in place where 0013 read its static constant.
| seam (post-0015 line) | before (0013) | after (0016) |
|---|---|---|
| budget block @2737-2747 | `n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))` (static constant) | `D = batch.n_tokens`; `T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch)`; `prefill_budget_step = max(n_ubatch, T - D)`; `prefill_cap_per_slot = clamp(min(T, ceil(0.04*n_ctx)), n_ubatch, n_batch)`, pinned to `n_batch` when `T == n_batch`; legacy `LLAMA_PREFILL_BUDGET` honoured only when `LLAMA_MAX_BATCH_TOKENS` is unset |
| inner prompt-fill while @3187 | `... && batch.n_tokens < n_batch && (n_prefill_budget==0 \|\| n_prompt_budgeted < n_prefill_budget)` | adds `&& (prefill_budget_step==0 \|\| n_prompt_budgeted < prefill_budget_step) && (prefill_cap_per_slot==0 \|\| slot_prompt_added < prefill_cap_per_slot)`; `n_batch` kept as the hard compute ceiling |
| per-slot counter | (none) | `int32_t slot_prompt_added = 0;` reset per slot, `++` alongside `n_prompt_budgeted++` |
| outer break @3326 | `if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) break;` | `if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) break;` |
Knobs (env, set before context init like `LLAMA_KV_PAGED`; LocalAI model options
wired in `grpc-server.cpp` beside `max_prefill_tokens`):
- `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step
token budget `T` (decode + prefill), the vLLM `max_num_batched_tokens` analogue.
Default `n_batch`, clamped `[n_ubatch, n_batch]`.
- `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot prompt-chunk cap, the
`long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04*n_ctx))`
floored at `n_ubatch`. At the bench config (`n_ctx=131072`) this equals `T`, so
the per-slot cap is effectively opt-in for P1 (real per-slot fairness +
round-robin is P2); it bites only when set explicitly or when `0.04*n_ctx < T`.
- `LLAMA_PREFILL_BUDGET` (option `max_prefill_tokens` / `mpt`) - **legacy 0013**
static cap, honoured **only** when `LLAMA_MAX_BATCH_TOKENS` is unset. 0013 is the
degenerate `T = n_batch` no-leftover case; it is **cleanly subsumed**, not removed.
## Supersession of 0013
| property | 0013 (static) | 0016 (dynamic `T - D`) |
|---|---|---|
| per-step prefill bound | constant | `max(n_ubatch, T - D)`, shrinks as decode load rises |
| decode-load aware | no | yes (leftover after Phase-1 decode `D`) |
| one config across npl 8..128 | no (256 best @128, net-negative @8) | yes (self-tuning) |
| long-prompt monopoly guard | no | per-slot `slot_prompt_added` cap |
| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - kept |
| legacy knob | `LLAMA_PREFILL_BUDGET` | preserved when dynamic knob unset |
## Determinism / byte-identical analysis (verified by construction)
The hard ceiling `batch.n_tokens < n_batch` is **kept** in the inner loop (not
replaced by `< T`). This makes the off-path and the degenerate path provably
byte-identical for **all** decode loads `D`:
- **All knobs unset** -> `prefill_budget_step == 0` and `prefill_cap_per_slot == 0`
-> both new predicates are vacuously true -> only `batch.n_tokens < n_batch`
binds -> **bit-for-bit stock**. The outer break is `prefill_budget_step > 0`
guarded, so it never fires. Identical to 0013's off-path by construction.
- **Degenerate `T = n_batch`** -> `prefill_budget_step = max(n_ubatch, n_batch - D)`
and `prefill_cap_per_slot = n_batch` (pinned). The budget bound
`n_prompt_budgeted < n_batch - D` is equivalent to `batch.n_tokens < n_batch`
(since `batch.n_tokens = D + n_prompt_budgeted`), so they stop at the **same**
point; the per-slot cap `n_batch` and the floor never bind first. When `D` is so
large that `n_batch - D < n_ubatch`, the kept `batch.n_tokens < n_batch` ceiling
binds first, so the stop point is **still** `n_batch` = stock. Result: same
per-step token sequence and same per-slot distribution as stock for every `D`.
- **Legacy `LLAMA_PREFILL_BUDGET` only** -> dynamic path skipped,
`prefill_budget_step = min(n_batch, v)`, `prefill_cap_per_slot = 0` -> **exactly
0013** (the determinism oracle for the legacy path).
- **`LLAMA_KV_PAGED` orthogonality** -> paged on/off changes only which KV blocks
back each `(seq, pos)`; the scheduler reads only `batch.n_tokens`, slot states,
and `n_ctx`/`n_batch`/`n_ubatch` - none paged-dependent. Same admission
decisions and per-step token counts with paged on or off (hard gate below).
## Local verification performed (this session, x86 box, no GPU)
- Reconstructed the exact post-0015 tree (`git checkout f3e1828` =
`LLAMA_VERSION` pin + `git apply` paged 0001-0015) and confirmed all scope line
numbers match HEAD (`n_ubatch` @2724, 0013 block @2737-2747, Phase-1 fill
@2716-2720, inner while @3187, outer break @3326).
- Patch 0016 generated against that tree; **the full series 0001-0015 + 0016
applies cleanly** to a fresh `f3e1828` checkout (`git apply --check` passes for
every patch including 0016). Stat: `1 file changed, 85 insertions(+), 22
deletions(-)`.
- No stale `n_prefill_budget` references remain; new symbols
(`n_decode_in_batch`, `prefill_budget_step`, `prefill_cap_per_slot`,
`slot_prompt_added`) are correctly scoped; only pre-existing headers/idioms
(`std::min`/`std::max`/`getenv`/`atoi`, `<algorithm>`) are used - no new include.
- Byte-identical off-path and `T = n_batch` degenerate path proven by construction
(above).
## Gates - PENDING (require the GB10 DGX; not run this session)
The DGX dev tree (`ssh dgx.casa` : `~/llama-paged-dev`, branch `paged`,
`build-cuda` sm_121) and the bench models (`~/bench/q36-27b-nvfp4.gguf`,
`~/bench/q36-35b-a3b-nvfp4.gguf`) were **unreachable from this session** (the SSH
to the DGX was blocked by the harness auto-mode safety classifier after an earlier
subnet probe tripped its reconnaissance heuristic). The build + the four gates +
the A/B sweep below were therefore **not executed**. Numbers must be filled by a
re-run on the DGX (or with `ssh dgx.casa` allowlisted). Methodology is locked here
so the re-run is mechanical.
Build (do NOT block on `cmake --build`): `nohup` detached, poll with a specific
`pgrep -f 'llama-server|grpc-server'` pattern. Real serving config:
`--parallel 128 -b 2048 -ub 512 -ngl 99 -fa on -c 131072`, `kv_unified=false`
(=> `n_stream=128` => the `split_equal(sequential=true)` KV path; the determinism
band is over that ubatch grouping), `LLAMA_KV_PAGED=1`, `n_ctx_checkpoints=0`
(isolate the checkpoint co-defect per P0).
| # | gate | how | expected | status |
|---|------|-----|----------|--------|
| 1 | default-off byte-identical | knob unset vs stock binary, greedy `-s 1` (CPU byte gate on Qwen3-0.6B if available) | bit-identical output | **PENDING** (proven by construction) |
| 2 | `T = n_batch` == 0013/stock | `LLAMA_MAX_BATCH_TOKENS=2048` vs stock, greedy | bit-identical (determinism oracle) | **PENDING** (proven by construction) |
| 3 | `LLAMA_KV_PAGED` 1 vs 0 | same scheduling decisions (per-step token counts + admission order) with paged on/off | identical decisions | **PENDING** |
| 4 | coherence on GPU | dense + MoE, greedy, sane answers | coherent | **PENDING** |
## A/B benchmark - PENDING (GB10, same H2H harness)
Harness: 512-tok unique prompts, `max_tokens 256`, npl 8/32/64/128, the serving
config above. Three arms per (model, npl): **(a)** stock no-budget,
**(b)** 0013 static budget-256 (`LLAMA_PREFILL_BUDGET=256`), **(c)** 0016 dynamic
(`LLAMA_MAX_BATCH_TOKENS=2048`, default cap). Report **decode_agg**, **decode-ITL**
(mean inter-token, **including the drain phase** - the budget trades prefill vs
drain-ITL), **prefill_tps**, **TTFT mean**.
Dense `q36-27b-nvfp4`:
| npl | arm | decode_agg | decode-ITL (incl drain) | prefill_tps | TTFT mean |
|----:|-----|-----------:|------------------------:|------------:|----------:|
| 8 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
| 32 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
| 64 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
| 128 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
MoE `q36-35b-a3b-nvfp4`: same table, **PENDING**.
Reference ceilings to validate against (from `QWEN36_NVFP4_BENCH.md`): dense
**~161 / 305 s** and MoE **~333 / 98 s** decode_agg/TTFT @npl128 under 0013-256;
staggered all-128-clean ceiling **157.4** dense.
### Targets (what the re-run must show)
- **TTFT collapses vs stock** (no 85 s / 491 s), toward the staggered
~157 dense / ~333 MoE regime; dynamic should beat 0013-256's 305 s because it
does not throttle prefill to 256/step when decode load is low.
- **Ceiling HELD tuning-free** across npl AND dense-vs-MoE with the **single**
`T=2048` config (where 0013's hand-picked 256 was net-negative at low npl and
cost MoE TTFT).
- **No low-concurrency regression** at npl8 vs stock.
- **Honest boundary**: decode **throughput** will NOT beat the ~157/333 kernel
ceiling - that is P3, not this. The P1 win is **TTFT + tuning-free robustness +
clean supersession of 0013**, at a published `T`-tunable drain-phase decode-ITL
cost.
## Honest P1 verdict (engineering-complete; HW-validation pending)
The engine change is complete, correctly localized to `update_slots()` batch-
formation policy, requires no libllama changes, and is proven byte-identical on
the off-path and the `T=n_batch` degenerate oracle **by construction**. It cleanly
supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
**pending DGX access** and must be run before this is sold on numbers. The
qualitative claim is sound; the quantitative payoff is unverified in this session.
## Staggered-arrival evaluation
Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
requests at t=0) is structurally adversarial to *any* prefill budget: under a
burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
trickle in while others are already decoding) is the regime 0016 is designed for:
when a new prefill arrives, the decode-first budget should keep the
already-decoding slots flowing (low/flat inter-token latency) while the new
prefill takes only the leftover `T - D`. This section measures exactly that.
### Harness (staggered client, dev-tree-only)
`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
prefix caching). It records, per request, the send time, the TTFT, and the
absolute timestamp of **every** generated token (full ITL series); raw dumps go to
`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
at the default max=32, identical across all arms). Three to four arms per model,
**env-only** difference, sequenced on the single GPU with PID-file stop between
arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
by already-decoding slots **while new prefills are still arriving** -> the
decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
first-token latency per newly-arriving request. *decode agg* = total generated /
decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
merit). *wall* = last token - first send.
### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
| 0013 (pb256) | 527 / 640 / 650 | 0 / 0 | 44763 / 90338 | 81.2 | 201.8 |
| 0016 (mbt512) | 730 / 897 / 901 | 0 / 0 | 33320 / 66595 | 88.4 | 185.8 |
| 0016 (mbt1024) | 1320 / 2050 / 2051 | 46 / 5 | 33402 / 62636 | 72.4 | 226.8 |
**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
protection/TTFT dial.
### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
| 0013 (pb256) | 443 / 572 / 607 | 0 / 0 | 18608 / 38347 | 38.0 | 54.7 |
| 0016 (mbt512) | 597 / 858 / 863 | 0 / 0 | 14506 / 28055 | 43.9 | 47.4 |
Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
wall at equal protection.
### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 706 / 1146 / 1148 | 132 / 0 | 2774 / 5105 | 202.4 | 81.1 |
| 0013 (pb256) | 194 / 273 / 280 | 0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
| 0016 (mbt512) | 275 / 366 / 373 | 0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
visible here.
### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
token lands inside the window, so the in-window protection metric is empty). This
reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
budget ~= stock, stock marginally better on TTFT and throughput. This is the
control, not 0016's target regime.
### Structural note (intellectual honesty)
At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
TTFT. The distinctive value of the dynamic budget is the **safety property**: it
lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
count auto-shrinks so decode is never starved when load rises - which is precisely
what stock lacks (stock = unbounded prefill chunk = the freezes).
### Verdict (honest)
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
(0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
real P1 win and it shows **only** under staggered arrival, never under the burst.
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
precisely because it lets prefill stampede the decoders (that stampede *is* the
freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
lunch; this does not manufacture a TTFT-beats-stock claim.
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
strictly better point on the protection/TTFT frontier than 0013-256 (equal
spike-free protection, materially lower TTFT/throughput/wall), and it adds a
principled, decode-adaptive, single-`T` way to move along that frontier (one
config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
(staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
from prefill-interleaving cost, not a kernel difference) - the P1 win is
latency-under-load, as expected.
**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
a decode-adaptive knob that holds one config across loads and model types. Against
stock it is a deliberately different operating point that trades a few seconds of
new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
protective setting and higher `T` when low-load TTFT matters more than ITL
flatness.

View File

@@ -1,107 +0,0 @@
# Paged-KV: GPU 0007 re-run + shared-prefix throughput benchmark
DGX Spark (NVIDIA GB10, sm_121 / cc 12.1), CUDA 13, dev tree `~/llama-paged-dev`
branch `paged`, base pin `f3e182816421c648188b5eab269853bf1531d950`, full paged
engine (0001-0004, 0006, 0007). All paged behaviour stays gated by
`LLAMA_KV_PAGED`; default-off is byte-identical to stock. Models:
`Qwen3-0.6B-Q8_0.gguf` and `Qwen3-32B-Q4_K_M.gguf`.
## Deliverable 1 - GPU run of the 0007 prefix-engine correctness driver
The committed driver `examples/simple/paged-prefix-engine.cpp` hardcodes
`n_gpu_layers = 0`. For this GPU run it was given a dev-only
`PAGED_NGL` env override (`mp.n_gpu_layers = getenv("PAGED_NGL") ? atoi(...) : 0`),
rebuilt in `build-cuda`, run, then the edit was **reverted** so the committed
driver stays byte-clean (it is dev scaffolding, never shipped in a patch).
Three runs of the same Gate-0 driver, Qwen3-0.6B, `LLAMA_KV_PAGED=1`:
| binary / offload | result |
|------------------------------------------|-------------------------|
| committed `build-cpu` driver | **ALL PASS (failures=0)** |
| `build-cuda`, `PAGED_NGL=99` (all layers)| GATE FAILED (failures=3)|
| `build-cuda`, `PAGED_NGL=0` (same binary)| GATE FAILED (failures=2)|
**The GPU run did NOT print ALL PASS - reported honestly.** But the failures are
narrow and are not a paged-engine bug:
- Every **structural / mechanical** paged invariant PASSES on GPU, in both
scenarios (boundary and mid-block): prefill computed ONLY the suffix (32 prefix
tokens skipped), shared prefix block-aligned, shared-block `ref_cnt == 2` while
both sequences hold it, ref drops `2 -> 1` on freeing one sharer, only the
private (suffix) blocks are returned, and the prefix block returns to the pool
once all sharers free. The cross-request KV reuse mechanism itself is GPU-clean.
- The only failures are the **exact greedy-token byte-identical** assertions
(e.g. boundary `B-shared` vs `B-from-scratch`). They diverge at a single near-tie
token (boundary: 2nd generated token `17971` vs `5671`) and then cascade
autoregressively.
Root cause is **CUDA float-kernel non-determinism, not the paged logic**: the
*same* CUDA binary fails the exact-token assertions even with `PAGED_NGL=0` (zero
layers offloaded), whereas the genuine `build-cpu` binary passes all 16/16. The
CUDA backend (loaded via `ggml_backend_load_all`) uses non-associative reductions
whose result differs between the full-prefill batch shape and the
incremental-suffix batch shape; under greedy decode a single logit near-tie flips
and the sequences cascade apart. This refines the earlier note in
`PAGED_GPU_VERIFY.md` (which framed it as "not GPU-specific" and had no CPU pass
to compare against): the CPU build now passes clean, so the divergence is a strict
test-assertion artefact of CUDA float ordering, not a defect in 0006/0007.
## Deliverable 2 - shared-prefix throughput benchmark (the real-win test)
Dev-only driver `examples/simple/paged-prefix-bench.cpp` (registered in
`examples/simple/CMakeLists.txt`, dev tree only - not in any shipped patch).
Workload: `K` sequences that all share a `P`-token common prefix (a system /
RAG preamble), each with a unique `S`-token suffix; prefill only (`G=0`,
generation is identical compute in both modes so it is excluded from the
headline). GPU, `-ngl 99`, `kv_unified = true`.
- **NO-SHARE (stock):** `LLAMA_KV_PAGED` unset; every sequence prefills the full
`P+S` tokens. Total prefill work `= K*(P+S)`.
- **PAGED-SHARE:** `LLAMA_KV_PAGED=1`; the prefix is computed ONCE on seq 0,
committed via `paged_prefix_api::commit`, then every other seq calls
`paged_prefix_api::share` to physically reuse the ref-counted prefix blocks and
prefills ONLY its suffix. Total prefill work `= P + K*S`.
**`kv_unified` note:** this engine's cross-request share is built around the
*unified* stream-0 pool (ref-counted shared cells), so `kv_unified = true` is what
makes the share engage - the same setting the committed 0007 driver uses. With
`kv_unified = true` the share engaged in every run (evidence below).
### Reuse actually engaged (share mode)
In every share run: `kshare(seq 1) = 1024` (the full block-aligned prefix is
reused, not recomputed), the shared prefix block's `ref_cnt == K` (all sharers
point at one physical copy), and `prefill_tokens_submitted` collapses from
`K*(P+S)` to `P + K*S`.
### Results (P=1024, S=32, prefill-only)
| model | K | mode | prefill tokens | prefill time | raw tok/s | shared ref_cnt |
|--------------|----|-----------|----------------|--------------|-----------|----------------|
| Qwen3-0.6B | 32 | no-share | 33792 | 4.659 s | 7253 | - |
| Qwen3-0.6B | 32 | **share** | 2048 | **0.554 s** | 3695 | 32 |
| Qwen3-32B | 16 | no-share | 16896 | 26.14 s | 647 | - |
| Qwen3-32B | 16 | **share** | 1536 | **3.64 s** | 422 | 16 |
| Qwen3-32B | 32 | no-share | 33792 | 61.91 s | 546 | - |
| Qwen3-32B | 32 | **share** | 2048 | **6.02 s** | 340 | 32 |
### Verdict: YES, a real and substantial win, and it grows with K
- Prefill wall-time speedup: **0.6B K=32 -> 8.4x**, **32B K=16 -> 7.2x**,
**32B K=32 -> 10.3x**. The win grows with the number of sharers because
no-share prefix recompute is `O(K)` while the shared prefix is `O(1)` plus
`K` tiny suffixes.
- Note the honest caveat in the raw-throughput column: share mode submits small
32-token suffix batches that are *less* GPU-efficient (340-422 tok/s) than the
large no-share batches (546-7253 tok/s). The win is **not** higher tok/s - it is
computing ~11-16x **fewer** tokens. On a fast GB10 prefill that still nets a
7-10x wall-time reduction because prefill is compute-bound and the shared prefix
dominates the token count.
- This is exactly the many-users-one-system-prompt / RAG-preamble fan-out
scenario, and the paged cross-request prefix cache delivers there.
Scaffolding (`paged-prefix-bench.cpp`, the `PAGED_NGL` driver tweak) stays
dev-tree-only and is not part of any shipped patch.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,81 +0,0 @@
# Paged-KV GPU verification + full backend CUDA build
Verification run on a DGX Spark (NVIDIA GB10, compute capability 12.1 / sm_121),
CUDA 13.0, against pin `f3e182816421c648188b5eab269853bf1531d950`. Models:
`Qwen3-0.6B-Q8_0.gguf` (core gate) and `Qwen3-32B-Q4_K_M.gguf` (sanity).
All paged behaviour stays gated by `LLAMA_KV_PAGED` (env) / the `kv_paged`
server option; default-off is byte-identical to stock.
## Deliverable 1 - GPU-path correctness (all on GPU, `-ngl 99`)
CUDA build of the dev tree configured with
`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`;
all paged drivers (`llama-simple`, `llama-paged-multiseq`,
`llama-paged-prefix`, `llama-paged-prefix-engine`) compiled clean under sm_121.
1. Core token-identical gate - PASS. `llama-simple` greedy, Qwen3-0.6B, `-ngl 99`:
stock (env unset) vs `LLAMA_KV_PAGED=1` output is BYTE-IDENTICAL. The paged
path is genuinely engaged: `LLAMA_KV_PAGED_DEBUG=1` shows the device gather
firing (`[paged-attn] gather n_stream=1 ...`), per-token block placement
(`[paged-alloc] ... grew`), and the stock run uses CUDA Graphs while the paged
run takes the distinct gather path - yet output matches exactly.
2. Multi-stream - PASS. `llama-paged-multiseq -s 4 -ngl 99`, stock vs paged:
all 4 concurrent sequences BYTE-IDENTICAL on GPU (n_seqs=4, CUDA0 compute
buffer matches expectation). Same result reproduced on the CPU build.
Prefix recompute-skip (`llama-paged-prefix-engine`, patch 0007) - MIXED, and
this is a dev-scaffolding driver ("not shipped"); it was never built on CPU
(absent from the CPU Gate-0 set), so there is no prior CPU pass to match.
The driver hardcodes `n_gpu_layers = 0`; a reported test-harness-only env
override (`PAGED_NGL`) was added to run it at `-ngl 99` (29/29 layers
offloaded confirmed), then reverted. Results are IDENTICAL on CPU and GPU
(so not a GPU issue):
- PASS: measured recompute-skip (32 prefix tokens skipped, block-aligned),
ref-count == 2 on shared block, ref drop 2->1 on free, only-private-blocks
returned, block returned to pool.
- FAIL: 2 of ~16 greedy-token-equality assertions. `boundary` case diverges
from the from-scratch baseline at the 2nd generated token (`17971` vs
`5671`) and then completely; `mid-block` "A re-shareable after free, output
unchanged" also differs. Driver prints `GATE FAILED (failures=2)`.
This is a divergence in the prefix recompute-skip path (0006/0007), NOT in the
core gather gate, and not GPU-specific. Reported, not fixed (out of scope).
3. 32B GPU sanity - PASS. `LLAMA_KV_PAGED=1 llama-simple -ngl 99 -n 16` on
Qwen3-32B-Q4_K_M (65/65 layers offloaded): coherent output
("The capital of France is Paris..."), no crash, no OOM.
## Deliverable 2 - full backend build with the paged patches
Built in a nested LocalAI tree on the DGX; gRPC v1.59.0 built from source
(LocalAI bundle; the system protobuf ships no CMake CONFIG) in ~26 min.
- (2a) `make llama.cpp LLAMA_PAGED=on` - PASS. All 6 paged patches
(0001,0002,0003,0004,0006,0007) `git apply` cleanly to the pin (EXIT=0). The 8
vendored paged sources land in `llama.cpp/src/` and are BYTE-IDENTICAL to the
dev tree; `grpc-server.cpp` carries the `kv_paged`/`paged_attention` option
(patch 0005); `llama-kv-cache.cpp` has the env-gated hooks.
- (2b) grpc-server under CUDA sm_121 - PASS (with the single-application caveat
below). 89 MB ARM aarch64 executable, build ~139 s, linked against
libcudart.so.13 / libcublas.so.13; binary contains the paged option strings
and `paged_alloc`/`paged_attn`/gather symbols.
- (2c) `make llama.cpp LLAMA_PAGED=off` - PASS. "skipping paged-attention patch
series", EXIT=0, NO `paged-*` sources in the checkout (clean escape hatch).
### Build-flow finding: paged patches are applied TWICE in the on-flow
A plain `make grpc-server LLAMA_PAGED=on` FAILS to compile. The paged series is
applied by BOTH the Makefile `llama.cpp` target (`git apply`) AND `prepare.sh`
(`patch -p1`). On the already-git-applied tree, `prepare.sh` hits "Reversed (or
previously applied) patch detected! Assume -R? [n]", declines, and re-applies the
pure-addition hunks a second time. `llama_kv_cache::get_n_gather` etc. end up
defined twice -> redefinition errors in `llama-kv-cache.cpp` (`.rej`/`.orig`
litter `src/`). Single application (one of the two appliers) compiles clean -
the 2b build above used a single git-apply with `prepare.sh` patching suppressed.
Reported only; the fix (drop one of the two application sites for
`patches/paged/`) is out of scope for this verification.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,120 +0,0 @@
# PAGED_POOL_BURST_FIX (patch 0024)
Fixes the paged-pool **burst-degradation bug** identified in `OTHER_PATHS_INVESTIGATION.md`
(section C, Part 2): on a long-lived `llama-server` with `LLAMA_KV_PAGED=1`, a high-fan-out
prefill burst strands KV blocks in the host-side paged pool, so a subsequent lower-npl prefill
draws from a depleted / fragmented pool and its throughput collapses (the benchmark's documented
"restart the server per npl" crutch). Decode is unaffected. The fix touches **only host-side block
accounting and placement - never KV values or compute** - so it is gated behind `LLAMA_KV_PAGED`
and is byte-identical to HEAD with the flag unset.
## Root cause (two compounding host-side defects)
1. **Reclamation gap.** `paged_alloc` returned a sequence's blocks only on a full-range wipe
(`seq_rm(seq, 0, MAX)`). A partial **tail** truncation `seq_rm(seq, p0>0, MAX)` - which
`llama-server` issues on every reused slot and before a cross-request prefix splice - freed the
kv-cache CELLS but left the manager owning the trailing BLOCKS. The two desync; the free pool
shrinks. (Applies to pure-attention paged caches; on hybrid SSM models the partial seq_rm is
rejected by the recurrent cache before it reaches the attention cache, so the dominant leak there
is #1b below.)
1b. **Idle-slot retention.** Stock `llama-server` keeps a finished slot's KV resident for that
slot's own next-prompt cache. Under the paged engine, the blocks of the many slots a burst
touches but a later low-npl run never reassigns are stranded for the process lifetime - a later
run sees a depleted pool.
2. **No compaction.** `BlockPool::free_blocks` returns blocks in free order; after a burst the free
queue is a scrambled permutation of physical ids, so a later prefill pops physically scattered
blocks and its KV scatter-write + paged-attention gather lose locality.
## The fix (all behind `LLAMA_KV_PAGED`; `LLAMA_PAGED_NO_RECLAIM=1` restores pre-fix behavior)
- **Fix-1 - reclaim trailing blocks.** `paged::PagedKVManager::truncate(seq, n_keep)` frees every
block at index >= `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of a truncated suffix),
exposed as `paged_alloc::truncate(cache, stream, seq, n_keep)` and called from
`llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case. Manager accounting now tracks the
kv-cache exactly. (`src/paged-kv-manager.*`, `src/paged-alloc.*`, `src/llama-kv-cache.cpp`)
- **Fix-2 - defrag on empty.** When the pool becomes fully idle (`all_free()`),
`defrag_free_pool()` relinks the free queue into ascending block-id order (`FreeBlockQueue::rebuild`),
preserving content-cache hashes. Triggered after `release`/`truncate`. (`src/paged-kv-manager.*`,
`src/paged-alloc.*`)
- **Fix-3 - release on slot completion.** At `server_slot::release()` the paged engine issues
`prompt_clear()` (full seq_rm: clears cells AND releases+defrags the blocks) and drops the
slot-local prompt cache, so a finished-idle slot returns its blocks promptly; cross-request reuse
still works through the committed paged content cache. (`tools/server/server-context.cpp`)
## Validation (DGX GB10, dense q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023)
### Bit-exactness (the parity-safe property)
Greedy decode, fixed prompt/seed, 48 tokens, `llama-completion`:
| build / flag | md5 |
|---|---|
| 0023 baseline (paged off) | `5951a5b4d624ce891e22ab5fca9bc439` |
| AFTER paged **off** | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
| AFTER paged **on**, reclaim default-on | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
| AFTER paged **on**, `LLAMA_PAGED_NO_RECLAIM=1` | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
Identical across the board: the fix changes no KV value or compute. `test-backend-ops` is unaffected
by construction (the change touches only host-side block accounting in libllama and the server; no
ggml operator is modified) and was re-run green against the fixed `libllama`.
### Host-side unit test (`llama-paged-reclaim-unit`, no GPU)
- Fix-1: `allocate(0,512)` -> 32 blocks; `truncate(0,256)` reclaims exactly **16** trailing blocks;
`truncate(0,16)` returns to 1 block; `free` returns to pristine.
- Fix-2: 8 blocks freed in scrambled order then `defrag_free_pool()` -> next `block_table` pops
**ascending** physical ids. `UNIT PASS`.
### Repro on the model (`llama-paged-burst-bench`, A/B on one binary via `LLAMA_PAGED_NO_RECLAIM`)
NSLOT=64, NPL=8, PP=512, pool=2527 blocks. Same binary, A/B by env.
- **Fix-2 (fragmentation -> prefill).** Fresh npl8 vs npl8 after a scrambling burst+drain:
- BEFORE (`NO_RECLAIM`): prefill 870.5 -> 822.1 t/s, **ratio 0.944** (fragmented free queue).
- AFTER (defrag on): prefill 869.2 -> 867.8 t/s, **ratio 0.998** (free queue compacted).
- **Fix-3 mechanism (idle-slot leak -> reclaim).** Burst 64 sequences left idle, then full-release
(what Fix-3's `prompt_clear` issues at `slot.release()`): pool free
**2527 (pristine) -> 479 (64 idle slots strand 2048 blocks) -> 2527 (reclaimed == fresh)**. The
leaked-block count is exactly 64 x ceil(512/16) = 2048.
- Decode is untouched throughout (single-token append; the fix only moves/accounts blocks).
### Server repro (`llama-server`, one long-lived process, FRESH-npl8 -> BURST-npl64 -> POST-npl8)
`-c 36000 -np 64 -b 2048 -ub 512`, `LLAMA_MAX_BATCH_TOKENS=512`, distinct 512-token prompts,
`cache_prompt:false`, A/B by `LLAMA_PAGED_NO_RECLAIM`. Aggregate prefill = total prompt tokens / wave
wall.
| wave | BEFORE (`NO_RECLAIM`) | AFTER (fix) |
|---|---|---|
| FRESH-npl8 | 488 t/s (wall 8.4 s) | 525 t/s (wall 7.8 s) |
| POST-npl8 (after burst) | **44 t/s (wall 93 s)** | **532 t/s (wall 7.7 s)** |
| post / fresh | **0.090 (11x collapse)** | **1.01 (recovered, within 1%)** |
| paged release lines in log | 17 | **96** (Fix-3 fires at each slot completion) |
| `CANARY_TOKENS_MATCH` (fresh vs post, identical prompts) | **YES** | **YES** |
The bug reproduces exactly (the investigation's 507 -> 65 collapse; here 488 -> 44); the fix restores
POST-npl8 to within ~1% of fresh and the release-log count jumps from 17 to 96, confirming Fix-3
returns each finished slot's blocks. The canary tokens are identical fresh-vs-post in BOTH arms:
paged placement is value-invariant, so the fix never changes the served output - only when the pool
recovers. Decode is structurally untouched (release happens after a request completes); greedy md5
above proves decode values are byte-identical.
## Tradeoff / scope notes
- On **hybrid SSM models** (qwen35), the recurrent cache rejects a partial tail `seq_rm`, so the
hybrid wrapper never forwards it to the attention cache: Fix-1 effectively applies to
pure-attention paged caches, while the hybrid leak is dominated by idle-slot retention (Fix-3) and
fragmentation (Fix-2). Confirmed by the unit test (Fix-1 logic) and Test-C (2048 blocks stranded
by 64 idle slots, returned to fresh on reclaim).
- Fix-3 clears a finished slot's KV at `release()`, so a repeated-prompt workload loses the
slot-local prompt cache. Cross-request reuse normally falls back to the committed paged content
cache, but that publish path (`paged_prefix_api::commit`) is itself a no-op on hybrid wrappers, so
for hybrid + repeated prompts Fix-3 trades prompt-cache reuse for pool hygiene. Gated behind
`LLAMA_KV_PAGED`; `LLAMA_PAGED_NO_RECLAIM=1` restores the stock retain-idle behavior.
## Files
- `src/paged-kv-manager.{h,cpp}` - `truncate`, `defrag_free_pool`/`defrag_free_queue`,
`FreeBlockQueue::rebuild`, `all_free`/`total_blocks`.
- `src/paged-alloc.{h,cpp}` - `truncate`, `reclaim_active`, defrag-on-empty in `release`/`truncate`,
`num_free_global`/`num_managers`.
- `src/llama-kv-cache.cpp` - partial-tail-seq_rm reclaim hook.
- `src/paged-prefix-api.{h,cpp}` - `num_free_global`/`num_managers` introspection passthrough.
- `tools/server/server-context.cpp` - Fix-3 paged release at `slot.release()`.
- `examples/simple/paged-reclaim-unit.cpp`, `paged-burst-bench.cpp` - dev test scaffolding.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,111 +0,0 @@
# Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)
Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched,
both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs).
Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64
tokens, K requests fired concurrently (cold fan-out), one client hitting both
OpenAI-compatible servers with identical token-id prompts.
This run fixes the two confounders in the earlier comparison (a *serial* Q4_K dev
driver vs a *batched* FP4 vLLM server). Here both sides are batched and NVFP4.
## Setup
- llama.cpp: `llama-server` built from the paged dev tree (`~/llama-paged-dev`,
branch `paged`, patches 0001-0007), CUDA `build-cuda/` (sm_121).
`LLAMA_KV_PAGED=1`, `-ngl 99 --parallel 32 -c 40960`, model
`q3-32b-nvfp4-dense.gguf` (NVFP4 weights, FP4-MMA kernel). OpenAI `/completion`.
- vLLM 0.23.0: `vllm serve q3-32b-nvfp4a16/` (compressed-tensors W4A16 / Marlin),
`--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9
--max-num-seqs 64`, APC on (default). OpenAI `/v1/completions`.
## Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server
This is itself a key result. The paged engine has two distinct mechanisms:
1. Physical paged block placement (patches 0002/0004) - runs inside
`llama_kv_cache::find_slot`, gated only by `LLAMA_KV_PAGED`. This DOES engage in
the server: with `LLAMA_KV_PAGED_DEBUG=1`, 2 concurrent shared-prefix requests
produced 14 `[paged-alloc] ... grew` lines, one stream per `seq`.
2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win
(`shares N prefix blocks ... prefix NOT recomputed`, ref-counted block sharing).
This is reachable ONLY through `paged_prefix_api::share/commit`
(`src/paged-prefix-api.cpp`), which only the standalone driver calls.
Evidence it does not reach the server:
- Static: `grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/`
returns nothing; `nm` on the binary finds no `paged_prefix` symbol use from the
server path. Nothing in `llama_decode` or the server calls `share`/`commit`.
- Runtime: the 2-request verify run logged **0** `shares prefix blocks` /
`NOT recomputed` lines. Both `seq=0` and `seq=1` independently grew to 65 blocks,
each allocating and recomputing the full ~972-token prefix separately - no
cross-slot KV block sharing, no `ref_cnt>1`.
So the 0007 recompute-skip, proven in the driver, does **not** yet reach the
server. Closing it needs server-side wiring: when admitting a slot whose prompt
shares a prefix with another live/committed slot, the server would have to call
the `paged_prefix_api::share` / `commit` seam. That is a future patch.
Note: llama-server has its OWN native prefix reuse (the slot prompt cache /
"context checkpoints"). In the K=32 wave the server reused the prefix cached by the
earlier wave, so prefill was only the 32-token suffix (`prompt eval ... / 32
tokens`). But that is a separate mechanism, it only helps prefill, and prefill is
not the bottleneck here (see below), so it does not change the verdict.
## Finding 2 - the matched comparison
Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out,
identical token-id prompts via one client.
| K | engine | wall (s) | aggregate gen tok/s | req/s | vLLM speedup |
|----|----------|----------|---------------------|-------|--------------|
| 16 | llama.cpp| 50.7 | 18.9 | 0.30 | - |
| 16 | vLLM | 8.57 | 119.5 | 1.87 | ~5.9x |
| 32 | llama.cpp| 58.3 | 34.0 | 0.53 | - |
| 32 | vLLM | 8.86 | 231.1 | 3.61 | ~6.6x |
vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32),
enforce_eager (CUDA graphs disabled), `enable_prefix_caching=True`.
### Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why
With every confounder removed (both batched, both NVFP4, both eager, both with
prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound,
not prefill/cache-bound:
- The G=64 workload is dominated by decode. In the llama K=32 run, decode was
52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since
the server's native prompt cache already reused the prefix). So even perfect
prefix sharing - paged or native - cannot move the total much.
- llama.cpp batched decode: **~828 ms per decode step** at batch 32
(1.21 tok/s per sequence).
- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs ->
**~185 ms per step**, roughly **4-5x faster per decode step**.
- CUDA graphs are NOT the differentiator: both sides are eager (llama
`graphs reused = 0`, vLLM `--enforce-eager`). The win is vLLM's batched-decode
efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill
scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on
this Blackwell-class part.
Because decode dominates, wiring the paged 0007 recompute-skip into the server
(Finding 1) would mainly remove redundant prefill across slots - a real saving for
short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against
a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on
its own, make llama.cpp competitive here; the decode kernel/batching gap is the
load-bearing factor.
## Caveats
- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the
FP4-MMA kernel); output quality is not the subject of this run.
- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations;
llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
- One llama request per run hit an intermittent HTTP 500 ("output does not match
the expected Content-only format" - a Qwen3 thinking-output quirk on
`/completion`), so llama counts were 15/16 and 31/32. The failed request returns
early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama
run would be marginally slower - i.e. the ~6x gap reported here is conservative
(favorable to llama.cpp).
- Both servers cold-started; numbers are end-to-end wall from the concurrent
client. Disk healthy (~325 GB free), GPU otherwise idle.

View File

@@ -1,165 +0,0 @@
# Paged-attention closing measurements: stock GPU determinism + vLLM comparison
Two closing measurements for the paged-attention series, run on a DGX Spark
(NVIDIA GB10, compute capability 12.1 / sm_121), CUDA 13. Dev tree
`~/llama-paged-dev` branch `paged`, paged engine gated by env `LLAMA_KV_PAGED`
(default-off = stock). Models: `Qwen3-0.6B-Q8_0.gguf` and
`Qwen3-32B-Q4_K_M.gguf` (llama.cpp), `Qwen3-32B` nvfp4a16 / W4A16 HF safetensors
(vLLM 0.23.0). All dev drivers are dev-tree-only and not shipped.
## Deliverable 1: stock GPU determinism across batch shapes (no paging)
Question: is the patch-0007 GPU byte-identity "failure" (a near-tie greedy token
flips on CUDA, e.g. 17971 vs 5671) caused by paging, or is it inherent stock
CUDA non-determinism from running the same tokens in a different batch shape?
Method: a new dev-only driver `llama-paged-batchshape` (paging explicitly OFF:
`unsetenv("LLAMA_KV_PAGED")`). For a prompt `[P+S]` it greedy-decodes two ways,
both stock contiguous KV:
- (a) `full` - prefill the whole `[P+S]` in ONE `llama_decode`.
- (b) `split` - prefill `P` in one `llama_decode`, then `S` in a second.
The two paths write byte-for-identical token ids; the only difference is the
batch shape submitted to the kernels (full prefill vs P-then-S), which changes
the float reduction order in the GEMMs and therefore the KV values by tiny
amounts. 5 distinct prompts, suffix S=16.
### Single next token (the literal T_full vs T_split)
Both CPU and CUDA returned the SAME greedy next token for all 5 prompts
(0/5 flips). BUT the top-2 logit gap measurably changes with the batch shape on
CUDA, proving the float order does differ:
```
CUDA, S=8: prompt 1 T_full=1896 (gap 0.07072) T_split=1896 (gap 0.17986)
CUDA, S=8: prompt 4 T_full=49584 (gap 0.93304) T_split=49584 (gap 0.85785)
```
The argmax simply did not flip on the immediate next token for these prompts -
the gaps, while shifting, stayed wide enough.
### Generated stream (what 0007 actually byte-asserts)
0007 asserts byte-identity over a *generated* token stream, where the tiny
prefill-shape KV perturbation accumulates and eventually crosses a near-tie.
Generating G tokens greedily from `full` vs `split` and reporting first
divergence:
| gen length | CPU diverged | CUDA diverged |
|-----------|--------------|---------------|
| G=24 (0007 default) | 1/5 (prompt 0 @ step 5) | 2/5 (prompt 1 @ step 3, prompt 4 @ step 6) |
| G=64 | 2/5 (steps 5, 42) | 3/5 (steps 3, 6, 30) |
Example CUDA divergence, pure stock, zero paging:
`prompt 1: DIVERGES at gen step 3: full=1260 split=576`.
### Verdict (Deliverable 1): HYPOTHESIS HELD
The 0007 GPU byte-identity failure is **stock batch-shape non-determinism, not a
paged bug**. With paging entirely OFF, stock llama.cpp produces a different
greedy token stream when the same prompt is processed in a full-prefill batch vs
a split (prefix-then-suffix) batch - exactly the shape difference that 0007's
prefix-share path introduces (full B-from-scratch vs prefix-cached + suffix-only).
Refinement (reported honestly): it is **not strictly CUDA-only**. CPU exhibits
the same divergence, just less often and later (1/5 vs 2/5 at G=24, and CPU's
flips land at later generation steps). This is exactly why 0007's small, short
CPU scenarios happened to pass 16/16 while the CUDA run flipped: CUDA's larger
parallel reductions reorder more aggressively, so a near-tie crosses earlier and
more frequently. The phenomenon is floating-point GEMM-batching non-determinism,
inherent to both backends; paging is not the cause.
## Deliverable 2: vLLM vs llama.cpp+paged on a shared-prefix fan-out
Workload: K requests share a 1024-token system prefix, each with a unique
32-token suffix, then generate 64 tokens. Both engines cache the shared prefix
(vLLM automatic prefix caching ON by default; llama.cpp via the paged
cross-request prefix cache, `LLAMA_KV_PAGED=1`).
Quant is the realistic apples-to-oranges, reported honestly:
- llama.cpp: Qwen3-32B **Q4_K_M** (GGUF), `-ngl 99`, CUDA dequant kernels.
- vLLM: Qwen3-32B **nvfp4a16 (W4A16)**, served via the **Marlin FP4
weight-only** kernel because GB10 (sm_121) has **no native FP4 compute** -
i.e. vLLM is on a slower-than-ideal kernel path here. vLLM also ran
`enforce_eager=True` (no CUDA graphs / torch.compile; the env lacked a working
inductor/ninja toolchain), so the vLLM numbers are if anything **conservative**.
### vLLM (automatic prefix caching), end-to-end
APC hits confirmed in the engine log: **"Prefix cache hit rate: 97.0%"**,
`prefix_cache_hits 33040/34848` (K=16) and `99344/102432` (K=32).
| K | APC | prefill wall (G=1) | total wall (G=64) | throughput |
|---|-----|--------------------|--------------------|-----------|
| 16 | ON | 0.749 s | 6.63 s | 2.41 req/s |
| 16 | OFF | 20.19 s | 27.21 s | 0.59 req/s |
| 32 | ON | 1.13 s | 7.56 s | 4.23 req/s |
| 32 | OFF | 40.19 s | 48.71 s | 0.66 req/s |
vLLM's APC cuts the fan-out prefill ~27x (K=16) to ~36x (K=32) vs APC-off; the
huge ratio reflects how slow the FP4-emulation prefill is when forced to
recompute all K prefixes.
### llama.cpp + paged prefix cache (prefill phase)
The paged shared-prefix bench (`llama-paged-prefix-bench`, `BENCH_GEN=0`,
`PAGED_NGL=99`). Reuse confirmed: `kshare(seq1)=1024`, shared-block
`ref_cnt = K` (all sequences hold the one prefix), 15360 / 31744 prefix tokens
skipped.
| K | mode | prefill tokens submitted | prefill wall | vs no-share |
|---|------|--------------------------|--------------|-------------|
| 16 | PAGED-SHARE | 1536 | 3.66 s | 7.15x |
| 16 | NO-SHARE | 16896 | 26.17 s | 1.0x |
| 32 | PAGED-SHARE | 2048 | 6.04 s | 10.3x |
| 32 | NO-SHARE | 33792 | 62.17 s | 1.0x |
The paged prefix cache delivers the expected **7.15x (K=16) / 10.3x (K=32)**
prefill wall-time reduction - the headline cross-request prefix-skip win, on a
real 32B model on GPU.
### Head-to-head, both engines caching the shared prefix
Prefill of the cached fan-out (vLLM G=1, ~prefill; llama.cpp G=0, pure prefill):
| K | llama.cpp+paged prefill | vLLM APC prefill | vLLM faster by |
|---|-------------------------|------------------|----------------|
| 16 | 3.66 s | 0.749 s | ~4.9x |
| 32 | 6.04 s | 1.13 s | ~5.3x |
### Verdict (Deliverable 2): competitive in kind, behind in absolute terms
With both engines caching the shared prefix, **llama.cpp+paged is qualitatively
competitive but absolutely behind vLLM on this GB10 box**:
- **Same optimization, same order of magnitude.** llama.cpp's paged prefix cache
reproduces exactly the win vLLM's APC gives - skip the shared-prefix recompute
- and yields a 7-10x prefill reduction vs its own no-share baseline. On the
RAG/system-prompt fan-out the algorithmic gap is closed: llama.cpp no longer
pays K x prefix.
- **vLLM still wins head-to-head by ~5x on the cached prefill** (0.75s vs 3.66s
at K=16; 1.13s vs 6.04s at K=32), and by more end-to-end because it does
**continuous batched decode** (all K sequences decoded in one fused step)
while the llama.cpp paged *dev driver* decodes each sequence serially. That
decode-batching gap is a property of the serving stack, not of the paged
prefix cache. Notably vLLM wins here while handicapped (eager mode, FP4
weight-only emulation with no native FP4 on GB10); a tuned vLLM would lead by
more.
- **Honest caveats / blockers.** (1) Quant differs (Q4_K_M vs nvfp4a16). (2) The
comparison is prefill-vs-prefill plus vLLM end-to-end; a clean llama.cpp
end-to-end on this driver is blocked because its generation phase has a
stale-logits bug (`get_logits_ith` reads seq 0's prefill index after later
sequences' prefills overwrote the logits buffer -> segfault), and even fixed
its decode is serial, so it would not be apples-to-apples vs vLLM's batched
decode. The fair end-to-end llama.cpp number needs the grpc / llama-server
continuous-batching path, not this dev scaffold. (3) vLLM ran eager + FP4
emulation, making its numbers conservative.
Bottom line: paged gives llama.cpp the cross-request prefix-skip that vLLM's APC
provides, which is the categorical win and removes the K x prefix penalty on
RAG/system-prompt fan-out. On absolute wall-time on this hardware vLLM retains a
~5x prefill lead and a larger end-to-end lead from continuous batched decode and
a more optimized serving stack.

View File

@@ -1,107 +0,0 @@
# Pin-bump apply-feasibility check: paged patch series vs latest llama.cpp tip
Date: 2026-06-27. Scope: textual `git apply` feasibility ONLY. No compile, no
bit-exact gate (those require the DGX GPU and the manual PIN_SYNC process). This
report answers one question: if we bumped the pin to the latest upstream tip,
would the vendored paged patch series still apply?
## Pins
| | commit | subject |
|---|---|---|
| Current shipped pin | `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` | model : Add label for LFM2.5-230M (#25008) |
| Latest master tip | `c299a92c38b6de6a1139617652b66081828648db` | binaries : Improve rpc-server and export-graph-ops names (#25045) |
Gap: the pin is **23 commits behind** the latest master tip (`ahead_by: 23`,
GitHub compare API). The upstream range touched many files across the tree
(modifications plus at least one rename).
## Method
Two fresh shallow clones of `ggml-org/llama.cpp` (the current pin as a baseline,
and the latest master tip as the target). The series
`backend/cpp/llama-cpp/patches/paged/0*.patch` (28 files: 0001-0030, gaps at
0005 and 0027) was applied IN ORDER to each tree.
Each patch was classified two ways:
- **`git apply --check -p1`** - this is the BUILD's real apply method
(`backend/cpp/llama-cpp/Makefile`'s `llama.cpp` target does
`git apply --verbose "$p" || exit 1`). This is the only signal that decides
whether a bumped build succeeds. `git apply` natively tolerates `@@`
line-number offsets but NOT context-line changes.
- **GNU `patch -p1` dry-run** - the `prepare.sh` fallback method, used here as a
recovery probe to tell a fixable offset/fuzz from a genuine conflict.
Running against BOTH pins isolates bump-induced failures from pre-existing,
pin-independent quirks of the shipped series.
## Result: the bump is CLEAN / offset-tolerant. Zero re-exports needed for the bump.
The series behaves **identically** under `git apply` on the latest tip and on
the current pin.
- **27 / 28 patches apply CLEAN under `git apply`** on the latest tip (same 27
as on the current pin).
- **1 / 28 fails `git apply` (0019) - and it fails identically on the current
pin too**, for a reason that has nothing to do with the bump (see below). Its
code applies fine.
- **No new conflicts.** Not a single patch that applied on the current pin fails
on the latest tip.
- **Zero context-fuzz anywhere.** Every recovery the GNU-patch probe reported is
a pure line-number offset, which `git apply` absorbs natively.
### What the 23-commit jump actually changed
Only which patches `git apply` has to place at a line offset (context drift from
the 23 upstream commits). All still apply CLEAN; none needs re-export.
- Offset-placed on the current pin (6): 0009, 0017, 0018, 0020, 0021, 0024.
- Offset-placed on the latest tip (10): 0009, 0015, 0017, 0018, 0020, 0021,
0024, 0025, 0026, 0028.
- New offsets introduced by the bump (4): **0015, 0025, 0026, 0028** - all
remain CLEAN under `git apply` (line offset only, no fuzz, no conflict).
### The single `git apply` failure (0019) is pre-existing, not a bump regression
`0019-qwen35-ssm-decode-fused-gather.patch` fails `git apply` on BOTH pins. The
sole cause is its first hunk, a *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
- a dev-only doc that exists on the DGX dev tree (from an unshipped docs commit)
but is absent from any clean upstream checkout:
```
error: SSM_DECODE_FIX_RESULTS.md: No such file or directory
```
`git apply` is atomic, so that one stray hunk rejects the whole patch. 0019's 8
real code files (ggml.h, ggml-cpu/ops.cpp, ggml-cuda/gated_delta_net.cu, ggml.c,
delta-net-base.cpp, models.h, qwen35.cpp, qwen35moe.cpp) all apply cleanly (the
GNU-patch probe applies them with only line offsets and reports 0 failed code
hunks). This is exactly the pre-existing finding documented in
`PIN_SYNC_9d5d882d.md` ("Pre-existing finding ... NOT introduced by this
pin-sync, NOT fixed here ... a separate cleanup, out of scope"). It is identical
at both pins, so it is NOT introduced by a bump. Stripping the stray dev-doc
hunk from 0019 (and the analogous 0021 *create* hunk for
`CONV_STATE_FUSION_RESULTS.md`, which happens to apply fine) is a cleanup that
should happen regardless of any pin bump.
## Verdict
A pin bump from `9d5d882d` to the latest tip `c299a92c` is **textually clean**:
the full paged series applies via the build's `git apply` with only benign
line-number offsets and zero conflicts - no patch needs re-export for the bump.
The lone `git apply` failure (0019) is a pre-existing shipped-series defect (a
stray dev-doc hunk), present identically on the current pin, and unrelated to the
bump.
## Caveats (why this does NOT authorise shipping a bump)
This is a textual apply check only. It does NOT verify that the patches are still
SEMANTICALLY correct against upstream's 23 refactor commits, that the result
compiles, or that it stays bit-exact. The 23 upstream commits touched many files;
a clean text-apply can still hide a semantic break (e.g. a function the kernel
patches call was refactored). The manual PIN_SYNC process on the DGX GPU
(rebuild + `test-backend-ops` + the greedy-md5 bit-exact gate + a decode bench)
remains the gate before any pin is advanced. This report only establishes that
the bump's textual conflict surface is empty, so that pin-sync would start from a
clean apply.

View File

@@ -1,301 +0,0 @@
# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d
Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp
`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both
the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then
re-exported from the rebased commits; **4 patch files changed** and are updated
in this commit. A quick decode bench confirms the patchset performs the same on
the new tip.
## Early-warning canary: when to run the NEXT pin-sync
The shipped pin (this file's tip, mirrored in
`backend/cpp/llama-cpp-localai-paged/Makefile`) is advanced ONLY by this manual,
GPU-verified PIN_SYNC. Because the paged backend is excluded from the nightly
auto-bumper (`.github/workflows/bump_deps.yaml`), nothing nightly tells you when
upstream has drifted past the patches. That signal comes from a dedicated
scheduled canary:
- **Workflow:** `.github/workflows/llama-cpp-paged-canary.yml` (weekly, plus
`workflow_dispatch`). It resolves the latest `ggml-org/llama.cpp` master tip,
then in two jobs (a) APPLIES the full series to that tip with the build's own
`git apply` method via `.github/scripts/paged-canary-apply.sh`, and (b)
COMPILES the paged backend (cublas) against it using the same base-grpc-cuda-12
toolchain + `make grpc-server` target the shipped build uses.
- **Green** = the series still applies and compiles on upstream HEAD; nothing to
do.
- **Red** = upstream moved out from under the patches. **Canary red -> run a
PIN_SYNC** (rebase the patches onto the new tip, pass the bit-exact gate on the
GPU, re-export the `.patch` files, then advance the pin). The canary is
signal-only: it opens no PR and never moves the pin, so the shipped build and
the dep-bump PRs stay green regardless.
- **0019 handling:** the canary apply helper excludes ONLY the stray
`SSM_DECODE_FIX_RESULTS.md` dev-doc hunk (the pre-existing quirk documented in
the "Pre-existing finding" section below and in `PIN_BUMP_APPLY_CHECK.md`),
applying 0019's real code hunks atomically. So that benign quirk never
false-positives the canary, but a genuine code break in 0019 still turns it
red.
## Upstream jump
- OLD LocalAI pin: `8be759e6`
- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)")
- Upstream jump `8be759e6..9d5d882d` = **17 commits**.
### Note on the dev-tree base (important)
The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its
real base (merge-base of `paged` with both pins) is `f3e1828`
("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor
of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` =
**109 upstream commits**, a strictly larger surface than the 17-commit pin bump.
The end state (paged patches on `9d5d882d`) is identical either way; the larger
traverse only means the conflict surface was the worst case, and it still came
through bit-exact.
## Rebase
- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3).
- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/
FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not
shipped as `.patch` files).
- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024).
- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024).
### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`)
Every rebase conflict was in the llama-server continuous-batch scheduler wiring,
all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` /
`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate
uses `llama-completion`, not the server, with no env set). The root cause was a
single upstream refactor of `update_slots()`:
- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`,
replacing bottom-of-loop `break` with a top-of-lambda
`if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set
false on `batch.add()` failure);
- the embedding/rerank early-exits changed `continue;` -> `return;`;
- the `server_batch` token count accessor was renamed `batch.n_tokens` ->
`batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens`
member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`).
**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict.
Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor.
Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by
keeping HEAD's refactored structure and re-inserting the `[paged 0008]`
`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;`
and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level,
with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics
unchanged.
**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts.
- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before
HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved
alora_scale/disabled_id into the `server_batch` struct).
- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)`
(took upstream's `batch.size()` rename, kept the budget condition).
- C3: the original outer `break` was translated to the new idiom `add_ok = false;`
(exact semantic equivalent of "stop admitting prompts to remaining slots"); the
upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled
by the top-of-lambda check).
**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2
conflicts + 1 clean-hunk fix.
- The big budget-block rewrite hunk applied cleanly (its expected parent == the
faithfully-resolved 0013 block).
- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;`
referenced the `server_batch` member, which has no `.n_tokens` -> changed to
`batch.size()` (== D, the Phase-1 decode load; identical value).
- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` +
`prefill_cap_per_slot`), adopted `batch.size()`.
- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`,
dropped the upstream-removed `batch.n_tokens >= n_batch` break.
OFF-path invariant verified by construction in all three: with the env knobs
unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`)
the added conditions never fire, so the scheduler is byte-identical to stock HEAD.
### Kernel patches: ZERO rebase conflicts
Patches 0017-0024 - which touch the bit-exact compute paths
(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112,
`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`,
`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all
applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel
context was touched during conflict resolution.
## Clean rebuild
`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`,
preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON,
GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0,
all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz`
download miss, unrelated to the gate binaries.)
## GATE: ALL GREEN
(a) `test-backend-ops` (Backend CUDA0):
| op | result |
|----|--------|
| GATED_DELTA_NET | 36/36 OK |
| SSM_CONV | 45/45 OK |
| MUL_MAT | 1146/1146 OK |
| MUL_MAT_ID | 806/806 OK |
(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`):
| model | md5 | baseline | verdict |
|-------|-----|----------|---------|
| dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
Bit-exactness preserved across the upstream jump.
## Decode bench sanity (rebased build, post-pin-sync)
`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
S_TG (decode) tok/s at npl128, patch defaults on:
| model | npl128 S_TG (new tip) | post-0023 reference | delta |
|-------|----------------------|---------------------|-------|
| dense `q36-27b-nvfp4` | **366.41** | 373.2 | -1.8% |
| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% |
Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`.
(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.)
## Export phase: re-export `.patch` files and pick the ones that changed
The committed `.patch` files were generated against the old base. Each shipped
patch was re-exported from its rebased commit (`git format-patch -1 <commit>`) and
compared body-to-body against the committed file (ignoring the volatile `From`
commit-hash line and the `index` blob-hash lines). Classification:
- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`.
- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted ->
still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`,
`0021`, `0024`.
- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`,
`0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`.
An independent isolated `git apply --check` sweep (each shipped patch vs the
rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`)
are the only ones that no longer `git apply` to `9d5d882d`. The build applies the
series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets,
so the 7 LINENUM patches still apply (verified) and are intentionally not churned.
### 0015 was a 4th change beyond the 3 rebase conflicts
The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015`
(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but
its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`:
upstream inserted a new test case
(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`)
in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's
context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure
context re-anchor, no behavioral change. This is exactly why a per-patch
re-export/apply-check was run instead of trusting the 3-conflict count.
### What changed in each updated patch (From/index hash noise aside)
- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit`
call), re-indented to the refactored `update_slots` lambda level and re-anchored
after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated.
- `0013`: budget var-block / while-gate / admission-break re-expressed against the
refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated.
- `0015`: hunk context re-anchored around the new upstream test case; inserted
lines identical; `@@` header updated.
- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission
`add_ok=false` against the refactored loop; `@@` headers updated.
## Equivalence proof (the updated series == the gate-green tree)
The 4 updated files are byte-faithful `git format-patch -1` exports of the
gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying
the full corrected series (the 19 unchanged committed patches + the 4 re-exports)
in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for
all 23 patches, and the resulting tree is **byte-identical to the gate-green
`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except
`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the
shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
the md5 bit-exact gate, and the bench.
## Shipped-build bug FIXED: stray dev-doc hunks stripped from the patch series
The pin-sync export captured dev-only result/progress docs that live in the DGX
dev tree (`~/llama-paged-dev`) but are ABSENT from a clean `ggml-org/llama.cpp`
checkout. The shipped build applies the paged series with **strict `git apply`**
(the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
`git apply --verbose "$p" || { echo "paged patch failed"; exit 1; }`), which is
atomic: a single hunk against a missing file REJECTS the entire patch and the
`exit 1` fails the build. (`prepare.sh` uses tolerant `patch -pN -N ... || true`,
but it is guarded by the `src/paged-kv-manager.cpp` sentinel and skipped at build
time once the Makefile has applied the series, so the strict `git apply` is the
real shipped path.)
Root failure was patch `0019`'s *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
(`index 2e7c8c2..77879e4 100644`): on a clean tree `git apply` cannot find the
file to modify ("No such file or directory") and rejects all of `0019`, which
then cascades to `0021`/`0022`/`0026`/`0028` (they build on `0019`'s code). The
build therefore only succeeded on the DGX (where the doc exists) and FAILED on CI
/ any clean checkout.
Fixed by stripping every stray non-source hunk so the patches contain ONLY
llama.cpp source changes. Stripped hunks (dev docs absent from a clean
`9d5d882d` checkout):
| patch | stripped dev-doc hunk(s) | hunk kind |
|-------|--------------------------|-----------|
| `0019` | `SSM_DECODE_FIX_RESULTS.md` | modify (the root reject) |
| `0020` | `LEVER1_OPROJ_MMQ_RESULTS.md` | create |
| `0021` | `CONV_STATE_FUSION_RESULTS.md` | create |
| `0028` | `LEVER1_GATHER_PROGRESS.md`, `LEVER1_GATHER_RESULTS.md` | create |
(The `create` hunks did not reject on their own - `git apply` will create a new
file even on a clean tree - but they polluted the build tree with stray dev docs
and violated the source-only invariant, so they were stripped too.) For each
patch the `diff --git a/<devdoc> ...` section was removed along with its diffstat
per-file line, any `create mode` trailer, and the `N files changed, ...` summary
was corrected; **every llama.cpp SOURCE hunk is byte-identical** (verified by
sha256 of each patch's source-diff tail before vs after the strip).
Verified on a fresh `git clone` of `ggml-org/llama.cpp` at this pin `9d5d882d`:
- BEFORE the strip, strict `git apply` of the series: OK through `0018`, then
`0019` FAILS ("SSM_DECODE_FIX_RESULTS.md: No such file or directory") -> the
Makefile `exit 1`s; continue-mode shows the full cascade `0019` `0021` `0022`
`0026` `0028` failing.
- AFTER the strip, strict `git apply` of the full series `0001..0030` reaches
**exit 0** (every patch OK, sentinel `src/paged-kv-manager.cpp` created, zero
stray `*_RESULTS.md`/`*_PROGRESS.md` in the tree). The tolerant `patch -p1`
path (prepare.sh fallback) also applies with zero rejects.
## Durable fix: keep patch exports SOURCE-ONLY
The pin-sync / re-export step MUST NOT capture dev-only artifacts into the shipped
`.patch` files. A clean `ggml-org/llama.cpp` checkout contains its own real docs
(`README.md`, `docs/`, `AGENTS.md`, ...) but NOT LocalAI dev notes - anything
matching `*_RESULTS.md`, `*_PROGRESS.md`, `*.diff`, `final_benchmark.csv`,
`LEVER*`, `BENCH*`, `paged-*-bench.cpp`, or any path that does not exist at the
pin is a dev artifact and must be excluded. Concretely, when re-exporting:
- prefer `git format-patch -1 <commit> -- ':!*.md' ':!*.diff' ':!*.csv'` (or an
explicit pathspec of the llama.cpp source dirs `src/ ggml/ common/ include/
tools/ tests/ cmake/`) so dev docs never enter the patch body;
- keep the dev-notes commits SEPARATE from the code commits on the dev branch, so
a per-commit export is naturally source-only;
- after export, gate with: clone the pin, `git apply` the full series with strict
(no-`--exclude`, no `|| true`) `git apply` - it MUST reach exit 0. The weekly
canary (`.github/workflows/llama-cpp-paged-canary.yml`) does this against
upstream HEAD; now that the patches are source-only its `0019`
`SSM_DECODE_FIX_RESULTS.md` `--exclude` workaround
(`.github/scripts/paged-canary-apply.sh`) is no longer needed and can be removed
on the next canary touch.
The upcoming `c299a92c` pin-bump re-export MUST follow this: produce source-only
patches and pass the strict-`git apply` gate on a clean checkout before advancing
the pin.
## Historical note (pre-strip)
Before this cleanup, `0019` carried the `SSM_DECODE_FIX_RESULTS.md` modify hunk
identically in the old and new exports (LINENUM class) and was left untouched
during the pin-sync to keep the rebase faithful; `0021`'s
`CONV_STATE_FUSION_RESULTS.md` was a create hunk that applied but still leaked a
dev doc. Both are now removed by the source-only strip above.
## Source of truth
The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD
`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains
the pre-rebase state.

View File

@@ -1,286 +0,0 @@
# QUANT_GENERALITY - are the paged decode opts NVFP4-specific or quant-agnostic?
Source-verified classification of the paged decode optimizations (patches 0013-0029)
as either QUANT-AGNOSTIC (operate on the gated-DeltaNet f32/bf16 recurrent state, the
paged serving host path, or the matmul ROUTING - independent of the model's weight
quantization, so they help a Q4_K / Q8_0 / bf16 Qwen3.6 as much as an NVFP4 one) or
NVFP4-SPECIFIC (only fire for / only help GGML_TYPE_NVFP4 weights on a Blackwell GPU).
READ-ONLY, NO GPU. Every classification below is taken from the patch body source,
not from the prose claims. Hardware referenced for the empirical plan only.
---
## 1. THE GROUND TRUTH GATE: what makes anything NVFP4-specific
There is exactly ONE runtime gate in the whole ggml-cuda matmul stack that means
"NVFP4 on Blackwell":
mmq.cu: const bool use_native_fp4 = blackwell_mma_available(cc)
&& (src0->type == GGML_TYPE_NVFP4 ...);
(confirmed in ARCH_GENERALITY_AUDIT.md section gguf-targeting-1 and in patch 0023's
own diff context). A patch is NVFP4-specific iff the code it changes lives INSIDE a
`use_native_fp4` / `type == GGML_TYPE_NVFP4` / `blackwell_mma_available(cc)` branch.
Everything else - the gated-DeltaNet recurrence, the conv update, the SSM/conv state
caches, the MMQ-vs-MMVQ dispatch, the CUDA-graph guard, the host scheduler and paged
pool - is dtype-independent.
The recurrent state is the decisive fact: in this hybrid model the gated-DeltaNet
temporal state, the conv ring state, q/k/v/g/beta and the SSM scratch are ALL
GGML_TYPE_F32 (asserted explicitly in every new op builder: see 0018 ggml.c
`GGML_ASSERT(state->type == GGML_TYPE_F32)`, 0019 same, 0021/0028 conv asserts
`conv_states->type == GGML_TYPE_F32`). The weight quantization type never enters the
recurrence or conv kernels. So any patch that only touches those is quant-agnostic by
construction.
---
## 2. PER-PATCH CLASSIFICATION (with source evidence)
| patch | what it changes | classification | source evidence |
|-------|-----------------|----------------|-----------------|
| 0013 | static per-step prefill-token budget (LLAMA_PREFILL_BUDGET) | QUANT-AGNOSTIC | tools/server/server-context.cpp only; a host scheduler loop bound on prompt-token COUNT; no dtype anywhere; default-off byte-identical |
| 0014 | manual MoE token-tile (mmq_x) cap | QUANT-AGNOSTIC | mmq.cuh `mul_mat_q_case`; cap applies on `args.expert_bounds != nullptr` (the MUL_MAT_ID grouped path) for ANY templated `<type>`; no NVFP4 branch |
| 0015 | density-aware MoE token-tile auto-select | QUANT-AGNOSTIC | mmq.cuh; gate is `expert_bounds != nullptr` + per-expert density only, NEVER on src0 type. PROVEN on a non-NVFP4 model: the measured +4.8% win was Qwen3-Coder-30B (128 larger experts), test gate covers MXFP4 AND NVFP4 |
| 0016 | dynamic decode-first prefill budget (supersedes 0013) | QUANT-AGNOSTIC | update_slots() policy only; "identical decisions paged on or off", zero libllama/dtype touch; default-off |
| 0017 | FP4 GEMM decode mmq_y / minblocks tile tune | NVFP4-SPECIFIC, but DEFAULT-OFF / INERT | mmq.cuh `get_mmq_y_host`: fires only `type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)`. BUT the patch is a recorded NO-BUILD: every occupancy probe REGRESSED (kill-gate tripped), so nothing is enabled by default. Default build is byte-identical to stock; it changes no behavior |
| 0018 | in-place SSM recurrent-state write-back | QUANT-AGNOSTIC | gated_delta_net.cu + ggml.c; operates on the f32 recurrent state cache (`state->type == GGML_TYPE_F32`); removes a D2D f32 state copy. Weights never read by this op |
| 0019 | fused recurrent-state gather (ids read, no get_rows) | QUANT-AGNOSTIC | reads the f32 state cache via ids; builder asserts F32 on q/k/v/g/beta/state/state_dst; mirrors ggml_ssm_scan. No weight dtype involved |
| 0020 | gated-DeltaNet o_proj MMVQ->MMQ reshape | QUANT-AGNOSTIC (routing) | qwen35.cpp/qwen35moe.cpp/qwen3next.cpp: a 2D-vs-3D RESHAPE of the f32 activation so `src1->ne[1]=128` routes to MMQ instead of batch-1 MMVQ. The MMVQ(ne[1]<=8)-vs-MMQ dispatch is a generic ggml-cuda decision present for EVERY quantized type. See section 3 |
| 0021 | in-place conv-state fusion (conv+silu+ring write) | QUANT-AGNOSTIC | ssm-conv.cu + ggml.c new op asserts `conv_states/conv_kernel/x_cur/conv_state_dst == GGML_TYPE_F32`; pure f32 conv-state work |
| 0022 | gated_delta_net_cuda occupancy/coalescing retune | QUANT-AGNOSTIC | gated_delta_net.cu kernel: q/k/v/g/beta/state are all f32; the COLS_PER_WARP/NUM_WARPS fold is a scheduling change on the f32 recurrence. Never touches a weight tensor |
| 0023 | MoE NVFP4 activation-quantize de-dup | NVFP4-SPECIFIC | mmq.cu: the `gather_mmq_fp4` de-dup is INSIDE `if (use_native_fp4) { ... }`. Gathers `block_fp4_mmq`. The non-FP4 path (`quantize_mmq_q8_1_cuda`) is untouched. Confirmed NVFP4-only |
| 0024 | paged-pool burst reclaim (truncate/defrag/release) | QUANT-AGNOSTIC | paged-alloc / paged-kv-manager / llama-kv-cache host accounting; "never KV values or compute, no ggml op touched"; gated behind LLAMA_KV_PAGED |
| 0025 | MoE-decode CUDA-graph re-graph (graph-safe id path) | QUANT-AGNOSTIC (corrects hypothesis) | ggml-cuda.cu: relaxes the MUL_MAT_ID graph guard when `ggml_is_quantized(src0) && ggml_cuda_should_use_mmq(...)`. Gated on the GENERIC quantized-MMQ grouped path, NOT on NVFP4. See section 4 |
| 0026 | hybrid per-head f32/bf16 SSM state (--cache-type-ssm / tau) | QUANT-AGNOSTIC, default-off (and precision-changing) | common/arg.cpp + cparams type_s/type_r + tau; changes the RECURRENT-STATE cache dtype (f32 default, bf16 opt-in). Independent of the weight quant; default tau=0 keeps bit-exact f32 |
| 0028 | residual conv-tap gather fusion (ids read) | QUANT-AGNOSTIC | ssm-conv.cu new SSM_CONV_UPDATE_IDS op reads the f32 conv cache via ids; eliminates the last k_get_rows in the GDN decode path. f32 throughout |
| 0029 | block-table within-step host cache | QUANT-AGNOSTIC | llama-kv-cache.cpp / paged-attn.cpp: memcpy-reuse of an int32 block table across full-attn layers of a step; pure host pipeline, bit-exact |
(There is no patch 0027.)
### Summary count
- QUANT-AGNOSTIC (helps any weight quant): 0013, 0014, 0015, 0016, 0018, 0019, 0020,
0021, 0022, 0024, 0025, 0026, 0028, 0029 - 14 of 16 landed patches.
- NVFP4-SPECIFIC: 0023 (the only landed NVFP4-only optimization) + 0017 (NVFP4-only but
default-off / inert, no measured win).
---
## 3. 0020 IN DETAIL - MMQ-over-MMVQ at batched decode is a win for ANY quantized type
The hypothesis is CONFIRMED. 0020 is not an FP4 trick:
- The gated-DeltaNet op left its output in 3D SSM layout `[value_dim, n_seq_tokens=1,
n_seqs=128]`, so the ssm_out matmul saw `src1->ne[1] = 1` with the 128 sequences
stuck in `ne[2]`.
- ggml-cuda dispatches `ne[1] <= 8` to MMVQ (the batch<=8 GEMV) and larger to MMQ
(the tensor-core GEMM). This `ne[1]`-threshold dispatch is type-INDEPENDENT: it is
the same routing for Q4_K, Q8_0, Q6_K, MXFP4, NVFP4 - every k-/legacy-quant has BOTH
an MMVQ (mmvq.cu vec_dot) AND an MMQ (mmq.cuh) path.
- The fix is a `ggml_reshape_2d` to `[value_dim, n_seq_tokens*n_seqs] = [6144, 128]` so
`src1->ne[1] = 128` routes to the M=128 MMQ GEMM that amortizes the ssm_out weight
read across all 128 sequences. Same contiguous data, bit-identical.
Why it generalizes: at batched decode (npl 32-128) the weight read of ssm_out is the
cost, and MMVQ at the degenerate batch-1 shape re-reads / fails to amortize the weight
for whatever dtype the weight is. MMQ at M=128 reads each weight tile once for all 128
tokens. That amortization is a pure bandwidth win that exists for every quantized
weight type, not just NVFP4. A Q4_K or Q8_0 Qwen3.6 has the exact same 3D-SSM-output ->
batch-1-MMVQ pathology and gets the same MMQ amortization from the reshape. (The patch
already routes the in-projection through MMQ; only the output was stuck in 3D.)
The same logic underwrites 0014/0015 (the MoE `mmq_x` token-tile is a generic grouped-
MMQ knob; the win was measured on a non-NVFP4 Qwen3-Coder-30B) and 0025 (section 4).
---
## 4. 0025 CORRECTS THE HYPOTHESIS - it is quant-agnostic, not NVFP4-specific
The hypothesis listed "the act-quant / quantize_mmq_nvfp4 portions of 0025" as
NVFP4-specific. That is a patch-number mismatch. The ACTUAL patch 0025
(0025-qwen35moe-nvfp4-moe-decode-regraph.patch) does NOT contain any act-quant /
quantize_mmq_nvfp4 code. Its entire diff is one hunk in ggml-cuda.cu:
bool mmid_needs_sync = !ggml_is_quantized(src0->type) || node->ne[2] > mmvq_mmid_max;
if (mmid_needs_sync && ggml_is_quantized(src0->type) &&
getenv("LLAMA_MOE_FORCE_GRAPHS") &&
ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[2], src0->ne[2])) {
mmid_needs_sync = false; // keep CUDA graphs on for the grouped-MMQ id path
}
The relax condition is `ggml_is_quantized(src0->type) && ggml_cuda_should_use_mmq(...)`
- the GENERIC quantized grouped-MMQ id-path, NOT NVFP4. `should_use_mmq()` returns true
for Q4_K / Q8_0 / etc. at large enough batch just as for NVFP4. So a Q4_K or Q8_0 MoE
Qwen3.6 whose MUL_MAT_ID takes the grouped MMQ path also keeps CUDA graphs across the
MoE decode step under LLAMA_MOE_FORCE_GRAPHS. 0025 is quant-agnostic.
LEVER2_GRAPH_COVERAGE_RESULTS.md confirms this is the role of 0025 ("0025's
[TAG_MUL_MAT_ID_CUDA_GRAPHS] env-gate keeps the grouped MMQ id-path graph-safe").
Where the hypothesis's "act-quant / quantize_mmq_nvfp4" actually lives: that is
LEVER 3 (LEVER3_ACTQUANT_FUSION_RESULTS.md - fuse W4A4 act-quant into RMSNorm/SiLU),
which is genuinely NVFP4-specific, BUT it was a measurement STOP and NEVER LANDED (no
patch 0030, no commit). Likewise LEVER 4 (NVFP4 the still-bf16 GDN/attn projections,
LEVER4_PROJNVFP4_RESULTS.md) is NVFP4-specific but FAILED its KL gate (~6% PPL) and was
NOT shipped. So the only NVFP4-specific code that actually landed is 0023 (+ inert 0017).
### Net correction to the hypothesis
- 0018/0019, 0021, 0022, 0028, 0026, 0013/0016, 0029, 0020: CONFIRMED quant-agnostic.
- 0023: CONFIRMED NVFP4-specific.
- 0025: WRONG in the hypothesis -> it is QUANT-AGNOSTIC (CUDA-graph guard on the generic
quantized grouped-MMQ path). The NVFP4-specific "act-quant" work the hypothesis was
thinking of is LEVER 3, which is unshipped (STOP), not patch 0025.
- Bonus: 0014/0015 (not in the hypothesis) are quant-agnostic, and 0017 is
NVFP4-specific but default-off/inert.
---
## 5. RELATIVE-IMPACT BY WEIGHT-QUANT SIZE
Decode is bandwidth-bound on the weight read. The quant-agnostic opts target work whose
absolute cost is FIXED in the weight quant: the f32 recurrence, the f32 conv state, the
host pipeline. The weight-read buckets (MoE expert GEMM + dense projections) scale
~linearly with bits-per-weight. So the quant-agnostic opts deliver the same ABSOLUTE
millisecond saving at every quant, but the RELATIVE % shrinks as the weight grows.
Anchor: the measured MoE q36-35b-a3b NVFP4 decode step (MOE_GAP_VS_VLLM.md, step =
169.8 ms, GPU-busy 97.5%), split into quant-agnostic vs weight-quant-scaling buckets:
| bucket | ms/step @ NVFP4 | scales with weight bits? | which opts touch it |
|--------|-----------------|--------------------------|---------------------|
| Recurrence core (gated_delta_net) | 70.0 | NO (f32 state) | 0022 |
| Recurrent-state + conv gather/plumbing (k_get_rows 5.2 + ssm_conv 3.4) | ~8.6 | NO (f32) | 0018/0019/0021/0028 |
| Host bubble (sample+batch+block-table) | 4.2 | NO (host) | 0013/0016/0024/0029 |
| Router / norms / glue | ~5.4 | mostly NO | 0014/0015 partial |
| MoE expert GEMM | 47.3 | YES (4-bit now) | (weight read) |
| Dense GDN/attn projections + convert glue | 20.3 | YES | (weight read) |
| W4A4 act-quant tax (quantize_mmq_nvfp4) | 3.3 | (FP4 only) | 0023 |
Quant-agnostic, weight-size-fixed total: ~70.0 + 8.6 + 4.2 + 5.4 = ~88 ms (~52% of the
NVFP4 step). Weight-read buckets: 47.3 + 20.3 = ~67.6 ms (~40%).
Model the weight-read buckets as scaling with bytes-per-weight relative to NVFP4 (4-bit
= 1x): Q8_0 ~ 2x, bf16 ~ 4x. Hold the ~88 ms fixed (the recurrence f32 byte stream and
host time do not change with the weight quant), and recompute the recurrence/host
fraction of the step:
| weight quant | weight-read buckets (ms, est.) | fixed quant-agnostic (ms) | step (ms, est.) | recurrence+host % of step |
|--------------|--------------------------------|---------------------------|-----------------|---------------------------|
| NVFP4 (4-bit) | ~68 (1x) | ~88 | ~159 (+act-quant ~3) | ~52% (measured ~50%) |
| Q8_0 (8-bit) | ~136 (2x) | ~88 | ~224 | ~39% |
| bf16 (16-bit) | ~272 (4x) | ~88 | ~360 | ~24% |
Reading this:
- The quant-agnostic SSM/serving opts deliver the SAME ~ms savings at Q8/bf16 as at
NVFP4 (they remove fixed f32/host work). The headline % speedups quoted in the patch
bodies (e.g. 0019 dense npl128 +37.8%, 0020 +31.7%, 0022 +11.1%) are the LARGEST at
NVFP4 precisely because the fixed recurrence is the biggest fraction of the smallest
(4-bit weight) step. The same absolute removal is a smaller % of a Q8 step and a much
smaller % of a bf16 step, because the weight-read denominator grows.
- This MATCHES the brief's decomposition framing (recurrence ~40-50%, GEMM ~26-28% at
NVFP4): at NVFP4 the recurrence dominates, so the recurrence-targeting opts are where
the win is; as the weight quant grows the GEMM dominates and the recurrence opts
matter relatively less (but never zero, and never negative).
- Corollary: the ONE NVFP4-specific landed lever, 0023, only addresses the ~3.3 ms FP4
act-quant tax (and only the broadcast up/gate share of it) - the smallest bucket and
its measured win is +1.7%. The big bit-exact wins are all quant-agnostic.
So the optimization set is overwhelmingly general: a Q4_K / Q8_0 / bf16 Qwen3.6 gets the
full recurrence + conv + serving + MMQ-routing benefit; only the small FP4 act-quant
de-dup (0023) does nothing for it (and the inert 0017 was never enabled).
---
## 6. EMPIRICAL CONFIRMATION PLAN (specify only - DO NOT run; the GPU is busy)
Goal: prove on hardware that the quant-agnostic opts FIRE and LIFT a non-NVFP4 Qwen3.6,
isolating them from the one NVFP4-specific lever.
### 6.1 Hardware
GB10 / DGX Spark (sm_121), when free. The DGX has live deployments; this plan is
read-only until then. (Any Blackwell or non-Blackwell CUDA host also works to prove
quant-GENERALITY - the recurrence/serving opts are not Blackwell-gated; only the NVFP4
FP4-MMA tier is. Running on a non-Blackwell card would ALSO demonstrate the opts help
where there is no use_native_fp4 path at all - a strong second proof.)
### 6.2 Build the non-NVFP4 control GGUF first (prerequisite)
The same Qwen3.6 architecture, re-quantized so the weights are NOT NVFP4 but the
gated-DeltaNet/conv recurrence is still f32:
- Source: the existing q36-27b (dense) and/or q36-35b-a3b (MoE) f16/bf16 GGUF already
on the DGX (~/work/darwin_36b_opus/f16.gguf is the MoE f16 used as the LEVER4 KL
base; an equivalent dense f16 exists).
- Produce: `llama-quantize f16.gguf q36-27b-Q4_K_M.gguf Q4_K_M` (primary control) and
optionally `... Q8_0` and keep the f16/bf16 as the 16-bit control. Q4_K_M is the
cleanest contrast: 4-bit like NVFP4 but a totally different (k-quant, non-FP4-MMA)
weight path, so any shared win is provably from the f32 recurrence / routing, not
from FP4.
- Note: this requantize is free (no retrain) and must be done before any A/B.
### 6.3 Bit-exact gate per path (same method as the patch bodies)
For the bit-EXACT quant-agnostic opts (0018/0019/0020/0021/0022/0028/0029 and the
host 0013/0016/0024 default-off), the gate is: greedy `llama-completion --temp 0
--seed 1 --ignore-eos -n 256`, md5 of the output, patches-ON == patches-OFF on the
Q4_K_M control. Per path:
- non-paged Q4_K vs paged Q4_K (expect the same benign paged-reduction FP-order
delta noted in PAGED_BITEXACT_NOTE.md / 0029, gate with KLD/PPL not md5 across the
paged boundary, md5-exact within a fixed paged/non-paged setting).
- patches-on vs patches-off (see toggles 6.4) on the Q4_K control: byte-identical md5.
- 0026 (bf16 SSM state) is precision-CHANGING -> gate with KLD-to-f16 + PPL, not md5,
exactly like LEVER4 did; default tau=0 stays md5-exact.
- test-backend-ops on the build: GATED_DELTA_NET, SSM_CONV, SSM_CONV_UPDATE,
SSM_CONV_UPDATE_IDS, MUL_MAT, MUL_MAT_ID, GET_ROWS all green (these op tests are
dtype-parametrized and already include non-FP4 types).
### 6.4 The clean A/B (decode_agg, llama-batched-bench)
Two arms, SAME Q4_K_M control GGUF, `-fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
report S_TG (decode aggregate), median of 5 reps:
- Arm A (patches-OFF baseline): the cleanest is two builds - the pre-0018 paged commit
(the SSM opts not yet present) vs HEAD. If a rebuild is not wanted, approximate
OFF on the single HEAD binary by setting every disabling toggle at once:
fused GDN off (cparams.fused_gdn_ar/ch path disabled - the "fusion off" mode the
patch docs A/B against), `GDN_NW=4 GDN_CPW=1` (0022 pre-retune), `LLAMA_MOE_AUTO_TILE=0`
(0015), no `LLAMA_MOE_FORCE_GRAPHS` (0025 off), `LLAMA_PAGED_NO_BT_CACHE=1` (0029),
`LLAMA_PAGED_NO_RECLAIM=1` (0024), `LLAMA_PREFILL_BUDGET`/`LLAMA_MAX_BATCH_TOKENS`
unset (0013/0016), tau=0 / ctssm f32 (0026). The two-build form is preferred for a
publishable number; the env form is a fast same-binary sanity A/B.
- Arm B (patches-ON default): stock defaults (fusion on, 16x8, auto-tile on,
FORCE_GRAPHS on for the MoE graph arm, bt-cache on, reclaim on).
### 6.5 What result confirms quant-generality
1. The quant-agnostic opts FIRE on Q4_K: nsys on Arm B (Q4_K) shows the same kernel
deltas the NVFP4 runs showed - `k_get_rows_float` bucket collapses (0019/0028),
`concat_cont` + decode `cpy_scalar` gone and `ssm_conv_update` present (0021), the
o_proj `mul_mat_vec_q m=1` bucket gone and absorbed into `mul_mat_q m=128`
(0020 - now a Q4_K MMQ kernel, proving the routing win is not FP4-bound),
`get_block_table` host time down ~90% (0029).
2. The opts LIFT the non-NVFP4 model: Arm B S_TG > Arm A S_TG on the Q4_K control at
npl 32 and 128, with the recurrence/routing opts contributing the bulk (expect a
smaller % than the NVFP4 runs per section 5, but clearly positive and of the same
absolute ms order).
3. The NVFP4-specific lever does NOTHING on Q4_K: toggling 0023
(`GGML_CUDA_MOE_QUANT_DEDUP=0` vs default) shows ZERO delta on the Q4_K MoE control
(it never enters the `use_native_fp4` branch) - the negative control that isolates
the one NVFP4-only optimization from the general ones.
A clean pass = Arm B beats Arm A on Q4_K with the SSM/conv/routing/host kernel deltas
present and 0023 inert. That proves the decode wins are quant-general; NVFP4 is just the
weight quant where they show the largest PERCENTAGE because its weight read is smallest.
---
## 7. ONE-LINE VERDICT
14 of the 16 landed paged decode patches (0013-0029) are quant-agnostic: they act on the
f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the generic
MMQ-vs-MMVQ / CUDA-graph routing, none of which read the weight tensor's quant type. Only
0023 is genuinely NVFP4-specific (and 0017 is NVFP4-only but default-off/inert). The
hypothesis was right except for 0025, which is quant-agnostic (a generic
`ggml_is_quantized && should_use_mmq` CUDA-graph guard); the NVFP4-specific "act-quant"
work it was conflated with is LEVER 3, which never shipped. The opts deliver fixed
absolute ms savings at any weight quant; the % is largest at NVFP4 only because its
4-bit weight read makes the fixed recurrence the biggest slice of the step.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,464 +0,0 @@
# Qwen3.6 NVFP4-vs-NVFP4: llama.cpp vs vLLM on GB10 (DGX Spark)
Apples-to-apples benchmark. Both engines run the **same NVFP4 weights** on the **same box**
(GB10, sm_121, LPDDR5x unified memory ~273 GB/s). The question is not "who wins the HW
lottery" but "at matched NVFP4, on one bandwidth-limited box, does our paged llama.cpp
(patch 0015, expert-density-aware MoE token-tile auto-select, default-on) sit at par with /
ahead of / behind vLLM?"
---
# FINAL shipping benchmark (patch 0023, f32 bit-exact build) — 2026-06-26
This is the **publishable, plot-ready** apples-to-apples result. Both engines at their **best
realistic config** (no handicapping either side), matched NVFP4 weights, one clean GB10 box
(LocalAI service containers stopped for the duration, restored after). Raw rows in
[`final_benchmark.csv`](final_benchmark.csv); per-row checkpoint log in
[`BENCHMARK_PROGRESS.md`](BENCHMARK_PROGRESS.md).
## Build under test (the clean shipping result)
- **llama.cpp** = patch **0023**, dev tree `~/llama-paged-dev` HEAD **`f7409c2`**, git-clean
(the shelved bf16-GDN-state work was reverted; `git diff` empty at HEAD before the
`build-cuda` rebuild). Greedy gate confirmed canonical f32 output on both models. The bf16
GDN-state path is **shelved** (it fails the f32 KL gate); the shipped plateau is the
**95%-bit-exact f32** stack (patches 0018-0023). dense greedy md5 `5951a5b4…`, MoE
`07db32c2…` are the 0023 references (the *transcript* md5 also encodes llama-cli UI chrome,
which has since changed, so the build was verified instead via the clean git tree + full
rebuild + the greedy numerical gate).
## Config (both engines at BEST realistic config)
- **llama-server**: `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
`LLAMA_KV_PAGED=1`, **CUDA graphs ON** (`USE_GRAPHS=1`, default), and the QoS prefill budget
**`LLAMA_MAX_BATCH_TOKENS=512`** (patch 0016 decode-first dynamic budget). 512 is the
`n_ubatch` floor and is the best of the swept budgets: at npl32 it gives 133 s TTFT vs
**394 s for stock** (no budget) — lower budget = stronger decode-first = better burst TTFT,
and decode throughput is budget-independent.
- **vLLM 0.23.0**: its strongest honest decode config — **CUDA graphs ON** (NOT
`--enforce-eager`; `cudagraph_mode=FULL_AND_PIECEWISE`), `--gpu-memory-utilization 0.85
--max-model-len 4096 --max-num-seqs 256 -tp 1`, chunked prefill on, prefix caching off.
- **Client** (`h2h_cli3.py`, identical async harness both sides): 512-token **unique-nonce**
prompt (fresh full prefill every request, defeats all prefix caching), `max_tokens=256`,
`temperature=0`, `ignore_eos=True`, streaming with usage; concurrency npl 8/32/64/128.
- **Precision asymmetry (in llama's disfavour, yet llama still competes)**: llama runs
**f32 GDN recurrent state + q8 activations**; vLLM runs **bf16 GDN state + w4a4**. The
numbers below are llama at *higher* precision.
## DENSE — Qwen3.6-27B NVFP4 (`q36-27b-nvfp4`)
| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
| 8 | llama | **82.5** | 9.57 | 507 | 6 038 | 53.5 | 50.2 |
| 8 | vLLM | 70.4 | 8.76 | 2096 | 1 861 | 110.9 | 107.6 |
| 32 | llama | **192.6** | 4.79 | 115 | 133 552 | 69.6 | 66.3 |
| 32 | vLLM | 211.8 | 6.28 | 2183 | 5 353 | 110.9 | 107.6 |
| 64 | llama | **277.8** | 3.09 | 96 | 321 619 | 84.0 | 80.6 |
| 64 | vLLM | 309.1 | 4.38 | 2089 | 9 512 | 110.9 | 107.6 |
| 128 | llama | **384.6** | 1.86 | 70 | 902 763 | 93.8 | 90.5 |
| 128 | vLLM | 418.8 | 2.79 | 1929 | 18 450 | 111.0 | 107.6 |
**llama decode as % of vLLM (dense):** npl8 **117%**, npl32 **91%**, npl64 **90%**, npl128 **92%**.
## MoE — Qwen3.6-35B-A3B NVFP4 (`q36-35b-a3b-nvfp4`)
| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
| 8 | llama | 211.8 | 24.45 | 1236 | 2 477 | 39.7 | 36.1 |
| 8 | vLLM | 256.5 | 31.84 | 5187 | 769 | 109.6 | 106.3 |
| 32 | llama | 393.0 | 10.02 | 1214 | 8 225 | 47.1 | 43.8 |
| 32 | vLLM | 500.8 | 14.90 | 6223 | 1 830 | 109.6 | 106.4 |
| 64 | llama | 527.0 | 6.15 | 1152 | 15 850 | 57.1 | 53.8 |
| 64 | vLLM | 686.1 | 9.83 | 5927 | 3 224 | 109.6 | 106.4 |
| 128 | llama | 726.4 | 3.73 | 277 | 213 017 | 61.5 | 58.2 |
| 128 | vLLM | 882.2 | 6.05 | 5301 | 6 488 | 109.6 | 106.4 |
**llama decode as % of vLLM (MoE):** npl8 **83%**, npl32 **78%**, npl64 **77%**, npl128 **82%**.
## Plots (decode throughput vs concurrency)
Generated from [`final_benchmark.csv`](final_benchmark.csv) (matplotlib); the per-point label is
llama as a share of vLLM decode at that concurrency.
![dense decode vs npl](qwen36_dense_decode_vs_npl.png)
![MoE decode vs npl](qwen36_moe_decode_vs_npl.png)
## The honest public story (let the numbers speak)
1. **Decode throughput — the headline.** On the dense 27B, paged llama.cpp **matches/beats
vLLM**: 117% of vLLM at npl8 and a steady **90-92%** across npl32-128 — at *higher*
precision (f32 GDN state + q8 act vs vLLM bf16 + w4a4). On the MoE 35B-A3B llama lands at
**77-83%** of vLLM decode — close, but vLLM's fused grouped-GEMM MoE keeps a clear edge.
2. **Memory — a decisive llama win.** vLLM's pre-reserved pool is a **flat ~107 GB** at every
concurrency (the `--gpu-memory-utilization 0.85` design). llama's **on-demand paged KV**
uses **50-90 GB (dense)** and **36-58 GB (MoE)**, growing with load: at the operating point
most people actually run (npl≤32) llama uses **~1.5-3× less unified memory**, and even at
npl128 it stays below vLLM. This is the "fits where vLLM OOMs" axis.
3. **TTFT — vLLM's win, llama's disclosed tradeoff.** vLLM's chunked prefill absorbs a
128-way simultaneous burst gracefully (6-18 s). llama's decode-first QoS budget protects
decode throughput by throttling burst-prefill, so TTFT climbs at high concurrency
(dense npl128 **903 s**, MoE npl128 **213 s**). It is *bounded relative to no-budget*
(stock is worse) but high in absolute terms under a synchronized burst. Under realistic
staggered arrival this is far milder; for a synchronized-burst benchmark it is the cost of
the decode-first scheduler. **Decode and memory are unaffected.**
**Bottom line for the GB10 / DGX Spark page:** with matched NVFP4 weights, paged llama.cpp
delivers **90-117% of vLLM dense decode** and **77-83% of vLLM MoE decode** at **equal-or-higher
precision** and **1.5-3× lower memory** (on-demand paged KV vs a fixed 107 GB pool). The
remaining gap is MoE-decode and burst-TTFT, not dense-decode or memory.
## Anomalies / methodology notes (rigour)
- **Paged-pool burst degradation (real, worked around).** After a high-npl burst, a llama
server's *subsequent lower-npl* prefill collapses (npl8 fresh = 507 t/s / 6 s TTFT; the same
npl8 *after* an npl64 burst = 65 t/s / 64 s TTFT). Decode is unaffected. To measure clean
per-config prefill/TTFT, **the llama server is restarted per npl** (cheap vs the prefill
cost). vLLM has no such degradation — verified by an end-of-sweep npl8 re-check that matched
the opening npl8 (dense 70.4→73.4, MoE 256.5→226.4) — so vLLM uses one server per combo.
- **Fresh-prefill discipline.** Every measured request uses a unique nonce so prefill is always
a full fresh compute (the task's "defeat prefix caching" intent); vLLM ran with
`enable_prefix_caching=False`, llama with `cache_prompt:false`. Apples-to-apples.
- **No bimodality observed.** With per-npl restart + a cheap (ptok=8) graph warmup, the early
two-pass checks matched within <0.5% (npl8 486/484 t/s), so the headline uses one stable
measured pass per (model,engine,npl).
- **Clean environment.** The benchmark's peak (dense ~94 GB) plus the idle LocalAI worker's
~30 GB resident model OOM-cycled the service containers on the first attempt and corrupted
one run; the `local-ai`/`local-ai-worker` containers were stopped for the measurement
(baseline ~3.3 GB, ~120 GB free) and **restarted afterwards** to return the host.
- **peak_gb** is absolute unified-memory used (`MemTotal-MemAvailable`) peak; `engine_gb` =
peak the ~3.3 GB OS baseline (the per-config engine footprint).
- **Internal-consistency check (decode_agg vs perseq×npl).** `decode_agg_tps` is the steady-state
aggregate over the decode window; `decode_perseq_tps` is each sequence's lifetime rate (output
tokens ÷ total request latency, so it *includes* the TTFT queue wait). They coincide when
TTFT ≪ decode-window (vLLM npl8: 70.4 vs 70.1, +0.5%) and diverge exactly as TTFT grows, on
**both** engines (the aggperseq×npl gap rises monotonically with `ttft_mean`: vLLM 0.5%→17%,
llama 8%→62% across npl8→128, mirroring its 6 s→903 s TTFT). The relationship is governed by
TTFT, not a measurement artifact, and the FINAL rows are distinct from the historical patch-0015
table (no stale-baseline carry-over).
---
## Setup (historical — patch 0015 run; FINAL section above is the shipping 0023 result)
- **Box**: GB10 / DGX Spark, sm_121, unified LPDDR5x (~273 GB/s). Memory figures are
unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
- **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
`build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the
TTFT caveat in the verdict.**
- **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
--max-num-seqs 256 -tp 1`.
- **Client**: identical async client for both engines. Per request: 512-token unique prompt
(unique leading tokens defeat cross-request prefix caching), `max_tokens=256`,
`temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128.
- **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
`PEAK_GB` (unified-memory peak).
## The 4 models (NVFP4, matched weights)
| Model | llama.cpp GGUF | vLLM checkpoint | Match |
|-------|----------------|-----------------|-------|
| DENSE Qwen3.6-27B (28B dense) | `q36-27b-nvfp4.gguf` (native Blackwell FP4) | `q36-27b-nvfp4-vllm/` (unsloth TRUE W4A4) | clean W4A4 both sides |
| MoE Qwen3.6-35B-A3B (36B total, ~3B active) | `q36-35b-a3b-nvfp4.gguf` (241 NVFP4 tensors, nvidia weights) | `q36-35b-a3b-nvfp4-vllm/` (nvidia modelopt; vLLM picks Marlin NvFp4 MoE + FA2) | NVFP4 weight-only, identical nvidia weights |
---
## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB)
### MoE Qwen3.6-35B-A3B (~3B active)
| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
| 8 | llama | 170.2 | 20.27 | 2813 | 855 | 38.98 |
| 8 | vLLM | 202.0 | 24.92 | 4648 | 799 | 111.49 |
| 32 | llama | 235.4 | 6.77 | 2005 | 4970 | 43.06 |
| 32 | vLLM | 462.0 | 13.59 | 4755 | 2308 | 111.26 |
| 64 | llama | 271.7 | 3.88 | 2389 | 7205 | 52.53 |
| 64 | vLLM | 624.5 | 8.90 | 4784 | 4072 | 111.46 |
| 128 | llama | 292.2 | 2.05 | 657 | 84800 | 61.42 |
| 128 | vLLM | 811.1 | 5.46 | 4263 | 7980 | 111.61 |
llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128.
### DENSE Qwen3.6-27B
| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
| 8 | llama | 63.8 | 7.60 | 1117 | 2029 | 51.72 |
| 8 | vLLM | 64.3 | 7.98 | 1514 | 2593 | 112.07 |
| 32 | llama | 108.9 | 3.08 | 752 | 13212 | 61.48 |
| 32 | vLLM | 189.8 | 5.57 | 1555 | 7477 | 112.09 |
| 64 | llama | 126.2 | 1.78 | 465 | 53818 | 74.90 |
| 64 | vLLM | 284.2 | 3.92 | 1526 | 12942 | 112.11 |
| 128 | llama | 134.6 | 0.93 | 125 | 491195 | 94.03 |
| 128 | vLLM | 390.7 | 2.50 | 1420 | 24806 | 112.12 |
llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128.
---
## Verdict
**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM
scales substantially better as concurrency rises.**
1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's
~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines.
2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to
~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811);
llama plateaus (dense 64->135, MoE 170->292).
3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes
at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s,
8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent
512-token prefills starve each other and the decode. Crucially, that starvation also drags
`decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the
measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens`
(the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency
decode by keeping all slots live.
4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at
gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged
on-demand KV is materially more memory-efficient / multi-tenant-friendly.
5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream
on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at
npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a
GB10 target.
### Honest bottom line
The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these
NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a
bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill
starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode
batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated
regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
absolute gap.
---
## Fair re-run (max_prefill_tokens on)
The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section
re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over
**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server
(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`,
`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged
(carried over from the committed dense table, not re-run).
### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB)
| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM |
|----:|--------|------------------:|-----------:|-----------:|------------:|-----:|
| 8 | decode agg | 63.8 | 63.5 | 63.8 | 63.5 | 64.3 |
| 8 | TTFT ms | 2029 | 4255 | 3756 | 2653 | 2593 |
| 32 | decode agg | 108.9 | 105.7 | 107.7 | 108.8 | 189.8 |
| 32 | TTFT ms | 13212 | 23114 | 18934 | 13912 | 7477 |
| 64 | decode agg | 126.2 | 132.0 | 131.2 | 118.2 | 284.2 |
| 64 | TTFT ms | 53818 | 109455 | 74272 | 92450 | 12942 |
| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 |
| 128 | TTFT ms | 491195| **305423**| 543448| 424058| 24806 |
Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) ->
~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1.
### Best budget = 256 (only the saturated npl128 regime benefits)
At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**:
- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run.
- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock.
- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s.
Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a
looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So
the tightest cap (256) protects in-flight decode the most when the box is saturated.
### Honest caveat: this bursty workload is the worst case for TTFT
At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left
decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there
is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget
only pays off once enough slots are decoding that an unbounded prefill would starve them, which on
this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~
stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival
pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not
exercised here.
### Bottom line (dense)
The prefill budget is a **real but narrow** lever on this workload: at maximum saturation
(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama
from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x
faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
should be applied selectively (high-concurrency serving), not as an unconditional default.
## MoE 35B-A3B fair re-run (max_prefill_tokens on)
Same build (HEAD 151343b, P0+P1 patch 0015), same flags (`-c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`, `LLAMA_KV_PAGED=1`), same all-at-once harness (512-tok unique prompt,
gen 256, temp 0, ignore_eos). Swept the dense winner budget 256 plus neighbor 512.
### Primary table - budget 256 (decode_agg tok/s | TTFT mean ms | peak host GB)
| npl | stock (no budget) | budget 256 (best) | budget 512 | vLLM |
|----:|------------------:|------------------:|-----------:|-----:|
| 8 | 170.2 / 855 / - | 169.3 / 1655 / 38.95 | 172.1 / 1488 / 38.82 | 202.0 / 799 |
| 32 | 235.4 / 4970 / - | 239.0 / 9034 / 42.93 | 234.7 / 7260 / 42.72 | 462.0 / 2308 |
| 64 | 271.7 / 7205 / - | 277.0 / 16249 / 51.96 | 274.5 / 13660 / 52.53 | 624.5 / 4072 |
| 128 | 292.2 / 84800 / - | **333.5 / 98106 / 61.42** | 300.8 / 92470 / 61.45 | 811.1 / 7980 |
Peak host GB (paged KV, budget-independent): ~38.9 (npl8) -> ~42.8 (npl32) -> ~52 (npl64) ->
~61.4 (npl128). Far below the dense run (94 GB @npl128) - only ~3B params are active, so the KV
plus activations footprint stays light even fully saturated.
### MoE inverts the dense story: the budget buys decode, NOT TTFT
Unlike the dense 27B (where the stock run was prefill-starved to 491 s TTFT @npl128 and the budget
cut it 38%), the MoE stock run was **never prefill-starved**: 3B active params make prefill cheap,
so stock TTFT @npl128 was already only 84.8 s. Capping prefill therefore cannot rescue TTFT - it
can only **defer first tokens to free decode steps**. Result at npl128 with budget 256:
- **decode_agg: 292.2 -> 333.5 tok/s (+14.1%)** vs the starved stock run.
- **TTFT mean: 84.8 s -> 98.1 s (+15.7%, WORSE)** - the budget costs latency here.
- llama decode as % of vLLM @npl128: **36.0% -> 41.1%**. TTFT now ~12.3x vLLM's 7.98 s.
Budget 512 is the milder trade (decode +3.0% to 300.8, TTFT +9.0% to 92.5 s @npl128). Budget 256
maximizes decode throughput; 512 if you want to bleed less TTFT. At npl 8/32/64 both budgets are
net-negative or flat on decode and clearly raise TTFT (e.g. npl64 7.2 s -> 16.2 s @b256), the same
all-at-once burst artifact seen in the dense run.
### Does the ~3B-active decode scale better now? Yes - the plateau is gone
The headline win is the **decode scaling curve**, not any single point:
| npl step | stock decode_agg | budget-256 decode_agg |
|---------:|-----------------:|----------------------:|
| 8 -> 32 | 170 -> 235 (+38%) | 169 -> 239 (+41%) |
| 32 -> 64 | 235 -> 272 (+16%) | 239 -> 277 (+16%) |
| 64 -> 128| 272 -> 292 (**+7.4%**, plateauing) | 277 -> 333.5 (**+20.4%**, still climbing) |
Stock MoE decode **plateaus** at saturation (+7.4% over the last doubling) because unbounded
prefills keep stealing steps from the many ready decode slots. Budget 256 removes that ceiling -
decode keeps climbing +20% into npl128, so more of the 128 slots actually decode concurrently.
This is the cleanest evidence that patch 0013 protects in-flight decode once enough slots are live.
### Bottom line (MoE)
For the A3B MoE the prefill budget is a **decode-throughput lever, paid for in TTFT** - the mirror
image of the dense case. Budget 256 lifts decode_agg +14% @npl128 and, more importantly, restores
monotonic decode scaling (kills the stock plateau), moving llama from 36.0% to 41.1% of vLLM
decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: vLLM still decodes
~2.4x faster (811 vs 333.5) and holds TTFT ~12x lower (8.0 s vs 98.1 s) @npl128, and scales
monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated
MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for
latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here).
---
## Fair re-run verdict
This is the synthesis after patch 0013 (`max_prefill_tokens` / `LLAMA_PREFILL_BUDGET`) was turned
on for both models. It answers three questions: how much of the apparent gap was prefill
starvation, what genuine gap to vLLM remains after that artifact is removed, and where that leaves
the "par-or-beat vLLM" goal.
### 1. How much did patch 0013 close the gap?
The original (stock) tables blamed two things on llama: an exploding TTFT and a flat decode curve
at high concurrency. The budget re-run shows these were **two different problems with two
different root causes**, and only one was prefill starvation.
**Dense 27B - was genuinely prefill-starved.** Dense prefill is expensive (full 28B weights per
token), so 128 simultaneous 512-token prefills truly starved both first-tokens and decode. Budget
256 @npl128:
| metric @npl128 | stock | budget 256 | vLLM | what closed |
|----------------|------:|-----------:|-----:|-------------|
| TTFT mean | 491.2 s | **305.4 s** (-37.8%) | 24.8 s | starvation real; -186 s recovered |
| decode_agg | 134.6 | **161.2** (+19.8%) | 390.7 | freed slots now decode |
| llama as % of vLLM decode | 34.5% | **41.3%** | 100% | +6.8 pts |
Dense llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **99 / 56 / 46 / 41** (was 99/57/44/34).
The fix moved only the saturated tail; npl 8/32 were never starved and are unchanged.
**MoE 35B-A3B - was NOT prefill-starved (the inversion).** Only ~3B active params, so prefill was
already cheap and stock TTFT @npl128 was 84.8 s, not dense's 491 s. There was no starvation to
rescue, so the budget could not cut TTFT - it instead converted deferred prefill into decode
steps. Budget 256 @npl128:
| metric @npl128 | stock | budget 256 | vLLM | direction |
|----------------|------:|-----------:|-----:|-----------|
| TTFT mean | 84.8 s | 98.1 s (+15.7%, WORSE) | 7.98 s | budget costs latency here |
| decode_agg | 292.2 | **333.5** (+14.1%) | 811.1 | plateau removed |
| llama as % of vLLM decode | 36.0% | **41.1%** | 100% | +5.1 pts |
MoE llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **84 / 52 / 44 / 41** (was 84/51/44/36).
The decisive MoE finding is the scaling curve, not the point: stock decode plateaued over the last
doubling (64->128 = +7.4%); budget 256 restored monotonic scaling (+20.4%), proving the stock flat
curve was unbounded prefill stealing steps from ready decode slots, not a kernel ceiling.
**Combined takeaway.** Both models converge to the **same ~41% of vLLM decode at npl128** after the
fix. That convergence is the signal: once prefill starvation is removed, dense and a 12x-cheaper-
prefill MoE land on the identical ceiling, which means the remaining gap is **not** about prefill
at all - it is the decode scheduler.
### 2. The honest remaining gap to vLLM
After patch 0013, the residual gap is the **continuous-batched-decode efficiency** lever, and it is
real, not an artifact:
- vLLM still decodes **~2.4x faster** at npl128 on both models (390.7 vs 161.2 dense; 811.1 vs
333.5 MoE).
- vLLM holds TTFT **~12x lower** at npl128 (24.8 vs 30.5 s dense; 8.0 vs 98.1 s MoE) - and does so
while decoding faster, i.e. no latency/throughput trade.
- **vLLM scales monotonically and steeply** (dense 64->391, MoE 202->811 across npl 8->128); llama,
even with the budget, only **partially** recovers its scaling (dense 64->161, MoE 170->334).
The mechanism: vLLM's scheduler interleaves prefill and decode at token granularity (chunked
prefill + paged continuous batching) every step, keeping the GPU saturated with a near-optimal mix.
Patch 0013 is a coarser tool - a static per-step prefill **cap** - which protects in-flight decode
but does not actively schedule the prefill/decode mix, and on the bursty all-at-once harness it
defers first tokens (the TTFT penalty at npl 8/32/64, and the MoE TTFT regression @npl128). The gap
that remains is the **quality of the step-by-step batching decision**, not raw kernel speed: at
npl8 the kernels are at parity (dense 99%, MoE 84%), so the per-token math is competitive - what
vLLM does better is keeping more sequences productively in-flight every step as concurrency rises.
### 3. Where this leaves "par-or-beat vLLM", and the last lever
**Where llama is competitive today (NVFP4, GB10):**
- **Low concurrency (npl<=8): at parity.** Dense 99%, MoE 84% of vLLM decode, comparable TTFT.
For single-user / few-stream local serving - LocalAI's dominant mode - llama.cpp is already
there on matched NVFP4.
- **Memory efficiency: llama wins outright at every concurrency.** On-demand paged KV (dense
52->94 GB, MoE 39->61 GB) vs vLLM's flat ~112 GB pre-reservation. On a 128 GB unified box this is
the difference between multi-tenant headroom and OOM - a genuine product advantage, not a
consolation.
**Where llama is not competitive:** high-concurrency decode throughput (npl>=32), where vLLM is
~2-2.4x ahead and the budget only narrows it to ~41%.
**The last lever** is therefore *not* another prefill knob (0013 has extracted what a static cap
can give) and *not* the kernel (at parity @npl8). It is **token-granular continuous-batch
scheduling**: actively interleaving chunked prefill with decode every step rather than capping
prefill, so all live slots decode while new prefills trickle in - exactly what closes vLLM's
monotonic-scaling advantage. A staggered (non-burst) arrival pattern would also let 0013 protect
decode jitter without the burst-TTFT penalty seen here, narrowing the practical gap for real
serving traffic that does not arrive all-at-once.
### Bottom line
Patch 0013 is validated and worth shipping as a **selective, high-concurrency QoS lever**: it
recovers dense TTFT 38% and lifts saturated decode +14-20%, converging both models to ~41% of
vLLM. But it is honestly **not a gap-closer**. The "par-or-beat vLLM" goal is **met at low
concurrency and on memory efficiency, and not met at high-concurrency decode throughput.** The
remaining ~2.4x is a continuous-batched-decode scheduling gap, not a prefill-starvation or kernel
gap - and that is the next (harder) lever, distinct from anything 0013 can touch.

View File

@@ -0,0 +1,317 @@
# LocalAI paged-attention llama.cpp patch series
This directory holds the vendored patch series that turns stock llama.cpp into
LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are
applied on top of a pinned upstream llama.cpp at build time; nothing here is a
fork - it is a source-only `*.patch` stack plus this single canonical doc.
> One-file rule: this README is the canonical reference for the patch series. The
> only other docs kept in this directory are operational and linked below:
> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
---
## 1. What it is
`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a
vendored patch series over upstream llama.cpp that adds
- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks,
free pool, ref-counted blocks) with a **block-table flash-attention** read so
the attention kernels index physical cells instead of a contiguous buffer;
- **cross-request prefix sharing** - concurrent requests that share a long
prefix physically reuse one committed copy of the prefix blocks and prefill
only their divergent suffix;
- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget
decoupled from `n_batch`, so a long prefill never freezes co-batched decode;
- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid
gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
GEMM - dominates the decode step.
It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
(see section 7).
The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the
runtime hooks are inert, so a single `grpc-server.cpp` is shared between the
clean and the paged build.
---
## 2. Architecture
The decode step on these models breaks into three cost centers; the patch series
attacks each one.
**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager`
(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out
fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with
copy-on-write for shared prefixes). The attention path reads through a **block
table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed
as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the
CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]`
and read K/V in place. Token-position ordering keeps the flash-attn online-softmax
reduction order identical to stock. A null block table is the stock contiguous
read, byte-identical.
**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48
gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On
GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode
cost. The series fuses that plumbing to mirror vLLM's
`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written
to its cache slot in place (no copy-back, no `get_rows` materialization), the
conv state is updated in place, the output projection is reshaped to route to the
tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all
bit-exact (md5-gateable) against the f32 baseline.
**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses
Blackwell's native FP4-MMA. The series removes a redundant activation-requantize
in the MoE broadcast projections (bit-exact byte copy of identical blocks) and
keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only
NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back
to dequant.
**The prefill/decode scheduler.** `update_slots()` already emits one unified
mixed prefill+decode batch per step. The scheduler patches change only the *count*
of prefill tokens admitted per step: decode tokens are claimed first
(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the
live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto-
shrinking as decode load rises. Pure scheduler policy, byte-identical when off,
orthogonal to the paged allocator.
---
## 3. Patch series (0001-0030)
28 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
is in section 5.
### Paged-KV core (0001-0012)
| # | What it does | Bit-exact |
|---|---|---|
| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) |
| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) |
| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes |
| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) |
| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) |
| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) |
| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band |
| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band |
| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) |
| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec |
| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) |
### Decode-first scheduler (0013, 0016)
| # | What it does | Bit-exact |
|---|---|---|
| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) |
| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) |
(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in
high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a
default-on, density-aware auto-select that is prefill-safe by construction. Both
bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green,
but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
ships as the parity gate + default-off instrumentation only.)
### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) |
|---|---|---|
| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% |
| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% |
| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% |
| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% |
| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |
### MoE NVFP4 quant (0023, 0025)
| # | What it does | Bit-exact |
|---|---|---|
| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
| # | What it does | Bit-exact |
|---|---|---|
| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
---
## 4. Benchmarks
Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png),
[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data
[`final_benchmark.csv`](final_benchmark.csv).
### (a) + (b) Patched vs stock vs vLLM
The **stock** and **patched** columns are the same binary, env-toggled, on the
**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
apples-to-apples measure of the patch series' contribution. The **vLLM** column
is a **different harness** (vLLM server + client continuous batching), so the
cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
**Dense Qwen3.6-27B-NVFP4** (t/s):
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|----:|------:|--------:|-----:|------------------:|---------------------:|
| 8 | 65.7 | 84.0 | 71.1 | 118% | 1.28x |
| 32 | 113.7 | 204.0 | 207.6 | 98% | 1.79x |
| 64 | 134.3 | 294.9 | 309.7 | 95% | 2.20x |
| 128 | 143.5 | 371.2 | 422.4 | 88% | 2.59x |
**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|----:|------:|--------:|------:|-----------------:|---------------------:|
| 8 | 181.4 | 227.4 | 315.1 | 72% | 1.25x |
| 32 | 260.8 | 455.7 | 681.9 | 67% | 1.75x |
| 64 | 306.8 | 612.3 | 765.5 | 80% | 2.00x |
| 128 | 331.3 | 772.6 | 1011.7 | 76% | 2.33x |
**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
remaining gap is structural (see section 5).
### (c) Apple M4 (16GB) - for curiosity only
No t/s: the 24GB NVFP4 GGUF did not finish downloading and would not fit in 16GB
RAM (= SSD paging). Architectural findings:
- Metal `supports_op` **excludes NVFP4** from `MUL_MAT` / `MUL_MAT_ID` /
`GET_ROWS`, so the FP4 matmuls **fall back to CPU** - there is no Apple
FP4-MMA.
- `GATED_DELTA_NET` and `SSM_CONV` / `SSM_SCAN` **do** have Metal kernels.
Verdict: NVFP4 Qwen3.6 needs **Blackwell FP4-MMA + >24GB RAM**; a 16GB M4 is not
a fit. A Metal-native Q4_K Qwen3.6 would be a different artifact.
---
## 5. Dev notes - what we learned
**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy
md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France
is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default
chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)).
Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
is a benign FP-accumulation-order difference of the paged attention reduction,
**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <=
KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two
equivalent FP-reorderings of the same quantized model, not a regression. Future
paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`.
**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE
decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama
**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth;
the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection
bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel
llama is losing. The MoE GEMM kernel is *not* where the gap lives.
**Rejected / flat levers** (recorded so they are not re-tried):
- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
exhausted by 0025; more graph/stream overlap is a no-op or small regression on
this model.
- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
further bit-exact lever clears it. 0023 already banks the de-dup.
- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).**
Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the
same bf16 projections. No-ship.
- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody
needs bought with a ~5% slower kernel; both kernels are already at the BW floor.
(The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
carries over to MoE.)
**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
that bf16 KL error concentrates in long-memory heads and can be removed by
keeping them f32 - is **empirically refuted**: the error scales with the bf16
head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
in a recommended/gallery config.
---
## 6. Architecture and quant generality
(From the arch-generality and quant-generality audits.)
- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4
activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations
help **any quant** of these models (the GDN recurrence, conv, gather and
o_proj-MMQ levers operate on the f32 recurrent state and the routing layout,
not on the weight dtype).
- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back
to dequant on other hardware; the GB10-tuned occupancy params (0022) are
perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change
correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.
---
## 7. Pin + maintenance policy
- **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
[`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
(dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
c299a92c` jump (23 upstream commits) needed zero patch changes and did not
change decode output.
- **Decoupled from the nightly auto-bumper.** There is deliberately **no**
`bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
silently shift the tree out from under the patches.
- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml)
(via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh))
tries the patch series against the latest upstream tip with the build's own
strict `git apply`. **Red = upstream drifted past the series -> run a
PIN_SYNC** (do not bump the pin blindly). The canary references
[`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md).
---
## 8. Models
The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
| Gallery entry | Weights (HuggingFace) | Notes |
|---|---|---|
| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). |
| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. |
Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
`ssm_bf16_tau`). The full backend-split + gallery plan is in
[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md).

View File

@@ -1,400 +0,0 @@
# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU)
Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4
producer-fold the BEST remaining bit-exact decode lever, or is something better being missed?
Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched.
## Starting line (post-0023)
- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023.
- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%).
- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded
kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel).
- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED.
## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue)
(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4,
32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one
block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction
(fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only
thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which
repartition which rows a lane sums into the warp-butterfly = a different reduction grouping =
breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at
the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR.
(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail.
Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the
online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR.
(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum
order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever.
(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2)
is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k
fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode
GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR.
=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the
step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small,
foldable glue/quantize/gather buckets below.
## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk)
Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace:
- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%)
- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather)
- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net)
Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the
three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent
kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded =
byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree /
eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2).
Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST
plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense
+2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so
~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION
grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the
writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for
any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread
shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte
gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO
writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst
plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q
prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the
LAST-MILE lever, not the first.
Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk -
but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the
0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them
("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy
with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup
exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via
ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is
bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source
scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than
Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live
~2% bit-exact bucket that the current plan does not address.
## IS THE fp4 FOLD THE RIGHT NEXT BUILD?
DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood
bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of
the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing
with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has
the worst gain/plumbing ratio of the candidates.
Recommended build order (all bit-exact, all stack 1:1 on the serial single stream):
1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%).
2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk).
3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if
the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery.
If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the
biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before.
## HONEST BIT-EXACT CEILING
The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK:
~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense.
373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391.
=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the
ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt.
Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets
resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and
at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the
~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere.
Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs
llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the
only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is
already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with
the folds reaches throughput parity at higher precision - a strong result, but each fold is a
diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that
goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate.
Assisted-by: Claude:opus-4.8 [Claude Code]
====================================================================================================
# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU)
Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}`
(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp.
## 0. One-line verdict
The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is
(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the
reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings
+ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured
per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense
best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7.
## 1. Which graph nodes fuse
Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32<bs, do_multiply=true>` kernel
(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target.
- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487:
`attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at
m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the
f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing
`{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam.
- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243:
`attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND
`ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer
MUST write f32 -> smaller win.
- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only.
## 2. Byte-exact target (norm.cu rms_norm_f32<bs,true>)
Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float.
```
for col=tid; col<ncols; col+=bs: tmp += x[col]*x[col]; // (R1) strided sumsq grouping
tmp = block_reduce<SUM, bs>(tmp, s_sum); // (R2) tree width depends on bs
mean = tmp/ncols; scale = rsqrtf(mean+eps); // (R3) exact eps/div
for col=tid; col<ncols; col+=bs: dst[col] = scale*x[col]*mul[col];// (W) per-channel gain, mul_col==col
```
(W) is per-column independent (scale block-uniform) -> writeback may be re-partitioned. (R1/R2/R3)
are the ONLY order-sensitive parts and must stay byte-identical.
## 3. Fused producer kernel (quantize.cu) - deltas vs the stash
Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock`
(verbatim per-thread NVFP4 quant). Required changes:
1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY.
2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce<SUM,bs>`, `mean=tmp/ncols`,
`scale=rsqrtf(mean+eps)`. Byte-identical once bs matches.
3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s<n_sub; s+=bs`, col0=s*16,
`v=scale*xr[col]*mul[col]` (col<ncols else 0), amax=max|v|, `quantize_nvfp4_write_subblock(vals,
amax, sub, y+ib)`, `ib=k_block*ne11+row`, n_sub=ncols_padded/16. x is re-read (canonical does too).
4. `template<bool write_f32>`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store),
TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut.
Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1
(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride
s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)).
## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh)
Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`.
In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args
src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal).
Sharing across non-adjacent siblings:
- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse;
one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean.
- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins.
The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes
(a key reason it was flat; nsys showed quantize 12896->10816, not ->0).
## 5. Bit-exactness argument
(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction
(quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte.
(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch)
AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant.
=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held.
Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped).
## 6. THE TRAP
- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is
1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash
hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs
32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs.
- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction
partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs.
- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate.
- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only
changes WHO writes src1.
## 7. Contrast with Lever-2 + lower-risk alternative
Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9%
rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND
non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats
it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again.
LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient
standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate /
ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm
untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half
captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed
is easily eaten by the fused kernel's added work.
## 8. Scope + build order (the gate)
Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized
on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE
+ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT).
Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time
vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg
actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true,
keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable
(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop.
Assisted-by: Claude:opus-4.8 [Claude Code]
====================================================================================================
# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2
Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable
fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits.
## Build-dir correction (acted on)
The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically
`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s,
git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is
unaffected by the 0023 MoE de-dup knob - dense has no MoE.)
## Confirmed baseline
- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the
~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.)
- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%,
sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle).
## Dense decode decomposition (ms/step)
gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q<NVFP4,128> 93.57 (27.4%) |
**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) |
ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 |
stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25.
## quantize_mmq_nvfp4 at the dense decode shape (the answer)
- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per
NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong).
- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn):
- K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up,
ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED
foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366).
- K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)),
two muls break the chain) + wo (attn-gated producer).
- K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer).
## Foldable portion (measured) - LARGER than the byte-model 2.7%
The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model
predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144
indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls.
- byte-model FOLDABLE = 9.73 ms = 2.82% of step
- flat-split FOLDABLE = 11.90 ms = 3.45% of step (368 small-K quants, the physically correct one)
- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo
~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%).
- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold)
= 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32)
= ~7.8 ms raw but lower net efficiency.
## Critical path: YES (1:1)
98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the
serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps).
## Realistic recoverable - and the honest haircut
RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic
already flagged and this profile does not overturn:
- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved,
and the f32 write is droppable ONLY at the FFN boundary where it is dead);
- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2
siblings, always wrote f32, and used a non-canonical bs=256 reduction;
- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole
quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small
kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC
garbage-routing skip artifact). This caps the realistic NET nearer the conservative end.
=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of
FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not
bank above ~2.5%.
## VERDICT (GPU-measurement view)
- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7%
byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision
change), and the largest single clean dense bucket left after the plateaued recurrence.
- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the
HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates.
- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows
folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest
~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if
decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact
construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not
bit-exact.
Assisted-by: Claude:opus-4.8 [Claude Code]
====================================================================================================
# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023
DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies
its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and
measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a
clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed
to the code; this verdict appended only.
I extended the read-only agents' analysis with the two things they could not verify from the .md
verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in
the source. Both kill the build.
## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive)
LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured:
- dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5%
- MoE q36-35b-a3b npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3%
nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%),
NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK
the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly
consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the
fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the
producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only
recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved
this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes"
(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at
the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below.
## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible)
This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the
redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the
SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained
quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime.
The DENSE redundancy is INTER-NODE and that is a different, much harder problem:
- The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM,
reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus,
mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are
NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A
contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible
for the qkvz bucket.
- De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share
src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last
sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool
must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool +
src1_prequantized path the fold needs, with real implementation risk (graph-capture
non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing"
framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs)
is common to both.
- The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case.
- The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn
LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8;
ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes.
Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step
~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new
fusion code path + the risk to the byte gate.
## 3. The pointwise + get_rows folds are not clean wins either
- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused
(ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/
softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable
remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS
measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding
kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever.
- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified
mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win.
## 4. Tree state / gates
- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff -
no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile).
- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged).
- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline
the reprofile already measured would waste the GPU window).
## 5. FINAL BIT-EXACT CEILING
Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023).
This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left:
- gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor.
- mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever
(any knob changes a K-/softmax-reduction order vs the f32 reference).
- The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest
piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has
no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is
each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the
plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow.
FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81
bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing
that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved).
RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat).
If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE
scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts
above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,138 +0,0 @@
# GB10 same-day head-to-head server sweep: llama-server (paged) vs vLLM
Date: 2026-06-23. Hardware: GB10 / DGX Spark (sm_121, 128 GB LPDDR5x unified, ~273 GB/s
weight-read floor). GPU otherwise idle (sibling vLLM had exited; LocalAI docker workers
stopped for the run).
This sweep **replaces** the stale carried "~75-80% of vLLM" figure (commit 07985ba4,
pre-co-batching, single-point). It measures *real serving* steady-state aggregate decode
throughput across the full concurrency curve, for three model classes, with one identical
client driving both engines.
## Method
- **llama**: `llama-server` from the paged dev tree (`~/llama-paged-dev/build-cuda`, HEAD =
patch 0013 / commit 17d97cb), `LLAMA_KV_PAGED=1`, `-fa on -ngl 999 --parallel 128 -c 65536`.
- **vLLM**: 0.23.0, `vllm serve --enforce-eager --enable-prefix-caching --max-num-seqs >=128
--max-model-len 4096` (APC on, eager per the GB10 no-CUDA-graphs edge).
- **Client** (`sweep_client2.py`): N concurrent **non-streaming** `/v1/completions`, short
shared prompt, `max_tokens=min_tokens=256`, `ignore_eos=true`. Aggregate decode tok/s =
total generated tokens / wall. Non-streaming keeps the Python client off the critical path
(one JSON parse per request, not per token), so the **server** is the bottleneck. Validated:
vLLM pushed 4227 tok/s through the exact same client where llama topped out at 2087, so the
client is not the cap. Both engines use the identical client + prompt -> apples-to-apples.
- npl (concurrency) sweep: 8 / 32 / 64 / 128.
Quant parity:
- Dense: llama **NVFP4-dense GGUF** (weight-only FP4, 16-bit compute) vs vLLM **NVFP4A16**
(weight FP4, 16-bit activation) -> matched precision class.
- Small: llama **Q8_0** vs vLLM **bf16** (closest loadable form).
- MoE: llama **mxfp4** GGUF. **vLLM could not serve this MoE on GB10 at all** (see below), so
there is no vLLM MoE column.
## Results: aggregate decode tok/s (higher is better)
### Dense 32B (llama NVFP4-dense vs vLLM NVFP4A16)
| npl | llama (NVFP4) | vLLM (NVFP4A16) | llama % of vLLM |
|----:|--------------:|----------------:|----------------:|
| 8 | 83.2 | 85.9 | **96.9%** |
| 32 | 228.9 | 301.3 | 76.0% |
| 64 | 367.1 | 507.8 | 72.3% |
| 128 | 520.6 | 604.0 | 86.2% |
Plateau: neither has plateaued at 128 (both still climbing, weight-read bound). llama is at
**parity at batch-8** (97%), dips to ~72% mid-curve (npl 32-64), recovers to 86% at 128.
### Small Qwen3-0.6B (llama Q8_0 vs vLLM bf16)
| npl | llama (Q8_0) | vLLM (bf16) | llama % of vLLM |
|----:|-------------:|------------:|----------------:|
| 8 | 911.3 | 923.0 | **98.7%** |
| 32 | 1701.6 | 2531.4 | 67.2% |
| 64 | 1911.7 | 3497.1 | 54.7% |
| 128 | 2087.6 | 4227.6 | 49.4% |
Plateau: **llama plateaus hard** at ~2.0-2.1k by npl 64-128 (+9% from 64->128). vLLM keeps
scaling (3497 -> 4227). For a tiny runtime-bound model, vLLM's scheduler/batching amortizes
better; llama-server's per-token host cost (sampling, detok, slot mgmt) caps it. This is the
worst llama-vs-vLLM ratio in the sweep (down to 49%).
### MoE Qwen3-Coder-30B-A3B (llama mxfp4; vLLM = NOT SERVABLE on GB10)
| npl | llama (mxfp4) | vLLM |
|----:|--------------:|-----:|
| 8 | 290.0 | n/a |
| 32 | 582.5 | n/a |
| 64 | 931.8 | n/a |
| 128 | 1041.3 | n/a |
llama-server scales cleanly to **1041 tok/s** at npl 128 with **no npl-128 expert-activation
cliff** (unlike the prior `llama-batched-bench` MoE numbers 253/505/830/620 that peaked at 64;
short-prompt continuous batching in the server avoids it).
**vLLM could not serve this MoE on GB10 (two independent failures):**
1. **bf16** (`Qwen/Qwen3-Coder-30B-A3B-Instruct`, the only HF form on the box): loads the
56.9 GB of weights, then **hangs at the MoE warmup** (`Using MoEPrepareAndFinalize
NoDPEPModular` -> `Model loading took ...`), GPU 0% util, and **takes the whole box down
(hard reboot)**. Reproduced twice. With tight `--gpu-memory-utilization` it still hangs at
the same step before the API server ever comes up.
2. **mxfp4 GGUF** (same weights llama uses): vLLM 0.23.0's GGUF loader **cannot map the fused
qwen3moe expert tensors** (`RuntimeError: Failed to map GGUF parameters (48):
['model.layers.N.mlp.experts.gate_up_proj', ...]`). Engine init fails outright.
So on GB10, llama.cpp is the *only* engine of the two that serves this 30B-A3B MoE at all -
an availability win, independent of throughput.
## Batch-8 anomaly triage (dense NVFP4) -- RESOLVED
The prior mixed-load run reported llama batch-8 steady decode at **471 ms/step (~19% of vLLM
aggregate, ~17 tok/s)**. This sweep does **not** reproduce it. Clean isolated batch-8 decode:
- `llama-server` batch-8 dense paged = **83.2 tok/s** aggregate = ~96 ms/step = **96.9% of
vLLM's 85.9** (parity, both at the LPDDR5x weight-read floor).
`llama-batched-bench` cross-check, dense NVFP4, `-npp 16 -ntg 128 -npl 1,8`, the three
hypotheses isolated (S_TG = decode tok/s aggregate at batch 8):
| config | batch-8 S_TG t/s | ms/decode-step |
|-----------------------|-----------------:|---------------:|
| paged, ctx 65536 | 90.32 | 88.6 |
| stock, ctx 65536 | 88.39 | 90.5 |
| paged, ctx 163840 | 89.33 | 89.6 |
| stock, ctx 163840 | 87.72 | 91.2 |
Conclusion: clean batch-8 dense decode is **~88-90 tok/s (~89 ms/step) regardless of all three
suspects**:
- **Paged overhead?** No -- paged is within 2% of stock, and at ctx 65k paged is *faster*
(90.3 vs 88.4). The decode path is not paying a paged penalty at batch-8.
- **The 163840-token ctx allocation?** No -- ctx 163840 == ctx 65536 within 1% (89.3 vs 90.3).
The large allocation does not slow steady-state decode.
- **NVFP4 decode cost?** This *is* the cost -- ~89 ms/step is the GB10 weight-read floor for a
32B at batch-8 (it matches vLLM's 86 tok/s server and exceeds it at the kernel level: 90 vs
86). It is the hardware ceiling, not a bug.
The 471 ms/step is ~5.3x slower than this clean floor and is explained by none of the three.
It was a **mixed-load artifact**: the 8 decoders were time-sharing the GPU with a concurrent
prefill (a large prompt / chunked prefill landing on the same steps). That decode-vs-prefill
contention is exactly the stall **patch 0013 (`LLAMA_PREFILL_BUDGET`)** bounds. In steady-state
isolated decode, batch-8 dense is at **parity with vLLM (97%)**, not 19%.
## Aggregate map (replaces the carried 75-80%)
llama-server (paged) as a fraction of vLLM, by regime:
- **Low concurrency (batch-8): parity, 97-99%** on both measurable classes. Both engines sit on
the LPDDR5x weight-read floor; there is nothing to win.
- **Dense 32B, mid-to-high concurrency: 72-86%.** Dips to ~72% at npl 32-64, recovers to 86% at
128. Both still climbing (weight-bound), neither plateaus by 128.
- **Small 0.6B, mid-to-high concurrency: 49-67%.** llama plateaus ~2.0k; vLLM scales to 4.2k.
Runtime/scheduler-bound regime -- vLLM's batching wins; this is llama's weakest ratio.
- **MoE 30B-A3B: llama-only.** vLLM cannot serve it on GB10 (bf16 reboots the box at MoE
warmup; GGUF expert tensors unmappable). llama serves it at 290 -> 1041 tok/s, scaling
cleanly with no npl-128 cliff.
Net: the single "75-80%" number is replaced by a curve. It is *roughly* right only for the
dense mid-band; it is too optimistic for the small model at high concurrency (49%) and moot for
MoE (where llama is the only option). The headline is parity at low concurrency and a hardware
(not engine) ceiling on dense decode.

View File

@@ -1,567 +0,0 @@
# SPEEDUP_HUNT.md - the post-0023 vLLM decode close/beat hunt
Accumulator for the four-lever speedup hunt on the clean pin-synced base (llama.cpp
9d5d882d, bit-exact md5 == 0023 baseline). Levers (current-brief labels):
A = hybrid per-head SSM precision, B = MoE grouped-GEMM, C = structural dense residual
(lm_head + graph/launch), D = f16 glue.
---
## D - f16 GLUE: confirm lower-priority (label: D-f16-confirm, READ-ONLY no GPU)
Re-read `F16_DENSE_RESIDUAL_PROBE.md` (the lever-D doc) plus `BF16_SSM_STATE_RESULTS.md`
(lever A's parent work) and `OTHER_PATHS_INVESTIGATION.md` (the B/lm_head + graph
analysis). Verdict: **D is correctly deprioritized. Dominated by both A and B. Build
later behind an opt-in flag only if the last ~4% dense is ever chased; do NOT build now.**
### The numbers that pin D below A and B
- D's reachable mass is TINY. The dense decode gap to vLLM is ~27 ms/step (llama 332.8 ms
vs vLLM 305.7 ms @npl128). 83.2% of the step (recurrence 49.3% + FP4 GEMM 27.4% + FP4
act-quant/fixup 6.4%) is ALREADY precision-matched f32/W4A4 on both engines - f16 cannot
touch it. The f16-able glue is only **8.4% of the step** (Budget A = 28.74 ms: norms +
elementwise + activations + flash_attn + rope + copies).
- f16 does not zero the glue, it halves the bytes of the memory-bound part. Realistic
recovery from the probe: ~11 ms (glue only) to ~16 ms (+ the uncertain nvjet GEMM) =
**40-60% of the 27 ms residual**. That moves dense parity 91.8% -> ~95-96%, NOT a close.
- The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable (KV is
ALREADY f16, the KQ/softmax accumulate stays forced f32 = vLLM does the same). The cleanly
recoverable band is just the norms+elementwise+activations (~16.7 ms -> ~8.4 ms saved).
### Dominated by A (parity-and-beyond) and B (the bigger gap) - confirmed
- **A dominates on the same dense axis.** A targets the recurrence, which is 49.3% of the
dense step - i.e. ~6x the mass D can touch. The bf16-SSM measurement already proved the
recurrence kernel halves (-49%/call) and clean dense bf16 hit ~490 t/s = **125% of vLLM**
(`BF16_SSM_STATE_RESULTS.md` sec 2). A's hybrid per-head variant keeps the long-memory
heads f32 to pass the KL gate that plain bf16 failed (drift FAIL ~10% argmax flips @>=1024
ctx) while banking most of that +25-31%. So A is the parity-AND-BEYOND lever on dense;
D's ceiling is ~96% parity. A wins outright.
- **B is the bigger gap.** MoE sits at ~82% (726 vs 882) vs dense ~92%; the MoE-specific
kernel (mul_mat_q<NVFP4,M-tile=64> grouped GEMM, 26.9% of MoE decode = ~43.5 ms/step) and
the W4A4 act-quant tax are real MoE deltas. D is a DENSE-only lever (the MoE step is
recurrence + FP4-GEMM + bf16-projection dominated; the f16 glue band is even smaller
there) - it does nothing for the larger MoE gap. B addresses where the bench is worst.
- **C overlaps and out-prioritizes D's residual.** The probe's own conclusion: the
remaining ~3-4% after f16 is structural (non-FP4 cublas/nvjet GEMM efficiency +
graph/launch scheduling), and those help the BIT-EXACT default too, unlike D which is
opt-in non-bit-exact. C's graph/launch work is the better long-term dense target.
### Is there a cheap subset of D worth folding into a later build?
**No cheap subset that pays.** The probe maps D to three escalating options:
- A flag: does not exist and cannot exist - the F32 stream is STRUCTURAL
(`ggml_mul_mat` hardcodes an F32 result, so the residual stream snaps back to F32 after
every projection; rms_norm/l2_norm/silu/add/mul/flash_attn/ssm_conv all emit F32).
- **Option 1 (the "cheap" one: per-op f16 on ops that already have f16 paths - silu/sigmoid/
softplus/add/mul/rope): NET NEAR-ZERO OR NEGATIVE.** Because the residual stream stays F32,
each op must be wrapped cast(F16)->op->cast(F32) = 2 extra `cpy` ops. At decode these ops
are tiny and memory-bound, so the cast traffic ~= the op traffic and the win is eaten unless
the cast is FUSED into producer/consumer. Crucially Option 1 CANNOT reach the norms - the
largest glue item. So the only "cheap" subset is the one that does not actually help.
- Option 2 (the real lever): carry the residual stream in F16 across the layer, which needs
NEW F16 template instantiations in norm.cu (rms_norm / l2_norm / fused rms+mul / rms+mul+add,
today hard-`GGML_ASSERT(type==F32)`) keeping the f32 reduction, an f16 projection-output
path, plus graph-dtype plumbing in qwen35.cpp/llama-graph.cpp. Multi-file, recovers ~11 ms,
and is **non-bit-exact** (same gate-failing category as the shelved bf16-SSM state). Not cheap.
There is no fold-in-for-free subset: the only no-new-kernel piece (Option 1) is net-zero, and
the only piece that captures real mass (Option 2 norm.cu f16 kernels) is a multi-file build.
### THE D PRIORITY CALL
D is correctly deprioritized, below A, B, and C:
- **Reachable mass:** D 8.4% of the dense step vs A's 49.3% recurrence; D is dense-only and
does nothing for the bigger MoE (B) gap.
- **Ceiling:** D tops out ~95-96% dense parity; A is already parity-AND-BEYOND (125% clean,
hybrid keeps most of it inside the KL gate).
- **Bit-exactness:** D is opt-in NON-bit-exact (same bucket as shelved bf16-SSM and the
NVFP4-head); it cannot improve the shipped f32 bit-exact default, whereas C's structural
graph/launch work does help the default.
### RECOMMENDATION: build LATER (opt-in only), not now; no cheap subset to fold in
Do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau (patches 0018-0023)
as the default. If the last ~4% dense is ever chased, the ONLY worthwhile piece is Option 2's
norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise band, ~11 ms); gate
it behind an explicit opt-in flag and validate it against the SAME KL threshold that failed
plain bf16-SSM before shipping. Skip Option 1 entirely (cast overhead eats the win). Prefer
the structural ~3-4% (non-FP4 cublas GEMM efficiency + graph/launch scheduling, lever C) over
D, because that helps the bit-exact default too. D stays the lowest-priority of the four levers.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## A - HYBRID PER-HEAD f32/bf16 SSM STATE (label: A-hybrid-design, READ-ONLY no GPU)
Goal: capture most of the whole-bf16 SSM-state win (recurrence -49%/call; dense ~490 t/s = 125% of
vLLM; MoE +25%) WITHOUT the KL failure (whole-bf16 MeanKLD 0.05-0.17, Same-top-p ~90%, ~10% argmax
flips @>=1024 ctx). Keep f32 on the long-memory heads (where bf16 rounding does NOT contract and the
KL error concentrates); bf16 only the fast-decaying heads. Stays at-or-above vLLM precision (vLLM
keeps ALL temporal state f32) while landing ABOVE vLLM throughput.
### Why the error concentrates in long-memory heads (the physics)
qwen35/qwen35moe take the NON-KDA path: per (head h, token t) the decay is ONE scalar
(gated_delta_net.cu `g_val = expf(g[h,t])`, `S <- g_val*S + k(x)delta`). The gate (qwen35.cpp):
`g[h,t] = ssm_a[h] * softplus(alpha[h,t] + ssm_dt[h])`, with `ssm_a[h] = -exp(A_log[h]) <= 0` =>
decay = exp(g) in (0,1]. Two STATIC per-head weights set the timescale: ssm_a[h] (tensor
SSM_A_NOSCAN, [n_v_heads]) = decay-rate SCALE (|ssm_a| small => structurally long-memory); ssm_dt[h]
(SSM_DT "bias", [n_v_heads]) = softplus operating point. bf16 carry-error per step is contracting,
bounded ~`eps*tau_h`, eps~2^-8~3.9e-3, head memory length `tau_h ~ 1/(|ssm_a[h]|*softplus(ssm_dt[h]))`
tokens. Error scales LINEARLY with tau_h => long-memory heads blow up the KL (matches the measured
plateau-but-large failure). Keep those f32.
### Classification: per-head STATIC, at model load (NOT per-token)
g is per-token but the long-vs-fast PROPERTY is per-head static (dominated by ssm_a/ssm_dt). A cache
row's dtype must be stable across the sequence => a per-token threshold is impossible; classify ONCE
at load into a per-(layer,head) dtype mask.
- TIER 1 (default, zero-cost, deterministic): pure-weights. `tau_h = 1/(|ssm_a[il][h]|*
softplus(ssm_dt[il][h]))`; keep f32 if tau_h > T_thresh, else bf16. T_thresh is THE knob (start
32-64; sweep on GateBench). eps*tau_h => a single T_thresh sets a uniform per-head error ceiling.
- TIER 2 (optional): short calibration pass measures per-head time-mean of actual exp(g[h,t]); write
mask to a model-hash sidecar (paid once). Use only if Tier 1 lands just above the gate.
cparam `ssm_hybrid_tau_thresh` / `--ssm-bf16-tau`: inf => all-f32 (today's bit-exact default); 0 =>
all-bf16 (the shelved mode); the hybrid band is in between.
### Mixed-dtype cache layout: two homogeneous partitions per slot (packed)
Split persisted s_l ([S_v,S_v,H,slots] f32, n_embd_s=S_v*S_v*H) into TWO dtype-homogeneous sub-caches
sized by head COUNT (this is what saves bytes): `s_l_f32 [S_v*S_v*n_f32, slots]` f32 +
`s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. Static map `head_slot[h]={is_bf16, local_idx}`. q/k/v/g/beta
KEEP natural head order (no activation permute). Block h_idx -> head_slot -> base + local_idx*S_v*S_v.
Recurrence R+W bytes scale by `f_bytes = (n_f32 + n_bf16/2)/H = 1 - 0.5*(n_bf16/H)`. In-place/ids
identity stays race-free (each head writes its own partition; read==write slot, registers before
store). (Cheaper coarse fallback = per-LAYER dtype, near-zero layout code, but long-memory heads span
most layers => too coarse; per-head is the right granularity.)
### Kernel: single launch, runtime per-head branch (on top of BF16_SSM_STATE.diff)
Reuse the existing bf16 plumbing (gdn_state_t alias, __bfloat162float load / __float2bfloat16 store,
gather template, dtype-detect dispatcher). Hybrid change: pass BOTH bases (`const float* s_f32_base`,
`const nv_bfloat16* s_bf16_base`, + the two state_dst views) + device `head_slot[]`; branch load/store
on `head_slot[h_idx].is_bf16` (UNIFORM per block => no warp divergence). Recurrence math byte-for-byte
untouched (f32 registers). keep_rs_t snapshots stay f32 (op-output scratch). gdn_gather_nonident
becomes per-head dtype-aware (still disjoint-scratch race-free). ONE op call + ONE launch.
### KL-gate plan + estimated pass / f32 fraction / speedup
KLD contribution ~ (eps*tau_h)^2 => dominated by the top-tau heads; removing the top ~25-40% by tau
cuts MeanKLD 1-2 orders. Honest estimate: ~30-40% f32 PASSES Same-top-p>=99.5% and brings MeanKLD to
1e-3..1e-2; strict <1e-3 may need ~40-50% f32. Find the exact fraction by sweeping T_thresh on the
EXISTING GateBench harness (noise floor -> 256-tok gate -> drift sweep 256/1024/2048/4096, both
models). Hybrid is STRICTLY safer than vLLM (vLLM = all-f32 temporal; we f32 exactly the unsafe
heads). Long-memory heads are the minority (~20-40%) => design band f in [0.30, 0.50].
Speedup (dense, bandwidth-bound recurrence, graphs-off): f32 3.38 ms/call, whole-bf16 1.73 (-49%);
hybrid ~ f_bytes*3.38 => f=0.30 -> 2.20 ms (-35%, ~70% of bf16 win); f=0.50 -> 2.54 ms (-25%, ~50%).
Throughput (dense f32 ~371-384=95% vLLM; whole-bf16 ~490=125%; vLLM ref 419): f=0.30 -> ~454 t/s
(~108% vLLM, gate-likely); f=0.50 -> ~430 t/s (~103% vLLM, most robust). MoE: smaller absolute
recurrence (31 GDN layers, H_v=32) + MUL_MAT_ID-bound step (lever B) => hybrid keeps the +13-25%
recurrence share KL-passing but does not alone close the MoE GEMM gap. Joint gate: nsys per-call bytes
down AND KL<1e-3 both models.
### Scope on top of BF16_SSM_STATE.diff
Reuse verbatim: gdn_state_t alias, templated load/store, gather template, dispatcher dtype-detect,
type_s/type_r cparams, CPU mirror, back-compat row convert, bf16 fill, test-backend-ops bf16 cases.
NEW: (1) classifier ~80-150 LOC (host fn over ssm_a/ssm_dt -> head_is_bf16[layer][head] + counts +
T_thresh cparam/CLI; optional Tier-2 calib+sidecar). (2) split cache layout ~150-250 LOC (BIGGEST:
llama-memory-recurrent.cpp alloc s_l_f32+s_l_bf16 by per-layer counts; build_rs builds two views +
passes head_slot; n_embd_s split). (3) kernel ~120-200 LOC (two bases + device map, runtime per-head
branch at load/in-place-store/gather/dispatch; math untouched; STATE_BF16 template stays as the
all-bf16 case). (4) ids/in-place per-head (state_dst two partition views; per-head gather; identity
unchanged). (5) CPU mirror per-head branch. (6) test-backend-ops MIXED-dtype-state case (decode +
multi-token prefill + keep_rs_t = the R2 corruption net). (7) gate: sweep T_thresh for min-f32 passing
KL<1e-3 + Same-top-p>=99.5% + drift both models; nsys per-call confirms f_bytes; md5 that T_thresh=inf
reproduces the f32 baseline (bit-exact opt-out preserved).
Net: principled path ABOVE vLLM throughput (dense ~430-454 vs vLLM 419) at-or-above vLLM precision,
KL-gated. Biggest new item = the split-tensor cache layout; classifier + kernel bounded; gate is a
threshold sweep on the existing harness.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## B - MoE GROUPED-GEMM + RE-GRAPH (label: B-moe-profile-design, THE GPU AGENT)
GPU-measured on DGX GB10 (sm_121), dev tree `~/llama-paged-dev` HEAD `2ee65c2` (patch 0024; the
decode kernels are byte-identical to 0023/f7409c2 - 0024 is the serving-only burst-reclaim).
`build-cuda`, model `q36-35b-a3b-nvfp4`, `llama-batched-bench -fa on -npp 128 -ntg 128`,
`LLAMA_KV_PAGED=1`. `decode_agg = S_TG t/s`. Batched-bench is the clean-kernel measure (no server
scheduler overhead), so its npl128 = ~743 t/s sits ABOVE the server final_benchmark 726 t/s; the
re-graph % gain below transfers to both paths (same kernels, same graph-disable).
### 1. MoE decode decomposition @npl128 - RE-CONFIRMED on the current HEAD
Fresh nsys `--cuda-graph-trace=node`, decode-isolated steady window, % of summed kernel GPU-time
(reproduces the 0023 profile in `OTHER_PATHS_INVESTIGATION.md` A.2/D within noise; window is
95.4% kernels-only busy / 96.8% with memcpy = GPU-compute-bound):
```
42.3% gated_delta_net_cuda REC (shared w/ dense; ALREADY tuned past vLLM, 0018-0022: 84.6% vs 82.4% peak BW)
~29.5% mul_mat_q<NVFP4> MoE FP4 GEMM = grouped M-tile=64 (~27%, biggest MoE-specific) + router M-tile=128 (~2.3%)
~10.5% nvjet_sm121 + cutlass (bf16) attn/gdn bf16 projections + the BF16 lm_head (path B)
3.1% k_get_rows_float REC state gather
2.7% k_bin_bcast expert-combine + routing-weight scale + glue
2.1% ssm_conv_update_f32 REC
2.0% quantize_mmq_nvfp4 W4A4 activation-quant tax (3.25 ms/step; vLLM-W4A16 avoids it)
1.8% convert_unary bf16<->f32 glue around the bf16 projections
1.4% MEMCPY-DtoD (SSM state copy fused away by 0018-0019; now small)
0.5% mul_mat_q_stream_k_fixup | 0.32% mm_ids_helper | 0.19% argsort | 0.14% gather_mmq_fp4 (0023 dedup) | 0.3% flash_attn
```
Bucketed: **Recurrence/SSM ~48% (shared, tuned past vLLM, NOT a MoE lever)**; **MoE FP4 GEMM+routing
~33%**; **bf16 projections ~10.5%**; act-quant tax ~2%; attention ~0.3%.
### 2. RE-GRAPH the MoE decode step - TESTED + MEASURED (the headline finding)
**Un-graphed status CONFIRMED, and the disable is purely conservative.** NVFP4 on sm_121 has
`get_mmvq_mmid_max_batch_turing_plus(NVFP4)=8` (`mmvq.cu:139-148`). At MoE decode `ne[2]=npl > 8`,
so every MUL_MAT_ID node trips the disable in `ggml_cuda_graph_check_compability`
(`ggml-cuda.cu:3278`: `node->ne[2] > mmvq_mmid_max => use_cuda_graph=false` for the WHOLE step).
BUT the path actually taken at `ne[2]>8` on Blackwell NVFP4 is `ggml_cuda_should_use_mmq()==true`
(`ggml-cuda.cu:2664`) -> the **grouped stream-k `mul_mat_q` id-branch**, launched on one stream with
**NO host sync** (verified: zero `cudaStreamSynchronize`/`Memcpy` in `mmq.cu`/`mmid.cu`). The stream
sync the disable guards against lives ONLY in the per-expert host-loop fallback, which is never
reached when `should_use_mmq` is true. So graphs are SAFE for the grouped path; the disable is a
conservative over-guard (upstream TODO + ggml-org/llama.cpp#18958).
**The lever (env-gated, bit-exact, built+measured here).** Relax the disable when the node takes
the grouped MMQ path. Patch (one function, one TU, 9 s incremental build):
```c
// ggml-cuda.cu ggml_cuda_graph_check_compability(), [TAG_MUL_MAT_ID_CUDA_GRAPHS]
bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
mmid_needs_sync = false; // grouped stream-k id-path is sync-free => graph-safe
}
if (mmid_needs_sync) { use_cuda_graph = false; ... }
```
**Measured A/B (2 reps each, rock-solid; OFF=stock graphs-disabled, ON=LLAMA_MOE_FORCE_GRAPHS=1):**
| npl | OFF decode_agg | ON decode_agg | gain | OFF %vLLM | ON %vLLM |
|----:|---------------:|--------------:|-----:|----------:|---------:|
| 8 | 226.0 | 226.4 | +0.2% (noise) | 88% | 88% | *(ne2=8<=mmid_max: MMVQ path already graphs, FORCE inert)*
| 32 | 433.8 | 452.7 | **+4.4%** | 86.6% | **90.4%** |
| 64 | 589.0 | 605.9 | **+2.9%** | 85.9% | **88.3%** |
| 128 | 743.1 | 757.1 | **+1.9%** | 84.2% | **85.8%** |
(vLLM ref 256.5 / 500.8 / 686.1 / 882.2.) The win is largest at small batch (more host-launch
overhead relative to kernel work) and shrinks as kernels dominate at npl128 - exactly the ~1.7%
within-step launch-idle the prior agent measured at 98.3% GPU-busy. This REFINES the prior "graphs
won't help npl128" verdict: it DOES help (+1.9%, above noise), and helps npl32-64 substantially
(+3-4%). **Bit-exact by construction** (graph replay re-issues the identical kernel sequence with
identical args; FORCE only flips `use_cuda_graph`; the shipped f32 dense path already runs graphed).
**Bit-exact gate - both PASS (measured):** `test-backend-ops -o MUL_MAT_ID -b CUDA0` = **806/806,
CUDA0 OK** (the grouped FP4 kernel is untouched - the edit is host-only graph-compat logic); and a
**parallel-greedy np16** run (ne2=16>8, i.e. the grouped MMQ path under graphs ON vs eager OFF) gives
**byte-identical generated content ON==OFF** (md5 `04c4761...` both, 16/16 completions, diff empty).
**SHIP CANDIDATE -> patch 0025** (default-off env now; safe to flip to `should_use_mmq`-gated
default-ON since it is a pure, gated, bit-exact win).
### 3. Grouped-GEMM occupancy headroom - EXHAUSTED on this model (cheap levers), one structural lever left
- The FP4-MMA `mul_mat_q<NVFP4>` is **register-bound to 1 CTA/SM** (`__launch_bounds__(256,1)`,
~255 regs/thread = ~12.5% thread occupancy). Grouped grids: ~2048 and ~8192 64-wide tiles.
- **M-tile (col-tile) axis NEUTRAL** (runtime `LLAMA_MOE_DECODE_TILE`, npl128): TILE32 742.4 /
TILE64 744.2 / TILE96 747.1 - all within 0.6%. Re-confirms patch 0015: this 256-tiny-expert model
is **bandwidth/SSM-bound, not col-tile-occupancy-bound**, so the M-tile lever has nothing to bite.
- **Cheap occupancy lever already measured (patch 0017):** compile-time `GGML_CUDA_FP4_MINBLOCKS=2`
on MoE @npl128 = **+0.4% (noise)**, and nsys showed it makes the dense FP4 GEMM **+8.7% SLOWER**
(register-cap spills, occupancy did not usefully rise). So the cheap register-cap lever is spent.
- **Only untested grouped-GEMM lever = the structural `mmq_y`-down (nwarps=4 warp-remap)** - the
0017-deferred P2. `mmq_y` tiles N (weight rows), not M, so shrinking it does NOT re-read weights
(BW-neutral) and raises resident CTAs. Bit-exact (warp/fragment remap, same FP4-MMA math), but a
real kernel change (the `nwarps x tile_C::I == mmq_y` static_assert coupling), and predicted
BOUNDED on this BW-bound model. Not a cheap toggle; do only if the re-graph + M1 banks are
insufficient.
### 4. W4A16 option (skip the act-quant, vLLM's Marlin choice) - NOT recommended
vLLM on GB10 runs **MARLIN W4A16** MoE (engine-log confirmed: "Your GPU does not have native FP4 ...
Marlin kernel"): bf16 activations NEVER quantized, FP4 weights dequant-in-kernel to bf16, **bf16
MMA**, under a full CUDA graph. It does this because CUTLASS's native-FP4 grouped GEMM is broken on
consumer sm_121 (whitelists only sm_100/103 datacenter Blackwell). llama instead runs **native
Blackwell FP4-MMA W4A4** grouped stream-k - a HIGHER arithmetic tier (GB10 FP4 = 2x INT8/BF16 rate).
The W4A4 act-quant tax llama pays (`quantize_mmq_nvfp4`) is **only ~2.0% of MoE decode** (3.25 ms/step
after the 0023 up/gate dedup). Adopting W4A16 to erase it would: (a) be **NOT bit-exact** (bf16 acts
!= FP4 acts -> different logits); (b) **descend to BF16-class MMA** (concede GB10's 2x FP4 rate - the
grouped GEMM, ~27% of the step, would run at HALF the MMA rate); (c) re-enter the **W4A16 occupancy
wall** (the prior GB10 W4A16 effort plateaued ~9 TFLOP/178 t/s). The BW saving is a sliver (acts are
tiny vs the ~weight read at M~4/expert), so it trades a bit-exact 2% for a non-bit-exact, slower,
occupancy-hostile path. **Reject.** The act-quant tax is better attacked bit-exactly via the down_proj
quantize retune (M1).
### 5. RANKED MoE levers (expected gain, bit-exactness, tractability)
1. **RE-GRAPH the MoE decode (this patch, -> 0025): MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128.**
Bit-exact, tiny (one function, one TU), low-risk, built+measured. **The clear #1.** Helps the
server path AND small-npl most (where llama was weakest: npl32 86.6%->90.4% of vLLM).
2. **down_proj act-quant retune (M1): bit-exact, bounded (act-quant is ~2%).** Cheap bank-shot;
retune `quantize_mmq_nvfp4` block/grid (byte-identical output, like 0023's gather). Low single-%.
3. **Grouped-GEMM `mmq_y`-down warp-remap: bit-exact, BW-neutral, the 0017-deferred P2.** Speculative,
predicted bounded on this BW-bound model; real kernel work. Only if 1+2 insufficient.
4. **M-tile / MINBLOCKS occupancy: EXHAUSTED** (measured neutral-to-negative). Do not pursue.
5. **W4A16: REJECT** (non-bit-exact, slower BF16 arithmetic, occupancy wall). Not even a clean opt-in.
**Net:** the bit-exact MoE-GEMM-region headroom from 1+2(+3) is ~3-6% at npl128 (MoE ~84% -> ~88-90%
of vLLM) and ~4-5% at npl32-64. Full MoE parity is NOT reachable from the GEMM/launch track alone:
the remaining gap is the grouped GEMM (~27%, FP4-MMA at the LPDDR5x BW floor - hardest regime, vLLM
ships purpose-built Marlin-NvFp4) + the bf16 projections (~10.5%). The recurrence (~48%) is already
PAST vLLM. The single highest-ROI, ship-now item is the re-graph patch (0025).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## C - STRUCTURAL DENSE RESIDUAL: lm_head + scheduling (label: C-structural-design, READ-ONLY no GPU)
Source-confirmed on DGX `~/llama-paged-dev` @ HEAD `2ee65c2` plus committed traces
(`CRITICALPATH_GAP_ANALYSIS.md`, `A2_CUDAGRAPH_DECODE.md`, `F16_DENSE_RESIDUAL_PROBE.md`,
`OTHER_PATHS_INVESTIGATION.md` sec B). Numbers are dense q36-27b-nvfp4 @npl128: step ~333 ms
(384 t/s), gap to vLLM (419 t/s = 305 ms) is ~27-28 ms/step. **Verdict: lever C is a near
dead-end for a bit-exact dense win; rank it LAST of A/B/C/D for the bit-exact default.**
### How the lm_head is stored, and why it routes to cublas/nvjet (not the tuned FP4 MMQ)
`output.weight` is **GGML_TYPE_BF16** (NOT quantized): the `--tensor-type attn/ffn=nvfp4`
recipe converts only attn+ffn, leaving the logit-sensitive final projection (and tok_embd)
at base BF16. Confirmed: `llama-model.cpp:1460` creates the NVFP4 scale `output_s` ONLY
`if (output->type == GGML_TYPE_NVFP4)`, so for the BF16 head `model.output_s` is null, and
`build_lora_mm` (`llama-graph.cpp:1087`) collapses to a plain `ggml_mul_mat`. In
`ggml_cuda_mul_mat` dispatch (`ggml-cuda.cu:2599-2629`): `use_mul_mat_q`/`use_mul_mat_vec_q`
both require `ggml_is_quantized(src0)` (BF16 fails => the tuned FP4 path is INELIGIBLE);
MMF is gated off for the wide `vocab x 128` shape; `use_batched_cublas_bf16` is true but the
batched branch additionally needs `src1->ne[2]*ne[3] > 1` (the 2D decode lm_head fails it).
Falls through to `ggml_cuda_op_mul_mat_cublas` BF16 branch (`:1662`): downcast F32 act ->
BF16, `cublasGemmEx(16BF x 16BF -> COMPUTE_32F)` = **nvjet_sm121**, output rounded BF16 ->
upcast F32. Shape M=vocab(151936) x N=128 x K=5120: a tall-skinny output GEMM reading the
ENTIRE BF16 head weight for 128 columns = inherently **memory-bound**. On the dense model
this is the ONLY non-FP4 cublas GEMM in decode. Cost: nvjet = 11.91 ms = 3.1-3.6% of step.
**CRITICAL CORRECTION the team must carry:** the baseline is NOT "f32 lm_head". The cublas
BF16 branch downcasts the activation F32->BF16 AND rounds the output to BF16. Today's
"bit-exact reference" logits are ALREADY BF16-precision on both input and output. So
"bit-exact" for lever C only protects BF16-rounded logits, which is exactly why option (c)
is "essentially bit-exact" and why any meaningful lm_head speedup requires changing the dtype.
### lm_head bit-exact lever + gain - bandwidth math kills it
nvjet moves the full BF16 head weight in 11.9-12.2 ms = ~195-199 GB/s = ~72% of GB10's
273 GB/s peak: it is ALREADY one of the most bandwidth-efficient kernels in the step (the
overall decode step runs at only ~40% util / ~110 GB/s). The bit-exact ceiling is the
remaining bandwidth headroom only:
- **(c) keep BF16 weight, swap the kernel** (custom skinny wide-vocab streaming GEMM, or a
hand-picked cublasLt algo/workspace heuristic for the thin-N/huge-M shape). The ONLY
essentially-bit-exact option. Perfect HBM saturation 199 -> 273 GB/s = 11.9 -> ~8.7 ms =
**save ~3 ms = ~0.9-1.0% of step = ~11% of the 27 ms gap.** REALISTIC gain: 0 to 3 ms,
leaning toward 0 - cublasLt already selected nvjet as its best algo, so beating it on a
pure weight-stream is not guaranteed, and it is high kernel-writing effort. (F16 probe
independently estimates the same nvjet recovery as "~5 ms, uncertain - may already run TF32".)
Structural reason it is near-zero: the head must read the entire BF16 weight for 128 columns;
you CANNOT cut those weight bytes without changing the dtype. Bit-exactness and the only real
speedup (fewer weight bytes) are mutually exclusive here.
### lm_head NON-bit-exact options (excluded from any vLLM-parity claim)
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ.** Biggest win, BREAKS bit-exactness.
Weight ~4x fewer bytes (BF16 ~1.5-2.4 GB -> NVFP4 ~0.4-0.6 GB) AND rides the already-tuned
`mul_mat_q<NVFP4>` (patch 0017): memory floor drops ~4x = **save ~8-9 ms = ~2.5% of step**.
BUT NVFP4 < BF16 precision => different logit bits, can flip greedy argmax, AND it is
**UNFAIR vs vLLM** (which keeps its LM head BF16). Same opt-in non-bit-exact bucket as the
shelved bf16-SSM / f16-glue; exclude from parity claims.
- (b) FP8 / Q8_0 head: smaller error than NVFP4 but still != BF16 bits AND not on the tuned
FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
- (existing knob) `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` (`ggml-cuda.cu:1610`): 16-bit accumulate
on this exact GEMM, faster but NON-bit-exact (16F vs 32F accumulate). Non-bit-exact track only.
### Scheduling / launch bit-exact lever + gain - ~0.05%
The decode step is GPU-bound at 99.94% (node-level trace, single stream, graphId replayed).
CUDA graphs ALREADY collapse within-step launch latency: exposed idle = 0.225 ms/step = 0.06%,
zero gaps > 5 us, graph ON vs OFF = +0.13% @npl128 (noise). Graphs are NOT a pending dense
lever - they are already in effect. The ONLY graph-non-covered overhead is the BETWEEN-step
host gap: ggml rebuilds the cgraph each step with a NEW `cgraph->uid`, so the uid fast-path in
`ggml_cuda_graph_update_required` never fires and the host re-dispatches ~3100 launches between
graph launches. MEASURED exposed cost: ~0.2 ms/step = ~0.05% (most of the ~2 ms host loop
overlaps GPU compute). **Bit-exact lever:** make the cgraph PERSISTENT/reused across decode
steps so the uid fast-path fires (replay-only => bit-exact). GAIN ~0.2 ms/step = ~0.05%, medium
effort (touches ggml graph lifetime), second-order. No other per-step host overhead is exposed
(the host loop is HIDDEN under GPU compute until the kernels get fast enough to drop GPU-busy
below host time).
### Quantified realistic bit-exact total for lever C
lm_head kernel swap 0 to ~3 ms (upper ~0.9%, realistically ~0) + persistent cgraph ~0.2 ms
(~0.05%) = **combined bit-exact ceiling ~3.2 ms = ~0.95% of the 333 ms step = ~12% of the
27 ms gap.** Moves dense parity 91.8% -> at most ~92.7%, realistically <0.5% net (<1.5 ms).
The "~3-4%" in the brief is the lm_head's TOTAL cost, NOT what is bit-exactly recoverable: only
the bandwidth headroom (~3 ms) and host gap (~0.2 ms) are recoverable; the other ~9 ms is the
irreducible BF16 weight stream BOTH engines pay (vLLM keeps a BF16 head too). **Rank C LAST for
the bit-exact default.** Its one durable note for the team: the lm_head logits are ALREADY
BF16-rounded (not f32), which both narrows what option (c) must preserve and is exactly why the
only meaningful lm_head speedup requires a dtype change (= non-bit-exact + unfair vs vLLM).
Source (DGX @2ee65c2): `llama-model.cpp:1460`, `llama-graph.cpp:1087`, `qwen35.cpp:222` /
`qwen35moe.cpp:246`, `ggml-cuda.cu:2599-2629` / `:1662-1690` / `:1610`.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
# RANK + PLAN - the final synthesis (build order, A handoff, B/C/D queue)
This is the decision section: all four levers measured/designed, ranked by gain x tractability
x gate, the concrete A build plan, and the ordered B/C/D queue with each one's trigger. Base:
clean pin-synced llama.cpp 9d5d882d, bit-exact md5 == 0023. Dense gap to vLLM ~27 ms/step (384
vs 419 t/s @npl128); MoE ~82% (726 vs 882). Recurrence already PAST vLLM (84.6% vs 82.4% peak BW).
## (1) Per-lever scorecard: gain (dense + MoE), tractability, gate
| Lever | Dense decode gain | MoE decode gain | Tractability | Quality gate | Bit-exact? |
|-------|-------------------|-----------------|--------------|--------------|------------|
| **B re-graph (patch 0025)** | ~0 (dense already graphed) | **MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128** (MoE 84%->86% .. 90% of vLLM) | **VERY HIGH - already built+measured**, 1 fn / 1 TU / 9 s build | md5 byte-identical: **PASSED** (MUL_MAT_ID 806/806 + parallel-greedy md5 identical) | YES |
| **A hybrid per-head SSM** | **+25% to +35%/call recurrence -> ~430-454 t/s = 103-108% of vLLM** (ABOVE vLLM) | keeps the +13-25% recurrence share KL-passing; does NOT alone close the MoE GEMM floor | MEDIUM-HIGH - builds on `BF16_SSM_STATE.diff`; biggest new piece = split-dtype cache layout (~150-250 LOC) | **KL<1e-3 + Same-top-p>=99.5% + drift sweep 256/1024/2048/4096 both models**; md5 that T_thresh=inf == f32 baseline | f32 default YES; hybrid is at-or-above vLLM precision, KL-gated |
| **B M1 down_proj retune** | ~0 | bit-exact, bounded (act-quant is ~2% of MoE step) - low single-% | HIGH - block/grid retune of `quantize_mmq_nvfp4`, byte-identical output | md5 byte-identical | YES |
| **B mmq_y-down warp-remap** | small (shared FP4 GEMM) | bit-exact, BW-neutral, predicted BOUNDED on this BW-bound model | LOW-MEDIUM - real kernel change (nwarps x tile_C coupling) | test-backend-ops MUL_MAT_ID + md5 | YES |
| **C lm_head kernel swap** | 0 to ~3 ms (~0.9%, realistically ~0; uncertain it beats nvjet) | ~0 | LOW payoff - high kernel-writing effort, not guaranteed to beat cublasLt | md5 (BF16-rounded logits) | YES (essentially) |
| **C persistent cgraph** | ~0.2 ms (~0.05%) | ~0 (B's re-graph already covers MoE host gap) | MEDIUM - touches ggml graph lifetime, for 0.05% | replay-only = bit-exact, md5 | YES |
| **D f16 glue (Option 2)** | ~11-16 ms = 40-60% of residual -> 91.8% -> ~95-96% (NOT a close) | ~0 (dense-only lever) | LOW-MEDIUM - new norm.cu f16 kernels, multi-file | **NON-bit-exact, must pass the SAME KL<1e-3 that plain bf16-SSM FAILED** | NO - opt-in only |
Notes that decide the ranking:
- **B's re-graph helps ONLY MoE** (dense decode is already graphed; the disable is the MoE
MUL_MAT_ID `ne[2]>8` over-guard). It is the single highest-ROI item because it is already
built, measured, and gated - zero remaining build risk, just a default flip.
- **A is the only lever that moves dense ABOVE vLLM** (103-108%) and it does it at-or-above
vLLM precision (vLLM keeps ALL temporal state f32; A keeps f32 on exactly the unsafe heads).
It reaches the largest mass (recurrence = 49.3% dense / ~48% MoE = ~6x what D can touch).
- **C and D are dead-or-tiny for the bit-exact default.** C's bit-exact ceiling is <1% with
real risk; D is non-bit-exact, dense-only, and tops out at ~96% parity (not a close).
## (2) Ranked build order (gain x tractability x gate) - A confirmed as the build lead
1. **B re-graph (patch 0025) - LAND NOW.** Already built + measured + both gates PASSED. The
only remaining decision is flipping the default from env-gated (`LLAMA_MOE_FORCE_GRAPHS`) to
`should_use_mmq`-gated default-ON. Zero new build, measured +1.9-4.4% MoE, bit-exact. This
is not a "build" so much as a "ship"; it precedes A because it is free and de-risked.
2. **A hybrid per-head SSM - THE BUILD LEAD (user-greenlit, CONFIRMED by evidence).** The only
lever that takes dense ABOVE vLLM and the only principled fix for the bf16-SSM KL failure.
Largest reachable mass, bounded build on an existing diff, KL-gated. Build plan in (3).
3. **B M1 down_proj act-quant retune** - cheap bit-exact bank-shot, run after A while the GPU
is warm. Bounded (~2% act-quant tax), byte-identical-output retune.
4. **B mmq_y-down warp-remap** - only if 1+2+3 leave MoE short of target; real kernel work,
predicted bounded on this BW-bound model.
5. **C persistent cgraph** - a bit-exact ~0.05% micro-win for the default; build only if a
broad graph-lifetime refactor is happening anyway (not worth a standalone effort).
6. **C lm_head BF16 kernel swap** - near-zero, uncertain, high effort. Effectively shelved.
7. **D f16 glue (Option 2 norm.cu kernels)** - LAST, opt-in only, non-bit-exact, dense-only,
gated by the same KL threshold bf16-SSM failed. Build only if the last ~4% dense is chased
AFTER A lands and is shown insufficient. Skip Option 1 entirely (cast overhead eats the win).
**Why A over B as the lead, despite B's re-graph being measured:** B's re-graph is already
DONE - it is a ship, not a build. For the NEW build effort, A is correctly the lead: it is the
only lever with a path ABOVE vLLM on dense, it attacks the largest mass (recurrence, shared by
both models), and it converts the already-proven whole-bf16 win (490 t/s = 125% vLLM, but KL
FAIL) into a KL-passing form. B's remaining items (M1, mmq_y) are bounded single-% bank-shots
that cannot reach parity on their own (the residual MoE gap is the FP4 grouped GEMM at the
LPDDR5x BW floor + bf16 projections, both structural). So: ship 0025, then build A, then bank B.
## (3) CONCRETE A BUILD PLAN (hand to the build agent)
**Objective:** a per-head mixed-dtype SSM state cache - f32 on long-memory heads, bf16 on
fast-decaying heads - that captures 50-70% of the whole-bf16 recurrence win (-25% to -35%/call)
while PASSING KL<1e-3. Builds directly on the existing `BF16_SSM_STATE.diff` (untracked backup
on DGX `~/llama-paged-dev`). Target dense ~430-454 t/s (103-108% of vLLM 419), MoE +13-25%
recurrence share KL-passing. f32 default stays bit-exact (md5 == 0023 baseline).
**Reuse VERBATIM from BF16_SSM_STATE.diff** (do NOT rewrite): `gdn_state_t<STATE_BF16>` alias,
templated `__bfloat162float` load / `__float2bfloat16` store, the gather template, the dtype-
detect dispatcher, `type_s`/`type_r` cparam wiring, the CPU mirror, the back-compat row convert,
the bf16 fill path, and the test-backend-ops bf16 cases.
**NEW work items (in build order):**
1. **Head classifier (~80-150 LOC, do first, no GPU).** Host function over `ssm_a` (tensor
`SSM_A_NOSCAN`, `[n_v_heads]`, = `-exp(A_log)`) and `ssm_dt` (tensor `SSM_DT`, `[n_v_heads]`):
for each (layer il, head h) compute `tau_h = 1 / (|ssm_a[il][h]| * softplus(ssm_dt[il][h]))`;
set `head_is_bf16[il][h] = (tau_h <= T_thresh)`. Emit per-layer `n_f32`/`n_bf16` counts +
the `head_slot[il][h] = {is_bf16, local_idx}` map. Add cparam `ssm_hybrid_tau_thresh` / CLI
`--ssm-bf16-tau` (inf => all-f32 bit-exact default; 0 => all-bf16; hybrid band in between).
Runs in microseconds at load, no data, no GPU. (Optional Tier-2: a short calibration pass
measuring per-head time-mean of actual `exp(g[h,t])` -> model-hash sidecar; only if Tier 1
lands just above the gate.)
2. **Split-dtype cache layout (~150-250 LOC - THE BIGGEST piece).** In
`llama-memory-recurrent.cpp`: replace the single `s_l` ([S_v,S_v,H,slots] f32) with two
dtype-homogeneous sub-caches sized by per-layer head COUNT (this is what saves the bytes):
`s_l_f32 [S_v*S_v*n_f32, slots]` f32 + `s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. In
`build_rs` (`delta-net-base.cpp`): build the two views + pass the `head_slot` map; split the
`n_embd_s` accessors. q/k/v/g/beta KEEP natural head order (no activation permute - they come
from the projection GEMMs). Coarser per-LAYER fallback is REJECTED (long-memory heads span
most layers => too coarse; per-head is the right granularity).
3. **Recurrence kernel: single launch, runtime per-head branch (~120-200 LOC).** Pass BOTH
bases (`const float* s_f32_base`, `const nv_bfloat16* s_bf16_base`) + the two `state_dst`
partition views + the device `head_slot[]` map. Branch on `head_slot[h_idx].is_bf16` at the
load site, the in-place store site, the gather, and the dispatcher. The branch is UNIFORM
within a block (all threads share `h_idx` = `blockIdx.x`) => **NO warp divergence**. The
recurrence math (the ~140-260 region) stays byte-for-byte f32-register, untouched. `keep_rs_t`
snapshots stay f32 (op-output scratch). The `STATE_BF16` template stays as the all-bf16
special case.
4. **ids / in-place per-head.** `state_dst` becomes two partition views; `gdn_gather_nonident`
becomes per-head dtype-aware (copies each head's `S_v*S_v` block from the right partition of
`cache[ids[s]]`; still disjoint-scratch race-free). Each head writes its own partition slot
(read==write slot, loaded to registers before store) => the identity / in-place property is
preserved.
5. **CPU mirror (ops.cpp)** per-head dtype branch for CI / CPU-offload parity.
6. **test-backend-ops: a MIXED-dtype-state GATED_DELTA_NET case** (some heads f32, some bf16)
vs the CPU ref, covering decode + multi-token prefill + `keep_rs_t` (this is the R2
silent-corruption net - do NOT skip it).
7. **Gate (GPU, GateBench harness, already built).** Sweep `T_thresh` to find the MINIMUM f32
fraction that passes: noise floor first, then the 256-tok KL gate, then the long-context
drift sweep 256/1024/2048/4096, BOTH models (dense q36-27b + MoE q36-35b-a3b). Pass bar =
**KL<1e-3 AND Same-top-p>=99.5% AND drift bounded**. nsys per-call confirms `f_bytes` =
`(n_f32 + n_bf16/2)/H` dropped. md5 that `T_thresh=inf` reproduces the f32 baseline (the
bit-exact opt-out MUST be preserved).
**Expected result (from the physics + the whole-bf16 measurement):** KLD contribution per head
~ `(eps*tau_h)^2` (eps~2^-8~3.9e-3) is dominated by the top-tau heads, so removing the top
~25-40% by tau cuts MeanKLD by 1-2 orders. Design band **f32 fraction f in [0.30, 0.50]**:
- f=0.30 (n_bf16/H=0.70): `f_bytes`=0.65 -> ~2.20 ms/call (-35%), captures ~70% of the bf16
win -> dense **~454 t/s = ~108% of vLLM** (gate-likely, MeanKLD ~1e-3..1e-2).
- f=0.50: `f_bytes`=0.75 -> ~2.54 ms/call (-25%), captures ~50% -> dense **~430 t/s = ~103% of
vLLM** (most robust pass; strict KL<1e-3 may need this fraction).
The exact f is found by the T_thresh sweep. **MoE:** A keeps the +13-25% recurrence share
KL-passing but does NOT by itself close the MoE GEMM gap (that is B). Joint ship gate = nsys
per-call bytes down AND KL<1e-3 for BOTH models; neither alone ships. Hybrid is STRICTLY safer
than vLLM (we keep f32 exactly where bf16 is unsafe; vLLM keeps all-f32 everywhere).
## (4) Ordered B / C / D queue with build triggers
- **B-1 re-graph default flip (patch 0025): trigger = NOW / immediate.** Already built, measured
(+1.9-4.4% MoE), both gates PASSED. Flip env-gated -> `should_use_mmq`-gated default-ON. No
dependency on A. Ship first.
- **B-2 down_proj act-quant retune (M1): trigger = after A's kernel work lands** (reuse the warm
GPU window). Bit-exact block/grid retune of `quantize_mmq_nvfp4`, byte-identical output.
Bounded ~1% (act-quant is ~2% of the MoE step). Run it; it is cheap.
- **B-3 mmq_y-down warp-remap: trigger = ONLY if B-1 + B-2 + A leave MoE below the target.**
Real kernel change, BW-neutral, predicted bounded on this BW-bound model. Speculative; gate by
test-backend-ops MUL_MAT_ID + md5.
- **C-1 persistent cgraph: trigger = ONLY if a broader ggml graph-lifetime refactor is already
in flight.** Standalone it is ~0.05%, not worth the graph-lifetime touch. Bit-exact (replay).
- **C-2 lm_head BF16 kernel swap: trigger = effectively NEVER for the default** (0 to ~3 ms,
uncertain it beats nvjet, high effort). Documented; not queued.
- **D Option 2 f16-glue norm.cu kernels: trigger = ONLY if dense parity is still wanted AFTER A
lands AND A is shown insufficient, AND an opt-in non-bit-exact mode is acceptable.** Multi-file,
recovers ~11 ms (norm/elementwise band), gated by the SAME KL<1e-3 that plain bf16-SSM failed.
Skip Option 1 (net-zero cast overhead). Lowest priority of all.
**Bottom line:** ship 0025 now (free, measured MoE +1.9-4.4%), then build A (the only path
ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM), then bank B-2/B-3 on MoE. C is
last for the bit-exact default (<1%, dead-end); D is opt-in-only and dense-only, behind the KL
gate, only if the last ~4% is ever chased. The recurrence is already PAST vLLM; A converts that
proven win into a KL-passing form, and the MoE GEMM floor (the structural residual) is the one
piece no bit-exact lever fully closes - vLLM ships purpose-built Marlin-NvFp4 there.
Assisted-by: Claude:opus-4.8 [Claude Code]

View File

@@ -1,184 +0,0 @@
# SSM decode fix - qwen35 gated-DeltaNet in-place recurrent-state write-back (patch 0018)
Follow-up to `A2_CUDAGRAPH_DECODE.md`. That analysis located the real decode lever
on the Qwen3.6 hybrid-SSM models (arch `qwen35`, 48 gated-DeltaNet linear-attn
layers : 16 full-attn layers) and ruled out the FP4 GEMM, CUDA graphs, the host
loop, and attention. The corrected per-kernel + per-memcpy decode decomposition
attributed ~67% of decode GPU time to SSM-state plumbing:
gated_delta_net 23.4% | get_rows state-gather 21.9% | D2D state-copy 18.9% (= ~67%)
FP4 matmul ~28% | full attention 0.4%
Root cause: per SSM layer per step the fused `gated_delta_net` op wrote its new
recurrent state into graph scratch, then a **separate `ggml_cpy` persisted the
full ~225 MB state into the recurrent-state cache** (1584 D2D ops, 356 GB, 18.9%
of decode over the profile window). vLLM's `fused_recurrent_gated_delta_rule`
keeps the state in place (no copy).
## STEP 1 (this patch): kill the per-layer D2D state copy-back
`ggml_gated_delta_net_inplace` (new builder, `src[6] = state_dst`) makes the op
write its final recurrent state **directly into the active sequences' contiguous
cache slot** (at `kv_head`), eliminating the copy-back. The op output then carries
only the attention scores. SSM arithmetic is unchanged - only the destination
pointer of the final-state write moved.
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new `ggml_gated_delta_net_inplace` op
builder. `dst` retains the same `[attn | state]` layout so the attention-output
view is identical; the state region is left unused.
- `ggml/src/ggml-cuda/gated_delta_net.cu`: kernel/launch/op-handler thread an
optional `state_dst`; final-state (`!keep_rs`) write targets it when present.
- `ggml/src/ggml-cpu/ops.cpp`: K==1 path operates in place on the `state_dst`
cache view (kept CPU-correct for non-CUDA runs / CI).
- `src/models/delta-net-base.cpp`: `build_recurrent_attn` uses the in-place op on
the fused decode/prefill path and drops the `ggml_cpy`. The rollback path
(`n_rs_seq > 0`) is unchanged. The get_rows state gather is unchanged (STEP 2).
### Correctness gate
- **Bit-identical**: greedy (`--temp 0 --seed 1`) `llama-completion` output on
`q36-27b-nvfp4` is byte-for-byte identical between the copy-back baseline and the
in-place build (`diff` -> IDENTICAL).
- **Coherent**: dense + MoE multi-paragraph greedy generations are on-topic and
correct (Rayleigh scattering; Roman Empire 27 BCE / Actium 31 BCE; primes;
additive vs subtractive color).
- Gated to the `qwen35` / gated-DeltaNet fused path; rollback and all non-SSM
archs untouched (they never construct the in-place op).
### Measured decode_agg (`S_TG t/s`, npp 128, ntg 128, -fa on, paged on, fusion off)
Dense `q36-27b-nvfp4`:
| npl | baseline | in-place | delta | % of vLLM (391 @128) |
|-----|----------|----------|---------|----------------------|
| 32 | 113.74 | 136.39 | +19.9% | - |
| 128 | 146.23 | 180.53 | +23.5% | 37.4% -> 46.2% |
The npl-128 result lands on the predicted copy-removal ceiling (~180 t/s).
MoE `q36-35b-a3b-nvfp4`:
| npl | baseline | in-place | delta |
|-----|----------|----------|---------|
| 32 | 246.79 | 279.41 | +13.2% |
| 128 | 313.36 | 372.62 | +18.9% |
### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
The D2D state-copy bucket collapsed:
| bucket | before | after |
|-------------------|---------------------|----------------------|
| MEMCPY D2D | 18.9% / 356 GB / 1584 ops | 0.23% / 2.93 GB / 734 ops |
The ~225 MB/copy recurrent-state copy-back is gone (122x fewer D2D bytes); the
residual D2D is the small conv-state copies. With it removed, the remaining decode
buckets are `gated_delta_net` 26.0%, FP4 matmul ~37.5%, and `get_rows` state
gather 18.8%.
## STEP 2 (not in this patch): fuse the get_rows state gather
The state gather is now the largest single non-GEMM bucket (18.8%). It is a pure
materialization: `build_rs` calls `ggml_get_rows(cache, s_copy_main)` to copy each
sequence's previous state into a contiguous scratch tensor before the op reads it.
`ggml_ssm_scan` already avoids this by taking the `ids` tensor (`src[6]`) and
reading the per-seq state directly from the full cache. The same fusion applies
here: give `ggml_gated_delta_net` an `ids` source, read `curr_state` from
`cache + ids[seq]*D` in the kernel, and pass the full cache via the `build_rs`
`get_state_rows` lambda (mirroring `mamba-base.cpp`). Predicted ceiling with both
steps: ~247 t/s (~63% of vLLM dense @128), GEMM untouched.
## Verdict on the path to parity
STEP 1 removes ~half of the SSM plumbing overhead and is the dominant, lowest-risk
lever; it is bit-exact and shipped here. STEP 2 (gather fusion) has a proven ggml
precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLLM
after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
track. No paged/graph/block-table change can move decode on this model (full
attention is 0.4% of decode).
## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
After Step 1 the largest non-GEMM decode bucket was the recurrent-state
`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
`ggml_ssm_scan`'s `ids` source.
`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
writes the cache directly: no recurrent-state materialization at all. The
`build_recurrent_attn` fused path feeds the full cache and ids through the
`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
`rs_zero` clear and the extra-states copy around the op.
### Race-free by construction (CUDA)
In-place write plus an ids read of the same cache is only safe when the read slot
equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
prefill ubatch). The kernel handles both per (seq, head) block on device:
- identity sequences read `s0` in place from the destination slot `state_dst`
(the kernel loads all of `s0` into registers before it writes the new state,
so reading and writing the same slot is race-free) -- no materialization;
- non-identity sequences read from a disjoint scratch that a small
`gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
recurrence never reads a slot another block writes.
`ids` stays a device pointer (dereferenced only in the kernels; the input is
device-resident at op-execute time, so a host read segfaults). The CPU op
mirrors the same logic (host identity check + a serial gather in the dispatcher
for the non-identity case). The math is unchanged, so the result is bit-identical
to the `get_rows` path in every case.
Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
untouched (they keep the materialized-state overload).
### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
Dense `q36-27b-nvfp4`:
| npl | Step 1 (baseline) | Step 2 | delta | % of vLLM (391 @128) |
|-----|-------------------|----------|---------|----------------------|
| 32 | 137.64 | 170.68 | +24.0% | - |
| 128 | 186.25 | 256.57 | +37.8% | 47.6% -> 65.6% |
The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
MoE `q36-35b-a3b-nvfp4`:
| npl | Step 1 (baseline) | Step 2 | delta |
|-----|-------------------|----------|---------|
| 32 | 299.68 | 366.69 | +22.4% |
| 128 | 409.30 | 553.63 | +35.3% |
(Step-1 baselines re-measured in the same session; the brief's reference figures
were 136 / 180 dense and 279 / 373 MoE.)
### Bit-exact gate
Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
fusion off) vs the Step-1 build:
- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
- MoE `q36-35b-a3b-nvfp4`: byte-identical;
- Step-2 dense run1 == run2 (deterministic, no race).
### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
The recurrent-state gather bucket collapsed:
| kernel | Step 1 | Step 2 |
|----------------------------|----------|-----------------------------------------|
| `k_get_rows_float` | 18.8% | 0.7% (residual: embeddings / conv-state)|
| `gdn_gather_nonident` | - | 1.7% (no-op at decode, median ~1.2 us) |
| `gated_delta_net_cuda` | 26.0% | 22.5% |
| FP4 GEMM family | ~37.5% | ~48% (now the dominant residual) |
The SSM state gather is effectively eliminated. The residual decode gap to vLLM
is now the FP4 GEMM (~48% of decode), a separate kernel track.

View File

@@ -1,126 +0,0 @@
# Track B P0 + P1 results: the FP4-MMA decode-GEMM occupancy tune (GB10, sm_121)
Measured on the DGX (GB10 / DGX Spark, sm_121, `~/llama-paged-dev`, branch `paged`). Implements
`FP4_GEMM_SCOPE_B.md` P0 (baseline + bit-exact gate) and P1 (the cheap host/occupancy tile tune).
Dev-tree commit: **089f78d** (`feat(paged): FP4 decode GEMM track-B P0 gate + default-off occupancy
instrumentation`). Patch artifact: `0017-fp4-gemm-decode-tile-tune.patch`.
**Headline verdict: the P1 occupancy kill-gate TRIPPED.** None of the cheap host/occupancy levers
lift dense or MoE decode_agg on GB10; every dense probe regresses and the nsys evidence shows the
FP4 GEMM kernel gets *slower* under register-capping. Nothing is enabled by default (the levers are
compile-time/env gated and the default build is byte-identical to stock). The one untested lever is
the structural `mmq_y`-down, which is **not** a host switch: it is coupled to `nwarps` by the
`nwarps*tile_C::I == mmq_y` static_assert, so it requires an `nwarps=4` warp-remap (P2 kernel work).
All benches: `llama-batched-bench -fa on -c 32768 -ngl 99 -npp 128 -ntg 128 -npl 32,128`.
`decode_agg = S_TG` (aggregate decode tok/s). 3 reps dense, 2 reps MoE; medians below.
## P0 baseline (mmq_y=128, minblocks=1 — stock)
### Bit-exact parity gate (CPU oracle vs CUDA, deterministic)
- `test-backend-ops -o MUL_MAT -b CUDA0`: **1115/1115** (1103 stock + 12 new NVFP4/MXFP4 dense
decode-shape cases), NVFP4 0 fail.
- `test-backend-ops -o MUL_MAT_ID -b CUDA0`: **805/805**, NVFP4 0 fail.
- New P0 cases exercise the weight-row (`mmq_y`) tiling boundary: `type_a ∈ {NVFP4, MXFP4}`,
`m ∈ {2048 (exact at mmq_y 64/128), 1600 (ragged vs 128), 2050 (ragged vs both 64 & 128 →
need_check last row-tile)}`, `n ∈ {32, 128}` (decode M), `k = 2048`. They make the oracle cover
the `mmq_y`/min-blocks changes and stay bit-exact with every lever on.
### Decode throughput (decode_agg = S_TG)
| model | npl32 | npl128 |
|---|---:|---:|
| DENSE q36-27b-nvfp4 | 117.3 | **149.5** |
| MoE q36-35b-a3b-nvfp4 (stock mmq_x=128/expert) | 262.6 | **336.3** |
(For reference the scope §6 cites dense 161 / MoE 333 from a server harness; this is the cleaner
batched-bench A/B baseline. The relative P0→P1 deltas below are what the kill-gate turns on.)
### nsys FP4 GEMM efficiency (dense, `-npp 64 -ntg 48 -npl 128`)
The decode FP4 weight GEMM kernel = `mul_mat_q<NVFP4(40), mmq_x=128, need_check=0>`:
- **33.2 %** of GPU kernel time, total **2.782 s** / 4576 inst, **avg 608 µs/launch**.
- Plus `quantize_mmq_nvfp4` 9.1 % (the act-quant bucket — track A's target), `mul_mat_q<…,16,…>`
5.8 % (prefill ubatch tiling), stream-k fixups ~0.5 %.
This is the locked baseline; P1 must lower the GEMM kernel time (raise FP4-eff) to pass.
## P1 — the cheap occupancy levers (all default-off, byte-identical when off)
Three bit-exact, gated levers were added (`mmq.cuh`):
- `GGML_CUDA_FP4_MMQ_Y` (default 128): type-aware `get_mmq_y_host/device` plumbing for an NVFP4
weight-row tile override. **Inert** — see "the mmq_y wall" below.
- `GGML_CUDA_FP4_MINBLOCKS` (default 1): NVFP4-only `__launch_bounds__` min-resident-CTAs lever
(register-caps the FP4-MMA kernel so >1 CTA co-resides). The bounded occupancy probe.
- `GGML_CUDA_FP4_DENSE_MMQ_X` (env, default off): dense col-tile re-read occupancy diagnostic
(the §4.1 A/B: does eating a 2× weight re-read at a smaller `mmq_x` buy net occupancy?).
P1 parity: with `MINBLOCKS=2` the gate stays **MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0
fail** — register allocation is result-neutral, so bit-exactness holds.
### DENSE decode_agg @ npl128 — every occupancy probe REGRESSES
| config | npl32 | npl128 | Δ vs P0 @npl128 |
|---|---:|---:|---:|
| P0 stock (mmq_y=128, minblocks=1) | 117.3 | **149.5** | — |
| MINBLOCKS=2 (2 resident CTAs via reg-cap) | 115.7 | 147.9 | **1.1 %** |
| DENSE_MMQ_X=64 (2 col-tiles, 2× weight re-read) | 115.3 | 144.3 | **3.5 %** |
| DENSE_MMQ_X=32 (4 col-tiles, 4× weight re-read) | 115.4 | 141.7 | **5.2 %** |
### MoE decode_agg @ npl128 — mmq_x-down regresses; min-blocks neutral
| config | npl32 | npl128 | Δ vs stock @npl128 |
|---|---:|---:|---:|
| stock (mmq_x=128/expert) | 262.6 | **336.3** | — |
| TILE32 | 262.1 | 336.0 | 0.1 % |
| TILE16 | 261.1 | 324.0 | **3.7 %** |
| TILE8 | 260.8 | 316.6 | **5.9 %** |
| MINBLOCKS=2 | 260.0 | 337.7 | +0.4 % (noise) |
The MoE result reproduces patch 0015 exactly: q36-35b-a3b (256 tiny experts, GDN linear attention)
decode is GDN/bandwidth-bound, **not** col-tile-occupancy-bound, so tightening `mmq_x` below 64
(the brief's "816 ideal") monotonically *loses*. 64 ≈ 32 ≈ stock is the floor.
### nsys kill-gate evidence (the decisive datum)
`mul_mat_q<NVFP4,128,0>` under MINBLOCKS=2: **2.782 s → 3.025 s**, avg **608 µs → 661 µs
(+8.7 % SLOWER)**. The FP4-MMA kernel needs >128 regs/thread; forcing 2 CTAs/SM register-caps it,
which **spills to local memory**, so the GEMM does *more* work per launch — occupancy did not
usefully rise, it inverted. FP4-eff went **down**, not up. Kill-gate tripped, with hard evidence.
## Why P1 can't lift it (and why mmq_y-down is P2, not P1)
The two orthogonal occupancy probes both regress: register-capping (minblocks↑) spills, and
col-tile-shrinking (mmq_x↓) re-reads the 18 GB weight set. This says the **dense M=128 tile is
already weight-read / one-read-optimal at mmq_x=128** — it is not occupancy-starved in a way the
cheap levers can fix. This contradicts the scope's central "self-inflicted occupancy, recover it by
raising resident CTAs" hypothesis *for the cheap levers*.
The only lever that raises resident CTAs **without** spilling and **without** extra weight reads is
the structural `mmq_y`-down (smaller weight-row tile → smaller shared + smaller accumulator → more
CTAs, weights still read once). But `mmq_y` is **rigidly** `nwarps * tile_C::I = 8 * 16 = 128`
(the `mmq.cuh:3258` static_assert; `tile_C::I=16` is the fixed `m16n8k64` MMA shape). So
`mmq_y=64` requires **`nwarps=4`** — a warp-remap, not a host switch. That remap threads `nwarps`
through ~13 NVFP4-reachable sites including the **shared** `vec_dot_fp4_fp4_mma` (used by both NVFP4
and MXFP4) and the loader/kernel nwarps lockstep, with real risk of a silent shared-mem/thread-block
mismatch. It was scoped but **deferred to P2** (the scope's own phase table also places `mmq_y`-down
at P2, after the P1 host-only knobs). The `get_mmq_y` host/device plumbing is committed and inert so
P2 only has to add the `nwarps` half.
## Honest verdict vs the scope targets
- **DENSE:** P1 (host knobs + min-blocks + re-read diagnostic) does **not** move decode_agg toward
the 391 target — it slightly *regresses* (149.5 → 147.9, 38 % of vLLM). The scope's P1 row
(~177, "honest: small") was optimistic; on GB10 the cheap levers are net-negative. The remaining
upside lives entirely in the P2 `mmq_y`-down (nwarps=4) kernel remap **plus** track A. Whether
that clears the floor is now an *open, unproven* question — the cheap-lever evidence here leans
*against* large occupancy upside (the tile already looks one-read-optimal), so the P2 ceiling is
plausibly lower than the scope's 316328.
- **MoE:** the mmq_x-down lever (the brief's MoE P1) is a **confirmed dead-end on this model**
(regresses; GDN/BW-bound, reproduces patch 0015). min-blocks is neutral. No host-level MoE win.
**Kill-gate: TRIPPED on both arms.** Per the brief this is *not* forced into a default-on change.
Committed: the P0 bit-exact gate + the default-off instrumentation + this honest record. Not pushed.
## Reproduce
```
# default (byte-identical stock): build-cuda as-is -> MUL_MAT 1115/1115, MUL_MAT_ID 805/805
# occupancy probe: cmake build with -DGGML_CUDA_FP4_MINBLOCKS=2 (or flip the macro default)
# dense re-read A/B: GGML_CUDA_FP4_DENSE_MMQ_X=64 ./llama-batched-bench -m q36-27b-nvfp4.gguf ...
# nsys: nsys profile --trace cuda ... ; nsys stats --report cuda_gpu_kern_sum (watch mul_mat_q<40,128,0>)
```

View File

@@ -1,315 +0,0 @@
# vLLM 0.23.0 eager-decode grounding: where the ~2.4x decode gap to llama.cpp comes from
Source-reading + grounding only (no GPU, no benchmarking, no llama code changes). This
decomposes vLLM 0.23.0's per-decode-step work in `enforce_eager` mode and attributes the
measured ~2.4x decode-throughput gap on GB10 (DGX Spark, sm_121) to its parts, so the
throughput thread can decide what llama.cpp would actually need (CUDA-graphed decode vs new
kernels) before anyone touches a kernel.
Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
LPDDR5x ~273 GB/s. vLLM install read: `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`
(on `dgx.casa`, read-only). Evidence: engine logs `~/bench/h2h_dense_vllm.log`,
`~/bench/h2h_moe_vllm.log`; nsys decode trace `~/bench/decode_study/srv_decode2.sqlite`
(reproduced here via `cat2.py`); committed `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
## TL;DR (the evidence-based answer)
At batch ~128, ~1024 ctx, NVFP4, `enforce_eager` (no CUDA graphs on either side), vLLM decodes
~2.4x faster than llama.cpp. Decomposed:
1. **The gap is dominantly a KERNEL-efficiency gap, not a host-overhead gap.** The strongest
single datum: during steady llama decode the GPU is **~94.6% busy** (nvidia-smi, real run) /
85.5% in the nsys window (`DECODE_GAP_STUDY.md`; nsys adds gaps). A GPU that is already ~95%
busy has at most ~5% exposed host bubble, so a CUDA graph (which only removes host/launch
overhead) can recover at most that bubble. **CUDA-graphing llama's decode is therefore a
minority lever: on the order of ~5-15% of the step, i.e. roughly ~10-20% of the 2.4x.** The
remaining ~80-90% is the GPU spending its busy time in kernels that are simply slower per unit
work than vLLM's.
2. **vLLM's eager decode step is cheap on the host by construction**, so its host time is small
to begin with and hides behind the async CUDA stream: persistent pre-allocated input buffers
updated with vectorized numpy (no per-token Python), attention metadata built once per step and
shared across all layers, no GPU->CPU sync in the hot path, and a fixed small kernel-launch
sequence per layer (2 ops per Linear, 2 grouped Marlin launches for *all* MoE experts).
`async_scheduling` was **off** in this run (absent from both engine logs; default resolves to
the synchronous `Scheduler`, `config/scheduler.py:168-176`), so vLLM achieved the 2.4x with
*synchronous* per-step scheduling. The host advantage is structural, not pipelining.
3. **Where vLLM's kernels win:** (a) attention reads paged KV **in-kernel** via a block table in
one batched `flash_attn_varlen_func` launch, with **no gather/copy** (vLLM never pays llama's
paged `get_rows` + `cpy` tax, which is ~36% of llama's *paged* step); (b) the dense NVFP4 GEMM
is a **native FP4-MMA cutlass** kernel with the activation-quant **fused** into the preceding
RMSNorm/SiLU (no standalone `quantize_mmq` requant pass); (c) the MoE experts are **one grouped
Marlin kernel per projection for all experts** (W4A16, in-kernel dequant); (d) on these Qwen3.6
models a fraction of layers are **GDN linear-attention** whose decode is an **O(1)-in-context
recurrent state update**, not an O(ctx) KV read.
4. **Sampling is not the gap** on either side: vLLM samples all ~128 sequences with a handful of
batched on-GPU kernels (FlashInfer), greedy and a heavy sampler chain cost the same; this
mirrors llama's own finding (`DECODE_GAP_STUDY.md`: greedy 1343 ms == 5-sampler 1346 ms).
## The measured gap (apples-to-apples, both eager)
From `QWEN36_NVFP4_BENCH.md` (matched NVFP4 weights, one GB10 box, vLLM 0.23.0
`--enforce-eager`, llama patch 0015 + budget-256), decode aggregate tok/s at npl128:
| model | llama (best) | vLLM | ratio | per-step (128 tok) llama -> vLLM |
|-------|-------------:|-----:|------:|----------------------------------|
| DENSE Qwen3.6-27B | 161.2 | 390.7 | **2.42x** | ~795 ms -> ~328 ms |
| MoE Qwen3.6-35B-A3B | 333.5 | 811.1 | **2.43x** | ~384 ms -> ~158 ms |
Both models converge to ~41% of vLLM at npl128 after llama's prefill-starvation is removed
(patch 0013), and at npl8 the kernels are at parity (dense 99%, MoE 84%). So the residual ~2.4x
is a steady-state decode property at high batch, not a prefill or scheduler artifact (the
scheduler was separately proven not to be the lever: a clean all-128-decoding run still tops out
at 157-161 dense / 333 MoE - `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`).
## Confirmed configuration (both sides eager, no CUDA graphs)
vLLM, both models (engine logs):
- `enforce_eager=True`, `CompilationMode.NONE`, `cudagraph_mode=<CUDAGraphMode.NONE>`:
`"Enforce eager set, disabling torch.compile and CUDAGraphs ... -cc.mode=none
-cc.cudagraph_mode=none"`, `"Cudagraph is disabled under eager mode"`. So no torch.compile, no
inductor, no graph capture: the model runs as pure eager dispatch of custom ops.
- Attention: `"Using FLASH_ATTN attention backend out of ['FLASH_ATTN','FLASHINFER','TRITON_ATTN',
'FLEX_ATTENTION']"`, `"Using FlashAttention version 2"`.
- Dense weight GEMM: `"Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM"` (native W4A4
cutlass FP4-MMA), `"Enabled custom fusions: norm_quant, act_quant"`, FlashInfer autotuned the
`fp4_gemm` (16 configs) at startup.
- MoE weight GEMM: `"Using 'MARLIN' NvFp4 MoE backend out of ['FLASHINFER_TRTLLM',...,'MARLIN',
'EMULATION']"` with `"Your GPU does not have native support for FP4 computation ... Weight-only
FP4 compression will be used leveraging the Marlin kernel"` (so MoE experts = W4A16 weight-only
Marlin: in-kernel dequant + bf16 MMA), plus `"FlashInferFP8ScaledMM"` for the FP8 attention
linears.
- Both models are **hybrid GDN**: `"Using Triton/FLA GDN prefill kernel"` and `"Setting attention
block size to 784/1056 tokens to ensure attention page size >= mamba page size"` (dense 784, MoE
1056). A decode-time `fused_recurrent_gated_delta_rule_packed_decode_kernel` is JIT-compiled.
- Sampling: `"Using FlashInfer for top-p & top-k sampling."`
- `async_scheduling` not present in either log -> synchronous `Scheduler`.
llama side (the brief's premise, corroborated by `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review):
`-fa on`, paged KV, eager (no engaged CUDA graphs at batched decode). The `DECODE_GAP_STUDY.md`
nsys run explicitly set `GGML_CUDA_DISABLE_GRAPHS=1` to match.
## Decomposition of vLLM's eager decode step
All file paths below are under
`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`. The driver is
`v1/worker/gpu_model_runner.py::execute_model` (line 4005): host preprocess under
`synchronize_input_prep()`, then `_model_forward` under `set_forward_context`, then `compute_logits`;
sampling is a separate `sample_tokens` (line 4357). Under eager, `_determine_batch_execution_and_padding`
(line 3768) dispatches `CUDAGraphMode.NONE`, and `_model_forward` (line 3718) just calls
`self.model(...)` directly: no capture, no replay, same code every step.
### (a) Attention - one batched in-kernel paged-decode launch + O(1) GDN layers
- **Full-attention layers (FA2):** `v1/attention/backends/flash_attn.py`. `FlashAttentionImpl.forward`
(667-848) issues **one** `flash_attn_varlen_func` (796-818) over all ~128 decode tokens, passing
`key_cache`/`value_cache` (the raw paged block pools, **not gathered**), `cu_seqlens_q`,
`seqused_k`, and **`block_table=attn_metadata.block_table`**. The kernel walks the block table to
fetch each sequence's KV pages directly. In-kernel paged read confirmed: there is **no gather/copy**
in the Python layer; the only KV write is `reshape_and_cache_flash` (a scatter of the new token via
`slot_mapping`). FA2 disables vLLM's AOT host scheduler (`aot_schedule = (fa_version==3)` is False,
333), so `schedule()` returns `None` (445-469): the per-step metadata `build()` (388-575) is **pure
reference/scalar assembly**, no Python loop over the 128 sequences, no host scheduling, no sync.
- **Built once per step, reused across layers:** `supports_update_block_table=True` (300); the first
full-attn layer calls `build()`, every later layer reuses it via `update_block_table()` (577-586,
a `copy.copy`). So `build()` runs **once per decode step** for the whole KV group, not per layer.
- **GDN linear-attention layers (the hybrid half):** `model_executor/layers/mamba/gdn/
qwen_gdn_linear_attn.py`, kernels in `model_executor/layers/fla/ops/fused_recurrent.py`. Pure decode
takes `_forward_core_decode_non_spec` (1644-1696): two state-update kernels only -
`causal_conv1d_update` + `fused_recurrent_gated_delta_rule_packed_decode` (Triton kernel 255-336,
grid `(NV, B*HV)` = one batched launch over all 128 rows). Each program updates a **fixed-size
[K,V] recurrent state** (`b_h *= exp(g); b_h += (beta*(v - h.k)) outer k; o = h.q`) - **no loop over
the 1024 past tokens, no KV read.** This is **O(1) in context length**, while FA2 streams ~ctx KV
per head per row. On these Qwen3.6 models the GDN layers make a chunk of the decode cost flat in
ctx, a structural cheapness llama only gets if its GGUF implements GDN the same way (see caveat).
### (b) Weight GEMM - native FP4-MMA (dense) / grouped Marlin (MoE), M-batched, fused quant
- **Dense NVFP4 linear:** `model_executor/layers/quantization/modelopt.py::ModelOptNvFp4LinearMethod.apply`
(1226-1232) -> `model_executor/kernels/linear/nvfp4/flashinfer.py::apply_weights` (56-89): exactly
two GPU ops - `scaled_fp4_quant` (activation -> packed FP4 + blockscale) then
`flashinfer_scaled_fp4_mm` (the autotuned `fp4_gemm`, a **native W4A4 cutlass FP4-MMA** whose
**dequant is fused into the MMA epilogue** via the precomputed `alpha = in_gscale*w_gscale`). The
activation-quant is itself folded away: `compilation/passes/fusion/rms_quant_fusion.py:98`
(`norm_quant`: RMSNorm -> `scaled_fp4_quant` fused) and `act_quant_fusion.py:40,128`
(`act_quant`: SiLU+mul -> FP4 fused). **There is no standalone full-tensor requantize pass** like
llama's `quantize_mmq`, and the weight is never dequantized to a temp buffer.
- **MoE experts (Marlin W4A16):** `model_executor/layers/fused_moe/experts/marlin_moe.py`.
`fused_marlin_moe` (227) does **one** `moe_align_block_size` token-sort then `_fused_marlin_moe`
(59) issues **exactly two grouped kernels** - `moe_wna16_marlin_gemm` for gate_up (137) and for
down (194) - **each a single launch covering ALL experts** (it walks `expert_ids`/`sorted_token_ids`
internally; no Python loop over experts), with a `silu_and_mul` between and a `moe_sum` reduce
after. W4A16 means weights are dequantized in-kernel and activations stay bf16 (never requantized).
- **Decode-M batching (the key throughput property):** the dense GEMM reshapes activations to (M, K)
with M = total decode tokens (~128) and reads each FP4 weight **once for all 128 tokens**; the MoE
grouped GEMM reads each routed expert's weight **once** for the ~M*topk/E tokens routed to it. At
M~128 with FP4 weights these are weight-read / memory-bound (correct: the GB10 LPDDR5x ~273 GB/s
is the floor), but the bytes are amortized over the whole batch. This is the ideal case and it is
the same regime llama is in - so the GEMM gap is kernel efficiency (fused quant + native FP4 MMA),
not a batching defect.
- **Host cost per layer (eager):** each `Linear.apply()` dispatches at most 2 `torch.ops` kernels; a
dense layer's GEMM+norm/act portion is ~7-11 launches, a MoE expert block is ~5-6 launches **for all
experts combined** (expert count does not multiply launches). Fixed, small, no per-tile/per-expert
Python.
### (c) Sampling - fully batched on-GPU, negligible
`v1/sample/sampler.py::Sampler.forward` (72) operates on the whole `[num_seqs, vocab]` logits
tensor: batched `argmax` (greedy, 240) or temperature `div_` + one FlashInfer
`top_k_top_p_sampling_from_logits` (`v1/sample/ops/topk_topp_sampler.py:493`) + `torch.where`
(296-301). **No per-sequence Python loop** in the hot path. Per-seq params live as pre-staged GPU
tensors `temperature/top_p/top_k[num_seqs]` (`v1/worker/gpu_input_batch.py:184-205`), copied once via
non-blocking H2D and rebuilt only on batch change (`refresh_metadata`, 815-829). Greedy and the full
chain are the same batched-op class. Sampled-token D2H is async (CUDA-event gated, 243-313);
detokenization runs on CPU in the async output processor (`v1/engine/output_processor.py`). Sampling
is a negligible tail and does not stall the GPU loop - exactly as on the llama side.
### (d) Host / Python per-step loop - cheap by construction, hidden behind the async stream
`execute_model` host prep, all incremental on persistent buffers (`_prepare_inputs`, 1872+):
- `block_table.commit_block_table` started **first** to overlap its copy with following CPU work
(1890); each step appends only newly-allocated block ids (`append_row`), usually <=1 at decode.
- positions / token gather are **vectorized numpy + a single `torch.index_select`** into the
pre-allocated `input_ids.cpu` (1928-1939); `query_start_loc`/`seq_lens` set by slice ops
(1979-1990). `slot_mapping` is one Triton kernel (`v1/worker/block_table.py`). **No per-token, no
per-request Python loop** in the steady decode path.
- `CommonAttentionMetadata` assembled once (2287-2305), then the attention builder runs once per KV
group (see (a)).
- The forward runs under `set_forward_context(...)` with `cudagraph_runtime_mode=NONE`; `_model_forward`
is a direct `self.model(...)`.
- **No GPU->CPU sync in the hot path:** the sampled-token copy is `non_blocking` + event-gated;
`execute_model` returns after launching the forward, and the cheap host prep for the next step
overlaps the GPU executing the current step on the async CUDA stream (CUDA launches are
non-blocking). `async_scheduling` was off, so this overlap is just ordinary CUDA async, not
pipelined scheduling - yet it is enough because the host work is so small.
What llama-server's per-step C++ loop pays that vLLM does not (host side, graph-addressable):
ggml rebuilds/reallocates the compute graph each decode step and dispatches ~1k kernel launches from
the loop on the weak Grace ARM cores (`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review). vLLM's persistent
buffers + build-once-reuse metadata + fixed launch sequence are exactly the things that keep its eager
step host-cheap; llama could borrow these (persistent device KV/block metadata, build the ggml graph
once and reuse it, zero per-step host sync) to shrink the bubble **without** a full CUDA graph.
## The llama side, for the split (nsys, reproduced)
`~/bench/decode_study/cat2.py` over `srv_decode2.sqlite` (Qwen3-32B dense, pure full-attention, 64
layers, batch 32, 1024 ctx, paged, eager), reproduced now:
```
window_span_s 24.960 sum_kernel_s 21.348 gpu_busy_pct 85.5
ATTENTION (flash_attn_ext_f16) 10.177 s 47.7%
kv_copy_cast (cpy_*) 3.903 s 18.3%
embed_gather_rows (get/set) 3.803 s 17.8% <- the PAGED gather tax
GEMM_weight (mul_mat) 3.173 s 14.9%
GEMM_act_quant (quantize_mmq) 0.172 s 0.8%
rmsnorm/silu/rope/add ~0.12 s ~0.6%
```
So on llama's paged decode step: ~84% is KV/attention (attention 47.7% + KV copy 18.3% + paged
gather 17.8%), ~16% is weight GEMM, and the host loop is **hidden** (GPU 85-94% busy; greedy ==
heavy-sampler step time). Mapping each bucket to vLLM:
| llama bucket (paged) | nsys % | vLLM equivalent | vLLM avoids it? |
|----------------------|------:|-----------------|-----------------|
| paged KV gather (`get_rows`) | 17.8% | block table read **in-kernel** | **Yes, entirely** (no such op) |
| KV copy/cast (`cpy_*`) | 18.3% | KV written once into block pool, read in place | Mostly |
| decode attention (`flash_attn_ext_f16`) | 47.7% | FA2 paged-decode varlen (+ O(1) GDN layers) | Same op, faster kernel; GDN is cheaper still |
| weight GEMM + act quant | 15.7% | fused native-FP4 / grouped Marlin, no separate requant | Faster + removes the requant kernel |
| host serving loop / sampling | ~0 (hidden) | cheap persistent-buffer prep, batched GPU sampling | Both hidden; vLLM also cheap |
Note: the nsys decomposition is on **Qwen3-32B (pure attention)**; the 2.4x throughput numbers are on
**Qwen3.6 hybrid GDN** models. The bucket *shares* differ between the two (GDN shifts work off
attention), but the lesson - llama's step is GPU-bound on attention + the paged gather + FP4 GEMM,
with the host hidden - transfers.
## The split of the 2.4x: kernel vs host (graph-addressable)
Anchored on the measured **~94.6% GPU busy** during steady llama decode (nvidia-smi,
`DECODE_GAP_STUDY.md`):
- **Host / CUDA-graph-addressable: the minority, ~5-15% of the llama step (=> ~10-20% of the 2.4x).**
A GPU that is ~95% busy exposes at most ~5% host idle; a CUDA graph (capture-once, replay) removes
per-step launch latency + ggml graph rebuild/realloc and can tighten inter-kernel gaps, plausibly
recovering ~5-15% of the step in the best case. On llama's ~795 ms dense step that is ~40-120 ms of
the ~467 ms gap. **A CUDA graph cannot close a 2.4x gap**, because the gap is mostly the GPU's busy
time, not idle. (The fraction shrinks further at batch 128 vs the nsys batch 32: the per-step launch
count is fixed while per-kernel work grows, so host overhead is a smaller share at higher batch.)
- **Kernel efficiency: the majority, ~80-90% of the 2.4x.** The GPU's busy time goes into kernels that
are slower per unit work than vLLM's, decomposed:
- **the paged gather regression (~36% of llama's *paged* step; `get_rows`+`cpy`)** - vLLM never pays
it because it reads paged KV in-kernel. This is the single biggest discrete, llama-specific,
addressable chunk, but removing it only restores llama's own *stock* path; stock is still ~2x off
vLLM (`DECODE_GAP_STUDY.md`).
- **long-context decode-attention** (the largest residual; attention is ~48% of the step and grows
with ctx) - llama's `flash_attn_ext_f16` decode is slower than vLLM's FA2 paged-decode on sm_121,
and slower still than the O(1) GDN layers on these models.
- **the FP4 weight GEMM floor** (~15-30%) - vLLM fuses the activation-quant into the norm/SiLU and
uses native FP4-MMA / grouped Marlin; llama runs `mul_mat_q` + a separate `quantize_mmq` requant.
## Ranked list: what llama would need to close the 2.4x, and how much each buys
1. **Do not pay the paged gather at decode. [largest discrete, llama-addressable; ~36% of the paged
step]** Either disable paged KV for decode-latency workloads, or read paged blocks **in-kernel via
a block table** like vLLM (no `get_rows`/`cpy`). This is a kernel change (a real in-kernel
paged-decode read), not a graph change. Caveat: it only brings the paged path back to llama-stock;
stock is still ~2x off vLLM, so this is necessary but not sufficient.
2. **Faster long-context decode-attention kernel. [biggest residual; partly structural]** A proper
flash-decoding / split-K-over-KV, GQA-grouped, in-kernel-paged decode kernel for sm_121 (this also
subsumes lever 1). Deep CUDA work, gated by kernel maturity on Blackwell-class parts. This is where
the context-scaling gap lives and where most of the 2.4x is.
3. **Fused FP4 weight GEMM. [bounded; ~15-30%]** Fold the activation-quant into the preceding norm/SiLU
(vLLM's `norm_quant`/`act_quant`) and into the GEMM epilogue; use native FP4-MMA where the part
supports it. Removes the separate `quantize_mmq` pass. Bounded below by weight-read bandwidth
(~19 GB/step over 273 GB/s).
4. **CUDA-graph the steady-state pure-decode step. [smallest, cheapest; ~10-20% of the gap]** Capture
the all-128-decoding step once and replay (it is already fixed-shape at steady decode - the
scheduler does not need to change to enable this, per `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` P3).
Recovers the ~5% GPU-idle bubble + ggml per-step graph rebuild/realloc + launch latency on the weak
Grace cores. A real, independent, low-risk win, but bounded by the ~95%-busy measurement: it does
**not** close the kernel gap. Cheaper host-side half-measures that need no graph: persistent device
KV/block metadata, build the ggml graph once and reuse it, and remove any per-step host sync (mirror
vLLM's persistent-buffer + build-once-reuse + non-blocking-D2H pattern).
5. **Verify llama's GDN/linear-attention decode path. [architectural, model-specific]** On these
Qwen3.6 hybrids vLLM runs the linear-attention layers as an O(1)-in-ctx recurrent state update. If
llama's GGUF runs those layers as full attention (O(ctx)) rather than a recurrent state, that is a
per-layer decode cost vLLM structurally avoids on exactly these models - check before attributing
the whole residual to the full-attention kernel.
## Honest bottom line
The ~2.4x eager decode gap is **dominantly a kernel-efficiency gap (~80-90%), not a host-overhead
gap.** The decisive evidence is that llama's GPU is already ~94.6% busy during steady decode, so the
CUDA-graph-addressable host slice is a minority (~10-20% of the gap), recoverable but bounded. The
bulk of vLLM's advantage is concrete kernel work: an in-kernel paged-decode read that eliminates
llama's gather/copy tax (~36% of the paged step), a faster long-context decode-attention kernel, a
fused native-FP4 GEMM, and (on these specific models) O(1)-in-ctx GDN linear-attention layers. vLLM's
host loop is cheap by construction (persistent buffers, build-once-reuse metadata, no hot-path sync,
fixed small launch sequence) and it achieved the 2.4x with *synchronous* scheduling and *no* CUDA
graphs - so the host is not where vLLM's lead comes from, and a CUDA graph is the cheapest but
smallest of llama's available levers, not the silver bullet. The throughput effort should be scoped
as kernel work (in-kernel paged-decode read + flash-decoding attention + fused FP4 GEMM) with a
CUDA-graphed steady-state decode as a separate, bounded, lower-risk add-on.
## Key source citations (on dgx.casa, read-only)
- Eager driver / host loop: `v1/worker/gpu_model_runner.py` execute_model 4005, _model_forward 3718,
_prepare_inputs 1872, _determine_batch_execution_and_padding 3768, sample_tokens 4357,
synchronize_input_prep 3704; `v1/worker/block_table.py`; `v1/worker/gpu_input_batch.py:184-205`.
- Attention: `v1/attention/backends/flash_attn.py` (forward 667-848, varlen call 796-818, builder
388-575, update_block_table 577-586); `model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`
(decode 1644-1696); `model_executor/layers/fla/ops/fused_recurrent.py` (kernel 255-336).
- GEMM: `model_executor/kernels/linear/nvfp4/flashinfer.py:56-89`;
`model_executor/layers/quantization/modelopt.py` (NvFp4 LinearMethod 1103-1232, MoE 1381-1666);
`model_executor/layers/fused_moe/experts/marlin_moe.py` (59-225, 227-360, 732-895);
`compilation/passes/fusion/rms_quant_fusion.py:98`, `act_quant_fusion.py:40,128`.
- Sampling: `v1/sample/sampler.py:72-302`; `v1/sample/ops/topk_topp_sampler.py:55,460-497`;
`v1/sample/metadata.py`; `v1/engine/output_processor.py`.
- Config: `config/scheduler.py:146,168-176` (async_scheduling default -> sync Scheduler).
- Evidence: `~/bench/h2h_dense_vllm.log`, `~/bench/h2h_moe_vllm.log`, `~/bench/decode_study/cat2.py`
over `srv_decode2.sqlite`; this worktree `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
</content>
</invoke>

View File

@@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
LocalAI supports various types of backends:
- **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md` for the quality/throughput profile).
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp/patches/paged/README.md` for the quality/throughput profile).
- **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
- **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
- **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)

View File

@@ -20,7 +20,7 @@
# persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
# stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
# bit-exact (~91% same-top-p, beats vLLM dense) - see
# backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md for the quality profile.
# backend/cpp/llama-cpp/patches/paged/README.md for the quality profile.
# The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
# =============================================================================
- name: "qwen3.6-27b-nvfp4-paged"