docs(paged): consolidate the dev-trail docs into one canonical README

The paged-attention patch directory had accumulated ~55 scattered dev docs (results, progress, scope, lever, and gap-analysis notes). Consolidate the durable content of all of them into one canonical backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is, the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030 patch series table with bit-exact status, the GB10 benchmarks (patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes (bit-exact methodology, the per-path gate, the MoE-parity conclusion, the rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the pin + canary maintenance policy, and the published NVFP4 gallery models. Delete the consolidated-away dev trail. Keep the three operational docs the README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md (per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the ship-as-own-backend design-of-record), plus the benchmark plots + csv. The .patch files and the unit/bench .cpp are untouched. Repoint every external reference to a deleted doc at the new README: grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a patch-internal path matcher, not a repo-doc link) is left intact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 09:23:30 +00:00
parent a5a5b2ad80
commit fb2dc33d52
62 changed files with 325 additions and 12885 deletions
--- a/.github/scripts/paged-canary-apply.sh
+++ b/.github/scripts/paged-canary-apply.sh
@@ -28,7 +28,7 @@
 # build on 0019's code, the rejection cascades to them too. This is a
 # PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
 # upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
-# and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still
+# and README.md). We exclude ONLY that dev-doc path and still
 # apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
 # still fails the canary. prepare.sh tolerates the same hunk via
 # `patch ... || true`; this mirrors that tolerance precisely.
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -850,7 +850,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        // common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
        // Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
        // externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
-        // NOT bit-exact (~91% same-top-p ceiling); see patches/paged/A_HYBRID_SSM_RESULTS.md.
+        // NOT bit-exact (~91% same-top-p ceiling); see patches/paged/README.md (Dev notes).
        } else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
            if (optval != NULL) {
                try {
--- a/backend/cpp/llama-cpp/patches/README.md
+++ b/backend/cpp/llama-cpp/patches/README.md
@@ -57,7 +57,7 @@ All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so t
 - **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
 - **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
-  (`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
+  (see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
--- a/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md
+++ b/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md
@@ -1,102 +0,0 @@
-# Patch 0003 — paged gather-read: exact implementation plan
-
-**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
-`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
-path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
-
-**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
-
-## Design
-
-The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
-graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
-three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
-mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
-
-ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
-→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
-`transpose → get_rows → transpose`.
-
-### KEY CORRECTIONS (found while implementing — these change the edits)
-
-1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
-   *current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
-   `kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
-   and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
-
-2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
-   count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
-   tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
-   MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
-
-3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
-   in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
-   `reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
-   Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
-
-4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
-   everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
-   n_kv, then gathered). Smaller than the 7-edit estimate above.
-
-## Edits
-
-### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
-```cpp
-    bool        is_paged() const { return paged_active; }            // near get_size()
-    ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
-    void          set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
-    uint32_t      get_n_gather(const slot_info & sinfo) const;       // == sum of used cells gathered
-```
-Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
-`build_input_k_idxs`/`get_n_kv`).
-
-### 2. `src/llama-kv-cache.cpp`
- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
-  the mask/k/v will see). This is the canonical gather order.
-
-### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
-Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
-
-### 4. `src/llama-graph.cpp`
- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()` → `mctx->set_input_gather_idxs(gather_idxs, ...)`.
- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
-  mctx_cur->build_input_gather_idxs(ctx0, ...)`.
- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
-```cpp
-if (ggml_tensor * gi = inp->get_gather_idxs()) {
-    k = ggml_get_rows(ctx0, k, gi);                                   // [d, n_gather, ...] (reshape view ok)
-    v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
-    ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
-    m = ggml_get_rows(ctx0, m, gi);                                   // [n_tokens, n_gather]
-    kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m));              // [n_gather, n_tokens]
-}
-ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
-```
-Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
-variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
-`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
-mask gather in the graph. (Cleaner — revisit during impl.)
-
-### 5. V-transposed path
-When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
-work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
-flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
-
-## Verification (the gate)
-```sh
-cmake --build build-cpu --target llama-simple -j
-M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
-build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt                    # stock
-LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt   # paged gather-read
-diff a.txt b.txt        # MUST be identical
-```
-Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
-Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
-
-## Risks
- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
-  means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.
--- a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md
+++ b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md
@@ -1,347 +0,0 @@
-# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis
-
-Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev`
-HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense),
-harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a
-clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column.
-
-## TL;DR verdict
-
-CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well
-under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**.
-Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps
-(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap
-(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged
-decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays
-99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM
-(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at
-batch 128), not in launch overhead or the host loop.
-
-The premise that "the paged decode runs eager (graphs reused=0)" did not survive
-measurement: at the benchmarked context the default paged decode captures and
-replays graphs exactly like stock non-paged. Two measurement traps (below)
-explain the earlier "reused=0 / gap-bound" reading.
-
-## Method note: a graph-enable trap that was corrected
-
-`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr`
-(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string
-still disables graphs. A first 4-cell pass that used
-`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF
-in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the
-variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it
-never set the variable).
-
-## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled)
-
-npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`:
-
-| cell             | npl 32  | npl 128 |
-|------------------|---------|---------|
-| stock_graphon    | 116.47  | 148.41  |
-| stock_graphoff   | 115.17  | 148.21  |
-| paged_graphon    | 116.21  | 148.60  |
-| paged_graphoff   | 114.62  | 147.65  |
-
-ON vs OFF (the graph win):
-
-| config | npl 32 | npl 128 |
-|--------|--------|---------|
-| stock  | +1.13% | +0.13%  |
-| paged  | +1.39% | +0.64%  |
-
- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at
-  npl 32 (small-batch, where per-kernel launch overhead is relatively larger).
-  All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128).
- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT
-  eager at this config (see Step 2); it captures graphs like stock.
- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at
-  npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM.
-
-All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is
-shared by stock and paged; it is not paged-specific. Calibration cross-check
-(paged ON, ntg 64): 147.64, matching the reference 148-149.
-
-## Step 2 - why the "eager" premise is wrong, and what actually mutates
-
-CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in
-`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node
-properties did not change vs the previous step; any later change logs
-`CUDA graph warmup reset` and reverts to eager until stable again.
-`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus
-each src's `data` ptr / `ne` / `nb`.
-
-`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96):
- stock:  `warmup complete` x2, `warmup reset` x0.
- paged:  `warmup complete` x2, `warmup reset` x0.
-Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0
-for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so
-the uid fast-path never fires; the graph is still replayed via the
-`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative,
-not evidence of eager execution.** (Trap #1.)
-
-Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512
-token boundaries), counts over ~320 decode steps:
-
-| path                          | complete | reset | interpretation                |
-|-------------------------------|----------|-------|-------------------------------|
-| paged in-kernel (default)     | 10       | 8     | resets only at 256-boundaries |
-| paged gather (KV_PAGED_GATHER)| 0        | 0     | never captures -> pure eager  |
-| stock non-paged               | 10       | 8     | identical 256-cadence         |
-
-The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s),
-not per-step. The default paged decode is therefore captured for ~97% of steps,
-re-warming only every ~256 tokens, with the **same cadence as stock**.
-
-What mutates (the block-table / gather input):
- in-kernel decode (default): the block-table graph input
-  `idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with
-  `n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]`
-  steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The
-  kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the
-  property change is per-256-tokens, not per-step.
- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the
-  index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)`
-  (`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every
-  step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every
-  step"). That changes a node property every step, warmup never completes, and
-  the path runs pure eager. This is the only "graphs reused=0" path, and it is
-  not the default decode path.
-
-`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across
-the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table
-`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256.
-
-## Step 3 - where the step time goes (nsys), and a second trap
-
-npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup).
-Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps
->= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph
-reserve / prefill one-offs).
-
-Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA
-graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU
-rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy.
-Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows
-and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON
-nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** -
-the likely source of the earlier "freed GPU time became idle gaps" conclusion.
-
-Reliable steady-state numbers:
-
-| trace                          | GPU rows | busy   | within-step idle | between-step idle | host gap/step |
-|--------------------------------|----------|--------|------------------|-------------------|---------------|
-| OFF (eager)                    | 84,946   | 99.4%  | 0.37%            | 0.24%             | ~2.0 ms       |
-| ON (captured, node-trace)      | 84,946   | 99.5%  | 0.11%            | 0.38%             | ~1.9 ms       |
-
- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has
-  ~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus
-  ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%.
- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged.
- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample +
-  update_slots + batch build), 0.24-0.38% of the ~896 ms step.
-
-Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel
-compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps.
-
-## The load-bearing question, answered
-
-Within-step or between-step? **Neither is large.** The steady decode is 99.4%
-GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove
-the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is
-~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim
-either; the host loop is currently **hidden under GPU compute** (the GPU stays
-busy while the host syncs/schedules). It would only become a lever once the
-kernels are fast enough to drop GPU-busy below the host time, i.e. it is a
-second-order floor, not the present bottleneck.
-
-## Verdict
-
-1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the
-   default decode; capturing reduces within-step launch idle from 0.37% to 0.11%
-   but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured
-   +0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise).
-2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step,
-   hidden under GPU compute). It is the candidate floor only after the kernels
-   speed up.
-3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x
-   gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch
-   128). That, not graphs and not the host loop, is the throughput lever.
-4. Corrected premises: paged is not perpetually eager (it captures with a
-   256-token reset cadence identical to stock); "graphs reused=0" was a uid
-   fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy
-   unless `--cuda-graph-trace=node` is set.
-
-No code patch in Phase 1 (graphs are not the lever, so there is no paged
-graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`,
-`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.
-
-# Phase 2 - the real decode lever, located (per-kernel decomposition)
-
-Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the
-per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured
-that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys
-`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the
-measurement. Two corrections, then the lever.
-
-The conditional Phase 2 fix (make the paged decode graph-capturable) is moot:
-Phase 1 already showed the default paged decode captures, and the fresh re-check
-below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph
-fix / between-step host loop) is the lever; the lever is a third thing, measured
-here.
-
-## Fresh re-confirmation: graphs are not the lever
-
-Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table:
-
-| paged decode  | S_TG t/s | vs vLLM 391 |
-|---------------|----------|-------------|
-| graphs ON     | 146.03   | 37.3%       |
-| graphs OFF    | 144.90   | 37.1%       |
-
-+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also
-99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy
-nodes too, so they cannot remove either the copies or the compute.
-
-## Correction 1: the model is a hybrid SSM, not a plain transformer
-
-`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with
-`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The
-decode-window kernel cadence (per step, ~19.8 steps in the window) is 48
-`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48
-gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1
-hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention
-layers.
-
-## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute
-
-Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24
-OFF rep; single CUDA stream; running-max-end so it is overlap-correct):
-
-| activity set           | GPU busy | idle  |
-|------------------------|----------|-------|
-| kernels only           | 80.2%    | 19.8% |
-| kernels + memcpy (all) | 99.4%    | 0.6%  |
-
-The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are
-filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each,
-356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet
-recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the
-paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH
-47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy"
-and lumped them into "GPU kernel compute" - they are memory movement, and they
-are addressable.
-
-## Decode GPU-time decomposition (% of kernel+memcpy busy)
-
-OFF/eager rep, steady window. `/step` = instances per decode step.
-
-| share | activity                          | /step | role                          |
-|-------|-----------------------------------|-------|-------------------------------|
-| 23.4% | gated_delta_net_cuda              | 48    | linear-attn recurrence        |
-| 21.9% | k_get_rows_float                  | 97    | SSM state / conv-state gather |
-| 18.9% | MEMCPY DtoD                       | 80    | SSM recurrent-state copy      |
-| 15.5% | mul_mat_vec_q (nvfp4, ncols=1)    | 48    | FP4 GEMV                      |
-| 10.4% | mul_mat_q (nvfp4)                 | 352   | FP4 GEMM                      |
-|  1.9% | quantize_mmq_nvfp4                | 448   | act requant for MMQ           |
-|  1.0% | concat_cont                       | 48    | SSM state glue                |
-|  0.8% | ssm_conv_f32                      | 48    | SSM short conv                |
-|  0.7% | unary_gated_op silu               | 112   | SSM gating                    |
-|  0.4% | flash_attn_tile/_ext              | 16    | FULL attention (paged)        |
-
-Grouped:
- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy
-  + conv + gating glue): **~67% of decode**.
- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**.
- Full attention - everything paged attention optimizes: **~0.4%**.
-
-## Verdict and scope of the real lever
-
-1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture
-   the memcpy too, so they cannot touch the copies or the compute.
-2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s).
-3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26%
-   faster left decode_agg flat) - it was never the long pole.
-4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs,
-   block-table stabilization, gather rewrite) can move decode_agg on this model**
-   - it optimizes under half a percent of the step. This is the structural reason
-   A.2, and the paged-decode track generally, cannot close the vLLM gap on
-   q36-27b: the model barely uses the path being optimized.
-
-The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer
-per step it re-materializes and D2D-copies the full recurrent state (~230 MB at
-batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2
-`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is
-SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention
-`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the
-gather into the scan, avoiding both the per-layer D2D copy and the gathers.
-
-Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path -
-not paged-attn, not a graph capture, not a block-table tweak):
-1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor
-   in place (or double-buffer / write-back), so the recurrence consumes and
-   produces the persistent state without a full-state copy each layer each step.
-2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel.
-
-Ceiling from this rep (upper bound; assumes the work is fully removed, not just
-overlapped):
- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s.
- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within
-  ~1.6x of vLLM 391 with the FP4 GEMM still untouched.
-
-No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in
-the SSM path, too large for this measurement pass and orthogonal to paged
-attention. `patches/paged/0018` stays free. Evidence on the DGX:
-`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
-its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
-`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
-
-# A.2 final synthesis - the four-point verdict
-
-All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
-`decode_agg` = `S_TG t/s`), npl 128 unless noted.
-
-**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
-ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
-All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
-94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
-so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
-busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
-GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
-
-**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
-There was nothing to fix: the premise was false. At the benchmarked context the
-DEFAULT in-kernel paged decode already captures and replays graphs, with a
-256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
-~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
-reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
-only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
-index grows every step), which is not the default decode. Because graphs were
-already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
-Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
-launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
-
-**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
-**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
-re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
-
-**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
-A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
-full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
-layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
-path can move decode_agg. The residual gap is structural and lives elsewhere:
-~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
-get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
-per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
-B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
-(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
-GPU compute - neither is the residual.
-
-The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
-(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
-write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
-vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
-fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
-t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
-still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
-stays free.
--- a/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md
+++ b/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md
@@ -1,107 +0,0 @@
-# Additive layout for the paged-KV patch series - "hook, don't edit"
-
-Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
-minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
-"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
-the opposite: **additive**. This document is the design rule and the per-patch core-touch
-budget.
-
-## The rule
-
-> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
-> **single, env-gated hook** at one call site in a core file that delegates to the new
-> file. No logic lives in a core file. No core struct/signature is edited.
-
-Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
-`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
-hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
-old-0003 approach) conflicts on every bump and must be re-understood each time.
-
-This is enforceable as a **core-touch budget**: each patch declares the core files it
-touches and the line count; review rejects anything that grows logic in core.
-
-## Why it's achievable here (grounded in the pinned source)
-
-The two seams paged KV needs are both already abstract in llama.cpp at the pin
-(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
-
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
-  cell indices. Paged placement is just *different indices*. 0002 already does this as one
-  gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
-  `llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
-  subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
-  added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
-
-## Per-patch core-touch budget
-
-| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
-|---|-------|----------------------|------------------------------|-----------:|
-| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
-| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
-| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
-| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
-| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
-| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
-
-Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
-cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
-is new files or the LocalAI-side server loop.
-
-## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
-
-The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
-(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
-The additive form removes the core-struct and core-`set_input` edits entirely:
-
-**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
-  gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
-  the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
-  same set the `kq_mask` keeps), in the canonical order.
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
-  constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
-  and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
-  transpose`). When not active it returns immediately -> **stock path byte-identical**.
-
-**Core hooks (the whole core diff for 0003):**
-1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
-   ```cpp
-   paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
-   ```
-   One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
-   registered input, so `llama-graph.h` is untouched.
-2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
-   file can read the used-cell set without reaching into internals -
-   `uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
-   (delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
-   pattern). ~8 lines total, no signature changes to existing methods.
-3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
-
-First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
-(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
-no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
-`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
-KV set; `n_gather < n_kv` proves compaction, not identity).
-
-## Anti-drift practices (already in `README.md`, restated as policy)
-
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
-  `git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
-  needs a touch, and the failure names the exact step.
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
-  changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
-  (e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
-  commit, re-export. (Set up and verified for this pin during this work.)
-
-## Status / next step
-
- 0001, 0002: done, additive, verified token-identical.
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
-  0001+0002 applied is ready (`paged` branch). Remaining work is the focused
-  implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
-  the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
-  FA-first), not a structural one - the structure is settled here.
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
-  patch at all).
--- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
+++ b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md
@@ -1,669 +0,0 @@
-# ARCH_GENERALITY_AUDIT - llama-cpp-localai-paged backend
-
-Source/build/gallery audit (no GPU, no hardware). Maps how arch-general the
-paged backend's BUILD targeting is, and whether non-Blackwell / Metal / CPU
-hosts get a working build.
-
-## Section: backend-build-matrix (build targeting)
-
-### 1. CUDA arch list: NOT Blackwell-only - it is the FULL upstream ggml default
-
-There is NO explicit CUDA arch list anywhere in the paged build path:
-
- `.docker/llama-cpp-localai-paged-compile.sh` only injects
-  `-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}` *when* `CUDA_DOCKER_ARCH` is
-  non-empty (`if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]`).
- NO `backend-matrix.yml` row for `llama-cpp-localai-paged` sets
-  `CUDA_DOCKER_ARCH` (nor does any stock `llama-cpp` row). It is empty.
- `backend/cpp/llama-cpp/Makefile` (reused verbatim by the paged wrapper) sets
-  only `-DGGML_CUDA=ON` (+ `-DGGML_NATIVE=OFF`). It never sets
-  `CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`.
-
-=> The compiled arch fan is whatever upstream llama.cpp / ggml-cuda picks by
-default with `GGML_NATIVE=OFF` (the full multi-arch default, which includes
-Blackwell sm_120 alongside the older archs ggml ships). This is BIT-IDENTICAL to
-how the stock llama-cpp backend is targeted - the paged wrapper copies and reuses
-the exact same Makefile + CMakeLists + prepare.sh, only forcing `LLAMA_PAGED=on`.
-
-Consequence for NVFP4: the FP4-MMA kernel is compile-time gated *inside* the
-ggml-cuda TU by `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer, sm_100
-datacenter). Because the build emits the full arch fan (not a Blackwell-only
-list), the sm_120 NVFP4-MMA path is present for Blackwell AND the older archs get
-their own kernels (NVFP4 runs the non-MMA fallback at runtime on
-Ampere/Ada/Hopper). NOTHING in the build pins this to GB10/sm_121. The binary is
-arch-portable; only the NVFP4 *speedup* is Blackwell-specific, by kernel gating,
-not by build targeting.
-
-### 2. Variants built: CUDA + ROCm + SYCL + Vulkan + CPU (NOT CUDA-only)
-
-`backend-matrix.yml` `include:` (Linux) ships, for `llama-cpp-localai-paged`,
-one row per stock-llama-cpp Linux row (10 rows, lines 4889-5046):
-
- cublas CUDA 12.8 (linux/amd64)
- cublas CUDA 13.0 (linux/amd64)
- cublas CUDA 13.0 arm64 l4t (jetson)
- cublas CUDA 12.0 arm64 l4t (jetson)
- hipblas / ROCm 7.2.1 (linux/amd64) - AMDGPU_TARGETS = gfx908..gfx1201
- sycl_f32 (Intel oneAPI)
- sycl_f16 (Intel oneAPI)
- vulkan (linux/amd64)
- vulkan (linux/arm64)
- CPU (linux/amd64) + CPU (linux/arm64), built via the ggml
-  `CPU_ALL_VARIANTS` single-build (dlopen libggml-cpu-*.so by host CPU feature;
-  arm64 uses gcc-14 for SME).
-
-So it is NOT CUDA-only. Per image, `compile.sh` builds: the accelerator variant
-(or CPU_ALL_VARIANTS when BUILD_TYPE empty) + grpc-server + rpc-server.
-
-### 3. THE GAP vs stock llama-cpp: NO Metal / Darwin build
-
-This is the single build-targeting divergence:
-
- stock `llama-cpp` HAS a Darwin row in `includeDarwin`
-  (`-metal-darwin-arm64-llama-cpp`, line 5071) and a `metal:` capability key
-  (`metal: "metal-llama-cpp"`, index.yaml line 25).
- `llama-cpp-localai-paged` has ZERO entries in `includeDarwin` (grep past line
-  5048 = none) and NO `metal:` capability key in its meta-backend.
- There is NO `metal-*-llama-cpp-localai-paged` tag anywhere in
-  backend-matrix.yml or backend/index.yaml.
-
-`scripts/changed-backends.js` already anticipates a future darwin paged row
-(lines 78-81 map `backend === "llama-cpp-localai-paged"` to the C++ source dir),
-but no such matrix row exists, so it is currently dead/forward-looking code.
-
-Everything else (CUDA arch fan, ROCm gfx list, SYCL, Vulkan, CPU) matches stock
-llama-cpp exactly.
-
-### 4. Does a non-Blackwell / Metal / CPU host get a working build of THIS backend?
-
-Meta-backend capabilities map (index.yaml lines 101-111):
-default(cpu), nvidia(cuda12), intel(sycl-f16), amd(rocm), vulkan, nvidia-l4t,
-nvidia-cuda-13, nvidia-cuda-12, nvidia-l4t-cuda-12/13.  NO `metal:` key.
-
- Non-Blackwell NVIDIA (Ampere sm_80-86 / Ada sm_89 / Hopper sm_90 / datacenter
-  Blackwell sm_100): selects the SAME cuda12 / cuda13 image. That image is
-  compiled for the full arch fan, so it RUNS. NVFP4 falls back to the non-MMA
-  path on pre-Blackwell; on sm_100 it gets FP4-MMA but is compute-bound (HBM3e),
-  not the LPDDR5x-bound GB10 regime the patches were tuned for. WORKS, just
-  without the GB10-specific bandwidth win.
- AMD / Intel / Vulkan / CPU (amd64 + arm64) Linux hosts: each has its own
-  matching variant in the map + matrix. WORKS.
- Metal / macOS Apple Silicon: NO `metal:` key and NO darwin build. Capability
-  resolution falls back to `default` = `cpu-llama-cpp-localai-paged`, which is a
-  Linux (amd64/arm64) image, NOT a macOS-native build, so it will NOT run on
-  macOS. And because this is a SEPARATE meta-backend, it does NOT fall through to
-  the stock `llama-cpp` backend - a Mac user who explicitly selects
-  llama-cpp-localai-paged gets a non-running selection and must manually pick the
-  stock llama-cpp backend instead. DOES NOT WORK on Metal/macOS; no auto-fallback
-  to stock.
-
-## Verdict (build-targeting)
-
- Arch-general on Linux: YES. The build is NOT Blackwell-only; it targets the
-  exact same full CUDA arch fan + the same ROCm/SYCL/Vulkan/CPU variant set as
-  stock llama-cpp. Any Linux host that can run stock llama-cpp can run THIS
-  backend; the NVFP4 speedup is the only Blackwell-gated piece, and that gating
-  is inside the kernel, not in the build matrix.
- Single gap: NO Metal/Darwin variant and NO `metal:` capability key. macOS /
-  Apple Silicon hosts have no working build of this backend and do not auto-fall
-  to stock llama-cpp. To close the gap, add an `includeDarwin` row
-  (`-metal-darwin-arm64-llama-cpp-localai-paged`, mirroring the stock llama-cpp
-  darwin row + the C++ source build path that changed-backends.js already
-  anticipates) and a `metal:` key to the paged meta-backend. (Note: NVFP4 has no
-  Metal MMA path, so a Metal build would deliver paged-KV behaviour only, no
-  NVFP4 acceleration - still a correctness/availability win over the current
-  broken selection.)
-
-## Section: gguf-gallery-targeting (NVFP4 portability + hardware gating)
-
-### 1. NVFP4 GGUFs LOAD + RUN on non-Blackwell - runs-via-dequant, NOT FP4-MMA-required
-
-The published GGUFs use `file_type` MOSTLY_NVFP4 / `GGML_TYPE_NVFP4` (type id 40).
-This is a standard ggml block-quant type with FULL software dequant + matmul
-coverage across every backend, NOT a Blackwell-only format. Verified against the
-paged backend's pinned ggml source (pin 0a2677c6, same upstream as stock
-llama-cpp):
-
- CPU (any arch, amd64 + arm64): full support, no special hardware.
-  - `ggml/src/ggml-cpu/quants.c`: `quantize_row_nvfp4` (from_float) +
-    `ggml_vec_dot_nvfp4_q8_0_generic` (the matmul dot product), dequant via the
-    `kvalues_mxfp4` lookup table. Registered in the CPU type-traits table
-    (`ggml-cpu.c` line 283: `[GGML_TYPE_NVFP4] = { .from_float=..., .vec_dot=... }`).
-  - NVFP4 handled in all the CPU op switches (`ops.cpp` lines 674, 1125, 1255,
-    4424, 4701, 4925, 5651). LOADS + RUNS correctly on a pure-CPU host, just slow.
- CUDA, NON-Blackwell (Pascal/Volta/Turing/Ampere sm_80-86 / Ada sm_89 /
-  Hopper sm_90): RUNS correctly via the integer-quantized matmul paths, no
-  FP4-MMA needed.
-  - `convert.cu` registers `dequantize_row_nvfp4_cuda` as both the to_float and
-    to_fp16 dequant kernel (lines 759, 814) - the generic dequant->GEMM path.
-  - `mmvq.cu`: `vec_dot_nvfp4_q8_1` (DP4A integer dot, works on any GPU with
-    dp4a, i.e. Pascal sm_61+). This is the decode (gemv) path.
-  - `mmq.cuh`: NVFP4 has a `MMQ_DP4A_TXS_Q8_0_16` DP4A tile AND a separate
-    `MMQ_MMA_TILE_X_K_NVFP4` tile explicitly commented "NVFP4 Generic" (line
-    222), DISTINCT from `MMQ_MMA_TILE_X_K_FP4` "MXFP4 and NVFP4 Blackwell" (line
-    221). So there are three tiers: DP4A (oldest), generic-MMA (Turing+), and
-    Blackwell-native FP4-MMA.
-  - The Blackwell path is a runtime FLAG, not a requirement:
-    `mmq.cu` line 125 `const bool use_native_fp4 = blackwell_mma_available(cc)
-    && (... NVFP4)`. When false (non-Blackwell), it falls through to the generic
-    quantized kernel. Grep for any abort/unsupported on NVFP4+blackwell = NONE.
-    No `GGML_ABORT`, no garbage - just the non-MMA kernel.
- Vulkan: has `dequant_nvfp4.comp` + NVFP4 in `ggml-vulkan.cpp` / dequant_funcs
-  - LOADS + RUNS on Vulkan hosts (AMD/Intel/NVIDIA) via dequant.
- Metal: NVFP4 referenced only in `ggml-metal-device.m` (type registration /
-  size), NO Metal NVFP4 compute kernel. On Apple Silicon NVFP4 tensors would
-  fall back to the CPU backend op-by-op (correct but slow) IF a Metal build
-  existed - which for THIS backend it does not (see build-targeting Section 3).
-
-Bottom line: the NVFP4 GGUFs are PORTABLE. A Hopper/Ada/Ampere/CPU/Vulkan host
-loads and runs them correctly (bit-faithful dequant), just WITHOUT the FP4-MMA
-speedup. FP4-MMA is a Blackwell-only performance tier layered on top of a
-fully-general software path, NOT a load/run gate. Off-Blackwell = runs-via-dequant,
-correct-but-slow; never fail/garbage.
-
-### 2. Gallery hardware-targeting GAP: nothing stops a non-Blackwell user
-
-The 6 -paged entries declare NO machine-readable hardware targeting. The only
-Blackwell signal is free prose in `description:` ("native Blackwell NVFP4
-(FP4-MMA)", "Benchmarked on GB10 / DGX Spark") and a `nvfp4` string in `tags:`.
-
-How LocalAI's gallery CAN express hardware gating (what exists):
- `tags:` are FREE-TEXT, search-only. `core/gallery/gallery.go` line 89 just does
-  `strings.Contains(lower(join(tags)), term)` for the search box + line 128
-  collects them for filter chips. There is NO tag that gates install or warns;
-  the `nvfp4` tag is purely discoverability.
- The model `ModelConfig` struct (`core/gallery/models.go`) has only
-  Description/Icon/License/URLs/Name/ConfigFile/Files/PromptTemplates. There is
-  NO capabilities / requirements / hardware field at the MODEL level. (Signing
-  `verification:` is the only structured gate, unrelated to hardware.)
- The `capabilities:` map (default/nvidia/intel/amd/metal/vulkan/...) is a
-  BACKEND-level concept in `backend/index.yaml` (paged entry lines 100-111). It
-  selects the backend IMAGE by detected accelerator FAMILY (nvidia vs amd vs
-  metal vs cpu). Crucially it does NOT and CANNOT distinguish Blackwell sm_120/121
-  from older NVIDIA - `nvidia: cuda12-llama-cpp-localai-paged` is served to ANY
-  NVIDIA GPU. There is no sub-nvidia (microarch) gating mechanism in the gallery
-  or the backend capability resolver.
-
-So the gating gap is real: a non-Blackwell user browsing the gallery is offered
-the NVFP4 entries with no machine-readable signal that they will run far below
-the advertised "90-117% of vLLM" numbers (those numbers are GB10/LPDDR5x-bound
-specific). It will install and run correctly, just slowly, and the bench claims
-in the description will not hold.
-
-### 3. How to express Blackwell-targeting (recommendation)
-
-Given there is no microarch-gating primitive, the honest options are, in order:
-
-a. DESCRIPTION + TAG (only thing available today, zero code): the entries already
-   say "native Blackwell NVFP4 (FP4-MMA)" - tighten it to a leading one-line
-   "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
-   NVIDIA/CPU via NVFP4 dequant but WITHOUT the FP4-MMA speedup and below the
-   quoted GB10 throughput." Add a `blackwell` tag alongside `nvfp4` for the
-   filter chip. This is the existing convention (other entries use free prose +
-   `nvidia` tag, e.g. line 2395; quant trade-offs are described in prose, e.g.
-   the Gemma "Mobile-optimized" notes lines 1312/1366). No other gallery entry
-   today encodes a GPU-microarch requirement, so prose is the de-facto standard.
-b. If a structured signal is wanted, it would need a NEW field (e.g. a
-   `recommended_hardware` / `requires` note surfaced by the React UI import
-   dialog) - that is a feature, not a config tweak, and does not exist yet.
-c. The `nvfp4` tag should at minimum be present on ALL six entries - the four
-   Qwopus/Qwen-MTP entries at lines 819/854/890 have only `[llm, gguf]` tags and
-   omit `nvfp4`, so they are not even discoverable/filterable as NVFP4, despite
-   being NVFP4 GGUFs. Inconsistent tagging is a secondary gap.
-
-Verdict (gallery-targeting): NVFP4 GGUFs are safe to ship broadly (they run
-everywhere via dequant, never fail), so the risk is PERFORMANCE-EXPECTATION, not
-correctness. LocalAI has no microarch gating primitive; the only lever is the
-description + tags. Recommend a one-line Blackwell-recommended hardware note +
-consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
-claims with the "runs slower off-Blackwell" caveat.
-
-## Section: optimization-generality (patches 0013/0016 + 0017-0029)
-
-Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch),
-GB10-TUNED (needs per-arch retuning of the magnitude/constants), or
-Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the
-patch commit bodies + the diffs they touch; bit-exactness verdicts are the
-patches' own md5/test-backend-ops gates.
-
-Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer
-+ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none;
-Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but
-HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10
-"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM
-ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they
-reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic.
-
-### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral)
-
-Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no
-bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs
-the code path.
-
- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy,
-  zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off
-  byte-identical. Latency/fairness lever (flattens decode-ITL spike from a
-  co-batched long prefill). No arch assumption.
- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure
-  `update_slots()` policy, default-off byte-identical, T==n_batch degenerate case
-  == stock. Arch-neutral, identical paged on/off.
- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot
-  release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes
-  a real fragmentation/throughput-collapse bug on long-lived servers.
-  Arch-independent host bookkeeping.
- 0029 block-table within-step host cache - memcpy-reuse of the host block table
-  across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1
-  off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound
-  (MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so
-  the BIGGER this win elsewhere.
- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by
-  reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no
-  equivalent of is a win on ANY arch running the GDN path; not FP4, not
-  bandwidth-floor specific.
- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021
-  conv-state in-place fusion - remove a D2D state copy-back (0018), a state
-  get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring
-  vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is
-  plumbing, so wins on ANY arch running the gated-DeltaNet recurrence.
- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching,
-  in-kernel paged read. No precision or bandwidth-floor assumption; the most
-  portable part of the work, helps capacity/serving anywhere it compiles.
-
-NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every
-gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU};
-NOT covered on Metal/SYCL/Vulkan - see SAFETY #1.
-
-### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor
-
-Correct + beneficial everywhere, but the specific %/constants are GB10-bound.
-
- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the
-  ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes
-  the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the
-  128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the
-  gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8
-  GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA
-  archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128
-  is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7%
-  magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct.
-  REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8)
-  could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the
-  MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit.
- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1)
-  quantize the unique token activations once and gather the identical FP4 blocks
-  instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off.
-  DIRECTION-GENERAL (de-duplicating identical work is always good) but
-  NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where
-  activation-quant is a measurable decode bucket - on a compute-bound arch the
-  saved quant time may be off the critical path (even on GB10 the MoE TG win is
-  only +1.7%).
-
-### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch)
-
- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params
-  NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The
-  reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column
-  assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel.
-  (16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal
-  values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere
-  (bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off
-  GB10; on a compute-bound arch (sm_100) the kernel may not even be the
-  bottleneck. Needs a per-arch GDN_NW/CPW sweep.
- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT-
-  OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X).
-  Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on
-  sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default =>
-  byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the
-  kill-gate could flip; the levers are in place and inert, ready to re-sweep.
-
-### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists)
-
- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped
-  stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off
-  byte-identical. The CUDA-graph mechanism is general, but the specific guard
-  condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here"
-  reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the
-  node would not take that branch -> inert.
- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds
-  --cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length).
-  Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN
-  decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On
-  sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is
-  precision-specific (opt-in, default-safe).
- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific
-  for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere
-  they run-via-dequant (correct, slow) per the gallery-targeting section above.
-
-### Per-arch expected story
-
- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120
-  (GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the
-  bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline +
-  graph wins (0029/0025) and the MMQ reshape (0020) hold.
- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast
-  (precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound.
-  Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The
-  bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the
-  host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net:
-  works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants.
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4
-  levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ
-  etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler
-  (0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv
-  plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj
-  MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged +
-  SSM + scheduler infra is a pure win".
- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no
-  NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference
-  kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses
-  ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT
-  get the new fused-op kernels (SAFETY #1).
-
-### SAFETY / regression risks
-
-1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated.
-   0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids],
-   ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are
-   emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set
-   (constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the
-   active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan
-   two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud
-   supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream,
-   so a qwen35 model likely cannot run there regardless); (b) the fused conv
-   variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a
-   backend that supports plain SSM_CONV but ignores the discriminator would compute
-   the WRONG plain conv -> SILENT corruption. That is the one genuine
-   silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute
-   backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated
-   SSM_CONV where the fused handling is absent).
-2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery
-   batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream
-   decode, n_seqs=1). Bit-identical either way - only a potential perf regression
-   at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming
-   the reshape still picks the better kernel at n_seqs=1 elsewhere.
-3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal;
-   do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep.
-   No correctness risk.
-4. Gallery rows do not state a GPU-arch requirement (covered in the
-   gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note.
-
-### One-line verdict
-
-The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024,
-block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the
-o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles -
-bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers
-(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The
-occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned
-and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and
-the regime flips to compute-bound). The single real SAFETY gap: the new fused
-GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal
-paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute
-(discriminated SSM_CONV) - it should be compute-backend-gated.
-
-## Section: patch-arch-safety (build-break / miscompile classification, 0018-0029)
-
-This section is the narrow safety read: for EACH patch, does it (a) compile and
-behave correctly on every build target, (b) compile only under
-BLACKWELL_MMA_AVAILABLE with a fallback elsewhere, or (c) RISK a build break /
-miscompile / crash on a non-Blackwell arch. Class letters here are
-build-safety classes, distinct from the perf-generality buckets above. Note 0027
-does not exist (numbering gap). The dispositive build facts: the backend is built
-for CUDA 12/13, L4T arm64, ROCm/hipblas, SYCL f32/f16, CPU (amd64+arm64), Vulkan -
-and NOT for darwin/Metal (no includeDarwin row), and the CUDA build emits the full
-multi-arch fan (CUDA_DOCKER_ARCH unset; Dockerfile documents e.g. `75;86;89;120`),
-so every .cu TU MUST already compile for non-Blackwell SASS.
-
-Method: grepped every added line in 0017-0029 for arch-specific tokens
-(BLACKWELL/__CUDA_ARCH__/sm_NNN/cp.async/ldmatrix/mma./asm volatile/cc gates).
-The ONLY hits are in 0017 (all correctly `#if`-gated) and free-text comments. No
-SSM/conv/GDN kernel in the series uses a Blackwell intrinsic or a hardcoded
-sm_12x launch geometry.
-
-| patch | class | build-safety note |
-|-------|-------|-------------------|
-| 0017 fp4-gemm-decode-tile-tune | (b) GATED | only Blackwell-specific patch; NVFP4 mmq_y/min-blocks levers behind `#if defined(BLACKWELL_MMA_AVAILABLE)` + `blackwell_mma_available(cc)` + `type==GGML_TYPE_NVFP4`, ALL default-off => default build byte-identical to stock on every arch. `get_mmq_y_device<type>()` templating has a default arg keeping stock behaviour for non-NVFP4. Builds on sm_80-90 (body stripped). |
-| 0018 ssm-decode-inplace-state | (a) general | plain in-place GDN state write-back, no intrinsics; CPU mirror in ggml-cpu/ops.cpp. |
-| 0019 ssm-decode-fused-gather | (a) general | `gdn_gather_nonident_kernel` = plain `<<<n_seqs,256>>>`; CPU mirror added. |
-| 0020 gdn-oproj-mmq-reshape | (a) general | host-side reshape_2d in qwen35*/qwen3next.cpp, no device code. |
-| 0021 conv-state-inplace-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (4th src discriminator), no new enum => no ggml-cpu.c switch needed; `ssm_conv_update_f32` plain portable CUDA (threads=128, templated d_conv); CPU mirror + test case. |
-| 0022 gdn-recurrence-occupancy-retune | (a) general | template NUM_WARPS/COLS_PER_WARP/MIN_BLOCKS; new default (16,8) = 512 thr/block, MIN_BLOCKS=2, within the 1024 limit on sm_70..120 and AMD; bit-exact for any (NW,CPW). NOT Blackwell-gated and NOT a break - just a GB10-tuned default applied everywhere (see risk 3 below). |
-| 0023 moe-nvfp4-quant-dedup | (a) general | `gather_mmq_fp4` = plain uint4 byte-copy kernel; reached ONLY inside the pre-existing `if (use_native_fp4)` branch (Blackwell-only at runtime) and uses `block_fp4_mmq`, a type that already compiles for the full arch fan pre-0023. Adds no new arch surface. |
-| 0024 paged-pool-burst-reclaim | (a) general | pure host C++. |
-| 0025 moe-nvfp4-decode-regraph | (a) general | host-side ggml-cuda.cu graph-guard relaxation, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off => byte-identical; predicate is runtime cc-generic. |
-| 0026 hybrid-perhead-ssm-state | (a) general | mostly host plumbing; GDN kernel = same portable column-folded code; fill.cu instantiates `fill_kernel<nv_bfloat16>` (bf16 STORAGE-only, fine on all targeted arches; bf16-compute SSM plan is SHELVED so STATE_T stays f32 on the hot path). LOW-RISK verify item: confirm no bf16-arithmetic GDN instantiation reaches sm_75 if sm_75 ships. |
-| 0028 recurrent-state-gather-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (ids src + rs_head); `ssm_conv_gather_nonident_kernel` plain portable CUDA; CPU mirror + test cases. |
-| 0029 blocktable-within-step-cache | (a) general | pure host C++ + host-timing instrumentation. |
-
-### Specific lines that carry the only conditional/risk surface
-
- 0017 the ONLY correctly-gated arch surface:
-  - `get_mmq_y_host`: `if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc))`
-  - `get_mmq_y_device<type>()` / `mmq_get_min_blocks_device<type>()`: bodies inside `#if defined(BLACKWELL_MMA_AVAILABLE)`.
-  All default to the stock value, so a default build is byte-identical everywhere.
- 0023 the gather kernel default-on (GGML_CUDA_MOE_QUANT_DEDUP=1) but the call site
-  is `if (moe_quant_dedup && ne11 == 1)` strictly inside `if (use_native_fp4)`; on
-  non-Blackwell `use_native_fp4` is false so the dedup never executes.
- 0022 the GB10-tuned launch geometry is `GDN_DEFAULT_NW 16` / `GDN_DEFAULT_CPW 8`
-  (=> 512 threads, MIN_BLOCKS=2). This is the closest thing to a "hardcoded for
-  GB10" launch config, but it is a correct, within-limits, bit-exact default for
-  ANY arch, runtime-overridable via GDN_NW/GDN_CPW. Not a break.
-
-### THE ONE silent-correctness risk (cross-ref SAFETY #1 above)
-
-0021/0028 (and 0018/0019 for the GDN op) implement their new ops for CUDA + CPU
-ONLY, and the fused conv variants REUSE GGML_OP_SSM_CONV discriminated by a
-non-null src[3]/src[4]. Emission is NOT gated on the active compute backend. A
-backend that supports plain SSM_CONV but ignores the discriminator would run the
-WRONG plain conv => SILENT corruption (not a build break). In practice the model
-that emits these (qwen35 hybrid) also needs the fork-custom GDN op, which is
-CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts/falls back FIRST and the
-model cannot run there regardless; and Metal is not a build target. So the risk is
-latent rather than live, but it should still be closed by gating fused-op emission
-on a CUDA/HIP compute backend (or a supports_op guard rejecting the discriminated
-SSM_CONV where fused handling is absent). This is the single item that could ever
-miscompute silently; everything else is either build-safe or loud.
-
-### Build-safety verdict per target (would it COMPILE / RUN)
-
- CUDA sm_80 / 86 / 89 / 90 (Ampere/Ada/Hopper): BUILDS (0017 Blackwell code
-  `#if`-stripped + default-off; all other device code portable CUDA). qwen35 hybrid
-  models RUN (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4
-  GGUFs run via the stock non-FP4-MMA dequant/DP4A path; the FP4 levers are inert,
-  not broken. No patch in 0018-0029 breaks this build.
- CUDA sm_100 (datacenter Blackwell, HBM3e): BUILDS + every lever active
-  (BLACKWELL_MMA_AVAILABLE defined). Bit-exact. GB10-tuned launch defaults are
-  correct but tuned for the LPDDR5x BW floor; on HBM3e the regime is compute-bound,
-  so safe-but-not-necessarily-optimal (re-sweep 0022/0017 levers). No build/correctness risk.
- Metal: NOT a build target (no darwin row), so missing Metal kernels for the new
-  SSM_CONV/GDN ops cannot break a build or a run here. (The GDN op has no Metal
-  kernel regardless.)
- CPU (amd64 + arm64): BUILDS + RUNS - every new op ships a CPU mirror under the
-  reused enums; host patches are portable C++.
- ROCm/HIP, Intel SYCL, Vulkan: BUILD ok. The .cu additions hipify cleanly (no
-  Blackwell intrinsic outside the `#if`; 0022's 512-thread launch within AMD limits).
-  SYCL/Vulkan are separate backends that don't compile the .cu files and lack the
-  GDN op, so qwen35 hybrid models fall back/assert there rather than run; classic
-  (non-qwen35) models are unaffected because SSM_CONV semantics only change when the
-  qwen35 graph emits the discriminator src. The latent silent-SSM_CONV risk above
-  applies only if a backend both supports SSM_CONV and ignores the discriminator.
-
-Verdict: of 0018-0029, none would break a non-Blackwell CUDA build, the CPU build,
-or the ROCm/SYCL/Vulkan builds; 0017 is the only Blackwell-gated patch and is
-default-off and `#if`-guarded. The sole non-build hazard is the latent
-discriminated-SSM_CONV silent-miscompute on a hypothetical Vulkan/SYCL/Metal GDN
-run, which should be closed by compute-backend-gating the fused-op emission.
-
-## Section: CROSS-ARCH SYNTHESIS (final verdict)
-
-Consolidates the four audit sections above into a single ship decision. The arch
-axis: NVFP4 FP4-MMA requires `BLACKWELL_MMA_AVAILABLE` = sm_120/121 (consumer
-Blackwell, GB10/RTX-50) + sm_100 (datacenter Blackwell). sm_90 Hopper / sm_89 Ada
-/ sm_80-86 Ampere = NO FP4-MMA. Metal/CPU/AMD/Intel = no NVFP4-MMA. GB10's wins
-are dominated by the LPDDR5x ~273 GB/s bandwidth floor; sm_100 has FP4-MMA but
-HBM3e ~8 TB/s so it is COMPUTE-bound and every "bandwidth-bound" GB10 verdict
-inverts there.
-
-### 1. BUILD SAFETY: does it build + run WITHOUT CRASHING off-Blackwell?
-
-YES on every target it builds for, with ONE latent silent-correctness hazard
-(not a crash) to close before claiming non-Blackwell support. The build is NOT
-GB10-pinned: there is no explicit CUDA arch list anywhere in the paged path
-(`CUDA_DOCKER_ARCH` empty in every matrix row, identical to stock llama-cpp), so
-the CUDA TUs compile the full upstream ggml arch fan and the NVFP4 FP4-MMA path
-is gated INSIDE the kernel by `BLACKWELL_MMA_AVAILABLE`, never by the matrix.
-
-| target | builds? | runs? | notes |
-|--------|---------|-------|-------|
-| CUDA sm_80/86/89/90 (Ampere/Ada/Hopper) | YES | YES | 0017 Blackwell code `#if`-stripped + default-off; all other device code portable. qwen35 hybrid models run (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 GGUFs run via dequant/DP4A; FP4 levers inert, not broken. |
-| CUDA sm_100 (datacenter Blackwell, HBM3e) | YES | YES | every lever active + bit-exact; GB10-tuned launch defaults are correct but compute-bound regime => safe-but-suboptimal (re-sweep, do not assume GB10 constants). |
-| CPU (amd64 + arm64) | YES | YES | every new op ships a CPU mirror under the reused enums; host patches portable C++. |
-| ROCm/HIP, Intel SYCL, Vulkan | YES | partial | .cu hipifies cleanly (no Blackwell intrinsic outside `#if`; 0022's 512-thread launch within AMD limits). SYCL/Vulkan don't compile the .cu and lack the GDN op, so qwen35 hybrid models assert/fall back rather than run; classic non-qwen35 models unaffected. |
-| Metal / macOS | NOT BUILT | N/A | no `includeDarwin` row, no `metal:` capability key. Mac selection of this backend falls back to `default`=cpu (a Linux image) and does NOT run; no auto-fallthrough to stock llama-cpp. |
-
-No patch in 0017-0029 breaks a non-Blackwell CUDA build, the CPU build, or the
-ROCm/SYCL/Vulkan builds. The only thing that is not merely "suboptimal" is the
-fused-conv silent-miscompute hazard (item RISKY-1 below), and even that is latent
-because the co-emitted GDN op asserts first on the backends that lack it.
-
-### 2. EVERY patch/opt, four buckets
-
-SAFE-EVERYWHERE (ship as-is; bit-exact or default-off byte-identical; pure win or
-neutral on any arch that runs the path):
- 0001-0012 paged KV core (manager, on-demand alloc, prefix caching, in-kernel paged read)
- 0013 / 0016 prefill-token budget scheduler (pure `update_slots()` policy, default-off byte-identical)
- 0018 in-place SSM-state write-back  (CUDA+CPU; see RISKY-1 for backend coverage)
- 0019 fused SSM-state gather          (CUDA+CPU)
- 0021 conv-state in-place fusion      (CUDA+CPU)
- 0028 recurrent-state (conv-tap) gather fusion (CUDA+CPU)
- 0020 o_proj GDN MMVQ->MMQ reshape (zero-cost view, bit-identical; MMQ>MMVQ at M=128 is universal; magnitude GB10-bound, perf-only caveat at tiny real M=1, see RISKY-2)
- 0024 paged-pool burst-reclaim (pure host C++; fixes a real long-server fragmentation collapse)
- 0029 block-table within-step host cache (host memcpy reuse, bit-exact; bigger win the FASTER the GPU, i.e. MORE host-bound decode elsewhere)
-
-BLACKWELL-ONLY, CLEAN FALLBACK (only meaningful where FP4-MMA exists; provably
-inert/byte-identical elsewhere, never a break):
- 0017 FP4 dense-GEMM decode tile tune - levers `#if BLACKWELL_MMA_AVAILABLE` + `blackwell_mma_available(cc)` + `type==NVFP4`, ALL default-off => default build byte-identical to stock on every arch
- 0023 MoE NVFP4 activation-quant de-dup - plain uint4 copy kernel reached ONLY inside the pre-existing `if (use_native_fp4)` branch (false off-Blackwell); never executes there
- 0025 MoE NVFP4 decode re-graph - host-side CUDA-graph guard, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off; the NVFP4-grouped guard predicate is inert on non-FP4
- NVFP4 GGUFs + 6 gallery rows - FAST path is sm_120/121/100 only; elsewhere run-via-dequant (correct, slow), never a load/run gate
-
-GB10-TUNED (works + safe everywhere, but the constants/magnitude are GB10
-bandwidth-floor winners; re-sweep per arch, no correctness risk):
- 0022 GDN recurrence occupancy retune - column-fold default (16,8)=512thr/MIN_BLOCKS=2, bit-exact, env-overridable GDN_NW/GDN_CPW; within the 1024-thread limit on sm_70..120 + AMD. Optimal values depend on DRAM latency/L2/SM-count; on a compute-bound arch the kernel may not be the bottleneck.
- 0026 bf16 per-head SSM/conv cache - default f32 bit-exact (opt-in `--cache-type-ssm/-conv`); bf16 only pays off on a bandwidth-bound arch, buys little on sm_100 HBM3e. bf16 is STORAGE-only (fill_kernel<nv_bfloat16>), the bf16-compute SSM plan is shelved so STATE_T stays f32 on the hot path.
- 0017 / 0023 magnitudes (the % wins, not the gating) are also GB10-floor-bound.
-
-RISKY (fix before claiming non-Blackwell ship; neither is a crash, one is silent):
- RISKY-1 (the one real gap) fused GDN/conv ops are CUDA+CPU-only with
-  backend-UNGATED, DEFAULT-ON emission. Confirmed: `cparams.fused_gdn_ch = true`
-  and `auto_fgdn = true` in the `llama_context` constructor; emission fires on
-  `(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar)` with NO compute-backend
-  check. The fused conv variant REUSES `GGML_OP_SSM_CONV` discriminated by a
-  non-null `src[3]` (verified: CUDA `if (dst->src[3] != nullptr)` branch at the
-  top of `ggml_cuda_op_ssm_conv`, CPU mirror in ops.cpp, NO supports_op guard). A
-  backend that supports plain SSM_CONV but ignores `src[3]` would compute the
-  WRONG plain conv => SILENT corruption. Latent today only because the co-emitted
-  fork-custom GDN op is CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts
-  first and the qwen35 hybrid model cannot run there anyway, and Metal is not
-  built. FIX: gate fused-op emission on a CUDA/HIP compute backend, OR add a
-  supports_op guard that rejects the discriminated SSM_CONV where fused handling
-  is absent. This is the single thing that could miscompute silently; close it
-  before a Vulkan/SYCL/Metal paged build of a gated-DeltaNet model is ever shipped.
- RISKY-2 (perf-only, not correctness) 0020 forces MMQ; at a genuine single-stream
-  decode M<=8 (n_seqs=1) MMQ could be slower than MMVQ off the GB10 batched
-  regime. Bit-identical either way. Confirm the reshape still picks the better
-  kernel at n_seqs=1 on non-GB10 archs.
-
-### 3. NVFP4-GGUF + gallery targeting recommendation
-
-Do NOT hardware-gate the entries (and you cannot: LocalAI has no microarch-gating
-primitive - `tags:` are free-text/search-only, `ModelConfig` has no
-hardware/requirements field, and backend `capabilities:` resolves by accelerator
-FAMILY only, serving `nvidia: cuda12-...-paged` to ANY NVIDIA GPU with no
-sub-nvidia resolution). The GGUFs run correctly everywhere via dequant, so the
-risk is PERFORMANCE-EXPECTATION, not correctness; a hard gate would wrongly block
-valid (slow) use. Recommended, in order:
-1. (zero-code, do now) Lead each of the 6 descriptions with one honest line:
-   "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
-   NVIDIA/CPU via NVFP4 dequant but WITHOUT FP4-MMA and below the quoted GB10
-   throughput." Temper the "90-117% of vLLM" claims with that caveat (those are
-   LPDDR5x-bandwidth-bound specific).
-2. (zero-code) Tag all six consistently with `nvfp4` + a new `blackwell` chip. The
-   four Qwopus/Qwen-MTP entries currently carry only `[llm, gguf]` and are not even
-   discoverable as NVFP4 despite being NVFP4 GGUFs - secondary correctness-of-metadata gap.
-3. (feature, later) A structured `recommended_hardware` field surfaced by the React
-   import dialog is the only way to express this machine-readably; it does not exist.
-
-### 4. Per-arch roadmap (ranked by value / effort)
-
- sm_100 datacenter Blackwell - HIGH value, MEDIUM effort. FP4-MMA works so NVFP4
-  stays fast and the precision bucket (0017/0023/0025) carries over, but the BW
-  floor is gone => compute-bound. Needs: re-sweep 0022 GDN_NW/CPW; re-evaluate the
-  0017 kill-gate (levers ready, may flip); expect 0018/0019/0026 bandwidth wins to
-  shrink toward neutral while 0029/0025/0020 host/graph/MMQ wins still help. No
-  code change to be SAFE; a tuning pass to be OPTIMAL.
- Metal / macOS - MEDIUM value, MEDIUM effort. Add the `includeDarwin`
-  `-metal-darwin-arm64-llama-cpp-localai-paged` row + a `metal:` capability key
-  (changed-backends.js already anticipates the source path). Delivers paged-KV +
-  scheduler value only (no NVFP4-MMA on Metal); still a strict win over today's
-  broken Mac selection. MUST also land RISKY-1 first (Metal would otherwise hit the
-  discriminated-SSM_CONV path if it ever gained an SSM_CONV kernel without the
-  discriminator).
- CPU - LOW effort, already works. Reference kernels exist for every fused op;
-  paged KV + scheduler + reclaim are the portable value. Nothing to do.
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 - MEDIUM value, LOW effort (no FP4
-  work). No FP4-MMA => pair the precision-agnostic infra (paged KV, 0013/0016,
-  0024, 0029, 0018/0019/0021/0028, 0020) with a DIFFERENT quant (Q4_K/AWQ/GPTQ).
-  Messaging: "no NVFP4 here, use another quant, but paged + SSM + scheduler infra
-  is a pure win". The GGUFs/gallery rows are out of scope for these.
-
-### 5. What MUST be empirically verified (and on what hardware)
-
- GB10 (sm_121, user has it): the validated target; already measured. Re-confirm
-  bit-exactness gates after RISKY-1 fix.
- M4 Mac (user has it): (a) once an `includeDarwin` paged row exists, verify the
-  Metal build compiles + a NON-qwen35 model runs (paged KV path); (b) verify a
-  qwen35 hybrid model on Metal EITHER asserts loudly OR is correct - it must NOT
-  silently miscompute the discriminated SSM_CONV. This is the direct test of
-  RISKY-1 on real Metal. Do this BEFORE shipping a Metal paged build. Also verify
-  CPU correctness of every fused op on the Mac (arm64 CPU mirror).
- non-Blackwell NVIDIA (sm_80/86/89/90 - user would need to ACQUIRE, e.g. cloud
-  A100/L4/L40S/H100): verify (a) the cuda12/cuda13 paged image runs a qwen35
-  hybrid model correctly (GDN + ssm_conv_update + gather non-Blackwell kernels),
-  (b) NVFP4 GGUFs load + produce correct output via dequant/DP4A (not garbage),
-  (c) RISKY-2: that 0020's forced MMQ does not regress single-stream (n_seqs=1)
-  decode latency vs MMVQ. This is the only bucket needing hardware acquisition;
-  everything else is covered by the GB10 + M4 the user already has.
- sm_100 (datacenter Blackwell - cloud B200 if a tuning pass is wanted): only
-  needed to make sm_100 OPTIMAL, not to make it SAFE. Defer unless targeting it.
-
-### 6. SHIP DECISION
-
-SAFE TO SHIP TODAY as a Blackwell-targeted backend on Linux. The build is
-arch-general (same arch fan + variant set as stock llama-cpp), every targeted
-Linux variant builds and runs, and all Blackwell-specific code is default-off +
-`#if`-guarded so a non-Blackwell build is byte-identical to stock on the FP4 path.
-The NVFP4 GGUFs run everywhere via dequant (correct, slower), so broad gallery
-exposure is a performance-expectation issue, not a correctness one.
-
-MINIMUM to not break / mislead other archs:
-1. (correctness, before ANY Vulkan/SYCL/Metal paged build of a gated-DeltaNet
-   model) Close RISKY-1: compute-backend-gate the fused GDN/conv op emission, or
-   add a supports_op guard rejecting the discriminated SSM_CONV. This is the only
-   hard requirement; it is latent on the current Linux targets but becomes live
-   the moment a Metal/Vulkan/SYCL paged build of qwen35 exists.
-2. (availability, zero-risk) Add the `includeDarwin` paged row + `metal:` key so
-   Mac users get a working (paged-KV-only) build instead of a non-running
-   default=cpu selection with no fallthrough to stock.
-3. (expectation, zero-code) Add the Blackwell-recommended hardware note + the
-   "runs slower off-Blackwell via dequant" caveat to the 6 gallery descriptions
-   and tag all six `nvfp4` + `blackwell`.
-4. (perf, verify don't block) Confirm 0020 does not regress n_seqs=1 decode on
-   non-GB10 NVIDIA; if it does, gate the MMVQ->MMQ reshape on a real-M threshold.
-
-Items 2-4 do not block a Linux Blackwell ship. Item 1 blocks only a future
-non-CUDA paged build of a gated-DeltaNet model; on the current build targets the
-hazard is latent (the GDN op asserts first). Net: ship for Blackwell/Linux now;
-land item 1 before extending paged to Metal/Vulkan/SYCL.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md
@@ -1,56 +0,0 @@
-# A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS
-
-Label: A-build (GPU agent). Base: DGX `~/llama-paged-dev` branch `paged` HEAD 2f4f5ab (patch 0025),
-plus `BF16_SSM_STATE.diff` applied as the bf16 plumbing base. Goal: per-head mixed-dtype SSM state
-(f32 long-memory heads, bf16 fast heads); default `ssm_hybrid_tau_thresh=inf` (all-f32, bit-exact).
-
-## Design recap (from SPEEDUP_HUNT.md A-hybrid-design)
- Classifier (host, model-load): tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h])); f32 if tau_h>T.
-  ssm_a = SSM_A_NOSCAN = -exp(A_log) (verified qwen35.cpp:376). ssm_dt = SSM_DT bias.
- Split cache: per GDN layer, s_l (f32, n_f32 heads) + s_l_bf16 (bf16, n_bf16 heads). head_slot map.
- Kernel: ONE kernel templated +HYBRID; per-block (h_idx) branch on head_slot (uniform, no divergence).
-  Recurrence math byte-for-byte f32-register, untouched. Homogeneous (HYBRID=false) path bit-exact.
- Op: extra src[8]=state_bf16, src[9]=head_slot; backend detects hybrid = (src[9]!=null).
- CPU mirror: per-head partition read.
- test-backend-ops: MIXED case (some heads f32, some bf16) output-append, decode+prefill+keep_rs_t.
-
-## DE-RISK GATE (must pass before sweep)
-1. test-backend-ops GATED_DELTA_NET mixed PASS (CUDA mixed vs CPU mixed).
-2. T_thresh=inf greedy md5 == 0023 baseline: dense 5951a5b4d624ce891e22ab5fca9bc439,
-   MoE 07db32c2bcb78d17a43ed18bc22705cd.
-
-## KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)
-Rule (brief verbatim + physics + "start 32-64" guidance all agree): a head is kept f32 iff
-tau_h > T_thresh, else bf16. tau_h = 1/(|ssm_a|*softplus(ssm_dt)) in tokens. Long-memory (large tau)
-heads stay f32 (bf16 rounding does not contract there -> KL); fast (small tau) heads -> bf16.
- ssm_hybrid_tau_thresh DEFAULT = 0.0  => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).
- ssm_hybrid_tau_thresh -> +inf        => ALL BF16 (shelved mode).
- sweep: raise T (16/32/64/128 tokens) to bf16 progressively more (longer-memory) heads = more speed.
-NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED vs the operative rule it states
-("keep f32 if tau>T") and vs "start 32-64" + the physics. Correct endpoints: 0=all-f32, inf=all-bf16.
-Implemented the physically-correct rule; default 0.0 = bit-exact all-f32.
-
-## STATUS
- [x] ggml.h/ggml.c hybrid op builders
- [x] gated_delta_net.cu hybrid kernel + dispatch (one kernel, +HYBRID template, uniform per-block branch)
- [x] ops.cpp CPU hybrid read mirror (output-append; ids in-place is GPU-only, asserted)
- [x] test-backend-ops mixed case (32 cases: hc 4/8 x hs 64/128 x decode/prefill/keep_rs_t x kda)
- [x] de-risk gate 1: test-backend-ops GATED_DELTA_NET = 84/84 PASS (incl 32 hybrid mixed CUDA-vs-CPU)
- [x] cparam/CLI ssm_hybrid_tau_thresh plumbing (default 0.0; threaded context->cparams->memory->ctors)
- [x] memory-recurrent split cache + classifier (validated: real tau split, correct 2-partition layout)
- [x] delta-net-base hybrid op build (fused ids decode + bf16 rs_zero/extra mirror)
- [x] full build clean (sm_121; llama-completion/batched-bench/perplexity/test-backend-ops)
- [x] de-risk gate 2 (default/all-f32 md5 == 0023 both models, re-verified post-build)
- [x] hybrid-ON decode FIXED: the incoherence was head_slot being zeroed by clear(data=true) (whole-RS
-      buffer clear) after warm-up, never re-uploaded => every head -> f32-local-0 => split collapse.
-      Fix = persist head_slot_host + re-upload via upload_head_slots() after every buffer clear. Hybrid
-      decode now coherent; cross-op carry verified BYTE-EXACT (write==read both partitions).
- [x] A-gatesweep DONE: KL sweep T in {0.25..128} both models, single-seq c1024 (clean carry), drift.
-      SHIP GATE FAILS - no T passes MeanKLD<1e-3 AND top-p>=99.5% with meaningful speedup. Premise
-      (error concentrates in long-memory heads) REFUTED: KL scales with bf16 COUNT and saturates
-      ~0.06/~91% (MoE saturates at the minimal split). Carry byte-exact => genuine bf16 sensitivity,
-      not a bug. Throughput lever real: dense +12.4% / MoE +11.5% decode @npl128 at T=128.
- [x] Shipped default-off (f32, bit-exact). De-risk gates re-verified on the clean build (84/84;
-      md5 == baseline both models). See A_HYBRID_SSM_RESULTS.md for the full tables.
-
-Committed: DGX paged 33e7c65 (amended); worktree patch 0026 + A_HYBRID_SSM_RESULTS.md + this doc.
--- a/backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md
@@ -1,119 +0,0 @@
-# A - HYBRID PER-HEAD f32/bf16 SSM STATE - BUILD + DE-RISK + GATE-SWEEP RESULTS
-
-Label: A-build + A-gatesweep. Lands as patch 0026 on top of 0025 (DGX HEAD 2f4f5ab),
-incorporating the bf16-SSM-state plumbing as the base. Built into `~/llama-paged-dev/build-cuda`
-(sm_121); committed on the DGX `paged` branch (33e7c65, amended) and as
-`patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch` + this doc in the worktree.
-
-## VERDICT
-
-The hybrid machinery is **CORRECT and complete** (both de-risk gates PASS; the carry is byte-exact;
-the previously-open decode-incoherence bug is FIXED). The **ship gate FAILS**: no T_thresh reaches
-`MeanKLD < 1e-3 AND Same-top-p >= 99.5%` for both models with any meaningful speedup. The design
-premise - that the bf16 KL error concentrates in long-memory heads and is removed by keeping them
-f32 at f32-fraction 0.30-0.50 - is **empirically refuted** on q36-27b and q36-35b-a3b-nvfp4: the KL
-error scales with the bf16 head COUNT and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any
-useful byte-saving. The bf16 byte-saving (and the decode speedup it buys) is real but cannot meet the
-strict KL bar. **Shipped default-off (f32, bit-exact opt-out); the hybrid is opt-in only.**
-
-## THE FIX (was: hybrid-ON decode incoherent)
-
-Root cause: `llama_memory_recurrent::clear(data=true)` zeroes the WHOLE recurrent backend buffer with
-`ggml_backend_buffer_clear`, which includes the per-layer `head_slot` maps. Those maps were uploaded
-only once in the constructor. llama.cpp calls `clear(true)` to reset state after the warm-up run (and
-on context resets), so by the time real prefill/decode runs, every `head_slot[h] == 0`. The kernel
-decodes `head_slot==0` as "f32 head, local index 0", so EVERY head reads/writes f32-partition slot 0:
-the split collapses (the bf16 partition is never written, every head collides on one f32 slot) and the
-output is garbage. Warm-up showed correct values precisely because it ran before the clear.
-
-Fix: persist the host-side maps (`head_slot_host`) and re-upload them after every buffer clear via a
-new `upload_head_slots()` (called both at construction and at the end of `clear(true)`). 22 lines in
-`src/llama-memory-recurrent.cpp` + 7 in the header. After the fix:
- head_slot reads back correct in every forward (e.g. `0 1 -1 -2`), in both llama-completion and
-  llama-perplexity;
- the bf16 partition is written (non-zero) every step;
- the cross-op state carry is **byte-exact**: at a continuation forward the op reads back EXACTLY what
-  the prior op wrote, element-for-element, in BOTH partitions (f32 `[0]=0.00303 [1]=-0.00074
-  [16384]=0.00054`, bf16 `[0]=-0.00023 [1]=0.00008 [16384]=0.00269` write == read), confirming there
-  is no addressing/scramble/corruption bug. The only residual difference from f32 is the bf16 rounding
-  of the bf16-partition heads.
-
-## DE-RISK GATES - both PASS (re-verified on the final clean build)
-
-1. **test-backend-ops GATED_DELTA_NET = 84/84 PASS, CUDA0 OK** (incl. the 32 mixed-dtype hybrid cases
-   vs CPU: head_count {4,8} x head_size {64,128} x {decode, prefill 33/64/100, keep_rs_t K=4} x kda).
-2. **T=0 (default, all-f32) greedy md5 == 0023 baseline, both models**, NO `--ssm-bf16-tau`:
-   - dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == baseline
-   - MoE   q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == baseline
-   The bit-exact opt-out is preserved byte-for-byte.
-
-## SHIP GATE - the KL/throughput sweep (FAILS)
-
-KL harness = the bf16-work GateBench: `llama-perplexity --kl-divergence` on wikitext-2-raw,
-`-ngl 99 -fa on --seed 1`, base = T=0 (f32). The clean carry config is single-sequence
-`-b 1024 -ub 512 -c 1024 --chunks 8` (one cross-ubatch bf16 round-trip; f32-vs-f32 floor = 100.000%
-same-top-p, MeanKLD ~ -1.2e-5). Gate: `MeanKLD < 1e-3 AND Same-top-p >= 99.5% AND bounded drift`.
-
-### Dense q36-27b-nvfp4 (H_v=48), c1024 single-seq
-
-| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD  | Same-top-p |
-|---------:|-----------:|--------:|--------:|---------:|-----------:|
-| 0.25     | 14         | 0.964   | 0.982   | 0.00270  | 98.92%     |
-| 0.5      | 48         | 0.963   | 0.982   | 0.01439  | 96.18%     |
-| 1        | 118        | 0.935   | 0.968   | 0.06357  | 91.59%     |
-| 8        | ~610       | 0.735   | 0.868   | 0.05669  | 91.59%     |
-| 32       | ~1113      | 0.517   | 0.759   | 0.05724  | 90.97%     |
-| 64       | ~1304      | 0.434   | 0.717   | 0.06183  | 91.85%     |
-| 128      | ~1460      | 0.366   | 0.683   | 0.05980  | 91.56%     |
-
-Monotonic below the knee (T<=1), then a flat plateau. Best meaningful point T=0.25 (only ~1.8% byte
-saving) already FAILS both criteria (KLD 0.0027 > 1e-3; top-p 98.92% < 99.5%). To pass the gate the
-bf16 count must be < ~14 heads (f_bytes > 0.98) => no speedup.
-
-### MoE q36-35b-a3b-nvfp4 (H_v=32), c1024 single-seq
-
-| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD  | Same-top-p |
-|---------:|-----------:|--------:|--------:|---------:|-----------:|
-| 0.25     | 23         | 0.940   | 0.970   | 0.03907  | 91.61%     |
-| 0.5      | 53         | 0.928   | 0.964   | 0.04620  | 90.31%     |
-| 1        | 78         | 0.910   | 0.955   | 0.04425  | 89.82%     |
-| 32       | 585        | 0.391   | 0.695   | 0.04552  | 90.09%     |
-
-MoE has NO low-KL regime: even the minimal split (23 bf16 heads, ~3% byte saving) is already at the
-~0.045 / ~91% plateau. Fails the gate everywhere by a wide margin.
-
-### Why it fails (the refutation)
-
-The carry is byte-exact, so this is genuine bf16 rounding of the recurrent state, not a bug. The
-gated-DeltaNet logit is extremely sensitive to ANY perturbation of the temporal state: even rounding a
-handful of small-magnitude heads to bf16 flips ~9% of hard-wikitext argmaxes, and adding more bf16
-heads does not flip materially more (saturation - the flips concentrate in an inherently-marginal
-token pool). This matches the prior whole-bf16 finding (MeanKLD 0.05-0.17, top-p ~90%, "bounded but
-LARGE, plateaus with context"). The error is NOT concentrated by tau, so f32-ing the long-memory heads
-(or, tested, the fast heads - inverted classifier gives the same plateau) does not recover the gate.
-
-## THROUGHPUT - the byte-saving lever IS real (but KL-gated out)
-
-`llama-batched-bench -fa on -npp 128 -ntg 128 -npl 128`, `LLAMA_KV_PAGED=1`, decode_agg = S_TG t/s:
-
-| model | T=0 (f32) | T=128 (f_bytes ~0.68) | gain   |
-|-------|----------:|----------------------:|-------:|
-| dense | 529.0     | 594.4                 | +12.4% |
-| MoE   | 1110.7    | 1238.1                | +11.5% |
-
-So the split delivers the predicted recurrence-bandwidth win (~+12% decode at f_bytes ~0.68), but only
-at T values whose KL is ~0.06 / ~91% top-p. There is no operating point with both a real speedup and a
-passing KL.
-
-## RECOMMENDATION
-
- Ship 0026 as-is: **default `ssm_hybrid_tau_thresh = 0.0` (f32, bit-exact)**; the hybrid is opt-in via
-  `--ssm-bf16-tau` for callers who explicitly accept reduced precision for ~+12% decode. Do NOT put a
-  hybrid T in the gallery/recommended config - it does not pass the KL bar.
- Lever A is closed as a KL-passing speedup: the GDN recurrent state does not tolerate bf16 on a
-  head-subset basis. Speed beyond the f32 recurrence must come from elsewhere (the MoE FP4 GEMM /
-  re-graph levers, or NVFP4-dense quant), not from bf16-ing the SSM state.
- If a product later accepts a looser bar (e.g. top-p >= 95%), dense T=0.5 (96.18%, f_bytes 0.982) is
-  the only near-miss and buys ~2% - still not worth it; MoE has nothing.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md
@@ -1,92 +0,0 @@
-# FINAL apples-to-apples NVFP4 benchmark (GB10 / DGX Spark) - CLEAN env, containers stopped
-# llama 0023 clean f7409c2 | LLAMA_KV_PAGED=1, LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget; beats stock 394s->142s TTFT@npl32), CUDA graphs ON, -c 131072 --parallel 128 -b 2048 -ub 512 -fa on
-# vLLM 0.23.0 | CUDA graphs ON (no enforce-eager), util 0.85, max-model-len 4096, max-num-seqs 256, tp1
-# client h2h_cli3.py: 512-tok UNIQUE-nonce prompt (fresh full prefill, defeats prefix caching), max_tokens=256, temp0, ignore_eos, stream+usage
-# llama restarts server PER NPL (paged-pool degrades after high-npl bursts); vllm one server/combo + npl8 re-check. 1 measured pass/npl + ptok8 graph warmup. peak_gb engine = PEAK-PRE.
-# started Fri Jun 26 04:43:38 AM CEST 2026 baseline=3.29 GB
-
-[2026-06-26 04:43:38] [dense_llama] ==== START dense_llama (llama) baseline_mem=3.29 ====
-[2026-06-26 04:43:38] [dense_llama] NPL=8 launching server PRE_GB=3.29
-[2026-06-26 04:43:48] [dense_llama] NPL=8 ready LOADED_GB=47.06
-[2026-06-26 04:43:55] [dense_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1.  **Analyze User Input:** The user says "The capital of France is". This is a straightforward factual question with a clear answer.\n2.  **Identify Key Entity:** France (country)\n3.  **Identify Question Type:** Capit
-[2026-06-26 04:44:30] [dense_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2040, "prompt_tok_total": 4195, "gen_per_req": 255.0, "agg_tps": 61.8, "decode_agg_tps": 82.5, "decode_perseq_tps": 9.57, "prefill_tps": 507.3, "ttft_mean_ms": 6038.1, "ttft_max_ms": 8270.0, "wall_s": 32.999}
-[2026-06-26 04:44:30] [dense_llama] NPL=8 PEAK_GB=53.51
-[2026-06-26 04:44:35] [dense_llama] NPL=8 server stopped mem=3.31
-[2026-06-26 04:44:35] [dense_llama] NPL=32 launching server PRE_GB=3.31
-[2026-06-26 04:44:40] [dense_llama] NPL=32 ready LOADED_GB=46.96
-[2026-06-26 04:47:55] [dense_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8180, "prompt_tok_total": 16900, "gen_per_req": 255.6, "agg_tps": 43.2, "decode_agg_tps": 192.6, "decode_perseq_tps": 4.79, "prefill_tps": 115.0, "ttft_mean_ms": 133551.7, "ttft_max_ms": 147007.0, "wall_s": 189.49}
-[2026-06-26 04:47:55] [dense_llama] NPL=32 PEAK_GB=69.63
-[2026-06-26 04:48:01] [dense_llama] NPL=32 server stopped mem=3.32
-[2026-06-26 04:48:01] [dense_llama] NPL=64 launching server PRE_GB=3.32
-[2026-06-26 04:48:11] [dense_llama] NPL=64 ready LOADED_GB=46.97
-[2026-06-26 04:55:10] [dense_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16382, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 39.8, "decode_agg_tps": 277.8, "decode_perseq_tps": 3.09, "prefill_tps": 95.9, "ttft_mean_ms": 321618.8, "ttft_max_ms": 352633.6, "wall_s": 411.603}
-[2026-06-26 04:55:10] [dense_llama] NPL=64 PEAK_GB=83.96
-[2026-06-26 04:55:16] [dense_llama] NPL=64 server stopped mem=3.30
-[2026-06-26 04:55:16] [dense_llama] NPL=128 launching server PRE_GB=3.30
-[2026-06-26 04:55:21] [dense_llama] NPL=128 ready LOADED_GB=47.09
-[2026-06-26 05:13:18] [dense_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32767, "prompt_tok_total": 67969, "gen_per_req": 256.0, "agg_tps": 30.9, "decode_agg_tps": 384.6, "decode_perseq_tps": 1.86, "prefill_tps": 69.7, "ttft_mean_ms": 902762.7, "ttft_max_ms": 975832.6, "wall_s": 1061.031}
-[2026-06-26 05:13:18] [dense_llama] NPL=128 PEAK_GB=93.82
-[2026-06-26 05:13:25] [dense_llama] NPL=128 server stopped mem=3.31
-[2026-06-26 05:13:25] [dense_llama] ==== DONE dense_llama POST_GB=3.31 ====
-[2026-06-26 05:13:25] [dense_vllm] ==== START dense_vllm (vllm) baseline_mem=3.31 ====
-[2026-06-26 05:13:25] [dense_vllm] launching vllm PRE_GB=3.31
-[2026-06-26 05:21:15] [dense_vllm] vllm ready LOADED_GB=110.48
-[2026-06-26 05:21:27] [dense_vllm] GATE='Here\'s a thinking process:\n\n1.  **Analyze User Input:** The user says "The capital of France is"\n2.  **Identify Key Entity/Question:** The question is asking for the capital city of France.\n3.  **Retrieve Knowledge:** I know from general knowledge that t
-[2026-06-26 05:21:59] [dense_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1959, "prompt_tok_total": 4195, "gen_per_req": 244.9, "agg_tps": 65.6, "decode_agg_tps": 70.4, "decode_perseq_tps": 8.76, "prefill_tps": 2096.2, "ttft_mean_ms": 1861.1, "ttft_max_ms": 2000.6, "wall_s": 29.843}
-[2026-06-26 05:21:59] [dense_vllm] NPL=8 PEAK_GB=110.92
-[2026-06-26 05:22:47] [dense_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8165, "prompt_tok_total": 16900, "gen_per_req": 255.2, "agg_tps": 176.3, "decode_agg_tps": 211.8, "decode_perseq_tps": 6.28, "prefill_tps": 2182.6, "ttft_mean_ms": 5353.2, "ttft_max_ms": 7741.4, "wall_s": 46.302}
-[2026-06-26 05:22:47] [dense_vllm] NPL=32 PEAK_GB=110.87
-[2026-06-26 05:23:59] [dense_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16314, "prompt_tok_total": 33828, "gen_per_req": 254.9, "agg_tps": 236.5, "decode_agg_tps": 309.1, "decode_perseq_tps": 4.38, "prefill_tps": 2088.9, "ttft_mean_ms": 9512.4, "ttft_max_ms": 16191.0, "wall_s": 68.976}
-[2026-06-26 05:23:59] [dense_vllm] NPL=64 PEAK_GB=110.88
-[2026-06-26 05:25:57] [dense_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32640, "prompt_tok_total": 67969, "gen_per_req": 255.0, "agg_tps": 288.4, "decode_agg_tps": 418.8, "decode_perseq_tps": 2.79, "prefill_tps": 1929.1, "ttft_mean_ms": 18449.5, "ttft_max_ms": 35227.7, "wall_s": 113.162}
-[2026-06-26 05:25:57] [dense_vllm] NPL=128 PEAK_GB=110.95
-[2026-06-26 05:26:27] [dense_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 2044, "prompt_tok_total": 4187, "gen_per_req": 255.5, "agg_tps": 68.1, "decode_agg_tps": 73.4, "decode_perseq_tps": 9.07, "prefill_tps": 1921.9, "ttft_mean_ms": 1877.6, "ttft_max_ms": 2178.1, "wall_s": 30.018}
-[2026-06-26 05:26:35] [dense_vllm] ==== DONE dense_vllm POST_GB=3.53 ====
-[2026-06-26 05:26:35] [moe_llama] ==== START moe_llama (llama) baseline_mem=3.53 ====
-[2026-06-26 05:26:35] [moe_llama] NPL=8 launching server PRE_GB=3.53
-[2026-06-26 05:26:50] [moe_llama] NPL=8 ready LOADED_GB=36.42
-[2026-06-26 05:26:52] [moe_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1.  **Analyze User Input:**\n   - User says: "The capital of France is"\n   - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2.  **Identify Key Information:*
-[2026-06-26 05:27:06] [moe_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2048, "prompt_tok_total": 4195, "gen_per_req": 256.0, "agg_tps": 156.8, "decode_agg_tps": 211.8, "decode_perseq_tps": 24.45, "prefill_tps": 1236.4, "ttft_mean_ms": 2477.1, "ttft_max_ms": 3392.9, "wall_s": 13.061}
-[2026-06-26 05:27:06] [moe_llama] NPL=8 PEAK_GB=39.66
-[2026-06-26 05:27:11] [moe_llama] NPL=8 server stopped mem=3.34
-[2026-06-26 05:27:11] [moe_llama] NPL=32 launching server PRE_GB=3.34
-[2026-06-26 05:27:16] [moe_llama] NPL=32 ready LOADED_GB=36.54
-[2026-06-26 05:27:54] [moe_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8192, "prompt_tok_total": 16900, "gen_per_req": 256.0, "agg_tps": 235.6, "decode_agg_tps": 393.0, "decode_perseq_tps": 10.02, "prefill_tps": 1213.9, "ttft_mean_ms": 8225.2, "ttft_max_ms": 13921.9, "wall_s": 34.768}
-[2026-06-26 05:27:54] [moe_llama] NPL=32 PEAK_GB=47.11
-[2026-06-26 05:28:00] [moe_llama] NPL=32 server stopped mem=3.30
-[2026-06-26 05:28:00] [moe_llama] NPL=64 launching server PRE_GB=3.30
-[2026-06-26 05:28:05] [moe_llama] NPL=64 ready LOADED_GB=36.39
-[2026-06-26 05:29:10] [moe_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16384, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 271.0, "decode_agg_tps": 527.0, "decode_perseq_tps": 6.15, "prefill_tps": 1152.3, "ttft_mean_ms": 15849.5, "ttft_max_ms": 29356.9, "wall_s": 60.449}
-[2026-06-26 05:29:10] [moe_llama] NPL=64 PEAK_GB=57.13
-[2026-06-26 05:29:16] [moe_llama] NPL=64 server stopped mem=3.28
-[2026-06-26 05:29:16] [moe_llama] NPL=128 launching server PRE_GB=3.28
-[2026-06-26 05:29:21] [moe_llama] NPL=128 ready LOADED_GB=36.48
-[2026-06-26 05:34:19] [moe_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32760, "prompt_tok_total": 67969, "gen_per_req": 255.9, "agg_tps": 112.7, "decode_agg_tps": 726.4, "decode_perseq_tps": 3.73, "prefill_tps": 276.8, "ttft_mean_ms": 213017.2, "ttft_max_ms": 245528.7, "wall_s": 290.634}
-[2026-06-26 05:34:19] [moe_llama] NPL=128 PEAK_GB=61.51
-[2026-06-26 05:34:25] [moe_llama] NPL=128 server stopped mem=3.28
-[2026-06-26 05:34:25] [moe_llama] ==== DONE moe_llama POST_GB=3.28 ====
-[2026-06-26 05:34:25] [moe_vllm] ==== START moe_vllm (vllm) baseline_mem=3.28 ====
-[2026-06-26 05:34:25] [moe_vllm] launching vllm PRE_GB=3.28
-[2026-06-26 05:39:35] [moe_vllm] vllm ready LOADED_GB=109.46
-[2026-06-26 05:39:38] [moe_vllm] GATE='Here\'s a thinking process:\n\n1.  **Analyze User Input:**\n   - User says: "The capital of France is"\n   - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2.  **Identify Key Information:**\n   - C
-[2026-06-26 05:39:47] [moe_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1900, "prompt_tok_total": 4195, "gen_per_req": 237.5, "agg_tps": 231.2, "decode_agg_tps": 256.5, "decode_perseq_tps": 31.84, "prefill_tps": 5186.5, "ttft_mean_ms": 768.8, "ttft_max_ms": 808.2, "wall_s": 8.217}
-[2026-06-26 05:39:47] [moe_vllm] NPL=8 PEAK_GB=109.62
-[2026-06-26 05:40:07] [moe_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 7794, "prompt_tok_total": 16900, "gen_per_req": 243.6, "agg_tps": 426.4, "decode_agg_tps": 500.8, "decode_perseq_tps": 14.9, "prefill_tps": 6223.4, "ttft_mean_ms": 1830.4, "ttft_max_ms": 2714.2, "wall_s": 18.28}
-[2026-06-26 05:40:07] [moe_vllm] NPL=32 PEAK_GB=109.63
-[2026-06-26 05:40:37] [moe_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 15927, "prompt_tok_total": 33828, "gen_per_req": 248.9, "agg_tps": 550.7, "decode_agg_tps": 686.1, "decode_perseq_tps": 9.83, "prefill_tps": 5926.5, "ttft_mean_ms": 3224.4, "ttft_max_ms": 5704.9, "wall_s": 28.92}
-[2026-06-26 05:40:37] [moe_vllm] NPL=64 PEAK_GB=109.63
-[2026-06-26 05:41:27] [moe_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 31795, "prompt_tok_total": 67969, "gen_per_req": 248.4, "agg_tps": 650.7, "decode_agg_tps": 882.2, "decode_perseq_tps": 6.05, "prefill_tps": 5300.5, "ttft_mean_ms": 6487.7, "ttft_max_ms": 12817.8, "wall_s": 48.863}
-[2026-06-26 05:41:27] [moe_vllm] NPL=128 PEAK_GB=109.64
-[2026-06-26 05:41:36] [moe_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 1702, "prompt_tok_total": 4187, "gen_per_req": 212.8, "agg_tps": 207.2, "decode_agg_tps": 226.4, "decode_perseq_tps": 28.06, "prefill_tps": 6021.3, "ttft_mean_ms": 642.7, "ttft_max_ms": 694.8, "wall_s": 8.213}
-[2026-06-26 05:41:44] [moe_vllm] ==== DONE moe_vllm POST_GB=3.31 ====
-
-==== ALL 16 ROWS COLLECTED (2 models x 2 engines x 4 npl) ====
-decode_agg t/s (llama | vLLM | llama%vLLM):
- DENSE q36-27b-nvfp4:  npl8 82.5|70.4|117%  npl32 192.6|211.8|91%  npl64 277.8|309.1|90%  npl128 384.6|418.8|92%
- MoE   q36-35b-a3b:    npl8 211.8|256.5|83%  npl32 393.0|500.8|78%  npl64 527.0|686.1|77%  npl128 726.4|882.2|82%
-peak_gb (llama on-demand grows | vLLM fixed ~107 pool):
- DENSE llama 53.5->93.8 ; vLLM ~110.9 flat
- MoE   llama 39.7->61.5 ; vLLM ~109.6 flat
-Final CSV: final_benchmark.csv ; analysis: QWEN36_NVFP4_BENCH.md (FINAL section).
-Cleanup: no leftover server/bench PIDs; GPU free (memnow 3.28 GB); local-ai + local-ai-worker
-containers restarted (host returned). DONE.
--- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md
+++ b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md
@@ -1,628 +0,0 @@
-# bf16 SSM-state cache: BUILD PLAN (PART C synthesis - hand this to the build agent)
-
-Status: READ-ONLY design. Lands ON TOP of patch 0021 (conv-state in-place fusion, building
-concurrently on the GPU). DEFAULT = bf16 SSM recurrent state, f32 opt-out. This PART C is the
-executive build brief: ordered edits, acceptance gate, bench targets, semantics/back-compat/risk
-register, and the de-risk-first item. PART A (cparams wiring), PART B (kernel/op plumbing) and the
-Appendix (upstream precedent + numeric safety) below are the detailed reference each step points into.
-
-The decision (settled by GDN_RECURRENCE_BYTE_GATE.md): the gated-DeltaNet recurrence is the dominant
-decode kernel (51.6% of the step, 805 MB f32 state R+W/call at 74% of GB10 peak BW) and is ALREADY
-single-pass (measured re-stream ~1.0x, hard-capped <=1.33x). The whole ~2x DRAM gap vs vLLM is purely
-f32(llama) vs bf16(vLLM) state-cache WIDTH, not extra passes. Narrowing the persisted SSM state to
-bf16 (load->f32, recurrence math in f32 UNCHANGED, store->bf16) halves the dominant term and reaches
-vLLM parity-to-ahead. vLLM's own GDN state cache is bf16, so this is a fair equal-precision change.
-
-## C.0 Synthesis decisions that OVERRIDE the per-part text
-
-1. v1 ships `type_s` = BF16 (SSM recurrent state, the 805 MB lever) and KEEPS `type_r` = F32 (conv
-   state). Reason: `ggml_concat` at prefill (`build_conv_state`, delta-net-base.cpp:472) requires
-   same-type operands; a bf16 conv cache breaks the f32 `qkv_mixed` concat. Conv state is ~12.6 MB
-   (launch-bound, ~0 ms byte benefit), so keeping it f32 costs nothing. This OVERRIDES PART A §3a/§3b,
-   which set BOTH defaults to BF16: in v1 set the `type_r` / `cache_type_conv` DEFAULT to
-   `GGML_TYPE_F32`. `type_r`=bf16 is a v2 follow-up (needs an f32 staging view before the prefill
-   concat - PART B §B.6).
-2. Keep ALL transient/scratch tensors f32: the GDN op OUTPUT scratch (ggml.c:6327), the 0019 gather
-   scratch, and the keep_rs_t prefill snapshot. ONLY the PERSISTED cache rows narrow to bf16 (the
-   src[5] read view and the src[6] in-place write view).
-3. The gate REPLACES the bit-exact md5 gate for the bf16 default: bf16 is intentionally non-bit-exact
-   vs llama f32 (it is equal precision to vLLM's bf16). The 0018/0019 md5 gate STILL applies to (a)
-   patch 0021's conv fusion and (b) verifying the f32 opt-out path is byte-identical to the pre-bf16
-   f32 baseline.
-
-## C.1 Ordered file-by-file edit list (build order, on top of 0021)
-
-Order is dependency- and de-risk-driven: prove the kernel dtype-correct in ISOLATION before flipping
-any default. Section refs point into PART A / PART B below.
-
-STEP 1 - kernel + op made dtype-generic (the load/store conversion), validated standalone:
- 1a `ggml/src/ggml.c` - relax the F32-only state asserts to {F32,BF16} in the 3 GDN builders:
-  `ggml_gated_delta_net` (~6308), `_inplace` (~6370), `_inplace_ids` (~6430), on `state` and
-  `src_state_dst`. KEEP the op OUTPUT scratch F32 (6327). [PART B §B.2]
- 1b `ggml/src/ggml-cuda/ggml-cuda.cu` - `supports_op` `GGML_OP_GATED_DELTA_NET` (~3096): permit a
-  BF16 `src[5]`/`src[6]`. [PART B §B.3]
- 1c `ggml/src/ggml-cuda/gated_delta_net.cu` - template kernel+gather+launch on `bool STATE_BF16`;
-  `#include <cuda_bf16.h>`. LOAD `__bfloat162float` (~102), STORE `__float2bfloat16` (~207), GATHER
-  bf16->f32 scratch (~20). Cast `src_state`/`src_state_dst` pointers to `nv_bfloat16` on bf16; relax
-  dispatcher asserts (309-311) `sizeof(float)` -> `ggml_type_size(type)`. Keep gather scratch +
-  keep_rs_t snapshot f32. ALL recurrence math (106-200) UNCHANGED in f32 registers. [PART B §B.4,§B.8]
- 1d `ggml/src/ggml-cpu/ops.cpp` - matching bf16 load/store branch in the GDN reference (10726/10744/
-  10891 load via `GGML_BF16_TO_FP32`, 10758-10762 store via `GGML_FP32_TO_BF16`); relax `nb[]` asserts
-  to `ggml_type_size(type)`. [PART B §B.5]
- 1e `tests/test-backend-ops.cpp` - add a BF16-state `GATED_DELTA_NET` case covering BOTH `n_tokens==1`
-  decode AND a multi-token (prefill/chunk) + `keep_rs_t==true` path, CUDA bf16 vs CPU bf16 reference.
-  THIS IS THE DE-RISK GATE for Step 1 (see C.5). Build + pass before Step 2.
-
-STEP 2 - cparams selection wiring (llama.cpp core):
- 2a `include/llama.h` (after :366) - add `enum ggml_type type_s;` and `type_r;` adjacent to
-  `type_k`/`type_v`, marked `[EXPERIMENTAL]`. [PART A §3a]
- 2b `src/llama-context.cpp:3468` (`llama_context_default_params`) - add `/*.type_s =*/ GGML_TYPE_BF16,`
-  and `/*.type_r =*/ GGML_TYPE_F32,`. THIS IS THE DEFAULT CHANGE (type_r stays F32 per C.0). [PART A §3a]
- 2c `src/llama-memory.h:19` (`struct llama_memory_params`) - add `ggml_type type_r;` and `type_s;`.
-  [PART A §3a]
- 2d `src/llama-context.cpp:325` (`params_mem` init) - pass `params.type_r` / `params.type_s`. [PART A §3a]
- 2e `src/llama-model.cpp` - replace the 3 hardcoded `GGML_TYPE_F32` pairs (2056-57 recurrent, 2098-99
-  hybrid_iswa, 2117-18 hybrid = the qwen35/qwen35moe path) with `params.type_r` / `params.type_s`.
-  [PART A §2/§3a]
-
-STEP 3 - back-compat for saved recurrent state (REQUIRED, the default flips):
- 3a `src/llama-memory-recurrent.cpp` `state_read_data` - on `s_type_i_ref != live type` with both in
-  {F32,BF16}, CONVERT row-by-row during load instead of returning false (same for `r`). Bump the
-  recurrent state-file version. [PART A §5, option A]
-
-STEP 4 - CLI / llama-server surface (needed by the gate harness):
- 4a `common/common.h:566` region - `cache_type_ssm = GGML_TYPE_BF16;` and
-  `cache_type_conv = GGML_TYPE_F32;` (conv default F32 per C.0). [PART A §3b]
- 4b `common/common.cpp:1589` region - `cparams.type_s = params.cache_type_ssm;` and
-  `cparams.type_r = params.cache_type_conv;`. [PART A §3b]
- 4c `common/arg.cpp` (after :2074) - add `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv`
-  via the existing `kv_cache_type_from_str` (arg.cpp:402); confirm `bf16` -> `GGML_TYPE_BF16`. The C.2
-  harness depends on `--cache-type-ssm {f32,bf16}`. [PART A §3b]
-
-STEP 5 - LocalAI gRPC / YAML (force f32 from model config):
- 5a `backend/backend.proto` - `string CacheTypeSSM` / `CacheTypeConv` (next free tags after 64);
-  regen proto. [PART A §3c]
- 5b `backend/cpp/llama-cpp/grpc-server.cpp:504` region - `params.cache_type_ssm =
-  kv_cache_type_from_str(request->cachetypessm());` + conv. [PART A §3c]
- 5c `core/config/model_config.go:935` - `CacheTypeSSM`/`CacheTypeConv` yaml fields. [PART A §3c]
- 5d `core/backend/options.go:247` - map into the request. [PART A §3c]
- 5e `core/config/meta/registry.go` + `build_test.go` - register `cache_type_ssm`/`cache_type_conv`
-  as static fields (gate). [PART A §3c]
-
-STEP 6 - capability fallback (heterogeneous / CPU-offload safety):
- 6a `src/llama-context.cpp:518-595` - an `auto_fgdn`-style device-match probe: if a participating
-  device lacks the bf16 GDN load/store specialization (CPU-offloaded GDN layer, non-GB10 backend),
-  demote `type_s` to F32 BEFORE alloc and log once. [PART A §4]
-
-## C.2 Acceptance gate (REPLACES the bit-exact md5 gate)
-
-bf16 is intentionally non-bit-exact, so the 0018/0019 md5 byte-equality gate does NOT apply to the
-bf16 default. The gate is teacher-forced KL-divergence + PPL-delta + greedy coherence + a
-long-context drift sweep, vs the SAME model run f32. All commands on `dgx.casa` (DO NOT run during
-this design - GPU busy). Binaries `~/llama-paged-dev/build*/bin`; models `~/bench/q36-27b-nvfp4.gguf`
-(dense) and `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE); scratch `~/bench/klgate`.
-
-Why teacher-forced (not self-greedy): a self-greedy decode lets each precision pick its own argmax,
-so after the first divergence the contexts differ and per-token logits are no longer comparable (you
-measure trajectory divergence, not numeric error). `llama-perplexity --kl-divergence` feeds both
-precisions the IDENTICAL token stream and compares output distributions position-by-position; the
-greedy trajectory is validated SEPARATELY by the Same-top-p metric + a coherence read.
-
-Corpus (one-time): wikitext-2 raw test (~280k tokens) into `~/bench/klgate`. KL mode needs
->= 2*n_ctx tokens; any fixed >=8k-token UTF-8 file works as long as base AND test share it.
-
-256-token headline gate (per model; shown for dense):
-```
-M=~/bench/q36-27b-nvfp4.gguf; F=~/bench/klgate/wikitext-2-raw/wiki.test.raw; D=~/bench/klgate
-COMMON="-m $M -f $F -c 256 -b 256 -ngl 99 -fa on --seed 1 --chunks 32"
-# (a) f32 BASE: reference logits + f32 PPL
-llama-perplexity $COMMON --cache-type-ssm f32  --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32.c256.base.log
-# (b) bf16 TEST: KL(bf16||f32) + bf16 PPL + Same-top-p
-llama-perplexity $COMMON --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.bf16.c256.kl.log
-```
-Noise floor (run FIRST, mandatory - GPU reductions are not bit-deterministic, so KLD has a non-zero
-floor; bf16 is judged against BOTH the absolute threshold AND this floor):
-```
-llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32f32.floor.log
-```
-Record `Mean KLD_floor` and `Same-top-p_floor` (expect KLD ~1e-6..1e-5, top-p ~100%).
-
-Coherence spot-check (greedy trajectory, reuses the 0018/0019 `--temp 0 --seed 1` convention):
-```
-P="Explain how a transformer language model generates text, step by step."
-for T in f32 bf16; do llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm $T 2>/dev/null > $D/q27.greedy.$T.txt; done
-diff $D/q27.greedy.f32.txt $D/q27.greedy.bf16.txt && echo "GREEDY BYTE-IDENTICAL"
-```
-Long-context drift sweep (verifies the g<1 decay bound: bf16 state-rounding error must stay FLAT, not
-accumulate, as context grows - the GDN state spans the whole window):
-```
-for C in 256 1024 2048 4096; do
-  CMN="-m $M -f $F -c $C -b $C -ngl 99 -fa on --seed 1 --chunks 8"
-  llama-perplexity $CMN --cache-type-ssm f32  --kl-divergence-base $D/q27.f32.c$C.kld >/dev/null
-  llama-perplexity $CMN --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c$C.kld | tee $D/q27.bf16.c$C.kl.log
-done
-```
-f32 opt-out verification (the safety valve must actually select f32 and reproduce the committed f32
-greedy md5 from 0018/0019 - the bf16 default must NOT change the f32-path output):
-```
-llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm f32 2>/dev/null | md5sum  # == 0018/0019 f32 baseline md5
-```
-Repeat the WHOLE gate verbatim for the MoE model (`M=~/bench/q36-35b-a3b-nvfp4.gguf`).
-
-PASS/FAIL (bf16 ships as DEFAULT only if ALL rows pass for BOTH dense and MoE):
-
-| metric | source | PASS threshold |
-|---|---|---|
-| Mean KLD | 256-gate (b) | **< 1e-3 nats** (hard, the brief) |
-| Mean KLD vs floor | (b) vs floor | <= ~5x `Mean KLD_floor` (bounded signal, not pure noise) |
-| Same top p | (b) | **>= 99.5%** (100% => greedy byte-identical to f32) |
-| PPL-delta `ln(PPL_bf16/PPL_f32)` | (a)+(b) | **abs < 0.005** (PPL within +-0.5%) |
-| Max / 99.9% KLD | (b) | report; flag if Max > 0.05 (tail outliers) |
-| Coherence | greedy | fluent + on-topic; byte-identical if Same-top-p=100% |
-| Long-context drift | sweep | MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0% |
-
-If any row fails for a model: keep THAT model on f32 (gallery YAML `cache_type_ssm: f32`) while the
-global default stays bf16; the cparams f32 fallback is the safety valve. MoE has fewer GDN layers
-(31 vs 48) and smaller per-head state (H_v=32 vs 48), so expected KLD <= dense; same thresholds.
-Same-top-p is the bridge to the old md5 harness: at 100% the bf16 greedy output is byte-identical to
-f32 and the 0018/0019 md5 gate would still pass - the strongest possible non-bit-exact result.
-
-## C.3 Bench targets + nsys confirmation
-
-Dense q36-27b-nvfp4 (48 GDN layers, S_v=128, H_v=48), npl128, GB10/sm_121, graphs-OFF
-apples-to-apples (the measured baseline):
- Recurrence per call: 3.98 ms (f32, 805 MB R+W, 74% peak) -> **~2.0-3.0 ms** (bf16, ~413 MB R+W).
-  2.0 ms = 74% peak retained; 3.0 ms = conservative 50% peak on the smaller footprint.
- Recurrence per step: 191 ms -> ~96-143 ms (save ~48-95 ms).
- Step time: 384 ms -> **289-339 ms**.
- Decode throughput: ~335 -> **360-443 tok/s** = parity-to-ahead of vLLM (327 ms / 391 tok/s).
-
-MoE q36-35b-a3b-nvfp4 (31 GDN layers, H_v=32): state per (seq,layer) = 128*128*32*4 = 2.0 MiB f32 ->
-per-call R+W ~537 MB f32 -> ~268 MB bf16. Fewer layers + smaller state => smaller ABSOLUTE recurrence
-savings, and MoE decode is more GEMM-bound (the `MUL_MAT_ID` expert path), so the bf16-state win is a
-smaller FRACTION of the MoE step. Target: a measurable per-call halving of the GDN recurrence time
-with the C.2 KL gate passing; no absolute MoE step target is asserted here (the MoE step is
-MUL_MAT_ID-dominated, a separate lever from this one).
-
-nsys confirmation (the measurement that proves the lever landed):
-```
-GGML_CUDA_DISABLE_GRAPHS=1 nsys profile -o ssmbf16 --force-overwrite true \
-  llama-batched-bench -m $M -npp 8 -ntg 12 -npl 128 -ub 2048
-nsys stats --report cuda_gpu_kern_sum ssmbf16.nsys-rep | grep -i gated_delta_net
-```
-Confirm: `gated_delta_net_cuda` mean duration/call drops 3.98 -> 2.0-3.0 ms; step time + tok/s land in
-the 289-339 ms / 360-443 tok/s band; the f32 opt-out reproduces the 3.98 ms f32 call. The gate is the
-JOINT condition: per-call speed in band AND KL<1e-3 - neither alone ships bf16.
-
-## C.4 Default / opt-out semantics, back-compat, risk register
-
-Semantics:
- DEFAULT `type_s` = `GGML_TYPE_BF16` (SSM recurrent state). `type_r` = `GGML_TYPE_F32` in v1 (conv
-  state; bf16 is v2). This is the INVERSE of KV (KV is opt-IN to compression at F16 default; SSM is
-  opt-OUT to f32).
- Opt-out: `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML) -> bit-exact f32
-  recurrence. Per-model opt-out lives in gallery YAML if a model fails the gate; the global default
-  stays bf16.
- Silent capability fallback: the C.1 STEP 6 device-match probe demotes `type_s` to F32 before alloc
-  on devices lacking the bf16 GDN specialization (CPU offload / non-GB10) and logs once.
-
-Back-compat (the ONE real breakage): `llama-memory-recurrent.cpp` serializes the per-layer state
-dtype and HARD-matches on restore (mismatch -> `"mismatched s type"` -> returns false). The f32->bf16
-default flip makes OLD f32-saved sessions fail to restore against a bf16 build. Fix = STEP 3a: convert
-row-by-row on mismatch (both in {F32,BF16}) + bump the recurrent state-file version. KV never hit this
-because `type_k`/`type_v` were EXPERIMENTAL and never default-changed; the SSM default FLIP is what
-forces the convert/version work.
-
-Risk register:
- **R1 numeric drift (KL gate fails).** Likelihood LOW: g<1 geometric decay contracts per-step bf16
-  rounding to a bounded series (~`eps/(1-exp(g_mean))`), f32 registers confine rounding to one
-  per-step cache write, and vLLM ships this exact config in production. Mitigation: C.2 gate +
-  per-model f32 opt-out + global f32 fallback.
- **R2 prefill / keep_rs_t / gather state path (the silent-corruption landmine).** The conversion
-  points are documented for DECODE; the SAME kernel also runs the chunked prefill path, the keep_rs_t
-  snapshot (writes to f32 scratch while the cache is bf16), and the 0019 gather (reads bf16 cache ->
-  f32 scratch). A dtype mistake on any of these corrupts the state at the prefill->decode handoff and
-  surfaces ONLY as long-context drift, which a decode-only 256-token gate can mask. Mitigation: STEP
-  1e test-backend-ops MUST cover the multi-token prefill + keep_rs_t==true path, not just decode; the
-  C.2 long-context sweep is the second net. (This is C.5, the single biggest risk.)
- **R3 MoE MUL_MAT_ID path.** The GDN recurrence op is IDENTICAL for dense and MoE; the MoE expert
-  GEMM (`MUL_MAT_ID`) does NOT touch the SSM state, so bf16-state is orthogonal to the expert path.
-  Residual risk: `qwen35moe` `build_recurrent_attn` must route the same bf16 state view (it shares
-  delta-net-base.cpp). Mitigation: run the full C.2 gate on the MoE model; the test-backend-ops case
-  is arch-agnostic.
- **R4 conv-state coupling with patch 0021.** Flipping `type_r` to bf16 breaks `ggml_concat` at
-  prefill (different types). Mitigation: v1 keeps `type_r`=F32 (C.0); `type_r`=bf16 deferred to v2
-  with an f32 staging view (PART B §B.6).
- **R5 back-compat restore failure.** Mitigation: STEP 3a convert + version bump (above).
-
-## C.5 Single biggest risk + how the build agent de-risks it FIRST
-
-Single biggest risk: **R2 - silent state corruption on the NON-decode state paths** (chunked prefill,
-the keep_rs_t snapshot, the 0019 gather). The 805 MB measurement and every conversion-point in the
-cheat-sheet describe the STEADY decode path (`n_tokens==1`, `!keep_rs_t`). But the bf16 cache is ALSO
-read/written by the multi-token prefill path and the prefill/rollback snapshot (which targets f32
-scratch while the cache is bf16). A dtype bug there does not crash and barely moves the 256-token
-decode md5; it corrupts the recurrent state at the prefill->decode boundary and shows up ONLY as
-long-context drift - exactly the failure a quick gate misses.
-
-De-risk FIRST (before ANY default flip or wiring): implement STEP 1 (kernel + op dtype-generic) and
-STEP 1e (test-backend-ops) ONLY, then prove the kernel is dtype-correct in ISOLATION by forcing a
-bf16 state allocation behind a temporary debug flag and running test-backend-ops with a case that
-exercises (a) single-token decode, (b) a multi-token prefill chunk, and (c) `keep_rs_t==true`,
-comparing CUDA bf16 against the CPU bf16 reference AND against the f32 path within tolerance. Only
-after that case is GREEN does the build agent proceed to STEP 2 (flip the default) and the C.2
-model-level gate. This decouples kernel dtype-correctness from the cparams wiring, so a Step-1 bug is
-caught by a deterministic op test in minutes instead of as a fuzzy long-context regression after the
-full stack is wired.
-
---
-
-# bf16 SSM state cache — cparams wiring (DEFAULT bf16 + f32 opt-out)
-
-Label: cparams-default-fallback (READ-ONLY design). Mirrors the KV-cache `type_k`/`type_v`
-precision plumbing exactly. Designed against HEAD-after-patch-0021 (conv-state in-place fusion).
-
-This is lever (2) of GDN_RECURRENCE_BYTE_GATE.md: the recurrent SSM state cache is the dominant
-decode byte stream (805 MB R+W/call, 51.6% of step, single-pass f32 = at the BW floor). The whole
-~2x DRAM gap vs vLLM is f32(llama) vs bf16(vLLM) state width. Storing the persisted state in bf16
-(load→f32, recurrence math in f32 UNCHANGED, store→bf16) halves the dominant term. vLLM's GDN state
-cache is bf16, so bf16-default is the fair equal-precision comparison → make it the DEFAULT.
-
---
-
-## 1. The KV-cache template we mirror (exact chain for type_k / type_v)
-
-```
-CLI   common/arg.cpp:2052     -ctk/--cache-type-k TYPE → params.cache_type_k
-                              (common_params, common/common.h:566, default GGML_TYPE_F16)
-  ↓
-glue  common/common.cpp:1589  cparams.type_k = params.cache_type_k   (cparams = llama_context_params)
-  ↓
-API   include/llama.h:365     llama_context_params.type_k  // [EXPERIMENTAL]
-      llama-context.cpp:3468  default in llama_context_default_params() = GGML_TYPE_F16
-  ↓
-mem   llama-context.cpp:326   llama_memory_params params_mem.type_k = params.type_k
-      llama-memory.h:19       struct llama_memory_params { ggml_type type_k; type_v; ... }
-  ↓
-alloc llama-model.cpp:2030    create_memory(params_mem, cparams) → KV cache uses params.type_k
-```
-
-Key facts:
- `type_k`/`type_v` are NOT stored in `struct llama_cparams` (src/llama-cparams.h). They ride in
-  `llama_context_params` → `llama_memory_params` and are consumed directly at cache-alloc time.
-  We mirror that: NO new `llama_cparams` field is needed.
- KV default is opt-IN to compression (F16 default, pass `-ctk q8_0` to shrink). SSM is the INVERSE:
-  bf16 DEFAULT, pass an explicit `f32` to opt out / restore bit-exactness.
-
-## 2. Where the SSM state type is currently hardcoded (the targets)
-
-The recurrent cache constructor already accepts the types — only the model hardcodes F32:
-
- `src/llama-memory-recurrent.cpp:22-23` ctor params `ggml_type type_r, type_s`
-  - `r_l` (line 100, `n_embd_r`) = short conv state  → `type_r` (TINY: conv_width-1 taps × conv_dim)
-  - `s_l` (line 101, `n_embd_s`) = SSM recurrent state → `type_s` (THE 805 MB/call dominant)
- `src/llama-memory-hybrid.h:32-33` ctor params `type_r, type_s` (qwen35 / qwen35moe path)
- Hardcoded `GGML_TYPE_F32` call sites in `src/llama-model.cpp::create_memory`:
-  - 2056-2057  `llama_memory_recurrent(...)`            (pure recurrent arches)
-  - 2098-2099  `llama_memory_hybrid_iswa(...)`          recurrent_type_r / recurrent_type_s
-  - 2117-2118  `llama_memory_hybrid(...)`               recurrent_type_k / recurrent_type_v (mislabeled; they are r/s)
-
-Note: `qwen35` / `qwen35moe` are HYBRID (filter_attn/filter_recr, no SWA) → they take the
-`llama_memory_hybrid` branch (2108-2118). That is the call site that matters for the parity push.
-
-## 3. New plumbing (parallel chain `type_s` / `type_r`)
-
-### 3a. Public API + cparams glue (llama.cpp side)
-
-| File | Change |
-|------|--------|
-| `include/llama.h` (after :366) | Add `enum ggml_type type_s; // data type for recurrent SSM state cache [EXPERIMENTAL]` and `enum ggml_type type_r; // data type for recurrent conv state cache [EXPERIMENTAL]`. Place adjacent to `type_k`/`type_v`. |
-| `src/llama-context.cpp:3468` (default params) | Add `/*.type_s =*/ GGML_TYPE_BF16,` and `/*.type_r =*/ GGML_TYPE_BF16,`. **This is the DEFAULT change.** |
-| `src/llama-memory.h:19` (`struct llama_memory_params`) | Add `ggml_type type_r;` and `ggml_type type_s;` next to `type_k`/`type_v`. |
-| `src/llama-context.cpp:325` (`params_mem` init) | Add `/*.type_r =*/ params.type_r,` and `/*.type_s =*/ params.type_s,`. |
-| `src/llama-model.cpp` 2056-57 / 2098-99 / 2117-18 | Replace the 3 hardcoded `GGML_TYPE_F32` pairs with `params.type_r` / `params.type_s`. |
-
-### 3b. CLI / llama-server (common side)
-
-| File | Change |
-|------|--------|
-| `common/common.h:566` region | Add `ggml_type cache_type_ssm = GGML_TYPE_BF16;` and `ggml_type cache_type_conv = GGML_TYPE_BF16;` (mirror `cache_type_k/v`; note the DEFAULT is BF16, not F16). |
-| `common/common.cpp:1589` region | Add `cparams.type_s = params.cache_type_ssm;` and `cparams.type_r = params.cache_type_conv;`. |
-| `common/arg.cpp` (after :2074) | Add `--cache-type-ssm TYPE` (`-ctssm`) → `params.cache_type_ssm = kv_cache_type_from_str(value)`, and `--cache-type-conv TYPE` (`-ctconv`). Reuse the existing `kv_cache_type_from_str` (arg.cpp:402). Help text: "recurrent SSM state cache type (default bf16; pass f32 for bit-exact recurrence)". |
-
-`kv_cache_type_from_str` already accepts `f32`/`bf16`/`f16` — no change needed; just confirm `bf16`
-maps to `GGML_TYPE_BF16` (add the case if absent).
-
-### 3c. LocalAI gRPC backend (so users can force f32 from model YAML)
-
-Mirror `CacheTypeKey` exactly:
-
-| File | Change |
-|------|--------|
-| `backend/backend.proto:419` region | Add `string CacheTypeSSM = NN;` and `string CacheTypeConv = NN;` (next free field tags). Regenerate proto. |
-| `backend/cpp/llama-cpp/grpc-server.cpp:504` region | `if (!request->cachetypessm().empty()) params.cache_type_ssm = kv_cache_type_from_str(request->cachetypessm());` and the conv equivalent. (grpc-server already has its own `kv_cache_type_from_str`; ensure it knows `bf16`.) |
-| `core/config/model_config.go:935` region | Add `CacheTypeSSM string yaml:"cache_type_ssm,omitempty"` and `CacheTypeConv string yaml:"cache_type_conv,omitempty"`. |
-| `core/backend/options.go:247` region | Add `CacheTypeSSM: c.CacheTypeSSM,` and `CacheTypeConv: c.CacheTypeConv,` to the request build. |
-| `core/config/meta/registry.go:161` + `core/config/meta/build_test.go:140` | Register `cache_type_ssm` / `cache_type_conv` as static fields (the `staticFields` slice + registry map) so the meta-config gate passes. |
-
-LocalAI semantics: leaving `cache_type_ssm` UNSET in YAML → empty gRPC string → backend keeps its
-BF16 default. Setting `cache_type_ssm: f32` → forces the f32 opt-out (bit-exact recurrence).
-
-## 4. Default / fallback semantics
-
- **DEFAULT = `GGML_TYPE_BF16`** for both SSM state (`type_s`) and conv state (`type_r`).
-  - SSM state (`type_s`) is the lever: f32→bf16 halves 805→413 MB/call → ~3.98→~2.0-3.0 ms/call.
-  - Conv state (`type_r`) is negligible bytes; default it bf16 too for consistency, but it can stay
-    f32 with zero perf cost if patch-0021's in-place conv path assumes f32 — see §6.
- **Opt-out = `GGML_TYPE_F32`** via `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML).
-  Restores bit-exact recurrence; use when the KL gate (<1e-3 / PPL-delta over 256-tok greedy) fails
-  for a given model, or for deterministic regression baselines.
- **Silent capability fallback**: gate the bf16 path behind a device-match probe modeled on
-  `auto_fgdn` (llama-context.cpp:518-595). If the GDN recurrence kernel's bf16 load/store
-  specialization is unavailable on a participating device (e.g. a CPU-offloaded GDN layer with no
-  bf16 op, or a non-GB10 backend), fall back to `GGML_TYPE_F32` for `type_s` BEFORE cache alloc and
-  log it once. This keeps "bf16 default" from breaking heterogeneous/CPU setups.
- The kernel contract is unchanged-math: load bf16→f32 into `s_shard` (registers stay f32), all
-  recurrence arithmetic in f32, store f32→bf16. Only the persisted cache is rounded per step;
-  geometric decay (g<1) bounds the rounding (does not accumulate unboundedly).
-
-## 5. Back-compat (the one real breakage — saved sessions / state files)
-
-`src/llama-memory-recurrent.cpp` SERIALIZES the per-layer state tensor dtype and does a HARD match
-on restore:
- write: `state_write_data` writes `s_type_i = (int32_t)s_l[il]->type` (line ~900) and the r type.
- read: `state_read_data` reads `s_type_i_ref`, compares to current `s_l[il]->type`, and on
-  mismatch logs `"mismatched s type (%d != %d, layer %d)"` and **returns false** (restore FAILS).
-  Same for `r` type.
-
-Consequence of the default flip f32→bf16:
- Sessions SAVED by an old f32-default build will FAIL to RESTORE against a new bf16-default build
-  (and vice versa), because the serialized `s_type_i_ref` (F32) ≠ the new cache type (BF16).
-
-Required handling (pick one, recommend A):
- **A (convert on mismatch, recommended)**: in `state_read_data`, when `s_type_i_ref != current`
-  and both ∈ {F32, BF16}, convert row-by-row during load (`ggml_fp32_to_bf16` / `bf16→fp32`) instead
-  of returning false. Same for `r`. Bump the recurrent state-file version so older readers reject
-  cleanly. This makes old f32 sessions loadable into bf16 caches and round-trips safely.
- **B (pin precision to the saved file)**: if a session is being restored, read `s_type_i_ref`
-  first and set `type_s`/`type_r` from it, overriding the default for that context. Keeps restore
-  working but silently disables the bf16 win for resumed sessions.
- **C (document-only)**: keep the hard match; document that bf16-default invalidates cross-version
-  saved recurrent states. Lowest effort, worst UX. Not recommended given parity is the goal.
-
-KV-cache parallel: `type_k`/`type_v` were always EXPERIMENTAL and non-default-changing, so the KV
-path never had to solve this. The SSM default-FLIP is what forces the convert/version work — call it
-out as the single most load-bearing back-compat item.
-
-## 6. Coupling notes / sequencing
-
- Land ON TOP of patch 0021 (conv-state in-place fusion). If 0021's fused conv write assumes an f32
-  conv-state tensor, either (a) extend it to the cache tensor's dtype, or (b) keep `type_r` = F32 by
-  default and make ONLY `type_s` bf16 (conv bytes are negligible, so this loses nothing perf-wise and
-  de-risks 0021). Decision: ship `type_s`=BF16 first; make `type_r`=BF16 a follow-up gated on 0021's
-  conv path being dtype-generic.
- Kernel side (separate patch, not this wiring): `ggml/src/ggml-cuda/gated_delta_net.cu` currently
-  takes `const float * curr_state` / `float * state_dst` and does `s_shard[r] = read_state[i]`
-  (line 102) — hardcoded f32. The bf16 build needs the dispatch to read `s0->type` and route a
-  bf16 load/store specialization; the gather kernel `gdn_gather_nonident_kernel` (line 7, `const
-  float * cache`) likewise needs a bf16 variant. The cparams wiring here only selects the cache
-  dtype; the kernel patch consumes it. Patches 0018 (in-place) / 0019 (gather) state asserts must be
-  relaxed from f32-only to {f32,bf16}.
- CPU mirror `ggml-cpu/ops.cpp` GDN path needs the same bf16 load/store for CI parity / fallback.
-
-## 7. Validation gate
-
- KL < 1e-3 and PPL-delta within tolerance vs the f32-state build over a 256-token greedy run, per
-  model (dense q36-27b-nvfp4, MoE q36-35b-a3b-nvfp4). If a model fails, that model sets
-  `cache_type_ssm: f32` in its gallery YAML (per-model opt-out) — the global default stays bf16.
- Add a `test-backend-ops` case for the GDN recurrence with bf16 state (mirror the 0021 harness:
-  dense text md5 + MoE byte check) to lock the load→f32→store→bf16 contract.
-
---
-
-# Appendix - label `upstream-bf16-precedent` (READ-ONLY research)
-
-Precedent + numeric-safety justification for the §1-7 wiring above. Sources: paged dev tree
-(`dgx.casa:~/llama-paged-dev`, branch `paged`) and the vLLM checkout
-(`~/vllm-bench/.../site-packages/vllm`).
-
-## A.1 Upstream llama.cpp: recurrent-cache f32 is HARDCODED (no f16/bf16 path), not a documented numeric guard
-
-The asymmetry to override: the attention KV cache type is user-tunable; the recurrent state cache is not.
-
- KV cache: `llama_context_params.type_k/type_v` default `GGML_TYPE_F16`
-  (`src/llama-context.cpp:3468-3469`), `[EXPERIMENTAL]` in `include/llama.h:365-366`, plumbed from
-  user params (`attn_type_k = params.type_k`).
- Recurrent/SSM cache: `llama_memory_recurrent(... type_r, type_s ...)` and the hybrid wrappers take
-  the recurrent types as ctor args, but EVERY call site in `src/llama-model.cpp` passes the literal
-  `GGML_TYPE_F32` (2056-2057 pure-recurrent; 2098-2099 hybrid-iswa `recurrent_type_r/s`;
-  2117-2118 hybrid `recurrent_type_k/v`). No cparams field feeds these - compile-time constants.
-  So mamba/mamba2/rwkv/falcon-h1/nemotron-h/qwen3.5 ALL get f32 recurrent + conv state unconditionally.
- Alloc: `r = ggml_new_tensor_2d(ctx, type_r, ...)`, `s = ggml_new_tensor_2d(ctx, type_s, ...)`
-  (`src/llama-memory-recurrent.cpp:100-101`). No f16 branch anywhere.
-
-Is f32 a deliberate numeric constraint? Structural, not documented:
- `ggml_ssm_conv` / `ggml_ssm_conv_update_inplace` HARD-ASSERT f32 on conv state/kernel/x_cur/dst
-  plus `nb[0]==sizeof(float)` (`ggml/src/ggml.c:5581-5584,5589,5597`). Conv path is f32-locked at the
-  builder.
- `ggml_ssm_scan` does NOT assert input state `s` dtype, but hardcodes its OUTPUT as
-  `GGML_TYPE_F32` (`ggml/src/ggml.c:5662`); scan kernels read `s` as `float *`.
- `ggml/src/ggml-cuda/gated_delta_net.cu` takes `const float * curr_state`, `float * state`,
-  `float * state_dst`; the per-(seq,head) shard `float s_shard[rows_per_lane]` is loaded/stored as raw
-  float (34-102). Same in `ggml-cpu/ops.cpp`.
- NO code comment anywhere justifies "f32 for precision". The constraint is that the ops were written
-  float-only. => recurrent-cache-f32 is a hardcoded implementation default to override deliberately:
-  the 3 literal `GGML_TYPE_F32` call-site pairs (gate behind `type_s`/`type_r` per §3), the
-  gated_delta_net.cu load/store convert, and KEEP conv f32 unless its asserts are extended (conv bytes
-  are negligible - only the temporal `type_s` state needs bf16).
-
-## A.2 vLLM: GDN temporal state cache is bf16 BY DEFAULT, fp32-accumulated in-kernel (the exact design)
-
- Dtype: `qwen3_next.py:780-787` -> `MambaStateDtypeCalculator.gated_delta_net_state_dtype` ->
-  `_mamba_state_dtype` (`mamba_utils.py:84-96`):
-  `conv_state_dtype = get_kv_cache_torch_dtype(mamba_cache_dtype, model_dtype)`;
-  `if mamba_ssm_cache_dtype == "auto": temporal_state_dtype = conv_state_dtype`.
-  With both knobs default `"auto"`, `get_kv_cache_torch_dtype("auto", model_dtype)` returns
-  `model_dtype` (`torch_utils.py:293-297`) = bf16 for Qwen3-Next => BOTH conv and temporal state are
-  bf16 by default. Explicit opt-out: `--mamba-ssm-cache-dtype float32` (mirror of our f32 fallback).
- In-kernel numerics (decode), `fla/ops/fused_recurrent.py`:
-  `b_h = tl.load(p_h0).to(tl.float32)` (303) load bf16->fp32; q/k/v/g/beta `.to(tl.float32)` (309-318);
-  recurrence in fp32 `b_h*=exp(g); b_v-=sum(b_h*b_k); b_v*=beta; b_h+=b_v*b_k; b_o=sum(b_h*b_q)`
-  (327-331); `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` (337) store fp32->bf16. Prefill chunk path
-  identical (`b_h=tl.zeros(...,tl.float32)`, `+= load().to(fp32)`, 102/120).
-  => byte-for-byte the proposed llama lever: load bf16->f32, math in f32 (UNCHANGED order, matches
-  gated_delta_net.cu's v-g*kv -> *beta -> S-update -> S^T q), store f32->bf16; only the persisted cache
-  crosses the bf16 boundary, once per step.
- vLLM numeric guards: NONE beyond fp32 accumulation - no per-step renorm, no clamp, no Kahan. Optional
-  `use_qk_l2norm_in_kernel` normalizes q,k (keeps k unit-norm) but does not touch the state.
- KDA nuance: `kda_state_dtype` returns `(state_dtype, torch.float32)` - Kimi Delta Attention keeps a
-  fp32 secondary component. qwen3.5 is `gated_delta_net` (fully-bf16 temporal state), but this shows
-  vLLM judged a fp32 component necessary for one delta variant -> reinforces keeping the f32 toggle.
-
-Verdict: vLLM's own GDN state cache is bf16, so bf16-state in llama is a FAIR equal-precision target,
-not a regression vs the competitor. bf16 brings llama TO vLLM's precision.
-
-## A.3 Numeric-safety assessment for bf16 gated-DeltaNet state
-
-Update: `S <- S*diag(exp(g)) + beta * k (x) (v - S k)`, with
-`g = -exp(A_log)*softplus(a+dt_bias) <= 0` so `exp(g) in (0,1]` (strict geometric decay) and
-`beta = sigmoid(.) in (0,1)`.
-
- Decay bounds error accumulation. bf16 = 8 mantissa bits -> per-element rel rounding
-  `eps ~= 2^-8 ~= 3.9e-3`. An error injected at step t is multiplied by `exp(g)<1` every later step ->
-  carry-error is a CONTRACTING geometric series bounded by ~`eps/(1-exp(g_mean))`, a small constant
-  multiple of one step's eps, NOT linear/unbounded. The recurrence is a contraction map - no
-  divergence. (The "per-step renorm" framing is not a literal renorm op in either codebase; the bound
-  IS the `g<1` contraction + `beta in (0,1)` + unit-norm k from the l2norm capping `||k (x) delta||`.)
- fp32 register accumulation is the minimal-error placement: load bf16->f32, do `S k`, `v-g*kv`,
-  `*beta`, the outer-product accumulate and `S^T q` ALL in fp32 (UNCHANGED math), store f32->bf16 once.
-  Identical to vLLM, which ships this as the Qwen3-Next default with no reported quality regression -
-  the strongest empirical safety evidence.
- Dominant risk is small KL/PPL drift, not instability. Gate KL<1e-3 + PPL-delta over 256-tok greedy
-  vs the f32 build; fall back to f32 via the §3c toggle if it fails. Keep conv state f32 (ssm_conv* is
-  f32-locked, conv bytes negligible) - no reason to risk it.
-
-Bottom line: (1) upstream recurrent-cache f32 is a hardcoded implementation default (conv asserts f32;
-scan/gdn kernels float-only; no numeric-rationale comments) - override via §3's `type_s`/`type_r`
-plumbing, bf16-default + f32 opt-out, touching only the temporal state. (2) vLLM's GDN temporal state
-is bf16 by default (auto->model_dtype), fp32-accumulated, with `--mamba-ssm-cache-dtype float32`
-opt-out - a fair equal-precision target. (3) bf16 GDN state is numerically safe: g<1 decay contracts
-rounding to a bounded geometric series, fp32 registers confine bf16 rounding to one per-step cache
-write, and vLLM ships this exact config in production. KL<1e-3 / PPL gate + f32 fallback is the right
-safety net.
-
---
-
-# PART B - label `bf16-kernel-plumbing` (the kernel/op edits §6 defers)
-
-Part A wires the cache DTYPE selection (cparams -> memory_params -> `s_l`/`r_l` alloc). Part B is the
-consuming half: every kernel/op that reads or writes those caches, and the exact
-load->f32->compute(f32, UNCHANGED)->store->bf16 conversion points. Traced against HEAD-after-0021 on
-`dgx.casa:~/llama-paged-dev` (branch `paged`).
-
-## B.1 Complete set of state-cache READERS/WRITERS (one op family only)
-`s_l` (ssm_states_all) reaches compute through exactly ONE op family - the gated-DeltaNet recurrence -
-via a strided VIEW from `build_rs` (graph base) that carries the cache dtype. The cache-touching srcs:
- `src[5]` `src_state` - the s0 read view (the cache, or the 0019 gather scratch).
- `src[6]` `src_state_dst` - the 0018 in-place write-back target (a view INTO the cache).
- `src[7]` `ids` - I32 seq map for the 0019 gather (no dtype concern).
-No other op reads `s_l`. `build_rs` only re-strides (dtype rides through); the 0019
-`gdn_gather_nonident_kernel` is the only other reader. So bf16 awareness localizes to: the 3 ggml.c
-builders (asserts), cuda `supports_op`, `gated_delta_net.cu`, and the CPU mirror in `ops.cpp`.
-
-## B.2 ggml.c builder asserts (relax F32-only -> {F32,BF16})
-File `ggml/src/ggml.c`:
- `ggml_gated_delta_net` (6287): line 6308 `GGML_ASSERT(state->type == GGML_TYPE_F32)` ->
-  `... == GGML_TYPE_F32 || ... == GGML_TYPE_BF16`.
- `ggml_gated_delta_net_inplace` (6349): same `state` assert (~6366-6370) + any `src_state_dst`
-  type assert -> allow BF16.
- `ggml_gated_delta_net_inplace_ids` (6417): same `state` + `src_state_dst` relax.
- KEEP the op OUTPUT scratch f32: line 6327 `ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne)` stays. The
-  `[attn_scores | new_states]` output is a TRANSIENT graph tensor; the bf16 persisted write goes
-  through `src_state_dst`/`state` (in-place). The non-in-place fallback `cpy`s scratch->cache and
-  `ggml_cpy` already type-converts f32->bf16.
-
-## B.3 CUDA supports_op
-`ggml/src/ggml-cuda/ggml-cuda.cu`, `supports_op` case `GGML_OP_GATED_DELTA_NET` (3096): allow a BF16
-`src[5]`/`src[6]` (add BF16 to the permitted state-src types).
-
-## B.4 CUDA recurrence kernel `ggml/src/ggml-cuda/gated_delta_net.cu`
-Template the kernel + gather + launch on the CACHE-pointer dtype (`bool STATE_BF16`); keep f32 valid so
-the f32 opt-out is the SAME kernel. Include `<cuda_bf16.h>`; convert with `__bfloat162float` /
-`__float2bfloat16`. ALL recurrence math (lines 106-200) stays in f32 registers, byte-for-byte UNCHANGED.
- Signatures: line 39 `const float * curr_state` -> `const STATE_T * curr_state`; line 57
-  `float * state_dst` -> `STATE_T * state_dst`; `read_state` (85-88) -> `const STATE_T * read_state`.
- LOAD (s0 -> f32 regs), lines 100-103:
-  `if constexpr (STATE_BF16) s_shard[r]=__bfloat162float(read_state[i]); else s_shard[r]=read_state[i];`
-  `s_shard` stays `float`.
- STORE-BACK (f32 regs -> bf16 cache):
-  - non-keep final write (203-208): `state[col*S_v+i] = STATE_BF16 ? __float2bfloat16(s_shard[r]) : s_shard[r];`
-  - keep_rs_t snapshot (191-200) targets `dst + attn_score_elems` = the f32 OUTPUT scratch (kept f32
-    per B.2); this is the prefill/rollback path (n_rs_seq>0), decode is `!keep_rs_t`. KEEP it f32.
-    Only the CACHE pointers (`curr_state` src[5], `state_dst` src[6]) are STATE_T.
- 0019 gather `gdn_gather_nonident_kernel` (7-30): `const float * cache` -> `const STATE_T * cache`;
-  `dst[i] = STATE_BF16 ? __bfloat162float(src[i]) : src[i];`. Keep `scratch` OUTPUT f32 (pool alloc
-  326-333 stays `ggml_cuda_pool_alloc<float>`) so the non-identity read path feeds f32; the identity
-  in-place path reads bf16 directly. `read_state`'s dtype follows the branch that selected it.
- Dispatcher (270-353):
-  - casts 299/323 `(const float *)src_state->data`, 312 `(float *)src_state_dst->data` ->
-    `(const nv_bfloat16 *)` / `(nv_bfloat16 *)` when `type == GGML_TYPE_BF16`; branch launch on type.
-  - asserts 309-311: `src_state_dst->type == GGML_TYPE_F32` -> allow BF16; `nb[0] == sizeof(float)` ->
-    `== ggml_type_size(type)`; `nb[1] == S_v*S_v*H*sizeof(float)` -> `... * ggml_type_size(type)`.
-  - q/k/v/g/beta strides (348-353) are ACTIVATION (f32) strides - UNCHANGED. Kernel indexes state by
-    ELEMENT (`col*S_v+i`), so the typed pointer halves the byte stride implicitly.
-  - `launch_gated_delta_net` (212-) + S_v switch (230-260): thread `STATE_BF16` into the
-    `gated_delta_net_cuda<S_v, KDA, keep_rs_t, STATE_BF16>` instantiations.
-
-## B.5 CPU reference `ggml/src/ggml-cpu/ops.cpp` (parity / CI / CPU-offload fallback)
-`ggml_compute_forward_gated_delta_net_one_chunk` (10662) + `_f32` (10847), dispatch (10915):
- LOAD: 10726 `const float * state_in_base = (const float *)src_state->data`, the rs_head/gather read
-  10744-10745, and 10891 `const float * cache = (const float *)src_state->data` -> when
-  `src_state->type == GGML_TYPE_BF16`, read `GGML_BF16_TO_FP32(((const ggml_bf16_t*)..)[..])`.
- STORE: 10758-10762 `inplace_state_base = (float *)src_state_dst->data` -> store
-  `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`; relax asserts `nb[0]`/`nb[1]`
-  to `ggml_type_size(type)`. Keep ONE impl, branch load/store on `src_state->type`.
-
-## B.6 Conv state (`r_l`) -> bf16 : DEFER (optional, low-value, prefill snag)
-Conv state ~12.6 MB total, LAUNCH-bound (0021 removed concat/cpy); bf16 saves ~0 ms, adds complexity:
- DECODE (0021 fused) `ggml_ssm_conv_update_inplace` (ggml.c:5566) asserts 5581-5584
-  `conv_states/conv_state_dst->type == F32`; CUDA `ssm_conv_update_f32` (ssm-conv.cu:131) + CPU
-  `ggml_compute_forward_ssm_conv_update_f32` (ops.cpp:9471) read/write f32. To bf16: relax the 2
-  asserts, template tap LOAD (`__bfloat162float`) + ring write-back STORE (`__float2bfloat16`), cast
-  `conv_states`/`conv_state_dst` ptrs in both dispatchers.
- PREFILL (non-fused) `build_conv_state` (delta-net-base.cpp:449-524): `conv_states=build_rs(...)`
-  (bf16 view) then `ggml_concat(conv_states, qkv_mixed, 0)` (472). **`ggml_concat` requires same type**
-  - qkv_mixed is f32 -> bf16 conv cache BREAKS the prefill concat (needs an f32 staging view of the
-  taps first; the ring write-back `ggml_cpy` at 496/520 already converts; concat is the blocker).
-RECOMMENDATION: keep `type_r` = F32 in v1 (matches Part A §6). Ship `type_s`=BF16 first; `type_r`=BF16
-is a follow-up that adds the f32 staging view.
-
-## B.7 Confirm UNTOUCHED: full-attn KV-cache (16 layers) + FP4 weights
- KV-cache: the `llama_kv_cache` half of `llama_memory_hybrid`, alloc with `params.type_k/type_v`
-  (llama-model.cpp 2030-2031 / 2089-2090 / 2108-2109). Part A changes ONLY the recurrent half's
-  `type_s`; `attn_type_k`/`attn_type_v` untouched. Paged-KV gather (0003-0011), flash-attn,
-  `type_k()/type_v()` accessors (kv-cache.h 161-162/381-382) unaffected.
- FP4 weights (nvfp4 dense + MoE): model weights, separate from runtime state caches; recurrence/conv
-  kernels read STATE not weights. FP4 GEMM (0017/0020) untouched.
- Activations (q/k/v/g/beta, attn-out, z) stay f32 (<1% of bytes). Only persisted `s_l` rows narrow.
-
-## B.8 Conversion-point cheat-sheet (the ONLY numeric-precision boundaries)
-1. CUDA load   `gated_delta_net.cu` ~102: `s_shard[r] = __bfloat162float(read_state[i])`.
-2. CUDA store  ~207: `state[col*S_v+i] = __float2bfloat16(s_shard[r])`.
-3. CUDA gather ~20: `dst[i] = __bfloat162float(src[i])` (bf16 cache -> f32 scratch).
-4. CPU load    `ops.cpp` ~10726/10744/10891: `GGML_BF16_TO_FP32(((ggml_bf16_t*)src_state->data)[..])`.
-5. CPU store   ~10762: `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`.
-Everything between (1)/(4) and (2)/(5) is f32-register math, identical to today's f32 kernel. Only the
-persisted cache rounds to bf16 once per step; g<1 geometric decay bounds the rounding.
-
-## B.9 File-by-file edit table (Part B)
-| File | Edit |
-|---|---|
-| `ggml/src/ggml.c` | relax `state`/`src_state_dst` F32 asserts -> allow BF16 in the 3 GDN builders (6308, ~6370, ~6430); keep output scratch F32 (6327) |
-| `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` GATED_DELTA_NET (3096): allow BF16 state src |
-| `ggml/src/ggml-cuda/gated_delta_net.cu` | template kernel+gather+launch on STATE_BF16; `__bfloat162float` load / `__float2bfloat16` store; cast src_state/src_state_dst ptrs; relax dispatcher asserts (309-311) to `ggml_type_size(type)`; keep gather scratch + keep_rs snapshot f32 |
-| `ggml/src/ggml-cpu/ops.cpp` | bf16 load/store branch in GDN ref (10726/10744/10758-10762/10891); relax asserts |
-| `tests/test-backend-ops.cpp` | add BF16-state GATED_DELTA_NET case (CUDA bf16 vs CPU bf16) |
-| (deferred) conv: `ggml.c:5581-84`, `ssm-conv.cu:131`, `ops.cpp:9471`, `delta-net-base.cpp:472` | v2 only - f32 staging before prefill concat |
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md
@@ -1,37 +0,0 @@
-# bf16 SSM state - build/de-risk progress
-
-DECISION (user override of plan): f32 DEFAULT + bf16 OPT-IN. type_s default = GGML_TYPE_F32.
-Conv state (type_r) stays F32. Recurrence math stays f32 (load->f32, store->cache dtype).
-
-## STEP 1 (dtype-generic kernel + op) - DONE + DE-RISK GATE 1 PASSED
-Files (DGX ~/llama-paged-dev):
- ggml/src/ggml.c: 3 GDN builder asserts F32 -> {F32,BF16}; state_dst nb[0] -> ggml_type_size.
- ggml/src/ggml-cuda/gated_delta_net.cu: gdn_state_t<STATE_BF16> alias; gather + recurrence kernel +
-  launchers templated on STATE_BF16; __bfloat162float load / __float2bfloat16 store; gather scratch
-  shares cache dtype (uniform read); dispatcher detects src_state->type, GDN_DISPATCH macro 8-way.
- ggml/src/ggml-cpu/ops.cpp: byte-based read base + read_bf16 load conversion; bf16 in-place
-  convert-store after token loop; bf16 gather widening; relaxed asserts to ggml_type_size.
- ggml/src/ggml-cpu/ggml-cpu.c: work-size +S_v*S_v for bf16 in-place.
- tests/test-backend-ops.cpp: state_type field on test_gated_delta_net; 16 bf16 cases (hs 64+128 x
-  decode/prefill/keep_rs x kda).
-GATE 1: build clean (EXIT=0); test-backend-ops -o GATED_DELTA_NET = 52/52 OK (CUDA bf16 vs CPU bf16).
-
-## STEP 2/3/4 (cparams opt-in wiring) - IN PROGRESS
-f32 DEFAULT everywhere; --cache-type-ssm bf16 opts in.
-
-## STEP 2/3/4 (cparams opt-in) - DONE
- llama.h/llama-context.cpp/llama-memory.h/llama-model.cpp: type_r/type_s plumbed, DEFAULT F32.
- common.h/common.cpp/arg.cpp: cache_type_ssm/conv (F32 default) + --cache-type-ssm/-conv CLI.
- llama-memory-recurrent.cpp: convert-on-mismatch f32<->bf16 (r and s) via ggml_*_row API.
-
-## EXTRA FIX (plan B.1 miss): build_rs rs_zero clear used ggml_scale (f32-only) -> bf16 abort.
- llama-graph.cpp: f32 keeps ggml_scale_inplace (bit-exact); non-f32 uses ggml_fill_inplace.
- fill.cu + ops.cpp + ggml.c: added BF16 to ggml_fill. get_rows/cpy already bf16-capable.
-
-## DE-RISK GATE - ALL PASS
- build clean EXIT=0 (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
- test-backend-ops -o GATED_DELTA_NET = 52/52 (16 bf16 cases: decode/prefill/keep_rs x kda x hs64/128).
- f32 default md5: dense 5951a5b4... MoE 07db32c2... == 0023 (non-invasive; also --cache-type-ssm f32 matches).
- bf16 opt-in: coherent "Paris", no crash; byte-identical to f32 on 48-tok sample (Same-top-p 100%).
- diff backup: ~/llama-paged-dev/BF16_SSM_STATE.diff (1003 lines, 15 files). NOT committed/pushed.
-READY FOR C.2 KL GATE (GateBench).
--- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md
@@ -1,203 +0,0 @@
-# bf16 SSM-state cache - BUILD + DE-RISK RESULTS
-
-Label: bf16-build-derisk (the GPU build agent). Lands on top of patch 0023 (HEAD f7409c2) on the DGX
-dev tree `~/llama-paged-dev` (branch `paged`). Status: **DE-RISK GATE PASSED, READY FOR THE C.2 KL
-GATE (GateBench).** Work is built into `build-cuda` and saved as `~/llama-paged-dev/BF16_SSM_STATE.diff`
-(uncommitted on the dev tree; the 0024 ship/shelve decision is gated on GateBench's KL results).
-
-## DECISION applied (user override of the plan): f32 DEFAULT + bf16 OPT-IN
-The plan defaulted bf16; the user wants f32 to stay the bit-exact DEFAULT and bf16 to be opt-IN via
-`--cache-type-ssm bf16`. So `type_s` default = `GGML_TYPE_F32`, `type_r` default = `GGML_TYPE_F32`
-(conv stays f32 always, per plan C.0). Only the persisted RECURRENT (temporal) state narrows to bf16
-when opted in; recurrence math stays f32 (load->f32, compute f32, store->cache dtype). The opt-in is
-non-invasive: with no flag the output is byte-identical to 0023.
-
-## Files changed (15; full diff = ~/llama-paged-dev/BF16_SSM_STATE.diff, 1003 lines)
-
-STEP 1 - dtype-generic kernel + op (the de-risk core):
- `ggml/src/ggml.c` - 3 GDN builder `state`/`state_dst` asserts F32 -> {F32,BF16}; `state_dst->nb[0]`
-  `sizeof(float)` -> `ggml_type_size(state_dst->type)`. Also relaxed the `ggml_fill` builder assert to
-  allow BF16 (needed by the rs_zero clear; see below).
- `ggml/src/ggml-cuda/gated_delta_net.cu` - `gdn_state_t<STATE_BF16>` alias (`nv_bfloat16`/`float`);
-  recurrence kernel + gather kernel + both launchers + the dispatcher templated on `STATE_BF16`.
-  LOAD `__bfloat162float`, STORE `__float2bfloat16`; the gather scratch is allocated at the CACHE
-  dtype so `read_state` is a single uniform dtype (no mixed-dtype read path - eliminates the plan-R2
-  landmine). The keep_rs snapshot + the non-in-place final write stay f32 (op output scratch); the
-  bf16 store happens ONLY on the in-place cache path. `supports_op` already returned `true`
-  unconditionally for GATED_DELTA_NET, so no change there.
- `ggml/src/ggml-cpu/ops.cpp` - byte-based prior-state read base + `read_bf16` load conversion
-  (`GGML_BF16_TO_FP32`); bf16 in-place convert-store after the per-(head,seq) token loop
-  (`GGML_FP32_TO_BF16`); bf16-widening non-identity gather; relaxed `nb[]` asserts to
-  `ggml_type_size`. Added a `ggml_compute_forward_fill_bf16` + dispatch case.
- `ggml/src/ggml-cuda/fill.cu` - BF16 case in the fill kernel switch.
- `ggml/src/ggml-cpu/ggml-cpu.c` - GDN work-size adds the extra `S_v*S_v` f32 buffer when the cache is
-  bf16 in-place (mirror of `need_work` in ops.cpp).
- `tests/test-backend-ops.cpp` - `state_type` field on `test_gated_delta_net`; 16 bf16-state cases
-  (head_size 64 + 128 x {decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda 0/1, n_seqs 1/2).
-
-STEP 2/3/4 - cparams opt-in wiring (f32 DEFAULT):
- `include/llama.h` - `type_r`/`type_s` in `llama_context_params` (adjacent to type_k/type_v).
- `src/llama-context.cpp` - default-params `type_r = type_s = GGML_TYPE_F32`; `params_mem` passes them.
- `src/llama-memory.h` - `type_r`/`type_s` in `llama_memory_params`.
- `src/llama-model.cpp` - the 3 hardcoded `GGML_TYPE_F32` recurrent ctor pairs (recurrent /
-  hybrid_iswa / hybrid = the qwen35/qwen35moe path) now pass `params.type_r` / `params.type_s`.
- `src/llama-memory-recurrent.cpp` - back-compat: `state_read_data` converts f32<->bf16 on type
-  mismatch (helper `recurrent_read_convert_rows` via the public `ggml_bf16_to_fp32_row` /
-  `ggml_fp32_to_bf16_row`) instead of failing, for both r and s; lets an f32-saved session restore
-  into a bf16 cache and vice versa.
- `src/llama-graph.cpp` - `build_rs` rs_zero clear: f32 keeps the exact `ggml_scale_inplace(.,0)` op
-  (bit-exactness); bf16 (and any non-f32) state uses `ggml_fill_inplace(.,0)` (CUDA scale is f32-only;
-  this was the one extra state-touching op the plan's "one op family" claim missed). get_rows + cpy
-  on the extra-states path already support bf16, so no change needed there.
- `common/common.h` / `common/common.cpp` / `common/arg.cpp` - `cache_type_ssm` / `cache_type_conv`
-  (default F32) + `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` CLI (reusing the
-  existing `kv_cache_type_from_str`, which already maps `f32`/`bf16`).
-
-## DE-RISK GATE - ALL PASS
-
-1. **Build clean** (build-cuda, CUDA arch 121): EXIT=0 for ggml/ggml-cuda/ggml-cpu/llama/llama-common
-   and the binaries (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
-2. **test-backend-ops -o GATED_DELTA_NET = 52/52 PASS** (CUDA backend vs CPU reference). Includes all
-   16 new bf16-state cases (CUDA bf16 vs CPU bf16) covering decode (n_tokens==1), multi-token
-   prefill/chunk (33/64/100), and keep_rs_t (K=4), with kda on/off and head_size 64 + 128 (production
-   S_v). The bf16 op test is the deterministic R2 de-risk for the load/compute/store contract.
-3. **f32-default md5 == 0023 baseline (opt-in is non-invasive):**
-   - dense  q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023  (no flag AND `--cache-type-ssm f32`)
-   - MoE    q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023
-   Command: `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`.
-4. **bf16 opt-in coherence + engaged (dense, `--cache-type-ssm bf16`):** no crash; coherent + on-topic.
-   - 48-tok greedy ("The capital of France is"): "**Paris**." - byte-identical to f32 (md5 5951a5b4...),
-     i.e. Same-top-p = 100% over that short sample (the g<1 decay bounds the per-step rounding so the
-     argmax trajectory is unchanged at short length).
-   - 256-tok greedy ("Explain how a transformer LM generates text, step by step"): fluent, well-structured
-     step-by-step explanation, and the bf16 md5 (`fc82b4cd44f8ec999c3b6843eb3f8c61`) **DIVERGES** from
-     f32 (`554cc667a2e62a47c34a999a127ac7e5`) - definitive proof that bf16 is genuinely ENGAGED (not a
-     silent f32 fallback) and behaves as expected (non-bit-exact, coherent). The per-token divergence
-     is exactly what the C.2 teacher-forced KL gate quantifies.
-   - Independent proof bf16 is allocated: BEFORE the build_rs fill fix, decode aborted in
-     `ggml_cuda_op_scale` on the recurrent-state tensor - an f32 cache would never have reached that
-     bf16-only failure, so the opt-in demonstrably allocates bf16. Wiring is also directly traceable:
-     `--cache-type-ssm bf16` -> cache_type_ssm -> cparams.type_s -> params_mem.type_s -> the
-     llama_memory_hybrid recurrent `s_l` alloc.
-
-CONFIRM: ready for the C.2 KL-divergence + PPL-delta + long-context drift gate (GateBench).
-
-## A landmine fixed beyond the plan (record for the gate/ship agents)
-The plan B.1 asserted `s_l` reaches compute through ONLY the gated-DeltaNet op. It also flows through
-`build_rs`: (a) the rs_zero restart-slot clear used `ggml_scale_inplace(.,0)`, and `ggml_cuda_op_scale`
-hard-asserts f32 -> the first bf16 decode aborted in scale. Fixed by routing the bf16 clear through
-`ggml_fill` (with a new bf16 fill path). (b) the extra-states `ggml_get_rows` + `ggml_cpy` already
-support bf16 (verified) - no change. This is exactly the kind of non-decode state path the de-risk
-was meant to surface; it is now covered end-to-end (the bf16 coherence run exercises rs_zero on the
-fresh-sequence prompt).
-
-## NOT done in this phase (next agents)
- STEP 5 LocalAI gRPC/YAML (`CacheTypeSSM`/`CacheTypeConv` proto + grpc-server + model_config +
-  options + meta registry) - needed to force f32/bf16 from a gallery YAML; not on the de-risk gate.
- STEP 6 capability fallback (device-match probe to demote bf16->f32 before alloc on a device lacking
-  the bf16 GDN/fill path, e.g. CPU-offloaded GDN). The CPU reference DOES implement bf16 (load/store/
-  gather/fill) so a CPU fallback is numerically correct today, but the probe is the clean guard.
- The C.2 KL/PPL/long-context gate + the C.3 nsys per-call bench - GateBench (GPU gate agent, runs
-  sequentially after this build phase; binaries are pre-built in build-cuda).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# C.2/C.3 ACCEPTANCE GATE + PARITY BENCH RESULTS (label bf16-gate-bench)
-
-Status: **GATE FAILS -> NO-SHIP. KEEP SHELVED. patch 0024 NOT created; nothing committed.**
-All runs on `dgx.casa` build-cuda binaries, wikitext-2-raw test, `-ngl 99 -fa on --seed 1`.
-Corpus: `~/bench/klgate/wikitext-2-raw/wiki.test.raw` (symlink to `~/wikitext-2-raw`, ~280k tokens).
-
-## 1. KL acceptance gate
-
-### Noise floor (f32-vs-f32, c256 chunks32) - the non-determinism floor
-| model | Mean KLD | Max KLD | Same-top-p | ln(PPL(Q)/PPL(base)) |
-|---|---|---|---|---|
-| dense q27 | -1.3e-5 | 1e-6 | 100.000% | +0.001256 |
-| MoE q35   | ~0 (-3e-7) | 5.9e-5 | 100.000% | +0.000607 |
-
-### Headline 256-token gate (bf16-vs-f32, c256 chunks32) - PASSES, but vacuously
-bf16 c256 is **byte-identical to the floor** for both models (Mean KLD -1.3e-5 dense / ~0 MoE,
-Same-top-p 100%, identical PPL). Reason: a single 256-token window is processed in ONE ubatch
-(ub512 > 256), so the recurrent state is written to the bf16 cache only ONCE at the chunk end and is
-NEVER read back to produce that window's logits. The 256-token gate therefore does NOT exercise the
-bf16 round-trip at all - it is blind to the actual cost.
-
-### Long-context drift sweep (bf16-vs-f32, chunks8) - FAILS HARD for BOTH models
-| model | ctx | Mean KLD | Same-top-p | Max KLD | 99.9% KLD |
-|---|---|---|---|---|---|
-| dense | 256  | -1.3e-5 | 100.000% | 1e-6 | 0 |
-| dense | 1024 | 0.0647 | 91.54% | 20.17 | 7.69 |
-| dense | 2048 | 0.1739 | 90.65% | 24.89 | 18.18 |
-| dense | 4096 | 0.1258 | 90.40% | 26.03 | 17.22 |
-| MoE   | 256  | ~0      | 100.000% | 5.6e-5 | 4.9e-5 |
-| MoE   | 1024 | 0.0472 | 90.04% | 5.13 | 0.95 |
-| MoE   | 2048 | 0.0442 | 90.84% | 1.85 | 1.11 |
-| MoE   | 4096 | 0.0422 | 89.97% | 2.76 | 0.83 |
-
-Gate thresholds: Mean KLD < 1e-3; Same-top-p >= 99.5%; |ln(PPL ratio)| < 0.005;
-drift MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0%.
-Result: 256-tok PASS (vacuous); **drift FAIL by ~50-170x on Mean KLD and ~9 pts on Same-top-p**
-(top-p ~90% = roughly 1 token in 10 flips its argmax at >=1024 ctx). FAIL for both dense and MoE.
-
-### Discrimination (is it a bug or genuine bf16?) - dense c1024 chunks8
- **f32 long-context floor c1024**: Mean KLD -1.2e-5, Same-top-p 100% -> the bf16 divergence is REAL
-  signal, not a long-context measurement artifact.
- **bf16 KLD is invariant to ubatch-boundary count** (= the cross-ubatch state read-back frequency):
-  ub1024 (0 internal boundaries) 0.0642 / 91.19%; ub512 (1) 0.0647 / 91.54%; ub256 (3) 0.0639 /
-  91.17%; ub64 (15) 0.0682 / 90.97%. Flat. -> The error is INTRINSIC to bf16 over the long
-  recurrence INSIDE a single op call, NOT a chunked-prefill/keep_rs/gather handoff bug (R2 ruled out;
-  test-backend-ops 52/52 already passed). The error PLATEAUS with context (contraction), i.e. it is
-  bounded but LARGE: the gated-DeltaNet has long-memory heads (exp(g) ~ 1), so the g<1 decay does NOT
-  tightly contract the per-step bf16 rounding the way the plan's A.3 optimistically assumed.
-
-Note (CORRECTED): this is NOT vLLM's precision. vLLM keeps the GDN **temporal state in f32** (proven
-three ways in BITEXACT_VS_VLLM.md: empirical kernel-boundary tensor dtype, the config chain, and the
-bandwidth regime; only vLLM's tiny conv state is bf16). So bf16 temporal here is a step BELOW vLLM's
-recurrent precision, not a match. (An earlier byte-gate draft mislabeled vLLM as bf16-state; that was
-refuted.) At equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW).
-
-## 2. Parity bench - the perf lever IS real
-
-### nsys recurrence per-call (graphs OFF, npp4 ntg32 npl128) - gated_delta_net_cuda Avg
-| model | f32 ms/call | bf16 ms/call | delta |
-|---|---|---|---|
-| dense q27 | 3.381 | 1.726 | **-49.0%** |
-| MoE q35   | 2.245 | 1.153 | **-48.6%** |
-
-The predicted 3.49 -> 2-3 ms/call lever LANDED (even beat it). Total GPU time dropped too (dense
-~12.05 -> ~9.05 s graphs-off). bf16 halving the persisted SSM-state bytes halves the dominant decode
-kernel, exactly as designed.
-
-### End-to-end decode throughput (S_TG aggregate, npp128 ntg128, graphs ON unless noted)
-| model | npl | f32 t/s | bf16 t/s | note |
-|---|---|---|---|---|
-| dense | 32  | 212 | 239 | +12.8% |
-| dense | 128 | 371-376 (stable) | 287 / 336 / 487 / 498 (BIMODAL) | clean ~490 = +31%; bad runs from a CUDA-graph instability on the dense path |
-| dense | 128 | 371.67 (graphsOFF) | 486.68 (graphsOFF) | clean +31% |
-| MoE   | 32  | 449 | 509 | +13.4% |
-| MoE   | 128 | 767 | 958 | +24.9% (clean, nsys-corroborated) |
-
-% of vLLM (391 t/s dense reference): f32 default = 95-96% (bit-exact, higher precision than vLLM);
-bf16 clean ~490 = **125%** (but unstable on dense + fails the numeric gate). MoE bf16 +25% is clean.
-
-## 3. DECISION: NO-SHIP / KEEP SHELVED
- The KL gate **fails** the long-context drift criterion for both models: bf16 SSM state changes
-  ~10% of tokens at >=1024 ctx vs our f32 (Same-top-p ~90%, Mean KLD 0.04-0.17). It is therefore NOT
-  a quality-neutral opt-in and cannot honor the project's "f32 bit-exact default" promise.
- Per the task rule (gate FAIL -> do not ship as usable): **patch 0024 was NOT created and nothing was
-  committed** (DGX tree stays uncommitted; backup `~/llama-paged-dev/BF16_SSM_STATE.diff`).
- The perf lever is genuinely real (recurrence halves; dense ~490 t/s = 125% of vLLM when clean; MoE
-  +25%), but bf16 temporal is BELOW vLLM's precision (vLLM keeps temporal f32), so it remains a valid
-  FUTURE option only if shipped as an explicitly-labeled "reduced-precision, NON-bit-exact, below-vLLM"
-  mode (never quality-neutral), AND the dense CUDA-graph throughput instability (bimodal 287..498) is
-  fixed first. The principled path is hybrid per-head precision (f32 long-memory heads + bf16 fast
-  heads) - keeps precision at-or-above vLLM while capturing most of the speedup.
- Recommendation: keep the shipped default as f32 bit-exact (95% of vLLM at higher precision). Shelve
-  bf16. Optional follow-up lever if precision must be cut: bf16 only on the SHORT-memory heads (those
-  with exp(g) well below 1), keeping long-memory heads f32 - a mixed-precision state that could pass
-  the gate while still cutting bytes; not implemented/measured here.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md
+++ b/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md
@@ -1,339 +0,0 @@
-# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
-
-Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
-Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
-(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
-vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
-
-> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
-> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
-> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
-> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
-> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
-> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
-> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
->
-> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
-> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
-> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
-> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
-> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
-> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
-> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
-> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
-> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
-> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
-> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
-> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
-> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
-> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
-> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
-> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
-> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
-> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
-> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
-> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
-> reading consistent with both source *and* the measured time.
-
-## Headline (two answers)
-
-1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
-   two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
-   kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
-   reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
-   that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
-   accumulation - a >>ULP divergence in every projection and the LM head.
-
-2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
-   was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
-   same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
-   bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
-   vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
-   not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
-
---
-
-# The five questions, answered (synthesis)
-
-**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
-bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
-activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
-(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
-projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
-(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
-into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
-(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
-A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
-rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
-GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
-*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
-(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
-with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
-same-precision run because of the A8-vs-A4 GEMM gap.
-
-**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
-and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
-re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
-81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
-74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
-fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
-per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
-precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
-worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
-bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
-other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
-chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
-one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
-don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
-recurrence (~440 tok/s) - an **over-clock**, not a requirement.
-
-**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
-corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
-the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
-**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
-state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
-precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
-precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
-which costs llama precision only on a tiny stream and buys almost no BW.
-
-**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
-never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
-throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
-KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
-registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
-is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
-must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
-precision-for-speed trade it is.
-
-**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
-"bf16 is required" framing - keep the bit-exact levers.**
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
-  than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
-  occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
-  non-associativity across distinct kernel stacks).
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
-  **Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
-  parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
-  vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
-  parity-matching change.
-
---
-
-# PART A - Divergence inventory (per source: bit-identical vs close)
-
-Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
-conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
-RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
-
-## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
-
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
-  (`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
-  block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
-  `mmq_y`/`mmq_x` + the warp-MMA fragment layout.
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
-  w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
-  (**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
-  16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
-  differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
-  divergence and is present in every projection + the LM head, so logits differ materially on its own.
-
-## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
-
-Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
-`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
-`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
-With S[k][v] (llama, transposed) == h[v][k] (vLLM):
-
-```
-llama:  kv[v] = Sigma_k S_old[k][v]*k[k]      # OLD state; g applied AFTER the sum
-        delta = (v[v] - g*kv[v])*beta;  S_new = g*S_old + k(x)delta;  o[v]=Sigma_k S_new[k][v]*q[k]
-vLLM:   h' = g*h_old                          # decay rounded into EVERY element first
-        kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k;  b_v=(v[v]-kv[v])*beta
-        h_new = h' + b_v(x)k;  o[v]=Sigma_k h_new[v][k]*q[k]
-```
-
-Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
-with identical f32 state, identical inputs, and identical reduction tree:
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
-  vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
-  Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
-  butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
-**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
-
-## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
-llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
-`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
-*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
-(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
-
-## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
-L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
-(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
-`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
-Different variance reduction tree / eps placement / fusion boundary.
-
-## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
-MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
-order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
-`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
-time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
-rounding points + convert-time vs runtime `exp(A_log)`.
-
-## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
-llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
-`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
-rounding.
-
-## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
-Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
-inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
-llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
-
-### Inventory summary
-
-| Source | bit-identical? | divergence size |
-|---|---|---|
-| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
-| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
-| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
-| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
-| MRoPE | NO | ~ULP-1e-6 |
-| gate softplus/exp | NO | rounding points |
-| flash attention | NO | softmax tiling |
-| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
-
-Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
-not close-to-equal at the bit level; they agree only to a few significant digits.
-
---
-
-# PART B - The decisive f32-state correction (proof from source)
-
-The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
-"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
-were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
-
-1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
-   and **`text_config.mamba_ssm_dtype = "float32"`**.
-2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
-   `"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
-3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
-   `mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
-   `cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
-   **`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
-4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
-   (vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
-   It runs **before** any layer/model state-dtype resolution.
-5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
-   --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
-   non-default args confirm none). So the override fires and the value is "float32".
-6. State dtype resolution reads the **already-overridden** value:
-   - `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
-     cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
-   - `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
-     `vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
-   - `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
-     temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
-     = **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
-7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
-   element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
-   round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
-   with `p_ht.dtype == initial_state.dtype == float32`.
-
-**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
-to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
-override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
-byte stream).
-
-## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
-
-| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
-|---|---|---|---|---|---|
-| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
-| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
-
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
-  "more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
-  count vLLM is at ~81%, i.e. **more** efficient than llama.
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
-  is bandwidth **efficiency** on equal bytes, not width.
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
-
---
-
-# PART C - The f32-preserving lever, and where recompute/bf16 land
-
-Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
-llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
-  column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
-  each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
-  a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
-  columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
-  width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
-  vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
-
-**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
-occupancy tune above is the real bit-exact win:**
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
-  2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
-  full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
-  dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
-  read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
-  replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
-  at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
-  bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
-  need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
-  uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
-  overlapping two BW-bound phases sums to ~0. FAILS.
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
-  decode); attention/materialization-free form is O(t) over the prefix. FAILS.
-
-**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
-recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
-precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
-throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
-MMQ + activation fold); none require bf16.
-
---
-
-# PART D - Verdict on B4 + the meaningful weaker form
-
-## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
-
-1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
-   `g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
-   cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
-   expression*, differing **even given identical f32 state and inputs**.
-2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
-   fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
-   Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
-   inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
-   **cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
-   op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
-   meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
-
-## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
-Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
-(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
-cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
-KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
-**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
-agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
-gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
-weakening greedy agreement for this model specifically.
-
-**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
-gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
-bf16 state does not and must be KL/PPL-gated vs llama's own f32).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md
@@ -1,53 +0,0 @@
-# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate)
-
-## Hard blocker on direct DRAM counters
- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only).
- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM.
- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root.
- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte
-  geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor.
-
-## Byte geometry (exact, from gated_delta_net.cu + GGUF)
- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head).
- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB.
- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register,
-  writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced).
-  l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at
-  steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away.
- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB.
- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call.
-
-## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97)
- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call
-  (min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms).
- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak.
- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal.
-  => re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE.
-  Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak.
-
-## Conv-path (same trace, steady-decode region kernels, per-call):
- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD
- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big)
- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op)
-
-## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite)
- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16).
-  Geometry identical to llama (H=48, k/v head_dim 128, S_v 128).
- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1),
-  bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs ->
-  store bf16 once).
- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient.
-
-## Conv-path (llama steady decode, per call x48 layers)
- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56)
-  = ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%).
- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state).
-
-## FINAL VERDICT (DONE)
- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak).
- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%);
-  gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes.
- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width.
-  805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327.
-  Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision.
- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended).
--- a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md
@@ -1,57 +0,0 @@
-# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint
-
-Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev`
-`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
-
-## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128.
-B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the
-grouped FP4 `mul_mat_q<NVFP4>` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3.
-
-## The lever (B-3 / SPEEDUP_HUNT B rank #3)
-mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q<NVFP4>` in `ggml/.../mmq.cuh`.
-mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA
-shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency,
-WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE
-GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the
-right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS =
-+8.7% slower 0017).
-
-## The coupling that makes it a real kernel change (not the 0017 knob alone)
-The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the
-m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The
-0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE:
-`mmq_get_nwarps_device<type>()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the
-coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to `<type>`. Default
-GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock.
-
-## Bit-exactness note (the real risk)
-The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64
-DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split
-across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping =>
-FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT
-bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP.
-
-## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md.
- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites.
- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `<type>` swaps. git diff clean (37+/11-).
- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref;
-      MoE S_TG npl32=441.98, npl128=756.47.
- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64).
- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%,
-      npl128 +0.40% - within the ~0.4% noise band. FLAT.
- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%)
-      dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2).
- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda
-      reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM.
- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap,
-      uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%).
-
-## Gate references
- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
- MoE   q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd
- gate cmd: `llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`
- bench: `llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128` (S_TG=decode_agg)
- vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md
@@ -1,232 +0,0 @@
-# B_MOE_RESULTS.md - B-2 (down_proj act-quant retune / M1) RESULT: NEGATIVE (no headroom)
-
-Agent: B2-build (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
-branch `b-work`), independent of the held hybrid 0026 (`33e7c65`). Lever: SPEEDUP_HUNT.md section B,
-rank #2 ("down_proj act-quant retune (M1): bit-exact, bounded - act-quant is ~2% of MoE step").
-
-## VERDICT
-**The existing `blockDim.x = 128` is ALREADY the kernel-level optimum for `quantize_mmq_nvfp4` on
-GB10 sm_121. B-2 has zero headroom: there is nothing to bake (128 is the current default), and it
-does NOT lift MoE decode (end-to-end flat within 0.4% noise across all block sizes). No patch 0027.**
-MoE stays ~85% of vLLM @npl128 / ~87% @npl32, well below vLLM => the remaining MoE lever is B-3.
-
-## The change that was built+measured (bit-exact, then REVERTED - did not lift)
-`ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. Replaced the hardcoded
-`constexpr int nvfp4_block_size = 128` with a `static const int` selected once from env
-`LLAMA_MOE_QUANT_BLOCK` (default 128), `block_num_y` recomputed from the SAME `blockDim.x`. ~20 LOC.
-
-### Why ANY block size is provably byte-identical (the bit-exact invariant)
-`quantize_mmq_nvfp4` maps thread -> column purely via the global linear index
-`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
-communication (no shared memory, no warp reduction) and every thread writing its OWN disjoint output
-sub-block (its own `sub` slot in `block_fp4_mmq`: `yqs[2*sub+0/1]`, `d4[sub]`). The per-thread quant
-body (amax, the 5-offset fp8-code search, the q0/q1 nibble packing, the writeback) is untouched. So
-the (thread)->output-byte map - and the produced bytes - are invariant to `blockDim.x`. Confirmed
-empirically: md5 identical at block 64, 128, AND 256, both models.
-
-## GATE (bit-exact) - BOTH MODELS PASS at default AND at non-128 blocks
-greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
-
-| block | dense q36-27b-nvfp4 md5 | MoE q36-35b-a3b-nvfp4 md5 |
-|------:|-------------------------|---------------------------|
-| 128 (default) | 5951a5b4d624ce891e22ab5fca9bc439 == ref | 07db32c2bcb78d17a43ed18bc22705cd == ref |
-| 64 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
-| 256 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
-
-test-backend-ops (CUDA0): **MUL_MAT 1146/1146 PASS**, **MUL_MAT_ID 806/806 PASS**.
-
-## MEASUREMENT 1 - end-to-end MoE decode_agg (S_TG t/s), the actual throughput
-`llama-batched-bench -m q36-35b-a3b-nvfp4.gguf -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`,
-1 rep/block (run-to-run noise ~0.3-0.5%):
-
-| block | npl=32 S_TG | npl=128 S_TG |
-|------:|------------:|-------------:|
-| 32 | 437.54 | 750.41 |
-| 64 | 437.82 | 751.68 |
-| 96 | 437.69 | 749.46 |
-| **128 (base/default)** | **438.14** | **751.76** |
-| 160 | 436.38 | 750.99 |
-| 192 | 436.81 | 751.61 |
-| 256 | 437.77 | 750.14 |
-
-Spread: npl32 = 1.76 t/s (0.4%), npl128 = 2.3 t/s (0.3%) - all within noise. **No block size lifts
-end-to-end decode.** Expected: the act-quant is ~2% of the MoE step, so even a perfect (0 ns) quantize
-kernel caps the end-to-end win at ~2%, and 128 is already optimal => measured 0%.
-
-## MEASUREMENT 2 - nsys kernel-level delta on quantize_mmq_nvfp4 (the meaningful B-2 metric)
-`nsys --report cuda_gpu_kern_sum`, MoE, `GGML_CUDA_DISABLE_GRAPHS=1 -npp 4 -ntg 32 -npl 128`,
-8,193 kernel invocations (the kernel is 2.0-2.2% of GPU time in this decode-heavy window):
-
-| block | total ns | avg ns | median ns | vs 128 (total) |
-|------:|---------:|-------:|----------:|---------------:|
-| 64 | 127,523,328 | 15,564.9 | 12,256 | +8.7% slower |
-| **128 (default)** | **117,371,424** | **14,325.8** | **11,488** | baseline (fastest) |
-| 192 | 128,970,464 | 15,741.5 | 12,032 | +9.9% slower |
-| 256 | 125,422,048 | 15,308.4 | 11,936 | +6.9% slower |
-
-**128 is a clean local minimum** (faster than the 64 below and the 192/256 above; 96 and 160 are its
-immediate neighbors, end-to-end-neutral, nsys-stats flaked on the re-runs but cannot beat a bracketed
-local min). The 7-10% kernel-level regression of the alternatives at 0% end-to-end change is exactly
-why end-to-end is flat: this BW-bound, 256-tiny-expert model has no col-tile/occupancy headroom in
-the act-quant - the same conclusion patch 0015 reached for the M-tile and patch 0017 for MINBLOCKS.
-
-## WHERE MoE STANDS (decode_agg, this base = 0025 with the re-graph)
-vLLM ref @npl128 = 882.2, @npl32 = 500.8.
- npl128: 751.8 / 882.2 = **85.2% of vLLM**
- npl32:  438.1 / 500.8 = **87.5% of vLLM**
-
-B-2 adds 0 (within noise). MoE is **still well below vLLM** => **TRY B-3** (the mmq_y-down warp-remap
-on the grouped `mul_mat_q<NVFP4,M-tile=64>` GEMM, ~27% of the MoE step - the only untested MoE GEMM
-lever; SPEEDUP_HUNT B rank #3, real kernel change, bit-exact, predicted bounded on this BW-bound
-model). The structural MoE residual is the FP4 grouped GEMM at the LPDDR5x BW floor + the bf16
-projections (~10.5%); the act-quant tax (~2%) is NOT where the gap lives and is already optimally
-tiled. Recurrence (~48%) is already past vLLM (0018-0022).
-
-## DECISION
-No patch 0027 (B-2 does not lift; dev tree reverted to pristine 0025). The `LLAMA_MOE_QUANT_BLOCK`
-hook + this measurement confirm 128 is the GB10 optimum, should other hardware ever want re-tuning.
-Hand off to B-3 (patch 0028) as the next MoE GEMM lever.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028)
-
-Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
-branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the
-0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q<NVFP4>` (the ~26-27%
-MoE-specific GEMM), the only untested MoE GEMM occupancy lever.
-
-## VERDICT
-**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4%
-(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The
-warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is
-BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128.
-
-## The change that was built+measured (bit-exact, then REVERTED)
-`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the
-invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64
-block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017
-`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new
-`mmq_get_nwarps_device<type>()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else
-the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads +
-9 `<type>` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host).
-Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64).
-**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is
-byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128).
-
-### Bit-exactness is EMPIRICAL here (not by-construction)
-The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64
-DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be
-split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different
-grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT
-for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON
-ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain.
-
-## GATE (bit-exact) - BOTH MODELS PASS
-greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref
- MoE   q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref
- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.**
-
-## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64)
-`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s,
-3 reps alternating (no concurrent load):
-
-| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta |
-|----:|----------------------:|--------------------:|------:|
-| 32  | 437.6 (437.3-437.7)   | 438.8 (438.4-439.1) | +0.29% |
-| 128 | 750.1 (748.9-751.1)   | 753.1 (753.0-753.4) | +0.40% |
-
-Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread
-(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win
-at well under 1%, nowhere near the gap to vLLM (882).
-
-## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load
-`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`:
-
-| kernel (% of window)            | mmq_y=128 total ns | mmq_y=64 total ns | delta  |
-|---------------------------------|-------------------:|------------------:|-------:|
-| gated_delta_net (SSM, ~40%)     | 2,335,951,709      | 2,334,847,390     | 0.0% (untouched, DOMINANT) |
-| **mul_mat_q<NVFP4, M-tile 64>** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** |
-| mul_mat_q<NVFP4, M-tile 128> (router, ~3.7%)     | 224,532,704       | 210,885,920       | -6.1%  |
-| quantize_mmq_nvfp4 (act-quant, ~2%)              | 119,118,624       | 118,718,496       | -0.3%  |
-| **mul_mat_q_stream_k_fixup<128>** (~0.6%)        | **26,848,479**    | **38,117,532**    | **+42% (fixup COSTLIER)** |
-
-The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident
-CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y
-DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the
-GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and
-already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net
-mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise).
-**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move
-end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2
-(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit.
-
-## DECISION
-No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup
-penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff),
-build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device<type>()` remap is a
-correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10.
-
---
-
-# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation
-
-## The bit-exact MoE GEMM/launch track is now EXHAUSTED
-| MoE lever (bit-exact) | result | MoE decode_agg @npl128 |
-|-----------------------|--------|------------------------|
-| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM |
-| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% |
-| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% |
-
-**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).**
-B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/
-launch track has no remaining bit-exact headroom.
-
-## Is the residual the structural Marlin-MoE gap? YES.
-The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys:
- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is
-  end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more
-  bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the
-  decode shape is BW-bound so the tier does not help). This is THE structural gap and matches
-  FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone."
- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever.
- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever.
-
-No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau.
-
-## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config)
-
-Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the
-held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies -
-but with a hard caveat the team must carry.**
-
-1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest,
-   precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4
-   arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity*
-   bit-exactly - claim ~85% with a precision profile at-or-above vLLM.
-
-2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the
-   ~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in
-   the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it
-   **FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5%
-   bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL
-   regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches
-   **~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state).
-
-   => Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as
-   an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5%
-   decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config.
-
-**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is
-the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph;
-B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%)
-is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in,
-default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85%
-bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market
-MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the
-defensible claim.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
@@ -1,499 +0,0 @@
-# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10
-
-Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This
-document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style
-token-granular continuous-batch scheduler, and records the single honest finding that
-re-shapes what the change can and cannot buy.
-
-Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
-LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`),
-MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev`
-(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`.
-Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that
-`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`).
-
-## TL;DR (the honest reframe)
-
-Three findings, read directly from the source at HEAD `151343b` and from the committed
-NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset
-what it is allowed to claim:
-
-1. **The unified mixed batch already exists.** `update_slots()` already builds ONE
-   `llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill
-   tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every
-   `SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then
-   Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode
-   is therefore **already claimed first and never dropped or capped** - the exact property
-   vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction.
-
-2. **The chunked-prefill slot state already exists and already persists across steps.** A
-   slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()`
-   is a partial prefill; it stays in that state and resumes next step until its prompt is
-   fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING`
-   (line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING`
-   simultaneously; there is **no global "one prefill at a time" gate**. So the mission's
-   "allow a slot to be mid-prefill while others decode in the same step" is **not a state
-   machine to build - it is already the behaviour.** This is the single biggest de-risking
-   fact in this document.
-
-3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013
-   (`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by
-   slots in iteration order. It is not decode-load-aware (does not subtract the live decode
-   count `D`), not adaptive (one constant across npl 8..128), and not fair (the first
-   `PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that
-   static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total
-   per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover**
-   `T - D` distributed across waiting prompts with a per-slot cap. That is ~the only
-   behavioural change. **No new slot states, no batch-formation rewrite.**
-
-### The honest ceiling (this is load-bearing for how the work is scoped and sold)
-
-The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus
-`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a
-decode-KERNEL batch-scaling ceiling, not a scheduler defect**:
-
- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode).
- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero
-  prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same
-  ~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the
-  same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The
-  scheduler cannot lift this.**
- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular
-  scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once
-  workload.
-
-Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is:
-
- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on
-  staggered arrival) - the genuine, large win.
- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without
-  per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE
-  TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole
-  npl range and across dense vs MoE.
- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the
-  burst-TTFT spread), and no admission collapse under sustained load.
-
-Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the
-paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or
-CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's
-scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness +
-self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids.
-
-## The gap, precisely localized (recap of the committed bench)
-
-At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0,
-decode_agg tok/s | TTFT mean, npl swept 8/32/64/128:
-
-| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM |
-|----:|------------------------:|-----------:|----------------------:|---------:|
-| 8   | 63.5  / 4.3 s   | 64.3  / 2.6 s | 169.3 / 1.7 s  | 202.0 / 0.8 s |
-| 32  | 105.7 / 23.1 s  | 189.8 / 7.5 s | 239.0 / 9.0 s  | 462.0 / 2.3 s |
-| 64  | 132.0 / 109 s   | 284.2 / 13 s  | 277.0 / 16.2 s | 624.5 / 4.1 s |
-| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s |
-
-Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That
-convergence is the signal: once prefill starvation is removed, a dense model and a
-12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill**
-and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching
-decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual).
-This scope addresses the first; it names the second as the separate lever.
-
-## What already exists (reuse, do NOT rebuild)
-
-All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`.
-
- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every
-  `SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to
-  `generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends
-  its sampled token (+ draft tokens) via `common_batch_add`. After this loop,
-  `batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in.
- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (=
-  `num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT`
-  slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it).
-  Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at
-  line 3502. **This is exactly vLLM's "leave the request in `running`, advance
-  `num_computed_tokens` next step" - already implemented.**
- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill;
-  the compute loop (lines ~3366-3378) `for (i=0; i<batch.n_tokens; i+=n_tokens){ n_tokens =
-  min(n_batch, batch.n_tokens-i); llama_decode(batch_view); }` runs it as one `llama_decode`
-  when `batch.n_tokens <= n_batch`; `n_ubatch` (512) splitting happens inside `llama_decode`.
- **[D] patch 0013 static prefill budget** - the thing to supersede. Read once at lines
-  2737-2747 (`n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))`, a CONSTANT for
-  the run); enforced as an extra `while` predicate at line 3188 (`n_prompt_budgeted <
-  n_prefill_budget`), counter at 3214, outer break at 3326. `0` = disabled = byte-identical
-  stock.
- **[E] productization seam** - `backend/cpp/llama-cpp/grpc-server.cpp` lines 781-791 parse
-  the model option `max_prefill_tokens` / `mpt` / `prefill_budget` and `setenv`
-  `LLAMA_PREFILL_BUDGET` before context init (same pattern as `kv_paged`). New knobs hang off
-  this seam identically.
- **[F] paged KV (patches 0001-0011)** - on-demand block allocation keyed by sequence
-  position. Batch formation only changes **which** tokens are in a step; paged alloc is
-  driven by the per-slot sequence positions, which are unchanged. Orthogonal (see Correctness).
-
-## vLLM v1 reference algorithm (the target, for fidelity)
-
-From `vllm/v1/core/sched/scheduler.py::schedule()` (0.23.0, on the box). The unifying idea:
-there is no prefill phase vs decode phase. Every request advances `num_computed_tokens`
-toward `num_tokens` by up to N this step; for a decoder N=1, for a prefiller N=remaining
-prompt. One per-step `token_budget = max_num_batched_tokens` bounds the TOTAL (decode +
-prefill). Pass 1 visits `running` first (decoders cost 1 each -> all decode claimed before
-any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each
-chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps:
-`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len,
-per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch
-maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded
-progress (-> low, flat TTFT), one `model.forward()` per step.
-
-The mapping to llama is clean because [A]+[B] already give us "running visited first" and
-"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D`
-sizing, and the per-request chunk cap with fair distribution.**
-
-## The unified per-step batch-formation algorithm (the design)
-
-New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`):
-
- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token
-  budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch`
-  (2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is
-  unchanged.
- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per
-  step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`,
-  floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress.
- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave
-  exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case.
-
-Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today):
-
-```
-common_batch_clear(batch);                                  // line 2594
-
-// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719)
-for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot);
-... speculative draft ...
-for (slot : generating) slot.update_batch(batch);           // appends decode (+draft) tokens
-
->> D = batch.n_tokens;                                       // NEW seam: decode load is now final (after 2719)
->> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch);
->> prefill_budget_step  = max(0, T - D);                     // DYNAMIC leftover, auto-shrinks with D
->> prefill_cap_per_slot = PREFILL_CAP;                       // long_prefill_token_threshold analogue
->> n_prompt_budgeted    = 0;                                 // total prompt tokens added this step (subsumes 0013)
-
-// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair)
-if (cont_batching || batch.n_tokens == 0) {
->>  for (k = 0; k < n_slots; ++k) {                          // round-robin start offset (fairness, see P2)
->>      slot = slots[(rr_start + k) % n_slots];
-        if (!slot.is_processing() || !can_batch_with) continue;
-        if (slot.state == STARTED) slot.state = PROCESSING_PROMPT;     // line 2782 (unchanged)
->>      slot_prompt_added = 0;                               // NEW: per-slot chunk counter (reset each slot)
-        // inner prompt-fill (lines 3187-3239), guard now triple-bounded:
-        while (slot.prompt.n_tokens() < slot.task->n_tokens()
->>             && batch.n_tokens   < T                       // was: < n_batch
->>             && n_prompt_budgeted < prefill_budget_step    // was: 0013 static n_prefill_budget
->>             && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution
-            common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd);
-            slot.prompt.tokens.push_back(cur_tok);
-            slot.n_prompt_tokens_processed++;
-            n_prompt_budgeted++; slot_prompt_added++;
-            ... checkpoint-boundary breaks (unchanged) ...
-        }
-        if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT;  // line 3252
-        ... checkpoint creation (unchanged) ...
->>      if (batch.n_tokens >= T) break;                      // was: >= n_batch (line 3320)
->>      if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326)
-    }
-}
-
-for (i=0; i<batch.n_tokens; i+=n) { n=min(n_batch,batch.n_tokens-i); llama_decode(view); }  // unchanged
-```
-
-The whole change is: (a) compute `prefill_budget_step = T - D` at the new seam after line
-2719 instead of reading a static env constant at 2737; (b) bound the inner/outer loops by `T`
-and the dynamic budget instead of `n_batch` and the static budget; (c) add `slot_prompt_added`
-with `prefill_cap_per_slot` for per-slot fairness; (d) a round-robin start offset so the same
-early slots do not always win the leftover.
-
-**Why this holds the decode ceiling without tuning.** `T` bounds total tokens per step ->
-bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec).
-As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates
-the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256
-reaches 161; here it is reached **automatically across the npl range** because the budget is
-`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover
-(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode-
-saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes
-bounded progress every step instead of waiting for a dedicated prefill burst.
-
-## Slot state machine changes (minimal - this is the headline de-risk)
-
-**No new states. No state-transition rewrite.** The existing 6-state machine
-(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72)
-already encodes everything:
-
- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING`
-  slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch).
- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() <
-  task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters
-  the slot and resumes from `prompt.n_tokens()`.
-
-The only **additions** are per-step scheduler scratch, not slot lifecycle state:
-
-1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for
-   the per-slot chunk cap. Not stored on the slot across steps.
-2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the
-   leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always
-   draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without
-   it, slot 0's prompt finishes first every time and the last slots starve).
-3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT`
-   transitions begin in one step. This falls out of the budget arithmetic already (a bounded
-   `prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it
-   may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a
-   new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots.
-
-That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap.
-The mission's feared "slot-state rewrite" does not materialize.
-
-## How it supersedes / subsumes patch 0013
-
-| property | 0013 (static cap) | this scheduler (dynamic `T - D`) |
-|----------|-------------------|----------------------------------|
-| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises |
-| decode-load aware | no (ignores `D`) | yes (leftover after decode) |
-| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) |
-| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) |
-| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts |
-| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** |
-
-0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant
-and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the
-new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the
-dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013.
-
-## Correctness
-
- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already
-  advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the
-  chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add`
-  receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state.
- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by
-  the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded:
-  pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays
-  coherent and answers correctly. The op-level math per token is position-determined and
-  unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic
-  oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies
-  to end-to-end only, never to the op test.
- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed
-  by sequence position and slot/stream, which this change does not touch; it changes only which
-  tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates
-  per-token via the block tables regardless of what prefill tokens are co-batched. Required gate:
-  run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical
-  scheduling decisions** (same per-step token counts, same admission order) - paged must be a
-  no-op on the scheduler.
- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same
-  `task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant
-  LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types /
-  per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note
-  it, do not try to lift it here.
- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that
-  reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the
-  dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained
-  heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s,
-  zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step.
-  This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left
-  on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision
-  should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose
-  checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching
-  change.
-
-## Phased plan P0 -> P3 (work, payoff, files, risk)
-
-| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk |
-|-------|------|-----------------------------------------------------|-------|------|
-| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low |
-| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) |
-| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) |
-| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) |
-
-**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):**
-
- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on
-  staggered arrival it goes below the vLLM burst figure to ~2 s).
- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM,
-  the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate
-  lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the
-  ceiling, the kernel on the throughput residual.
-
-## GB10 considerations
-
- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read +
-  paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only
-  keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single
-  `llama_decode`; `n_ubatch` (512) governs the internal split.
- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more
-  prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower
-  TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load.
-  Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving.
- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be
-  checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win.
- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat
-  ~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change.
- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the
-  scheduler's uniform decode-only step is a precondition that *helps* a future capture.
-
-## Biggest risks and how to de-risk
-
-1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others-
-   decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch
-   (`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states
-   untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock,
-   verified by an A/B diff of per-step token counts.
-2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already
-   documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**:
-   op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with
-   the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical**
-   to 0013.
-3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite
-   with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a
-   no-op on batch formation). This is a hard gate, not a spot check.
-4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the
-   scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission
-   checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the
-   batching win is legible independent of the checkpoint co-defect.
-5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the
-   work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**:
-   frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission
-   rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually
-   moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the
-   scheduler is never charged with the 391/811 number the kernel forbids.
-
-## Commit / hygiene
-
-Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity
-scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are
-implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch-
-scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp`
-beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
-`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
-(human pushes).
-
---
-
-## Review / risk (adversarial, source-verified)
-
-Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
-llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
-committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
-
-### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
-
-The central de-risking claims check out against the code, and the load-bearing honesty (decode
-residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
-Two calibration fixes are required before P1 (below), neither changes the go decision.
-
-### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
-
- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
-  per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
-  and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
-  `paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
-  (line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
-  decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
-  scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
-  `llama_decode`/ubatch/KV rewrite in scope.
- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
-  target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
-  `split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
-  forcing `-kvu`. Neither is introduced by this change.
- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
-  chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
-  -> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
-  (llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
-  does not break anything (0013 already hits it) but it means the actual scheduled object is a
-  split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
-  and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
-  the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
-  grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
-  hang off it identically.
-
-### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
-
- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
-  (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
-  `cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
-  **eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
-  launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
-  cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
-  refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
-  llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
-  **flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
-  quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
-  dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
-  `T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
-  compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
-  vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
-  "toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
-  dynamic budget would not. This strengthens P1's case beyond what the doc claims.
- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
-  `T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
-  co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
-  tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
-  drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
-  treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
-  decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
-  "barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
-  arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
-
-### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
-
- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
-  find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
-  change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
-  verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
-  config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
-  sound (default-off path is untouched).
- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
-  a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
-  not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
-
-### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
-
-The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
-with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
-2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
-line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
-The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
-to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
-`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
-(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
-burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
-fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
-npl8 and costs MoE TTFT), so P1 is the floor.
-
-### Realistic effort / payoff and sequencing
-
- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
-  set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
-  Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
-  0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
-  decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
-  tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
-  residual spread/robustness problem.
- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
-  host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
-  pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
-  step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
-  not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
-  over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
-  not change."
-
-### Bottom line
-
-GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
-libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
-cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
-decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
-`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
-P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
-of the 2.4x throughput residual.
--- a/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md
@@ -1,106 +0,0 @@
-# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
-
-The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
-After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
-the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
-
-## What changed
-
-A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
-non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
-
-    qkv_mixed transpose -> ggml_concat (build width-K window)   [concat_cont 8.14 ms/step]
-    -> ggml_ssm_conv (depthwise conv)                           [ssm_conv_f32 ~8.6 ms/step]
-    -> ggml_silu                                                [folded into ssm_conv on CUDA]
-    -> ggml_cpy of the shifted ring state into the conv cache   [cpy_scalar 5.76 ms/step]
-
-with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
-the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
-ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
-the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
-the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
-vLLM's `causal_conv1d_update`.
-
-Files:
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
-  (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
-  src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
-  (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
-  `ggml_cuda_op_ssm_conv`.
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
-  channels) + branch in `ggml_compute_forward_ssm_conv`.
- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
-  conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
-  target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
-  (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
-  rollback keep the existing concat+ssm_conv+silu+cpy chain.
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
-  vs the CPU reference across backends.
-
-## Gate: test-backend-ops (CUDA0 vs CPU reference)
-
- SSM_CONV: 45/45 OK (unchanged path intact)
- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
- SSM_CONV_BIAS_SILU: 90/90 OK
-
-## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
-
-Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
-
-| model              | baseline md5                     | fused md5                        | result          |
-|--------------------|----------------------------------|----------------------------------|-----------------|
-| q36-27b-nvfp4      | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL  |
-| q36-35b-a3b-nvfp4  | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL  |
-
-## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
-
-Dense q36-27b-nvfp4:
-
-| mode      | npl | baseline | fused  | delta   |
-|-----------|-----|----------|--------|---------|
-| CUDA-graph| 32  | 199.76   | 202.99 | +1.6%   |
-| CUDA-graph| 128 | 336.35   | 347.14 | +3.2%   |
-| eager     | 32  | 196.07   | 197.61 | +0.8%   |
-| eager     | 128 | 333.62   | 342.97 | +2.8%   |
-
-MoE q36-35b-a3b-nvfp4:
-
-| mode      | npl | baseline | fused  | delta   |
-|-----------|-----|----------|--------|---------|
-| CUDA-graph| 32  | 421.72   | 432.39 | +2.5%   |
-| CUDA-graph| 128 | 689.74   | 713.54 | +3.5%   |
-| eager     | 32  | 421.05   | 432.46 | +2.7%   |
-| eager     | 128 | 689.15   | 713.87 | +3.6%   |
-
-Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
-**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
-
-## Step time + nsys kernel delta
-
-Per-step decode time (dense npl128, T_TG / ntg=128):
- baseline 48.711 s / 128 = 380.6 ms/step
- fused    47.197 s / 128 = 368.7 ms/step  -> **-11.9 ms/step** (matches the predicted +12-14 ms)
- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
-
-nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
-
-| kernel              | baseline calls | fused calls | per-step (eager) |
-|---------------------|----------------|-------------|------------------|
-| concat_cont (decode)| 1152           | 0 (GONE)    | 7.95 -> 0 ms     |
-| cpy_scalar (decode) | 1152 of 3648   | 0 (GONE)    | 4.29 -> 0 ms     |
-| ssm_conv_f32 (decode)| 1152 of 2736  | 0 (prefill-only) | 8.65 -> 0 ms |
-| ssm_conv_update     | 0              | 1152        | 0 -> 7.56 ms     |
-
-Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
-and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
-prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
-
-## Verdict
-
-Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
-391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
-de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
+++ b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
@@ -1,639 +0,0 @@
-# Critical-Path Gap Analysis - GDN decode region
-
-## vllm-gdn-compare (READ-ONLY, no GPU) - vLLM decode GDN kernel inventory vs llama
-
-### Source ground truth
- Local checkout `/home/mudler/_git/vllm` and the DGX's benchmarked venv
-  `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm` are STRUCTURALLY
-  IDENTICAL (same file `qwen_gdn_linear_attn.py`, byte-for-byte same line numbers
-  1287/1344/1457/1644/1684). So the analysis below is faithful to what was actually
-  benchmarked on the GB10. Both are a recent dev build (`__version__ = "dev"`), same
-  era as the "0.23.0" reference; the GDN path is the refactored
-  `vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`.
-
-### The headline: vLLM runs the entire GDN region at decode as 2 Triton kernels + 3 GEMMs, ALL fused
-Per Qwen3.5 gated-DeltaNet (linear-attn) layer, vLLM decode launches:
-
-| # | Kernel | What is folded in |
-|---|--------|-------------------|
-| 1 | `in_proj_qkvz` GEMM | (quantized matmul - shared with llama) |
-| 2 | `in_proj_ba` GEMM | (quantized matmul - shared with llama) |
-| 3 | `_causal_conv1d_update_kernel` (causal_conv1d.py:1193) | conv1d **+ silu activation fused in** (the `activation` arg) |
-| 4 | `fused_recurrent_gated_delta_rule_packed_decode_kernel` (fused_recurrent.py:256-336) | **l2norm(q), l2norm(k), scale, softplus gate, A_log decay exp(g), sigmoid(beta), the delta-rule recurrence (b_h*=exp(g); delta update), the output b_o=sum(b_h*b_q), AND the SSM state write-back** - all in one kernel |
-| 5 | `RMSNormGated` (gated rms_norm) | **output gate silu/sigmoid * z fused into the rms_norm**; the comment notes the norm+quant is further fusable by the compilation pass (`fuse_norm_quant`) |
-| 6 | `out_proj` GEMM | (quantized matmul - shared with llama) |
-
-So the GDN-region "glue" elementwise op count in vLLM is effectively ZERO separate
-launches. Everything llama runs as standalone ggml nodes - conv-silu, gate
-sigmoid, softplus, l2norm, scale, decay mul, residual add, gather - is absorbed
-into kernels #3, #4, and #5.
-
-Verified kernel bodies:
- `fused_recurrent_gated_delta_rule_packed_decode_kernel` lines 313-336:
-  `b_q/sqrt(sum(b_q^2)+eps)`, `b_k/sqrt(...)` (l2norm), `b_q*scale`,
-  `softplus_x=where(x<=thr, log(1+exp(x)), x)`, `g_val=-exp(A_log)*softplus_x`,
-  `beta_val=sigmoid(b)`, `b_h*=exp(g_val)`, `b_v-=sum(b_h*b_k)`, `b_v*=beta_val`,
-  `b_h+=b_v*b_k`, `b_o=sum(b_h*b_q)`, `tl.store(p_o,...)`, `tl.store(p_ht,...)`.
-  ONE kernel = recurrence + ALL gating + l2norm + state writeback.
- The non-packed variant `fused_sigmoid_gating_delta_rule_update_kernel`
-  (fused_sigmoid_gating.py:24-179) is the same fusion (used for the spec-decode /
-  mixed-batch path); both fold gate+l2norm+recurrence+writeback into one launch.
- Decode dispatch: `_forward_core` (line 1286-1298) routes pure non-spec decode to
-  `_forward_core_decode_non_spec` (line 1644), which calls exactly
-  `causal_conv1d_update` (#3) then `fused_recurrent_gated_delta_rule_packed_decode`
-  (#4). `_output_projection` (line 851) does `self.norm(core_attn_out, z)` (#5,
-  gated rmsnorm) then `out_proj` (#6).
-
-### vLLM ALSO captures decode in a FULL CUDA graph - the launch bubbles are gone entirely
-`vllm/v1/attention/backends/gdn_attn.py`:
- `_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH` (line 82)
- `use_full_cuda_graph = cudagraph_mode.has_full_cudagraphs()` (line 113)
- `build_for_cudagraph_capture` (line 509): "only decode is supported for full
-  cudagraphs with Mamba" / "GDN only supports decode-only full CUDAGraph capture".
-
-So at decode vLLM captures the WHOLE forward (all 48 layers: GDN linear-attn layers
-+ the 1-in-4 full-attn layers + projections + conv + recurrence + gated rmsnorm)
-into a single replayed CUDA graph. Per-kernel host launch latency and the
-data-dependent inter-op gaps are eliminated at replay time. Even the 2 Triton
-kernels per GDN layer incur no host-side launch bubble during graph replay.
-
-### Why this is the 62%-vs-40% explanation (not GEMM throughput)
- llama runs the GDN region as ~7-9 separate ggml nodes per layer at decode
-  (`ssm_conv`, `gated_delta_net` recurrence, `gdn_gather`, `k_bin_bcast` mul,
-  `silu`, `sigmoid`, `l2_norm`, `op_add`, `concat`), each a host-launched kernel,
-  serially data-dependent (conv -> gate -> recurrence -> gather), with the gating
-  elementwise wedged between recurrence steps. Each launch + the dependency stall
-  is a bubble ON the critical path. x48 layers x ~8 ops = ~384 launch bubbles/step.
- vLLM has 2 fused Triton kernels per GDN layer AND wraps them in a CUDA graph, so
-  the GDN-region inter-op bubble count at decode is ~0. The recurrence kernel
-  itself is already near-parity in llama (gated_delta_net 1.47 ms/call vs vLLM).
-  The gap is the surrounding launch/sync overhead, which is exactly the 60% idle
-  measured (llama ~40% busy vs vLLM 62%).
- This matches why P2a and Lever 2 were FLAT: they shrink GPU-busy kernels that are
-  already overlapped with the 42% mul_mat_q GEMM. The real wall-clock lever is the
-  SERIAL GDN gating chain's launch bubbles, which vLLM removed by (a) fusion into
-  the recurrence kernel and (b) CUDA-graph capture.
-
-### What llama would need to match vLLM (two independent wins, either helps)
-1. **Op fusion (Lever 3).** Collapse the GDN per-layer gating chain into the
-   recurrence kernel: fold conv-silu, q/k l2norm, scale, softplus+A_log gate,
-   sigmoid-beta, the exp-decay mul, the residual add, and the SSM-state write-back
-   INTO the `gated_delta_net` CUDA kernel (and fuse the output gate silu*z into the
-   final rms_norm). Target: from ~8 GDN nodes/layer down to ~2 (conv-fused +
-   recurrence-fused), mirroring vLLM's `fused_recurrent_gated_delta_rule_packed_decode`.
-   The conv silu fold and the l2norm/scale/gate fold are the high-value pieces -
-   they are pure elementwise prologues sitting ON the serial chain between conv and
-   recurrence.
-2. **CUDA-graph the decode step.** Even without fusion, capturing the decode forward
-   in a CUDA graph removes the per-node host launch latency for all ~384 nodes/step.
-   (Prior A.2 work flagged ggml-cuda graph capture as the orthogonal lever; the
-   measured GDN structure here is exactly why it should move the wall.) vLLM gets
-   BOTH; llama gets neither today.
-
-### Bottom line for the gap-analysis agent
-The candidate explanation is confirmed at the source level: vLLM's GDN decode region
-is 2 fused Triton kernels under a full CUDA graph vs llama's ~8 separate
-host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM
-compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing nsys
-trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
-gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
-decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
-
---
-
-## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
-
-Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
-compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
-budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
-
-### Inputs (measured / GGUF metadata, no new GPU work)
- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
-  Both numbers are shared identically by llama and vLLM (same HW, same weights).
- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
-  **16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
-  head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
-  Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
-  T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
-
-### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
-| term | bytes/step | basis |
-|------|-----------|-------|
-| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
-| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
-| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
-| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
-| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
-
-**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
-Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
-
-KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
-= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
-GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
-occupancy) and Lever-2 were FLAT on decode.
-
-### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
-| term | FLOPs/step | floor |
-|------|-----------|-------|
-| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
-| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
-| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
-
-### 3. Verdict / bubble budget / parity target
-```
-                    compute floor   bandwidth floor    MEASURED step   x above bw-floor
-GB10 dense-FP4      ~13 ms          ~154 ms (117-224)
-vLLM dense @128                                        327 ms          ~2.1x (1.5-2.8x)
-llama dense @128                                       384 ms          ~2.5x (1.7-3.3x)
-```
- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
-  below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
-  LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
-  Confirms prior "decode is 2.5x above its bandwidth floor" work.
- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
-  = kernel-launch latency + occupancy gaps + serial data-dependency stalls.
- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
-  Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
-  SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
-  so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
-
-**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
-Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
-collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
-gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
-already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
-
-### Cross-check / sizing for the gap-analysis (timeline) agent
- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
-  GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
-  op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
-  so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
-  wall is the critical path.
- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
-  GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
-  sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
-  < 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
-  kernel slower than vLLM fused_recurrent, or scheduler/H2D).
- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
-  Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
-  gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
-  on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
-
-Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
-recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
-come from bubble (not bytes, not FLOPs).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
-
-### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
-Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
-`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
-`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
-gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
-runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
-grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
-128 lanes hold the full head-vector (one element per lane).
-
-Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
-critical path between the in-proj GEMMs and the recurrence):
-1. `beta = ggml_sigmoid(ssm_beta @ cur)`            -> kernel reads `beta_val = *beta_t`
-2. `alpha = ssm_alpha @ cur`
-3. `ggml_add(alpha, ssm_dt)`  (k_bin_bcast op_add)
-4. `ggml_softplus(...)`        (unary_op<softplus>, 1248 inst)
-5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked)  -> g; kernel does `expf(g_t)`
-6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
-7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
-
-Post-GDN gate (consumes kernel output):
-8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
-
-### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
-These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
-operand it needs:
- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
-
-That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
-
-### Why the OUTPUT gate (step 8) is NOT folded into this kernel
-The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
-In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
-blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
-grid-global sync - not feasible without a grid redesign. Leave step 8 as the
-existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
-The conv-silu (step 6) is a convolution, structurally separate; rely on the
-existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
-recurrence kernel.
-
-### Implementation scope
- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
-
-### Numeric considerations / bit-exactness
- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
-
-### Estimated kernels-removed-per-layer and the honest ceiling
- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
-
-### RISK (must be settled before building)
-1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
-   overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
-   Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
-   between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
-   batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
-   gap (overlapped), do NOT build op-fusion; go to lever (2) below.
-2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
-   8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
-   FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
-   to the post-GDN region or to inter-layer launch latency generally, the
-   higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
-   latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
-   fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
-   pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
-   CUDA-graph lever if the bubbles are spread across the step rather than concentrated
-   in the pre-GDN gate slice.
-3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
-   later op needing more srcs on this node has zero headroom; flag for review.
-
-## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
-
-### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
-
-Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
-(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
-(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
-against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
-
-### How graph-disable / re-instantiation are decided (this fork's state machine)
- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
-  reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
-  `node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
-  SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
-  list. So no GDN op forces graph-disable.
- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
-  (incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
-  warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
-  WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
-  CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
-
-### Why the GDN region's properties are STABLE across steady decode steps
-The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
-```
-state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
-                         kv_head * n_embd_s * elsize);   // offset = kv_head
-ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
-```
-Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
-`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
-`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
-(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
-range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
-(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
-same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
-offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
-fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
-compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
-tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
-
-### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
-`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
-on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
-gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
-no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
-(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
-noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
-graphs engage), confirming none of these break stream capture.
-
-### The only re-instantiation is NOT GDN-driven
-A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
-the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
-lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
-replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
-`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
-the default decode path and is not the GDN/SSM path.)
-
-### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
-The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
-decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
-graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
-Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
-0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
-captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
-decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
-GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
-"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
-SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
-wall-clock).
-
-### Graph-safe lever for GDN: none new
- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
-  are already safe and captured.
- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
-  ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
-  ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
-  Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
-  A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
-  host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
-  (`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
-  That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
-  worth a separate confirm for the MoE model.
-
-## decode-timeline-gap (GPU, label gap-analysis) - the decisive fresh node-level measurement
-
-This is the new GPU run the analysis was waiting on. It arbitrates between the
-roofline/vllm-gdn-compare theory ("57 ms = 100% bubble, Lever 3 closes it") and the
-cudagraph-coverage source verdict ("~99.4% busy, bandwidth-bound, bubbles refuted").
-The measurement confirms the latter and refutes the former, with per-kernel numbers.
-
-### Capture (the trap the prior `--trace=cuda` fell into is now avoided)
-`nsys profile --trace=cuda --cuda-graph-trace=node` on build-cuda-base (clean
-Lever-1, HEAD df1cc97, git-clean mmq.cuh), q36-27b-nvfp4 dense, `-fa on -npp 128
-ntg 24 -npl 128 -c 33000`. Artifacts on DGX: `~/llama-paged-dev/nsysgap.{nsys-rep,
-sqlite}`. The decode step is a single CUDA graph (graphId=11, 23 replays = steps
-2-24; graphId=1 x8 = prefill). Plain `--trace=cuda` recorded each step as ONE opaque
-~380 ms block, so the widely-cited `nsysab_new.kern.txt` breakdown (mul_mat_q 42%,
-gated_delta_net 13%) is PREFILL + the single eager capture step, NOT decode. With
-node-level trace the graph expands: 168201 kernels = 91499 graph-internal + 76702
-eager prefill. **All graph kernels on stream 14 (single stream) -> strictly serial,
-no overlap, so any inter-kernel gap is pure GPU idle.**
-
-### One steady decode step (window between decode launches 22413.26 / 22796.74 ms, width 383.48 ms)
-Exactly 48 `gated_delta_net` + 16 `flash_attn` = one clean step (48 GDN + 16 attn).
-2965 kernels.
-
-| classification | ms/step | % of step |
-|---|---|---|
-| (a) inter-kernel LAUNCH gaps + (b) SERIAL-DEPENDENCY stalls (LAG sum, single stream) | **0.225** | **0.06%** |
-| (c) within-kernel time (GPU running) | 380.4 | 99.94% |
-
-Zero gaps > 5 us. Largest single gap 2.40 us. 1260 sub-1us gaps + 1700 back-to-back.
-**The decode step is 99.94% GPU-busy. There are no bubbles.** This independently
-confirms cudagraph-coverage's ~99.4% and **refutes** roofline-decode's "57 ms = 100%
-bubble" and vllm-gdn-compare's "~384 launch bubbles/step on the critical path".
-nvidia-smi's "40% util" = low SM/compute efficiency WITHIN kernels (c) (memory-latency-
-bound, ~12-16% achieved occupancy), not wall-clock idle.
-
-### Real decode kernel mix (% of the 380.4 ms step) - corrects the prefill-contaminated kern_sum
-| kernel | n/step | ms | % | grid CTAs | waves/48SM |
-|---|---|---|---|---|---|
-| gated_delta_net_cuda | 48 | **196.37** | **51.6** | 48x128x32 = 196608 | 4096 |
-| mul_mat_q (FP4 in/out/qkv/o proj) | 496 | 92.90 | 24.4 | 136 | 1.5 |
-| quantize_mmq_nvfp4 | 496 | 17.13 | 4.5 | 483 | 10 |
-| nvjet GEMM (lm_head) | 1 | 11.91 | 3.1 | 1944 | 40 |
-| flash_attn_ext_f16 (16 attn layers) | 16 | 11.67 | 3.1 | 48 | 1.0 |
-| concat_cont (conv-state) | 48 | 8.01 | 2.1 | 20480 | 427 |
-| cpy_scalar | 64 | 7.62 | 2.0 | 49152 | 1024 |
-| k_get_rows_float | 49 | 7.08 | 1.9 | 15098 | 315 |
-| k_bin_bcast (gate mul + add) | 720 | 6.59 | 1.7 | 3169 | 66 |
-| ssm_conv_f32 | 48 | 5.64 | 1.5 | 10240 | 213 |
-| unary_gated (silu/sigmoid) | 128 | 5.36 | 1.4 | 5888 | 123 |
-| mul_mat_q_stream_k_fixup | 304 | 3.94 | 1.0 | 192 | 4 |
-| rms_norm_f32 | 209 | 3.52 | 0.9 | 1764 | 37 |
-| l2_norm_f32 | 96 | 0.64 | 0.2 | | |
-| gdn_gather_nonident | 48 | **0.061** | 0.016 | | |
-
- `gated_delta_net` is **51.6% of the step**, the single dominant term. The
-  previously-cited "1.47 ms/call near-vLLM" was the EAGER average over 1248 calls
-  (range 0.046-4.42 ms = prefill warmups + capture); true steady decode is
-  **4.08-4.11 ms/call** (gridY=128 = the 128 seqs). 2.8x higher than believed.
- It launches 196608 CTAs / 4096 waves = NOT occupancy-starved; the cost is
-  bandwidth-bound state traffic (~384 MB read + ~384 MB write per layer for the
-  48-head x 128-seq x [state 128 x head_v 128] recurrent state, ~190 GB/s effective).
- The Lever-3 narrow target (gating glue) = k_bin_bcast 6.59 + silu/sigmoid 5.36 +
-  l2_norm 0.64 + softplus 0.13 = **12.76 ms = 3.35%** of the step. `gdn_gather` is
-  **0.06 ms** (negligible - it early-returns on identity ids as predicted).
-
-### The three answers (with numbers)
-1. **Bubbles on the serial GDN critical path?** NO. 0.225 ms idle/step = 0.06%,
-   zero gaps > 5 us. CUDA graphs eliminated launch overhead; serial dependencies do
-   not produce idle (each kernel starts < 1 us after the previous). The premise is
-   refuted by direct measurement.
-2. **Would Lever 3 (fuse the gating chain) shrink the step or overlap away?** It
-   shrinks it, but only by its hard ceiling **12.76 ms = 3.35%** (380 -> 367 ms, 336
-   -> ~348 tok/s, 86% -> 89% of vLLM). It does NOT close the 14% / 53-57 ms gap.
-   IMPORTANT mechanism correction: the step is single-stream and 99.94% busy, so
-   there is NO overlap to absorb freed time (the lever3-design RISK #1 "same trap as
-   P2a if overlapped" does NOT apply - nothing overlaps). So removing those kernels'
-   GPU-time DOES cut wall-clock - but the win is removing their HBM byte traffic, NOT
-   launch bubbles (there are none). And the value is the measured ~12.76 ms, not the
-   "~288 launch bubbles" framing (those launches cost ~0 inside the graph). This also
-   explains P2a/Lever-2 flatness correctly: NOT "overlapped busy-time" (no overlap),
-   but P2a tuned the prefill large-M GEMM (decode GEMMs are 136-CTA tail-bound, untouched)
-   and Lever-2 relocated mandatory quantize work into the GEMM prologue (net zero).
-3. **Do CUDA graphs cover the GDN region at B=128?** YES, fully. Whole step = one
-   graph, 23 replays, ~0.2 ms host gap between steps. `gdn_gather_nonident` and the
-   in-place state ops are graph-internal nodes (graphNodeId != 0); no fragmentation.
-   Confirms cudagraph-coverage. Note: lever #2 from vllm-gdn-compare ("CUDA-graph the
-   decode step") is ALREADY IN EFFECT in this build and did not close the gap - so it
-   is spent, not pending.
-
-### Verdict against roofline-decode's own sizing test
-roofline-decode stated: "if critical-path gaps total < 57 ms, parity is NOT reachable
-via GDN-gate fusion alone and the gap is elsewhere (GDN core kernel slower than vLLM
-fused_recurrent)." **Measured gaps = 0.225 ms << 57 ms.** Therefore, by that test, the
-53-57 ms / 14% gap is NOT bubble and NOT closable by gating fusion. It lives in
-**kernel GPU-time**, dominated by the `gated_delta_net` recurrence (51.6%, bandwidth-
-bound) and secondarily the FP4 GEMM + quantize stack (29%). The "57 ms = 100% bubble"
-roofline conclusion was an inference from the prefill-contaminated GPU-busy sum
-(~555 ms vs 384 ms "implies overlap"); the node-level decode-only measurement shows
-per-step GPU-busy = wall (no overlap), so that inference does not hold.
-
-### Recommendation (resized)
- The real lever is the `gated_delta_net` recurrence kernel itself (196 ms, 51.6%):
-  match vLLM's `fused_recurrent_gated_delta_rule_packed_decode` (vllm-gdn-compare
-  kernel #4) which folds l2norm + gate + decay + recurrence + state-writeback into a
-  SINGLE pass over the state, reducing HBM round-trips of the state. The win is byte
-  reduction in a memory-bound single-stream step, not bubble removal.
- The lever3-design fusion is still worth doing as a component of that (it removes
-  ~12.76 ms = 3.35% of real byte traffic, and unlike its own RISK section feared, it
-  will NOT be flat because there is no overlap), but on its own it is a ~3% lever, not
-  the gap-closer. Build it folded into a single-pass recurrence kernel, not as an
-  isolated gate fold.
- Next decisive measurement (future GPU-agent run): profile vLLM's decode step at
-  npl128 with the same node-level method and compare per-region GPU-time (GDN
-  recurrence vs GEMM vs attention) to localize exactly where vLLM spends its 53-57 ms
-  less. Both engines move near-identical bytes only if vLLM's fused recurrence does
-  not re-stream state; the per-kernel A/B will show whether the gap is the recurrence
-  pass or the GEMM/quantize stack.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict
-
-Reconciles all six investigation sections above plus the three adversarial verdicts
-(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is
-99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by
-the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers
-are byte-reduction inside that kernel - NOT launch-bubble removal.**
-
-### 1. The proven critical-path decomposition of the decode step
-
-Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97,
-q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One
-steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on
-ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window
-383.48 ms.
-
-BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle):
-
-| bucket | ms/step | % step | note |
-|---|---|---|---|
-| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency |
-| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us |
-| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back |
-| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces |
-| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step |
-
-The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved
-occupancy on memory-latency-bound kernels), NOT wall-clock idle.
-
-KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives):
-
-| kernel | ms | % step | regime |
-|---|---|---|---|
-| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) |
-| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode |
-| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) |
-| `nvjet` lm_head GEMM | 11.91 | 3.1 | |
-| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | |
-| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target |
-| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) |
-| `k_get_rows_float` | 7.08 | 1.9 | |
-| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) |
-| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 |
-| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) |
-| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | |
-| `rms_norm_f32` | 3.52 | 0.9 | |
-| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target |
-| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) |
-
-GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step.
-The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was
-PREFILL + the single eager capture step contaminating the average over 1248 calls (range
-0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step.
-
-### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty
-
-**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN
-chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim
-requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay
-already collapsed launch latency; serial data-dependency does NOT equal idle when the
-graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM
-efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated
-`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block).
-
-**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing
-serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove
-(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap
-anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a
-tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and
-Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero).
-(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy
-single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%**
-(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that
-over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but
-only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3
-is a component, not the gap-closer.
-
-**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"):
-**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware
-floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is
-not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is
-software-closable in BIT-EXACT form turns on one unmeasured quantity (below).
-
-RESIDUAL UNCERTAINTY (the single open question that decides everything):
- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above
-  ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED
-  it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write)
-  bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM,
-  then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit
-  HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same,
-  the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a
-  hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on
-  both halves). This is the one measurement that converts the verdict from "refuted as
-  worded" to a definitive yes/no.
- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip
-  [TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode
-  graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace
-  cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold.
-  The dense 335 tok/s headline is fully settled.
-
-### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327)
-
-Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines
-share identical bandwidth/compute floors. Ranked by expected recovery.
-
-| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate |
-|---|---|---|---|---|---|---|
-| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** |
-| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless |
-| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone |
-| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track |
-| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) |
-| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound |
-| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat |
-| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap |
-| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor |
-
-Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a
-single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a
-distinct, bit-exactness-breaking precision track.
-
-### 4. The honest verdict and the single highest-value next step
-
-**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the
-unmeasured byte ratio:
- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass
-  fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking
-  llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance).
- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the
-  57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it.
-
-Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT
-ship the isolated gate fold as "the parity lever."
-
-**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN
-recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass
-over the recurrent state) - BUT gate the build on one cheap measurement first, because it
-is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal.
-
-**The measurement that confirms it before over-investing (one short GPU run, gap-analysis
-agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's
-`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two
-counters:
- `dram__bytes.sum` (actual DRAM bytes/call)
- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s)
-
-Decision rule:
- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant
-  un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery
-  scales with the byte delta (up to ~40 ms). This is the gap-closer.
- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the
-  recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only
-  the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4,
-  accept non-bit-exact), and bit-exact parity is NOT reachable.
-
-**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort):
-the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity),
-which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound,
-single-stream step where their full GPU-time is wall-clock.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md
@@ -1,185 +0,0 @@
-# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121)
-
-Profiling study (no engine changes). Question: matched apples-to-apples (both
-batched servers, NVFP4-class weights, prefix caching on, both eager), why is
-`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a
-1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural.
-
-Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers.
-llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf`
-(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`,
-`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin),
-`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix,
-K=32 concurrent, generate 64. All profiling scripts are dev-tree only
-(`~/bench/decode_study/`); minimal in-code timers were not needed (server already
-reports per-slot `eval time`, which excludes prompt-eval = pure decode).
-
-## TL;DR
-
-1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode
-   the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys).
-   Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage
-   sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The
-   "GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload.
-2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the
-   opposite of the thin-batch-GEMM story. Attention scaling with context length, not
-   the matmul, is the load-bearing cost.
-3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than
-   stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It
-   gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every
-   step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's
-   in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts
-   decode latency.
-4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM**
-   (~120-185 ms/step). The residual gap is the **long-context decode-attention
-   kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity
-   gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps.
-
-## The measured numbers (batch 32, server-reported pure-decode step time)
-
-`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token`
-(the most-contended, full-batch-32 slots; excludes prompt eval).
-
-| config                                   | decode step ms (max / top8) | client wall ms/step |
-|------------------------------------------|-----------------------------|---------------------|
-| paged, ctx 1024, greedy                  | 1343 / 1279                 | 1468                |
-| paged, ctx 1024, **heavy 5-sampler**     | 1346 / 1280                 | 1470                |
-| **stock** (no paging), ctx 1024, greedy  | **729 / 650**               | 768                 |
-| paged, **ctx 64** (short), greedy        | **215 / 215**               | 253                 |
-| vLLM NVFP4A16, ctx 1024 (K=32)           | **~120-185** (270 tok/s)    | -                   |
-
-The brief's reference ~828 ms/step sits between the stock (650-729) and paged
-(1279-1343) numbers measured here; the decomposition below is what is robust. Our
-fan-out shares no prefix across the 32 slots (each slot independently prefills 1056
-tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the
-"max" slot is maximally contended, which is why our paged max runs a little above 828.
-
-### Context sweep - decode step is attention-scaling, not fixed overhead
-
-Pure-decode step vs shared-prefix length (paged, batch 32):
-
-| prefix ctx | decode step ms |
-|-----------|----------------|
-| 64        | 215            |
-| 128       | ~290           |
-| 256       | ~410           |
-| 512       | ~660           |
-| 1024      | ~1280          |
-
-Roughly linear in context length: ~1 ms of added step time per added context token.
-The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope +
-loop + sampling, attention negligible). Everything above it scales with KV length =
-attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step.
-
-## Where the ~1280 ms paged decode step goes (nsys, pure-decode window)
-
-`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way
-decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's
-own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%).
-GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi).
-
-| bucket                         | % GPU time | abs (of ~1280 ms) | what it is |
-|--------------------------------|-----------:|------------------:|------------|
-| `flash_attn_ext_f16` ATTENTION | **47.7%**  | ~610 ms           | decode attention over the 1056-cell KV |
-| `cpy_scalar` KV copy/cast      | 18.3%      | ~234 ms           | KV write + f32->f16 casts |
-| `get_rows/set_rows` KV gather  | 17.8%      | ~228 ms           | **paged** gather of K/V/mask to contiguous |
-| `mul_mat_q` + `quantize_mmq`   | 15.7%      | ~201 ms           | NVFP4 weight GEMM (+ activation requant) |
-| rmsnorm / silu / rope / add    | ~0.6%      | ~8 ms             | elementwise |
-
-Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the
-weight matmul is ~the entire short-context step, and is context-independent, as
-expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep
-finding that ~84% of the step scales with context.
-
-Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util%
-+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the
-classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step).
-
-### Stock vs paged decomposition
-
- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read
-  directly by the FA kernel, **no gather**).
- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged
-  gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that
-  feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's
-  self-inflicted cost and is the entire ~1.85x stock->paged regression.
-
-## vLLM decode architecture mapped onto each llama bucket
-
-vLLM at ~120-185 ms/step is faster on **every** bucket:
-
-| llama bucket (paged)        | ms    | vLLM equivalent | does vLLM avoid it? |
-|-----------------------------|-------|-----------------|---------------------|
-| paged KV gather (get_rows)  | ~228  | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. |
-| KV copy/cast                | ~234  | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy |
-| decode attention            | ~610  | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 |
-| weight GEMM + act quant     | ~201  | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel |
-| CPU sampling / loop         | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too |
-
-vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so
-vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a
-stack of kernel-efficiency wins, not one silver bullet.
-
-## Ranked levers - closable vs structural
-
-1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged
-   path; net-zero risk - it is a regression]** The worktree's paged engine makes
-   decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer
-   every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable
-   paged KV** - it only ever helps prefix-prefill *memory*, never decode latency.
-   Fully recovering this *and* keeping paging requires reading paged blocks
-   in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2.
-
-2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock /
-   ~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx.
-   llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's
-   FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the
-   cost that *grows with context* - exactly the regime the brief targets. Tractable in
-   principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel
-   paged read that also kills lever 1's gather), but it is deep CUDA work on a new
-   arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.**
-
-3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]**
-   The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and
-   less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA
-   and folding the activation quant into the GEMM is tractable kernel work. Bounded
-   impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step).
-
-4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy ==
-   heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until
-   the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the
-   "host-bound serving loop" hypothesis for this decode-bound workload.
-
-5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server
-   already fuses all 32 slots into one decode step (one set of kernels per step over
-   batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins
-   on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is
-   kernel-bound, not scheduler-bound.
-
-## Honest bottom line
-
-The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as:
-
- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it
-  (don't run paged for decode-latency workloads).
- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**:
-  llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than
-  vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel
-  paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project.
- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not**
-  the gap; the GPU is ~95% busy on memory-bound kernels the whole step.
-
-What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel
-work). What is structural / hard: lever 2 (the decode-attention kernel + a real
-in-kernel paged read), which is where the context-scaling gap actually lives and where
-any serious effort to approach vLLM on GB10 must go.
-
-## Reproduction (dev-tree only, `~/bench/decode_study/`)
-
- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config.
- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`.
- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`.
- `cat2.py` - kernel-time categorization from the sqlite export.
- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s).
-</content>
-</invoke>
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
@@ -1,756 +0,0 @@
-# Decode parity exploration (post-SSM-fix) - per-agent findings
-
-Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
-dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
-The remaining +54% to parity is the question each section below probes. All numbers
-DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
-
---
-
-## Section: per-token-latency (critical path / host-loop) - READ-ONLY
-
-**Verdict: the per-step critical path and host loop are NOT the residual lever.
-Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
-~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
-(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
-sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
-saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
-efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
-
-### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
-
-`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
-steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
-
-| activity set            | busy %  | exposed idle |
-|-------------------------|---------|--------------|
-| kernels only            | 80.25%  | 19.74%       |
-| kernels + memcpy (all)  | 99.35%  | **0.65%**    |
-
- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
-  filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
-  (the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
-  They were a real critical-path serialization, and **patches 0018/0019 removed exactly
-  these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
-  +37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
-  back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
-  the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
-  **median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
-  (sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
-  **0.34%** of the window - the pure launch overhead that CUDA graphs collapse
-  (measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
-  default paged decode with a 256-token reset cadence).
-
-### 2. Post-SSM scaling of the FIXED host gap
-
-The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
-As decode accelerated it grew only as a fraction of a shrinking step:
-
-| build         | step ms @npl128 | host gap | host gap % of step |
-|---------------|-----------------|----------|--------------------|
-| pre-SSM (146) | ~877            | ~2 ms    | 0.24%              |
-| post-SSM (256)| ~499            | ~2 ms    | **~0.40%**         |
-| vLLM (391)    | ~328            | (n/a)    | (would be ~0.6%)   |
-
-Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
-the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
-below the host time, which is not the case at 65% of parity.
-
-### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
-
-The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
-needs layer N-1; cannot be parallelized across layers). The question is whether each
-individual kernel fills the SMs at batch 128. It does:
-
- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
-  = `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
-  `(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
-  state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
-  context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
-  regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
-  npl128: if the chain under-filled, busy% would be well below 99%.
-
-Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
-weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
-the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
-host/scheduler problem.
-
-### 4. What the host actually does per step (eager rep runtime API)
-
-Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
-
-| API                       |   n   | total   | avg     |
-|---------------------------|-------|---------|---------|
-| cudaStreamSynchronize     | 1723  | 7775 ms | 4513 us |
-| cudaLaunchKernelExC        | 30983 | 4045 ms | 131 us  |
-| cudaLaunchKernel          | 20385 | 2694 ms | 132 us  |
-| cudaMemcpyAsync           | 2085  |   96 ms |  46 us  |
-
-~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
-graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
-GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
-sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
-only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
-
-### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
-
-vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
-async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
-2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
-vectorized numpy (no per-token Python), attention metadata `build()` once per step
-reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
-non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
-next-step host prep overlaps the current-step GPU compute on the async stream. The key
-asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
-KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
-`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
-
-But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
-loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
-328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
-kernel time (FP4 GEMM), not host.
-
-### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
-
- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
-  default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
-  Nothing left to win here.
- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
-  the forward; the host still samples + runs `update_slots` + rebuilds the batch
-  between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
-  step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
-  non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
-  rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
-  cleanup, not a parity lever.
- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
-  GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
-  decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
-  kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
-  GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
-  traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
-  FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
-
-### Evidence
- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
-  VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
-# Decode-Parity Exploration
-
-## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
-
-### Model config (Qwen3.5-27B dense, from vLLM config.json)
- linear_key_head_dim K = 128, linear_value_head_dim V = 128
- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
-  Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
-
-### Which kernels run at decode
- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
-  Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
-  (the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
-
-### The state HBM traffic is IDENTICAL - it is NOT the lever
-Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
-The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
-=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
-   plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
-
-### Algorithmic / parallelization delta (the real differences)
-
-1) Reduction strategy (biggest structural difference)
-   - llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
-     one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
-     TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
-     rounds on the critical path, with n_tokens=1 they are NOT amortized.
-   - vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
-     of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
-     ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
-   Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
-
-2) Occupancy / launch geometry (likely the dominant bandwidth gap)
-   - llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
-     Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
-     (only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
-     s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
-     (~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
-     loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
-   - vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
-     software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
-
-3) Redundant non-state traffic in llama
-   - q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
-     => ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
-     Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
-   - Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
-     single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
-
-4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
-   - vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
-     (-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
-   - llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
-     ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
-     ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
-     gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
-     early-returns on the identity check, but still a grid launch of n_seqs blocks).
-   Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
-   Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
-   launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
-
-### What a faster llama GDN decode kernel would need (optimization scope)
- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
-  contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
-  (the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
-  state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
-  the equal state bytes - the single highest-leverage change.
- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
-  the output store across the warp.
- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
-  reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
-  per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
-  (or fold the identity check into the recurrence prologue, which it already partly does).
- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
-  divergence-from-reference not a parity lever.
-
-### Bottom line
-llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
-so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
-the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
-starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
-(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
-vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
-row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
-
---
-
-## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
-
-Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
-memcpy summed separately so D2D is never lumped into compute), not from summary text.
-
-### CLAIM 1 - decode decomposition
-PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
-gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
-1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
-=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
-full attention 0.37%.
-
-D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
-526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
-the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
-my DtoD 0.05 GB, the 2.16 GB is DtoH).
-
-FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
-NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
-no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
-predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
-("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
-to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
-"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
-get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
-Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
-
-### CLAIM 2 - 146 -> ~257 ("+66%")
-146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
-SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
-+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
-
-### CLAIM 3 - P2a GEMM-remap FLAT on decode
-THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
-kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
-
-### CLAIM 4 - 65% of vLLM (254 vs 391)
-254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
-
-### Traps checked
-GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
-noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
-(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
-
-### Net verdict
-1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
-UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
-"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
-
---
-
-## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
-
-Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
-arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
-Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
-(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
-
-### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
-
-NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
-
-WEIGHTS (read ONCE per step, shared across all 128 seqs):
-  - NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
-      (per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
-  - LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
-  - per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
-  - token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
-  => WEIGHTS TOTAL = 15.14 GB / step
-
-PER-SEQUENCE STATE (x128 seqs, read + write every step):
-  - SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
-      x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
-  - conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
-      x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
-  - KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
-      4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
-
-  TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
-
-### Floor vs measured -- decode is NOT at the bandwidth floor
-
-  Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
-  Measured llama  = 128 tok / 254 t/s   = **504 ms/step**  => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
-  vLLM 391 t/s    = 128 / 391           = **327 ms/step**  => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
-
-  The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
-  fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
-  face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
-  1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
-  bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
-  is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
-
-### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
-
-  SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
-  At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
-  weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
-  helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
-  Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
-  floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
-  by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
-  state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
-
-### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
-
-  1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
-     (pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
-     356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
-     post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
-     mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
-     PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
-     (18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
-  2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
-     launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
-     launches/step of serialization.
-  3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
-     ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
-     gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
-     recurrence prologue.
-  4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
-     Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
-     the GEMM prologue -> pure launch latency, not traffic.
-  5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
-     NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
-     weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
-
-### Bottom line (weight-bandwidth)
-At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
-(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
-traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
-DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
-(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
-55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
-lever, largely orthogonal to the batch-128 decode-parity gap.
-
---
-
-## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
-
-Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
-`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
-covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
-This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
-fusion gate, TMA scoping, and the conv-state residual.
-
-**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
-`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
-`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
-**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
-baseline" is only about the act-quant producer.
-
-**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
-the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
-flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
-per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
-shaved".
-
-### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
-`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
-(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
-`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
-activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
-but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
-is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
-the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
-chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
-utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
-(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
-as final.
-
-### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
-ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
-ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
-(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
-GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
-M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
-Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
-the npl128 throughput number does not** - tune the two separately.
-
-### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
-Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
-P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
-stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
-*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
-(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
-weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
-latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
-"NVFP4-dense is a prefill/low-batch lever".)
-
-### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
-After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
-delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
-the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
-in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
-can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
-low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
-launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
-
-### Deferred (covered by other sections, I concur)
- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
-  state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
-  parity lever - last priority, quality-risk.
- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
-  an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
-
-### Bottom line (this section's net-new)
-Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
-pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
-fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
-pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
-/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
-`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
-(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
-
-Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
-BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
-llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
-SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
-a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
-so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
-Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
-per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
-as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
-
-### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
-
-The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
-w16/w64 timing, in-process, batch 128):
-
-| engine / mode            | ms/step | decode tok/s | notes                          |
-|--------------------------|---------|--------------|--------------------------------|
-| llama post-SSM (graphs)  | ~510-522| **245-251**  | S_TG @npl128 ntg32 (this run)  |
-| vLLM enforce_eager       | 324.9   | **394.0**    | == the ~391 ref (h2h log 371-384)|
-| vLLM cuda-graphs         | 304.9   | **419.8**    | graphs buy only +6%            |
-
- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
-  (394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
-  kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
-  WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
-  decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
-  exactly -> the tight window is faithful; per-kernel numbers below use it.
-
-### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
-
-Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
-ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
-kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
-
-| component (per step)        | llama ms/step | llama % | vLLM ms/step | vLLM % |
-|-----------------------------|---------------|---------|--------------|--------|
-| GDN linear-attn recurrence  | 193 (48x4.03) | 38%     | 174 (48x3.62)| 53%    |
-| FP4 matmul + act-quant      | **236**       | **46%** | **117**      | **36%**|
-|   - mul_mat_vec_q (GEMV)     | 132 (48x2.75) | 26%     | -            | -      |
-|   - mul_mat_q (GEMM)         | 88 (448 calls)| 17%     | cutlass 61   | 19%    |
-|   - quantize_mmq_nvfp4       | 16 (448)      | 3%      | nvjet 53+cvt2| 17%    |
-| full attention (16 layers)  | 6.6 (16)      | 1.3%    | 6.2 (16)     | 1.9%   |
-| SSM conv + glue/elementwise | ~45           | 9%      | ~22          | 7%     |
-| MEMCPY (D2D+H2D)            | 2.5 (131 MB)  | 0.5%    | 0.36 (85 MB) | 0.1%   |
-| **TOTAL**                   | **~510**      | 100%    | **~326**     | 100%   |
-
-### 2. The three load-bearing comparisons (the brief)
-
-**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
-llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
-`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
-slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
-(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
-framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
-(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
-
-**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
-vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
-per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
-(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
-not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
-BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
-weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
-deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
-
-**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
-vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
-the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
-vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
-
-### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
-
-llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
-cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
-PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
-FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
-
-Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
-decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
-**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
-a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
-amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
-weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
-GEMV-at-batch-128 on the vLLM side at all.**
-
-Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
-it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
-which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
-GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
-and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
-fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
-mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
-(16 ms + 448 launches), then the GDN +19 ms.
-
-### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
-
-weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
-My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
-do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
-bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
-`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
-raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
-not the GDN kernel and not byte-cutting.
-
-### Verdict (profiler)
- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
-  Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
-  18 GB/step DtoD removal CONFIRMED in-trace.
- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
-  = 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
-  one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
-  GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
-
-### Evidence (DGX, this agent)
- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
-  (decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED
-
-Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the
-fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by
-reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD
-46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference.
-
-### 0. Headline
-
-Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is
-NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA
-graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect:
-**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`)
-at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of
-decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass
-M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step.
-
-### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED
-
-HELD (confirmed by both the adversarial re-derivation and the fresh profile):
- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB,
-  mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings).
- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed
-  three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel
-  sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The
-  SSM fix (0018/0019) is the real breakthrough and is working.
- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself
-  shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound.
- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed.
- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util)
-  vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below.
- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128
-  (llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler.
- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already
-  runs with graphs. Confirmed by the profiler.
-
-REFUTED / CORRECTED:
- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net
-  1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The
-  profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% /
-  +19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama,
-  53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower
-  window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/
-  fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity.
- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM,
-  not the speedup (validate-findings).
-
-SUPERSEDED (the gap validate-findings flagged, now filled by real data):
- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the
-  producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace
-  replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM
-  family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17%
-  (the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as
-  "48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This
-  is why P2a was flat on decode: **it optimized the wrong FP4 kernel.**
-
-### 2. Ground-truth per-step decode decomposition + the single biggest overage
-
-From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate:
-
-| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) |
-|-----------------------------|----------|--------|---------|-------|------------------|
-| GDN recurrence kernel       | 193      | 38%    | 174     | 53%   | **+19**          |
-| FP4 matmul + act-quant      | 236      | 46%    | 117     | 36%   | **+119**         |
-|   - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0   | -     | **+132**         |
-|   - mul_mat_q (MMQ GEMM)    | 88       | 17%    | 61 (cutlass) | 19% | +27             |
-|   - quantize_mmq_nvfp4      | 16       | 3%     | 55 (nvjet+cvt)| 17% | -39             |
-| full attention (16 layers)  | 6.6      | 1.3%   | 6.2     | 1.9%  | +0.4             |
-| SSM conv + glue/elementwise | 45       | 9%     | 22      | 7%    | +23              |
-| MEMCPY                      | 2.5      | 0.5%   | 0.36    | 0.1%  | +2               |
-| **TOTAL**                   | **~510** | 100%   | **~326**| 100%  | **+184**         |
-
-The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset
-by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So
-the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms.
-
-**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block
-(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout
-`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is:
-
-```cpp
-// current code (qwen3next.cpp, end of the GDN block)
-ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm,
-                                 head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128]
-cur = build_lora_mm(model.layers[il].ssm_out, final_output);                     // <-- the matmul
-cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);                 // collapse AFTER
-```
-
-`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda
-dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the
-128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs =
-**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's
-trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128
-sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles -
-the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not
-in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN
-kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all.
-
-This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt
-len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2])
-trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream
-`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is
-the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer.
-
-### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391)
-
-Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover
-figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels);
-each needs a confirming re-profile. Ceilings are cumulative.
-
-| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability |
-|---|-------|-------------------|--------------|-----------------------|-----------|--------------|
-| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) |
-| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) |
-| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) |
-| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) |
-| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) |
-| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) |
-| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) |
-
-Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per
-o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1
-rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise
-passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band,
-not a sum.
-
-### 4. Verdict: is true decode parity reachable?
-
-**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.**
-Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32
-recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM
-achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x
-throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to
-specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The
-GDN recurrence (the supposed floor) is only +11%/call between the two engines.
-
-How far each tier reaches:
- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves
-  the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact,
-  no new kernel (MMQ already runs the in-projection at this exact shape and type).
- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks
-  38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the
-  GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now).
- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work:
-  matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy,
-  folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard
-  but not floored" - it is kernel engineering, not a hardware wall.
-
-**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D
-`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant
-post-matmul `reshape_2d`):
-
-```cpp
-// route the GDN output projection through tensor-core MMQ at decode:
-// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact.
-ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm,
-                                 head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ
-```
-
-Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one
-number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The
-same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and
-any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those
-and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy
-compromise, so bit-exactness is preserved throughout.
-
-### Evidence (this section)
- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj
-  layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on
-  `ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is
-  MMQ-supported).
- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
+++ b/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md
@@ -1,184 +0,0 @@
-# F16/BF16 Glue Probe - the dense decode residual to vLLM
-
-Question: dense decode parity sits at llama 384.6 vs vLLM 418.8 t/s @ npl128 = 91.8%.
-The 49% SSM recurrence (f32 BOTH engines) and the 27% NVFP4 GEMM (W4A4 BOTH) are
-precision-matched. The residual ~8% may be partly that llama runs the NON-recurrence
-GLUE (attention, norms, activations, elementwise, residual stream) in F32 while vLLM
-runs the model in BF16. This probe settles, empirically on q36-27b-nvfp4 @npl128, how
-much of that residual is realistically f16/bf16-closable.
-
-Model: Qwen3.5-27B NVFP4 (dense). 64 layers = 16 attention + 48 gated-DeltaNet
-(SSM) recurrent. Build b104-f7409c2 (patch 0023), verified git-clean and coherent.
-The bf16 SSM work was never applied to the tree (only saved as a diff backup);
-ggml-cuda needed no recompile on rebuild, so the binary is bit-identical to clean 0023.
-
-## (1) Current KV / state dtype  (SETTLED)
-
-From the `-v` init log:
-
- ATTENTION KV cache (16 of 64 layers):
-  `K (f16): 1280 MiB, V (f16): 1280 MiB`  =>  **DEFAULT IS ALREADY F16.**
- RECURRENT cache (48 gated-DeltaNet layers):
-  `R (f32): 180 MiB` (conv state), `S (f32): 4608 MiB` (SSM state)  =>  **f32.**
-
-Consequence: the attention KV is ALREADY at vLLM's 16-bit bit-width. `--cache-type f16`
-is a literal no-op; the cheap KV lever is spent. The f32 lives in (a) the recurrent
-SSM/conv state (matched to vLLM, the bf16 version is shelved for failing the f32 KL
-gate) and (b) the intermediate-activation glue (norms, residual stream, attention
-compute, activations) - that glue is where llama still pays f32 vs vLLM bf16.
-
-## (2) Decode kernel budget  (nsys --cuda-graph-trace=node, npl128, 39 steady steps)
-
-step span 342.0 ms ; sum-of-kernels 338.8 ms ; **kern/span 99.0%** - the decode is
-GPU-bound, kernels back-to-back, nsys overhead negligible. The measured bench step
-(128 tok / 373.5 t/s = 342.8 ms) equals the nsys span, so the %-of-step figures below
-ARE wall-time fractions.
-
-OUT of scope - already precision-matched (83.2% of the step):
-
-| kernel | ms/step | % |
-|---|---:|---:|
-| gated_delta_net (SSM recurrence, f32 BOTH) | 167.1 | 49.3 |
-| mul_mat_q NVFP4 (W4A4 GEMM, BOTH)          |  93.0 | 27.4 |
-| quantize_mmq_nvfp4 (FP4 act-quant)         |  17.6 |  5.2 |
-| mul_mat_q stream_k fixup (FP4 reduction)   |   4.1 |  1.2 |
-
-F16-ABLE GLUE - f32 in llama, bf16 in vLLM:
-
-Budget A (clean compute glue, decoupled from the f32 state):
-
-| kernel | ms/step |
-|---|---:|
-| flash_attn_ext            | 11.94 |
-| unary_gated_op (silu)     |  5.16 |
-| k_bin_bcast (mul)         |  4.72 |
-| rms_norm                  |  3.58 |
-| k_bin_bcast (add, residual)|  1.67 |
-| l2_norm                   |  0.65 |
-| cpy_scalar                |  0.37 |
-| rope                      |  0.26 |
-| sigmoid                   |  0.22 |
-| softplus                  |  0.09 |
-| flash_attn fixups         |  0.08 |
-| **Budget A total**        | **28.74 ms = 8.4% of step** |
-
-Budget B (+ the non-FP4 cublas GEMM): + nvjet 12.17 ms => **40.91 ms = 12.0%**.
-
-Recurrence-coupled data movement (NOT bit-safe f16-able - needs the f32 state to go
-bf16, which is the shelved work that fails the f32 KL gate):
-ssm_conv 8.37 + k_get_rows_float 6.98 + k_set_rows 0.66 + gdn_gather 0.06 = 16.08 ms = 4.7%.
-
-## (3) Cache-type A/B  (decode_agg S_TG t/s, dense)
-
-| npl | DEFAULT | F16-explicit | Q8_0 |
-|---:|---:|---:|---:|
-|  32 | 209.05 | 208.75 | 208.63 |
-| 128 | 373.46 | 373.56 | 374.71 |
-
- F16-explicit == DEFAULT (0.03% delta) => proves the default KV is already f16; the
-  flag is a no-op.
- Q8_0 (8-bit, half the f16 KV bytes) is within noise at every npl => the attention KV
-  bandwidth is NOT a decode bottleneck (it is 16/64 layers; flash_attn is 3.5% of the
-  step). The KV-cache dtype is not a decode lever for this model.
- Coherence (48-tok greedy, "The capital of France is"): default and q8_0 both fully
-  coherent; q8_0 only causes minor greedy-path divergence, no quality break. But since
-  q8_0 buys zero speed and is not bit-exact, it is pointless here.
-
-## Read: how much of the ~8% dense residual is f16-closable
-
-The gap is ~27 ms/step (llama 332.8 ms vs vLLM 305.7 ms at npl128).
-
-f16 does not zero the glue, it speeds it up. Realistic recovery:
- Memory-bound glue (norms + elementwise + activations + copies + rope = 16.7 ms):
-  f16 halves the bytes => ~50% => ~8.4 ms.
- flash_attn_ext (12.0 ms): KV is ALREADY f16 and the accumulation must stay f32
-  (vLLM also f32-accumulates), so only the Q/projection side helps => ~25% => ~3.0 ms.
- Budget A realistic recovery ~= **11.4 ms**.
- nvjet non-FP4 GEMM (12.2 ms): bf16 tensor cores vs f32 ~= ~40-50% => ~5 ms, but
-  uncertain (may already run TF32) => +nvjet recovery ~= **16 ms**.
-
-So f16/bf16 glue realistically recovers **~11 ms (glue only) to ~16 ms (+GEMM) of the
-~27 ms gap = roughly 40-60% of the dense residual.** That moves parity 91.8% ->
-~95-96%, NOT a full close. The remaining ~3-4% is structural: cublas GEMM efficiency
-on the non-FP4 paths, graph/launch scheduling vs vLLM, and the irreducible f32
-accumulation in attention and the recurrence.
-
-Caveats for a build decision:
-1. The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable
-   (KV already f16, accumulate stays f32). The cleanly recoverable mass is the
-   norms+elementwise+activations (~16.7 ms).
-2. The recurrence-coupled 4.7% (ssm_conv + state gather) is only f16-able by taking the
-   SSM/conv state to bf16 = the already-built, already-shelved work that fails the f32
-   KL gate. It is OUT of a bit-safe f16 build.
-3. f16 glue is NON-bit-exact (same category as the shelved bf16 SSM state). It would be
-   an OPT-IN fast path, not the bit-exact default. Realistic ceiling ~95-96% parity for
-   a meaningful (norms/elementwise/activations + optionally nvjet) f16 conversion, at
-   the cost of leaving the 95%-bit-exact f32 plateau.
-
-## (4) What it costs to capture it: NOT a flag  (source map, read-only)
-
-The asymmetry confirmed at the source level (DGX `~/llama-paged-dev` @ f7409c2, tree
-git-clean; vLLM ref from BITEXACT_VS_VLLM.md):
- vLLM `text_config.dtype = bfloat16` => the ENTIRE non-quantized compute (residual
-  stream, RMSNorm I/O with f32-internal reduction, FlashAttention out, SiLU, gating,
-  conv state) runs in BF16. Only the gated-DeltaNet temporal SSM state is f32
-  (`mamba_ssm_dtype="float32"`, matched to llama).
- llama's intermediate activations are F32 **by construction, everywhere**:
-  `ggml_mul_mat` hardcodes an F32 result (ggml.c:3250), so the stream snaps back to F32
-  after EVERY projection (Q/K/V/O, wqkv, ssm in/out, ffn up/gate/down, eh_proj, lm_head).
-  `ggml_rms_norm`/`ggml_l2_norm`/`ggml_silu`/`ggml_add`/`ggml_mul`/`flash_attn_ext`/
-  `ggml_ssm_conv` all preserve/emit F32. There is no point where the stream is f16.
-
-There is **no vLLM-style global model-compute-dtype knob** in ggml/llama. You cannot flip
-one model-load flag. Three escalating options, all opt-in / non-bit-exact:
-
- A flag: does not exist and cannot exist as-is - the F32 is structural, not a default.
- Option 1 (targeted per-op f16, no new kernels): silu/sigmoid/softplus (unary.cu),
-  add/mul (binbcast.cu), rope already have f16 paths. But the residual stream stays F32,
-  so each op must be wrapped cast(F16)->op->cast(F32), adding 2 `cpy` ops per op. At
-  decode these ops are tiny and memory-bound; the cast traffic ~= the op traffic, so the
-  net win is near-zero or negative unless the cast is FUSED into the producer/consumer.
-  Crucially this CANNOT capture the norms - the largest glue item.
- Option 2 (the real lever, multi-file code change): carry the residual stream in F16
-  across the layer, cast to F32 only at the quantize boundary. Requires (a) f16 projection
-  output (patch `ggml_mul_mat` to honor a dst-type, or a cpy->F16 after each proj),
-  (b) **NEW F16 template instantiations in norm.cu** for rms_norm / l2_norm / fused
-  rms+mul / fused rms+mul+add (today hard-`GGML_ASSERT(type==F32)` at norm.cu:441-442,
-  465-466, 525-527, 601-604) keeping the f32 reduction, (c) optionally an F16 ssm-conv.cu,
-  plus graph-dtype plumbing in qwen35.cpp / llama-graph.cpp to thread F16 through
-  inpL/cur/the residual adds. The single biggest code item is the norm.cu f16 kernels -
-  the exact band vLLM runs in bf16 that Option 1 cannot reach.
-
-Must-stay-f32 regardless (vLLM does the same): RMSNorm/L2Norm sum-of-squares reduction;
-FlashAttention KQ/softmax accumulation (forced `GGML_PREC_F32`, llama-graph.cpp:2117);
-the gated-DeltaNet recurrent SSM temporal state (f32 BOTH engines, out of scope); the
-src1->q8_1/nvfp4 activation quantization reads F32, so the stream must be F32 at every
-projection boundary no matter what.
-
-## Verdict: probe-further-then-decide, leaning not-worth-it for the default
-
-f16 does NOT meaningfully close the dense residual on its own, and what it can close is a
-multi-file non-bit-exact build, not a flag.
-
- Precision is NOT the dominant cause of the 8% gap. 83.2% of the decode step (recurrence
-  49.3% + FP4 GEMM 27.4% + FP4 act-quant/fixup 6.4%) is already precision-matched f32/W4A4
-  on both engines. The f16-able glue is only 8.4% of the step (Budget A); of the ~27 ms
-  gap, f16 realistically recovers ~11 ms (glue) to ~16 ms (+ the uncertain nvjet GEMM) =
-  40-60% of the residual. The remaining ~3-4% is kernel/scheduling efficiency (non-FP4
-  cublas GEMM, graph-launch overhead, irreducible f32 accumulation) that f16 cannot touch.
- The recoverable mass is the norm+elementwise+activation band, which is precisely the
-  part that needs NEW f16 norm kernels (Option 2). The no-new-kernel ops (Option 1) are
-  too small and their cast overhead likely eats the win.
- Any version is opt-in / non-bit-exact, the same gate-failing category as the already
-  shelved bf16-SSM-state work. It cannot be the bit-exact f32 default; it is a second,
-  separately-maintained fast path with a ~95-96% ceiling.
-
-Recommendation: do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau
-(patches 0018-0023) as the default. If chasing the last 4% later, the only lever worth a
-build is Option 2's norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise
-band, ~11 ms); gate it behind an explicit opt-in flag and validate it against the same KL
-threshold as bf16-SSM before shipping. The non-FP4 cublas GEMM efficiency and graph-launch
-scheduling - the structural ~3-4% - are a better long-term target than precision, because
-they help the bit-exact default too.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
+++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
@@ -1,532 +0,0 @@
-# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
-
-Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
-residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
-8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
-consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
-
-**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
-ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
-parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
-parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
-vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
-
-Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
-Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
-committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
-benchmarks were run** (track A is on the box).
-
-## 0. Grounded inputs (measured, committed)
-
-| quantity | value | source |
-|---|---|---|
-| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
-| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
-| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
-| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
-| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
-| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
-| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
-| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
-| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
-| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
-
-`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
-`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
-compute-bound.
-
---
-
-## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
-
-This is the first thing track B must settle, and the evidence settles it decisively.
-
-| option | verdict | why |
-|---|---|---|
-| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
-| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
-| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
-
-**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
-id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
-The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
-hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
-to the bandwidth floor at the thin decode M-tile.
-
-### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
-
-Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
-ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
-`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
-**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
-`mma_block_scaled_fp4` (`mma.cuh:1126`).
-
-**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
-(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
-values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
-`ggml_cuda_float_to_fp4_e2m1`. The MMA is
-`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
-(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
-already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
-sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
-activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
-halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
-
-Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
-`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
-FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
-is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
-
---
-
-## 2. The roofline — answering the load-bearing question
-
-**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
-under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
-
-### 2a. DENSE Qwen3.6-27B, M=128
-
-`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
-
- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
-  @216 GB/s = 83 ms → 1,542 tok/s.
- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
-  at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
-  collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
-  territory.
- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
-  slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
-  compute-bound territory at single-digit efficiency.
- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
-  the GEMM.**
-
-### 2b. MoE Qwen3.6-35B-A3B, M=128
-
-@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
-expert M ≈ **4 tokens**.
-
- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
-  purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
-  feeding ragged M≈4 tiles.
- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
-  loss, not compute).
- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
-  already pushing the MoE BW floor.
-
-**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth
-is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
-
---
-
-## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
-
-The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
-keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
-`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
-single most important structural fact for track B:
-
-> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
-> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
-> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
-> regimes pick opposite knobs:
-
-| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
-|---|---|---|
-| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
-| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
-| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
-| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
-| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
-
-This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
-0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
-
-The five inefficiencies, ranked:
-
-1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
-   whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
-   into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
-   `block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
-   activation round-trip, not just the launch (see §4.4).
-
-2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
-   `load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` →
-   `vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
-   `cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
-   M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
-   35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
-   occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
-   hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
-
-3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
-   decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
-   but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
-   **`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
-
-4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
-   tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag-
-   group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
-   At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
-   the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
-
-5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
-   shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
-
-**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
-`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
-raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
-
---
-
-## 4. The specific build-ready changes
-
-All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
-stock behavior until proven.
-
-### 4.1 Dense M-tile / occupancy (the make-or-break)
-
- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
-  re-reads 18 GB). Lock this as an invariant in P0.
- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
-  128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
-  static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
-  fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
-  hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
-  dense occupancy lever and respects every GB10 rule.
- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
-  `rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
-  ~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
-  re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
-  broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
-  lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
-
-### 4.2 FP4-MMA fragment usage
-
- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
-  `m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
-  all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
-  these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
-  shape.
- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
-
-### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
-
- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
-  (the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
-  `LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep.
- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
-  of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
-  (scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
-
-### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
-
- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
-  (`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
-  operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
-  (`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
-  `rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
-  MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
-  `quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
-  unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
-  with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
-  struct is the frozen interface between them.
-
-### 4.5 GB10-fit rules (binding constraints on every kernel change)
-
- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
-  layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
-  XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
-  padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
-  FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
-  (smaller `mmq_y`, smaller shared), not a deeper pipeline.
- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
-  perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
-
---
-
-## 5. Correctness / parity gate (every phase)
-
- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
-  `test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
-  unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
-  is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
-  test).
- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
-  the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
-  small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
-  `mmq_y` changes and block-pad masking can leak.
- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
-  must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
-  rounding).
- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
-  (dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
-  the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
-
---
-
-## 6. Phased plan, with expected decode_agg at each phase
-
-Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
-**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
-
-### DENSE (parity target 391)
-
-| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
-|---|---|---:|---:|---:|---:|---|
-| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
-| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
-| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** |
-| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low |
-
-Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.**
-Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
-contingent on P2 clearing the kill-gate and on A landing.**
-
-### MoE (parity target 811)
-
-| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
-|---|---|---:|---:|---:|---:|---|
-| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
-| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
-| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
-| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
-
-MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
-hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
-non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
-cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
-(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
-
-### Explicitly NOT in scope (and why)
-
- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
-  CUTLASS's grouped FP4 is broken on sm_121.
- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
- "Make activations 4-bit" — already W4A4; no work, no win there.
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
-  **separate tracks**; B owns the GEMM only.
-
---
-
-## 7. The honest ceiling — does B reach TRUE PARITY?
-
- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
-  GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
-  hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.**
-  The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
-  ~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
-  occupancy kill-gate and contingent on track A. **GO (conditional).**
-
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
-  even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
-  grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
-  non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
-  deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
-
- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
-  *different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
-  M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
-  the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
-  (occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
-  rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
-  which is exactly what the P2 kill-gate measures.
-
-**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
-vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
-with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
-MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
-PARTIAL for MoE.**
-
---
-
-## 8. One-paragraph summary
-
-The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
-floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
-FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
-**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
-`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
-M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
-win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
-**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
-**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
-demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy
-kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
-sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
-(conditional, B+A), PARTIAL for MoE.**
-
---
-
-## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
-
-Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
-BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
-grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
-`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
-the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
-rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
-The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
-reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
-(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
-fixed**.
-
-### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
-
-The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
-dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
-parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
-
- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
-  so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
-  proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
-  build stuck at 1/4 of MMQ.
- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
-  hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
-  floor** as M grows; vLLM **holds it**.
-
-So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
-W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
-to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
-path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
-constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
-tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
-it is a different and more tractable regime; it under-states that the *binding risk is identical*.
-
-### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
-
-The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
-bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
-territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
-(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
-hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
-not the hardware, is the limiter.
-
-**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
-the dense step (numbers from the doc's own inputs):
-
-```
-llama step @npl128            795 ms   (decode_agg 161)
-vLLM step  @npl128            328 ms   (decode_agg 391)
-total gap                     467 ms
-
-llama GEMM                    471 ms
-vLLM GEMM (at the floor)      ~66-88 ms   (66 @273 GB/s spec, 88 @216 GB/s achieved)
-=> GEMM gap                   383-405 ms  = 82-87% of the 467 ms total gap
-=> non-GEMM gap                62-84 ms   = 13-18% of the total gap
-```
-
-So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
-needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
-elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
-necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
-parity story is *entirely* contingent on A, which the P3 framing buries.
-
-### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
-
-vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
-pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
-not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
-occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
-proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
-occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
-
-The doc conflates two different things under "deep pipeline":
- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
-  (correctly banned).
- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
-  register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
-  occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
-
-**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
-not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
-(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
-to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
-Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
-abandoning the path that works because the *manual* version of it failed.
-
-### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
-
-The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
-  and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
-  evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
-  `nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
-  host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
-  where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
-  on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
-
-**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
-real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
-
-### 9.5 Honest expected outcome (the numbers I would defend)
-
-| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
-|---|---:|---:|---:|---|
-| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
-| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
-
-Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
-**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
-floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
-contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
-GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
-tail, not median.
-
-**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
-outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
-**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
-**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
-the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
-close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
-parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
-the parity tail hits.
-
-### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
-
-The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
-reasons:
-1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
-   producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
-   seam is speculative until P3.
-2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
-   bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
-   B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
-   59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
-   <15%-eff kill-gate.
-3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
-   risky kernel tune.
-
-**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
-post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
-make-or-break decision cheaper, better-informed, and bankable-either-way.
-
-### 9.7 Verdict (go / no-go)
-
- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
-  (expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
-  bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
-  is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
-  first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
-  (iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
-
- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
-  (it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
-  ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
-  the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
-  **1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
-  CUDA-graph, GDN bf16 state).
-
- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
-  **top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
-  and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
-  high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
-  ~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
-  occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.
--- a/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md
@@ -1,96 +0,0 @@
-# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS
-
-Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md`
-(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but
-were emitted DEFAULT-ON with no backend guard.
-
-## The hazard
-
- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally
-  in the `llama_context` constructor (`src/llama-context.cpp`).
- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]`
-  (reuse `GGML_OP_GATED_DELTA_NET` with extra src slots).
- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse
-  `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids).
- Both families have CUDA + CPU kernels only. No `supports_op` change was made for
-  the discriminated variants.
- A backend that supports **plain** `SSM_CONV` but ignores the discriminator
-  (Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler
-  assigns the discriminated conv to it; it runs the **wrong plain conv** =>
-  SILENT corruption (not a crash).
- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the
-  discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and
-  discriminated-conv happened to share backend coverage). It goes live the moment a
-  non-CUDA paged build of a gated-DeltaNet model exists.
-
-## The fix (emission gate, not supports_op)
-
-Chosen route: **gate the emission on the active compute backend type.** The
-`supports_op` route would require editing every other backend's per-device
-`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` -
-invasive, fragile, and not centrally exposed by the ggml backend interface. The
-emission gate is self-contained in the fork's own code.
-
-`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before
-the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute
-backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three
-`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the
-discriminated-op handling), force
-`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`.
-
-Every emission site keys off these flags:
-`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar`
-(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and
-`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With
-the flags false the graph takes the upstream non-fused branch: a **plain
-`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles
-correctly.
-
-## CUDA byte-identical invariant
-
-On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the
-flags are left untouched, and the emitted decode graph is unchanged. The fix only
-changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical
-to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy
-md5 gates are unaffected on the validated GB10 target.
-
-## Verification
-
- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state
-  (upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and
-  applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops`
-  links `libllama.so` and the test binary with **0 errors**; the edited
-  `llama-context.cpp` compiles clean (uses only the already-included `<cstring>`
-  and the backend-reg API already used in this TU:
-  `ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` /
-  `ggml_backend_dev_type`).
- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check`
-  (Makefile path) and `patch -p1 -N` (prepare.sh path).
- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run
-  skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are
-  registered and will be exercised by the DGX CUDA run:
-  `test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) /
-  `test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) /
-  `test_gated_delta_net` (+ `_hybrid`).
-
-## Pending on the DGX (GPU)
-
-The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared
-tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole
-session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`)
-once reachable, then commit 0030 there too:
-
-```
-test-backend-ops test -o SSM_CONV
-test-backend-ops test -o SSM_CONV_UPDATE
-test-backend-ops test -o SSM_CONV_UPDATE_IDS
-test-backend-ops test -o GATED_DELTA_NET   # expect: 2/2 backends passed, OK
-```
-
-Greedy md5 (only if >40GB VRAM free; must equal the established baselines):
-`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`,
-`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip
-any flag on CUDA, the md5 is unchanged by code-path argument; the run is a
-belt-and-suspenders confirmation, not a correctness dependency.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md
+++ b/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md
@@ -1,89 +0,0 @@
-# Decode-Parity: Parked Levers (future exploration)
-
-**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode
-38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch
-**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the
-dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor.
-bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate).
-
-The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact
-floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode
-gains are marginal; the levers below are the honest pick-up points, ranked by promise.
-
---
-
-## 1. Hybrid-precision SSM state (the most promising)
-
-The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large -
-recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean
-runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx,
-~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the
-long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding);
-short/fast-decaying heads are fine.
-
-**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1)
-in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could
-capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph
-build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel
-plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head
-dtype selection on top.
-
-*Note (precision, corrected):* plain bf16 (no split) is a legitimate **opt-in for precision-tolerant
-deployments**, but it is *below* vLLM's recurrent precision, NOT equal to it. vLLM keeps the
-gated-DeltaNet **temporal state in f32** (proven three ways in `BITEXACT_VS_VLLM.md`; only its tiny
-conv state is bf16, and llama keeps even that f32). So bf16 here trades *below-vLLM* precision for
-*above-vLLM* throughput. We declined it as the default because both llama's f32 AND vLLM's f32 are a
-higher bar - and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW),
-so we do not need bf16 to match vLLM's recurrence.
-
-## 2. Dense CUDA-graph instability
-
-The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path
-CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable
-(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on
-the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic.
-
-## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk)
-
-The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4`
-into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover
-~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst
-gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1
-GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by
-the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and
-proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.**
-
-## 4. Datacenter Blackwell (sm_100)
-
-This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter
-Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA
-characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the
-whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding
-kernel). A separate investigation if datacenter Blackwell becomes a target.
-
-## 5. Prefill / TTFT scheduler + paged-pool burst degradation (HIGH priority - the weakest benchmark number)
-
-The final benchmark (`QWEN36_NVFP4_BENCH.md`) exposed TTFT as the clear weak spot vs vLLM. Two distinct
-issues:
- **Static decode-first budget tradeoff:** the QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS=512`)
-  maximizes decode tok/s + memory but throttles burst-prefill, so under a synchronized 128-way burst TTFT
-  climbs to **903 s dense / 213 s MoE @npl128** vs vLLM's chunked-prefill 6-18 s. A dynamic/adaptive budget
-  (by concurrency + queue depth), or matching vLLM's chunked-prefill interleave, would rebalance.
- **Paged-pool burst-degradation BUG (concrete, found in the benchmark):** after a high-npl burst, a
-  server's *subsequent lower-npl* prefill collapses (fresh npl8 = 507 t/s / 6 s TTFT; npl8 after an npl64
-  burst = 65 t/s / 64 s). Decode stays robust; only prefill degrades -> root-cause the paged-pool state
-  that persists across the burst.
-
-**HIGH promise** for the serving experience: decode (dense 90-117%, MoE 77-83% of vLLM) and memory (1.5-3x
-lower) are already strong; TTFT is the one number holding back a clean public win.
-
-## 6. MoE-specific recurrence tuning
-
-The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The
-MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM
-shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile.
-
---
-
-*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this
-directory.*
--- a/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md
@@ -1,208 +0,0 @@
-# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?
-
-Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify
-llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the
-context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?
-
-Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda
-sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf`
-(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models:
-`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf`
-(MoE, arch `qwen35moe`).
-
-## TL;DR verdict
-
-**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that
-reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's
-`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a
-major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so
-the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general
-kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode
-kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded.
-
-The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is
-**flat** while the full-attention kernel grows ~3.1x:
-
-| kernel | ctx 1024 | ctx 4096 | ratio | meaning |
-|--------|---------:|---------:|------:|---------|
-| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** |
-| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected |
-| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent |
-
-Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a
-per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows.
-
-## 1. Architecture (confirmed from GGUF metadata + tensor names)
-
-Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard
-full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.
-
-**Dense Qwen3.6-27B (`general.architecture = qwen35`):**
- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**.
- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`,
-  rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`.
- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`,
-  `ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is
-  `[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value
-  heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.
-
-**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):**
- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers).
- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`,
-  `expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`.
- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`,
-  `inner_size = 4096` -> `H_v = 32` value heads.
-
-**Tensor names confirm the op split (27B, per-layer dump):**
- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`,
-  `ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn
-  block). No `attn_k/v/output`, no per-head q/k norm.
- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`,
-  `attn_q_norm`, `attn_k_norm`. No `ssm_*`.
-
-llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV
-cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read
-and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the
-SSM-style recurrent path, not standard attention.
-
-## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)
-
-Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from
-`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits
-`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472).
-
-**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1`
-(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else
-`build_delta_net_autoregressive`. Both are O(1):
-
- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on
-  the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay),
-  `sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update),
-  `o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and
-  the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`.
- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op,
-  `ggml_gated_delta_net(q, k, v, g, b, s, K=1)`.
-
-**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):**
-the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back
-with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv
-state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3`
-samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back.
-So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens -
-this is the recurrent analogue, fixed size, independent of context length.
-
-**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and
-`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same
-device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99`
-that does not happen, so the **fused single-kernel path is what runs**.
-
-**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously
-O(1) in context:**
- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185):
-  the grid spans heads x sequences x state-columns. **There is no context-length dimension and no
-  context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s)
-  `[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`).
- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then
-  loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single
-  iteration: read the one new token, do the rank-1 update
-  `s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141),
-  then write the updated state back (lines 161-167). No second loop, no read of any past KV.
- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x
-  batch) and **constant in context length**. This is precisely vLLM's
-  `fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a
-  fixed-size `[K,V]` state) cited in the grounding doc.
-
-A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`);
-the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches
-(prefill), never at decode.
-
-## 3. nsys decode profiling: GDN is a small share and does not scale with context
-
-Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven
-to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and
-sqlites under `~/bench/gdn_study/`).
-
-**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:**
-
-| bucket | % of busy | us/launch |
-|--------|----------:|----------:|
-| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - |
-| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 |
-| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - |
-| elementwise / act_glu / norm / rope | ~13.5 | - |
-| embed_gather (`get_rows`) | 2.9 | - |
-| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 |
-| copy_cast (`cpy`) | 1.8 | - |
-| **GDN_conv (`ssm_conv`)** | **1.5** | - |
-
-The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the
-**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model,
-where the grounding predicted the GEMM would be the lever.
-
-**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:**
-GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the
-weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the
-context.
-
-**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs
-4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1
-`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical
-decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is
-a true per-step comparison:
-
-| kernel / bucket | ctx 1024 | ctx 4096 | ratio |
-|-----------------|---------:|---------:|------:|
-| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** |
-| GDN_recurrent share | 0.6% | 0.4% | flat/down |
-| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x |
-| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** |
-| ATTENTION_full share | 0.6% | 1.8% | 3.0x up |
-| total ms / decode step | 84.9 | 86.0 | 1.01x |
-
-The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the
-full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely
-moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is
-the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.**
-
-(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently
-prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)`
-+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping
-prefill short.)
-
-## 4. Verdict and fix scope
-
-**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused,
-single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in
-the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1)
-`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning
-with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the
-op carries no context-length parameter at all.
-
-**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat
-share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would
-not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`:
-1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that
-   vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
-2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with
-   context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.
-
-If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a
-complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is
-`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this
-traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a
-llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the
-`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and
-it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM
-(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed
-steady-state step as the bounded host-side add-on.
-
-## Reproduce
-
- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`.
- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373,
-  build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595`
-  (fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid
-  184-185, dispatch 226-312).
- Profiles: `~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur>` runs `llama-server` under
-  nsys and drives `clientloop.py`; `catgdn.py <sqlite>` buckets kernels. Sqlites:
-  `gdn_npl128_ctx1024`, `gdn_npl32_ctx1024`, `gdn_npl4_ctx1024`, `gdn_npl4_ctx4096`.
--- a/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
@@ -1,344 +0,0 @@
-# GDN recurrence byte gate + fused single-pass kernel design
-
-Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only;
-the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent.
-
-## TL;DR (the correction the workflow was set up to settle)
-
-**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>`
-(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers
-ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state
-back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`,
-`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM.
-
-The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather
-each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only
-the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`,
-the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations**
-(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to
-recover; the recurrence kernel is byte-minimal on state by construction.
-
-Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is
-f32-state read+write bandwidth, already a single pass. The two real levers are decoupled:
-
-1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small
-   per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv
-   `cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM.
-2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only
-   large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only
-   the persisted cache in bf16.
-
-Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below).
-
-## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128)
-
-State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**.
-
-Per kernel call (one GDN layer, full 128-seq batch), single pass:
- state read  = 786,432 * 128 * 4 = 402.65 MB
- state write = 402.65 MB
- **state R+W = 805.3 MB/call** (768 MiB)
- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**.
-
-Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**.
-GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.**
-
-48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB
-activation traffic is noise; state is 99% of the recurrence bytes.
-
-### What this means for the open question
- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`,
-  consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware
-  floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an
-  occupancy/coalescing tune, NOT a fusion win.
- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent
-  state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant
-  stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the
-  single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of
-  vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence
-  kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is
-  at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%.
-
-## The fused single-pass kernel design
-
-Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte
-verdict.
-
-### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel
-
-Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing
-single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never
-make a separate DRAM round-trip, and the per-op launches collapse to one.
-
-Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer:
-
-```
-wqkv GEMM -> qkv_mixed                                  (keep: GEMM, separate)
-wqkv_gate GEMM -> z                                     (keep: GEMM, separate)
-ssm_beta GEMM -> beta -> sigmoid                        [FOLD beta sigmoid]
-ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g]
-build_conv_state: reshape, transpose qkv, CONCAT, cpy   [concat/cpy -> conv-state plumbing, see note]
-ggml_ssm_conv(conv_input, conv_kernel)                  [FOLD depthwise conv, K=4]
-ggml_silu(conv_output)                                  [FOLD silu]
-views q_conv/k_conv/v_conv
-ggml_l2_norm(q_conv); ggml_l2_norm(k_conv)              [FOLD 2x l2norm]
-[repeat_4d skipped on fused path]
-ggml_gated_delta_net_inplace_ids(...)                   <-- THE recurrence kernel (196 ms)
-build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul  [FOLD post gated-RMSNorm]
-ssm_out GEMM                                            (keep: GEMM, separate)
-```
-
-Fold list (what moves INTO the kernel):
- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta.
- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel.
-  The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the
-  `a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches.
- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the
-  conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading
-  state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed
-  token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim)
-  is tiny and already cached; fold its read here and its 1-token shift write into the epilogue
-  (replaces the `concat`+`cpy` conv-state update).
- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel
-  already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same
-  warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math.
- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the
-  attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS,
-  multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output -
-  removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip.
-
-State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm
-collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery:
-the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held
-(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391).
-
-Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4):
-1. PDL sync.
-2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel,
-   dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element.
-3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps).
-4. Load `s0` column into `s_shard` (UNCHANGED single read).
-5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta,
-   attn = S^T q * scale).
-6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv
-   state back to the conv cache (replaces concat+cpy).
-7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and
-   by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate.
-
-Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw
-`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new
-`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe,
-llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it.
-
-### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact)
-
-Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is
-already >=85% of peak (then f32 is at the floor and bf16 is the only way down).
-
- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at
-  the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected
-  335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of
-  peak, so likely better.
- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence
-  arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved
-  within a step; only the PERSISTED state is rounded to bf16 each step.
- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not
-  accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a
-  256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the
-  ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is
-  unreachable because the f32 state bytes are irreducible (single pass already).
-
-## Numeric / bit-exactness notes (for fold (1))
- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum).
-  Order of summation across lanes differs from the standalone op's sequential sum -> floating
-  reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a
-  tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend:
-  ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019
-  already established the harness: dense text md5, MoE byte-identical).
- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` /
-  `ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the
-  v - g*kv -> *beta -> S update -> S^T q sequence.
- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu`
-  if done in the same order.
- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path)
-  to stay bit-exact.
-
-## Implementation scope
- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new
-  kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm
-  epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI.
- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d,
-  ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps).
- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside
-  the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route
-  the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`,
-  `kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged.
- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused
-  op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`.
- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store
-  convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids
-  builders' state asserts, and the recurrent cache type.
-
-## Risk register
- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of
-  vLLM). Worth it only if the workflow wants >90% and accepts no bf16.
- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is
-  f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf
-  call.
- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters
-  (CPU offload of a GDN layer must fall back cleanly).
- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32
-  state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort.
-
---
-
-# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED
-
-The design above predicted the answer; this is the decisive measurement that confirms it.
-
-## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)).
-
-Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is
-physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at
-**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak).
-The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming.
-
-## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway
- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM`
-  (`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa).
-  DRAM byte counters are unobtainable on this box.
- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from
-  the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor;
-  comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it.
-
-## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF)
-llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1.
-vLLM:  postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples).
-
-| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream |
-|---|---|---|---|---|---|---|
-| llama gated_delta_net_cuda          | f32  | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x |
-| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1)  | 111 GB/s | **41%** | ~1.0x |
-
- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms).
- vLLM recurrence/step  = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x.
-  The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE).
- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive
-  f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))`
-  once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" ->
-  model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128).
-
-## Why re-stream ~1.0x (the gate number)
-Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**.
-1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end:
-~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD.
-
-## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead
-2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one
-(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM):
- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms
-  (~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call =>
-  save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead.
- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on
-  PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16.
-
-## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48
-concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56)
-= ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes
-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op,
-identity seqs -> confirms gather does NOT re-stream state at steady decode).
-
-## One-line answer
-llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak.
-conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
-=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
-(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
-
---
-
-# FINAL DECISION (synthesis of all four agents) - the five points
-
-This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
-(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
-(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
-
-## (1) Byte-ratio verdict - the decisive number
-
-**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
-capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
-the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
-peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
-`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
-single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
-`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
-vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
-
-## (2) Fused single-pass GDN recurrence: **NO-BUILD**
-
-A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
-write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
-q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
-Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
-the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
-
-## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
-
-This is independent of the recurrence verdict and holds regardless. Build a fused
-`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
-(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
-registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
-folds `silu`, and writes the 1-token-shifted ring state back in place.
- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
-  materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
-  (5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
-  (88.5-89.3% of vLLM 391).
- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
-  primitive, same f32 state bytes written - only the producing node changes. Greedy output is
-  bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
-
-## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
-
-Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
-step) is narrowing the state-cache width to bf16, exactly as vLLM does.
- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
-  f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
-  step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
-  360-443 tok/s @npl128).
- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
-  irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
-  vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
-  precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
-  the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
-  over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
-  (g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
-  lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
-  recurrence term.
-
-## (5) Ranked build order + the single highest-value next step
-
-1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
-   reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
-   additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
-   Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
-   384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
-   byte-identical).
-2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
-   dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
-   Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
-   duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
-   256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
-   f32 fallback verified.
-3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
-   conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
-   + launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
-   no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
-   md5 unchanged.
-
-**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
-that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
-the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
-**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
-lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_PROGRESS.md
@@ -1,34 +0,0 @@
-# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
-
-STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
-
-## What
-Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
-update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
-0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
-(read path gather -> indexed in-kernel read; values + reduction order unchanged).
-
-## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
-  MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
-  GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
-
-## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
- MoE   npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
-  step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
-
-## Artifacts
- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
-
-## gather-bench landed (worktree)
-
-Rigorous same-session A/B (DGX GB10) validated patch 0028 bit-exact and lifting both models;
-results folded into LEVER1_GATHER_RESULTS.md and the regenerated 0028 patch. The bench files
-first landed in this worktree via concurrent merge c1f1d1e8e (origin/master sweep); this commit
-re-anchors them with sign-off attribution. DGX llama tree dedicated commit: fafe878 (code
-byte-identical to 944636c; docs-only amend). Both trees committed, not pushed.
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_RESULTS.md
@@ -1,163 +0,0 @@
-# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
-
-The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
-kernel vLLM has no equivalent of (~5.2 ms/step MoE decode; also present in dense): vLLM updates its
-gated-DeltaNet recurrent state in-place inside the fused decode kernel, while llama ran a separate
-`ggml_get_rows` gather. Patch 0019 fused the SSM recurrent-state gather; patch 0021 fused the conv
-compute/write-back but KEPT a `build_rs` gather for the conv taps ("tiny; not one of the eliminated
-buckets"). This patch closes that residual.
-
-## Which gather was still firing (nsys-located, DGX GB10 sm_121)
-
-Profiled MoE `q36-35b-a3b-nvfp4` at batch-128 decode (`llama-batched-bench -npp128 -ntg24 -npl128
-fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`). The decode-window `k_get_rows_float<float,float>`
-distribution was bimodal: a BIG cluster of **~720 instances (= 30 GDN layers x 24 decode steps) at
-~115 us each** plus small embedding/router gathers.
-
-The big gather's geometry (`grid=(ne10=128, block_num_y=96, 1)`) decodes to **128 rows (= n_seqs
-active sequences) of ne00 = 24576 floats**. With the model's real dims (`d_conv=4, d_inner=4096,
-n_group=16, d_state=128`):
- `n_embd_r = (d_conv-1) * (d_inner + 2*n_group*d_state) = 3 * 8192 = 24576` -> `block_num_y=96` EXACT match.
- `n_embd_s = d_state * d_inner = 524288` (the SSM state, gridY 2048 - already fused by 0019).
-
-So the residual `k_get_rows` is the **conv-state tap gather** in `build_conv_state_fused`
-(`src/models/delta-net-base.cpp`), which called the plain 4-arg `build_rs` -> `ggml_get_rows` of the
-24576-float conv-state row x 128 sequences, once per GDN layer per decode step (~3.4 ms/step here,
-~5.2 ms/step at steady ntg=128). The SSM-state gather is already fused, so this conv gather is the
-last `k_get_rows` in the GDN decode path.
-
-## What changed (mirror of the 0019 SSM gather fusion; bit-exact by construction)
-
-New op `ggml_ssm_conv_update_inplace_ids` (reuses `GGML_OP_SSM_CONV`, discriminated by a non-null
-`src[4]` = ids). Instead of a pre-gathered tap scratch, it takes the FULL conv-state cache (`src[0]`)
-plus the per-sequence `ids` (= the recurrent-state `s_copy`, `src[4]`; `op_params[1]=rs_head`) and
-reads each active sequence's prior K-1 taps directly from `cache[ids[s]]` in the kernel. This removes
-the separate `k_get_rows` launch.
-
-Race-free, exactly mirroring 0019:
- **Identity** sequences (`ids[s] == rs_head + s`, the whole AR-decode path) read the taps in place
-  from the `conv_state_dst` write slot. The kernel loads the full conv window into registers before
-  it writes the 1-token-shifted ring back, so read==write slot is race-free per (channel, seq) thread.
- **Non-identity** sequences (reorder / `rs_zero` remap at a prefill->decode boundary) are gathered
-  into a disjoint scratch by a small `ssm_conv_gather_nonident_kernel` first (no-op at steady decode),
-  so the update kernel never reads a slot another block writes.
-
-The read VALUES are unchanged (identity in-place taps == the gathered taps == `cache[ids[s]]`); only
-the read PATH changes from a `ggml_get_rows` materialization to an indexed in-kernel read. The conv
-math, ascending-tap FMA order, silu and the ring write-back are byte-identical to 0021.
-
-Files:
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: `ggml_ssm_conv_update_inplace_ids` builder
-  (src[0]=full cache [K-1,channels,n_cells], src[1]=conv_kernel, src[2]=x_cur, src[3]=conv_state_dst,
-  src[4]=ids; op_params[0]=fuse_silu, op_params[1]=rs_head).
- `ggml/src/ggml-cuda/ssm-conv.cu`: `ssm_conv_gather_nonident_kernel` + `ssm_conv_update_ids_f32`
-  kernel + `ggml_cuda_op_ssm_conv_update_ids` + a `src[4]`-discriminated branch in `ggml_cuda_op_ssm_conv`.
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_ids_f32` (window copied to a
-  local before the possibly-aliasing write) + dispatch branch.
- `src/models/delta-net-base.cpp`: `build_conv_state_fused` now feeds the FULL cache + ids through the
-  `build_rs` `get_state_rows` lambda (the rs_zero clear + extra-states copy still run around it),
-  exactly like the 0019 recurrent-attn fusion. The `qwen35` / `qwen35moe` / `qwen3next` callers are
-  unchanged (they already route the single-token decode path here).
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update_ids` (16 cases) - ids = a shuffled permutation
-  with `rs_head=0`, so each case exercises BOTH the identity in-place read and the non-identity cache
-  read; validates the conv+silu output vs the CPU reference.
-
-## GATE: test-backend-ops (CUDA0 vs CPU, 2/2 backends)
-
- SSM_CONV_UPDATE_IDS: OK (NEW; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
- SSM_CONV_UPDATE: OK (0021 path intact)
- SSM_CONV: OK
- GATED_DELTA_NET: OK
- GET_ROWS: OK
-
-## GATE: greedy bit-exactness (--temp 0 --seed 1 -n 48, -fa on) - BOTH models BYTE-IDENTICAL
-
-| model              | baseline md5                     | 0028 md5                         | result          |
-|--------------------|----------------------------------|----------------------------------|-----------------|
-| q36-27b-nvfp4      | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | BYTE-IDENTICAL  |
-| q36-35b-a3b-nvfp4  | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | BYTE-IDENTICAL  |
-
-(Built on the `paged` branch f32-default = 0026 hybrid default is f32; the baseline was re-confirmed
-on the same build before the edit.)
-
-## nsys proof - the gather is eliminated (MoE decode, npp128 ntg24 npl128, same window)
-
-| kernel                              | before        | after                         |
-|-------------------------------------|---------------|-------------------------------|
-| `k_get_rows_float<float,float>` cnt | 10174         | 9454 (720 fewer = 30 GDN x 24)|
-| `k_get_rows_float<float,float>` sum | 186.3 ms      | 102.8 ms (-83.5 ms)           |
-| conv update kernel                  | `ssm_conv_update_f32` 720 | `ssm_conv_update_ids_f32` 720 |
-| `ssm_conv_gather_nonident_kernel`   | -             | 720 x ~1.1 us = 0.8 ms (no-op at decode) |
-
-The 720 big ~115 us conv gathers are gone; the only added work is a ~1.1 us no-op gather kernel per
-layer-step (all sequences identity during steady AR decode). This matches 0019's "no-op at decode,
-median ~1.2 us" non-identity gather.
-
-## Preliminary throughput (post-fusion, single point; rigorous A/B is the bench phase)
-
- MoE `q36-35b-a3b-nvfp4` npl128 (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`): **783.9 t/s**, step
-  163.3 ms/step (MOE_GAP @0025 was 752.3 t/s / 169.8 ms/step => -6.5 ms/step in this stack).
- dense `q36-27b-nvfp4` npl128: **377.3 t/s** (~96% of vLLM 391; includes 0022/0026 base gains).
- npl128 ran clean (EXIT=0) on both - the non-identity boundary path does not crash.
-
-## Verdict
-
-Bit-exact (both md5 gates byte-identical, all test-backend-ops pass), the residual `k_get_rows` conv
-gather is eliminated (nsys-confirmed), and decode throughput improves. Helps BOTH dense and MoE (the
-shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path (after 0019 SSM-state
-+ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
-
-Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
-NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
-
-### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
-
-| model             | base (0026)                      | lever1 (0028)                    | recorded baseline                |
-|-------------------|----------------------------------|----------------------------------|----------------------------------|
-| q36-27b-nvfp4     | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
-| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
-
-test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
-GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
-
-### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
-
-dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
-
-| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 391    |
-|-----|-----------|-------------|--------|----------------|
-| 32  | 208.56    | 209.39      | +0.40% | -              |
-| 128 | 369.95    | 377.83      | +2.13% | 94.6% -> 96.6% |
-
-MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
-
-| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 901    |
-|-----|-----------|-------------|--------|----------------|
-| 32  | 456.85    | 459.56      | +0.59% | -              |
-| 128 | 763.47    | 777.95      | +1.90% | 84.7% -> 86.3% |
-
-Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
-
-### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
-
-| kernel                          | base (0026)            | lever1 (0028)                                |
-|---------------------------------|------------------------|----------------------------------------------|
-| k_get_rows_float<float,float>   | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms                       |
-| delta                           |                        | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
-| ssm_conv_update(_ids)_f32       | 219.71 ms (update)     | 225.75 ms (update_ids, +6 ms)                |
-| ssm_conv_gather_nonident_kernel | -                      | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
-
-The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
-into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
-the -3.13 ms/step throughput delta at npl128.
-
-### Verdict (gather-bench)
-
-Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
-gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
-throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
-MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md
@@ -1,77 +0,0 @@
-# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
-
-The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
-(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
-bit-exact tensor reshape that re-routes the per-layer SSM output projection
-from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
-
-## The mechanism (profiled, both engines)
-
-Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
-The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
-(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
-to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
-`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
-128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
-the ssm_out weight read across the 128 sequences. vLLM packs the same projection
-into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
-only the output projection was in 3D SSM layout.
-
-## The fix
-
-In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
-the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
-decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
-MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
-so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
-2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
-proven by the in-projection.
-
-```
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     ...
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-```
-
-## Gates (all PASS)
-
- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
-  post-SSM baseline build:
-  - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
-  - MoE   q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
- Coherent dense + MoE output (greedy text inspected).
-
-## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
-
-S_TG t/s (decode aggregate):
-
-| model            | npl | baseline | Lever 1 | delta   |
-|------------------|-----|----------|---------|---------|
-| dense q36-27b    |  32 |   170.52 |  200.00 | +17.3%  |
-| dense q36-27b    | 128 |   254.92 |  335.80 | +31.7%  |
-| MoE   q36-35b-a3b|  32 |   373.28 |  420.77 | +12.7%  |
-| MoE   q36-35b-a3b| 128 |   560.66 |  691.24 | +23.3%  |
-
-Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
-up from 65% post-SSM).
-
-## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
-
-The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
-
-| kernel                              | baseline           | Lever 1          |
-|-------------------------------------|--------------------|------------------|
-| mul_mat_vec_q<NVFP4, m=1> (o_proj)  | 132.8 ms / 48 inst | 0 ms / 0 inst    |
-| mul_mat_q<NVFP4, m=128>             | 5463 ms / 8800 inst| 5827 ms /10000 inst|
-
-The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
-(+1200 instances, +363 ms over the window), and its per-call average DROPS
-(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
-than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
-~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
-old GEMV: the amortized weight read is the win.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/LEVER4_PROJNVFP4_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER4_PROJNVFP4_RESULTS.md
@@ -1,83 +0,0 @@
-# LEVER 4 - NVFP4 the bf16 MoE GDN/attn projections: KL-GATE FAIL, no-ship
-
-GPU agent (L4-gatebench), DGX GB10 (sm_121, BLACKWELL_NATIVE_FP4=1). Build at 0028 (HEAD fafe878,
-branch `paged`). Lever 4 hypothesis (from `MOE_GAP_VS_VLLM.md` + the lever-4 scope): the MoE GGUF's
-GDN/attn projections (in_proj_qkvz=attn_qkv, in_proj_ba=ssm_alpha/ssm_beta, out_proj=ssm_out,
-attn_gate, full-attn attn_q/k/v/output) are left in BF16 by nvidia modelopt while the dense
-q36-27b-nvfp4 (unsloth) already ships them NVFP4. The scope called this a "quant-provenance accident"
-and proposed re-quantizing them to NVFP4 to recover the ~20.3->13.8ms projection-GEMM bucket.
-
-**Verdict: KL-GATE FAIL on every axis, for both variants. STOP, do NOT ship. No 0029 GGUF, no
-gallery entry, no bench, no nsys** (per spec: KL fails first -> report, do not bench/ship). The bf16
-projections are a **deliberate precision choice, not an accident** - re-quantizing them costs ~6% PPL.
-
-## Gate setup (all bit-changing -> KLD gate per spec)
-
- Reference (the "f32" of the gate): `~/work/darwin_36b_opus/f16.gguf` - the full-precision f16 GGUF
-  of the same Qwen3.6-35B-A3B model (qwen35moe, 41 blocks, vocab 248320, embd 2048). Verified it
-  matches the NVFP4 baseline shape; its own PPL = 7.376 self-consistent with the KLD base.
- KL base: `llama-perplexity --kl-divergence-base` over `wiki.test.raw`, c512, 16 chunks (8192 tok),
-  -ngl 99, seed 1. Base file `~/bench/l4gate/klbase_moe.dat` (2.0 GB). f16 PPL(base) = 7.3734.
- Candidates scored with `--kl-divergence` against that base, identical c512/16-chunks/seed.
- Current "bf16-projection GGUF" baseline = `~/bench/q36-35b-a3b-nvfp4.gguf` (the shipping NVFP4:
-  experts NVFP4, GDN/attn projections BF16). It is the reference for the PPL-delta and argmax gates.
-
-## Measurements (16 chunks, c512, 8192 tokens, wiki.test.raw)
-
-| model | PPL(Q) | PPL delta vs baseline | Mean KLD-to-f16 | Same-top-p (argmax agree vs f16) | RMS dp |
-|-------|--------|-----------------------|-----------------|----------------------------------|--------|
-| baseline NVFP4 (proj BF16, shipping) | 7.3896 | - (reference) | 0.1366 | 84.31% | 9.20% |
-| **projq FULL** (190 proj -> NVFP4, incl. in_proj_ba) | 7.8705 | **+6.51%** | 0.1638 | 81.72% | 10.47% |
-| **projq CONS** (130 proj -> NVFP4, in_proj_ba kept BF16) | 7.8440 | **+6.15%** | 0.1716 | 82.16% | 10.82% |
-
-Baseline vs f16: PPL ratio 1.0022 (+0.22%), i.e. the shipping NVFP4 is already near-f16 - because
-modelopt put the quant-sensitive GDN/attn projections in BF16 and only the experts (designed for FP4)
-in NVFP4. projq pushes the projections to NVFP4 and PPL ratio jumps to 1.067 (FULL) / 1.064 (CONS).
-
-## Gate verdict (all three conditions FAIL)
-
-1. **PPL delta < ~1% vs the bf16-projection GGUF -> FAIL.** FULL +6.51%, CONS +6.15%. Off by ~6x.
-2. **KLD-to-f32 < 0.06 -> FAIL.** The shipping baseline NVFP4 itself sits at 0.137 mean KLD vs f16
-   (per-token KLD is naturally high at 248K vocab), and projq raises it to 0.164 (FULL) / 0.172 (CONS).
-   Whatever the intended reference granularity, projq is strictly worse than the baseline, not < 0.06.
-3. **Zero greedy-argmax flips -> FAIL.** Per-token top-1 agreement vs f16 drops from 84.31% (baseline)
-   to 81.72% (FULL) / 82.16% (CONS): the requant flips the argmax on ~2.2-2.6% MORE tokens than the
-   shipping model. (A direct `llama-cli --temp 0 -n 48` greedy diff was attempted but the paged
-   llama-cli build segfaults at teardown on ALL models incl. baseline - not projq-specific - so the
-   8192-token Same-top-p above is the argmax measure used; it is strictly stronger than a 48-tok probe.)
-
-CONSERVATIVE (keeping the most quant-sensitive in_proj_ba=ssm_alpha/ssm_beta in BF16) recovered almost
-nothing: 7.844 vs 7.871. The damage is in the BULK attn/GDN projections (attn_qkv, ssm_out, attn_gate,
-attn_q/k/v/output), not the tiny in_proj_ba. An attn_gate-excluded third variant would, at best, shave
-a fraction of a percent off a 6% miss - not worth a GPU pass. lm_head was already NVFP4 in the baseline
-(and in vLLM's checkpoint), so it is not a variable here and was never the issue.
-
-## Why the premise was wrong (root cause of the failure)
-
-The scope assumed vLLM runs these projections in NVFP4. It does not. vLLM runs the **nvidia modelopt
-checkpoint** (`~/bench/q36-35b-a3b-nvfp4-vllm`), which is the SAME provenance that left these exact
-projections in BF16. So:
-
- The baseline GGUF's bf16 projections **match vLLM** already. They are not a llama-vs-vLLM gap.
- modelopt left in_proj_qkvz/in_proj_ba/out_proj/attn_q/k/v/output in BF16 **because they are
-  quant-sensitive in this hybrid gated-DeltaNet + attention model** - the gate confirms this empirically
-  at ~6% PPL. The dense q36-27b-nvfp4 (unsloth) tolerating NVFP4 projections does not transfer: it is a
-  different (non-MoE, different-provenance) model and a different sensitivity profile.
- Re-quantizing them is therefore not "matching vLLM" - it is going BEYOND vLLM's precision and paying
-  for it in quality. The ~20.3ms projection-GEMM bucket is the price of running these projections in
-  high precision; vLLM pays the same precision cost (its nvjet/cutlass bf16 GEMMs), so the bucket is NOT
-  the lever it looked like. The L4 speed win is real but only purchasable with a 6% PPL regression -
-  rejected by the gate.
-
-## Disposition / artifacts
-
- Both projq GGUFs exist on DGX but are **dead** (do not publish): `~/bench/q36-35b-a3b-nvfp4-projq.gguf`
-  (FULL, md5 1bd32114..., sha256 88b7e812...), `~/bench/q36-35b-a3b-nvfp4-projq-cons.gguf` (CONS, md5
-  6847ebe3..., sha256 ca035111...). The L4-requant pin files (`~/bench/pins_projq_{full,cons}.txt`) and
-  `/tmp/gen_pins.py` remain if a future, kernel-side (not precision-side) approach is ever revisited.
- Gate logs: `~/bench/l4gate/` - `f16base.log`, `kld.{baseline,projqFULL,projqCONS}.log`,
-  `klbase_moe.dat`.
- No code change, no patch, no commit to the DGX `llama-paged-dev` tree. No `-paged` gallery entry.
- MoE remains at 86.3% of vLLM @ npl128; this lever does not move it within the quality budget.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/LEVER5_HOSTPIPE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER5_HOSTPIPE_RESULTS.md
@@ -1,73 +0,0 @@
-# Lever 5 - block-table within-step host cache (patch 0029)
-
-## What
-
-`get_block_table()` is called once per full-attention layer per decode step. The
-KV cell layout (and therefore the block table bytes) is fixed for the whole step;
-it only changes in `apply()` when the ubatch's slots are committed. The old path
-recomputed the full table on every full-attention layer of every step.
-
-Patch 0029 builds the table once per step and reuses the bytes (`memcpy`) for the
-remaining full-attention layers, invalidating the cache in `apply()`. The reused
-bytes are identical to a fresh compute, so the change is bit-exact. Disable with
-`LLAMA_PAGED_NO_BT_CACHE=1`.
-
-## Host-side get_block_table time (the lever)
-
-`llama-batched-bench`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`,
-`-npp 128 -ntg 128 -npl 128 -ngl 99 -fa on`, measured with the in-tree
-`[L5INSTR]` host timers (aggregate over the full bench, n=2048 dense / 1280 MoE
-get_block_table calls):
-
-| model | get_block_table host, cache OFF | cache ON | reduction |
-|-------|--------------------------------:|---------:|----------:|
-| MoE  q36-35b-a3b-nvfp4 | 112.94 ms | 14.82 ms | -87% |
-| dense q36-27b-nvfp4    | 193.78 ms | 16.90 ms | -91% |
-
-The MoE 112.94 -> 14.82 ms is the "110 -> 14 ms host" headline. `set_inputs`
-host time falls in lockstep (MoE 128.6 -> 32.0 ms; dense 220.2 -> 36.5 ms) and
-`process_ubatch` host (hostproc) drops MoE 498.8 -> 413.0 ms, dense 730.1 ->
-544.2 ms.
-
-## Throughput effect
-
-Same bench, TG (decode) tokens/s, cache OFF -> ON:
-
-| model | TG t/s OFF | TG t/s ON | delta | vs vLLM @npl128 |
-|-------|-----------:|----------:|------:|----------------:|
-| dense q36-27b-nvfp4 | 364.81 | 374.72 | +2.7% | 374.72 / 391 = 95.8% |
-| MoE  q36-35b-a3b    | 752.19 | 756.97 | +0.6% (flat) | n/a |
-
- Dense decode is partly host-bound, so removing ~90% of the get_block_table host
-  time lifts dense TG by a few percent (run-to-run; ~0.4-2.7% across runs) and
-  pushes it to ~96-97.5% of the vLLM 391 t/s @npl128 reference.
- MoE decode is compute-bound (the FP4 GEMM dominates the step), so the ~98 ms of
-  saved host time is hidden behind GPU compute and is off the critical path: MoE
-  TG is flat. The deployment path (MoE) sees no regression and no win - the cache
-  is a pure pipeline cleanup there.
- npl=1 single-stream decode: get_block_table is tiny either way (MoE 0.64 ->
-  0.22 ms over 128 steps); the lever only matters at batch.
-
-## Bit-exactness
-
-`llama-completion -p "The capital of France is" -n 48 --temp 0 --seed 1`,
-chat-template (conversation) path:
-
-| path | md5 |
-|------|-----|
-| non-paged MoE | 07db32c2bcb78d17a43ed18bc22705cd |
-| paged MoE, cache ON  | 8cb0ce23777bf55f92f63d0292c756b0 |
-| paged MoE, cache OFF (`LLAMA_PAGED_NO_BT_CACHE=1`) | 8cb0ce23777bf55f92f63d0292c756b0 |
-| dense non-paged | 5951a5b4d624ce891e22ab5fca9bc439 |
-| dense paged | 5951a5b4d624ce891e22ab5fca9bc439 |
-
-cache ON == cache OFF confirms the lever is numerically neutral. The paged-MoE
-md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a benign
-FP-accumulation-order difference of the paged attention reduction, KL-validated
-in PAGED_BITEXACT_NOTE.md (not introduced by this lever - it is present on the
-0028 baseline too).
-
-## Verdict
-
-Ship. Bit-exact per path, real host-pipe win on host-bound (dense) decode,
-neutral on the compute-bound MoE deployment path.
--- a/backend/cpp/llama-cpp/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
+++ b/backend/cpp/llama-cpp/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
@@ -420,8 +420,8 @@ may want a second lighter gallery variant (context_size 16384, parallel 4, drop
 ================================================================================

 The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
-q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; QWEN36_NVFP4_BENCH.md section "The 4 models"
-documents provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
+q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
+document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
 tensors from nvidia modelopt weights). To publish:

  1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
@@ -434,7 +434,7 @@ tensors from nvidia modelopt weights). To publish:
  3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
     pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
     Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
-     paged), and the GB10 benchmark table (link QWEN36_NVFP4_BENCH.md numbers).
+     paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
  4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
     LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
     tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
--- a/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
@@ -1,143 +0,0 @@
-# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
-
-The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
-opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
-`mmq_x` auto-select in `mul_mat_q_case`. Companion to
-`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
-`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
-
-Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
-**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
-41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
-larger expert FFN, standard attention).
-
-## What it does (vs 0014)
-
-`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
-the per-expert column upper bound = token count) in one column-tile, i.e. stock
-**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
-per-expert density is tiny, that 128-wide tile is mostly padding.
-
-Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
-**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
-per-expert density host-side, from args the ids path already passes:
-
-```
-ne_get_rows = ncols_dst   = ne12 * n_expert_used        (token-expert assignments)
-n_experts   = nchannels_x = ne02
-density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
-```
-
-and caps to the small tile (default 64) **only when `density <= density_max`**, so
-the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
-No new kernel: the selection only lowers the loop's upper bound to an
-already-compiled, granularity- and shared-memory-validated `mmq_x`.
-
-## The threshold matters: `density_max = 8`, not `tile/4 = 16`
-
-The cap must fire for decode but not for a prefill ubatch. Each has per-expert
-density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
-
-```
-                       128 experts   256 experts
-prefill ubatch (512)        32            16
-decode npl128 (128)          8             4
-```
-
-`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
-prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
-`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
-`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
-default change is what makes the patch prefill-safe on the 256-expert model.
-
-## Measurements (default-on vs stock, median of 5 reps)
-
-`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
-sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
-
-```
-  npl   S_TG stock  S_TG 0015   dTG%     S_PP stock  S_PP 0015   dPP%
-    8      183.59     183.18  -0.22%        1489.2     1500.1  +0.73%
-   32      264.02     263.44  -0.22%        2034.5     2033.5  -0.05%
-   64      311.76     310.41  -0.43%        2028.3     2027.6  -0.03%
-  128      336.10     337.32  +0.36%        2025.0     2027.7  +0.13%
-```
-
-Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
-`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
-`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
-2029.4, 2029.0]`.
-
-### Honest read: neutral on this model
-
-On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
-npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
-the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
-materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
-256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
-lever has nothing to bite on.
-
-### npl128 decode tile sweep confirms 64 is the only useful width
-
-`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
-
-```
-  TILE8   TILE16  TILE32  TILE64  TILE96
- -6.31%   -3.18%  -0.17%  +0.70%  -0.76%
-```
-
-Smaller tiles are **worse**, not better: more column-tiles per expert = more
-grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
-the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
-same as 0014.
-
-## Why ship it default-on anyway
-
-1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
-   global, so prefill keeps its 128 tile (S_PP neutral above).
-2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
-   `tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
-   `cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
-   -1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
-   is identical by construction.)
-3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
-   does not help.
-4. **Correctness-gated** by the P0 harness (below).
-
-## Conservative by design (known limitation)
-
-A pure-density gate cannot separate two cases with the **same** per-expert density:
-Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
-16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
-+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
-it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
-as future work, not implemented.
-
-## Knobs
-
- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
-  forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
-
-## P0 correctness gate
-
-`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
-NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
-`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
-tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
-0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
-oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
-suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
-byte-identical to stock).
-
-## Verdict
-
- Correct, prefill-safe, default-on density-aware tile select; the durable design
-  0014's own doc scoped. Supersedes 0014's global cap as the default path; the
-  `LLAMA_MOE_MMQ_X` knob is retained as a manual override.
- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
-  because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
-  lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
-  prefill cost.
--- a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_PROGRESS.md
@@ -1,21 +0,0 @@
-# MOE_GAP_PROGRESS.md - moe-gap-groundtruth GPU agent checkpoint
-
-Status: **DONE.** Both-engine MoE decode decomposition complete. Findings in `MOE_GAP_VS_VLLM.md`.
-
-## Runs (DGX GB10 sm_121, GPU free, foreground)
- llama: `build-cuda` 2f4f5ab (0025), `llama-batched-bench -npp128 -ntg128 -npl128 -c32768 -fa on`,
-  `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`. S_TG=752.3 t/s, step 169.8 ms, busy 97.5%.
-  Artifacts on DGX: `~/llama-paged-dev/moe_gap_llama.{nsys-rep,trace.csv}`.
- vLLM 0.23.0 graphs-ON (FULL_AND_PIECEWISE, the 882-ref config): `~/bench/moe_gap_vllm.py` under
-  `nsys --capture-range=cudaProfilerApi`. step 142.0 ms, busy 99.7%.
-  Artifacts on DGX: `~/bench/moe_gap_vllm.{nsys-rep,trace.csv}`, script `~/bench/moe_gap_vllm.py`.
- Extractor: `~/bench/decode_decomp2.py` (dual-engine, steps = GDN-kernel-count / 30; cross-checked vs
-  flash/reshape_cache = 10x and vs throughput). Grouped-MoE GEMM isolated by per-call duration (LONG/SHORT).
-
-## Result (1 line)
-Gap = 27.8 ms/step (llama 83.6% of vLLM). **MoE grouped GEMM is a llama WIN** (native FP4-MMA W4A4 47.3 ms
-vs Marlin W4A16 50.0 ms). The 15% is bf16-projections+convert (+6.5), recurrence state-gather plumbing
-(+6.6, led by k_get_rows 5.2 ms), graph/overlap (+7.0), W4A4 act-quant tax (+3.3), router/glue (+5.4).
-Marlin is NOT the lever; do not build a W4A16 MoE GEMM.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GAP_VS_VLLM.md
@@ -1,484 +0,0 @@
-# MOE_GAP_VS_VLLM.md - ground-truth both-engine MoE decode decomposition (where vLLM's ~15% lives)
-
-> **READ THE FINAL SECTION FIRST ("RESIDUAL-ASSESS (FINAL)" at the bottom).** It concludes the hunt and
-> CORRECTS one premise used throughout the body below: this doc assumes vLLM runs the GDN/attn projections
-> as NVFP4-Marlin. It does NOT. vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
-> projection bucket is a matched-precision (bf16) gap, not a quant gap. Lever 4 (NVFP4 the projections) is
-> REJECTED (+6% PPL, and not even a vLLM gap). The MoE is at its bit-exact ceiling (~86-88% of vLLM).
-
-THE GPU AGENT (label `moe-gap-groundtruth`), DGX GB10 (sm_121). First **side-by-side, both-engine,
-per-kernel ms/step** decomposition of the MoE decode gap. All prior B work decomposed llama ONLY; this
-profiles vLLM's decode step too and computes the per-bucket `llama - vLLM` delta to pinpoint the gap.
-
-Model `q36-35b-a3b-nvfp4` (40 layers: 30 GDN linear-attn + 10 full-attn, 256 experts top-8, vocab 248320).
-Both engines profiled at **batch 128 decode** with `nsys --cuda-graph-trace=node`, steady-decode window,
-per-step normalized by GDN-kernel-count / 30 (cross-checked vs flash/reshape_cache counts and throughput).
-
- **llama**: `build-cuda` tip `2f4f5ab` (patch 0025), `llama-batched-bench -npp 128 -ntg 128 -npl 128
-  -c 32768 -fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1` (the re-graph ON = the 752 t/s ship point).
-  Measured **S_TG = 752.3 t/s** => **step = 169.8 ms**, GPU-busy 97.5% (idle 2.5% = 4.2 ms/step).
- **vLLM 0.23.0**: `q36-35b-a3b-nvfp4-vllm`, **CUDA graphs ON** (`cudagraph_mode=FULL_AND_PIECEWISE`,
-  the 882-reference config, NOT enforce_eager), MARLIN NvFp4 MoE, 128 seqs x 128-tok prompt x 128 gen.
-  Measured **step = 142.0 ms** (= 901 t/s-equiv), GPU-busy 99.7% (idle 0.3% = 0.4 ms/step).
- Gap reproduced: **169.8 - 142.0 = 27.8 ms/step** (llama 83.6% of vLLM here; matches the ~85% server number).
-
-## THE HEADLINE: the MoE grouped GEMM is NOT vLLM's advantage - it is a llama WIN
-
-Grouped MoE-expert GEMM, isolated by per-call duration (LONG calls = the per-expert grouped GEMM):
-
-| grouped MoE-expert GEMM | ms/step | what |
-|-------------------------|--------:|------|
-| **llama** `mul_mat_q<NVFP4,M-tile=64>` (+stream-k fixup + gather) | **48.3** | native Blackwell FP4-MMA **W4A4** |
-| **vLLM** `marlin_moe_wna16::Marlin` | **50.0** | **W4A16** (FP4 weights dequant-in-kernel -> bf16 MMA) |
-
-**llama's native-FP4 grouped GEMM is ~1.7 ms/step FASTER than vLLM's Marlin W4A16 at the ragged
-tiny-M (~4 rows/expert) decode shape** (pure GEMM core 47.3 vs 50.0). Both read the same ~4-bit weight
-bytes and are bandwidth-bound, so they tie to within a few %, and llama's 2x-rate FP4-MMA edges it.
-**=> Marlin is NOT faster here; a Marlin-style W4A16 MoE GEMM in llama would make the MoE GEMM SLOWER.**
-This directly answers the brief's load-bearing question #1/#2 and extends the prior `w4a16-marlin` DENSE
-conclusion ("the win was NVFP4-dense-quant, not the Marlin kernel") to MoE: **the MoE GEMM kernel is not
-the lever; llama already beats Marlin there.**
-
-## Side-by-side per-step decomposition (ms/step, kernel-time attribution)
-
-| bucket | llama ms | vLLM ms | Δ llama-vLLM | note |
-|--------|---------:|--------:|-------------:|------|
-| **Recurrence / SSM**           | **79.3** | **72.7** | **+6.6** | core kernel is a llama WIN (70.0 vs 71.1); the gap is llama's state-gather/conv plumbing |
-| **MoE-expert grouped GEMM**    | 48.3 | 50.0 | **-1.7** | **llama FASTER** (native FP4-MMA W4A4 vs Marlin W4A16) |
-| **Dense projections (+glue)**  | **20.3** | **13.8** | **+6.5** | llama runs GDN/attn projections in BF16 cublas; vLLM runs them as compact NVFP4-Marlin; +2.9 ms is llama's bf16<->f32 `convert_unary` glue vLLM never pays |
-| **Norms / glue / memcpy**      | 9.6 | 6.0 | +3.6 | llama `k_bin_bcast` (expert-combine+residual) 4.3 + memcpy 2.4 heavier |
-| **Act-quant (W4A4 tax)**       | 3.3 | 0.0 | **+3.3** | `quantize_mmq_nvfp4`; vLLM W4A16 keeps acts bf16 => structurally ZERO |
-| **Router / align**             | 2.4 | 0.5 | +1.9 | llama computes router via a full FP4 GEMM (1.6) + argsort/scatter; vLLM fuses topk/align |
-| **Attention (full-attn)**      | 2.8 | 2.6 | +0.2 | parity |
-| kernel-time subtotal           | 166.1 | 145.7 | +20.4 | |
-| **GPU idle (host bubble)**     | 4.2 | 0.4 | **+3.8** | graph coverage: llama partially-graphed (0025) vs vLLM FULL_AND_PIECEWISE |
-| cross-stream overlap (union<sumdur) | ~0.8 | ~4.0 | ~-3.2 (vLLM overlaps more) | vLLM runs more kernels concurrently |
-| **STEP TOTAL (wall)**          | **169.8** | **142.0** | **+27.8** | |
-
-### Per-engine top kernels (ms/step)
-
-```
-llama (752 t/s, step 169.8 ms, 97.5% busy)        vLLM (901-equiv, step 142.0 ms, 99.7% busy)
- 70.0  gated_delta_net_cuda          REC core      71.1  fused_recurrent_gated_delta   REC core
- 47.3  mul_mat_q grouped MoE (M=64)  MoE GEMM       50.0  marlin_moe_wna16::Marlin      MoE GEMM
-  8.2  nvjet 192x136 (bf16 proj)     PROJ            4.0  nvjet 128x72 (bf16 proj)      PROJ
-  5.2  k_get_rows_float  REC-GATHER  REC <-- vLLM    2.8  marlin dense (lm_head NVFP4)  PROJ
-  4.5  cutlass::Kernel2 (bf16 GEMM)  PROJ           has   2.7  nvjet 128x64 (bf16 proj)  PROJ
-  4.3  k_bin_bcast (combine+resid)   GLUE           no    2.5  flash_fwd_splitkv         ATTN
-  4.1  nvjet 128x64 (bf16 proj)      PROJ           equiv 2.0  marlin dense small (NVFP4) PROJ
-  3.4  ssm_conv_update_f32           REC            of    1.6  causal_conv1d_update      REC
-  3.3  quantize_mmq_nvfp4  W4A4 TAX   ACTQ <-- vLLM  these 1.4  std::enable_if (glue)     GLUE
-  2.9  convert_unary bf16<->f32      PROJ-GLUE <--   two   1.2  reduce_kernel             GLUE
-  2.8  flash_attn_tile               ATTN           (5.2+  1.0  cutlass::device (fp8 lin) PROJ
-  2.4  MEMCPY-Device (SSM state)     GLUE           2.9 =  0.8  nvjet 32x64               PROJ
-  1.6  mul_mat_q router (M=128)      ROUTER          8 ms  0.4  act_and_mul (SwiGLU)      GLUE
-  1.5  rms_norm_f32                  GLUE           pure   0.2  topkGating / moe_align    ROUTE
-  ...                                               llama  0.1  reshape_and_cache_flash   ATTN
-                                                     tax)
-```
-
-## WHERE THE 27.8 ms ACTUALLY IS (ranked) - and it is NOT the Marlin GEMM
-
-1. **Dense projections + bf16<->f32 glue: +6.5 ms.** llama keeps the GDN/attn linear projections (and
-   the lm_head) in **BF16** (cublas `nvjet`/`cutlass`, full-precision weight reads) and pays a 2.9 ms
-   `convert_unary` bf16<->f32 tax around them; vLLM runs the same projections as **compact NVFP4-Marlin
-   W4A16** (4-bit weight read, ~4x less BW) and stays bf16 end-to-end (no convert). This is the
-   **`NVFP4-dense-quant` lever the prior `w4a16-marlin` project already identified - applied to the
-   still-bf16 projections**, not the MoE GEMM.
-2. **Recurrence state-gather/conv plumbing: +6.6 ms.** The recurrence CORE kernel is a **llama win**
-   (gated_delta_net 70.0 vs vLLM fused_recurrent 71.1, confirming "past vLLM on BW efficiency"). The gap
-   is entirely the surrounding plumbing: **`k_get_rows_float` 5.2 ms (the recurrent-state gather)** +
-   `ssm_conv_update` 3.4 vs vLLM's single `causal_conv1d_update` 1.6. vLLM has **no gather** - its
-   recurrent state is updated in-place inside the fused decode kernel. `k_get_rows` is the single biggest
-   llama-specific kernel vLLM has no equivalent of.
-3. **Graph coverage + stream overlap: ~+7.0 ms combined** (idle +3.8, cross-stream overlap ~+3.2). vLLM
-   FULL_AND_PIECEWISE is 99.7% busy with more concurrent kernels; llama (partially graphed post-0025) is
-   97.5% busy with thinner overlap.
-4. **W4A4 act-quant tax: +3.3 ms.** `quantize_mmq_nvfp4`; vLLM's W4A16 choice makes this structurally 0.
-   Fusing the quant into the preceding op (as vLLM fuses act_quant into RMSNorm/SiLU) would erase it.
-5. **Router GEMM + norms/glue: +5.4 ms.** llama computes router logits via a full FP4 GEMM (1.6) and has
-   heavier `k_bin_bcast` combine/residual + memcpy; vLLM fuses routing into tiny topk/align kernels.
-
-## THE SINGLE BIGGEST vLLM-MoE ADVANTAGE
-
-**Not the Marlin GEMM.** It is a near-tie between two ~6.5 ms buckets, both bf16-precision-related:
- **Dense projections (+6.5 ms)** - vLLM runs the GDN/attn projections + lm_head as NVFP4-Marlin while
-  llama runs them BF16 + a 2.9 ms convert tax. Single biggest *bucket* delta.
- **Recurrent-state gather (+5.2 ms, kernel `k_get_rows_float`)** - the single biggest *kernel* vLLM
-  avoids entirely (in-place fused state vs llama's separate gather). Plus +1.8 ms more REC plumbing.
-
-The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it is explicitly ruled out.
-
-## ANSWERS TO THE BRIEF
-
-1. **WHERE is vLLM's 15%?** Spread across bf16-projection BW (+6.5) + recurrence state-gather plumbing
-   (+6.6) + graph/overlap (+7.0) + act-quant tax (+3.3) + router/glue (+5.4). **NOT the MoE GEMM.**
-2. **Is Marlin faster at tiny-M decode?** **No.** llama native FP4-MMA W4A4 = 47.3 ms vs Marlin W4A16 =
-   50.0 ms. Marlin is ~5% slower here; both are at the LPDDR5x BW floor.
-3. **Should llama implement a Marlin-style W4A16 MoE GEMM?** **No** - it would slow the MoE GEMM and is
-   not where the gap lives. The `w4a16-marlin` DENSE verdict ("NVFP4-dense-quant, not the Marlin kernel")
-   carries to MoE. The real, ordered levers are: **(a) NVFP4-quantize the still-bf16 GDN/attn projections
-   + lm_head** (close ~+6.5, the largest, bit-changing but the same class of move vLLM makes); **(b) fuse
-   away the recurrent-state gather `k_get_rows`** (~+5, bit-exact, the biggest single-kernel win);
-   **(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant
-   into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM.
-
---
-
-# FINAL DECISION (cross-agent synthesis) - "can we do what vLLM does on MoE?"
-
-Three agents converged on the same verdict from independent angles: `moe-gap-groundtruth`
-(the measured both-engine nsys decomposition above), `vllm-marlin-study` (source-read of vLLM's
-`moe_wna16_marlin_gemm` / `moe_align_block_size` / `prepare_nvfp4_moe_layer_for_marlin` on the DGX),
-and `marlin-port-feasibility` (read-only assessment of the dense W4A16 scaffold + prior STOP). All
-three agree, and the measurement is the arbiter. Below is the decision the user asked for.
-
-## (1) WHERE the 15% lives - decisive
-
-The gap is **27.8 ms/step (llama at 83.6% of vLLM)** and it is **NOT one kernel - it is a sum of small
-deltas, and the MoE grouped GEMM is on llama's side of the ledger.** Ranked:
-
-| rank | lever | Δ ms/step | bit-exact? | this is... |
-|-----:|-------|----------:|:----------:|------------|
-| 1 | Graph coverage + cross-stream overlap | ~+7.0 | **yes** | scheduler/runtime (idle +3.8, overlap +3.2) |
-| 2 | Recurrence state-gather/conv plumbing (`k_get_rows_float` 5.2 + conv) | +6.6 | **yes** | llama-only kernels; vLLM updates state in-place |
-| 3 | Dense GDN/attn projections + lm_head (bf16 vs NVFP4) + convert glue | +6.5 | **no** | the NVFP4-dense-quant lever, on the projections |
-| 4 | Router GEMM + norms/combine/memcpy glue | +5.4 | mostly yes | llama router = full FP4 GEMM; vLLM fuses topk/align |
-| 5 | W4A4 act-quant tax (`quantize_mmq_nvfp4`) | +3.3 | **yes** | vLLM's W4A16 makes this structurally 0 |
-| - | **MoE-expert grouped GEMM** | **-1.7** | - | **llama WIN** - native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 |
-
-**The Marlin GEMM is explicitly ruled out as the source of the gap.** Both engines read the same ~22 GB
-of ~4-bit expert weights once per step and are LPDDR5x-bandwidth-bound; on that weight stream they tie,
-and llama's 2x-rate FP4-MMA edges Marlin's half-rate bf16 MMA. It is **not the projections-vs-Marlin
-distinction in the experts, it is the projections in the DENSE path, the recurrence plumbing, and the
-runtime/graph** that cost llama the 15%. Not distributed, not the expert GEMM, not routing alone.
-
-## (2) Can llama MATCH it - and HOW
-
-**Yes - to within a few percent, and NOT with a Marlin/W4A16 MoE GEMM.** The two biggest *compute*
-kernels (the gated-DeltaNet SSM core 70.0 vs 71.1, and the MoE grouped GEMM 47.3 vs 50.0) are **already
-llama wins.** The gap is overhead/scheduling/precision-of-the-other-tensors, all of which llama can
-attack on its existing W4A4 FP4-MMA expert path. The four levers, in recommended build order:
-
-| order | build | gain | bit-exact / gate | effort |
-|------:|-------|-----:|------------------|--------|
-| 1st | **Fuse away the recurrent-state gather `k_get_rows_float`** (update SSM state in-place in the GDN decode path, fold `ssm_conv_update`) | ~+5 ms (~3% of step) - biggest single-kernel win | **bit-exact** (no md5 rebaseline) | medium - CUDA, the GDN decode kernel |
-| 2nd | **Fuller CUDA-graph coverage + stream overlap** (extend the 0025 re-graph to the remaining MoE/projection nodes, overlap independent streams) | ~+7 ms combined; 0025 already banked ~+1.9% | **bit-exact** | medium - scheduler, partly done |
-| 3rd | **NVFP4-quantize the still-bf16 GDN/attn projections + lm_head** (the same move vLLM makes on its dense path; 4-bit weight read ~4x less BW, kills the 2.9 ms bf16<->f32 convert) | ~+6.5 ms - biggest *bucket* | **bit-changing** (re-baselines md5 gates; precision-UPGRADE, see below) | medium-high - new NVFP4 weight path for non-expert linears |
-| 4th | **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** (as vLLM fuses act-quant) | +3.3 ms | **bit-exact** | low-medium |
-
-**Reach:** the three bit-exact levers (1+2+4 ~= +15.3 ms) alone close the gap to ~154.5 ms/step
-=> ~830 t/s = **~94% of vLLM, with zero precision change and zero md5 rebaseline.** Adding the
-NVFP4-projection lever (3, +6.5) reaches ~148 ms => ~865 t/s = **~96-97% of vLLM**, with the residual
-being router/glue and the irreducible cross-stream-overlap that is structural to how ggml schedules
-host-launched nodes vs vLLM's single fused graph. Because llama's two heaviest kernels are already
-ahead, **parity-or-better is physically reachable** once the plumbing/overhead is removed; vLLM has no
-arithmetic advantage on this hardware (its W4A16 is half-rate FP4 - it only wins on overhead and on the
-dense-path weight-read BW).
-
-## (3) The leading lever, in full - and the Marlin question, settled
-
-**The user's specific hypothesis - "do what vLLM does = a Marlin-style W4A16 grouped MoE GEMM" - is
-REJECTED, by measurement and by feasibility.**
-
- **It is not where the gap is.** The MoE GEMM is a **-1.7 ms llama win.** A W4A16 Marlin MoE GEMM would
-  make that bucket SLOWER (half-rate bf16 MMA on the ~27% GEMM bucket), not faster.
- **Its entire intrinsic upside is the ~2% act-quant tax** (W4A16 has no activation quantize). That
-  +2% ceiling is **smaller than the +1.9% the bit-exact 0025 re-graph already banked**, at vastly higher
-  effort and with a precision change. And the act-quant tax is independently closeable bit-exactly by
-  lever 4 (fuse it into the preceding op) without touching the GEMM.
- **The scaffold does not help.** `paged/kernel/w4a16/marlin-w4a16.cu` is dense-only, Q4_0/Q4_K, with no
-  grouped/MUL_MAT_ID path and no NVFP4 dequant. A real MoE Marlin is effectively a from-scratch port of
-  `moe_wna16_marlin_gemm` (per-expert M-tiles, block-padded `moe_align` token-sort, stream-K over ragged
-  segments, NVFP4->bf16 in-kernel dequant). vLLM only reaches the BW floor via cutlass-SM120 TMA +
-  warp-specialized pipelining; the GB10 occupancy-only route the dense scaffold tried **plateaued at
-  ~9 TFLOPS / 178 t/s (~5x under MMQ)** and STOPPED at the occupancy wall (XOR-swizzle + deep cp.async
-  collapse GB10 occupancy). Realistic outcome of an MoE port: **a net REGRESSION** on the 27% GEMM
-  bucket. Multi-week, high-risk, DGX-only, no `ncu`, for a +2% ceiling. **Do not build it.**
-
-**Why vLLM runs W4A16 at all:** not because it is better - because sm_121 (consumer Blackwell / GB10)
-has no working cutlass FP4 MoE cubins (vLLM whitelists only sm_100/103 datacenter Blackwell for native
-FP4 MoE; the engine literally warns it is falling back to "Weight-only FP4 ... Marlin kernel"). On GB10,
-W4A16 is HALF the FP4-MMA rate. **llama's native W4A4 FP4-MMA is the higher hardware tier; matching vLLM
-does NOT mean copying its W4A16 fallback.**
-
-**Precision / gate (the brief's key nuance, assessed honestly):** the observation that W4A16 (bf16 acts)
-is a strict activation-precision UPGRADE over W4A4 (FP4 acts), with better KL-to-f32, is **correct but
-unmonetizable here.** (a) The current W4A4 MoE default is **already bit-exact to the f32 reference**
-(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable on both models) - you get no quality credit for
-being more precise than a default that already passes, and the precision-sensitive site is the
-gated-DeltaNet SSM *state* (a different op, addressed by the separate 0026 bf16-SSM opt-in), not the MoE
-GEMM. (b) W4A16 is **non-bit-exact vs the W4A4 default, so adopting it re-baselines every shipped md5
-gate** - a real cost for a +2% throughput ceiling that is itself likely negative. So the precision angle
-does not flip the verdict: it would be a precision upgrade nobody needs, bought with a slower,
-occupancy-hostile, gate-rebaselining kernel. The one genuinely precision-positive AND throughput-positive
-move that quantizes weights is **lever 3 (NVFP4 projections)** - and that is W4A16 on the DENSE linears
-(where it cuts weight-read BW), not on the experts.
-
-## (4) HONEST VERDICT + recommended build
-
-**VERDICT: We can essentially match vLLM on MoE decode (~94% bit-exact, ~96-97% with the projection
-quant, parity-or-better physically in reach), but NOT by doing "what vLLM does" in the sense the question
-implies. A Marlin/W4A16 grouped MoE GEMM is the wrong lever - the MoE GEMM is already a llama win and a
-W4A16 port would regress it. The 15% is bf16 dense-projection bandwidth + recurrence-gather plumbing +
-graph/overlap overhead + a 2% act-quant tax + router glue. Every piece is closeable on llama's existing
-native-FP4 expert path, mostly bit-exactly.**
-
-**Recommended build (ship order, none of it a new MoE GEMM):**
-1. **`k_get_rows` SSM-state-gather fusion** - bit-exact, ~+5 ms, biggest single-kernel win, no rebaseline. **Do first.**
-2. **Extend CUDA-graph coverage + stream overlap** beyond 0025 - bit-exact, ~+7 ms combined, partly banked.
-3. **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** - bit-exact, +3.3 ms, erases the act-quant tax (the only thing W4A16 would have bought) without W4A16.
-4. **NVFP4-quantize the bf16 GDN/attn projections + lm_head** - +6.5 ms (biggest bucket), bit-changing
-   (re-gate md5; precision-UPGRADE, the same NVFP4-dense-quant move vLLM makes). Ship as default after
-   re-gating, or as an opt-in if the md5 rebaseline is undesirable.
-
-**Do NOT build:** the W4A16/Marlin grouped MoE GEMM (`paged/kernel/w4a16/` scaffold is dense-only and not
-reusable). Neither default nor opt-in: +2% ceiling < the already-banked bit-exact +1.9%, likely a net
-regression on the 27% GEMM bucket, multi-week high-risk, and it rebaselines every gate. The dense
-`w4a16-marlin` STOP transfers to MoE, and MORE strongly (the tiny-M decode shape is purely BW-bound, so
-the FP4-vs-bf16 tier is a wash that the weight-read floor erases - leaving only the half-rate downside).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# LEVER 4 (scope) - NVFP4-quantize the still-bf16 MoE GDN/attn projections (+lm_head), the +6.5 ms bucket
-
-Label `L4-scope`, READ-ONLY (no GPU). This scopes lever 4 - the single biggest *bucket* in the table
-above (**Dense projections +glue, +6.5 ms**) and the only remaining MoE lever with a real, measurable
-gain after levers 2 and 3 both came back FLAT measurement-STOPs (no patch, no commit - see
-`LEVER2_GRAPH_COVERAGE_RESULTS.md`, `LEVER3_ACTQUANT_FUSION_RESULTS.md`, `LEVERS_23_PROGRESS.md`). Lever 4
-is **bit-changing** (re-gates md5; gate on KL-to-f32, not bit-exact md5). Below: the root cause, the
-path, effort, the precision/KL story, the expected gain, and the default-vs-opt-in recommendation.
-
-## Root cause: the MoE GGUF's projections are bf16 only because of its quant PROVENANCE
-
-The "still-bf16 GDN/attn projections" are **MoE-specific, and they are an accident of how the MoE
-checkpoint was quantized - not a llama limitation.** The two GGUFs have different quant lineages:
-
- **Dense `q36-27b-nvfp4` (unsloth, native-Blackwell FP4, 304 NVFP4 tensors):** the GDN/attn projections
-  ARE already NVFP4. Proven directly - `DECODE_PARITY_EXPLORE.md:594` shows the dense `ssm_out`
-  (GDN out-projection) running as an **FP4 GEMV/MMQ** (`mul_mat_vec_q`/`mul_mat_q<NVFP4>`), and the
-  in_proj runs FP4 MMQ at M=128. This is exactly why the **dense decode is already at 96.6% of vLLM** -
-  it has essentially no bf16-projection bucket left.
- **MoE `q36-35b-a3b-nvfp4` (nvidia modelopt, 241 NVFP4 tensors):** modelopt quantized the **256-expert
-  FFN** tensors to NVFP4 (the 241 count is dominated by the packed grouped-expert tensors) but **left the
-  GDN/attn linear projections in BF16** - `in_proj_qkvz`, `in_proj_ba`, the GDN `out_proj`/`ssm_out`, and
-  the full-attn `attn_q/k/v/output`. Those are exactly the **bf16 nvjet/cutlass projection GEMMs** seen in
-  the MoE decode top-kernel list (8.2 `nvjet 192x136` + 4.5 `cutlass::Kernel2` + 4.1 `nvjet 128x64`)
-  plus the 2.9 ms `convert_unary` bf16<->f32 glue = the **20.3 ms projection bucket** vs vLLM's 13.8 ms
-  (vLLM runs the same projections, and on this modelopt checkpoint even its lm_head, as NVFP4-Marlin -
-  see its `2.8 marlin dense (lm_head NVFP4)` kernel).
-
-**=> Lever 4 is overwhelmingly a MoE-GGUF move:** bring the MoE GGUF's GDN/attn projections to the SAME
-NVFP4 the DENSE GGUF already ships and that vLLM already runs on the identical weights. It is not a new
-capability - the dense GGUF is the existence proof that llama runs and ships these projections in NVFP4.
-
-## (1) THE PATH + EFFORT
-
-Two ways to get the projection weights into NVFP4:
-
- **PATH A - offline re-quantize to a NEW GGUF variant (RECOMMENDED, = exactly what vLLM does).** Re-run
-  `llama-quantize` on the MoE source with the `--tensor-type` selector EXPANDED to also capture the
-  GDN/attn projection tensor-name patterns that the modelopt checkpoint left bf16 (the GDN `in_proj_*` /
-  `out_proj`/`ssm_out` and full-attn `attn_q/k/v/output` weights), producing e.g.
-  `q36-35b-a3b-nvfp4-projq.gguf`. **ZERO kernel/runtime code:** NVFP4 weights already flow end-to-end -
-  the loader auto-creates the per-tensor NVFP4 sidecar scales when `type == GGML_TYPE_NVFP4`
-  (`llama-model.cpp:1459`), and the projection GEMMs then route to the already-tuned `mul_mat_q<NVFP4>`
-  (patch 0017) instead of cublas/nvjet. The dense GGUF is the live proof this path works and gates clean.
-  **Effort: LOW-MEDIUM** - the only "build" is the quantize recipe + a KL gate harness + a gallery/index
-  entry + a RELEASE note. Risk items: (i) confirm the exact bf16 tensor list with a CPU `gguf_dump`
-  (metadata-only, no GPU); (ii) NVFP4 needs the contraction dim divisible by the 16-elt block - any
-  projection whose row dim is not a multiple of 16 stays bf16 (or needs padding), which is the most
-  likely reason a given tensor was left bf16 and must be checked per-tensor; (iii) the lm_head decision
-  (below).
- **PATH B - runtime quantize bf16->NVFP4 at load.** Convert the bf16 projection weights in-memory at
-  model load (one-time ue4m3 per-block scale-search), GGUF unchanged. **Worse choice:** needs new
-  load-time quant code (MEDIUM), and it *silently* changes the output of an existing GGUF for current
-  users (an implicit, non-opt-in precision change) - strictly inferior to an explicit new artifact.
-  Only attractive if shipping a new GGUF is somehow impossible; it is not.
-
-## (2) PRECISION / KL story (honest)
-
-Quantizing the projection WEIGHTS bf16 -> NVFP4 (e2m1 + per-16 ue4m3 scale) is a per-weight precision
-**downgrade vs the current bf16** on those specific tensors (it adds ~4-bit weight-quant error), and -
-because they route to the W4A4 MMQ path - it also FP4-quantizes those projections' activations. It is
-NOT a precision upgrade over bf16; it is the **same W4A4/W4A16-class move vLLM already makes on these
-same projections**, so at matched precision it is apples-to-apples with vLLM. Non-bit-exact => **re-gate
-on KL-to-f32, not md5.**
-
-**KL estimate: should PASS with margin.** Three independent reasons: (a) the dense GGUF ALREADY ships
-these GDN/attn projections in NVFP4 and passes its greedy gate (`5951a5b4...`), so the move is
-empirically proven shippable on this architecture; (b) the 256 experts already run W4A4 NVFP4 and pass
-(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable) - the GDN/attn projections are the same class of
-linear op and arguably less sensitive than the expert FFN; (c) this is a per-step, **non-accumulating**
-weight-quant error - structurally unlike the bf16-GDN-*state* experiment (`BF16_SSM_STATE_RESULTS.md`)
-that FAILED the KL gate (KLD 0.06-0.17, ~10% argmax flips) because that error *accumulated* through the
-recurrence. Expect KLD-to-f32 well under that failed-state threshold and PPL delta sub-percent (cf. the
-broader NVFP4-dense ~+4.8% PPL-vs-Q4_K figure is for full-model NVFP4; here only a minority of residual
-projection tensors move). **The one genuinely risky tensor is lm_head** (logit-direct; `OTHER_PATHS_
-INVESTIGATION.md` flags NVFP4-lm_head can flip the greedy argmax). For the MoE, quantizing lm_head is
-*fair* (vLLM's modelopt checkpoint already runs lm_head NVFP4), so include it but gate it explicitly on
-argmax-agreement; if it flips the greedy probe, keep lm_head bf16 and bank only the GDN/attn portion.
-Recommended gate: **KLD-to-f32 < the bf16-state failure floor (~0.06) AND PPL delta < ~1% vs the current
-bf16-projection GGUF AND zero greedy-argmax flips on the -n 48 probe.**
-
-## (3) EXPECTED MoE GAIN
-
-Closing the +6.5 ms projection bucket = bringing llama's 20.3 ms projection bucket down to vLLM's
-~13.8 ms (NVFP4 cuts the projection weight-read ~4x - 2.37 GB-class bf16 -> ~0.56 B/wt - and the W4A4
-MMQ path stays in the quantized domain, **erasing the 2.9 ms `convert_unary` bf16<->f32 glue**). llama's
-native FP4-MMA is faster per-FLOP than vLLM's W4A16-Marlin and these projections are BW-bound, so llama
-lands at parity-or-slightly-better, same as the expert GEMM (where W4A4 beat Marlin by 1.7 ms). 
-
- With **lm_head also NVFP4** (fair on this modelopt MoE, vLLM did it): full ~**+6.5 ms** =>
-  step 169.8 -> ~163.3 ms => ~785 t/s.
- With **lm_head kept bf16** (conservative): ~**+4 to +5 ms** (the GDN/attn projections + the convert
-  glue; lm_head's ~bf16 GEMM stays) => step 169.8 -> ~165-166 ms => ~768-775 t/s.
-
-In MOE_GAP frame (vLLM 142.0 ms / 901 t/s-equiv): **MoE moves from 86.3% (post-lever-1 / 0028) toward
-~89-91% of vLLM** (full bucket) or ~88% (lm_head bf16). This is the **largest single banked MoE gain
-available** - lever 1 (gather) shipped, levers 2 and 3 banked nothing, and the MoE GEMM is already a
-llama win - so after lever 4 the residual is just router/glue + the structural cross-stream-overlap and
-the ~4.2 ms host bubble (reachable only via a paged-attn host-pipeline edit, not a quant or graph knob).
-
-## (4) RECOMMENDATION: ship as a SEPARATE OPT-IN gallery GGUF variant (KL-gated), not a re-gated default
-
-**Ship lever 4 as a distinct, opt-in gallery variant** (e.g. `q36-35b-a3b-nvfp4-projq` / `-w4a4full`),
-**not** as a silent replacement of the default MoE GGUF. Rationale:
-
-1. The current default MoE GGUF is **md5-bit-exact-gated** (`07db32c2...` shipped); making it default
-   forces a permanent md5 rebaseline of every gate - the hard line this whole track has held (levers 2+3
-   STOPPED rather than cross it). A new artifact sidesteps that for users who chose the f32-lineage GGUF.
-2. Path A produces a **new GGUF anyway** (offline re-quant), so a separate gallery entry costs nothing
-   extra and makes the throughput<->precision choice explicit and reversible.
-3. The gain (~+4-6.5 ms, ~86% -> ~88-91% of vLLM) is real but modest - not worth forcing a precision
-   change on default-path users.
-4. **Promotion path:** because lever 4 only brings the MoE GGUF to the SAME NVFP4 the dense GGUF already
-   ships *as its default* and that vLLM already runs, a clean KL gate (KLD << 0.06, PPL delta < ~0.5%,
-   zero argmax flips) is a strong case to PROMOTE the variant to the default MoE GGUF in a later release.
-   Ship opt-in first to preserve the bit-exact default and avoid a forced rebaseline; promote if the
-   gate is clean and lm_head NVFP4 holds.
-
-**Effort summary:** LOW-MEDIUM, dominated by the KL gate + gallery wiring, NOT code (zero new kernel; the
-NVFP4 weight path - loader sidecar scales + tuned `mul_mat_q<NVFP4>` - is already in tree and proven by
-the dense GGUF). Highest-ROI remaining MoE lever. **Do first among remaining MoE work**, ahead of any
-non-bit-exact recurrence-plumbing or the rejected W4A16/Marlin GEMM.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-> **SUPERSEDED:** the lever-4 scope above was optimistic and PRE-GATE. The L4 KL gate FAILED
-> (+6.15-6.51% PPL, see `LEVER4_PROJNVFP4_RESULTS.md`) and the premise was wrong (vLLM keeps these
-> projections BF16 too). Lever 4 is REJECTED - do NOT ship. See the FINAL section below.
-
---
-
-# RESIDUAL-ASSESS (FINAL, concludes the hunt) - convert-glue + bf16-GEMM verdicts, the bit-exact MoE ceiling
-
-Label `residual-assess`, DGX GB10 (sm_121). After lever 1 shipped (0028, MoE 86.3% of vLLM @npl128,
-bit-exact), levers 2+3 flat, lever 4 REJECTED (KL-gate FAIL, AND vLLM keeps the same projections bf16),
-and lever 5 flat for MoE (host-side, off the compute-bound critical path; dense gets +0.41%), this is the
-final honest assessment of the two remaining sub-levers inside the 20.3-vs-13.8 ms projection bucket.
-Both are **bit-CHANGING or at-the-BW-floor.** The hunt is DONE.
-
-## CORRECTION that reframes the projection bucket
-
-The body above assumed **vLLM runs the GDN/attn projections as NVFP4-Marlin.** FALSE (confirmed by the L4
-gate). vLLM runs the **same nvidia-modelopt checkpoint** as the GGUF, which keeps `in_proj_qkvz`,
-`in_proj_ba`, `out_proj`, `attn_gate`, and full-attn `attn_q/k/v/output` in **BF16**. llama and vLLM run
-these projections at the **same precision (bf16).** The +6.5 ms projection-bucket delta is therefore NOT
-a precision/quant gap - it is (a) llama's f32-residual-stream convert tax and (b) bf16-GEMM kernel /
-round-trip efficiency, both at matched bf16 precision.
-
-## (1) convert-glue verdict (3.24 ms/step measured): NOT bit-exact eliminable
-
-Empirical split (`moe_dec` nsys, per-step over 43 decode steps):
- `convert_unary<float,bf16>` (input, f32 act -> bf16): **1.73 ms/step**, 186 calls/step
- `convert_unary<bf16,float>` (output, bf16 -> f32): **1.52 ms/step**, 186 calls/step (equal count = every
-  bf16 projection round-trips)
-
-Source root cause (`ggml/src/ggml-cuda/ggml-cuda.cu:1663-1690`, the `src0->type == BF16` cuBLAS path):
-ggml converts f32 activations to bf16, runs `cublasGemmEx` bf16xbf16 with **CUBLAS_COMPUTE_32F** but
-writes the result to a **bf16** buffer (`dst_bf16`, `CUDA_R_16BF`), then widens bf16 -> f32. The f32
-accumulator is **rounded to bf16 and then widened back** - it drops ~15 mantissa bits, and that
-bf16-rounded value feeds the f32 residual stream.
-
- The **output round-trip is load-bearing for the shipped numerics.** The fp16-fp32-compute path 40 lines
-  down (`:1729`, `dst CUDA_R_32F`) proves cuBLAS CAN write the f32 accumulator directly - so the bf16
-  output write+convert is a removable ggml inefficiency. BUT removing it (f32-direct output) changes the
-  value from "bf16-rounded" to "full-f32" => greedy md5 (`07db32c2`) re-baselines. It is a **precision
-  boundary (an upgrade), exactly like lever 4.** NOT bit-exact.
- The **input convert is intrinsic** to a bf16 GEMM (cuBLAS needs bf16 inputs; ggml's residual stream is
-  f32). The only bit-exact move is to fuse the f32->bf16 cast into the producing op's epilogue (same RNE
-  rounding, one fewer launch) - but that is per-site ggml graph surgery for a sub-1.7 ms launch ceiling,
-  and it is **subsumed by the (rejected) lever-4 move**: NVFP4-quantizing the weights routes the
-  projection to `mul_mat_q<NVFP4>` (W4A4) and deletes the entire bf16 cuBLAS path - input convert, GEMM,
-  output convert - in one shot.
- vLLM pays ~0 here because it runs an **end-to-end bf16 residual stream** (no f32 intermediate). Matching
-  that = converting llama's residual stream to bf16 = a global precision change, md5 rebaseline. Also not
-  bit-exact.
-
-**Verdict: bit-exact-eliminable = NO.** The f32<->bf16 round-trip is load-bearing for the current md5 (the
-bf16-rounded output IS the shipped value). Every way to remove it (f32-direct GEMM output, bf16 residual
-stream, or NVFP4 weights) is bit-changing. The one bit-exact sliver (fuse the input cast into the
-producer) is ~1.7 ms ceiling, high per-site effort, and redundant with lever 4. (Aside: the f32-direct
-GEMM output is a genuine upstreamable ggml win - faster AND more precise - but it rebaselines md5, so it
-is off the bit-exact table for this hunt.)
-
-## (2) bf16 projection GEMM verdict (17.27 ms/step measured): BW-bound at the floor, no kernel lever
-
-Per-step bf16-projection GEMM (nvjet cuBLASLt + cutlass bf16, `moe_dec` nsys): **17.27 ms/step, 225
-calls/step.** Roofline at the M=128 decode shape:
- Arithmetic intensity ~= 2*M FLOP / 2 bytes-per-weight = **M = 128 FLOP/byte** (the weight read
-  dominates; activations/output negligible at M=128).
- GB10: LPDDR5x unified BW ~= **273 GB/s**; bf16 tensor-core peak >= ~250 TFLOPS => ridge point ~=
-  250e12 / 273e9 ~= **>900 FLOP/byte.** 128 << 900 => **memory-bandwidth-bound by ~7x.**
- Achieved: 17.27 ms at 273 GB/s = **~4.7 GB of bf16 projection weights streamed per step** - i.e. the
-  GEMM moves the weight bytes at ~full LPDDR5x bandwidth. **It is at the BW floor.**
-
-The nvjet kernels are `tmaAB` (TMA-streamed on both operands) - the optimal Blackwell weight-streaming
-access pattern; vLLM's cutlass does the same and reads the **same bf16 bytes.** A cutlass swap cannot beat
-the byte floor. The only way faster is **fewer weight bytes = quantize** (lever 4, ~4x fewer bytes) -
-bit-changing AND rejected on quality (+6% PPL) AND not even a vLLM-parity gap. The residual ~3.5 ms of the
-llama-vs-vLLM GEMM-bucket delta traces to llama's extra `dst_bf16` write+read round-trip traffic (the
-convert glue of verdict 1), not a worse GEMM kernel.
-
-**Verdict: at the bandwidth floor; no bit-exact (nor even same-precision) kernel lever exists.** nvjet
-already streams the weights near-optimally.
-
-## (3) The bit-exact MoE ceiling, and the irreducible residual
-
-| MoE lever | status | bit-exact? | MoE gain |
-|-----------|--------|:----------:|----------|
-| 1 - recurrent-state gather fusion (0028) | **SHIPPED** | yes | banked -> 86.3% of vLLM |
-| 2 - graph coverage / overlap | flat | yes | ~0 |
-| 3 - act-quant fusion | flat | yes | ~0 |
-| 5 - block-table within-step cache | flat for MoE | yes | ~0 (host off compute-bound path; dense +0.41%) |
-| 4 - NVFP4 projections | REJECTED | no | +6% PPL, not a vLLM gap |
-| convert-glue elimination | this assess | **no** (precision boundary) | bit-changing only |
-| bf16-GEMM kernel | this assess | **no** (BW floor) | none |
-
-**Realistic bit-exact MoE ceiling = ~86-88% of vLLM @npl128. The shipped state (lever 1, 86.3%) is
-essentially AT it.** Lever 5 adds nothing to MoE. No clean bit-exact MoE lever remains.
-
-**The irreducible ~12-14% residual to vLLM is structural, not a missing optimization:**
-1. **f32-residual-stream convert tax (~3.2 ms/step)** - ggml runs an f32 graph and casts per bf16
-   projection; vLLM runs bf16 end-to-end. Removing it is a precision change.
-2. **bf16-GEMM BW floor + round-trip traffic (~3.5 ms/step)** - both engines at the LPDDR5x byte floor on
-   bf16 weights; the delta is the round-trip traffic (= item 1, bit-changing).
-3. **Recurrence-plumbing remainder** - mostly banked by lever 1; the core SSM kernel is already a llama
-   win.
-4. **Between-replay host loop + graph/overlap bubble** - sampling needs logits between graph replays;
-   irreducible at this batch shape.
-
-## CONCLUSION: the MoE-parity hunt is DONE
-
-The MoE is at its bit-exact ceiling. The two heaviest MoE compute kernels (the gated-DeltaNet SSM core and
-the NVFP4 expert grouped GEMM) are **already llama wins**, so there is no arithmetic gap to close. The
-remaining 12-14% is the f32-vs-bf16 graph-precision tax, the bf16-weight BW floor, and the irreducible
-host loop - none of which is a clean bit-exact lever, and the one bit-changing option (quantize the
-projections) is rejected on quality and is not even a vLLM-parity gap. **No one-more-lever for MoE.** The
-only clean win left in the whole track is DENSE (+0.41% from lever 5), gated behind first resolving the
-pre-existing paged-MoE baseline md5 drift (paged `8cb0ce23` vs canonical `07db32c2`) the L5 finish flagged.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
@@ -1,220 +0,0 @@
-# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
-
-Build-ready plan. **Not implemented in this workflow** (large kernel work). This
-document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
-on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
-finding that re-shapes the whole effort.
-
-Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
-LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
-(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
-HEAD at patch 0013), `build-cuda` sm_121.
-
-## TL;DR (the honest reframe)
-
-**The grouped GEMM the mission scoped to build from scratch already exists in
-upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
-sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
-MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
-building blocks:
-
-1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
-   count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
-2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
-   grid flattened into kbc-continuous space over expert x col-tile x row-tile x
-   k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
-
-The per-expert host-side row-gather loop in `ggml-cuda.cu`
-`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
-analysis describes as "the cliff" - is a **fallback only reached when
-`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
-GB10 mxfp4 path.**
-
-Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
-Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
-`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
-original `253/505/830/620` cliff was a real high-batch regression that has since
-been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
-batched-bench artifact.
-
-**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
-**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
-sizing heuristic.** This document scopes that delta, plus the optional
-block-padded align, plus the parity gate and phased plan. It also records what is
-intentionally NOT built and why (the W4A16 occupancy wall).
-
-## The one structural gap: M-tile sizing
-
-`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
-**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
-count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
-always selects the **largest** `mmq_x` that fits shared memory. vLLM's
-CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
-(typ. 16/32/64), padded **per expert**.
-
-ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
-per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
-~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
-So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
-tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
-`need_check`) every expert runs as a masked partial tail.
-
-The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
-(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
-
-Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
-expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
-re-read only across multiple col-tiles, of which there is one) while it **lowers
-shared-mem footprint and raises occupancy** - strictly aligned with the GB10
-occupancy lessons.
-
-## What already exists (reuse, do NOT rebuild)
-
-Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
-
- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
-  expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
-  warp-scan for the compaction index, scatters into `ids_src1` (column gather
-  permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
-  `expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
-  This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
-  vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
-  are exact (not block-padded).
- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
-  (kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
-  Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
-  Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
-  col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
-  x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
-  past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
-  `stream_k_fixup` pass recombines split tiles.
- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
-  (== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
-  `quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
-  the `ids_src1` gather fused** (one pass, no separate row-copy).
-
-Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
-with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
-(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
-`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
-
-(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
-7 for turing_plus - which is fine for thin batch and out of scope here.)
-
-## What to add (the durable delta, priority order)
-
-### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
-
-In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
-**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
-capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
-
- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
-  host-side SELECTION change - it picks a different already-compiled instantiation.
-  **Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
-  tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
-
-This is the single highest-leverage change and the seed of the durable port.
-
-### [2] Block-padded moe_align (the moe_align_block_size port proper)
-
-Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
-block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
-block boundaries. Then every col-tile is **full**, which:
-
- drops the `need_check` masking + per-expert partial-tail MMA,
- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
-  schedule), removing the `col_diff` skip branch.
-
-Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
-`write_back` masking (pad rows must not write output). This is the proper
-`moe_align_block_size` analogue and the durable second step.
-
-### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
-
-A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
-unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
-completeness; do not start without [1]+[2] measured as insufficient.
-
-## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
-
- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
-  `ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
-  (`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
-  scatter). No change to the host dispatch decision.
- **Gated fallback preserved**: the existing per-expert host loop
-  (`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
-  non-Blackwell GPUs. The new selection only fires on the grouped path.
- **Env gates** (off = exact current behavior):
-  - `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
-    (lever-1 + [1] manual knob).
-  - `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
-  Default both off until parity + throughput proven, then flip [1]'s
-  auto-selection on by default.
-
-## Correctness / parity gate
-
-Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
-reference is **deterministic** - the op test must be **bit-exact**.
-
- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
-  `n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
-  of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
-  tile geometry and where block-pad masking can leak.
- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
-  to current op-test output (op test is exact; the GB10 CUDA greedy-decode
-  non-determinism band applies only to end-to-end, never to the op test).
- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
-  `-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
-  and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
-  CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
-
-Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
-
-## Phased plan, expected payoff, risk per phase
-
-| Phase | Work | Expected payoff | Risk |
-|-------|------|-----------------|------|
-| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
-| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
-| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
-| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
-
-**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
-no broken path to unlock. The durable win is **matching vLLM's saturated
-grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
-M-tile mismatch - a micro-efficiency gain at large effective batch, not a
-step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
-hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
-already uses the same sorted-grouped-GEMM algorithm, so structural parity is
-**already met**; this closes the residual kernel micro-gap.
-
-## The biggest risk: the GB10 W4A16 occupancy wall
-
-The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
-178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
-XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
-
- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
-  or XOR-swizzle - they are exactly what killed W4A16);
- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
-  FP4 = 2x INT8/BF16 rate;
- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
-  MoE wins come from occupancy/tile fit, not bandwidth).
-
-Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
-shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
-re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
-rewards - which is why it leads the plan and [3] is gated behind it.
-
-## Commit / hygiene
-
-Scope doc only (this file). No engine change committed in this workflow. Bench and
-parity scripts are dev-tree-only. Commit with `git -s`, trailer
-`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
-Do not push (human pushes). When [1]/[2] are implemented they mirror to
-`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).
--- a/backend/cpp/llama-cpp/patches/paged/MOE_QUANT_DEDUP_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_QUANT_DEDUP_RESULTS.md
@@ -1,71 +0,0 @@
-# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
-
-Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
-8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
-
-## What
-
-ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
-`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
-For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
-every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
-`n_expert_used` times (4x for q36-35b-a3b).
-
-`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
-consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
-shfl/reduction). So the quantized block for a given token is byte-identical no matter which
-expert slot it lands in.
-
-## Lever
-
-When `ne11 == 1` (broadcast up/gate):
-1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
-   (`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
-2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
-   (`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
-   kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
-
-Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
-re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
-(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
-
-The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
-478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
-gather costs 32 ms.
-
-## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
-
-nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
-| kernel                | dedup off | dedup on |
-|-----------------------|-----------|----------|
-| quantize_mmq_nvfp4    | 868 ms    | 457 ms   |
-| gather_mmq_fp4        | -         | 32 ms    |
-| net quantize path     | 868 ms    | 489 ms   |  (-379 ms decode GPU-time)
-| gated_delta_net (50%) | unchanged | unchanged |
-| mul_mat_q<NVFP4>      | unchanged | unchanged |
-
-Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
-| model           | npl32 off->on    | npl128 off->on        |
-|-----------------|------------------|-----------------------|
-| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
-| dense q36-27b   | 207.4 -> 206.9 (flat)  | 373.28 -> 373.24 (byte-flat) |
-
-Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
-
-## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
-
-| model            | md5 (default on)                     | == 0022 |
-|------------------|--------------------------------------|---------|
-| q36-27b-nvfp4    | 5951a5b4d624ce891e22ab5fca9bc439     | yes (dense untouched) |
-| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd     | yes (on == off == 0022) |
-
-test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
-
-## Knob
-
-On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
-(byte-identical output, used as the A/B baseline).
-
-Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
@@ -1,99 +0,0 @@
-# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
-
-Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
-`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
-Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
-(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
-`build-cuda` sm_121.
-
-## Headline (honest): there is no npl128 cliff to erase on this build
-
-The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
-@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
-
-```
-llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
-  npl        1     8    32    64   128   256
-  stock     85   282   629   935  1295  1779     <- monotonic, no knee
-```
-
-The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
-decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
-`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
-persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
-`MOE_GROUPED_GEMM_SCOPE.md`.
-
-## What the knob does
-
-`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
-(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
-column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
-(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
-~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
-throughput on the padding columns, and the larger y-tile lowers occupancy.
-
-`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
-(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
-still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
-already uses for smaller batches - no new kernel configuration. Default
-(unset/<=0) = disabled => byte-identical to stock.
-
-## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
-
-Decode throughput, S_TG t/s:
-
-```
-  npl     stock   cap16   cap32   cap64
-   1       85      85      85      85
-   8      282     280     282     282
-  32      629     623     629     628
-  64      935     915     949     934
- 128     1295    1204    1344    1357     <- cap64 +4.8% (cap16 -7%)
- 256     1779    1370    1723    1820     <- cap64 +2.3% (cap16 -23%)
-```
-
-Prefill throughput, S_PP t/s (the cost):
-
-```
-  npl     stock   cap16   cap32   cap64
- 128     3083    1817    2559    3038
- 256     3084    1818    2560    3046
-                 -41%    -17%    -1.3%
-```
-
-Reproducibility (interleaved off/cap64, two reps each):
-
-```
-  npl    off rep1/rep2   cap64 rep1/rep2
-  128    1300 / 1290     1357.5 / 1357.0
-  256    1786 / 1782     1826.3 / 1824.5
-```
-
-cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
-
-## Why 64 is the only value that helps net
-
-A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
-16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
-craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
-so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
-the fuller, higher-occupancy tile.
-
-## Verdict
-
- Real but **modest** high-effective-batch DECODE micro-optimization
-  (+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
-  continuous batching already scales). Shipped as an opt-in, default-off knob;
-  recommended value 64 for decode-heavy high-concurrency deployments.
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
-  for single-stream generation and stays coherent; thousands of capped MoE
-  matmuls at npl128/256 ran with no CUDA error / NaN.
-
-## Durable follow-up (scoped, not implemented)
-
-Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
-from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
-small tile while prefill keeps its large tile automatically (removes the ~1.3%
-prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
-`MOE_GROUPED_GEMM_SCOPE.md`.
--- a/backend/cpp/llama-cpp/patches/paged/NONRECURRENCE_BITEXACT.md
+++ b/backend/cpp/llama-cpp/patches/paged/NONRECURRENCE_BITEXACT.md
@@ -1,323 +0,0 @@
-# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
-
-Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
-The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
-`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
-CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
-That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
-kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
-"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
-
-Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
- `quantize_mmq_nvfp4` (496/step) ~4.5%
- `nvjet` lm_head GEMM ~3.1%
- `flash_attn_ext_f16` (16 attn layers) ~3.1%
- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
-  l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
-
-Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
-`fattn.cu`, `fattn-common.cuh`.
-
---
-
-## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
-
-The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
-  Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
-  traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
-  `vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
-  verified in prior runs (1115/805 gate, byte-identical).
- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
-  for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
-  results.
-
-The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
-decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
-is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
-
-> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
-> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
-> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
-> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
-
-Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
-weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
-stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
-
-So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
-dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
-NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
-helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
-to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
-exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
-so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
-occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
-
-VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
-re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
-one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
-measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
-GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
-drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
-was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
-
-WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
-grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
-[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
-stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
-across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
-tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
-baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
-no order-preserving change moves it.
-
---
-
-## RANKED bit-exact non-recurrence levers
-
-Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
-gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
-
-### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
-Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
-emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
-producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
-  `amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
-  shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
-  the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
-  output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
-  producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
-  producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
-  holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
-  ~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
-  the removed kernel).
- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
-  measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
-  fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
-  producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
-
-### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
-Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
-the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
-the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
-  same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
-  half of the prior Lever-3 design.
- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
-  different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
-  reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
-  (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
-  FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
-  f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
-  grouping is forbidden.)
- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
-  excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
-
-### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
-See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
-md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
-the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
-(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
-Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
-expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
-
-### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
-`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
-SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
-under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
-a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
-them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
-(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
-attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
-
---
-
-## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
-  CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
-  remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
-  the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
-  recurrence's vectorized state loads).
- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
-  slower and breaks the one-read invariant; not a lever.
- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
-
---
-
-## One-line summary for the parent
-The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
-win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
-reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
-fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
-**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
-small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
-the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
-NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
-P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
-shapes (large-N asymptotic vs 1-3-wave decode per-call).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
-
-The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
-measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
-A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
-SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
-
-## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
-Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
-
-  Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
-    gated_delta_net (SSM recurrence)   193.97 ms/step   51.0%   <- BINDING KERNEL
-    mul_mat_q<NVFP4,m=128,nc=0>         93.64 ms/step   24.6%   <- the P2a target
-    quantize_mmq_nvfp4                  16.77 ms/step    4.4%
-    nvjet (cublas lm_head GEMM)         12.07 ms/step    3.2%
-    flash_attn_ext_f16                  11.69 ms/step    3.1%
-    concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
-    mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
-
-  conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
-    gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
-    nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
-
-  BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
-  Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
-
-## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
-The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
-mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
-mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
-for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
-write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
-stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
-
-  cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
-  -> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
-  BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
-  IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
-
-  CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
-  S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
-    DENSE q36-27b:   npl 32  208.02 -> 207.51 (-0.2%)   npl 128  374.30 -> 373.19 (-0.3%)   FLAT
-    MoE  q36-35b-a3b: npl 32  438.83 -> 439.30 (+0.1%)   npl 128  745.71 -> 745.07 (-0.1%)   FLAT
-  Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
-
-## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
-Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
-    gated_delta_net          167.99 -> 167.89 ms/step  (IDENTICAL - it is byte-identical code, untouched)
-    mul_mat_q<NVFP4,128,0>    93.79 ->  92.74 ms/step  (-1.1%, FLAT)            <- the P2a target, decode shape
-    mul_mat_q_stream_k_fixup   3.96 ->   5.65 ms/step  (+1.7ms, REGRESSES at nwarps/2=2)
-  => the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
-
-The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
-mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
-92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
-occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
-BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
-mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
-bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
-
-Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
-flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
-99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
-the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
-the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
-
-CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
-build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
-(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
-0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
-of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
-same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
-the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
-in whatever other flag/patch state each was built from.
-
-## VERDICT
-P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
-kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
-S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
-the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
-gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
-(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
-vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
-
-Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
-(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
-locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
-Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
-GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
-
-## Levers EVALUATED
-
-### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
-The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
-Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
-so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
-quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
-1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
-DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
-quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
-fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
-
-### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
-Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
-DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
-garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
-selection -> less / better-localized expert work -> artificially fast. The authoritative
-decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
-3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
-
-### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
-ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
-up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
-re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
-reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
-Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
-expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
-untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
-  quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
-  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
-  restores stock. Committed: DGX f7409c2 + worktree patch 0023.
-
-### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
-Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
-bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
-writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
-{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
-mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
-~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
-order), lm_head (cublas) have NO bit-exact lever.
-
-## Verdict
-The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
-quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
-It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
-bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
-The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
-fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
-lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
@@ -1,119 +0,0 @@
-# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
-
-Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
-Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
-NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
-Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
-
-## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
-
-vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
-Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
-
-## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
-    dtype=torch.float32  elem_bytes=4  shape=(1553, 48, 128, 128)  per-slot state = 786432 elems = 3.000 MiB (f32)
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB  (bf16 would be 402.7 MB)
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32  (mamba_cache_dtype=auto/bf16 for conv)
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
-    sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
-    -> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
-    BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak  (min 84.3%, p90 81.1%)
-
-## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
-    gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
-    Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
-      median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
-    (Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
-
-## THE GAP (equal f32 bytes, different efficiency)
-  llama   805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
-  vLLM    805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
-  => vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
-  Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
-  reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
-
---
-
-# retune-build (BUILD AGENT) — patch 0022 SHIPPED
-
-vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
-is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
-
-## The change — bit-exact column folding (Lever A + B + D)
-
-`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
-`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
-`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
-body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
-`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
-
-Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
-mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
-`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
-`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
-value provably does not depend on either (columns are fully independent — column c reads only its own
-S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
-state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
-grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
-(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
-per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
-reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
-coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
-default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
-
-## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
-
-Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
-isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
-agent's 4.021 ms / 73.4% exactly (n=1152).
-
-| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
-|---------------------------|---------|------|-------|
-| base (0021)               | 4.021   | 200.3| 73.4  |
-| 4 x 1 (control == base)   | 4.034   | 199.7| 73.1  |
-| 4 x 2                     | 3.887   | 207.2| 75.9  |
-| 4 x 4                     | 3.775   | 213.3| 78.1  |
-| 8 x 1                     | 3.837   | 209.9| 76.9  |
-| 8 x 2                     | 3.749   | 214.8| 78.7  |
-| 8 x 4                     | 3.699   | 217.7| 79.9  |
-| 8 x 8                     | 3.586   | 224.6| 82.3  |
-| 16 x 2                    | 3.665   | 219.8| 80.5  |
-| 16 x 4                    | 3.585   | 224.7| 82.3  |
-| **16 x 8  (WINNER/default)** | **3.488** | **230.9** | **84.6** |
-| 32 x 4                    | 3.489   | 230.8| 84.6  |
-
-Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
-1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
-
-## Gates (both PASS, non-negotiable)
-
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
-  (16,8 default) AND (4,1 control):
-  - q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
-  - q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
-
-## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
-
-| model | npl | base S_TG t/s | flag S_TG t/s | gain |
-|-------|-----|---------------|---------------|------|
-| dense 27b | 32  | 199.2 | 207.6 | +4.2% |
-| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
-| MoE 35b-a3b | 32  | 420.6 | 440.0 | +4.6% |
-| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
-
-Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
-
-## Parity vs vLLM (recurrence kernel)
-
-Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
-(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
-is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
-(matmul/attn), not in gated-DeltaNet.
-
-Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.
--- a/backend/cpp/llama-cpp/patches/paged/OTHER_PATHS_INVESTIGATION.md
+++ b/backend/cpp/llama-cpp/patches/paged/OTHER_PATHS_INVESTIGATION.md
@@ -1,511 +0,0 @@
-# OTHER_PATHS_INVESTIGATION.md
-
-Read-only investigation of the four post-0023 paths (A MoE grouped-GEMM, B lm_head GEMM,
-C TTFT/paged-pool burst, D dense CUDA-graph). One section per agent. No GPU except the
-moe-gpu-profile agent.
-
---
-
-## A. MoE grouped-GEMM gap (label: moe-gemm-source, READ-ONLY, no GPU)
-
-### The decisive finding: vLLM's MoE on GB10 is MARLIN W4A16, not a native-FP4 grouped GEMM
-
-Engine-log ground truth (`VLLM_DECODE_GROUNDING.md`, from `~/bench/h2h_moe_vllm.log`):
-`"Using 'MARLIN' NvFp4 MoE backend ... Your GPU does not have native support for FP4
-computation ... Weight-only FP4 compression will be used leveraging the Marlin kernel"`.
-vLLM does NOT take its native-FP4 cutlass/trtllm MoE path on sm_121 (it whitelists only
-sm_100/103 datacenter Blackwell for FP4-MMA MoE). So on this box vLLM's MoE is:
-
- `moe_align_block_size` (BLOCK-PADDED token-sort; `num_tokens_post_padded`, sentinel pad rows),
- **2 grouped `moe_wna16_marlin_gemm` launches per MoE block** (gate_up, then SiLU+mul, then down),
-  each ONE launch over ALL experts, `use_fp32_reduce=True`,
- **W4A16: activations stay bf16, NEVER quantized**; FP4 weights dequantized in-kernel to bf16,
-  bf16 MMA,
- the whole decode step under a FULL CUDA graph.
-
-llama's MUL_MAT_ID on GB10 (mmq.cu id-branch + mmid.cu + mmq.cuh stream-k) is:
-
- `mm_ids_helper` token-sort/scatter, **NO block padding** (exact segments, `expert_bounds`),
- **activation FP4 quantize** (`quantize_mmq_fp4`) of the expert-gathered rows = W4A4,
- **1 persistent stream-k `mul_mat_q<NVFP4>` launch per projection**, native Blackwell FP4-MMA
-  (`block_fp4_mmq`), fp32 accumulate + `stream_k_fixup`,
- per-expert-density `mmq_x` (M-tile) select (patches 0014/0015, default tile 64 @ density<=8),
- NOT under a CUDA graph.
-
-### So the "missing fused grouped GEMM" does not exist - llama already HAS it
-
-llama's grouped FP4-MMA stream-k IS the same sorted-grouped-GEMM algorithm vLLM uses, and on
-GB10 llama's MoE GEMM is at a HIGHER-precision/native-FP4 tier than vLLM's W4A16 Marlin. The
-MoE decode gap (77-83% of vLLM vs dense 90-117%) is therefore NOT a grouped-GEMM-architecture
-deficit. The MoE-specific EXTRA gap (the ~10-15pt that MoE is worse than dense) decomposes as:
-
-1. **W4A4 activation-quantize tax (llama-only, the biggest MoE-specific discrete cost).**
-   llama quantizes activations to FP4 for the MoE GEMM; vLLM (W4A16) keeps them bf16 and pays
-   ZERO activation quantize. At MoE decode npl128 that is 1024 up/gate rows (patch 0023 dedup'd
-   the broadcast ones to 128 unique + a coalesced block gather) PLUS 1024 down_proj rows
-   (distinct per expert, CANNOT be dedup'd). nsys decode-isolated (`MOE_QUANT_DEDUP_RESULTS.md`):
-   `quantize_mmq_nvfp4` is still **457 ms** of decode GPU-time after the 0023 up/gate dedup; the
-   remaining bulk is the down_proj per-expert re-quantize. vLLM's W4A16 choice is actually SMART
-   for MoE decode on a bandwidth-bound box: keeping activations bf16 adds negligible activation
-   bandwidth at M~8/expert but ELIMINATES the entire quantize pass.
-
-2. **Un-graphed extra MoE nodes' launch bubbles.** Per MoE layer llama runs mm_ids_helper +
-   quantize + gather + 2 grouped GEMMs + SiLU/mul + down-quantize + moe_sum as separate
-   host-launched ggml nodes, none under a CUDA graph; vLLM runs moe_align + 2 grouped launches
-   under a full decode graph. This is the SAME launch-bubble root cause `CRITICALPATH_GAP_ANALYSIS.md`
-   pins for the GDN region (57 ms/step dense = 100% bubble), amplified for MoE by the extra
-   quantize/gather/scatter nodes - consistent with MoE being relatively worse than dense.
-
-3. **Ragged tiny-M tile + `need_check` partial-tail MMA** in the grouped stream-k. Already
-   addressed by 0014/0015 and measured **NEUTRAL** on q36-35b-a3b: that model is bandwidth/
-   SSM-recurrence-bound, not col-tile-occupancy-bound (the `LLAMA_MOE_DECODE_TILE` sweep shows 64
-   is the only non-negative width and it is within noise). So the M-tile lever has nothing to
-   bite on for THIS model; it banks +4.8% only on col-tile-bound MoE (Qwen3-Coder-30B).
-
-### Bit-exact llama MoE-GEMM levers (ranked)
-
- **M1 (bit-exact, modest): down_proj activation-quantize kernel retune.** The remaining ~457 ms
-  is dominated by the down_proj per-expert FP4 re-quantize (`ne11==n_expert_used`, no dedup
-  possible). The per-block quantize is a pure per-thread function of 16 consecutive inputs (the
-  property 0023 exploited to make its gather bit-exact), so the launch GEOMETRY can be retuned
-  (occupancy/coalescing, like 0022 did for the recurrence and like 0023's coalesced-uint4 gather
-  fix) while the quantized bytes stay BYTE-IDENTICAL. Also worth checking whether the down gather
-  (`ids_src1`) is redundant when the SwiGLU intermediate is already expert-contiguous. Scope:
-  nsys the down-branch `quantize_mmq_fp4` on GB10, retune block/grid, gate on test-backend-ops
-  MUL_MAT_ID exact + greedy md5 == 0023. Expected: low single-digit % at npl128 (bounded - it is
-  a fraction of a fraction of the step), but it is the only clean quantize-axis lever left after
-  0023 and it is strictly bit-exact.
-
- **M2 (bit-exact, the structurally-correct big one, SHARED with path D/A.2): CUDA-graph the MoE
-  decode step.** Graph replay does not change numerics => bit-exact. The MoE-specific extra node
-  count (quantize+gather+scatter+2 GEMM+silu+sum/layer, none graphed) makes the launch-bubble tax
-  larger for MoE than dense, which is exactly why MoE sits at 77-83% while dense is 90-117%.
-  Capturing the decode forward removes those bubbles. This is the same lever the GDN/A.2 work
-  scoped; it helps MoE MORE than dense. Highest-leverage bit-exact MoE win, but it is a
-  decode-graph-capture project, not a MoE-GEMM kernel edit.
-
- **M0 (already shipped): 0017 `GGML_CUDA_FP4_MINBLOCKS` (min-resident-CTAs register-cap) and
-  0014/0015 (`mmq_x` density auto-tile) already cover the FP4-MMA occupancy + M-tile axes of the
-  SHARED `mul_mat_q<NVFP4>` kernel.** 0017 is bit-exact (register allocation cannot change
-  results) and was tuned on dense; a MoE-targeted min-blocks re-sweep (grouped per-expert M-tiles
-  have different occupancy than the dense M=128 GEMM) is a cheap bit-exact follow-up, but
-  MOE_DENSITY_AUTO_TILE already found this model is bandwidth-bound, so headroom is likely small.
-
-### NOT recommended (explicitly out of scope)
-
- **W4A16 bf16-activation MoE GEMM (matching vLLM's Marlin choice).** This is the single biggest
-  MoE-specific structural difference and would erase the activation-quantize tax entirely, but it
-  (a) is NOT bit-exact (bf16 activations vs llama's FP4), and (b) is the W4A16 occupancy-wall
-  dead-end the docs flag (only ~9 TFLOP/178 t/s on GB10). Do not pursue.
-
-### Verdict / ranking of path A
-
-Path A is NOT a missing-kernel opportunity - llama already runs the sorted-grouped-FP4-MMA GEMM,
-at a higher native-FP4 tier than vLLM's GB10 W4A16 Marlin fallback. The MoE-specific extra gap is
-(1) the W4A4 activation-quantize tax vLLM structurally avoids by choosing W4A16, and (2) the same
-un-graphed launch-bubble tax as the GDN region, amplified by MoE's extra nodes. The only purely
-bit-exact, MoE-GEMM-local lever left is M1 (down_proj quantize retune, modest). The real MoE
-bit-exact win is M2 (CUDA-graph the decode step), which is the SAME lever as path A.2/D and helps
-MoE more than dense - so A's best lever collapses into the decode-graph effort rather than
-standing alone. Recommend ranking A's standalone kernel value BELOW the decode-graph (M2/D) and
-the lm_head (B) levers; fold A into the decode-graph build, and keep M1 as a cheap bit-exact
-bank-shot.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## B. lm_head GEMM (label: cublas-lmhead, READ-ONLY, no GPU)
-
-### The decisive fact: lm_head is BF16, not NVFP4 - so it CANNOT take the FP4 MMQ path
-
-`output.weight` (the LM head) in q36-35b-a3b-nvfp4 is **type 30 = GGML_TYPE_BF16, NOT quantized**
-(verified in `DECODE_PARITY_EXPLORE.md:298`: "2425 MB = 2.37 GB, read in full each step", 16% of
-weight traffic). This is by construction: the model was quantized with `--tensor-type attn/ffn=
-nvfp4`, which converts the attn+ffn tensors to NVFP4 and **leaves `output.weight` (and `tok_embd`)
-at base BF16** - the standard recipe, because the final projection is the most logit-sensitive
-tensor. The NVFP4 sidecar scales (`output_s`, `output_in_s`) are only created when
-`output->type == GGML_TYPE_NVFP4` (`llama-model.cpp:1459`), so for the BF16 head `model.output_s`
-is null.
-
-### Why it runs cublas/nvjet and not MMQ (exact routing trace)
-
-Graph: `qwen35moe.cpp:244` `cur = build_lora_mm(model.output, cur, model.output_s)` ->
-`llama-graph.cpp:1093` is just `ggml_mul_mat(ctx0, w, cur)` (the null `w_s` skips the scale `ggml_mul`).
-Then `ggml_cuda_mul_mat` (`ggml-cuda.cu:2540`) decides the kernel:
-
- `use_mul_mat_q` / `use_mul_mat_vec_q` BOTH require `ggml_is_quantized(src0->type)`. BF16 is NOT
-  quantized (`is_quantized=false` for F16/BF16/F32; NVFP4 IS `is_quantized=true`, `ggml.c:748`).
-  => **both MMQ paths are ineligible for the BF16 head.** (If the head were NVFP4 it WOULD route to
-  the tuned FP4 `mul_mat_q` - this is exactly the difference.)
- At decode npl128 the activation `src1->ne[1] = 128` columns: `use_mul_mat_vec_f` is gated off by
-  the mmvf batch cap; `use_mul_mat_f` (the MMF bf16 tensor-core GEMM) is gated off by
-  `ggml_cuda_should_use_mmf` for the wide `151936-row x 128-col` shape.
- `use_batched_cublas_bf16` is true, but the batched-cublas branch additionally requires
-  `src1->ne[2]*src1->ne[3] > 1` (a 3D/4D multi-batch GEMM). The decode lm_head is 2D
-  (`ne[2]*ne[3] == 1`) => **batched-cublas branch is skipped.**
- => falls through to the final `else`: `ggml_cuda_op_mul_mat_cublas`. With `src0` BF16 +
-  bf16-MMA hardware it takes the BF16 branch (`ggml-cuda.cu:1663`): `cublasGemmEx(CUDA_R_16BF,
-  CUDA_R_16BF -> CUBLAS_COMPUTE_32F, TENSOR_OP)`. **That cublasLt kernel is `nvjet_sm121`.**
-
-Cost (both models): dense `nvjet` lm_head = **12.17 ms = 3.66% of the 332.8 ms dense step**
-(`F16_DENSE_RESIDUAL_PROBE.md:65`); MoE = **11.91 ms = 3.1%** (`CRITICALPATH_GAP_ANALYSIS.md:398`).
-
-### CRITICAL correction: the current head is NOT "f32-lm_head" - it is already BF16-rounded
-
-The task brief calls the baseline "f32-lm_head"; it is not. The cublas BF16 branch **downcasts the
-F32 activation to BF16**, does BF16xBF16 with F32 accumulate, **writes the result as BF16** (dst is
-`CUDA_R_16BF`), then upcasts BF16->F32. So today's "bit-exact reference" logits are already
-**BF16-precision**, not f32. Two consequences:
-1. Any NVFP4/FP8 head swap is measured against a BF16 baseline, not f32 - the precision delta vs
-   the *true* f32 head is partly already paid.
-2. A *different* BF16 GEMM kernel that also F32-accumulates and BF16-rounds the output is
-   **bit-identical for the vast majority of logits** (differs only at rare BF16 rounding ties) -
-   this is what makes option (c) below "essentially bit-exact".
-
-### The options, and which break bit-exactness
-
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ. BIGGEST win, BREAKS bit-exactness.** Weight
-  2.37 GB BF16 -> ~0.6 GB NVFP4 (0.5625 B/wt = 4x fewer bytes) AND it then hits the already-tuned
-  `mul_mat_q<NVFP4>` (0017) instead of cublas. Memory-bound floor drops ~4x => save ~8-9 ms =
-  ~2.5% of the dense step. But NVFP4 < BF16 precision => **different logit bits, can flip the greedy
-  argmax** = NOT bit-exact; and it is **UNFAIR vs vLLM**, which keeps its LM head BF16
-  (`DECODE_PARITY_EXPLORE.md:358`: "fp8 LM head ... only matters if vLLM also quantizes it"). This
-  is the same opt-in, non-bit-exact bucket as the f16-glue probe (already concluded SKIP).
- **(b) FP8 / Q8_0 head.** Smaller error than NVFP4 but still != BF16 bits => still NOT bit-exact,
-  and it is not even on the tuned FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
- **(c) Keep BF16 weight, swap the kernel (custom skinny wide-vocab streaming GEMM, or a cublasLt
-  algo heuristic tuned for the thin-M / huge-N memory-bound shape).** The ONLY essentially-bit-exact
-  option (F32 accumulate + BF16 round = identical except rounding ties, per the correction above).
-
-### Realistic lever + scope: there is NO good bit-exact lever here
-
-Bandwidth math kills option (c): `nvjet` moves 2.37 GB in ~11.9-12.2 ms = **~195-199 GB/s = ~72% of
-the GB10's 273 GB/s peak**. The lm_head GEMM is therefore **already one of the MOST
-bandwidth-efficient kernels in the step** - the overall decode step runs at only 40% util /
-110 GB/s (`DECODE_PARITY_EXPLORE.md`). The bit-exact ceiling is tiny: even a perfect
-HBM-saturating kernel (199 -> 273 GB/s) takes 11.9 -> ~8.7 ms = **save ~3 ms = ~0.9% of the dense
-step**, and beating cublas's own tuned nvjet on a pure weight-stream shape is NOT guaranteed (it may
-already be near-optimal). High kernel-writing effort, uncertain sub-1% payoff. (`F16_DENSE_RESIDUAL_
-PROBE.md:97` independently estimates a bf16-glue nvjet recovery of only ~5 ms and flags it
-"uncertain - may already run TF32" - consistent with little headroom.)
-
-The structural reason: the head must read the **entire 2.37 GB weight for just 128 output columns**
-(inherently memory-bound), and **you cannot cut those weight bytes without changing the dtype** -
-i.e. bit-exactness and the only real speedup (fewer weight bytes) are **mutually exclusive** here.
-
-### Verdict / ranking of path B
-
-The lm_head cublas/nvjet GEMM is a **dead end for a bit-exact win**: it is already ~72% of peak HBM
-(the step's most efficient major kernel), so a bit-exact kernel swap caps at <1% with real risk and
-no guarantee of beating cublas. The only large win - NVFP4-quantizing the head (~2.5%) - is
-explicitly non-bit-exact AND unfair vs vLLM (which keeps BF16), so it lands in the same opt-in
-non-bit-exact bucket as f16-glue that was already shelved. Rank B's bit-exact value **at the bottom**
-of the four paths. The one worthwhile note for the team is the correction that the head is already
-BF16 (not f32), which slightly narrows what "bit-exact" even protects here; if the project ever
-opens a *non*-bit-exact opt-in track, NVFP4-head (option a) is a clean ~2.5% dense lever that rides
-the existing tuned FP4 MMQ - but it must be gated as opt-in and excluded from any vLLM-parity claim.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## A.2 / D. GPU-measured MoE decode decomposition + dense-graph stability (label: moe-gpu-profile, THE GPU AGENT)
-
-nsys `--cuda-graph-trace=node` on a steady MoE decode at npl128 (q36-35b-a3b-nvfp4, HEAD f7409c2,
-clean 0023 build-cuda). The measurement was decode-isolated: the run has a prefill phase (16384 tok,
-the big-GEMM region) followed by 64 steady decode steps; I segmented the timeline by GPU-idle gaps,
-dropped the prefill window, and aggregated per-kernel time over the 64-step decode window only
-(`moe_decode_npl128.{nsys-rep,trace.csv}` on the DGX; extractor `decfull.py`/`grid.py`).
-
-### MoE decode window: 98.3% GPU-bound, ~165 ms/step. Per-kernel share of decode GPU-time:
-
-```
- 41.9%  gated_delta_net_cuda            REC (SHARED with dense, already tuned 0018-0022)
- 26.9%  mul_mat_q<NVFP4, M-tile=64>     MOE expert grouped GEMM (MUL_MAT_ID) <-- biggest MoE-specific kernel
-  7.7%  nvjet_sm121 (cublas bf16)       attn/gdn bf16 projections + the BF16 lm_head (path B)
-  2.7%  cutlass_80 bf16 s16816 relu     bf16 GEMM (shared-expert / gate)
-  2.7%  k_bin_bcast (mul/add)           expert-combine + routing-weight scale + glue
-  2.6%  k_get_rows_float                REC recurrent-state gather
-  2.4%  flash_attn_ext_f16              attention
-  2.3%  mul_mat_q<NVFP4, M-tile=128>    router / non-grouped FP4 GEMM
-  2.1%  ssm_conv(+update)               REC
-  2.0%  quantize_mmq_nvfp4              MOE W4A4 activation-quantize tax (3.25 ms/step)
-  1.8%  convert_unary bf16<->f32        glue around the bf16 projections
-  1.5%  cpy_scalar                      glue
-  0.9%  rms_norm
-  0.5%  REC gating act | 0.5% streamk_fixup | 0.3% mm_ids_helper | 0.3% argsort |
-  0.2%  l2norm | 0.2% set_rows | 0.1% gather_mmq_fp4 | <0.1% topk/softmax/reduce (routing)
-```
-
-Bucketed: **Recurrence (shared, tuned) ~= 47.5%** (gdn 41.9 + get_rows 2.6 + ssm_conv 2.1 + gating
-0.5 + l2norm 0.2 + set_rows 0.2). **MoE FFN+routing block ~= 31%** (grouped GEMM 26.9 + activation
-quant 2.0 + streamk 0.5 + mm_ids_helper/argsort/gather/softmax/topk/reduce ~1.3 + the expert-combine
-share of bin_bcast). **cublas/cutlass bf16 projections ~= 10.4%** (nvjet 7.7 + cutlass 2.7).
-Attention ~2.4%. The recurrence is the single biggest term but it is shared with dense and already
-the subject of 0018-0022, so it is NOT a MoE lever.
-
-### The biggest MoE-specific kernel (the lever): mul_mat_q<NVFP4, M-tile=64> grouped GEMM
-
-26.9% of decode = ~43.5 ms/step, avg **373 us/call**, grids of **2048 and 8192** 64-wide tiles
-(blk=32 = 1 warp/block). Compare the dense FFN GEMM in the same family at npl128: `mul_mat_q<NVFP4,
-M-tile=128>` avg **31 us/call**, grid 48. The grouped per-expert GEMM is ~12x the per-call cost and
-launches 100-200x more tiles because each of 128 experts is a separate tiny-M sub-GEMM (128 tokens x
-top-k / 128 experts ~= a handful of rows per expert) padded into 64-wide tiles. This is exactly the
-ragged-tiny-M / col-tile-occupancy axis section A's 0014/0015 `mmq_x` density auto-tile already
-covers and measured NEUTRAL on this bandwidth-bound a3b model. MMQ FP4 is integer/FP4-exact
-independent of tile geometry, so this kernel IS bit-exact to retune (occupancy/min-blocks/M-tile),
-but the headroom on THIS model is small (it is bandwidth-bound, not tile-occupancy-bound).
-
-### Confirmations / quantifications of section A (from live GPU, not source-reading):
-
-1. **Un-graphed at npl128: CONFIRMED in source, but NOT the npl128 bottleneck.** NVFP4 on sm121
-   (turing_plus path) has `mmvq_mmid_max = 8` (`mmvq.cu:145`); MoE decode batch ne[2]=128 > 8, so
-   `[TAG_MUL_MAT_ID_CUDA_GRAPHS]` (`ggml-cuda.cu:3273`) disables CUDA graphs for the WHOLE step and
-   the MMQ grouped path (not MMVQ) is taken. HOWEVER the measured decode window is **98.3% GPU-util
-   with ~7.8 us inter-step host gaps** - at npl128 the kernels are large enough to fully hide the
-   per-op launch latency, so the un-graphed launch-bubble tax is negligible HERE. The un-graphed
-   penalty is a SMALL-npl problem; at npl128 the MoE gap is in-kernel (grouped GEMM + quantize),
-   not host bubbles. This refines A's M2: graphing the decode step helps small-npl MoE much more
-   than npl128 MoE.
-2. **W4A4 activation-quantize tax: CONFIRMED present but only 2.0% at npl128.** `quantize_mmq_nvfp4`
-   = 3.25 ms/step in the decode-isolated window (A's 457 ms figure is a whole-run/different-window
-   total). Real, and vLLM-W4A16 avoids it, but it is a small-single-digit term, not dominant.
-3. **lm_head/projection cublas (path B): CONFIRMED ~12.4 ms/step** of nvjet in MoE decode (matches
-   B's 11.91 ms), but that 7.7% bundle is mostly per-layer attn/gdn bf16 projections, not just the
-   one lm_head.
-
-### D. Dense CUDA-graph stability: f32 dense is STABLE, the bimodality was a BF16-only artifact
-
-Dense (q36-27b-nvfp4) has no MUL_MAT_ID, so it stays fully CUDA-graphed. Measured S_TG @npl128:
-
-```
-intra-process (1 load, 6x npl=128, npp8/ntg48, N_KV=7168): 376.2 376.2 375.7 375.1 375.3 374.9  (spread <0.4%)
-inter-process (6 separate procs, fresh graph capture each):373.6 377.0 376.8 376.6 376.2 375.7  (spread ~0.9%)
-committed heavy config (npl128 ntg128, N_KV=32768):        333.3 / 334.8 / 335.9                 (spread ~0.8%)
-```
-
-No bimodality in either replay (intra-process) or capture (inter-process). The custom graph state
-machine (`ggml-cuda.cu:4484`: warmup_complete requires 2 property-stable calls; the one-time capture
-cost lands in T_PP, not S_TG) absorbs capture into prefill, which is the only "hint" (the first
-in-process measurement has a slightly higher T_PP and a marginally lower S_TG, fully bounded). The
-287/336/487/498 bimodality in the brief was the shelved BF16 SSM-state path (BF16_SSM_STATE.diff,
-never applied), not the shipped f32 path. There is NO graphs-off env in this fork (graph enable is
-compile-time USE_CUDA_GRAPH + the warmup machine), so a graph-disable A/B would need a rebuild; given
-the f32 path is already stable to <1%, path D is a non-issue and not worth the rebuild.
-
-### Verdict (GPU agent)
-
- The MoE decode gap vs vLLM at npl128 is **in-kernel, not host-overhead**: 98.3% GPU-util rules
-  out the un-graphed launch-bubble story AT npl128. The single biggest MoE-specific kernel is the
-  `mul_mat_q<NVFP4, M-tile=64>` grouped GEMM (26.9%, 43.5 ms/step); it is bit-exact to retune but
-  bandwidth-bound on this a3b model (A's auto-tile already measured neutral), so the standalone
-  bit-exact MoE-GEMM lever is REAL but BOUNDED. The recurrence (47.5%) is shared and already tuned.
- **Path D (dense graph instability) is closed: the shipped f32 dense path is stable (<1%, no
-  bimodality).** No latent fragility, no rebuild warranted.
- Net ranking from the GPU side agrees with A/B: the MoE-GEMM and lm_head levers are both bounded
-  and partly non-bit-exact; the only structurally large bit-exact MoE win (A's M2, graph the decode
-  step) pays off mostly at SMALL npl, not at the npl128 where the benchmark gap is reported.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## C. TTFT / paged-pool burst degradation (label: ttft-burst-rootcause, READ-ONLY, source + committed traces)
-
-Files read: `paged/paged_kv_manager.{h,cpp}`; patches `0004` (on-demand alloc), `0007` (persistent
-manager + ref-counted prefix), `0008` (server cross-request share), `0013`/`0016` (prefill budget);
-docs `QWEN36_NVFP4_BENCH.md`, `BENCHMARK_PROGRESS.md`, `CHUNKED_PREFILL_PLAN.md`,
-`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`, `P1_DYNAMIC_BUDGET_RESULTS.md`, `FUTURE_LEVERS.md`.
-
-### Part 1 - the static decode-first budget: why a 128-way burst hits 903 s dense / 213 s MoE TTFT
-
-How the budget schedules (patch 0016, `server-context.cpp::update_slots`): each step builds ONE
-mixed batch. Phase 1 appends every GENERATING slot's single sampled token UNCONDITIONALLY (no budget
-gate), so after Phase 1 `batch.n_tokens == D` (the live decode load). Phase 2 then fills prompt
-tokens, bounded by three predicates: the hard `batch.n_tokens < n_batch` (2048) ceiling, a per-step
-`prefill_budget_step`, and a per-slot `prefill_cap_per_slot`. **Decode is structurally claimed first
-and never capped; only prefill is throttled.**
-
-At the shipped config (`LLAMA_MAX_BATCH_TOKENS=512`, i.e. T=512=n_ubatch) the dynamic terms
-degenerate to constants:
- `prefill_budget_step = max(n_ubatch, T - D) = max(512, 512-D) = 512` for all D in [0,128] - the
-  floor binds, the `T-D` adaptivity NEVER bites (exactly the "structural note" in
-  `P1_DYNAMIC_BUDGET_RESULTS.md`).
- `prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) = min(512, 5243) = 512`, clamped to 512.
-
-So each step admits at most 512 prefill tokens TOTAL and up to 512 per single slot. Each benchmark
-prompt is exactly 512 tokens and there is NO round-robin (0016 drains slots in index order):
-**the first waiting slot consumes the entire 512-token step budget with its whole prompt; the 128
-prompts prefill strictly SERIALLY, one prompt per step.** Slot k's first token appears after ~k
-prefill steps and each step co-batches the accumulating decode load, so step time grows. Mean TTFT
-~= (half the prompts) x step_time ~= **903 s dense** (each step reads the full 28B NVFP4 weights) /
-**213 s MoE** (3B active = cheaper steps). Decode_agg stays high (384/726 t/s) because Phase 1 seats
-every decode token every step. This is the **deliberate decode-first tradeoff**: T=512 was chosen
-for decode throughput + memory; TTFT was the sacrificed axis. The 903 s is partly self-inflicted by
-the floor budget + lack of fairness, not a kernel limit (dense `prefill_tps` collapses to ~70 t/s
-under the throttle vs vLLM's flat ~1420).
-
-The fix (chunked-interleave / fair dynamic budget = P2 of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`,
-NOT implemented), three pieces in `update_slots` Phase 2, zero libllama change:
-1. Raise T toward `n_batch` (2048) so the per-step total budget is large; keep decode-first via the
-   REAL `prefill_budget_step = T - D` (leftover auto-shrinks as D rises, so the step never inflates
-   past T even at npl128).
-2. A per-slot chunk cap MUCH smaller than the budget (the `long_prefill_token_threshold` analogue),
-   e.g. 128-256 tokens, so one prompt cannot monopolize the step.
-3. A round-robin start offset over PROCESSING_PROMPT slots so leftover budget spreads across MANY
-   waiting prompts per step.
-
-Net: instead of "one full 512-prompt per step" (serial, last prompt waits 128 steps), each step
-admits small chunks from ~T/cap prompts at once, so all 128 advance in lockstep and TTFT collapses
-from O(k*step) to O(constant) - the vLLM 6-18 s regime. 0016's per-slot-cap variable already exists
-but is inert at the shipped config and lacks the round-robin spreader. Honest boundary (already in
-the docs): this closes TTFT, it does NOT lift the ~161/333 decode ceiling (a separate lever).
-
-### Part 2 - the burst-degradation BUG: later lower-npl prefill collapses 507 -> 65 t/s, decode fine, restart cures it
-
-The signature - prefill-only collapse, decode untouched, persists in-process, a server restart fully
-cures it (the benchmark's documented "restart per npl" workaround) - points to persistent paged-pool
-host state never restored short of `clear()`/teardown. Two compounding mechanisms, both confirmable
-from the patch source:
-
-**(1) RECLAMATION GAP - blocks are returned ONLY on a FULL-range wipe.** `paged_alloc` returns a
-sequence's blocks to the pool in exactly two places (patch 0004, kept in 0007): `clear()` ->
-`release_all`, and `seq_rm(seq, p0, p1)` ONLY when `p0 == 0 && p1 == MAX`. But llama-server's normal
-slot lifecycle issues PARTIAL truncations: slot reuse with a retained common/BOS prefix calls
-`seq_rm(slot.id, n_past, -1)` with `n_past > 0` (patch 0008 itself calls
-`common_context_seq_rm(ctx, slot.id, n_past, -1)`); context-shift / partial rewinds likewise. None
-satisfy `p0 == 0`, so the release hook never fires: the kv-cache frees those CELLS but the manager
-still believes the sequence owns those BLOCKS. The two desync and the manager's effective free pool
-shrinks every time. Patch 0008's own comment is the smoking gun - it added the `n_past < 16` gate
-because a mismatched full-prompt reservation vs suffix-only submission "never leaves stale blocks
-(which otherwise fragment the paged pool ... and crashed the server under high fan-out)". 0008 only
-closed that hole for the narrow `share()` path; the general partial-`seq_rm` path stays unhooked, so
-over a high-fan-out burst leaked blocks accumulate and never return.
-
-**(2) FRAGMENTATION / NO COMPACTION - the free queue is permuted by the burst and never rebuilt.**
-Even for cleanly freed blocks, `BlockPool::free_blocks` just `prepend_n`/`append_n`s them in free
-order; no compaction, no pristine reset. After a high-fan-out burst (many interleaved alloc/free
-across many seqs in the unified pool, or reversed-order frees in a per-stream pool) the free queue is
-a scrambled permutation of physical block ids. A subsequent LOW-npl prefill then `popleft`s
-physically SCATTERED blocks, so its 512-token KV scatter-WRITE plus the in-kernel paged-attention
-GATHER lose locality across the KV span -> prefill throughput collapses. Decode is a single-token
-append per step with a gather amortized over tiny per-step work, so it barely notices - exactly the
-observed "prefill collapses, decode robust". The scramble + leak persist for the process lifetime
-(only `clear()`/restart rebuilds a contiguous free queue) - precisely why restart-per-npl restores
-507 t/s. Contributing factor: slots used in the burst but not reassigned next run are never released
-(release fires only on next-task divergence), so a low-npl run sees a reduced, fragmented pool and
-falls back to the stock contiguous allocator more often (the `place()->false->res.idxs.clear()`
-fallback in find_slot), scanning a littered cell array - another prefill-only slowdown.
-
-Fix scope (all gated behind `LLAMA_KV_PAGED`, default-off byte-identical, no libllama API change):
- **Fix-1 (core, ~30-50 lines): close the reclamation gap.** Add
-  `paged::PagedKVManager::truncate(seq, n_keep)` that frees the trailing blocks of a request beyond
-  block index `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of the truncated block suffix),
-  expose `paged_alloc::truncate(cache, stream, seq, n_keep)`, and call it from
-  `llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Manager
-  accounting then tracks the kv-cache exactly; the leak stops.
- **Fix-2 (small): defrag on empty.** When a stream's cells reach `get_used() == 0`, rebuild that
-  manager's free queue to pristine contiguous order (or recreate the manager) so a reused pool
-  starts unfragmented.
- **Fix-3 (small): release on slot completion.** Add a paged release at server `slot.release()` so
-  finished-but-idle sequences return blocks promptly and a later low-npl run sees a full, compact
-  pool.
- **Fix-4 (optional hardening): best-fit / contiguous-run preference** in `get_new_blocks` + a
-  defrag pass before the find_slot stock-fallback fires.
-
-Validation repro (GPU-bound, for a later profiling pass): npl64 burst then npl8 on ONE server;
-assert npl8 `prefill_tps` within ~10% of a fresh-server npl8, and that `paged_alloc::num_free`
-returns to the fresh value after the burst drains.
-
-### Verdict / ranking of path C
-
-Two distinct things: a **BUG** (Part 2) and a **tuning tradeoff** (Part 1). Rank the BUG first - it
-is a true correctness/hygiene defect, not a tradeoff: a long-lived production server silently
-degrades under ordinary mixed load and currently REQUIRES the "restart per npl" crutch, unacceptable
-in real serving. Fix scope is small and localized to the paged-alloc unit + one `seq_rm` call site,
-default-off byte-identical, with a crisp pass/fail repro. The chunked-interleave scheduler (Part 1)
-is the bigger HEADLINE (the weakest benchmark number, 903 s/213 s burst TTFT vs vLLM 6-18 s) but a
-larger effort with a deliberate TTFT-vs-decode-ITL tradeoff to navigate. The two are complementary:
-the scheduler reduces how punishing each burst is; the bug fix ensures the pool survives the burst
-so the NEXT request is not poisoned.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## SYNTHESIS - ranking and the first build target (label: orchestrator)
-
-The brief framed two tracks: **BIT-EXACT** levers (help the shipped f32 parity DEFAULT, included in
-the vLLM-parity claim) and **SERVING** levers (gated behind `LLAMA_KV_PAGED`, default-off
-byte-identical, outside the parity claim). The decisive cross-cutting finding from all four agents:
-**there is no compelling first build target on the bit-exact decode-default track** - A is bounded,
-B is a sub-1% dead end, D is closed - **while the SERVING track has one clear, high-ROI, tractable,
-low-risk, byte-identical-default first target: the paged-pool burst-degradation bug.**
-
-### Per-path scorecard
-
-| Path | Expected gain | Tractability | Bit-exactness | Net |
-|------|---------------|--------------|---------------|-----|
-| **A** MoE grouped-GEMM | Standalone kernel: **bounded, low single-digit %** at npl128 (model is bandwidth-bound; 0014/0015 M-tile auto-tile already NEUTRAL here). The big MoE win = M2 = graph-the-decode-step, which is SHARED with D and pays off mostly at SMALL npl, not the npl128 benchmark point. | M1 (down_proj quantize retune) cheap; M2 a decode-graph-capture project (large). | M1 strictly bit-exact (byte-identical quantized output); M2 bit-exact (replay). Helps the DEFAULT. | Real but **bounded**; no clean standalone kernel win. Keep M1 as a cheap bank-shot; fold M2 into a decode-graph effort. |
-| **B** cublas lm_head (nvjet) | Bit-exact ceiling **<1%** (~3 ms; nvjet already ~72% of peak HBM, the step's most efficient major kernel). The only big win (NVFP4 head ~2.5%) is non-bit-exact AND unfair vs vLLM (which keeps BF16). | Custom skinny-GEMM = high effort, uncertain it beats cublas. | Bit-exact option caps <1%; the 2.5% option is a logits change (opt-in only). | **Dead end** for the default. Rank LAST. |
-| **C** TTFT / paged-pool burst | **Part 2 bug:** restores prefill from collapsed 65 -> ~507 t/s after a burst (removes the "restart per npl" crutch). **Part 1 scheduler:** the headline - 903 s/213 s burst TTFT -> vLLM 6-18 s regime. | **Part 2: small + localized** (paged-alloc unit + 1 seq_rm call site). Part 1: larger (fairness + admission + tuning). | Both gated behind `LLAMA_KV_PAGED`, **default-off byte-identical**. SERVING track (doesn't touch the parity-default numerics). | **Highest ROI x tractability.** Part 2 is a true correctness defect with a crisp repro. |
-| **D** dense CUDA-graph instability | **Zero** - f32 dense measured STABLE (<1% spread, no bimodality). The 287/336/487/498 bimodality was the SHELVED BF16 SSM path, not the shipped f32 path. | n/a (would need a rebuild for a graphs-off A/B). | n/a | **CLOSED.** Not worth any work. |
-
-### Ranked order (ROI x tractability x bit-exactness)
-
-1. **C-Part2 - paged-pool burst-degradation bug fix.** Small, localized, default-off byte-identical,
-   crisp pass/fail repro, removes a real production-serving defect + the benchmark's restart crutch.
-2. **C-Part1 - chunked-interleave / fair dynamic budget.** The public-facing TTFT headline closer,
-   but a larger effort and a deliberate TTFT-vs-ITL tradeoff. Do it AFTER the bug fix (the scheduler
-   reduces burst pain; the bug fix keeps the pool alive across bursts).
-3. **A-M1 - down_proj activation-quantize kernel retune** (cheap bit-exact bank-shot for the default;
-   bounded payoff on this bandwidth-bound model). Optionally folded with a future decode-graph build
-   (A-M2 / the shared MoE+GDN decode-graph capture), which is the only structurally large bit-exact
-   MoE lever but a big project that helps small-npl more than npl128.
-4. **B - lm_head kernel swap.** Bit-exact ceiling <1% with real risk. Skip unless a non-bit-exact
-   opt-in track opens (then NVFP4-head ~2.5% dense, gated, excluded from parity claims).
-5. **D - dense graph instability.** Closed, no work.
-
-### THE FIRST BUILD TARGET: paged-pool burst-degradation bug fix (C-Part2)
-
-**Why this one:** it is the only candidate that is simultaneously (a) high ROI - fixes a real
-correctness defect that forces the "restart per npl" crutch in long-lived serving, (b) tractable -
-small and localized to the paged-alloc unit plus one `seq_rm` call site, (c) safe for the parity
-claim - gated behind `LLAMA_KV_PAGED`, default-off byte-identical, and (d) verifiable with a crisp
-pass/fail repro. Every bit-exact-default alternative is bounded (A), a dead end (B), or closed (D).
-
-**Implementation plan (incremental, each step independently shippable):**
-1. **Fix-1 (core):** add `paged::PagedKVManager::truncate(seq, n_keep)` that ref-count-frees the
-   trailing blocks beyond block index `ceil(n_keep/bs)`; expose
-   `paged_alloc::truncate(cache, stream, seq, n_keep)`; call it from `llama_kv_cache::seq_rm` for the
-   `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Closes the reclamation gap so manager
-   accounting tracks the kv-cache exactly.
-2. **Fix-2:** defrag-on-empty - when a stream reaches `get_used() == 0`, rebuild its free queue to
-   pristine contiguous order.
-3. **Fix-3:** paged release at server `slot.release()` so finished-idle sequences return blocks
-   promptly.
-4. **Fix-4 (optional):** best-fit / contiguous-run preference in `get_new_blocks` + a defrag pass
-   before the find_slot stock fallback.
-
-**Confirming measurement (the explicit repro, GPU-bound):** on ONE long-lived server, run an npl64
-burst, let it drain, then run npl8. PASS if (i) npl8 `prefill_tps` is within ~10% of a fresh-server
-npl8 (vs the ~65 vs ~507 collapse today), and (ii) `paged_alloc::num_free` returns to the
-fresh-start value after the burst drains (proves no leaked blocks). Decode t/s must be unchanged.
-
-**Bit-exact gate it MUST pass:**
- With `LLAMA_KV_PAGED` unset, the build is byte-identical to HEAD f7409c2 (the fix lives entirely
-  inside the paged path) - `test-backend-ops` + the greedy-decode md5 against the 0023 baseline are
-  unchanged.
- With `LLAMA_KV_PAGED` set, the fix changes only block ACCOUNTING and PLACEMENT, never KV values or
-  compute, so the greedy-decode md5 on a fixed prompt is identical before vs after the fix (and the
-  post-burst run produces the same tokens as a fresh-server run).
-
-**Paths NOT worth building now:** B (lm_head, sub-1% bit-exact ceiling, the only big win is a
-non-bit-exact unfair-vs-vLLM logits change), and D (dense graph instability, measured stable -
-closed). A's standalone kernel value is bounded; keep A-M1 as a cheap follow-up and fold A-M2 into a
-later decode-graph project, but it is not the first target.
-
-**First target: ship the paged-pool burst-degradation bug fix (C-Part2, Fix-1 + Fix-2 + Fix-3).**
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
@@ -1,305 +0,0 @@
-# P1 results: dynamic decode-first prefill-token budget (patch 0016)
-
-Implements **P1** of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`: replace patch 0013's
-**static** per-step prefill cap with a **dynamic, decode-first** token budget in
-`tools/server/server-context.cpp::update_slots()`. Policy change only, zero
-libllama changes, default-off byte-identical. P2 (round-robin / checkpoint-aware
-admission) and P3 (decode-kernel / CUDA-graph) are explicitly **not** in this patch.
-
-## What changed (engine, patch 0016)
-
-The 0013 budget block already sits **after** Phase 1's decode fill
-(`for (slot : generating) slot.update_batch(batch)`, lines 2716-2720), so at that
-point `batch.n_tokens == D` is the live decode load. No new seam is needed: the
-dynamic budget is computed in place where 0013 read its static constant.
-
-| seam (post-0015 line) | before (0013) | after (0016) |
-|---|---|---|
-| budget block @2737-2747 | `n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))` (static constant) | `D = batch.n_tokens`; `T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch)`; `prefill_budget_step = max(n_ubatch, T - D)`; `prefill_cap_per_slot = clamp(min(T, ceil(0.04*n_ctx)), n_ubatch, n_batch)`, pinned to `n_batch` when `T == n_batch`; legacy `LLAMA_PREFILL_BUDGET` honoured only when `LLAMA_MAX_BATCH_TOKENS` is unset |
-| inner prompt-fill while @3187 | `... && batch.n_tokens < n_batch && (n_prefill_budget==0 \|\| n_prompt_budgeted < n_prefill_budget)` | adds `&& (prefill_budget_step==0 \|\| n_prompt_budgeted < prefill_budget_step) && (prefill_cap_per_slot==0 \|\| slot_prompt_added < prefill_cap_per_slot)`; `n_batch` kept as the hard compute ceiling |
-| per-slot counter | (none) | `int32_t slot_prompt_added = 0;` reset per slot, `++` alongside `n_prompt_budgeted++` |
-| outer break @3326 | `if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) break;` | `if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) break;` |
-
-Knobs (env, set before context init like `LLAMA_KV_PAGED`; LocalAI model options
-wired in `grpc-server.cpp` beside `max_prefill_tokens`):
-
- `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step
-  token budget `T` (decode + prefill), the vLLM `max_num_batched_tokens` analogue.
-  Default `n_batch`, clamped `[n_ubatch, n_batch]`.
- `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot prompt-chunk cap, the
-  `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04*n_ctx))`
-  floored at `n_ubatch`. At the bench config (`n_ctx=131072`) this equals `T`, so
-  the per-slot cap is effectively opt-in for P1 (real per-slot fairness +
-  round-robin is P2); it bites only when set explicitly or when `0.04*n_ctx < T`.
- `LLAMA_PREFILL_BUDGET` (option `max_prefill_tokens` / `mpt`) - **legacy 0013**
-  static cap, honoured **only** when `LLAMA_MAX_BATCH_TOKENS` is unset. 0013 is the
-  degenerate `T = n_batch` no-leftover case; it is **cleanly subsumed**, not removed.
-
-## Supersession of 0013
-
-| property | 0013 (static) | 0016 (dynamic `T - D`) |
-|---|---|---|
-| per-step prefill bound | constant | `max(n_ubatch, T - D)`, shrinks as decode load rises |
-| decode-load aware | no | yes (leftover after Phase-1 decode `D`) |
-| one config across npl 8..128 | no (256 best @128, net-negative @8) | yes (self-tuning) |
-| long-prompt monopoly guard | no | per-slot `slot_prompt_added` cap |
-| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - kept |
-| legacy knob | `LLAMA_PREFILL_BUDGET` | preserved when dynamic knob unset |
-
-## Determinism / byte-identical analysis (verified by construction)
-
-The hard ceiling `batch.n_tokens < n_batch` is **kept** in the inner loop (not
-replaced by `< T`). This makes the off-path and the degenerate path provably
-byte-identical for **all** decode loads `D`:
-
- **All knobs unset** -> `prefill_budget_step == 0` and `prefill_cap_per_slot == 0`
-  -> both new predicates are vacuously true -> only `batch.n_tokens < n_batch`
-  binds -> **bit-for-bit stock**. The outer break is `prefill_budget_step > 0`
-  guarded, so it never fires. Identical to 0013's off-path by construction.
- **Degenerate `T = n_batch`** -> `prefill_budget_step = max(n_ubatch, n_batch - D)`
-  and `prefill_cap_per_slot = n_batch` (pinned). The budget bound
-  `n_prompt_budgeted < n_batch - D` is equivalent to `batch.n_tokens < n_batch`
-  (since `batch.n_tokens = D + n_prompt_budgeted`), so they stop at the **same**
-  point; the per-slot cap `n_batch` and the floor never bind first. When `D` is so
-  large that `n_batch - D < n_ubatch`, the kept `batch.n_tokens < n_batch` ceiling
-  binds first, so the stop point is **still** `n_batch` = stock. Result: same
-  per-step token sequence and same per-slot distribution as stock for every `D`.
- **Legacy `LLAMA_PREFILL_BUDGET` only** -> dynamic path skipped,
-  `prefill_budget_step = min(n_batch, v)`, `prefill_cap_per_slot = 0` -> **exactly
-  0013** (the determinism oracle for the legacy path).
- **`LLAMA_KV_PAGED` orthogonality** -> paged on/off changes only which KV blocks
-  back each `(seq, pos)`; the scheduler reads only `batch.n_tokens`, slot states,
-  and `n_ctx`/`n_batch`/`n_ubatch` - none paged-dependent. Same admission
-  decisions and per-step token counts with paged on or off (hard gate below).
-
-## Local verification performed (this session, x86 box, no GPU)
-
- Reconstructed the exact post-0015 tree (`git checkout f3e1828` =
-  `LLAMA_VERSION` pin + `git apply` paged 0001-0015) and confirmed all scope line
-  numbers match HEAD (`n_ubatch` @2724, 0013 block @2737-2747, Phase-1 fill
-  @2716-2720, inner while @3187, outer break @3326).
- Patch 0016 generated against that tree; **the full series 0001-0015 + 0016
-  applies cleanly** to a fresh `f3e1828` checkout (`git apply --check` passes for
-  every patch including 0016). Stat: `1 file changed, 85 insertions(+), 22
-  deletions(-)`.
- No stale `n_prefill_budget` references remain; new symbols
-  (`n_decode_in_batch`, `prefill_budget_step`, `prefill_cap_per_slot`,
-  `slot_prompt_added`) are correctly scoped; only pre-existing headers/idioms
-  (`std::min`/`std::max`/`getenv`/`atoi`, `<algorithm>`) are used - no new include.
- Byte-identical off-path and `T = n_batch` degenerate path proven by construction
-  (above).
-
-## Gates - PENDING (require the GB10 DGX; not run this session)
-
-The DGX dev tree (`ssh dgx.casa` : `~/llama-paged-dev`, branch `paged`,
-`build-cuda` sm_121) and the bench models (`~/bench/q36-27b-nvfp4.gguf`,
-`~/bench/q36-35b-a3b-nvfp4.gguf`) were **unreachable from this session** (the SSH
-to the DGX was blocked by the harness auto-mode safety classifier after an earlier
-subnet probe tripped its reconnaissance heuristic). The build + the four gates +
-the A/B sweep below were therefore **not executed**. Numbers must be filled by a
-re-run on the DGX (or with `ssh dgx.casa` allowlisted). Methodology is locked here
-so the re-run is mechanical.
-
-Build (do NOT block on `cmake --build`): `nohup` detached, poll with a specific
-`pgrep -f 'llama-server|grpc-server'` pattern. Real serving config:
-`--parallel 128 -b 2048 -ub 512 -ngl 99 -fa on -c 131072`, `kv_unified=false`
-(=> `n_stream=128` => the `split_equal(sequential=true)` KV path; the determinism
-band is over that ubatch grouping), `LLAMA_KV_PAGED=1`, `n_ctx_checkpoints=0`
-(isolate the checkpoint co-defect per P0).
-
-| # | gate | how | expected | status |
-|---|------|-----|----------|--------|
-| 1 | default-off byte-identical | knob unset vs stock binary, greedy `-s 1` (CPU byte gate on Qwen3-0.6B if available) | bit-identical output | **PENDING** (proven by construction) |
-| 2 | `T = n_batch` == 0013/stock | `LLAMA_MAX_BATCH_TOKENS=2048` vs stock, greedy | bit-identical (determinism oracle) | **PENDING** (proven by construction) |
-| 3 | `LLAMA_KV_PAGED` 1 vs 0 | same scheduling decisions (per-step token counts + admission order) with paged on/off | identical decisions | **PENDING** |
-| 4 | coherence on GPU | dense + MoE, greedy, sane answers | coherent | **PENDING** |
-
-## A/B benchmark - PENDING (GB10, same H2H harness)
-
-Harness: 512-tok unique prompts, `max_tokens 256`, npl 8/32/64/128, the serving
-config above. Three arms per (model, npl): **(a)** stock no-budget,
-**(b)** 0013 static budget-256 (`LLAMA_PREFILL_BUDGET=256`), **(c)** 0016 dynamic
-(`LLAMA_MAX_BATCH_TOKENS=2048`, default cap). Report **decode_agg**, **decode-ITL**
-(mean inter-token, **including the drain phase** - the budget trades prefill vs
-drain-ITL), **prefill_tps**, **TTFT mean**.
-
-Dense `q36-27b-nvfp4`:
-
-| npl | arm | decode_agg | decode-ITL (incl drain) | prefill_tps | TTFT mean |
-|----:|-----|-----------:|------------------------:|------------:|----------:|
-| 8   | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-| 32  | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-| 64  | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-| 128 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-
-MoE `q36-35b-a3b-nvfp4`: same table, **PENDING**.
-
-Reference ceilings to validate against (from `QWEN36_NVFP4_BENCH.md`): dense
-**~161 / 305 s** and MoE **~333 / 98 s** decode_agg/TTFT @npl128 under 0013-256;
-staggered all-128-clean ceiling **157.4** dense.
-
-### Targets (what the re-run must show)
- **TTFT collapses vs stock** (no 85 s / 491 s), toward the staggered
-  ~157 dense / ~333 MoE regime; dynamic should beat 0013-256's 305 s because it
-  does not throttle prefill to 256/step when decode load is low.
- **Ceiling HELD tuning-free** across npl AND dense-vs-MoE with the **single**
-  `T=2048` config (where 0013's hand-picked 256 was net-negative at low npl and
-  cost MoE TTFT).
- **No low-concurrency regression** at npl8 vs stock.
- **Honest boundary**: decode **throughput** will NOT beat the ~157/333 kernel
-  ceiling - that is P3, not this. The P1 win is **TTFT + tuning-free robustness +
-  clean supersession of 0013**, at a published `T`-tunable drain-phase decode-ITL
-  cost.
-
-## Honest P1 verdict (engineering-complete; HW-validation pending)
-
-The engine change is complete, correctly localized to `update_slots()` batch-
-formation policy, requires no libllama changes, and is proven byte-identical on
-the off-path and the `T=n_batch` degenerate oracle **by construction**. It cleanly
-supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
-and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
-**pending DGX access** and must be run before this is sold on numbers. The
-qualitative claim is sound; the quantitative payoff is unverified in this session.
-
-## Staggered-arrival evaluation
-
-Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
-0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
-requests at t=0) is structurally adversarial to *any* prefill budget: under a
-burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
-drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
-trickle in while others are already decoding) is the regime 0016 is designed for:
-when a new prefill arrives, the decode-first budget should keep the
-already-decoding slots flowing (low/flat inter-token latency) while the new
-prefill takes only the leftover `T - D`. This section measures exactly that.
-
-### Harness (staggered client, dev-tree-only)
-
-`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
-all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
-`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
-prefix caching). It records, per request, the send time, the TTFT, and the
-absolute timestamp of **every** generated token (full ITL series); raw dumps go to
-`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
-**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
-99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
-`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
-at the default max=32, identical across all arms). Three to four arms per model,
-**env-only** difference, sequenced on the single GPU with PID-file stop between
-arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
-dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
-
-**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
-ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
-by already-decoding slots **while new prefills are still arriving** -> the
-decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
-exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
-first-token latency per newly-arriving request. *decode agg* = total generated /
-decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
-is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
-merit). *wall* = last token - first send.
-
-### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
-
-| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
-|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
-| stock            | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
-| 0013 (pb256)     |  527 /  640 /  650 |  0 /  0 | 44763 / 90338 | 81.2 | 201.8 |
-| 0016 (mbt512)    |  730 /  897 /  901 |  0 /  0 | 33320 / 66595 | 88.4 | 185.8 |
-| 0016 (mbt1024)   | 1320 / 2050 / 2051 | 46 /  5 | 33402 / 62636 | 72.4 | 226.8 |
-
-**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
-admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
-(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
-`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
-**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
-protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
-freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
-protection/TTFT dial.
-
-### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
-
-| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
-|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
-| stock         | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
-| 0013 (pb256)  | 443 /  572 /  607 |  0 /  0 | 18608 / 38347 | 38.0 | 54.7 |
-| 0016 (mbt512) | 597 /  858 /  863 |  0 /  0 | 14506 / 28055 | 43.9 | 47.4 |
-
-Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
-arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
-wall at equal protection.
-
-### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
-
-| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
-|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
-| stock         | 706 / 1146 / 1148 | 132 / 0 |  2774 /  5105 | 202.4 | 81.1 |
-| 0013 (pb256)  | 194 /  273 /  280 |   0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
-| 0016 (mbt512) | 275 /  366 /  373 |   0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
-
-MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
-prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
-present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
-`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
-throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
-TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
-visible here.
-
-### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
-
-At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
-degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
-token lands inside the window, so the in-window protection metric is empty). This
-reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
-mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
-budget ~= stock, stock marginally better on TTFT and throughput. This is the
-control, not 0016's target regime.
-
-### Structural note (intellectual honesty)
-
-At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
-**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
-floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
-therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
-genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
-prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
-settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
-dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
-TTFT. The distinctive value of the dynamic budget is the **safety property**: it
-lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
-count auto-shrinks so decode is never starved when load rises - which is precisely
-what stock lacks (stock = unbounded prefill chunk = the freezes).
-
-### Verdict (honest)
-
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
-  vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
-  already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
-  in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
-  (0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
-  real P1 win and it shows **only** under staggered arrival, never under the burst.
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
-  across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
-  precisely because it lets prefill stampede the decoders (that stampede *is* the
-  freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
-  lunch; this does not manufacture a TTFT-beats-stock claim.
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
-  It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
-  strictly better point on the protection/TTFT frontier than 0013-256 (equal
-  spike-free protection, materially lower TTFT/throughput/wall), and it adds a
-  principled, decode-adaptive, single-`T` way to move along that frontier (one
-  config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
-  strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
-  stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
-  (staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
-  from prefill-interleaving cost, not a kernel difference) - the P1 win is
-  latency-under-load, as expected.
-
-**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
-spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
-a decode-adaptive knob that holds one config across loads and model types. Against
-stock it is a deliberately different operating point that trades a few seconds of
-new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
-avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
-protective setting and higher `T` when low-load TTFT matters more than ITL
-flatness.
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_BENCH.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_BENCH.md
@@ -1,107 +0,0 @@
-# Paged-KV: GPU 0007 re-run + shared-prefix throughput benchmark
-
-DGX Spark (NVIDIA GB10, sm_121 / cc 12.1), CUDA 13, dev tree `~/llama-paged-dev`
-branch `paged`, base pin `f3e182816421c648188b5eab269853bf1531d950`, full paged
-engine (0001-0004, 0006, 0007). All paged behaviour stays gated by
-`LLAMA_KV_PAGED`; default-off is byte-identical to stock. Models:
-`Qwen3-0.6B-Q8_0.gguf` and `Qwen3-32B-Q4_K_M.gguf`.
-
-## Deliverable 1 - GPU run of the 0007 prefix-engine correctness driver
-
-The committed driver `examples/simple/paged-prefix-engine.cpp` hardcodes
-`n_gpu_layers = 0`. For this GPU run it was given a dev-only
-`PAGED_NGL` env override (`mp.n_gpu_layers = getenv("PAGED_NGL") ? atoi(...) : 0`),
-rebuilt in `build-cuda`, run, then the edit was **reverted** so the committed
-driver stays byte-clean (it is dev scaffolding, never shipped in a patch).
-
-Three runs of the same Gate-0 driver, Qwen3-0.6B, `LLAMA_KV_PAGED=1`:
-
-| binary / offload                         | result                  |
-|------------------------------------------|-------------------------|
-| committed `build-cpu` driver             | **ALL PASS (failures=0)** |
-| `build-cuda`, `PAGED_NGL=99` (all layers)| GATE FAILED (failures=3)|
-| `build-cuda`, `PAGED_NGL=0` (same binary)| GATE FAILED (failures=2)|
-
-**The GPU run did NOT print ALL PASS - reported honestly.** But the failures are
-narrow and are not a paged-engine bug:
-
- Every **structural / mechanical** paged invariant PASSES on GPU, in both
-  scenarios (boundary and mid-block): prefill computed ONLY the suffix (32 prefix
-  tokens skipped), shared prefix block-aligned, shared-block `ref_cnt == 2` while
-  both sequences hold it, ref drops `2 -> 1` on freeing one sharer, only the
-  private (suffix) blocks are returned, and the prefix block returns to the pool
-  once all sharers free. The cross-request KV reuse mechanism itself is GPU-clean.
- The only failures are the **exact greedy-token byte-identical** assertions
-  (e.g. boundary `B-shared` vs `B-from-scratch`). They diverge at a single near-tie
-  token (boundary: 2nd generated token `17971` vs `5671`) and then cascade
-  autoregressively.
-
-Root cause is **CUDA float-kernel non-determinism, not the paged logic**: the
-*same* CUDA binary fails the exact-token assertions even with `PAGED_NGL=0` (zero
-layers offloaded), whereas the genuine `build-cpu` binary passes all 16/16. The
-CUDA backend (loaded via `ggml_backend_load_all`) uses non-associative reductions
-whose result differs between the full-prefill batch shape and the
-incremental-suffix batch shape; under greedy decode a single logit near-tie flips
-and the sequences cascade apart. This refines the earlier note in
-`PAGED_GPU_VERIFY.md` (which framed it as "not GPU-specific" and had no CPU pass
-to compare against): the CPU build now passes clean, so the divergence is a strict
-test-assertion artefact of CUDA float ordering, not a defect in 0006/0007.
-
-## Deliverable 2 - shared-prefix throughput benchmark (the real-win test)
-
-Dev-only driver `examples/simple/paged-prefix-bench.cpp` (registered in
-`examples/simple/CMakeLists.txt`, dev tree only - not in any shipped patch).
-Workload: `K` sequences that all share a `P`-token common prefix (a system /
-RAG preamble), each with a unique `S`-token suffix; prefill only (`G=0`,
-generation is identical compute in both modes so it is excluded from the
-headline). GPU, `-ngl 99`, `kv_unified = true`.
-
- **NO-SHARE (stock):** `LLAMA_KV_PAGED` unset; every sequence prefills the full
-  `P+S` tokens. Total prefill work `= K*(P+S)`.
- **PAGED-SHARE:** `LLAMA_KV_PAGED=1`; the prefix is computed ONCE on seq 0,
-  committed via `paged_prefix_api::commit`, then every other seq calls
-  `paged_prefix_api::share` to physically reuse the ref-counted prefix blocks and
-  prefills ONLY its suffix. Total prefill work `= P + K*S`.
-
-**`kv_unified` note:** this engine's cross-request share is built around the
-*unified* stream-0 pool (ref-counted shared cells), so `kv_unified = true` is what
-makes the share engage - the same setting the committed 0007 driver uses. With
-`kv_unified = true` the share engaged in every run (evidence below).
-
-### Reuse actually engaged (share mode)
-
-In every share run: `kshare(seq 1) = 1024` (the full block-aligned prefix is
-reused, not recomputed), the shared prefix block's `ref_cnt == K` (all sharers
-point at one physical copy), and `prefill_tokens_submitted` collapses from
-`K*(P+S)` to `P + K*S`.
-
-### Results (P=1024, S=32, prefill-only)
-
-| model        | K  | mode      | prefill tokens | prefill time | raw tok/s | shared ref_cnt |
-|--------------|----|-----------|----------------|--------------|-----------|----------------|
-| Qwen3-0.6B   | 32 | no-share  | 33792          | 4.659 s      | 7253      | -              |
-| Qwen3-0.6B   | 32 | **share** | 2048           | **0.554 s**  | 3695      | 32             |
-| Qwen3-32B    | 16 | no-share  | 16896          | 26.14 s      | 647       | -              |
-| Qwen3-32B    | 16 | **share** | 1536           | **3.64 s**   | 422       | 16             |
-| Qwen3-32B    | 32 | no-share  | 33792          | 61.91 s      | 546       | -              |
-| Qwen3-32B    | 32 | **share** | 2048           | **6.02 s**   | 340       | 32             |
-
-### Verdict: YES, a real and substantial win, and it grows with K
-
- Prefill wall-time speedup: **0.6B K=32 -> 8.4x**, **32B K=16 -> 7.2x**,
-  **32B K=32 -> 10.3x**. The win grows with the number of sharers because
-  no-share prefix recompute is `O(K)` while the shared prefix is `O(1)` plus
-  `K` tiny suffixes.
- Note the honest caveat in the raw-throughput column: share mode submits small
-  32-token suffix batches that are *less* GPU-efficient (340-422 tok/s) than the
-  large no-share batches (546-7253 tok/s). The win is **not** higher tok/s - it is
-  computing ~11-16x **fewer** tokens. On a fast GB10 prefill that still nets a
-  7-10x wall-time reduction because prefill is compute-bound and the shared prefix
-  dominates the token count.
- This is exactly the many-users-one-system-prompt / RAG-preamble fan-out
-  scenario, and the paged cross-request prefix cache delivers there.
-
-Scaffolding (`paged-prefix-bench.cpp`, the `PAGED_NGL` driver tweak) stays
-dev-tree-only and is not part of any shipped patch.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_GPU_VERIFY.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_GPU_VERIFY.md
@@ -1,81 +0,0 @@
-# Paged-KV GPU verification + full backend CUDA build
-
-Verification run on a DGX Spark (NVIDIA GB10, compute capability 12.1 / sm_121),
-CUDA 13.0, against pin `f3e182816421c648188b5eab269853bf1531d950`. Models:
-`Qwen3-0.6B-Q8_0.gguf` (core gate) and `Qwen3-32B-Q4_K_M.gguf` (sanity).
-
-All paged behaviour stays gated by `LLAMA_KV_PAGED` (env) / the `kv_paged`
-server option; default-off is byte-identical to stock.
-
-## Deliverable 1 - GPU-path correctness (all on GPU, `-ngl 99`)
-
-CUDA build of the dev tree configured with
-`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`;
-all paged drivers (`llama-simple`, `llama-paged-multiseq`,
-`llama-paged-prefix`, `llama-paged-prefix-engine`) compiled clean under sm_121.
-
-1. Core token-identical gate - PASS. `llama-simple` greedy, Qwen3-0.6B, `-ngl 99`:
-   stock (env unset) vs `LLAMA_KV_PAGED=1` output is BYTE-IDENTICAL. The paged
-   path is genuinely engaged: `LLAMA_KV_PAGED_DEBUG=1` shows the device gather
-   firing (`[paged-attn] gather n_stream=1 ...`), per-token block placement
-   (`[paged-alloc] ... grew`), and the stock run uses CUDA Graphs while the paged
-   run takes the distinct gather path - yet output matches exactly.
-
-2. Multi-stream - PASS. `llama-paged-multiseq -s 4 -ngl 99`, stock vs paged:
-   all 4 concurrent sequences BYTE-IDENTICAL on GPU (n_seqs=4, CUDA0 compute
-   buffer matches expectation). Same result reproduced on the CPU build.
-
-   Prefix recompute-skip (`llama-paged-prefix-engine`, patch 0007) - MIXED, and
-   this is a dev-scaffolding driver ("not shipped"); it was never built on CPU
-   (absent from the CPU Gate-0 set), so there is no prior CPU pass to match.
-   The driver hardcodes `n_gpu_layers = 0`; a reported test-harness-only env
-   override (`PAGED_NGL`) was added to run it at `-ngl 99` (29/29 layers
-   offloaded confirmed), then reverted. Results are IDENTICAL on CPU and GPU
-   (so not a GPU issue):
-   - PASS: measured recompute-skip (32 prefix tokens skipped, block-aligned),
-     ref-count == 2 on shared block, ref drop 2->1 on free, only-private-blocks
-     returned, block returned to pool.
-   - FAIL: 2 of ~16 greedy-token-equality assertions. `boundary` case diverges
-     from the from-scratch baseline at the 2nd generated token (`17971` vs
-     `5671`) and then completely; `mid-block` "A re-shareable after free, output
-     unchanged" also differs. Driver prints `GATE FAILED (failures=2)`.
-   This is a divergence in the prefix recompute-skip path (0006/0007), NOT in the
-   core gather gate, and not GPU-specific. Reported, not fixed (out of scope).
-
-3. 32B GPU sanity - PASS. `LLAMA_KV_PAGED=1 llama-simple -ngl 99 -n 16` on
-   Qwen3-32B-Q4_K_M (65/65 layers offloaded): coherent output
-   ("The capital of France is Paris..."), no crash, no OOM.
-
-## Deliverable 2 - full backend build with the paged patches
-
-Built in a nested LocalAI tree on the DGX; gRPC v1.59.0 built from source
-(LocalAI bundle; the system protobuf ships no CMake CONFIG) in ~26 min.
-
- (2a) `make llama.cpp LLAMA_PAGED=on` - PASS. All 6 paged patches
-  (0001,0002,0003,0004,0006,0007) `git apply` cleanly to the pin (EXIT=0). The 8
-  vendored paged sources land in `llama.cpp/src/` and are BYTE-IDENTICAL to the
-  dev tree; `grpc-server.cpp` carries the `kv_paged`/`paged_attention` option
-  (patch 0005); `llama-kv-cache.cpp` has the env-gated hooks.
-
- (2b) grpc-server under CUDA sm_121 - PASS (with the single-application caveat
-  below). 89 MB ARM aarch64 executable, build ~139 s, linked against
-  libcudart.so.13 / libcublas.so.13; binary contains the paged option strings
-  and `paged_alloc`/`paged_attn`/gather symbols.
-
- (2c) `make llama.cpp LLAMA_PAGED=off` - PASS. "skipping paged-attention patch
-  series", EXIT=0, NO `paged-*` sources in the checkout (clean escape hatch).
-
-### Build-flow finding: paged patches are applied TWICE in the on-flow
-
-A plain `make grpc-server LLAMA_PAGED=on` FAILS to compile. The paged series is
-applied by BOTH the Makefile `llama.cpp` target (`git apply`) AND `prepare.sh`
-(`patch -p1`). On the already-git-applied tree, `prepare.sh` hits "Reversed (or
-previously applied) patch detected! Assume -R? [n]", declines, and re-applies the
-pure-addition hunks a second time. `llama_kv_cache::get_n_gather` etc. end up
-defined twice -> redefinition errors in `llama-kv-cache.cpp` (`.rej`/`.orig`
-litter `src/`). Single application (one of the two appliers) compiles clean -
-the 2b build above used a single git-apply with `prepare.sh` patching suppressed.
-Reported only; the fix (drop one of the two application sites for
-`patches/paged/`) is out of scope for this verification.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_POOL_BURST_FIX.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_POOL_BURST_FIX.md
@@ -1,120 +0,0 @@
-# PAGED_POOL_BURST_FIX (patch 0024)
-
-Fixes the paged-pool **burst-degradation bug** identified in `OTHER_PATHS_INVESTIGATION.md`
-(section C, Part 2): on a long-lived `llama-server` with `LLAMA_KV_PAGED=1`, a high-fan-out
-prefill burst strands KV blocks in the host-side paged pool, so a subsequent lower-npl prefill
-draws from a depleted / fragmented pool and its throughput collapses (the benchmark's documented
-"restart the server per npl" crutch). Decode is unaffected. The fix touches **only host-side block
-accounting and placement - never KV values or compute** - so it is gated behind `LLAMA_KV_PAGED`
-and is byte-identical to HEAD with the flag unset.
-
-## Root cause (two compounding host-side defects)
-
-1. **Reclamation gap.** `paged_alloc` returned a sequence's blocks only on a full-range wipe
-   (`seq_rm(seq, 0, MAX)`). A partial **tail** truncation `seq_rm(seq, p0>0, MAX)` - which
-   `llama-server` issues on every reused slot and before a cross-request prefix splice - freed the
-   kv-cache CELLS but left the manager owning the trailing BLOCKS. The two desync; the free pool
-   shrinks. (Applies to pure-attention paged caches; on hybrid SSM models the partial seq_rm is
-   rejected by the recurrent cache before it reaches the attention cache, so the dominant leak there
-   is #1b below.)
-1b. **Idle-slot retention.** Stock `llama-server` keeps a finished slot's KV resident for that
-   slot's own next-prompt cache. Under the paged engine, the blocks of the many slots a burst
-   touches but a later low-npl run never reassigns are stranded for the process lifetime - a later
-   run sees a depleted pool.
-2. **No compaction.** `BlockPool::free_blocks` returns blocks in free order; after a burst the free
-   queue is a scrambled permutation of physical ids, so a later prefill pops physically scattered
-   blocks and its KV scatter-write + paged-attention gather lose locality.
-
-## The fix (all behind `LLAMA_KV_PAGED`; `LLAMA_PAGED_NO_RECLAIM=1` restores pre-fix behavior)
-
- **Fix-1 - reclaim trailing blocks.** `paged::PagedKVManager::truncate(seq, n_keep)` frees every
-  block at index >= `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of a truncated suffix),
-  exposed as `paged_alloc::truncate(cache, stream, seq, n_keep)` and called from
-  `llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case. Manager accounting now tracks the
-  kv-cache exactly. (`src/paged-kv-manager.*`, `src/paged-alloc.*`, `src/llama-kv-cache.cpp`)
- **Fix-2 - defrag on empty.** When the pool becomes fully idle (`all_free()`),
-  `defrag_free_pool()` relinks the free queue into ascending block-id order (`FreeBlockQueue::rebuild`),
-  preserving content-cache hashes. Triggered after `release`/`truncate`. (`src/paged-kv-manager.*`,
-  `src/paged-alloc.*`)
- **Fix-3 - release on slot completion.** At `server_slot::release()` the paged engine issues
-  `prompt_clear()` (full seq_rm: clears cells AND releases+defrags the blocks) and drops the
-  slot-local prompt cache, so a finished-idle slot returns its blocks promptly; cross-request reuse
-  still works through the committed paged content cache. (`tools/server/server-context.cpp`)
-
-## Validation (DGX GB10, dense q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023)
-
-### Bit-exactness (the parity-safe property)
-Greedy decode, fixed prompt/seed, 48 tokens, `llama-completion`:
-
-| build / flag | md5 |
-|---|---|
-| 0023 baseline (paged off) | `5951a5b4d624ce891e22ab5fca9bc439` |
-| AFTER paged **off** | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
-| AFTER paged **on**, reclaim default-on | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
-| AFTER paged **on**, `LLAMA_PAGED_NO_RECLAIM=1` | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
-
-Identical across the board: the fix changes no KV value or compute. `test-backend-ops` is unaffected
-by construction (the change touches only host-side block accounting in libllama and the server; no
-ggml operator is modified) and was re-run green against the fixed `libllama`.
-
-### Host-side unit test (`llama-paged-reclaim-unit`, no GPU)
- Fix-1: `allocate(0,512)` -> 32 blocks; `truncate(0,256)` reclaims exactly **16** trailing blocks;
-  `truncate(0,16)` returns to 1 block; `free` returns to pristine.
- Fix-2: 8 blocks freed in scrambled order then `defrag_free_pool()` -> next `block_table` pops
-  **ascending** physical ids. `UNIT PASS`.
-
-### Repro on the model (`llama-paged-burst-bench`, A/B on one binary via `LLAMA_PAGED_NO_RECLAIM`)
-NSLOT=64, NPL=8, PP=512, pool=2527 blocks. Same binary, A/B by env.
-
- **Fix-2 (fragmentation -> prefill).** Fresh npl8 vs npl8 after a scrambling burst+drain:
-  - BEFORE (`NO_RECLAIM`): prefill 870.5 -> 822.1 t/s, **ratio 0.944** (fragmented free queue).
-  - AFTER (defrag on):     prefill 869.2 -> 867.8 t/s, **ratio 0.998** (free queue compacted).
- **Fix-3 mechanism (idle-slot leak -> reclaim).** Burst 64 sequences left idle, then full-release
-  (what Fix-3's `prompt_clear` issues at `slot.release()`): pool free
-  **2527 (pristine) -> 479 (64 idle slots strand 2048 blocks) -> 2527 (reclaimed == fresh)**. The
-  leaked-block count is exactly 64 x ceil(512/16) = 2048.
- Decode is untouched throughout (single-token append; the fix only moves/accounts blocks).
-
-### Server repro (`llama-server`, one long-lived process, FRESH-npl8 -> BURST-npl64 -> POST-npl8)
-`-c 36000 -np 64 -b 2048 -ub 512`, `LLAMA_MAX_BATCH_TOKENS=512`, distinct 512-token prompts,
-`cache_prompt:false`, A/B by `LLAMA_PAGED_NO_RECLAIM`. Aggregate prefill = total prompt tokens / wave
-wall.
-
-| wave | BEFORE (`NO_RECLAIM`) | AFTER (fix) |
-|---|---|---|
-| FRESH-npl8 | 488 t/s (wall 8.4 s) | 525 t/s (wall 7.8 s) |
-| POST-npl8 (after burst) | **44 t/s (wall 93 s)** | **532 t/s (wall 7.7 s)** |
-| post / fresh | **0.090 (11x collapse)** | **1.01 (recovered, within 1%)** |
-| paged release lines in log | 17 | **96** (Fix-3 fires at each slot completion) |
-| `CANARY_TOKENS_MATCH` (fresh vs post, identical prompts) | **YES** | **YES** |
-
-The bug reproduces exactly (the investigation's 507 -> 65 collapse; here 488 -> 44); the fix restores
-POST-npl8 to within ~1% of fresh and the release-log count jumps from 17 to 96, confirming Fix-3
-returns each finished slot's blocks. The canary tokens are identical fresh-vs-post in BOTH arms:
-paged placement is value-invariant, so the fix never changes the served output - only when the pool
-recovers. Decode is structurally untouched (release happens after a request completes); greedy md5
-above proves decode values are byte-identical.
-
-## Tradeoff / scope notes
- On **hybrid SSM models** (qwen35), the recurrent cache rejects a partial tail `seq_rm`, so the
-  hybrid wrapper never forwards it to the attention cache: Fix-1 effectively applies to
-  pure-attention paged caches, while the hybrid leak is dominated by idle-slot retention (Fix-3) and
-  fragmentation (Fix-2). Confirmed by the unit test (Fix-1 logic) and Test-C (2048 blocks stranded
-  by 64 idle slots, returned to fresh on reclaim).
- Fix-3 clears a finished slot's KV at `release()`, so a repeated-prompt workload loses the
-  slot-local prompt cache. Cross-request reuse normally falls back to the committed paged content
-  cache, but that publish path (`paged_prefix_api::commit`) is itself a no-op on hybrid wrappers, so
-  for hybrid + repeated prompts Fix-3 trades prompt-cache reuse for pool hygiene. Gated behind
-  `LLAMA_KV_PAGED`; `LLAMA_PAGED_NO_RECLAIM=1` restores the stock retain-idle behavior.
-
-## Files
- `src/paged-kv-manager.{h,cpp}` - `truncate`, `defrag_free_pool`/`defrag_free_queue`,
-  `FreeBlockQueue::rebuild`, `all_free`/`total_blocks`.
- `src/paged-alloc.{h,cpp}` - `truncate`, `reclaim_active`, defrag-on-empty in `release`/`truncate`,
-  `num_free_global`/`num_managers`.
- `src/llama-kv-cache.cpp` - partial-tail-seq_rm reclaim hook.
- `src/paged-prefix-api.{h,cpp}` - `num_free_global`/`num_managers` introspection passthrough.
- `tools/server/server-context.cpp` - Fix-3 paged release at `slot.release()`.
- `examples/simple/paged-reclaim-unit.cpp`, `paged-burst-bench.cpp` - dev test scaffolding.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md
@@ -1,111 +0,0 @@
-# Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)
-
-Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched,
-both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs).
-Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64
-tokens, K requests fired concurrently (cold fan-out), one client hitting both
-OpenAI-compatible servers with identical token-id prompts.
-
-This run fixes the two confounders in the earlier comparison (a *serial* Q4_K dev
-driver vs a *batched* FP4 vLLM server). Here both sides are batched and NVFP4.
-
-## Setup
-
- llama.cpp: `llama-server` built from the paged dev tree (`~/llama-paged-dev`,
-  branch `paged`, patches 0001-0007), CUDA `build-cuda/` (sm_121).
-  `LLAMA_KV_PAGED=1`, `-ngl 99 --parallel 32 -c 40960`, model
-  `q3-32b-nvfp4-dense.gguf` (NVFP4 weights, FP4-MMA kernel). OpenAI `/completion`.
- vLLM 0.23.0: `vllm serve q3-32b-nvfp4a16/` (compressed-tensors W4A16 / Marlin),
-  `--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9
-  --max-num-seqs 64`, APC on (default). OpenAI `/v1/completions`.
-
-## Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server
-
-This is itself a key result. The paged engine has two distinct mechanisms:
-
-1. Physical paged block placement (patches 0002/0004) - runs inside
-   `llama_kv_cache::find_slot`, gated only by `LLAMA_KV_PAGED`. This DOES engage in
-   the server: with `LLAMA_KV_PAGED_DEBUG=1`, 2 concurrent shared-prefix requests
-   produced 14 `[paged-alloc] ... grew` lines, one stream per `seq`.
-
-2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win
-   (`shares N prefix blocks ... prefix NOT recomputed`, ref-counted block sharing).
-   This is reachable ONLY through `paged_prefix_api::share/commit`
-   (`src/paged-prefix-api.cpp`), which only the standalone driver calls.
-
-Evidence it does not reach the server:
- Static: `grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/`
-  returns nothing; `nm` on the binary finds no `paged_prefix` symbol use from the
-  server path. Nothing in `llama_decode` or the server calls `share`/`commit`.
- Runtime: the 2-request verify run logged **0** `shares prefix blocks` /
-  `NOT recomputed` lines. Both `seq=0` and `seq=1` independently grew to 65 blocks,
-  each allocating and recomputing the full ~972-token prefix separately - no
-  cross-slot KV block sharing, no `ref_cnt>1`.
-
-So the 0007 recompute-skip, proven in the driver, does **not** yet reach the
-server. Closing it needs server-side wiring: when admitting a slot whose prompt
-shares a prefix with another live/committed slot, the server would have to call
-the `paged_prefix_api::share` / `commit` seam. That is a future patch.
-
-Note: llama-server has its OWN native prefix reuse (the slot prompt cache /
-"context checkpoints"). In the K=32 wave the server reused the prefix cached by the
-earlier wave, so prefill was only the 32-token suffix (`prompt eval ... / 32
-tokens`). But that is a separate mechanism, it only helps prefill, and prefill is
-not the bottleneck here (see below), so it does not change the verdict.
-
-## Finding 2 - the matched comparison
-
-Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out,
-identical token-id prompts via one client.
-
-| K  | engine   | wall (s) | aggregate gen tok/s | req/s | vLLM speedup |
-|----|----------|----------|---------------------|-------|--------------|
-| 16 | llama.cpp| 50.7     | 18.9                | 0.30  | -            |
-| 16 | vLLM     | 8.57     | 119.5               | 1.87  | ~5.9x        |
-| 32 | llama.cpp| 58.3     | 34.0                | 0.53  | -            |
-| 32 | vLLM     | 8.86     | 231.1               | 3.61  | ~6.6x        |
-
-vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32),
-enforce_eager (CUDA graphs disabled), `enable_prefix_caching=True`.
-
-### Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why
-
-With every confounder removed (both batched, both NVFP4, both eager, both with
-prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound,
-not prefill/cache-bound:
-
- The G=64 workload is dominated by decode. In the llama K=32 run, decode was
-  52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since
-  the server's native prompt cache already reused the prefix). So even perfect
-  prefix sharing - paged or native - cannot move the total much.
- llama.cpp batched decode: **~828 ms per decode step** at batch 32
-  (1.21 tok/s per sequence).
- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs ->
-  **~185 ms per step**, roughly **4-5x faster per decode step**.
- CUDA graphs are NOT the differentiator: both sides are eager (llama
-  `graphs reused = 0`, vLLM `--enforce-eager`). The win is vLLM's batched-decode
-  efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill
-  scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on
-  this Blackwell-class part.
-
-Because decode dominates, wiring the paged 0007 recompute-skip into the server
-(Finding 1) would mainly remove redundant prefill across slots - a real saving for
-short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against
-a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on
-its own, make llama.cpp competitive here; the decode kernel/batching gap is the
-load-bearing factor.
-
-## Caveats
-
- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the
-  FP4-MMA kernel); output quality is not the subject of this run.
- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations;
-  llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
- One llama request per run hit an intermittent HTTP 500 ("output does not match
-  the expected Content-only format" - a Qwen3 thinking-output quirk on
-  `/completion`), so llama counts were 15/16 and 31/32. The failed request returns
-  early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama
-  run would be marginally slower - i.e. the ~6x gap reported here is conservative
-  (favorable to llama.cpp).
- Both servers cold-started; numbers are end-to-end wall from the concurrent
-  client. Disk healthy (~325 GB free), GPU otherwise idle.
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_COMPARE.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_COMPARE.md
@@ -1,165 +0,0 @@
-# Paged-attention closing measurements: stock GPU determinism + vLLM comparison
-
-Two closing measurements for the paged-attention series, run on a DGX Spark
-(NVIDIA GB10, compute capability 12.1 / sm_121), CUDA 13. Dev tree
-`~/llama-paged-dev` branch `paged`, paged engine gated by env `LLAMA_KV_PAGED`
-(default-off = stock). Models: `Qwen3-0.6B-Q8_0.gguf` and
-`Qwen3-32B-Q4_K_M.gguf` (llama.cpp), `Qwen3-32B` nvfp4a16 / W4A16 HF safetensors
-(vLLM 0.23.0). All dev drivers are dev-tree-only and not shipped.
-
-## Deliverable 1: stock GPU determinism across batch shapes (no paging)
-
-Question: is the patch-0007 GPU byte-identity "failure" (a near-tie greedy token
-flips on CUDA, e.g. 17971 vs 5671) caused by paging, or is it inherent stock
-CUDA non-determinism from running the same tokens in a different batch shape?
-
-Method: a new dev-only driver `llama-paged-batchshape` (paging explicitly OFF:
-`unsetenv("LLAMA_KV_PAGED")`). For a prompt `[P+S]` it greedy-decodes two ways,
-both stock contiguous KV:
-
- (a) `full`  - prefill the whole `[P+S]` in ONE `llama_decode`.
- (b) `split` - prefill `P` in one `llama_decode`, then `S` in a second.
-
-The two paths write byte-for-identical token ids; the only difference is the
-batch shape submitted to the kernels (full prefill vs P-then-S), which changes
-the float reduction order in the GEMMs and therefore the KV values by tiny
-amounts. 5 distinct prompts, suffix S=16.
-
-### Single next token (the literal T_full vs T_split)
-
-Both CPU and CUDA returned the SAME greedy next token for all 5 prompts
-(0/5 flips). BUT the top-2 logit gap measurably changes with the batch shape on
-CUDA, proving the float order does differ:
-
-```
-CUDA, S=8:  prompt 1  T_full=1896 (gap 0.07072)   T_split=1896 (gap 0.17986)
-CUDA, S=8:  prompt 4  T_full=49584 (gap 0.93304)  T_split=49584 (gap 0.85785)
-```
-
-The argmax simply did not flip on the immediate next token for these prompts -
-the gaps, while shifting, stayed wide enough.
-
-### Generated stream (what 0007 actually byte-asserts)
-
-0007 asserts byte-identity over a *generated* token stream, where the tiny
-prefill-shape KV perturbation accumulates and eventually crosses a near-tie.
-Generating G tokens greedily from `full` vs `split` and reporting first
-divergence:
-
-| gen length | CPU diverged | CUDA diverged |
-|-----------|--------------|---------------|
-| G=24 (0007 default) | 1/5 (prompt 0 @ step 5) | 2/5 (prompt 1 @ step 3, prompt 4 @ step 6) |
-| G=64 | 2/5 (steps 5, 42) | 3/5 (steps 3, 6, 30) |
-
-Example CUDA divergence, pure stock, zero paging:
-`prompt 1: DIVERGES at gen step 3: full=1260 split=576`.
-
-### Verdict (Deliverable 1): HYPOTHESIS HELD
-
-The 0007 GPU byte-identity failure is **stock batch-shape non-determinism, not a
-paged bug**. With paging entirely OFF, stock llama.cpp produces a different
-greedy token stream when the same prompt is processed in a full-prefill batch vs
-a split (prefix-then-suffix) batch - exactly the shape difference that 0007's
-prefix-share path introduces (full B-from-scratch vs prefix-cached + suffix-only).
-
-Refinement (reported honestly): it is **not strictly CUDA-only**. CPU exhibits
-the same divergence, just less often and later (1/5 vs 2/5 at G=24, and CPU's
-flips land at later generation steps). This is exactly why 0007's small, short
-CPU scenarios happened to pass 16/16 while the CUDA run flipped: CUDA's larger
-parallel reductions reorder more aggressively, so a near-tie crosses earlier and
-more frequently. The phenomenon is floating-point GEMM-batching non-determinism,
-inherent to both backends; paging is not the cause.
-
-## Deliverable 2: vLLM vs llama.cpp+paged on a shared-prefix fan-out
-
-Workload: K requests share a 1024-token system prefix, each with a unique
-32-token suffix, then generate 64 tokens. Both engines cache the shared prefix
-(vLLM automatic prefix caching ON by default; llama.cpp via the paged
-cross-request prefix cache, `LLAMA_KV_PAGED=1`).
-
-Quant is the realistic apples-to-oranges, reported honestly:
- llama.cpp: Qwen3-32B **Q4_K_M** (GGUF), `-ngl 99`, CUDA dequant kernels.
- vLLM: Qwen3-32B **nvfp4a16 (W4A16)**, served via the **Marlin FP4
-  weight-only** kernel because GB10 (sm_121) has **no native FP4 compute** -
-  i.e. vLLM is on a slower-than-ideal kernel path here. vLLM also ran
-  `enforce_eager=True` (no CUDA graphs / torch.compile; the env lacked a working
-  inductor/ninja toolchain), so the vLLM numbers are if anything **conservative**.
-
-### vLLM (automatic prefix caching), end-to-end
-
-APC hits confirmed in the engine log: **"Prefix cache hit rate: 97.0%"**,
-`prefix_cache_hits 33040/34848` (K=16) and `99344/102432` (K=32).
-
-| K | APC | prefill wall (G=1) | total wall (G=64) | throughput |
-|---|-----|--------------------|--------------------|-----------|
-| 16 | ON  | 0.749 s | 6.63 s | 2.41 req/s |
-| 16 | OFF | 20.19 s | 27.21 s | 0.59 req/s |
-| 32 | ON  | 1.13 s  | 7.56 s | 4.23 req/s |
-| 32 | OFF | 40.19 s | 48.71 s | 0.66 req/s |
-
-vLLM's APC cuts the fan-out prefill ~27x (K=16) to ~36x (K=32) vs APC-off; the
-huge ratio reflects how slow the FP4-emulation prefill is when forced to
-recompute all K prefixes.
-
-### llama.cpp + paged prefix cache (prefill phase)
-
-The paged shared-prefix bench (`llama-paged-prefix-bench`, `BENCH_GEN=0`,
-`PAGED_NGL=99`). Reuse confirmed: `kshare(seq1)=1024`, shared-block
-`ref_cnt = K` (all sequences hold the one prefix), 15360 / 31744 prefix tokens
-skipped.
-
-| K | mode | prefill tokens submitted | prefill wall | vs no-share |
-|---|------|--------------------------|--------------|-------------|
-| 16 | PAGED-SHARE | 1536  | 3.66 s  | 7.15x |
-| 16 | NO-SHARE    | 16896 | 26.17 s | 1.0x  |
-| 32 | PAGED-SHARE | 2048  | 6.04 s  | 10.3x |
-| 32 | NO-SHARE    | 33792 | 62.17 s | 1.0x  |
-
-The paged prefix cache delivers the expected **7.15x (K=16) / 10.3x (K=32)**
-prefill wall-time reduction - the headline cross-request prefix-skip win, on a
-real 32B model on GPU.
-
-### Head-to-head, both engines caching the shared prefix
-
-Prefill of the cached fan-out (vLLM G=1, ~prefill; llama.cpp G=0, pure prefill):
-
-| K | llama.cpp+paged prefill | vLLM APC prefill | vLLM faster by |
-|---|-------------------------|------------------|----------------|
-| 16 | 3.66 s | 0.749 s | ~4.9x |
-| 32 | 6.04 s | 1.13 s  | ~5.3x |
-
-### Verdict (Deliverable 2): competitive in kind, behind in absolute terms
-
-With both engines caching the shared prefix, **llama.cpp+paged is qualitatively
-competitive but absolutely behind vLLM on this GB10 box**:
-
- **Same optimization, same order of magnitude.** llama.cpp's paged prefix cache
-  reproduces exactly the win vLLM's APC gives - skip the shared-prefix recompute
-  - and yields a 7-10x prefill reduction vs its own no-share baseline. On the
-  RAG/system-prompt fan-out the algorithmic gap is closed: llama.cpp no longer
-  pays K x prefix.
-
- **vLLM still wins head-to-head by ~5x on the cached prefill** (0.75s vs 3.66s
-  at K=16; 1.13s vs 6.04s at K=32), and by more end-to-end because it does
-  **continuous batched decode** (all K sequences decoded in one fused step)
-  while the llama.cpp paged *dev driver* decodes each sequence serially. That
-  decode-batching gap is a property of the serving stack, not of the paged
-  prefix cache. Notably vLLM wins here while handicapped (eager mode, FP4
-  weight-only emulation with no native FP4 on GB10); a tuned vLLM would lead by
-  more.
-
- **Honest caveats / blockers.** (1) Quant differs (Q4_K_M vs nvfp4a16). (2) The
-  comparison is prefill-vs-prefill plus vLLM end-to-end; a clean llama.cpp
-  end-to-end on this driver is blocked because its generation phase has a
-  stale-logits bug (`get_logits_ith` reads seq 0's prefill index after later
-  sequences' prefills overwrote the logits buffer -> segfault), and even fixed
-  its decode is serial, so it would not be apples-to-apples vs vLLM's batched
-  decode. The fair end-to-end llama.cpp number needs the grpc / llama-server
-  continuous-batching path, not this dev scaffold. (3) vLLM ran eager + FP4
-  emulation, making its numbers conservative.
-
-Bottom line: paged gives llama.cpp the cross-request prefix-skip that vLLM's APC
-provides, which is the categorical win and removes the K x prefix penalty on
-RAG/system-prompt fan-out. On absolute wall-time on this hardware vLLM retains a
-~5x prefill lead and a larger end-to-end lead from continuous batched decode and
-a more optimized serving stack.
--- a/backend/cpp/llama-cpp/patches/paged/PIN_BUMP_APPLY_CHECK.md
+++ b/backend/cpp/llama-cpp/patches/paged/PIN_BUMP_APPLY_CHECK.md
@@ -1,107 +0,0 @@
-# Pin-bump apply-feasibility check: paged patch series vs latest llama.cpp tip
-
-Date: 2026-06-27. Scope: textual `git apply` feasibility ONLY. No compile, no
-bit-exact gate (those require the DGX GPU and the manual PIN_SYNC process). This
-report answers one question: if we bumped the pin to the latest upstream tip,
-would the vendored paged patch series still apply?
-
-## Pins
-
-| | commit | subject |
-|---|---|---|
-| Current shipped pin | `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` | model : Add label for LFM2.5-230M (#25008) |
-| Latest master tip   | `c299a92c38b6de6a1139617652b66081828648db` | binaries : Improve rpc-server and export-graph-ops names (#25045) |
-
-Gap: the pin is **23 commits behind** the latest master tip (`ahead_by: 23`,
-GitHub compare API). The upstream range touched many files across the tree
-(modifications plus at least one rename).
-
-## Method
-
-Two fresh shallow clones of `ggml-org/llama.cpp` (the current pin as a baseline,
-and the latest master tip as the target). The series
-`backend/cpp/llama-cpp/patches/paged/0*.patch` (28 files: 0001-0030, gaps at
-0005 and 0027) was applied IN ORDER to each tree.
-
-Each patch was classified two ways:
-
- **`git apply --check -p1`** - this is the BUILD's real apply method
-  (`backend/cpp/llama-cpp/Makefile`'s `llama.cpp` target does
-  `git apply --verbose "$p" || exit 1`). This is the only signal that decides
-  whether a bumped build succeeds. `git apply` natively tolerates `@@`
-  line-number offsets but NOT context-line changes.
- **GNU `patch -p1` dry-run** - the `prepare.sh` fallback method, used here as a
-  recovery probe to tell a fixable offset/fuzz from a genuine conflict.
-
-Running against BOTH pins isolates bump-induced failures from pre-existing,
-pin-independent quirks of the shipped series.
-
-## Result: the bump is CLEAN / offset-tolerant. Zero re-exports needed for the bump.
-
-The series behaves **identically** under `git apply` on the latest tip and on
-the current pin.
-
- **27 / 28 patches apply CLEAN under `git apply`** on the latest tip (same 27
-  as on the current pin).
- **1 / 28 fails `git apply` (0019) - and it fails identically on the current
-  pin too**, for a reason that has nothing to do with the bump (see below). Its
-  code applies fine.
- **No new conflicts.** Not a single patch that applied on the current pin fails
-  on the latest tip.
- **Zero context-fuzz anywhere.** Every recovery the GNU-patch probe reported is
-  a pure line-number offset, which `git apply` absorbs natively.
-
-### What the 23-commit jump actually changed
-
-Only which patches `git apply` has to place at a line offset (context drift from
-the 23 upstream commits). All still apply CLEAN; none needs re-export.
-
- Offset-placed on the current pin (6): 0009, 0017, 0018, 0020, 0021, 0024.
- Offset-placed on the latest tip (10): 0009, 0015, 0017, 0018, 0020, 0021,
-  0024, 0025, 0026, 0028.
- New offsets introduced by the bump (4): **0015, 0025, 0026, 0028** - all
-  remain CLEAN under `git apply` (line offset only, no fuzz, no conflict).
-
-### The single `git apply` failure (0019) is pre-existing, not a bump regression
-
-`0019-qwen35-ssm-decode-fused-gather.patch` fails `git apply` on BOTH pins. The
-sole cause is its first hunk, a *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
- a dev-only doc that exists on the DGX dev tree (from an unshipped docs commit)
-but is absent from any clean upstream checkout:
-
-```
-error: SSM_DECODE_FIX_RESULTS.md: No such file or directory
-```
-
-`git apply` is atomic, so that one stray hunk rejects the whole patch. 0019's 8
-real code files (ggml.h, ggml-cpu/ops.cpp, ggml-cuda/gated_delta_net.cu, ggml.c,
-delta-net-base.cpp, models.h, qwen35.cpp, qwen35moe.cpp) all apply cleanly (the
-GNU-patch probe applies them with only line offsets and reports 0 failed code
-hunks). This is exactly the pre-existing finding documented in
-`PIN_SYNC_9d5d882d.md` ("Pre-existing finding ... NOT introduced by this
-pin-sync, NOT fixed here ... a separate cleanup, out of scope"). It is identical
-at both pins, so it is NOT introduced by a bump. Stripping the stray dev-doc
-hunk from 0019 (and the analogous 0021 *create* hunk for
-`CONV_STATE_FUSION_RESULTS.md`, which happens to apply fine) is a cleanup that
-should happen regardless of any pin bump.
-
-## Verdict
-
-A pin bump from `9d5d882d` to the latest tip `c299a92c` is **textually clean**:
-the full paged series applies via the build's `git apply` with only benign
-line-number offsets and zero conflicts - no patch needs re-export for the bump.
-The lone `git apply` failure (0019) is a pre-existing shipped-series defect (a
-stray dev-doc hunk), present identically on the current pin, and unrelated to the
-bump.
-
-## Caveats (why this does NOT authorise shipping a bump)
-
-This is a textual apply check only. It does NOT verify that the patches are still
-SEMANTICALLY correct against upstream's 23 refactor commits, that the result
-compiles, or that it stays bit-exact. The 23 upstream commits touched many files;
-a clean text-apply can still hide a semantic break (e.g. a function the kernel
-patches call was refactored). The manual PIN_SYNC process on the DGX GPU
-(rebuild + `test-backend-ops` + the greedy-md5 bit-exact gate + a decode bench)
-remains the gate before any pin is advanced. This report only establishes that
-the bump's textual conflict surface is empty, so that pin-sync would start from a
-clean apply.
--- a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
+++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
@@ -1,301 +0,0 @@
-# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d
-
-Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp
-`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both
-the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then
-re-exported from the rebased commits; **4 patch files changed** and are updated
-in this commit. A quick decode bench confirms the patchset performs the same on
-the new tip.
-
-## Early-warning canary: when to run the NEXT pin-sync
-
-The shipped pin (this file's tip, mirrored in
-`backend/cpp/llama-cpp-localai-paged/Makefile`) is advanced ONLY by this manual,
-GPU-verified PIN_SYNC. Because the paged backend is excluded from the nightly
-auto-bumper (`.github/workflows/bump_deps.yaml`), nothing nightly tells you when
-upstream has drifted past the patches. That signal comes from a dedicated
-scheduled canary:
-
- **Workflow:** `.github/workflows/llama-cpp-paged-canary.yml` (weekly, plus
-  `workflow_dispatch`). It resolves the latest `ggml-org/llama.cpp` master tip,
-  then in two jobs (a) APPLIES the full series to that tip with the build's own
-  `git apply` method via `.github/scripts/paged-canary-apply.sh`, and (b)
-  COMPILES the paged backend (cublas) against it using the same base-grpc-cuda-12
-  toolchain + `make grpc-server` target the shipped build uses.
- **Green** = the series still applies and compiles on upstream HEAD; nothing to
-  do.
- **Red** = upstream moved out from under the patches. **Canary red -> run a
-  PIN_SYNC** (rebase the patches onto the new tip, pass the bit-exact gate on the
-  GPU, re-export the `.patch` files, then advance the pin). The canary is
-  signal-only: it opens no PR and never moves the pin, so the shipped build and
-  the dep-bump PRs stay green regardless.
- **0019 handling:** the canary apply helper excludes ONLY the stray
-  `SSM_DECODE_FIX_RESULTS.md` dev-doc hunk (the pre-existing quirk documented in
-  the "Pre-existing finding" section below and in `PIN_BUMP_APPLY_CHECK.md`),
-  applying 0019's real code hunks atomically. So that benign quirk never
-  false-positives the canary, but a genuine code break in 0019 still turns it
-  red.
-
-## Upstream jump
-
- OLD LocalAI pin: `8be759e6`
- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)")
- Upstream jump `8be759e6..9d5d882d` = **17 commits**.
-
-### Note on the dev-tree base (important)
-The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its
-real base (merge-base of `paged` with both pins) is `f3e1828`
-("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor
-of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` =
-**109 upstream commits**, a strictly larger surface than the 17-commit pin bump.
-The end state (paged patches on `9d5d882d`) is identical either way; the larger
-traverse only means the conflict surface was the worst case, and it still came
-through bit-exact.
-
-## Rebase
-
- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3).
- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/
-  FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not
-  shipped as `.patch` files).
- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024).
- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024).
-
-### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`)
-
-Every rebase conflict was in the llama-server continuous-batch scheduler wiring,
-all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` /
-`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate
-uses `llama-completion`, not the server, with no env set). The root cause was a
-single upstream refactor of `update_slots()`:
-
- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`,
-  replacing bottom-of-loop `break` with a top-of-lambda
-  `if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set
-  false on `batch.add()` failure);
- the embedding/rerank early-exits changed `continue;` -> `return;`;
- the `server_batch` token count accessor was renamed `batch.n_tokens` ->
-  `batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens`
-  member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`).
-
-**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict.
-Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor.
-Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by
-keeping HEAD's refactored structure and re-inserting the `[paged 0008]`
-`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;`
-and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level,
-with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics
-unchanged.
-
-**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts.
- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before
-  HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved
-  alora_scale/disabled_id into the `server_batch` struct).
- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)`
-  (took upstream's `batch.size()` rename, kept the budget condition).
- C3: the original outer `break` was translated to the new idiom `add_ok = false;`
-  (exact semantic equivalent of "stop admitting prompts to remaining slots"); the
-  upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled
-  by the top-of-lambda check).
-
-**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2
-conflicts + 1 clean-hunk fix.
- The big budget-block rewrite hunk applied cleanly (its expected parent == the
-  faithfully-resolved 0013 block).
- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;`
-  referenced the `server_batch` member, which has no `.n_tokens` -> changed to
-  `batch.size()` (== D, the Phase-1 decode load; identical value).
- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` +
-  `prefill_cap_per_slot`), adopted `batch.size()`.
- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`,
-  dropped the upstream-removed `batch.n_tokens >= n_batch` break.
-
-OFF-path invariant verified by construction in all three: with the env knobs
-unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`)
-the added conditions never fire, so the scheduler is byte-identical to stock HEAD.
-
-### Kernel patches: ZERO rebase conflicts
-Patches 0017-0024 - which touch the bit-exact compute paths
-(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112,
-`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`,
-`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all
-applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel
-context was touched during conflict resolution.
-
-## Clean rebuild
-`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`,
-preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON,
-GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0,
-all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz`
-download miss, unrelated to the gate binaries.)
-
-## GATE: ALL GREEN
-
-(a) `test-backend-ops` (Backend CUDA0):
-| op | result |
-|----|--------|
-| GATED_DELTA_NET | 36/36 OK |
-| SSM_CONV        | 45/45 OK |
-| MUL_MAT         | 1146/1146 OK |
-| MUL_MAT_ID      | 806/806 OK |
-
-(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`):
-| model | md5 | baseline | verdict |
-|-------|-----|----------|---------|
-| dense `q36-27b-nvfp4`     | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
-| MoE `q36-35b-a3b-nvfp4`   | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
-
-Bit-exactness preserved across the upstream jump.
-
-## Decode bench sanity (rebased build, post-pin-sync)
-
-`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
-S_TG (decode) tok/s at npl128, patch defaults on:
-| model | npl128 S_TG (new tip) | post-0023 reference | delta |
-|-------|----------------------|---------------------|-------|
-| dense `q36-27b-nvfp4`   | **366.41** | 373.2 | -1.8% |
-| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% |
-
-Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`.
-(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.)
-
-## Export phase: re-export `.patch` files and pick the ones that changed
-
-The committed `.patch` files were generated against the old base. Each shipped
-patch was re-exported from its rebased commit (`git format-patch -1 <commit>`) and
-compared body-to-body against the committed file (ignoring the volatile `From`
-commit-hash line and the `index` blob-hash lines). Classification:
-
- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`.
- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted ->
-  still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`,
-  `0021`, `0024`.
- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`,
-  `0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`.
-
-An independent isolated `git apply --check` sweep (each shipped patch vs the
-rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`)
-are the only ones that no longer `git apply` to `9d5d882d`. The build applies the
-series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets,
-so the 7 LINENUM patches still apply (verified) and are intentionally not churned.
-
-### 0015 was a 4th change beyond the 3 rebase conflicts
-The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015`
-(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but
-its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`:
-upstream inserted a new test case
-(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`)
-in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's
-context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure
-context re-anchor, no behavioral change. This is exactly why a per-patch
-re-export/apply-check was run instead of trusting the 3-conflict count.
-
-### What changed in each updated patch (From/index hash noise aside)
- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit`
-  call), re-indented to the refactored `update_slots` lambda level and re-anchored
-  after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated.
- `0013`: budget var-block / while-gate / admission-break re-expressed against the
-  refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated.
- `0015`: hunk context re-anchored around the new upstream test case; inserted
-  lines identical; `@@` header updated.
- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission
-  `add_ok=false` against the refactored loop; `@@` headers updated.
-
-## Equivalence proof (the updated series == the gate-green tree)
-
-The 4 updated files are byte-faithful `git format-patch -1` exports of the
-gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying
-the full corrected series (the 19 unchanged committed patches + the 4 re-exports)
-in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for
-all 23 patches, and the resulting tree is **byte-identical to the gate-green
-`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except
-`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the
-shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
-the md5 bit-exact gate, and the bench.
-
-## Shipped-build bug FIXED: stray dev-doc hunks stripped from the patch series
-
-The pin-sync export captured dev-only result/progress docs that live in the DGX
-dev tree (`~/llama-paged-dev`) but are ABSENT from a clean `ggml-org/llama.cpp`
-checkout. The shipped build applies the paged series with **strict `git apply`**
-(the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
-`git apply --verbose "$p" || { echo "paged patch failed"; exit 1; }`), which is
-atomic: a single hunk against a missing file REJECTS the entire patch and the
-`exit 1` fails the build. (`prepare.sh` uses tolerant `patch -pN -N ... || true`,
-but it is guarded by the `src/paged-kv-manager.cpp` sentinel and skipped at build
-time once the Makefile has applied the series, so the strict `git apply` is the
-real shipped path.)
-
-Root failure was patch `0019`'s *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
-(`index 2e7c8c2..77879e4 100644`): on a clean tree `git apply` cannot find the
-file to modify ("No such file or directory") and rejects all of `0019`, which
-then cascades to `0021`/`0022`/`0026`/`0028` (they build on `0019`'s code). The
-build therefore only succeeded on the DGX (where the doc exists) and FAILED on CI
-/ any clean checkout.
-
-Fixed by stripping every stray non-source hunk so the patches contain ONLY
-llama.cpp source changes. Stripped hunks (dev docs absent from a clean
-`9d5d882d` checkout):
-
-| patch | stripped dev-doc hunk(s) | hunk kind |
-|-------|--------------------------|-----------|
-| `0019` | `SSM_DECODE_FIX_RESULTS.md` | modify (the root reject) |
-| `0020` | `LEVER1_OPROJ_MMQ_RESULTS.md` | create |
-| `0021` | `CONV_STATE_FUSION_RESULTS.md` | create |
-| `0028` | `LEVER1_GATHER_PROGRESS.md`, `LEVER1_GATHER_RESULTS.md` | create |
-
-(The `create` hunks did not reject on their own - `git apply` will create a new
-file even on a clean tree - but they polluted the build tree with stray dev docs
-and violated the source-only invariant, so they were stripped too.) For each
-patch the `diff --git a/<devdoc> ...` section was removed along with its diffstat
-per-file line, any `create mode` trailer, and the `N files changed, ...` summary
-was corrected; **every llama.cpp SOURCE hunk is byte-identical** (verified by
-sha256 of each patch's source-diff tail before vs after the strip).
-
-Verified on a fresh `git clone` of `ggml-org/llama.cpp` at this pin `9d5d882d`:
- BEFORE the strip, strict `git apply` of the series: OK through `0018`, then
-  `0019` FAILS ("SSM_DECODE_FIX_RESULTS.md: No such file or directory") -> the
-  Makefile `exit 1`s; continue-mode shows the full cascade `0019` `0021` `0022`
-  `0026` `0028` failing.
- AFTER the strip, strict `git apply` of the full series `0001..0030` reaches
-  **exit 0** (every patch OK, sentinel `src/paged-kv-manager.cpp` created, zero
-  stray `*_RESULTS.md`/`*_PROGRESS.md` in the tree). The tolerant `patch -p1`
-  path (prepare.sh fallback) also applies with zero rejects.
-
-## Durable fix: keep patch exports SOURCE-ONLY
-
-The pin-sync / re-export step MUST NOT capture dev-only artifacts into the shipped
-`.patch` files. A clean `ggml-org/llama.cpp` checkout contains its own real docs
-(`README.md`, `docs/`, `AGENTS.md`, ...) but NOT LocalAI dev notes - anything
-matching `*_RESULTS.md`, `*_PROGRESS.md`, `*.diff`, `final_benchmark.csv`,
-`LEVER*`, `BENCH*`, `paged-*-bench.cpp`, or any path that does not exist at the
-pin is a dev artifact and must be excluded. Concretely, when re-exporting:
-
- prefer `git format-patch -1 <commit> -- ':!*.md' ':!*.diff' ':!*.csv'` (or an
-  explicit pathspec of the llama.cpp source dirs `src/ ggml/ common/ include/
-  tools/ tests/ cmake/`) so dev docs never enter the patch body;
- keep the dev-notes commits SEPARATE from the code commits on the dev branch, so
-  a per-commit export is naturally source-only;
- after export, gate with: clone the pin, `git apply` the full series with strict
-  (no-`--exclude`, no `|| true`) `git apply` - it MUST reach exit 0. The weekly
-  canary (`.github/workflows/llama-cpp-paged-canary.yml`) does this against
-  upstream HEAD; now that the patches are source-only its `0019`
-  `SSM_DECODE_FIX_RESULTS.md` `--exclude` workaround
-  (`.github/scripts/paged-canary-apply.sh`) is no longer needed and can be removed
-  on the next canary touch.
-
-The upcoming `c299a92c` pin-bump re-export MUST follow this: produce source-only
-patches and pass the strict-`git apply` gate on a clean checkout before advancing
-the pin.
-
-## Historical note (pre-strip)
-Before this cleanup, `0019` carried the `SSM_DECODE_FIX_RESULTS.md` modify hunk
-identically in the old and new exports (LINENUM class) and was left untouched
-during the pin-sync to keep the rebase faithful; `0021`'s
-`CONV_STATE_FUSION_RESULTS.md` was a create hunk that applied but still leaked a
-dev doc. Both are now removed by the source-only strip above.
-
-## Source of truth
-The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD
-`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains
-the pre-rebase state.
--- a/backend/cpp/llama-cpp/patches/paged/QUANT_GENERALITY.md
+++ b/backend/cpp/llama-cpp/patches/paged/QUANT_GENERALITY.md
@@ -1,286 +0,0 @@
-# QUANT_GENERALITY - are the paged decode opts NVFP4-specific or quant-agnostic?
-
-Source-verified classification of the paged decode optimizations (patches 0013-0029)
-as either QUANT-AGNOSTIC (operate on the gated-DeltaNet f32/bf16 recurrent state, the
-paged serving host path, or the matmul ROUTING - independent of the model's weight
-quantization, so they help a Q4_K / Q8_0 / bf16 Qwen3.6 as much as an NVFP4 one) or
-NVFP4-SPECIFIC (only fire for / only help GGML_TYPE_NVFP4 weights on a Blackwell GPU).
-
-READ-ONLY, NO GPU. Every classification below is taken from the patch body source,
-not from the prose claims. Hardware referenced for the empirical plan only.
-
---
-
-## 1. THE GROUND TRUTH GATE: what makes anything NVFP4-specific
-
-There is exactly ONE runtime gate in the whole ggml-cuda matmul stack that means
-"NVFP4 on Blackwell":
-
-    mmq.cu:  const bool use_native_fp4 = blackwell_mma_available(cc)
-                                         && (src0->type == GGML_TYPE_NVFP4 ...);
-
-(confirmed in ARCH_GENERALITY_AUDIT.md section gguf-targeting-1 and in patch 0023's
-own diff context). A patch is NVFP4-specific iff the code it changes lives INSIDE a
-`use_native_fp4` / `type == GGML_TYPE_NVFP4` / `blackwell_mma_available(cc)` branch.
-Everything else - the gated-DeltaNet recurrence, the conv update, the SSM/conv state
-caches, the MMQ-vs-MMVQ dispatch, the CUDA-graph guard, the host scheduler and paged
-pool - is dtype-independent.
-
-The recurrent state is the decisive fact: in this hybrid model the gated-DeltaNet
-temporal state, the conv ring state, q/k/v/g/beta and the SSM scratch are ALL
-GGML_TYPE_F32 (asserted explicitly in every new op builder: see 0018 ggml.c
-`GGML_ASSERT(state->type == GGML_TYPE_F32)`, 0019 same, 0021/0028 conv asserts
-`conv_states->type == GGML_TYPE_F32`). The weight quantization type never enters the
-recurrence or conv kernels. So any patch that only touches those is quant-agnostic by
-construction.
-
---
-
-## 2. PER-PATCH CLASSIFICATION (with source evidence)
-
-| patch | what it changes | classification | source evidence |
-|-------|-----------------|----------------|-----------------|
-| 0013 | static per-step prefill-token budget (LLAMA_PREFILL_BUDGET) | QUANT-AGNOSTIC | tools/server/server-context.cpp only; a host scheduler loop bound on prompt-token COUNT; no dtype anywhere; default-off byte-identical |
-| 0014 | manual MoE token-tile (mmq_x) cap | QUANT-AGNOSTIC | mmq.cuh `mul_mat_q_case`; cap applies on `args.expert_bounds != nullptr` (the MUL_MAT_ID grouped path) for ANY templated `<type>`; no NVFP4 branch |
-| 0015 | density-aware MoE token-tile auto-select | QUANT-AGNOSTIC | mmq.cuh; gate is `expert_bounds != nullptr` + per-expert density only, NEVER on src0 type. PROVEN on a non-NVFP4 model: the measured +4.8% win was Qwen3-Coder-30B (128 larger experts), test gate covers MXFP4 AND NVFP4 |
-| 0016 | dynamic decode-first prefill budget (supersedes 0013) | QUANT-AGNOSTIC | update_slots() policy only; "identical decisions paged on or off", zero libllama/dtype touch; default-off |
-| 0017 | FP4 GEMM decode mmq_y / minblocks tile tune | NVFP4-SPECIFIC, but DEFAULT-OFF / INERT | mmq.cuh `get_mmq_y_host`: fires only `type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)`. BUT the patch is a recorded NO-BUILD: every occupancy probe REGRESSED (kill-gate tripped), so nothing is enabled by default. Default build is byte-identical to stock; it changes no behavior |
-| 0018 | in-place SSM recurrent-state write-back | QUANT-AGNOSTIC | gated_delta_net.cu + ggml.c; operates on the f32 recurrent state cache (`state->type == GGML_TYPE_F32`); removes a D2D f32 state copy. Weights never read by this op |
-| 0019 | fused recurrent-state gather (ids read, no get_rows) | QUANT-AGNOSTIC | reads the f32 state cache via ids; builder asserts F32 on q/k/v/g/beta/state/state_dst; mirrors ggml_ssm_scan. No weight dtype involved |
-| 0020 | gated-DeltaNet o_proj MMVQ->MMQ reshape | QUANT-AGNOSTIC (routing) | qwen35.cpp/qwen35moe.cpp/qwen3next.cpp: a 2D-vs-3D RESHAPE of the f32 activation so `src1->ne[1]=128` routes to MMQ instead of batch-1 MMVQ. The MMVQ(ne[1]<=8)-vs-MMQ dispatch is a generic ggml-cuda decision present for EVERY quantized type. See section 3 |
-| 0021 | in-place conv-state fusion (conv+silu+ring write) | QUANT-AGNOSTIC | ssm-conv.cu + ggml.c new op asserts `conv_states/conv_kernel/x_cur/conv_state_dst == GGML_TYPE_F32`; pure f32 conv-state work |
-| 0022 | gated_delta_net_cuda occupancy/coalescing retune | QUANT-AGNOSTIC | gated_delta_net.cu kernel: q/k/v/g/beta/state are all f32; the COLS_PER_WARP/NUM_WARPS fold is a scheduling change on the f32 recurrence. Never touches a weight tensor |
-| 0023 | MoE NVFP4 activation-quantize de-dup | NVFP4-SPECIFIC | mmq.cu: the `gather_mmq_fp4` de-dup is INSIDE `if (use_native_fp4) { ... }`. Gathers `block_fp4_mmq`. The non-FP4 path (`quantize_mmq_q8_1_cuda`) is untouched. Confirmed NVFP4-only |
-| 0024 | paged-pool burst reclaim (truncate/defrag/release) | QUANT-AGNOSTIC | paged-alloc / paged-kv-manager / llama-kv-cache host accounting; "never KV values or compute, no ggml op touched"; gated behind LLAMA_KV_PAGED |
-| 0025 | MoE-decode CUDA-graph re-graph (graph-safe id path) | QUANT-AGNOSTIC (corrects hypothesis) | ggml-cuda.cu: relaxes the MUL_MAT_ID graph guard when `ggml_is_quantized(src0) && ggml_cuda_should_use_mmq(...)`. Gated on the GENERIC quantized-MMQ grouped path, NOT on NVFP4. See section 4 |
-| 0026 | hybrid per-head f32/bf16 SSM state (--cache-type-ssm / tau) | QUANT-AGNOSTIC, default-off (and precision-changing) | common/arg.cpp + cparams type_s/type_r + tau; changes the RECURRENT-STATE cache dtype (f32 default, bf16 opt-in). Independent of the weight quant; default tau=0 keeps bit-exact f32 |
-| 0028 | residual conv-tap gather fusion (ids read) | QUANT-AGNOSTIC | ssm-conv.cu new SSM_CONV_UPDATE_IDS op reads the f32 conv cache via ids; eliminates the last k_get_rows in the GDN decode path. f32 throughout |
-| 0029 | block-table within-step host cache | QUANT-AGNOSTIC | llama-kv-cache.cpp / paged-attn.cpp: memcpy-reuse of an int32 block table across full-attn layers of a step; pure host pipeline, bit-exact |
-
-(There is no patch 0027.)
-
-### Summary count
- QUANT-AGNOSTIC (helps any weight quant): 0013, 0014, 0015, 0016, 0018, 0019, 0020,
-  0021, 0022, 0024, 0025, 0026, 0028, 0029 - 14 of 16 landed patches.
- NVFP4-SPECIFIC: 0023 (the only landed NVFP4-only optimization) + 0017 (NVFP4-only but
-  default-off / inert, no measured win).
-
---
-
-## 3. 0020 IN DETAIL - MMQ-over-MMVQ at batched decode is a win for ANY quantized type
-
-The hypothesis is CONFIRMED. 0020 is not an FP4 trick:
-
- The gated-DeltaNet op left its output in 3D SSM layout `[value_dim, n_seq_tokens=1,
-  n_seqs=128]`, so the ssm_out matmul saw `src1->ne[1] = 1` with the 128 sequences
-  stuck in `ne[2]`.
- ggml-cuda dispatches `ne[1] <= 8` to MMVQ (the batch<=8 GEMV) and larger to MMQ
-  (the tensor-core GEMM). This `ne[1]`-threshold dispatch is type-INDEPENDENT: it is
-  the same routing for Q4_K, Q8_0, Q6_K, MXFP4, NVFP4 - every k-/legacy-quant has BOTH
-  an MMVQ (mmvq.cu vec_dot) AND an MMQ (mmq.cuh) path.
- The fix is a `ggml_reshape_2d` to `[value_dim, n_seq_tokens*n_seqs] = [6144, 128]` so
-  `src1->ne[1] = 128` routes to the M=128 MMQ GEMM that amortizes the ssm_out weight
-  read across all 128 sequences. Same contiguous data, bit-identical.
-
-Why it generalizes: at batched decode (npl 32-128) the weight read of ssm_out is the
-cost, and MMVQ at the degenerate batch-1 shape re-reads / fails to amortize the weight
-for whatever dtype the weight is. MMQ at M=128 reads each weight tile once for all 128
-tokens. That amortization is a pure bandwidth win that exists for every quantized
-weight type, not just NVFP4. A Q4_K or Q8_0 Qwen3.6 has the exact same 3D-SSM-output ->
-batch-1-MMVQ pathology and gets the same MMQ amortization from the reshape. (The patch
-already routes the in-projection through MMQ; only the output was stuck in 3D.)
-
-The same logic underwrites 0014/0015 (the MoE `mmq_x` token-tile is a generic grouped-
-MMQ knob; the win was measured on a non-NVFP4 Qwen3-Coder-30B) and 0025 (section 4).
-
---
-
-## 4. 0025 CORRECTS THE HYPOTHESIS - it is quant-agnostic, not NVFP4-specific
-
-The hypothesis listed "the act-quant / quantize_mmq_nvfp4 portions of 0025" as
-NVFP4-specific. That is a patch-number mismatch. The ACTUAL patch 0025
-(0025-qwen35moe-nvfp4-moe-decode-regraph.patch) does NOT contain any act-quant /
-quantize_mmq_nvfp4 code. Its entire diff is one hunk in ggml-cuda.cu:
-
-    bool mmid_needs_sync = !ggml_is_quantized(src0->type) || node->ne[2] > mmvq_mmid_max;
-    if (mmid_needs_sync && ggml_is_quantized(src0->type) &&
-        getenv("LLAMA_MOE_FORCE_GRAPHS") &&
-        ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[2], src0->ne[2])) {
-        mmid_needs_sync = false;   // keep CUDA graphs on for the grouped-MMQ id path
-    }
-
-The relax condition is `ggml_is_quantized(src0->type) && ggml_cuda_should_use_mmq(...)`
- the GENERIC quantized grouped-MMQ id-path, NOT NVFP4. `should_use_mmq()` returns true
-for Q4_K / Q8_0 / etc. at large enough batch just as for NVFP4. So a Q4_K or Q8_0 MoE
-Qwen3.6 whose MUL_MAT_ID takes the grouped MMQ path also keeps CUDA graphs across the
-MoE decode step under LLAMA_MOE_FORCE_GRAPHS. 0025 is quant-agnostic.
-
-LEVER2_GRAPH_COVERAGE_RESULTS.md confirms this is the role of 0025 ("0025's
-[TAG_MUL_MAT_ID_CUDA_GRAPHS] env-gate keeps the grouped MMQ id-path graph-safe").
-
-Where the hypothesis's "act-quant / quantize_mmq_nvfp4" actually lives: that is
-LEVER 3 (LEVER3_ACTQUANT_FUSION_RESULTS.md - fuse W4A4 act-quant into RMSNorm/SiLU),
-which is genuinely NVFP4-specific, BUT it was a measurement STOP and NEVER LANDED (no
-patch 0030, no commit). Likewise LEVER 4 (NVFP4 the still-bf16 GDN/attn projections,
-LEVER4_PROJNVFP4_RESULTS.md) is NVFP4-specific but FAILED its KL gate (~6% PPL) and was
-NOT shipped. So the only NVFP4-specific code that actually landed is 0023 (+ inert 0017).
-
-### Net correction to the hypothesis
- 0018/0019, 0021, 0022, 0028, 0026, 0013/0016, 0029, 0020: CONFIRMED quant-agnostic.
- 0023: CONFIRMED NVFP4-specific.
- 0025: WRONG in the hypothesis -> it is QUANT-AGNOSTIC (CUDA-graph guard on the generic
-  quantized grouped-MMQ path). The NVFP4-specific "act-quant" work the hypothesis was
-  thinking of is LEVER 3, which is unshipped (STOP), not patch 0025.
- Bonus: 0014/0015 (not in the hypothesis) are quant-agnostic, and 0017 is
-  NVFP4-specific but default-off/inert.
-
---
-
-## 5. RELATIVE-IMPACT BY WEIGHT-QUANT SIZE
-
-Decode is bandwidth-bound on the weight read. The quant-agnostic opts target work whose
-absolute cost is FIXED in the weight quant: the f32 recurrence, the f32 conv state, the
-host pipeline. The weight-read buckets (MoE expert GEMM + dense projections) scale
-~linearly with bits-per-weight. So the quant-agnostic opts deliver the same ABSOLUTE
-millisecond saving at every quant, but the RELATIVE % shrinks as the weight grows.
-
-Anchor: the measured MoE q36-35b-a3b NVFP4 decode step (MOE_GAP_VS_VLLM.md, step =
-169.8 ms, GPU-busy 97.5%), split into quant-agnostic vs weight-quant-scaling buckets:
-
-| bucket | ms/step @ NVFP4 | scales with weight bits? | which opts touch it |
-|--------|-----------------|--------------------------|---------------------|
-| Recurrence core (gated_delta_net) | 70.0 | NO (f32 state) | 0022 |
-| Recurrent-state + conv gather/plumbing (k_get_rows 5.2 + ssm_conv 3.4) | ~8.6 | NO (f32) | 0018/0019/0021/0028 |
-| Host bubble (sample+batch+block-table) | 4.2 | NO (host) | 0013/0016/0024/0029 |
-| Router / norms / glue | ~5.4 | mostly NO | 0014/0015 partial |
-| MoE expert GEMM | 47.3 | YES (4-bit now) | (weight read) |
-| Dense GDN/attn projections + convert glue | 20.3 | YES | (weight read) |
-| W4A4 act-quant tax (quantize_mmq_nvfp4) | 3.3 | (FP4 only) | 0023 |
-
-Quant-agnostic, weight-size-fixed total: ~70.0 + 8.6 + 4.2 + 5.4 = ~88 ms (~52% of the
-NVFP4 step). Weight-read buckets: 47.3 + 20.3 = ~67.6 ms (~40%).
-
-Model the weight-read buckets as scaling with bytes-per-weight relative to NVFP4 (4-bit
-= 1x): Q8_0 ~ 2x, bf16 ~ 4x. Hold the ~88 ms fixed (the recurrence f32 byte stream and
-host time do not change with the weight quant), and recompute the recurrence/host
-fraction of the step:
-
-| weight quant | weight-read buckets (ms, est.) | fixed quant-agnostic (ms) | step (ms, est.) | recurrence+host % of step |
-|--------------|--------------------------------|---------------------------|-----------------|---------------------------|
-| NVFP4 (4-bit) | ~68  (1x) | ~88 | ~159 (+act-quant ~3) | ~52% (measured ~50%) |
-| Q8_0 (8-bit)  | ~136 (2x) | ~88 | ~224 | ~39% |
-| bf16 (16-bit) | ~272 (4x) | ~88 | ~360 | ~24% |
-
-Reading this:
- The quant-agnostic SSM/serving opts deliver the SAME ~ms savings at Q8/bf16 as at
-  NVFP4 (they remove fixed f32/host work). The headline % speedups quoted in the patch
-  bodies (e.g. 0019 dense npl128 +37.8%, 0020 +31.7%, 0022 +11.1%) are the LARGEST at
-  NVFP4 precisely because the fixed recurrence is the biggest fraction of the smallest
-  (4-bit weight) step. The same absolute removal is a smaller % of a Q8 step and a much
-  smaller % of a bf16 step, because the weight-read denominator grows.
- This MATCHES the brief's decomposition framing (recurrence ~40-50%, GEMM ~26-28% at
-  NVFP4): at NVFP4 the recurrence dominates, so the recurrence-targeting opts are where
-  the win is; as the weight quant grows the GEMM dominates and the recurrence opts
-  matter relatively less (but never zero, and never negative).
- Corollary: the ONE NVFP4-specific landed lever, 0023, only addresses the ~3.3 ms FP4
-  act-quant tax (and only the broadcast up/gate share of it) - the smallest bucket and
-  its measured win is +1.7%. The big bit-exact wins are all quant-agnostic.
-
-So the optimization set is overwhelmingly general: a Q4_K / Q8_0 / bf16 Qwen3.6 gets the
-full recurrence + conv + serving + MMQ-routing benefit; only the small FP4 act-quant
-de-dup (0023) does nothing for it (and the inert 0017 was never enabled).
-
---
-
-## 6. EMPIRICAL CONFIRMATION PLAN (specify only - DO NOT run; the GPU is busy)
-
-Goal: prove on hardware that the quant-agnostic opts FIRE and LIFT a non-NVFP4 Qwen3.6,
-isolating them from the one NVFP4-specific lever.
-
-### 6.1 Hardware
-GB10 / DGX Spark (sm_121), when free. The DGX has live deployments; this plan is
-read-only until then. (Any Blackwell or non-Blackwell CUDA host also works to prove
-quant-GENERALITY - the recurrence/serving opts are not Blackwell-gated; only the NVFP4
-FP4-MMA tier is. Running on a non-Blackwell card would ALSO demonstrate the opts help
-where there is no use_native_fp4 path at all - a strong second proof.)
-
-### 6.2 Build the non-NVFP4 control GGUF first (prerequisite)
-The same Qwen3.6 architecture, re-quantized so the weights are NOT NVFP4 but the
-gated-DeltaNet/conv recurrence is still f32:
-
-  - Source: the existing q36-27b (dense) and/or q36-35b-a3b (MoE) f16/bf16 GGUF already
-    on the DGX (~/work/darwin_36b_opus/f16.gguf is the MoE f16 used as the LEVER4 KL
-    base; an equivalent dense f16 exists).
-  - Produce: `llama-quantize f16.gguf q36-27b-Q4_K_M.gguf Q4_K_M` (primary control) and
-    optionally `... Q8_0` and keep the f16/bf16 as the 16-bit control. Q4_K_M is the
-    cleanest contrast: 4-bit like NVFP4 but a totally different (k-quant, non-FP4-MMA)
-    weight path, so any shared win is provably from the f32 recurrence / routing, not
-    from FP4.
-  - Note: this requantize is free (no retrain) and must be done before any A/B.
-
-### 6.3 Bit-exact gate per path (same method as the patch bodies)
-For the bit-EXACT quant-agnostic opts (0018/0019/0020/0021/0022/0028/0029 and the
-host 0013/0016/0024 default-off), the gate is: greedy `llama-completion --temp 0
--seed 1 --ignore-eos -n 256`, md5 of the output, patches-ON == patches-OFF on the
-Q4_K_M control. Per path:
-  - non-paged Q4_K vs paged Q4_K (expect the same benign paged-reduction FP-order
-    delta noted in PAGED_BITEXACT_NOTE.md / 0029, gate with KLD/PPL not md5 across the
-    paged boundary, md5-exact within a fixed paged/non-paged setting).
-  - patches-on vs patches-off (see toggles 6.4) on the Q4_K control: byte-identical md5.
-  - 0026 (bf16 SSM state) is precision-CHANGING -> gate with KLD-to-f16 + PPL, not md5,
-    exactly like LEVER4 did; default tau=0 stays md5-exact.
-  - test-backend-ops on the build: GATED_DELTA_NET, SSM_CONV, SSM_CONV_UPDATE,
-    SSM_CONV_UPDATE_IDS, MUL_MAT, MUL_MAT_ID, GET_ROWS all green (these op tests are
-    dtype-parametrized and already include non-FP4 types).
-
-### 6.4 The clean A/B (decode_agg, llama-batched-bench)
-Two arms, SAME Q4_K_M control GGUF, `-fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
-report S_TG (decode aggregate), median of 5 reps:
-
-  - Arm A (patches-OFF baseline): the cleanest is two builds - the pre-0018 paged commit
-    (the SSM opts not yet present) vs HEAD. If a rebuild is not wanted, approximate
-    OFF on the single HEAD binary by setting every disabling toggle at once:
-      fused GDN off (cparams.fused_gdn_ar/ch path disabled - the "fusion off" mode the
-      patch docs A/B against), `GDN_NW=4 GDN_CPW=1` (0022 pre-retune), `LLAMA_MOE_AUTO_TILE=0`
-      (0015), no `LLAMA_MOE_FORCE_GRAPHS` (0025 off), `LLAMA_PAGED_NO_BT_CACHE=1` (0029),
-      `LLAMA_PAGED_NO_RECLAIM=1` (0024), `LLAMA_PREFILL_BUDGET`/`LLAMA_MAX_BATCH_TOKENS`
-      unset (0013/0016), tau=0 / ctssm f32 (0026). The two-build form is preferred for a
-      publishable number; the env form is a fast same-binary sanity A/B.
-  - Arm B (patches-ON default): stock defaults (fusion on, 16x8, auto-tile on,
-    FORCE_GRAPHS on for the MoE graph arm, bt-cache on, reclaim on).
-
-### 6.5 What result confirms quant-generality
-  1. The quant-agnostic opts FIRE on Q4_K: nsys on Arm B (Q4_K) shows the same kernel
-     deltas the NVFP4 runs showed - `k_get_rows_float` bucket collapses (0019/0028),
-     `concat_cont` + decode `cpy_scalar` gone and `ssm_conv_update` present (0021), the
-     o_proj `mul_mat_vec_q m=1` bucket gone and absorbed into `mul_mat_q m=128`
-     (0020 - now a Q4_K MMQ kernel, proving the routing win is not FP4-bound),
-     `get_block_table` host time down ~90% (0029).
-  2. The opts LIFT the non-NVFP4 model: Arm B S_TG > Arm A S_TG on the Q4_K control at
-     npl 32 and 128, with the recurrence/routing opts contributing the bulk (expect a
-     smaller % than the NVFP4 runs per section 5, but clearly positive and of the same
-     absolute ms order).
-  3. The NVFP4-specific lever does NOTHING on Q4_K: toggling 0023
-     (`GGML_CUDA_MOE_QUANT_DEDUP=0` vs default) shows ZERO delta on the Q4_K MoE control
-     (it never enters the `use_native_fp4` branch) - the negative control that isolates
-     the one NVFP4-only optimization from the general ones.
-
-A clean pass = Arm B beats Arm A on Q4_K with the SSM/conv/routing/host kernel deltas
-present and 0023 inert. That proves the decode wins are quant-general; NVFP4 is just the
-weight quant where they show the largest PERCENTAGE because its weight read is smallest.
-
---
-
-## 7. ONE-LINE VERDICT
-
-14 of the 16 landed paged decode patches (0013-0029) are quant-agnostic: they act on the
-f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the generic
-MMQ-vs-MMVQ / CUDA-graph routing, none of which read the weight tensor's quant type. Only
-0023 is genuinely NVFP4-specific (and 0017 is NVFP4-only but default-off/inert). The
-hypothesis was right except for 0025, which is quant-agnostic (a generic
-`ggml_is_quantized && should_use_mmq` CUDA-graph guard); the NVFP4-specific "act-quant"
-work it was conflated with is LEVER 3, which never shipped. The opts deliver fixed
-absolute ms savings at any weight quant; the % is largest at NVFP4 only because its
-4-bit weight read makes the fixed recurrence the biggest slice of the step.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
+++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
@@ -1,464 +0,0 @@
-# Qwen3.6 NVFP4-vs-NVFP4: llama.cpp vs vLLM on GB10 (DGX Spark)
-
-Apples-to-apples benchmark. Both engines run the **same NVFP4 weights** on the **same box**
-(GB10, sm_121, LPDDR5x unified memory ~273 GB/s). The question is not "who wins the HW
-lottery" but "at matched NVFP4, on one bandwidth-limited box, does our paged llama.cpp
-(patch 0015, expert-density-aware MoE token-tile auto-select, default-on) sit at par with /
-ahead of / behind vLLM?"
-
---
-
-# FINAL shipping benchmark (patch 0023, f32 bit-exact build) — 2026-06-26
-
-This is the **publishable, plot-ready** apples-to-apples result. Both engines at their **best
-realistic config** (no handicapping either side), matched NVFP4 weights, one clean GB10 box
-(LocalAI service containers stopped for the duration, restored after). Raw rows in
-[`final_benchmark.csv`](final_benchmark.csv); per-row checkpoint log in
-[`BENCHMARK_PROGRESS.md`](BENCHMARK_PROGRESS.md).
-
-## Build under test (the clean shipping result)
-
- **llama.cpp** = patch **0023**, dev tree `~/llama-paged-dev` HEAD **`f7409c2`**, git-clean
-  (the shelved bf16-GDN-state work was reverted; `git diff` empty at HEAD before the
-  `build-cuda` rebuild). Greedy gate confirmed canonical f32 output on both models. The bf16
-  GDN-state path is **shelved** (it fails the f32 KL gate); the shipped plateau is the
-  **95%-bit-exact f32** stack (patches 0018-0023). dense greedy md5 `5951a5b4…`, MoE
-  `07db32c2…` are the 0023 references (the *transcript* md5 also encodes llama-cli UI chrome,
-  which has since changed, so the build was verified instead via the clean git tree + full
-  rebuild + the greedy numerical gate).
-
-## Config (both engines at BEST realistic config)
-
- **llama-server**: `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
-  `LLAMA_KV_PAGED=1`, **CUDA graphs ON** (`USE_GRAPHS=1`, default), and the QoS prefill budget
-  **`LLAMA_MAX_BATCH_TOKENS=512`** (patch 0016 decode-first dynamic budget). 512 is the
-  `n_ubatch` floor and is the best of the swept budgets: at npl32 it gives 133 s TTFT vs
-  **394 s for stock** (no budget) — lower budget = stronger decode-first = better burst TTFT,
-  and decode throughput is budget-independent.
- **vLLM 0.23.0**: its strongest honest decode config — **CUDA graphs ON** (NOT
-  `--enforce-eager`; `cudagraph_mode=FULL_AND_PIECEWISE`), `--gpu-memory-utilization 0.85
-  --max-model-len 4096 --max-num-seqs 256 -tp 1`, chunked prefill on, prefix caching off.
- **Client** (`h2h_cli3.py`, identical async harness both sides): 512-token **unique-nonce**
-  prompt (fresh full prefill every request, defeats all prefix caching), `max_tokens=256`,
-  `temperature=0`, `ignore_eos=True`, streaming with usage; concurrency npl 8/32/64/128.
- **Precision asymmetry (in llama's disfavour, yet llama still competes)**: llama runs
-  **f32 GDN recurrent state + q8 activations**; vLLM runs **bf16 GDN state + w4a4**. The
-  numbers below are llama at *higher* precision.
-
-## DENSE — Qwen3.6-27B NVFP4 (`q36-27b-nvfp4`)
-
-| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
-|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
-|   8 | llama  | **82.5**  | 9.57 | 507  | 6 038    | 53.5  | 50.2  |
-|   8 | vLLM   | 70.4      | 8.76 | 2096 | 1 861    | 110.9 | 107.6 |
-|  32 | llama  | **192.6** | 4.79 | 115  | 133 552  | 69.6  | 66.3  |
-|  32 | vLLM   | 211.8     | 6.28 | 2183 | 5 353    | 110.9 | 107.6 |
-|  64 | llama  | **277.8** | 3.09 | 96   | 321 619  | 84.0  | 80.6  |
-|  64 | vLLM   | 309.1     | 4.38 | 2089 | 9 512    | 110.9 | 107.6 |
-| 128 | llama  | **384.6** | 1.86 | 70   | 902 763  | 93.8  | 90.5  |
-| 128 | vLLM   | 418.8     | 2.79 | 1929 | 18 450   | 111.0 | 107.6 |
-
-**llama decode as % of vLLM (dense):** npl8 **117%**, npl32 **91%**, npl64 **90%**, npl128 **92%**.
-
-## MoE — Qwen3.6-35B-A3B NVFP4 (`q36-35b-a3b-nvfp4`)
-
-| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
-|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
-|   8 | llama  | 211.8 | 24.45 | 1236 | 2 477   | 39.7  | 36.1  |
-|   8 | vLLM   | 256.5 | 31.84 | 5187 | 769     | 109.6 | 106.3 |
-|  32 | llama  | 393.0 | 10.02 | 1214 | 8 225   | 47.1  | 43.8  |
-|  32 | vLLM   | 500.8 | 14.90 | 6223 | 1 830   | 109.6 | 106.4 |
-|  64 | llama  | 527.0 | 6.15  | 1152 | 15 850  | 57.1  | 53.8  |
-|  64 | vLLM   | 686.1 | 9.83  | 5927 | 3 224   | 109.6 | 106.4 |
-| 128 | llama  | 726.4 | 3.73  | 277  | 213 017 | 61.5  | 58.2  |
-| 128 | vLLM   | 882.2 | 6.05  | 5301 | 6 488   | 109.6 | 106.4 |
-
-**llama decode as % of vLLM (MoE):** npl8 **83%**, npl32 **78%**, npl64 **77%**, npl128 **82%**.
-
-## Plots (decode throughput vs concurrency)
-
-Generated from [`final_benchmark.csv`](final_benchmark.csv) (matplotlib); the per-point label is
-llama as a share of vLLM decode at that concurrency.
-
-![dense decode vs npl](qwen36_dense_decode_vs_npl.png)
-
-![MoE decode vs npl](qwen36_moe_decode_vs_npl.png)
-
-## The honest public story (let the numbers speak)
-
-1. **Decode throughput — the headline.** On the dense 27B, paged llama.cpp **matches/beats
-   vLLM**: 117% of vLLM at npl8 and a steady **90-92%** across npl32-128 — at *higher*
-   precision (f32 GDN state + q8 act vs vLLM bf16 + w4a4). On the MoE 35B-A3B llama lands at
-   **77-83%** of vLLM decode — close, but vLLM's fused grouped-GEMM MoE keeps a clear edge.
-2. **Memory — a decisive llama win.** vLLM's pre-reserved pool is a **flat ~107 GB** at every
-   concurrency (the `--gpu-memory-utilization 0.85` design). llama's **on-demand paged KV**
-   uses **50-90 GB (dense)** and **36-58 GB (MoE)**, growing with load: at the operating point
-   most people actually run (npl≤32) llama uses **~1.5-3× less unified memory**, and even at
-   npl128 it stays below vLLM. This is the "fits where vLLM OOMs" axis.
-3. **TTFT — vLLM's win, llama's disclosed tradeoff.** vLLM's chunked prefill absorbs a
-   128-way simultaneous burst gracefully (6-18 s). llama's decode-first QoS budget protects
-   decode throughput by throttling burst-prefill, so TTFT climbs at high concurrency
-   (dense npl128 **903 s**, MoE npl128 **213 s**). It is *bounded relative to no-budget*
-   (stock is worse) but high in absolute terms under a synchronized burst. Under realistic
-   staggered arrival this is far milder; for a synchronized-burst benchmark it is the cost of
-   the decode-first scheduler. **Decode and memory are unaffected.**
-
-**Bottom line for the GB10 / DGX Spark page:** with matched NVFP4 weights, paged llama.cpp
-delivers **90-117% of vLLM dense decode** and **77-83% of vLLM MoE decode** at **equal-or-higher
-precision** and **1.5-3× lower memory** (on-demand paged KV vs a fixed 107 GB pool). The
-remaining gap is MoE-decode and burst-TTFT, not dense-decode or memory.
-
-## Anomalies / methodology notes (rigour)
-
- **Paged-pool burst degradation (real, worked around).** After a high-npl burst, a llama
-  server's *subsequent lower-npl* prefill collapses (npl8 fresh = 507 t/s / 6 s TTFT; the same
-  npl8 *after* an npl64 burst = 65 t/s / 64 s TTFT). Decode is unaffected. To measure clean
-  per-config prefill/TTFT, **the llama server is restarted per npl** (cheap vs the prefill
-  cost). vLLM has no such degradation — verified by an end-of-sweep npl8 re-check that matched
-  the opening npl8 (dense 70.4→73.4, MoE 256.5→226.4) — so vLLM uses one server per combo.
- **Fresh-prefill discipline.** Every measured request uses a unique nonce so prefill is always
-  a full fresh compute (the task's "defeat prefix caching" intent); vLLM ran with
-  `enable_prefix_caching=False`, llama with `cache_prompt:false`. Apples-to-apples.
- **No bimodality observed.** With per-npl restart + a cheap (ptok=8) graph warmup, the early
-  two-pass checks matched within <0.5% (npl8 486/484 t/s), so the headline uses one stable
-  measured pass per (model,engine,npl).
- **Clean environment.** The benchmark's peak (dense ~94 GB) plus the idle LocalAI worker's
-  ~30 GB resident model OOM-cycled the service containers on the first attempt and corrupted
-  one run; the `local-ai`/`local-ai-worker` containers were stopped for the measurement
-  (baseline ~3.3 GB, ~120 GB free) and **restarted afterwards** to return the host.
- **peak_gb** is absolute unified-memory used (`MemTotal-MemAvailable`) peak; `engine_gb` =
-  peak − the ~3.3 GB OS baseline (the per-config engine footprint).
- **Internal-consistency check (decode_agg vs perseq×npl).** `decode_agg_tps` is the steady-state
-  aggregate over the decode window; `decode_perseq_tps` is each sequence's lifetime rate (output
-  tokens ÷ total request latency, so it *includes* the TTFT queue wait). They coincide when
-  TTFT ≪ decode-window (vLLM npl8: 70.4 vs 70.1, +0.5%) and diverge exactly as TTFT grows, on
-  **both** engines (the agg−perseq×npl gap rises monotonically with `ttft_mean`: vLLM 0.5%→17%,
-  llama 8%→62% across npl8→128, mirroring its 6 s→903 s TTFT). The relationship is governed by
-  TTFT, not a measurement artifact, and the FINAL rows are distinct from the historical patch-0015
-  table (no stale-baseline carry-over).
-
---
-
-## Setup (historical — patch 0015 run; FINAL section above is the shipping 0023 result)
-
- **Box**: GB10 / DGX Spark, sm_121, unified LPDDR5x (~273 GB/s). Memory figures are
-  unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
- **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
-  `build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
-  -ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the
-  TTFT caveat in the verdict.**
- **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
-  --max-num-seqs 256 -tp 1`.
- **Client**: identical async client for both engines. Per request: 512-token unique prompt
-  (unique leading tokens defeat cross-request prefix caching), `max_tokens=256`,
-  `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128.
- **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
-  live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
-  `PEAK_GB` (unified-memory peak).
-
-## The 4 models (NVFP4, matched weights)
-
-| Model | llama.cpp GGUF | vLLM checkpoint | Match |
-|-------|----------------|-----------------|-------|
-| DENSE Qwen3.6-27B (28B dense) | `q36-27b-nvfp4.gguf` (native Blackwell FP4) | `q36-27b-nvfp4-vllm/` (unsloth TRUE W4A4) | clean W4A4 both sides |
-| MoE Qwen3.6-35B-A3B (36B total, ~3B active) | `q36-35b-a3b-nvfp4.gguf` (241 NVFP4 tensors, nvidia weights) | `q36-35b-a3b-nvfp4-vllm/` (nvidia modelopt; vLLM picks Marlin NvFp4 MoE + FA2) | NVFP4 weight-only, identical nvidia weights |
-
---
-
-## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB)
-
-### MoE Qwen3.6-35B-A3B (~3B active)
-
-| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
-|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
-| 8   | llama  | 170.2 | 20.27 | 2813 | 855     | 38.98 |
-| 8   | vLLM   | 202.0 | 24.92 | 4648 | 799     | 111.49 |
-| 32  | llama  | 235.4 | 6.77  | 2005 | 4970    | 43.06 |
-| 32  | vLLM   | 462.0 | 13.59 | 4755 | 2308    | 111.26 |
-| 64  | llama  | 271.7 | 3.88  | 2389 | 7205    | 52.53 |
-| 64  | vLLM   | 624.5 | 8.90  | 4784 | 4072    | 111.46 |
-| 128 | llama  | 292.2 | 2.05  | 657  | 84800   | 61.42 |
-| 128 | vLLM   | 811.1 | 5.46  | 4263 | 7980    | 111.61 |
-
-llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128.
-
-### DENSE Qwen3.6-27B
-
-| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
-|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
-| 8   | llama  | 63.8  | 7.60 | 1117 | 2029    | 51.72 |
-| 8   | vLLM   | 64.3  | 7.98 | 1514 | 2593    | 112.07 |
-| 32  | llama  | 108.9 | 3.08 | 752  | 13212   | 61.48 |
-| 32  | vLLM   | 189.8 | 5.57 | 1555 | 7477    | 112.09 |
-| 64  | llama  | 126.2 | 1.78 | 465  | 53818   | 74.90 |
-| 64  | vLLM   | 284.2 | 3.92 | 1526 | 12942   | 112.11 |
-| 128 | llama  | 134.6 | 0.93 | 125  | 491195  | 94.03 |
-| 128 | vLLM   | 390.7 | 2.50 | 1420 | 24806   | 112.12 |
-
-llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128.
-
---
-
-## Verdict
-
-**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM
-scales substantially better as concurrency rises.**
-
-1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's
-   ~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines.
-
-2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to
-   ~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811);
-   llama plateaus (dense 64->135, MoE 170->292).
-
-3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes
-   at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s,
-   8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent
-   512-token prefills starve each other and the decode. Crucially, that starvation also drags
-   `decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the
-   measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens`
-   (the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency
-   decode by keeping all slots live.
-
-4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at
-   gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged
-   on-demand KV is materially more memory-efficient / multi-tenant-friendly.
-
-5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream
-   on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at
-   npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a
-   GB10 target.
-
-### Honest bottom line
-
-The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these
-NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a
-bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill
-starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode
-batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated
-regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
-on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
-absolute gap.
-
---
-
-## Fair re-run (max_prefill_tokens on)
-
-The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section
-re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over
-**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server
-(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
-`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`,
-`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged
-(carried over from the committed dense table, not re-run).
-
-### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB)
-
-| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM |
-|----:|--------|------------------:|-----------:|-----------:|------------:|-----:|
-| 8   | decode agg | 63.8  | 63.5   | 63.8   | 63.5   | 64.3  |
-| 8   | TTFT ms    | 2029  | 4255   | 3756   | 2653   | 2593  |
-| 32  | decode agg | 108.9 | 105.7  | 107.7  | 108.8  | 189.8 |
-| 32  | TTFT ms    | 13212 | 23114  | 18934  | 13912  | 7477  |
-| 64  | decode agg | 126.2 | 132.0  | 131.2  | 118.2  | 284.2 |
-| 64  | TTFT ms    | 53818 | 109455 | 74272  | 92450  | 12942 |
-| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 |
-| 128 | TTFT ms    | 491195| **305423**| 543448| 424058| 24806 |
-
-Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) ->
-~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1.
-
-### Best budget = 256 (only the saturated npl128 regime benefits)
-
-At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**:
-
- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run.
- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock.
- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s.
-
-Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a
-looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So
-the tightest cap (256) protects in-flight decode the most when the box is saturated.
-
-### Honest caveat: this bursty workload is the worst case for TTFT
-
-At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left
-decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there
-is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget
-only pays off once enough slots are decoding that an unbounded prefill would starve them, which on
-this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~
-stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival
-pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not
-exercised here.
-
-### Bottom line (dense)
-
-The prefill budget is a **real but narrow** lever on this workload: at maximum saturation
-(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama
-from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x
-faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
-light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
-should be applied selectively (high-concurrency serving), not as an unconditional default.
-
-## MoE 35B-A3B fair re-run (max_prefill_tokens on)
-
-Same build (HEAD 151343b, P0+P1 patch 0015), same flags (`-c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`, `LLAMA_KV_PAGED=1`), same all-at-once harness (512-tok unique prompt,
-gen 256, temp 0, ignore_eos). Swept the dense winner budget 256 plus neighbor 512.
-
-### Primary table - budget 256 (decode_agg tok/s | TTFT mean ms | peak host GB)
-
-| npl | stock (no budget) | budget 256 (best) | budget 512 | vLLM |
-|----:|------------------:|------------------:|-----------:|-----:|
-| 8   | 170.2 / 855   / -    | 169.3 / 1655  / 38.95 | 172.1 / 1488  / 38.82 | 202.0 / 799  |
-| 32  | 235.4 / 4970  / -    | 239.0 / 9034  / 42.93 | 234.7 / 7260  / 42.72 | 462.0 / 2308 |
-| 64  | 271.7 / 7205  / -    | 277.0 / 16249 / 51.96 | 274.5 / 13660 / 52.53 | 624.5 / 4072 |
-| 128 | 292.2 / 84800 / -    | **333.5 / 98106 / 61.42** | 300.8 / 92470 / 61.45 | 811.1 / 7980 |
-
-Peak host GB (paged KV, budget-independent): ~38.9 (npl8) -> ~42.8 (npl32) -> ~52 (npl64) ->
-~61.4 (npl128). Far below the dense run (94 GB @npl128) - only ~3B params are active, so the KV
-plus activations footprint stays light even fully saturated.
-
-### MoE inverts the dense story: the budget buys decode, NOT TTFT
-
-Unlike the dense 27B (where the stock run was prefill-starved to 491 s TTFT @npl128 and the budget
-cut it 38%), the MoE stock run was **never prefill-starved**: 3B active params make prefill cheap,
-so stock TTFT @npl128 was already only 84.8 s. Capping prefill therefore cannot rescue TTFT - it
-can only **defer first tokens to free decode steps**. Result at npl128 with budget 256:
-
- **decode_agg: 292.2 -> 333.5 tok/s (+14.1%)** vs the starved stock run.
- **TTFT mean: 84.8 s -> 98.1 s (+15.7%, WORSE)** - the budget costs latency here.
- llama decode as % of vLLM @npl128: **36.0% -> 41.1%**. TTFT now ~12.3x vLLM's 7.98 s.
-
-Budget 512 is the milder trade (decode +3.0% to 300.8, TTFT +9.0% to 92.5 s @npl128). Budget 256
-maximizes decode throughput; 512 if you want to bleed less TTFT. At npl 8/32/64 both budgets are
-net-negative or flat on decode and clearly raise TTFT (e.g. npl64 7.2 s -> 16.2 s @b256), the same
-all-at-once burst artifact seen in the dense run.
-
-### Does the ~3B-active decode scale better now? Yes - the plateau is gone
-
-The headline win is the **decode scaling curve**, not any single point:
-
-| npl step | stock decode_agg | budget-256 decode_agg |
-|---------:|-----------------:|----------------------:|
-| 8 -> 32  | 170 -> 235 (+38%) | 169 -> 239 (+41%) |
-| 32 -> 64 | 235 -> 272 (+16%) | 239 -> 277 (+16%) |
-| 64 -> 128| 272 -> 292 (**+7.4%**, plateauing) | 277 -> 333.5 (**+20.4%**, still climbing) |
-
-Stock MoE decode **plateaus** at saturation (+7.4% over the last doubling) because unbounded
-prefills keep stealing steps from the many ready decode slots. Budget 256 removes that ceiling -
-decode keeps climbing +20% into npl128, so more of the 128 slots actually decode concurrently.
-This is the cleanest evidence that patch 0013 protects in-flight decode once enough slots are live.
-
-### Bottom line (MoE)
-
-For the A3B MoE the prefill budget is a **decode-throughput lever, paid for in TTFT** - the mirror
-image of the dense case. Budget 256 lifts decode_agg +14% @npl128 and, more importantly, restores
-monotonic decode scaling (kills the stock plateau), moving llama from 36.0% to 41.1% of vLLM
-decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: vLLM still decodes
-~2.4x faster (811 vs 333.5) and holds TTFT ~12x lower (8.0 s vs 98.1 s) @npl128, and scales
-monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated
-MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for
-latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here).
-
---
-
-## Fair re-run verdict
-
-This is the synthesis after patch 0013 (`max_prefill_tokens` / `LLAMA_PREFILL_BUDGET`) was turned
-on for both models. It answers three questions: how much of the apparent gap was prefill
-starvation, what genuine gap to vLLM remains after that artifact is removed, and where that leaves
-the "par-or-beat vLLM" goal.
-
-### 1. How much did patch 0013 close the gap?
-
-The original (stock) tables blamed two things on llama: an exploding TTFT and a flat decode curve
-at high concurrency. The budget re-run shows these were **two different problems with two
-different root causes**, and only one was prefill starvation.
-
-**Dense 27B - was genuinely prefill-starved.** Dense prefill is expensive (full 28B weights per
-token), so 128 simultaneous 512-token prefills truly starved both first-tokens and decode. Budget
-256 @npl128:
-
-| metric @npl128 | stock | budget 256 | vLLM | what closed |
-|----------------|------:|-----------:|-----:|-------------|
-| TTFT mean | 491.2 s | **305.4 s** (-37.8%) | 24.8 s | starvation real; -186 s recovered |
-| decode_agg | 134.6 | **161.2** (+19.8%) | 390.7 | freed slots now decode |
-| llama as % of vLLM decode | 34.5% | **41.3%** | 100% | +6.8 pts |
-
-Dense llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **99 / 56 / 46 / 41** (was 99/57/44/34).
-The fix moved only the saturated tail; npl 8/32 were never starved and are unchanged.
-
-**MoE 35B-A3B - was NOT prefill-starved (the inversion).** Only ~3B active params, so prefill was
-already cheap and stock TTFT @npl128 was 84.8 s, not dense's 491 s. There was no starvation to
-rescue, so the budget could not cut TTFT - it instead converted deferred prefill into decode
-steps. Budget 256 @npl128:
-
-| metric @npl128 | stock | budget 256 | vLLM | direction |
-|----------------|------:|-----------:|-----:|-----------|
-| TTFT mean | 84.8 s | 98.1 s (+15.7%, WORSE) | 7.98 s | budget costs latency here |
-| decode_agg | 292.2 | **333.5** (+14.1%) | 811.1 | plateau removed |
-| llama as % of vLLM decode | 36.0% | **41.1%** | 100% | +5.1 pts |
-
-MoE llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **84 / 52 / 44 / 41** (was 84/51/44/36).
-The decisive MoE finding is the scaling curve, not the point: stock decode plateaued over the last
-doubling (64->128 = +7.4%); budget 256 restored monotonic scaling (+20.4%), proving the stock flat
-curve was unbounded prefill stealing steps from ready decode slots, not a kernel ceiling.
-
-**Combined takeaway.** Both models converge to the **same ~41% of vLLM decode at npl128** after the
-fix. That convergence is the signal: once prefill starvation is removed, dense and a 12x-cheaper-
-prefill MoE land on the identical ceiling, which means the remaining gap is **not** about prefill
-at all - it is the decode scheduler.
-
-### 2. The honest remaining gap to vLLM
-
-After patch 0013, the residual gap is the **continuous-batched-decode efficiency** lever, and it is
-real, not an artifact:
-
- vLLM still decodes **~2.4x faster** at npl128 on both models (390.7 vs 161.2 dense; 811.1 vs
-  333.5 MoE).
- vLLM holds TTFT **~12x lower** at npl128 (24.8 vs 30.5 s dense; 8.0 vs 98.1 s MoE) - and does so
-  while decoding faster, i.e. no latency/throughput trade.
- **vLLM scales monotonically and steeply** (dense 64->391, MoE 202->811 across npl 8->128); llama,
-  even with the budget, only **partially** recovers its scaling (dense 64->161, MoE 170->334).
-
-The mechanism: vLLM's scheduler interleaves prefill and decode at token granularity (chunked
-prefill + paged continuous batching) every step, keeping the GPU saturated with a near-optimal mix.
-Patch 0013 is a coarser tool - a static per-step prefill **cap** - which protects in-flight decode
-but does not actively schedule the prefill/decode mix, and on the bursty all-at-once harness it
-defers first tokens (the TTFT penalty at npl 8/32/64, and the MoE TTFT regression @npl128). The gap
-that remains is the **quality of the step-by-step batching decision**, not raw kernel speed: at
-npl8 the kernels are at parity (dense 99%, MoE 84%), so the per-token math is competitive - what
-vLLM does better is keeping more sequences productively in-flight every step as concurrency rises.
-
-### 3. Where this leaves "par-or-beat vLLM", and the last lever
-
-**Where llama is competitive today (NVFP4, GB10):**
-
- **Low concurrency (npl<=8): at parity.** Dense 99%, MoE 84% of vLLM decode, comparable TTFT.
-  For single-user / few-stream local serving - LocalAI's dominant mode - llama.cpp is already
-  there on matched NVFP4.
- **Memory efficiency: llama wins outright at every concurrency.** On-demand paged KV (dense
-  52->94 GB, MoE 39->61 GB) vs vLLM's flat ~112 GB pre-reservation. On a 128 GB unified box this is
-  the difference between multi-tenant headroom and OOM - a genuine product advantage, not a
-  consolation.
-
-**Where llama is not competitive:** high-concurrency decode throughput (npl>=32), where vLLM is
-~2-2.4x ahead and the budget only narrows it to ~41%.
-
-**The last lever** is therefore *not* another prefill knob (0013 has extracted what a static cap
-can give) and *not* the kernel (at parity @npl8). It is **token-granular continuous-batch
-scheduling**: actively interleaving chunked prefill with decode every step rather than capping
-prefill, so all live slots decode while new prefills trickle in - exactly what closes vLLM's
-monotonic-scaling advantage. A staggered (non-burst) arrival pattern would also let 0013 protect
-decode jitter without the burst-TTFT penalty seen here, narrowing the practical gap for real
-serving traffic that does not arrive all-at-once.
-
-### Bottom line
-
-Patch 0013 is validated and worth shipping as a **selective, high-concurrency QoS lever**: it
-recovers dense TTFT 38% and lifts saturated decode +14-20%, converging both models to ~41% of
-vLLM. But it is honestly **not a gap-closer**. The "par-or-beat vLLM" goal is **met at low
-concurrency and on memory efficiency, and not met at high-concurrency decode throughput.** The
-remaining ~2.4x is a continuous-batched-decode scheduling gap, not a prefill-starvation or kernel
-gap - and that is the next (harder) lever, distinct from anything 0013 can touch.
--- a/backend/cpp/llama-cpp/patches/paged/README.md
+++ b/backend/cpp/llama-cpp/patches/paged/README.md
@@ -0,0 +1,317 @@
+# LocalAI paged-attention llama.cpp patch series
+
+This directory holds the vendored patch series that turns stock llama.cpp into
+LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are
+applied on top of a pinned upstream llama.cpp at build time; nothing here is a
+fork - it is a source-only `*.patch` stack plus this single canonical doc.
+
+> One-file rule: this README is the canonical reference for the patch series. The
+> only other docs kept in this directory are operational and linked below:
+> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
+> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
+> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
+
+---
+
+## 1. What it is
+
+`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a
+vendored patch series over upstream llama.cpp that adds
+
+- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks,
+  free pool, ref-counted blocks) with a **block-table flash-attention** read so
+  the attention kernels index physical cells instead of a contiguous buffer;
+- **cross-request prefix sharing** - concurrent requests that share a long
+  prefix physically reuse one committed copy of the prefix blocks and prefill
+  only their divergent suffix;
+- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget
+  decoupled from `n_batch`, so a long prefill never freezes co-batched decode;
+- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid
+  gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
+  GEMM - dominates the decode step.
+
+It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
+export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
+[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
+(see section 7).
+
+The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
+enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
+`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the
+runtime hooks are inert, so a single `grpc-server.cpp` is shared between the
+clean and the paged build.
+
+---
+
+## 2. Architecture
+
+The decode step on these models breaks into three cost centers; the patch series
+attacks each one.
+
+**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager`
+(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out
+fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with
+copy-on-write for shared prefixes). The attention path reads through a **block
+table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed
+as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the
+CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]`
+and read K/V in place. Token-position ordering keeps the flash-attn online-softmax
+reduction order identical to stock. A null block table is the stock contiguous
+read, byte-identical.
+
+**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48
+gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On
+GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode
+cost. The series fuses that plumbing to mirror vLLM's
+`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written
+to its cache slot in place (no copy-back, no `get_rows` materialization), the
+conv state is updated in place, the output projection is reshaped to route to the
+tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all
+bit-exact (md5-gateable) against the f32 baseline.
+
+**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses
+Blackwell's native FP4-MMA. The series removes a redundant activation-requantize
+in the MoE broadcast projections (bit-exact byte copy of identical blocks) and
+keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only
+NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back
+to dequant.
+
+**The prefill/decode scheduler.** `update_slots()` already emits one unified
+mixed prefill+decode batch per step. The scheduler patches change only the *count*
+of prefill tokens admitted per step: decode tokens are claimed first
+(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the
+live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto-
+shrinking as decode load rises. Pure scheduler policy, byte-identical when off,
+orthogonal to the paged allocator.
+
+---
+
+## 3. Patch series (0001-0030)
+
+28 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
+`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
+is in section 5.
+
+### Paged-KV core (0001-0012)
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) |
+| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) |
+| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes |
+| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) |
+| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) |
+| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) |
+| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band |
+| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band |
+| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) |
+| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec |
+| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) |
+
+### Decode-first scheduler (0013, 0016)
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) |
+| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) |
+
+(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in
+high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a
+default-on, density-aware auto-select that is prefill-safe by construction. Both
+bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green,
+but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
+ships as the parity gate + default-off instrumentation only.)
+
+### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
+
+These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
+
+| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) |
+|---|---|---|
+| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% |
+| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% |
+| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% |
+| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% |
+| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
+| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |
+
+### MoE NVFP4 quant (0023, 0025)
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
+| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
+
+### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
+| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
+| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
+| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
+
+---
+
+## 4. Benchmarks
+
+Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
+**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
+S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
+serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png),
+[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data
+[`final_benchmark.csv`](final_benchmark.csv).
+
+### (a) + (b) Patched vs stock vs vLLM
+
+The **stock** and **patched** columns are the same binary, env-toggled, on the
+**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
+apples-to-apples measure of the patch series' contribution. The **vLLM** column
+is a **different harness** (vLLM server + client continuous batching), so the
+cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
+
+**Dense Qwen3.6-27B-NVFP4** (t/s):
+
+| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
+|----:|------:|--------:|-----:|------------------:|---------------------:|
+| 8   |  65.7 |   84.0 |  71.1 | 118% | 1.28x |
+| 32  | 113.7 |  204.0 | 207.6 |  98% | 1.79x |
+| 64  | 134.3 |  294.9 | 309.7 |  95% | 2.20x |
+| 128 | 143.5 |  371.2 | 422.4 |  88% | 2.59x |
+
+**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
+
+| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
+|----:|------:|--------:|------:|-----------------:|---------------------:|
+| 8   | 181.4 |  227.4 |  315.1 | 72% | 1.25x |
+| 32  | 260.8 |  455.7 |  681.9 | 67% | 1.75x |
+| 64  | 306.8 |  612.3 |  765.5 | 80% | 2.00x |
+| 128 | 331.3 |  772.6 | 1011.7 | 76% | 2.33x |
+
+**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
+@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
+config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
+groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
+
+**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
+stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
+remaining gap is structural (see section 5).
+
+### (c) Apple M4 (16GB) - for curiosity only
+
+No t/s: the 24GB NVFP4 GGUF did not finish downloading and would not fit in 16GB
+RAM (= SSD paging). Architectural findings:
+
+- Metal `supports_op` **excludes NVFP4** from `MUL_MAT` / `MUL_MAT_ID` /
+  `GET_ROWS`, so the FP4 matmuls **fall back to CPU** - there is no Apple
+  FP4-MMA.
+- `GATED_DELTA_NET` and `SSM_CONV` / `SSM_SCAN` **do** have Metal kernels.
+
+Verdict: NVFP4 Qwen3.6 needs **Blackwell FP4-MMA + >24GB RAM**; a 16GB M4 is not
+a fit. A Metal-native Q4_K Qwen3.6 would be a different artifact.
+
+---
+
+## 5. Dev notes - what we learned
+
+**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy
+md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France
+is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
+`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default
+chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
+touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
+
+**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)).
+Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
+(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
+is a benign FP-accumulation-order difference of the paged attention reduction,
+**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <=
+KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two
+equivalent FP-reorderings of the same quantized model, not a regression. Future
+paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`.
+
+**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE
+decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama
+**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth;
+the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection
+bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel
+llama is losing. The MoE GEMM kernel is *not* where the gap lives.
+
+**Rejected / flat levers** (recorded so they are not re-tried):
+
+- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
+  exhausted by 0025; more graph/stream overlap is a no-op or small regression on
+  this model.
+- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
+  by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
+  further bit-exact lever clears it. 0023 already banks the de-dup.
+- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).**
+  Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the
+  same bf16 projections. No-ship.
+- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody
+  needs bought with a ~5% slower kernel; both kernels are already at the BW floor.
+  (The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
+  carries over to MoE.)
+
+**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
+that bf16 KL error concentrates in long-memory heads and can be removed by
+keeping them f32 - is **empirically refuted**: the error scales with the bf16
+head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
+byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
+byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
+ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
+in a recommended/gallery config.
+
+---
+
+## 6. Architecture and quant generality
+
+(From the arch-generality and quant-generality audits.)
+
+- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4
+  activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations
+  help **any quant** of these models (the GDN recurrence, conv, gather and
+  o_proj-MMQ levers operate on the f32 recurrent state and the routing layout,
+  not on the weight dtype).
+- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back
+  to dequant on other hardware; the GB10-tuned occupancy params (0022) are
+  perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change
+  correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
+  CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.
+
+---
+
+## 7. Pin + maintenance policy
+
+- **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
+  [`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
+  onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
+  (dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
+  c299a92c` jump (23 upstream commits) needed zero patch changes and did not
+  change decode output.
+- **Decoupled from the nightly auto-bumper.** There is deliberately **no**
+  `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
+  silently shift the tree out from under the patches.
+- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml)
+  (via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh))
+  tries the patch series against the latest upstream tip with the build's own
+  strict `git apply`. **Red = upstream drifted past the series -> run a
+  PIN_SYNC** (do not bump the pin blindly). The canary references
+  [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md).
+
+---
+
+## 8. Models
+
+The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
+
+| Gallery entry | Weights (HuggingFace) | Notes |
+|---|---|---|
+| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). |
+| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. |
+
+Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
+(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
+`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
+`ssm_bf16_tau`). The full backend-split + gallery plan is in
+[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md).
--- a/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md
+++ b/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md
@@ -1,400 +0,0 @@
-# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU)
-
-Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4
-producer-fold the BEST remaining bit-exact decode lever, or is something better being missed?
-Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched.
-
-## Starting line (post-0023)
- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023.
- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%).
- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded
-  kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel).
- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED.
-
-## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue)
-
-(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4,
-    32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one
-    block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction
-    (fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only
-    thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which
-    repartition which rows a lane sums into the warp-butterfly = a different reduction grouping =
-    breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at
-    the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR.
-(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail.
-    Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the
-    online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR.
-(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum
-    order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever.
-(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2)
-    is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k
-    fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode
-    GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR.
-
-=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the
-   step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small,
-   foldable glue/quantize/gather buckets below.
-
-## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk)
-
-Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace:
- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%)
- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather)
- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net)
-
-Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the
-  three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent
-  kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded =
-  byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree /
-  eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2).
-
-Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST
-  plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense
-  +2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so
-  ~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION
-  grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the
-  writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for
-  any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread
-  shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte
-  gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO
-  writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst
-  plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q
-  prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the
-  LAST-MILE lever, not the first.
-
-Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk -
-  but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the
-  0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them
-  ("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy
-  with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup
-  exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via
-  ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is
-  bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source
-  scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than
-  Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live
-  ~2% bit-exact bucket that the current plan does not address.
-
-## IS THE fp4 FOLD THE RIGHT NEXT BUILD?
-
-DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood
-bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of
-the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing
-with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has
-the worst gain/plumbing ratio of the candidates.
-
-Recommended build order (all bit-exact, all stack 1:1 on the serial single stream):
-  1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%).
-  2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk).
-  3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if
-     the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery.
-If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the
-biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before.
-
-## HONEST BIT-EXACT CEILING
-
-The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK:
-  ~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense.
-  373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391.
-=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the
-   ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt.
-   Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets
-   resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and
-   at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the
-   ~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere.
-
-Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs
-llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the
-only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is
-already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with
-the folds reaches throughput parity at higher precision - a strong result, but each fold is a
-diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that
-goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-====================================================================================================
-
-# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU)
-
-Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}`
-(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp.
-
-## 0. One-line verdict
-The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is
-(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the
-reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings
-+ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured
-per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense
-best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7.
-
-## 1. Which graph nodes fuse
-Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32<bs, do_multiply=true>` kernel
-(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target.
-
- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487:
-  `attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at
-  m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the
-  f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing
-  `{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam.
- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243:
-  `attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND
-  `ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer
-  MUST write f32 -> smaller win.
- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only.
-
-## 2. Byte-exact target (norm.cu rms_norm_f32<bs,true>)
-Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float.
-```
-for col=tid; col<ncols; col+=bs: tmp += x[col]*x[col];           // (R1) strided sumsq grouping
-tmp = block_reduce<SUM, bs>(tmp, s_sum);                          // (R2) tree width depends on bs
-mean = tmp/ncols; scale = rsqrtf(mean+eps);                       // (R3) exact eps/div
-for col=tid; col<ncols; col+=bs: dst[col] = scale*x[col]*mul[col];// (W) per-channel gain, mul_col==col
-```
-(W) is per-column independent (scale block-uniform) -> writeback may be re-partitioned. (R1/R2/R3)
-are the ONLY order-sensitive parts and must stay byte-identical.
-
-## 3. Fused producer kernel (quantize.cu) - deltas vs the stash
-Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock`
-(verbatim per-thread NVFP4 quant). Required changes:
-1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY.
-2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce<SUM,bs>`, `mean=tmp/ncols`,
-   `scale=rsqrtf(mean+eps)`. Byte-identical once bs matches.
-3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s<n_sub; s+=bs`, col0=s*16,
-   `v=scale*xr[col]*mul[col]` (col<ncols else 0), amax=max|v|, `quantize_nvfp4_write_subblock(vals,
-   amax, sub, y+ib)`, `ib=k_block*ne11+row`, n_sub=ncols_padded/16. x is re-read (canonical does too).
-4. `template<bool write_f32>`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store),
-   TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut.
-Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1
-(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride
-s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)).
-
-## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh)
-Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`.
-In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args
-src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal).
-Sharing across non-adjacent siblings:
- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse;
-  one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean.
- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins.
-The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes
-(a key reason it was flat; nsys showed quantize 12896->10816, not ->0).
-
-## 5. Bit-exactness argument
-(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction
-    (quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte.
-(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch)
-    AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant.
-=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held.
-Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped).
-
-## 6. THE TRAP
- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is
-  1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash
-  hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs
-  32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs.
- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction
-  partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs.
- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate.
- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only
-  changes WHO writes src1.
-
-## 7. Contrast with Lever-2 + lower-risk alternative
-Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9%
-rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND
-non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats
-it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again.
-LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient
-standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate /
-ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm
-untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half
-captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed
-is easily eaten by the fused kernel's added work.
-
-## 8. Scope + build order (the gate)
-Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized
-on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE
-+ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT).
-Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time
-vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg
-actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true,
-keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable
-(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-====================================================================================================
-
-# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2
-
-Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable
-fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits.
-
-## Build-dir correction (acted on)
-The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically
-`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s,
-git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is
-unaffected by the 0023 MoE de-dup knob - dense has no MoE.)
-
-## Confirmed baseline
- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the
-  ~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.)
- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%,
-  sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle).
-
-## Dense decode decomposition (ms/step)
-gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q<NVFP4,128> 93.57 (27.4%) |
-**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) |
-ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 |
-stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25.
-
-## quantize_mmq_nvfp4 at the dense decode shape (the answer)
- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per
-  NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong).
- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn):
-  - K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up,
-    ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED
-    foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366).
-  - K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)),
-    two muls break the chain) + wo (attn-gated producer).
-  - K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer).
-
-## Foldable portion (measured) - LARGER than the byte-model 2.7%
-The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model
-predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144
-indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls.
- byte-model FOLDABLE  = 9.73 ms = 2.82% of step
- flat-split FOLDABLE  = 11.90 ms = 3.45% of step  (368 small-K quants, the physically correct one)
- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo
-  ~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%).
- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold)
-  = 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32)
-  = ~7.8 ms raw but lower net efficiency.
-
-## Critical path: YES (1:1)
-98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the
-serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps).
-
-## Realistic recoverable - and the honest haircut
-RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic
-already flagged and this profile does not overturn:
- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved,
-  and the f32 write is droppable ONLY at the FFN boundary where it is dead);
- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2
-  siblings, always wrote f32, and used a non-canonical bs=256 reduction;
- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole
-  quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small
-  kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC
-  garbage-routing skip artifact). This caps the realistic NET nearer the conservative end.
-=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of
-   FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not
-   bank above ~2.5%.
-
-## VERDICT (GPU-measurement view)
- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7%
-  byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision
-  change), and the largest single clean dense bucket left after the plateaued recurrence.
- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the
-  HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates.
- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows
-  folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest
-  ~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if
-  decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact
-  construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not
-  bit-exact.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-====================================================================================================
-
-# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023
-
-DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies
-its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and
-measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a
-clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed
-to the code; this verdict appended only.
-
-I extended the read-only agents' analysis with the two things they could not verify from the .md
-verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in
-the source. Both kill the build.
-
-## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive)
-LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured:
-  - dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5%
-  - MoE   q36-35b-a3b   npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3%
-nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%),
-NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK
-the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly
-consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the
-fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the
-producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only
-recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved
-this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes"
-(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at
-the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below.
-
-## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible)
-This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the
-redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the
-SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained
-quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime.
-The DENSE redundancy is INTER-NODE and that is a different, much harder problem:
-  - The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM,
-    reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus,
-    mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are
-    NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A
-    contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible
-    for the qkvz bucket.
-  - De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share
-    src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last
-    sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool
-    must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool +
-    src1_prequantized path the fold needs, with real implementation risk (graph-capture
-    non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing"
-    framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs)
-    is common to both.
-  - The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case.
-  - The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn
-    LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8;
-    ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes.
-    Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step
-    ~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new
-    fusion code path + the risk to the byte gate.
-
-## 3. The pointwise + get_rows folds are not clean wins either
- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused
-  (ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/
-  softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable
-  remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS
-  measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding
-  kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever.
- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified
-  mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win.
-
-## 4. Tree state / gates
- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff -
-  no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile).
- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged).
- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline
-  the reprofile already measured would waste the GPU window).
-
-## 5. FINAL BIT-EXACT CEILING
-Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023).
-This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left:
-  - gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor.
-  - mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever
-    (any knob changes a K-/softmax-reduction order vs the f32 reference).
-  - The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest
-    piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has
-    no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is
-    each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the
-    plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow.
-FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81
-bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing
-that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved).
-RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat).
-If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE
-scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts
-above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/SERVER_SWEEP.md
+++ b/backend/cpp/llama-cpp/patches/paged/SERVER_SWEEP.md
@@ -1,138 +0,0 @@
-# GB10 same-day head-to-head server sweep: llama-server (paged) vs vLLM
-
-Date: 2026-06-23. Hardware: GB10 / DGX Spark (sm_121, 128 GB LPDDR5x unified, ~273 GB/s
-weight-read floor). GPU otherwise idle (sibling vLLM had exited; LocalAI docker workers
-stopped for the run).
-
-This sweep **replaces** the stale carried "~75-80% of vLLM" figure (commit 07985ba4,
-pre-co-batching, single-point). It measures *real serving* steady-state aggregate decode
-throughput across the full concurrency curve, for three model classes, with one identical
-client driving both engines.
-
-## Method
-
- **llama**: `llama-server` from the paged dev tree (`~/llama-paged-dev/build-cuda`, HEAD =
-  patch 0013 / commit 17d97cb), `LLAMA_KV_PAGED=1`, `-fa on -ngl 999 --parallel 128 -c 65536`.
- **vLLM**: 0.23.0, `vllm serve --enforce-eager --enable-prefix-caching --max-num-seqs >=128
-  --max-model-len 4096` (APC on, eager per the GB10 no-CUDA-graphs edge).
- **Client** (`sweep_client2.py`): N concurrent **non-streaming** `/v1/completions`, short
-  shared prompt, `max_tokens=min_tokens=256`, `ignore_eos=true`. Aggregate decode tok/s =
-  total generated tokens / wall. Non-streaming keeps the Python client off the critical path
-  (one JSON parse per request, not per token), so the **server** is the bottleneck. Validated:
-  vLLM pushed 4227 tok/s through the exact same client where llama topped out at 2087, so the
-  client is not the cap. Both engines use the identical client + prompt -> apples-to-apples.
- npl (concurrency) sweep: 8 / 32 / 64 / 128.
-
-Quant parity:
- Dense: llama **NVFP4-dense GGUF** (weight-only FP4, 16-bit compute) vs vLLM **NVFP4A16**
-  (weight FP4, 16-bit activation) -> matched precision class.
- Small: llama **Q8_0** vs vLLM **bf16** (closest loadable form).
- MoE: llama **mxfp4** GGUF. **vLLM could not serve this MoE on GB10 at all** (see below), so
-  there is no vLLM MoE column.
-
-## Results: aggregate decode tok/s (higher is better)
-
-### Dense 32B  (llama NVFP4-dense  vs  vLLM NVFP4A16)
-
-| npl | llama (NVFP4) | vLLM (NVFP4A16) | llama % of vLLM |
-|----:|--------------:|----------------:|----------------:|
-|   8 |          83.2 |            85.9 |          **96.9%** |
-|  32 |         228.9 |           301.3 |          76.0%  |
-|  64 |         367.1 |           507.8 |          72.3%  |
-| 128 |         520.6 |           604.0 |          86.2%  |
-
-Plateau: neither has plateaued at 128 (both still climbing, weight-read bound). llama is at
-**parity at batch-8** (97%), dips to ~72% mid-curve (npl 32-64), recovers to 86% at 128.
-
-### Small  Qwen3-0.6B  (llama Q8_0  vs  vLLM bf16)
-
-| npl | llama (Q8_0) | vLLM (bf16) | llama % of vLLM |
-|----:|-------------:|------------:|----------------:|
-|   8 |        911.3 |       923.0 |        **98.7%** |
-|  32 |       1701.6 |      2531.4 |        67.2%  |
-|  64 |       1911.7 |      3497.1 |        54.7%  |
-| 128 |       2087.6 |      4227.6 |        49.4%  |
-
-Plateau: **llama plateaus hard** at ~2.0-2.1k by npl 64-128 (+9% from 64->128). vLLM keeps
-scaling (3497 -> 4227). For a tiny runtime-bound model, vLLM's scheduler/batching amortizes
-better; llama-server's per-token host cost (sampling, detok, slot mgmt) caps it. This is the
-worst llama-vs-vLLM ratio in the sweep (down to 49%).
-
-### MoE  Qwen3-Coder-30B-A3B  (llama mxfp4; vLLM = NOT SERVABLE on GB10)
-
-| npl | llama (mxfp4) | vLLM |
-|----:|--------------:|-----:|
-|   8 |         290.0 |  n/a |
-|  32 |         582.5 |  n/a |
-|  64 |         931.8 |  n/a |
-| 128 |        1041.3 |  n/a |
-
-llama-server scales cleanly to **1041 tok/s** at npl 128 with **no npl-128 expert-activation
-cliff** (unlike the prior `llama-batched-bench` MoE numbers 253/505/830/620 that peaked at 64;
-short-prompt continuous batching in the server avoids it).
-
-**vLLM could not serve this MoE on GB10 (two independent failures):**
-1. **bf16** (`Qwen/Qwen3-Coder-30B-A3B-Instruct`, the only HF form on the box): loads the
-   56.9 GB of weights, then **hangs at the MoE warmup** (`Using MoEPrepareAndFinalize
-   NoDPEPModular` -> `Model loading took ...`), GPU 0% util, and **takes the whole box down
-   (hard reboot)**. Reproduced twice. With tight `--gpu-memory-utilization` it still hangs at
-   the same step before the API server ever comes up.
-2. **mxfp4 GGUF** (same weights llama uses): vLLM 0.23.0's GGUF loader **cannot map the fused
-   qwen3moe expert tensors** (`RuntimeError: Failed to map GGUF parameters (48):
-   ['model.layers.N.mlp.experts.gate_up_proj', ...]`). Engine init fails outright.
-
-So on GB10, llama.cpp is the *only* engine of the two that serves this 30B-A3B MoE at all -
-an availability win, independent of throughput.
-
-## Batch-8 anomaly triage (dense NVFP4) -- RESOLVED
-
-The prior mixed-load run reported llama batch-8 steady decode at **471 ms/step (~19% of vLLM
-aggregate, ~17 tok/s)**. This sweep does **not** reproduce it. Clean isolated batch-8 decode:
-
- `llama-server` batch-8 dense paged = **83.2 tok/s** aggregate = ~96 ms/step = **96.9% of
-  vLLM's 85.9** (parity, both at the LPDDR5x weight-read floor).
-
-`llama-batched-bench` cross-check, dense NVFP4, `-npp 16 -ntg 128 -npl 1,8`, the three
-hypotheses isolated (S_TG = decode tok/s aggregate at batch 8):
-
-| config                | batch-8 S_TG t/s | ms/decode-step |
-|-----------------------|-----------------:|---------------:|
-| paged,  ctx 65536     |            90.32 |          88.6  |
-| stock,  ctx 65536     |            88.39 |          90.5  |
-| paged,  ctx 163840    |            89.33 |          89.6  |
-| stock,  ctx 163840    |            87.72 |          91.2  |
-
-Conclusion: clean batch-8 dense decode is **~88-90 tok/s (~89 ms/step) regardless of all three
-suspects**:
- **Paged overhead?** No -- paged is within 2% of stock, and at ctx 65k paged is *faster*
-  (90.3 vs 88.4). The decode path is not paying a paged penalty at batch-8.
- **The 163840-token ctx allocation?** No -- ctx 163840 == ctx 65536 within 1% (89.3 vs 90.3).
-  The large allocation does not slow steady-state decode.
- **NVFP4 decode cost?** This *is* the cost -- ~89 ms/step is the GB10 weight-read floor for a
-  32B at batch-8 (it matches vLLM's 86 tok/s server and exceeds it at the kernel level: 90 vs
-  86). It is the hardware ceiling, not a bug.
-
-The 471 ms/step is ~5.3x slower than this clean floor and is explained by none of the three.
-It was a **mixed-load artifact**: the 8 decoders were time-sharing the GPU with a concurrent
-prefill (a large prompt / chunked prefill landing on the same steps). That decode-vs-prefill
-contention is exactly the stall **patch 0013 (`LLAMA_PREFILL_BUDGET`)** bounds. In steady-state
-isolated decode, batch-8 dense is at **parity with vLLM (97%)**, not 19%.
-
-## Aggregate map (replaces the carried 75-80%)
-
-llama-server (paged) as a fraction of vLLM, by regime:
-
- **Low concurrency (batch-8): parity, 97-99%** on both measurable classes. Both engines sit on
-  the LPDDR5x weight-read floor; there is nothing to win.
- **Dense 32B, mid-to-high concurrency: 72-86%.** Dips to ~72% at npl 32-64, recovers to 86% at
-  128. Both still climbing (weight-bound), neither plateaus by 128.
- **Small 0.6B, mid-to-high concurrency: 49-67%.** llama plateaus ~2.0k; vLLM scales to 4.2k.
-  Runtime/scheduler-bound regime -- vLLM's batching wins; this is llama's weakest ratio.
- **MoE 30B-A3B: llama-only.** vLLM cannot serve it on GB10 (bf16 reboots the box at MoE
-  warmup; GGUF expert tensors unmappable). llama serves it at 290 -> 1041 tok/s, scaling
-  cleanly with no npl-128 cliff.
-
-Net: the single "75-80%" number is replaced by a curve. It is *roughly* right only for the
-dense mid-band; it is too optimistic for the small model at high concurrency (49%) and moot for
-MoE (where llama is the only option). The headline is parity at low concurrency and a hardware
-(not engine) ceiling on dense decode.
--- a/backend/cpp/llama-cpp/patches/paged/SPEEDUP_HUNT.md
+++ b/backend/cpp/llama-cpp/patches/paged/SPEEDUP_HUNT.md
@@ -1,567 +0,0 @@
-# SPEEDUP_HUNT.md - the post-0023 vLLM decode close/beat hunt
-
-Accumulator for the four-lever speedup hunt on the clean pin-synced base (llama.cpp
-9d5d882d, bit-exact md5 == 0023 baseline). Levers (current-brief labels):
-A = hybrid per-head SSM precision, B = MoE grouped-GEMM, C = structural dense residual
-(lm_head + graph/launch), D = f16 glue.
-
---
-
-## D - f16 GLUE: confirm lower-priority (label: D-f16-confirm, READ-ONLY no GPU)
-
-Re-read `F16_DENSE_RESIDUAL_PROBE.md` (the lever-D doc) plus `BF16_SSM_STATE_RESULTS.md`
-(lever A's parent work) and `OTHER_PATHS_INVESTIGATION.md` (the B/lm_head + graph
-analysis). Verdict: **D is correctly deprioritized. Dominated by both A and B. Build
-later behind an opt-in flag only if the last ~4% dense is ever chased; do NOT build now.**
-
-### The numbers that pin D below A and B
-
- D's reachable mass is TINY. The dense decode gap to vLLM is ~27 ms/step (llama 332.8 ms
-  vs vLLM 305.7 ms @npl128). 83.2% of the step (recurrence 49.3% + FP4 GEMM 27.4% + FP4
-  act-quant/fixup 6.4%) is ALREADY precision-matched f32/W4A4 on both engines - f16 cannot
-  touch it. The f16-able glue is only **8.4% of the step** (Budget A = 28.74 ms: norms +
-  elementwise + activations + flash_attn + rope + copies).
- f16 does not zero the glue, it halves the bytes of the memory-bound part. Realistic
-  recovery from the probe: ~11 ms (glue only) to ~16 ms (+ the uncertain nvjet GEMM) =
-  **40-60% of the 27 ms residual**. That moves dense parity 91.8% -> ~95-96%, NOT a close.
- The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable (KV is
-  ALREADY f16, the KQ/softmax accumulate stays forced f32 = vLLM does the same). The cleanly
-  recoverable band is just the norms+elementwise+activations (~16.7 ms -> ~8.4 ms saved).
-
-### Dominated by A (parity-and-beyond) and B (the bigger gap) - confirmed
-
- **A dominates on the same dense axis.** A targets the recurrence, which is 49.3% of the
-  dense step - i.e. ~6x the mass D can touch. The bf16-SSM measurement already proved the
-  recurrence kernel halves (-49%/call) and clean dense bf16 hit ~490 t/s = **125% of vLLM**
-  (`BF16_SSM_STATE_RESULTS.md` sec 2). A's hybrid per-head variant keeps the long-memory
-  heads f32 to pass the KL gate that plain bf16 failed (drift FAIL ~10% argmax flips @>=1024
-  ctx) while banking most of that +25-31%. So A is the parity-AND-BEYOND lever on dense;
-  D's ceiling is ~96% parity. A wins outright.
- **B is the bigger gap.** MoE sits at ~82% (726 vs 882) vs dense ~92%; the MoE-specific
-  kernel (mul_mat_q<NVFP4,M-tile=64> grouped GEMM, 26.9% of MoE decode = ~43.5 ms/step) and
-  the W4A4 act-quant tax are real MoE deltas. D is a DENSE-only lever (the MoE step is
-  recurrence + FP4-GEMM + bf16-projection dominated; the f16 glue band is even smaller
-  there) - it does nothing for the larger MoE gap. B addresses where the bench is worst.
- **C overlaps and out-prioritizes D's residual.** The probe's own conclusion: the
-  remaining ~3-4% after f16 is structural (non-FP4 cublas/nvjet GEMM efficiency +
-  graph/launch scheduling), and those help the BIT-EXACT default too, unlike D which is
-  opt-in non-bit-exact. C's graph/launch work is the better long-term dense target.
-
-### Is there a cheap subset of D worth folding into a later build?
-
-**No cheap subset that pays.** The probe maps D to three escalating options:
-
- A flag: does not exist and cannot exist - the F32 stream is STRUCTURAL
-  (`ggml_mul_mat` hardcodes an F32 result, so the residual stream snaps back to F32 after
-  every projection; rms_norm/l2_norm/silu/add/mul/flash_attn/ssm_conv all emit F32).
- **Option 1 (the "cheap" one: per-op f16 on ops that already have f16 paths - silu/sigmoid/
-  softplus/add/mul/rope): NET NEAR-ZERO OR NEGATIVE.** Because the residual stream stays F32,
-  each op must be wrapped cast(F16)->op->cast(F32) = 2 extra `cpy` ops. At decode these ops
-  are tiny and memory-bound, so the cast traffic ~= the op traffic and the win is eaten unless
-  the cast is FUSED into producer/consumer. Crucially Option 1 CANNOT reach the norms - the
-  largest glue item. So the only "cheap" subset is the one that does not actually help.
- Option 2 (the real lever): carry the residual stream in F16 across the layer, which needs
-  NEW F16 template instantiations in norm.cu (rms_norm / l2_norm / fused rms+mul / rms+mul+add,
-  today hard-`GGML_ASSERT(type==F32)`) keeping the f32 reduction, an f16 projection-output
-  path, plus graph-dtype plumbing in qwen35.cpp/llama-graph.cpp. Multi-file, recovers ~11 ms,
-  and is **non-bit-exact** (same gate-failing category as the shelved bf16-SSM state). Not cheap.
-
-There is no fold-in-for-free subset: the only no-new-kernel piece (Option 1) is net-zero, and
-the only piece that captures real mass (Option 2 norm.cu f16 kernels) is a multi-file build.
-
-### THE D PRIORITY CALL
-
-D is correctly deprioritized, below A, B, and C:
- **Reachable mass:** D 8.4% of the dense step vs A's 49.3% recurrence; D is dense-only and
-  does nothing for the bigger MoE (B) gap.
- **Ceiling:** D tops out ~95-96% dense parity; A is already parity-AND-BEYOND (125% clean,
-  hybrid keeps most of it inside the KL gate).
- **Bit-exactness:** D is opt-in NON-bit-exact (same bucket as shelved bf16-SSM and the
-  NVFP4-head); it cannot improve the shipped f32 bit-exact default, whereas C's structural
-  graph/launch work does help the default.
-
-### RECOMMENDATION: build LATER (opt-in only), not now; no cheap subset to fold in
-
-Do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau (patches 0018-0023)
-as the default. If the last ~4% dense is ever chased, the ONLY worthwhile piece is Option 2's
-norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise band, ~11 ms); gate
-it behind an explicit opt-in flag and validate it against the SAME KL threshold that failed
-plain bf16-SSM before shipping. Skip Option 1 entirely (cast overhead eats the win). Prefer
-the structural ~3-4% (non-FP4 cublas GEMM efficiency + graph/launch scheduling, lever C) over
-D, because that helps the bit-exact default too. D stays the lowest-priority of the four levers.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## A - HYBRID PER-HEAD f32/bf16 SSM STATE (label: A-hybrid-design, READ-ONLY no GPU)
-
-Goal: capture most of the whole-bf16 SSM-state win (recurrence -49%/call; dense ~490 t/s = 125% of
-vLLM; MoE +25%) WITHOUT the KL failure (whole-bf16 MeanKLD 0.05-0.17, Same-top-p ~90%, ~10% argmax
-flips @>=1024 ctx). Keep f32 on the long-memory heads (where bf16 rounding does NOT contract and the
-KL error concentrates); bf16 only the fast-decaying heads. Stays at-or-above vLLM precision (vLLM
-keeps ALL temporal state f32) while landing ABOVE vLLM throughput.
-
-### Why the error concentrates in long-memory heads (the physics)
-qwen35/qwen35moe take the NON-KDA path: per (head h, token t) the decay is ONE scalar
-(gated_delta_net.cu `g_val = expf(g[h,t])`, `S <- g_val*S + k(x)delta`). The gate (qwen35.cpp):
-`g[h,t] = ssm_a[h] * softplus(alpha[h,t] + ssm_dt[h])`, with `ssm_a[h] = -exp(A_log[h]) <= 0` =>
-decay = exp(g) in (0,1]. Two STATIC per-head weights set the timescale: ssm_a[h] (tensor
-SSM_A_NOSCAN, [n_v_heads]) = decay-rate SCALE (|ssm_a| small => structurally long-memory); ssm_dt[h]
-(SSM_DT "bias", [n_v_heads]) = softplus operating point. bf16 carry-error per step is contracting,
-bounded ~`eps*tau_h`, eps~2^-8~3.9e-3, head memory length `tau_h ~ 1/(|ssm_a[h]|*softplus(ssm_dt[h]))`
-tokens. Error scales LINEARLY with tau_h => long-memory heads blow up the KL (matches the measured
-plateau-but-large failure). Keep those f32.
-
-### Classification: per-head STATIC, at model load (NOT per-token)
-g is per-token but the long-vs-fast PROPERTY is per-head static (dominated by ssm_a/ssm_dt). A cache
-row's dtype must be stable across the sequence => a per-token threshold is impossible; classify ONCE
-at load into a per-(layer,head) dtype mask.
- TIER 1 (default, zero-cost, deterministic): pure-weights. `tau_h = 1/(|ssm_a[il][h]|*
-  softplus(ssm_dt[il][h]))`; keep f32 if tau_h > T_thresh, else bf16. T_thresh is THE knob (start
-  32-64; sweep on GateBench). eps*tau_h => a single T_thresh sets a uniform per-head error ceiling.
- TIER 2 (optional): short calibration pass measures per-head time-mean of actual exp(g[h,t]); write
-  mask to a model-hash sidecar (paid once). Use only if Tier 1 lands just above the gate.
-cparam `ssm_hybrid_tau_thresh` / `--ssm-bf16-tau`: inf => all-f32 (today's bit-exact default); 0 =>
-all-bf16 (the shelved mode); the hybrid band is in between.
-
-### Mixed-dtype cache layout: two homogeneous partitions per slot (packed)
-Split persisted s_l ([S_v,S_v,H,slots] f32, n_embd_s=S_v*S_v*H) into TWO dtype-homogeneous sub-caches
-sized by head COUNT (this is what saves bytes): `s_l_f32 [S_v*S_v*n_f32, slots]` f32 +
-`s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. Static map `head_slot[h]={is_bf16, local_idx}`. q/k/v/g/beta
-KEEP natural head order (no activation permute). Block h_idx -> head_slot -> base + local_idx*S_v*S_v.
-Recurrence R+W bytes scale by `f_bytes = (n_f32 + n_bf16/2)/H = 1 - 0.5*(n_bf16/H)`. In-place/ids
-identity stays race-free (each head writes its own partition; read==write slot, registers before
-store). (Cheaper coarse fallback = per-LAYER dtype, near-zero layout code, but long-memory heads span
-most layers => too coarse; per-head is the right granularity.)
-
-### Kernel: single launch, runtime per-head branch (on top of BF16_SSM_STATE.diff)
-Reuse the existing bf16 plumbing (gdn_state_t alias, __bfloat162float load / __float2bfloat16 store,
-gather template, dtype-detect dispatcher). Hybrid change: pass BOTH bases (`const float* s_f32_base`,
-`const nv_bfloat16* s_bf16_base`, + the two state_dst views) + device `head_slot[]`; branch load/store
-on `head_slot[h_idx].is_bf16` (UNIFORM per block => no warp divergence). Recurrence math byte-for-byte
-untouched (f32 registers). keep_rs_t snapshots stay f32 (op-output scratch). gdn_gather_nonident
-becomes per-head dtype-aware (still disjoint-scratch race-free). ONE op call + ONE launch.
-
-### KL-gate plan + estimated pass / f32 fraction / speedup
-KLD contribution ~ (eps*tau_h)^2 => dominated by the top-tau heads; removing the top ~25-40% by tau
-cuts MeanKLD 1-2 orders. Honest estimate: ~30-40% f32 PASSES Same-top-p>=99.5% and brings MeanKLD to
-1e-3..1e-2; strict <1e-3 may need ~40-50% f32. Find the exact fraction by sweeping T_thresh on the
-EXISTING GateBench harness (noise floor -> 256-tok gate -> drift sweep 256/1024/2048/4096, both
-models). Hybrid is STRICTLY safer than vLLM (vLLM = all-f32 temporal; we f32 exactly the unsafe
-heads). Long-memory heads are the minority (~20-40%) => design band f in [0.30, 0.50].
-Speedup (dense, bandwidth-bound recurrence, graphs-off): f32 3.38 ms/call, whole-bf16 1.73 (-49%);
-hybrid ~ f_bytes*3.38 => f=0.30 -> 2.20 ms (-35%, ~70% of bf16 win); f=0.50 -> 2.54 ms (-25%, ~50%).
-Throughput (dense f32 ~371-384=95% vLLM; whole-bf16 ~490=125%; vLLM ref 419): f=0.30 -> ~454 t/s
-(~108% vLLM, gate-likely); f=0.50 -> ~430 t/s (~103% vLLM, most robust). MoE: smaller absolute
-recurrence (31 GDN layers, H_v=32) + MUL_MAT_ID-bound step (lever B) => hybrid keeps the +13-25%
-recurrence share KL-passing but does not alone close the MoE GEMM gap. Joint gate: nsys per-call bytes
-down AND KL<1e-3 both models.
-
-### Scope on top of BF16_SSM_STATE.diff
-Reuse verbatim: gdn_state_t alias, templated load/store, gather template, dispatcher dtype-detect,
-type_s/type_r cparams, CPU mirror, back-compat row convert, bf16 fill, test-backend-ops bf16 cases.
-NEW: (1) classifier ~80-150 LOC (host fn over ssm_a/ssm_dt -> head_is_bf16[layer][head] + counts +
-T_thresh cparam/CLI; optional Tier-2 calib+sidecar). (2) split cache layout ~150-250 LOC (BIGGEST:
-llama-memory-recurrent.cpp alloc s_l_f32+s_l_bf16 by per-layer counts; build_rs builds two views +
-passes head_slot; n_embd_s split). (3) kernel ~120-200 LOC (two bases + device map, runtime per-head
-branch at load/in-place-store/gather/dispatch; math untouched; STATE_BF16 template stays as the
-all-bf16 case). (4) ids/in-place per-head (state_dst two partition views; per-head gather; identity
-unchanged). (5) CPU mirror per-head branch. (6) test-backend-ops MIXED-dtype-state case (decode +
-multi-token prefill + keep_rs_t = the R2 corruption net). (7) gate: sweep T_thresh for min-f32 passing
-KL<1e-3 + Same-top-p>=99.5% + drift both models; nsys per-call confirms f_bytes; md5 that T_thresh=inf
-reproduces the f32 baseline (bit-exact opt-out preserved).
-
-Net: principled path ABOVE vLLM throughput (dense ~430-454 vs vLLM 419) at-or-above vLLM precision,
-KL-gated. Biggest new item = the split-tensor cache layout; classifier + kernel bounded; gate is a
-threshold sweep on the existing harness.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## B - MoE GROUPED-GEMM + RE-GRAPH (label: B-moe-profile-design, THE GPU AGENT)
-
-GPU-measured on DGX GB10 (sm_121), dev tree `~/llama-paged-dev` HEAD `2ee65c2` (patch 0024; the
-decode kernels are byte-identical to 0023/f7409c2 - 0024 is the serving-only burst-reclaim).
-`build-cuda`, model `q36-35b-a3b-nvfp4`, `llama-batched-bench -fa on -npp 128 -ntg 128`,
-`LLAMA_KV_PAGED=1`. `decode_agg = S_TG t/s`. Batched-bench is the clean-kernel measure (no server
-scheduler overhead), so its npl128 = ~743 t/s sits ABOVE the server final_benchmark 726 t/s; the
-re-graph % gain below transfers to both paths (same kernels, same graph-disable).
-
-### 1. MoE decode decomposition @npl128 - RE-CONFIRMED on the current HEAD
-
-Fresh nsys `--cuda-graph-trace=node`, decode-isolated steady window, % of summed kernel GPU-time
-(reproduces the 0023 profile in `OTHER_PATHS_INVESTIGATION.md` A.2/D within noise; window is
-95.4% kernels-only busy / 96.8% with memcpy = GPU-compute-bound):
-
-```
- 42.3%  gated_delta_net_cuda            REC  (shared w/ dense; ALREADY tuned past vLLM, 0018-0022: 84.6% vs 82.4% peak BW)
-~29.5%  mul_mat_q<NVFP4>                MoE FP4 GEMM = grouped M-tile=64 (~27%, biggest MoE-specific) + router M-tile=128 (~2.3%)
-~10.5%  nvjet_sm121 + cutlass (bf16)    attn/gdn bf16 projections + the BF16 lm_head (path B)
-  3.1%  k_get_rows_float                REC state gather
-  2.7%  k_bin_bcast                     expert-combine + routing-weight scale + glue
-  2.1%  ssm_conv_update_f32             REC
-  2.0%  quantize_mmq_nvfp4              W4A4 activation-quant tax (3.25 ms/step; vLLM-W4A16 avoids it)
-  1.8%  convert_unary bf16<->f32        glue around the bf16 projections
-  1.4%  MEMCPY-DtoD                     (SSM state copy fused away by 0018-0019; now small)
-  0.5%  mul_mat_q_stream_k_fixup | 0.32% mm_ids_helper | 0.19% argsort | 0.14% gather_mmq_fp4 (0023 dedup) | 0.3% flash_attn
-```
-
-Bucketed: **Recurrence/SSM ~48% (shared, tuned past vLLM, NOT a MoE lever)**; **MoE FP4 GEMM+routing
-~33%**; **bf16 projections ~10.5%**; act-quant tax ~2%; attention ~0.3%.
-
-### 2. RE-GRAPH the MoE decode step - TESTED + MEASURED (the headline finding)
-
-**Un-graphed status CONFIRMED, and the disable is purely conservative.** NVFP4 on sm_121 has
-`get_mmvq_mmid_max_batch_turing_plus(NVFP4)=8` (`mmvq.cu:139-148`). At MoE decode `ne[2]=npl > 8`,
-so every MUL_MAT_ID node trips the disable in `ggml_cuda_graph_check_compability`
-(`ggml-cuda.cu:3278`: `node->ne[2] > mmvq_mmid_max => use_cuda_graph=false` for the WHOLE step).
-BUT the path actually taken at `ne[2]>8` on Blackwell NVFP4 is `ggml_cuda_should_use_mmq()==true`
-(`ggml-cuda.cu:2664`) -> the **grouped stream-k `mul_mat_q` id-branch**, launched on one stream with
-**NO host sync** (verified: zero `cudaStreamSynchronize`/`Memcpy` in `mmq.cu`/`mmid.cu`). The stream
-sync the disable guards against lives ONLY in the per-expert host-loop fallback, which is never
-reached when `should_use_mmq` is true. So graphs are SAFE for the grouped path; the disable is a
-conservative over-guard (upstream TODO + ggml-org/llama.cpp#18958).
-
-**The lever (env-gated, bit-exact, built+measured here).** Relax the disable when the node takes
-the grouped MMQ path. Patch (one function, one TU, 9 s incremental build):
-
-```c
-// ggml-cuda.cu  ggml_cuda_graph_check_compability(), [TAG_MUL_MAT_ID_CUDA_GRAPHS]
-bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
-if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
-    getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
-    ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
-    mmid_needs_sync = false;   // grouped stream-k id-path is sync-free => graph-safe
-}
-if (mmid_needs_sync) { use_cuda_graph = false; ... }
-```
-
-**Measured A/B (2 reps each, rock-solid; OFF=stock graphs-disabled, ON=LLAMA_MOE_FORCE_GRAPHS=1):**
-
-| npl | OFF decode_agg | ON decode_agg | gain | OFF %vLLM | ON %vLLM |
-|----:|---------------:|--------------:|-----:|----------:|---------:|
-|   8 | 226.0 | 226.4 | +0.2% (noise) | 88% | 88% |  *(ne2=8<=mmid_max: MMVQ path already graphs, FORCE inert)*
-|  32 | 433.8 | 452.7 | **+4.4%** | 86.6% | **90.4%** |
-|  64 | 589.0 | 605.9 | **+2.9%** | 85.9% | **88.3%** |
-| 128 | 743.1 | 757.1 | **+1.9%** | 84.2% | **85.8%** |
-
-(vLLM ref 256.5 / 500.8 / 686.1 / 882.2.) The win is largest at small batch (more host-launch
-overhead relative to kernel work) and shrinks as kernels dominate at npl128 - exactly the ~1.7%
-within-step launch-idle the prior agent measured at 98.3% GPU-busy. This REFINES the prior "graphs
-won't help npl128" verdict: it DOES help (+1.9%, above noise), and helps npl32-64 substantially
-(+3-4%). **Bit-exact by construction** (graph replay re-issues the identical kernel sequence with
-identical args; FORCE only flips `use_cuda_graph`; the shipped f32 dense path already runs graphed).
-**Bit-exact gate - both PASS (measured):** `test-backend-ops -o MUL_MAT_ID -b CUDA0` = **806/806,
-CUDA0 OK** (the grouped FP4 kernel is untouched - the edit is host-only graph-compat logic); and a
-**parallel-greedy np16** run (ne2=16>8, i.e. the grouped MMQ path under graphs ON vs eager OFF) gives
-**byte-identical generated content ON==OFF** (md5 `04c4761...` both, 16/16 completions, diff empty).
-**SHIP CANDIDATE -> patch 0025** (default-off env now; safe to flip to `should_use_mmq`-gated
-default-ON since it is a pure, gated, bit-exact win).
-
-### 3. Grouped-GEMM occupancy headroom - EXHAUSTED on this model (cheap levers), one structural lever left
-
- The FP4-MMA `mul_mat_q<NVFP4>` is **register-bound to 1 CTA/SM** (`__launch_bounds__(256,1)`,
-  ~255 regs/thread = ~12.5% thread occupancy). Grouped grids: ~2048 and ~8192 64-wide tiles.
- **M-tile (col-tile) axis NEUTRAL** (runtime `LLAMA_MOE_DECODE_TILE`, npl128): TILE32 742.4 /
-  TILE64 744.2 / TILE96 747.1 - all within 0.6%. Re-confirms patch 0015: this 256-tiny-expert model
-  is **bandwidth/SSM-bound, not col-tile-occupancy-bound**, so the M-tile lever has nothing to bite.
- **Cheap occupancy lever already measured (patch 0017):** compile-time `GGML_CUDA_FP4_MINBLOCKS=2`
-  on MoE @npl128 = **+0.4% (noise)**, and nsys showed it makes the dense FP4 GEMM **+8.7% SLOWER**
-  (register-cap spills, occupancy did not usefully rise). So the cheap register-cap lever is spent.
- **Only untested grouped-GEMM lever = the structural `mmq_y`-down (nwarps=4 warp-remap)** - the
-  0017-deferred P2. `mmq_y` tiles N (weight rows), not M, so shrinking it does NOT re-read weights
-  (BW-neutral) and raises resident CTAs. Bit-exact (warp/fragment remap, same FP4-MMA math), but a
-  real kernel change (the `nwarps x tile_C::I == mmq_y` static_assert coupling), and predicted
-  BOUNDED on this BW-bound model. Not a cheap toggle; do only if the re-graph + M1 banks are
-  insufficient.
-
-### 4. W4A16 option (skip the act-quant, vLLM's Marlin choice) - NOT recommended
-
-vLLM on GB10 runs **MARLIN W4A16** MoE (engine-log confirmed: "Your GPU does not have native FP4 ...
-Marlin kernel"): bf16 activations NEVER quantized, FP4 weights dequant-in-kernel to bf16, **bf16
-MMA**, under a full CUDA graph. It does this because CUTLASS's native-FP4 grouped GEMM is broken on
-consumer sm_121 (whitelists only sm_100/103 datacenter Blackwell). llama instead runs **native
-Blackwell FP4-MMA W4A4** grouped stream-k - a HIGHER arithmetic tier (GB10 FP4 = 2x INT8/BF16 rate).
-The W4A4 act-quant tax llama pays (`quantize_mmq_nvfp4`) is **only ~2.0% of MoE decode** (3.25 ms/step
-after the 0023 up/gate dedup). Adopting W4A16 to erase it would: (a) be **NOT bit-exact** (bf16 acts
-!= FP4 acts -> different logits); (b) **descend to BF16-class MMA** (concede GB10's 2x FP4 rate - the
-grouped GEMM, ~27% of the step, would run at HALF the MMA rate); (c) re-enter the **W4A16 occupancy
-wall** (the prior GB10 W4A16 effort plateaued ~9 TFLOP/178 t/s). The BW saving is a sliver (acts are
-tiny vs the ~weight read at M~4/expert), so it trades a bit-exact 2% for a non-bit-exact, slower,
-occupancy-hostile path. **Reject.** The act-quant tax is better attacked bit-exactly via the down_proj
-quantize retune (M1).
-
-### 5. RANKED MoE levers (expected gain, bit-exactness, tractability)
-
-1. **RE-GRAPH the MoE decode (this patch, -> 0025): MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128.**
-   Bit-exact, tiny (one function, one TU), low-risk, built+measured. **The clear #1.** Helps the
-   server path AND small-npl most (where llama was weakest: npl32 86.6%->90.4% of vLLM).
-2. **down_proj act-quant retune (M1): bit-exact, bounded (act-quant is ~2%).** Cheap bank-shot;
-   retune `quantize_mmq_nvfp4` block/grid (byte-identical output, like 0023's gather). Low single-%.
-3. **Grouped-GEMM `mmq_y`-down warp-remap: bit-exact, BW-neutral, the 0017-deferred P2.** Speculative,
-   predicted bounded on this BW-bound model; real kernel work. Only if 1+2 insufficient.
-4. **M-tile / MINBLOCKS occupancy: EXHAUSTED** (measured neutral-to-negative). Do not pursue.
-5. **W4A16: REJECT** (non-bit-exact, slower BF16 arithmetic, occupancy wall). Not even a clean opt-in.
-
-**Net:** the bit-exact MoE-GEMM-region headroom from 1+2(+3) is ~3-6% at npl128 (MoE ~84% -> ~88-90%
-of vLLM) and ~4-5% at npl32-64. Full MoE parity is NOT reachable from the GEMM/launch track alone:
-the remaining gap is the grouped GEMM (~27%, FP4-MMA at the LPDDR5x BW floor - hardest regime, vLLM
-ships purpose-built Marlin-NvFp4) + the bf16 projections (~10.5%). The recurrence (~48%) is already
-PAST vLLM. The single highest-ROI, ship-now item is the re-graph patch (0025).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## C - STRUCTURAL DENSE RESIDUAL: lm_head + scheduling (label: C-structural-design, READ-ONLY no GPU)
-
-Source-confirmed on DGX `~/llama-paged-dev` @ HEAD `2ee65c2` plus committed traces
-(`CRITICALPATH_GAP_ANALYSIS.md`, `A2_CUDAGRAPH_DECODE.md`, `F16_DENSE_RESIDUAL_PROBE.md`,
-`OTHER_PATHS_INVESTIGATION.md` sec B). Numbers are dense q36-27b-nvfp4 @npl128: step ~333 ms
-(384 t/s), gap to vLLM (419 t/s = 305 ms) is ~27-28 ms/step. **Verdict: lever C is a near
-dead-end for a bit-exact dense win; rank it LAST of A/B/C/D for the bit-exact default.**
-
-### How the lm_head is stored, and why it routes to cublas/nvjet (not the tuned FP4 MMQ)
-
-`output.weight` is **GGML_TYPE_BF16** (NOT quantized): the `--tensor-type attn/ffn=nvfp4`
-recipe converts only attn+ffn, leaving the logit-sensitive final projection (and tok_embd)
-at base BF16. Confirmed: `llama-model.cpp:1460` creates the NVFP4 scale `output_s` ONLY
-`if (output->type == GGML_TYPE_NVFP4)`, so for the BF16 head `model.output_s` is null, and
-`build_lora_mm` (`llama-graph.cpp:1087`) collapses to a plain `ggml_mul_mat`. In
-`ggml_cuda_mul_mat` dispatch (`ggml-cuda.cu:2599-2629`): `use_mul_mat_q`/`use_mul_mat_vec_q`
-both require `ggml_is_quantized(src0)` (BF16 fails => the tuned FP4 path is INELIGIBLE);
-MMF is gated off for the wide `vocab x 128` shape; `use_batched_cublas_bf16` is true but the
-batched branch additionally needs `src1->ne[2]*ne[3] > 1` (the 2D decode lm_head fails it).
-Falls through to `ggml_cuda_op_mul_mat_cublas` BF16 branch (`:1662`): downcast F32 act ->
-BF16, `cublasGemmEx(16BF x 16BF -> COMPUTE_32F)` = **nvjet_sm121**, output rounded BF16 ->
-upcast F32. Shape M=vocab(151936) x N=128 x K=5120: a tall-skinny output GEMM reading the
-ENTIRE BF16 head weight for 128 columns = inherently **memory-bound**. On the dense model
-this is the ONLY non-FP4 cublas GEMM in decode. Cost: nvjet = 11.91 ms = 3.1-3.6% of step.
-
-**CRITICAL CORRECTION the team must carry:** the baseline is NOT "f32 lm_head". The cublas
-BF16 branch downcasts the activation F32->BF16 AND rounds the output to BF16. Today's
-"bit-exact reference" logits are ALREADY BF16-precision on both input and output. So
-"bit-exact" for lever C only protects BF16-rounded logits, which is exactly why option (c)
-is "essentially bit-exact" and why any meaningful lm_head speedup requires changing the dtype.
-
-### lm_head bit-exact lever + gain - bandwidth math kills it
-
-nvjet moves the full BF16 head weight in 11.9-12.2 ms = ~195-199 GB/s = ~72% of GB10's
-273 GB/s peak: it is ALREADY one of the most bandwidth-efficient kernels in the step (the
-overall decode step runs at only ~40% util / ~110 GB/s). The bit-exact ceiling is the
-remaining bandwidth headroom only:
- **(c) keep BF16 weight, swap the kernel** (custom skinny wide-vocab streaming GEMM, or a
-  hand-picked cublasLt algo/workspace heuristic for the thin-N/huge-M shape). The ONLY
-  essentially-bit-exact option. Perfect HBM saturation 199 -> 273 GB/s = 11.9 -> ~8.7 ms =
-  **save ~3 ms = ~0.9-1.0% of step = ~11% of the 27 ms gap.** REALISTIC gain: 0 to 3 ms,
-  leaning toward 0 - cublasLt already selected nvjet as its best algo, so beating it on a
-  pure weight-stream is not guaranteed, and it is high kernel-writing effort. (F16 probe
-  independently estimates the same nvjet recovery as "~5 ms, uncertain - may already run TF32".)
-
-Structural reason it is near-zero: the head must read the entire BF16 weight for 128 columns;
-you CANNOT cut those weight bytes without changing the dtype. Bit-exactness and the only real
-speedup (fewer weight bytes) are mutually exclusive here.
-
-### lm_head NON-bit-exact options (excluded from any vLLM-parity claim)
-
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ.** Biggest win, BREAKS bit-exactness.
-  Weight ~4x fewer bytes (BF16 ~1.5-2.4 GB -> NVFP4 ~0.4-0.6 GB) AND rides the already-tuned
-  `mul_mat_q<NVFP4>` (patch 0017): memory floor drops ~4x = **save ~8-9 ms = ~2.5% of step**.
-  BUT NVFP4 < BF16 precision => different logit bits, can flip greedy argmax, AND it is
-  **UNFAIR vs vLLM** (which keeps its LM head BF16). Same opt-in non-bit-exact bucket as the
-  shelved bf16-SSM / f16-glue; exclude from parity claims.
- (b) FP8 / Q8_0 head: smaller error than NVFP4 but still != BF16 bits AND not on the tuned
-  FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
- (existing knob) `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` (`ggml-cuda.cu:1610`): 16-bit accumulate
-  on this exact GEMM, faster but NON-bit-exact (16F vs 32F accumulate). Non-bit-exact track only.
-
-### Scheduling / launch bit-exact lever + gain - ~0.05%
-
-The decode step is GPU-bound at 99.94% (node-level trace, single stream, graphId replayed).
-CUDA graphs ALREADY collapse within-step launch latency: exposed idle = 0.225 ms/step = 0.06%,
-zero gaps > 5 us, graph ON vs OFF = +0.13% @npl128 (noise). Graphs are NOT a pending dense
-lever - they are already in effect. The ONLY graph-non-covered overhead is the BETWEEN-step
-host gap: ggml rebuilds the cgraph each step with a NEW `cgraph->uid`, so the uid fast-path in
-`ggml_cuda_graph_update_required` never fires and the host re-dispatches ~3100 launches between
-graph launches. MEASURED exposed cost: ~0.2 ms/step = ~0.05% (most of the ~2 ms host loop
-overlaps GPU compute). **Bit-exact lever:** make the cgraph PERSISTENT/reused across decode
-steps so the uid fast-path fires (replay-only => bit-exact). GAIN ~0.2 ms/step = ~0.05%, medium
-effort (touches ggml graph lifetime), second-order. No other per-step host overhead is exposed
-(the host loop is HIDDEN under GPU compute until the kernels get fast enough to drop GPU-busy
-below host time).
-
-### Quantified realistic bit-exact total for lever C
-
-lm_head kernel swap 0 to ~3 ms (upper ~0.9%, realistically ~0) + persistent cgraph ~0.2 ms
-(~0.05%) = **combined bit-exact ceiling ~3.2 ms = ~0.95% of the 333 ms step = ~12% of the
-27 ms gap.** Moves dense parity 91.8% -> at most ~92.7%, realistically <0.5% net (<1.5 ms).
-The "~3-4%" in the brief is the lm_head's TOTAL cost, NOT what is bit-exactly recoverable: only
-the bandwidth headroom (~3 ms) and host gap (~0.2 ms) are recoverable; the other ~9 ms is the
-irreducible BF16 weight stream BOTH engines pay (vLLM keeps a BF16 head too). **Rank C LAST for
-the bit-exact default.** Its one durable note for the team: the lm_head logits are ALREADY
-BF16-rounded (not f32), which both narrows what option (c) must preserve and is exactly why the
-only meaningful lm_head speedup requires a dtype change (= non-bit-exact + unfair vs vLLM).
-
-Source (DGX @2ee65c2): `llama-model.cpp:1460`, `llama-graph.cpp:1087`, `qwen35.cpp:222` /
-`qwen35moe.cpp:246`, `ggml-cuda.cu:2599-2629` / `:1662-1690` / `:1610`.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# RANK + PLAN - the final synthesis (build order, A handoff, B/C/D queue)
-
-This is the decision section: all four levers measured/designed, ranked by gain x tractability
-x gate, the concrete A build plan, and the ordered B/C/D queue with each one's trigger. Base:
-clean pin-synced llama.cpp 9d5d882d, bit-exact md5 == 0023. Dense gap to vLLM ~27 ms/step (384
-vs 419 t/s @npl128); MoE ~82% (726 vs 882). Recurrence already PAST vLLM (84.6% vs 82.4% peak BW).
-
-## (1) Per-lever scorecard: gain (dense + MoE), tractability, gate
-
-| Lever | Dense decode gain | MoE decode gain | Tractability | Quality gate | Bit-exact? |
-|-------|-------------------|-----------------|--------------|--------------|------------|
-| **B re-graph (patch 0025)** | ~0 (dense already graphed) | **MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128** (MoE 84%->86% .. 90% of vLLM) | **VERY HIGH - already built+measured**, 1 fn / 1 TU / 9 s build | md5 byte-identical: **PASSED** (MUL_MAT_ID 806/806 + parallel-greedy md5 identical) | YES |
-| **A hybrid per-head SSM** | **+25% to +35%/call recurrence -> ~430-454 t/s = 103-108% of vLLM** (ABOVE vLLM) | keeps the +13-25% recurrence share KL-passing; does NOT alone close the MoE GEMM floor | MEDIUM-HIGH - builds on `BF16_SSM_STATE.diff`; biggest new piece = split-dtype cache layout (~150-250 LOC) | **KL<1e-3 + Same-top-p>=99.5% + drift sweep 256/1024/2048/4096 both models**; md5 that T_thresh=inf == f32 baseline | f32 default YES; hybrid is at-or-above vLLM precision, KL-gated |
-| **B M1 down_proj retune** | ~0 | bit-exact, bounded (act-quant is ~2% of MoE step) - low single-% | HIGH - block/grid retune of `quantize_mmq_nvfp4`, byte-identical output | md5 byte-identical | YES |
-| **B mmq_y-down warp-remap** | small (shared FP4 GEMM) | bit-exact, BW-neutral, predicted BOUNDED on this BW-bound model | LOW-MEDIUM - real kernel change (nwarps x tile_C coupling) | test-backend-ops MUL_MAT_ID + md5 | YES |
-| **C lm_head kernel swap** | 0 to ~3 ms (~0.9%, realistically ~0; uncertain it beats nvjet) | ~0 | LOW payoff - high kernel-writing effort, not guaranteed to beat cublasLt | md5 (BF16-rounded logits) | YES (essentially) |
-| **C persistent cgraph** | ~0.2 ms (~0.05%) | ~0 (B's re-graph already covers MoE host gap) | MEDIUM - touches ggml graph lifetime, for 0.05% | replay-only = bit-exact, md5 | YES |
-| **D f16 glue (Option 2)** | ~11-16 ms = 40-60% of residual -> 91.8% -> ~95-96% (NOT a close) | ~0 (dense-only lever) | LOW-MEDIUM - new norm.cu f16 kernels, multi-file | **NON-bit-exact, must pass the SAME KL<1e-3 that plain bf16-SSM FAILED** | NO - opt-in only |
-
-Notes that decide the ranking:
- **B's re-graph helps ONLY MoE** (dense decode is already graphed; the disable is the MoE
-  MUL_MAT_ID `ne[2]>8` over-guard). It is the single highest-ROI item because it is already
-  built, measured, and gated - zero remaining build risk, just a default flip.
- **A is the only lever that moves dense ABOVE vLLM** (103-108%) and it does it at-or-above
-  vLLM precision (vLLM keeps ALL temporal state f32; A keeps f32 on exactly the unsafe heads).
-  It reaches the largest mass (recurrence = 49.3% dense / ~48% MoE = ~6x what D can touch).
- **C and D are dead-or-tiny for the bit-exact default.** C's bit-exact ceiling is <1% with
-  real risk; D is non-bit-exact, dense-only, and tops out at ~96% parity (not a close).
-
-## (2) Ranked build order (gain x tractability x gate) - A confirmed as the build lead
-
-1. **B re-graph (patch 0025) - LAND NOW.** Already built + measured + both gates PASSED. The
-   only remaining decision is flipping the default from env-gated (`LLAMA_MOE_FORCE_GRAPHS`) to
-   `should_use_mmq`-gated default-ON. Zero new build, measured +1.9-4.4% MoE, bit-exact. This
-   is not a "build" so much as a "ship"; it precedes A because it is free and de-risked.
-2. **A hybrid per-head SSM - THE BUILD LEAD (user-greenlit, CONFIRMED by evidence).** The only
-   lever that takes dense ABOVE vLLM and the only principled fix for the bf16-SSM KL failure.
-   Largest reachable mass, bounded build on an existing diff, KL-gated. Build plan in (3).
-3. **B M1 down_proj act-quant retune** - cheap bit-exact bank-shot, run after A while the GPU
-   is warm. Bounded (~2% act-quant tax), byte-identical-output retune.
-4. **B mmq_y-down warp-remap** - only if 1+2+3 leave MoE short of target; real kernel work,
-   predicted bounded on this BW-bound model.
-5. **C persistent cgraph** - a bit-exact ~0.05% micro-win for the default; build only if a
-   broad graph-lifetime refactor is happening anyway (not worth a standalone effort).
-6. **C lm_head BF16 kernel swap** - near-zero, uncertain, high effort. Effectively shelved.
-7. **D f16 glue (Option 2 norm.cu kernels)** - LAST, opt-in only, non-bit-exact, dense-only,
-   gated by the same KL threshold bf16-SSM failed. Build only if the last ~4% dense is chased
-   AFTER A lands and is shown insufficient. Skip Option 1 entirely (cast overhead eats the win).
-
-**Why A over B as the lead, despite B's re-graph being measured:** B's re-graph is already
-DONE - it is a ship, not a build. For the NEW build effort, A is correctly the lead: it is the
-only lever with a path ABOVE vLLM on dense, it attacks the largest mass (recurrence, shared by
-both models), and it converts the already-proven whole-bf16 win (490 t/s = 125% vLLM, but KL
-FAIL) into a KL-passing form. B's remaining items (M1, mmq_y) are bounded single-% bank-shots
-that cannot reach parity on their own (the residual MoE gap is the FP4 grouped GEMM at the
-LPDDR5x BW floor + bf16 projections, both structural). So: ship 0025, then build A, then bank B.
-
-## (3) CONCRETE A BUILD PLAN (hand to the build agent)
-
-**Objective:** a per-head mixed-dtype SSM state cache - f32 on long-memory heads, bf16 on
-fast-decaying heads - that captures 50-70% of the whole-bf16 recurrence win (-25% to -35%/call)
-while PASSING KL<1e-3. Builds directly on the existing `BF16_SSM_STATE.diff` (untracked backup
-on DGX `~/llama-paged-dev`). Target dense ~430-454 t/s (103-108% of vLLM 419), MoE +13-25%
-recurrence share KL-passing. f32 default stays bit-exact (md5 == 0023 baseline).
-
-**Reuse VERBATIM from BF16_SSM_STATE.diff** (do NOT rewrite): `gdn_state_t<STATE_BF16>` alias,
-templated `__bfloat162float` load / `__float2bfloat16` store, the gather template, the dtype-
-detect dispatcher, `type_s`/`type_r` cparam wiring, the CPU mirror, the back-compat row convert,
-the bf16 fill path, and the test-backend-ops bf16 cases.
-
-**NEW work items (in build order):**
-
-1. **Head classifier (~80-150 LOC, do first, no GPU).** Host function over `ssm_a` (tensor
-   `SSM_A_NOSCAN`, `[n_v_heads]`, = `-exp(A_log)`) and `ssm_dt` (tensor `SSM_DT`, `[n_v_heads]`):
-   for each (layer il, head h) compute `tau_h = 1 / (|ssm_a[il][h]| * softplus(ssm_dt[il][h]))`;
-   set `head_is_bf16[il][h] = (tau_h <= T_thresh)`. Emit per-layer `n_f32`/`n_bf16` counts +
-   the `head_slot[il][h] = {is_bf16, local_idx}` map. Add cparam `ssm_hybrid_tau_thresh` / CLI
-   `--ssm-bf16-tau` (inf => all-f32 bit-exact default; 0 => all-bf16; hybrid band in between).
-   Runs in microseconds at load, no data, no GPU. (Optional Tier-2: a short calibration pass
-   measuring per-head time-mean of actual `exp(g[h,t])` -> model-hash sidecar; only if Tier 1
-   lands just above the gate.)
-2. **Split-dtype cache layout (~150-250 LOC - THE BIGGEST piece).** In
-   `llama-memory-recurrent.cpp`: replace the single `s_l` ([S_v,S_v,H,slots] f32) with two
-   dtype-homogeneous sub-caches sized by per-layer head COUNT (this is what saves the bytes):
-   `s_l_f32 [S_v*S_v*n_f32, slots]` f32 + `s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. In
-   `build_rs` (`delta-net-base.cpp`): build the two views + pass the `head_slot` map; split the
-   `n_embd_s` accessors. q/k/v/g/beta KEEP natural head order (no activation permute - they come
-   from the projection GEMMs). Coarser per-LAYER fallback is REJECTED (long-memory heads span
-   most layers => too coarse; per-head is the right granularity).
-3. **Recurrence kernel: single launch, runtime per-head branch (~120-200 LOC).** Pass BOTH
-   bases (`const float* s_f32_base`, `const nv_bfloat16* s_bf16_base`) + the two `state_dst`
-   partition views + the device `head_slot[]` map. Branch on `head_slot[h_idx].is_bf16` at the
-   load site, the in-place store site, the gather, and the dispatcher. The branch is UNIFORM
-   within a block (all threads share `h_idx` = `blockIdx.x`) => **NO warp divergence**. The
-   recurrence math (the ~140-260 region) stays byte-for-byte f32-register, untouched. `keep_rs_t`
-   snapshots stay f32 (op-output scratch). The `STATE_BF16` template stays as the all-bf16
-   special case.
-4. **ids / in-place per-head.** `state_dst` becomes two partition views; `gdn_gather_nonident`
-   becomes per-head dtype-aware (copies each head's `S_v*S_v` block from the right partition of
-   `cache[ids[s]]`; still disjoint-scratch race-free). Each head writes its own partition slot
-   (read==write slot, loaded to registers before store) => the identity / in-place property is
-   preserved.
-5. **CPU mirror (ops.cpp)** per-head dtype branch for CI / CPU-offload parity.
-6. **test-backend-ops: a MIXED-dtype-state GATED_DELTA_NET case** (some heads f32, some bf16)
-   vs the CPU ref, covering decode + multi-token prefill + `keep_rs_t` (this is the R2
-   silent-corruption net - do NOT skip it).
-7. **Gate (GPU, GateBench harness, already built).** Sweep `T_thresh` to find the MINIMUM f32
-   fraction that passes: noise floor first, then the 256-tok KL gate, then the long-context
-   drift sweep 256/1024/2048/4096, BOTH models (dense q36-27b + MoE q36-35b-a3b). Pass bar =
-   **KL<1e-3 AND Same-top-p>=99.5% AND drift bounded**. nsys per-call confirms `f_bytes` =
-   `(n_f32 + n_bf16/2)/H` dropped. md5 that `T_thresh=inf` reproduces the f32 baseline (the
-   bit-exact opt-out MUST be preserved).
-
-**Expected result (from the physics + the whole-bf16 measurement):** KLD contribution per head
-~ `(eps*tau_h)^2` (eps~2^-8~3.9e-3) is dominated by the top-tau heads, so removing the top
-~25-40% by tau cuts MeanKLD by 1-2 orders. Design band **f32 fraction f in [0.30, 0.50]**:
- f=0.30 (n_bf16/H=0.70): `f_bytes`=0.65 -> ~2.20 ms/call (-35%), captures ~70% of the bf16
-  win -> dense **~454 t/s = ~108% of vLLM** (gate-likely, MeanKLD ~1e-3..1e-2).
- f=0.50: `f_bytes`=0.75 -> ~2.54 ms/call (-25%), captures ~50% -> dense **~430 t/s = ~103% of
-  vLLM** (most robust pass; strict KL<1e-3 may need this fraction).
-
-The exact f is found by the T_thresh sweep. **MoE:** A keeps the +13-25% recurrence share
-KL-passing but does NOT by itself close the MoE GEMM gap (that is B). Joint ship gate = nsys
-per-call bytes down AND KL<1e-3 for BOTH models; neither alone ships. Hybrid is STRICTLY safer
-than vLLM (we keep f32 exactly where bf16 is unsafe; vLLM keeps all-f32 everywhere).
-
-## (4) Ordered B / C / D queue with build triggers
-
- **B-1 re-graph default flip (patch 0025): trigger = NOW / immediate.** Already built, measured
-  (+1.9-4.4% MoE), both gates PASSED. Flip env-gated -> `should_use_mmq`-gated default-ON. No
-  dependency on A. Ship first.
- **B-2 down_proj act-quant retune (M1): trigger = after A's kernel work lands** (reuse the warm
-  GPU window). Bit-exact block/grid retune of `quantize_mmq_nvfp4`, byte-identical output.
-  Bounded ~1% (act-quant is ~2% of the MoE step). Run it; it is cheap.
- **B-3 mmq_y-down warp-remap: trigger = ONLY if B-1 + B-2 + A leave MoE below the target.**
-  Real kernel change, BW-neutral, predicted bounded on this BW-bound model. Speculative; gate by
-  test-backend-ops MUL_MAT_ID + md5.
- **C-1 persistent cgraph: trigger = ONLY if a broader ggml graph-lifetime refactor is already
-  in flight.** Standalone it is ~0.05%, not worth the graph-lifetime touch. Bit-exact (replay).
- **C-2 lm_head BF16 kernel swap: trigger = effectively NEVER for the default** (0 to ~3 ms,
-  uncertain it beats nvjet, high effort). Documented; not queued.
- **D Option 2 f16-glue norm.cu kernels: trigger = ONLY if dense parity is still wanted AFTER A
-  lands AND A is shown insufficient, AND an opt-in non-bit-exact mode is acceptable.** Multi-file,
-  recovers ~11 ms (norm/elementwise band), gated by the SAME KL<1e-3 that plain bf16-SSM failed.
-  Skip Option 1 (net-zero cast overhead). Lowest priority of all.
-
-**Bottom line:** ship 0025 now (free, measured MoE +1.9-4.4%), then build A (the only path
-ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM), then bank B-2/B-3 on MoE. C is
-last for the bit-exact default (<1%, dead-end); D is opt-in-only and dense-only, behind the KL
-gate, only if the last ~4% is ever chased. The recurrence is already PAST vLLM; A converts that
-proven win into a KL-passing form, and the MoE GEMM floor (the structural residual) is the one
-piece no bit-exact lever fully closes - vLLM ships purpose-built Marlin-NvFp4 there.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md
@@ -1,184 +0,0 @@
-# SSM decode fix - qwen35 gated-DeltaNet in-place recurrent-state write-back (patch 0018)
-
-Follow-up to `A2_CUDAGRAPH_DECODE.md`. That analysis located the real decode lever
-on the Qwen3.6 hybrid-SSM models (arch `qwen35`, 48 gated-DeltaNet linear-attn
-layers : 16 full-attn layers) and ruled out the FP4 GEMM, CUDA graphs, the host
-loop, and attention. The corrected per-kernel + per-memcpy decode decomposition
-attributed ~67% of decode GPU time to SSM-state plumbing:
-
-    gated_delta_net 23.4% | get_rows state-gather 21.9% | D2D state-copy 18.9% (= ~67%)
-    FP4 matmul ~28% | full attention 0.4%
-
-Root cause: per SSM layer per step the fused `gated_delta_net` op wrote its new
-recurrent state into graph scratch, then a **separate `ggml_cpy` persisted the
-full ~225 MB state into the recurrent-state cache** (1584 D2D ops, 356 GB, 18.9%
-of decode over the profile window). vLLM's `fused_recurrent_gated_delta_rule`
-keeps the state in place (no copy).
-
-## STEP 1 (this patch): kill the per-layer D2D state copy-back
-
-`ggml_gated_delta_net_inplace` (new builder, `src[6] = state_dst`) makes the op
-write its final recurrent state **directly into the active sequences' contiguous
-cache slot** (at `kv_head`), eliminating the copy-back. The op output then carries
-only the attention scores. SSM arithmetic is unchanged - only the destination
-pointer of the final-state write moved.
-
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new `ggml_gated_delta_net_inplace` op
-  builder. `dst` retains the same `[attn | state]` layout so the attention-output
-  view is identical; the state region is left unused.
- `ggml/src/ggml-cuda/gated_delta_net.cu`: kernel/launch/op-handler thread an
-  optional `state_dst`; final-state (`!keep_rs`) write targets it when present.
- `ggml/src/ggml-cpu/ops.cpp`: K==1 path operates in place on the `state_dst`
-  cache view (kept CPU-correct for non-CUDA runs / CI).
- `src/models/delta-net-base.cpp`: `build_recurrent_attn` uses the in-place op on
-  the fused decode/prefill path and drops the `ggml_cpy`. The rollback path
-  (`n_rs_seq > 0`) is unchanged. The get_rows state gather is unchanged (STEP 2).
-
-### Correctness gate
-
- **Bit-identical**: greedy (`--temp 0 --seed 1`) `llama-completion` output on
-  `q36-27b-nvfp4` is byte-for-byte identical between the copy-back baseline and the
-  in-place build (`diff` -> IDENTICAL).
- **Coherent**: dense + MoE multi-paragraph greedy generations are on-topic and
-  correct (Rayleigh scattering; Roman Empire 27 BCE / Actium 31 BCE; primes;
-  additive vs subtractive color).
- Gated to the `qwen35` / gated-DeltaNet fused path; rollback and all non-SSM
-  archs untouched (they never construct the in-place op).
-
-### Measured decode_agg (`S_TG t/s`, npp 128, ntg 128, -fa on, paged on, fusion off)
-
-Dense `q36-27b-nvfp4`:
-
-| npl | baseline | in-place | delta   | % of vLLM (391 @128) |
-|-----|----------|----------|---------|----------------------|
-| 32  | 113.74   | 136.39   | +19.9%  | -                    |
-| 128 | 146.23   | 180.53   | +23.5%  | 37.4% -> 46.2%       |
-
-The npl-128 result lands on the predicted copy-removal ceiling (~180 t/s).
-
-MoE `q36-35b-a3b-nvfp4`:
-
-| npl | baseline | in-place | delta   |
-|-----|----------|----------|---------|
-| 32  | 246.79   | 279.41   | +13.2%  |
-| 128 | 313.36   | 372.62   | +18.9%  |
-
-### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
-
-The D2D state-copy bucket collapsed:
-
-| bucket            | before              | after                |
-|-------------------|---------------------|----------------------|
-| MEMCPY D2D        | 18.9% / 356 GB / 1584 ops | 0.23% / 2.93 GB / 734 ops |
-
-The ~225 MB/copy recurrent-state copy-back is gone (122x fewer D2D bytes); the
-residual D2D is the small conv-state copies. With it removed, the remaining decode
-buckets are `gated_delta_net` 26.0%, FP4 matmul ~37.5%, and `get_rows` state
-gather 18.8%.
-
-## STEP 2 (not in this patch): fuse the get_rows state gather
-
-The state gather is now the largest single non-GEMM bucket (18.8%). It is a pure
-materialization: `build_rs` calls `ggml_get_rows(cache, s_copy_main)` to copy each
-sequence's previous state into a contiguous scratch tensor before the op reads it.
-`ggml_ssm_scan` already avoids this by taking the `ids` tensor (`src[6]`) and
-reading the per-seq state directly from the full cache. The same fusion applies
-here: give `ggml_gated_delta_net` an `ids` source, read `curr_state` from
-`cache + ids[seq]*D` in the kernel, and pass the full cache via the `build_rs`
-`get_state_rows` lambda (mirroring `mamba-base.cpp`). Predicted ceiling with both
-steps: ~247 t/s (~63% of vLLM dense @128), GEMM untouched.
-
-## Verdict on the path to parity
-
-STEP 1 removes ~half of the SSM plumbing overhead and is the dominant, lowest-risk
-lever; it is bit-exact and shipped here. STEP 2 (gather fusion) has a proven ggml
-precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLLM
-after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
-track. No paged/graph/block-table change can move decode on this model (full
-attention is 0.4% of decode).
-
-## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
-
-After Step 1 the largest non-GEMM decode bucket was the recurrent-state
-`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
-sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
-gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
-`ggml_ssm_scan`'s `ids` source.
-
-`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
-`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
-`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
-`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
-writes the cache directly: no recurrent-state materialization at all. The
-`build_recurrent_attn` fused path feeds the full cache and ids through the
-`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
-`rs_zero` clear and the extra-states copy around the op.
-
-### Race-free by construction (CUDA)
-
-In-place write plus an ids read of the same cache is only safe when the read slot
-equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
-(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
-can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
-prefill ubatch). The kernel handles both per (seq, head) block on device:
-
- identity sequences read `s0` in place from the destination slot `state_dst`
-  (the kernel loads all of `s0` into registers before it writes the new state,
-  so reading and writing the same slot is race-free) -- no materialization;
- non-identity sequences read from a disjoint scratch that a small
-  `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
-  recurrence never reads a slot another block writes.
-
-`ids` stays a device pointer (dereferenced only in the kernels; the input is
-device-resident at op-execute time, so a host read segfaults). The CPU op
-mirrors the same logic (host identity check + a serial gather in the dispatcher
-for the non-identity case). The math is unchanged, so the result is bit-identical
-to the `get_rows` path in every case.
-
-Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
-`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
-untouched (they keep the materialized-state overload).
-
-### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
-
-Dense `q36-27b-nvfp4`:
-
-| npl | Step 1 (baseline) | Step 2   | delta   | % of vLLM (391 @128) |
-|-----|-------------------|----------|---------|----------------------|
-| 32  | 137.64            | 170.68   | +24.0%  | -                    |
-| 128 | 186.25            | 256.57   | +37.8%  | 47.6% -> 65.6%       |
-
-The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
-
-MoE `q36-35b-a3b-nvfp4`:
-
-| npl | Step 1 (baseline) | Step 2   | delta   |
-|-----|-------------------|----------|---------|
-| 32  | 299.68            | 366.69   | +22.4%  |
-| 128 | 409.30            | 553.63   | +35.3%  |
-
-(Step-1 baselines re-measured in the same session; the brief's reference figures
-were 136 / 180 dense and 279 / 373 MoE.)
-
-### Bit-exact gate
-
-Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
-fusion off) vs the Step-1 build:
-
- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
- MoE `q36-35b-a3b-nvfp4`: byte-identical;
- Step-2 dense run1 == run2 (deterministic, no race).
-
-### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
-
-The recurrent-state gather bucket collapsed:
-
-| kernel                     | Step 1   | Step 2                                  |
-|----------------------------|----------|-----------------------------------------|
-| `k_get_rows_float`         | 18.8%    | 0.7% (residual: embeddings / conv-state)|
-| `gdn_gather_nonident`      | -        | 1.7% (no-op at decode, median ~1.2 us)  |
-| `gated_delta_net_cuda`     | 26.0%    | 22.5%                                    |
-| FP4 GEMM family            | ~37.5%   | ~48% (now the dominant residual)        |
-
-The SSM state gather is effectively eliminated. The residual decode gap to vLLM
-is now the FP4 GEMM (~48% of decode), a separate kernel track.
--- a/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/THROUGHPUT_B_P1_RESULTS.md
@@ -1,126 +0,0 @@
-# Track B P0 + P1 results: the FP4-MMA decode-GEMM occupancy tune (GB10, sm_121)
-
-Measured on the DGX (GB10 / DGX Spark, sm_121, `~/llama-paged-dev`, branch `paged`). Implements
-`FP4_GEMM_SCOPE_B.md` P0 (baseline + bit-exact gate) and P1 (the cheap host/occupancy tile tune).
-Dev-tree commit: **089f78d** (`feat(paged): FP4 decode GEMM track-B P0 gate + default-off occupancy
-instrumentation`). Patch artifact: `0017-fp4-gemm-decode-tile-tune.patch`.
-
-**Headline verdict: the P1 occupancy kill-gate TRIPPED.** None of the cheap host/occupancy levers
-lift dense or MoE decode_agg on GB10; every dense probe regresses and the nsys evidence shows the
-FP4 GEMM kernel gets *slower* under register-capping. Nothing is enabled by default (the levers are
-compile-time/env gated and the default build is byte-identical to stock). The one untested lever is
-the structural `mmq_y`-down, which is **not** a host switch: it is coupled to `nwarps` by the
-`nwarps*tile_C::I == mmq_y` static_assert, so it requires an `nwarps=4` warp-remap (P2 kernel work).
-
-All benches: `llama-batched-bench -fa on -c 32768 -ngl 99 -npp 128 -ntg 128 -npl 32,128`.
-`decode_agg = S_TG` (aggregate decode tok/s). 3 reps dense, 2 reps MoE; medians below.
-
-## P0 baseline (mmq_y=128, minblocks=1 — stock)
-
-### Bit-exact parity gate (CPU oracle vs CUDA, deterministic)
- `test-backend-ops -o MUL_MAT  -b CUDA0`: **1115/1115** (1103 stock + 12 new NVFP4/MXFP4 dense
-  decode-shape cases), NVFP4 0 fail.
- `test-backend-ops -o MUL_MAT_ID -b CUDA0`: **805/805**, NVFP4 0 fail.
- New P0 cases exercise the weight-row (`mmq_y`) tiling boundary: `type_a ∈ {NVFP4, MXFP4}`,
-  `m ∈ {2048 (exact at mmq_y 64/128), 1600 (ragged vs 128), 2050 (ragged vs both 64 & 128 →
-  need_check last row-tile)}`, `n ∈ {32, 128}` (decode M), `k = 2048`. They make the oracle cover
-  the `mmq_y`/min-blocks changes and stay bit-exact with every lever on.
-
-### Decode throughput (decode_agg = S_TG)
-| model | npl32 | npl128 |
-|---|---:|---:|
-| DENSE q36-27b-nvfp4 | 117.3 | **149.5** |
-| MoE q36-35b-a3b-nvfp4 (stock mmq_x=128/expert) | 262.6 | **336.3** |
-
-(For reference the scope §6 cites dense 161 / MoE 333 from a server harness; this is the cleaner
-batched-bench A/B baseline. The relative P0→P1 deltas below are what the kill-gate turns on.)
-
-### nsys FP4 GEMM efficiency (dense, `-npp 64 -ntg 48 -npl 128`)
-The decode FP4 weight GEMM kernel = `mul_mat_q<NVFP4(40), mmq_x=128, need_check=0>`:
- **33.2 %** of GPU kernel time, total **2.782 s** / 4576 inst, **avg 608 µs/launch**.
- Plus `quantize_mmq_nvfp4` 9.1 % (the act-quant bucket — track A's target), `mul_mat_q<…,16,…>`
-  5.8 % (prefill ubatch tiling), stream-k fixups ~0.5 %.
-
-This is the locked baseline; P1 must lower the GEMM kernel time (raise FP4-eff) to pass.
-
-## P1 — the cheap occupancy levers (all default-off, byte-identical when off)
-
-Three bit-exact, gated levers were added (`mmq.cuh`):
- `GGML_CUDA_FP4_MMQ_Y` (default 128): type-aware `get_mmq_y_host/device` plumbing for an NVFP4
-  weight-row tile override. **Inert** — see "the mmq_y wall" below.
- `GGML_CUDA_FP4_MINBLOCKS` (default 1): NVFP4-only `__launch_bounds__` min-resident-CTAs lever
-  (register-caps the FP4-MMA kernel so >1 CTA co-resides). The bounded occupancy probe.
- `GGML_CUDA_FP4_DENSE_MMQ_X` (env, default off): dense col-tile re-read occupancy diagnostic
-  (the §4.1 A/B: does eating a 2× weight re-read at a smaller `mmq_x` buy net occupancy?).
-
-P1 parity: with `MINBLOCKS=2` the gate stays **MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0
-fail** — register allocation is result-neutral, so bit-exactness holds.
-
-### DENSE decode_agg @ npl128 — every occupancy probe REGRESSES
-| config | npl32 | npl128 | Δ vs P0 @npl128 |
-|---|---:|---:|---:|
-| P0 stock (mmq_y=128, minblocks=1) | 117.3 | **149.5** | — |
-| MINBLOCKS=2 (2 resident CTAs via reg-cap) | 115.7 | 147.9 | **−1.1 %** |
-| DENSE_MMQ_X=64 (2 col-tiles, 2× weight re-read) | 115.3 | 144.3 | **−3.5 %** |
-| DENSE_MMQ_X=32 (4 col-tiles, 4× weight re-read) | 115.4 | 141.7 | **−5.2 %** |
-
-### MoE decode_agg @ npl128 — mmq_x-down regresses; min-blocks neutral
-| config | npl32 | npl128 | Δ vs stock @npl128 |
-|---|---:|---:|---:|
-| stock (mmq_x=128/expert) | 262.6 | **336.3** | — |
-| TILE32 | 262.1 | 336.0 | −0.1 % |
-| TILE16 | 261.1 | 324.0 | **−3.7 %** |
-| TILE8 | 260.8 | 316.6 | **−5.9 %** |
-| MINBLOCKS=2 | 260.0 | 337.7 | +0.4 % (noise) |
-
-The MoE result reproduces patch 0015 exactly: q36-35b-a3b (256 tiny experts, GDN linear attention)
-decode is GDN/bandwidth-bound, **not** col-tile-occupancy-bound, so tightening `mmq_x` below 64
-(the brief's "8–16 ideal") monotonically *loses*. 64 ≈ 32 ≈ stock is the floor.
-
-### nsys kill-gate evidence (the decisive datum)
-`mul_mat_q<NVFP4,128,0>` under MINBLOCKS=2: **2.782 s → 3.025 s**, avg **608 µs → 661 µs
-(+8.7 % SLOWER)**. The FP4-MMA kernel needs >128 regs/thread; forcing 2 CTAs/SM register-caps it,
-which **spills to local memory**, so the GEMM does *more* work per launch — occupancy did not
-usefully rise, it inverted. FP4-eff went **down**, not up. Kill-gate tripped, with hard evidence.
-
-## Why P1 can't lift it (and why mmq_y-down is P2, not P1)
-
-The two orthogonal occupancy probes both regress: register-capping (minblocks↑) spills, and
-col-tile-shrinking (mmq_x↓) re-reads the 18 GB weight set. This says the **dense M=128 tile is
-already weight-read / one-read-optimal at mmq_x=128** — it is not occupancy-starved in a way the
-cheap levers can fix. This contradicts the scope's central "self-inflicted occupancy, recover it by
-raising resident CTAs" hypothesis *for the cheap levers*.
-
-The only lever that raises resident CTAs **without** spilling and **without** extra weight reads is
-the structural `mmq_y`-down (smaller weight-row tile → smaller shared + smaller accumulator → more
-CTAs, weights still read once). But `mmq_y` is **rigidly** `nwarps * tile_C::I = 8 * 16 = 128`
-(the `mmq.cuh:3258` static_assert; `tile_C::I=16` is the fixed `m16n8k64` MMA shape). So
-`mmq_y=64` requires **`nwarps=4`** — a warp-remap, not a host switch. That remap threads `nwarps`
-through ~13 NVFP4-reachable sites including the **shared** `vec_dot_fp4_fp4_mma` (used by both NVFP4
-and MXFP4) and the loader/kernel nwarps lockstep, with real risk of a silent shared-mem/thread-block
-mismatch. It was scoped but **deferred to P2** (the scope's own phase table also places `mmq_y`-down
-at P2, after the P1 host-only knobs). The `get_mmq_y` host/device plumbing is committed and inert so
-P2 only has to add the `nwarps` half.
-
-## Honest verdict vs the scope targets
-
- **DENSE:** P1 (host knobs + min-blocks + re-read diagnostic) does **not** move decode_agg toward
-  the 391 target — it slightly *regresses* (149.5 → 147.9, 38 % of vLLM). The scope's P1 row
-  (~177, "honest: small") was optimistic; on GB10 the cheap levers are net-negative. The remaining
-  upside lives entirely in the P2 `mmq_y`-down (nwarps=4) kernel remap **plus** track A. Whether
-  that clears the floor is now an *open, unproven* question — the cheap-lever evidence here leans
-  *against* large occupancy upside (the tile already looks one-read-optimal), so the P2 ceiling is
-  plausibly lower than the scope's 316–328.
- **MoE:** the mmq_x-down lever (the brief's MoE P1) is a **confirmed dead-end on this model**
-  (regresses; GDN/BW-bound, reproduces patch 0015). min-blocks is neutral. No host-level MoE win.
-
-**Kill-gate: TRIPPED on both arms.** Per the brief this is *not* forced into a default-on change.
-Committed: the P0 bit-exact gate + the default-off instrumentation + this honest record. Not pushed.
-
-## Reproduce
-```
-# default (byte-identical stock): build-cuda as-is -> MUL_MAT 1115/1115, MUL_MAT_ID 805/805
-# occupancy probe:  cmake build with -DGGML_CUDA_FP4_MINBLOCKS=2  (or flip the macro default)
-# dense re-read A/B: GGML_CUDA_FP4_DENSE_MMQ_X=64 ./llama-batched-bench -m q36-27b-nvfp4.gguf ...
-# nsys: nsys profile --trace cuda ... ; nsys stats --report cuda_gpu_kern_sum  (watch mul_mat_q<40,128,0>)
-```
--- a/backend/cpp/llama-cpp/patches/paged/VLLM_DECODE_GROUNDING.md
+++ b/backend/cpp/llama-cpp/patches/paged/VLLM_DECODE_GROUNDING.md
@@ -1,315 +0,0 @@
-# vLLM 0.23.0 eager-decode grounding: where the ~2.4x decode gap to llama.cpp comes from
-
-Source-reading + grounding only (no GPU, no benchmarking, no llama code changes). This
-decomposes vLLM 0.23.0's per-decode-step work in `enforce_eager` mode and attributes the
-measured ~2.4x decode-throughput gap on GB10 (DGX Spark, sm_121) to its parts, so the
-throughput thread can decide what llama.cpp would actually need (CUDA-graphed decode vs new
-kernels) before anyone touches a kernel.
-
-Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
-LPDDR5x ~273 GB/s. vLLM install read: `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`
-(on `dgx.casa`, read-only). Evidence: engine logs `~/bench/h2h_dense_vllm.log`,
-`~/bench/h2h_moe_vllm.log`; nsys decode trace `~/bench/decode_study/srv_decode2.sqlite`
-(reproduced here via `cat2.py`); committed `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
-`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
-
-## TL;DR (the evidence-based answer)
-
-At batch ~128, ~1024 ctx, NVFP4, `enforce_eager` (no CUDA graphs on either side), vLLM decodes
-~2.4x faster than llama.cpp. Decomposed:
-
-1. **The gap is dominantly a KERNEL-efficiency gap, not a host-overhead gap.** The strongest
-   single datum: during steady llama decode the GPU is **~94.6% busy** (nvidia-smi, real run) /
-   85.5% in the nsys window (`DECODE_GAP_STUDY.md`; nsys adds gaps). A GPU that is already ~95%
-   busy has at most ~5% exposed host bubble, so a CUDA graph (which only removes host/launch
-   overhead) can recover at most that bubble. **CUDA-graphing llama's decode is therefore a
-   minority lever: on the order of ~5-15% of the step, i.e. roughly ~10-20% of the 2.4x.** The
-   remaining ~80-90% is the GPU spending its busy time in kernels that are simply slower per unit
-   work than vLLM's.
-
-2. **vLLM's eager decode step is cheap on the host by construction**, so its host time is small
-   to begin with and hides behind the async CUDA stream: persistent pre-allocated input buffers
-   updated with vectorized numpy (no per-token Python), attention metadata built once per step and
-   shared across all layers, no GPU->CPU sync in the hot path, and a fixed small kernel-launch
-   sequence per layer (2 ops per Linear, 2 grouped Marlin launches for *all* MoE experts).
-   `async_scheduling` was **off** in this run (absent from both engine logs; default resolves to
-   the synchronous `Scheduler`, `config/scheduler.py:168-176`), so vLLM achieved the 2.4x with
-   *synchronous* per-step scheduling. The host advantage is structural, not pipelining.
-
-3. **Where vLLM's kernels win:** (a) attention reads paged KV **in-kernel** via a block table in
-   one batched `flash_attn_varlen_func` launch, with **no gather/copy** (vLLM never pays llama's
-   paged `get_rows` + `cpy` tax, which is ~36% of llama's *paged* step); (b) the dense NVFP4 GEMM
-   is a **native FP4-MMA cutlass** kernel with the activation-quant **fused** into the preceding
-   RMSNorm/SiLU (no standalone `quantize_mmq` requant pass); (c) the MoE experts are **one grouped
-   Marlin kernel per projection for all experts** (W4A16, in-kernel dequant); (d) on these Qwen3.6
-   models a fraction of layers are **GDN linear-attention** whose decode is an **O(1)-in-context
-   recurrent state update**, not an O(ctx) KV read.
-
-4. **Sampling is not the gap** on either side: vLLM samples all ~128 sequences with a handful of
-   batched on-GPU kernels (FlashInfer), greedy and a heavy sampler chain cost the same; this
-   mirrors llama's own finding (`DECODE_GAP_STUDY.md`: greedy 1343 ms == 5-sampler 1346 ms).
-
-## The measured gap (apples-to-apples, both eager)
-
-From `QWEN36_NVFP4_BENCH.md` (matched NVFP4 weights, one GB10 box, vLLM 0.23.0
-`--enforce-eager`, llama patch 0015 + budget-256), decode aggregate tok/s at npl128:
-
-| model | llama (best) | vLLM | ratio | per-step (128 tok) llama -> vLLM |
-|-------|-------------:|-----:|------:|----------------------------------|
-| DENSE Qwen3.6-27B | 161.2 | 390.7 | **2.42x** | ~795 ms -> ~328 ms |
-| MoE Qwen3.6-35B-A3B | 333.5 | 811.1 | **2.43x** | ~384 ms -> ~158 ms |
-
-Both models converge to ~41% of vLLM at npl128 after llama's prefill-starvation is removed
-(patch 0013), and at npl8 the kernels are at parity (dense 99%, MoE 84%). So the residual ~2.4x
-is a steady-state decode property at high batch, not a prefill or scheduler artifact (the
-scheduler was separately proven not to be the lever: a clean all-128-decoding run still tops out
-at 157-161 dense / 333 MoE - `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`).
-
-## Confirmed configuration (both sides eager, no CUDA graphs)
-
-vLLM, both models (engine logs):
- `enforce_eager=True`, `CompilationMode.NONE`, `cudagraph_mode=<CUDAGraphMode.NONE>`:
-  `"Enforce eager set, disabling torch.compile and CUDAGraphs ... -cc.mode=none
-  -cc.cudagraph_mode=none"`, `"Cudagraph is disabled under eager mode"`. So no torch.compile, no
-  inductor, no graph capture: the model runs as pure eager dispatch of custom ops.
- Attention: `"Using FLASH_ATTN attention backend out of ['FLASH_ATTN','FLASHINFER','TRITON_ATTN',
-  'FLEX_ATTENTION']"`, `"Using FlashAttention version 2"`.
- Dense weight GEMM: `"Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM"` (native W4A4
-  cutlass FP4-MMA), `"Enabled custom fusions: norm_quant, act_quant"`, FlashInfer autotuned the
-  `fp4_gemm` (16 configs) at startup.
- MoE weight GEMM: `"Using 'MARLIN' NvFp4 MoE backend out of ['FLASHINFER_TRTLLM',...,'MARLIN',
-  'EMULATION']"` with `"Your GPU does not have native support for FP4 computation ... Weight-only
-  FP4 compression will be used leveraging the Marlin kernel"` (so MoE experts = W4A16 weight-only
-  Marlin: in-kernel dequant + bf16 MMA), plus `"FlashInferFP8ScaledMM"` for the FP8 attention
-  linears.
- Both models are **hybrid GDN**: `"Using Triton/FLA GDN prefill kernel"` and `"Setting attention
-  block size to 784/1056 tokens to ensure attention page size >= mamba page size"` (dense 784, MoE
-  1056). A decode-time `fused_recurrent_gated_delta_rule_packed_decode_kernel` is JIT-compiled.
- Sampling: `"Using FlashInfer for top-p & top-k sampling."`
- `async_scheduling` not present in either log -> synchronous `Scheduler`.
-
-llama side (the brief's premise, corroborated by `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review):
-`-fa on`, paged KV, eager (no engaged CUDA graphs at batched decode). The `DECODE_GAP_STUDY.md`
-nsys run explicitly set `GGML_CUDA_DISABLE_GRAPHS=1` to match.
-
-## Decomposition of vLLM's eager decode step
-
-All file paths below are under
-`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`. The driver is
-`v1/worker/gpu_model_runner.py::execute_model` (line 4005): host preprocess under
-`synchronize_input_prep()`, then `_model_forward` under `set_forward_context`, then `compute_logits`;
-sampling is a separate `sample_tokens` (line 4357). Under eager, `_determine_batch_execution_and_padding`
-(line 3768) dispatches `CUDAGraphMode.NONE`, and `_model_forward` (line 3718) just calls
-`self.model(...)` directly: no capture, no replay, same code every step.
-
-### (a) Attention - one batched in-kernel paged-decode launch + O(1) GDN layers
-
- **Full-attention layers (FA2):** `v1/attention/backends/flash_attn.py`. `FlashAttentionImpl.forward`
-  (667-848) issues **one** `flash_attn_varlen_func` (796-818) over all ~128 decode tokens, passing
-  `key_cache`/`value_cache` (the raw paged block pools, **not gathered**), `cu_seqlens_q`,
-  `seqused_k`, and **`block_table=attn_metadata.block_table`**. The kernel walks the block table to
-  fetch each sequence's KV pages directly. In-kernel paged read confirmed: there is **no gather/copy**
-  in the Python layer; the only KV write is `reshape_and_cache_flash` (a scatter of the new token via
-  `slot_mapping`). FA2 disables vLLM's AOT host scheduler (`aot_schedule = (fa_version==3)` is False,
-  333), so `schedule()` returns `None` (445-469): the per-step metadata `build()` (388-575) is **pure
-  reference/scalar assembly**, no Python loop over the 128 sequences, no host scheduling, no sync.
- **Built once per step, reused across layers:** `supports_update_block_table=True` (300); the first
-  full-attn layer calls `build()`, every later layer reuses it via `update_block_table()` (577-586,
-  a `copy.copy`). So `build()` runs **once per decode step** for the whole KV group, not per layer.
- **GDN linear-attention layers (the hybrid half):** `model_executor/layers/mamba/gdn/
-  qwen_gdn_linear_attn.py`, kernels in `model_executor/layers/fla/ops/fused_recurrent.py`. Pure decode
-  takes `_forward_core_decode_non_spec` (1644-1696): two state-update kernels only -
-  `causal_conv1d_update` + `fused_recurrent_gated_delta_rule_packed_decode` (Triton kernel 255-336,
-  grid `(NV, B*HV)` = one batched launch over all 128 rows). Each program updates a **fixed-size
-  [K,V] recurrent state** (`b_h *= exp(g); b_h += (beta*(v - h.k)) outer k; o = h.q`) - **no loop over
-  the 1024 past tokens, no KV read.** This is **O(1) in context length**, while FA2 streams ~ctx KV
-  per head per row. On these Qwen3.6 models the GDN layers make a chunk of the decode cost flat in
-  ctx, a structural cheapness llama only gets if its GGUF implements GDN the same way (see caveat).
-
-### (b) Weight GEMM - native FP4-MMA (dense) / grouped Marlin (MoE), M-batched, fused quant
-
- **Dense NVFP4 linear:** `model_executor/layers/quantization/modelopt.py::ModelOptNvFp4LinearMethod.apply`
-  (1226-1232) -> `model_executor/kernels/linear/nvfp4/flashinfer.py::apply_weights` (56-89): exactly
-  two GPU ops - `scaled_fp4_quant` (activation -> packed FP4 + blockscale) then
-  `flashinfer_scaled_fp4_mm` (the autotuned `fp4_gemm`, a **native W4A4 cutlass FP4-MMA** whose
-  **dequant is fused into the MMA epilogue** via the precomputed `alpha = in_gscale*w_gscale`). The
-  activation-quant is itself folded away: `compilation/passes/fusion/rms_quant_fusion.py:98`
-  (`norm_quant`: RMSNorm -> `scaled_fp4_quant` fused) and `act_quant_fusion.py:40,128`
-  (`act_quant`: SiLU+mul -> FP4 fused). **There is no standalone full-tensor requantize pass** like
-  llama's `quantize_mmq`, and the weight is never dequantized to a temp buffer.
- **MoE experts (Marlin W4A16):** `model_executor/layers/fused_moe/experts/marlin_moe.py`.
-  `fused_marlin_moe` (227) does **one** `moe_align_block_size` token-sort then `_fused_marlin_moe`
-  (59) issues **exactly two grouped kernels** - `moe_wna16_marlin_gemm` for gate_up (137) and for
-  down (194) - **each a single launch covering ALL experts** (it walks `expert_ids`/`sorted_token_ids`
-  internally; no Python loop over experts), with a `silu_and_mul` between and a `moe_sum` reduce
-  after. W4A16 means weights are dequantized in-kernel and activations stay bf16 (never requantized).
- **Decode-M batching (the key throughput property):** the dense GEMM reshapes activations to (M, K)
-  with M = total decode tokens (~128) and reads each FP4 weight **once for all 128 tokens**; the MoE
-  grouped GEMM reads each routed expert's weight **once** for the ~M*topk/E tokens routed to it. At
-  M~128 with FP4 weights these are weight-read / memory-bound (correct: the GB10 LPDDR5x ~273 GB/s
-  is the floor), but the bytes are amortized over the whole batch. This is the ideal case and it is
-  the same regime llama is in - so the GEMM gap is kernel efficiency (fused quant + native FP4 MMA),
-  not a batching defect.
- **Host cost per layer (eager):** each `Linear.apply()` dispatches at most 2 `torch.ops` kernels; a
-  dense layer's GEMM+norm/act portion is ~7-11 launches, a MoE expert block is ~5-6 launches **for all
-  experts combined** (expert count does not multiply launches). Fixed, small, no per-tile/per-expert
-  Python.
-
-### (c) Sampling - fully batched on-GPU, negligible
-
-`v1/sample/sampler.py::Sampler.forward` (72) operates on the whole `[num_seqs, vocab]` logits
-tensor: batched `argmax` (greedy, 240) or temperature `div_` + one FlashInfer
-`top_k_top_p_sampling_from_logits` (`v1/sample/ops/topk_topp_sampler.py:493`) + `torch.where`
-(296-301). **No per-sequence Python loop** in the hot path. Per-seq params live as pre-staged GPU
-tensors `temperature/top_p/top_k[num_seqs]` (`v1/worker/gpu_input_batch.py:184-205`), copied once via
-non-blocking H2D and rebuilt only on batch change (`refresh_metadata`, 815-829). Greedy and the full
-chain are the same batched-op class. Sampled-token D2H is async (CUDA-event gated, 243-313);
-detokenization runs on CPU in the async output processor (`v1/engine/output_processor.py`). Sampling
-is a negligible tail and does not stall the GPU loop - exactly as on the llama side.
-
-### (d) Host / Python per-step loop - cheap by construction, hidden behind the async stream
-
-`execute_model` host prep, all incremental on persistent buffers (`_prepare_inputs`, 1872+):
- `block_table.commit_block_table` started **first** to overlap its copy with following CPU work
-  (1890); each step appends only newly-allocated block ids (`append_row`), usually <=1 at decode.
- positions / token gather are **vectorized numpy + a single `torch.index_select`** into the
-  pre-allocated `input_ids.cpu` (1928-1939); `query_start_loc`/`seq_lens` set by slice ops
-  (1979-1990). `slot_mapping` is one Triton kernel (`v1/worker/block_table.py`). **No per-token, no
-  per-request Python loop** in the steady decode path.
- `CommonAttentionMetadata` assembled once (2287-2305), then the attention builder runs once per KV
-  group (see (a)).
- The forward runs under `set_forward_context(...)` with `cudagraph_runtime_mode=NONE`; `_model_forward`
-  is a direct `self.model(...)`.
- **No GPU->CPU sync in the hot path:** the sampled-token copy is `non_blocking` + event-gated;
-  `execute_model` returns after launching the forward, and the cheap host prep for the next step
-  overlaps the GPU executing the current step on the async CUDA stream (CUDA launches are
-  non-blocking). `async_scheduling` was off, so this overlap is just ordinary CUDA async, not
-  pipelined scheduling - yet it is enough because the host work is so small.
-
-What llama-server's per-step C++ loop pays that vLLM does not (host side, graph-addressable):
-ggml rebuilds/reallocates the compute graph each decode step and dispatches ~1k kernel launches from
-the loop on the weak Grace ARM cores (`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review). vLLM's persistent
-buffers + build-once-reuse metadata + fixed launch sequence are exactly the things that keep its eager
-step host-cheap; llama could borrow these (persistent device KV/block metadata, build the ggml graph
-once and reuse it, zero per-step host sync) to shrink the bubble **without** a full CUDA graph.
-
-## The llama side, for the split (nsys, reproduced)
-
-`~/bench/decode_study/cat2.py` over `srv_decode2.sqlite` (Qwen3-32B dense, pure full-attention, 64
-layers, batch 32, 1024 ctx, paged, eager), reproduced now:
-
-```
-window_span_s 24.960  sum_kernel_s 21.348  gpu_busy_pct 85.5
-ATTENTION (flash_attn_ext_f16) 10.177 s  47.7%
-kv_copy_cast (cpy_*)            3.903 s  18.3%
-embed_gather_rows (get/set)    3.803 s  17.8%   <- the PAGED gather tax
-GEMM_weight (mul_mat)          3.173 s  14.9%
-GEMM_act_quant (quantize_mmq)  0.172 s   0.8%
-rmsnorm/silu/rope/add          ~0.12 s   ~0.6%
-```
-
-So on llama's paged decode step: ~84% is KV/attention (attention 47.7% + KV copy 18.3% + paged
-gather 17.8%), ~16% is weight GEMM, and the host loop is **hidden** (GPU 85-94% busy; greedy ==
-heavy-sampler step time). Mapping each bucket to vLLM:
-
-| llama bucket (paged) | nsys % | vLLM equivalent | vLLM avoids it? |
-|----------------------|------:|-----------------|-----------------|
-| paged KV gather (`get_rows`) | 17.8% | block table read **in-kernel** | **Yes, entirely** (no such op) |
-| KV copy/cast (`cpy_*`) | 18.3% | KV written once into block pool, read in place | Mostly |
-| decode attention (`flash_attn_ext_f16`) | 47.7% | FA2 paged-decode varlen (+ O(1) GDN layers) | Same op, faster kernel; GDN is cheaper still |
-| weight GEMM + act quant | 15.7% | fused native-FP4 / grouped Marlin, no separate requant | Faster + removes the requant kernel |
-| host serving loop / sampling | ~0 (hidden) | cheap persistent-buffer prep, batched GPU sampling | Both hidden; vLLM also cheap |
-
-Note: the nsys decomposition is on **Qwen3-32B (pure attention)**; the 2.4x throughput numbers are on
-**Qwen3.6 hybrid GDN** models. The bucket *shares* differ between the two (GDN shifts work off
-attention), but the lesson - llama's step is GPU-bound on attention + the paged gather + FP4 GEMM,
-with the host hidden - transfers.
-
-## The split of the 2.4x: kernel vs host (graph-addressable)
-
-Anchored on the measured **~94.6% GPU busy** during steady llama decode (nvidia-smi,
-`DECODE_GAP_STUDY.md`):
-
- **Host / CUDA-graph-addressable: the minority, ~5-15% of the llama step (=> ~10-20% of the 2.4x).**
-  A GPU that is ~95% busy exposes at most ~5% host idle; a CUDA graph (capture-once, replay) removes
-  per-step launch latency + ggml graph rebuild/realloc and can tighten inter-kernel gaps, plausibly
-  recovering ~5-15% of the step in the best case. On llama's ~795 ms dense step that is ~40-120 ms of
-  the ~467 ms gap. **A CUDA graph cannot close a 2.4x gap**, because the gap is mostly the GPU's busy
-  time, not idle. (The fraction shrinks further at batch 128 vs the nsys batch 32: the per-step launch
-  count is fixed while per-kernel work grows, so host overhead is a smaller share at higher batch.)
- **Kernel efficiency: the majority, ~80-90% of the 2.4x.** The GPU's busy time goes into kernels that
-  are slower per unit work than vLLM's, decomposed:
-  - **the paged gather regression (~36% of llama's *paged* step; `get_rows`+`cpy`)** - vLLM never pays
-    it because it reads paged KV in-kernel. This is the single biggest discrete, llama-specific,
-    addressable chunk, but removing it only restores llama's own *stock* path; stock is still ~2x off
-    vLLM (`DECODE_GAP_STUDY.md`).
-  - **long-context decode-attention** (the largest residual; attention is ~48% of the step and grows
-    with ctx) - llama's `flash_attn_ext_f16` decode is slower than vLLM's FA2 paged-decode on sm_121,
-    and slower still than the O(1) GDN layers on these models.
-  - **the FP4 weight GEMM floor** (~15-30%) - vLLM fuses the activation-quant into the norm/SiLU and
-    uses native FP4-MMA / grouped Marlin; llama runs `mul_mat_q` + a separate `quantize_mmq` requant.
-
-## Ranked list: what llama would need to close the 2.4x, and how much each buys
-
-1. **Do not pay the paged gather at decode. [largest discrete, llama-addressable; ~36% of the paged
-   step]** Either disable paged KV for decode-latency workloads, or read paged blocks **in-kernel via
-   a block table** like vLLM (no `get_rows`/`cpy`). This is a kernel change (a real in-kernel
-   paged-decode read), not a graph change. Caveat: it only brings the paged path back to llama-stock;
-   stock is still ~2x off vLLM, so this is necessary but not sufficient.
-2. **Faster long-context decode-attention kernel. [biggest residual; partly structural]** A proper
-   flash-decoding / split-K-over-KV, GQA-grouped, in-kernel-paged decode kernel for sm_121 (this also
-   subsumes lever 1). Deep CUDA work, gated by kernel maturity on Blackwell-class parts. This is where
-   the context-scaling gap lives and where most of the 2.4x is.
-3. **Fused FP4 weight GEMM. [bounded; ~15-30%]** Fold the activation-quant into the preceding norm/SiLU
-   (vLLM's `norm_quant`/`act_quant`) and into the GEMM epilogue; use native FP4-MMA where the part
-   supports it. Removes the separate `quantize_mmq` pass. Bounded below by weight-read bandwidth
-   (~19 GB/step over 273 GB/s).
-4. **CUDA-graph the steady-state pure-decode step. [smallest, cheapest; ~10-20% of the gap]** Capture
-   the all-128-decoding step once and replay (it is already fixed-shape at steady decode - the
-   scheduler does not need to change to enable this, per `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` P3).
-   Recovers the ~5% GPU-idle bubble + ggml per-step graph rebuild/realloc + launch latency on the weak
-   Grace cores. A real, independent, low-risk win, but bounded by the ~95%-busy measurement: it does
-   **not** close the kernel gap. Cheaper host-side half-measures that need no graph: persistent device
-   KV/block metadata, build the ggml graph once and reuse it, and remove any per-step host sync (mirror
-   vLLM's persistent-buffer + build-once-reuse + non-blocking-D2H pattern).
-5. **Verify llama's GDN/linear-attention decode path. [architectural, model-specific]** On these
-   Qwen3.6 hybrids vLLM runs the linear-attention layers as an O(1)-in-ctx recurrent state update. If
-   llama's GGUF runs those layers as full attention (O(ctx)) rather than a recurrent state, that is a
-   per-layer decode cost vLLM structurally avoids on exactly these models - check before attributing
-   the whole residual to the full-attention kernel.
-
-## Honest bottom line
-
-The ~2.4x eager decode gap is **dominantly a kernel-efficiency gap (~80-90%), not a host-overhead
-gap.** The decisive evidence is that llama's GPU is already ~94.6% busy during steady decode, so the
-CUDA-graph-addressable host slice is a minority (~10-20% of the gap), recoverable but bounded. The
-bulk of vLLM's advantage is concrete kernel work: an in-kernel paged-decode read that eliminates
-llama's gather/copy tax (~36% of the paged step), a faster long-context decode-attention kernel, a
-fused native-FP4 GEMM, and (on these specific models) O(1)-in-ctx GDN linear-attention layers. vLLM's
-host loop is cheap by construction (persistent buffers, build-once-reuse metadata, no hot-path sync,
-fixed small launch sequence) and it achieved the 2.4x with *synchronous* scheduling and *no* CUDA
-graphs - so the host is not where vLLM's lead comes from, and a CUDA graph is the cheapest but
-smallest of llama's available levers, not the silver bullet. The throughput effort should be scoped
-as kernel work (in-kernel paged-decode read + flash-decoding attention + fused FP4 GEMM) with a
-CUDA-graphed steady-state decode as a separate, bounded, lower-risk add-on.
-
-## Key source citations (on dgx.casa, read-only)
-
- Eager driver / host loop: `v1/worker/gpu_model_runner.py` execute_model 4005, _model_forward 3718,
-  _prepare_inputs 1872, _determine_batch_execution_and_padding 3768, sample_tokens 4357,
-  synchronize_input_prep 3704; `v1/worker/block_table.py`; `v1/worker/gpu_input_batch.py:184-205`.
- Attention: `v1/attention/backends/flash_attn.py` (forward 667-848, varlen call 796-818, builder
-  388-575, update_block_table 577-586); `model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`
-  (decode 1644-1696); `model_executor/layers/fla/ops/fused_recurrent.py` (kernel 255-336).
- GEMM: `model_executor/kernels/linear/nvfp4/flashinfer.py:56-89`;
-  `model_executor/layers/quantization/modelopt.py` (NvFp4 LinearMethod 1103-1232, MoE 1381-1666);
-  `model_executor/layers/fused_moe/experts/marlin_moe.py` (59-225, 227-360, 732-895);
-  `compilation/passes/fusion/rms_quant_fusion.py:98`, `act_quant_fusion.py:40,128`.
- Sampling: `v1/sample/sampler.py:72-302`; `v1/sample/ops/topk_topp_sampler.py:55,460-497`;
-  `v1/sample/metadata.py`; `v1/engine/output_processor.py`.
- Config: `config/scheduler.py:146,168-176` (async_scheduling default -> sync Scheduler).
- Evidence: `~/bench/h2h_dense_vllm.log`, `~/bench/h2h_moe_vllm.log`, `~/bench/decode_study/cat2.py`
-  over `srv_decode2.sqlite`; this worktree `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
-  `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
-</content>
-</invoke>
--- a/docs/content/features/backends.md
+++ b/docs/content/features/backends.md
@@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
 LocalAI supports various types of backends:

 - **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
-  - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md` for the quality/throughput profile).
+  - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp/patches/paged/README.md` for the quality/throughput profile).
 - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
 - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
 - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -20,7 +20,7 @@
 # persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
 # stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
 # bit-exact (~91% same-top-p, beats vLLM dense) - see
-# backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md for the quality profile.
+# backend/cpp/llama-cpp/patches/paged/README.md for the quality profile.
 # The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
 # =============================================================================
 - name: "qwen3.6-27b-nvfp4-paged"