docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target

Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM, B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph): - A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is bounded on this bandwidth-bound a3b model. Keep down_proj quantize retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a later shared GDN+MoE decode-graph project. - B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair vs vLLM. Dead end. Rank last. - C: paged-pool burst-degradation BUG (Part 2) is a true correctness defect (prefill collapses 507->65 t/s after a burst, restart cures it): reclamation gap on partial seq_rm + free-queue fragmentation. Plus the static decode-first budget (Part 1) explains 903s/213s burst TTFT and the chunked-interleave fix. - D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's bimodality was the shelved BF16 SSM path. Closed. First build target: the paged-pool burst-degradation bug fix (Fix-1 truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot- completion). Small, localized, default-off byte-identical, crisp repro (npl64 burst then npl8: prefill within 10% of fresh + num_free restored). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 09:26:55 -04:00 · 2026-06-26 09:42:55 +00:00
parent 89e62fc74f
commit b061e4aef0
1 changed files with 511 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/OTHER_PATHS_INVESTIGATION.md
+++ b/backend/cpp/llama-cpp/patches/paged/OTHER_PATHS_INVESTIGATION.md
@@ -0,0 +1,511 @@
+# OTHER_PATHS_INVESTIGATION.md
+
+Read-only investigation of the four post-0023 paths (A MoE grouped-GEMM, B lm_head GEMM,
+C TTFT/paged-pool burst, D dense CUDA-graph). One section per agent. No GPU except the
+moe-gpu-profile agent.
+
+---
+
+## A. MoE grouped-GEMM gap (label: moe-gemm-source, READ-ONLY, no GPU)
+
+### The decisive finding: vLLM's MoE on GB10 is MARLIN W4A16, not a native-FP4 grouped GEMM
+
+Engine-log ground truth (`VLLM_DECODE_GROUNDING.md`, from `~/bench/h2h_moe_vllm.log`):
+`"Using 'MARLIN' NvFp4 MoE backend ... Your GPU does not have native support for FP4
+computation ... Weight-only FP4 compression will be used leveraging the Marlin kernel"`.
+vLLM does NOT take its native-FP4 cutlass/trtllm MoE path on sm_121 (it whitelists only
+sm_100/103 datacenter Blackwell for FP4-MMA MoE). So on this box vLLM's MoE is:
+
+- `moe_align_block_size` (BLOCK-PADDED token-sort; `num_tokens_post_padded`, sentinel pad rows),
+- **2 grouped `moe_wna16_marlin_gemm` launches per MoE block** (gate_up, then SiLU+mul, then down),
+  each ONE launch over ALL experts, `use_fp32_reduce=True`,
+- **W4A16: activations stay bf16, NEVER quantized**; FP4 weights dequantized in-kernel to bf16,
+  bf16 MMA,
+- the whole decode step under a FULL CUDA graph.
+
+llama's MUL_MAT_ID on GB10 (mmq.cu id-branch + mmid.cu + mmq.cuh stream-k) is:
+
+- `mm_ids_helper` token-sort/scatter, **NO block padding** (exact segments, `expert_bounds`),
+- **activation FP4 quantize** (`quantize_mmq_fp4`) of the expert-gathered rows = W4A4,
+- **1 persistent stream-k `mul_mat_q<NVFP4>` launch per projection**, native Blackwell FP4-MMA
+  (`block_fp4_mmq`), fp32 accumulate + `stream_k_fixup`,
+- per-expert-density `mmq_x` (M-tile) select (patches 0014/0015, default tile 64 @ density<=8),
+- NOT under a CUDA graph.
+
+### So the "missing fused grouped GEMM" does not exist - llama already HAS it
+
+llama's grouped FP4-MMA stream-k IS the same sorted-grouped-GEMM algorithm vLLM uses, and on
+GB10 llama's MoE GEMM is at a HIGHER-precision/native-FP4 tier than vLLM's W4A16 Marlin. The
+MoE decode gap (77-83% of vLLM vs dense 90-117%) is therefore NOT a grouped-GEMM-architecture
+deficit. The MoE-specific EXTRA gap (the ~10-15pt that MoE is worse than dense) decomposes as:
+
+1. **W4A4 activation-quantize tax (llama-only, the biggest MoE-specific discrete cost).**
+   llama quantizes activations to FP4 for the MoE GEMM; vLLM (W4A16) keeps them bf16 and pays
+   ZERO activation quantize. At MoE decode npl128 that is 1024 up/gate rows (patch 0023 dedup'd
+   the broadcast ones to 128 unique + a coalesced block gather) PLUS 1024 down_proj rows
+   (distinct per expert, CANNOT be dedup'd). nsys decode-isolated (`MOE_QUANT_DEDUP_RESULTS.md`):
+   `quantize_mmq_nvfp4` is still **457 ms** of decode GPU-time after the 0023 up/gate dedup; the
+   remaining bulk is the down_proj per-expert re-quantize. vLLM's W4A16 choice is actually SMART
+   for MoE decode on a bandwidth-bound box: keeping activations bf16 adds negligible activation
+   bandwidth at M~8/expert but ELIMINATES the entire quantize pass.
+
+2. **Un-graphed extra MoE nodes' launch bubbles.** Per MoE layer llama runs mm_ids_helper +
+   quantize + gather + 2 grouped GEMMs + SiLU/mul + down-quantize + moe_sum as separate
+   host-launched ggml nodes, none under a CUDA graph; vLLM runs moe_align + 2 grouped launches
+   under a full decode graph. This is the SAME launch-bubble root cause `CRITICALPATH_GAP_ANALYSIS.md`
+   pins for the GDN region (57 ms/step dense = 100% bubble), amplified for MoE by the extra
+   quantize/gather/scatter nodes - consistent with MoE being relatively worse than dense.
+
+3. **Ragged tiny-M tile + `need_check` partial-tail MMA** in the grouped stream-k. Already
+   addressed by 0014/0015 and measured **NEUTRAL** on q36-35b-a3b: that model is bandwidth/
+   SSM-recurrence-bound, not col-tile-occupancy-bound (the `LLAMA_MOE_DECODE_TILE` sweep shows 64
+   is the only non-negative width and it is within noise). So the M-tile lever has nothing to
+   bite on for THIS model; it banks +4.8% only on col-tile-bound MoE (Qwen3-Coder-30B).
+
+### Bit-exact llama MoE-GEMM levers (ranked)
+
+- **M1 (bit-exact, modest): down_proj activation-quantize kernel retune.** The remaining ~457 ms
+  is dominated by the down_proj per-expert FP4 re-quantize (`ne11==n_expert_used`, no dedup
+  possible). The per-block quantize is a pure per-thread function of 16 consecutive inputs (the
+  property 0023 exploited to make its gather bit-exact), so the launch GEOMETRY can be retuned
+  (occupancy/coalescing, like 0022 did for the recurrence and like 0023's coalesced-uint4 gather
+  fix) while the quantized bytes stay BYTE-IDENTICAL. Also worth checking whether the down gather
+  (`ids_src1`) is redundant when the SwiGLU intermediate is already expert-contiguous. Scope:
+  nsys the down-branch `quantize_mmq_fp4` on GB10, retune block/grid, gate on test-backend-ops
+  MUL_MAT_ID exact + greedy md5 == 0023. Expected: low single-digit % at npl128 (bounded - it is
+  a fraction of a fraction of the step), but it is the only clean quantize-axis lever left after
+  0023 and it is strictly bit-exact.
+
+- **M2 (bit-exact, the structurally-correct big one, SHARED with path D/A.2): CUDA-graph the MoE
+  decode step.** Graph replay does not change numerics => bit-exact. The MoE-specific extra node
+  count (quantize+gather+scatter+2 GEMM+silu+sum/layer, none graphed) makes the launch-bubble tax
+  larger for MoE than dense, which is exactly why MoE sits at 77-83% while dense is 90-117%.
+  Capturing the decode forward removes those bubbles. This is the same lever the GDN/A.2 work
+  scoped; it helps MoE MORE than dense. Highest-leverage bit-exact MoE win, but it is a
+  decode-graph-capture project, not a MoE-GEMM kernel edit.
+
+- **M0 (already shipped): 0017 `GGML_CUDA_FP4_MINBLOCKS` (min-resident-CTAs register-cap) and
+  0014/0015 (`mmq_x` density auto-tile) already cover the FP4-MMA occupancy + M-tile axes of the
+  SHARED `mul_mat_q<NVFP4>` kernel.** 0017 is bit-exact (register allocation cannot change
+  results) and was tuned on dense; a MoE-targeted min-blocks re-sweep (grouped per-expert M-tiles
+  have different occupancy than the dense M=128 GEMM) is a cheap bit-exact follow-up, but
+  MOE_DENSITY_AUTO_TILE already found this model is bandwidth-bound, so headroom is likely small.
+
+### NOT recommended (explicitly out of scope)
+
+- **W4A16 bf16-activation MoE GEMM (matching vLLM's Marlin choice).** This is the single biggest
+  MoE-specific structural difference and would erase the activation-quantize tax entirely, but it
+  (a) is NOT bit-exact (bf16 activations vs llama's FP4), and (b) is the W4A16 occupancy-wall
+  dead-end the docs flag (only ~9 TFLOP/178 t/s on GB10). Do not pursue.
+
+### Verdict / ranking of path A
+
+Path A is NOT a missing-kernel opportunity - llama already runs the sorted-grouped-FP4-MMA GEMM,
+at a higher native-FP4 tier than vLLM's GB10 W4A16 Marlin fallback. The MoE-specific extra gap is
+(1) the W4A4 activation-quantize tax vLLM structurally avoids by choosing W4A16, and (2) the same
+un-graphed launch-bubble tax as the GDN region, amplified by MoE's extra nodes. The only purely
+bit-exact, MoE-GEMM-local lever left is M1 (down_proj quantize retune, modest). The real MoE
+bit-exact win is M2 (CUDA-graph the decode step), which is the SAME lever as path A.2/D and helps
+MoE more than dense - so A's best lever collapses into the decode-graph effort rather than
+standing alone. Recommend ranking A's standalone kernel value BELOW the decode-graph (M2/D) and
+the lm_head (B) levers; fold A into the decode-graph build, and keep M1 as a cheap bit-exact
+bank-shot.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## B. lm_head GEMM (label: cublas-lmhead, READ-ONLY, no GPU)
+
+### The decisive fact: lm_head is BF16, not NVFP4 - so it CANNOT take the FP4 MMQ path
+
+`output.weight` (the LM head) in q36-35b-a3b-nvfp4 is **type 30 = GGML_TYPE_BF16, NOT quantized**
+(verified in `DECODE_PARITY_EXPLORE.md:298`: "2425 MB = 2.37 GB, read in full each step", 16% of
+weight traffic). This is by construction: the model was quantized with `--tensor-type attn/ffn=
+nvfp4`, which converts the attn+ffn tensors to NVFP4 and **leaves `output.weight` (and `tok_embd`)
+at base BF16** - the standard recipe, because the final projection is the most logit-sensitive
+tensor. The NVFP4 sidecar scales (`output_s`, `output_in_s`) are only created when
+`output->type == GGML_TYPE_NVFP4` (`llama-model.cpp:1459`), so for the BF16 head `model.output_s`
+is null.
+
+### Why it runs cublas/nvjet and not MMQ (exact routing trace)
+
+Graph: `qwen35moe.cpp:244` `cur = build_lora_mm(model.output, cur, model.output_s)` ->
+`llama-graph.cpp:1093` is just `ggml_mul_mat(ctx0, w, cur)` (the null `w_s` skips the scale `ggml_mul`).
+Then `ggml_cuda_mul_mat` (`ggml-cuda.cu:2540`) decides the kernel:
+
+- `use_mul_mat_q` / `use_mul_mat_vec_q` BOTH require `ggml_is_quantized(src0->type)`. BF16 is NOT
+  quantized (`is_quantized=false` for F16/BF16/F32; NVFP4 IS `is_quantized=true`, `ggml.c:748`).
+  => **both MMQ paths are ineligible for the BF16 head.** (If the head were NVFP4 it WOULD route to
+  the tuned FP4 `mul_mat_q` - this is exactly the difference.)
+- At decode npl128 the activation `src1->ne[1] = 128` columns: `use_mul_mat_vec_f` is gated off by
+  the mmvf batch cap; `use_mul_mat_f` (the MMF bf16 tensor-core GEMM) is gated off by
+  `ggml_cuda_should_use_mmf` for the wide `151936-row x 128-col` shape.
+- `use_batched_cublas_bf16` is true, but the batched-cublas branch additionally requires
+  `src1->ne[2]*src1->ne[3] > 1` (a 3D/4D multi-batch GEMM). The decode lm_head is 2D
+  (`ne[2]*ne[3] == 1`) => **batched-cublas branch is skipped.**
+- => falls through to the final `else`: `ggml_cuda_op_mul_mat_cublas`. With `src0` BF16 +
+  bf16-MMA hardware it takes the BF16 branch (`ggml-cuda.cu:1663`): `cublasGemmEx(CUDA_R_16BF,
+  CUDA_R_16BF -> CUBLAS_COMPUTE_32F, TENSOR_OP)`. **That cublasLt kernel is `nvjet_sm121`.**
+
+Cost (both models): dense `nvjet` lm_head = **12.17 ms = 3.66% of the 332.8 ms dense step**
+(`F16_DENSE_RESIDUAL_PROBE.md:65`); MoE = **11.91 ms = 3.1%** (`CRITICALPATH_GAP_ANALYSIS.md:398`).
+
+### CRITICAL correction: the current head is NOT "f32-lm_head" - it is already BF16-rounded
+
+The task brief calls the baseline "f32-lm_head"; it is not. The cublas BF16 branch **downcasts the
+F32 activation to BF16**, does BF16xBF16 with F32 accumulate, **writes the result as BF16** (dst is
+`CUDA_R_16BF`), then upcasts BF16->F32. So today's "bit-exact reference" logits are already
+**BF16-precision**, not f32. Two consequences:
+1. Any NVFP4/FP8 head swap is measured against a BF16 baseline, not f32 - the precision delta vs
+   the *true* f32 head is partly already paid.
+2. A *different* BF16 GEMM kernel that also F32-accumulates and BF16-rounds the output is
+   **bit-identical for the vast majority of logits** (differs only at rare BF16 rounding ties) -
+   this is what makes option (c) below "essentially bit-exact".
+
+### The options, and which break bit-exactness
+
+- **(a) NVFP4-quantize the head -> tuned FP4 MMQ. BIGGEST win, BREAKS bit-exactness.** Weight
+  2.37 GB BF16 -> ~0.6 GB NVFP4 (0.5625 B/wt = 4x fewer bytes) AND it then hits the already-tuned
+  `mul_mat_q<NVFP4>` (0017) instead of cublas. Memory-bound floor drops ~4x => save ~8-9 ms =
+  ~2.5% of the dense step. But NVFP4 < BF16 precision => **different logit bits, can flip the greedy
+  argmax** = NOT bit-exact; and it is **UNFAIR vs vLLM**, which keeps its LM head BF16
+  (`DECODE_PARITY_EXPLORE.md:358`: "fp8 LM head ... only matters if vLLM also quantizes it"). This
+  is the same opt-in, non-bit-exact bucket as the f16-glue probe (already concluded SKIP).
+- **(b) FP8 / Q8_0 head.** Smaller error than NVFP4 but still != BF16 bits => still NOT bit-exact,
+  and it is not even on the tuned FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
+- **(c) Keep BF16 weight, swap the kernel (custom skinny wide-vocab streaming GEMM, or a cublasLt
+  algo heuristic tuned for the thin-M / huge-N memory-bound shape).** The ONLY essentially-bit-exact
+  option (F32 accumulate + BF16 round = identical except rounding ties, per the correction above).
+
+### Realistic lever + scope: there is NO good bit-exact lever here
+
+Bandwidth math kills option (c): `nvjet` moves 2.37 GB in ~11.9-12.2 ms = **~195-199 GB/s = ~72% of
+the GB10's 273 GB/s peak**. The lm_head GEMM is therefore **already one of the MOST
+bandwidth-efficient kernels in the step** - the overall decode step runs at only 40% util /
+110 GB/s (`DECODE_PARITY_EXPLORE.md`). The bit-exact ceiling is tiny: even a perfect
+HBM-saturating kernel (199 -> 273 GB/s) takes 11.9 -> ~8.7 ms = **save ~3 ms = ~0.9% of the dense
+step**, and beating cublas's own tuned nvjet on a pure weight-stream shape is NOT guaranteed (it may
+already be near-optimal). High kernel-writing effort, uncertain sub-1% payoff. (`F16_DENSE_RESIDUAL_
+PROBE.md:97` independently estimates a bf16-glue nvjet recovery of only ~5 ms and flags it
+"uncertain - may already run TF32" - consistent with little headroom.)
+
+The structural reason: the head must read the **entire 2.37 GB weight for just 128 output columns**
+(inherently memory-bound), and **you cannot cut those weight bytes without changing the dtype** -
+i.e. bit-exactness and the only real speedup (fewer weight bytes) are **mutually exclusive** here.
+
+### Verdict / ranking of path B
+
+The lm_head cublas/nvjet GEMM is a **dead end for a bit-exact win**: it is already ~72% of peak HBM
+(the step's most efficient major kernel), so a bit-exact kernel swap caps at <1% with real risk and
+no guarantee of beating cublas. The only large win - NVFP4-quantizing the head (~2.5%) - is
+explicitly non-bit-exact AND unfair vs vLLM (which keeps BF16), so it lands in the same opt-in
+non-bit-exact bucket as f16-glue that was already shelved. Rank B's bit-exact value **at the bottom**
+of the four paths. The one worthwhile note for the team is the correction that the head is already
+BF16 (not f32), which slightly narrows what "bit-exact" even protects here; if the project ever
+opens a *non*-bit-exact opt-in track, NVFP4-head (option a) is a clean ~2.5% dense lever that rides
+the existing tuned FP4 MMQ - but it must be gated as opt-in and excluded from any vLLM-parity claim.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## A.2 / D. GPU-measured MoE decode decomposition + dense-graph stability (label: moe-gpu-profile, THE GPU AGENT)
+
+nsys `--cuda-graph-trace=node` on a steady MoE decode at npl128 (q36-35b-a3b-nvfp4, HEAD f7409c2,
+clean 0023 build-cuda). The measurement was decode-isolated: the run has a prefill phase (16384 tok,
+the big-GEMM region) followed by 64 steady decode steps; I segmented the timeline by GPU-idle gaps,
+dropped the prefill window, and aggregated per-kernel time over the 64-step decode window only
+(`moe_decode_npl128.{nsys-rep,trace.csv}` on the DGX; extractor `decfull.py`/`grid.py`).
+
+### MoE decode window: 98.3% GPU-bound, ~165 ms/step. Per-kernel share of decode GPU-time:
+
+```
+ 41.9%  gated_delta_net_cuda            REC (SHARED with dense, already tuned 0018-0022)
+ 26.9%  mul_mat_q<NVFP4, M-tile=64>     MOE expert grouped GEMM (MUL_MAT_ID) <-- biggest MoE-specific kernel
+  7.7%  nvjet_sm121 (cublas bf16)       attn/gdn bf16 projections + the BF16 lm_head (path B)
+  2.7%  cutlass_80 bf16 s16816 relu     bf16 GEMM (shared-expert / gate)
+  2.7%  k_bin_bcast (mul/add)           expert-combine + routing-weight scale + glue
+  2.6%  k_get_rows_float                REC recurrent-state gather
+  2.4%  flash_attn_ext_f16              attention
+  2.3%  mul_mat_q<NVFP4, M-tile=128>    router / non-grouped FP4 GEMM
+  2.1%  ssm_conv(+update)               REC
+  2.0%  quantize_mmq_nvfp4              MOE W4A4 activation-quantize tax (3.25 ms/step)
+  1.8%  convert_unary bf16<->f32        glue around the bf16 projections
+  1.5%  cpy_scalar                      glue
+  0.9%  rms_norm
+  0.5%  REC gating act | 0.5% streamk_fixup | 0.3% mm_ids_helper | 0.3% argsort |
+  0.2%  l2norm | 0.2% set_rows | 0.1% gather_mmq_fp4 | <0.1% topk/softmax/reduce (routing)
+```
+
+Bucketed: **Recurrence (shared, tuned) ~= 47.5%** (gdn 41.9 + get_rows 2.6 + ssm_conv 2.1 + gating
+0.5 + l2norm 0.2 + set_rows 0.2). **MoE FFN+routing block ~= 31%** (grouped GEMM 26.9 + activation
+quant 2.0 + streamk 0.5 + mm_ids_helper/argsort/gather/softmax/topk/reduce ~1.3 + the expert-combine
+share of bin_bcast). **cublas/cutlass bf16 projections ~= 10.4%** (nvjet 7.7 + cutlass 2.7).
+Attention ~2.4%. The recurrence is the single biggest term but it is shared with dense and already
+the subject of 0018-0022, so it is NOT a MoE lever.
+
+### The biggest MoE-specific kernel (the lever): mul_mat_q<NVFP4, M-tile=64> grouped GEMM
+
+26.9% of decode = ~43.5 ms/step, avg **373 us/call**, grids of **2048 and 8192** 64-wide tiles
+(blk=32 = 1 warp/block). Compare the dense FFN GEMM in the same family at npl128: `mul_mat_q<NVFP4,
+M-tile=128>` avg **31 us/call**, grid 48. The grouped per-expert GEMM is ~12x the per-call cost and
+launches 100-200x more tiles because each of 128 experts is a separate tiny-M sub-GEMM (128 tokens x
+top-k / 128 experts ~= a handful of rows per expert) padded into 64-wide tiles. This is exactly the
+ragged-tiny-M / col-tile-occupancy axis section A's 0014/0015 `mmq_x` density auto-tile already
+covers and measured NEUTRAL on this bandwidth-bound a3b model. MMQ FP4 is integer/FP4-exact
+independent of tile geometry, so this kernel IS bit-exact to retune (occupancy/min-blocks/M-tile),
+but the headroom on THIS model is small (it is bandwidth-bound, not tile-occupancy-bound).
+
+### Confirmations / quantifications of section A (from live GPU, not source-reading):
+
+1. **Un-graphed at npl128: CONFIRMED in source, but NOT the npl128 bottleneck.** NVFP4 on sm121
+   (turing_plus path) has `mmvq_mmid_max = 8` (`mmvq.cu:145`); MoE decode batch ne[2]=128 > 8, so
+   `[TAG_MUL_MAT_ID_CUDA_GRAPHS]` (`ggml-cuda.cu:3273`) disables CUDA graphs for the WHOLE step and
+   the MMQ grouped path (not MMVQ) is taken. HOWEVER the measured decode window is **98.3% GPU-util
+   with ~7.8 us inter-step host gaps** - at npl128 the kernels are large enough to fully hide the
+   per-op launch latency, so the un-graphed launch-bubble tax is negligible HERE. The un-graphed
+   penalty is a SMALL-npl problem; at npl128 the MoE gap is in-kernel (grouped GEMM + quantize),
+   not host bubbles. This refines A's M2: graphing the decode step helps small-npl MoE much more
+   than npl128 MoE.
+2. **W4A4 activation-quantize tax: CONFIRMED present but only 2.0% at npl128.** `quantize_mmq_nvfp4`
+   = 3.25 ms/step in the decode-isolated window (A's 457 ms figure is a whole-run/different-window
+   total). Real, and vLLM-W4A16 avoids it, but it is a small-single-digit term, not dominant.
+3. **lm_head/projection cublas (path B): CONFIRMED ~12.4 ms/step** of nvjet in MoE decode (matches
+   B's 11.91 ms), but that 7.7% bundle is mostly per-layer attn/gdn bf16 projections, not just the
+   one lm_head.
+
+### D. Dense CUDA-graph stability: f32 dense is STABLE, the bimodality was a BF16-only artifact
+
+Dense (q36-27b-nvfp4) has no MUL_MAT_ID, so it stays fully CUDA-graphed. Measured S_TG @npl128:
+
+```
+intra-process (1 load, 6x npl=128, npp8/ntg48, N_KV=7168): 376.2 376.2 375.7 375.1 375.3 374.9  (spread <0.4%)
+inter-process (6 separate procs, fresh graph capture each):373.6 377.0 376.8 376.6 376.2 375.7  (spread ~0.9%)
+committed heavy config (npl128 ntg128, N_KV=32768):        333.3 / 334.8 / 335.9                 (spread ~0.8%)
+```
+
+No bimodality in either replay (intra-process) or capture (inter-process). The custom graph state
+machine (`ggml-cuda.cu:4484`: warmup_complete requires 2 property-stable calls; the one-time capture
+cost lands in T_PP, not S_TG) absorbs capture into prefill, which is the only "hint" (the first
+in-process measurement has a slightly higher T_PP and a marginally lower S_TG, fully bounded). The
+287/336/487/498 bimodality in the brief was the shelved BF16 SSM-state path (BF16_SSM_STATE.diff,
+never applied), not the shipped f32 path. There is NO graphs-off env in this fork (graph enable is
+compile-time USE_CUDA_GRAPH + the warmup machine), so a graph-disable A/B would need a rebuild; given
+the f32 path is already stable to <1%, path D is a non-issue and not worth the rebuild.
+
+### Verdict (GPU agent)
+
+- The MoE decode gap vs vLLM at npl128 is **in-kernel, not host-overhead**: 98.3% GPU-util rules
+  out the un-graphed launch-bubble story AT npl128. The single biggest MoE-specific kernel is the
+  `mul_mat_q<NVFP4, M-tile=64>` grouped GEMM (26.9%, 43.5 ms/step); it is bit-exact to retune but
+  bandwidth-bound on this a3b model (A's auto-tile already measured neutral), so the standalone
+  bit-exact MoE-GEMM lever is REAL but BOUNDED. The recurrence (47.5%) is shared and already tuned.
+- **Path D (dense graph instability) is closed: the shipped f32 dense path is stable (<1%, no
+  bimodality).** No latent fragility, no rebuild warranted.
+- Net ranking from the GPU side agrees with A/B: the MoE-GEMM and lm_head levers are both bounded
+  and partly non-bit-exact; the only structurally large bit-exact MoE win (A's M2, graph the decode
+  step) pays off mostly at SMALL npl, not at the npl128 where the benchmark gap is reported.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## C. TTFT / paged-pool burst degradation (label: ttft-burst-rootcause, READ-ONLY, source + committed traces)
+
+Files read: `paged/paged_kv_manager.{h,cpp}`; patches `0004` (on-demand alloc), `0007` (persistent
+manager + ref-counted prefix), `0008` (server cross-request share), `0013`/`0016` (prefill budget);
+docs `QWEN36_NVFP4_BENCH.md`, `BENCHMARK_PROGRESS.md`, `CHUNKED_PREFILL_PLAN.md`,
+`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`, `P1_DYNAMIC_BUDGET_RESULTS.md`, `FUTURE_LEVERS.md`.
+
+### Part 1 - the static decode-first budget: why a 128-way burst hits 903 s dense / 213 s MoE TTFT
+
+How the budget schedules (patch 0016, `server-context.cpp::update_slots`): each step builds ONE
+mixed batch. Phase 1 appends every GENERATING slot's single sampled token UNCONDITIONALLY (no budget
+gate), so after Phase 1 `batch.n_tokens == D` (the live decode load). Phase 2 then fills prompt
+tokens, bounded by three predicates: the hard `batch.n_tokens < n_batch` (2048) ceiling, a per-step
+`prefill_budget_step`, and a per-slot `prefill_cap_per_slot`. **Decode is structurally claimed first
+and never capped; only prefill is throttled.**
+
+At the shipped config (`LLAMA_MAX_BATCH_TOKENS=512`, i.e. T=512=n_ubatch) the dynamic terms
+degenerate to constants:
+- `prefill_budget_step = max(n_ubatch, T - D) = max(512, 512-D) = 512` for all D in [0,128] - the
+  floor binds, the `T-D` adaptivity NEVER bites (exactly the "structural note" in
+  `P1_DYNAMIC_BUDGET_RESULTS.md`).
+- `prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) = min(512, 5243) = 512`, clamped to 512.
+
+So each step admits at most 512 prefill tokens TOTAL and up to 512 per single slot. Each benchmark
+prompt is exactly 512 tokens and there is NO round-robin (0016 drains slots in index order):
+**the first waiting slot consumes the entire 512-token step budget with its whole prompt; the 128
+prompts prefill strictly SERIALLY, one prompt per step.** Slot k's first token appears after ~k
+prefill steps and each step co-batches the accumulating decode load, so step time grows. Mean TTFT
+~= (half the prompts) x step_time ~= **903 s dense** (each step reads the full 28B NVFP4 weights) /
+**213 s MoE** (3B active = cheaper steps). Decode_agg stays high (384/726 t/s) because Phase 1 seats
+every decode token every step. This is the **deliberate decode-first tradeoff**: T=512 was chosen
+for decode throughput + memory; TTFT was the sacrificed axis. The 903 s is partly self-inflicted by
+the floor budget + lack of fairness, not a kernel limit (dense `prefill_tps` collapses to ~70 t/s
+under the throttle vs vLLM's flat ~1420).
+
+The fix (chunked-interleave / fair dynamic budget = P2 of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`,
+NOT implemented), three pieces in `update_slots` Phase 2, zero libllama change:
+1. Raise T toward `n_batch` (2048) so the per-step total budget is large; keep decode-first via the
+   REAL `prefill_budget_step = T - D` (leftover auto-shrinks as D rises, so the step never inflates
+   past T even at npl128).
+2. A per-slot chunk cap MUCH smaller than the budget (the `long_prefill_token_threshold` analogue),
+   e.g. 128-256 tokens, so one prompt cannot monopolize the step.
+3. A round-robin start offset over PROCESSING_PROMPT slots so leftover budget spreads across MANY
+   waiting prompts per step.
+
+Net: instead of "one full 512-prompt per step" (serial, last prompt waits 128 steps), each step
+admits small chunks from ~T/cap prompts at once, so all 128 advance in lockstep and TTFT collapses
+from O(k*step) to O(constant) - the vLLM 6-18 s regime. 0016's per-slot-cap variable already exists
+but is inert at the shipped config and lacks the round-robin spreader. Honest boundary (already in
+the docs): this closes TTFT, it does NOT lift the ~161/333 decode ceiling (a separate lever).
+
+### Part 2 - the burst-degradation BUG: later lower-npl prefill collapses 507 -> 65 t/s, decode fine, restart cures it
+
+The signature - prefill-only collapse, decode untouched, persists in-process, a server restart fully
+cures it (the benchmark's documented "restart per npl" workaround) - points to persistent paged-pool
+host state never restored short of `clear()`/teardown. Two compounding mechanisms, both confirmable
+from the patch source:
+
+**(1) RECLAMATION GAP - blocks are returned ONLY on a FULL-range wipe.** `paged_alloc` returns a
+sequence's blocks to the pool in exactly two places (patch 0004, kept in 0007): `clear()` ->
+`release_all`, and `seq_rm(seq, p0, p1)` ONLY when `p0 == 0 && p1 == MAX`. But llama-server's normal
+slot lifecycle issues PARTIAL truncations: slot reuse with a retained common/BOS prefix calls
+`seq_rm(slot.id, n_past, -1)` with `n_past > 0` (patch 0008 itself calls
+`common_context_seq_rm(ctx, slot.id, n_past, -1)`); context-shift / partial rewinds likewise. None
+satisfy `p0 == 0`, so the release hook never fires: the kv-cache frees those CELLS but the manager
+still believes the sequence owns those BLOCKS. The two desync and the manager's effective free pool
+shrinks every time. Patch 0008's own comment is the smoking gun - it added the `n_past < 16` gate
+because a mismatched full-prompt reservation vs suffix-only submission "never leaves stale blocks
+(which otherwise fragment the paged pool ... and crashed the server under high fan-out)". 0008 only
+closed that hole for the narrow `share()` path; the general partial-`seq_rm` path stays unhooked, so
+over a high-fan-out burst leaked blocks accumulate and never return.
+
+**(2) FRAGMENTATION / NO COMPACTION - the free queue is permuted by the burst and never rebuilt.**
+Even for cleanly freed blocks, `BlockPool::free_blocks` just `prepend_n`/`append_n`s them in free
+order; no compaction, no pristine reset. After a high-fan-out burst (many interleaved alloc/free
+across many seqs in the unified pool, or reversed-order frees in a per-stream pool) the free queue is
+a scrambled permutation of physical block ids. A subsequent LOW-npl prefill then `popleft`s
+physically SCATTERED blocks, so its 512-token KV scatter-WRITE plus the in-kernel paged-attention
+GATHER lose locality across the KV span -> prefill throughput collapses. Decode is a single-token
+append per step with a gather amortized over tiny per-step work, so it barely notices - exactly the
+observed "prefill collapses, decode robust". The scramble + leak persist for the process lifetime
+(only `clear()`/restart rebuilds a contiguous free queue) - precisely why restart-per-npl restores
+507 t/s. Contributing factor: slots used in the burst but not reassigned next run are never released
+(release fires only on next-task divergence), so a low-npl run sees a reduced, fragmented pool and
+falls back to the stock contiguous allocator more often (the `place()->false->res.idxs.clear()`
+fallback in find_slot), scanning a littered cell array - another prefill-only slowdown.
+
+Fix scope (all gated behind `LLAMA_KV_PAGED`, default-off byte-identical, no libllama API change):
+- **Fix-1 (core, ~30-50 lines): close the reclamation gap.** Add
+  `paged::PagedKVManager::truncate(seq, n_keep)` that frees the trailing blocks of a request beyond
+  block index `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of the truncated block suffix),
+  expose `paged_alloc::truncate(cache, stream, seq, n_keep)`, and call it from
+  `llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Manager
+  accounting then tracks the kv-cache exactly; the leak stops.
+- **Fix-2 (small): defrag on empty.** When a stream's cells reach `get_used() == 0`, rebuild that
+  manager's free queue to pristine contiguous order (or recreate the manager) so a reused pool
+  starts unfragmented.
+- **Fix-3 (small): release on slot completion.** Add a paged release at server `slot.release()` so
+  finished-but-idle sequences return blocks promptly and a later low-npl run sees a full, compact
+  pool.
+- **Fix-4 (optional hardening): best-fit / contiguous-run preference** in `get_new_blocks` + a
+  defrag pass before the find_slot stock-fallback fires.
+
+Validation repro (GPU-bound, for a later profiling pass): npl64 burst then npl8 on ONE server;
+assert npl8 `prefill_tps` within ~10% of a fresh-server npl8, and that `paged_alloc::num_free`
+returns to the fresh value after the burst drains.
+
+### Verdict / ranking of path C
+
+Two distinct things: a **BUG** (Part 2) and a **tuning tradeoff** (Part 1). Rank the BUG first - it
+is a true correctness/hygiene defect, not a tradeoff: a long-lived production server silently
+degrades under ordinary mixed load and currently REQUIRES the "restart per npl" crutch, unacceptable
+in real serving. Fix scope is small and localized to the paged-alloc unit + one `seq_rm` call site,
+default-off byte-identical, with a crisp pass/fail repro. The chunked-interleave scheduler (Part 1)
+is the bigger HEADLINE (the weakest benchmark number, 903 s/213 s burst TTFT vs vLLM 6-18 s) but a
+larger effort with a deliberate TTFT-vs-decode-ITL tradeoff to navigate. The two are complementary:
+the scheduler reduces how punishing each burst is; the bug fix ensures the pool survives the burst
+so the NEXT request is not poisoned.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+---
+
+## SYNTHESIS - ranking and the first build target (label: orchestrator)
+
+The brief framed two tracks: **BIT-EXACT** levers (help the shipped f32 parity DEFAULT, included in
+the vLLM-parity claim) and **SERVING** levers (gated behind `LLAMA_KV_PAGED`, default-off
+byte-identical, outside the parity claim). The decisive cross-cutting finding from all four agents:
+**there is no compelling first build target on the bit-exact decode-default track** - A is bounded,
+B is a sub-1% dead end, D is closed - **while the SERVING track has one clear, high-ROI, tractable,
+low-risk, byte-identical-default first target: the paged-pool burst-degradation bug.**
+
+### Per-path scorecard
+
+| Path | Expected gain | Tractability | Bit-exactness | Net |
+|------|---------------|--------------|---------------|-----|
+| **A** MoE grouped-GEMM | Standalone kernel: **bounded, low single-digit %** at npl128 (model is bandwidth-bound; 0014/0015 M-tile auto-tile already NEUTRAL here). The big MoE win = M2 = graph-the-decode-step, which is SHARED with D and pays off mostly at SMALL npl, not the npl128 benchmark point. | M1 (down_proj quantize retune) cheap; M2 a decode-graph-capture project (large). | M1 strictly bit-exact (byte-identical quantized output); M2 bit-exact (replay). Helps the DEFAULT. | Real but **bounded**; no clean standalone kernel win. Keep M1 as a cheap bank-shot; fold M2 into a decode-graph effort. |
+| **B** cublas lm_head (nvjet) | Bit-exact ceiling **<1%** (~3 ms; nvjet already ~72% of peak HBM, the step's most efficient major kernel). The only big win (NVFP4 head ~2.5%) is non-bit-exact AND unfair vs vLLM (which keeps BF16). | Custom skinny-GEMM = high effort, uncertain it beats cublas. | Bit-exact option caps <1%; the 2.5% option is a logits change (opt-in only). | **Dead end** for the default. Rank LAST. |
+| **C** TTFT / paged-pool burst | **Part 2 bug:** restores prefill from collapsed 65 -> ~507 t/s after a burst (removes the "restart per npl" crutch). **Part 1 scheduler:** the headline - 903 s/213 s burst TTFT -> vLLM 6-18 s regime. | **Part 2: small + localized** (paged-alloc unit + 1 seq_rm call site). Part 1: larger (fairness + admission + tuning). | Both gated behind `LLAMA_KV_PAGED`, **default-off byte-identical**. SERVING track (doesn't touch the parity-default numerics). | **Highest ROI x tractability.** Part 2 is a true correctness defect with a crisp repro. |
+| **D** dense CUDA-graph instability | **Zero** - f32 dense measured STABLE (<1% spread, no bimodality). The 287/336/487/498 bimodality was the SHELVED BF16 SSM path, not the shipped f32 path. | n/a (would need a rebuild for a graphs-off A/B). | n/a | **CLOSED.** Not worth any work. |
+
+### Ranked order (ROI x tractability x bit-exactness)
+
+1. **C-Part2 - paged-pool burst-degradation bug fix.** Small, localized, default-off byte-identical,
+   crisp pass/fail repro, removes a real production-serving defect + the benchmark's restart crutch.
+2. **C-Part1 - chunked-interleave / fair dynamic budget.** The public-facing TTFT headline closer,
+   but a larger effort and a deliberate TTFT-vs-ITL tradeoff. Do it AFTER the bug fix (the scheduler
+   reduces burst pain; the bug fix keeps the pool alive across bursts).
+3. **A-M1 - down_proj activation-quantize kernel retune** (cheap bit-exact bank-shot for the default;
+   bounded payoff on this bandwidth-bound model). Optionally folded with a future decode-graph build
+   (A-M2 / the shared MoE+GDN decode-graph capture), which is the only structurally large bit-exact
+   MoE lever but a big project that helps small-npl more than npl128.
+4. **B - lm_head kernel swap.** Bit-exact ceiling <1% with real risk. Skip unless a non-bit-exact
+   opt-in track opens (then NVFP4-head ~2.5% dense, gated, excluded from parity claims).
+5. **D - dense graph instability.** Closed, no work.
+
+### THE FIRST BUILD TARGET: paged-pool burst-degradation bug fix (C-Part2)
+
+**Why this one:** it is the only candidate that is simultaneously (a) high ROI - fixes a real
+correctness defect that forces the "restart per npl" crutch in long-lived serving, (b) tractable -
+small and localized to the paged-alloc unit plus one `seq_rm` call site, (c) safe for the parity
+claim - gated behind `LLAMA_KV_PAGED`, default-off byte-identical, and (d) verifiable with a crisp
+pass/fail repro. Every bit-exact-default alternative is bounded (A), a dead end (B), or closed (D).
+
+**Implementation plan (incremental, each step independently shippable):**
+1. **Fix-1 (core):** add `paged::PagedKVManager::truncate(seq, n_keep)` that ref-count-frees the
+   trailing blocks beyond block index `ceil(n_keep/bs)`; expose
+   `paged_alloc::truncate(cache, stream, seq, n_keep)`; call it from `llama_kv_cache::seq_rm` for the
+   `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Closes the reclamation gap so manager
+   accounting tracks the kv-cache exactly.
+2. **Fix-2:** defrag-on-empty - when a stream reaches `get_used() == 0`, rebuild its free queue to
+   pristine contiguous order.
+3. **Fix-3:** paged release at server `slot.release()` so finished-idle sequences return blocks
+   promptly.
+4. **Fix-4 (optional):** best-fit / contiguous-run preference in `get_new_blocks` + a defrag pass
+   before the find_slot stock fallback.
+
+**Confirming measurement (the explicit repro, GPU-bound):** on ONE long-lived server, run an npl64
+burst, let it drain, then run npl8. PASS if (i) npl8 `prefill_tps` is within ~10% of a fresh-server
+npl8 (vs the ~65 vs ~507 collapse today), and (ii) `paged_alloc::num_free` returns to the
+fresh-start value after the burst drains (proves no leaked blocks). Decode t/s must be unchanged.
+
+**Bit-exact gate it MUST pass:**
+- With `LLAMA_KV_PAGED` unset, the build is byte-identical to HEAD f7409c2 (the fix lives entirely
+  inside the paged path) - `test-backend-ops` + the greedy-decode md5 against the 0023 baseline are
+  unchanged.
+- With `LLAMA_KV_PAGED` set, the fix changes only block ACCOUNTING and PLACEMENT, never KV values or
+  compute, so the greedy-decode md5 on a fixed prompt is identical before vs after the fix (and the
+  post-burst run produces the same tokens as a fresh-server run).
+
+**Paths NOT worth building now:** B (lm_head, sub-1% bit-exact ceiling, the only big win is a
+non-bit-exact unfair-vs-vLLM logits change), and D (dense graph instability, measured stable -
+closed). A's standalone kernel value is bounded; keep A-M1 as a cheap follow-up and fold A-M2 into a
+later decode-graph project, but it is not the first target.
+
+**First target: ship the paged-pool burst-degradation bug fix (C-Part2, Fix-1 + Fix-2 + Fix-3).**
+
+Assisted-by: Claude:opus-4.8 [Claude Code]