docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target

Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):

- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
  vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
  bounded on this bandwidth-bound a3b model. Keep down_proj quantize
  retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
  later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
  ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
  vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
  defect (prefill collapses 507->65 t/s after a burst, restart cures it):
  reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
  static decode-first budget (Part 1) explains 903s/213s burst TTFT and
  the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
  bimodality was the shelved BF16 SSM path. Closed.

First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-26 09:42:55 +00:00
parent 89e62fc74f
commit b061e4aef0

View File

@@ -0,0 +1,511 @@
# OTHER_PATHS_INVESTIGATION.md
Read-only investigation of the four post-0023 paths (A MoE grouped-GEMM, B lm_head GEMM,
C TTFT/paged-pool burst, D dense CUDA-graph). One section per agent. No GPU except the
moe-gpu-profile agent.
---
## A. MoE grouped-GEMM gap (label: moe-gemm-source, READ-ONLY, no GPU)
### The decisive finding: vLLM's MoE on GB10 is MARLIN W4A16, not a native-FP4 grouped GEMM
Engine-log ground truth (`VLLM_DECODE_GROUNDING.md`, from `~/bench/h2h_moe_vllm.log`):
`"Using 'MARLIN' NvFp4 MoE backend ... Your GPU does not have native support for FP4
computation ... Weight-only FP4 compression will be used leveraging the Marlin kernel"`.
vLLM does NOT take its native-FP4 cutlass/trtllm MoE path on sm_121 (it whitelists only
sm_100/103 datacenter Blackwell for FP4-MMA MoE). So on this box vLLM's MoE is:
- `moe_align_block_size` (BLOCK-PADDED token-sort; `num_tokens_post_padded`, sentinel pad rows),
- **2 grouped `moe_wna16_marlin_gemm` launches per MoE block** (gate_up, then SiLU+mul, then down),
each ONE launch over ALL experts, `use_fp32_reduce=True`,
- **W4A16: activations stay bf16, NEVER quantized**; FP4 weights dequantized in-kernel to bf16,
bf16 MMA,
- the whole decode step under a FULL CUDA graph.
llama's MUL_MAT_ID on GB10 (mmq.cu id-branch + mmid.cu + mmq.cuh stream-k) is:
- `mm_ids_helper` token-sort/scatter, **NO block padding** (exact segments, `expert_bounds`),
- **activation FP4 quantize** (`quantize_mmq_fp4`) of the expert-gathered rows = W4A4,
- **1 persistent stream-k `mul_mat_q<NVFP4>` launch per projection**, native Blackwell FP4-MMA
(`block_fp4_mmq`), fp32 accumulate + `stream_k_fixup`,
- per-expert-density `mmq_x` (M-tile) select (patches 0014/0015, default tile 64 @ density<=8),
- NOT under a CUDA graph.
### So the "missing fused grouped GEMM" does not exist - llama already HAS it
llama's grouped FP4-MMA stream-k IS the same sorted-grouped-GEMM algorithm vLLM uses, and on
GB10 llama's MoE GEMM is at a HIGHER-precision/native-FP4 tier than vLLM's W4A16 Marlin. The
MoE decode gap (77-83% of vLLM vs dense 90-117%) is therefore NOT a grouped-GEMM-architecture
deficit. The MoE-specific EXTRA gap (the ~10-15pt that MoE is worse than dense) decomposes as:
1. **W4A4 activation-quantize tax (llama-only, the biggest MoE-specific discrete cost).**
llama quantizes activations to FP4 for the MoE GEMM; vLLM (W4A16) keeps them bf16 and pays
ZERO activation quantize. At MoE decode npl128 that is 1024 up/gate rows (patch 0023 dedup'd
the broadcast ones to 128 unique + a coalesced block gather) PLUS 1024 down_proj rows
(distinct per expert, CANNOT be dedup'd). nsys decode-isolated (`MOE_QUANT_DEDUP_RESULTS.md`):
`quantize_mmq_nvfp4` is still **457 ms** of decode GPU-time after the 0023 up/gate dedup; the
remaining bulk is the down_proj per-expert re-quantize. vLLM's W4A16 choice is actually SMART
for MoE decode on a bandwidth-bound box: keeping activations bf16 adds negligible activation
bandwidth at M~8/expert but ELIMINATES the entire quantize pass.
2. **Un-graphed extra MoE nodes' launch bubbles.** Per MoE layer llama runs mm_ids_helper +
quantize + gather + 2 grouped GEMMs + SiLU/mul + down-quantize + moe_sum as separate
host-launched ggml nodes, none under a CUDA graph; vLLM runs moe_align + 2 grouped launches
under a full decode graph. This is the SAME launch-bubble root cause `CRITICALPATH_GAP_ANALYSIS.md`
pins for the GDN region (57 ms/step dense = 100% bubble), amplified for MoE by the extra
quantize/gather/scatter nodes - consistent with MoE being relatively worse than dense.
3. **Ragged tiny-M tile + `need_check` partial-tail MMA** in the grouped stream-k. Already
addressed by 0014/0015 and measured **NEUTRAL** on q36-35b-a3b: that model is bandwidth/
SSM-recurrence-bound, not col-tile-occupancy-bound (the `LLAMA_MOE_DECODE_TILE` sweep shows 64
is the only non-negative width and it is within noise). So the M-tile lever has nothing to
bite on for THIS model; it banks +4.8% only on col-tile-bound MoE (Qwen3-Coder-30B).
### Bit-exact llama MoE-GEMM levers (ranked)
- **M1 (bit-exact, modest): down_proj activation-quantize kernel retune.** The remaining ~457 ms
is dominated by the down_proj per-expert FP4 re-quantize (`ne11==n_expert_used`, no dedup
possible). The per-block quantize is a pure per-thread function of 16 consecutive inputs (the
property 0023 exploited to make its gather bit-exact), so the launch GEOMETRY can be retuned
(occupancy/coalescing, like 0022 did for the recurrence and like 0023's coalesced-uint4 gather
fix) while the quantized bytes stay BYTE-IDENTICAL. Also worth checking whether the down gather
(`ids_src1`) is redundant when the SwiGLU intermediate is already expert-contiguous. Scope:
nsys the down-branch `quantize_mmq_fp4` on GB10, retune block/grid, gate on test-backend-ops
MUL_MAT_ID exact + greedy md5 == 0023. Expected: low single-digit % at npl128 (bounded - it is
a fraction of a fraction of the step), but it is the only clean quantize-axis lever left after
0023 and it is strictly bit-exact.
- **M2 (bit-exact, the structurally-correct big one, SHARED with path D/A.2): CUDA-graph the MoE
decode step.** Graph replay does not change numerics => bit-exact. The MoE-specific extra node
count (quantize+gather+scatter+2 GEMM+silu+sum/layer, none graphed) makes the launch-bubble tax
larger for MoE than dense, which is exactly why MoE sits at 77-83% while dense is 90-117%.
Capturing the decode forward removes those bubbles. This is the same lever the GDN/A.2 work
scoped; it helps MoE MORE than dense. Highest-leverage bit-exact MoE win, but it is a
decode-graph-capture project, not a MoE-GEMM kernel edit.
- **M0 (already shipped): 0017 `GGML_CUDA_FP4_MINBLOCKS` (min-resident-CTAs register-cap) and
0014/0015 (`mmq_x` density auto-tile) already cover the FP4-MMA occupancy + M-tile axes of the
SHARED `mul_mat_q<NVFP4>` kernel.** 0017 is bit-exact (register allocation cannot change
results) and was tuned on dense; a MoE-targeted min-blocks re-sweep (grouped per-expert M-tiles
have different occupancy than the dense M=128 GEMM) is a cheap bit-exact follow-up, but
MOE_DENSITY_AUTO_TILE already found this model is bandwidth-bound, so headroom is likely small.
### NOT recommended (explicitly out of scope)
- **W4A16 bf16-activation MoE GEMM (matching vLLM's Marlin choice).** This is the single biggest
MoE-specific structural difference and would erase the activation-quantize tax entirely, but it
(a) is NOT bit-exact (bf16 activations vs llama's FP4), and (b) is the W4A16 occupancy-wall
dead-end the docs flag (only ~9 TFLOP/178 t/s on GB10). Do not pursue.
### Verdict / ranking of path A
Path A is NOT a missing-kernel opportunity - llama already runs the sorted-grouped-FP4-MMA GEMM,
at a higher native-FP4 tier than vLLM's GB10 W4A16 Marlin fallback. The MoE-specific extra gap is
(1) the W4A4 activation-quantize tax vLLM structurally avoids by choosing W4A16, and (2) the same
un-graphed launch-bubble tax as the GDN region, amplified by MoE's extra nodes. The only purely
bit-exact, MoE-GEMM-local lever left is M1 (down_proj quantize retune, modest). The real MoE
bit-exact win is M2 (CUDA-graph the decode step), which is the SAME lever as path A.2/D and helps
MoE more than dense - so A's best lever collapses into the decode-graph effort rather than
standing alone. Recommend ranking A's standalone kernel value BELOW the decode-graph (M2/D) and
the lm_head (B) levers; fold A into the decode-graph build, and keep M1 as a cheap bit-exact
bank-shot.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## B. lm_head GEMM (label: cublas-lmhead, READ-ONLY, no GPU)
### The decisive fact: lm_head is BF16, not NVFP4 - so it CANNOT take the FP4 MMQ path
`output.weight` (the LM head) in q36-35b-a3b-nvfp4 is **type 30 = GGML_TYPE_BF16, NOT quantized**
(verified in `DECODE_PARITY_EXPLORE.md:298`: "2425 MB = 2.37 GB, read in full each step", 16% of
weight traffic). This is by construction: the model was quantized with `--tensor-type attn/ffn=
nvfp4`, which converts the attn+ffn tensors to NVFP4 and **leaves `output.weight` (and `tok_embd`)
at base BF16** - the standard recipe, because the final projection is the most logit-sensitive
tensor. The NVFP4 sidecar scales (`output_s`, `output_in_s`) are only created when
`output->type == GGML_TYPE_NVFP4` (`llama-model.cpp:1459`), so for the BF16 head `model.output_s`
is null.
### Why it runs cublas/nvjet and not MMQ (exact routing trace)
Graph: `qwen35moe.cpp:244` `cur = build_lora_mm(model.output, cur, model.output_s)` ->
`llama-graph.cpp:1093` is just `ggml_mul_mat(ctx0, w, cur)` (the null `w_s` skips the scale `ggml_mul`).
Then `ggml_cuda_mul_mat` (`ggml-cuda.cu:2540`) decides the kernel:
- `use_mul_mat_q` / `use_mul_mat_vec_q` BOTH require `ggml_is_quantized(src0->type)`. BF16 is NOT
quantized (`is_quantized=false` for F16/BF16/F32; NVFP4 IS `is_quantized=true`, `ggml.c:748`).
=> **both MMQ paths are ineligible for the BF16 head.** (If the head were NVFP4 it WOULD route to
the tuned FP4 `mul_mat_q` - this is exactly the difference.)
- At decode npl128 the activation `src1->ne[1] = 128` columns: `use_mul_mat_vec_f` is gated off by
the mmvf batch cap; `use_mul_mat_f` (the MMF bf16 tensor-core GEMM) is gated off by
`ggml_cuda_should_use_mmf` for the wide `151936-row x 128-col` shape.
- `use_batched_cublas_bf16` is true, but the batched-cublas branch additionally requires
`src1->ne[2]*src1->ne[3] > 1` (a 3D/4D multi-batch GEMM). The decode lm_head is 2D
(`ne[2]*ne[3] == 1`) => **batched-cublas branch is skipped.**
- => falls through to the final `else`: `ggml_cuda_op_mul_mat_cublas`. With `src0` BF16 +
bf16-MMA hardware it takes the BF16 branch (`ggml-cuda.cu:1663`): `cublasGemmEx(CUDA_R_16BF,
CUDA_R_16BF -> CUBLAS_COMPUTE_32F, TENSOR_OP)`. **That cublasLt kernel is `nvjet_sm121`.**
Cost (both models): dense `nvjet` lm_head = **12.17 ms = 3.66% of the 332.8 ms dense step**
(`F16_DENSE_RESIDUAL_PROBE.md:65`); MoE = **11.91 ms = 3.1%** (`CRITICALPATH_GAP_ANALYSIS.md:398`).
### CRITICAL correction: the current head is NOT "f32-lm_head" - it is already BF16-rounded
The task brief calls the baseline "f32-lm_head"; it is not. The cublas BF16 branch **downcasts the
F32 activation to BF16**, does BF16xBF16 with F32 accumulate, **writes the result as BF16** (dst is
`CUDA_R_16BF`), then upcasts BF16->F32. So today's "bit-exact reference" logits are already
**BF16-precision**, not f32. Two consequences:
1. Any NVFP4/FP8 head swap is measured against a BF16 baseline, not f32 - the precision delta vs
the *true* f32 head is partly already paid.
2. A *different* BF16 GEMM kernel that also F32-accumulates and BF16-rounds the output is
**bit-identical for the vast majority of logits** (differs only at rare BF16 rounding ties) -
this is what makes option (c) below "essentially bit-exact".
### The options, and which break bit-exactness
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ. BIGGEST win, BREAKS bit-exactness.** Weight
2.37 GB BF16 -> ~0.6 GB NVFP4 (0.5625 B/wt = 4x fewer bytes) AND it then hits the already-tuned
`mul_mat_q<NVFP4>` (0017) instead of cublas. Memory-bound floor drops ~4x => save ~8-9 ms =
~2.5% of the dense step. But NVFP4 < BF16 precision => **different logit bits, can flip the greedy
argmax** = NOT bit-exact; and it is **UNFAIR vs vLLM**, which keeps its LM head BF16
(`DECODE_PARITY_EXPLORE.md:358`: "fp8 LM head ... only matters if vLLM also quantizes it"). This
is the same opt-in, non-bit-exact bucket as the f16-glue probe (already concluded SKIP).
- **(b) FP8 / Q8_0 head.** Smaller error than NVFP4 but still != BF16 bits => still NOT bit-exact,
and it is not even on the tuned FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
- **(c) Keep BF16 weight, swap the kernel (custom skinny wide-vocab streaming GEMM, or a cublasLt
algo heuristic tuned for the thin-M / huge-N memory-bound shape).** The ONLY essentially-bit-exact
option (F32 accumulate + BF16 round = identical except rounding ties, per the correction above).
### Realistic lever + scope: there is NO good bit-exact lever here
Bandwidth math kills option (c): `nvjet` moves 2.37 GB in ~11.9-12.2 ms = **~195-199 GB/s = ~72% of
the GB10's 273 GB/s peak**. The lm_head GEMM is therefore **already one of the MOST
bandwidth-efficient kernels in the step** - the overall decode step runs at only 40% util /
110 GB/s (`DECODE_PARITY_EXPLORE.md`). The bit-exact ceiling is tiny: even a perfect
HBM-saturating kernel (199 -> 273 GB/s) takes 11.9 -> ~8.7 ms = **save ~3 ms = ~0.9% of the dense
step**, and beating cublas's own tuned nvjet on a pure weight-stream shape is NOT guaranteed (it may
already be near-optimal). High kernel-writing effort, uncertain sub-1% payoff. (`F16_DENSE_RESIDUAL_
PROBE.md:97` independently estimates a bf16-glue nvjet recovery of only ~5 ms and flags it
"uncertain - may already run TF32" - consistent with little headroom.)
The structural reason: the head must read the **entire 2.37 GB weight for just 128 output columns**
(inherently memory-bound), and **you cannot cut those weight bytes without changing the dtype** -
i.e. bit-exactness and the only real speedup (fewer weight bytes) are **mutually exclusive** here.
### Verdict / ranking of path B
The lm_head cublas/nvjet GEMM is a **dead end for a bit-exact win**: it is already ~72% of peak HBM
(the step's most efficient major kernel), so a bit-exact kernel swap caps at <1% with real risk and
no guarantee of beating cublas. The only large win - NVFP4-quantizing the head (~2.5%) - is
explicitly non-bit-exact AND unfair vs vLLM (which keeps BF16), so it lands in the same opt-in
non-bit-exact bucket as f16-glue that was already shelved. Rank B's bit-exact value **at the bottom**
of the four paths. The one worthwhile note for the team is the correction that the head is already
BF16 (not f32), which slightly narrows what "bit-exact" even protects here; if the project ever
opens a *non*-bit-exact opt-in track, NVFP4-head (option a) is a clean ~2.5% dense lever that rides
the existing tuned FP4 MMQ - but it must be gated as opt-in and excluded from any vLLM-parity claim.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## A.2 / D. GPU-measured MoE decode decomposition + dense-graph stability (label: moe-gpu-profile, THE GPU AGENT)
nsys `--cuda-graph-trace=node` on a steady MoE decode at npl128 (q36-35b-a3b-nvfp4, HEAD f7409c2,
clean 0023 build-cuda). The measurement was decode-isolated: the run has a prefill phase (16384 tok,
the big-GEMM region) followed by 64 steady decode steps; I segmented the timeline by GPU-idle gaps,
dropped the prefill window, and aggregated per-kernel time over the 64-step decode window only
(`moe_decode_npl128.{nsys-rep,trace.csv}` on the DGX; extractor `decfull.py`/`grid.py`).
### MoE decode window: 98.3% GPU-bound, ~165 ms/step. Per-kernel share of decode GPU-time:
```
41.9% gated_delta_net_cuda REC (SHARED with dense, already tuned 0018-0022)
26.9% mul_mat_q<NVFP4, M-tile=64> MOE expert grouped GEMM (MUL_MAT_ID) <-- biggest MoE-specific kernel
7.7% nvjet_sm121 (cublas bf16) attn/gdn bf16 projections + the BF16 lm_head (path B)
2.7% cutlass_80 bf16 s16816 relu bf16 GEMM (shared-expert / gate)
2.7% k_bin_bcast (mul/add) expert-combine + routing-weight scale + glue
2.6% k_get_rows_float REC recurrent-state gather
2.4% flash_attn_ext_f16 attention
2.3% mul_mat_q<NVFP4, M-tile=128> router / non-grouped FP4 GEMM
2.1% ssm_conv(+update) REC
2.0% quantize_mmq_nvfp4 MOE W4A4 activation-quantize tax (3.25 ms/step)
1.8% convert_unary bf16<->f32 glue around the bf16 projections
1.5% cpy_scalar glue
0.9% rms_norm
0.5% REC gating act | 0.5% streamk_fixup | 0.3% mm_ids_helper | 0.3% argsort |
0.2% l2norm | 0.2% set_rows | 0.1% gather_mmq_fp4 | <0.1% topk/softmax/reduce (routing)
```
Bucketed: **Recurrence (shared, tuned) ~= 47.5%** (gdn 41.9 + get_rows 2.6 + ssm_conv 2.1 + gating
0.5 + l2norm 0.2 + set_rows 0.2). **MoE FFN+routing block ~= 31%** (grouped GEMM 26.9 + activation
quant 2.0 + streamk 0.5 + mm_ids_helper/argsort/gather/softmax/topk/reduce ~1.3 + the expert-combine
share of bin_bcast). **cublas/cutlass bf16 projections ~= 10.4%** (nvjet 7.7 + cutlass 2.7).
Attention ~2.4%. The recurrence is the single biggest term but it is shared with dense and already
the subject of 0018-0022, so it is NOT a MoE lever.
### The biggest MoE-specific kernel (the lever): mul_mat_q<NVFP4, M-tile=64> grouped GEMM
26.9% of decode = ~43.5 ms/step, avg **373 us/call**, grids of **2048 and 8192** 64-wide tiles
(blk=32 = 1 warp/block). Compare the dense FFN GEMM in the same family at npl128: `mul_mat_q<NVFP4,
M-tile=128>` avg **31 us/call**, grid 48. The grouped per-expert GEMM is ~12x the per-call cost and
launches 100-200x more tiles because each of 128 experts is a separate tiny-M sub-GEMM (128 tokens x
top-k / 128 experts ~= a handful of rows per expert) padded into 64-wide tiles. This is exactly the
ragged-tiny-M / col-tile-occupancy axis section A's 0014/0015 `mmq_x` density auto-tile already
covers and measured NEUTRAL on this bandwidth-bound a3b model. MMQ FP4 is integer/FP4-exact
independent of tile geometry, so this kernel IS bit-exact to retune (occupancy/min-blocks/M-tile),
but the headroom on THIS model is small (it is bandwidth-bound, not tile-occupancy-bound).
### Confirmations / quantifications of section A (from live GPU, not source-reading):
1. **Un-graphed at npl128: CONFIRMED in source, but NOT the npl128 bottleneck.** NVFP4 on sm121
(turing_plus path) has `mmvq_mmid_max = 8` (`mmvq.cu:145`); MoE decode batch ne[2]=128 > 8, so
`[TAG_MUL_MAT_ID_CUDA_GRAPHS]` (`ggml-cuda.cu:3273`) disables CUDA graphs for the WHOLE step and
the MMQ grouped path (not MMVQ) is taken. HOWEVER the measured decode window is **98.3% GPU-util
with ~7.8 us inter-step host gaps** - at npl128 the kernels are large enough to fully hide the
per-op launch latency, so the un-graphed launch-bubble tax is negligible HERE. The un-graphed
penalty is a SMALL-npl problem; at npl128 the MoE gap is in-kernel (grouped GEMM + quantize),
not host bubbles. This refines A's M2: graphing the decode step helps small-npl MoE much more
than npl128 MoE.
2. **W4A4 activation-quantize tax: CONFIRMED present but only 2.0% at npl128.** `quantize_mmq_nvfp4`
= 3.25 ms/step in the decode-isolated window (A's 457 ms figure is a whole-run/different-window
total). Real, and vLLM-W4A16 avoids it, but it is a small-single-digit term, not dominant.
3. **lm_head/projection cublas (path B): CONFIRMED ~12.4 ms/step** of nvjet in MoE decode (matches
B's 11.91 ms), but that 7.7% bundle is mostly per-layer attn/gdn bf16 projections, not just the
one lm_head.
### D. Dense CUDA-graph stability: f32 dense is STABLE, the bimodality was a BF16-only artifact
Dense (q36-27b-nvfp4) has no MUL_MAT_ID, so it stays fully CUDA-graphed. Measured S_TG @npl128:
```
intra-process (1 load, 6x npl=128, npp8/ntg48, N_KV=7168): 376.2 376.2 375.7 375.1 375.3 374.9 (spread <0.4%)
inter-process (6 separate procs, fresh graph capture each):373.6 377.0 376.8 376.6 376.2 375.7 (spread ~0.9%)
committed heavy config (npl128 ntg128, N_KV=32768): 333.3 / 334.8 / 335.9 (spread ~0.8%)
```
No bimodality in either replay (intra-process) or capture (inter-process). The custom graph state
machine (`ggml-cuda.cu:4484`: warmup_complete requires 2 property-stable calls; the one-time capture
cost lands in T_PP, not S_TG) absorbs capture into prefill, which is the only "hint" (the first
in-process measurement has a slightly higher T_PP and a marginally lower S_TG, fully bounded). The
287/336/487/498 bimodality in the brief was the shelved BF16 SSM-state path (BF16_SSM_STATE.diff,
never applied), not the shipped f32 path. There is NO graphs-off env in this fork (graph enable is
compile-time USE_CUDA_GRAPH + the warmup machine), so a graph-disable A/B would need a rebuild; given
the f32 path is already stable to <1%, path D is a non-issue and not worth the rebuild.
### Verdict (GPU agent)
- The MoE decode gap vs vLLM at npl128 is **in-kernel, not host-overhead**: 98.3% GPU-util rules
out the un-graphed launch-bubble story AT npl128. The single biggest MoE-specific kernel is the
`mul_mat_q<NVFP4, M-tile=64>` grouped GEMM (26.9%, 43.5 ms/step); it is bit-exact to retune but
bandwidth-bound on this a3b model (A's auto-tile already measured neutral), so the standalone
bit-exact MoE-GEMM lever is REAL but BOUNDED. The recurrence (47.5%) is shared and already tuned.
- **Path D (dense graph instability) is closed: the shipped f32 dense path is stable (<1%, no
bimodality).** No latent fragility, no rebuild warranted.
- Net ranking from the GPU side agrees with A/B: the MoE-GEMM and lm_head levers are both bounded
and partly non-bit-exact; the only structurally large bit-exact MoE win (A's M2, graph the decode
step) pays off mostly at SMALL npl, not at the npl128 where the benchmark gap is reported.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## C. TTFT / paged-pool burst degradation (label: ttft-burst-rootcause, READ-ONLY, source + committed traces)
Files read: `paged/paged_kv_manager.{h,cpp}`; patches `0004` (on-demand alloc), `0007` (persistent
manager + ref-counted prefix), `0008` (server cross-request share), `0013`/`0016` (prefill budget);
docs `QWEN36_NVFP4_BENCH.md`, `BENCHMARK_PROGRESS.md`, `CHUNKED_PREFILL_PLAN.md`,
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`, `P1_DYNAMIC_BUDGET_RESULTS.md`, `FUTURE_LEVERS.md`.
### Part 1 - the static decode-first budget: why a 128-way burst hits 903 s dense / 213 s MoE TTFT
How the budget schedules (patch 0016, `server-context.cpp::update_slots`): each step builds ONE
mixed batch. Phase 1 appends every GENERATING slot's single sampled token UNCONDITIONALLY (no budget
gate), so after Phase 1 `batch.n_tokens == D` (the live decode load). Phase 2 then fills prompt
tokens, bounded by three predicates: the hard `batch.n_tokens < n_batch` (2048) ceiling, a per-step
`prefill_budget_step`, and a per-slot `prefill_cap_per_slot`. **Decode is structurally claimed first
and never capped; only prefill is throttled.**
At the shipped config (`LLAMA_MAX_BATCH_TOKENS=512`, i.e. T=512=n_ubatch) the dynamic terms
degenerate to constants:
- `prefill_budget_step = max(n_ubatch, T - D) = max(512, 512-D) = 512` for all D in [0,128] - the
floor binds, the `T-D` adaptivity NEVER bites (exactly the "structural note" in
`P1_DYNAMIC_BUDGET_RESULTS.md`).
- `prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) = min(512, 5243) = 512`, clamped to 512.
So each step admits at most 512 prefill tokens TOTAL and up to 512 per single slot. Each benchmark
prompt is exactly 512 tokens and there is NO round-robin (0016 drains slots in index order):
**the first waiting slot consumes the entire 512-token step budget with its whole prompt; the 128
prompts prefill strictly SERIALLY, one prompt per step.** Slot k's first token appears after ~k
prefill steps and each step co-batches the accumulating decode load, so step time grows. Mean TTFT
~= (half the prompts) x step_time ~= **903 s dense** (each step reads the full 28B NVFP4 weights) /
**213 s MoE** (3B active = cheaper steps). Decode_agg stays high (384/726 t/s) because Phase 1 seats
every decode token every step. This is the **deliberate decode-first tradeoff**: T=512 was chosen
for decode throughput + memory; TTFT was the sacrificed axis. The 903 s is partly self-inflicted by
the floor budget + lack of fairness, not a kernel limit (dense `prefill_tps` collapses to ~70 t/s
under the throttle vs vLLM's flat ~1420).
The fix (chunked-interleave / fair dynamic budget = P2 of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`,
NOT implemented), three pieces in `update_slots` Phase 2, zero libllama change:
1. Raise T toward `n_batch` (2048) so the per-step total budget is large; keep decode-first via the
REAL `prefill_budget_step = T - D` (leftover auto-shrinks as D rises, so the step never inflates
past T even at npl128).
2. A per-slot chunk cap MUCH smaller than the budget (the `long_prefill_token_threshold` analogue),
e.g. 128-256 tokens, so one prompt cannot monopolize the step.
3. A round-robin start offset over PROCESSING_PROMPT slots so leftover budget spreads across MANY
waiting prompts per step.
Net: instead of "one full 512-prompt per step" (serial, last prompt waits 128 steps), each step
admits small chunks from ~T/cap prompts at once, so all 128 advance in lockstep and TTFT collapses
from O(k*step) to O(constant) - the vLLM 6-18 s regime. 0016's per-slot-cap variable already exists
but is inert at the shipped config and lacks the round-robin spreader. Honest boundary (already in
the docs): this closes TTFT, it does NOT lift the ~161/333 decode ceiling (a separate lever).
### Part 2 - the burst-degradation BUG: later lower-npl prefill collapses 507 -> 65 t/s, decode fine, restart cures it
The signature - prefill-only collapse, decode untouched, persists in-process, a server restart fully
cures it (the benchmark's documented "restart per npl" workaround) - points to persistent paged-pool
host state never restored short of `clear()`/teardown. Two compounding mechanisms, both confirmable
from the patch source:
**(1) RECLAMATION GAP - blocks are returned ONLY on a FULL-range wipe.** `paged_alloc` returns a
sequence's blocks to the pool in exactly two places (patch 0004, kept in 0007): `clear()` ->
`release_all`, and `seq_rm(seq, p0, p1)` ONLY when `p0 == 0 && p1 == MAX`. But llama-server's normal
slot lifecycle issues PARTIAL truncations: slot reuse with a retained common/BOS prefix calls
`seq_rm(slot.id, n_past, -1)` with `n_past > 0` (patch 0008 itself calls
`common_context_seq_rm(ctx, slot.id, n_past, -1)`); context-shift / partial rewinds likewise. None
satisfy `p0 == 0`, so the release hook never fires: the kv-cache frees those CELLS but the manager
still believes the sequence owns those BLOCKS. The two desync and the manager's effective free pool
shrinks every time. Patch 0008's own comment is the smoking gun - it added the `n_past < 16` gate
because a mismatched full-prompt reservation vs suffix-only submission "never leaves stale blocks
(which otherwise fragment the paged pool ... and crashed the server under high fan-out)". 0008 only
closed that hole for the narrow `share()` path; the general partial-`seq_rm` path stays unhooked, so
over a high-fan-out burst leaked blocks accumulate and never return.
**(2) FRAGMENTATION / NO COMPACTION - the free queue is permuted by the burst and never rebuilt.**
Even for cleanly freed blocks, `BlockPool::free_blocks` just `prepend_n`/`append_n`s them in free
order; no compaction, no pristine reset. After a high-fan-out burst (many interleaved alloc/free
across many seqs in the unified pool, or reversed-order frees in a per-stream pool) the free queue is
a scrambled permutation of physical block ids. A subsequent LOW-npl prefill then `popleft`s
physically SCATTERED blocks, so its 512-token KV scatter-WRITE plus the in-kernel paged-attention
GATHER lose locality across the KV span -> prefill throughput collapses. Decode is a single-token
append per step with a gather amortized over tiny per-step work, so it barely notices - exactly the
observed "prefill collapses, decode robust". The scramble + leak persist for the process lifetime
(only `clear()`/restart rebuilds a contiguous free queue) - precisely why restart-per-npl restores
507 t/s. Contributing factor: slots used in the burst but not reassigned next run are never released
(release fires only on next-task divergence), so a low-npl run sees a reduced, fragmented pool and
falls back to the stock contiguous allocator more often (the `place()->false->res.idxs.clear()`
fallback in find_slot), scanning a littered cell array - another prefill-only slowdown.
Fix scope (all gated behind `LLAMA_KV_PAGED`, default-off byte-identical, no libllama API change):
- **Fix-1 (core, ~30-50 lines): close the reclamation gap.** Add
`paged::PagedKVManager::truncate(seq, n_keep)` that frees the trailing blocks of a request beyond
block index `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of the truncated block suffix),
expose `paged_alloc::truncate(cache, stream, seq, n_keep)`, and call it from
`llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Manager
accounting then tracks the kv-cache exactly; the leak stops.
- **Fix-2 (small): defrag on empty.** When a stream's cells reach `get_used() == 0`, rebuild that
manager's free queue to pristine contiguous order (or recreate the manager) so a reused pool
starts unfragmented.
- **Fix-3 (small): release on slot completion.** Add a paged release at server `slot.release()` so
finished-but-idle sequences return blocks promptly and a later low-npl run sees a full, compact
pool.
- **Fix-4 (optional hardening): best-fit / contiguous-run preference** in `get_new_blocks` + a
defrag pass before the find_slot stock-fallback fires.
Validation repro (GPU-bound, for a later profiling pass): npl64 burst then npl8 on ONE server;
assert npl8 `prefill_tps` within ~10% of a fresh-server npl8, and that `paged_alloc::num_free`
returns to the fresh value after the burst drains.
### Verdict / ranking of path C
Two distinct things: a **BUG** (Part 2) and a **tuning tradeoff** (Part 1). Rank the BUG first - it
is a true correctness/hygiene defect, not a tradeoff: a long-lived production server silently
degrades under ordinary mixed load and currently REQUIRES the "restart per npl" crutch, unacceptable
in real serving. Fix scope is small and localized to the paged-alloc unit + one `seq_rm` call site,
default-off byte-identical, with a crisp pass/fail repro. The chunked-interleave scheduler (Part 1)
is the bigger HEADLINE (the weakest benchmark number, 903 s/213 s burst TTFT vs vLLM 6-18 s) but a
larger effort with a deliberate TTFT-vs-decode-ITL tradeoff to navigate. The two are complementary:
the scheduler reduces how punishing each burst is; the bug fix ensures the pool survives the burst
so the NEXT request is not poisoned.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## SYNTHESIS - ranking and the first build target (label: orchestrator)
The brief framed two tracks: **BIT-EXACT** levers (help the shipped f32 parity DEFAULT, included in
the vLLM-parity claim) and **SERVING** levers (gated behind `LLAMA_KV_PAGED`, default-off
byte-identical, outside the parity claim). The decisive cross-cutting finding from all four agents:
**there is no compelling first build target on the bit-exact decode-default track** - A is bounded,
B is a sub-1% dead end, D is closed - **while the SERVING track has one clear, high-ROI, tractable,
low-risk, byte-identical-default first target: the paged-pool burst-degradation bug.**
### Per-path scorecard
| Path | Expected gain | Tractability | Bit-exactness | Net |
|------|---------------|--------------|---------------|-----|
| **A** MoE grouped-GEMM | Standalone kernel: **bounded, low single-digit %** at npl128 (model is bandwidth-bound; 0014/0015 M-tile auto-tile already NEUTRAL here). The big MoE win = M2 = graph-the-decode-step, which is SHARED with D and pays off mostly at SMALL npl, not the npl128 benchmark point. | M1 (down_proj quantize retune) cheap; M2 a decode-graph-capture project (large). | M1 strictly bit-exact (byte-identical quantized output); M2 bit-exact (replay). Helps the DEFAULT. | Real but **bounded**; no clean standalone kernel win. Keep M1 as a cheap bank-shot; fold M2 into a decode-graph effort. |
| **B** cublas lm_head (nvjet) | Bit-exact ceiling **<1%** (~3 ms; nvjet already ~72% of peak HBM, the step's most efficient major kernel). The only big win (NVFP4 head ~2.5%) is non-bit-exact AND unfair vs vLLM (which keeps BF16). | Custom skinny-GEMM = high effort, uncertain it beats cublas. | Bit-exact option caps <1%; the 2.5% option is a logits change (opt-in only). | **Dead end** for the default. Rank LAST. |
| **C** TTFT / paged-pool burst | **Part 2 bug:** restores prefill from collapsed 65 -> ~507 t/s after a burst (removes the "restart per npl" crutch). **Part 1 scheduler:** the headline - 903 s/213 s burst TTFT -> vLLM 6-18 s regime. | **Part 2: small + localized** (paged-alloc unit + 1 seq_rm call site). Part 1: larger (fairness + admission + tuning). | Both gated behind `LLAMA_KV_PAGED`, **default-off byte-identical**. SERVING track (doesn't touch the parity-default numerics). | **Highest ROI x tractability.** Part 2 is a true correctness defect with a crisp repro. |
| **D** dense CUDA-graph instability | **Zero** - f32 dense measured STABLE (<1% spread, no bimodality). The 287/336/487/498 bimodality was the SHELVED BF16 SSM path, not the shipped f32 path. | n/a (would need a rebuild for a graphs-off A/B). | n/a | **CLOSED.** Not worth any work. |
### Ranked order (ROI x tractability x bit-exactness)
1. **C-Part2 - paged-pool burst-degradation bug fix.** Small, localized, default-off byte-identical,
crisp pass/fail repro, removes a real production-serving defect + the benchmark's restart crutch.
2. **C-Part1 - chunked-interleave / fair dynamic budget.** The public-facing TTFT headline closer,
but a larger effort and a deliberate TTFT-vs-ITL tradeoff. Do it AFTER the bug fix (the scheduler
reduces burst pain; the bug fix keeps the pool alive across bursts).
3. **A-M1 - down_proj activation-quantize kernel retune** (cheap bit-exact bank-shot for the default;
bounded payoff on this bandwidth-bound model). Optionally folded with a future decode-graph build
(A-M2 / the shared MoE+GDN decode-graph capture), which is the only structurally large bit-exact
MoE lever but a big project that helps small-npl more than npl128.
4. **B - lm_head kernel swap.** Bit-exact ceiling <1% with real risk. Skip unless a non-bit-exact
opt-in track opens (then NVFP4-head ~2.5% dense, gated, excluded from parity claims).
5. **D - dense graph instability.** Closed, no work.
### THE FIRST BUILD TARGET: paged-pool burst-degradation bug fix (C-Part2)
**Why this one:** it is the only candidate that is simultaneously (a) high ROI - fixes a real
correctness defect that forces the "restart per npl" crutch in long-lived serving, (b) tractable -
small and localized to the paged-alloc unit plus one `seq_rm` call site, (c) safe for the parity
claim - gated behind `LLAMA_KV_PAGED`, default-off byte-identical, and (d) verifiable with a crisp
pass/fail repro. Every bit-exact-default alternative is bounded (A), a dead end (B), or closed (D).
**Implementation plan (incremental, each step independently shippable):**
1. **Fix-1 (core):** add `paged::PagedKVManager::truncate(seq, n_keep)` that ref-count-frees the
trailing blocks beyond block index `ceil(n_keep/bs)`; expose
`paged_alloc::truncate(cache, stream, seq, n_keep)`; call it from `llama_kv_cache::seq_rm` for the
`p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Closes the reclamation gap so manager
accounting tracks the kv-cache exactly.
2. **Fix-2:** defrag-on-empty - when a stream reaches `get_used() == 0`, rebuild its free queue to
pristine contiguous order.
3. **Fix-3:** paged release at server `slot.release()` so finished-idle sequences return blocks
promptly.
4. **Fix-4 (optional):** best-fit / contiguous-run preference in `get_new_blocks` + a defrag pass
before the find_slot stock fallback.
**Confirming measurement (the explicit repro, GPU-bound):** on ONE long-lived server, run an npl64
burst, let it drain, then run npl8. PASS if (i) npl8 `prefill_tps` is within ~10% of a fresh-server
npl8 (vs the ~65 vs ~507 collapse today), and (ii) `paged_alloc::num_free` returns to the
fresh-start value after the burst drains (proves no leaked blocks). Decode t/s must be unchanged.
**Bit-exact gate it MUST pass:**
- With `LLAMA_KV_PAGED` unset, the build is byte-identical to HEAD f7409c2 (the fix lives entirely
inside the paged path) - `test-backend-ops` + the greedy-decode md5 against the 0023 baseline are
unchanged.
- With `LLAMA_KV_PAGED` set, the fix changes only block ACCOUNTING and PLACEMENT, never KV values or
compute, so the greedy-decode md5 on a fixed prompt is identical before vs after the fix (and the
post-burst run produces the same tokens as a fresh-server run).
**Paths NOT worth building now:** B (lm_head, sub-1% bit-exact ceiling, the only big win is a
non-bit-exact unfair-vs-vLLM logits change), and D (dense graph instability, measured stable -
closed). A's standalone kernel value is bounded; keep A-M1 as a cheap follow-up and fold A-M2 into a
later decode-graph project, but it is not the first target.
**First target: ship the paged-pool burst-degradation bug fix (C-Part2, Fix-1 + Fix-2 + Fix-3).**
Assisted-by: Claude:opus-4.8 [Claude Code]