diff --git a/backend/cpp/llama-cpp/patches/paged/QUANT_GENERALITY.md b/backend/cpp/llama-cpp/patches/paged/QUANT_GENERALITY.md new file mode 100644 index 000000000..ff1650fb6 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/QUANT_GENERALITY.md @@ -0,0 +1,286 @@ +# QUANT_GENERALITY - are the paged decode opts NVFP4-specific or quant-agnostic? + +Source-verified classification of the paged decode optimizations (patches 0013-0029) +as either QUANT-AGNOSTIC (operate on the gated-DeltaNet f32/bf16 recurrent state, the +paged serving host path, or the matmul ROUTING - independent of the model's weight +quantization, so they help a Q4_K / Q8_0 / bf16 Qwen3.6 as much as an NVFP4 one) or +NVFP4-SPECIFIC (only fire for / only help GGML_TYPE_NVFP4 weights on a Blackwell GPU). + +READ-ONLY, NO GPU. Every classification below is taken from the patch body source, +not from the prose claims. Hardware referenced for the empirical plan only. + +--- + +## 1. THE GROUND TRUTH GATE: what makes anything NVFP4-specific + +There is exactly ONE runtime gate in the whole ggml-cuda matmul stack that means +"NVFP4 on Blackwell": + + mmq.cu: const bool use_native_fp4 = blackwell_mma_available(cc) + && (src0->type == GGML_TYPE_NVFP4 ...); + +(confirmed in ARCH_GENERALITY_AUDIT.md section gguf-targeting-1 and in patch 0023's +own diff context). A patch is NVFP4-specific iff the code it changes lives INSIDE a +`use_native_fp4` / `type == GGML_TYPE_NVFP4` / `blackwell_mma_available(cc)` branch. +Everything else - the gated-DeltaNet recurrence, the conv update, the SSM/conv state +caches, the MMQ-vs-MMVQ dispatch, the CUDA-graph guard, the host scheduler and paged +pool - is dtype-independent. + +The recurrent state is the decisive fact: in this hybrid model the gated-DeltaNet +temporal state, the conv ring state, q/k/v/g/beta and the SSM scratch are ALL +GGML_TYPE_F32 (asserted explicitly in every new op builder: see 0018 ggml.c +`GGML_ASSERT(state->type == GGML_TYPE_F32)`, 0019 same, 0021/0028 conv asserts +`conv_states->type == GGML_TYPE_F32`). The weight quantization type never enters the +recurrence or conv kernels. So any patch that only touches those is quant-agnostic by +construction. + +--- + +## 2. PER-PATCH CLASSIFICATION (with source evidence) + +| patch | what it changes | classification | source evidence | +|-------|-----------------|----------------|-----------------| +| 0013 | static per-step prefill-token budget (LLAMA_PREFILL_BUDGET) | QUANT-AGNOSTIC | tools/server/server-context.cpp only; a host scheduler loop bound on prompt-token COUNT; no dtype anywhere; default-off byte-identical | +| 0014 | manual MoE token-tile (mmq_x) cap | QUANT-AGNOSTIC | mmq.cuh `mul_mat_q_case`; cap applies on `args.expert_bounds != nullptr` (the MUL_MAT_ID grouped path) for ANY templated ``; no NVFP4 branch | +| 0015 | density-aware MoE token-tile auto-select | QUANT-AGNOSTIC | mmq.cuh; gate is `expert_bounds != nullptr` + per-expert density only, NEVER on src0 type. PROVEN on a non-NVFP4 model: the measured +4.8% win was Qwen3-Coder-30B (128 larger experts), test gate covers MXFP4 AND NVFP4 | +| 0016 | dynamic decode-first prefill budget (supersedes 0013) | QUANT-AGNOSTIC | update_slots() policy only; "identical decisions paged on or off", zero libllama/dtype touch; default-off | +| 0017 | FP4 GEMM decode mmq_y / minblocks tile tune | NVFP4-SPECIFIC, but DEFAULT-OFF / INERT | mmq.cuh `get_mmq_y_host`: fires only `type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)`. BUT the patch is a recorded NO-BUILD: every occupancy probe REGRESSED (kill-gate tripped), so nothing is enabled by default. Default build is byte-identical to stock; it changes no behavior | +| 0018 | in-place SSM recurrent-state write-back | QUANT-AGNOSTIC | gated_delta_net.cu + ggml.c; operates on the f32 recurrent state cache (`state->type == GGML_TYPE_F32`); removes a D2D f32 state copy. Weights never read by this op | +| 0019 | fused recurrent-state gather (ids read, no get_rows) | QUANT-AGNOSTIC | reads the f32 state cache via ids; builder asserts F32 on q/k/v/g/beta/state/state_dst; mirrors ggml_ssm_scan. No weight dtype involved | +| 0020 | gated-DeltaNet o_proj MMVQ->MMQ reshape | QUANT-AGNOSTIC (routing) | qwen35.cpp/qwen35moe.cpp/qwen3next.cpp: a 2D-vs-3D RESHAPE of the f32 activation so `src1->ne[1]=128` routes to MMQ instead of batch-1 MMVQ. The MMVQ(ne[1]<=8)-vs-MMQ dispatch is a generic ggml-cuda decision present for EVERY quantized type. See section 3 | +| 0021 | in-place conv-state fusion (conv+silu+ring write) | QUANT-AGNOSTIC | ssm-conv.cu + ggml.c new op asserts `conv_states/conv_kernel/x_cur/conv_state_dst == GGML_TYPE_F32`; pure f32 conv-state work | +| 0022 | gated_delta_net_cuda occupancy/coalescing retune | QUANT-AGNOSTIC | gated_delta_net.cu kernel: q/k/v/g/beta/state are all f32; the COLS_PER_WARP/NUM_WARPS fold is a scheduling change on the f32 recurrence. Never touches a weight tensor | +| 0023 | MoE NVFP4 activation-quantize de-dup | NVFP4-SPECIFIC | mmq.cu: the `gather_mmq_fp4` de-dup is INSIDE `if (use_native_fp4) { ... }`. Gathers `block_fp4_mmq`. The non-FP4 path (`quantize_mmq_q8_1_cuda`) is untouched. Confirmed NVFP4-only | +| 0024 | paged-pool burst reclaim (truncate/defrag/release) | QUANT-AGNOSTIC | paged-alloc / paged-kv-manager / llama-kv-cache host accounting; "never KV values or compute, no ggml op touched"; gated behind LLAMA_KV_PAGED | +| 0025 | MoE-decode CUDA-graph re-graph (graph-safe id path) | QUANT-AGNOSTIC (corrects hypothesis) | ggml-cuda.cu: relaxes the MUL_MAT_ID graph guard when `ggml_is_quantized(src0) && ggml_cuda_should_use_mmq(...)`. Gated on the GENERIC quantized-MMQ grouped path, NOT on NVFP4. See section 4 | +| 0026 | hybrid per-head f32/bf16 SSM state (--cache-type-ssm / tau) | QUANT-AGNOSTIC, default-off (and precision-changing) | common/arg.cpp + cparams type_s/type_r + tau; changes the RECURRENT-STATE cache dtype (f32 default, bf16 opt-in). Independent of the weight quant; default tau=0 keeps bit-exact f32 | +| 0028 | residual conv-tap gather fusion (ids read) | QUANT-AGNOSTIC | ssm-conv.cu new SSM_CONV_UPDATE_IDS op reads the f32 conv cache via ids; eliminates the last k_get_rows in the GDN decode path. f32 throughout | +| 0029 | block-table within-step host cache | QUANT-AGNOSTIC | llama-kv-cache.cpp / paged-attn.cpp: memcpy-reuse of an int32 block table across full-attn layers of a step; pure host pipeline, bit-exact | + +(There is no patch 0027.) + +### Summary count +- QUANT-AGNOSTIC (helps any weight quant): 0013, 0014, 0015, 0016, 0018, 0019, 0020, + 0021, 0022, 0024, 0025, 0026, 0028, 0029 - 14 of 16 landed patches. +- NVFP4-SPECIFIC: 0023 (the only landed NVFP4-only optimization) + 0017 (NVFP4-only but + default-off / inert, no measured win). + +--- + +## 3. 0020 IN DETAIL - MMQ-over-MMVQ at batched decode is a win for ANY quantized type + +The hypothesis is CONFIRMED. 0020 is not an FP4 trick: + +- The gated-DeltaNet op left its output in 3D SSM layout `[value_dim, n_seq_tokens=1, + n_seqs=128]`, so the ssm_out matmul saw `src1->ne[1] = 1` with the 128 sequences + stuck in `ne[2]`. +- ggml-cuda dispatches `ne[1] <= 8` to MMVQ (the batch<=8 GEMV) and larger to MMQ + (the tensor-core GEMM). This `ne[1]`-threshold dispatch is type-INDEPENDENT: it is + the same routing for Q4_K, Q8_0, Q6_K, MXFP4, NVFP4 - every k-/legacy-quant has BOTH + an MMVQ (mmvq.cu vec_dot) AND an MMQ (mmq.cuh) path. +- The fix is a `ggml_reshape_2d` to `[value_dim, n_seq_tokens*n_seqs] = [6144, 128]` so + `src1->ne[1] = 128` routes to the M=128 MMQ GEMM that amortizes the ssm_out weight + read across all 128 sequences. Same contiguous data, bit-identical. + +Why it generalizes: at batched decode (npl 32-128) the weight read of ssm_out is the +cost, and MMVQ at the degenerate batch-1 shape re-reads / fails to amortize the weight +for whatever dtype the weight is. MMQ at M=128 reads each weight tile once for all 128 +tokens. That amortization is a pure bandwidth win that exists for every quantized +weight type, not just NVFP4. A Q4_K or Q8_0 Qwen3.6 has the exact same 3D-SSM-output -> +batch-1-MMVQ pathology and gets the same MMQ amortization from the reshape. (The patch +already routes the in-projection through MMQ; only the output was stuck in 3D.) + +The same logic underwrites 0014/0015 (the MoE `mmq_x` token-tile is a generic grouped- +MMQ knob; the win was measured on a non-NVFP4 Qwen3-Coder-30B) and 0025 (section 4). + +--- + +## 4. 0025 CORRECTS THE HYPOTHESIS - it is quant-agnostic, not NVFP4-specific + +The hypothesis listed "the act-quant / quantize_mmq_nvfp4 portions of 0025" as +NVFP4-specific. That is a patch-number mismatch. The ACTUAL patch 0025 +(0025-qwen35moe-nvfp4-moe-decode-regraph.patch) does NOT contain any act-quant / +quantize_mmq_nvfp4 code. Its entire diff is one hunk in ggml-cuda.cu: + + bool mmid_needs_sync = !ggml_is_quantized(src0->type) || node->ne[2] > mmvq_mmid_max; + if (mmid_needs_sync && ggml_is_quantized(src0->type) && + getenv("LLAMA_MOE_FORCE_GRAPHS") && + ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[2], src0->ne[2])) { + mmid_needs_sync = false; // keep CUDA graphs on for the grouped-MMQ id path + } + +The relax condition is `ggml_is_quantized(src0->type) && ggml_cuda_should_use_mmq(...)` +- the GENERIC quantized grouped-MMQ id-path, NOT NVFP4. `should_use_mmq()` returns true +for Q4_K / Q8_0 / etc. at large enough batch just as for NVFP4. So a Q4_K or Q8_0 MoE +Qwen3.6 whose MUL_MAT_ID takes the grouped MMQ path also keeps CUDA graphs across the +MoE decode step under LLAMA_MOE_FORCE_GRAPHS. 0025 is quant-agnostic. + +LEVER2_GRAPH_COVERAGE_RESULTS.md confirms this is the role of 0025 ("0025's +[TAG_MUL_MAT_ID_CUDA_GRAPHS] env-gate keeps the grouped MMQ id-path graph-safe"). + +Where the hypothesis's "act-quant / quantize_mmq_nvfp4" actually lives: that is +LEVER 3 (LEVER3_ACTQUANT_FUSION_RESULTS.md - fuse W4A4 act-quant into RMSNorm/SiLU), +which is genuinely NVFP4-specific, BUT it was a measurement STOP and NEVER LANDED (no +patch 0030, no commit). Likewise LEVER 4 (NVFP4 the still-bf16 GDN/attn projections, +LEVER4_PROJNVFP4_RESULTS.md) is NVFP4-specific but FAILED its KL gate (~6% PPL) and was +NOT shipped. So the only NVFP4-specific code that actually landed is 0023 (+ inert 0017). + +### Net correction to the hypothesis +- 0018/0019, 0021, 0022, 0028, 0026, 0013/0016, 0029, 0020: CONFIRMED quant-agnostic. +- 0023: CONFIRMED NVFP4-specific. +- 0025: WRONG in the hypothesis -> it is QUANT-AGNOSTIC (CUDA-graph guard on the generic + quantized grouped-MMQ path). The NVFP4-specific "act-quant" work the hypothesis was + thinking of is LEVER 3, which is unshipped (STOP), not patch 0025. +- Bonus: 0014/0015 (not in the hypothesis) are quant-agnostic, and 0017 is + NVFP4-specific but default-off/inert. + +--- + +## 5. RELATIVE-IMPACT BY WEIGHT-QUANT SIZE + +Decode is bandwidth-bound on the weight read. The quant-agnostic opts target work whose +absolute cost is FIXED in the weight quant: the f32 recurrence, the f32 conv state, the +host pipeline. The weight-read buckets (MoE expert GEMM + dense projections) scale +~linearly with bits-per-weight. So the quant-agnostic opts deliver the same ABSOLUTE +millisecond saving at every quant, but the RELATIVE % shrinks as the weight grows. + +Anchor: the measured MoE q36-35b-a3b NVFP4 decode step (MOE_GAP_VS_VLLM.md, step = +169.8 ms, GPU-busy 97.5%), split into quant-agnostic vs weight-quant-scaling buckets: + +| bucket | ms/step @ NVFP4 | scales with weight bits? | which opts touch it | +|--------|-----------------|--------------------------|---------------------| +| Recurrence core (gated_delta_net) | 70.0 | NO (f32 state) | 0022 | +| Recurrent-state + conv gather/plumbing (k_get_rows 5.2 + ssm_conv 3.4) | ~8.6 | NO (f32) | 0018/0019/0021/0028 | +| Host bubble (sample+batch+block-table) | 4.2 | NO (host) | 0013/0016/0024/0029 | +| Router / norms / glue | ~5.4 | mostly NO | 0014/0015 partial | +| MoE expert GEMM | 47.3 | YES (4-bit now) | (weight read) | +| Dense GDN/attn projections + convert glue | 20.3 | YES | (weight read) | +| W4A4 act-quant tax (quantize_mmq_nvfp4) | 3.3 | (FP4 only) | 0023 | + +Quant-agnostic, weight-size-fixed total: ~70.0 + 8.6 + 4.2 + 5.4 = ~88 ms (~52% of the +NVFP4 step). Weight-read buckets: 47.3 + 20.3 = ~67.6 ms (~40%). + +Model the weight-read buckets as scaling with bytes-per-weight relative to NVFP4 (4-bit += 1x): Q8_0 ~ 2x, bf16 ~ 4x. Hold the ~88 ms fixed (the recurrence f32 byte stream and +host time do not change with the weight quant), and recompute the recurrence/host +fraction of the step: + +| weight quant | weight-read buckets (ms, est.) | fixed quant-agnostic (ms) | step (ms, est.) | recurrence+host % of step | +|--------------|--------------------------------|---------------------------|-----------------|---------------------------| +| NVFP4 (4-bit) | ~68 (1x) | ~88 | ~159 (+act-quant ~3) | ~52% (measured ~50%) | +| Q8_0 (8-bit) | ~136 (2x) | ~88 | ~224 | ~39% | +| bf16 (16-bit) | ~272 (4x) | ~88 | ~360 | ~24% | + +Reading this: +- The quant-agnostic SSM/serving opts deliver the SAME ~ms savings at Q8/bf16 as at + NVFP4 (they remove fixed f32/host work). The headline % speedups quoted in the patch + bodies (e.g. 0019 dense npl128 +37.8%, 0020 +31.7%, 0022 +11.1%) are the LARGEST at + NVFP4 precisely because the fixed recurrence is the biggest fraction of the smallest + (4-bit weight) step. The same absolute removal is a smaller % of a Q8 step and a much + smaller % of a bf16 step, because the weight-read denominator grows. +- This MATCHES the brief's decomposition framing (recurrence ~40-50%, GEMM ~26-28% at + NVFP4): at NVFP4 the recurrence dominates, so the recurrence-targeting opts are where + the win is; as the weight quant grows the GEMM dominates and the recurrence opts + matter relatively less (but never zero, and never negative). +- Corollary: the ONE NVFP4-specific landed lever, 0023, only addresses the ~3.3 ms FP4 + act-quant tax (and only the broadcast up/gate share of it) - the smallest bucket and + its measured win is +1.7%. The big bit-exact wins are all quant-agnostic. + +So the optimization set is overwhelmingly general: a Q4_K / Q8_0 / bf16 Qwen3.6 gets the +full recurrence + conv + serving + MMQ-routing benefit; only the small FP4 act-quant +de-dup (0023) does nothing for it (and the inert 0017 was never enabled). + +--- + +## 6. EMPIRICAL CONFIRMATION PLAN (specify only - DO NOT run; the GPU is busy) + +Goal: prove on hardware that the quant-agnostic opts FIRE and LIFT a non-NVFP4 Qwen3.6, +isolating them from the one NVFP4-specific lever. + +### 6.1 Hardware +GB10 / DGX Spark (sm_121), when free. The DGX has live deployments; this plan is +read-only until then. (Any Blackwell or non-Blackwell CUDA host also works to prove +quant-GENERALITY - the recurrence/serving opts are not Blackwell-gated; only the NVFP4 +FP4-MMA tier is. Running on a non-Blackwell card would ALSO demonstrate the opts help +where there is no use_native_fp4 path at all - a strong second proof.) + +### 6.2 Build the non-NVFP4 control GGUF first (prerequisite) +The same Qwen3.6 architecture, re-quantized so the weights are NOT NVFP4 but the +gated-DeltaNet/conv recurrence is still f32: + + - Source: the existing q36-27b (dense) and/or q36-35b-a3b (MoE) f16/bf16 GGUF already + on the DGX (~/work/darwin_36b_opus/f16.gguf is the MoE f16 used as the LEVER4 KL + base; an equivalent dense f16 exists). + - Produce: `llama-quantize f16.gguf q36-27b-Q4_K_M.gguf Q4_K_M` (primary control) and + optionally `... Q8_0` and keep the f16/bf16 as the 16-bit control. Q4_K_M is the + cleanest contrast: 4-bit like NVFP4 but a totally different (k-quant, non-FP4-MMA) + weight path, so any shared win is provably from the f32 recurrence / routing, not + from FP4. + - Note: this requantize is free (no retrain) and must be done before any A/B. + +### 6.3 Bit-exact gate per path (same method as the patch bodies) +For the bit-EXACT quant-agnostic opts (0018/0019/0020/0021/0022/0028/0029 and the +host 0013/0016/0024 default-off), the gate is: greedy `llama-completion --temp 0 +--seed 1 --ignore-eos -n 256`, md5 of the output, patches-ON == patches-OFF on the +Q4_K_M control. Per path: + - non-paged Q4_K vs paged Q4_K (expect the same benign paged-reduction FP-order + delta noted in PAGED_BITEXACT_NOTE.md / 0029, gate with KLD/PPL not md5 across the + paged boundary, md5-exact within a fixed paged/non-paged setting). + - patches-on vs patches-off (see toggles 6.4) on the Q4_K control: byte-identical md5. + - 0026 (bf16 SSM state) is precision-CHANGING -> gate with KLD-to-f16 + PPL, not md5, + exactly like LEVER4 did; default tau=0 stays md5-exact. + - test-backend-ops on the build: GATED_DELTA_NET, SSM_CONV, SSM_CONV_UPDATE, + SSM_CONV_UPDATE_IDS, MUL_MAT, MUL_MAT_ID, GET_ROWS all green (these op tests are + dtype-parametrized and already include non-FP4 types). + +### 6.4 The clean A/B (decode_agg, llama-batched-bench) +Two arms, SAME Q4_K_M control GGUF, `-fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`, +report S_TG (decode aggregate), median of 5 reps: + + - Arm A (patches-OFF baseline): the cleanest is two builds - the pre-0018 paged commit + (the SSM opts not yet present) vs HEAD. If a rebuild is not wanted, approximate + OFF on the single HEAD binary by setting every disabling toggle at once: + fused GDN off (cparams.fused_gdn_ar/ch path disabled - the "fusion off" mode the + patch docs A/B against), `GDN_NW=4 GDN_CPW=1` (0022 pre-retune), `LLAMA_MOE_AUTO_TILE=0` + (0015), no `LLAMA_MOE_FORCE_GRAPHS` (0025 off), `LLAMA_PAGED_NO_BT_CACHE=1` (0029), + `LLAMA_PAGED_NO_RECLAIM=1` (0024), `LLAMA_PREFILL_BUDGET`/`LLAMA_MAX_BATCH_TOKENS` + unset (0013/0016), tau=0 / ctssm f32 (0026). The two-build form is preferred for a + publishable number; the env form is a fast same-binary sanity A/B. + - Arm B (patches-ON default): stock defaults (fusion on, 16x8, auto-tile on, + FORCE_GRAPHS on for the MoE graph arm, bt-cache on, reclaim on). + +### 6.5 What result confirms quant-generality + 1. The quant-agnostic opts FIRE on Q4_K: nsys on Arm B (Q4_K) shows the same kernel + deltas the NVFP4 runs showed - `k_get_rows_float` bucket collapses (0019/0028), + `concat_cont` + decode `cpy_scalar` gone and `ssm_conv_update` present (0021), the + o_proj `mul_mat_vec_q m=1` bucket gone and absorbed into `mul_mat_q m=128` + (0020 - now a Q4_K MMQ kernel, proving the routing win is not FP4-bound), + `get_block_table` host time down ~90% (0029). + 2. The opts LIFT the non-NVFP4 model: Arm B S_TG > Arm A S_TG on the Q4_K control at + npl 32 and 128, with the recurrence/routing opts contributing the bulk (expect a + smaller % than the NVFP4 runs per section 5, but clearly positive and of the same + absolute ms order). + 3. The NVFP4-specific lever does NOTHING on Q4_K: toggling 0023 + (`GGML_CUDA_MOE_QUANT_DEDUP=0` vs default) shows ZERO delta on the Q4_K MoE control + (it never enters the `use_native_fp4` branch) - the negative control that isolates + the one NVFP4-only optimization from the general ones. + +A clean pass = Arm B beats Arm A on Q4_K with the SSM/conv/routing/host kernel deltas +present and 0023 inert. That proves the decode wins are quant-general; NVFP4 is just the +weight quant where they show the largest PERCENTAGE because its weight read is smallest. + +--- + +## 7. ONE-LINE VERDICT + +14 of the 16 landed paged decode patches (0013-0029) are quant-agnostic: they act on the +f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the generic +MMQ-vs-MMVQ / CUDA-graph routing, none of which read the weight tensor's quant type. Only +0023 is genuinely NVFP4-specific (and 0017 is NVFP4-only but default-off/inert). The +hypothesis was right except for 0025, which is quant-agnostic (a generic +`ggml_is_quantized && should_use_mmq` CUDA-graph guard); the NVFP4-specific "act-quant" +work it was conflated with is LEVER 3, which never shipped. The opts deliver fixed +absolute ms savings at any weight quant; the % is largest at NVFP4 only because its +4-bit weight read makes the fixed recurrence the biggest slice of the step. + +Assisted-by: Claude:opus-4.8 [Claude Code]