From e3f8149f3b665f4d61070a10b3aa743cd09bb5b5 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Fri, 26 Jun 2026 23:36:38 +0000 Subject: [PATCH] docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PPL, no-ship Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4 scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline: PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172 vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so the damage is in the bulk attn/GDN projections. Root cause: the bf16 projections are a deliberate modelopt precision choice, not a provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms projection-GEMM bucket is the price of high-precision projections that vLLM also pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/LEVER4_PROJNVFP4_RESULTS.md | 83 +++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/LEVER4_PROJNVFP4_RESULTS.md diff --git a/backend/cpp/llama-cpp/patches/paged/LEVER4_PROJNVFP4_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/LEVER4_PROJNVFP4_RESULTS.md new file mode 100644 index 000000000..a161465ed --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/LEVER4_PROJNVFP4_RESULTS.md @@ -0,0 +1,83 @@ +# LEVER 4 - NVFP4 the bf16 MoE GDN/attn projections: KL-GATE FAIL, no-ship + +GPU agent (L4-gatebench), DGX GB10 (sm_121, BLACKWELL_NATIVE_FP4=1). Build at 0028 (HEAD fafe878, +branch `paged`). Lever 4 hypothesis (from `MOE_GAP_VS_VLLM.md` + the lever-4 scope): the MoE GGUF's +GDN/attn projections (in_proj_qkvz=attn_qkv, in_proj_ba=ssm_alpha/ssm_beta, out_proj=ssm_out, +attn_gate, full-attn attn_q/k/v/output) are left in BF16 by nvidia modelopt while the dense +q36-27b-nvfp4 (unsloth) already ships them NVFP4. The scope called this a "quant-provenance accident" +and proposed re-quantizing them to NVFP4 to recover the ~20.3->13.8ms projection-GEMM bucket. + +**Verdict: KL-GATE FAIL on every axis, for both variants. STOP, do NOT ship. No 0029 GGUF, no +gallery entry, no bench, no nsys** (per spec: KL fails first -> report, do not bench/ship). The bf16 +projections are a **deliberate precision choice, not an accident** - re-quantizing them costs ~6% PPL. + +## Gate setup (all bit-changing -> KLD gate per spec) + +- Reference (the "f32" of the gate): `~/work/darwin_36b_opus/f16.gguf` - the full-precision f16 GGUF + of the same Qwen3.6-35B-A3B model (qwen35moe, 41 blocks, vocab 248320, embd 2048). Verified it + matches the NVFP4 baseline shape; its own PPL = 7.376 self-consistent with the KLD base. +- KL base: `llama-perplexity --kl-divergence-base` over `wiki.test.raw`, c512, 16 chunks (8192 tok), + -ngl 99, seed 1. Base file `~/bench/l4gate/klbase_moe.dat` (2.0 GB). f16 PPL(base) = 7.3734. +- Candidates scored with `--kl-divergence` against that base, identical c512/16-chunks/seed. +- Current "bf16-projection GGUF" baseline = `~/bench/q36-35b-a3b-nvfp4.gguf` (the shipping NVFP4: + experts NVFP4, GDN/attn projections BF16). It is the reference for the PPL-delta and argmax gates. + +## Measurements (16 chunks, c512, 8192 tokens, wiki.test.raw) + +| model | PPL(Q) | PPL delta vs baseline | Mean KLD-to-f16 | Same-top-p (argmax agree vs f16) | RMS dp | +|-------|--------|-----------------------|-----------------|----------------------------------|--------| +| baseline NVFP4 (proj BF16, shipping) | 7.3896 | - (reference) | 0.1366 | 84.31% | 9.20% | +| **projq FULL** (190 proj -> NVFP4, incl. in_proj_ba) | 7.8705 | **+6.51%** | 0.1638 | 81.72% | 10.47% | +| **projq CONS** (130 proj -> NVFP4, in_proj_ba kept BF16) | 7.8440 | **+6.15%** | 0.1716 | 82.16% | 10.82% | + +Baseline vs f16: PPL ratio 1.0022 (+0.22%), i.e. the shipping NVFP4 is already near-f16 - because +modelopt put the quant-sensitive GDN/attn projections in BF16 and only the experts (designed for FP4) +in NVFP4. projq pushes the projections to NVFP4 and PPL ratio jumps to 1.067 (FULL) / 1.064 (CONS). + +## Gate verdict (all three conditions FAIL) + +1. **PPL delta < ~1% vs the bf16-projection GGUF -> FAIL.** FULL +6.51%, CONS +6.15%. Off by ~6x. +2. **KLD-to-f32 < 0.06 -> FAIL.** The shipping baseline NVFP4 itself sits at 0.137 mean KLD vs f16 + (per-token KLD is naturally high at 248K vocab), and projq raises it to 0.164 (FULL) / 0.172 (CONS). + Whatever the intended reference granularity, projq is strictly worse than the baseline, not < 0.06. +3. **Zero greedy-argmax flips -> FAIL.** Per-token top-1 agreement vs f16 drops from 84.31% (baseline) + to 81.72% (FULL) / 82.16% (CONS): the requant flips the argmax on ~2.2-2.6% MORE tokens than the + shipping model. (A direct `llama-cli --temp 0 -n 48` greedy diff was attempted but the paged + llama-cli build segfaults at teardown on ALL models incl. baseline - not projq-specific - so the + 8192-token Same-top-p above is the argmax measure used; it is strictly stronger than a 48-tok probe.) + +CONSERVATIVE (keeping the most quant-sensitive in_proj_ba=ssm_alpha/ssm_beta in BF16) recovered almost +nothing: 7.844 vs 7.871. The damage is in the BULK attn/GDN projections (attn_qkv, ssm_out, attn_gate, +attn_q/k/v/output), not the tiny in_proj_ba. An attn_gate-excluded third variant would, at best, shave +a fraction of a percent off a 6% miss - not worth a GPU pass. lm_head was already NVFP4 in the baseline +(and in vLLM's checkpoint), so it is not a variable here and was never the issue. + +## Why the premise was wrong (root cause of the failure) + +The scope assumed vLLM runs these projections in NVFP4. It does not. vLLM runs the **nvidia modelopt +checkpoint** (`~/bench/q36-35b-a3b-nvfp4-vllm`), which is the SAME provenance that left these exact +projections in BF16. So: + +- The baseline GGUF's bf16 projections **match vLLM** already. They are not a llama-vs-vLLM gap. +- modelopt left in_proj_qkvz/in_proj_ba/out_proj/attn_q/k/v/output in BF16 **because they are + quant-sensitive in this hybrid gated-DeltaNet + attention model** - the gate confirms this empirically + at ~6% PPL. The dense q36-27b-nvfp4 (unsloth) tolerating NVFP4 projections does not transfer: it is a + different (non-MoE, different-provenance) model and a different sensitivity profile. +- Re-quantizing them is therefore not "matching vLLM" - it is going BEYOND vLLM's precision and paying + for it in quality. The ~20.3ms projection-GEMM bucket is the price of running these projections in + high precision; vLLM pays the same precision cost (its nvjet/cutlass bf16 GEMMs), so the bucket is NOT + the lever it looked like. The L4 speed win is real but only purchasable with a 6% PPL regression - + rejected by the gate. + +## Disposition / artifacts + +- Both projq GGUFs exist on DGX but are **dead** (do not publish): `~/bench/q36-35b-a3b-nvfp4-projq.gguf` + (FULL, md5 1bd32114..., sha256 88b7e812...), `~/bench/q36-35b-a3b-nvfp4-projq-cons.gguf` (CONS, md5 + 6847ebe3..., sha256 ca035111...). The L4-requant pin files (`~/bench/pins_projq_{full,cons}.txt`) and + `/tmp/gen_pins.py` remain if a future, kernel-side (not precision-side) approach is ever revisited. +- Gate logs: `~/bench/l4gate/` - `f16base.log`, `kld.{baseline,projqFULL,projqCONS}.log`, + `klbase_moe.dat`. +- No code change, no patch, no commit to the DGX `llama-paged-dev` tree. No `-paged` gallery entry. +- MoE remains at 86.3% of vLLM @ npl128; this lever does not move it within the quality budget. + +Assisted-by: Claude:opus-4.8 [Claude Code]