docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PPL, no-ship

Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4
scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline:
PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172
vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq
variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so
the damage is in the bulk attn/GDN projections.

Root cause: the bf16 projections are a deliberate modelopt precision choice, not a
provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these
projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms
projection-GEMM bucket is the price of high-precision projections that vLLM also
pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only
purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-26 23:36:38 +00:00
parent 9a1be79f04
commit e3f8149f3b

View File

@@ -0,0 +1,83 @@
# LEVER 4 - NVFP4 the bf16 MoE GDN/attn projections: KL-GATE FAIL, no-ship
GPU agent (L4-gatebench), DGX GB10 (sm_121, BLACKWELL_NATIVE_FP4=1). Build at 0028 (HEAD fafe878,
branch `paged`). Lever 4 hypothesis (from `MOE_GAP_VS_VLLM.md` + the lever-4 scope): the MoE GGUF's
GDN/attn projections (in_proj_qkvz=attn_qkv, in_proj_ba=ssm_alpha/ssm_beta, out_proj=ssm_out,
attn_gate, full-attn attn_q/k/v/output) are left in BF16 by nvidia modelopt while the dense
q36-27b-nvfp4 (unsloth) already ships them NVFP4. The scope called this a "quant-provenance accident"
and proposed re-quantizing them to NVFP4 to recover the ~20.3->13.8ms projection-GEMM bucket.
**Verdict: KL-GATE FAIL on every axis, for both variants. STOP, do NOT ship. No 0029 GGUF, no
gallery entry, no bench, no nsys** (per spec: KL fails first -> report, do not bench/ship). The bf16
projections are a **deliberate precision choice, not an accident** - re-quantizing them costs ~6% PPL.
## Gate setup (all bit-changing -> KLD gate per spec)
- Reference (the "f32" of the gate): `~/work/darwin_36b_opus/f16.gguf` - the full-precision f16 GGUF
of the same Qwen3.6-35B-A3B model (qwen35moe, 41 blocks, vocab 248320, embd 2048). Verified it
matches the NVFP4 baseline shape; its own PPL = 7.376 self-consistent with the KLD base.
- KL base: `llama-perplexity --kl-divergence-base` over `wiki.test.raw`, c512, 16 chunks (8192 tok),
-ngl 99, seed 1. Base file `~/bench/l4gate/klbase_moe.dat` (2.0 GB). f16 PPL(base) = 7.3734.
- Candidates scored with `--kl-divergence` against that base, identical c512/16-chunks/seed.
- Current "bf16-projection GGUF" baseline = `~/bench/q36-35b-a3b-nvfp4.gguf` (the shipping NVFP4:
experts NVFP4, GDN/attn projections BF16). It is the reference for the PPL-delta and argmax gates.
## Measurements (16 chunks, c512, 8192 tokens, wiki.test.raw)
| model | PPL(Q) | PPL delta vs baseline | Mean KLD-to-f16 | Same-top-p (argmax agree vs f16) | RMS dp |
|-------|--------|-----------------------|-----------------|----------------------------------|--------|
| baseline NVFP4 (proj BF16, shipping) | 7.3896 | - (reference) | 0.1366 | 84.31% | 9.20% |
| **projq FULL** (190 proj -> NVFP4, incl. in_proj_ba) | 7.8705 | **+6.51%** | 0.1638 | 81.72% | 10.47% |
| **projq CONS** (130 proj -> NVFP4, in_proj_ba kept BF16) | 7.8440 | **+6.15%** | 0.1716 | 82.16% | 10.82% |
Baseline vs f16: PPL ratio 1.0022 (+0.22%), i.e. the shipping NVFP4 is already near-f16 - because
modelopt put the quant-sensitive GDN/attn projections in BF16 and only the experts (designed for FP4)
in NVFP4. projq pushes the projections to NVFP4 and PPL ratio jumps to 1.067 (FULL) / 1.064 (CONS).
## Gate verdict (all three conditions FAIL)
1. **PPL delta < ~1% vs the bf16-projection GGUF -> FAIL.** FULL +6.51%, CONS +6.15%. Off by ~6x.
2. **KLD-to-f32 < 0.06 -> FAIL.** The shipping baseline NVFP4 itself sits at 0.137 mean KLD vs f16
(per-token KLD is naturally high at 248K vocab), and projq raises it to 0.164 (FULL) / 0.172 (CONS).
Whatever the intended reference granularity, projq is strictly worse than the baseline, not < 0.06.
3. **Zero greedy-argmax flips -> FAIL.** Per-token top-1 agreement vs f16 drops from 84.31% (baseline)
to 81.72% (FULL) / 82.16% (CONS): the requant flips the argmax on ~2.2-2.6% MORE tokens than the
shipping model. (A direct `llama-cli --temp 0 -n 48` greedy diff was attempted but the paged
llama-cli build segfaults at teardown on ALL models incl. baseline - not projq-specific - so the
8192-token Same-top-p above is the argmax measure used; it is strictly stronger than a 48-tok probe.)
CONSERVATIVE (keeping the most quant-sensitive in_proj_ba=ssm_alpha/ssm_beta in BF16) recovered almost
nothing: 7.844 vs 7.871. The damage is in the BULK attn/GDN projections (attn_qkv, ssm_out, attn_gate,
attn_q/k/v/output), not the tiny in_proj_ba. An attn_gate-excluded third variant would, at best, shave
a fraction of a percent off a 6% miss - not worth a GPU pass. lm_head was already NVFP4 in the baseline
(and in vLLM's checkpoint), so it is not a variable here and was never the issue.
## Why the premise was wrong (root cause of the failure)
The scope assumed vLLM runs these projections in NVFP4. It does not. vLLM runs the **nvidia modelopt
checkpoint** (`~/bench/q36-35b-a3b-nvfp4-vllm`), which is the SAME provenance that left these exact
projections in BF16. So:
- The baseline GGUF's bf16 projections **match vLLM** already. They are not a llama-vs-vLLM gap.
- modelopt left in_proj_qkvz/in_proj_ba/out_proj/attn_q/k/v/output in BF16 **because they are
quant-sensitive in this hybrid gated-DeltaNet + attention model** - the gate confirms this empirically
at ~6% PPL. The dense q36-27b-nvfp4 (unsloth) tolerating NVFP4 projections does not transfer: it is a
different (non-MoE, different-provenance) model and a different sensitivity profile.
- Re-quantizing them is therefore not "matching vLLM" - it is going BEYOND vLLM's precision and paying
for it in quality. The ~20.3ms projection-GEMM bucket is the price of running these projections in
high precision; vLLM pays the same precision cost (its nvjet/cutlass bf16 GEMMs), so the bucket is NOT
the lever it looked like. The L4 speed win is real but only purchasable with a 6% PPL regression -
rejected by the gate.
## Disposition / artifacts
- Both projq GGUFs exist on DGX but are **dead** (do not publish): `~/bench/q36-35b-a3b-nvfp4-projq.gguf`
(FULL, md5 1bd32114..., sha256 88b7e812...), `~/bench/q36-35b-a3b-nvfp4-projq-cons.gguf` (CONS, md5
6847ebe3..., sha256 ca035111...). The L4-requant pin files (`~/bench/pins_projq_{full,cons}.txt`) and
`/tmp/gen_pins.py` remain if a future, kernel-side (not precision-side) approach is ever revisited.
- Gate logs: `~/bench/l4gate/` - `f16base.log`, `kld.{baseline,projqFULL,projqCONS}.log`,
`klbase_moe.dat`.
- No code change, no patch, no commit to the DGX `llama-paged-dev` tree. No `-paged` gallery entry.
- MoE remains at 86.3% of vLLM @ npl128; this lever does not move it within the quality budget.
Assisted-by: Claude:opus-4.8 [Claude Code]