# PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10

Companion to `PREFILL_GEMM_SCOPE.md`. This records the GPU A/B for the #1
prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16
cuBLAS / nvjet). Shipped as patch `0033`, **default-off** because the measured
result is a regression on this hardware.

Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin `9d5d882d`.
Models: `q36-27b-nvfp4.gguf` (dense), `q36-35b-a3b-nvfp4.gguf` (MoE).
Binary: `build-cuda/bin/llama-batched-bench -fa on -ngl 99`, `LLAMA_KV_PAGED=1`.
A/B is a single build toggled by `LLAMA_FP4_PREFILL_M` (0 = MMQ baseline, >0 =
route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever.

## 1. Bit-exact / numeric gate (PASS - divergence benign)

| Gate | Result |
|---|---|
| `test-backend-ops -o MUL_MAT` (default, threshold off) | 1146/1146 pass |
| `test-backend-ops -o MUL_MAT_ID` (default) | 806/806 pass (MoE untouched) |
| `test-backend-ops -o MUL_MAT`, path FORCED (`LLAMA_FP4_PREFILL_M=64`) | NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU |
| greedy md5, short prefill (< threshold), lever vs base | identical: `5951a5b4d624ce891e22ab5fca9bc439` (== documented dense reference; decode byte-untouched) |
| greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base | identical: `5f3967df5781445feeb25762abb9eae7` (the new FP path flips no greedy argmax) |

The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a
different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better:
keeping activations in bf16 instead of Q8_1 is strictly more precise, and the
greedy output is byte-identical. This matches the scope's prediction
(KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)).

## 2. Performance (REGRESSION - the lever loses on GB10)

S_PP (prefill tokens/s), q36-27b dense, A/B `LLAMA_FP4_PREFILL_M` off vs on:

| prefill ubatch M | npl | base S_PP (MMQ) | lever S_PP (bf16 cuBLAS) | delta |
|---|---|---|---|---|
| 512  | 32 | 958.99  | 486.65 | -49% |
| 1024 | 8  | 1013.65 | 587.27 | -42% |
| 2048 | 8  | 918.46  | 649.42 | -29% |

Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is
inert by default.

## 3. Why it loses (the scope premise was wrong for GB10)

The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a
vendor large-M kernel would win. **Measured, FP4-MMQ at M=512..2048 beats
dequant->bf16 cuBLAS by 29-49%.** Two compounding reasons:

1. **bf16 tensor-core peak is ~half FP4 peak on GB10.** Even a perfect bf16 GEMM
   caps at ~half the throughput the FP4-MMA path can reach.
2. **The dequant tax is an un-amortized memory pass.** Per prefill step the new
   path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM
   reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read
   (~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M
   grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses
   even at M=2048 (above the default n_ubatch).

This is also consistent with the README decode finding that the dense path was
already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the
prefill ground-truth (measured on the MoE decision model) implied.

## 4. Status of the phases

- **Phase 1 (dense): REJECTED on GB10**, landed default-off as a validated,
  env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by
  non-GB10 hardware where bf16 may fare differently).
- **Phase 2 (MoE grouped large-M): NOT implemented.** It inherits the same
  bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped
  bf16-cuBLAS would regress for the same reason; the MoE id-path also has the
  graph-safety catch (a false `should_use_mmq` falls to the host-sync sorted
  loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph
  work on a path the dense A/B already shows loses.
- **The only route to a real prefill GEMM win is option (b)** - a native
  Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the
  prefill regime is funded. The committed scaffold gives option (b) its
  M-threshold routing and its bit-exact gate for free.