Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto 39e16cc2c4 docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go
Append section 9 (skeptical staff-CUDA-engineer review) to FP4_GEMM_SCOPE_B.md,
stress-testing the dense/MoE parity verdict against the committed grounding.

Key findings:
- Not the W4A16 wall: the npl-sweep (dense 99/56/46/41% of vLLM at npl 8/32/64/128)
  shows llama's FP4-MMA kernel HITS the weight-read floor at M=8 and FALLS OFF it as
  M grows, while vLLM HOLDS it. Working-path tune, dual existence proof (M=8 + vLLM
  M=128), not a greenfield build. Same binding constraint as W4A16 though (hide
  LPDDR5x latency at the larger tile on an occupancy-dominated part).
- The dense gap is ~82-87% GEMM, ~13-18% non-GEMM (467 ms total = 383-405 GEMM +
  62-84 non-GEMM). B alone caps ~80%; track A is what tips dense over the parity line.
- Sharpest omission: vLLM's M=128 floor is reached via cutlass TMA + deep pipeline -
  the technique the doc forbids on GB10. TMA != manual cp.async (lower occupancy cost);
  it must be an in-scope P2 fallback, not categorically banned.
- Honest landing: dense ~80-90% (parity the optimistic tail, contingent on B+A+floor),
  MoE ~55-65% (parity not reachable from B). Low-regret: even a tripped P2 kill-gate
  lands B+A ~89%, doubling today's 41%.
- Sequencing fix: land A first (defines B's interface + baseline + kill-gate), then
  run B's P2 against the post-A number.

Verdict: DENSE conditional GO (scope as GEMM-gap-closing, not true parity; A-first,
gate at P2, add TMA); MoE NO-GO for parity from B (do the cheap mmq_x-down win as a
1.7-1.85x, not parity).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 14:31:35 +00:00
..