Conclude the MoE-parity hunt. The two remaining sub-levers in the
20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor:
- convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output
bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32
GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that
bf16-rounded value is load-bearing for the shipped md5. Removing the
round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all
rebaseline md5. A precision boundary, like lever 4.
- bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor
(~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet
already TMA-streams the weights; cutlass reads the same bytes. No kernel
lever; only fewer bytes (quantize) helps - rejected on quality.
Corrects the body premise that vLLM runs these projections as NVFP4-Marlin:
vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
projection bucket is a matched-precision gap, not a quant gap.
Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is
at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5),
gated behind resolving the paged-MoE baseline md5 drift.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>