mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE
Conclude the MoE-parity hunt. The two remaining sub-levers in the 20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor: - convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32 GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that bf16-rounded value is load-bearing for the shipped md5. Removing the round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all rebaseline md5. A precision boundary, like lever 4. - bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor (~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet already TMA-streams the weights; cutlass reads the same bytes. No kernel lever; only fewer bytes (quantize) helps - rejected on quality. Corrects the body premise that vLLM runs these projections as NVFP4-Marlin: vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the projection bucket is a matched-precision gap, not a quant gap. Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5), gated behind resolving the paged-MoE baseline md5 drift. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -1,5 +1,11 @@
|
||||
# MOE_GAP_VS_VLLM.md - ground-truth both-engine MoE decode decomposition (where vLLM's ~15% lives)
|
||||
|
||||
> **READ THE FINAL SECTION FIRST ("RESIDUAL-ASSESS (FINAL)" at the bottom).** It concludes the hunt and
|
||||
> CORRECTS one premise used throughout the body below: this doc assumes vLLM runs the GDN/attn projections
|
||||
> as NVFP4-Marlin. It does NOT. vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
|
||||
> projection bucket is a matched-precision (bf16) gap, not a quant gap. Lever 4 (NVFP4 the projections) is
|
||||
> REJECTED (+6% PPL, and not even a vLLM gap). The MoE is at its bit-exact ceiling (~86-88% of vLLM).
|
||||
|
||||
THE GPU AGENT (label `moe-gap-groundtruth`), DGX GB10 (sm_121). First **side-by-side, both-engine,
|
||||
per-kernel ms/step** decomposition of the MoE decode gap. All prior B work decomposed llama ONLY; this
|
||||
profiles vLLM's decode step too and computes the per-bucket `llama - vLLM` delta to pinpoint the gap.
|
||||
@@ -360,3 +366,119 @@ the dense GGUF). Highest-ROI remaining MoE lever. **Do first among remaining MoE
|
||||
non-bit-exact recurrence-plumbing or the rejected W4A16/Marlin GEMM.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
> **SUPERSEDED:** the lever-4 scope above was optimistic and PRE-GATE. The L4 KL gate FAILED
|
||||
> (+6.15-6.51% PPL, see `LEVER4_PROJNVFP4_RESULTS.md`) and the premise was wrong (vLLM keeps these
|
||||
> projections BF16 too). Lever 4 is REJECTED - do NOT ship. See the FINAL section below.
|
||||
|
||||
---
|
||||
|
||||
# RESIDUAL-ASSESS (FINAL, concludes the hunt) - convert-glue + bf16-GEMM verdicts, the bit-exact MoE ceiling
|
||||
|
||||
Label `residual-assess`, DGX GB10 (sm_121). After lever 1 shipped (0028, MoE 86.3% of vLLM @npl128,
|
||||
bit-exact), levers 2+3 flat, lever 4 REJECTED (KL-gate FAIL, AND vLLM keeps the same projections bf16),
|
||||
and lever 5 flat for MoE (host-side, off the compute-bound critical path; dense gets +0.41%), this is the
|
||||
final honest assessment of the two remaining sub-levers inside the 20.3-vs-13.8 ms projection bucket.
|
||||
Both are **bit-CHANGING or at-the-BW-floor.** The hunt is DONE.
|
||||
|
||||
## CORRECTION that reframes the projection bucket
|
||||
|
||||
The body above assumed **vLLM runs the GDN/attn projections as NVFP4-Marlin.** FALSE (confirmed by the L4
|
||||
gate). vLLM runs the **same nvidia-modelopt checkpoint** as the GGUF, which keeps `in_proj_qkvz`,
|
||||
`in_proj_ba`, `out_proj`, `attn_gate`, and full-attn `attn_q/k/v/output` in **BF16**. llama and vLLM run
|
||||
these projections at the **same precision (bf16).** The +6.5 ms projection-bucket delta is therefore NOT
|
||||
a precision/quant gap - it is (a) llama's f32-residual-stream convert tax and (b) bf16-GEMM kernel /
|
||||
round-trip efficiency, both at matched bf16 precision.
|
||||
|
||||
## (1) convert-glue verdict (3.24 ms/step measured): NOT bit-exact eliminable
|
||||
|
||||
Empirical split (`moe_dec` nsys, per-step over 43 decode steps):
|
||||
- `convert_unary<float,bf16>` (input, f32 act -> bf16): **1.73 ms/step**, 186 calls/step
|
||||
- `convert_unary<bf16,float>` (output, bf16 -> f32): **1.52 ms/step**, 186 calls/step (equal count = every
|
||||
bf16 projection round-trips)
|
||||
|
||||
Source root cause (`ggml/src/ggml-cuda/ggml-cuda.cu:1663-1690`, the `src0->type == BF16` cuBLAS path):
|
||||
ggml converts f32 activations to bf16, runs `cublasGemmEx` bf16xbf16 with **CUBLAS_COMPUTE_32F** but
|
||||
writes the result to a **bf16** buffer (`dst_bf16`, `CUDA_R_16BF`), then widens bf16 -> f32. The f32
|
||||
accumulator is **rounded to bf16 and then widened back** - it drops ~15 mantissa bits, and that
|
||||
bf16-rounded value feeds the f32 residual stream.
|
||||
|
||||
- The **output round-trip is load-bearing for the shipped numerics.** The fp16-fp32-compute path 40 lines
|
||||
down (`:1729`, `dst CUDA_R_32F`) proves cuBLAS CAN write the f32 accumulator directly - so the bf16
|
||||
output write+convert is a removable ggml inefficiency. BUT removing it (f32-direct output) changes the
|
||||
value from "bf16-rounded" to "full-f32" => greedy md5 (`07db32c2`) re-baselines. It is a **precision
|
||||
boundary (an upgrade), exactly like lever 4.** NOT bit-exact.
|
||||
- The **input convert is intrinsic** to a bf16 GEMM (cuBLAS needs bf16 inputs; ggml's residual stream is
|
||||
f32). The only bit-exact move is to fuse the f32->bf16 cast into the producing op's epilogue (same RNE
|
||||
rounding, one fewer launch) - but that is per-site ggml graph surgery for a sub-1.7 ms launch ceiling,
|
||||
and it is **subsumed by the (rejected) lever-4 move**: NVFP4-quantizing the weights routes the
|
||||
projection to `mul_mat_q<NVFP4>` (W4A4) and deletes the entire bf16 cuBLAS path - input convert, GEMM,
|
||||
output convert - in one shot.
|
||||
- vLLM pays ~0 here because it runs an **end-to-end bf16 residual stream** (no f32 intermediate). Matching
|
||||
that = converting llama's residual stream to bf16 = a global precision change, md5 rebaseline. Also not
|
||||
bit-exact.
|
||||
|
||||
**Verdict: bit-exact-eliminable = NO.** The f32<->bf16 round-trip is load-bearing for the current md5 (the
|
||||
bf16-rounded output IS the shipped value). Every way to remove it (f32-direct GEMM output, bf16 residual
|
||||
stream, or NVFP4 weights) is bit-changing. The one bit-exact sliver (fuse the input cast into the
|
||||
producer) is ~1.7 ms ceiling, high per-site effort, and redundant with lever 4. (Aside: the f32-direct
|
||||
GEMM output is a genuine upstreamable ggml win - faster AND more precise - but it rebaselines md5, so it
|
||||
is off the bit-exact table for this hunt.)
|
||||
|
||||
## (2) bf16 projection GEMM verdict (17.27 ms/step measured): BW-bound at the floor, no kernel lever
|
||||
|
||||
Per-step bf16-projection GEMM (nvjet cuBLASLt + cutlass bf16, `moe_dec` nsys): **17.27 ms/step, 225
|
||||
calls/step.** Roofline at the M=128 decode shape:
|
||||
- Arithmetic intensity ~= 2*M FLOP / 2 bytes-per-weight = **M = 128 FLOP/byte** (the weight read
|
||||
dominates; activations/output negligible at M=128).
|
||||
- GB10: LPDDR5x unified BW ~= **273 GB/s**; bf16 tensor-core peak >= ~250 TFLOPS => ridge point ~=
|
||||
250e12 / 273e9 ~= **>900 FLOP/byte.** 128 << 900 => **memory-bandwidth-bound by ~7x.**
|
||||
- Achieved: 17.27 ms at 273 GB/s = **~4.7 GB of bf16 projection weights streamed per step** - i.e. the
|
||||
GEMM moves the weight bytes at ~full LPDDR5x bandwidth. **It is at the BW floor.**
|
||||
|
||||
The nvjet kernels are `tmaAB` (TMA-streamed on both operands) - the optimal Blackwell weight-streaming
|
||||
access pattern; vLLM's cutlass does the same and reads the **same bf16 bytes.** A cutlass swap cannot beat
|
||||
the byte floor. The only way faster is **fewer weight bytes = quantize** (lever 4, ~4x fewer bytes) -
|
||||
bit-changing AND rejected on quality (+6% PPL) AND not even a vLLM-parity gap. The residual ~3.5 ms of the
|
||||
llama-vs-vLLM GEMM-bucket delta traces to llama's extra `dst_bf16` write+read round-trip traffic (the
|
||||
convert glue of verdict 1), not a worse GEMM kernel.
|
||||
|
||||
**Verdict: at the bandwidth floor; no bit-exact (nor even same-precision) kernel lever exists.** nvjet
|
||||
already streams the weights near-optimally.
|
||||
|
||||
## (3) The bit-exact MoE ceiling, and the irreducible residual
|
||||
|
||||
| MoE lever | status | bit-exact? | MoE gain |
|
||||
|-----------|--------|:----------:|----------|
|
||||
| 1 - recurrent-state gather fusion (0028) | **SHIPPED** | yes | banked -> 86.3% of vLLM |
|
||||
| 2 - graph coverage / overlap | flat | yes | ~0 |
|
||||
| 3 - act-quant fusion | flat | yes | ~0 |
|
||||
| 5 - block-table within-step cache | flat for MoE | yes | ~0 (host off compute-bound path; dense +0.41%) |
|
||||
| 4 - NVFP4 projections | REJECTED | no | +6% PPL, not a vLLM gap |
|
||||
| convert-glue elimination | this assess | **no** (precision boundary) | bit-changing only |
|
||||
| bf16-GEMM kernel | this assess | **no** (BW floor) | none |
|
||||
|
||||
**Realistic bit-exact MoE ceiling = ~86-88% of vLLM @npl128. The shipped state (lever 1, 86.3%) is
|
||||
essentially AT it.** Lever 5 adds nothing to MoE. No clean bit-exact MoE lever remains.
|
||||
|
||||
**The irreducible ~12-14% residual to vLLM is structural, not a missing optimization:**
|
||||
1. **f32-residual-stream convert tax (~3.2 ms/step)** - ggml runs an f32 graph and casts per bf16
|
||||
projection; vLLM runs bf16 end-to-end. Removing it is a precision change.
|
||||
2. **bf16-GEMM BW floor + round-trip traffic (~3.5 ms/step)** - both engines at the LPDDR5x byte floor on
|
||||
bf16 weights; the delta is the round-trip traffic (= item 1, bit-changing).
|
||||
3. **Recurrence-plumbing remainder** - mostly banked by lever 1; the core SSM kernel is already a llama
|
||||
win.
|
||||
4. **Between-replay host loop + graph/overlap bubble** - sampling needs logits between graph replays;
|
||||
irreducible at this batch shape.
|
||||
|
||||
## CONCLUSION: the MoE-parity hunt is DONE
|
||||
|
||||
The MoE is at its bit-exact ceiling. The two heaviest MoE compute kernels (the gated-DeltaNet SSM core and
|
||||
the NVFP4 expert grouped GEMM) are **already llama wins**, so there is no arithmetic gap to close. The
|
||||
remaining 12-14% is the f32-vs-bf16 graph-precision tax, the bf16-weight BW floor, and the irreducible
|
||||
host loop - none of which is a clean bit-exact lever, and the one bit-changing option (quantize the
|
||||
projections) is rejected on quality and is not even a vLLM-parity gap. **No one-more-lever for MoE.** The
|
||||
only clean win left in the whole track is DENSE (+0.41% from lever 5), gated behind first resolving the
|
||||
pre-existing paged-MoE baseline md5 drift (paged `8cb0ce23` vs canonical `07db32c2`) the L5 finish flagged.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
Reference in New Issue
Block a user