mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 17:37:07 -04:00
docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win
Cross-agent synthesis on top of the both-engine nsys decomposition (3b5957157):
settle the user's "can we do what vLLM does on MoE?" question with the three
converging investigations (groundtruth measurement + vllm-marlin source-read +
marlin-port feasibility).
Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a
-1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged
tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16
dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led
by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax
(+3.3), and router/glue (+5.4).
A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would
regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy
wall the dense scaffold already STOPPED at, and its entire intrinsic upside is
the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph
already banked, and closeable bit-exactly by fusing the act-quant.
Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state
gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph
coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into
RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn
projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM
makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant
~96-97%, parity-or-better physically in reach since both heaviest kernels
(SSM core, MoE GEMM) are already llama wins.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -117,4 +117,119 @@ The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it
|
||||
**(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant
|
||||
into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM.
|
||||
|
||||
---
|
||||
|
||||
# FINAL DECISION (cross-agent synthesis) - "can we do what vLLM does on MoE?"
|
||||
|
||||
Three agents converged on the same verdict from independent angles: `moe-gap-groundtruth`
|
||||
(the measured both-engine nsys decomposition above), `vllm-marlin-study` (source-read of vLLM's
|
||||
`moe_wna16_marlin_gemm` / `moe_align_block_size` / `prepare_nvfp4_moe_layer_for_marlin` on the DGX),
|
||||
and `marlin-port-feasibility` (read-only assessment of the dense W4A16 scaffold + prior STOP). All
|
||||
three agree, and the measurement is the arbiter. Below is the decision the user asked for.
|
||||
|
||||
## (1) WHERE the 15% lives - decisive
|
||||
|
||||
The gap is **27.8 ms/step (llama at 83.6% of vLLM)** and it is **NOT one kernel - it is a sum of small
|
||||
deltas, and the MoE grouped GEMM is on llama's side of the ledger.** Ranked:
|
||||
|
||||
| rank | lever | Δ ms/step | bit-exact? | this is... |
|
||||
|-----:|-------|----------:|:----------:|------------|
|
||||
| 1 | Graph coverage + cross-stream overlap | ~+7.0 | **yes** | scheduler/runtime (idle +3.8, overlap +3.2) |
|
||||
| 2 | Recurrence state-gather/conv plumbing (`k_get_rows_float` 5.2 + conv) | +6.6 | **yes** | llama-only kernels; vLLM updates state in-place |
|
||||
| 3 | Dense GDN/attn projections + lm_head (bf16 vs NVFP4) + convert glue | +6.5 | **no** | the NVFP4-dense-quant lever, on the projections |
|
||||
| 4 | Router GEMM + norms/combine/memcpy glue | +5.4 | mostly yes | llama router = full FP4 GEMM; vLLM fuses topk/align |
|
||||
| 5 | W4A4 act-quant tax (`quantize_mmq_nvfp4`) | +3.3 | **yes** | vLLM's W4A16 makes this structurally 0 |
|
||||
| - | **MoE-expert grouped GEMM** | **-1.7** | - | **llama WIN** - native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 |
|
||||
|
||||
**The Marlin GEMM is explicitly ruled out as the source of the gap.** Both engines read the same ~22 GB
|
||||
of ~4-bit expert weights once per step and are LPDDR5x-bandwidth-bound; on that weight stream they tie,
|
||||
and llama's 2x-rate FP4-MMA edges Marlin's half-rate bf16 MMA. It is **not the projections-vs-Marlin
|
||||
distinction in the experts, it is the projections in the DENSE path, the recurrence plumbing, and the
|
||||
runtime/graph** that cost llama the 15%. Not distributed, not the expert GEMM, not routing alone.
|
||||
|
||||
## (2) Can llama MATCH it - and HOW
|
||||
|
||||
**Yes - to within a few percent, and NOT with a Marlin/W4A16 MoE GEMM.** The two biggest *compute*
|
||||
kernels (the gated-DeltaNet SSM core 70.0 vs 71.1, and the MoE grouped GEMM 47.3 vs 50.0) are **already
|
||||
llama wins.** The gap is overhead/scheduling/precision-of-the-other-tensors, all of which llama can
|
||||
attack on its existing W4A4 FP4-MMA expert path. The four levers, in recommended build order:
|
||||
|
||||
| order | build | gain | bit-exact / gate | effort |
|
||||
|------:|-------|-----:|------------------|--------|
|
||||
| 1st | **Fuse away the recurrent-state gather `k_get_rows_float`** (update SSM state in-place in the GDN decode path, fold `ssm_conv_update`) | ~+5 ms (~3% of step) - biggest single-kernel win | **bit-exact** (no md5 rebaseline) | medium - CUDA, the GDN decode kernel |
|
||||
| 2nd | **Fuller CUDA-graph coverage + stream overlap** (extend the 0025 re-graph to the remaining MoE/projection nodes, overlap independent streams) | ~+7 ms combined; 0025 already banked ~+1.9% | **bit-exact** | medium - scheduler, partly done |
|
||||
| 3rd | **NVFP4-quantize the still-bf16 GDN/attn projections + lm_head** (the same move vLLM makes on its dense path; 4-bit weight read ~4x less BW, kills the 2.9 ms bf16<->f32 convert) | ~+6.5 ms - biggest *bucket* | **bit-changing** (re-baselines md5 gates; precision-UPGRADE, see below) | medium-high - new NVFP4 weight path for non-expert linears |
|
||||
| 4th | **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** (as vLLM fuses act-quant) | +3.3 ms | **bit-exact** | low-medium |
|
||||
|
||||
**Reach:** the three bit-exact levers (1+2+4 ~= +15.3 ms) alone close the gap to ~154.5 ms/step
|
||||
=> ~830 t/s = **~94% of vLLM, with zero precision change and zero md5 rebaseline.** Adding the
|
||||
NVFP4-projection lever (3, +6.5) reaches ~148 ms => ~865 t/s = **~96-97% of vLLM**, with the residual
|
||||
being router/glue and the irreducible cross-stream-overlap that is structural to how ggml schedules
|
||||
host-launched nodes vs vLLM's single fused graph. Because llama's two heaviest kernels are already
|
||||
ahead, **parity-or-better is physically reachable** once the plumbing/overhead is removed; vLLM has no
|
||||
arithmetic advantage on this hardware (its W4A16 is half-rate FP4 - it only wins on overhead and on the
|
||||
dense-path weight-read BW).
|
||||
|
||||
## (3) The leading lever, in full - and the Marlin question, settled
|
||||
|
||||
**The user's specific hypothesis - "do what vLLM does = a Marlin-style W4A16 grouped MoE GEMM" - is
|
||||
REJECTED, by measurement and by feasibility.**
|
||||
|
||||
- **It is not where the gap is.** The MoE GEMM is a **-1.7 ms llama win.** A W4A16 Marlin MoE GEMM would
|
||||
make that bucket SLOWER (half-rate bf16 MMA on the ~27% GEMM bucket), not faster.
|
||||
- **Its entire intrinsic upside is the ~2% act-quant tax** (W4A16 has no activation quantize). That
|
||||
+2% ceiling is **smaller than the +1.9% the bit-exact 0025 re-graph already banked**, at vastly higher
|
||||
effort and with a precision change. And the act-quant tax is independently closeable bit-exactly by
|
||||
lever 4 (fuse it into the preceding op) without touching the GEMM.
|
||||
- **The scaffold does not help.** `paged/kernel/w4a16/marlin-w4a16.cu` is dense-only, Q4_0/Q4_K, with no
|
||||
grouped/MUL_MAT_ID path and no NVFP4 dequant. A real MoE Marlin is effectively a from-scratch port of
|
||||
`moe_wna16_marlin_gemm` (per-expert M-tiles, block-padded `moe_align` token-sort, stream-K over ragged
|
||||
segments, NVFP4->bf16 in-kernel dequant). vLLM only reaches the BW floor via cutlass-SM120 TMA +
|
||||
warp-specialized pipelining; the GB10 occupancy-only route the dense scaffold tried **plateaued at
|
||||
~9 TFLOPS / 178 t/s (~5x under MMQ)** and STOPPED at the occupancy wall (XOR-swizzle + deep cp.async
|
||||
collapse GB10 occupancy). Realistic outcome of an MoE port: **a net REGRESSION** on the 27% GEMM
|
||||
bucket. Multi-week, high-risk, DGX-only, no `ncu`, for a +2% ceiling. **Do not build it.**
|
||||
|
||||
**Why vLLM runs W4A16 at all:** not because it is better - because sm_121 (consumer Blackwell / GB10)
|
||||
has no working cutlass FP4 MoE cubins (vLLM whitelists only sm_100/103 datacenter Blackwell for native
|
||||
FP4 MoE; the engine literally warns it is falling back to "Weight-only FP4 ... Marlin kernel"). On GB10,
|
||||
W4A16 is HALF the FP4-MMA rate. **llama's native W4A4 FP4-MMA is the higher hardware tier; matching vLLM
|
||||
does NOT mean copying its W4A16 fallback.**
|
||||
|
||||
**Precision / gate (the brief's key nuance, assessed honestly):** the observation that W4A16 (bf16 acts)
|
||||
is a strict activation-precision UPGRADE over W4A4 (FP4 acts), with better KL-to-f32, is **correct but
|
||||
unmonetizable here.** (a) The current W4A4 MoE default is **already bit-exact to the f32 reference**
|
||||
(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable on both models) - you get no quality credit for
|
||||
being more precise than a default that already passes, and the precision-sensitive site is the
|
||||
gated-DeltaNet SSM *state* (a different op, addressed by the separate 0026 bf16-SSM opt-in), not the MoE
|
||||
GEMM. (b) W4A16 is **non-bit-exact vs the W4A4 default, so adopting it re-baselines every shipped md5
|
||||
gate** - a real cost for a +2% throughput ceiling that is itself likely negative. So the precision angle
|
||||
does not flip the verdict: it would be a precision upgrade nobody needs, bought with a slower,
|
||||
occupancy-hostile, gate-rebaselining kernel. The one genuinely precision-positive AND throughput-positive
|
||||
move that quantizes weights is **lever 3 (NVFP4 projections)** - and that is W4A16 on the DENSE linears
|
||||
(where it cuts weight-read BW), not on the experts.
|
||||
|
||||
## (4) HONEST VERDICT + recommended build
|
||||
|
||||
**VERDICT: We can essentially match vLLM on MoE decode (~94% bit-exact, ~96-97% with the projection
|
||||
quant, parity-or-better physically in reach), but NOT by doing "what vLLM does" in the sense the question
|
||||
implies. A Marlin/W4A16 grouped MoE GEMM is the wrong lever - the MoE GEMM is already a llama win and a
|
||||
W4A16 port would regress it. The 15% is bf16 dense-projection bandwidth + recurrence-gather plumbing +
|
||||
graph/overlap overhead + a 2% act-quant tax + router glue. Every piece is closeable on llama's existing
|
||||
native-FP4 expert path, mostly bit-exactly.**
|
||||
|
||||
**Recommended build (ship order, none of it a new MoE GEMM):**
|
||||
1. **`k_get_rows` SSM-state-gather fusion** - bit-exact, ~+5 ms, biggest single-kernel win, no rebaseline. **Do first.**
|
||||
2. **Extend CUDA-graph coverage + stream overlap** beyond 0025 - bit-exact, ~+7 ms combined, partly banked.
|
||||
3. **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** - bit-exact, +3.3 ms, erases the act-quant tax (the only thing W4A16 would have bought) without W4A16.
|
||||
4. **NVFP4-quantize the bf16 GDN/attn projections + lm_head** - +6.5 ms (biggest bucket), bit-changing
|
||||
(re-gate md5; precision-UPGRADE, the same NVFP4-dense-quant move vLLM makes). Ship as default after
|
||||
re-gating, or as an opt-in if the md5 rebaseline is undesirable.
|
||||
|
||||
**Do NOT build:** the W4A16/Marlin grouped MoE GEMM (`paged/kernel/w4a16/` scaffold is dense-only and not
|
||||
reusable). Neither default nor opt-in: +2% ceiling < the already-banked bit-exact +1.9%, likely a net
|
||||
regression on the 27% GEMM bucket, multi-week high-risk, and it rebaselines every gate. The dense
|
||||
`w4a16-marlin` STOP transfers to MoE, and MORE strongly (the tiny-M decode shape is purely BW-bound, so
|
||||
the FP4-vs-bf16 tier is a wash that the weight-read floor erases - leaving only the half-rate downside).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
Reference in New Issue
Block a user