From 3cf7fa1715d2c69a9be6f4e824041cd38d7a02f3 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 30 Jun 2026 23:41:38 +0000 Subject: [PATCH] docs(paged): reject swiglu down fusion candidate Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 42 +++++++++++++++ .../docs/VLLM_PARITY_LEVER_MAP.md | 9 +++- .../plans/2026-06-30-serving-source-phase7.md | 54 +++++++++++++++++-- 3 files changed, 101 insertions(+), 4 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index ea26edf29..cf48a35f5 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -578,3 +578,45 @@ Fresh DGX gates from `/home/mudler/bench/phase7_source_scope/`: The new gate covers the merged MoE gate_up -> SWIGLU -> down-projection graph shape needed before attempting a batched NVFP4 down-input quantization fusion. + +## Phase 7 SWIGLU-Down Fusion Candidate Rejected + +Attempted candidate: fuse `GGML_OP_GLU(SWIGLU)` into the NVFP4 activation +quantization feeding the MoE down-projection `MUL_MAT_ID`, while keeping the +existing grouped-MMQ kernel. The patch was kept behind +`GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1` during validation. + +DGX artifacts: + +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_swiglu_down_optin.txt` +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_after_optin.txt` +- `/home/mudler/bench/phase7_source_scope/default_gates_after_optin/` +- `/home/mudler/bench/phase7_source_scope/optin_gates/` +- `/home/mudler/bench/phase7_source_scope/serving_ab/` + +Correctness and inference gates: + +- Forced fusion `MOE_SWIGLU_DOWN`: `7/7`. +- Broad default `MUL_MAT_ID`: `806/806`. +- Default md5 after opt-in gating stayed canonical: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- Opt-in fusion md5: + - MoE `07db32c2bcb78d17a43ed18bc22705cd`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. + +Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`, `--no-cache`): + +| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict | +|------|------------------|------------------|---------------|---------| +| default | 3.92 | 657.1 | 1456.0 | baseline | +| `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1` | 3.88 | 667.4 | 1462.9 | reject; md5 drift and flat A/B | + +Result: + +- Rejected as a production patch. The opt-in path changes the paged-MoE md5 + into the non-paged namespace and does not materially improve serving. +- Root-cause note for future attempts: the first fused-op gate failed because + the fused quantizer used compact GLU-output strides to read split `gate`/`up` + views. Split views stride over the merged gate/up tensor; using source-view + strides fixed the op gate but not the end-to-end md5 drift. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index f921dd294..b7fbc8eb0 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -240,6 +240,14 @@ Organized by where the verified gap actually is. For each: mechanism / expected - **Gate:** bit-exact if SiLU + accumulation order preserved → greedy md5 (else KL-gate). - **Risk:** HIGH (fused FP4 FFN kernel is complex; register pressure on sm_121a). - **Effort/reward: HIGH / MED-HIGH.** Strong but expensive; sequence after A1. +- **Phase 7 shortcut rejected:** fusing only SWIGLU into the NVFP4 + down-input quantization while reusing grouped-MMQ passed the focused op gate + (`MOE_SWIGLU_DOWN 7/7`) but changed paged-MoE md5 under opt-in + (`07db32c2...` vs canonical `8cb0ce23...`) and was flat in serving A/B + (`decode_agg_tps 657.1 → 667.4`, `decode_perseq_tps 3.92 → 3.88`). + Do not retry that partial fusion without a KL gate and a stronger profile + bucket. A real A4 remains a different, larger register/shared-resident FFN + kernel. ### A5. Activation-quant fusion into the 0042 residual/RMSNorm epilogue (prefill) - **Mechanism:** the README's "act-quant fusion FLAT" verdict was *decode-only*. For prefill the W4A4 activation-quantize pass is a bigger tensor. 0042 already fuses residual-add+RMSNorm+mul; extend its epilogue to emit the FP4-quantized activation the next GEMM consumes, removing a dedicated act-quant read+write. @@ -490,4 +498,3 @@ Two profile surprises that reshape the directions: (a) vLLM on sm_121 is NOT nat Cross-cutting: the prefill levers (#101 GDN, D2 MoE GEMM) double as serving-decode levers because continuous batching interleaves ~25-55% prefill work into the serving step. GDN edges MoE-GEMM as the top prefill pick (bigger gap, cleaner math mechanism, 2.6x proven headroom, lower in-backend risk, dual payoff). All numbers from the both-engine nsys profile (cuda_gpu_kern_sum buckets, bucketer dgx:/home/mudler/bench/bucket2.py, reports dgx:/home/mudler/bench/profgap/); caveats: no NVTX (kernel-name regex buckets); shared elementwise straddles resid/MoE-fanin/GDN-glue; vLLM decode is offline 128-wide, not staggered-server. Relevant repo paths (absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{TENSORCORE_GDN_SCOPE.md,TENSORCORE_GDN_BUILD_PLAN.md,VLLM_PARITY_LEVER_MAP.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,DECODE_SERVING_SCOPE.md,PAGED_BITEXACT_NOTE.md,final_benchmark.csv}; patches dir .../patches/paged/ (existing 0031 chunked-GDN serial, 0033 dequant->cuBLAS rejected, 0034 native FP4-MMA, 0040/0041 S1/S3 decode-graph, 0042 fused residual+RMSNorm); methodology /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/.agents/vllm-parity-methodology.md. - diff --git a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md index 2d0845133..034e09d6e 100644 --- a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md +++ b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md @@ -1,6 +1,7 @@ # Phase 7: Serving Source Candidate Scope -**Status:** Test-gate patch landed. Production CUDA fusion not started. +**Status:** Test-gate patch landed. First production CUDA fusion candidate +rejected after DGX gates and serving A/B. **Goal:** Select one maintainable source candidate for the remaining GB10 MoE serving gap, then implement only if it can be gated for inference correctness and @@ -142,8 +143,16 @@ to implementation when all are true: - [x] Run md5/op gates before serving A/B. - `MOE_SWIGLU_DOWN`: `7/7` on CUDA0. - Serving A/B is not applicable to this test-only patch. -- [ ] Keep only if the serving bucket and h2h result improve materially. -- [ ] Regenerate LocalAI patch stack and update docs if kept. +- [x] Keep only if the serving bucket and h2h result improve materially. + - Rejected candidate: opt-in SWIGLU-down NVFP4 quantization fusion. + - Default path was protected behind `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1`. + - Default md5 gates stayed canonical, but the opt-in paged-MoE md5 changed + to the non-paged namespace (`07db32c2bcb78d17a43ed18bc22705cd`). + - Serving A/B was flat: default `decode_agg_tps=657.1`, + `decode_perseq_tps=3.92`, `prefill_tps=1456.0`; opt-in + `decode_agg_tps=667.4`, `decode_perseq_tps=3.88`, `prefill_tps=1462.9`. +- [x] Regenerate LocalAI patch stack and update docs if kept. + - No production patch kept; only docs updated for the rejected candidate. ## Required Tests Before Track A Source Patch @@ -183,6 +192,45 @@ DGX result after the adjustment: and tree-matches fork head `cd56cf037`. - Mirrored tree hash: `623b7cb008a929455ca3d9deae35494c02622fef`. +## Rejected Production Candidate: SWIGLU-Down MMQ Fusion + +Attempted a fork-first CUDA patch that fused `GGML_OP_GLU(SWIGLU)` into the +NVFP4 activation quantization feeding the down-projection `MUL_MAT_ID`. The +patch kept the existing grouped-MMQ kernel and only replaced the separate f32 +SWIGLU write/read plus down-input quantize pass. + +Root-cause note from the first failed op gate: the fused quantizer initially used +the compact GLU output strides to read the split `gate`/`up` views. Those views +stride over the original merged gate/up tensor, so the NVFP4 cases read wrong +rows and failed at roughly `2.0` NMSE. Switching the fused quantizer to the +source-view strides fixed the focused op gate. + +Final DGX artifacts live under `/home/mudler/bench/phase7_source_scope/`: + +- Forced fusion op gate: + `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1 test-backend-ops test -b CUDA0 -o MOE_SWIGLU_DOWN -j 1` + -> `7/7`. +- Broad default op gate: + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` -> `806/806`. +- Default inference md5 after protecting the fusion behind + `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1`: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Opt-in fusion inference md5: + - MoE: `07db32c2bcb78d17a43ed18bc22705cd` (not the canonical paged-MoE md5). + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Serving A/B, `n=128`, `ptok=128`, `gen=64`, `/v1/completions`, + `--no-cache`: + - default: `decode_agg_tps=657.1`, `decode_perseq_tps=3.92`, + `prefill_tps=1456.0`. + - opt-in: `decode_agg_tps=667.4`, `decode_perseq_tps=3.88`, + `prefill_tps=1462.9`. + +Verdict: reject the production patch. The opt-in path is not md5-safe for +paged-MoE and the bounded serving A/B is effectively flat. Do not spend more +time on this exact activation-quant fusion unless a future KL gate explicitly +allows a new paged-MoE md5 namespace and a profile shows a material bucket win. + ## Required Tests Before Track B Source Patch - Establish fixed-seed baseline output md5 and token-id parity for a