mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): reject swiglu down fusion candidate
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -578,3 +578,45 @@ Fresh DGX gates from `/home/mudler/bench/phase7_source_scope/`:
|
||||
|
||||
The new gate covers the merged MoE gate_up -> SWIGLU -> down-projection graph
|
||||
shape needed before attempting a batched NVFP4 down-input quantization fusion.
|
||||
|
||||
## Phase 7 SWIGLU-Down Fusion Candidate Rejected
|
||||
|
||||
Attempted candidate: fuse `GGML_OP_GLU(SWIGLU)` into the NVFP4 activation
|
||||
quantization feeding the MoE down-projection `MUL_MAT_ID`, while keeping the
|
||||
existing grouped-MMQ kernel. The patch was kept behind
|
||||
`GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1` during validation.
|
||||
|
||||
DGX artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_swiglu_down_optin.txt`
|
||||
- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_after_optin.txt`
|
||||
- `/home/mudler/bench/phase7_source_scope/default_gates_after_optin/`
|
||||
- `/home/mudler/bench/phase7_source_scope/optin_gates/`
|
||||
- `/home/mudler/bench/phase7_source_scope/serving_ab/`
|
||||
|
||||
Correctness and inference gates:
|
||||
|
||||
- Forced fusion `MOE_SWIGLU_DOWN`: `7/7`.
|
||||
- Broad default `MUL_MAT_ID`: `806/806`.
|
||||
- Default md5 after opt-in gating stayed canonical:
|
||||
- MoE `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Opt-in fusion md5:
|
||||
- MoE `07db32c2bcb78d17a43ed18bc22705cd`.
|
||||
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
|
||||
Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`, `--no-cache`):
|
||||
|
||||
| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict |
|
||||
|------|------------------|------------------|---------------|---------|
|
||||
| default | 3.92 | 657.1 | 1456.0 | baseline |
|
||||
| `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1` | 3.88 | 667.4 | 1462.9 | reject; md5 drift and flat A/B |
|
||||
|
||||
Result:
|
||||
|
||||
- Rejected as a production patch. The opt-in path changes the paged-MoE md5
|
||||
into the non-paged namespace and does not materially improve serving.
|
||||
- Root-cause note for future attempts: the first fused-op gate failed because
|
||||
the fused quantizer used compact GLU-output strides to read split `gate`/`up`
|
||||
views. Split views stride over the merged gate/up tensor; using source-view
|
||||
strides fixed the op gate but not the end-to-end md5 drift.
|
||||
|
||||
@@ -240,6 +240,14 @@ Organized by where the verified gap actually is. For each: mechanism / expected
|
||||
- **Gate:** bit-exact if SiLU + accumulation order preserved → greedy md5 (else KL-gate).
|
||||
- **Risk:** HIGH (fused FP4 FFN kernel is complex; register pressure on sm_121a).
|
||||
- **Effort/reward: HIGH / MED-HIGH.** Strong but expensive; sequence after A1.
|
||||
- **Phase 7 shortcut rejected:** fusing only SWIGLU into the NVFP4
|
||||
down-input quantization while reusing grouped-MMQ passed the focused op gate
|
||||
(`MOE_SWIGLU_DOWN 7/7`) but changed paged-MoE md5 under opt-in
|
||||
(`07db32c2...` vs canonical `8cb0ce23...`) and was flat in serving A/B
|
||||
(`decode_agg_tps 657.1 → 667.4`, `decode_perseq_tps 3.92 → 3.88`).
|
||||
Do not retry that partial fusion without a KL gate and a stronger profile
|
||||
bucket. A real A4 remains a different, larger register/shared-resident FFN
|
||||
kernel.
|
||||
|
||||
### A5. Activation-quant fusion into the 0042 residual/RMSNorm epilogue (prefill)
|
||||
- **Mechanism:** the README's "act-quant fusion FLAT" verdict was *decode-only*. For prefill the W4A4 activation-quantize pass is a bigger tensor. 0042 already fuses residual-add+RMSNorm+mul; extend its epilogue to emit the FP4-quantized activation the next GEMM consumes, removing a dedicated act-quant read+write.
|
||||
@@ -490,4 +498,3 @@ Two profile surprises that reshape the directions: (a) vLLM on sm_121 is NOT nat
|
||||
Cross-cutting: the prefill levers (#101 GDN, D2 MoE GEMM) double as serving-decode levers because continuous batching interleaves ~25-55% prefill work into the serving step. GDN edges MoE-GEMM as the top prefill pick (bigger gap, cleaner math mechanism, 2.6x proven headroom, lower in-backend risk, dual payoff).
|
||||
|
||||
All numbers from the both-engine nsys profile (cuda_gpu_kern_sum buckets, bucketer dgx:/home/mudler/bench/bucket2.py, reports dgx:/home/mudler/bench/profgap/); caveats: no NVTX (kernel-name regex buckets); shared elementwise straddles resid/MoE-fanin/GDN-glue; vLLM decode is offline 128-wide, not staggered-server. Relevant repo paths (absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{TENSORCORE_GDN_SCOPE.md,TENSORCORE_GDN_BUILD_PLAN.md,VLLM_PARITY_LEVER_MAP.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,DECODE_SERVING_SCOPE.md,PAGED_BITEXACT_NOTE.md,final_benchmark.csv}; patches dir .../patches/paged/ (existing 0031 chunked-GDN serial, 0033 dequant->cuBLAS rejected, 0034 native FP4-MMA, 0040/0041 S1/S3 decode-graph, 0042 fused residual+RMSNorm); methodology /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/.agents/vllm-parity-methodology.md.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user