mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline (cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril, recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5 chunked scan. P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER, gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%. GO required FLA >10% faster at npp2048. Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket (55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is the state-recurrence GEMMs plus per-chunk h-state materialization to global LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel keeps the 128x128 state resident in smem and never materializes per-chunk h, so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket (+59.2, the single largest prefill lever) is a confirmed shared-hardware / memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm. Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%). Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE 8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed; series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0), NOT pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>