LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 12:57:02 -04:00

Files

Ettore Di Giacinto 5b8b33a302 docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is a confirmed shared-hardware floor

P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline
(cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril,
recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the
chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs
host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind
LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5
chunked scan.

P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE
q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER,
gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%.
GO required FLA >10% faster at npp2048.

Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket
(55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is
the state-recurrence GEMMs plus per-chunk h-state materialization to global
LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel
keeps the 128x128 state resident in smem and never materializes per-chunk h,
so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket
(+59.2, the single largest prefill lever) is a confirmed shared-hardware /
memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm.
Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%).

Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs
control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path
untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE
8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed;
series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0),
NOT pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-07-02 21:04:37 +00:00

ACCELERATOR_PORTING_SCOPE.md

docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)

2026-06-28 08:34:32 +00:00

BENCHMARK.md

docs(paged): record phases 112-140 + series trim decision

2026-07-02 10:16:53 +00:00

DECODE_SERVING_SCOPE.md

docs(paged): record padded/fixed-slot decode shape as tested-and-rejected

2026-06-28 20:47:43 +00:00

EXECUTION_REARCH_SCOPE.md

docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is a confirmed shared-hardware floor