mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 03:17:01 -04:00
Scopes the follow-up recorded by patch 0031 + README section 5: replace the serial per-thread reductions of the chunked gated-DeltaNet prefill scan with mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the path that would beat the tuned sequential scan and close the GDN prefill bucket toward vLLM's ~2.5x-cheaper chunked scan. Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a data-center only); the usable path is the extended mma.sync family. So the kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma' shorthand reads as mma only on this part. Design: register-resident state frees the 64KB that forced C=16, admitting C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32 precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the bounded de-gating; A-inverse via blocked forward substitution (FLA UT transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW ~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not. Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains beat the sequential scan and close most of the bucket but not full sm_100-class parity. KL-gate binding (NMSE likely fails at reduced precision). Phased: re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state -> occupancy/cp.async; opt-in default-OFF until A/B-proven. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>