mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 08:38:51 -04:00
Closes lever 5 of VLLM_DECODE_GROUNDING.md. GGUF metadata + source reading on the paged dev tree plus nsys decode traces on Qwen3.6-27B NVFP4 (GB10 sm_121) confirm the Gated-Delta-Net linear-attention layers decode as a fused single CUDA kernel (gated_delta_net.cu) updating a fixed-size cached recurrent state: no context-length parameter, no KV re-scan. Matched-batch context-scaling control (npl4, pure decode) shows the GDN kernel flat (10.3 -> 8.0 us/launch) across 4x context while full-attention grows 3.1x (27 -> 85 us). GDN is a small, context-flat share (~0.4-10%% by batch); the FP4 weight GEMM dominates (~67%). Verdict: GDN decode is efficient, not the cheap model-specific fix; the 2.4x is the general GEMM + full-attention kernel work, as the grounding concluded. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>