LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-24 08:38:51 -04:00

Files

Ettore Di Giacinto e4c63179e0 docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever

Closes lever 5 of VLLM_DECODE_GROUNDING.md. GGUF metadata + source reading on
the paged dev tree plus nsys decode traces on Qwen3.6-27B NVFP4 (GB10 sm_121)
confirm the Gated-Delta-Net linear-attention layers decode as a fused single
CUDA kernel (gated_delta_net.cu) updating a fixed-size cached recurrent state:
no context-length parameter, no KV re-scan. Matched-batch context-scaling
control (npl4, pure decode) shows the GDN kernel flat (10.3 -> 8.0 us/launch)
across 4x context while full-attention grows 3.1x (27 -> 85 us). GDN is a small,
context-flat share (~0.4-10%% by batch); the FP4 weight GEMM dominates (~67%).
Verdict: GDN decode is efficient, not the cheap model-specific fix; the 2.4x is
the general GEMM + full-attention kernel work, as the grounding concluded.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-24 11:21:44 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever