LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 01:16:58 -04:00

Files

Ettore Di Giacinto e597a8ac78 docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs llama ~8 ops

Read-only source comparison of the gated-DeltaNet decode region. vLLM folds
conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule
recurrence and the SSM state write-back into ONE Triton kernel
(fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into
a gated rms_norm, and captures the whole decode forward in a full CUDA graph
(GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the
same region as ~8 separate host-launched, serially-dependent ggml nodes. That
launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-25 14:43:01 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs llama ~8 ops