mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 01:16:58 -04:00
Read-only source comparison of the gated-DeltaNet decode region. vLLM folds conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule recurrence and the SSM state write-back into ONE Triton kernel (fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into a gated rms_norm, and captures the whole decode forward in a full CUDA graph (GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the same region as ~8 separate host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>