mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 09:26:55 -04:00

Files

Ettore Di Giacinto e4c63179e0 docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever

Closes lever 5 of VLLM_DECODE_GROUNDING.md. GGUF metadata + source reading on
the paged dev tree plus nsys decode traces on Qwen3.6-27B NVFP4 (GB10 sm_121)
confirm the Gated-Delta-Net linear-attention layers decode as a fused single
CUDA kernel (gated_delta_net.cu) updating a fixed-size cached recurrent state:
no context-length parameter, no KV re-scan. Matched-batch context-scaling
control (npl4, pure decode) shows the GDN kernel flat (10.3 -> 8.0 us/launch)
across 4x context while full-attention grows 3.1x (27 -> 85 us). GDN is a small,
context-flat share (~0.4-10%% by batch); the FP4 weight GEMM dominates (~67%).
Verdict: GDN decode is efficient, not the cheap model-specific fix; the 2.4x is
the general GEMM + full-attention kernel work, as the grounding concluded.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-24 11:21:44 +00:00

13 KiB

Raw Blame History

GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?

Verdict-first, then the evidence. This closes lever 5 of VLLM_DECODE_GROUNDING.md ("Verify llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?

Method: GGUF-metadata + source reading on the paged dev tree (~/llama-paged-dev, build-cuda sm_121) on dgx.casa, plus nsys CUDA-kernel decode traces on ~/bench/q36-27b-nvfp4.gguf (GB10 / DGX Spark, GGML_CUDA_DISABLE_GRAPHS=1, paged KV, -fa on). Models: ~/bench/q36-27b-nvfp4.gguf (dense, arch qwen35), ~/bench/q36-35b-a3b-nvfp4.gguf (MoE, arch qwen35moe).

TL;DR verdict

llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's fused_recurrent_gated_delta_rule. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a major contributor to the ~2.4x eager-decode gap. There is no GDN-specific bottleneck to fix, so the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode kernel in the minority of full-attention layers), exactly as VLLM_DECODE_GROUNDING.md concluded.

The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is flat while the full-attention kernel grows ~3.1x:

kernel	ctx 1024	ctx 4096	ratio	meaning
`gated_delta_net_cuda` (GDN linear-attn)	10.3 us/launch	8.0 us/launch	~1.0x (flat)	O(1) in ctx
`flash_attn_tile` (full-attn layers)	27.1 us/launch	85.0 us/launch	3.1x	O(ctx), as expected
total ms / decode step	84.9	86.0	1.01x	GEMM-bound, ctx-independent

Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a per-step like-for-like comparison: the GDN layers do not get more expensive as context grows.

1. Architecture (confirmed from GGUF metadata + tensor names)

Both Qwen3.6 models are hybrid: a full_attention_interval of 4 means every 4th layer is standard full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.

Dense Qwen3.6-27B (general.architecture = qwen35):

block_count = 64, full_attention_interval = 4 -> 16 full-attention layers + 48 GDN layers.
Full-attn: head_count = 24, head_count_kv = 4 (GQA), key_length = value_length = 256, rope freq_base = 1e7, mrope sections [11,11,10,0].
GDN/SSM: ssm.state_size = 128, ssm.conv_kernel = 4, ssm.group_count = 16, ssm.time_step_rank = 48, ssm.inner_size = 6144. So the recurrent state per GDN layer is [S_v=128, S_v=128, H_v=48] per sequence (H_v = inner_size/state_size = 6144/128 = 48 value heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.

MoE Qwen3.6-35B-A3B (general.architecture = qwen35moe):

block_count = 41, full_attention_interval = 4 (~10 full-attn + ~31 GDN layers).
head_count = 16, head_count_kv = 2, key_length = value_length = 256, expert_count = 256, expert_used_count = 8, expert_feed_forward_length = 512.
Same SSM dims: state_size = 128, conv_kernel = 4, group_count = 16, inner_size = 4096 -> H_v = 32 value heads.

Tensor names confirm the op split (27B, per-layer dump):

GDN layers (e.g. blk.0.*): ssm_alpha, ssm_beta, ssm_conv1d, ssm_a, ssm_dt.bias, ssm_norm, ssm_out, plus attn_qkv / attn_gate (the in/out projections of the linear-attn block). No attn_k/v/output, no per-head q/k norm.
Full-attn layers (e.g. blk.3.*, every 4th): attn_q, attn_k, attn_v, attn_output, attn_q_norm, attn_k_norm. No ssm_*.

llama loads the GDN layers through the recurrent memory (llama-memory-recurrent), not the KV cache: the conv state and the SSM state live in conv_states_all / ssm_states_all and are read and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the SSM-style recurrent path, not standard attention.

2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)

Graph build (shared by both models): src/models/delta-net-base.cpp, dispatched from src/models/qwen35.cpp and src/models/qwen35moe.cpp (the MoE class inherits llm_build_delta_net_base and calls the same build_recurrent_attn, qwen35moe.cpp:472).

Decode dispatch (build_delta_net, delta-net-base.cpp:425-447): when n_seq_tokens == 1 (decode), it takes build_delta_net_fused if cparams.fused_gdn_ar (the default, see below), else build_delta_net_autoregressive. Both are O(1):

build_delta_net_autoregressive (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on the fixed-size state s shaped [S_v, S_v, H_v, n_seqs]: s *= exp(g) (decay), sk = sum_rows(s * k), d = (v - sk^T) * beta, s += k (x) d^T (rank-1 update), o = sum_rows(s * q). No loop over past tokens, no KV read - it touches only the state and the single new token's q/k/v/g/beta. GGML_ASSERT(n_tokens == 1).
build_delta_net_fused (delta-net-base.cpp:373-423) collapses the same recurrence into one op, ggml_gated_delta_net(q, k, v, g, b, s, K=1).

State is cached across steps, not rebuilt (build_recurrent_attn, delta-net-base.cpp:527-606): the input state s is read from ssm_states_all via build_rs, and the new state is copied back with ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...)) (lines 555-558). The causal-conv state is handled the same way in build_conv_state (449-525): the previous conv_kernel-1 = 3 samples are read from conv_states_all, the new token is appended, and the last 3 are written back. So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens - this is the recurrent analogue, fixed size, independent of context length.

Defaults (src/llama-context.cpp:200-201): cparams.fused_gdn_ar = true and fused_gdn_ch = true. They are only auto-disabled if the fused op cannot be scheduled on the same device as the layer (device_gdn != device_kv, lines 540-595); on a single GB10 with -ngl 99 that does not happen, so the fused single-kernel path is what runs.

The CUDA kernel (ggml/src/ggml-cuda/gated_delta_net.cu) is the crux, and it is unambiguously O(1) in context:

Launch grid dim3(H, n_seqs, ceil(S_v/4)) and block (min(warp,S_v), 4, 1) (lines 184-185): the grid spans heads x sequences x state-columns. There is no context-length dimension and no context-length argument anywhere in the kernel signature (q/k/v/g/beta are the new token(s) [S_v, H, n_tokens, n_seqs]; curr_state is the fixed [S_v, S_v, H, n_seqs]).
Each warp loads its shard of the fixed-size state into registers once (lines 57-61), then loops for (t = 0; t < n_tokens; t++) (line 63). At decode n_tokens == 1, so it is a single iteration: read the one new token, do the rank-1 update s_shard[r] = g * s_shard[r] + k[i] * delta_col and the readout attn = S^T q (lines 84-141), then write the updated state back (lines 161-167). No second loop, no read of any past KV.
Work per decode step is therefore proportional to S_v * S_v * H * n_seqs (the state size x batch) and constant in context length. This is precisely vLLM's fused_recurrent_gated_delta_rule_packed_decode_kernel (one batched launch updating a fixed-size [K,V] state) cited in the grounding doc.

A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 //TODO: Add chunked kernel); the chunked CPU/graph path (build_delta_net_chunking) only runs for multi-token ubatches (prefill), never at decode.

Qwen3.6-27B NVFP4, sm_121, GGML_CUDA_DISABLE_GRAPHS=1, paged KV, -fa on, llama-server driven to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and sqlites under ~/bench/gdn_study/).

(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:

bucket	% of busy	us/launch
GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`)	59.2	-
GDN_recurrent (`gated_delta_net_cuda`)	8.9	369
GEMM_act_quant (`quantize_mmq_nvfp4`)	8.2	-
elementwise / act_glu / norm / rope	~13.5	-
embed_gather (`get_rows`)	2.9	-
ATTENTION_full (`flash_attn`, 16 layers)	1.8	107
copy_cast (`cpy`)	1.8	-
GDN_conv (`ssm_conv`)	1.5	-

The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant). This is the dense model, where the grounding predicted the GEMM would be the lever.

(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%: GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the context.

(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs 4096. Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1 mul_mat_vec_q decode GEMV, and the client completed decode rounds inside the window). Identical decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is a true per-step comparison:

kernel / bucket	ctx 1024	ctx 4096	ratio
`gated_delta_net_cuda` us/launch	10.3	8.0	0.78x (flat)
GDN_recurrent share	0.6%	0.4%	flat/down
`ssm_conv` (GDN_conv) us/launch	5.2	5.2	1.00x
`flash_attn_tile` us/launch	27.1	85.0	3.14x
ATTENTION_full share	0.6%	1.8%	3.0x up
total ms / decode step	84.9	86.0	1.01x

The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is the empirical confirmation of the code analysis: llama's GDN decode does not re-scan the context.

(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the mul_mat_q(M=128)

flash_attn_ext_f16(ctx4096) signature gave it away. The npl4 runs above avoid this by keeping prefill short.)

4. Verdict and fix scope

Efficient, not a bottleneck. llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused, single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1) fused_recurrent decode. The probe's worst case (llama re-scanning context => GDN layers ballooning with context and concurrency) is falsified: the GDN kernel is flat across 4x context, and the op carries no context-length parameter at all.

So the GDN path is not the cheap model-specific lever. It is a small-to-moderate, context-flat share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would not dent the 2.4x. The gap is the general kernel work, confirming VLLM_DECODE_GROUNDING.md:

the FP4 weight GEMM is the dominant bucket (~59% GEMM + ~8% quantize_mmq_nvfp4 requant that vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
the full-attention decode kernel is the O(ctx) residual (the only thing that grows with context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.

If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is S_v^2 * H * batch = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this traffic is intrinsic to the architecture - vLLM pays the identical state I/O, so it is not a llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the ssm_conv + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM (fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed steady-state step as the bounded host-side add-on.

Reproduce

Metadata: python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf.
Code: src/models/delta-net-base.cpp (build_delta_net 425, autoregressive 289, fused 373, build_recurrent_attn 527, build_conv_state 449); src/llama-context.cpp:200-201,540-595 (fused_gdn defaults/guard); ggml/src/ggml-cuda/gated_delta_net.cu (kernel 4-168, launch grid 184-185, dispatch 226-312).
Profiles: ~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur> runs llama-server under nsys and drives clientloop.py; catgdn.py <sqlite> buckets kernels. Sqlites: gdn_npl128_ctx1024, gdn_npl32_ctx1024, gdn_npl4_ctx1024, gdn_npl4_ctx4096.

13 KiB Raw Blame History