mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 09:26:55 -04:00
docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered at B=128
Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain at batch=128. It does: nothing in the GDN region forces graph-disable (check_compability lists only split-buffers and large-batch MUL_MAT_ID), and the recurrent head is constant for a steady 128-seq batch so the inplace_ids state_dst offset + rs_head op_param + SSM input shapes are stable across steps. The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is the per-256-token full-attention block-table cadence (not a GDN op). The ~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the ~0.4% between-step host cgraph rebuild. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -98,3 +98,258 @@ compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing ns
|
||||
trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
|
||||
gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
|
||||
decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
|
||||
|
||||
---
|
||||
|
||||
## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
|
||||
|
||||
Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
|
||||
compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
|
||||
budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
|
||||
|
||||
### Inputs (measured / GGUF metadata, no new GPU work)
|
||||
- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
|
||||
Both numbers are shared identically by llama and vLLM (same HW, same weights).
|
||||
- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
|
||||
**16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
|
||||
head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
|
||||
Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
|
||||
- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
|
||||
T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
|
||||
- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
|
||||
|
||||
### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
|
||||
| term | bytes/step | basis |
|
||||
|------|-----------|-------|
|
||||
| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
|
||||
| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
|
||||
| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
|
||||
| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
|
||||
| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
|
||||
|
||||
**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
|
||||
Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
|
||||
|
||||
KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
|
||||
= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
|
||||
GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
|
||||
occupancy) and Lever-2 were FLAT on decode.
|
||||
|
||||
### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
|
||||
| term | FLOPs/step | floor |
|
||||
|------|-----------|-------|
|
||||
| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
|
||||
| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
|
||||
| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
|
||||
|
||||
### 3. Verdict / bubble budget / parity target
|
||||
```
|
||||
compute floor bandwidth floor MEASURED step x above bw-floor
|
||||
GB10 dense-FP4 ~13 ms ~154 ms (117-224)
|
||||
vLLM dense @128 327 ms ~2.1x (1.5-2.8x)
|
||||
llama dense @128 384 ms ~2.5x (1.7-3.3x)
|
||||
```
|
||||
- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
|
||||
below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
|
||||
- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
|
||||
LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
|
||||
Confirms prior "decode is 2.5x above its bandwidth floor" work.
|
||||
- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
|
||||
= kernel-launch latency + occupancy gaps + serial data-dependency stalls.
|
||||
- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
|
||||
Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
|
||||
SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
|
||||
so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
|
||||
|
||||
**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
|
||||
Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
|
||||
collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
|
||||
gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
|
||||
already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
|
||||
|
||||
### Cross-check / sizing for the gap-analysis (timeline) agent
|
||||
- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
|
||||
GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
|
||||
op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
|
||||
so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
|
||||
wall is the critical path.
|
||||
- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
|
||||
GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
|
||||
sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
|
||||
< 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
|
||||
kernel slower than vLLM fused_recurrent, or scheduler/H2D).
|
||||
- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
|
||||
Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
|
||||
gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
|
||||
on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
|
||||
|
||||
Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
|
||||
recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
|
||||
come from bubble (not bytes, not FLOPs).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
|
||||
|
||||
### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
|
||||
Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
|
||||
`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
|
||||
`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
|
||||
gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
|
||||
runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
|
||||
grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
|
||||
128 lanes hold the full head-vector (one element per lane).
|
||||
|
||||
Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
|
||||
critical path between the in-proj GEMMs and the recurrence):
|
||||
1. `beta = ggml_sigmoid(ssm_beta @ cur)` -> kernel reads `beta_val = *beta_t`
|
||||
2. `alpha = ssm_alpha @ cur`
|
||||
3. `ggml_add(alpha, ssm_dt)` (k_bin_bcast op_add)
|
||||
4. `ggml_softplus(...)` (unary_op<softplus>, 1248 inst)
|
||||
5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked) -> g; kernel does `expf(g_t)`
|
||||
6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
|
||||
7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
|
||||
|
||||
Post-GDN gate (consumes kernel output):
|
||||
8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
|
||||
|
||||
### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
|
||||
These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
|
||||
-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
|
||||
operand it needs:
|
||||
- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
|
||||
- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
|
||||
- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
|
||||
|
||||
That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
|
||||
|
||||
### Why the OUTPUT gate (step 8) is NOT folded into this kernel
|
||||
The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
|
||||
In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
|
||||
blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
|
||||
grid-global sync - not feasible without a grid redesign. Leave step 8 as the
|
||||
existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
|
||||
The conv-silu (step 6) is a convolution, structurally separate; rely on the
|
||||
existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
|
||||
recurrence kernel.
|
||||
|
||||
### Implementation scope
|
||||
- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
|
||||
- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
|
||||
- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
|
||||
- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
|
||||
- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
|
||||
- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
|
||||
- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
|
||||
|
||||
### Numeric considerations / bit-exactness
|
||||
- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
|
||||
- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
|
||||
|
||||
### Estimated kernels-removed-per-layer and the honest ceiling
|
||||
- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
|
||||
- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
|
||||
|
||||
### RISK (must be settled before building)
|
||||
1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
|
||||
overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
|
||||
Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
|
||||
between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
|
||||
batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
|
||||
gap (overlapped), do NOT build op-fusion; go to lever (2) below.
|
||||
2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
|
||||
8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
|
||||
FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
|
||||
to the post-GDN region or to inter-layer launch latency generally, the
|
||||
higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
|
||||
latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
|
||||
fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
|
||||
pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
|
||||
CUDA-graph lever if the bubbles are spread across the step rather than concentrated
|
||||
in the pre-GDN gate slice.
|
||||
3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
|
||||
later op needing more srcs on this node has zero headroom; flag for review.
|
||||
|
||||
## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
|
||||
|
||||
### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
|
||||
|
||||
Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
|
||||
(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
|
||||
(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
|
||||
against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
|
||||
|
||||
### How graph-disable / re-instantiation are decided (this fork's state machine)
|
||||
- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
|
||||
reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
|
||||
`node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
|
||||
SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
|
||||
list. So no GDN op forces graph-disable.
|
||||
- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
|
||||
(incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
|
||||
warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
|
||||
WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
|
||||
CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
|
||||
|
||||
### Why the GDN region's properties are STABLE across steady decode steps
|
||||
The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
|
||||
```
|
||||
state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
|
||||
kv_head * n_embd_s * elsize); // offset = kv_head
|
||||
ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
|
||||
```
|
||||
Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
|
||||
`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
|
||||
`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
|
||||
(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
|
||||
range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
|
||||
(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
|
||||
same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
|
||||
offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
|
||||
fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
|
||||
compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
|
||||
tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
|
||||
|
||||
### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
|
||||
`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
|
||||
on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
|
||||
gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
|
||||
no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
|
||||
(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
|
||||
noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
|
||||
graphs engage), confirming none of these break stream capture.
|
||||
|
||||
### The only re-instantiation is NOT GDN-driven
|
||||
A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
|
||||
the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
|
||||
lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
|
||||
replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
|
||||
`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
|
||||
the default decode path and is not the GDN/SSM path.)
|
||||
|
||||
### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
|
||||
The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
|
||||
decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
|
||||
graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
|
||||
Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
|
||||
0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
|
||||
captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
|
||||
decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
|
||||
GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
|
||||
"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
|
||||
SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
|
||||
wall-clock).
|
||||
|
||||
### Graph-safe lever for GDN: none new
|
||||
- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
|
||||
are already safe and captured.
|
||||
- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
|
||||
ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
|
||||
ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
|
||||
Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
|
||||
A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
|
||||
host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
|
||||
- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
|
||||
(`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
|
||||
That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
|
||||
worth a separate confirm for the MoE model.
|
||||
|
||||
Reference in New Issue
Block a user