docs(paged): profile-both-engines post-SSM ground-truth decode decomposition

Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at
npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs,
graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is
GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs
vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy;
SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific
overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap),
dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per
GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-25 08:56:37 +00:00
parent 6f0792c3be
commit ee13fd18ce

View File

@@ -0,0 +1,578 @@
# Decode parity exploration (post-SSM-fix) - per-agent findings
Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
The remaining +54% to parity is the question each section below probes. All numbers
DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
---
## Section: per-token-latency (critical path / host-loop) - READ-ONLY
**Verdict: the per-step critical path and host loop are NOT the residual lever.
Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
| activity set | busy % | exposed idle |
|-------------------------|---------|--------------|
| kernels only | 80.25% | 19.74% |
| kernels + memcpy (all) | 99.35% | **0.65%** |
- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
(the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
They were a real critical-path serialization, and **patches 0018/0019 removed exactly
these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
+37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
**median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
(sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
**0.34%** of the window - the pure launch overhead that CUDA graphs collapse
(measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
default paged decode with a 256-token reset cadence).
### 2. Post-SSM scaling of the FIXED host gap
The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
As decode accelerated it grew only as a fraction of a shrinking step:
| build | step ms @npl128 | host gap | host gap % of step |
|---------------|-----------------|----------|--------------------|
| pre-SSM (146) | ~877 | ~2 ms | 0.24% |
| post-SSM (256)| ~499 | ~2 ms | **~0.40%** |
| vLLM (391) | ~328 | (n/a) | (would be ~0.6%) |
Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
below the host time, which is not the case at 65% of parity.
### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
needs layer N-1; cannot be parallelized across layers). The question is whether each
individual kernel fills the SMs at batch 128. It does:
- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
= `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
`(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
npl128: if the chain under-filled, busy% would be well below 99%.
Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
host/scheduler problem.
### 4. What the host actually does per step (eager rep runtime API)
Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
| API | n | total | avg |
|---------------------------|-------|---------|---------|
| cudaStreamSynchronize | 1723 | 7775 ms | 4513 us |
| cudaLaunchKernelExC | 30983 | 4045 ms | 131 us |
| cudaLaunchKernel | 20385 | 2694 ms | 132 us |
| cudaMemcpyAsync | 2085 | 96 ms | 46 us |
~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
vectorized numpy (no per-token Python), attention metadata `build()` once per step
reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
next-step host prep overlaps the current-step GPU compute on the async stream. The key
asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
kernel time (FP4 GEMM), not host.
### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
Nothing left to win here.
- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
the forward; the host still samples + runs `update_slots` + rebuilds the batch
between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
cleanup, not a parity lever.
- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
### Evidence
- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
# Decode-Parity Exploration
## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
### Model config (Qwen3.5-27B dense, from vLLM config.json)
- linear_key_head_dim K = 128, linear_value_head_dim V = 128
- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
### Which kernels run at decode
- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
(the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
### The state HBM traffic is IDENTICAL - it is NOT the lever
Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
### Algorithmic / parallelization delta (the real differences)
1) Reduction strategy (biggest structural difference)
- llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
rounds on the critical path, with n_tokens=1 they are NOT amortized.
- vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
2) Occupancy / launch geometry (likely the dominant bandwidth gap)
- llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
(only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
(~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
- vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
3) Redundant non-state traffic in llama
- q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
=> ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
- Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
- vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
(-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
- llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
early-returns on the identity check, but still a grid launch of n_seqs blocks).
Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
### What a faster llama GDN decode kernel would need (optimization scope)
- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
(the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
the equal state bytes - the single highest-leverage change.
- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
the output store across the warp.
- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
(or fold the identity check into the recurrence prologue, which it already partly does).
- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
divergence-from-reference not a parity lever.
### Bottom line
llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
---
## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
memcpy summed separately so D2D is never lumped into compute), not from summary text.
### CLAIM 1 - decode decomposition
PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
full attention 0.37%.
D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
my DtoD 0.05 GB, the 2.16 GB is DtoH).
FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
### CLAIM 2 - 146 -> ~257 ("+66%")
146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
### CLAIM 3 - P2a GEMM-remap FLAT on decode
THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
### CLAIM 4 - 65% of vLLM (254 vs 391)
254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
### Traps checked
GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
### Net verdict
1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
---
## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
WEIGHTS (read ONCE per step, shared across all 128 seqs):
- NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
(per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
- LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
- per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
- token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
=> WEIGHTS TOTAL = 15.14 GB / step
PER-SEQUENCE STATE (x128 seqs, read + write every step):
- SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
- conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
- KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
### Floor vs measured -- decode is NOT at the bandwidth floor
Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
Measured llama = 128 tok / 254 t/s = **504 ms/step** => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
vLLM 391 t/s = 128 / 391 = **327 ms/step** => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
(pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
(18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
launches/step of serialization.
3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
recurrence prologue.
4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
the GEMM prologue -> pure launch latency, not traffic.
5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
### Bottom line (weight-bandwidth)
At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
lever, largely orthogonal to the batch-128 decode-parity gap.
---
## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
fusion gate, TMA scoping, and the conv-state residual.
**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
baseline" is only about the act-quant producer.
**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
shaved".
### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
as final.
### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
the npl128 throughput number does not** - tune the two separately.
### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
"NVFP4-dense is a prefill/low-batch lever".)
### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
### Deferred (covered by other sections, I concur)
- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
parity lever - last priority, quality-risk.
- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
### Bottom line (this section's net-new)
Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
Assisted-by: Claude:opus-4.8 [Claude Code]
---
## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
w16/w64 timing, in-process, batch 128):
| engine / mode | ms/step | decode tok/s | notes |
|--------------------------|---------|--------------|--------------------------------|
| llama post-SSM (graphs) | ~510-522| **245-251** | S_TG @npl128 ntg32 (this run) |
| vLLM enforce_eager | 324.9 | **394.0** | == the ~391 ref (h2h log 371-384)|
| vLLM cuda-graphs | 304.9 | **419.8** | graphs buy only +6% |
- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
(394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
exactly -> the tight window is faithful; per-kernel numbers below use it.
### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
| component (per step) | llama ms/step | llama % | vLLM ms/step | vLLM % |
|-----------------------------|---------------|---------|--------------|--------|
| GDN linear-attn recurrence | 193 (48x4.03) | 38% | 174 (48x3.62)| 53% |
| FP4 matmul + act-quant | **236** | **46%** | **117** | **36%**|
| - mul_mat_vec_q (GEMV) | 132 (48x2.75) | 26% | - | - |
| - mul_mat_q (GEMM) | 88 (448 calls)| 17% | cutlass 61 | 19% |
| - quantize_mmq_nvfp4 | 16 (448) | 3% | nvjet 53+cvt2| 17% |
| full attention (16 layers) | 6.6 (16) | 1.3% | 6.2 (16) | 1.9% |
| SSM conv + glue/elementwise | ~45 | 9% | ~22 | 7% |
| MEMCPY (D2D+H2D) | 2.5 (131 MB) | 0.5% | 0.36 (85 MB) | 0.1% |
| **TOTAL** | **~510** | 100% | **~326** | 100% |
### 2. The three load-bearing comparisons (the brief)
**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
GEMV-at-batch-128 on the vLLM side at all.**
Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
(16 ms + 448 launches), then the GDN +19 ms.
### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
not the GDN kernel and not byte-cutting.
### Verdict (profiler)
- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
18 GB/step DtoD removal CONFIRMED in-trace.
- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
= 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
### Evidence (DGX, this agent)
- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
(decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
Assisted-by: Claude:opus-4.8 [Claude Code]