diff --git a/.github/scripts/paged-canary-apply.sh b/.github/scripts/paged-canary-apply.sh index c8ec8c8a3..548e29249 100755 --- a/.github/scripts/paged-canary-apply.sh +++ b/.github/scripts/paged-canary-apply.sh @@ -28,7 +28,7 @@ # build on 0019's code, the rejection cascades to them too. This is a # PRE-EXISTING shipped-series defect, present identically on every pin, NOT an # upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md -# and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still +# and README.md). We exclude ONLY that dev-doc path and still # apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019 # still fails the canary. prepare.sh tolerates the same hunk via # `patch ... || true`; this mirrors that tolerance precisely. diff --git a/backend/cpp/llama-cpp/grpc-server.cpp b/backend/cpp/llama-cpp/grpc-server.cpp index 3c45302e5..05b5ea0b7 100644 --- a/backend/cpp/llama-cpp/grpc-server.cpp +++ b/backend/cpp/llama-cpp/grpc-server.cpp @@ -850,7 +850,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt // common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset. // Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an // externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is - // NOT bit-exact (~91% same-top-p ceiling); see patches/paged/A_HYBRID_SSM_RESULTS.md. + // NOT bit-exact (~91% same-top-p ceiling); see patches/paged/README.md (Dev notes). } else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) { if (optval != NULL) { try { diff --git a/backend/cpp/llama-cpp/patches/README.md b/backend/cpp/llama-cpp/patches/README.md index 99fa0b69a..3748e9dd2 100644 --- a/backend/cpp/llama-cpp/patches/README.md +++ b/backend/cpp/llama-cpp/patches/README.md @@ -57,7 +57,7 @@ All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so t - **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing. - **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form - (`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index + (see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B, diff --git a/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md b/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md deleted file mode 100644 index a4356fa4a..000000000 --- a/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md +++ /dev/null @@ -1,102 +0,0 @@ -# Patch 0003 — paged gather-read: exact implementation plan - -**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered -`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock -path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`). - -**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`). - -## Design - -The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a -graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all -three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged -mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement). - -ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]` -→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as -`transpose → get_rows → transpose`. - -### KEY CORRECTIONS (found while implementing — these change the edits) - -1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the - *current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by - `kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical - and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`. - -2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell - count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32 - tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They - MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order. - -3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous - in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**: - `reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`. - Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up. - -4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`; - everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over - n_kv, then gathered). Smaller than the 7-edit estimate above. - -## Edits - -### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`) -```cpp - bool is_paged() const { return paged_active; } // near get_size() - ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const; - void set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const; - uint32_t get_n_gather(const slot_info & sinfo) const; // == sum of used cells gathered -``` -Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like -`build_input_k_idxs`/`get_n_kv`). - -### 2. `src/llama-kv-cache.cpp` -- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success. -- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed). -- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`. -- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order - the mask/k/v will see). This is the canonical gather order. - -### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv` -Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`. - -### 4. `src/llama-graph.cpp` -- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()` → `mctx->set_input_gather_idxs(gather_idxs, ...)`. -- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs = - mctx_cur->build_input_gather_idxs(ctx0, ...)`. -- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`: -```cpp -if (ggml_tensor * gi = inp->get_gather_idxs()) { - k = ggml_get_rows(ctx0, k, gi); // [d, n_gather, ...] (reshape view ok) - v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi); - ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv] - m = ggml_get_rows(ctx0, m, gi); // [n_tokens, n_gather] - kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tokens] -} -ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il); -``` -Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged -variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to -`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the -mask gather in the graph. (Cleaner — revisit during impl.) - -### 5. V-transposed path -When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't -work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require -flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up. - -## Verification (the gate) -```sh -cmake --build build-cpu --target llama-simple -j -M=Qwen3-0.6B.Q4_K_M.gguf ; P="" -build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt # stock -LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt # paged gather-read -diff a.txt b.txt # MUST be identical -``` -Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity). -Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`. - -## Risks -- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order - means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop. -- flash-attn vs not: do flash-attn first (simpler mask), then v_trans. diff --git a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md deleted file mode 100644 index a0fd5cb5c..000000000 --- a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md +++ /dev/null @@ -1,347 +0,0 @@ -# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis - -Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev` -HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense), -harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a -clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column. - -## TL;DR verdict - -CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well -under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**. -Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps -(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap -(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged -decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays -99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM -(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at -batch 128), not in launch overhead or the host loop. - -The premise that "the paged decode runs eager (graphs reused=0)" did not survive -measurement: at the benchmarked context the default paged decode captures and -replays graphs exactly like stock non-paged. Two measurement traps (below) -explain the earlier "reused=0 / gap-bound" reading. - -## Method note: a graph-enable trap that was corrected - -`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr` -(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string -still disables graphs. A first 4-cell pass that used -`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF -in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the -variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it -never set the variable). - -## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled) - -npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`: - -| cell | npl 32 | npl 128 | -|------------------|---------|---------| -| stock_graphon | 116.47 | 148.41 | -| stock_graphoff | 115.17 | 148.21 | -| paged_graphon | 116.21 | 148.60 | -| paged_graphoff | 114.62 | 147.65 | - -ON vs OFF (the graph win): - -| config | npl 32 | npl 128 | -|--------|--------|---------| -| stock | +1.13% | +0.13% | -| paged | +1.39% | +0.64% | - -- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at - npl 32 (small-batch, where per-kernel launch overhead is relatively larger). - All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128). -- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT - eager at this config (see Step 2); it captures graphs like stock. -- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at - npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM. - -All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is -shared by stock and paged; it is not paged-specific. Calibration cross-check -(paged ON, ntg 64): 147.64, matching the reference 148-149. - -## Step 2 - why the "eager" premise is wrong, and what actually mutates - -CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in -`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node -properties did not change vs the previous step; any later change logs -`CUDA graph warmup reset` and reverts to eager until stable again. -`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus -each src's `data` ptr / `ne` / `nb`. - -`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96): -- stock: `warmup complete` x2, `warmup reset` x0. -- paged: `warmup complete` x2, `warmup reset` x0. -Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0 -for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so -the uid fast-path never fires; the graph is still replayed via the -`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative, -not evidence of eager execution.** (Trap #1.) - -Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512 -token boundaries), counts over ~320 decode steps: - -| path | complete | reset | interpretation | -|-------------------------------|----------|-------|-------------------------------| -| paged in-kernel (default) | 10 | 8 | resets only at 256-boundaries | -| paged gather (KV_PAGED_GATHER)| 0 | 0 | never captures -> pure eager | -| stock non-paged | 10 | 8 | identical 256-cadence | - -The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s), -not per-step. The default paged decode is therefore captured for ~97% of steps, -re-warming only every ~256 tokens, with the **same cadence as stock**. - -What mutates (the block-table / gather input): -- in-kernel decode (default): the block-table graph input - `idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with - `n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]` - steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The - kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the - property change is per-256-tokens, not per-step. -- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the - index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)` - (`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every - step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every - step"). That changes a node property every step, warmup never completes, and - the path runs pure eager. This is the only "graphs reused=0" path, and it is - not the default decode path. - -`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across -the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table -`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256. - -## Step 3 - where the step time goes (nsys), and a second trap - -npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup). -Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps ->= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph -reserve / prefill one-offs). - -Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA -graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU -rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy. -Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows -and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON -nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** - -the likely source of the earlier "freed GPU time became idle gaps" conclusion. - -Reliable steady-state numbers: - -| trace | GPU rows | busy | within-step idle | between-step idle | host gap/step | -|--------------------------------|----------|--------|------------------|-------------------|---------------| -| OFF (eager) | 84,946 | 99.4% | 0.37% | 0.24% | ~2.0 ms | -| ON (captured, node-trace) | 84,946 | 99.5% | 0.11% | 0.38% | ~1.9 ms | - -- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has - ~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus - ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%. -- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged. -- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample + - update_slots + batch build), 0.24-0.38% of the ~896 ms step. - -Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel -compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps. - -## The load-bearing question, answered - -Within-step or between-step? **Neither is large.** The steady decode is 99.4% -GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove -the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is -~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim -either; the host loop is currently **hidden under GPU compute** (the GPU stays -busy while the host syncs/schedules). It would only become a lever once the -kernels are fast enough to drop GPU-busy below the host time, i.e. it is a -second-order floor, not the present bottleneck. - -## Verdict - -1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the - default decode; capturing reduces within-step launch idle from 0.37% to 0.11% - but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured - +0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise). -2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step, - hidden under GPU compute). It is the candidate floor only after the kernels - speed up. -3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x - gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch - 128). That, not graphs and not the host loop, is the throughput lever. -4. Corrected premises: paged is not perpetually eager (it captures with a - 256-token reset cadence identical to stock); "graphs reused=0" was a uid - fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy - unless `--cuda-graph-trace=node` is set. - -No code patch in Phase 1 (graphs are not the lever, so there is no paged -graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`, -`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX. - -# Phase 2 - the real decode lever, located (per-kernel decomposition) - -Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the -per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured -that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys -`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the -measurement. Two corrections, then the lever. - -The conditional Phase 2 fix (make the paged decode graph-capturable) is moot: -Phase 1 already showed the default paged decode captures, and the fresh re-check -below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph -fix / between-step host loop) is the lever; the lever is a third thing, measured -here. - -## Fresh re-confirmation: graphs are not the lever - -Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table: - -| paged decode | S_TG t/s | vs vLLM 391 | -|---------------|----------|-------------| -| graphs ON | 146.03 | 37.3% | -| graphs OFF | 144.90 | 37.1% | - -+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also -99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy -nodes too, so they cannot remove either the copies or the compute. - -## Correction 1: the model is a hybrid SSM, not a plain transformer - -`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with -`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The -decode-window kernel cadence (per step, ~19.8 steps in the window) is 48 -`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48 -gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1 -hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention -layers. - -## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute - -Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24 -OFF rep; single CUDA stream; running-max-end so it is overlap-correct): - -| activity set | GPU busy | idle | -|------------------------|----------|-------| -| kernels only | 80.2% | 19.8% | -| kernels + memcpy (all) | 99.4% | 0.6% | - -The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are -filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each, -356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet -recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the -paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH -47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy" -and lumped them into "GPU kernel compute" - they are memory movement, and they -are addressable. - -## Decode GPU-time decomposition (% of kernel+memcpy busy) - -OFF/eager rep, steady window. `/step` = instances per decode step. - -| share | activity | /step | role | -|-------|-----------------------------------|-------|-------------------------------| -| 23.4% | gated_delta_net_cuda | 48 | linear-attn recurrence | -| 21.9% | k_get_rows_float | 97 | SSM state / conv-state gather | -| 18.9% | MEMCPY DtoD | 80 | SSM recurrent-state copy | -| 15.5% | mul_mat_vec_q (nvfp4, ncols=1) | 48 | FP4 GEMV | -| 10.4% | mul_mat_q (nvfp4) | 352 | FP4 GEMM | -| 1.9% | quantize_mmq_nvfp4 | 448 | act requant for MMQ | -| 1.0% | concat_cont | 48 | SSM state glue | -| 0.8% | ssm_conv_f32 | 48 | SSM short conv | -| 0.7% | unary_gated_op silu | 112 | SSM gating | -| 0.4% | flash_attn_tile/_ext | 16 | FULL attention (paged) | - -Grouped: -- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy - + conv + gating glue): **~67% of decode**. -- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**. -- Full attention - everything paged attention optimizes: **~0.4%**. - -## Verdict and scope of the real lever - -1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture - the memcpy too, so they cannot touch the copies or the compute. -2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s). -3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26% - faster left decode_agg flat) - it was never the long pole. -4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs, - block-table stabilization, gather rewrite) can move decode_agg on this model** - - it optimizes under half a percent of the step. This is the structural reason - A.2, and the paged-decode track generally, cannot close the vLLM gap on - q36-27b: the model barely uses the path being optimized. - -The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer -per step it re-materializes and D2D-copies the full recurrent state (~230 MB at -batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2 -`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is -SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention -`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the -gather into the scan, avoiding both the per-layer D2D copy and the gathers. - -Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path - -not paged-attn, not a graph capture, not a block-table tweak): -1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor - in place (or double-buffer / write-back), so the recurrence consumes and - produces the persistent state without a full-state copy each layer each step. -2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel. - -Ceiling from this rep (upper bound; assumes the work is fully removed, not just -overlapped): -- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s. -- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within - ~1.6x of vLLM 391 with the FP4 GEMM still untouched. - -No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in -the SSM path, too large for this measurement pass and orthogonal to paged -attention. `patches/paged/0018` stays free. Evidence on the DGX: -`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in -its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps -`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`. - -# A.2 final synthesis - the four-point verdict - -All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF, -`decode_agg` = `S_TG t/s`), npl 128 unless noted. - -**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock -ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32. -All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a -94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy, -so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the -busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts -GPU-busy unless `--cuda-graph-trace=node` is set - trap #2). - -**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?** -There was nothing to fix: the premise was false. At the benchmarked context the -DEFAULT in-kernel paged decode already captures and replays graphs, with a -256-token reset cadence identical to stock non-paged (10 complete / 8 reset over -~320 steps, resets clustered only at the 256/512 token boundaries). "graphs -reused=0" was a uid fast-path false negative, not eager execution (trap #1). The -only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded -index grows every step), which is not the default decode. Because graphs were -already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%). -Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k -launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged. - -**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 = -**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh -re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage. - -**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No. -A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged / -full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet -layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged -path can move decode_agg. The residual gap is structural and lives elsewhere: -~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9% -get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer -per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track -B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved -(graphs) and the between-step host loop is a 0.24% second-order floor hidden under -GPU compute - neither is the residual. - -The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode: -(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer -write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring -vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and -fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180 -t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM -still untouched). That work is orthogonal to paged attention; `patches/paged/0018` -stays free. diff --git a/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md b/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md deleted file mode 100644 index c74e63c05..000000000 --- a/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md +++ /dev/null @@ -1,107 +0,0 @@ -# Additive layout for the paged-KV patch series - "hook, don't edit" - -Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with -minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as -"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be -the opposite: **additive**. This document is the design rule and the per-patch core-touch -budget. - -## The rule - -> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a -> **single, env-gated hook** at one call site in a core file that delegates to the new -> file. No logic lives in a core file. No core struct/signature is edited. - -Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file, -`git apply` either still lands the hook (context unchanged) or fails *only on that tiny -hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 / -old-0003 approach) conflicts on every bump and must be re-understood each time. - -This is enforceable as a **core-touch budget**: each patch declares the core files it -touches and the line count; review rejects anything that grows logic in core. - -## Why it's achievable here (grounded in the pinned source) - -The two seams paged KV needs are both already abstract in llama.cpp at the pin -(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types: - -- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical - cell indices. Paged placement is just *different indices*. 0002 already does this as one - gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal. -- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and - `llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input - subclass. So a paged graph input (the gather index) can be **a new class in a new file**, - added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`. - -## Per-patch core-touch budget - -| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines | -|---|-------|----------------------|------------------------------|-----------:| -| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 | -| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 | -| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 | -| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 | -| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core | -| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 | - -Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical -cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else -is new files or the LocalAI-side server loop. - -## 0003 redesigned to the rule (replaces the 4-file-surgery plan) - -The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}` -(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`). -The additive form removes the core-struct and core-`set_input` edits entirely: - -**New file `src/paged-attn.{h,cpp}`** holds *all* logic: -- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]` - gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills - the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the - same set the `kq_mask` keeps), in the canonical order. -- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active, - constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`, - and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows -> - transpose`). When not active it returns immediately -> **stock path byte-identical**. - -**Core hooks (the whole core diff for 0003):** -1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357): - ```cpp - paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED - ``` - One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate* - registered input, so `llama-graph.h` is untouched. -2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new - file can read the used-cell set without reaching into internals - - `uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;` - (delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs` - pattern). ~8 lines total, no signature changes to existing methods. -3. `src/CMakeLists.txt`: `+ paged-attn.cpp`. - -First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed -(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`, -no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs -`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered -KV set; `n_gather < n_kv` proves compaction, not identity). - -## Anti-drift practices (already in `README.md`, restated as policy) - -- **Stacking patches, one concern each**, exported 1:1 from a dev branch via - `git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch - needs a touch, and the failure names the exact step. -- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never - changes stock behavior - and the hooks compile to a no-op branch when the env is unset. -- **Dev tree:** `git worktree add ` off any checkout that has the pin - (e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one - commit, re-export. (Set up and verified for this pin during this work.) - -## Status / next step - -- 0001, 0002: done, additive, verified token-identical. -- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with - 0001+0002 applied is ready (`paged` branch). Remaining work is the focused - implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to - the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment, - FA-first), not a structural one - the structure is settled here. -- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core - patch at all). diff --git a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md b/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md deleted file mode 100644 index ab72bf9a4..000000000 --- a/backend/cpp/llama-cpp/patches/paged/ARCH_GENERALITY_AUDIT.md +++ /dev/null @@ -1,669 +0,0 @@ -# ARCH_GENERALITY_AUDIT - llama-cpp-localai-paged backend - -Source/build/gallery audit (no GPU, no hardware). Maps how arch-general the -paged backend's BUILD targeting is, and whether non-Blackwell / Metal / CPU -hosts get a working build. - -## Section: backend-build-matrix (build targeting) - -### 1. CUDA arch list: NOT Blackwell-only - it is the FULL upstream ggml default - -There is NO explicit CUDA arch list anywhere in the paged build path: - -- `.docker/llama-cpp-localai-paged-compile.sh` only injects - `-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}` *when* `CUDA_DOCKER_ARCH` is - non-empty (`if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]`). -- NO `backend-matrix.yml` row for `llama-cpp-localai-paged` sets - `CUDA_DOCKER_ARCH` (nor does any stock `llama-cpp` row). It is empty. -- `backend/cpp/llama-cpp/Makefile` (reused verbatim by the paged wrapper) sets - only `-DGGML_CUDA=ON` (+ `-DGGML_NATIVE=OFF`). It never sets - `CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`. - -=> The compiled arch fan is whatever upstream llama.cpp / ggml-cuda picks by -default with `GGML_NATIVE=OFF` (the full multi-arch default, which includes -Blackwell sm_120 alongside the older archs ggml ships). This is BIT-IDENTICAL to -how the stock llama-cpp backend is targeted - the paged wrapper copies and reuses -the exact same Makefile + CMakeLists + prepare.sh, only forcing `LLAMA_PAGED=on`. - -Consequence for NVFP4: the FP4-MMA kernel is compile-time gated *inside* the -ggml-cuda TU by `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer, sm_100 -datacenter). Because the build emits the full arch fan (not a Blackwell-only -list), the sm_120 NVFP4-MMA path is present for Blackwell AND the older archs get -their own kernels (NVFP4 runs the non-MMA fallback at runtime on -Ampere/Ada/Hopper). NOTHING in the build pins this to GB10/sm_121. The binary is -arch-portable; only the NVFP4 *speedup* is Blackwell-specific, by kernel gating, -not by build targeting. - -### 2. Variants built: CUDA + ROCm + SYCL + Vulkan + CPU (NOT CUDA-only) - -`backend-matrix.yml` `include:` (Linux) ships, for `llama-cpp-localai-paged`, -one row per stock-llama-cpp Linux row (10 rows, lines 4889-5046): - -- cublas CUDA 12.8 (linux/amd64) -- cublas CUDA 13.0 (linux/amd64) -- cublas CUDA 13.0 arm64 l4t (jetson) -- cublas CUDA 12.0 arm64 l4t (jetson) -- hipblas / ROCm 7.2.1 (linux/amd64) - AMDGPU_TARGETS = gfx908..gfx1201 -- sycl_f32 (Intel oneAPI) -- sycl_f16 (Intel oneAPI) -- vulkan (linux/amd64) -- vulkan (linux/arm64) -- CPU (linux/amd64) + CPU (linux/arm64), built via the ggml - `CPU_ALL_VARIANTS` single-build (dlopen libggml-cpu-*.so by host CPU feature; - arm64 uses gcc-14 for SME). - -So it is NOT CUDA-only. Per image, `compile.sh` builds: the accelerator variant -(or CPU_ALL_VARIANTS when BUILD_TYPE empty) + grpc-server + rpc-server. - -### 3. THE GAP vs stock llama-cpp: NO Metal / Darwin build - -This is the single build-targeting divergence: - -- stock `llama-cpp` HAS a Darwin row in `includeDarwin` - (`-metal-darwin-arm64-llama-cpp`, line 5071) and a `metal:` capability key - (`metal: "metal-llama-cpp"`, index.yaml line 25). -- `llama-cpp-localai-paged` has ZERO entries in `includeDarwin` (grep past line - 5048 = none) and NO `metal:` capability key in its meta-backend. -- There is NO `metal-*-llama-cpp-localai-paged` tag anywhere in - backend-matrix.yml or backend/index.yaml. - -`scripts/changed-backends.js` already anticipates a future darwin paged row -(lines 78-81 map `backend === "llama-cpp-localai-paged"` to the C++ source dir), -but no such matrix row exists, so it is currently dead/forward-looking code. - -Everything else (CUDA arch fan, ROCm gfx list, SYCL, Vulkan, CPU) matches stock -llama-cpp exactly. - -### 4. Does a non-Blackwell / Metal / CPU host get a working build of THIS backend? - -Meta-backend capabilities map (index.yaml lines 101-111): -default(cpu), nvidia(cuda12), intel(sycl-f16), amd(rocm), vulkan, nvidia-l4t, -nvidia-cuda-13, nvidia-cuda-12, nvidia-l4t-cuda-12/13. NO `metal:` key. - -- Non-Blackwell NVIDIA (Ampere sm_80-86 / Ada sm_89 / Hopper sm_90 / datacenter - Blackwell sm_100): selects the SAME cuda12 / cuda13 image. That image is - compiled for the full arch fan, so it RUNS. NVFP4 falls back to the non-MMA - path on pre-Blackwell; on sm_100 it gets FP4-MMA but is compute-bound (HBM3e), - not the LPDDR5x-bound GB10 regime the patches were tuned for. WORKS, just - without the GB10-specific bandwidth win. -- AMD / Intel / Vulkan / CPU (amd64 + arm64) Linux hosts: each has its own - matching variant in the map + matrix. WORKS. -- Metal / macOS Apple Silicon: NO `metal:` key and NO darwin build. Capability - resolution falls back to `default` = `cpu-llama-cpp-localai-paged`, which is a - Linux (amd64/arm64) image, NOT a macOS-native build, so it will NOT run on - macOS. And because this is a SEPARATE meta-backend, it does NOT fall through to - the stock `llama-cpp` backend - a Mac user who explicitly selects - llama-cpp-localai-paged gets a non-running selection and must manually pick the - stock llama-cpp backend instead. DOES NOT WORK on Metal/macOS; no auto-fallback - to stock. - -## Verdict (build-targeting) - -- Arch-general on Linux: YES. The build is NOT Blackwell-only; it targets the - exact same full CUDA arch fan + the same ROCm/SYCL/Vulkan/CPU variant set as - stock llama-cpp. Any Linux host that can run stock llama-cpp can run THIS - backend; the NVFP4 speedup is the only Blackwell-gated piece, and that gating - is inside the kernel, not in the build matrix. -- Single gap: NO Metal/Darwin variant and NO `metal:` capability key. macOS / - Apple Silicon hosts have no working build of this backend and do not auto-fall - to stock llama-cpp. To close the gap, add an `includeDarwin` row - (`-metal-darwin-arm64-llama-cpp-localai-paged`, mirroring the stock llama-cpp - darwin row + the C++ source build path that changed-backends.js already - anticipates) and a `metal:` key to the paged meta-backend. (Note: NVFP4 has no - Metal MMA path, so a Metal build would deliver paged-KV behaviour only, no - NVFP4 acceleration - still a correctness/availability win over the current - broken selection.) - -## Section: gguf-gallery-targeting (NVFP4 portability + hardware gating) - -### 1. NVFP4 GGUFs LOAD + RUN on non-Blackwell - runs-via-dequant, NOT FP4-MMA-required - -The published GGUFs use `file_type` MOSTLY_NVFP4 / `GGML_TYPE_NVFP4` (type id 40). -This is a standard ggml block-quant type with FULL software dequant + matmul -coverage across every backend, NOT a Blackwell-only format. Verified against the -paged backend's pinned ggml source (pin 0a2677c6, same upstream as stock -llama-cpp): - -- CPU (any arch, amd64 + arm64): full support, no special hardware. - - `ggml/src/ggml-cpu/quants.c`: `quantize_row_nvfp4` (from_float) + - `ggml_vec_dot_nvfp4_q8_0_generic` (the matmul dot product), dequant via the - `kvalues_mxfp4` lookup table. Registered in the CPU type-traits table - (`ggml-cpu.c` line 283: `[GGML_TYPE_NVFP4] = { .from_float=..., .vec_dot=... }`). - - NVFP4 handled in all the CPU op switches (`ops.cpp` lines 674, 1125, 1255, - 4424, 4701, 4925, 5651). LOADS + RUNS correctly on a pure-CPU host, just slow. -- CUDA, NON-Blackwell (Pascal/Volta/Turing/Ampere sm_80-86 / Ada sm_89 / - Hopper sm_90): RUNS correctly via the integer-quantized matmul paths, no - FP4-MMA needed. - - `convert.cu` registers `dequantize_row_nvfp4_cuda` as both the to_float and - to_fp16 dequant kernel (lines 759, 814) - the generic dequant->GEMM path. - - `mmvq.cu`: `vec_dot_nvfp4_q8_1` (DP4A integer dot, works on any GPU with - dp4a, i.e. Pascal sm_61+). This is the decode (gemv) path. - - `mmq.cuh`: NVFP4 has a `MMQ_DP4A_TXS_Q8_0_16` DP4A tile AND a separate - `MMQ_MMA_TILE_X_K_NVFP4` tile explicitly commented "NVFP4 Generic" (line - 222), DISTINCT from `MMQ_MMA_TILE_X_K_FP4` "MXFP4 and NVFP4 Blackwell" (line - 221). So there are three tiers: DP4A (oldest), generic-MMA (Turing+), and - Blackwell-native FP4-MMA. - - The Blackwell path is a runtime FLAG, not a requirement: - `mmq.cu` line 125 `const bool use_native_fp4 = blackwell_mma_available(cc) - && (... NVFP4)`. When false (non-Blackwell), it falls through to the generic - quantized kernel. Grep for any abort/unsupported on NVFP4+blackwell = NONE. - No `GGML_ABORT`, no garbage - just the non-MMA kernel. -- Vulkan: has `dequant_nvfp4.comp` + NVFP4 in `ggml-vulkan.cpp` / dequant_funcs - - LOADS + RUNS on Vulkan hosts (AMD/Intel/NVIDIA) via dequant. -- Metal: NVFP4 referenced only in `ggml-metal-device.m` (type registration / - size), NO Metal NVFP4 compute kernel. On Apple Silicon NVFP4 tensors would - fall back to the CPU backend op-by-op (correct but slow) IF a Metal build - existed - which for THIS backend it does not (see build-targeting Section 3). - -Bottom line: the NVFP4 GGUFs are PORTABLE. A Hopper/Ada/Ampere/CPU/Vulkan host -loads and runs them correctly (bit-faithful dequant), just WITHOUT the FP4-MMA -speedup. FP4-MMA is a Blackwell-only performance tier layered on top of a -fully-general software path, NOT a load/run gate. Off-Blackwell = runs-via-dequant, -correct-but-slow; never fail/garbage. - -### 2. Gallery hardware-targeting GAP: nothing stops a non-Blackwell user - -The 6 -paged entries declare NO machine-readable hardware targeting. The only -Blackwell signal is free prose in `description:` ("native Blackwell NVFP4 -(FP4-MMA)", "Benchmarked on GB10 / DGX Spark") and a `nvfp4` string in `tags:`. - -How LocalAI's gallery CAN express hardware gating (what exists): -- `tags:` are FREE-TEXT, search-only. `core/gallery/gallery.go` line 89 just does - `strings.Contains(lower(join(tags)), term)` for the search box + line 128 - collects them for filter chips. There is NO tag that gates install or warns; - the `nvfp4` tag is purely discoverability. -- The model `ModelConfig` struct (`core/gallery/models.go`) has only - Description/Icon/License/URLs/Name/ConfigFile/Files/PromptTemplates. There is - NO capabilities / requirements / hardware field at the MODEL level. (Signing - `verification:` is the only structured gate, unrelated to hardware.) -- The `capabilities:` map (default/nvidia/intel/amd/metal/vulkan/...) is a - BACKEND-level concept in `backend/index.yaml` (paged entry lines 100-111). It - selects the backend IMAGE by detected accelerator FAMILY (nvidia vs amd vs - metal vs cpu). Crucially it does NOT and CANNOT distinguish Blackwell sm_120/121 - from older NVIDIA - `nvidia: cuda12-llama-cpp-localai-paged` is served to ANY - NVIDIA GPU. There is no sub-nvidia (microarch) gating mechanism in the gallery - or the backend capability resolver. - -So the gating gap is real: a non-Blackwell user browsing the gallery is offered -the NVFP4 entries with no machine-readable signal that they will run far below -the advertised "90-117% of vLLM" numbers (those numbers are GB10/LPDDR5x-bound -specific). It will install and run correctly, just slowly, and the bench claims -in the description will not hold. - -### 3. How to express Blackwell-targeting (recommendation) - -Given there is no microarch-gating primitive, the honest options are, in order: - -a. DESCRIPTION + TAG (only thing available today, zero code): the entries already - say "native Blackwell NVFP4 (FP4-MMA)" - tighten it to a leading one-line - "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other - NVIDIA/CPU via NVFP4 dequant but WITHOUT the FP4-MMA speedup and below the - quoted GB10 throughput." Add a `blackwell` tag alongside `nvfp4` for the - filter chip. This is the existing convention (other entries use free prose + - `nvidia` tag, e.g. line 2395; quant trade-offs are described in prose, e.g. - the Gemma "Mobile-optimized" notes lines 1312/1366). No other gallery entry - today encodes a GPU-microarch requirement, so prose is the de-facto standard. -b. If a structured signal is wanted, it would need a NEW field (e.g. a - `recommended_hardware` / `requires` note surfaced by the React UI import - dialog) - that is a feature, not a config tweak, and does not exist yet. -c. The `nvfp4` tag should at minimum be present on ALL six entries - the four - Qwopus/Qwen-MTP entries at lines 819/854/890 have only `[llm, gguf]` tags and - omit `nvfp4`, so they are not even discoverable/filterable as NVFP4, despite - being NVFP4 GGUFs. Inconsistent tagging is a secondary gap. - -Verdict (gallery-targeting): NVFP4 GGUFs are safe to ship broadly (they run -everywhere via dequant, never fail), so the risk is PERFORMANCE-EXPECTATION, not -correctness. LocalAI has no microarch gating primitive; the only lever is the -description + tags. Recommend a one-line Blackwell-recommended hardware note + -consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench -claims with the "runs slower off-Blackwell" caveat. - -## Section: optimization-generality (patches 0013/0016 + 0017-0029) - -Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch), -GB10-TUNED (needs per-arch retuning of the magnitude/constants), or -Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the -patch commit bodies + the diffs they touch; bit-exactness verdicts are the -patches' own md5/test-backend-ops gates. - -Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer -+ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none; -Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but -HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10 -"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM -ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they -reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic. - -### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral) - -Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no -bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs -the code path. - -- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy, - zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off - byte-identical. Latency/fairness lever (flattens decode-ITL spike from a - co-batched long prefill). No arch assumption. -- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure - `update_slots()` policy, default-off byte-identical, T==n_batch degenerate case - == stock. Arch-neutral, identical paged on/off. -- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot - release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes - a real fragmentation/throughput-collapse bug on long-lived servers. - Arch-independent host bookkeeping. -- 0029 block-table within-step host cache - memcpy-reuse of the host block table - across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1 - off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound - (MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so - the BIGGER this win elsewhere. -- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by - reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no - equivalent of is a win on ANY arch running the GDN path; not FP4, not - bandwidth-floor specific. -- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021 - conv-state in-place fusion - remove a D2D state copy-back (0018), a state - get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring - vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is - plumbing, so wins on ANY arch running the gated-DeltaNet recurrence. -- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching, - in-kernel paged read. No precision or bandwidth-floor assumption; the most - portable part of the work, helps capacity/serving anywhere it compiles. - -NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every -gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU}; -NOT covered on Metal/SYCL/Vulkan - see SAFETY #1. - -### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor - -Correct + beneficial everywhere, but the specific %/constants are GB10-bound. - -- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the - ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes - the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the - 128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the - gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8 - GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA - archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128 - is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7% - magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct. - REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8) - could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the - MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit. -- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1) - quantize the unique token activations once and gather the identical FP4 blocks - instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off. - DIRECTION-GENERAL (de-duplicating identical work is always good) but - NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where - activation-quant is a measurable decode bucket - on a compute-bound arch the - saved quant time may be off the critical path (even on GB10 the MoE TG win is - only +1.7%). - -### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch) - -- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params - NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The - reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column - assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel. - (16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal - values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere - (bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off - GB10; on a compute-bound arch (sm_100) the kernel may not even be the - bottleneck. Needs a per-arch GDN_NW/CPW sweep. -- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT- - OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X). - Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on - sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default => - byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the - kill-gate could flip; the levers are in place and inert, ready to re-sweep. - -### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists) - -- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped - stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off - byte-identical. The CUDA-graph mechanism is general, but the specific guard - condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here" - reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the - node would not take that branch -> inert. -- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds - --cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length). - Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN - decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On - sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is - precision-specific (opt-in, default-safe). -- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific - for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere - they run-via-dequant (correct, slow) per the gallery-targeting section above. - -### Per-arch expected story - -- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120 - (GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the - bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline + - graph wins (0029/0025) and the MMQ reshape (0020) hold. -- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast - (precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound. - Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The - bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the - host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net: - works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants. -- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4 - levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ - etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler - (0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv - plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj - MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged + - SSM + scheduler infra is a pure win". -- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no - NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference - kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses - ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT - get the new fused-op kernels (SAFETY #1). - -### SAFETY / regression risks - -1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated. - 0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids], - ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are - emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set - (constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the - active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan - two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud - supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream, - so a qwen35 model likely cannot run there regardless); (b) the fused conv - variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a - backend that supports plain SSM_CONV but ignores the discriminator would compute - the WRONG plain conv -> SILENT corruption. That is the one genuine - silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute - backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated - SSM_CONV where the fused handling is absent). -2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery - batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream - decode, n_seqs=1). Bit-identical either way - only a potential perf regression - at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming - the reshape still picks the better kernel at n_seqs=1 elsewhere. -3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal; - do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep. - No correctness risk. -4. Gallery rows do not state a GPU-arch requirement (covered in the - gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note. - -### One-line verdict - -The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024, -block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the -o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles - -bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers -(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The -occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned -and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and -the regime flips to compute-bound). The single real SAFETY gap: the new fused -GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal -paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute -(discriminated SSM_CONV) - it should be compute-backend-gated. - -## Section: patch-arch-safety (build-break / miscompile classification, 0018-0029) - -This section is the narrow safety read: for EACH patch, does it (a) compile and -behave correctly on every build target, (b) compile only under -BLACKWELL_MMA_AVAILABLE with a fallback elsewhere, or (c) RISK a build break / -miscompile / crash on a non-Blackwell arch. Class letters here are -build-safety classes, distinct from the perf-generality buckets above. Note 0027 -does not exist (numbering gap). The dispositive build facts: the backend is built -for CUDA 12/13, L4T arm64, ROCm/hipblas, SYCL f32/f16, CPU (amd64+arm64), Vulkan - -and NOT for darwin/Metal (no includeDarwin row), and the CUDA build emits the full -multi-arch fan (CUDA_DOCKER_ARCH unset; Dockerfile documents e.g. `75;86;89;120`), -so every .cu TU MUST already compile for non-Blackwell SASS. - -Method: grepped every added line in 0017-0029 for arch-specific tokens -(BLACKWELL/__CUDA_ARCH__/sm_NNN/cp.async/ldmatrix/mma./asm volatile/cc gates). -The ONLY hits are in 0017 (all correctly `#if`-gated) and free-text comments. No -SSM/conv/GDN kernel in the series uses a Blackwell intrinsic or a hardcoded -sm_12x launch geometry. - -| patch | class | build-safety note | -|-------|-------|-------------------| -| 0017 fp4-gemm-decode-tile-tune | (b) GATED | only Blackwell-specific patch; NVFP4 mmq_y/min-blocks levers behind `#if defined(BLACKWELL_MMA_AVAILABLE)` + `blackwell_mma_available(cc)` + `type==GGML_TYPE_NVFP4`, ALL default-off => default build byte-identical to stock on every arch. `get_mmq_y_device()` templating has a default arg keeping stock behaviour for non-NVFP4. Builds on sm_80-90 (body stripped). | -| 0018 ssm-decode-inplace-state | (a) general | plain in-place GDN state write-back, no intrinsics; CPU mirror in ggml-cpu/ops.cpp. | -| 0019 ssm-decode-fused-gather | (a) general | `gdn_gather_nonident_kernel` = plain `<<>>`; CPU mirror added. | -| 0020 gdn-oproj-mmq-reshape | (a) general | host-side reshape_2d in qwen35*/qwen3next.cpp, no device code. | -| 0021 conv-state-inplace-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (4th src discriminator), no new enum => no ggml-cpu.c switch needed; `ssm_conv_update_f32` plain portable CUDA (threads=128, templated d_conv); CPU mirror + test case. | -| 0022 gdn-recurrence-occupancy-retune | (a) general | template NUM_WARPS/COLS_PER_WARP/MIN_BLOCKS; new default (16,8) = 512 thr/block, MIN_BLOCKS=2, within the 1024 limit on sm_70..120 and AMD; bit-exact for any (NW,CPW). NOT Blackwell-gated and NOT a break - just a GB10-tuned default applied everywhere (see risk 3 below). | -| 0023 moe-nvfp4-quant-dedup | (a) general | `gather_mmq_fp4` = plain uint4 byte-copy kernel; reached ONLY inside the pre-existing `if (use_native_fp4)` branch (Blackwell-only at runtime) and uses `block_fp4_mmq`, a type that already compiles for the full arch fan pre-0023. Adds no new arch surface. | -| 0024 paged-pool-burst-reclaim | (a) general | pure host C++. | -| 0025 moe-nvfp4-decode-regraph | (a) general | host-side ggml-cuda.cu graph-guard relaxation, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off => byte-identical; predicate is runtime cc-generic. | -| 0026 hybrid-perhead-ssm-state | (a) general | mostly host plumbing; GDN kernel = same portable column-folded code; fill.cu instantiates `fill_kernel` (bf16 STORAGE-only, fine on all targeted arches; bf16-compute SSM plan is SHELVED so STATE_T stays f32 on the hot path). LOW-RISK verify item: confirm no bf16-arithmetic GDN instantiation reaches sm_75 if sm_75 ships. | -| 0028 recurrent-state-gather-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (ids src + rs_head); `ssm_conv_gather_nonident_kernel` plain portable CUDA; CPU mirror + test cases. | -| 0029 blocktable-within-step-cache | (a) general | pure host C++ + host-timing instrumentation. | - -### Specific lines that carry the only conditional/risk surface - -- 0017 the ONLY correctly-gated arch surface: - - `get_mmq_y_host`: `if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc))` - - `get_mmq_y_device()` / `mmq_get_min_blocks_device()`: bodies inside `#if defined(BLACKWELL_MMA_AVAILABLE)`. - All default to the stock value, so a default build is byte-identical everywhere. -- 0023 the gather kernel default-on (GGML_CUDA_MOE_QUANT_DEDUP=1) but the call site - is `if (moe_quant_dedup && ne11 == 1)` strictly inside `if (use_native_fp4)`; on - non-Blackwell `use_native_fp4` is false so the dedup never executes. -- 0022 the GB10-tuned launch geometry is `GDN_DEFAULT_NW 16` / `GDN_DEFAULT_CPW 8` - (=> 512 threads, MIN_BLOCKS=2). This is the closest thing to a "hardcoded for - GB10" launch config, but it is a correct, within-limits, bit-exact default for - ANY arch, runtime-overridable via GDN_NW/GDN_CPW. Not a break. - -### THE ONE silent-correctness risk (cross-ref SAFETY #1 above) - -0021/0028 (and 0018/0019 for the GDN op) implement their new ops for CUDA + CPU -ONLY, and the fused conv variants REUSE GGML_OP_SSM_CONV discriminated by a -non-null src[3]/src[4]. Emission is NOT gated on the active compute backend. A -backend that supports plain SSM_CONV but ignores the discriminator would run the -WRONG plain conv => SILENT corruption (not a build break). In practice the model -that emits these (qwen35 hybrid) also needs the fork-custom GDN op, which is -CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts/falls back FIRST and the -model cannot run there regardless; and Metal is not a build target. So the risk is -latent rather than live, but it should still be closed by gating fused-op emission -on a CUDA/HIP compute backend (or a supports_op guard rejecting the discriminated -SSM_CONV where fused handling is absent). This is the single item that could ever -miscompute silently; everything else is either build-safe or loud. - -### Build-safety verdict per target (would it COMPILE / RUN) - -- CUDA sm_80 / 86 / 89 / 90 (Ampere/Ada/Hopper): BUILDS (0017 Blackwell code - `#if`-stripped + default-off; all other device code portable CUDA). qwen35 hybrid - models RUN (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 - GGUFs run via the stock non-FP4-MMA dequant/DP4A path; the FP4 levers are inert, - not broken. No patch in 0018-0029 breaks this build. -- CUDA sm_100 (datacenter Blackwell, HBM3e): BUILDS + every lever active - (BLACKWELL_MMA_AVAILABLE defined). Bit-exact. GB10-tuned launch defaults are - correct but tuned for the LPDDR5x BW floor; on HBM3e the regime is compute-bound, - so safe-but-not-necessarily-optimal (re-sweep 0022/0017 levers). No build/correctness risk. -- Metal: NOT a build target (no darwin row), so missing Metal kernels for the new - SSM_CONV/GDN ops cannot break a build or a run here. (The GDN op has no Metal - kernel regardless.) -- CPU (amd64 + arm64): BUILDS + RUNS - every new op ships a CPU mirror under the - reused enums; host patches are portable C++. -- ROCm/HIP, Intel SYCL, Vulkan: BUILD ok. The .cu additions hipify cleanly (no - Blackwell intrinsic outside the `#if`; 0022's 512-thread launch within AMD limits). - SYCL/Vulkan are separate backends that don't compile the .cu files and lack the - GDN op, so qwen35 hybrid models fall back/assert there rather than run; classic - (non-qwen35) models are unaffected because SSM_CONV semantics only change when the - qwen35 graph emits the discriminator src. The latent silent-SSM_CONV risk above - applies only if a backend both supports SSM_CONV and ignores the discriminator. - -Verdict: of 0018-0029, none would break a non-Blackwell CUDA build, the CPU build, -or the ROCm/SYCL/Vulkan builds; 0017 is the only Blackwell-gated patch and is -default-off and `#if`-guarded. The sole non-build hazard is the latent -discriminated-SSM_CONV silent-miscompute on a hypothetical Vulkan/SYCL/Metal GDN -run, which should be closed by compute-backend-gating the fused-op emission. - -## Section: CROSS-ARCH SYNTHESIS (final verdict) - -Consolidates the four audit sections above into a single ship decision. The arch -axis: NVFP4 FP4-MMA requires `BLACKWELL_MMA_AVAILABLE` = sm_120/121 (consumer -Blackwell, GB10/RTX-50) + sm_100 (datacenter Blackwell). sm_90 Hopper / sm_89 Ada -/ sm_80-86 Ampere = NO FP4-MMA. Metal/CPU/AMD/Intel = no NVFP4-MMA. GB10's wins -are dominated by the LPDDR5x ~273 GB/s bandwidth floor; sm_100 has FP4-MMA but -HBM3e ~8 TB/s so it is COMPUTE-bound and every "bandwidth-bound" GB10 verdict -inverts there. - -### 1. BUILD SAFETY: does it build + run WITHOUT CRASHING off-Blackwell? - -YES on every target it builds for, with ONE latent silent-correctness hazard -(not a crash) to close before claiming non-Blackwell support. The build is NOT -GB10-pinned: there is no explicit CUDA arch list anywhere in the paged path -(`CUDA_DOCKER_ARCH` empty in every matrix row, identical to stock llama-cpp), so -the CUDA TUs compile the full upstream ggml arch fan and the NVFP4 FP4-MMA path -is gated INSIDE the kernel by `BLACKWELL_MMA_AVAILABLE`, never by the matrix. - -| target | builds? | runs? | notes | -|--------|---------|-------|-------| -| CUDA sm_80/86/89/90 (Ampere/Ada/Hopper) | YES | YES | 0017 Blackwell code `#if`-stripped + default-off; all other device code portable. qwen35 hybrid models run (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 GGUFs run via dequant/DP4A; FP4 levers inert, not broken. | -| CUDA sm_100 (datacenter Blackwell, HBM3e) | YES | YES | every lever active + bit-exact; GB10-tuned launch defaults are correct but compute-bound regime => safe-but-suboptimal (re-sweep, do not assume GB10 constants). | -| CPU (amd64 + arm64) | YES | YES | every new op ships a CPU mirror under the reused enums; host patches portable C++. | -| ROCm/HIP, Intel SYCL, Vulkan | YES | partial | .cu hipifies cleanly (no Blackwell intrinsic outside `#if`; 0022's 512-thread launch within AMD limits). SYCL/Vulkan don't compile the .cu and lack the GDN op, so qwen35 hybrid models assert/fall back rather than run; classic non-qwen35 models unaffected. | -| Metal / macOS | NOT BUILT | N/A | no `includeDarwin` row, no `metal:` capability key. Mac selection of this backend falls back to `default`=cpu (a Linux image) and does NOT run; no auto-fallthrough to stock llama-cpp. | - -No patch in 0017-0029 breaks a non-Blackwell CUDA build, the CPU build, or the -ROCm/SYCL/Vulkan builds. The only thing that is not merely "suboptimal" is the -fused-conv silent-miscompute hazard (item RISKY-1 below), and even that is latent -because the co-emitted GDN op asserts first on the backends that lack it. - -### 2. EVERY patch/opt, four buckets - -SAFE-EVERYWHERE (ship as-is; bit-exact or default-off byte-identical; pure win or -neutral on any arch that runs the path): -- 0001-0012 paged KV core (manager, on-demand alloc, prefix caching, in-kernel paged read) -- 0013 / 0016 prefill-token budget scheduler (pure `update_slots()` policy, default-off byte-identical) -- 0018 in-place SSM-state write-back (CUDA+CPU; see RISKY-1 for backend coverage) -- 0019 fused SSM-state gather (CUDA+CPU) -- 0021 conv-state in-place fusion (CUDA+CPU) -- 0028 recurrent-state (conv-tap) gather fusion (CUDA+CPU) -- 0020 o_proj GDN MMVQ->MMQ reshape (zero-cost view, bit-identical; MMQ>MMVQ at M=128 is universal; magnitude GB10-bound, perf-only caveat at tiny real M=1, see RISKY-2) -- 0024 paged-pool burst-reclaim (pure host C++; fixes a real long-server fragmentation collapse) -- 0029 block-table within-step host cache (host memcpy reuse, bit-exact; bigger win the FASTER the GPU, i.e. MORE host-bound decode elsewhere) - -BLACKWELL-ONLY, CLEAN FALLBACK (only meaningful where FP4-MMA exists; provably -inert/byte-identical elsewhere, never a break): -- 0017 FP4 dense-GEMM decode tile tune - levers `#if BLACKWELL_MMA_AVAILABLE` + `blackwell_mma_available(cc)` + `type==NVFP4`, ALL default-off => default build byte-identical to stock on every arch -- 0023 MoE NVFP4 activation-quant de-dup - plain uint4 copy kernel reached ONLY inside the pre-existing `if (use_native_fp4)` branch (false off-Blackwell); never executes there -- 0025 MoE NVFP4 decode re-graph - host-side CUDA-graph guard, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off; the NVFP4-grouped guard predicate is inert on non-FP4 -- NVFP4 GGUFs + 6 gallery rows - FAST path is sm_120/121/100 only; elsewhere run-via-dequant (correct, slow), never a load/run gate - -GB10-TUNED (works + safe everywhere, but the constants/magnitude are GB10 -bandwidth-floor winners; re-sweep per arch, no correctness risk): -- 0022 GDN recurrence occupancy retune - column-fold default (16,8)=512thr/MIN_BLOCKS=2, bit-exact, env-overridable GDN_NW/GDN_CPW; within the 1024-thread limit on sm_70..120 + AMD. Optimal values depend on DRAM latency/L2/SM-count; on a compute-bound arch the kernel may not be the bottleneck. -- 0026 bf16 per-head SSM/conv cache - default f32 bit-exact (opt-in `--cache-type-ssm/-conv`); bf16 only pays off on a bandwidth-bound arch, buys little on sm_100 HBM3e. bf16 is STORAGE-only (fill_kernel), the bf16-compute SSM plan is shelved so STATE_T stays f32 on the hot path. -- 0017 / 0023 magnitudes (the % wins, not the gating) are also GB10-floor-bound. - -RISKY (fix before claiming non-Blackwell ship; neither is a crash, one is silent): -- RISKY-1 (the one real gap) fused GDN/conv ops are CUDA+CPU-only with - backend-UNGATED, DEFAULT-ON emission. Confirmed: `cparams.fused_gdn_ch = true` - and `auto_fgdn = true` in the `llama_context` constructor; emission fires on - `(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar)` with NO compute-backend - check. The fused conv variant REUSES `GGML_OP_SSM_CONV` discriminated by a - non-null `src[3]` (verified: CUDA `if (dst->src[3] != nullptr)` branch at the - top of `ggml_cuda_op_ssm_conv`, CPU mirror in ops.cpp, NO supports_op guard). A - backend that supports plain SSM_CONV but ignores `src[3]` would compute the - WRONG plain conv => SILENT corruption. Latent today only because the co-emitted - fork-custom GDN op is CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts - first and the qwen35 hybrid model cannot run there anyway, and Metal is not - built. FIX: gate fused-op emission on a CUDA/HIP compute backend, OR add a - supports_op guard that rejects the discriminated SSM_CONV where fused handling - is absent. This is the single thing that could miscompute silently; close it - before a Vulkan/SYCL/Metal paged build of a gated-DeltaNet model is ever shipped. -- RISKY-2 (perf-only, not correctness) 0020 forces MMQ; at a genuine single-stream - decode M<=8 (n_seqs=1) MMQ could be slower than MMVQ off the GB10 batched - regime. Bit-identical either way. Confirm the reshape still picks the better - kernel at n_seqs=1 on non-GB10 archs. - -### 3. NVFP4-GGUF + gallery targeting recommendation - -Do NOT hardware-gate the entries (and you cannot: LocalAI has no microarch-gating -primitive - `tags:` are free-text/search-only, `ModelConfig` has no -hardware/requirements field, and backend `capabilities:` resolves by accelerator -FAMILY only, serving `nvidia: cuda12-...-paged` to ANY NVIDIA GPU with no -sub-nvidia resolution). The GGUFs run correctly everywhere via dequant, so the -risk is PERFORMANCE-EXPECTATION, not correctness; a hard gate would wrongly block -valid (slow) use. Recommended, in order: -1. (zero-code, do now) Lead each of the 6 descriptions with one honest line: - "Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other - NVIDIA/CPU via NVFP4 dequant but WITHOUT FP4-MMA and below the quoted GB10 - throughput." Temper the "90-117% of vLLM" claims with that caveat (those are - LPDDR5x-bandwidth-bound specific). -2. (zero-code) Tag all six consistently with `nvfp4` + a new `blackwell` chip. The - four Qwopus/Qwen-MTP entries currently carry only `[llm, gguf]` and are not even - discoverable as NVFP4 despite being NVFP4 GGUFs - secondary correctness-of-metadata gap. -3. (feature, later) A structured `recommended_hardware` field surfaced by the React - import dialog is the only way to express this machine-readably; it does not exist. - -### 4. Per-arch roadmap (ranked by value / effort) - -- sm_100 datacenter Blackwell - HIGH value, MEDIUM effort. FP4-MMA works so NVFP4 - stays fast and the precision bucket (0017/0023/0025) carries over, but the BW - floor is gone => compute-bound. Needs: re-sweep 0022 GDN_NW/CPW; re-evaluate the - 0017 kill-gate (levers ready, may flip); expect 0018/0019/0026 bandwidth wins to - shrink toward neutral while 0029/0025/0020 host/graph/MMQ wins still help. No - code change to be SAFE; a tuning pass to be OPTIMAL. -- Metal / macOS - MEDIUM value, MEDIUM effort. Add the `includeDarwin` - `-metal-darwin-arm64-llama-cpp-localai-paged` row + a `metal:` capability key - (changed-backends.js already anticipates the source path). Delivers paged-KV + - scheduler value only (no NVFP4-MMA on Metal); still a strict win over today's - broken Mac selection. MUST also land RISKY-1 first (Metal would otherwise hit the - discriminated-SSM_CONV path if it ever gained an SSM_CONV kernel without the - discriminator). -- CPU - LOW effort, already works. Reference kernels exist for every fused op; - paged KV + scheduler + reclaim are the portable value. Nothing to do. -- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 - MEDIUM value, LOW effort (no FP4 - work). No FP4-MMA => pair the precision-agnostic infra (paged KV, 0013/0016, - 0024, 0029, 0018/0019/0021/0028, 0020) with a DIFFERENT quant (Q4_K/AWQ/GPTQ). - Messaging: "no NVFP4 here, use another quant, but paged + SSM + scheduler infra - is a pure win". The GGUFs/gallery rows are out of scope for these. - -### 5. What MUST be empirically verified (and on what hardware) - -- GB10 (sm_121, user has it): the validated target; already measured. Re-confirm - bit-exactness gates after RISKY-1 fix. -- M4 Mac (user has it): (a) once an `includeDarwin` paged row exists, verify the - Metal build compiles + a NON-qwen35 model runs (paged KV path); (b) verify a - qwen35 hybrid model on Metal EITHER asserts loudly OR is correct - it must NOT - silently miscompute the discriminated SSM_CONV. This is the direct test of - RISKY-1 on real Metal. Do this BEFORE shipping a Metal paged build. Also verify - CPU correctness of every fused op on the Mac (arm64 CPU mirror). -- non-Blackwell NVIDIA (sm_80/86/89/90 - user would need to ACQUIRE, e.g. cloud - A100/L4/L40S/H100): verify (a) the cuda12/cuda13 paged image runs a qwen35 - hybrid model correctly (GDN + ssm_conv_update + gather non-Blackwell kernels), - (b) NVFP4 GGUFs load + produce correct output via dequant/DP4A (not garbage), - (c) RISKY-2: that 0020's forced MMQ does not regress single-stream (n_seqs=1) - decode latency vs MMVQ. This is the only bucket needing hardware acquisition; - everything else is covered by the GB10 + M4 the user already has. -- sm_100 (datacenter Blackwell - cloud B200 if a tuning pass is wanted): only - needed to make sm_100 OPTIMAL, not to make it SAFE. Defer unless targeting it. - -### 6. SHIP DECISION - -SAFE TO SHIP TODAY as a Blackwell-targeted backend on Linux. The build is -arch-general (same arch fan + variant set as stock llama-cpp), every targeted -Linux variant builds and runs, and all Blackwell-specific code is default-off + -`#if`-guarded so a non-Blackwell build is byte-identical to stock on the FP4 path. -The NVFP4 GGUFs run everywhere via dequant (correct, slower), so broad gallery -exposure is a performance-expectation issue, not a correctness one. - -MINIMUM to not break / mislead other archs: -1. (correctness, before ANY Vulkan/SYCL/Metal paged build of a gated-DeltaNet - model) Close RISKY-1: compute-backend-gate the fused GDN/conv op emission, or - add a supports_op guard rejecting the discriminated SSM_CONV. This is the only - hard requirement; it is latent on the current Linux targets but becomes live - the moment a Metal/Vulkan/SYCL paged build of qwen35 exists. -2. (availability, zero-risk) Add the `includeDarwin` paged row + `metal:` key so - Mac users get a working (paged-KV-only) build instead of a non-running - default=cpu selection with no fallthrough to stock. -3. (expectation, zero-code) Add the Blackwell-recommended hardware note + the - "runs slower off-Blackwell via dequant" caveat to the 6 gallery descriptions - and tag all six `nvfp4` + `blackwell`. -4. (perf, verify don't block) Confirm 0020 does not regress n_seqs=1 decode on - non-GB10 NVIDIA; if it does, gate the MMVQ->MMQ reshape on a real-M threshold. - -Items 2-4 do not block a Linux Blackwell ship. Item 1 blocks only a future -non-CUDA paged build of a gated-DeltaNet model; on the current build targets the -hazard is latent (the GDN op asserts first). Net: ship for Blackwell/Linux now; -land item 1 before extending paged to Metal/Vulkan/SYCL. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md deleted file mode 100644 index f852c6f76..000000000 --- a/backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md +++ /dev/null @@ -1,56 +0,0 @@ -# A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS - -Label: A-build (GPU agent). Base: DGX `~/llama-paged-dev` branch `paged` HEAD 2f4f5ab (patch 0025), -plus `BF16_SSM_STATE.diff` applied as the bf16 plumbing base. Goal: per-head mixed-dtype SSM state -(f32 long-memory heads, bf16 fast heads); default `ssm_hybrid_tau_thresh=inf` (all-f32, bit-exact). - -## Design recap (from SPEEDUP_HUNT.md A-hybrid-design) -- Classifier (host, model-load): tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h])); f32 if tau_h>T. - ssm_a = SSM_A_NOSCAN = -exp(A_log) (verified qwen35.cpp:376). ssm_dt = SSM_DT bias. -- Split cache: per GDN layer, s_l (f32, n_f32 heads) + s_l_bf16 (bf16, n_bf16 heads). head_slot map. -- Kernel: ONE kernel templated +HYBRID; per-block (h_idx) branch on head_slot (uniform, no divergence). - Recurrence math byte-for-byte f32-register, untouched. Homogeneous (HYBRID=false) path bit-exact. -- Op: extra src[8]=state_bf16, src[9]=head_slot; backend detects hybrid = (src[9]!=null). -- CPU mirror: per-head partition read. -- test-backend-ops: MIXED case (some heads f32, some bf16) output-append, decode+prefill+keep_rs_t. - -## DE-RISK GATE (must pass before sweep) -1. test-backend-ops GATED_DELTA_NET mixed PASS (CUDA mixed vs CPU mixed). -2. T_thresh=inf greedy md5 == 0023 baseline: dense 5951a5b4d624ce891e22ab5fca9bc439, - MoE 07db32c2bcb78d17a43ed18bc22705cd. - -## KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected) -Rule (brief verbatim + physics + "start 32-64" guidance all agree): a head is kept f32 iff -tau_h > T_thresh, else bf16. tau_h = 1/(|ssm_a|*softplus(ssm_dt)) in tokens. Long-memory (large tau) -heads stay f32 (bf16 rounding does not contract there -> KL); fast (small tau) heads -> bf16. -- ssm_hybrid_tau_thresh DEFAULT = 0.0 => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here). -- ssm_hybrid_tau_thresh -> +inf => ALL BF16 (shelved mode). -- sweep: raise T (16/32/64/128 tokens) to bf16 progressively more (longer-memory) heads = more speed. -NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED vs the operative rule it states -("keep f32 if tau>T") and vs "start 32-64" + the physics. Correct endpoints: 0=all-f32, inf=all-bf16. -Implemented the physically-correct rule; default 0.0 = bit-exact all-f32. - -## STATUS -- [x] ggml.h/ggml.c hybrid op builders -- [x] gated_delta_net.cu hybrid kernel + dispatch (one kernel, +HYBRID template, uniform per-block branch) -- [x] ops.cpp CPU hybrid read mirror (output-append; ids in-place is GPU-only, asserted) -- [x] test-backend-ops mixed case (32 cases: hc 4/8 x hs 64/128 x decode/prefill/keep_rs_t x kda) -- [x] de-risk gate 1: test-backend-ops GATED_DELTA_NET = 84/84 PASS (incl 32 hybrid mixed CUDA-vs-CPU) -- [x] cparam/CLI ssm_hybrid_tau_thresh plumbing (default 0.0; threaded context->cparams->memory->ctors) -- [x] memory-recurrent split cache + classifier (validated: real tau split, correct 2-partition layout) -- [x] delta-net-base hybrid op build (fused ids decode + bf16 rs_zero/extra mirror) -- [x] full build clean (sm_121; llama-completion/batched-bench/perplexity/test-backend-ops) -- [x] de-risk gate 2 (default/all-f32 md5 == 0023 both models, re-verified post-build) -- [x] hybrid-ON decode FIXED: the incoherence was head_slot being zeroed by clear(data=true) (whole-RS - buffer clear) after warm-up, never re-uploaded => every head -> f32-local-0 => split collapse. - Fix = persist head_slot_host + re-upload via upload_head_slots() after every buffer clear. Hybrid - decode now coherent; cross-op carry verified BYTE-EXACT (write==read both partitions). -- [x] A-gatesweep DONE: KL sweep T in {0.25..128} both models, single-seq c1024 (clean carry), drift. - SHIP GATE FAILS - no T passes MeanKLD<1e-3 AND top-p>=99.5% with meaningful speedup. Premise - (error concentrates in long-memory heads) REFUTED: KL scales with bf16 COUNT and saturates - ~0.06/~91% (MoE saturates at the minimal split). Carry byte-exact => genuine bf16 sensitivity, - not a bug. Throughput lever real: dense +12.4% / MoE +11.5% decode @npl128 at T=128. -- [x] Shipped default-off (f32, bit-exact). De-risk gates re-verified on the clean build (84/84; - md5 == baseline both models). See A_HYBRID_SSM_RESULTS.md for the full tables. - -Committed: DGX paged 33e7c65 (amended); worktree patch 0026 + A_HYBRID_SSM_RESULTS.md + this doc. diff --git a/backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md deleted file mode 100644 index 7634e5d3b..000000000 --- a/backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md +++ /dev/null @@ -1,119 +0,0 @@ -# A - HYBRID PER-HEAD f32/bf16 SSM STATE - BUILD + DE-RISK + GATE-SWEEP RESULTS - -Label: A-build + A-gatesweep. Lands as patch 0026 on top of 0025 (DGX HEAD 2f4f5ab), -incorporating the bf16-SSM-state plumbing as the base. Built into `~/llama-paged-dev/build-cuda` -(sm_121); committed on the DGX `paged` branch (33e7c65, amended) and as -`patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch` + this doc in the worktree. - -## VERDICT - -The hybrid machinery is **CORRECT and complete** (both de-risk gates PASS; the carry is byte-exact; -the previously-open decode-incoherence bug is FIXED). The **ship gate FAILS**: no T_thresh reaches -`MeanKLD < 1e-3 AND Same-top-p >= 99.5%` for both models with any meaningful speedup. The design -premise - that the bf16 KL error concentrates in long-memory heads and is removed by keeping them -f32 at f32-fraction 0.30-0.50 - is **empirically refuted** on q36-27b and q36-35b-a3b-nvfp4: the KL -error scales with the bf16 head COUNT and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any -useful byte-saving. The bf16 byte-saving (and the decode speedup it buys) is real but cannot meet the -strict KL bar. **Shipped default-off (f32, bit-exact opt-out); the hybrid is opt-in only.** - -## THE FIX (was: hybrid-ON decode incoherent) - -Root cause: `llama_memory_recurrent::clear(data=true)` zeroes the WHOLE recurrent backend buffer with -`ggml_backend_buffer_clear`, which includes the per-layer `head_slot` maps. Those maps were uploaded -only once in the constructor. llama.cpp calls `clear(true)` to reset state after the warm-up run (and -on context resets), so by the time real prefill/decode runs, every `head_slot[h] == 0`. The kernel -decodes `head_slot==0` as "f32 head, local index 0", so EVERY head reads/writes f32-partition slot 0: -the split collapses (the bf16 partition is never written, every head collides on one f32 slot) and the -output is garbage. Warm-up showed correct values precisely because it ran before the clear. - -Fix: persist the host-side maps (`head_slot_host`) and re-upload them after every buffer clear via a -new `upload_head_slots()` (called both at construction and at the end of `clear(true)`). 22 lines in -`src/llama-memory-recurrent.cpp` + 7 in the header. After the fix: -- head_slot reads back correct in every forward (e.g. `0 1 -1 -2`), in both llama-completion and - llama-perplexity; -- the bf16 partition is written (non-zero) every step; -- the cross-op state carry is **byte-exact**: at a continuation forward the op reads back EXACTLY what - the prior op wrote, element-for-element, in BOTH partitions (f32 `[0]=0.00303 [1]=-0.00074 - [16384]=0.00054`, bf16 `[0]=-0.00023 [1]=0.00008 [16384]=0.00269` write == read), confirming there - is no addressing/scramble/corruption bug. The only residual difference from f32 is the bf16 rounding - of the bf16-partition heads. - -## DE-RISK GATES - both PASS (re-verified on the final clean build) - -1. **test-backend-ops GATED_DELTA_NET = 84/84 PASS, CUDA0 OK** (incl. the 32 mixed-dtype hybrid cases - vs CPU: head_count {4,8} x head_size {64,128} x {decode, prefill 33/64/100, keep_rs_t K=4} x kda). -2. **T=0 (default, all-f32) greedy md5 == 0023 baseline, both models**, NO `--ssm-bf16-tau`: - - dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == baseline - - MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == baseline - The bit-exact opt-out is preserved byte-for-byte. - -## SHIP GATE - the KL/throughput sweep (FAILS) - -KL harness = the bf16-work GateBench: `llama-perplexity --kl-divergence` on wikitext-2-raw, -`-ngl 99 -fa on --seed 1`, base = T=0 (f32). The clean carry config is single-sequence -`-b 1024 -ub 512 -c 1024 --chunks 8` (one cross-ubatch bf16 round-trip; f32-vs-f32 floor = 100.000% -same-top-p, MeanKLD ~ -1.2e-5). Gate: `MeanKLD < 1e-3 AND Same-top-p >= 99.5% AND bounded drift`. - -### Dense q36-27b-nvfp4 (H_v=48), c1024 single-seq - -| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD | Same-top-p | -|---------:|-----------:|--------:|--------:|---------:|-----------:| -| 0.25 | 14 | 0.964 | 0.982 | 0.00270 | 98.92% | -| 0.5 | 48 | 0.963 | 0.982 | 0.01439 | 96.18% | -| 1 | 118 | 0.935 | 0.968 | 0.06357 | 91.59% | -| 8 | ~610 | 0.735 | 0.868 | 0.05669 | 91.59% | -| 32 | ~1113 | 0.517 | 0.759 | 0.05724 | 90.97% | -| 64 | ~1304 | 0.434 | 0.717 | 0.06183 | 91.85% | -| 128 | ~1460 | 0.366 | 0.683 | 0.05980 | 91.56% | - -Monotonic below the knee (T<=1), then a flat plateau. Best meaningful point T=0.25 (only ~1.8% byte -saving) already FAILS both criteria (KLD 0.0027 > 1e-3; top-p 98.92% < 99.5%). To pass the gate the -bf16 count must be < ~14 heads (f_bytes > 0.98) => no speedup. - -### MoE q36-35b-a3b-nvfp4 (H_v=32), c1024 single-seq - -| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD | Same-top-p | -|---------:|-----------:|--------:|--------:|---------:|-----------:| -| 0.25 | 23 | 0.940 | 0.970 | 0.03907 | 91.61% | -| 0.5 | 53 | 0.928 | 0.964 | 0.04620 | 90.31% | -| 1 | 78 | 0.910 | 0.955 | 0.04425 | 89.82% | -| 32 | 585 | 0.391 | 0.695 | 0.04552 | 90.09% | - -MoE has NO low-KL regime: even the minimal split (23 bf16 heads, ~3% byte saving) is already at the -~0.045 / ~91% plateau. Fails the gate everywhere by a wide margin. - -### Why it fails (the refutation) - -The carry is byte-exact, so this is genuine bf16 rounding of the recurrent state, not a bug. The -gated-DeltaNet logit is extremely sensitive to ANY perturbation of the temporal state: even rounding a -handful of small-magnitude heads to bf16 flips ~9% of hard-wikitext argmaxes, and adding more bf16 -heads does not flip materially more (saturation - the flips concentrate in an inherently-marginal -token pool). This matches the prior whole-bf16 finding (MeanKLD 0.05-0.17, top-p ~90%, "bounded but -LARGE, plateaus with context"). The error is NOT concentrated by tau, so f32-ing the long-memory heads -(or, tested, the fast heads - inverted classifier gives the same plateau) does not recover the gate. - -## THROUGHPUT - the byte-saving lever IS real (but KL-gated out) - -`llama-batched-bench -fa on -npp 128 -ntg 128 -npl 128`, `LLAMA_KV_PAGED=1`, decode_agg = S_TG t/s: - -| model | T=0 (f32) | T=128 (f_bytes ~0.68) | gain | -|-------|----------:|----------------------:|-------:| -| dense | 529.0 | 594.4 | +12.4% | -| MoE | 1110.7 | 1238.1 | +11.5% | - -So the split delivers the predicted recurrence-bandwidth win (~+12% decode at f_bytes ~0.68), but only -at T values whose KL is ~0.06 / ~91% top-p. There is no operating point with both a real speedup and a -passing KL. - -## RECOMMENDATION - -- Ship 0026 as-is: **default `ssm_hybrid_tau_thresh = 0.0` (f32, bit-exact)**; the hybrid is opt-in via - `--ssm-bf16-tau` for callers who explicitly accept reduced precision for ~+12% decode. Do NOT put a - hybrid T in the gallery/recommended config - it does not pass the KL bar. -- Lever A is closed as a KL-passing speedup: the GDN recurrent state does not tolerate bf16 on a - head-subset basis. Speed beyond the f32 recurrence must come from elsewhere (the MoE FP4 GEMM / - re-graph levers, or NVFP4-dense quant), not from bf16-ing the SSM state. -- If a product later accepts a looser bar (e.g. top-p >= 95%), dense T=0.5 (96.18%, f_bytes 0.982) is - the only near-miss and buys ~2% - still not worth it; MoE has nothing. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md deleted file mode 100644 index 1e6893fa3..000000000 --- a/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md +++ /dev/null @@ -1,92 +0,0 @@ -# FINAL apples-to-apples NVFP4 benchmark (GB10 / DGX Spark) - CLEAN env, containers stopped -# llama 0023 clean f7409c2 | LLAMA_KV_PAGED=1, LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget; beats stock 394s->142s TTFT@npl32), CUDA graphs ON, -c 131072 --parallel 128 -b 2048 -ub 512 -fa on -# vLLM 0.23.0 | CUDA graphs ON (no enforce-eager), util 0.85, max-model-len 4096, max-num-seqs 256, tp1 -# client h2h_cli3.py: 512-tok UNIQUE-nonce prompt (fresh full prefill, defeats prefix caching), max_tokens=256, temp0, ignore_eos, stream+usage -# llama restarts server PER NPL (paged-pool degrades after high-npl bursts); vllm one server/combo + npl8 re-check. 1 measured pass/npl + ptok8 graph warmup. peak_gb engine = PEAK-PRE. -# started Fri Jun 26 04:43:38 AM CEST 2026 baseline=3.29 GB - -[2026-06-26 04:43:38] [dense_llama] ==== START dense_llama (llama) baseline_mem=3.29 ==== -[2026-06-26 04:43:38] [dense_llama] NPL=8 launching server PRE_GB=3.29 -[2026-06-26 04:43:48] [dense_llama] NPL=8 ready LOADED_GB=47.06 -[2026-06-26 04:43:55] [dense_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is". This is a straightforward factual question with a clear answer.\n2. **Identify Key Entity:** France (country)\n3. **Identify Question Type:** Capit -[2026-06-26 04:44:30] [dense_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2040, "prompt_tok_total": 4195, "gen_per_req": 255.0, "agg_tps": 61.8, "decode_agg_tps": 82.5, "decode_perseq_tps": 9.57, "prefill_tps": 507.3, "ttft_mean_ms": 6038.1, "ttft_max_ms": 8270.0, "wall_s": 32.999} -[2026-06-26 04:44:30] [dense_llama] NPL=8 PEAK_GB=53.51 -[2026-06-26 04:44:35] [dense_llama] NPL=8 server stopped mem=3.31 -[2026-06-26 04:44:35] [dense_llama] NPL=32 launching server PRE_GB=3.31 -[2026-06-26 04:44:40] [dense_llama] NPL=32 ready LOADED_GB=46.96 -[2026-06-26 04:47:55] [dense_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8180, "prompt_tok_total": 16900, "gen_per_req": 255.6, "agg_tps": 43.2, "decode_agg_tps": 192.6, "decode_perseq_tps": 4.79, "prefill_tps": 115.0, "ttft_mean_ms": 133551.7, "ttft_max_ms": 147007.0, "wall_s": 189.49} -[2026-06-26 04:47:55] [dense_llama] NPL=32 PEAK_GB=69.63 -[2026-06-26 04:48:01] [dense_llama] NPL=32 server stopped mem=3.32 -[2026-06-26 04:48:01] [dense_llama] NPL=64 launching server PRE_GB=3.32 -[2026-06-26 04:48:11] [dense_llama] NPL=64 ready LOADED_GB=46.97 -[2026-06-26 04:55:10] [dense_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16382, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 39.8, "decode_agg_tps": 277.8, "decode_perseq_tps": 3.09, "prefill_tps": 95.9, "ttft_mean_ms": 321618.8, "ttft_max_ms": 352633.6, "wall_s": 411.603} -[2026-06-26 04:55:10] [dense_llama] NPL=64 PEAK_GB=83.96 -[2026-06-26 04:55:16] [dense_llama] NPL=64 server stopped mem=3.30 -[2026-06-26 04:55:16] [dense_llama] NPL=128 launching server PRE_GB=3.30 -[2026-06-26 04:55:21] [dense_llama] NPL=128 ready LOADED_GB=47.09 -[2026-06-26 05:13:18] [dense_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32767, "prompt_tok_total": 67969, "gen_per_req": 256.0, "agg_tps": 30.9, "decode_agg_tps": 384.6, "decode_perseq_tps": 1.86, "prefill_tps": 69.7, "ttft_mean_ms": 902762.7, "ttft_max_ms": 975832.6, "wall_s": 1061.031} -[2026-06-26 05:13:18] [dense_llama] NPL=128 PEAK_GB=93.82 -[2026-06-26 05:13:25] [dense_llama] NPL=128 server stopped mem=3.31 -[2026-06-26 05:13:25] [dense_llama] ==== DONE dense_llama POST_GB=3.31 ==== -[2026-06-26 05:13:25] [dense_vllm] ==== START dense_vllm (vllm) baseline_mem=3.31 ==== -[2026-06-26 05:13:25] [dense_vllm] launching vllm PRE_GB=3.31 -[2026-06-26 05:21:15] [dense_vllm] vllm ready LOADED_GB=110.48 -[2026-06-26 05:21:27] [dense_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is"\n2. **Identify Key Entity/Question:** The question is asking for the capital city of France.\n3. **Retrieve Knowledge:** I know from general knowledge that t -[2026-06-26 05:21:59] [dense_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1959, "prompt_tok_total": 4195, "gen_per_req": 244.9, "agg_tps": 65.6, "decode_agg_tps": 70.4, "decode_perseq_tps": 8.76, "prefill_tps": 2096.2, "ttft_mean_ms": 1861.1, "ttft_max_ms": 2000.6, "wall_s": 29.843} -[2026-06-26 05:21:59] [dense_vllm] NPL=8 PEAK_GB=110.92 -[2026-06-26 05:22:47] [dense_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8165, "prompt_tok_total": 16900, "gen_per_req": 255.2, "agg_tps": 176.3, "decode_agg_tps": 211.8, "decode_perseq_tps": 6.28, "prefill_tps": 2182.6, "ttft_mean_ms": 5353.2, "ttft_max_ms": 7741.4, "wall_s": 46.302} -[2026-06-26 05:22:47] [dense_vllm] NPL=32 PEAK_GB=110.87 -[2026-06-26 05:23:59] [dense_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16314, "prompt_tok_total": 33828, "gen_per_req": 254.9, "agg_tps": 236.5, "decode_agg_tps": 309.1, "decode_perseq_tps": 4.38, "prefill_tps": 2088.9, "ttft_mean_ms": 9512.4, "ttft_max_ms": 16191.0, "wall_s": 68.976} -[2026-06-26 05:23:59] [dense_vllm] NPL=64 PEAK_GB=110.88 -[2026-06-26 05:25:57] [dense_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32640, "prompt_tok_total": 67969, "gen_per_req": 255.0, "agg_tps": 288.4, "decode_agg_tps": 418.8, "decode_perseq_tps": 2.79, "prefill_tps": 1929.1, "ttft_mean_ms": 18449.5, "ttft_max_ms": 35227.7, "wall_s": 113.162} -[2026-06-26 05:25:57] [dense_vllm] NPL=128 PEAK_GB=110.95 -[2026-06-26 05:26:27] [dense_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 2044, "prompt_tok_total": 4187, "gen_per_req": 255.5, "agg_tps": 68.1, "decode_agg_tps": 73.4, "decode_perseq_tps": 9.07, "prefill_tps": 1921.9, "ttft_mean_ms": 1877.6, "ttft_max_ms": 2178.1, "wall_s": 30.018} -[2026-06-26 05:26:35] [dense_vllm] ==== DONE dense_vllm POST_GB=3.53 ==== -[2026-06-26 05:26:35] [moe_llama] ==== START moe_llama (llama) baseline_mem=3.53 ==== -[2026-06-26 05:26:35] [moe_llama] NPL=8 launching server PRE_GB=3.53 -[2026-06-26 05:26:50] [moe_llama] NPL=8 ready LOADED_GB=36.42 -[2026-06-26 05:26:52] [moe_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:* -[2026-06-26 05:27:06] [moe_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2048, "prompt_tok_total": 4195, "gen_per_req": 256.0, "agg_tps": 156.8, "decode_agg_tps": 211.8, "decode_perseq_tps": 24.45, "prefill_tps": 1236.4, "ttft_mean_ms": 2477.1, "ttft_max_ms": 3392.9, "wall_s": 13.061} -[2026-06-26 05:27:06] [moe_llama] NPL=8 PEAK_GB=39.66 -[2026-06-26 05:27:11] [moe_llama] NPL=8 server stopped mem=3.34 -[2026-06-26 05:27:11] [moe_llama] NPL=32 launching server PRE_GB=3.34 -[2026-06-26 05:27:16] [moe_llama] NPL=32 ready LOADED_GB=36.54 -[2026-06-26 05:27:54] [moe_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8192, "prompt_tok_total": 16900, "gen_per_req": 256.0, "agg_tps": 235.6, "decode_agg_tps": 393.0, "decode_perseq_tps": 10.02, "prefill_tps": 1213.9, "ttft_mean_ms": 8225.2, "ttft_max_ms": 13921.9, "wall_s": 34.768} -[2026-06-26 05:27:54] [moe_llama] NPL=32 PEAK_GB=47.11 -[2026-06-26 05:28:00] [moe_llama] NPL=32 server stopped mem=3.30 -[2026-06-26 05:28:00] [moe_llama] NPL=64 launching server PRE_GB=3.30 -[2026-06-26 05:28:05] [moe_llama] NPL=64 ready LOADED_GB=36.39 -[2026-06-26 05:29:10] [moe_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16384, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 271.0, "decode_agg_tps": 527.0, "decode_perseq_tps": 6.15, "prefill_tps": 1152.3, "ttft_mean_ms": 15849.5, "ttft_max_ms": 29356.9, "wall_s": 60.449} -[2026-06-26 05:29:10] [moe_llama] NPL=64 PEAK_GB=57.13 -[2026-06-26 05:29:16] [moe_llama] NPL=64 server stopped mem=3.28 -[2026-06-26 05:29:16] [moe_llama] NPL=128 launching server PRE_GB=3.28 -[2026-06-26 05:29:21] [moe_llama] NPL=128 ready LOADED_GB=36.48 -[2026-06-26 05:34:19] [moe_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32760, "prompt_tok_total": 67969, "gen_per_req": 255.9, "agg_tps": 112.7, "decode_agg_tps": 726.4, "decode_perseq_tps": 3.73, "prefill_tps": 276.8, "ttft_mean_ms": 213017.2, "ttft_max_ms": 245528.7, "wall_s": 290.634} -[2026-06-26 05:34:19] [moe_llama] NPL=128 PEAK_GB=61.51 -[2026-06-26 05:34:25] [moe_llama] NPL=128 server stopped mem=3.28 -[2026-06-26 05:34:25] [moe_llama] ==== DONE moe_llama POST_GB=3.28 ==== -[2026-06-26 05:34:25] [moe_vllm] ==== START moe_vllm (vllm) baseline_mem=3.28 ==== -[2026-06-26 05:34:25] [moe_vllm] launching vllm PRE_GB=3.28 -[2026-06-26 05:39:35] [moe_vllm] vllm ready LOADED_GB=109.46 -[2026-06-26 05:39:38] [moe_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:**\n - C -[2026-06-26 05:39:47] [moe_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1900, "prompt_tok_total": 4195, "gen_per_req": 237.5, "agg_tps": 231.2, "decode_agg_tps": 256.5, "decode_perseq_tps": 31.84, "prefill_tps": 5186.5, "ttft_mean_ms": 768.8, "ttft_max_ms": 808.2, "wall_s": 8.217} -[2026-06-26 05:39:47] [moe_vllm] NPL=8 PEAK_GB=109.62 -[2026-06-26 05:40:07] [moe_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 7794, "prompt_tok_total": 16900, "gen_per_req": 243.6, "agg_tps": 426.4, "decode_agg_tps": 500.8, "decode_perseq_tps": 14.9, "prefill_tps": 6223.4, "ttft_mean_ms": 1830.4, "ttft_max_ms": 2714.2, "wall_s": 18.28} -[2026-06-26 05:40:07] [moe_vllm] NPL=32 PEAK_GB=109.63 -[2026-06-26 05:40:37] [moe_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 15927, "prompt_tok_total": 33828, "gen_per_req": 248.9, "agg_tps": 550.7, "decode_agg_tps": 686.1, "decode_perseq_tps": 9.83, "prefill_tps": 5926.5, "ttft_mean_ms": 3224.4, "ttft_max_ms": 5704.9, "wall_s": 28.92} -[2026-06-26 05:40:37] [moe_vllm] NPL=64 PEAK_GB=109.63 -[2026-06-26 05:41:27] [moe_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 31795, "prompt_tok_total": 67969, "gen_per_req": 248.4, "agg_tps": 650.7, "decode_agg_tps": 882.2, "decode_perseq_tps": 6.05, "prefill_tps": 5300.5, "ttft_mean_ms": 6487.7, "ttft_max_ms": 12817.8, "wall_s": 48.863} -[2026-06-26 05:41:27] [moe_vllm] NPL=128 PEAK_GB=109.64 -[2026-06-26 05:41:36] [moe_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 1702, "prompt_tok_total": 4187, "gen_per_req": 212.8, "agg_tps": 207.2, "decode_agg_tps": 226.4, "decode_perseq_tps": 28.06, "prefill_tps": 6021.3, "ttft_mean_ms": 642.7, "ttft_max_ms": 694.8, "wall_s": 8.213} -[2026-06-26 05:41:44] [moe_vllm] ==== DONE moe_vllm POST_GB=3.31 ==== - -==== ALL 16 ROWS COLLECTED (2 models x 2 engines x 4 npl) ==== -decode_agg t/s (llama | vLLM | llama%vLLM): - DENSE q36-27b-nvfp4: npl8 82.5|70.4|117% npl32 192.6|211.8|91% npl64 277.8|309.1|90% npl128 384.6|418.8|92% - MoE q36-35b-a3b: npl8 211.8|256.5|83% npl32 393.0|500.8|78% npl64 527.0|686.1|77% npl128 726.4|882.2|82% -peak_gb (llama on-demand grows | vLLM fixed ~107 pool): - DENSE llama 53.5->93.8 ; vLLM ~110.9 flat - MoE llama 39.7->61.5 ; vLLM ~109.6 flat -Final CSV: final_benchmark.csv ; analysis: QWEN36_NVFP4_BENCH.md (FINAL section). -Cleanup: no leftover server/bench PIDs; GPU free (memnow 3.28 GB); local-ai + local-ai-worker -containers restarted (host returned). DONE. diff --git a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md deleted file mode 100644 index 311e3631e..000000000 --- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md +++ /dev/null @@ -1,628 +0,0 @@ -# bf16 SSM-state cache: BUILD PLAN (PART C synthesis - hand this to the build agent) - -Status: READ-ONLY design. Lands ON TOP of patch 0021 (conv-state in-place fusion, building -concurrently on the GPU). DEFAULT = bf16 SSM recurrent state, f32 opt-out. This PART C is the -executive build brief: ordered edits, acceptance gate, bench targets, semantics/back-compat/risk -register, and the de-risk-first item. PART A (cparams wiring), PART B (kernel/op plumbing) and the -Appendix (upstream precedent + numeric safety) below are the detailed reference each step points into. - -The decision (settled by GDN_RECURRENCE_BYTE_GATE.md): the gated-DeltaNet recurrence is the dominant -decode kernel (51.6% of the step, 805 MB f32 state R+W/call at 74% of GB10 peak BW) and is ALREADY -single-pass (measured re-stream ~1.0x, hard-capped <=1.33x). The whole ~2x DRAM gap vs vLLM is purely -f32(llama) vs bf16(vLLM) state-cache WIDTH, not extra passes. Narrowing the persisted SSM state to -bf16 (load->f32, recurrence math in f32 UNCHANGED, store->bf16) halves the dominant term and reaches -vLLM parity-to-ahead. vLLM's own GDN state cache is bf16, so this is a fair equal-precision change. - -## C.0 Synthesis decisions that OVERRIDE the per-part text - -1. v1 ships `type_s` = BF16 (SSM recurrent state, the 805 MB lever) and KEEPS `type_r` = F32 (conv - state). Reason: `ggml_concat` at prefill (`build_conv_state`, delta-net-base.cpp:472) requires - same-type operands; a bf16 conv cache breaks the f32 `qkv_mixed` concat. Conv state is ~12.6 MB - (launch-bound, ~0 ms byte benefit), so keeping it f32 costs nothing. This OVERRIDES PART A §3a/§3b, - which set BOTH defaults to BF16: in v1 set the `type_r` / `cache_type_conv` DEFAULT to - `GGML_TYPE_F32`. `type_r`=bf16 is a v2 follow-up (needs an f32 staging view before the prefill - concat - PART B §B.6). -2. Keep ALL transient/scratch tensors f32: the GDN op OUTPUT scratch (ggml.c:6327), the 0019 gather - scratch, and the keep_rs_t prefill snapshot. ONLY the PERSISTED cache rows narrow to bf16 (the - src[5] read view and the src[6] in-place write view). -3. The gate REPLACES the bit-exact md5 gate for the bf16 default: bf16 is intentionally non-bit-exact - vs llama f32 (it is equal precision to vLLM's bf16). The 0018/0019 md5 gate STILL applies to (a) - patch 0021's conv fusion and (b) verifying the f32 opt-out path is byte-identical to the pre-bf16 - f32 baseline. - -## C.1 Ordered file-by-file edit list (build order, on top of 0021) - -Order is dependency- and de-risk-driven: prove the kernel dtype-correct in ISOLATION before flipping -any default. Section refs point into PART A / PART B below. - -STEP 1 - kernel + op made dtype-generic (the load/store conversion), validated standalone: -- 1a `ggml/src/ggml.c` - relax the F32-only state asserts to {F32,BF16} in the 3 GDN builders: - `ggml_gated_delta_net` (~6308), `_inplace` (~6370), `_inplace_ids` (~6430), on `state` and - `src_state_dst`. KEEP the op OUTPUT scratch F32 (6327). [PART B §B.2] -- 1b `ggml/src/ggml-cuda/ggml-cuda.cu` - `supports_op` `GGML_OP_GATED_DELTA_NET` (~3096): permit a - BF16 `src[5]`/`src[6]`. [PART B §B.3] -- 1c `ggml/src/ggml-cuda/gated_delta_net.cu` - template kernel+gather+launch on `bool STATE_BF16`; - `#include `. LOAD `__bfloat162float` (~102), STORE `__float2bfloat16` (~207), GATHER - bf16->f32 scratch (~20). Cast `src_state`/`src_state_dst` pointers to `nv_bfloat16` on bf16; relax - dispatcher asserts (309-311) `sizeof(float)` -> `ggml_type_size(type)`. Keep gather scratch + - keep_rs_t snapshot f32. ALL recurrence math (106-200) UNCHANGED in f32 registers. [PART B §B.4,§B.8] -- 1d `ggml/src/ggml-cpu/ops.cpp` - matching bf16 load/store branch in the GDN reference (10726/10744/ - 10891 load via `GGML_BF16_TO_FP32`, 10758-10762 store via `GGML_FP32_TO_BF16`); relax `nb[]` asserts - to `ggml_type_size(type)`. [PART B §B.5] -- 1e `tests/test-backend-ops.cpp` - add a BF16-state `GATED_DELTA_NET` case covering BOTH `n_tokens==1` - decode AND a multi-token (prefill/chunk) + `keep_rs_t==true` path, CUDA bf16 vs CPU bf16 reference. - THIS IS THE DE-RISK GATE for Step 1 (see C.5). Build + pass before Step 2. - -STEP 2 - cparams selection wiring (llama.cpp core): -- 2a `include/llama.h` (after :366) - add `enum ggml_type type_s;` and `type_r;` adjacent to - `type_k`/`type_v`, marked `[EXPERIMENTAL]`. [PART A §3a] -- 2b `src/llama-context.cpp:3468` (`llama_context_default_params`) - add `/*.type_s =*/ GGML_TYPE_BF16,` - and `/*.type_r =*/ GGML_TYPE_F32,`. THIS IS THE DEFAULT CHANGE (type_r stays F32 per C.0). [PART A §3a] -- 2c `src/llama-memory.h:19` (`struct llama_memory_params`) - add `ggml_type type_r;` and `type_s;`. - [PART A §3a] -- 2d `src/llama-context.cpp:325` (`params_mem` init) - pass `params.type_r` / `params.type_s`. [PART A §3a] -- 2e `src/llama-model.cpp` - replace the 3 hardcoded `GGML_TYPE_F32` pairs (2056-57 recurrent, 2098-99 - hybrid_iswa, 2117-18 hybrid = the qwen35/qwen35moe path) with `params.type_r` / `params.type_s`. - [PART A §2/§3a] - -STEP 3 - back-compat for saved recurrent state (REQUIRED, the default flips): -- 3a `src/llama-memory-recurrent.cpp` `state_read_data` - on `s_type_i_ref != live type` with both in - {F32,BF16}, CONVERT row-by-row during load instead of returning false (same for `r`). Bump the - recurrent state-file version. [PART A §5, option A] - -STEP 4 - CLI / llama-server surface (needed by the gate harness): -- 4a `common/common.h:566` region - `cache_type_ssm = GGML_TYPE_BF16;` and - `cache_type_conv = GGML_TYPE_F32;` (conv default F32 per C.0). [PART A §3b] -- 4b `common/common.cpp:1589` region - `cparams.type_s = params.cache_type_ssm;` and - `cparams.type_r = params.cache_type_conv;`. [PART A §3b] -- 4c `common/arg.cpp` (after :2074) - add `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` - via the existing `kv_cache_type_from_str` (arg.cpp:402); confirm `bf16` -> `GGML_TYPE_BF16`. The C.2 - harness depends on `--cache-type-ssm {f32,bf16}`. [PART A §3b] - -STEP 5 - LocalAI gRPC / YAML (force f32 from model config): -- 5a `backend/backend.proto` - `string CacheTypeSSM` / `CacheTypeConv` (next free tags after 64); - regen proto. [PART A §3c] -- 5b `backend/cpp/llama-cpp/grpc-server.cpp:504` region - `params.cache_type_ssm = - kv_cache_type_from_str(request->cachetypessm());` + conv. [PART A §3c] -- 5c `core/config/model_config.go:935` - `CacheTypeSSM`/`CacheTypeConv` yaml fields. [PART A §3c] -- 5d `core/backend/options.go:247` - map into the request. [PART A §3c] -- 5e `core/config/meta/registry.go` + `build_test.go` - register `cache_type_ssm`/`cache_type_conv` - as static fields (gate). [PART A §3c] - -STEP 6 - capability fallback (heterogeneous / CPU-offload safety): -- 6a `src/llama-context.cpp:518-595` - an `auto_fgdn`-style device-match probe: if a participating - device lacks the bf16 GDN load/store specialization (CPU-offloaded GDN layer, non-GB10 backend), - demote `type_s` to F32 BEFORE alloc and log once. [PART A §4] - -## C.2 Acceptance gate (REPLACES the bit-exact md5 gate) - -bf16 is intentionally non-bit-exact, so the 0018/0019 md5 byte-equality gate does NOT apply to the -bf16 default. The gate is teacher-forced KL-divergence + PPL-delta + greedy coherence + a -long-context drift sweep, vs the SAME model run f32. All commands on `dgx.casa` (DO NOT run during -this design - GPU busy). Binaries `~/llama-paged-dev/build*/bin`; models `~/bench/q36-27b-nvfp4.gguf` -(dense) and `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE); scratch `~/bench/klgate`. - -Why teacher-forced (not self-greedy): a self-greedy decode lets each precision pick its own argmax, -so after the first divergence the contexts differ and per-token logits are no longer comparable (you -measure trajectory divergence, not numeric error). `llama-perplexity --kl-divergence` feeds both -precisions the IDENTICAL token stream and compares output distributions position-by-position; the -greedy trajectory is validated SEPARATELY by the Same-top-p metric + a coherence read. - -Corpus (one-time): wikitext-2 raw test (~280k tokens) into `~/bench/klgate`. KL mode needs ->= 2*n_ctx tokens; any fixed >=8k-token UTF-8 file works as long as base AND test share it. - -256-token headline gate (per model; shown for dense): -``` -M=~/bench/q36-27b-nvfp4.gguf; F=~/bench/klgate/wikitext-2-raw/wiki.test.raw; D=~/bench/klgate -COMMON="-m $M -f $F -c 256 -b 256 -ngl 99 -fa on --seed 1 --chunks 32" -# (a) f32 BASE: reference logits + f32 PPL -llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32.c256.base.log -# (b) bf16 TEST: KL(bf16||f32) + bf16 PPL + Same-top-p -llama-perplexity $COMMON --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.bf16.c256.kl.log -``` -Noise floor (run FIRST, mandatory - GPU reductions are not bit-deterministic, so KLD has a non-zero -floor; bf16 is judged against BOTH the absolute threshold AND this floor): -``` -llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32f32.floor.log -``` -Record `Mean KLD_floor` and `Same-top-p_floor` (expect KLD ~1e-6..1e-5, top-p ~100%). - -Coherence spot-check (greedy trajectory, reuses the 0018/0019 `--temp 0 --seed 1` convention): -``` -P="Explain how a transformer language model generates text, step by step." -for T in f32 bf16; do llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm $T 2>/dev/null > $D/q27.greedy.$T.txt; done -diff $D/q27.greedy.f32.txt $D/q27.greedy.bf16.txt && echo "GREEDY BYTE-IDENTICAL" -``` -Long-context drift sweep (verifies the g<1 decay bound: bf16 state-rounding error must stay FLAT, not -accumulate, as context grows - the GDN state spans the whole window): -``` -for C in 256 1024 2048 4096; do - CMN="-m $M -f $F -c $C -b $C -ngl 99 -fa on --seed 1 --chunks 8" - llama-perplexity $CMN --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c$C.kld >/dev/null - llama-perplexity $CMN --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c$C.kld | tee $D/q27.bf16.c$C.kl.log -done -``` -f32 opt-out verification (the safety valve must actually select f32 and reproduce the committed f32 -greedy md5 from 0018/0019 - the bf16 default must NOT change the f32-path output): -``` -llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm f32 2>/dev/null | md5sum # == 0018/0019 f32 baseline md5 -``` -Repeat the WHOLE gate verbatim for the MoE model (`M=~/bench/q36-35b-a3b-nvfp4.gguf`). - -PASS/FAIL (bf16 ships as DEFAULT only if ALL rows pass for BOTH dense and MoE): - -| metric | source | PASS threshold | -|---|---|---| -| Mean KLD | 256-gate (b) | **< 1e-3 nats** (hard, the brief) | -| Mean KLD vs floor | (b) vs floor | <= ~5x `Mean KLD_floor` (bounded signal, not pure noise) | -| Same top p | (b) | **>= 99.5%** (100% => greedy byte-identical to f32) | -| PPL-delta `ln(PPL_bf16/PPL_f32)` | (a)+(b) | **abs < 0.005** (PPL within +-0.5%) | -| Max / 99.9% KLD | (b) | report; flag if Max > 0.05 (tail outliers) | -| Coherence | greedy | fluent + on-topic; byte-identical if Same-top-p=100% | -| Long-context drift | sweep | MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0% | - -If any row fails for a model: keep THAT model on f32 (gallery YAML `cache_type_ssm: f32`) while the -global default stays bf16; the cparams f32 fallback is the safety valve. MoE has fewer GDN layers -(31 vs 48) and smaller per-head state (H_v=32 vs 48), so expected KLD <= dense; same thresholds. -Same-top-p is the bridge to the old md5 harness: at 100% the bf16 greedy output is byte-identical to -f32 and the 0018/0019 md5 gate would still pass - the strongest possible non-bit-exact result. - -## C.3 Bench targets + nsys confirmation - -Dense q36-27b-nvfp4 (48 GDN layers, S_v=128, H_v=48), npl128, GB10/sm_121, graphs-OFF -apples-to-apples (the measured baseline): -- Recurrence per call: 3.98 ms (f32, 805 MB R+W, 74% peak) -> **~2.0-3.0 ms** (bf16, ~413 MB R+W). - 2.0 ms = 74% peak retained; 3.0 ms = conservative 50% peak on the smaller footprint. -- Recurrence per step: 191 ms -> ~96-143 ms (save ~48-95 ms). -- Step time: 384 ms -> **289-339 ms**. -- Decode throughput: ~335 -> **360-443 tok/s** = parity-to-ahead of vLLM (327 ms / 391 tok/s). - -MoE q36-35b-a3b-nvfp4 (31 GDN layers, H_v=32): state per (seq,layer) = 128*128*32*4 = 2.0 MiB f32 -> -per-call R+W ~537 MB f32 -> ~268 MB bf16. Fewer layers + smaller state => smaller ABSOLUTE recurrence -savings, and MoE decode is more GEMM-bound (the `MUL_MAT_ID` expert path), so the bf16-state win is a -smaller FRACTION of the MoE step. Target: a measurable per-call halving of the GDN recurrence time -with the C.2 KL gate passing; no absolute MoE step target is asserted here (the MoE step is -MUL_MAT_ID-dominated, a separate lever from this one). - -nsys confirmation (the measurement that proves the lever landed): -``` -GGML_CUDA_DISABLE_GRAPHS=1 nsys profile -o ssmbf16 --force-overwrite true \ - llama-batched-bench -m $M -npp 8 -ntg 12 -npl 128 -ub 2048 -nsys stats --report cuda_gpu_kern_sum ssmbf16.nsys-rep | grep -i gated_delta_net -``` -Confirm: `gated_delta_net_cuda` mean duration/call drops 3.98 -> 2.0-3.0 ms; step time + tok/s land in -the 289-339 ms / 360-443 tok/s band; the f32 opt-out reproduces the 3.98 ms f32 call. The gate is the -JOINT condition: per-call speed in band AND KL<1e-3 - neither alone ships bf16. - -## C.4 Default / opt-out semantics, back-compat, risk register - -Semantics: -- DEFAULT `type_s` = `GGML_TYPE_BF16` (SSM recurrent state). `type_r` = `GGML_TYPE_F32` in v1 (conv - state; bf16 is v2). This is the INVERSE of KV (KV is opt-IN to compression at F16 default; SSM is - opt-OUT to f32). -- Opt-out: `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML) -> bit-exact f32 - recurrence. Per-model opt-out lives in gallery YAML if a model fails the gate; the global default - stays bf16. -- Silent capability fallback: the C.1 STEP 6 device-match probe demotes `type_s` to F32 before alloc - on devices lacking the bf16 GDN specialization (CPU offload / non-GB10) and logs once. - -Back-compat (the ONE real breakage): `llama-memory-recurrent.cpp` serializes the per-layer state -dtype and HARD-matches on restore (mismatch -> `"mismatched s type"` -> returns false). The f32->bf16 -default flip makes OLD f32-saved sessions fail to restore against a bf16 build. Fix = STEP 3a: convert -row-by-row on mismatch (both in {F32,BF16}) + bump the recurrent state-file version. KV never hit this -because `type_k`/`type_v` were EXPERIMENTAL and never default-changed; the SSM default FLIP is what -forces the convert/version work. - -Risk register: -- **R1 numeric drift (KL gate fails).** Likelihood LOW: g<1 geometric decay contracts per-step bf16 - rounding to a bounded series (~`eps/(1-exp(g_mean))`), f32 registers confine rounding to one - per-step cache write, and vLLM ships this exact config in production. Mitigation: C.2 gate + - per-model f32 opt-out + global f32 fallback. -- **R2 prefill / keep_rs_t / gather state path (the silent-corruption landmine).** The conversion - points are documented for DECODE; the SAME kernel also runs the chunked prefill path, the keep_rs_t - snapshot (writes to f32 scratch while the cache is bf16), and the 0019 gather (reads bf16 cache -> - f32 scratch). A dtype mistake on any of these corrupts the state at the prefill->decode handoff and - surfaces ONLY as long-context drift, which a decode-only 256-token gate can mask. Mitigation: STEP - 1e test-backend-ops MUST cover the multi-token prefill + keep_rs_t==true path, not just decode; the - C.2 long-context sweep is the second net. (This is C.5, the single biggest risk.) -- **R3 MoE MUL_MAT_ID path.** The GDN recurrence op is IDENTICAL for dense and MoE; the MoE expert - GEMM (`MUL_MAT_ID`) does NOT touch the SSM state, so bf16-state is orthogonal to the expert path. - Residual risk: `qwen35moe` `build_recurrent_attn` must route the same bf16 state view (it shares - delta-net-base.cpp). Mitigation: run the full C.2 gate on the MoE model; the test-backend-ops case - is arch-agnostic. -- **R4 conv-state coupling with patch 0021.** Flipping `type_r` to bf16 breaks `ggml_concat` at - prefill (different types). Mitigation: v1 keeps `type_r`=F32 (C.0); `type_r`=bf16 deferred to v2 - with an f32 staging view (PART B §B.6). -- **R5 back-compat restore failure.** Mitigation: STEP 3a convert + version bump (above). - -## C.5 Single biggest risk + how the build agent de-risks it FIRST - -Single biggest risk: **R2 - silent state corruption on the NON-decode state paths** (chunked prefill, -the keep_rs_t snapshot, the 0019 gather). The 805 MB measurement and every conversion-point in the -cheat-sheet describe the STEADY decode path (`n_tokens==1`, `!keep_rs_t`). But the bf16 cache is ALSO -read/written by the multi-token prefill path and the prefill/rollback snapshot (which targets f32 -scratch while the cache is bf16). A dtype bug there does not crash and barely moves the 256-token -decode md5; it corrupts the recurrent state at the prefill->decode boundary and shows up ONLY as -long-context drift - exactly the failure a quick gate misses. - -De-risk FIRST (before ANY default flip or wiring): implement STEP 1 (kernel + op dtype-generic) and -STEP 1e (test-backend-ops) ONLY, then prove the kernel is dtype-correct in ISOLATION by forcing a -bf16 state allocation behind a temporary debug flag and running test-backend-ops with a case that -exercises (a) single-token decode, (b) a multi-token prefill chunk, and (c) `keep_rs_t==true`, -comparing CUDA bf16 against the CPU bf16 reference AND against the f32 path within tolerance. Only -after that case is GREEN does the build agent proceed to STEP 2 (flip the default) and the C.2 -model-level gate. This decouples kernel dtype-correctness from the cparams wiring, so a Step-1 bug is -caught by a deterministic op test in minutes instead of as a fuzzy long-context regression after the -full stack is wired. - ---- - -# bf16 SSM state cache — cparams wiring (DEFAULT bf16 + f32 opt-out) - -Label: cparams-default-fallback (READ-ONLY design). Mirrors the KV-cache `type_k`/`type_v` -precision plumbing exactly. Designed against HEAD-after-patch-0021 (conv-state in-place fusion). - -This is lever (2) of GDN_RECURRENCE_BYTE_GATE.md: the recurrent SSM state cache is the dominant -decode byte stream (805 MB R+W/call, 51.6% of step, single-pass f32 = at the BW floor). The whole -~2x DRAM gap vs vLLM is f32(llama) vs bf16(vLLM) state width. Storing the persisted state in bf16 -(load→f32, recurrence math in f32 UNCHANGED, store→bf16) halves the dominant term. vLLM's GDN state -cache is bf16, so bf16-default is the fair equal-precision comparison → make it the DEFAULT. - ---- - -## 1. The KV-cache template we mirror (exact chain for type_k / type_v) - -``` -CLI common/arg.cpp:2052 -ctk/--cache-type-k TYPE → params.cache_type_k - (common_params, common/common.h:566, default GGML_TYPE_F16) - ↓ -glue common/common.cpp:1589 cparams.type_k = params.cache_type_k (cparams = llama_context_params) - ↓ -API include/llama.h:365 llama_context_params.type_k // [EXPERIMENTAL] - llama-context.cpp:3468 default in llama_context_default_params() = GGML_TYPE_F16 - ↓ -mem llama-context.cpp:326 llama_memory_params params_mem.type_k = params.type_k - llama-memory.h:19 struct llama_memory_params { ggml_type type_k; type_v; ... } - ↓ -alloc llama-model.cpp:2030 create_memory(params_mem, cparams) → KV cache uses params.type_k -``` - -Key facts: -- `type_k`/`type_v` are NOT stored in `struct llama_cparams` (src/llama-cparams.h). They ride in - `llama_context_params` → `llama_memory_params` and are consumed directly at cache-alloc time. - We mirror that: NO new `llama_cparams` field is needed. -- KV default is opt-IN to compression (F16 default, pass `-ctk q8_0` to shrink). SSM is the INVERSE: - bf16 DEFAULT, pass an explicit `f32` to opt out / restore bit-exactness. - -## 2. Where the SSM state type is currently hardcoded (the targets) - -The recurrent cache constructor already accepts the types — only the model hardcodes F32: - -- `src/llama-memory-recurrent.cpp:22-23` ctor params `ggml_type type_r, type_s` - - `r_l` (line 100, `n_embd_r`) = short conv state → `type_r` (TINY: conv_width-1 taps × conv_dim) - - `s_l` (line 101, `n_embd_s`) = SSM recurrent state → `type_s` (THE 805 MB/call dominant) -- `src/llama-memory-hybrid.h:32-33` ctor params `type_r, type_s` (qwen35 / qwen35moe path) -- Hardcoded `GGML_TYPE_F32` call sites in `src/llama-model.cpp::create_memory`: - - 2056-2057 `llama_memory_recurrent(...)` (pure recurrent arches) - - 2098-2099 `llama_memory_hybrid_iswa(...)` recurrent_type_r / recurrent_type_s - - 2117-2118 `llama_memory_hybrid(...)` recurrent_type_k / recurrent_type_v (mislabeled; they are r/s) - -Note: `qwen35` / `qwen35moe` are HYBRID (filter_attn/filter_recr, no SWA) → they take the -`llama_memory_hybrid` branch (2108-2118). That is the call site that matters for the parity push. - -## 3. New plumbing (parallel chain `type_s` / `type_r`) - -### 3a. Public API + cparams glue (llama.cpp side) - -| File | Change | -|------|--------| -| `include/llama.h` (after :366) | Add `enum ggml_type type_s; // data type for recurrent SSM state cache [EXPERIMENTAL]` and `enum ggml_type type_r; // data type for recurrent conv state cache [EXPERIMENTAL]`. Place adjacent to `type_k`/`type_v`. | -| `src/llama-context.cpp:3468` (default params) | Add `/*.type_s =*/ GGML_TYPE_BF16,` and `/*.type_r =*/ GGML_TYPE_BF16,`. **This is the DEFAULT change.** | -| `src/llama-memory.h:19` (`struct llama_memory_params`) | Add `ggml_type type_r;` and `ggml_type type_s;` next to `type_k`/`type_v`. | -| `src/llama-context.cpp:325` (`params_mem` init) | Add `/*.type_r =*/ params.type_r,` and `/*.type_s =*/ params.type_s,`. | -| `src/llama-model.cpp` 2056-57 / 2098-99 / 2117-18 | Replace the 3 hardcoded `GGML_TYPE_F32` pairs with `params.type_r` / `params.type_s`. | - -### 3b. CLI / llama-server (common side) - -| File | Change | -|------|--------| -| `common/common.h:566` region | Add `ggml_type cache_type_ssm = GGML_TYPE_BF16;` and `ggml_type cache_type_conv = GGML_TYPE_BF16;` (mirror `cache_type_k/v`; note the DEFAULT is BF16, not F16). | -| `common/common.cpp:1589` region | Add `cparams.type_s = params.cache_type_ssm;` and `cparams.type_r = params.cache_type_conv;`. | -| `common/arg.cpp` (after :2074) | Add `--cache-type-ssm TYPE` (`-ctssm`) → `params.cache_type_ssm = kv_cache_type_from_str(value)`, and `--cache-type-conv TYPE` (`-ctconv`). Reuse the existing `kv_cache_type_from_str` (arg.cpp:402). Help text: "recurrent SSM state cache type (default bf16; pass f32 for bit-exact recurrence)". | - -`kv_cache_type_from_str` already accepts `f32`/`bf16`/`f16` — no change needed; just confirm `bf16` -maps to `GGML_TYPE_BF16` (add the case if absent). - -### 3c. LocalAI gRPC backend (so users can force f32 from model YAML) - -Mirror `CacheTypeKey` exactly: - -| File | Change | -|------|--------| -| `backend/backend.proto:419` region | Add `string CacheTypeSSM = NN;` and `string CacheTypeConv = NN;` (next free field tags). Regenerate proto. | -| `backend/cpp/llama-cpp/grpc-server.cpp:504` region | `if (!request->cachetypessm().empty()) params.cache_type_ssm = kv_cache_type_from_str(request->cachetypessm());` and the conv equivalent. (grpc-server already has its own `kv_cache_type_from_str`; ensure it knows `bf16`.) | -| `core/config/model_config.go:935` region | Add `CacheTypeSSM string yaml:"cache_type_ssm,omitempty"` and `CacheTypeConv string yaml:"cache_type_conv,omitempty"`. | -| `core/backend/options.go:247` region | Add `CacheTypeSSM: c.CacheTypeSSM,` and `CacheTypeConv: c.CacheTypeConv,` to the request build. | -| `core/config/meta/registry.go:161` + `core/config/meta/build_test.go:140` | Register `cache_type_ssm` / `cache_type_conv` as static fields (the `staticFields` slice + registry map) so the meta-config gate passes. | - -LocalAI semantics: leaving `cache_type_ssm` UNSET in YAML → empty gRPC string → backend keeps its -BF16 default. Setting `cache_type_ssm: f32` → forces the f32 opt-out (bit-exact recurrence). - -## 4. Default / fallback semantics - -- **DEFAULT = `GGML_TYPE_BF16`** for both SSM state (`type_s`) and conv state (`type_r`). - - SSM state (`type_s`) is the lever: f32→bf16 halves 805→413 MB/call → ~3.98→~2.0-3.0 ms/call. - - Conv state (`type_r`) is negligible bytes; default it bf16 too for consistency, but it can stay - f32 with zero perf cost if patch-0021's in-place conv path assumes f32 — see §6. -- **Opt-out = `GGML_TYPE_F32`** via `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML). - Restores bit-exact recurrence; use when the KL gate (<1e-3 / PPL-delta over 256-tok greedy) fails - for a given model, or for deterministic regression baselines. -- **Silent capability fallback**: gate the bf16 path behind a device-match probe modeled on - `auto_fgdn` (llama-context.cpp:518-595). If the GDN recurrence kernel's bf16 load/store - specialization is unavailable on a participating device (e.g. a CPU-offloaded GDN layer with no - bf16 op, or a non-GB10 backend), fall back to `GGML_TYPE_F32` for `type_s` BEFORE cache alloc and - log it once. This keeps "bf16 default" from breaking heterogeneous/CPU setups. -- The kernel contract is unchanged-math: load bf16→f32 into `s_shard` (registers stay f32), all - recurrence arithmetic in f32, store f32→bf16. Only the persisted cache is rounded per step; - geometric decay (g<1) bounds the rounding (does not accumulate unboundedly). - -## 5. Back-compat (the one real breakage — saved sessions / state files) - -`src/llama-memory-recurrent.cpp` SERIALIZES the per-layer state tensor dtype and does a HARD match -on restore: -- write: `state_write_data` writes `s_type_i = (int32_t)s_l[il]->type` (line ~900) and the r type. -- read: `state_read_data` reads `s_type_i_ref`, compares to current `s_l[il]->type`, and on - mismatch logs `"mismatched s type (%d != %d, layer %d)"` and **returns false** (restore FAILS). - Same for `r` type. - -Consequence of the default flip f32→bf16: -- Sessions SAVED by an old f32-default build will FAIL to RESTORE against a new bf16-default build - (and vice versa), because the serialized `s_type_i_ref` (F32) ≠ the new cache type (BF16). - -Required handling (pick one, recommend A): -- **A (convert on mismatch, recommended)**: in `state_read_data`, when `s_type_i_ref != current` - and both ∈ {F32, BF16}, convert row-by-row during load (`ggml_fp32_to_bf16` / `bf16→fp32`) instead - of returning false. Same for `r`. Bump the recurrent state-file version so older readers reject - cleanly. This makes old f32 sessions loadable into bf16 caches and round-trips safely. -- **B (pin precision to the saved file)**: if a session is being restored, read `s_type_i_ref` - first and set `type_s`/`type_r` from it, overriding the default for that context. Keeps restore - working but silently disables the bf16 win for resumed sessions. -- **C (document-only)**: keep the hard match; document that bf16-default invalidates cross-version - saved recurrent states. Lowest effort, worst UX. Not recommended given parity is the goal. - -KV-cache parallel: `type_k`/`type_v` were always EXPERIMENTAL and non-default-changing, so the KV -path never had to solve this. The SSM default-FLIP is what forces the convert/version work — call it -out as the single most load-bearing back-compat item. - -## 6. Coupling notes / sequencing - -- Land ON TOP of patch 0021 (conv-state in-place fusion). If 0021's fused conv write assumes an f32 - conv-state tensor, either (a) extend it to the cache tensor's dtype, or (b) keep `type_r` = F32 by - default and make ONLY `type_s` bf16 (conv bytes are negligible, so this loses nothing perf-wise and - de-risks 0021). Decision: ship `type_s`=BF16 first; make `type_r`=BF16 a follow-up gated on 0021's - conv path being dtype-generic. -- Kernel side (separate patch, not this wiring): `ggml/src/ggml-cuda/gated_delta_net.cu` currently - takes `const float * curr_state` / `float * state_dst` and does `s_shard[r] = read_state[i]` - (line 102) — hardcoded f32. The bf16 build needs the dispatch to read `s0->type` and route a - bf16 load/store specialization; the gather kernel `gdn_gather_nonident_kernel` (line 7, `const - float * cache`) likewise needs a bf16 variant. The cparams wiring here only selects the cache - dtype; the kernel patch consumes it. Patches 0018 (in-place) / 0019 (gather) state asserts must be - relaxed from f32-only to {f32,bf16}. -- CPU mirror `ggml-cpu/ops.cpp` GDN path needs the same bf16 load/store for CI parity / fallback. - -## 7. Validation gate - -- KL < 1e-3 and PPL-delta within tolerance vs the f32-state build over a 256-token greedy run, per - model (dense q36-27b-nvfp4, MoE q36-35b-a3b-nvfp4). If a model fails, that model sets - `cache_type_ssm: f32` in its gallery YAML (per-model opt-out) — the global default stays bf16. -- Add a `test-backend-ops` case for the GDN recurrence with bf16 state (mirror the 0021 harness: - dense text md5 + MoE byte check) to lock the load→f32→store→bf16 contract. - ---- - -# Appendix - label `upstream-bf16-precedent` (READ-ONLY research) - -Precedent + numeric-safety justification for the §1-7 wiring above. Sources: paged dev tree -(`dgx.casa:~/llama-paged-dev`, branch `paged`) and the vLLM checkout -(`~/vllm-bench/.../site-packages/vllm`). - -## A.1 Upstream llama.cpp: recurrent-cache f32 is HARDCODED (no f16/bf16 path), not a documented numeric guard - -The asymmetry to override: the attention KV cache type is user-tunable; the recurrent state cache is not. - -- KV cache: `llama_context_params.type_k/type_v` default `GGML_TYPE_F16` - (`src/llama-context.cpp:3468-3469`), `[EXPERIMENTAL]` in `include/llama.h:365-366`, plumbed from - user params (`attn_type_k = params.type_k`). -- Recurrent/SSM cache: `llama_memory_recurrent(... type_r, type_s ...)` and the hybrid wrappers take - the recurrent types as ctor args, but EVERY call site in `src/llama-model.cpp` passes the literal - `GGML_TYPE_F32` (2056-2057 pure-recurrent; 2098-2099 hybrid-iswa `recurrent_type_r/s`; - 2117-2118 hybrid `recurrent_type_k/v`). No cparams field feeds these - compile-time constants. - So mamba/mamba2/rwkv/falcon-h1/nemotron-h/qwen3.5 ALL get f32 recurrent + conv state unconditionally. -- Alloc: `r = ggml_new_tensor_2d(ctx, type_r, ...)`, `s = ggml_new_tensor_2d(ctx, type_s, ...)` - (`src/llama-memory-recurrent.cpp:100-101`). No f16 branch anywhere. - -Is f32 a deliberate numeric constraint? Structural, not documented: -- `ggml_ssm_conv` / `ggml_ssm_conv_update_inplace` HARD-ASSERT f32 on conv state/kernel/x_cur/dst - plus `nb[0]==sizeof(float)` (`ggml/src/ggml.c:5581-5584,5589,5597`). Conv path is f32-locked at the - builder. -- `ggml_ssm_scan` does NOT assert input state `s` dtype, but hardcodes its OUTPUT as - `GGML_TYPE_F32` (`ggml/src/ggml.c:5662`); scan kernels read `s` as `float *`. -- `ggml/src/ggml-cuda/gated_delta_net.cu` takes `const float * curr_state`, `float * state`, - `float * state_dst`; the per-(seq,head) shard `float s_shard[rows_per_lane]` is loaded/stored as raw - float (34-102). Same in `ggml-cpu/ops.cpp`. -- NO code comment anywhere justifies "f32 for precision". The constraint is that the ops were written - float-only. => recurrent-cache-f32 is a hardcoded implementation default to override deliberately: - the 3 literal `GGML_TYPE_F32` call-site pairs (gate behind `type_s`/`type_r` per §3), the - gated_delta_net.cu load/store convert, and KEEP conv f32 unless its asserts are extended (conv bytes - are negligible - only the temporal `type_s` state needs bf16). - -## A.2 vLLM: GDN temporal state cache is bf16 BY DEFAULT, fp32-accumulated in-kernel (the exact design) - -- Dtype: `qwen3_next.py:780-787` -> `MambaStateDtypeCalculator.gated_delta_net_state_dtype` -> - `_mamba_state_dtype` (`mamba_utils.py:84-96`): - `conv_state_dtype = get_kv_cache_torch_dtype(mamba_cache_dtype, model_dtype)`; - `if mamba_ssm_cache_dtype == "auto": temporal_state_dtype = conv_state_dtype`. - With both knobs default `"auto"`, `get_kv_cache_torch_dtype("auto", model_dtype)` returns - `model_dtype` (`torch_utils.py:293-297`) = bf16 for Qwen3-Next => BOTH conv and temporal state are - bf16 by default. Explicit opt-out: `--mamba-ssm-cache-dtype float32` (mirror of our f32 fallback). -- In-kernel numerics (decode), `fla/ops/fused_recurrent.py`: - `b_h = tl.load(p_h0).to(tl.float32)` (303) load bf16->fp32; q/k/v/g/beta `.to(tl.float32)` (309-318); - recurrence in fp32 `b_h*=exp(g); b_v-=sum(b_h*b_k); b_v*=beta; b_h+=b_v*b_k; b_o=sum(b_h*b_q)` - (327-331); `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` (337) store fp32->bf16. Prefill chunk path - identical (`b_h=tl.zeros(...,tl.float32)`, `+= load().to(fp32)`, 102/120). - => byte-for-byte the proposed llama lever: load bf16->f32, math in f32 (UNCHANGED order, matches - gated_delta_net.cu's v-g*kv -> *beta -> S-update -> S^T q), store f32->bf16; only the persisted cache - crosses the bf16 boundary, once per step. -- vLLM numeric guards: NONE beyond fp32 accumulation - no per-step renorm, no clamp, no Kahan. Optional - `use_qk_l2norm_in_kernel` normalizes q,k (keeps k unit-norm) but does not touch the state. -- KDA nuance: `kda_state_dtype` returns `(state_dtype, torch.float32)` - Kimi Delta Attention keeps a - fp32 secondary component. qwen3.5 is `gated_delta_net` (fully-bf16 temporal state), but this shows - vLLM judged a fp32 component necessary for one delta variant -> reinforces keeping the f32 toggle. - -Verdict: vLLM's own GDN state cache is bf16, so bf16-state in llama is a FAIR equal-precision target, -not a regression vs the competitor. bf16 brings llama TO vLLM's precision. - -## A.3 Numeric-safety assessment for bf16 gated-DeltaNet state - -Update: `S <- S*diag(exp(g)) + beta * k (x) (v - S k)`, with -`g = -exp(A_log)*softplus(a+dt_bias) <= 0` so `exp(g) in (0,1]` (strict geometric decay) and -`beta = sigmoid(.) in (0,1)`. - -- Decay bounds error accumulation. bf16 = 8 mantissa bits -> per-element rel rounding - `eps ~= 2^-8 ~= 3.9e-3`. An error injected at step t is multiplied by `exp(g)<1` every later step -> - carry-error is a CONTRACTING geometric series bounded by ~`eps/(1-exp(g_mean))`, a small constant - multiple of one step's eps, NOT linear/unbounded. The recurrence is a contraction map - no - divergence. (The "per-step renorm" framing is not a literal renorm op in either codebase; the bound - IS the `g<1` contraction + `beta in (0,1)` + unit-norm k from the l2norm capping `||k (x) delta||`.) -- fp32 register accumulation is the minimal-error placement: load bf16->f32, do `S k`, `v-g*kv`, - `*beta`, the outer-product accumulate and `S^T q` ALL in fp32 (UNCHANGED math), store f32->bf16 once. - Identical to vLLM, which ships this as the Qwen3-Next default with no reported quality regression - - the strongest empirical safety evidence. -- Dominant risk is small KL/PPL drift, not instability. Gate KL<1e-3 + PPL-delta over 256-tok greedy - vs the f32 build; fall back to f32 via the §3c toggle if it fails. Keep conv state f32 (ssm_conv* is - f32-locked, conv bytes negligible) - no reason to risk it. - -Bottom line: (1) upstream recurrent-cache f32 is a hardcoded implementation default (conv asserts f32; -scan/gdn kernels float-only; no numeric-rationale comments) - override via §3's `type_s`/`type_r` -plumbing, bf16-default + f32 opt-out, touching only the temporal state. (2) vLLM's GDN temporal state -is bf16 by default (auto->model_dtype), fp32-accumulated, with `--mamba-ssm-cache-dtype float32` -opt-out - a fair equal-precision target. (3) bf16 GDN state is numerically safe: g<1 decay contracts -rounding to a bounded geometric series, fp32 registers confine bf16 rounding to one per-step cache -write, and vLLM ships this exact config in production. KL<1e-3 / PPL gate + f32 fallback is the right -safety net. - ---- - -# PART B - label `bf16-kernel-plumbing` (the kernel/op edits §6 defers) - -Part A wires the cache DTYPE selection (cparams -> memory_params -> `s_l`/`r_l` alloc). Part B is the -consuming half: every kernel/op that reads or writes those caches, and the exact -load->f32->compute(f32, UNCHANGED)->store->bf16 conversion points. Traced against HEAD-after-0021 on -`dgx.casa:~/llama-paged-dev` (branch `paged`). - -## B.1 Complete set of state-cache READERS/WRITERS (one op family only) -`s_l` (ssm_states_all) reaches compute through exactly ONE op family - the gated-DeltaNet recurrence - -via a strided VIEW from `build_rs` (graph base) that carries the cache dtype. The cache-touching srcs: -- `src[5]` `src_state` - the s0 read view (the cache, or the 0019 gather scratch). -- `src[6]` `src_state_dst` - the 0018 in-place write-back target (a view INTO the cache). -- `src[7]` `ids` - I32 seq map for the 0019 gather (no dtype concern). -No other op reads `s_l`. `build_rs` only re-strides (dtype rides through); the 0019 -`gdn_gather_nonident_kernel` is the only other reader. So bf16 awareness localizes to: the 3 ggml.c -builders (asserts), cuda `supports_op`, `gated_delta_net.cu`, and the CPU mirror in `ops.cpp`. - -## B.2 ggml.c builder asserts (relax F32-only -> {F32,BF16}) -File `ggml/src/ggml.c`: -- `ggml_gated_delta_net` (6287): line 6308 `GGML_ASSERT(state->type == GGML_TYPE_F32)` -> - `... == GGML_TYPE_F32 || ... == GGML_TYPE_BF16`. -- `ggml_gated_delta_net_inplace` (6349): same `state` assert (~6366-6370) + any `src_state_dst` - type assert -> allow BF16. -- `ggml_gated_delta_net_inplace_ids` (6417): same `state` + `src_state_dst` relax. -- KEEP the op OUTPUT scratch f32: line 6327 `ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne)` stays. The - `[attn_scores | new_states]` output is a TRANSIENT graph tensor; the bf16 persisted write goes - through `src_state_dst`/`state` (in-place). The non-in-place fallback `cpy`s scratch->cache and - `ggml_cpy` already type-converts f32->bf16. - -## B.3 CUDA supports_op -`ggml/src/ggml-cuda/ggml-cuda.cu`, `supports_op` case `GGML_OP_GATED_DELTA_NET` (3096): allow a BF16 -`src[5]`/`src[6]` (add BF16 to the permitted state-src types). - -## B.4 CUDA recurrence kernel `ggml/src/ggml-cuda/gated_delta_net.cu` -Template the kernel + gather + launch on the CACHE-pointer dtype (`bool STATE_BF16`); keep f32 valid so -the f32 opt-out is the SAME kernel. Include ``; convert with `__bfloat162float` / -`__float2bfloat16`. ALL recurrence math (lines 106-200) stays in f32 registers, byte-for-byte UNCHANGED. -- Signatures: line 39 `const float * curr_state` -> `const STATE_T * curr_state`; line 57 - `float * state_dst` -> `STATE_T * state_dst`; `read_state` (85-88) -> `const STATE_T * read_state`. -- LOAD (s0 -> f32 regs), lines 100-103: - `if constexpr (STATE_BF16) s_shard[r]=__bfloat162float(read_state[i]); else s_shard[r]=read_state[i];` - `s_shard` stays `float`. -- STORE-BACK (f32 regs -> bf16 cache): - - non-keep final write (203-208): `state[col*S_v+i] = STATE_BF16 ? __float2bfloat16(s_shard[r]) : s_shard[r];` - - keep_rs_t snapshot (191-200) targets `dst + attn_score_elems` = the f32 OUTPUT scratch (kept f32 - per B.2); this is the prefill/rollback path (n_rs_seq>0), decode is `!keep_rs_t`. KEEP it f32. - Only the CACHE pointers (`curr_state` src[5], `state_dst` src[6]) are STATE_T. -- 0019 gather `gdn_gather_nonident_kernel` (7-30): `const float * cache` -> `const STATE_T * cache`; - `dst[i] = STATE_BF16 ? __bfloat162float(src[i]) : src[i];`. Keep `scratch` OUTPUT f32 (pool alloc - 326-333 stays `ggml_cuda_pool_alloc`) so the non-identity read path feeds f32; the identity - in-place path reads bf16 directly. `read_state`'s dtype follows the branch that selected it. -- Dispatcher (270-353): - - casts 299/323 `(const float *)src_state->data`, 312 `(float *)src_state_dst->data` -> - `(const nv_bfloat16 *)` / `(nv_bfloat16 *)` when `type == GGML_TYPE_BF16`; branch launch on type. - - asserts 309-311: `src_state_dst->type == GGML_TYPE_F32` -> allow BF16; `nb[0] == sizeof(float)` -> - `== ggml_type_size(type)`; `nb[1] == S_v*S_v*H*sizeof(float)` -> `... * ggml_type_size(type)`. - - q/k/v/g/beta strides (348-353) are ACTIVATION (f32) strides - UNCHANGED. Kernel indexes state by - ELEMENT (`col*S_v+i`), so the typed pointer halves the byte stride implicitly. - - `launch_gated_delta_net` (212-) + S_v switch (230-260): thread `STATE_BF16` into the - `gated_delta_net_cuda` instantiations. - -## B.5 CPU reference `ggml/src/ggml-cpu/ops.cpp` (parity / CI / CPU-offload fallback) -`ggml_compute_forward_gated_delta_net_one_chunk` (10662) + `_f32` (10847), dispatch (10915): -- LOAD: 10726 `const float * state_in_base = (const float *)src_state->data`, the rs_head/gather read - 10744-10745, and 10891 `const float * cache = (const float *)src_state->data` -> when - `src_state->type == GGML_TYPE_BF16`, read `GGML_BF16_TO_FP32(((const ggml_bf16_t*)..)[..])`. -- STORE: 10758-10762 `inplace_state_base = (float *)src_state_dst->data` -> store - `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`; relax asserts `nb[0]`/`nb[1]` - to `ggml_type_size(type)`. Keep ONE impl, branch load/store on `src_state->type`. - -## B.6 Conv state (`r_l`) -> bf16 : DEFER (optional, low-value, prefill snag) -Conv state ~12.6 MB total, LAUNCH-bound (0021 removed concat/cpy); bf16 saves ~0 ms, adds complexity: -- DECODE (0021 fused) `ggml_ssm_conv_update_inplace` (ggml.c:5566) asserts 5581-5584 - `conv_states/conv_state_dst->type == F32`; CUDA `ssm_conv_update_f32` (ssm-conv.cu:131) + CPU - `ggml_compute_forward_ssm_conv_update_f32` (ops.cpp:9471) read/write f32. To bf16: relax the 2 - asserts, template tap LOAD (`__bfloat162float`) + ring write-back STORE (`__float2bfloat16`), cast - `conv_states`/`conv_state_dst` ptrs in both dispatchers. -- PREFILL (non-fused) `build_conv_state` (delta-net-base.cpp:449-524): `conv_states=build_rs(...)` - (bf16 view) then `ggml_concat(conv_states, qkv_mixed, 0)` (472). **`ggml_concat` requires same type** - - qkv_mixed is f32 -> bf16 conv cache BREAKS the prefill concat (needs an f32 staging view of the - taps first; the ring write-back `ggml_cpy` at 496/520 already converts; concat is the blocker). -RECOMMENDATION: keep `type_r` = F32 in v1 (matches Part A §6). Ship `type_s`=BF16 first; `type_r`=BF16 -is a follow-up that adds the f32 staging view. - -## B.7 Confirm UNTOUCHED: full-attn KV-cache (16 layers) + FP4 weights -- KV-cache: the `llama_kv_cache` half of `llama_memory_hybrid`, alloc with `params.type_k/type_v` - (llama-model.cpp 2030-2031 / 2089-2090 / 2108-2109). Part A changes ONLY the recurrent half's - `type_s`; `attn_type_k`/`attn_type_v` untouched. Paged-KV gather (0003-0011), flash-attn, - `type_k()/type_v()` accessors (kv-cache.h 161-162/381-382) unaffected. -- FP4 weights (nvfp4 dense + MoE): model weights, separate from runtime state caches; recurrence/conv - kernels read STATE not weights. FP4 GEMM (0017/0020) untouched. -- Activations (q/k/v/g/beta, attn-out, z) stay f32 (<1% of bytes). Only persisted `s_l` rows narrow. - -## B.8 Conversion-point cheat-sheet (the ONLY numeric-precision boundaries) -1. CUDA load `gated_delta_net.cu` ~102: `s_shard[r] = __bfloat162float(read_state[i])`. -2. CUDA store ~207: `state[col*S_v+i] = __float2bfloat16(s_shard[r])`. -3. CUDA gather ~20: `dst[i] = __bfloat162float(src[i])` (bf16 cache -> f32 scratch). -4. CPU load `ops.cpp` ~10726/10744/10891: `GGML_BF16_TO_FP32(((ggml_bf16_t*)src_state->data)[..])`. -5. CPU store ~10762: `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`. -Everything between (1)/(4) and (2)/(5) is f32-register math, identical to today's f32 kernel. Only the -persisted cache rounds to bf16 once per step; g<1 geometric decay bounds the rounding. - -## B.9 File-by-file edit table (Part B) -| File | Edit | -|---|---| -| `ggml/src/ggml.c` | relax `state`/`src_state_dst` F32 asserts -> allow BF16 in the 3 GDN builders (6308, ~6370, ~6430); keep output scratch F32 (6327) | -| `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` GATED_DELTA_NET (3096): allow BF16 state src | -| `ggml/src/ggml-cuda/gated_delta_net.cu` | template kernel+gather+launch on STATE_BF16; `__bfloat162float` load / `__float2bfloat16` store; cast src_state/src_state_dst ptrs; relax dispatcher asserts (309-311) to `ggml_type_size(type)`; keep gather scratch + keep_rs snapshot f32 | -| `ggml/src/ggml-cpu/ops.cpp` | bf16 load/store branch in GDN ref (10726/10744/10758-10762/10891); relax asserts | -| `tests/test-backend-ops.cpp` | add BF16-state GATED_DELTA_NET case (CUDA bf16 vs CPU bf16) | -| (deferred) conv: `ggml.c:5581-84`, `ssm-conv.cu:131`, `ops.cpp:9471`, `delta-net-base.cpp:472` | v2 only - f32 staging before prefill concat | - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md deleted file mode 100644 index 97adbc55a..000000000 --- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md +++ /dev/null @@ -1,37 +0,0 @@ -# bf16 SSM state - build/de-risk progress - -DECISION (user override of plan): f32 DEFAULT + bf16 OPT-IN. type_s default = GGML_TYPE_F32. -Conv state (type_r) stays F32. Recurrence math stays f32 (load->f32, store->cache dtype). - -## STEP 1 (dtype-generic kernel + op) - DONE + DE-RISK GATE 1 PASSED -Files (DGX ~/llama-paged-dev): -- ggml/src/ggml.c: 3 GDN builder asserts F32 -> {F32,BF16}; state_dst nb[0] -> ggml_type_size. -- ggml/src/ggml-cuda/gated_delta_net.cu: gdn_state_t alias; gather + recurrence kernel + - launchers templated on STATE_BF16; __bfloat162float load / __float2bfloat16 store; gather scratch - shares cache dtype (uniform read); dispatcher detects src_state->type, GDN_DISPATCH macro 8-way. -- ggml/src/ggml-cpu/ops.cpp: byte-based read base + read_bf16 load conversion; bf16 in-place - convert-store after token loop; bf16 gather widening; relaxed asserts to ggml_type_size. -- ggml/src/ggml-cpu/ggml-cpu.c: work-size +S_v*S_v for bf16 in-place. -- tests/test-backend-ops.cpp: state_type field on test_gated_delta_net; 16 bf16 cases (hs 64+128 x - decode/prefill/keep_rs x kda). -GATE 1: build clean (EXIT=0); test-backend-ops -o GATED_DELTA_NET = 52/52 OK (CUDA bf16 vs CPU bf16). - -## STEP 2/3/4 (cparams opt-in wiring) - IN PROGRESS -f32 DEFAULT everywhere; --cache-type-ssm bf16 opts in. - -## STEP 2/3/4 (cparams opt-in) - DONE -- llama.h/llama-context.cpp/llama-memory.h/llama-model.cpp: type_r/type_s plumbed, DEFAULT F32. -- common.h/common.cpp/arg.cpp: cache_type_ssm/conv (F32 default) + --cache-type-ssm/-conv CLI. -- llama-memory-recurrent.cpp: convert-on-mismatch f32<->bf16 (r and s) via ggml_*_row API. - -## EXTRA FIX (plan B.1 miss): build_rs rs_zero clear used ggml_scale (f32-only) -> bf16 abort. -- llama-graph.cpp: f32 keeps ggml_scale_inplace (bit-exact); non-f32 uses ggml_fill_inplace. -- fill.cu + ops.cpp + ggml.c: added BF16 to ggml_fill. get_rows/cpy already bf16-capable. - -## DE-RISK GATE - ALL PASS -- build clean EXIT=0 (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench). -- test-backend-ops -o GATED_DELTA_NET = 52/52 (16 bf16 cases: decode/prefill/keep_rs x kda x hs64/128). -- f32 default md5: dense 5951a5b4... MoE 07db32c2... == 0023 (non-invasive; also --cache-type-ssm f32 matches). -- bf16 opt-in: coherent "Paris", no crash; byte-identical to f32 on 48-tok sample (Same-top-p 100%). -- diff backup: ~/llama-paged-dev/BF16_SSM_STATE.diff (1003 lines, 15 files). NOT committed/pushed. -READY FOR C.2 KL GATE (GateBench). diff --git a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md deleted file mode 100644 index eb1473108..000000000 --- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md +++ /dev/null @@ -1,203 +0,0 @@ -# bf16 SSM-state cache - BUILD + DE-RISK RESULTS - -Label: bf16-build-derisk (the GPU build agent). Lands on top of patch 0023 (HEAD f7409c2) on the DGX -dev tree `~/llama-paged-dev` (branch `paged`). Status: **DE-RISK GATE PASSED, READY FOR THE C.2 KL -GATE (GateBench).** Work is built into `build-cuda` and saved as `~/llama-paged-dev/BF16_SSM_STATE.diff` -(uncommitted on the dev tree; the 0024 ship/shelve decision is gated on GateBench's KL results). - -## DECISION applied (user override of the plan): f32 DEFAULT + bf16 OPT-IN -The plan defaulted bf16; the user wants f32 to stay the bit-exact DEFAULT and bf16 to be opt-IN via -`--cache-type-ssm bf16`. So `type_s` default = `GGML_TYPE_F32`, `type_r` default = `GGML_TYPE_F32` -(conv stays f32 always, per plan C.0). Only the persisted RECURRENT (temporal) state narrows to bf16 -when opted in; recurrence math stays f32 (load->f32, compute f32, store->cache dtype). The opt-in is -non-invasive: with no flag the output is byte-identical to 0023. - -## Files changed (15; full diff = ~/llama-paged-dev/BF16_SSM_STATE.diff, 1003 lines) - -STEP 1 - dtype-generic kernel + op (the de-risk core): -- `ggml/src/ggml.c` - 3 GDN builder `state`/`state_dst` asserts F32 -> {F32,BF16}; `state_dst->nb[0]` - `sizeof(float)` -> `ggml_type_size(state_dst->type)`. Also relaxed the `ggml_fill` builder assert to - allow BF16 (needed by the rs_zero clear; see below). -- `ggml/src/ggml-cuda/gated_delta_net.cu` - `gdn_state_t` alias (`nv_bfloat16`/`float`); - recurrence kernel + gather kernel + both launchers + the dispatcher templated on `STATE_BF16`. - LOAD `__bfloat162float`, STORE `__float2bfloat16`; the gather scratch is allocated at the CACHE - dtype so `read_state` is a single uniform dtype (no mixed-dtype read path - eliminates the plan-R2 - landmine). The keep_rs snapshot + the non-in-place final write stay f32 (op output scratch); the - bf16 store happens ONLY on the in-place cache path. `supports_op` already returned `true` - unconditionally for GATED_DELTA_NET, so no change there. -- `ggml/src/ggml-cpu/ops.cpp` - byte-based prior-state read base + `read_bf16` load conversion - (`GGML_BF16_TO_FP32`); bf16 in-place convert-store after the per-(head,seq) token loop - (`GGML_FP32_TO_BF16`); bf16-widening non-identity gather; relaxed `nb[]` asserts to - `ggml_type_size`. Added a `ggml_compute_forward_fill_bf16` + dispatch case. -- `ggml/src/ggml-cuda/fill.cu` - BF16 case in the fill kernel switch. -- `ggml/src/ggml-cpu/ggml-cpu.c` - GDN work-size adds the extra `S_v*S_v` f32 buffer when the cache is - bf16 in-place (mirror of `need_work` in ops.cpp). -- `tests/test-backend-ops.cpp` - `state_type` field on `test_gated_delta_net`; 16 bf16-state cases - (head_size 64 + 128 x {decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda 0/1, n_seqs 1/2). - -STEP 2/3/4 - cparams opt-in wiring (f32 DEFAULT): -- `include/llama.h` - `type_r`/`type_s` in `llama_context_params` (adjacent to type_k/type_v). -- `src/llama-context.cpp` - default-params `type_r = type_s = GGML_TYPE_F32`; `params_mem` passes them. -- `src/llama-memory.h` - `type_r`/`type_s` in `llama_memory_params`. -- `src/llama-model.cpp` - the 3 hardcoded `GGML_TYPE_F32` recurrent ctor pairs (recurrent / - hybrid_iswa / hybrid = the qwen35/qwen35moe path) now pass `params.type_r` / `params.type_s`. -- `src/llama-memory-recurrent.cpp` - back-compat: `state_read_data` converts f32<->bf16 on type - mismatch (helper `recurrent_read_convert_rows` via the public `ggml_bf16_to_fp32_row` / - `ggml_fp32_to_bf16_row`) instead of failing, for both r and s; lets an f32-saved session restore - into a bf16 cache and vice versa. -- `src/llama-graph.cpp` - `build_rs` rs_zero clear: f32 keeps the exact `ggml_scale_inplace(.,0)` op - (bit-exactness); bf16 (and any non-f32) state uses `ggml_fill_inplace(.,0)` (CUDA scale is f32-only; - this was the one extra state-touching op the plan's "one op family" claim missed). get_rows + cpy - on the extra-states path already support bf16, so no change needed there. -- `common/common.h` / `common/common.cpp` / `common/arg.cpp` - `cache_type_ssm` / `cache_type_conv` - (default F32) + `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` CLI (reusing the - existing `kv_cache_type_from_str`, which already maps `f32`/`bf16`). - -## DE-RISK GATE - ALL PASS - -1. **Build clean** (build-cuda, CUDA arch 121): EXIT=0 for ggml/ggml-cuda/ggml-cpu/llama/llama-common - and the binaries (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench). -2. **test-backend-ops -o GATED_DELTA_NET = 52/52 PASS** (CUDA backend vs CPU reference). Includes all - 16 new bf16-state cases (CUDA bf16 vs CPU bf16) covering decode (n_tokens==1), multi-token - prefill/chunk (33/64/100), and keep_rs_t (K=4), with kda on/off and head_size 64 + 128 (production - S_v). The bf16 op test is the deterministic R2 de-risk for the load/compute/store contract. -3. **f32-default md5 == 0023 baseline (opt-in is non-invasive):** - - dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023 (no flag AND `--cache-type-ssm f32`) - - MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023 - Command: `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`. -4. **bf16 opt-in coherence + engaged (dense, `--cache-type-ssm bf16`):** no crash; coherent + on-topic. - - 48-tok greedy ("The capital of France is"): "**Paris**." - byte-identical to f32 (md5 5951a5b4...), - i.e. Same-top-p = 100% over that short sample (the g<1 decay bounds the per-step rounding so the - argmax trajectory is unchanged at short length). - - 256-tok greedy ("Explain how a transformer LM generates text, step by step"): fluent, well-structured - step-by-step explanation, and the bf16 md5 (`fc82b4cd44f8ec999c3b6843eb3f8c61`) **DIVERGES** from - f32 (`554cc667a2e62a47c34a999a127ac7e5`) - definitive proof that bf16 is genuinely ENGAGED (not a - silent f32 fallback) and behaves as expected (non-bit-exact, coherent). The per-token divergence - is exactly what the C.2 teacher-forced KL gate quantifies. - - Independent proof bf16 is allocated: BEFORE the build_rs fill fix, decode aborted in - `ggml_cuda_op_scale` on the recurrent-state tensor - an f32 cache would never have reached that - bf16-only failure, so the opt-in demonstrably allocates bf16. Wiring is also directly traceable: - `--cache-type-ssm bf16` -> cache_type_ssm -> cparams.type_s -> params_mem.type_s -> the - llama_memory_hybrid recurrent `s_l` alloc. - -CONFIRM: ready for the C.2 KL-divergence + PPL-delta + long-context drift gate (GateBench). - -## A landmine fixed beyond the plan (record for the gate/ship agents) -The plan B.1 asserted `s_l` reaches compute through ONLY the gated-DeltaNet op. It also flows through -`build_rs`: (a) the rs_zero restart-slot clear used `ggml_scale_inplace(.,0)`, and `ggml_cuda_op_scale` -hard-asserts f32 -> the first bf16 decode aborted in scale. Fixed by routing the bf16 clear through -`ggml_fill` (with a new bf16 fill path). (b) the extra-states `ggml_get_rows` + `ggml_cpy` already -support bf16 (verified) - no change. This is exactly the kind of non-decode state path the de-risk -was meant to surface; it is now covered end-to-end (the bf16 coherence run exercises rs_zero on the -fresh-sequence prompt). - -## NOT done in this phase (next agents) -- STEP 5 LocalAI gRPC/YAML (`CacheTypeSSM`/`CacheTypeConv` proto + grpc-server + model_config + - options + meta registry) - needed to force f32/bf16 from a gallery YAML; not on the de-risk gate. -- STEP 6 capability fallback (device-match probe to demote bf16->f32 before alloc on a device lacking - the bf16 GDN/fill path, e.g. CPU-offloaded GDN). The CPU reference DOES implement bf16 (load/store/ - gather/fill) so a CPU fallback is numerically correct today, but the probe is the clean guard. -- The C.2 KL/PPL/long-context gate + the C.3 nsys per-call bench - GateBench (GPU gate agent, runs - sequentially after this build phase; binaries are pre-built in build-cuda). - -Assisted-by: Claude:opus-4.8 [Claude Code] - ---- - -# C.2/C.3 ACCEPTANCE GATE + PARITY BENCH RESULTS (label bf16-gate-bench) - -Status: **GATE FAILS -> NO-SHIP. KEEP SHELVED. patch 0024 NOT created; nothing committed.** -All runs on `dgx.casa` build-cuda binaries, wikitext-2-raw test, `-ngl 99 -fa on --seed 1`. -Corpus: `~/bench/klgate/wikitext-2-raw/wiki.test.raw` (symlink to `~/wikitext-2-raw`, ~280k tokens). - -## 1. KL acceptance gate - -### Noise floor (f32-vs-f32, c256 chunks32) - the non-determinism floor -| model | Mean KLD | Max KLD | Same-top-p | ln(PPL(Q)/PPL(base)) | -|---|---|---|---|---| -| dense q27 | -1.3e-5 | 1e-6 | 100.000% | +0.001256 | -| MoE q35 | ~0 (-3e-7) | 5.9e-5 | 100.000% | +0.000607 | - -### Headline 256-token gate (bf16-vs-f32, c256 chunks32) - PASSES, but vacuously -bf16 c256 is **byte-identical to the floor** for both models (Mean KLD -1.3e-5 dense / ~0 MoE, -Same-top-p 100%, identical PPL). Reason: a single 256-token window is processed in ONE ubatch -(ub512 > 256), so the recurrent state is written to the bf16 cache only ONCE at the chunk end and is -NEVER read back to produce that window's logits. The 256-token gate therefore does NOT exercise the -bf16 round-trip at all - it is blind to the actual cost. - -### Long-context drift sweep (bf16-vs-f32, chunks8) - FAILS HARD for BOTH models -| model | ctx | Mean KLD | Same-top-p | Max KLD | 99.9% KLD | -|---|---|---|---|---|---| -| dense | 256 | -1.3e-5 | 100.000% | 1e-6 | 0 | -| dense | 1024 | 0.0647 | 91.54% | 20.17 | 7.69 | -| dense | 2048 | 0.1739 | 90.65% | 24.89 | 18.18 | -| dense | 4096 | 0.1258 | 90.40% | 26.03 | 17.22 | -| MoE | 256 | ~0 | 100.000% | 5.6e-5 | 4.9e-5 | -| MoE | 1024 | 0.0472 | 90.04% | 5.13 | 0.95 | -| MoE | 2048 | 0.0442 | 90.84% | 1.85 | 1.11 | -| MoE | 4096 | 0.0422 | 89.97% | 2.76 | 0.83 | - -Gate thresholds: Mean KLD < 1e-3; Same-top-p >= 99.5%; |ln(PPL ratio)| < 0.005; -drift MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0%. -Result: 256-tok PASS (vacuous); **drift FAIL by ~50-170x on Mean KLD and ~9 pts on Same-top-p** -(top-p ~90% = roughly 1 token in 10 flips its argmax at >=1024 ctx). FAIL for both dense and MoE. - -### Discrimination (is it a bug or genuine bf16?) - dense c1024 chunks8 -- **f32 long-context floor c1024**: Mean KLD -1.2e-5, Same-top-p 100% -> the bf16 divergence is REAL - signal, not a long-context measurement artifact. -- **bf16 KLD is invariant to ubatch-boundary count** (= the cross-ubatch state read-back frequency): - ub1024 (0 internal boundaries) 0.0642 / 91.19%; ub512 (1) 0.0647 / 91.54%; ub256 (3) 0.0639 / - 91.17%; ub64 (15) 0.0682 / 90.97%. Flat. -> The error is INTRINSIC to bf16 over the long - recurrence INSIDE a single op call, NOT a chunked-prefill/keep_rs/gather handoff bug (R2 ruled out; - test-backend-ops 52/52 already passed). The error PLATEAUS with context (contraction), i.e. it is - bounded but LARGE: the gated-DeltaNet has long-memory heads (exp(g) ~ 1), so the g<1 decay does NOT - tightly contract the per-step bf16 rounding the way the plan's A.3 optimistically assumed. - -Note (CORRECTED): this is NOT vLLM's precision. vLLM keeps the GDN **temporal state in f32** (proven -three ways in BITEXACT_VS_VLLM.md: empirical kernel-boundary tensor dtype, the config chain, and the -bandwidth regime; only vLLM's tiny conv state is bf16). So bf16 temporal here is a step BELOW vLLM's -recurrent precision, not a match. (An earlier byte-gate draft mislabeled vLLM as bf16-state; that was -refuted.) At equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW). - -## 2. Parity bench - the perf lever IS real - -### nsys recurrence per-call (graphs OFF, npp4 ntg32 npl128) - gated_delta_net_cuda Avg -| model | f32 ms/call | bf16 ms/call | delta | -|---|---|---|---| -| dense q27 | 3.381 | 1.726 | **-49.0%** | -| MoE q35 | 2.245 | 1.153 | **-48.6%** | - -The predicted 3.49 -> 2-3 ms/call lever LANDED (even beat it). Total GPU time dropped too (dense -~12.05 -> ~9.05 s graphs-off). bf16 halving the persisted SSM-state bytes halves the dominant decode -kernel, exactly as designed. - -### End-to-end decode throughput (S_TG aggregate, npp128 ntg128, graphs ON unless noted) -| model | npl | f32 t/s | bf16 t/s | note | -|---|---|---|---|---| -| dense | 32 | 212 | 239 | +12.8% | -| dense | 128 | 371-376 (stable) | 287 / 336 / 487 / 498 (BIMODAL) | clean ~490 = +31%; bad runs from a CUDA-graph instability on the dense path | -| dense | 128 | 371.67 (graphsOFF) | 486.68 (graphsOFF) | clean +31% | -| MoE | 32 | 449 | 509 | +13.4% | -| MoE | 128 | 767 | 958 | +24.9% (clean, nsys-corroborated) | - -% of vLLM (391 t/s dense reference): f32 default = 95-96% (bit-exact, higher precision than vLLM); -bf16 clean ~490 = **125%** (but unstable on dense + fails the numeric gate). MoE bf16 +25% is clean. - -## 3. DECISION: NO-SHIP / KEEP SHELVED -- The KL gate **fails** the long-context drift criterion for both models: bf16 SSM state changes - ~10% of tokens at >=1024 ctx vs our f32 (Same-top-p ~90%, Mean KLD 0.04-0.17). It is therefore NOT - a quality-neutral opt-in and cannot honor the project's "f32 bit-exact default" promise. -- Per the task rule (gate FAIL -> do not ship as usable): **patch 0024 was NOT created and nothing was - committed** (DGX tree stays uncommitted; backup `~/llama-paged-dev/BF16_SSM_STATE.diff`). -- The perf lever is genuinely real (recurrence halves; dense ~490 t/s = 125% of vLLM when clean; MoE - +25%), but bf16 temporal is BELOW vLLM's precision (vLLM keeps temporal f32), so it remains a valid - FUTURE option only if shipped as an explicitly-labeled "reduced-precision, NON-bit-exact, below-vLLM" - mode (never quality-neutral), AND the dense CUDA-graph throughput instability (bimodal 287..498) is - fixed first. The principled path is hybrid per-head precision (f32 long-memory heads + bf16 fast - heads) - keeps precision at-or-above vLLM while capturing most of the speedup. -- Recommendation: keep the shipped default as f32 bit-exact (95% of vLLM at higher precision). Shelve - bf16. Optional follow-up lever if precision must be cut: bf16 only on the SHORT-memory heads (those - with exp(g) well below 1), keeping long-memory heads f32 - a mixed-precision state that could pass - the gate while still cutting bytes; not implemented/measured here. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md b/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md deleted file mode 100644 index 879f801ad..000000000 --- a/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md +++ /dev/null @@ -1,339 +0,0 @@ -# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet) - -Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study. -Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4 -(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`, -vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s). - -> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded -> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM) -> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")` -> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that -> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is -> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width -> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype. -> -> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached -> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every -> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal -> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models); -> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts -> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only** -> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`, -> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at -> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps -> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` -> -> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`) -> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires** -> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch -> -> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16); -> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd) -> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`). -> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and -> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration -> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound -> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only -> reading consistent with both source *and* the measured time. - -## Headline (two answers) - -1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any - two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two - kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different - reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of - that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different - accumulation - a >>ULP divergence in every projection and the LM head. - -2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever - was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the - same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal - bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact - vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence), - not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED. - ---- - -# The five questions, answered (synthesis) - -**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make -bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the -activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16 -(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every* -projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))` -(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded -into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees* -(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**. -A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal -rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct -GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism -*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate -(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds** -with length (once one token differs the SSM/KV states fork), and is *weaker here* than a -same-precision run because of the A8-vs-A4 GEMM gap. - -**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists -and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven + -re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM -81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda` -74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward -fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the -per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero -precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it; -worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other -bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every -other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap, -chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible -one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you -don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the -recurrence (~440 tok/s) - an **over-clock**, not a requirement. - -**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This -corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on -the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small -**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal** -state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate -precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state -precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv - -which costs llama precision only on a tiny stream and buys almost no BW. - -**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar - -never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)** -throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement, -KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32 -registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable -is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity" -must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the -precision-for-speed trade it is. - -**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the -"bf16 is required" framing - keep the bit-exact levers.** -- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient - than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient). -- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap. -- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32 - occupancy retune reaches vLLM's recurrence BW. bf16 is optional. -- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP - non-associativity across distinct kernel stacks). -- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it. - **Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the - parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond - vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a - parity-matching change. - ---- - -# PART A - Divergence inventory (per source: bit-identical vs close) - -Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise -conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers: -RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate. - -## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP) - -- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path - (`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**, - block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by - `mmq_y`/`mmq_x` + the warp-MMA fragment layout. -- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations, - w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback - (**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`. -- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or - 16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order - differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest - divergence and is present in every projection + the LM head, so logits differ materially on its own. - -## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32 - -Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu` -`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py` -`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`. -With S[k][v] (llama, transposed) == h[v][k] (vLLM): - -``` -llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum - delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k] -vLLM: h' = g*h_old # decay rounded into EVERY element first - kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta - h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k] -``` - -Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even -with identical f32 state, identical inputs, and identical reduction tree: -- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction); - vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction). - Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob. -- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step - butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis). -**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.** - -## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical -llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM -`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the -*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order. -(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.) - -## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6) -L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold -(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log: -`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`. -Different variance reduction tree / eps placement / fusion boundary. - -## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close) -MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply -order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes -`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert -time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different -rounding points + convert-time vs runtime `exp(A_log)`. - -## A6. Flash attention (full-attn layers) -- NOT bit-identical (close) -llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per -`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different -rounding. - -## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do) -Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical -inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than -llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points. - -### Inventory summary - -| Source | bit-identical? | divergence size | -|---|---|---| -| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** | -| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree | -| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order | -| L2-norm / RMSNorm | NO | ~1e-6 (tree) | -| MRoPE | NO | ~ULP-1e-6 | -| gate softplus/exp | NO | rounding points | -| flash attention | NO | softmax tiling | -| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal | - -Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are -not close-to-equal at the bit level; they agree only to a few significant digits. - ---- - -# PART B - The decisive f32-state correction (proof from source) - -The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the -"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters -were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified: - -1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`, - and **`text_config.mamba_ssm_dtype = "float32"`**. -2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and - `"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`. -3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562): - `mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if - `cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes - **`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**. -4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()` - (vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`. - It runs **before** any layer/model state-dtype resolution. -5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager - --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log` - non-default args confirm none). So the override fires and the value is "float32". -6. State dtype resolution reads the **already-overridden** value: - - `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16, - cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`. - - `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes - `vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto". - - `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**; - temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]` - = **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`). -7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple - element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel - round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` - with `p_ht.dtype == initial_state.dtype == float32`. - -**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width -to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the -override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny -byte stream). - -## Re-derived efficiency table (measured duration + PROVEN f32 byte volume) - -| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak | -|---|---|---|---|---|---| -| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** | -| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** | - -- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim - "more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte - count vLLM is at ~81%, i.e. **more** efficient than llama. -- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap - is bandwidth **efficiency** on equal bytes, not width. -- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*. - ---- - -# PART C - The f32-preserving lever, and where recompute/bf16 land - -Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising -llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width: -- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state - column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1), - each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each, - a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more - columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load - width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**. -- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of - vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021). - -**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the -occupancy tune above is the real bit-exact win:** -- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed - 2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense, - full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS. -- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full - dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full - read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after - replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor, - at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed - bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you - need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated. -- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is - uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so - overlapping two BW-bound phases sums to ~0. FAILS. -- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at - decode); attention/materialization-free form is O(t) over the prefix. FAILS. - -**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the -recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32 -precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class -throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj -MMQ + activation fold); none require bf16. - ---- - -# PART D - Verdict on B4 + the meaningful weaker form - -## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds - -1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence - `g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits - cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated - expression*, differing **even given identical f32 state and inputs**. -2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce, - fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer. - Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM - inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver - **cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every* - op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a - meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5). - -## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic -Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap -(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the -cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement / -KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it -**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so -agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8 -gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison, -weakening greedy agreement for this model specifically. - -**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact -gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify; -bf16 state does not and must be KL/PPL-gated vs llama's own f32). - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md deleted file mode 100644 index 6a68cc504..000000000 --- a/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md +++ /dev/null @@ -1,53 +0,0 @@ -# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate) - -## Hard blocker on direct DRAM counters -- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only). -- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM. -- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root. -- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte - geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor. - -## Byte geometry (exact, from gated_delta_net.cu + GGUF) -- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head). -- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB. -- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register, - writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced). - l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at - steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away. -- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB. -- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call. - -## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97) -- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call - (min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms). -- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak. -- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal. - => re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE. - Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak. - -## Conv-path (same trace, steady-decode region kernels, per-call): -- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD -- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big) -- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op) - -## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite) -- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16). - Geometry identical to llama (H=48, k/v head_dim 128, S_v 128). -- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1), - bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs -> - store bf16 once). -- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient. - -## Conv-path (llama steady decode, per call x48 layers) -- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56) - = ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%). -- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state). - -## FINAL VERDICT (DONE) -- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak). -- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%); - gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes. -- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width. - 805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327. - Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision. -- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended). diff --git a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md b/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md deleted file mode 100644 index 4a8beba10..000000000 --- a/backend/cpp/llama-cpp/patches/paged/B_MOE_PROGRESS.md +++ /dev/null @@ -1,57 +0,0 @@ -# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint - -Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` -`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`. - -## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128. -B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the -grouped FP4 `mul_mat_q` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3. - -## The lever (B-3 / SPEEDUP_HUNT B rank #3) -mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q` in `ggml/.../mmq.cuh`. -mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA -shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency, -WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE -GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the -right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS = -+8.7% slower 0017). - -## The coupling that makes it a real kernel change (not the 0017 knob alone) -The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the -m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The -0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE: -`mmq_get_nwarps_device()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the -coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to ``. Default -GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock. - -## Bit-exactness note (the real risk) -The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64 -DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split -across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping => -FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT -bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP. - -## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md. -- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites. -- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `` swaps. git diff clean (37+/11-). -- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref; - MoE S_TG npl32=441.98, npl128=756.47. -- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64). -- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS. -- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%, - npl128 +0.40% - within the ~0.4% noise band. FLAT. -- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%) - dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2). -- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda - reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM. -- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap, - uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%). - -## Gate references -- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439 -- MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd -- gate cmd: `llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1` -- bench: `llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128` (S_TG=decode_agg) -- vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8). - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md deleted file mode 100644 index 5929939df..000000000 --- a/backend/cpp/llama-cpp/patches/paged/B_MOE_RESULTS.md +++ /dev/null @@ -1,232 +0,0 @@ -# B_MOE_RESULTS.md - B-2 (down_proj act-quant retune / M1) RESULT: NEGATIVE (no headroom) - -Agent: B2-build (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`, -branch `b-work`), independent of the held hybrid 0026 (`33e7c65`). Lever: SPEEDUP_HUNT.md section B, -rank #2 ("down_proj act-quant retune (M1): bit-exact, bounded - act-quant is ~2% of MoE step"). - -## VERDICT -**The existing `blockDim.x = 128` is ALREADY the kernel-level optimum for `quantize_mmq_nvfp4` on -GB10 sm_121. B-2 has zero headroom: there is nothing to bake (128 is the current default), and it -does NOT lift MoE decode (end-to-end flat within 0.4% noise across all block sizes). No patch 0027.** -MoE stays ~85% of vLLM @npl128 / ~87% @npl32, well below vLLM => the remaining MoE lever is B-3. - -## The change that was built+measured (bit-exact, then REVERTED - did not lift) -`ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. Replaced the hardcoded -`constexpr int nvfp4_block_size = 128` with a `static const int` selected once from env -`LLAMA_MOE_QUANT_BLOCK` (default 128), `block_num_y` recomputed from the SAME `blockDim.x`. ~20 LOC. - -### Why ANY block size is provably byte-identical (the bit-exact invariant) -`quantize_mmq_nvfp4` maps thread -> column purely via the global linear index -`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread -communication (no shared memory, no warp reduction) and every thread writing its OWN disjoint output -sub-block (its own `sub` slot in `block_fp4_mmq`: `yqs[2*sub+0/1]`, `d4[sub]`). The per-thread quant -body (amax, the 5-offset fp8-code search, the q0/q1 nibble packing, the writeback) is untouched. So -the (thread)->output-byte map - and the produced bytes - are invariant to `blockDim.x`. Confirmed -empirically: md5 identical at block 64, 128, AND 256, both models. - -## GATE (bit-exact) - BOTH MODELS PASS at default AND at non-128 blocks -greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`: - -| block | dense q36-27b-nvfp4 md5 | MoE q36-35b-a3b-nvfp4 md5 | -|------:|-------------------------|---------------------------| -| 128 (default) | 5951a5b4d624ce891e22ab5fca9bc439 == ref | 07db32c2bcb78d17a43ed18bc22705cd == ref | -| 64 | 5951a5b4...439 == ref | 07db32c2...5cd == ref | -| 256 | 5951a5b4...439 == ref | 07db32c2...5cd == ref | - -test-backend-ops (CUDA0): **MUL_MAT 1146/1146 PASS**, **MUL_MAT_ID 806/806 PASS**. - -## MEASUREMENT 1 - end-to-end MoE decode_agg (S_TG t/s), the actual throughput -`llama-batched-bench -m q36-35b-a3b-nvfp4.gguf -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, -1 rep/block (run-to-run noise ~0.3-0.5%): - -| block | npl=32 S_TG | npl=128 S_TG | -|------:|------------:|-------------:| -| 32 | 437.54 | 750.41 | -| 64 | 437.82 | 751.68 | -| 96 | 437.69 | 749.46 | -| **128 (base/default)** | **438.14** | **751.76** | -| 160 | 436.38 | 750.99 | -| 192 | 436.81 | 751.61 | -| 256 | 437.77 | 750.14 | - -Spread: npl32 = 1.76 t/s (0.4%), npl128 = 2.3 t/s (0.3%) - all within noise. **No block size lifts -end-to-end decode.** Expected: the act-quant is ~2% of the MoE step, so even a perfect (0 ns) quantize -kernel caps the end-to-end win at ~2%, and 128 is already optimal => measured 0%. - -## MEASUREMENT 2 - nsys kernel-level delta on quantize_mmq_nvfp4 (the meaningful B-2 metric) -`nsys --report cuda_gpu_kern_sum`, MoE, `GGML_CUDA_DISABLE_GRAPHS=1 -npp 4 -ntg 32 -npl 128`, -8,193 kernel invocations (the kernel is 2.0-2.2% of GPU time in this decode-heavy window): - -| block | total ns | avg ns | median ns | vs 128 (total) | -|------:|---------:|-------:|----------:|---------------:| -| 64 | 127,523,328 | 15,564.9 | 12,256 | +8.7% slower | -| **128 (default)** | **117,371,424** | **14,325.8** | **11,488** | baseline (fastest) | -| 192 | 128,970,464 | 15,741.5 | 12,032 | +9.9% slower | -| 256 | 125,422,048 | 15,308.4 | 11,936 | +6.9% slower | - -**128 is a clean local minimum** (faster than the 64 below and the 192/256 above; 96 and 160 are its -immediate neighbors, end-to-end-neutral, nsys-stats flaked on the re-runs but cannot beat a bracketed -local min). The 7-10% kernel-level regression of the alternatives at 0% end-to-end change is exactly -why end-to-end is flat: this BW-bound, 256-tiny-expert model has no col-tile/occupancy headroom in -the act-quant - the same conclusion patch 0015 reached for the M-tile and patch 0017 for MINBLOCKS. - -## WHERE MoE STANDS (decode_agg, this base = 0025 with the re-graph) -vLLM ref @npl128 = 882.2, @npl32 = 500.8. -- npl128: 751.8 / 882.2 = **85.2% of vLLM** -- npl32: 438.1 / 500.8 = **87.5% of vLLM** - -B-2 adds 0 (within noise). MoE is **still well below vLLM** => **TRY B-3** (the mmq_y-down warp-remap -on the grouped `mul_mat_q` GEMM, ~27% of the MoE step - the only untested MoE GEMM -lever; SPEEDUP_HUNT B rank #3, real kernel change, bit-exact, predicted bounded on this BW-bound -model). The structural MoE residual is the FP4 grouped GEMM at the LPDDR5x BW floor + the bf16 -projections (~10.5%); the act-quant tax (~2%) is NOT where the gap lives and is already optimally -tiled. Recurrence (~48%) is already past vLLM (0018-0022). - -## DECISION -No patch 0027 (B-2 does not lift; dev tree reverted to pristine 0025). The `LLAMA_MOE_QUANT_BLOCK` -hook + this measurement confirm 128 is the GB10 optimum, should other hardware ever want re-tuning. -Hand off to B-3 (patch 0028) as the next MoE GEMM lever. - -Assisted-by: Claude:opus-4.8 [Claude Code] - ---- - -# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028) - -Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`, -branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the -0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q` (the ~26-27% -MoE-specific GEMM), the only untested MoE GEMM occupancy lever. - -## VERDICT -**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4% -(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The -warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is -BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128. - -## The change that was built+measured (bit-exact, then REVERTED) -`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the -invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64 -block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017 -`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new -`mmq_get_nwarps_device()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else -the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads + -9 `` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host). -Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64). -**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is -byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128). - -### Bit-exactness is EMPIRICAL here (not by-construction) -The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64 -DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be -split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different -grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT -for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON -ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain. - -## GATE (bit-exact) - BOTH MODELS PASS -greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`: -- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref -- MoE q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref -- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.** - -## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64) -`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s, -3 reps alternating (no concurrent load): - -| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta | -|----:|----------------------:|--------------------:|------:| -| 32 | 437.6 (437.3-437.7) | 438.8 (438.4-439.1) | +0.29% | -| 128 | 750.1 (748.9-751.1) | 753.1 (753.0-753.4) | +0.40% | - -Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread -(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win -at well under 1%, nowhere near the gap to vLLM (882). - -## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load -`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`: - -| kernel (% of window) | mmq_y=128 total ns | mmq_y=64 total ns | delta | -|---------------------------------|-------------------:|------------------:|-------:| -| gated_delta_net (SSM, ~40%) | 2,335,951,709 | 2,334,847,390 | 0.0% (untouched, DOMINANT) | -| **mul_mat_q** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** | -| mul_mat_q (router, ~3.7%) | 224,532,704 | 210,885,920 | -6.1% | -| quantize_mmq_nvfp4 (act-quant, ~2%) | 119,118,624 | 118,718,496 | -0.3% | -| **mul_mat_q_stream_k_fixup<128>** (~0.6%) | **26,848,479** | **38,117,532** | **+42% (fixup COSTLIER)** | - -The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident -CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y -DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the -GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and -already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net -mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise). -**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move -end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2 -(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit. - -## DECISION -No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup -penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff), -build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device()` remap is a -correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10. - ---- - -# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation - -## The bit-exact MoE GEMM/launch track is now EXHAUSTED -| MoE lever (bit-exact) | result | MoE decode_agg @npl128 | -|-----------------------|--------|------------------------| -| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM | -| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% | -| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% | - -**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).** -B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/ -launch track has no remaining bit-exact headroom. - -## Is the residual the structural Marlin-MoE gap? YES. -The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys: -- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is - end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more - bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the - decode shape is BW-bound so the tier does not help). This is THE structural gap and matches - FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone." -- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever. -- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever. - -No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau. - -## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config) - -Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the -held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies - -but with a hard caveat the team must carry.** - -1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest, - precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4 - arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity* - bit-exactly - claim ~85% with a precision profile at-or-above vLLM. - -2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the - ~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in - the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it - **FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5% - bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL - regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches - **~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state). - - => Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as - an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5% - decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config. - -**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is -the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph; -B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%) -is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in, -default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85% -bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market -MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the -defensible claim. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md deleted file mode 100644 index d20f0c5ac..000000000 --- a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md +++ /dev/null @@ -1,499 +0,0 @@ -# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10 - -Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This -document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style -token-granular continuous-batch scheduler, and records the single honest finding that -re-shapes what the change can and cannot buy. - -Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified -LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`), -MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev` -(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`. -Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that -`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`). - -## TL;DR (the honest reframe) - -Three findings, read directly from the source at HEAD `151343b` and from the committed -NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset -what it is allowed to claim: - -1. **The unified mixed batch already exists.** `update_slots()` already builds ONE - `llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill - tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every - `SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then - Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode - is therefore **already claimed first and never dropped or capped** - the exact property - vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction. - -2. **The chunked-prefill slot state already exists and already persists across steps.** A - slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()` - is a partial prefill; it stays in that state and resumes next step until its prompt is - fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING` - (line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING` - simultaneously; there is **no global "one prefill at a time" gate**. So the mission's - "allow a slot to be mid-prefill while others decode in the same step" is **not a state - machine to build - it is already the behaviour.** This is the single biggest de-risking - fact in this document. - -3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013 - (`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by - slots in iteration order. It is not decode-load-aware (does not subtract the live decode - count `D`), not adaptive (one constant across npl 8..128), and not fair (the first - `PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that - static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total - per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover** - `T - D` distributed across waiting prompts with a per-slot cap. That is ~the only - behavioural change. **No new slot states, no batch-formation rewrite.** - -### The honest ceiling (this is load-bearing for how the work is scoped and sold) - -The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus -`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a -decode-KERNEL batch-scaling ceiling, not a scheduler defect**: - -- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode). -- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero - prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same - ~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the - same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The - scheduler cannot lift this.** -- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular - scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once - workload. - -Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is: - -- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on - staggered arrival) - the genuine, large win. -- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without - per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE - TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole - npl range and across dense vs MoE. -- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the - burst-TTFT spread), and no admission collapse under sustained load. - -Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the -paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or -CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's -scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness + -self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids. - -## The gap, precisely localized (recap of the committed bench) - -At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0, -decode_agg tok/s | TTFT mean, npl swept 8/32/64/128: - -| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM | -|----:|------------------------:|-----------:|----------------------:|---------:| -| 8 | 63.5 / 4.3 s | 64.3 / 2.6 s | 169.3 / 1.7 s | 202.0 / 0.8 s | -| 32 | 105.7 / 23.1 s | 189.8 / 7.5 s | 239.0 / 9.0 s | 462.0 / 2.3 s | -| 64 | 132.0 / 109 s | 284.2 / 13 s | 277.0 / 16.2 s | 624.5 / 4.1 s | -| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s | - -Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That -convergence is the signal: once prefill starvation is removed, a dense model and a -12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill** -and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching -decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual). -This scope addresses the first; it names the second as the separate lever. - -## What already exists (reuse, do NOT rebuild) - -All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`. - -- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every - `SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to - `generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends - its sampled token (+ draft tokens) via `common_batch_add`. After this loop, - `batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in. -- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (= - `num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT` - slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it). - Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at - line 3502. **This is exactly vLLM's "leave the request in `running`, advance - `num_computed_tokens` next step" - already implemented.** -- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill; - the compute loop (lines ~3366-3378) `for (i=0; i all decode claimed before -any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each -chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps: -`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len, -per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch -maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded -progress (-> low, flat TTFT), one `model.forward()` per step. - -The mapping to llama is clean because [A]+[B] already give us "running visited first" and -"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D` -sizing, and the per-request chunk cap with fair distribution.** - -## The unified per-step batch-formation algorithm (the design) - -New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`): - -- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token - budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch` - (2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is - unchanged. -- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per - step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`, - floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress. -- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave - exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case. - -Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today): - -``` -common_batch_clear(batch); // line 2594 - -// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719) -for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot); -... speculative draft ... -for (slot : generating) slot.update_batch(batch); // appends decode (+draft) tokens - ->> D = batch.n_tokens; // NEW seam: decode load is now final (after 2719) ->> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch); ->> prefill_budget_step = max(0, T - D); // DYNAMIC leftover, auto-shrinks with D ->> prefill_cap_per_slot = PREFILL_CAP; // long_prefill_token_threshold analogue ->> n_prompt_budgeted = 0; // total prompt tokens added this step (subsumes 0013) - -// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair) -if (cont_batching || batch.n_tokens == 0) { ->> for (k = 0; k < n_slots; ++k) { // round-robin start offset (fairness, see P2) ->> slot = slots[(rr_start + k) % n_slots]; - if (!slot.is_processing() || !can_batch_with) continue; - if (slot.state == STARTED) slot.state = PROCESSING_PROMPT; // line 2782 (unchanged) ->> slot_prompt_added = 0; // NEW: per-slot chunk counter (reset each slot) - // inner prompt-fill (lines 3187-3239), guard now triple-bounded: - while (slot.prompt.n_tokens() < slot.task->n_tokens() ->> && batch.n_tokens < T // was: < n_batch ->> && n_prompt_budgeted < prefill_budget_step // was: 0013 static n_prefill_budget ->> && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution - common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd); - slot.prompt.tokens.push_back(cur_tok); - slot.n_prompt_tokens_processed++; - n_prompt_budgeted++; slot_prompt_added++; - ... checkpoint-boundary breaks (unchanged) ... - } - if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT; // line 3252 - ... checkpoint creation (unchanged) ... ->> if (batch.n_tokens >= T) break; // was: >= n_batch (line 3320) ->> if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326) - } -} - -for (i=0; i -bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec). -As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates -the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256 -reaches 161; here it is reached **automatically across the npl range** because the budget is -`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover -(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode- -saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes -bounded progress every step instead of waiting for a dedicated prefill burst. - -## Slot state machine changes (minimal - this is the headline de-risk) - -**No new states. No state-transition rewrite.** The existing 6-state machine -(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72) -already encodes everything: - -- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING` - slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch). -- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() < - task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters - the slot and resumes from `prompt.n_tokens()`. - -The only **additions** are per-step scheduler scratch, not slot lifecycle state: - -1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for - the per-slot chunk cap. Not stored on the slot across steps. -2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the - leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always - draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without - it, slot 0's prompt finishes first every time and the last slots starve). -3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT` - transitions begin in one step. This falls out of the budget arithmetic already (a bounded - `prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it - may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a - new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots. - -That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap. -The mission's feared "slot-state rewrite" does not materialize. - -## How it supersedes / subsumes patch 0013 - -| property | 0013 (static cap) | this scheduler (dynamic `T - D`) | -|----------|-------------------|----------------------------------| -| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises | -| decode-load aware | no (ignores `D`) | yes (leftover after decode) | -| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) | -| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) | -| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts | -| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** | - -0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant -and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the -new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the -dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013. - -## Correctness - -- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already - advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the - chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add` - receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state. -- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by - the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded: - pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays - coherent and answers correctly. The op-level math per token is position-determined and - unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic - oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies - to end-to-end only, never to the op test. -- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed - by sequence position and slot/stream, which this change does not touch; it changes only which - tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates - per-token via the block tables regardless of what prefill tokens are co-batched. Required gate: - run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical - scheduling decisions** (same per-step token counts, same admission order) - paged must be a - no-op on the scheduler. -- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same - `task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant - LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types / - per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note - it, do not try to lift it here. -- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that - reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the - dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained - heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s, - zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step. - This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left - on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision - should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose - checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching - change. - -## Phased plan P0 -> P3 (work, payoff, files, risk) - -| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk | -|-------|------|-----------------------------------------------------|-------|------| -| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low | -| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) | -| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) | -| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) | - -**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):** - -- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on - staggered arrival it goes below the vLLM burst figure to ~2 s). -- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM, - the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate - lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the - ceiling, the kernel on the throughput residual. - -## GB10 considerations - -- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read + - paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only - keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single - `llama_decode`; `n_ubatch` (512) governs the internal split. -- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more - prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower - TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load. - Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving. -- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be - checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win. -- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat - ~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change. -- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the - scheduler's uniform decode-only step is a precondition that *helps* a future capture. - -## Biggest risks and how to de-risk - -1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others- - decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch - (`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states - untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock, - verified by an A/B diff of per-step token counts. -2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already - documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**: - op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with - the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical** - to 0013. -3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite - with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a - no-op on batch formation). This is a hard gate, not a spot check. -4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the - scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission - checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the - batching win is legible independent of the checkpoint co-defect. -5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the - work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**: - frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission - rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually - moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the - scheduler is never charged with the 391/811 number the kernel forbids. - -## Commit / hygiene - -Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity -scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are -implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch- -scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp` -beside `max_prefill_tokens`. Commit with `git commit -s`, trailer -`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push -(human pushes). - ---- - -## Review / risk (adversarial, source-verified) - -Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp, -llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the -committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box. - -### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3. - -The central de-risking claims check out against the code, and the load-bearing honesty (decode -residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated. -Two calibration fixes are required before P1 (below), neither changes the go decision. - -### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely: - -- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with - per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos` - and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` + - `paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch` - (line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed - decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new - scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no - `llama_decode`/ubatch/KV rewrite in scope. -- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the - target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii) - `split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling), - forcing `-kvu`. Neither is introduced by this change. -- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple` - chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false` - -> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path - (llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This - does not break anything (0013 already hits it) but it means the actual scheduled object is a - split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks) - and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in - the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at - grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs - hang off it identically. - -### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified. - -- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager` - (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`, - `cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine - **eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel - launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler - cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to - refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.) -- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench, - llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds - **flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling - quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the - dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover - `T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk* - compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above - vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just - "toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the - dynamic budget would not. This strengthens P1's case beyond what the doc claims. -- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full - `T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so - co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's - tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the - drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and - treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting - decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it - "barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic - arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*. - -### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code. - -- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing - find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this - change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is - verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate. -- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real - config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is - sound (default-off path is untouched). -- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is - a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene, - not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`. - -### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1). - -The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747) -with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line -2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of -line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win. -The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)` -to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin + -`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal -(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured -burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also -fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at -npl8 and costs MoE TTFT), so P1 is the floor. - -### Realistic effort / payoff and sequencing - -- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric - set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only. -- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep. - Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than - 0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable - decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.** -- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT - tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a - residual spread/robustness problem. -- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step - host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state - pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding - step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do - not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild - over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does - not change." - -### Bottom line - -GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no -libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013 -cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase -decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the -`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence -P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner -of the 2.4x throughput residual. diff --git a/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md deleted file mode 100644 index f59b6e532..000000000 --- a/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md +++ /dev/null @@ -1,106 +0,0 @@ -# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact) - -The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3). -After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at -the f32 byte floor), the conv path was the only remaining bit-exact decode lever. - -## What changed - -A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a -non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain: - - qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step] - -> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step] - -> ggml_silu [folded into ssm_conv on CUDA] - -> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step] - -with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from -the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME -ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes -the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot -the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is -vLLM's `causal_conv1d_update`. - -Files: -- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace` - (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs], - src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu). -- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32` (one thread per - (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of - `ggml_cuda_op_ssm_conv`. -- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over - channels) + branch in `ggml_compute_forward_ssm_conv`. -- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs - conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write - target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed. -- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path - (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/ - rollback keep the existing concat+ssm_conv+silu+cpy chain. -- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output - vs the CPU reference across backends. - -## Gate: test-backend-ops (CUDA0 vs CPU reference) - -- SSM_CONV: 45/45 OK (unchanged path intact) -- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128) -- SSM_CONV_BIAS_SILU: 90/90 OK - -## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on) - -Byte-identical to the clean Lever-1 (0019/0020) baseline, both models: - -| model | baseline md5 | fused md5 | result | -|--------------------|----------------------------------|----------------------------------|-----------------| -| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL | -| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL | - -## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after - -Dense q36-27b-nvfp4: - -| mode | npl | baseline | fused | delta | -|-----------|-----|----------|--------|---------| -| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% | -| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% | -| eager | 32 | 196.07 | 197.61 | +0.8% | -| eager | 128 | 333.62 | 342.97 | +2.8% | - -MoE q36-35b-a3b-nvfp4: - -| mode | npl | baseline | fused | delta | -|-----------|-----|----------|--------|---------| -| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% | -| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% | -| eager | 32 | 421.05 | 432.46 | +2.7% | -| eager | 128 | 689.15 | 713.87 | +3.6% | - -Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at -**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes. - -## Step time + nsys kernel delta - -Per-step decode time (dense npl128, T_TG / ntg=128): -- baseline 48.711 s / 128 = 380.6 ms/step -- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms) -- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step) - -nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels: - -| kernel | baseline calls | fused calls | per-step (eager) | -|---------------------|----------------|-------------|------------------| -| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms | -| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms | -| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms | -| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms | - -Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont -and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel. -prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged). - -## Verdict - -Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM -391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free; -de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md deleted file mode 100644 index 6a97923fc..000000000 --- a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md +++ /dev/null @@ -1,639 +0,0 @@ -# Critical-Path Gap Analysis - GDN decode region - -## vllm-gdn-compare (READ-ONLY, no GPU) - vLLM decode GDN kernel inventory vs llama - -### Source ground truth -- Local checkout `/home/mudler/_git/vllm` and the DGX's benchmarked venv - `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm` are STRUCTURALLY - IDENTICAL (same file `qwen_gdn_linear_attn.py`, byte-for-byte same line numbers - 1287/1344/1457/1644/1684). So the analysis below is faithful to what was actually - benchmarked on the GB10. Both are a recent dev build (`__version__ = "dev"`), same - era as the "0.23.0" reference; the GDN path is the refactored - `vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`. - -### The headline: vLLM runs the entire GDN region at decode as 2 Triton kernels + 3 GEMMs, ALL fused -Per Qwen3.5 gated-DeltaNet (linear-attn) layer, vLLM decode launches: - -| # | Kernel | What is folded in | -|---|--------|-------------------| -| 1 | `in_proj_qkvz` GEMM | (quantized matmul - shared with llama) | -| 2 | `in_proj_ba` GEMM | (quantized matmul - shared with llama) | -| 3 | `_causal_conv1d_update_kernel` (causal_conv1d.py:1193) | conv1d **+ silu activation fused in** (the `activation` arg) | -| 4 | `fused_recurrent_gated_delta_rule_packed_decode_kernel` (fused_recurrent.py:256-336) | **l2norm(q), l2norm(k), scale, softplus gate, A_log decay exp(g), sigmoid(beta), the delta-rule recurrence (b_h*=exp(g); delta update), the output b_o=sum(b_h*b_q), AND the SSM state write-back** - all in one kernel | -| 5 | `RMSNormGated` (gated rms_norm) | **output gate silu/sigmoid * z fused into the rms_norm**; the comment notes the norm+quant is further fusable by the compilation pass (`fuse_norm_quant`) | -| 6 | `out_proj` GEMM | (quantized matmul - shared with llama) | - -So the GDN-region "glue" elementwise op count in vLLM is effectively ZERO separate -launches. Everything llama runs as standalone ggml nodes - conv-silu, gate -sigmoid, softplus, l2norm, scale, decay mul, residual add, gather - is absorbed -into kernels #3, #4, and #5. - -Verified kernel bodies: -- `fused_recurrent_gated_delta_rule_packed_decode_kernel` lines 313-336: - `b_q/sqrt(sum(b_q^2)+eps)`, `b_k/sqrt(...)` (l2norm), `b_q*scale`, - `softplus_x=where(x<=thr, log(1+exp(x)), x)`, `g_val=-exp(A_log)*softplus_x`, - `beta_val=sigmoid(b)`, `b_h*=exp(g_val)`, `b_v-=sum(b_h*b_k)`, `b_v*=beta_val`, - `b_h+=b_v*b_k`, `b_o=sum(b_h*b_q)`, `tl.store(p_o,...)`, `tl.store(p_ht,...)`. - ONE kernel = recurrence + ALL gating + l2norm + state writeback. -- The non-packed variant `fused_sigmoid_gating_delta_rule_update_kernel` - (fused_sigmoid_gating.py:24-179) is the same fusion (used for the spec-decode / - mixed-batch path); both fold gate+l2norm+recurrence+writeback into one launch. -- Decode dispatch: `_forward_core` (line 1286-1298) routes pure non-spec decode to - `_forward_core_decode_non_spec` (line 1644), which calls exactly - `causal_conv1d_update` (#3) then `fused_recurrent_gated_delta_rule_packed_decode` - (#4). `_output_projection` (line 851) does `self.norm(core_attn_out, z)` (#5, - gated rmsnorm) then `out_proj` (#6). - -### vLLM ALSO captures decode in a FULL CUDA graph - the launch bubbles are gone entirely -`vllm/v1/attention/backends/gdn_attn.py`: -- `_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH` (line 82) -- `use_full_cuda_graph = cudagraph_mode.has_full_cudagraphs()` (line 113) -- `build_for_cudagraph_capture` (line 509): "only decode is supported for full - cudagraphs with Mamba" / "GDN only supports decode-only full CUDAGraph capture". - -So at decode vLLM captures the WHOLE forward (all 48 layers: GDN linear-attn layers -+ the 1-in-4 full-attn layers + projections + conv + recurrence + gated rmsnorm) -into a single replayed CUDA graph. Per-kernel host launch latency and the -data-dependent inter-op gaps are eliminated at replay time. Even the 2 Triton -kernels per GDN layer incur no host-side launch bubble during graph replay. - -### Why this is the 62%-vs-40% explanation (not GEMM throughput) -- llama runs the GDN region as ~7-9 separate ggml nodes per layer at decode - (`ssm_conv`, `gated_delta_net` recurrence, `gdn_gather`, `k_bin_bcast` mul, - `silu`, `sigmoid`, `l2_norm`, `op_add`, `concat`), each a host-launched kernel, - serially data-dependent (conv -> gate -> recurrence -> gather), with the gating - elementwise wedged between recurrence steps. Each launch + the dependency stall - is a bubble ON the critical path. x48 layers x ~8 ops = ~384 launch bubbles/step. -- vLLM has 2 fused Triton kernels per GDN layer AND wraps them in a CUDA graph, so - the GDN-region inter-op bubble count at decode is ~0. The recurrence kernel - itself is already near-parity in llama (gated_delta_net 1.47 ms/call vs vLLM). - The gap is the surrounding launch/sync overhead, which is exactly the 60% idle - measured (llama ~40% busy vs vLLM 62%). -- This matches why P2a and Lever 2 were FLAT: they shrink GPU-busy kernels that are - already overlapped with the 42% mul_mat_q GEMM. The real wall-clock lever is the - SERIAL GDN gating chain's launch bubbles, which vLLM removed by (a) fusion into - the recurrence kernel and (b) CUDA-graph capture. - -### What llama would need to match vLLM (two independent wins, either helps) -1. **Op fusion (Lever 3).** Collapse the GDN per-layer gating chain into the - recurrence kernel: fold conv-silu, q/k l2norm, scale, softplus+A_log gate, - sigmoid-beta, the exp-decay mul, the residual add, and the SSM-state write-back - INTO the `gated_delta_net` CUDA kernel (and fuse the output gate silu*z into the - final rms_norm). Target: from ~8 GDN nodes/layer down to ~2 (conv-fused + - recurrence-fused), mirroring vLLM's `fused_recurrent_gated_delta_rule_packed_decode`. - The conv silu fold and the l2norm/scale/gate fold are the high-value pieces - - they are pure elementwise prologues sitting ON the serial chain between conv and - recurrence. -2. **CUDA-graph the decode step.** Even without fusion, capturing the decode forward - in a CUDA graph removes the per-node host launch latency for all ~384 nodes/step. - (Prior A.2 work flagged ggml-cuda graph capture as the orthogonal lever; the - measured GDN structure here is exactly why it should move the wall.) vLLM gets - BOTH; llama gets neither today. - -### Bottom line for the gap-analysis agent -The candidate explanation is confirmed at the source level: vLLM's GDN decode region -is 2 fused Triton kernels under a full CUDA graph vs llama's ~8 separate -host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM -compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing nsys -trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence -> -gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or -decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2. - ---- - -## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target - -Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor, -compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble -budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above. - -### Inputs (measured / GGUF metadata, no new GPU work) -- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s). - Both numbers are shared identically by llama and vLLM (same HW, same weights). -- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 -> - **16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads, - head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144. - Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident. -- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps): - T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms"). -- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**. - -### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s) -| term | bytes/step | basis | -|------|-----------|-------| -| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read | -| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it | -| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB | -| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 | -| **TOTAL** | **~42 GB/step** | bracket 32-61 GB | - -**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms). -Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**. - -KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B -= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q` -GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA -occupancy) and Lever-2 were FLAT on decode. - -### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4) -| term | FLOPs/step | floor | -|------|-----------|-------| -| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) | -| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) | -| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** | - -### 3. Verdict / bubble budget / parity target -``` - compute floor bandwidth floor MEASURED step x above bw-floor -GB10 dense-FP4 ~13 ms ~154 ms (117-224) -vLLM dense @128 327 ms ~2.1x (1.5-2.8x) -llama dense @128 384 ms ~2.5x (1.7-3.3x) -``` -- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x - below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly. -- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47% - LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound. - Confirms prior "decode is 2.5x above its bandwidth floor" work. -- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**. - = kernel-launch latency + occupancy gaps + serial data-dependency stalls. -- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.** - Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same - SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal, - so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute. - -**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.** -Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from -collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN -gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM -already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor. - -### Cross-check / sizing for the gap-analysis (timeline) agent -- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 + - GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6, - op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x, - so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the - wall is the critical path. -- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial - GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps) - sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total - < 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core - kernel slower than vLLM fused_recurrent, or scheduler/H2D). -- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused - Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net + - gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step - on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms. - -Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is -recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY -come from bubble (not bytes, not FLOPs). - -Assisted-by: Claude:opus-4.8 [Claude Code] - -## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel - -### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar) -Traced in `src/models/qwen35.cpp::build_layer_attn_linear` -> -`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) -> -`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar -gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel -runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1, -grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the -128 lanes hold the full head-vector (one element per lane). - -Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the -critical path between the in-proj GEMMs and the recurrence): -1. `beta = ggml_sigmoid(ssm_beta @ cur)` -> kernel reads `beta_val = *beta_t` -2. `alpha = ssm_alpha @ cur` -3. `ggml_add(alpha, ssm_dt)` (k_bin_bcast op_add) -4. `ggml_softplus(...)` (unary_op, 1248 inst) -5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked) -> g; kernel does `expf(g_t)` -6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices -7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg - -Post-GDN gate (consumes kernel output): -8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op, the 5.9% bucket) - -### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode) -These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a --> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every -operand it needs: -- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1. -- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5. -- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param. - -That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step. - -### Why the OUTPUT gate (step 8) is NOT folded into this kernel -The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq). -In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile) -blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a -grid-global sync - not feasible without a grid redesign. Leave step 8 as the -existing RMS_NORM+MUL + unary_gated fusion (already 2 launches, not in scope). -The conv-silu (step 6) is a convolution, structurally separate; rely on the -existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the -recurrence kernel. - -### Implementation scope -- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review). -- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]). -- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta). -- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops. -- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps. -- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged. -- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched. - -### Numeric considerations / bit-exactness -- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly. -- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it. - -### Estimated kernels-removed-per-layer and the honest ceiling -- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul). -- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path. - -### RISK (must be settled before building) -1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already - overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM, - Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU - between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at - batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no - gap (overlapped), do NOT build op-fusion; go to lever (2) below. -2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul - 8.5% and unary_gated 5.9% - much of which is the POST-GDN output gate and - FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle - to the post-GDN region or to inter-layer launch latency generally, the - higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch - latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op - fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the - pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the - CUDA-graph lever if the bubbles are spread across the step rather than concentrated - in the pre-GDN gate slice. -3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any - later op needing more srcs on this node has zero headroom; flag for review. - -## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128? - -### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation. - -Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu` -(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp` -(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked -against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements. - -### How graph-disable / re-instantiation are decided (this fork's state machine) -- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two - reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR - `node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET, - SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable - list. So no GDN op forces graph-disable. -- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct - (incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the - warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the - WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer - CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe. - -### Why the GDN region's properties are STABLE across steady decode steps -The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560): -``` -state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1, - kv_head * n_embd_s * elsize); // offset = kv_head -ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head); -``` -Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as -`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from -`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot` -(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous -range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token -(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the -same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst -offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are -fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same -compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy) -tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe. - -### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc) -`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda` -on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The -gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup, -no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences -(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly- -noop kernel - still captured into the graph like any other. Capture succeeds (the build runs, -graphs engage), confirming none of these break stream capture. - -### The only re-instantiation is NOT GDN-driven -A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because -the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in -lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps -replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded -`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not -the default decode path and is not the GDN/SSM path.) - -### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN) -The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady -decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a -graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2). -Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is -0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the -captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs: -decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The -GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So -"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is -SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not -wall-clock). - -### Graph-safe lever for GDN: none new -- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they - are already safe and captured. -- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%): - ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in - ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the - Grace cores between graph launches (vLLM builds its graph once + persistent device metadata). - A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the - host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor. -- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS] - (`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager. - That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline; - worth a separate confirm for the MoE model. - -## decode-timeline-gap (GPU, label gap-analysis) - the decisive fresh node-level measurement - -This is the new GPU run the analysis was waiting on. It arbitrates between the -roofline/vllm-gdn-compare theory ("57 ms = 100% bubble, Lever 3 closes it") and the -cudagraph-coverage source verdict ("~99.4% busy, bandwidth-bound, bubbles refuted"). -The measurement confirms the latter and refutes the former, with per-kernel numbers. - -### Capture (the trap the prior `--trace=cuda` fell into is now avoided) -`nsys profile --trace=cuda --cuda-graph-trace=node` on build-cuda-base (clean -Lever-1, HEAD df1cc97, git-clean mmq.cuh), q36-27b-nvfp4 dense, `-fa on -npp 128 --ntg 24 -npl 128 -c 33000`. Artifacts on DGX: `~/llama-paged-dev/nsysgap.{nsys-rep, -sqlite}`. The decode step is a single CUDA graph (graphId=11, 23 replays = steps -2-24; graphId=1 x8 = prefill). Plain `--trace=cuda` recorded each step as ONE opaque -~380 ms block, so the widely-cited `nsysab_new.kern.txt` breakdown (mul_mat_q 42%, -gated_delta_net 13%) is PREFILL + the single eager capture step, NOT decode. With -node-level trace the graph expands: 168201 kernels = 91499 graph-internal + 76702 -eager prefill. **All graph kernels on stream 14 (single stream) -> strictly serial, -no overlap, so any inter-kernel gap is pure GPU idle.** - -### One steady decode step (window between decode launches 22413.26 / 22796.74 ms, width 383.48 ms) -Exactly 48 `gated_delta_net` + 16 `flash_attn` = one clean step (48 GDN + 16 attn). -2965 kernels. - -| classification | ms/step | % of step | -|---|---|---| -| (a) inter-kernel LAUNCH gaps + (b) SERIAL-DEPENDENCY stalls (LAG sum, single stream) | **0.225** | **0.06%** | -| (c) within-kernel time (GPU running) | 380.4 | 99.94% | - -Zero gaps > 5 us. Largest single gap 2.40 us. 1260 sub-1us gaps + 1700 back-to-back. -**The decode step is 99.94% GPU-busy. There are no bubbles.** This independently -confirms cudagraph-coverage's ~99.4% and **refutes** roofline-decode's "57 ms = 100% -bubble" and vllm-gdn-compare's "~384 launch bubbles/step on the critical path". -nvidia-smi's "40% util" = low SM/compute efficiency WITHIN kernels (c) (memory-latency- -bound, ~12-16% achieved occupancy), not wall-clock idle. - -### Real decode kernel mix (% of the 380.4 ms step) - corrects the prefill-contaminated kern_sum -| kernel | n/step | ms | % | grid CTAs | waves/48SM | -|---|---|---|---|---|---| -| gated_delta_net_cuda | 48 | **196.37** | **51.6** | 48x128x32 = 196608 | 4096 | -| mul_mat_q (FP4 in/out/qkv/o proj) | 496 | 92.90 | 24.4 | 136 | 1.5 | -| quantize_mmq_nvfp4 | 496 | 17.13 | 4.5 | 483 | 10 | -| nvjet GEMM (lm_head) | 1 | 11.91 | 3.1 | 1944 | 40 | -| flash_attn_ext_f16 (16 attn layers) | 16 | 11.67 | 3.1 | 48 | 1.0 | -| concat_cont (conv-state) | 48 | 8.01 | 2.1 | 20480 | 427 | -| cpy_scalar | 64 | 7.62 | 2.0 | 49152 | 1024 | -| k_get_rows_float | 49 | 7.08 | 1.9 | 15098 | 315 | -| k_bin_bcast (gate mul + add) | 720 | 6.59 | 1.7 | 3169 | 66 | -| ssm_conv_f32 | 48 | 5.64 | 1.5 | 10240 | 213 | -| unary_gated (silu/sigmoid) | 128 | 5.36 | 1.4 | 5888 | 123 | -| mul_mat_q_stream_k_fixup | 304 | 3.94 | 1.0 | 192 | 4 | -| rms_norm_f32 | 209 | 3.52 | 0.9 | 1764 | 37 | -| l2_norm_f32 | 96 | 0.64 | 0.2 | | | -| gdn_gather_nonident | 48 | **0.061** | 0.016 | | | - -- `gated_delta_net` is **51.6% of the step**, the single dominant term. The - previously-cited "1.47 ms/call near-vLLM" was the EAGER average over 1248 calls - (range 0.046-4.42 ms = prefill warmups + capture); true steady decode is - **4.08-4.11 ms/call** (gridY=128 = the 128 seqs). 2.8x higher than believed. -- It launches 196608 CTAs / 4096 waves = NOT occupancy-starved; the cost is - bandwidth-bound state traffic (~384 MB read + ~384 MB write per layer for the - 48-head x 128-seq x [state 128 x head_v 128] recurrent state, ~190 GB/s effective). -- The Lever-3 narrow target (gating glue) = k_bin_bcast 6.59 + silu/sigmoid 5.36 + - l2_norm 0.64 + softplus 0.13 = **12.76 ms = 3.35%** of the step. `gdn_gather` is - **0.06 ms** (negligible - it early-returns on identity ids as predicted). - -### The three answers (with numbers) -1. **Bubbles on the serial GDN critical path?** NO. 0.225 ms idle/step = 0.06%, - zero gaps > 5 us. CUDA graphs eliminated launch overhead; serial dependencies do - not produce idle (each kernel starts < 1 us after the previous). The premise is - refuted by direct measurement. -2. **Would Lever 3 (fuse the gating chain) shrink the step or overlap away?** It - shrinks it, but only by its hard ceiling **12.76 ms = 3.35%** (380 -> 367 ms, 336 - -> ~348 tok/s, 86% -> 89% of vLLM). It does NOT close the 14% / 53-57 ms gap. - IMPORTANT mechanism correction: the step is single-stream and 99.94% busy, so - there is NO overlap to absorb freed time (the lever3-design RISK #1 "same trap as - P2a if overlapped" does NOT apply - nothing overlaps). So removing those kernels' - GPU-time DOES cut wall-clock - but the win is removing their HBM byte traffic, NOT - launch bubbles (there are none). And the value is the measured ~12.76 ms, not the - "~288 launch bubbles" framing (those launches cost ~0 inside the graph). This also - explains P2a/Lever-2 flatness correctly: NOT "overlapped busy-time" (no overlap), - but P2a tuned the prefill large-M GEMM (decode GEMMs are 136-CTA tail-bound, untouched) - and Lever-2 relocated mandatory quantize work into the GEMM prologue (net zero). -3. **Do CUDA graphs cover the GDN region at B=128?** YES, fully. Whole step = one - graph, 23 replays, ~0.2 ms host gap between steps. `gdn_gather_nonident` and the - in-place state ops are graph-internal nodes (graphNodeId != 0); no fragmentation. - Confirms cudagraph-coverage. Note: lever #2 from vllm-gdn-compare ("CUDA-graph the - decode step") is ALREADY IN EFFECT in this build and did not close the gap - so it - is spent, not pending. - -### Verdict against roofline-decode's own sizing test -roofline-decode stated: "if critical-path gaps total < 57 ms, parity is NOT reachable -via GDN-gate fusion alone and the gap is elsewhere (GDN core kernel slower than vLLM -fused_recurrent)." **Measured gaps = 0.225 ms << 57 ms.** Therefore, by that test, the -53-57 ms / 14% gap is NOT bubble and NOT closable by gating fusion. It lives in -**kernel GPU-time**, dominated by the `gated_delta_net` recurrence (51.6%, bandwidth- -bound) and secondarily the FP4 GEMM + quantize stack (29%). The "57 ms = 100% bubble" -roofline conclusion was an inference from the prefill-contaminated GPU-busy sum -(~555 ms vs 384 ms "implies overlap"); the node-level decode-only measurement shows -per-step GPU-busy = wall (no overlap), so that inference does not hold. - -### Recommendation (resized) -- The real lever is the `gated_delta_net` recurrence kernel itself (196 ms, 51.6%): - match vLLM's `fused_recurrent_gated_delta_rule_packed_decode` (vllm-gdn-compare - kernel #4) which folds l2norm + gate + decay + recurrence + state-writeback into a - SINGLE pass over the state, reducing HBM round-trips of the state. The win is byte - reduction in a memory-bound single-stream step, not bubble removal. -- The lever3-design fusion is still worth doing as a component of that (it removes - ~12.76 ms = 3.35% of real byte traffic, and unlike its own RISK section feared, it - will NOT be flat because there is no overlap), but on its own it is a ~3% lever, not - the gap-closer. Build it folded into a single-pass recurrence kernel, not as an - isolated gate fold. -- Next decisive measurement (future GPU-agent run): profile vLLM's decode step at - npl128 with the same node-level method and compare per-region GPU-time (GDN - recurrence vs GEMM vs attention) to localize exactly where vLLM spends its 53-57 ms - less. Both engines move near-identical bytes only if vLLM's fused recurrence does - not re-stream state; the per-kernel A/B will show whether the gap is the recurrence - pass or the GEMM/quantize stack. - -Assisted-by: Claude:opus-4.8 [Claude Code] - ---- - -## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict - -Reconciles all six investigation sections above plus the three adversarial verdicts -(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is -99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by -the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers -are byte-reduction inside that kernel - NOT launch-bubble removal.** - -### 1. The proven critical-path decomposition of the decode step - -Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97, -q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One -steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on -ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window -383.48 ms. - -BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle): - -| bucket | ms/step | % step | note | -|---|---|---|---| -| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency | -| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us | -| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back | -| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces | -| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step | - -The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved -occupancy on memory-latency-bound kernels), NOT wall-clock idle. - -KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives): - -| kernel | ms | % step | regime | -|---|---|---|---| -| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) | -| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode | -| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) | -| `nvjet` lm_head GEMM | 11.91 | 3.1 | | -| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | | -| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target | -| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) | -| `k_get_rows_float` | 7.08 | 1.9 | | -| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) | -| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 | -| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) | -| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | | -| `rms_norm_f32` | 3.52 | 0.9 | | -| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target | -| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) | - -GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step. -The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was -PREFILL + the single eager capture step contaminating the average over 1248 calls (range -0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step. - -### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty - -**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN -chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim -requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay -already collapsed launch latency; serial data-dependency does NOT equal idle when the -graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM -efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated -`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block). - -**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing -serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove -(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap -anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a -tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and -Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero). -(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy -single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%** -(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that -over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but -only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3 -is a component, not the gap-closer. - -**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"): -**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware -floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is -not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is -software-closable in BIT-EXACT form turns on one unmeasured quantity (below). - -RESIDUAL UNCERTAINTY (the single open question that decides everything): -- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above - ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED - it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write) - bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM, - then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit - HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same, - the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a - hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on - both halves). This is the one measurement that converts the verdict from "refuted as - worded" to a definitive yes/no. -- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip - [TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode - graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace - cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold. - The dense 335 tok/s headline is fully settled. - -### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327) - -Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines -share identical bandwidth/compute floors. Ranked by expected recovery. - -| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate | -|---|---|---|---|---|---|---| -| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** | -| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless | -| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone | -| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track | -| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) | -| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound | -| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat | -| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap | -| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor | - -Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a -single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a -distinct, bit-exactness-breaking precision track. - -### 4. The honest verdict and the single highest-value next step - -**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the -unmeasured byte ratio: -- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass - fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking - llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance). -- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the - 57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it. - -Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT -ship the isolated gate fold as "the parity lever." - -**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN -recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass -over the recurrent state) - BUT gate the build on one cheap measurement first, because it -is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal. - -**The measurement that confirms it before over-investing (one short GPU run, gap-analysis -agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's -`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two -counters: -- `dram__bytes.sum` (actual DRAM bytes/call) -- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s) - -Decision rule: -- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant - un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery - scales with the byte delta (up to ~40 ms). This is the gap-closer. -- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the - recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only - the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4, - accept non-bit-exact), and bit-exact parity is NOT reachable. - -**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort): -the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity), -which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound, -single-stream step where their full GPU-time is wall-clock. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md b/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md deleted file mode 100644 index 34b271dc7..000000000 --- a/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md +++ /dev/null @@ -1,185 +0,0 @@ -# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121) - -Profiling study (no engine changes). Question: matched apples-to-apples (both -batched servers, NVFP4-class weights, prefix caching on, both eager), why is -`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a -1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural. - -Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers. -llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf` -(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`, -`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin), -`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix, -K=32 concurrent, generate 64. All profiling scripts are dev-tree only -(`~/bench/decode_study/`); minimal in-code timers were not needed (server already -reports per-slot `eval time`, which excludes prompt-eval = pure decode). - -## TL;DR - -1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode - the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys). - Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage - sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The - "GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload. -2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the - opposite of the thin-batch-GEMM story. Attention scaling with context length, not - the matmul, is the load-bearing cost. -3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than - stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It - gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every - step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's - in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts - decode latency. -4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM** - (~120-185 ms/step). The residual gap is the **long-context decode-attention - kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity - gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps. - -## The measured numbers (batch 32, server-reported pure-decode step time) - -`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token` -(the most-contended, full-batch-32 slots; excludes prompt eval). - -| config | decode step ms (max / top8) | client wall ms/step | -|------------------------------------------|-----------------------------|---------------------| -| paged, ctx 1024, greedy | 1343 / 1279 | 1468 | -| paged, ctx 1024, **heavy 5-sampler** | 1346 / 1280 | 1470 | -| **stock** (no paging), ctx 1024, greedy | **729 / 650** | 768 | -| paged, **ctx 64** (short), greedy | **215 / 215** | 253 | -| vLLM NVFP4A16, ctx 1024 (K=32) | **~120-185** (270 tok/s) | - | - -The brief's reference ~828 ms/step sits between the stock (650-729) and paged -(1279-1343) numbers measured here; the decomposition below is what is robust. Our -fan-out shares no prefix across the 32 slots (each slot independently prefills 1056 -tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the -"max" slot is maximally contended, which is why our paged max runs a little above 828. - -### Context sweep - decode step is attention-scaling, not fixed overhead - -Pure-decode step vs shared-prefix length (paged, batch 32): - -| prefix ctx | decode step ms | -|-----------|----------------| -| 64 | 215 | -| 128 | ~290 | -| 256 | ~410 | -| 512 | ~660 | -| 1024 | ~1280 | - -Roughly linear in context length: ~1 ms of added step time per added context token. -The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope + -loop + sampling, attention negligible). Everything above it scales with KV length = -attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step. - -## Where the ~1280 ms paged decode step goes (nsys, pure-decode window) - -`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way -decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's -own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%). -GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi). - -| bucket | % GPU time | abs (of ~1280 ms) | what it is | -|--------------------------------|-----------:|------------------:|------------| -| `flash_attn_ext_f16` ATTENTION | **47.7%** | ~610 ms | decode attention over the 1056-cell KV | -| `cpy_scalar` KV copy/cast | 18.3% | ~234 ms | KV write + f32->f16 casts | -| `get_rows/set_rows` KV gather | 17.8% | ~228 ms | **paged** gather of K/V/mask to contiguous | -| `mul_mat_q` + `quantize_mmq` | 15.7% | ~201 ms | NVFP4 weight GEMM (+ activation requant) | -| rmsnorm / silu / rope / add | ~0.6% | ~8 ms | elementwise | - -Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the -weight matmul is ~the entire short-context step, and is context-independent, as -expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep -finding that ~84% of the step scales with context. - -Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util% -+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the -classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step). - -### Stock vs paged decomposition - -- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read - directly by the FA kernel, **no gather**). -- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged - gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that - feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's - self-inflicted cost and is the entire ~1.85x stock->paged regression. - -## vLLM decode architecture mapped onto each llama bucket - -vLLM at ~120-185 ms/step is faster on **every** bucket: - -| llama bucket (paged) | ms | vLLM equivalent | does vLLM avoid it? | -|-----------------------------|-------|-----------------|---------------------| -| paged KV gather (get_rows) | ~228 | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. | -| KV copy/cast | ~234 | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy | -| decode attention | ~610 | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 | -| weight GEMM + act quant | ~201 | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel | -| CPU sampling / loop | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too | - -vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so -vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a -stack of kernel-efficiency wins, not one silver bullet. - -## Ranked levers - closable vs structural - -1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged - path; net-zero risk - it is a regression]** The worktree's paged engine makes - decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer - every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable - paged KV** - it only ever helps prefix-prefill *memory*, never decode latency. - Fully recovering this *and* keeping paging requires reading paged blocks - in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2. - -2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock / - ~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx. - llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's - FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the - cost that *grows with context* - exactly the regime the brief targets. Tractable in - principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel - paged read that also kills lever 1's gather), but it is deep CUDA work on a new - arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.** - -3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]** - The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and - less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA - and folding the activation quant into the GEMM is tractable kernel work. Bounded - impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step). - -4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy == - heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until - the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the - "host-bound serving loop" hypothesis for this decode-bound workload. - -5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server - already fuses all 32 slots into one decode step (one set of kernels per step over - batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins - on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is - kernel-bound, not scheduler-bound. - -## Honest bottom line - -The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as: - -- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it - (don't run paged for decode-latency workloads). -- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**: - llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than - vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel - paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project. -- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not** - the gap; the GPU is ~95% busy on memory-bound kernels the whole step. - -What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel -work). What is structural / hard: lever 2 (the decode-attention kernel + a real -in-kernel paged read), which is where the context-scaling gap actually lives and where -any serious effort to approach vLLM on GB10 must go. - -## Reproduction (dev-tree only, `~/bench/decode_study/`) - -- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config. -- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`. -- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`. -- `cat2.py` - kernel-time categorization from the sqlite export. -- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s). - - diff --git a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md deleted file mode 100644 index 086f022e6..000000000 --- a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md +++ /dev/null @@ -1,756 +0,0 @@ -# Decode parity exploration (post-SSM-fix) - per-agent findings - -Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather): -dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. -The remaining +54% to parity is the question each section below probes. All numbers -DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`. - ---- - -## Section: per-token-latency (critical path / host-loop) - READ-ONLY - -**Verdict: the per-step critical path and host loop are NOT the residual lever. -Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is -~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half -(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer -sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid -saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM -efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.** - -### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep) - -`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode -steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep: - -| activity set | busy % | exposed idle | -|-------------------------|---------|--------------| -| kernels only | 80.25% | 19.74% | -| kernels + memcpy (all) | 99.35% | **0.65%** | - -- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are - filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies - (the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain). - They were a real critical-path serialization, and **patches 0018/0019 removed exactly - these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose - +37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed - back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from - the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent). -- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms, - **median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap - (sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute. -- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms = - **0.34%** of the window - the pure launch overhead that CUDA graphs collapse - (measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the - default paged decode with a 256-token reset cadence). - -### 2. Post-SSM scaling of the FIXED host gap - -The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time). -As decode accelerated it grew only as a fraction of a shrinking step: - -| build | step ms @npl128 | host gap | host gap % of step | -|---------------|-----------------|----------|--------------------| -| pre-SSM (146) | ~877 | ~2 ms | 0.24% | -| post-SSM (256)| ~499 | ~2 ms | **~0.40%** | -| vLLM (391) | ~328 | (n/a) | (would be ~0.6%) | - -Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not -the lever - it only becomes material once the kernels are fast enough to drop GPU-busy -below the host time, which is not the case at 65% of parity. - -### 3. The 64-layer chain does NOT under-fill the GPU at batch 128 - -The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N -needs layer N-1; cannot be parallelized across layers). The question is whether each -individual kernel fills the SMs at batch 128. It does: - -- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))` - = `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block - `(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its - state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in - context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY). -- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched - regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM). -- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at - npl128: if the chain under-filled, busy% would be well below 99%. - -Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the -weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize -the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a -host/scheduler problem. - -### 4. What the host actually does per step (eager rep runtime API) - -Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU): - -| API | n | total | avg | -|---------------------------|-------|---------|---------| -| cudaStreamSynchronize | 1723 | 7775 ms | 4513 us | -| cudaLaunchKernelExC | 30983 | 4045 ms | 131 us | -| cudaLaunchKernel | 20385 | 2694 ms | 132 us | -| cudaMemcpyAsync | 2085 | 96 ms | 46 us | - -~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by -graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy -GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The -sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The -only NON-overlapped residue is the ~2 ms/step between-step gap in section 1. - -### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md) - -vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the -async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the -2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by -vectorized numpy (no per-token Python), attention metadata `build()` once per step -reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H -non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The -next-step host prep overlaps the current-step GPU compute on the async stream. The key -asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device -KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new -`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores. - -But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host -loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is -328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU -kernel time (FP4 GEMM), not host. - -### 6. Is any host/serialization lever CUDA-graph or scheduler addressable? - -- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by - default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise). - Nothing left to win here. -- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays - the forward; the host still samples + runs `update_slots` + rebuilds the batch - between replays). It is SCHEDULER addressable - overlap step N+1's host prep with - step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse + - non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of - rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a - cleanup, not a parity lever. -- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4 - GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the - decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM - kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not - GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM - traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native - FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs. - -### Evidence -- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent). -- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`). -- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md, - VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md. -# Decode-Parity Exploration - -## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule) - -### Model config (Qwen3.5-27B dense, from vLLM config.json) -- linear_key_head_dim K = 128, linear_value_head_dim V = 128 -- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4 -- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn -- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB. - Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32. - -### Which kernels run at decode -- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda. - Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel). -- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel - (the dedicated single-token decode kernel, NOT the generic varlen fwd kernel). - -### The state HBM traffic is IDENTICAL - it is NOT the lever -Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced. -The dominant memory term is equal. llama is NOT moving more state bytes than vLLM. -=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes, - plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable. - -### Algorithmic / parallelization delta (the real differences) - -1) Reduction strategy (biggest structural difference) - - llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns - one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum. - TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle - rounds on the critical path, with n_tokens=1 they are NOT amortized. - - vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row - of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions - - ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local. - Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence. - -2) Occupancy / launch geometry (likely the dominant bandwidth gap) - - llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32). - Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state - (only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with - s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM - (~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight - loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write. - - vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3 - software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM. - -3) Redundant non-state traffic in llama - - q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k - => ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program). - Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream. - - Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered - single-float stores; vLLM stores a contiguous BV=32 vector (coalesced). - -4) Fusion delta (per-layer kernel-launch / HBM round-trip count) - - vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) + - (-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back. - - llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op: - ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate), - ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate - gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block - early-returns on the identity check, but still a grid launch of n_seqs blocks). - Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token). - Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured, - launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps. - -### What a faster llama GDN decode kernel would need (optimization scope) -- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn - contractions become register-local FMAs, eliminating the two warp_reduce_sum per token. -- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint - (the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next - state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on - the equal state bytes - the single highest-leverage change. -- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce - the output store across the warp. -- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel, - reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips - per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode - (or fold the identity check into the recurrence prologue, which it already partly does). -- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a - divergence-from-reference not a parity lever. - -### Bottom line -llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal), -so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on -the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure) -starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus -(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that -vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and -row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C). - ---- - -## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY - -Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and -memcpy summed separately so D2D is never lumped into compute), not from summary text. - -### CLAIM 1 - decode decomposition -PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp: -gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB / -1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37). -=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%; -full attention 0.37%. - -D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there = -526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than -the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible - -my DtoD 0.05 GB, the 2.16 GB is DtoH). - -FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only), -NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%), -no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31 -predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS` -("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant -to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So -"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28% -get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured. -Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split. - -### CLAIM 2 - 146 -> ~257 ("+66%") -146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across -SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is -+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup. - -### CLAIM 3 - P2a GEMM-remap FLAT on decode -THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM -kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED. - -### CLAIM 4 - 65% of vLLM (254 vs 391) -254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED. - -### Traps checked -GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within -noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys -(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs. - -### Net verdict -1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split -UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED, -"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED. - ---- - -## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math) - -Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4, -arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started. -Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128 -(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s. - -### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window) - -NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240). - -WEIGHTS (read ONCE per step, shared across all 128 seqs): - - NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB - (per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB) - - LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step) - - per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB - - token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible) - => WEIGHTS TOTAL = 15.14 GB / step - -PER-SEQUENCE STATE (x128 seqs, read + write every step): - - SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq - x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step** - - conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq - x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step - - KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16): - 4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step - - TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step** - -### Floor vs measured -- decode is NOT at the bandwidth floor - - Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step** - Measured llama = 128 tok / 254 t/s = **504 ms/step** => **2.48x the floor** (eff BW 110 GB/s = 40% of peak) - vLLM 391 t/s = 128 / 391 = **327 ms/step** => 1.61x the floor (eff BW 170 GB/s = 62% of peak) - - The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its - fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines - face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 = - 1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a - bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth - is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above. - -### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict) - - SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%. - At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense - weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant - helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode. - Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x - floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained - by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become - state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound. - -### Redundant traffic llama reads that vLLM avoids (cut list, by impact) - - 1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA - (pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window - 356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone - post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net / - mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all - PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal - (18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254. - 2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel - launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48 - launches/step of serialization. - 3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96 - ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small - gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the - recurrence prologue. - 4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul. - Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into - the GEMM prologue -> pure launch latency, not traffic. - 5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it), - NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of - weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it. - -### Bottom line (weight-bandwidth) -At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x -(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not -traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step -DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth -(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING -55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch -lever, largely orthogonal to the batch-128 decode-parity gap. - ---- - -## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU - -Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps -> -`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already -covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat). -This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml -fusion gate, TMA scoping, and the conv-state residual. - -**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means -`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in -`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the -**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF -baseline" is only about the act-quant producer. - -**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION / -the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both -flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels -per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us -shaved". - -### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41] -`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128 -(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent: -`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the -activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE), -but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression -is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into -the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the -chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts -utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench -(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression -as final. - -### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM] -ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`, -ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128 -(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the -GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the -M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work. -Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion; -the npl128 throughput number does not** - tune the two separately. - -### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever. -Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?): -P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the -stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a -*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on -(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure -weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch -latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's -"NVFP4-dense is a prefill/low-batch lever".) - -### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able] -After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`, -delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes -the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact -in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat` -can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact, -low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48 -launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern. - -### Deferred (covered by other sections, I concur) -- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16 - state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a - parity lever - last priority, quality-risk. -- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is - an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too). -- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever. - -### Bottom line (this section's net-new) -Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable -pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue -fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019 -pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill -/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from -`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp -(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs. - -Assisted-by: Claude:opus-4.8 [Claude Code] - ---- - -## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER - -Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of -BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense. -llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019 -SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for -a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0` -so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts. -Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate -per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged -as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`. - -### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference - -The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length -w16/w64 timing, in-process, batch 128): - -| engine / mode | ms/step | decode tok/s | notes | -|--------------------------|---------|--------------|--------------------------------| -| llama post-SSM (graphs) | ~510-522| **245-251** | S_TG @npl128 ntg32 (this run) | -| vLLM enforce_eager | 324.9 | **394.0** | == the ~391 ref (h2h log 371-384)| -| vLLM cuda-graphs | 304.9 | **419.8** | graphs buy only +6% | - -- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6% - (394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager - kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`. -- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s - - WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight - decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms - exactly -> the tight window is faithful; per-kernel numbers below use it. - -### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free) - -Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy). -ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU -kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count). - -| component (per step) | llama ms/step | llama % | vLLM ms/step | vLLM % | -|-----------------------------|---------------|---------|--------------|--------| -| GDN linear-attn recurrence | 193 (48x4.03) | 38% | 174 (48x3.62)| 53% | -| FP4 matmul + act-quant | **236** | **46%** | **117** | **36%**| -| - mul_mat_vec_q (GEMV) | 132 (48x2.75) | 26% | - | - | -| - mul_mat_q (GEMM) | 88 (448 calls)| 17% | cutlass 61 | 19% | -| - quantize_mmq_nvfp4 | 16 (448) | 3% | nvjet 53+cvt2| 17% | -| full attention (16 layers) | 6.6 (16) | 1.3% | 6.2 (16) | 1.9% | -| SSM conv + glue/elementwise | ~45 | 9% | ~22 | 7% | -| MEMCPY (D2D+H2D) | 2.5 (131 MB) | 0.5% | 0.36 (85 MB) | 0.1% | -| **TOTAL** | **~510** | 100% | **~326** | 100% | - -### 2. The three load-bearing comparisons (the brief) - -**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration: -llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM -`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%** -slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides -(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the -framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019. -(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.) - -**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs -vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big -per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix -(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels, -not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and -BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT -weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume -deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3). - -**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle), -vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is -the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs -vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path. - -### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap) - -llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM + -cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra -PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on -FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**. - -Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of -decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the -**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections - -a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read -amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core, -weight read amortized across the 128-row batch) at a fraction of the cost. **There is no -GEMV-at-batch-128 on the vLLM side at all.** - -Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made -it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms), -which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the -GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path, -and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches -fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the -mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion -(16 ms + 448 launches), then the GDN +19 ms. - -### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction) - -weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound." -My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which -do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state -bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the -`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity = -raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion), -not the GDN kernel and not byte-cutting. - -### Verdict (profiler) -- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs. - Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host. -- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual. -- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's - 18 GB/step DtoD removal CONFIRMED in-trace. -- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms - = 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%, - one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4 - GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel. - -### Evidence (DGX, this agent) -- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM). -- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace). -- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput). -- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py` - (decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction). - -Assisted-by: Claude:opus-4.8 [Claude Code] - ---- - -## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED - -Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the -fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by -reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD -46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference. - -### 0. Headline - -Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is -NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA -graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect: -**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`) -at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of -decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass -M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step. - -### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED - -HELD (confirmed by both the adversarial re-derivation and the fresh profile): -- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB, - mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings). -- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed - three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel - sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The - SSM fix (0018/0019) is the real breakthrough and is working. -- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself - shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound. -- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed. -- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util) - vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below. -- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128 - (llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler. -- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already - runs with graphs. Confirmed by the profiler. - -REFUTED / CORRECTED: -- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net - 1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The - profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% / - +19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama, - 53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower - window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/ - fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity. -- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM, - not the speedup (validate-findings). - -SUPERSEDED (the gap validate-findings flagged, now filled by real data): -- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the - producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace - replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM - family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17% - (the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as - "48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This - is why P2a was flat on decode: **it optimized the wrong FP4 kernel.** - -### 2. Ground-truth per-step decode decomposition + the single biggest overage - -From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate: - -| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) | -|-----------------------------|----------|--------|---------|-------|------------------| -| GDN recurrence kernel | 193 | 38% | 174 | 53% | **+19** | -| FP4 matmul + act-quant | 236 | 46% | 117 | 36% | **+119** | -| - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0 | - | **+132** | -| - mul_mat_q (MMQ GEMM) | 88 | 17% | 61 (cutlass) | 19% | +27 | -| - quantize_mmq_nvfp4 | 16 | 3% | 55 (nvjet+cvt)| 17% | -39 | -| full attention (16 layers) | 6.6 | 1.3% | 6.2 | 1.9% | +0.4 | -| SSM conv + glue/elementwise | 45 | 9% | 22 | 7% | +23 | -| MEMCPY | 2.5 | 0.5% | 0.36 | 0.1% | +2 | -| **TOTAL** | **~510** | 100% | **~326**| 100% | **+184** | - -The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset -by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So -the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms. - -**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block -(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout -`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is: - -```cpp -// current code (qwen3next.cpp, end of the GDN block) -ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, - head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128] -cur = build_lora_mm(model.layers[il].ssm_out, final_output); // <-- the matmul -cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); // collapse AFTER -``` - -`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda -dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the -128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs = -**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's -trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128 -sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles - -the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not -in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN -kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all. - -This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt -len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2]) -trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream -`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is -the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer. - -### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391) - -Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover -figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels); -each needs a confirming re-profile. Ceilings are cumulative. - -| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability | -|---|-------|-------------------|--------------|-----------------------|-----------|--------------| -| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) | -| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) | -| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) | -| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) | -| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) | -| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) | -| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) | - -Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per -o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1 -rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise -passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band, -not a sum. - -### 4. Verdict: is true decode parity reachable? - -**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.** -Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32 -recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM -achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x -throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to -specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The -GDN recurrence (the supposed floor) is only +11%/call between the two engines. - -How far each tier reaches: -- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves - the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact, - no new kernel (MMQ already runs the in-projection at this exact shape and type). -- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks - 38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the - GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now). -- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work: - matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy, - folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard - but not floored" - it is kernel engineering, not a hardware wall. - -**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D -`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant -post-matmul `reshape_2d`): - -```cpp -// route the GDN output projection through tensor-core MMQ at decode: -// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact. -ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, - head_v_dim * num_v_heads, n_seq_tokens * n_seqs); -cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ -``` - -Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one -number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The -same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and -any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those -and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy -compromise, so bit-exactness is preserved throughout. - -### Evidence (this section) -- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj - layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on - `ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is - MMQ-supported). -- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md b/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md deleted file mode 100644 index 2cd3af3e3..000000000 --- a/backend/cpp/llama-cpp/patches/paged/F16_DENSE_RESIDUAL_PROBE.md +++ /dev/null @@ -1,184 +0,0 @@ -# F16/BF16 Glue Probe - the dense decode residual to vLLM - -Question: dense decode parity sits at llama 384.6 vs vLLM 418.8 t/s @ npl128 = 91.8%. -The 49% SSM recurrence (f32 BOTH engines) and the 27% NVFP4 GEMM (W4A4 BOTH) are -precision-matched. The residual ~8% may be partly that llama runs the NON-recurrence -GLUE (attention, norms, activations, elementwise, residual stream) in F32 while vLLM -runs the model in BF16. This probe settles, empirically on q36-27b-nvfp4 @npl128, how -much of that residual is realistically f16/bf16-closable. - -Model: Qwen3.5-27B NVFP4 (dense). 64 layers = 16 attention + 48 gated-DeltaNet -(SSM) recurrent. Build b104-f7409c2 (patch 0023), verified git-clean and coherent. -The bf16 SSM work was never applied to the tree (only saved as a diff backup); -ggml-cuda needed no recompile on rebuild, so the binary is bit-identical to clean 0023. - -## (1) Current KV / state dtype (SETTLED) - -From the `-v` init log: - -- ATTENTION KV cache (16 of 64 layers): - `K (f16): 1280 MiB, V (f16): 1280 MiB` => **DEFAULT IS ALREADY F16.** -- RECURRENT cache (48 gated-DeltaNet layers): - `R (f32): 180 MiB` (conv state), `S (f32): 4608 MiB` (SSM state) => **f32.** - -Consequence: the attention KV is ALREADY at vLLM's 16-bit bit-width. `--cache-type f16` -is a literal no-op; the cheap KV lever is spent. The f32 lives in (a) the recurrent -SSM/conv state (matched to vLLM, the bf16 version is shelved for failing the f32 KL -gate) and (b) the intermediate-activation glue (norms, residual stream, attention -compute, activations) - that glue is where llama still pays f32 vs vLLM bf16. - -## (2) Decode kernel budget (nsys --cuda-graph-trace=node, npl128, 39 steady steps) - -step span 342.0 ms ; sum-of-kernels 338.8 ms ; **kern/span 99.0%** - the decode is -GPU-bound, kernels back-to-back, nsys overhead negligible. The measured bench step -(128 tok / 373.5 t/s = 342.8 ms) equals the nsys span, so the %-of-step figures below -ARE wall-time fractions. - -OUT of scope - already precision-matched (83.2% of the step): - -| kernel | ms/step | % | -|---|---:|---:| -| gated_delta_net (SSM recurrence, f32 BOTH) | 167.1 | 49.3 | -| mul_mat_q NVFP4 (W4A4 GEMM, BOTH) | 93.0 | 27.4 | -| quantize_mmq_nvfp4 (FP4 act-quant) | 17.6 | 5.2 | -| mul_mat_q stream_k fixup (FP4 reduction) | 4.1 | 1.2 | - -F16-ABLE GLUE - f32 in llama, bf16 in vLLM: - -Budget A (clean compute glue, decoupled from the f32 state): - -| kernel | ms/step | -|---|---:| -| flash_attn_ext | 11.94 | -| unary_gated_op (silu) | 5.16 | -| k_bin_bcast (mul) | 4.72 | -| rms_norm | 3.58 | -| k_bin_bcast (add, residual)| 1.67 | -| l2_norm | 0.65 | -| cpy_scalar | 0.37 | -| rope | 0.26 | -| sigmoid | 0.22 | -| softplus | 0.09 | -| flash_attn fixups | 0.08 | -| **Budget A total** | **28.74 ms = 8.4% of step** | - -Budget B (+ the non-FP4 cublas GEMM): + nvjet 12.17 ms => **40.91 ms = 12.0%**. - -Recurrence-coupled data movement (NOT bit-safe f16-able - needs the f32 state to go -bf16, which is the shelved work that fails the f32 KL gate): -ssm_conv 8.37 + k_get_rows_float 6.98 + k_set_rows 0.66 + gdn_gather 0.06 = 16.08 ms = 4.7%. - -## (3) Cache-type A/B (decode_agg S_TG t/s, dense) - -| npl | DEFAULT | F16-explicit | Q8_0 | -|---:|---:|---:|---:| -| 32 | 209.05 | 208.75 | 208.63 | -| 128 | 373.46 | 373.56 | 374.71 | - -- F16-explicit == DEFAULT (0.03% delta) => proves the default KV is already f16; the - flag is a no-op. -- Q8_0 (8-bit, half the f16 KV bytes) is within noise at every npl => the attention KV - bandwidth is NOT a decode bottleneck (it is 16/64 layers; flash_attn is 3.5% of the - step). The KV-cache dtype is not a decode lever for this model. -- Coherence (48-tok greedy, "The capital of France is"): default and q8_0 both fully - coherent; q8_0 only causes minor greedy-path divergence, no quality break. But since - q8_0 buys zero speed and is not bit-exact, it is pointless here. - -## Read: how much of the ~8% dense residual is f16-closable - -The gap is ~27 ms/step (llama 332.8 ms vs vLLM 305.7 ms at npl128). - -f16 does not zero the glue, it speeds it up. Realistic recovery: -- Memory-bound glue (norms + elementwise + activations + copies + rope = 16.7 ms): - f16 halves the bytes => ~50% => ~8.4 ms. -- flash_attn_ext (12.0 ms): KV is ALREADY f16 and the accumulation must stay f32 - (vLLM also f32-accumulates), so only the Q/projection side helps => ~25% => ~3.0 ms. -- Budget A realistic recovery ~= **11.4 ms**. -- nvjet non-FP4 GEMM (12.2 ms): bf16 tensor cores vs f32 ~= ~40-50% => ~5 ms, but - uncertain (may already run TF32) => +nvjet recovery ~= **16 ms**. - -So f16/bf16 glue realistically recovers **~11 ms (glue only) to ~16 ms (+GEMM) of the -~27 ms gap = roughly 40-60% of the dense residual.** That moves parity 91.8% -> -~95-96%, NOT a full close. The remaining ~3-4% is structural: cublas GEMM efficiency -on the non-FP4 paths, graph/launch scheduling vs vLLM, and the irreducible f32 -accumulation in attention and the recurrence. - -Caveats for a build decision: -1. The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable - (KV already f16, accumulate stays f32). The cleanly recoverable mass is the - norms+elementwise+activations (~16.7 ms). -2. The recurrence-coupled 4.7% (ssm_conv + state gather) is only f16-able by taking the - SSM/conv state to bf16 = the already-built, already-shelved work that fails the f32 - KL gate. It is OUT of a bit-safe f16 build. -3. f16 glue is NON-bit-exact (same category as the shelved bf16 SSM state). It would be - an OPT-IN fast path, not the bit-exact default. Realistic ceiling ~95-96% parity for - a meaningful (norms/elementwise/activations + optionally nvjet) f16 conversion, at - the cost of leaving the 95%-bit-exact f32 plateau. - -## (4) What it costs to capture it: NOT a flag (source map, read-only) - -The asymmetry confirmed at the source level (DGX `~/llama-paged-dev` @ f7409c2, tree -git-clean; vLLM ref from BITEXACT_VS_VLLM.md): -- vLLM `text_config.dtype = bfloat16` => the ENTIRE non-quantized compute (residual - stream, RMSNorm I/O with f32-internal reduction, FlashAttention out, SiLU, gating, - conv state) runs in BF16. Only the gated-DeltaNet temporal SSM state is f32 - (`mamba_ssm_dtype="float32"`, matched to llama). -- llama's intermediate activations are F32 **by construction, everywhere**: - `ggml_mul_mat` hardcodes an F32 result (ggml.c:3250), so the stream snaps back to F32 - after EVERY projection (Q/K/V/O, wqkv, ssm in/out, ffn up/gate/down, eh_proj, lm_head). - `ggml_rms_norm`/`ggml_l2_norm`/`ggml_silu`/`ggml_add`/`ggml_mul`/`flash_attn_ext`/ - `ggml_ssm_conv` all preserve/emit F32. There is no point where the stream is f16. - -There is **no vLLM-style global model-compute-dtype knob** in ggml/llama. You cannot flip -one model-load flag. Three escalating options, all opt-in / non-bit-exact: - -- A flag: does not exist and cannot exist as-is - the F32 is structural, not a default. -- Option 1 (targeted per-op f16, no new kernels): silu/sigmoid/softplus (unary.cu), - add/mul (binbcast.cu), rope already have f16 paths. But the residual stream stays F32, - so each op must be wrapped cast(F16)->op->cast(F32), adding 2 `cpy` ops per op. At - decode these ops are tiny and memory-bound; the cast traffic ~= the op traffic, so the - net win is near-zero or negative unless the cast is FUSED into the producer/consumer. - Crucially this CANNOT capture the norms - the largest glue item. -- Option 2 (the real lever, multi-file code change): carry the residual stream in F16 - across the layer, cast to F32 only at the quantize boundary. Requires (a) f16 projection - output (patch `ggml_mul_mat` to honor a dst-type, or a cpy->F16 after each proj), - (b) **NEW F16 template instantiations in norm.cu** for rms_norm / l2_norm / fused - rms+mul / fused rms+mul+add (today hard-`GGML_ASSERT(type==F32)` at norm.cu:441-442, - 465-466, 525-527, 601-604) keeping the f32 reduction, (c) optionally an F16 ssm-conv.cu, - plus graph-dtype plumbing in qwen35.cpp / llama-graph.cpp to thread F16 through - inpL/cur/the residual adds. The single biggest code item is the norm.cu f16 kernels - - the exact band vLLM runs in bf16 that Option 1 cannot reach. - -Must-stay-f32 regardless (vLLM does the same): RMSNorm/L2Norm sum-of-squares reduction; -FlashAttention KQ/softmax accumulation (forced `GGML_PREC_F32`, llama-graph.cpp:2117); -the gated-DeltaNet recurrent SSM temporal state (f32 BOTH engines, out of scope); the -src1->q8_1/nvfp4 activation quantization reads F32, so the stream must be F32 at every -projection boundary no matter what. - -## Verdict: probe-further-then-decide, leaning not-worth-it for the default - -f16 does NOT meaningfully close the dense residual on its own, and what it can close is a -multi-file non-bit-exact build, not a flag. - -- Precision is NOT the dominant cause of the 8% gap. 83.2% of the decode step (recurrence - 49.3% + FP4 GEMM 27.4% + FP4 act-quant/fixup 6.4%) is already precision-matched f32/W4A4 - on both engines. The f16-able glue is only 8.4% of the step (Budget A); of the ~27 ms - gap, f16 realistically recovers ~11 ms (glue) to ~16 ms (+ the uncertain nvjet GEMM) = - 40-60% of the residual. The remaining ~3-4% is kernel/scheduling efficiency (non-FP4 - cublas GEMM, graph-launch overhead, irreducible f32 accumulation) that f16 cannot touch. -- The recoverable mass is the norm+elementwise+activation band, which is precisely the - part that needs NEW f16 norm kernels (Option 2). The no-new-kernel ops (Option 1) are - too small and their cast overhead likely eats the win. -- Any version is opt-in / non-bit-exact, the same gate-failing category as the already - shelved bf16-SSM-state work. It cannot be the bit-exact f32 default; it is a second, - separately-maintained fast path with a ~95-96% ceiling. - -Recommendation: do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau -(patches 0018-0023) as the default. If chasing the last 4% later, the only lever worth a -build is Option 2's norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise -band, ~11 ms); gate it behind an explicit opt-in flag and validate it against the same KL -threshold as bf16-SSM before shipping. The non-FP4 cublas GEMM efficiency and graph-launch -scheduling - the structural ~3-4% - are a better long-term target than precision, because -they help the bit-exact default too. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md deleted file mode 100644 index cf1c24ea8..000000000 --- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md +++ /dev/null @@ -1,532 +0,0 @@ -# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go - -Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the -residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the -8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ -consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket. - -**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one -ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever → -parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor → -parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach -vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway? - -Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x. -Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the -committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU -benchmarks were run** (track A is on the box). - -## 0. Grounded inputs (measured, committed) - -| quantity | value | source | -|---|---|---| -| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` | -| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study | -| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 | -| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 | -| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` | -| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX | -| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX | -| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a | -| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` | -| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` | - -`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout: -`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it -compute-bound. - ---- - -## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel - -This is the first thing track B must settle, and the evidence settles it decisively. - -| option | verdict | why | -|---|---|---| -| **(A) Tune the existing `mul_mat_q` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. | -| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. | -| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. | - -**Decision: track B = tune `mul_mat_q` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q` -id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16. -The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no -hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed -to the bandwidth floor at the thin decode M-tile. - -### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`) - -Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one -ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path: -`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id; -**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) → -`mma_block_scaled_fp4` (`mma.cuh:1126`). - -**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq` -(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1) -values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via -`ggml_cuda_float_to_fp4_e2m1`. The MMA is -`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3` -(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is -already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) == -sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit -activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic -halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.** - -Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles -`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64 -FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop -is `for (j0=0; j0 **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times. -> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row -> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two -> regimes pick opposite knobs: - -| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) | -|---|---|---| -| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled | -| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive | -| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) | -| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch | -| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) | - -This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch -0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.** - -The five inefficiencies, ranked: - -1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the - whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant` - into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized - `block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the - activation round-trip, not just the launch (see §4.4). - -2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).** - `load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` → - `vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no - `cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At - M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) / - 35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10 - occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to - hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage. - -3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense - decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 — - but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is - **`mmq_y`↓** (BW-neutral), not `mmq_x`↓. - -4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide) - tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag- - group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`). - At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is - the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.) - -5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger - shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob. - -**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense -`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M); -raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first). - ---- - -## 4. The specific build-ready changes - -All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact -stock behavior until proven. - -### 4.1 Dense M-tile / occupancy (the make-or-break) - -- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that - re-reads 18 GB). Lock this as an invariant in P0. -- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned - 128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ - static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch: - fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency - hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary - dense occupancy lever and respects every GB10 rule. -- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets - `rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q` - ~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight - re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly - broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is - lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`. - -### 4.2 FP4-MMA fragment usage - -- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the - `m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128 - all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of - these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment - shape. -- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3). - -### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed) - -- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case` - (the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`, - `LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep. -- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple - of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk - (scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`. - -### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract) - -- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc` - (`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale - operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand. -- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`. -- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant** - (`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused - `rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized - MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the - `quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the - unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green - with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq` - struct is the frozen interface between them. - -### 4.5 GB10-fit rules (binding constraints on every kernel change) - -- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared - layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB - XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS). -- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4` - padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost. -- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's - FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10). -- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs* - (smaller `mmq_y`, smaller shared), not a deeper pipeline. -- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver - perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only). - ---- - -## 5. Correctness / parity gate (every phase) - -- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and - `test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and** - unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test - is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op - test). -- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at - the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged - small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/ - `mmq_y` changes and block-pad masking can leak. -- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path - must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3` - rounding). -- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf` - (dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within - the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**. - ---- - -## 6. Phased plan, with expected decode_agg at each phase - -Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162; -**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`. - -### DENSE (parity target 391) - -| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk | -|---|---|---:|---:|---:|---:|---| -| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low | -| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low | -| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** | -| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low | - -Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.** -Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error, -contingent on P2 clearing the kill-gate and on A landing.** - -### MoE (parity target 811) - -| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk | -|---|---|---:|---:|---:|---:|---| -| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low | -| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low | -| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med | -| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low | - -MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the -hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's -non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step -cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop -(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.** - -### Explicitly NOT in scope (and why) - -- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and - CUTLASS's grouped FP4 is broken on sm_121. -- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy. -- "Make activations 4-bit" — already W4A4; no work, no win there. -- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but - **separate tracks**; B owns the GEMM only. - ---- - -## 7. The honest ceiling — does B reach TRUE PARITY? - -- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the - GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not - hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.** - The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated - ~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2 - occupancy kill-gate and contingent on track A. **GO (conditional).** - -- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)** - even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest - grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is - non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot - deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.** - -- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a - *different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin - M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at - the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128** - (occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield - rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall — - which is exactly what the P2 kill-gate measures. - -**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with -vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope -with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%); -MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated), -PARTIAL for MoE.** - ---- - -## 8. One-paragraph summary - -The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read -floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of -FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already -**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing -`mul_mat_q`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The -M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy -win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert -**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches -**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the -demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy -kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor -sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense -(conditional, B+A), PARTIAL for MoE.** - ---- - -## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go - -Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the -BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the -grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`, -`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net: -the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not -rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**. -The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not -reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells -(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be -fixed**. - -### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why - -The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`): -dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at -parity** (99%); the gap **opens monotonically as M grows**. Decompose this: - -- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there, - so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence - proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield - build stuck at 1/4 of MMQ. -- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and - hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the - floor** as M grows; vLLM **holds it**. - -So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead -W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8 -to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working -path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding -constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger -tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that -it is a different and more tractable regime; it under-states that the *binding risk is identical*. - -### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM - -The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared -bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound -territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile -(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the -hard target both engines share. This confirms the doc's roofline and its central claim that the kernel, -not the hardware, is the limiter. - -**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose -the dense step (numbers from the doc's own inputs): - -``` -llama step @npl128 795 ms (decode_agg 161) -vLLM step @npl128 328 ms (decode_agg 391) -total gap 467 ms - -llama GEMM 471 ms -vLLM GEMM (at the floor) ~66-88 ms (66 @273 GB/s spec, 88 @216 GB/s achieved) -=> GEMM gap 383-405 ms = 82-87% of the 467 ms total gap -=> non-GEMM gap 62-84 ms = 13-18% of the total gap -``` - -So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity -needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is -elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is -necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The -parity story is *entirely* contingent on A, which the P3 framing buries. - -### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA) - -vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer -pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do -not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down -occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence -proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the -occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope. - -The doc conflates two different things under "deep pipeline": -- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10** - (correctly banned). -- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower - register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the - occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121. - -**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**, -not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment -(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way -to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses. -Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse: -abandoning the path that works because the *manual* version of it failed. - -### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat - -The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel: -- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed - and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked - evidence that the "tune `mul_mat_q`" approach works on this exact kernel family. -- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the - `nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure - host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is - where the residual occupancy risk lives** - and per 9.3, TMA belongs here. -- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken - on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge. - -**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a -real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3). - -### 9.5 Honest expected outcome (the numbers I would defend) - -| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? | -|---|---:|---:|---:|---| -| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** | -| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** | - -Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the -**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the -floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**, -contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the -GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs = -tail, not median. - -**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped* -outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands -**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands -**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves -the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how -close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly -parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether -the parity tail hits. - -### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards) - -The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three -reasons: -1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq` - producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer - seam is speculative until P3. -2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2% - bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before - B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw - 59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the - <15%-eff kill-gate. -3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the - risky kernel tune. - -**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share -post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the -make-or-break decision cheaper, better-informed, and bankable-either-way. - -### 9.7 Verdict (go / no-go) - -- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap" - (expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is - bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome - is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A - first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win); - (iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3). - -- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic - (it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM - ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing - - the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a - **1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host - CUDA-graph, GDN bf16 state). - -- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the - **top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over), - and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a - high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE - ~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the - occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches. diff --git a/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md deleted file mode 100644 index 27bf10829..000000000 --- a/backend/cpp/llama-cpp/patches/paged/FUSED_OP_BACKEND_GATE_RESULTS.md +++ /dev/null @@ -1,96 +0,0 @@ -# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS - -Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md` -(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but -were emitted DEFAULT-ON with no backend guard. - -## The hazard - -- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally - in the `llama_context` constructor (`src/llama-context.cpp`). -- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]` - (reuse `GGML_OP_GATED_DELTA_NET` with extra src slots). -- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse - `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids). -- Both families have CUDA + CPU kernels only. No `supports_op` change was made for - the discriminated variants. -- A backend that supports **plain** `SSM_CONV` but ignores the discriminator - (Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler - assigns the discriminated conv to it; it runs the **wrong plain conv** => - SILENT corruption (not a crash). -- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the - discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and - discriminated-conv happened to share backend coverage). It goes live the moment a - non-CUDA paged build of a gated-DeltaNet model exists. - -## The fix (emission gate, not supports_op) - -Chosen route: **gate the emission on the active compute backend type.** The -`supports_op` route would require editing every other backend's per-device -`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` - -invasive, fragile, and not centrally exposed by the ggml backend interface. The -emission gate is self-contained in the fork's own code. - -`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before -the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute -backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three -`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the -discriminated-op handling), force -`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`. - -Every emission site keys off these flags: -`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar` -(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and -`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With -the flags false the graph takes the upstream non-fused branch: a **plain -`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles -correctly. - -## CUDA byte-identical invariant - -On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the -flags are left untouched, and the emitted decode graph is unchanged. The fix only -changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical -to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy -md5 gates are unaffected on the validated GB10 target. - -## Verification - -- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state - (upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and - applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops` - links `libllama.so` and the test binary with **0 errors**; the edited - `llama-context.cpp` compiles clean (uses only the already-included `` - and the backend-reg API already used in this TU: - `ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` / - `ggml_backend_dev_type`). -- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check` - (Makefile path) and `patch -p1 -N` (prepare.sh path). -- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run - skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are - registered and will be exercised by the DGX CUDA run: - `test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) / - `test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) / - `test_gated_delta_net` (+ `_hybrid`). - -## Pending on the DGX (GPU) - -The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared -tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole -session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`) -once reachable, then commit 0030 there too: - -``` -test-backend-ops test -o SSM_CONV -test-backend-ops test -o SSM_CONV_UPDATE -test-backend-ops test -o SSM_CONV_UPDATE_IDS -test-backend-ops test -o GATED_DELTA_NET # expect: 2/2 backends passed, OK -``` - -Greedy md5 (only if >40GB VRAM free; must equal the established baselines): -`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`, -`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip -any flag on CUDA, the md5 is unchanged by code-path argument; the run is a -belt-and-suspenders confirmation, not a correctness dependency. - -Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md b/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md deleted file mode 100644 index e7d4b2ea5..000000000 --- a/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md +++ /dev/null @@ -1,89 +0,0 @@ -# Decode-Parity: Parked Levers (future exploration) - -**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode -38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch -**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the -dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor. -bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate). - -The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact -floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode -gains are marginal; the levers below are the honest pick-up points, ranked by promise. - ---- - -## 1. Hybrid-precision SSM state (the most promising) - -The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large - -recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean -runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx, -~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the -long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding); -short/fast-decaying heads are fine. - -**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1) -in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could -capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph -build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel -plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head -dtype selection on top. - -*Note (precision, corrected):* plain bf16 (no split) is a legitimate **opt-in for precision-tolerant -deployments**, but it is *below* vLLM's recurrent precision, NOT equal to it. vLLM keeps the -gated-DeltaNet **temporal state in f32** (proven three ways in `BITEXACT_VS_VLLM.md`; only its tiny -conv state is bf16, and llama keeps even that f32). So bf16 here trades *below-vLLM* precision for -*above-vLLM* throughput. We declined it as the default because both llama's f32 AND vLLM's f32 are a -higher bar - and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW), -so we do not need bf16 to match vLLM's recurrence. - -## 2. Dense CUDA-graph instability - -The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path -CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable -(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on -the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic. - -## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk) - -The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4` -into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover -~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst -gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1 -GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by -the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and -proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.** - -## 4. Datacenter Blackwell (sm_100) - -This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter -Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA -characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the -whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding -kernel). A separate investigation if datacenter Blackwell becomes a target. - -## 5. Prefill / TTFT scheduler + paged-pool burst degradation (HIGH priority - the weakest benchmark number) - -The final benchmark (`QWEN36_NVFP4_BENCH.md`) exposed TTFT as the clear weak spot vs vLLM. Two distinct -issues: -- **Static decode-first budget tradeoff:** the QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS=512`) - maximizes decode tok/s + memory but throttles burst-prefill, so under a synchronized 128-way burst TTFT - climbs to **903 s dense / 213 s MoE @npl128** vs vLLM's chunked-prefill 6-18 s. A dynamic/adaptive budget - (by concurrency + queue depth), or matching vLLM's chunked-prefill interleave, would rebalance. -- **Paged-pool burst-degradation BUG (concrete, found in the benchmark):** after a high-npl burst, a - server's *subsequent lower-npl* prefill collapses (fresh npl8 = 507 t/s / 6 s TTFT; npl8 after an npl64 - burst = 65 t/s / 64 s). Decode stays robust; only prefill degrades -> root-cause the paged-pool state - that persists across the burst. - -**HIGH promise** for the serving experience: decode (dense 90-117%, MoE 77-83% of vLLM) and memory (1.5-3x -lower) are already strong; TTFT is the one number holding back a clean public win. - -## 6. MoE-specific recurrence tuning - -The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The -MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM -shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile. - ---- - -*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this -directory.* diff --git a/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md b/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md deleted file mode 100644 index 933593cea..000000000 --- a/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md +++ /dev/null @@ -1,208 +0,0 @@ -# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan? - -Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify -llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the -context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state? - -Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda -sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf` -(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models: -`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf` -(MoE, arch `qwen35moe`). - -## TL;DR verdict - -**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that -reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's -`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a -major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so -the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general -kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode -kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded. - -The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is -**flat** while the full-attention kernel grows ~3.1x: - -| kernel | ctx 1024 | ctx 4096 | ratio | meaning | -|--------|---------:|---------:|------:|---------| -| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** | -| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected | -| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent | - -Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a -per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows. - -## 1. Architecture (confirmed from GGUF metadata + tensor names) - -Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard -full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state. - -**Dense Qwen3.6-27B (`general.architecture = qwen35`):** -- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**. -- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`, - rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`. -- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`, - `ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is - `[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value - heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer. - -**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):** -- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers). -- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`, - `expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`. -- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`, - `inner_size = 4096` -> `H_v = 32` value heads. - -**Tensor names confirm the op split (27B, per-layer dump):** -- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`, - `ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn - block). No `attn_k/v/output`, no per-head q/k norm. -- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`, - `attn_q_norm`, `attn_k_norm`. No `ssm_*`. - -llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV -cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read -and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the -SSM-style recurrent path, not standard attention. - -## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven) - -Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from -`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits -`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472). - -**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1` -(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else -`build_delta_net_autoregressive`. Both are O(1): - -- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on - the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay), - `sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update), - `o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and - the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`. -- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op, - `ggml_gated_delta_net(q, k, v, g, b, s, K=1)`. - -**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):** -the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back -with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv -state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3` -samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back. -So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens - -this is the recurrent analogue, fixed size, independent of context length. - -**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and -`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same -device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99` -that does not happen, so the **fused single-kernel path is what runs**. - -**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously -O(1) in context:** -- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185): - the grid spans heads x sequences x state-columns. **There is no context-length dimension and no - context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s) - `[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`). -- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then - loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single - iteration: read the one new token, do the rank-1 update - `s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141), - then write the updated state back (lines 161-167). No second loop, no read of any past KV. -- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x - batch) and **constant in context length**. This is precisely vLLM's - `fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a - fixed-size `[K,V]` state) cited in the grounding doc. - -A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`); -the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches -(prefill), never at decode. - -## 3. nsys decode profiling: GDN is a small share and does not scale with context - -Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven -to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and -sqlites under `~/bench/gdn_study/`). - -**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:** - -| bucket | % of busy | us/launch | -|--------|----------:|----------:| -| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - | -| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 | -| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - | -| elementwise / act_glu / norm / rope | ~13.5 | - | -| embed_gather (`get_rows`) | 2.9 | - | -| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 | -| copy_cast (`cpy`) | 1.8 | - | -| **GDN_conv (`ssm_conv`)** | **1.5** | - | - -The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the -**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model, -where the grounding predicted the GEMM would be the lever. - -**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:** -GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the -weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the -context. - -**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs -4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1 -`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical -decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is -a true per-step comparison: - -| kernel / bucket | ctx 1024 | ctx 4096 | ratio | -|-----------------|---------:|---------:|------:| -| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** | -| GDN_recurrent share | 0.6% | 0.4% | flat/down | -| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x | -| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** | -| ATTENTION_full share | 0.6% | 1.8% | 3.0x up | -| total ms / decode step | 84.9 | 86.0 | 1.01x | - -The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the -full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely -moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is -the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.** - -(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently -prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)` -+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping -prefill short.) - -## 4. Verdict and fix scope - -**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused, -single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in -the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1) -`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning -with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the -op carries no context-length parameter at all. - -**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat -share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would -not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`: -1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that - vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever. -2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with - context, ~3x per-launch over 4x ctx), in the minority of full-attention layers. - -If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a -complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is -`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this -traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a -llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the -`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and -it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM -(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed -steady-state step as the bounded host-side add-on. - -## Reproduce - -- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`. -- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373, - build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595` - (fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid - 184-185, dispatch 226-312). -- Profiles: `~/bench/gdn_study/drv.sh