mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
docs(paged): consolidate the dev-trail docs into one canonical README
The paged-attention patch directory had accumulated ~55 scattered dev docs (results, progress, scope, lever, and gap-analysis notes). Consolidate the durable content of all of them into one canonical backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is, the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030 patch series table with bit-exact status, the GB10 benchmarks (patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes (bit-exact methodology, the per-path gate, the MoE-parity conclusion, the rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the pin + canary maintenance policy, and the published NVFP4 gallery models. Delete the consolidated-away dev trail. Keep the three operational docs the README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md (per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the ship-as-own-backend design-of-record), plus the benchmark plots + csv. The .patch files and the unit/bench .cpp are untouched. Repoint every external reference to a deleted doc at the new README: grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a patch-internal path matcher, not a repo-doc link) is left intact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
2
.github/scripts/paged-canary-apply.sh
vendored
2
.github/scripts/paged-canary-apply.sh
vendored
@@ -28,7 +28,7 @@
|
||||
# build on 0019's code, the rejection cascades to them too. This is a
|
||||
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
|
||||
# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
|
||||
# and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still
|
||||
# and README.md). We exclude ONLY that dev-doc path and still
|
||||
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
|
||||
# still fails the canary. prepare.sh tolerates the same hunk via
|
||||
# `patch ... || true`; this mirrors that tolerance precisely.
|
||||
|
||||
@@ -850,7 +850,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
// common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
|
||||
// Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
|
||||
// externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
|
||||
// NOT bit-exact (~91% same-top-p ceiling); see patches/paged/A_HYBRID_SSM_RESULTS.md.
|
||||
// NOT bit-exact (~91% same-top-p ceiling); see patches/paged/README.md (Dev notes).
|
||||
} else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
|
||||
@@ -57,7 +57,7 @@ All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so t
|
||||
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
|
||||
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
|
||||
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
|
||||
(`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
|
||||
(see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
|
||||
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
|
||||
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
|
||||
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
|
||||
|
||||
@@ -1,102 +0,0 @@
|
||||
# Patch 0003 — paged gather-read: exact implementation plan
|
||||
|
||||
**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
|
||||
`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
|
||||
path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
|
||||
|
||||
**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
|
||||
|
||||
## Design
|
||||
|
||||
The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
|
||||
graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
|
||||
three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
|
||||
mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
|
||||
|
||||
ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
|
||||
→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
|
||||
`transpose → get_rows → transpose`.
|
||||
|
||||
### KEY CORRECTIONS (found while implementing — these change the edits)
|
||||
|
||||
1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
|
||||
*current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
|
||||
`kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
|
||||
and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
|
||||
|
||||
2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
|
||||
count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
|
||||
tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
|
||||
MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
|
||||
|
||||
3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
|
||||
in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
|
||||
`reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
|
||||
Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
|
||||
|
||||
4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
|
||||
everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
|
||||
n_kv, then gathered). Smaller than the 7-edit estimate above.
|
||||
|
||||
## Edits
|
||||
|
||||
### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
|
||||
```cpp
|
||||
bool is_paged() const { return paged_active; } // near get_size()
|
||||
ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
|
||||
void set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
|
||||
uint32_t get_n_gather(const slot_info & sinfo) const; // == sum of used cells gathered
|
||||
```
|
||||
Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
|
||||
`build_input_k_idxs`/`get_n_kv`).
|
||||
|
||||
### 2. `src/llama-kv-cache.cpp`
|
||||
- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
|
||||
- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
|
||||
- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
|
||||
- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
|
||||
the mask/k/v will see). This is the canonical gather order.
|
||||
|
||||
### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
|
||||
Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
|
||||
|
||||
### 4. `src/llama-graph.cpp`
|
||||
- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()` → `mctx->set_input_gather_idxs(gather_idxs, ...)`.
|
||||
- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
|
||||
mctx_cur->build_input_gather_idxs(ctx0, ...)`.
|
||||
- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
|
||||
```cpp
|
||||
if (ggml_tensor * gi = inp->get_gather_idxs()) {
|
||||
k = ggml_get_rows(ctx0, k, gi); // [d, n_gather, ...] (reshape view ok)
|
||||
v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
|
||||
ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
|
||||
m = ggml_get_rows(ctx0, m, gi); // [n_tokens, n_gather]
|
||||
kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tokens]
|
||||
}
|
||||
ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
|
||||
```
|
||||
Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
|
||||
variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
|
||||
`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
|
||||
mask gather in the graph. (Cleaner — revisit during impl.)
|
||||
|
||||
### 5. V-transposed path
|
||||
When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
|
||||
work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
|
||||
flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
|
||||
|
||||
## Verification (the gate)
|
||||
```sh
|
||||
cmake --build build-cpu --target llama-simple -j
|
||||
M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
|
||||
build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt # stock
|
||||
LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt # paged gather-read
|
||||
diff a.txt b.txt # MUST be identical
|
||||
```
|
||||
Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
|
||||
Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
|
||||
|
||||
## Risks
|
||||
- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
|
||||
means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
|
||||
- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.
|
||||
@@ -1,347 +0,0 @@
|
||||
# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis
|
||||
|
||||
Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev`
|
||||
HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense),
|
||||
harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a
|
||||
clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column.
|
||||
|
||||
## TL;DR verdict
|
||||
|
||||
CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well
|
||||
under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**.
|
||||
Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps
|
||||
(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap
|
||||
(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged
|
||||
decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays
|
||||
99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM
|
||||
(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at
|
||||
batch 128), not in launch overhead or the host loop.
|
||||
|
||||
The premise that "the paged decode runs eager (graphs reused=0)" did not survive
|
||||
measurement: at the benchmarked context the default paged decode captures and
|
||||
replays graphs exactly like stock non-paged. Two measurement traps (below)
|
||||
explain the earlier "reused=0 / gap-bound" reading.
|
||||
|
||||
## Method note: a graph-enable trap that was corrected
|
||||
|
||||
`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr`
|
||||
(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string
|
||||
still disables graphs. A first 4-cell pass that used
|
||||
`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF
|
||||
in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the
|
||||
variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it
|
||||
never set the variable).
|
||||
|
||||
## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled)
|
||||
|
||||
npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`:
|
||||
|
||||
| cell | npl 32 | npl 128 |
|
||||
|------------------|---------|---------|
|
||||
| stock_graphon | 116.47 | 148.41 |
|
||||
| stock_graphoff | 115.17 | 148.21 |
|
||||
| paged_graphon | 116.21 | 148.60 |
|
||||
| paged_graphoff | 114.62 | 147.65 |
|
||||
|
||||
ON vs OFF (the graph win):
|
||||
|
||||
| config | npl 32 | npl 128 |
|
||||
|--------|--------|---------|
|
||||
| stock | +1.13% | +0.13% |
|
||||
| paged | +1.39% | +0.64% |
|
||||
|
||||
- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at
|
||||
npl 32 (small-batch, where per-kernel launch overhead is relatively larger).
|
||||
All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128).
|
||||
- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT
|
||||
eager at this config (see Step 2); it captures graphs like stock.
|
||||
- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at
|
||||
npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM.
|
||||
|
||||
All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is
|
||||
shared by stock and paged; it is not paged-specific. Calibration cross-check
|
||||
(paged ON, ntg 64): 147.64, matching the reference 148-149.
|
||||
|
||||
## Step 2 - why the "eager" premise is wrong, and what actually mutates
|
||||
|
||||
CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in
|
||||
`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node
|
||||
properties did not change vs the previous step; any later change logs
|
||||
`CUDA graph warmup reset` and reverts to eager until stable again.
|
||||
`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus
|
||||
each src's `data` ptr / `ne` / `nb`.
|
||||
|
||||
`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96):
|
||||
- stock: `warmup complete` x2, `warmup reset` x0.
|
||||
- paged: `warmup complete` x2, `warmup reset` x0.
|
||||
Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0
|
||||
for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so
|
||||
the uid fast-path never fires; the graph is still replayed via the
|
||||
`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative,
|
||||
not evidence of eager execution.** (Trap #1.)
|
||||
|
||||
Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512
|
||||
token boundaries), counts over ~320 decode steps:
|
||||
|
||||
| path | complete | reset | interpretation |
|
||||
|-------------------------------|----------|-------|-------------------------------|
|
||||
| paged in-kernel (default) | 10 | 8 | resets only at 256-boundaries |
|
||||
| paged gather (KV_PAGED_GATHER)| 0 | 0 | never captures -> pure eager |
|
||||
| stock non-paged | 10 | 8 | identical 256-cadence |
|
||||
|
||||
The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s),
|
||||
not per-step. The default paged decode is therefore captured for ~97% of steps,
|
||||
re-warming only every ~256 tokens, with the **same cadence as stock**.
|
||||
|
||||
What mutates (the block-table / gather input):
|
||||
- in-kernel decode (default): the block-table graph input
|
||||
`idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with
|
||||
`n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]`
|
||||
steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The
|
||||
kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the
|
||||
property change is per-256-tokens, not per-step.
|
||||
- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the
|
||||
index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)`
|
||||
(`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every
|
||||
step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every
|
||||
step"). That changes a node property every step, warmup never completes, and
|
||||
the path runs pure eager. This is the only "graphs reused=0" path, and it is
|
||||
not the default decode path.
|
||||
|
||||
`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across
|
||||
the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table
|
||||
`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256.
|
||||
|
||||
## Step 3 - where the step time goes (nsys), and a second trap
|
||||
|
||||
npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup).
|
||||
Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps
|
||||
>= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph
|
||||
reserve / prefill one-offs).
|
||||
|
||||
Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA
|
||||
graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU
|
||||
rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy.
|
||||
Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows
|
||||
and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON
|
||||
nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** -
|
||||
the likely source of the earlier "freed GPU time became idle gaps" conclusion.
|
||||
|
||||
Reliable steady-state numbers:
|
||||
|
||||
| trace | GPU rows | busy | within-step idle | between-step idle | host gap/step |
|
||||
|--------------------------------|----------|--------|------------------|-------------------|---------------|
|
||||
| OFF (eager) | 84,946 | 99.4% | 0.37% | 0.24% | ~2.0 ms |
|
||||
| ON (captured, node-trace) | 84,946 | 99.5% | 0.11% | 0.38% | ~1.9 ms |
|
||||
|
||||
- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has
|
||||
~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus
|
||||
ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%.
|
||||
- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged.
|
||||
- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample +
|
||||
update_slots + batch build), 0.24-0.38% of the ~896 ms step.
|
||||
|
||||
Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel
|
||||
compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps.
|
||||
|
||||
## The load-bearing question, answered
|
||||
|
||||
Within-step or between-step? **Neither is large.** The steady decode is 99.4%
|
||||
GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove
|
||||
the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is
|
||||
~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim
|
||||
either; the host loop is currently **hidden under GPU compute** (the GPU stays
|
||||
busy while the host syncs/schedules). It would only become a lever once the
|
||||
kernels are fast enough to drop GPU-busy below the host time, i.e. it is a
|
||||
second-order floor, not the present bottleneck.
|
||||
|
||||
## Verdict
|
||||
|
||||
1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the
|
||||
default decode; capturing reduces within-step launch idle from 0.37% to 0.11%
|
||||
but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured
|
||||
+0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise).
|
||||
2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step,
|
||||
hidden under GPU compute). It is the candidate floor only after the kernels
|
||||
speed up.
|
||||
3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x
|
||||
gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch
|
||||
128). That, not graphs and not the host loop, is the throughput lever.
|
||||
4. Corrected premises: paged is not perpetually eager (it captures with a
|
||||
256-token reset cadence identical to stock); "graphs reused=0" was a uid
|
||||
fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy
|
||||
unless `--cuda-graph-trace=node` is set.
|
||||
|
||||
No code patch in Phase 1 (graphs are not the lever, so there is no paged
|
||||
graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`,
|
||||
`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.
|
||||
|
||||
# Phase 2 - the real decode lever, located (per-kernel decomposition)
|
||||
|
||||
Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the
|
||||
per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured
|
||||
that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys
|
||||
`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the
|
||||
measurement. Two corrections, then the lever.
|
||||
|
||||
The conditional Phase 2 fix (make the paged decode graph-capturable) is moot:
|
||||
Phase 1 already showed the default paged decode captures, and the fresh re-check
|
||||
below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph
|
||||
fix / between-step host loop) is the lever; the lever is a third thing, measured
|
||||
here.
|
||||
|
||||
## Fresh re-confirmation: graphs are not the lever
|
||||
|
||||
Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table:
|
||||
|
||||
| paged decode | S_TG t/s | vs vLLM 391 |
|
||||
|---------------|----------|-------------|
|
||||
| graphs ON | 146.03 | 37.3% |
|
||||
| graphs OFF | 144.90 | 37.1% |
|
||||
|
||||
+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also
|
||||
99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy
|
||||
nodes too, so they cannot remove either the copies or the compute.
|
||||
|
||||
## Correction 1: the model is a hybrid SSM, not a plain transformer
|
||||
|
||||
`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with
|
||||
`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The
|
||||
decode-window kernel cadence (per step, ~19.8 steps in the window) is 48
|
||||
`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48
|
||||
gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1
|
||||
hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention
|
||||
layers.
|
||||
|
||||
## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute
|
||||
|
||||
Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24
|
||||
OFF rep; single CUDA stream; running-max-end so it is overlap-correct):
|
||||
|
||||
| activity set | GPU busy | idle |
|
||||
|------------------------|----------|-------|
|
||||
| kernels only | 80.2% | 19.8% |
|
||||
| kernels + memcpy (all) | 99.4% | 0.6% |
|
||||
|
||||
The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are
|
||||
filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each,
|
||||
356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet
|
||||
recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the
|
||||
paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH
|
||||
47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy"
|
||||
and lumped them into "GPU kernel compute" - they are memory movement, and they
|
||||
are addressable.
|
||||
|
||||
## Decode GPU-time decomposition (% of kernel+memcpy busy)
|
||||
|
||||
OFF/eager rep, steady window. `/step` = instances per decode step.
|
||||
|
||||
| share | activity | /step | role |
|
||||
|-------|-----------------------------------|-------|-------------------------------|
|
||||
| 23.4% | gated_delta_net_cuda | 48 | linear-attn recurrence |
|
||||
| 21.9% | k_get_rows_float | 97 | SSM state / conv-state gather |
|
||||
| 18.9% | MEMCPY DtoD | 80 | SSM recurrent-state copy |
|
||||
| 15.5% | mul_mat_vec_q (nvfp4, ncols=1) | 48 | FP4 GEMV |
|
||||
| 10.4% | mul_mat_q (nvfp4) | 352 | FP4 GEMM |
|
||||
| 1.9% | quantize_mmq_nvfp4 | 448 | act requant for MMQ |
|
||||
| 1.0% | concat_cont | 48 | SSM state glue |
|
||||
| 0.8% | ssm_conv_f32 | 48 | SSM short conv |
|
||||
| 0.7% | unary_gated_op silu | 112 | SSM gating |
|
||||
| 0.4% | flash_attn_tile/_ext | 16 | FULL attention (paged) |
|
||||
|
||||
Grouped:
|
||||
- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy
|
||||
+ conv + gating glue): **~67% of decode**.
|
||||
- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**.
|
||||
- Full attention - everything paged attention optimizes: **~0.4%**.
|
||||
|
||||
## Verdict and scope of the real lever
|
||||
|
||||
1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture
|
||||
the memcpy too, so they cannot touch the copies or the compute.
|
||||
2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s).
|
||||
3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26%
|
||||
faster left decode_agg flat) - it was never the long pole.
|
||||
4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs,
|
||||
block-table stabilization, gather rewrite) can move decode_agg on this model**
|
||||
- it optimizes under half a percent of the step. This is the structural reason
|
||||
A.2, and the paged-decode track generally, cannot close the vLLM gap on
|
||||
q36-27b: the model barely uses the path being optimized.
|
||||
|
||||
The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer
|
||||
per step it re-materializes and D2D-copies the full recurrent state (~230 MB at
|
||||
batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2
|
||||
`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is
|
||||
SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention
|
||||
`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the
|
||||
gather into the scan, avoiding both the per-layer D2D copy and the gathers.
|
||||
|
||||
Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path -
|
||||
not paged-attn, not a graph capture, not a block-table tweak):
|
||||
1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor
|
||||
in place (or double-buffer / write-back), so the recurrence consumes and
|
||||
produces the persistent state without a full-state copy each layer each step.
|
||||
2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel.
|
||||
|
||||
Ceiling from this rep (upper bound; assumes the work is fully removed, not just
|
||||
overlapped):
|
||||
- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s.
|
||||
- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within
|
||||
~1.6x of vLLM 391 with the FP4 GEMM still untouched.
|
||||
|
||||
No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in
|
||||
the SSM path, too large for this measurement pass and orthogonal to paged
|
||||
attention. `patches/paged/0018` stays free. Evidence on the DGX:
|
||||
`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
|
||||
its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
|
||||
`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
|
||||
|
||||
# A.2 final synthesis - the four-point verdict
|
||||
|
||||
All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
|
||||
`decode_agg` = `S_TG t/s`), npl 128 unless noted.
|
||||
|
||||
**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
|
||||
ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
|
||||
All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
|
||||
94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
|
||||
so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
|
||||
busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
|
||||
GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
|
||||
|
||||
**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
|
||||
There was nothing to fix: the premise was false. At the benchmarked context the
|
||||
DEFAULT in-kernel paged decode already captures and replays graphs, with a
|
||||
256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
|
||||
~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
|
||||
reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
|
||||
only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
|
||||
index grows every step), which is not the default decode. Because graphs were
|
||||
already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
|
||||
Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
|
||||
launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
|
||||
|
||||
**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
|
||||
**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
|
||||
re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
|
||||
|
||||
**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
|
||||
A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
|
||||
full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
|
||||
layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
|
||||
path can move decode_agg. The residual gap is structural and lives elsewhere:
|
||||
~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
|
||||
get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
|
||||
per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
|
||||
B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
|
||||
(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
|
||||
GPU compute - neither is the residual.
|
||||
|
||||
The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
|
||||
(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
|
||||
write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
|
||||
vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
|
||||
fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
|
||||
t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
|
||||
still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
|
||||
stays free.
|
||||
@@ -1,107 +0,0 @@
|
||||
# Additive layout for the paged-KV patch series - "hook, don't edit"
|
||||
|
||||
Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
|
||||
minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
|
||||
"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
|
||||
the opposite: **additive**. This document is the design rule and the per-patch core-touch
|
||||
budget.
|
||||
|
||||
## The rule
|
||||
|
||||
> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
|
||||
> **single, env-gated hook** at one call site in a core file that delegates to the new
|
||||
> file. No logic lives in a core file. No core struct/signature is edited.
|
||||
|
||||
Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
|
||||
`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
|
||||
hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
|
||||
old-0003 approach) conflicts on every bump and must be re-understood each time.
|
||||
|
||||
This is enforceable as a **core-touch budget**: each patch declares the core files it
|
||||
touches and the line count; review rejects anything that grows logic in core.
|
||||
|
||||
## Why it's achievable here (grounded in the pinned source)
|
||||
|
||||
The two seams paged KV needs are both already abstract in llama.cpp at the pin
|
||||
(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
|
||||
|
||||
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
|
||||
cell indices. Paged placement is just *different indices*. 0002 already does this as one
|
||||
gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
|
||||
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
|
||||
`llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
|
||||
subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
|
||||
added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
|
||||
|
||||
## Per-patch core-touch budget
|
||||
|
||||
| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
|
||||
|---|-------|----------------------|------------------------------|-----------:|
|
||||
| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
|
||||
| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
|
||||
| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
|
||||
| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
|
||||
| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
|
||||
| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
|
||||
|
||||
Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
|
||||
cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
|
||||
is new files or the LocalAI-side server loop.
|
||||
|
||||
## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
|
||||
|
||||
The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
|
||||
(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
|
||||
The additive form removes the core-struct and core-`set_input` edits entirely:
|
||||
|
||||
**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
|
||||
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
|
||||
gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
|
||||
the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
|
||||
same set the `kq_mask` keeps), in the canonical order.
|
||||
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
|
||||
constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
|
||||
and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
|
||||
transpose`). When not active it returns immediately -> **stock path byte-identical**.
|
||||
|
||||
**Core hooks (the whole core diff for 0003):**
|
||||
1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
|
||||
```cpp
|
||||
paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
|
||||
```
|
||||
One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
|
||||
registered input, so `llama-graph.h` is untouched.
|
||||
2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
|
||||
file can read the used-cell set without reaching into internals -
|
||||
`uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
|
||||
(delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
|
||||
pattern). ~8 lines total, no signature changes to existing methods.
|
||||
3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
|
||||
|
||||
First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
|
||||
(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
|
||||
no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
|
||||
`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
|
||||
KV set; `n_gather < n_kv` proves compaction, not identity).
|
||||
|
||||
## Anti-drift practices (already in `README.md`, restated as policy)
|
||||
|
||||
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
|
||||
`git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
|
||||
needs a touch, and the failure names the exact step.
|
||||
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
|
||||
changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
|
||||
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
|
||||
(e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
|
||||
commit, re-export. (Set up and verified for this pin during this work.)
|
||||
|
||||
## Status / next step
|
||||
|
||||
- 0001, 0002: done, additive, verified token-identical.
|
||||
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
|
||||
0001+0002 applied is ready (`paged` branch). Remaining work is the focused
|
||||
implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
|
||||
the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
|
||||
FA-first), not a structural one - the structure is settled here.
|
||||
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
|
||||
patch at all).
|
||||
@@ -1,669 +0,0 @@
|
||||
# ARCH_GENERALITY_AUDIT - llama-cpp-localai-paged backend
|
||||
|
||||
Source/build/gallery audit (no GPU, no hardware). Maps how arch-general the
|
||||
paged backend's BUILD targeting is, and whether non-Blackwell / Metal / CPU
|
||||
hosts get a working build.
|
||||
|
||||
## Section: backend-build-matrix (build targeting)
|
||||
|
||||
### 1. CUDA arch list: NOT Blackwell-only - it is the FULL upstream ggml default
|
||||
|
||||
There is NO explicit CUDA arch list anywhere in the paged build path:
|
||||
|
||||
- `.docker/llama-cpp-localai-paged-compile.sh` only injects
|
||||
`-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}` *when* `CUDA_DOCKER_ARCH` is
|
||||
non-empty (`if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]`).
|
||||
- NO `backend-matrix.yml` row for `llama-cpp-localai-paged` sets
|
||||
`CUDA_DOCKER_ARCH` (nor does any stock `llama-cpp` row). It is empty.
|
||||
- `backend/cpp/llama-cpp/Makefile` (reused verbatim by the paged wrapper) sets
|
||||
only `-DGGML_CUDA=ON` (+ `-DGGML_NATIVE=OFF`). It never sets
|
||||
`CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`.
|
||||
|
||||
=> The compiled arch fan is whatever upstream llama.cpp / ggml-cuda picks by
|
||||
default with `GGML_NATIVE=OFF` (the full multi-arch default, which includes
|
||||
Blackwell sm_120 alongside the older archs ggml ships). This is BIT-IDENTICAL to
|
||||
how the stock llama-cpp backend is targeted - the paged wrapper copies and reuses
|
||||
the exact same Makefile + CMakeLists + prepare.sh, only forcing `LLAMA_PAGED=on`.
|
||||
|
||||
Consequence for NVFP4: the FP4-MMA kernel is compile-time gated *inside* the
|
||||
ggml-cuda TU by `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer, sm_100
|
||||
datacenter). Because the build emits the full arch fan (not a Blackwell-only
|
||||
list), the sm_120 NVFP4-MMA path is present for Blackwell AND the older archs get
|
||||
their own kernels (NVFP4 runs the non-MMA fallback at runtime on
|
||||
Ampere/Ada/Hopper). NOTHING in the build pins this to GB10/sm_121. The binary is
|
||||
arch-portable; only the NVFP4 *speedup* is Blackwell-specific, by kernel gating,
|
||||
not by build targeting.
|
||||
|
||||
### 2. Variants built: CUDA + ROCm + SYCL + Vulkan + CPU (NOT CUDA-only)
|
||||
|
||||
`backend-matrix.yml` `include:` (Linux) ships, for `llama-cpp-localai-paged`,
|
||||
one row per stock-llama-cpp Linux row (10 rows, lines 4889-5046):
|
||||
|
||||
- cublas CUDA 12.8 (linux/amd64)
|
||||
- cublas CUDA 13.0 (linux/amd64)
|
||||
- cublas CUDA 13.0 arm64 l4t (jetson)
|
||||
- cublas CUDA 12.0 arm64 l4t (jetson)
|
||||
- hipblas / ROCm 7.2.1 (linux/amd64) - AMDGPU_TARGETS = gfx908..gfx1201
|
||||
- sycl_f32 (Intel oneAPI)
|
||||
- sycl_f16 (Intel oneAPI)
|
||||
- vulkan (linux/amd64)
|
||||
- vulkan (linux/arm64)
|
||||
- CPU (linux/amd64) + CPU (linux/arm64), built via the ggml
|
||||
`CPU_ALL_VARIANTS` single-build (dlopen libggml-cpu-*.so by host CPU feature;
|
||||
arm64 uses gcc-14 for SME).
|
||||
|
||||
So it is NOT CUDA-only. Per image, `compile.sh` builds: the accelerator variant
|
||||
(or CPU_ALL_VARIANTS when BUILD_TYPE empty) + grpc-server + rpc-server.
|
||||
|
||||
### 3. THE GAP vs stock llama-cpp: NO Metal / Darwin build
|
||||
|
||||
This is the single build-targeting divergence:
|
||||
|
||||
- stock `llama-cpp` HAS a Darwin row in `includeDarwin`
|
||||
(`-metal-darwin-arm64-llama-cpp`, line 5071) and a `metal:` capability key
|
||||
(`metal: "metal-llama-cpp"`, index.yaml line 25).
|
||||
- `llama-cpp-localai-paged` has ZERO entries in `includeDarwin` (grep past line
|
||||
5048 = none) and NO `metal:` capability key in its meta-backend.
|
||||
- There is NO `metal-*-llama-cpp-localai-paged` tag anywhere in
|
||||
backend-matrix.yml or backend/index.yaml.
|
||||
|
||||
`scripts/changed-backends.js` already anticipates a future darwin paged row
|
||||
(lines 78-81 map `backend === "llama-cpp-localai-paged"` to the C++ source dir),
|
||||
but no such matrix row exists, so it is currently dead/forward-looking code.
|
||||
|
||||
Everything else (CUDA arch fan, ROCm gfx list, SYCL, Vulkan, CPU) matches stock
|
||||
llama-cpp exactly.
|
||||
|
||||
### 4. Does a non-Blackwell / Metal / CPU host get a working build of THIS backend?
|
||||
|
||||
Meta-backend capabilities map (index.yaml lines 101-111):
|
||||
default(cpu), nvidia(cuda12), intel(sycl-f16), amd(rocm), vulkan, nvidia-l4t,
|
||||
nvidia-cuda-13, nvidia-cuda-12, nvidia-l4t-cuda-12/13. NO `metal:` key.
|
||||
|
||||
- Non-Blackwell NVIDIA (Ampere sm_80-86 / Ada sm_89 / Hopper sm_90 / datacenter
|
||||
Blackwell sm_100): selects the SAME cuda12 / cuda13 image. That image is
|
||||
compiled for the full arch fan, so it RUNS. NVFP4 falls back to the non-MMA
|
||||
path on pre-Blackwell; on sm_100 it gets FP4-MMA but is compute-bound (HBM3e),
|
||||
not the LPDDR5x-bound GB10 regime the patches were tuned for. WORKS, just
|
||||
without the GB10-specific bandwidth win.
|
||||
- AMD / Intel / Vulkan / CPU (amd64 + arm64) Linux hosts: each has its own
|
||||
matching variant in the map + matrix. WORKS.
|
||||
- Metal / macOS Apple Silicon: NO `metal:` key and NO darwin build. Capability
|
||||
resolution falls back to `default` = `cpu-llama-cpp-localai-paged`, which is a
|
||||
Linux (amd64/arm64) image, NOT a macOS-native build, so it will NOT run on
|
||||
macOS. And because this is a SEPARATE meta-backend, it does NOT fall through to
|
||||
the stock `llama-cpp` backend - a Mac user who explicitly selects
|
||||
llama-cpp-localai-paged gets a non-running selection and must manually pick the
|
||||
stock llama-cpp backend instead. DOES NOT WORK on Metal/macOS; no auto-fallback
|
||||
to stock.
|
||||
|
||||
## Verdict (build-targeting)
|
||||
|
||||
- Arch-general on Linux: YES. The build is NOT Blackwell-only; it targets the
|
||||
exact same full CUDA arch fan + the same ROCm/SYCL/Vulkan/CPU variant set as
|
||||
stock llama-cpp. Any Linux host that can run stock llama-cpp can run THIS
|
||||
backend; the NVFP4 speedup is the only Blackwell-gated piece, and that gating
|
||||
is inside the kernel, not in the build matrix.
|
||||
- Single gap: NO Metal/Darwin variant and NO `metal:` capability key. macOS /
|
||||
Apple Silicon hosts have no working build of this backend and do not auto-fall
|
||||
to stock llama-cpp. To close the gap, add an `includeDarwin` row
|
||||
(`-metal-darwin-arm64-llama-cpp-localai-paged`, mirroring the stock llama-cpp
|
||||
darwin row + the C++ source build path that changed-backends.js already
|
||||
anticipates) and a `metal:` key to the paged meta-backend. (Note: NVFP4 has no
|
||||
Metal MMA path, so a Metal build would deliver paged-KV behaviour only, no
|
||||
NVFP4 acceleration - still a correctness/availability win over the current
|
||||
broken selection.)
|
||||
|
||||
## Section: gguf-gallery-targeting (NVFP4 portability + hardware gating)
|
||||
|
||||
### 1. NVFP4 GGUFs LOAD + RUN on non-Blackwell - runs-via-dequant, NOT FP4-MMA-required
|
||||
|
||||
The published GGUFs use `file_type` MOSTLY_NVFP4 / `GGML_TYPE_NVFP4` (type id 40).
|
||||
This is a standard ggml block-quant type with FULL software dequant + matmul
|
||||
coverage across every backend, NOT a Blackwell-only format. Verified against the
|
||||
paged backend's pinned ggml source (pin 0a2677c6, same upstream as stock
|
||||
llama-cpp):
|
||||
|
||||
- CPU (any arch, amd64 + arm64): full support, no special hardware.
|
||||
- `ggml/src/ggml-cpu/quants.c`: `quantize_row_nvfp4` (from_float) +
|
||||
`ggml_vec_dot_nvfp4_q8_0_generic` (the matmul dot product), dequant via the
|
||||
`kvalues_mxfp4` lookup table. Registered in the CPU type-traits table
|
||||
(`ggml-cpu.c` line 283: `[GGML_TYPE_NVFP4] = { .from_float=..., .vec_dot=... }`).
|
||||
- NVFP4 handled in all the CPU op switches (`ops.cpp` lines 674, 1125, 1255,
|
||||
4424, 4701, 4925, 5651). LOADS + RUNS correctly on a pure-CPU host, just slow.
|
||||
- CUDA, NON-Blackwell (Pascal/Volta/Turing/Ampere sm_80-86 / Ada sm_89 /
|
||||
Hopper sm_90): RUNS correctly via the integer-quantized matmul paths, no
|
||||
FP4-MMA needed.
|
||||
- `convert.cu` registers `dequantize_row_nvfp4_cuda` as both the to_float and
|
||||
to_fp16 dequant kernel (lines 759, 814) - the generic dequant->GEMM path.
|
||||
- `mmvq.cu`: `vec_dot_nvfp4_q8_1` (DP4A integer dot, works on any GPU with
|
||||
dp4a, i.e. Pascal sm_61+). This is the decode (gemv) path.
|
||||
- `mmq.cuh`: NVFP4 has a `MMQ_DP4A_TXS_Q8_0_16` DP4A tile AND a separate
|
||||
`MMQ_MMA_TILE_X_K_NVFP4` tile explicitly commented "NVFP4 Generic" (line
|
||||
222), DISTINCT from `MMQ_MMA_TILE_X_K_FP4` "MXFP4 and NVFP4 Blackwell" (line
|
||||
221). So there are three tiers: DP4A (oldest), generic-MMA (Turing+), and
|
||||
Blackwell-native FP4-MMA.
|
||||
- The Blackwell path is a runtime FLAG, not a requirement:
|
||||
`mmq.cu` line 125 `const bool use_native_fp4 = blackwell_mma_available(cc)
|
||||
&& (... NVFP4)`. When false (non-Blackwell), it falls through to the generic
|
||||
quantized kernel. Grep for any abort/unsupported on NVFP4+blackwell = NONE.
|
||||
No `GGML_ABORT`, no garbage - just the non-MMA kernel.
|
||||
- Vulkan: has `dequant_nvfp4.comp` + NVFP4 in `ggml-vulkan.cpp` / dequant_funcs
|
||||
- LOADS + RUNS on Vulkan hosts (AMD/Intel/NVIDIA) via dequant.
|
||||
- Metal: NVFP4 referenced only in `ggml-metal-device.m` (type registration /
|
||||
size), NO Metal NVFP4 compute kernel. On Apple Silicon NVFP4 tensors would
|
||||
fall back to the CPU backend op-by-op (correct but slow) IF a Metal build
|
||||
existed - which for THIS backend it does not (see build-targeting Section 3).
|
||||
|
||||
Bottom line: the NVFP4 GGUFs are PORTABLE. A Hopper/Ada/Ampere/CPU/Vulkan host
|
||||
loads and runs them correctly (bit-faithful dequant), just WITHOUT the FP4-MMA
|
||||
speedup. FP4-MMA is a Blackwell-only performance tier layered on top of a
|
||||
fully-general software path, NOT a load/run gate. Off-Blackwell = runs-via-dequant,
|
||||
correct-but-slow; never fail/garbage.
|
||||
|
||||
### 2. Gallery hardware-targeting GAP: nothing stops a non-Blackwell user
|
||||
|
||||
The 6 -paged entries declare NO machine-readable hardware targeting. The only
|
||||
Blackwell signal is free prose in `description:` ("native Blackwell NVFP4
|
||||
(FP4-MMA)", "Benchmarked on GB10 / DGX Spark") and a `nvfp4` string in `tags:`.
|
||||
|
||||
How LocalAI's gallery CAN express hardware gating (what exists):
|
||||
- `tags:` are FREE-TEXT, search-only. `core/gallery/gallery.go` line 89 just does
|
||||
`strings.Contains(lower(join(tags)), term)` for the search box + line 128
|
||||
collects them for filter chips. There is NO tag that gates install or warns;
|
||||
the `nvfp4` tag is purely discoverability.
|
||||
- The model `ModelConfig` struct (`core/gallery/models.go`) has only
|
||||
Description/Icon/License/URLs/Name/ConfigFile/Files/PromptTemplates. There is
|
||||
NO capabilities / requirements / hardware field at the MODEL level. (Signing
|
||||
`verification:` is the only structured gate, unrelated to hardware.)
|
||||
- The `capabilities:` map (default/nvidia/intel/amd/metal/vulkan/...) is a
|
||||
BACKEND-level concept in `backend/index.yaml` (paged entry lines 100-111). It
|
||||
selects the backend IMAGE by detected accelerator FAMILY (nvidia vs amd vs
|
||||
metal vs cpu). Crucially it does NOT and CANNOT distinguish Blackwell sm_120/121
|
||||
from older NVIDIA - `nvidia: cuda12-llama-cpp-localai-paged` is served to ANY
|
||||
NVIDIA GPU. There is no sub-nvidia (microarch) gating mechanism in the gallery
|
||||
or the backend capability resolver.
|
||||
|
||||
So the gating gap is real: a non-Blackwell user browsing the gallery is offered
|
||||
the NVFP4 entries with no machine-readable signal that they will run far below
|
||||
the advertised "90-117% of vLLM" numbers (those numbers are GB10/LPDDR5x-bound
|
||||
specific). It will install and run correctly, just slowly, and the bench claims
|
||||
in the description will not hold.
|
||||
|
||||
### 3. How to express Blackwell-targeting (recommendation)
|
||||
|
||||
Given there is no microarch-gating primitive, the honest options are, in order:
|
||||
|
||||
a. DESCRIPTION + TAG (only thing available today, zero code): the entries already
|
||||
say "native Blackwell NVFP4 (FP4-MMA)" - tighten it to a leading one-line
|
||||
"Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
|
||||
NVIDIA/CPU via NVFP4 dequant but WITHOUT the FP4-MMA speedup and below the
|
||||
quoted GB10 throughput." Add a `blackwell` tag alongside `nvfp4` for the
|
||||
filter chip. This is the existing convention (other entries use free prose +
|
||||
`nvidia` tag, e.g. line 2395; quant trade-offs are described in prose, e.g.
|
||||
the Gemma "Mobile-optimized" notes lines 1312/1366). No other gallery entry
|
||||
today encodes a GPU-microarch requirement, so prose is the de-facto standard.
|
||||
b. If a structured signal is wanted, it would need a NEW field (e.g. a
|
||||
`recommended_hardware` / `requires` note surfaced by the React UI import
|
||||
dialog) - that is a feature, not a config tweak, and does not exist yet.
|
||||
c. The `nvfp4` tag should at minimum be present on ALL six entries - the four
|
||||
Qwopus/Qwen-MTP entries at lines 819/854/890 have only `[llm, gguf]` tags and
|
||||
omit `nvfp4`, so they are not even discoverable/filterable as NVFP4, despite
|
||||
being NVFP4 GGUFs. Inconsistent tagging is a secondary gap.
|
||||
|
||||
Verdict (gallery-targeting): NVFP4 GGUFs are safe to ship broadly (they run
|
||||
everywhere via dequant, never fail), so the risk is PERFORMANCE-EXPECTATION, not
|
||||
correctness. LocalAI has no microarch gating primitive; the only lever is the
|
||||
description + tags. Recommend a one-line Blackwell-recommended hardware note +
|
||||
consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
|
||||
claims with the "runs slower off-Blackwell" caveat.
|
||||
|
||||
## Section: optimization-generality (patches 0013/0016 + 0017-0029)
|
||||
|
||||
Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch),
|
||||
GB10-TUNED (needs per-arch retuning of the magnitude/constants), or
|
||||
Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the
|
||||
patch commit bodies + the diffs they touch; bit-exactness verdicts are the
|
||||
patches' own md5/test-backend-ops gates.
|
||||
|
||||
Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer
|
||||
+ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none;
|
||||
Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but
|
||||
HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10
|
||||
"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM
|
||||
ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they
|
||||
reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic.
|
||||
|
||||
### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral)
|
||||
|
||||
Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no
|
||||
bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs
|
||||
the code path.
|
||||
|
||||
- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy,
|
||||
zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off
|
||||
byte-identical. Latency/fairness lever (flattens decode-ITL spike from a
|
||||
co-batched long prefill). No arch assumption.
|
||||
- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure
|
||||
`update_slots()` policy, default-off byte-identical, T==n_batch degenerate case
|
||||
== stock. Arch-neutral, identical paged on/off.
|
||||
- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot
|
||||
release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes
|
||||
a real fragmentation/throughput-collapse bug on long-lived servers.
|
||||
Arch-independent host bookkeeping.
|
||||
- 0029 block-table within-step host cache - memcpy-reuse of the host block table
|
||||
across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1
|
||||
off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound
|
||||
(MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so
|
||||
the BIGGER this win elsewhere.
|
||||
- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by
|
||||
reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no
|
||||
equivalent of is a win on ANY arch running the GDN path; not FP4, not
|
||||
bandwidth-floor specific.
|
||||
- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021
|
||||
conv-state in-place fusion - remove a D2D state copy-back (0018), a state
|
||||
get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring
|
||||
vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is
|
||||
plumbing, so wins on ANY arch running the gated-DeltaNet recurrence.
|
||||
- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching,
|
||||
in-kernel paged read. No precision or bandwidth-floor assumption; the most
|
||||
portable part of the work, helps capacity/serving anywhere it compiles.
|
||||
|
||||
NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every
|
||||
gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU};
|
||||
NOT covered on Metal/SYCL/Vulkan - see SAFETY #1.
|
||||
|
||||
### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor
|
||||
|
||||
Correct + beneficial everywhere, but the specific %/constants are GB10-bound.
|
||||
|
||||
- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the
|
||||
ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes
|
||||
the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the
|
||||
128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the
|
||||
gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8
|
||||
GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA
|
||||
archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128
|
||||
is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7%
|
||||
magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct.
|
||||
REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8)
|
||||
could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the
|
||||
MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit.
|
||||
- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1)
|
||||
quantize the unique token activations once and gather the identical FP4 blocks
|
||||
instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off.
|
||||
DIRECTION-GENERAL (de-duplicating identical work is always good) but
|
||||
NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where
|
||||
activation-quant is a measurable decode bucket - on a compute-bound arch the
|
||||
saved quant time may be off the critical path (even on GB10 the MoE TG win is
|
||||
only +1.7%).
|
||||
|
||||
### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch)
|
||||
|
||||
- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params
|
||||
NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The
|
||||
reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column
|
||||
assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel.
|
||||
(16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal
|
||||
values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere
|
||||
(bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off
|
||||
GB10; on a compute-bound arch (sm_100) the kernel may not even be the
|
||||
bottleneck. Needs a per-arch GDN_NW/CPW sweep.
|
||||
- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT-
|
||||
OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X).
|
||||
Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on
|
||||
sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default =>
|
||||
byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the
|
||||
kill-gate could flip; the levers are in place and inert, ready to re-sweep.
|
||||
|
||||
### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists)
|
||||
|
||||
- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped
|
||||
stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off
|
||||
byte-identical. The CUDA-graph mechanism is general, but the specific guard
|
||||
condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here"
|
||||
reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the
|
||||
node would not take that branch -> inert.
|
||||
- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds
|
||||
--cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length).
|
||||
Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN
|
||||
decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On
|
||||
sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is
|
||||
precision-specific (opt-in, default-safe).
|
||||
- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific
|
||||
for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere
|
||||
they run-via-dequant (correct, slow) per the gallery-targeting section above.
|
||||
|
||||
### Per-arch expected story
|
||||
|
||||
- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120
|
||||
(GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the
|
||||
bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline +
|
||||
graph wins (0029/0025) and the MMQ reshape (0020) hold.
|
||||
- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast
|
||||
(precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound.
|
||||
Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The
|
||||
bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the
|
||||
host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net:
|
||||
works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants.
|
||||
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4
|
||||
levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ
|
||||
etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler
|
||||
(0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv
|
||||
plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj
|
||||
MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged +
|
||||
SSM + scheduler infra is a pure win".
|
||||
- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no
|
||||
NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference
|
||||
kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses
|
||||
ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT
|
||||
get the new fused-op kernels (SAFETY #1).
|
||||
|
||||
### SAFETY / regression risks
|
||||
|
||||
1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated.
|
||||
0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids],
|
||||
ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are
|
||||
emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set
|
||||
(constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the
|
||||
active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan
|
||||
two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud
|
||||
supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream,
|
||||
so a qwen35 model likely cannot run there regardless); (b) the fused conv
|
||||
variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a
|
||||
backend that supports plain SSM_CONV but ignores the discriminator would compute
|
||||
the WRONG plain conv -> SILENT corruption. That is the one genuine
|
||||
silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute
|
||||
backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated
|
||||
SSM_CONV where the fused handling is absent).
|
||||
2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery
|
||||
batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream
|
||||
decode, n_seqs=1). Bit-identical either way - only a potential perf regression
|
||||
at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming
|
||||
the reshape still picks the better kernel at n_seqs=1 elsewhere.
|
||||
3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal;
|
||||
do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep.
|
||||
No correctness risk.
|
||||
4. Gallery rows do not state a GPU-arch requirement (covered in the
|
||||
gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note.
|
||||
|
||||
### One-line verdict
|
||||
|
||||
The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024,
|
||||
block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the
|
||||
o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles -
|
||||
bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers
|
||||
(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The
|
||||
occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned
|
||||
and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and
|
||||
the regime flips to compute-bound). The single real SAFETY gap: the new fused
|
||||
GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal
|
||||
paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute
|
||||
(discriminated SSM_CONV) - it should be compute-backend-gated.
|
||||
|
||||
## Section: patch-arch-safety (build-break / miscompile classification, 0018-0029)
|
||||
|
||||
This section is the narrow safety read: for EACH patch, does it (a) compile and
|
||||
behave correctly on every build target, (b) compile only under
|
||||
BLACKWELL_MMA_AVAILABLE with a fallback elsewhere, or (c) RISK a build break /
|
||||
miscompile / crash on a non-Blackwell arch. Class letters here are
|
||||
build-safety classes, distinct from the perf-generality buckets above. Note 0027
|
||||
does not exist (numbering gap). The dispositive build facts: the backend is built
|
||||
for CUDA 12/13, L4T arm64, ROCm/hipblas, SYCL f32/f16, CPU (amd64+arm64), Vulkan -
|
||||
and NOT for darwin/Metal (no includeDarwin row), and the CUDA build emits the full
|
||||
multi-arch fan (CUDA_DOCKER_ARCH unset; Dockerfile documents e.g. `75;86;89;120`),
|
||||
so every .cu TU MUST already compile for non-Blackwell SASS.
|
||||
|
||||
Method: grepped every added line in 0017-0029 for arch-specific tokens
|
||||
(BLACKWELL/__CUDA_ARCH__/sm_NNN/cp.async/ldmatrix/mma./asm volatile/cc gates).
|
||||
The ONLY hits are in 0017 (all correctly `#if`-gated) and free-text comments. No
|
||||
SSM/conv/GDN kernel in the series uses a Blackwell intrinsic or a hardcoded
|
||||
sm_12x launch geometry.
|
||||
|
||||
| patch | class | build-safety note |
|
||||
|-------|-------|-------------------|
|
||||
| 0017 fp4-gemm-decode-tile-tune | (b) GATED | only Blackwell-specific patch; NVFP4 mmq_y/min-blocks levers behind `#if defined(BLACKWELL_MMA_AVAILABLE)` + `blackwell_mma_available(cc)` + `type==GGML_TYPE_NVFP4`, ALL default-off => default build byte-identical to stock on every arch. `get_mmq_y_device<type>()` templating has a default arg keeping stock behaviour for non-NVFP4. Builds on sm_80-90 (body stripped). |
|
||||
| 0018 ssm-decode-inplace-state | (a) general | plain in-place GDN state write-back, no intrinsics; CPU mirror in ggml-cpu/ops.cpp. |
|
||||
| 0019 ssm-decode-fused-gather | (a) general | `gdn_gather_nonident_kernel` = plain `<<<n_seqs,256>>>`; CPU mirror added. |
|
||||
| 0020 gdn-oproj-mmq-reshape | (a) general | host-side reshape_2d in qwen35*/qwen3next.cpp, no device code. |
|
||||
| 0021 conv-state-inplace-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (4th src discriminator), no new enum => no ggml-cpu.c switch needed; `ssm_conv_update_f32` plain portable CUDA (threads=128, templated d_conv); CPU mirror + test case. |
|
||||
| 0022 gdn-recurrence-occupancy-retune | (a) general | template NUM_WARPS/COLS_PER_WARP/MIN_BLOCKS; new default (16,8) = 512 thr/block, MIN_BLOCKS=2, within the 1024 limit on sm_70..120 and AMD; bit-exact for any (NW,CPW). NOT Blackwell-gated and NOT a break - just a GB10-tuned default applied everywhere (see risk 3 below). |
|
||||
| 0023 moe-nvfp4-quant-dedup | (a) general | `gather_mmq_fp4` = plain uint4 byte-copy kernel; reached ONLY inside the pre-existing `if (use_native_fp4)` branch (Blackwell-only at runtime) and uses `block_fp4_mmq`, a type that already compiles for the full arch fan pre-0023. Adds no new arch surface. |
|
||||
| 0024 paged-pool-burst-reclaim | (a) general | pure host C++. |
|
||||
| 0025 moe-nvfp4-decode-regraph | (a) general | host-side ggml-cuda.cu graph-guard relaxation, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off => byte-identical; predicate is runtime cc-generic. |
|
||||
| 0026 hybrid-perhead-ssm-state | (a) general | mostly host plumbing; GDN kernel = same portable column-folded code; fill.cu instantiates `fill_kernel<nv_bfloat16>` (bf16 STORAGE-only, fine on all targeted arches; bf16-compute SSM plan is SHELVED so STATE_T stays f32 on the hot path). LOW-RISK verify item: confirm no bf16-arithmetic GDN instantiation reaches sm_75 if sm_75 ships. |
|
||||
| 0028 recurrent-state-gather-fusion | (a) general | new op reuses GGML_OP_SSM_CONV (ids src + rs_head); `ssm_conv_gather_nonident_kernel` plain portable CUDA; CPU mirror + test cases. |
|
||||
| 0029 blocktable-within-step-cache | (a) general | pure host C++ + host-timing instrumentation. |
|
||||
|
||||
### Specific lines that carry the only conditional/risk surface
|
||||
|
||||
- 0017 the ONLY correctly-gated arch surface:
|
||||
- `get_mmq_y_host`: `if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc))`
|
||||
- `get_mmq_y_device<type>()` / `mmq_get_min_blocks_device<type>()`: bodies inside `#if defined(BLACKWELL_MMA_AVAILABLE)`.
|
||||
All default to the stock value, so a default build is byte-identical everywhere.
|
||||
- 0023 the gather kernel default-on (GGML_CUDA_MOE_QUANT_DEDUP=1) but the call site
|
||||
is `if (moe_quant_dedup && ne11 == 1)` strictly inside `if (use_native_fp4)`; on
|
||||
non-Blackwell `use_native_fp4` is false so the dedup never executes.
|
||||
- 0022 the GB10-tuned launch geometry is `GDN_DEFAULT_NW 16` / `GDN_DEFAULT_CPW 8`
|
||||
(=> 512 threads, MIN_BLOCKS=2). This is the closest thing to a "hardcoded for
|
||||
GB10" launch config, but it is a correct, within-limits, bit-exact default for
|
||||
ANY arch, runtime-overridable via GDN_NW/GDN_CPW. Not a break.
|
||||
|
||||
### THE ONE silent-correctness risk (cross-ref SAFETY #1 above)
|
||||
|
||||
0021/0028 (and 0018/0019 for the GDN op) implement their new ops for CUDA + CPU
|
||||
ONLY, and the fused conv variants REUSE GGML_OP_SSM_CONV discriminated by a
|
||||
non-null src[3]/src[4]. Emission is NOT gated on the active compute backend. A
|
||||
backend that supports plain SSM_CONV but ignores the discriminator would run the
|
||||
WRONG plain conv => SILENT corruption (not a build break). In practice the model
|
||||
that emits these (qwen35 hybrid) also needs the fork-custom GDN op, which is
|
||||
CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts/falls back FIRST and the
|
||||
model cannot run there regardless; and Metal is not a build target. So the risk is
|
||||
latent rather than live, but it should still be closed by gating fused-op emission
|
||||
on a CUDA/HIP compute backend (or a supports_op guard rejecting the discriminated
|
||||
SSM_CONV where fused handling is absent). This is the single item that could ever
|
||||
miscompute silently; everything else is either build-safe or loud.
|
||||
|
||||
### Build-safety verdict per target (would it COMPILE / RUN)
|
||||
|
||||
- CUDA sm_80 / 86 / 89 / 90 (Ampere/Ada/Hopper): BUILDS (0017 Blackwell code
|
||||
`#if`-stripped + default-off; all other device code portable CUDA). qwen35 hybrid
|
||||
models RUN (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4
|
||||
GGUFs run via the stock non-FP4-MMA dequant/DP4A path; the FP4 levers are inert,
|
||||
not broken. No patch in 0018-0029 breaks this build.
|
||||
- CUDA sm_100 (datacenter Blackwell, HBM3e): BUILDS + every lever active
|
||||
(BLACKWELL_MMA_AVAILABLE defined). Bit-exact. GB10-tuned launch defaults are
|
||||
correct but tuned for the LPDDR5x BW floor; on HBM3e the regime is compute-bound,
|
||||
so safe-but-not-necessarily-optimal (re-sweep 0022/0017 levers). No build/correctness risk.
|
||||
- Metal: NOT a build target (no darwin row), so missing Metal kernels for the new
|
||||
SSM_CONV/GDN ops cannot break a build or a run here. (The GDN op has no Metal
|
||||
kernel regardless.)
|
||||
- CPU (amd64 + arm64): BUILDS + RUNS - every new op ships a CPU mirror under the
|
||||
reused enums; host patches are portable C++.
|
||||
- ROCm/HIP, Intel SYCL, Vulkan: BUILD ok. The .cu additions hipify cleanly (no
|
||||
Blackwell intrinsic outside the `#if`; 0022's 512-thread launch within AMD limits).
|
||||
SYCL/Vulkan are separate backends that don't compile the .cu files and lack the
|
||||
GDN op, so qwen35 hybrid models fall back/assert there rather than run; classic
|
||||
(non-qwen35) models are unaffected because SSM_CONV semantics only change when the
|
||||
qwen35 graph emits the discriminator src. The latent silent-SSM_CONV risk above
|
||||
applies only if a backend both supports SSM_CONV and ignores the discriminator.
|
||||
|
||||
Verdict: of 0018-0029, none would break a non-Blackwell CUDA build, the CPU build,
|
||||
or the ROCm/SYCL/Vulkan builds; 0017 is the only Blackwell-gated patch and is
|
||||
default-off and `#if`-guarded. The sole non-build hazard is the latent
|
||||
discriminated-SSM_CONV silent-miscompute on a hypothetical Vulkan/SYCL/Metal GDN
|
||||
run, which should be closed by compute-backend-gating the fused-op emission.
|
||||
|
||||
## Section: CROSS-ARCH SYNTHESIS (final verdict)
|
||||
|
||||
Consolidates the four audit sections above into a single ship decision. The arch
|
||||
axis: NVFP4 FP4-MMA requires `BLACKWELL_MMA_AVAILABLE` = sm_120/121 (consumer
|
||||
Blackwell, GB10/RTX-50) + sm_100 (datacenter Blackwell). sm_90 Hopper / sm_89 Ada
|
||||
/ sm_80-86 Ampere = NO FP4-MMA. Metal/CPU/AMD/Intel = no NVFP4-MMA. GB10's wins
|
||||
are dominated by the LPDDR5x ~273 GB/s bandwidth floor; sm_100 has FP4-MMA but
|
||||
HBM3e ~8 TB/s so it is COMPUTE-bound and every "bandwidth-bound" GB10 verdict
|
||||
inverts there.
|
||||
|
||||
### 1. BUILD SAFETY: does it build + run WITHOUT CRASHING off-Blackwell?
|
||||
|
||||
YES on every target it builds for, with ONE latent silent-correctness hazard
|
||||
(not a crash) to close before claiming non-Blackwell support. The build is NOT
|
||||
GB10-pinned: there is no explicit CUDA arch list anywhere in the paged path
|
||||
(`CUDA_DOCKER_ARCH` empty in every matrix row, identical to stock llama-cpp), so
|
||||
the CUDA TUs compile the full upstream ggml arch fan and the NVFP4 FP4-MMA path
|
||||
is gated INSIDE the kernel by `BLACKWELL_MMA_AVAILABLE`, never by the matrix.
|
||||
|
||||
| target | builds? | runs? | notes |
|
||||
|--------|---------|-------|-------|
|
||||
| CUDA sm_80/86/89/90 (Ampere/Ada/Hopper) | YES | YES | 0017 Blackwell code `#if`-stripped + default-off; all other device code portable. qwen35 hybrid models run (GDN + ssm_conv_update + gather have non-Blackwell kernels). NVFP4 GGUFs run via dequant/DP4A; FP4 levers inert, not broken. |
|
||||
| CUDA sm_100 (datacenter Blackwell, HBM3e) | YES | YES | every lever active + bit-exact; GB10-tuned launch defaults are correct but compute-bound regime => safe-but-suboptimal (re-sweep, do not assume GB10 constants). |
|
||||
| CPU (amd64 + arm64) | YES | YES | every new op ships a CPU mirror under the reused enums; host patches portable C++. |
|
||||
| ROCm/HIP, Intel SYCL, Vulkan | YES | partial | .cu hipifies cleanly (no Blackwell intrinsic outside `#if`; 0022's 512-thread launch within AMD limits). SYCL/Vulkan don't compile the .cu and lack the GDN op, so qwen35 hybrid models assert/fall back rather than run; classic non-qwen35 models unaffected. |
|
||||
| Metal / macOS | NOT BUILT | N/A | no `includeDarwin` row, no `metal:` capability key. Mac selection of this backend falls back to `default`=cpu (a Linux image) and does NOT run; no auto-fallthrough to stock llama-cpp. |
|
||||
|
||||
No patch in 0017-0029 breaks a non-Blackwell CUDA build, the CPU build, or the
|
||||
ROCm/SYCL/Vulkan builds. The only thing that is not merely "suboptimal" is the
|
||||
fused-conv silent-miscompute hazard (item RISKY-1 below), and even that is latent
|
||||
because the co-emitted GDN op asserts first on the backends that lack it.
|
||||
|
||||
### 2. EVERY patch/opt, four buckets
|
||||
|
||||
SAFE-EVERYWHERE (ship as-is; bit-exact or default-off byte-identical; pure win or
|
||||
neutral on any arch that runs the path):
|
||||
- 0001-0012 paged KV core (manager, on-demand alloc, prefix caching, in-kernel paged read)
|
||||
- 0013 / 0016 prefill-token budget scheduler (pure `update_slots()` policy, default-off byte-identical)
|
||||
- 0018 in-place SSM-state write-back (CUDA+CPU; see RISKY-1 for backend coverage)
|
||||
- 0019 fused SSM-state gather (CUDA+CPU)
|
||||
- 0021 conv-state in-place fusion (CUDA+CPU)
|
||||
- 0028 recurrent-state (conv-tap) gather fusion (CUDA+CPU)
|
||||
- 0020 o_proj GDN MMVQ->MMQ reshape (zero-cost view, bit-identical; MMQ>MMVQ at M=128 is universal; magnitude GB10-bound, perf-only caveat at tiny real M=1, see RISKY-2)
|
||||
- 0024 paged-pool burst-reclaim (pure host C++; fixes a real long-server fragmentation collapse)
|
||||
- 0029 block-table within-step host cache (host memcpy reuse, bit-exact; bigger win the FASTER the GPU, i.e. MORE host-bound decode elsewhere)
|
||||
|
||||
BLACKWELL-ONLY, CLEAN FALLBACK (only meaningful where FP4-MMA exists; provably
|
||||
inert/byte-identical elsewhere, never a break):
|
||||
- 0017 FP4 dense-GEMM decode tile tune - levers `#if BLACKWELL_MMA_AVAILABLE` + `blackwell_mma_available(cc)` + `type==NVFP4`, ALL default-off => default build byte-identical to stock on every arch
|
||||
- 0023 MoE NVFP4 activation-quant de-dup - plain uint4 copy kernel reached ONLY inside the pre-existing `if (use_native_fp4)` branch (false off-Blackwell); never executes there
|
||||
- 0025 MoE NVFP4 decode re-graph - host-side CUDA-graph guard, env-gated `LLAMA_MOE_FORCE_GRAPHS` default-off; the NVFP4-grouped guard predicate is inert on non-FP4
|
||||
- NVFP4 GGUFs + 6 gallery rows - FAST path is sm_120/121/100 only; elsewhere run-via-dequant (correct, slow), never a load/run gate
|
||||
|
||||
GB10-TUNED (works + safe everywhere, but the constants/magnitude are GB10
|
||||
bandwidth-floor winners; re-sweep per arch, no correctness risk):
|
||||
- 0022 GDN recurrence occupancy retune - column-fold default (16,8)=512thr/MIN_BLOCKS=2, bit-exact, env-overridable GDN_NW/GDN_CPW; within the 1024-thread limit on sm_70..120 + AMD. Optimal values depend on DRAM latency/L2/SM-count; on a compute-bound arch the kernel may not be the bottleneck.
|
||||
- 0026 bf16 per-head SSM/conv cache - default f32 bit-exact (opt-in `--cache-type-ssm/-conv`); bf16 only pays off on a bandwidth-bound arch, buys little on sm_100 HBM3e. bf16 is STORAGE-only (fill_kernel<nv_bfloat16>), the bf16-compute SSM plan is shelved so STATE_T stays f32 on the hot path.
|
||||
- 0017 / 0023 magnitudes (the % wins, not the gating) are also GB10-floor-bound.
|
||||
|
||||
RISKY (fix before claiming non-Blackwell ship; neither is a crash, one is silent):
|
||||
- RISKY-1 (the one real gap) fused GDN/conv ops are CUDA+CPU-only with
|
||||
backend-UNGATED, DEFAULT-ON emission. Confirmed: `cparams.fused_gdn_ch = true`
|
||||
and `auto_fgdn = true` in the `llama_context` constructor; emission fires on
|
||||
`(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar)` with NO compute-backend
|
||||
check. The fused conv variant REUSES `GGML_OP_SSM_CONV` discriminated by a
|
||||
non-null `src[3]` (verified: CUDA `if (dst->src[3] != nullptr)` branch at the
|
||||
top of `ggml_cuda_op_ssm_conv`, CPU mirror in ops.cpp, NO supports_op guard). A
|
||||
backend that supports plain SSM_CONV but ignores `src[3]` would compute the
|
||||
WRONG plain conv => SILENT corruption. Latent today only because the co-emitted
|
||||
fork-custom GDN op is CUDA/CPU-only, so on Vulkan/SYCL the GDN node asserts
|
||||
first and the qwen35 hybrid model cannot run there anyway, and Metal is not
|
||||
built. FIX: gate fused-op emission on a CUDA/HIP compute backend, OR add a
|
||||
supports_op guard that rejects the discriminated SSM_CONV where fused handling
|
||||
is absent. This is the single thing that could miscompute silently; close it
|
||||
before a Vulkan/SYCL/Metal paged build of a gated-DeltaNet model is ever shipped.
|
||||
- RISKY-2 (perf-only, not correctness) 0020 forces MMQ; at a genuine single-stream
|
||||
decode M<=8 (n_seqs=1) MMQ could be slower than MMVQ off the GB10 batched
|
||||
regime. Bit-identical either way. Confirm the reshape still picks the better
|
||||
kernel at n_seqs=1 on non-GB10 archs.
|
||||
|
||||
### 3. NVFP4-GGUF + gallery targeting recommendation
|
||||
|
||||
Do NOT hardware-gate the entries (and you cannot: LocalAI has no microarch-gating
|
||||
primitive - `tags:` are free-text/search-only, `ModelConfig` has no
|
||||
hardware/requirements field, and backend `capabilities:` resolves by accelerator
|
||||
FAMILY only, serving `nvidia: cuda12-...-paged` to ANY NVIDIA GPU with no
|
||||
sub-nvidia resolution). The GGUFs run correctly everywhere via dequant, so the
|
||||
risk is PERFORMANCE-EXPECTATION, not correctness; a hard gate would wrongly block
|
||||
valid (slow) use. Recommended, in order:
|
||||
1. (zero-code, do now) Lead each of the 6 descriptions with one honest line:
|
||||
"Hardware: Blackwell (RTX 50-series / GB10 / B200) recommended; runs on other
|
||||
NVIDIA/CPU via NVFP4 dequant but WITHOUT FP4-MMA and below the quoted GB10
|
||||
throughput." Temper the "90-117% of vLLM" claims with that caveat (those are
|
||||
LPDDR5x-bandwidth-bound specific).
|
||||
2. (zero-code) Tag all six consistently with `nvfp4` + a new `blackwell` chip. The
|
||||
four Qwopus/Qwen-MTP entries currently carry only `[llm, gguf]` and are not even
|
||||
discoverable as NVFP4 despite being NVFP4 GGUFs - secondary correctness-of-metadata gap.
|
||||
3. (feature, later) A structured `recommended_hardware` field surfaced by the React
|
||||
import dialog is the only way to express this machine-readably; it does not exist.
|
||||
|
||||
### 4. Per-arch roadmap (ranked by value / effort)
|
||||
|
||||
- sm_100 datacenter Blackwell - HIGH value, MEDIUM effort. FP4-MMA works so NVFP4
|
||||
stays fast and the precision bucket (0017/0023/0025) carries over, but the BW
|
||||
floor is gone => compute-bound. Needs: re-sweep 0022 GDN_NW/CPW; re-evaluate the
|
||||
0017 kill-gate (levers ready, may flip); expect 0018/0019/0026 bandwidth wins to
|
||||
shrink toward neutral while 0029/0025/0020 host/graph/MMQ wins still help. No
|
||||
code change to be SAFE; a tuning pass to be OPTIMAL.
|
||||
- Metal / macOS - MEDIUM value, MEDIUM effort. Add the `includeDarwin`
|
||||
`-metal-darwin-arm64-llama-cpp-localai-paged` row + a `metal:` capability key
|
||||
(changed-backends.js already anticipates the source path). Delivers paged-KV +
|
||||
scheduler value only (no NVFP4-MMA on Metal); still a strict win over today's
|
||||
broken Mac selection. MUST also land RISKY-1 first (Metal would otherwise hit the
|
||||
discriminated-SSM_CONV path if it ever gained an SSM_CONV kernel without the
|
||||
discriminator).
|
||||
- CPU - LOW effort, already works. Reference kernels exist for every fused op;
|
||||
paged KV + scheduler + reclaim are the portable value. Nothing to do.
|
||||
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 - MEDIUM value, LOW effort (no FP4
|
||||
work). No FP4-MMA => pair the precision-agnostic infra (paged KV, 0013/0016,
|
||||
0024, 0029, 0018/0019/0021/0028, 0020) with a DIFFERENT quant (Q4_K/AWQ/GPTQ).
|
||||
Messaging: "no NVFP4 here, use another quant, but paged + SSM + scheduler infra
|
||||
is a pure win". The GGUFs/gallery rows are out of scope for these.
|
||||
|
||||
### 5. What MUST be empirically verified (and on what hardware)
|
||||
|
||||
- GB10 (sm_121, user has it): the validated target; already measured. Re-confirm
|
||||
bit-exactness gates after RISKY-1 fix.
|
||||
- M4 Mac (user has it): (a) once an `includeDarwin` paged row exists, verify the
|
||||
Metal build compiles + a NON-qwen35 model runs (paged KV path); (b) verify a
|
||||
qwen35 hybrid model on Metal EITHER asserts loudly OR is correct - it must NOT
|
||||
silently miscompute the discriminated SSM_CONV. This is the direct test of
|
||||
RISKY-1 on real Metal. Do this BEFORE shipping a Metal paged build. Also verify
|
||||
CPU correctness of every fused op on the Mac (arm64 CPU mirror).
|
||||
- non-Blackwell NVIDIA (sm_80/86/89/90 - user would need to ACQUIRE, e.g. cloud
|
||||
A100/L4/L40S/H100): verify (a) the cuda12/cuda13 paged image runs a qwen35
|
||||
hybrid model correctly (GDN + ssm_conv_update + gather non-Blackwell kernels),
|
||||
(b) NVFP4 GGUFs load + produce correct output via dequant/DP4A (not garbage),
|
||||
(c) RISKY-2: that 0020's forced MMQ does not regress single-stream (n_seqs=1)
|
||||
decode latency vs MMVQ. This is the only bucket needing hardware acquisition;
|
||||
everything else is covered by the GB10 + M4 the user already has.
|
||||
- sm_100 (datacenter Blackwell - cloud B200 if a tuning pass is wanted): only
|
||||
needed to make sm_100 OPTIMAL, not to make it SAFE. Defer unless targeting it.
|
||||
|
||||
### 6. SHIP DECISION
|
||||
|
||||
SAFE TO SHIP TODAY as a Blackwell-targeted backend on Linux. The build is
|
||||
arch-general (same arch fan + variant set as stock llama-cpp), every targeted
|
||||
Linux variant builds and runs, and all Blackwell-specific code is default-off +
|
||||
`#if`-guarded so a non-Blackwell build is byte-identical to stock on the FP4 path.
|
||||
The NVFP4 GGUFs run everywhere via dequant (correct, slower), so broad gallery
|
||||
exposure is a performance-expectation issue, not a correctness one.
|
||||
|
||||
MINIMUM to not break / mislead other archs:
|
||||
1. (correctness, before ANY Vulkan/SYCL/Metal paged build of a gated-DeltaNet
|
||||
model) Close RISKY-1: compute-backend-gate the fused GDN/conv op emission, or
|
||||
add a supports_op guard rejecting the discriminated SSM_CONV. This is the only
|
||||
hard requirement; it is latent on the current Linux targets but becomes live
|
||||
the moment a Metal/Vulkan/SYCL paged build of qwen35 exists.
|
||||
2. (availability, zero-risk) Add the `includeDarwin` paged row + `metal:` key so
|
||||
Mac users get a working (paged-KV-only) build instead of a non-running
|
||||
default=cpu selection with no fallthrough to stock.
|
||||
3. (expectation, zero-code) Add the Blackwell-recommended hardware note + the
|
||||
"runs slower off-Blackwell via dequant" caveat to the 6 gallery descriptions
|
||||
and tag all six `nvfp4` + `blackwell`.
|
||||
4. (perf, verify don't block) Confirm 0020 does not regress n_seqs=1 decode on
|
||||
non-GB10 NVIDIA; if it does, gate the MMVQ->MMQ reshape on a real-M threshold.
|
||||
|
||||
Items 2-4 do not block a Linux Blackwell ship. Item 1 blocks only a future
|
||||
non-CUDA paged build of a gated-DeltaNet model; on the current build targets the
|
||||
hazard is latent (the GDN op asserts first). Net: ship for Blackwell/Linux now;
|
||||
land item 1 before extending paged to Metal/Vulkan/SYCL.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,56 +0,0 @@
|
||||
# A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS
|
||||
|
||||
Label: A-build (GPU agent). Base: DGX `~/llama-paged-dev` branch `paged` HEAD 2f4f5ab (patch 0025),
|
||||
plus `BF16_SSM_STATE.diff` applied as the bf16 plumbing base. Goal: per-head mixed-dtype SSM state
|
||||
(f32 long-memory heads, bf16 fast heads); default `ssm_hybrid_tau_thresh=inf` (all-f32, bit-exact).
|
||||
|
||||
## Design recap (from SPEEDUP_HUNT.md A-hybrid-design)
|
||||
- Classifier (host, model-load): tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h])); f32 if tau_h>T.
|
||||
ssm_a = SSM_A_NOSCAN = -exp(A_log) (verified qwen35.cpp:376). ssm_dt = SSM_DT bias.
|
||||
- Split cache: per GDN layer, s_l (f32, n_f32 heads) + s_l_bf16 (bf16, n_bf16 heads). head_slot map.
|
||||
- Kernel: ONE kernel templated +HYBRID; per-block (h_idx) branch on head_slot (uniform, no divergence).
|
||||
Recurrence math byte-for-byte f32-register, untouched. Homogeneous (HYBRID=false) path bit-exact.
|
||||
- Op: extra src[8]=state_bf16, src[9]=head_slot; backend detects hybrid = (src[9]!=null).
|
||||
- CPU mirror: per-head partition read.
|
||||
- test-backend-ops: MIXED case (some heads f32, some bf16) output-append, decode+prefill+keep_rs_t.
|
||||
|
||||
## DE-RISK GATE (must pass before sweep)
|
||||
1. test-backend-ops GATED_DELTA_NET mixed PASS (CUDA mixed vs CPU mixed).
|
||||
2. T_thresh=inf greedy md5 == 0023 baseline: dense 5951a5b4d624ce891e22ab5fca9bc439,
|
||||
MoE 07db32c2bcb78d17a43ed18bc22705cd.
|
||||
|
||||
## KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)
|
||||
Rule (brief verbatim + physics + "start 32-64" guidance all agree): a head is kept f32 iff
|
||||
tau_h > T_thresh, else bf16. tau_h = 1/(|ssm_a|*softplus(ssm_dt)) in tokens. Long-memory (large tau)
|
||||
heads stay f32 (bf16 rounding does not contract there -> KL); fast (small tau) heads -> bf16.
|
||||
- ssm_hybrid_tau_thresh DEFAULT = 0.0 => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).
|
||||
- ssm_hybrid_tau_thresh -> +inf => ALL BF16 (shelved mode).
|
||||
- sweep: raise T (16/32/64/128 tokens) to bf16 progressively more (longer-memory) heads = more speed.
|
||||
NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED vs the operative rule it states
|
||||
("keep f32 if tau>T") and vs "start 32-64" + the physics. Correct endpoints: 0=all-f32, inf=all-bf16.
|
||||
Implemented the physically-correct rule; default 0.0 = bit-exact all-f32.
|
||||
|
||||
## STATUS
|
||||
- [x] ggml.h/ggml.c hybrid op builders
|
||||
- [x] gated_delta_net.cu hybrid kernel + dispatch (one kernel, +HYBRID template, uniform per-block branch)
|
||||
- [x] ops.cpp CPU hybrid read mirror (output-append; ids in-place is GPU-only, asserted)
|
||||
- [x] test-backend-ops mixed case (32 cases: hc 4/8 x hs 64/128 x decode/prefill/keep_rs_t x kda)
|
||||
- [x] de-risk gate 1: test-backend-ops GATED_DELTA_NET = 84/84 PASS (incl 32 hybrid mixed CUDA-vs-CPU)
|
||||
- [x] cparam/CLI ssm_hybrid_tau_thresh plumbing (default 0.0; threaded context->cparams->memory->ctors)
|
||||
- [x] memory-recurrent split cache + classifier (validated: real tau split, correct 2-partition layout)
|
||||
- [x] delta-net-base hybrid op build (fused ids decode + bf16 rs_zero/extra mirror)
|
||||
- [x] full build clean (sm_121; llama-completion/batched-bench/perplexity/test-backend-ops)
|
||||
- [x] de-risk gate 2 (default/all-f32 md5 == 0023 both models, re-verified post-build)
|
||||
- [x] hybrid-ON decode FIXED: the incoherence was head_slot being zeroed by clear(data=true) (whole-RS
|
||||
buffer clear) after warm-up, never re-uploaded => every head -> f32-local-0 => split collapse.
|
||||
Fix = persist head_slot_host + re-upload via upload_head_slots() after every buffer clear. Hybrid
|
||||
decode now coherent; cross-op carry verified BYTE-EXACT (write==read both partitions).
|
||||
- [x] A-gatesweep DONE: KL sweep T in {0.25..128} both models, single-seq c1024 (clean carry), drift.
|
||||
SHIP GATE FAILS - no T passes MeanKLD<1e-3 AND top-p>=99.5% with meaningful speedup. Premise
|
||||
(error concentrates in long-memory heads) REFUTED: KL scales with bf16 COUNT and saturates
|
||||
~0.06/~91% (MoE saturates at the minimal split). Carry byte-exact => genuine bf16 sensitivity,
|
||||
not a bug. Throughput lever real: dense +12.4% / MoE +11.5% decode @npl128 at T=128.
|
||||
- [x] Shipped default-off (f32, bit-exact). De-risk gates re-verified on the clean build (84/84;
|
||||
md5 == baseline both models). See A_HYBRID_SSM_RESULTS.md for the full tables.
|
||||
|
||||
Committed: DGX paged 33e7c65 (amended); worktree patch 0026 + A_HYBRID_SSM_RESULTS.md + this doc.
|
||||
@@ -1,119 +0,0 @@
|
||||
# A - HYBRID PER-HEAD f32/bf16 SSM STATE - BUILD + DE-RISK + GATE-SWEEP RESULTS
|
||||
|
||||
Label: A-build + A-gatesweep. Lands as patch 0026 on top of 0025 (DGX HEAD 2f4f5ab),
|
||||
incorporating the bf16-SSM-state plumbing as the base. Built into `~/llama-paged-dev/build-cuda`
|
||||
(sm_121); committed on the DGX `paged` branch (33e7c65, amended) and as
|
||||
`patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch` + this doc in the worktree.
|
||||
|
||||
## VERDICT
|
||||
|
||||
The hybrid machinery is **CORRECT and complete** (both de-risk gates PASS; the carry is byte-exact;
|
||||
the previously-open decode-incoherence bug is FIXED). The **ship gate FAILS**: no T_thresh reaches
|
||||
`MeanKLD < 1e-3 AND Same-top-p >= 99.5%` for both models with any meaningful speedup. The design
|
||||
premise - that the bf16 KL error concentrates in long-memory heads and is removed by keeping them
|
||||
f32 at f32-fraction 0.30-0.50 - is **empirically refuted** on q36-27b and q36-35b-a3b-nvfp4: the KL
|
||||
error scales with the bf16 head COUNT and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any
|
||||
useful byte-saving. The bf16 byte-saving (and the decode speedup it buys) is real but cannot meet the
|
||||
strict KL bar. **Shipped default-off (f32, bit-exact opt-out); the hybrid is opt-in only.**
|
||||
|
||||
## THE FIX (was: hybrid-ON decode incoherent)
|
||||
|
||||
Root cause: `llama_memory_recurrent::clear(data=true)` zeroes the WHOLE recurrent backend buffer with
|
||||
`ggml_backend_buffer_clear`, which includes the per-layer `head_slot` maps. Those maps were uploaded
|
||||
only once in the constructor. llama.cpp calls `clear(true)` to reset state after the warm-up run (and
|
||||
on context resets), so by the time real prefill/decode runs, every `head_slot[h] == 0`. The kernel
|
||||
decodes `head_slot==0` as "f32 head, local index 0", so EVERY head reads/writes f32-partition slot 0:
|
||||
the split collapses (the bf16 partition is never written, every head collides on one f32 slot) and the
|
||||
output is garbage. Warm-up showed correct values precisely because it ran before the clear.
|
||||
|
||||
Fix: persist the host-side maps (`head_slot_host`) and re-upload them after every buffer clear via a
|
||||
new `upload_head_slots()` (called both at construction and at the end of `clear(true)`). 22 lines in
|
||||
`src/llama-memory-recurrent.cpp` + 7 in the header. After the fix:
|
||||
- head_slot reads back correct in every forward (e.g. `0 1 -1 -2`), in both llama-completion and
|
||||
llama-perplexity;
|
||||
- the bf16 partition is written (non-zero) every step;
|
||||
- the cross-op state carry is **byte-exact**: at a continuation forward the op reads back EXACTLY what
|
||||
the prior op wrote, element-for-element, in BOTH partitions (f32 `[0]=0.00303 [1]=-0.00074
|
||||
[16384]=0.00054`, bf16 `[0]=-0.00023 [1]=0.00008 [16384]=0.00269` write == read), confirming there
|
||||
is no addressing/scramble/corruption bug. The only residual difference from f32 is the bf16 rounding
|
||||
of the bf16-partition heads.
|
||||
|
||||
## DE-RISK GATES - both PASS (re-verified on the final clean build)
|
||||
|
||||
1. **test-backend-ops GATED_DELTA_NET = 84/84 PASS, CUDA0 OK** (incl. the 32 mixed-dtype hybrid cases
|
||||
vs CPU: head_count {4,8} x head_size {64,128} x {decode, prefill 33/64/100, keep_rs_t K=4} x kda).
|
||||
2. **T=0 (default, all-f32) greedy md5 == 0023 baseline, both models**, NO `--ssm-bf16-tau`:
|
||||
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == baseline
|
||||
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == baseline
|
||||
The bit-exact opt-out is preserved byte-for-byte.
|
||||
|
||||
## SHIP GATE - the KL/throughput sweep (FAILS)
|
||||
|
||||
KL harness = the bf16-work GateBench: `llama-perplexity --kl-divergence` on wikitext-2-raw,
|
||||
`-ngl 99 -fa on --seed 1`, base = T=0 (f32). The clean carry config is single-sequence
|
||||
`-b 1024 -ub 512 -c 1024 --chunks 8` (one cross-ubatch bf16 round-trip; f32-vs-f32 floor = 100.000%
|
||||
same-top-p, MeanKLD ~ -1.2e-5). Gate: `MeanKLD < 1e-3 AND Same-top-p >= 99.5% AND bounded drift`.
|
||||
|
||||
### Dense q36-27b-nvfp4 (H_v=48), c1024 single-seq
|
||||
|
||||
| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD | Same-top-p |
|
||||
|---------:|-----------:|--------:|--------:|---------:|-----------:|
|
||||
| 0.25 | 14 | 0.964 | 0.982 | 0.00270 | 98.92% |
|
||||
| 0.5 | 48 | 0.963 | 0.982 | 0.01439 | 96.18% |
|
||||
| 1 | 118 | 0.935 | 0.968 | 0.06357 | 91.59% |
|
||||
| 8 | ~610 | 0.735 | 0.868 | 0.05669 | 91.59% |
|
||||
| 32 | ~1113 | 0.517 | 0.759 | 0.05724 | 90.97% |
|
||||
| 64 | ~1304 | 0.434 | 0.717 | 0.06183 | 91.85% |
|
||||
| 128 | ~1460 | 0.366 | 0.683 | 0.05980 | 91.56% |
|
||||
|
||||
Monotonic below the knee (T<=1), then a flat plateau. Best meaningful point T=0.25 (only ~1.8% byte
|
||||
saving) already FAILS both criteria (KLD 0.0027 > 1e-3; top-p 98.92% < 99.5%). To pass the gate the
|
||||
bf16 count must be < ~14 heads (f_bytes > 0.98) => no speedup.
|
||||
|
||||
### MoE q36-35b-a3b-nvfp4 (H_v=32), c1024 single-seq
|
||||
|
||||
| T_thresh | bf16 heads | f32-frac | f_bytes | MeanKLD | Same-top-p |
|
||||
|---------:|-----------:|--------:|--------:|---------:|-----------:|
|
||||
| 0.25 | 23 | 0.940 | 0.970 | 0.03907 | 91.61% |
|
||||
| 0.5 | 53 | 0.928 | 0.964 | 0.04620 | 90.31% |
|
||||
| 1 | 78 | 0.910 | 0.955 | 0.04425 | 89.82% |
|
||||
| 32 | 585 | 0.391 | 0.695 | 0.04552 | 90.09% |
|
||||
|
||||
MoE has NO low-KL regime: even the minimal split (23 bf16 heads, ~3% byte saving) is already at the
|
||||
~0.045 / ~91% plateau. Fails the gate everywhere by a wide margin.
|
||||
|
||||
### Why it fails (the refutation)
|
||||
|
||||
The carry is byte-exact, so this is genuine bf16 rounding of the recurrent state, not a bug. The
|
||||
gated-DeltaNet logit is extremely sensitive to ANY perturbation of the temporal state: even rounding a
|
||||
handful of small-magnitude heads to bf16 flips ~9% of hard-wikitext argmaxes, and adding more bf16
|
||||
heads does not flip materially more (saturation - the flips concentrate in an inherently-marginal
|
||||
token pool). This matches the prior whole-bf16 finding (MeanKLD 0.05-0.17, top-p ~90%, "bounded but
|
||||
LARGE, plateaus with context"). The error is NOT concentrated by tau, so f32-ing the long-memory heads
|
||||
(or, tested, the fast heads - inverted classifier gives the same plateau) does not recover the gate.
|
||||
|
||||
## THROUGHPUT - the byte-saving lever IS real (but KL-gated out)
|
||||
|
||||
`llama-batched-bench -fa on -npp 128 -ntg 128 -npl 128`, `LLAMA_KV_PAGED=1`, decode_agg = S_TG t/s:
|
||||
|
||||
| model | T=0 (f32) | T=128 (f_bytes ~0.68) | gain |
|
||||
|-------|----------:|----------------------:|-------:|
|
||||
| dense | 529.0 | 594.4 | +12.4% |
|
||||
| MoE | 1110.7 | 1238.1 | +11.5% |
|
||||
|
||||
So the split delivers the predicted recurrence-bandwidth win (~+12% decode at f_bytes ~0.68), but only
|
||||
at T values whose KL is ~0.06 / ~91% top-p. There is no operating point with both a real speedup and a
|
||||
passing KL.
|
||||
|
||||
## RECOMMENDATION
|
||||
|
||||
- Ship 0026 as-is: **default `ssm_hybrid_tau_thresh = 0.0` (f32, bit-exact)**; the hybrid is opt-in via
|
||||
`--ssm-bf16-tau` for callers who explicitly accept reduced precision for ~+12% decode. Do NOT put a
|
||||
hybrid T in the gallery/recommended config - it does not pass the KL bar.
|
||||
- Lever A is closed as a KL-passing speedup: the GDN recurrent state does not tolerate bf16 on a
|
||||
head-subset basis. Speed beyond the f32 recurrence must come from elsewhere (the MoE FP4 GEMM /
|
||||
re-graph levers, or NVFP4-dense quant), not from bf16-ing the SSM state.
|
||||
- If a product later accepts a looser bar (e.g. top-p >= 95%), dense T=0.5 (96.18%, f_bytes 0.982) is
|
||||
the only near-miss and buys ~2% - still not worth it; MoE has nothing.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,92 +0,0 @@
|
||||
# FINAL apples-to-apples NVFP4 benchmark (GB10 / DGX Spark) - CLEAN env, containers stopped
|
||||
# llama 0023 clean f7409c2 | LLAMA_KV_PAGED=1, LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget; beats stock 394s->142s TTFT@npl32), CUDA graphs ON, -c 131072 --parallel 128 -b 2048 -ub 512 -fa on
|
||||
# vLLM 0.23.0 | CUDA graphs ON (no enforce-eager), util 0.85, max-model-len 4096, max-num-seqs 256, tp1
|
||||
# client h2h_cli3.py: 512-tok UNIQUE-nonce prompt (fresh full prefill, defeats prefix caching), max_tokens=256, temp0, ignore_eos, stream+usage
|
||||
# llama restarts server PER NPL (paged-pool degrades after high-npl bursts); vllm one server/combo + npl8 re-check. 1 measured pass/npl + ptok8 graph warmup. peak_gb engine = PEAK-PRE.
|
||||
# started Fri Jun 26 04:43:38 AM CEST 2026 baseline=3.29 GB
|
||||
|
||||
[2026-06-26 04:43:38] [dense_llama] ==== START dense_llama (llama) baseline_mem=3.29 ====
|
||||
[2026-06-26 04:43:38] [dense_llama] NPL=8 launching server PRE_GB=3.29
|
||||
[2026-06-26 04:43:48] [dense_llama] NPL=8 ready LOADED_GB=47.06
|
||||
[2026-06-26 04:43:55] [dense_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is". This is a straightforward factual question with a clear answer.\n2. **Identify Key Entity:** France (country)\n3. **Identify Question Type:** Capit
|
||||
[2026-06-26 04:44:30] [dense_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2040, "prompt_tok_total": 4195, "gen_per_req": 255.0, "agg_tps": 61.8, "decode_agg_tps": 82.5, "decode_perseq_tps": 9.57, "prefill_tps": 507.3, "ttft_mean_ms": 6038.1, "ttft_max_ms": 8270.0, "wall_s": 32.999}
|
||||
[2026-06-26 04:44:30] [dense_llama] NPL=8 PEAK_GB=53.51
|
||||
[2026-06-26 04:44:35] [dense_llama] NPL=8 server stopped mem=3.31
|
||||
[2026-06-26 04:44:35] [dense_llama] NPL=32 launching server PRE_GB=3.31
|
||||
[2026-06-26 04:44:40] [dense_llama] NPL=32 ready LOADED_GB=46.96
|
||||
[2026-06-26 04:47:55] [dense_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8180, "prompt_tok_total": 16900, "gen_per_req": 255.6, "agg_tps": 43.2, "decode_agg_tps": 192.6, "decode_perseq_tps": 4.79, "prefill_tps": 115.0, "ttft_mean_ms": 133551.7, "ttft_max_ms": 147007.0, "wall_s": 189.49}
|
||||
[2026-06-26 04:47:55] [dense_llama] NPL=32 PEAK_GB=69.63
|
||||
[2026-06-26 04:48:01] [dense_llama] NPL=32 server stopped mem=3.32
|
||||
[2026-06-26 04:48:01] [dense_llama] NPL=64 launching server PRE_GB=3.32
|
||||
[2026-06-26 04:48:11] [dense_llama] NPL=64 ready LOADED_GB=46.97
|
||||
[2026-06-26 04:55:10] [dense_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16382, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 39.8, "decode_agg_tps": 277.8, "decode_perseq_tps": 3.09, "prefill_tps": 95.9, "ttft_mean_ms": 321618.8, "ttft_max_ms": 352633.6, "wall_s": 411.603}
|
||||
[2026-06-26 04:55:10] [dense_llama] NPL=64 PEAK_GB=83.96
|
||||
[2026-06-26 04:55:16] [dense_llama] NPL=64 server stopped mem=3.30
|
||||
[2026-06-26 04:55:16] [dense_llama] NPL=128 launching server PRE_GB=3.30
|
||||
[2026-06-26 04:55:21] [dense_llama] NPL=128 ready LOADED_GB=47.09
|
||||
[2026-06-26 05:13:18] [dense_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32767, "prompt_tok_total": 67969, "gen_per_req": 256.0, "agg_tps": 30.9, "decode_agg_tps": 384.6, "decode_perseq_tps": 1.86, "prefill_tps": 69.7, "ttft_mean_ms": 902762.7, "ttft_max_ms": 975832.6, "wall_s": 1061.031}
|
||||
[2026-06-26 05:13:18] [dense_llama] NPL=128 PEAK_GB=93.82
|
||||
[2026-06-26 05:13:25] [dense_llama] NPL=128 server stopped mem=3.31
|
||||
[2026-06-26 05:13:25] [dense_llama] ==== DONE dense_llama POST_GB=3.31 ====
|
||||
[2026-06-26 05:13:25] [dense_vllm] ==== START dense_vllm (vllm) baseline_mem=3.31 ====
|
||||
[2026-06-26 05:13:25] [dense_vllm] launching vllm PRE_GB=3.31
|
||||
[2026-06-26 05:21:15] [dense_vllm] vllm ready LOADED_GB=110.48
|
||||
[2026-06-26 05:21:27] [dense_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:** The user says "The capital of France is"\n2. **Identify Key Entity/Question:** The question is asking for the capital city of France.\n3. **Retrieve Knowledge:** I know from general knowledge that t
|
||||
[2026-06-26 05:21:59] [dense_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1959, "prompt_tok_total": 4195, "gen_per_req": 244.9, "agg_tps": 65.6, "decode_agg_tps": 70.4, "decode_perseq_tps": 8.76, "prefill_tps": 2096.2, "ttft_mean_ms": 1861.1, "ttft_max_ms": 2000.6, "wall_s": 29.843}
|
||||
[2026-06-26 05:21:59] [dense_vllm] NPL=8 PEAK_GB=110.92
|
||||
[2026-06-26 05:22:47] [dense_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8165, "prompt_tok_total": 16900, "gen_per_req": 255.2, "agg_tps": 176.3, "decode_agg_tps": 211.8, "decode_perseq_tps": 6.28, "prefill_tps": 2182.6, "ttft_mean_ms": 5353.2, "ttft_max_ms": 7741.4, "wall_s": 46.302}
|
||||
[2026-06-26 05:22:47] [dense_vllm] NPL=32 PEAK_GB=110.87
|
||||
[2026-06-26 05:23:59] [dense_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16314, "prompt_tok_total": 33828, "gen_per_req": 254.9, "agg_tps": 236.5, "decode_agg_tps": 309.1, "decode_perseq_tps": 4.38, "prefill_tps": 2088.9, "ttft_mean_ms": 9512.4, "ttft_max_ms": 16191.0, "wall_s": 68.976}
|
||||
[2026-06-26 05:23:59] [dense_vllm] NPL=64 PEAK_GB=110.88
|
||||
[2026-06-26 05:25:57] [dense_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32640, "prompt_tok_total": 67969, "gen_per_req": 255.0, "agg_tps": 288.4, "decode_agg_tps": 418.8, "decode_perseq_tps": 2.79, "prefill_tps": 1929.1, "ttft_mean_ms": 18449.5, "ttft_max_ms": 35227.7, "wall_s": 113.162}
|
||||
[2026-06-26 05:25:57] [dense_vllm] NPL=128 PEAK_GB=110.95
|
||||
[2026-06-26 05:26:27] [dense_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 2044, "prompt_tok_total": 4187, "gen_per_req": 255.5, "agg_tps": 68.1, "decode_agg_tps": 73.4, "decode_perseq_tps": 9.07, "prefill_tps": 1921.9, "ttft_mean_ms": 1877.6, "ttft_max_ms": 2178.1, "wall_s": 30.018}
|
||||
[2026-06-26 05:26:35] [dense_vllm] ==== DONE dense_vllm POST_GB=3.53 ====
|
||||
[2026-06-26 05:26:35] [moe_llama] ==== START moe_llama (llama) baseline_mem=3.53 ====
|
||||
[2026-06-26 05:26:35] [moe_llama] NPL=8 launching server PRE_GB=3.53
|
||||
[2026-06-26 05:26:50] [moe_llama] NPL=8 ready LOADED_GB=36.42
|
||||
[2026-06-26 05:26:52] [moe_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:*
|
||||
[2026-06-26 05:27:06] [moe_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2048, "prompt_tok_total": 4195, "gen_per_req": 256.0, "agg_tps": 156.8, "decode_agg_tps": 211.8, "decode_perseq_tps": 24.45, "prefill_tps": 1236.4, "ttft_mean_ms": 2477.1, "ttft_max_ms": 3392.9, "wall_s": 13.061}
|
||||
[2026-06-26 05:27:06] [moe_llama] NPL=8 PEAK_GB=39.66
|
||||
[2026-06-26 05:27:11] [moe_llama] NPL=8 server stopped mem=3.34
|
||||
[2026-06-26 05:27:11] [moe_llama] NPL=32 launching server PRE_GB=3.34
|
||||
[2026-06-26 05:27:16] [moe_llama] NPL=32 ready LOADED_GB=36.54
|
||||
[2026-06-26 05:27:54] [moe_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8192, "prompt_tok_total": 16900, "gen_per_req": 256.0, "agg_tps": 235.6, "decode_agg_tps": 393.0, "decode_perseq_tps": 10.02, "prefill_tps": 1213.9, "ttft_mean_ms": 8225.2, "ttft_max_ms": 13921.9, "wall_s": 34.768}
|
||||
[2026-06-26 05:27:54] [moe_llama] NPL=32 PEAK_GB=47.11
|
||||
[2026-06-26 05:28:00] [moe_llama] NPL=32 server stopped mem=3.30
|
||||
[2026-06-26 05:28:00] [moe_llama] NPL=64 launching server PRE_GB=3.30
|
||||
[2026-06-26 05:28:05] [moe_llama] NPL=64 ready LOADED_GB=36.39
|
||||
[2026-06-26 05:29:10] [moe_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16384, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 271.0, "decode_agg_tps": 527.0, "decode_perseq_tps": 6.15, "prefill_tps": 1152.3, "ttft_mean_ms": 15849.5, "ttft_max_ms": 29356.9, "wall_s": 60.449}
|
||||
[2026-06-26 05:29:10] [moe_llama] NPL=64 PEAK_GB=57.13
|
||||
[2026-06-26 05:29:16] [moe_llama] NPL=64 server stopped mem=3.28
|
||||
[2026-06-26 05:29:16] [moe_llama] NPL=128 launching server PRE_GB=3.28
|
||||
[2026-06-26 05:29:21] [moe_llama] NPL=128 ready LOADED_GB=36.48
|
||||
[2026-06-26 05:34:19] [moe_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32760, "prompt_tok_total": 67969, "gen_per_req": 255.9, "agg_tps": 112.7, "decode_agg_tps": 726.4, "decode_perseq_tps": 3.73, "prefill_tps": 276.8, "ttft_mean_ms": 213017.2, "ttft_max_ms": 245528.7, "wall_s": 290.634}
|
||||
[2026-06-26 05:34:19] [moe_llama] NPL=128 PEAK_GB=61.51
|
||||
[2026-06-26 05:34:25] [moe_llama] NPL=128 server stopped mem=3.28
|
||||
[2026-06-26 05:34:25] [moe_llama] ==== DONE moe_llama POST_GB=3.28 ====
|
||||
[2026-06-26 05:34:25] [moe_vllm] ==== START moe_vllm (vllm) baseline_mem=3.28 ====
|
||||
[2026-06-26 05:34:25] [moe_vllm] launching vllm PRE_GB=3.28
|
||||
[2026-06-26 05:39:35] [moe_vllm] vllm ready LOADED_GB=109.46
|
||||
[2026-06-26 05:39:38] [moe_vllm] GATE='Here\'s a thinking process:\n\n1. **Analyze User Input:**\n - User says: "The capital of France is"\n - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2. **Identify Key Information:**\n - C
|
||||
[2026-06-26 05:39:47] [moe_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1900, "prompt_tok_total": 4195, "gen_per_req": 237.5, "agg_tps": 231.2, "decode_agg_tps": 256.5, "decode_perseq_tps": 31.84, "prefill_tps": 5186.5, "ttft_mean_ms": 768.8, "ttft_max_ms": 808.2, "wall_s": 8.217}
|
||||
[2026-06-26 05:39:47] [moe_vllm] NPL=8 PEAK_GB=109.62
|
||||
[2026-06-26 05:40:07] [moe_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 7794, "prompt_tok_total": 16900, "gen_per_req": 243.6, "agg_tps": 426.4, "decode_agg_tps": 500.8, "decode_perseq_tps": 14.9, "prefill_tps": 6223.4, "ttft_mean_ms": 1830.4, "ttft_max_ms": 2714.2, "wall_s": 18.28}
|
||||
[2026-06-26 05:40:07] [moe_vllm] NPL=32 PEAK_GB=109.63
|
||||
[2026-06-26 05:40:37] [moe_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 15927, "prompt_tok_total": 33828, "gen_per_req": 248.9, "agg_tps": 550.7, "decode_agg_tps": 686.1, "decode_perseq_tps": 9.83, "prefill_tps": 5926.5, "ttft_mean_ms": 3224.4, "ttft_max_ms": 5704.9, "wall_s": 28.92}
|
||||
[2026-06-26 05:40:37] [moe_vllm] NPL=64 PEAK_GB=109.63
|
||||
[2026-06-26 05:41:27] [moe_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 31795, "prompt_tok_total": 67969, "gen_per_req": 248.4, "agg_tps": 650.7, "decode_agg_tps": 882.2, "decode_perseq_tps": 6.05, "prefill_tps": 5300.5, "ttft_mean_ms": 6487.7, "ttft_max_ms": 12817.8, "wall_s": 48.863}
|
||||
[2026-06-26 05:41:27] [moe_vllm] NPL=128 PEAK_GB=109.64
|
||||
[2026-06-26 05:41:36] [moe_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 1702, "prompt_tok_total": 4187, "gen_per_req": 212.8, "agg_tps": 207.2, "decode_agg_tps": 226.4, "decode_perseq_tps": 28.06, "prefill_tps": 6021.3, "ttft_mean_ms": 642.7, "ttft_max_ms": 694.8, "wall_s": 8.213}
|
||||
[2026-06-26 05:41:44] [moe_vllm] ==== DONE moe_vllm POST_GB=3.31 ====
|
||||
|
||||
==== ALL 16 ROWS COLLECTED (2 models x 2 engines x 4 npl) ====
|
||||
decode_agg t/s (llama | vLLM | llama%vLLM):
|
||||
DENSE q36-27b-nvfp4: npl8 82.5|70.4|117% npl32 192.6|211.8|91% npl64 277.8|309.1|90% npl128 384.6|418.8|92%
|
||||
MoE q36-35b-a3b: npl8 211.8|256.5|83% npl32 393.0|500.8|78% npl64 527.0|686.1|77% npl128 726.4|882.2|82%
|
||||
peak_gb (llama on-demand grows | vLLM fixed ~107 pool):
|
||||
DENSE llama 53.5->93.8 ; vLLM ~110.9 flat
|
||||
MoE llama 39.7->61.5 ; vLLM ~109.6 flat
|
||||
Final CSV: final_benchmark.csv ; analysis: QWEN36_NVFP4_BENCH.md (FINAL section).
|
||||
Cleanup: no leftover server/bench PIDs; GPU free (memnow 3.28 GB); local-ai + local-ai-worker
|
||||
containers restarted (host returned). DONE.
|
||||
@@ -1,628 +0,0 @@
|
||||
# bf16 SSM-state cache: BUILD PLAN (PART C synthesis - hand this to the build agent)
|
||||
|
||||
Status: READ-ONLY design. Lands ON TOP of patch 0021 (conv-state in-place fusion, building
|
||||
concurrently on the GPU). DEFAULT = bf16 SSM recurrent state, f32 opt-out. This PART C is the
|
||||
executive build brief: ordered edits, acceptance gate, bench targets, semantics/back-compat/risk
|
||||
register, and the de-risk-first item. PART A (cparams wiring), PART B (kernel/op plumbing) and the
|
||||
Appendix (upstream precedent + numeric safety) below are the detailed reference each step points into.
|
||||
|
||||
The decision (settled by GDN_RECURRENCE_BYTE_GATE.md): the gated-DeltaNet recurrence is the dominant
|
||||
decode kernel (51.6% of the step, 805 MB f32 state R+W/call at 74% of GB10 peak BW) and is ALREADY
|
||||
single-pass (measured re-stream ~1.0x, hard-capped <=1.33x). The whole ~2x DRAM gap vs vLLM is purely
|
||||
f32(llama) vs bf16(vLLM) state-cache WIDTH, not extra passes. Narrowing the persisted SSM state to
|
||||
bf16 (load->f32, recurrence math in f32 UNCHANGED, store->bf16) halves the dominant term and reaches
|
||||
vLLM parity-to-ahead. vLLM's own GDN state cache is bf16, so this is a fair equal-precision change.
|
||||
|
||||
## C.0 Synthesis decisions that OVERRIDE the per-part text
|
||||
|
||||
1. v1 ships `type_s` = BF16 (SSM recurrent state, the 805 MB lever) and KEEPS `type_r` = F32 (conv
|
||||
state). Reason: `ggml_concat` at prefill (`build_conv_state`, delta-net-base.cpp:472) requires
|
||||
same-type operands; a bf16 conv cache breaks the f32 `qkv_mixed` concat. Conv state is ~12.6 MB
|
||||
(launch-bound, ~0 ms byte benefit), so keeping it f32 costs nothing. This OVERRIDES PART A §3a/§3b,
|
||||
which set BOTH defaults to BF16: in v1 set the `type_r` / `cache_type_conv` DEFAULT to
|
||||
`GGML_TYPE_F32`. `type_r`=bf16 is a v2 follow-up (needs an f32 staging view before the prefill
|
||||
concat - PART B §B.6).
|
||||
2. Keep ALL transient/scratch tensors f32: the GDN op OUTPUT scratch (ggml.c:6327), the 0019 gather
|
||||
scratch, and the keep_rs_t prefill snapshot. ONLY the PERSISTED cache rows narrow to bf16 (the
|
||||
src[5] read view and the src[6] in-place write view).
|
||||
3. The gate REPLACES the bit-exact md5 gate for the bf16 default: bf16 is intentionally non-bit-exact
|
||||
vs llama f32 (it is equal precision to vLLM's bf16). The 0018/0019 md5 gate STILL applies to (a)
|
||||
patch 0021's conv fusion and (b) verifying the f32 opt-out path is byte-identical to the pre-bf16
|
||||
f32 baseline.
|
||||
|
||||
## C.1 Ordered file-by-file edit list (build order, on top of 0021)
|
||||
|
||||
Order is dependency- and de-risk-driven: prove the kernel dtype-correct in ISOLATION before flipping
|
||||
any default. Section refs point into PART A / PART B below.
|
||||
|
||||
STEP 1 - kernel + op made dtype-generic (the load/store conversion), validated standalone:
|
||||
- 1a `ggml/src/ggml.c` - relax the F32-only state asserts to {F32,BF16} in the 3 GDN builders:
|
||||
`ggml_gated_delta_net` (~6308), `_inplace` (~6370), `_inplace_ids` (~6430), on `state` and
|
||||
`src_state_dst`. KEEP the op OUTPUT scratch F32 (6327). [PART B §B.2]
|
||||
- 1b `ggml/src/ggml-cuda/ggml-cuda.cu` - `supports_op` `GGML_OP_GATED_DELTA_NET` (~3096): permit a
|
||||
BF16 `src[5]`/`src[6]`. [PART B §B.3]
|
||||
- 1c `ggml/src/ggml-cuda/gated_delta_net.cu` - template kernel+gather+launch on `bool STATE_BF16`;
|
||||
`#include <cuda_bf16.h>`. LOAD `__bfloat162float` (~102), STORE `__float2bfloat16` (~207), GATHER
|
||||
bf16->f32 scratch (~20). Cast `src_state`/`src_state_dst` pointers to `nv_bfloat16` on bf16; relax
|
||||
dispatcher asserts (309-311) `sizeof(float)` -> `ggml_type_size(type)`. Keep gather scratch +
|
||||
keep_rs_t snapshot f32. ALL recurrence math (106-200) UNCHANGED in f32 registers. [PART B §B.4,§B.8]
|
||||
- 1d `ggml/src/ggml-cpu/ops.cpp` - matching bf16 load/store branch in the GDN reference (10726/10744/
|
||||
10891 load via `GGML_BF16_TO_FP32`, 10758-10762 store via `GGML_FP32_TO_BF16`); relax `nb[]` asserts
|
||||
to `ggml_type_size(type)`. [PART B §B.5]
|
||||
- 1e `tests/test-backend-ops.cpp` - add a BF16-state `GATED_DELTA_NET` case covering BOTH `n_tokens==1`
|
||||
decode AND a multi-token (prefill/chunk) + `keep_rs_t==true` path, CUDA bf16 vs CPU bf16 reference.
|
||||
THIS IS THE DE-RISK GATE for Step 1 (see C.5). Build + pass before Step 2.
|
||||
|
||||
STEP 2 - cparams selection wiring (llama.cpp core):
|
||||
- 2a `include/llama.h` (after :366) - add `enum ggml_type type_s;` and `type_r;` adjacent to
|
||||
`type_k`/`type_v`, marked `[EXPERIMENTAL]`. [PART A §3a]
|
||||
- 2b `src/llama-context.cpp:3468` (`llama_context_default_params`) - add `/*.type_s =*/ GGML_TYPE_BF16,`
|
||||
and `/*.type_r =*/ GGML_TYPE_F32,`. THIS IS THE DEFAULT CHANGE (type_r stays F32 per C.0). [PART A §3a]
|
||||
- 2c `src/llama-memory.h:19` (`struct llama_memory_params`) - add `ggml_type type_r;` and `type_s;`.
|
||||
[PART A §3a]
|
||||
- 2d `src/llama-context.cpp:325` (`params_mem` init) - pass `params.type_r` / `params.type_s`. [PART A §3a]
|
||||
- 2e `src/llama-model.cpp` - replace the 3 hardcoded `GGML_TYPE_F32` pairs (2056-57 recurrent, 2098-99
|
||||
hybrid_iswa, 2117-18 hybrid = the qwen35/qwen35moe path) with `params.type_r` / `params.type_s`.
|
||||
[PART A §2/§3a]
|
||||
|
||||
STEP 3 - back-compat for saved recurrent state (REQUIRED, the default flips):
|
||||
- 3a `src/llama-memory-recurrent.cpp` `state_read_data` - on `s_type_i_ref != live type` with both in
|
||||
{F32,BF16}, CONVERT row-by-row during load instead of returning false (same for `r`). Bump the
|
||||
recurrent state-file version. [PART A §5, option A]
|
||||
|
||||
STEP 4 - CLI / llama-server surface (needed by the gate harness):
|
||||
- 4a `common/common.h:566` region - `cache_type_ssm = GGML_TYPE_BF16;` and
|
||||
`cache_type_conv = GGML_TYPE_F32;` (conv default F32 per C.0). [PART A §3b]
|
||||
- 4b `common/common.cpp:1589` region - `cparams.type_s = params.cache_type_ssm;` and
|
||||
`cparams.type_r = params.cache_type_conv;`. [PART A §3b]
|
||||
- 4c `common/arg.cpp` (after :2074) - add `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv`
|
||||
via the existing `kv_cache_type_from_str` (arg.cpp:402); confirm `bf16` -> `GGML_TYPE_BF16`. The C.2
|
||||
harness depends on `--cache-type-ssm {f32,bf16}`. [PART A §3b]
|
||||
|
||||
STEP 5 - LocalAI gRPC / YAML (force f32 from model config):
|
||||
- 5a `backend/backend.proto` - `string CacheTypeSSM` / `CacheTypeConv` (next free tags after 64);
|
||||
regen proto. [PART A §3c]
|
||||
- 5b `backend/cpp/llama-cpp/grpc-server.cpp:504` region - `params.cache_type_ssm =
|
||||
kv_cache_type_from_str(request->cachetypessm());` + conv. [PART A §3c]
|
||||
- 5c `core/config/model_config.go:935` - `CacheTypeSSM`/`CacheTypeConv` yaml fields. [PART A §3c]
|
||||
- 5d `core/backend/options.go:247` - map into the request. [PART A §3c]
|
||||
- 5e `core/config/meta/registry.go` + `build_test.go` - register `cache_type_ssm`/`cache_type_conv`
|
||||
as static fields (gate). [PART A §3c]
|
||||
|
||||
STEP 6 - capability fallback (heterogeneous / CPU-offload safety):
|
||||
- 6a `src/llama-context.cpp:518-595` - an `auto_fgdn`-style device-match probe: if a participating
|
||||
device lacks the bf16 GDN load/store specialization (CPU-offloaded GDN layer, non-GB10 backend),
|
||||
demote `type_s` to F32 BEFORE alloc and log once. [PART A §4]
|
||||
|
||||
## C.2 Acceptance gate (REPLACES the bit-exact md5 gate)
|
||||
|
||||
bf16 is intentionally non-bit-exact, so the 0018/0019 md5 byte-equality gate does NOT apply to the
|
||||
bf16 default. The gate is teacher-forced KL-divergence + PPL-delta + greedy coherence + a
|
||||
long-context drift sweep, vs the SAME model run f32. All commands on `dgx.casa` (DO NOT run during
|
||||
this design - GPU busy). Binaries `~/llama-paged-dev/build*/bin`; models `~/bench/q36-27b-nvfp4.gguf`
|
||||
(dense) and `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE); scratch `~/bench/klgate`.
|
||||
|
||||
Why teacher-forced (not self-greedy): a self-greedy decode lets each precision pick its own argmax,
|
||||
so after the first divergence the contexts differ and per-token logits are no longer comparable (you
|
||||
measure trajectory divergence, not numeric error). `llama-perplexity --kl-divergence` feeds both
|
||||
precisions the IDENTICAL token stream and compares output distributions position-by-position; the
|
||||
greedy trajectory is validated SEPARATELY by the Same-top-p metric + a coherence read.
|
||||
|
||||
Corpus (one-time): wikitext-2 raw test (~280k tokens) into `~/bench/klgate`. KL mode needs
|
||||
>= 2*n_ctx tokens; any fixed >=8k-token UTF-8 file works as long as base AND test share it.
|
||||
|
||||
256-token headline gate (per model; shown for dense):
|
||||
```
|
||||
M=~/bench/q36-27b-nvfp4.gguf; F=~/bench/klgate/wikitext-2-raw/wiki.test.raw; D=~/bench/klgate
|
||||
COMMON="-m $M -f $F -c 256 -b 256 -ngl 99 -fa on --seed 1 --chunks 32"
|
||||
# (a) f32 BASE: reference logits + f32 PPL
|
||||
llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32.c256.base.log
|
||||
# (b) bf16 TEST: KL(bf16||f32) + bf16 PPL + Same-top-p
|
||||
llama-perplexity $COMMON --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.bf16.c256.kl.log
|
||||
```
|
||||
Noise floor (run FIRST, mandatory - GPU reductions are not bit-deterministic, so KLD has a non-zero
|
||||
floor; bf16 is judged against BOTH the absolute threshold AND this floor):
|
||||
```
|
||||
llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32f32.floor.log
|
||||
```
|
||||
Record `Mean KLD_floor` and `Same-top-p_floor` (expect KLD ~1e-6..1e-5, top-p ~100%).
|
||||
|
||||
Coherence spot-check (greedy trajectory, reuses the 0018/0019 `--temp 0 --seed 1` convention):
|
||||
```
|
||||
P="Explain how a transformer language model generates text, step by step."
|
||||
for T in f32 bf16; do llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm $T 2>/dev/null > $D/q27.greedy.$T.txt; done
|
||||
diff $D/q27.greedy.f32.txt $D/q27.greedy.bf16.txt && echo "GREEDY BYTE-IDENTICAL"
|
||||
```
|
||||
Long-context drift sweep (verifies the g<1 decay bound: bf16 state-rounding error must stay FLAT, not
|
||||
accumulate, as context grows - the GDN state spans the whole window):
|
||||
```
|
||||
for C in 256 1024 2048 4096; do
|
||||
CMN="-m $M -f $F -c $C -b $C -ngl 99 -fa on --seed 1 --chunks 8"
|
||||
llama-perplexity $CMN --cache-type-ssm f32 --kl-divergence-base $D/q27.f32.c$C.kld >/dev/null
|
||||
llama-perplexity $CMN --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c$C.kld | tee $D/q27.bf16.c$C.kl.log
|
||||
done
|
||||
```
|
||||
f32 opt-out verification (the safety valve must actually select f32 and reproduce the committed f32
|
||||
greedy md5 from 0018/0019 - the bf16 default must NOT change the f32-path output):
|
||||
```
|
||||
llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm f32 2>/dev/null | md5sum # == 0018/0019 f32 baseline md5
|
||||
```
|
||||
Repeat the WHOLE gate verbatim for the MoE model (`M=~/bench/q36-35b-a3b-nvfp4.gguf`).
|
||||
|
||||
PASS/FAIL (bf16 ships as DEFAULT only if ALL rows pass for BOTH dense and MoE):
|
||||
|
||||
| metric | source | PASS threshold |
|
||||
|---|---|---|
|
||||
| Mean KLD | 256-gate (b) | **< 1e-3 nats** (hard, the brief) |
|
||||
| Mean KLD vs floor | (b) vs floor | <= ~5x `Mean KLD_floor` (bounded signal, not pure noise) |
|
||||
| Same top p | (b) | **>= 99.5%** (100% => greedy byte-identical to f32) |
|
||||
| PPL-delta `ln(PPL_bf16/PPL_f32)` | (a)+(b) | **abs < 0.005** (PPL within +-0.5%) |
|
||||
| Max / 99.9% KLD | (b) | report; flag if Max > 0.05 (tail outliers) |
|
||||
| Coherence | greedy | fluent + on-topic; byte-identical if Same-top-p=100% |
|
||||
| Long-context drift | sweep | MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0% |
|
||||
|
||||
If any row fails for a model: keep THAT model on f32 (gallery YAML `cache_type_ssm: f32`) while the
|
||||
global default stays bf16; the cparams f32 fallback is the safety valve. MoE has fewer GDN layers
|
||||
(31 vs 48) and smaller per-head state (H_v=32 vs 48), so expected KLD <= dense; same thresholds.
|
||||
Same-top-p is the bridge to the old md5 harness: at 100% the bf16 greedy output is byte-identical to
|
||||
f32 and the 0018/0019 md5 gate would still pass - the strongest possible non-bit-exact result.
|
||||
|
||||
## C.3 Bench targets + nsys confirmation
|
||||
|
||||
Dense q36-27b-nvfp4 (48 GDN layers, S_v=128, H_v=48), npl128, GB10/sm_121, graphs-OFF
|
||||
apples-to-apples (the measured baseline):
|
||||
- Recurrence per call: 3.98 ms (f32, 805 MB R+W, 74% peak) -> **~2.0-3.0 ms** (bf16, ~413 MB R+W).
|
||||
2.0 ms = 74% peak retained; 3.0 ms = conservative 50% peak on the smaller footprint.
|
||||
- Recurrence per step: 191 ms -> ~96-143 ms (save ~48-95 ms).
|
||||
- Step time: 384 ms -> **289-339 ms**.
|
||||
- Decode throughput: ~335 -> **360-443 tok/s** = parity-to-ahead of vLLM (327 ms / 391 tok/s).
|
||||
|
||||
MoE q36-35b-a3b-nvfp4 (31 GDN layers, H_v=32): state per (seq,layer) = 128*128*32*4 = 2.0 MiB f32 ->
|
||||
per-call R+W ~537 MB f32 -> ~268 MB bf16. Fewer layers + smaller state => smaller ABSOLUTE recurrence
|
||||
savings, and MoE decode is more GEMM-bound (the `MUL_MAT_ID` expert path), so the bf16-state win is a
|
||||
smaller FRACTION of the MoE step. Target: a measurable per-call halving of the GDN recurrence time
|
||||
with the C.2 KL gate passing; no absolute MoE step target is asserted here (the MoE step is
|
||||
MUL_MAT_ID-dominated, a separate lever from this one).
|
||||
|
||||
nsys confirmation (the measurement that proves the lever landed):
|
||||
```
|
||||
GGML_CUDA_DISABLE_GRAPHS=1 nsys profile -o ssmbf16 --force-overwrite true \
|
||||
llama-batched-bench -m $M -npp 8 -ntg 12 -npl 128 -ub 2048
|
||||
nsys stats --report cuda_gpu_kern_sum ssmbf16.nsys-rep | grep -i gated_delta_net
|
||||
```
|
||||
Confirm: `gated_delta_net_cuda` mean duration/call drops 3.98 -> 2.0-3.0 ms; step time + tok/s land in
|
||||
the 289-339 ms / 360-443 tok/s band; the f32 opt-out reproduces the 3.98 ms f32 call. The gate is the
|
||||
JOINT condition: per-call speed in band AND KL<1e-3 - neither alone ships bf16.
|
||||
|
||||
## C.4 Default / opt-out semantics, back-compat, risk register
|
||||
|
||||
Semantics:
|
||||
- DEFAULT `type_s` = `GGML_TYPE_BF16` (SSM recurrent state). `type_r` = `GGML_TYPE_F32` in v1 (conv
|
||||
state; bf16 is v2). This is the INVERSE of KV (KV is opt-IN to compression at F16 default; SSM is
|
||||
opt-OUT to f32).
|
||||
- Opt-out: `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML) -> bit-exact f32
|
||||
recurrence. Per-model opt-out lives in gallery YAML if a model fails the gate; the global default
|
||||
stays bf16.
|
||||
- Silent capability fallback: the C.1 STEP 6 device-match probe demotes `type_s` to F32 before alloc
|
||||
on devices lacking the bf16 GDN specialization (CPU offload / non-GB10) and logs once.
|
||||
|
||||
Back-compat (the ONE real breakage): `llama-memory-recurrent.cpp` serializes the per-layer state
|
||||
dtype and HARD-matches on restore (mismatch -> `"mismatched s type"` -> returns false). The f32->bf16
|
||||
default flip makes OLD f32-saved sessions fail to restore against a bf16 build. Fix = STEP 3a: convert
|
||||
row-by-row on mismatch (both in {F32,BF16}) + bump the recurrent state-file version. KV never hit this
|
||||
because `type_k`/`type_v` were EXPERIMENTAL and never default-changed; the SSM default FLIP is what
|
||||
forces the convert/version work.
|
||||
|
||||
Risk register:
|
||||
- **R1 numeric drift (KL gate fails).** Likelihood LOW: g<1 geometric decay contracts per-step bf16
|
||||
rounding to a bounded series (~`eps/(1-exp(g_mean))`), f32 registers confine rounding to one
|
||||
per-step cache write, and vLLM ships this exact config in production. Mitigation: C.2 gate +
|
||||
per-model f32 opt-out + global f32 fallback.
|
||||
- **R2 prefill / keep_rs_t / gather state path (the silent-corruption landmine).** The conversion
|
||||
points are documented for DECODE; the SAME kernel also runs the chunked prefill path, the keep_rs_t
|
||||
snapshot (writes to f32 scratch while the cache is bf16), and the 0019 gather (reads bf16 cache ->
|
||||
f32 scratch). A dtype mistake on any of these corrupts the state at the prefill->decode handoff and
|
||||
surfaces ONLY as long-context drift, which a decode-only 256-token gate can mask. Mitigation: STEP
|
||||
1e test-backend-ops MUST cover the multi-token prefill + keep_rs_t==true path, not just decode; the
|
||||
C.2 long-context sweep is the second net. (This is C.5, the single biggest risk.)
|
||||
- **R3 MoE MUL_MAT_ID path.** The GDN recurrence op is IDENTICAL for dense and MoE; the MoE expert
|
||||
GEMM (`MUL_MAT_ID`) does NOT touch the SSM state, so bf16-state is orthogonal to the expert path.
|
||||
Residual risk: `qwen35moe` `build_recurrent_attn` must route the same bf16 state view (it shares
|
||||
delta-net-base.cpp). Mitigation: run the full C.2 gate on the MoE model; the test-backend-ops case
|
||||
is arch-agnostic.
|
||||
- **R4 conv-state coupling with patch 0021.** Flipping `type_r` to bf16 breaks `ggml_concat` at
|
||||
prefill (different types). Mitigation: v1 keeps `type_r`=F32 (C.0); `type_r`=bf16 deferred to v2
|
||||
with an f32 staging view (PART B §B.6).
|
||||
- **R5 back-compat restore failure.** Mitigation: STEP 3a convert + version bump (above).
|
||||
|
||||
## C.5 Single biggest risk + how the build agent de-risks it FIRST
|
||||
|
||||
Single biggest risk: **R2 - silent state corruption on the NON-decode state paths** (chunked prefill,
|
||||
the keep_rs_t snapshot, the 0019 gather). The 805 MB measurement and every conversion-point in the
|
||||
cheat-sheet describe the STEADY decode path (`n_tokens==1`, `!keep_rs_t`). But the bf16 cache is ALSO
|
||||
read/written by the multi-token prefill path and the prefill/rollback snapshot (which targets f32
|
||||
scratch while the cache is bf16). A dtype bug there does not crash and barely moves the 256-token
|
||||
decode md5; it corrupts the recurrent state at the prefill->decode boundary and shows up ONLY as
|
||||
long-context drift - exactly the failure a quick gate misses.
|
||||
|
||||
De-risk FIRST (before ANY default flip or wiring): implement STEP 1 (kernel + op dtype-generic) and
|
||||
STEP 1e (test-backend-ops) ONLY, then prove the kernel is dtype-correct in ISOLATION by forcing a
|
||||
bf16 state allocation behind a temporary debug flag and running test-backend-ops with a case that
|
||||
exercises (a) single-token decode, (b) a multi-token prefill chunk, and (c) `keep_rs_t==true`,
|
||||
comparing CUDA bf16 against the CPU bf16 reference AND against the f32 path within tolerance. Only
|
||||
after that case is GREEN does the build agent proceed to STEP 2 (flip the default) and the C.2
|
||||
model-level gate. This decouples kernel dtype-correctness from the cparams wiring, so a Step-1 bug is
|
||||
caught by a deterministic op test in minutes instead of as a fuzzy long-context regression after the
|
||||
full stack is wired.
|
||||
|
||||
---
|
||||
|
||||
# bf16 SSM state cache — cparams wiring (DEFAULT bf16 + f32 opt-out)
|
||||
|
||||
Label: cparams-default-fallback (READ-ONLY design). Mirrors the KV-cache `type_k`/`type_v`
|
||||
precision plumbing exactly. Designed against HEAD-after-patch-0021 (conv-state in-place fusion).
|
||||
|
||||
This is lever (2) of GDN_RECURRENCE_BYTE_GATE.md: the recurrent SSM state cache is the dominant
|
||||
decode byte stream (805 MB R+W/call, 51.6% of step, single-pass f32 = at the BW floor). The whole
|
||||
~2x DRAM gap vs vLLM is f32(llama) vs bf16(vLLM) state width. Storing the persisted state in bf16
|
||||
(load→f32, recurrence math in f32 UNCHANGED, store→bf16) halves the dominant term. vLLM's GDN state
|
||||
cache is bf16, so bf16-default is the fair equal-precision comparison → make it the DEFAULT.
|
||||
|
||||
---
|
||||
|
||||
## 1. The KV-cache template we mirror (exact chain for type_k / type_v)
|
||||
|
||||
```
|
||||
CLI common/arg.cpp:2052 -ctk/--cache-type-k TYPE → params.cache_type_k
|
||||
(common_params, common/common.h:566, default GGML_TYPE_F16)
|
||||
↓
|
||||
glue common/common.cpp:1589 cparams.type_k = params.cache_type_k (cparams = llama_context_params)
|
||||
↓
|
||||
API include/llama.h:365 llama_context_params.type_k // [EXPERIMENTAL]
|
||||
llama-context.cpp:3468 default in llama_context_default_params() = GGML_TYPE_F16
|
||||
↓
|
||||
mem llama-context.cpp:326 llama_memory_params params_mem.type_k = params.type_k
|
||||
llama-memory.h:19 struct llama_memory_params { ggml_type type_k; type_v; ... }
|
||||
↓
|
||||
alloc llama-model.cpp:2030 create_memory(params_mem, cparams) → KV cache uses params.type_k
|
||||
```
|
||||
|
||||
Key facts:
|
||||
- `type_k`/`type_v` are NOT stored in `struct llama_cparams` (src/llama-cparams.h). They ride in
|
||||
`llama_context_params` → `llama_memory_params` and are consumed directly at cache-alloc time.
|
||||
We mirror that: NO new `llama_cparams` field is needed.
|
||||
- KV default is opt-IN to compression (F16 default, pass `-ctk q8_0` to shrink). SSM is the INVERSE:
|
||||
bf16 DEFAULT, pass an explicit `f32` to opt out / restore bit-exactness.
|
||||
|
||||
## 2. Where the SSM state type is currently hardcoded (the targets)
|
||||
|
||||
The recurrent cache constructor already accepts the types — only the model hardcodes F32:
|
||||
|
||||
- `src/llama-memory-recurrent.cpp:22-23` ctor params `ggml_type type_r, type_s`
|
||||
- `r_l` (line 100, `n_embd_r`) = short conv state → `type_r` (TINY: conv_width-1 taps × conv_dim)
|
||||
- `s_l` (line 101, `n_embd_s`) = SSM recurrent state → `type_s` (THE 805 MB/call dominant)
|
||||
- `src/llama-memory-hybrid.h:32-33` ctor params `type_r, type_s` (qwen35 / qwen35moe path)
|
||||
- Hardcoded `GGML_TYPE_F32` call sites in `src/llama-model.cpp::create_memory`:
|
||||
- 2056-2057 `llama_memory_recurrent(...)` (pure recurrent arches)
|
||||
- 2098-2099 `llama_memory_hybrid_iswa(...)` recurrent_type_r / recurrent_type_s
|
||||
- 2117-2118 `llama_memory_hybrid(...)` recurrent_type_k / recurrent_type_v (mislabeled; they are r/s)
|
||||
|
||||
Note: `qwen35` / `qwen35moe` are HYBRID (filter_attn/filter_recr, no SWA) → they take the
|
||||
`llama_memory_hybrid` branch (2108-2118). That is the call site that matters for the parity push.
|
||||
|
||||
## 3. New plumbing (parallel chain `type_s` / `type_r`)
|
||||
|
||||
### 3a. Public API + cparams glue (llama.cpp side)
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `include/llama.h` (after :366) | Add `enum ggml_type type_s; // data type for recurrent SSM state cache [EXPERIMENTAL]` and `enum ggml_type type_r; // data type for recurrent conv state cache [EXPERIMENTAL]`. Place adjacent to `type_k`/`type_v`. |
|
||||
| `src/llama-context.cpp:3468` (default params) | Add `/*.type_s =*/ GGML_TYPE_BF16,` and `/*.type_r =*/ GGML_TYPE_BF16,`. **This is the DEFAULT change.** |
|
||||
| `src/llama-memory.h:19` (`struct llama_memory_params`) | Add `ggml_type type_r;` and `ggml_type type_s;` next to `type_k`/`type_v`. |
|
||||
| `src/llama-context.cpp:325` (`params_mem` init) | Add `/*.type_r =*/ params.type_r,` and `/*.type_s =*/ params.type_s,`. |
|
||||
| `src/llama-model.cpp` 2056-57 / 2098-99 / 2117-18 | Replace the 3 hardcoded `GGML_TYPE_F32` pairs with `params.type_r` / `params.type_s`. |
|
||||
|
||||
### 3b. CLI / llama-server (common side)
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `common/common.h:566` region | Add `ggml_type cache_type_ssm = GGML_TYPE_BF16;` and `ggml_type cache_type_conv = GGML_TYPE_BF16;` (mirror `cache_type_k/v`; note the DEFAULT is BF16, not F16). |
|
||||
| `common/common.cpp:1589` region | Add `cparams.type_s = params.cache_type_ssm;` and `cparams.type_r = params.cache_type_conv;`. |
|
||||
| `common/arg.cpp` (after :2074) | Add `--cache-type-ssm TYPE` (`-ctssm`) → `params.cache_type_ssm = kv_cache_type_from_str(value)`, and `--cache-type-conv TYPE` (`-ctconv`). Reuse the existing `kv_cache_type_from_str` (arg.cpp:402). Help text: "recurrent SSM state cache type (default bf16; pass f32 for bit-exact recurrence)". |
|
||||
|
||||
`kv_cache_type_from_str` already accepts `f32`/`bf16`/`f16` — no change needed; just confirm `bf16`
|
||||
maps to `GGML_TYPE_BF16` (add the case if absent).
|
||||
|
||||
### 3c. LocalAI gRPC backend (so users can force f32 from model YAML)
|
||||
|
||||
Mirror `CacheTypeKey` exactly:
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `backend/backend.proto:419` region | Add `string CacheTypeSSM = NN;` and `string CacheTypeConv = NN;` (next free field tags). Regenerate proto. |
|
||||
| `backend/cpp/llama-cpp/grpc-server.cpp:504` region | `if (!request->cachetypessm().empty()) params.cache_type_ssm = kv_cache_type_from_str(request->cachetypessm());` and the conv equivalent. (grpc-server already has its own `kv_cache_type_from_str`; ensure it knows `bf16`.) |
|
||||
| `core/config/model_config.go:935` region | Add `CacheTypeSSM string yaml:"cache_type_ssm,omitempty"` and `CacheTypeConv string yaml:"cache_type_conv,omitempty"`. |
|
||||
| `core/backend/options.go:247` region | Add `CacheTypeSSM: c.CacheTypeSSM,` and `CacheTypeConv: c.CacheTypeConv,` to the request build. |
|
||||
| `core/config/meta/registry.go:161` + `core/config/meta/build_test.go:140` | Register `cache_type_ssm` / `cache_type_conv` as static fields (the `staticFields` slice + registry map) so the meta-config gate passes. |
|
||||
|
||||
LocalAI semantics: leaving `cache_type_ssm` UNSET in YAML → empty gRPC string → backend keeps its
|
||||
BF16 default. Setting `cache_type_ssm: f32` → forces the f32 opt-out (bit-exact recurrence).
|
||||
|
||||
## 4. Default / fallback semantics
|
||||
|
||||
- **DEFAULT = `GGML_TYPE_BF16`** for both SSM state (`type_s`) and conv state (`type_r`).
|
||||
- SSM state (`type_s`) is the lever: f32→bf16 halves 805→413 MB/call → ~3.98→~2.0-3.0 ms/call.
|
||||
- Conv state (`type_r`) is negligible bytes; default it bf16 too for consistency, but it can stay
|
||||
f32 with zero perf cost if patch-0021's in-place conv path assumes f32 — see §6.
|
||||
- **Opt-out = `GGML_TYPE_F32`** via `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML).
|
||||
Restores bit-exact recurrence; use when the KL gate (<1e-3 / PPL-delta over 256-tok greedy) fails
|
||||
for a given model, or for deterministic regression baselines.
|
||||
- **Silent capability fallback**: gate the bf16 path behind a device-match probe modeled on
|
||||
`auto_fgdn` (llama-context.cpp:518-595). If the GDN recurrence kernel's bf16 load/store
|
||||
specialization is unavailable on a participating device (e.g. a CPU-offloaded GDN layer with no
|
||||
bf16 op, or a non-GB10 backend), fall back to `GGML_TYPE_F32` for `type_s` BEFORE cache alloc and
|
||||
log it once. This keeps "bf16 default" from breaking heterogeneous/CPU setups.
|
||||
- The kernel contract is unchanged-math: load bf16→f32 into `s_shard` (registers stay f32), all
|
||||
recurrence arithmetic in f32, store f32→bf16. Only the persisted cache is rounded per step;
|
||||
geometric decay (g<1) bounds the rounding (does not accumulate unboundedly).
|
||||
|
||||
## 5. Back-compat (the one real breakage — saved sessions / state files)
|
||||
|
||||
`src/llama-memory-recurrent.cpp` SERIALIZES the per-layer state tensor dtype and does a HARD match
|
||||
on restore:
|
||||
- write: `state_write_data` writes `s_type_i = (int32_t)s_l[il]->type` (line ~900) and the r type.
|
||||
- read: `state_read_data` reads `s_type_i_ref`, compares to current `s_l[il]->type`, and on
|
||||
mismatch logs `"mismatched s type (%d != %d, layer %d)"` and **returns false** (restore FAILS).
|
||||
Same for `r` type.
|
||||
|
||||
Consequence of the default flip f32→bf16:
|
||||
- Sessions SAVED by an old f32-default build will FAIL to RESTORE against a new bf16-default build
|
||||
(and vice versa), because the serialized `s_type_i_ref` (F32) ≠ the new cache type (BF16).
|
||||
|
||||
Required handling (pick one, recommend A):
|
||||
- **A (convert on mismatch, recommended)**: in `state_read_data`, when `s_type_i_ref != current`
|
||||
and both ∈ {F32, BF16}, convert row-by-row during load (`ggml_fp32_to_bf16` / `bf16→fp32`) instead
|
||||
of returning false. Same for `r`. Bump the recurrent state-file version so older readers reject
|
||||
cleanly. This makes old f32 sessions loadable into bf16 caches and round-trips safely.
|
||||
- **B (pin precision to the saved file)**: if a session is being restored, read `s_type_i_ref`
|
||||
first and set `type_s`/`type_r` from it, overriding the default for that context. Keeps restore
|
||||
working but silently disables the bf16 win for resumed sessions.
|
||||
- **C (document-only)**: keep the hard match; document that bf16-default invalidates cross-version
|
||||
saved recurrent states. Lowest effort, worst UX. Not recommended given parity is the goal.
|
||||
|
||||
KV-cache parallel: `type_k`/`type_v` were always EXPERIMENTAL and non-default-changing, so the KV
|
||||
path never had to solve this. The SSM default-FLIP is what forces the convert/version work — call it
|
||||
out as the single most load-bearing back-compat item.
|
||||
|
||||
## 6. Coupling notes / sequencing
|
||||
|
||||
- Land ON TOP of patch 0021 (conv-state in-place fusion). If 0021's fused conv write assumes an f32
|
||||
conv-state tensor, either (a) extend it to the cache tensor's dtype, or (b) keep `type_r` = F32 by
|
||||
default and make ONLY `type_s` bf16 (conv bytes are negligible, so this loses nothing perf-wise and
|
||||
de-risks 0021). Decision: ship `type_s`=BF16 first; make `type_r`=BF16 a follow-up gated on 0021's
|
||||
conv path being dtype-generic.
|
||||
- Kernel side (separate patch, not this wiring): `ggml/src/ggml-cuda/gated_delta_net.cu` currently
|
||||
takes `const float * curr_state` / `float * state_dst` and does `s_shard[r] = read_state[i]`
|
||||
(line 102) — hardcoded f32. The bf16 build needs the dispatch to read `s0->type` and route a
|
||||
bf16 load/store specialization; the gather kernel `gdn_gather_nonident_kernel` (line 7, `const
|
||||
float * cache`) likewise needs a bf16 variant. The cparams wiring here only selects the cache
|
||||
dtype; the kernel patch consumes it. Patches 0018 (in-place) / 0019 (gather) state asserts must be
|
||||
relaxed from f32-only to {f32,bf16}.
|
||||
- CPU mirror `ggml-cpu/ops.cpp` GDN path needs the same bf16 load/store for CI parity / fallback.
|
||||
|
||||
## 7. Validation gate
|
||||
|
||||
- KL < 1e-3 and PPL-delta within tolerance vs the f32-state build over a 256-token greedy run, per
|
||||
model (dense q36-27b-nvfp4, MoE q36-35b-a3b-nvfp4). If a model fails, that model sets
|
||||
`cache_type_ssm: f32` in its gallery YAML (per-model opt-out) — the global default stays bf16.
|
||||
- Add a `test-backend-ops` case for the GDN recurrence with bf16 state (mirror the 0021 harness:
|
||||
dense text md5 + MoE byte check) to lock the load→f32→store→bf16 contract.
|
||||
|
||||
---
|
||||
|
||||
# Appendix - label `upstream-bf16-precedent` (READ-ONLY research)
|
||||
|
||||
Precedent + numeric-safety justification for the §1-7 wiring above. Sources: paged dev tree
|
||||
(`dgx.casa:~/llama-paged-dev`, branch `paged`) and the vLLM checkout
|
||||
(`~/vllm-bench/.../site-packages/vllm`).
|
||||
|
||||
## A.1 Upstream llama.cpp: recurrent-cache f32 is HARDCODED (no f16/bf16 path), not a documented numeric guard
|
||||
|
||||
The asymmetry to override: the attention KV cache type is user-tunable; the recurrent state cache is not.
|
||||
|
||||
- KV cache: `llama_context_params.type_k/type_v` default `GGML_TYPE_F16`
|
||||
(`src/llama-context.cpp:3468-3469`), `[EXPERIMENTAL]` in `include/llama.h:365-366`, plumbed from
|
||||
user params (`attn_type_k = params.type_k`).
|
||||
- Recurrent/SSM cache: `llama_memory_recurrent(... type_r, type_s ...)` and the hybrid wrappers take
|
||||
the recurrent types as ctor args, but EVERY call site in `src/llama-model.cpp` passes the literal
|
||||
`GGML_TYPE_F32` (2056-2057 pure-recurrent; 2098-2099 hybrid-iswa `recurrent_type_r/s`;
|
||||
2117-2118 hybrid `recurrent_type_k/v`). No cparams field feeds these - compile-time constants.
|
||||
So mamba/mamba2/rwkv/falcon-h1/nemotron-h/qwen3.5 ALL get f32 recurrent + conv state unconditionally.
|
||||
- Alloc: `r = ggml_new_tensor_2d(ctx, type_r, ...)`, `s = ggml_new_tensor_2d(ctx, type_s, ...)`
|
||||
(`src/llama-memory-recurrent.cpp:100-101`). No f16 branch anywhere.
|
||||
|
||||
Is f32 a deliberate numeric constraint? Structural, not documented:
|
||||
- `ggml_ssm_conv` / `ggml_ssm_conv_update_inplace` HARD-ASSERT f32 on conv state/kernel/x_cur/dst
|
||||
plus `nb[0]==sizeof(float)` (`ggml/src/ggml.c:5581-5584,5589,5597`). Conv path is f32-locked at the
|
||||
builder.
|
||||
- `ggml_ssm_scan` does NOT assert input state `s` dtype, but hardcodes its OUTPUT as
|
||||
`GGML_TYPE_F32` (`ggml/src/ggml.c:5662`); scan kernels read `s` as `float *`.
|
||||
- `ggml/src/ggml-cuda/gated_delta_net.cu` takes `const float * curr_state`, `float * state`,
|
||||
`float * state_dst`; the per-(seq,head) shard `float s_shard[rows_per_lane]` is loaded/stored as raw
|
||||
float (34-102). Same in `ggml-cpu/ops.cpp`.
|
||||
- NO code comment anywhere justifies "f32 for precision". The constraint is that the ops were written
|
||||
float-only. => recurrent-cache-f32 is a hardcoded implementation default to override deliberately:
|
||||
the 3 literal `GGML_TYPE_F32` call-site pairs (gate behind `type_s`/`type_r` per §3), the
|
||||
gated_delta_net.cu load/store convert, and KEEP conv f32 unless its asserts are extended (conv bytes
|
||||
are negligible - only the temporal `type_s` state needs bf16).
|
||||
|
||||
## A.2 vLLM: GDN temporal state cache is bf16 BY DEFAULT, fp32-accumulated in-kernel (the exact design)
|
||||
|
||||
- Dtype: `qwen3_next.py:780-787` -> `MambaStateDtypeCalculator.gated_delta_net_state_dtype` ->
|
||||
`_mamba_state_dtype` (`mamba_utils.py:84-96`):
|
||||
`conv_state_dtype = get_kv_cache_torch_dtype(mamba_cache_dtype, model_dtype)`;
|
||||
`if mamba_ssm_cache_dtype == "auto": temporal_state_dtype = conv_state_dtype`.
|
||||
With both knobs default `"auto"`, `get_kv_cache_torch_dtype("auto", model_dtype)` returns
|
||||
`model_dtype` (`torch_utils.py:293-297`) = bf16 for Qwen3-Next => BOTH conv and temporal state are
|
||||
bf16 by default. Explicit opt-out: `--mamba-ssm-cache-dtype float32` (mirror of our f32 fallback).
|
||||
- In-kernel numerics (decode), `fla/ops/fused_recurrent.py`:
|
||||
`b_h = tl.load(p_h0).to(tl.float32)` (303) load bf16->fp32; q/k/v/g/beta `.to(tl.float32)` (309-318);
|
||||
recurrence in fp32 `b_h*=exp(g); b_v-=sum(b_h*b_k); b_v*=beta; b_h+=b_v*b_k; b_o=sum(b_h*b_q)`
|
||||
(327-331); `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` (337) store fp32->bf16. Prefill chunk path
|
||||
identical (`b_h=tl.zeros(...,tl.float32)`, `+= load().to(fp32)`, 102/120).
|
||||
=> byte-for-byte the proposed llama lever: load bf16->f32, math in f32 (UNCHANGED order, matches
|
||||
gated_delta_net.cu's v-g*kv -> *beta -> S-update -> S^T q), store f32->bf16; only the persisted cache
|
||||
crosses the bf16 boundary, once per step.
|
||||
- vLLM numeric guards: NONE beyond fp32 accumulation - no per-step renorm, no clamp, no Kahan. Optional
|
||||
`use_qk_l2norm_in_kernel` normalizes q,k (keeps k unit-norm) but does not touch the state.
|
||||
- KDA nuance: `kda_state_dtype` returns `(state_dtype, torch.float32)` - Kimi Delta Attention keeps a
|
||||
fp32 secondary component. qwen3.5 is `gated_delta_net` (fully-bf16 temporal state), but this shows
|
||||
vLLM judged a fp32 component necessary for one delta variant -> reinforces keeping the f32 toggle.
|
||||
|
||||
Verdict: vLLM's own GDN state cache is bf16, so bf16-state in llama is a FAIR equal-precision target,
|
||||
not a regression vs the competitor. bf16 brings llama TO vLLM's precision.
|
||||
|
||||
## A.3 Numeric-safety assessment for bf16 gated-DeltaNet state
|
||||
|
||||
Update: `S <- S*diag(exp(g)) + beta * k (x) (v - S k)`, with
|
||||
`g = -exp(A_log)*softplus(a+dt_bias) <= 0` so `exp(g) in (0,1]` (strict geometric decay) and
|
||||
`beta = sigmoid(.) in (0,1)`.
|
||||
|
||||
- Decay bounds error accumulation. bf16 = 8 mantissa bits -> per-element rel rounding
|
||||
`eps ~= 2^-8 ~= 3.9e-3`. An error injected at step t is multiplied by `exp(g)<1` every later step ->
|
||||
carry-error is a CONTRACTING geometric series bounded by ~`eps/(1-exp(g_mean))`, a small constant
|
||||
multiple of one step's eps, NOT linear/unbounded. The recurrence is a contraction map - no
|
||||
divergence. (The "per-step renorm" framing is not a literal renorm op in either codebase; the bound
|
||||
IS the `g<1` contraction + `beta in (0,1)` + unit-norm k from the l2norm capping `||k (x) delta||`.)
|
||||
- fp32 register accumulation is the minimal-error placement: load bf16->f32, do `S k`, `v-g*kv`,
|
||||
`*beta`, the outer-product accumulate and `S^T q` ALL in fp32 (UNCHANGED math), store f32->bf16 once.
|
||||
Identical to vLLM, which ships this as the Qwen3-Next default with no reported quality regression -
|
||||
the strongest empirical safety evidence.
|
||||
- Dominant risk is small KL/PPL drift, not instability. Gate KL<1e-3 + PPL-delta over 256-tok greedy
|
||||
vs the f32 build; fall back to f32 via the §3c toggle if it fails. Keep conv state f32 (ssm_conv* is
|
||||
f32-locked, conv bytes negligible) - no reason to risk it.
|
||||
|
||||
Bottom line: (1) upstream recurrent-cache f32 is a hardcoded implementation default (conv asserts f32;
|
||||
scan/gdn kernels float-only; no numeric-rationale comments) - override via §3's `type_s`/`type_r`
|
||||
plumbing, bf16-default + f32 opt-out, touching only the temporal state. (2) vLLM's GDN temporal state
|
||||
is bf16 by default (auto->model_dtype), fp32-accumulated, with `--mamba-ssm-cache-dtype float32`
|
||||
opt-out - a fair equal-precision target. (3) bf16 GDN state is numerically safe: g<1 decay contracts
|
||||
rounding to a bounded geometric series, fp32 registers confine bf16 rounding to one per-step cache
|
||||
write, and vLLM ships this exact config in production. KL<1e-3 / PPL gate + f32 fallback is the right
|
||||
safety net.
|
||||
|
||||
---
|
||||
|
||||
# PART B - label `bf16-kernel-plumbing` (the kernel/op edits §6 defers)
|
||||
|
||||
Part A wires the cache DTYPE selection (cparams -> memory_params -> `s_l`/`r_l` alloc). Part B is the
|
||||
consuming half: every kernel/op that reads or writes those caches, and the exact
|
||||
load->f32->compute(f32, UNCHANGED)->store->bf16 conversion points. Traced against HEAD-after-0021 on
|
||||
`dgx.casa:~/llama-paged-dev` (branch `paged`).
|
||||
|
||||
## B.1 Complete set of state-cache READERS/WRITERS (one op family only)
|
||||
`s_l` (ssm_states_all) reaches compute through exactly ONE op family - the gated-DeltaNet recurrence -
|
||||
via a strided VIEW from `build_rs` (graph base) that carries the cache dtype. The cache-touching srcs:
|
||||
- `src[5]` `src_state` - the s0 read view (the cache, or the 0019 gather scratch).
|
||||
- `src[6]` `src_state_dst` - the 0018 in-place write-back target (a view INTO the cache).
|
||||
- `src[7]` `ids` - I32 seq map for the 0019 gather (no dtype concern).
|
||||
No other op reads `s_l`. `build_rs` only re-strides (dtype rides through); the 0019
|
||||
`gdn_gather_nonident_kernel` is the only other reader. So bf16 awareness localizes to: the 3 ggml.c
|
||||
builders (asserts), cuda `supports_op`, `gated_delta_net.cu`, and the CPU mirror in `ops.cpp`.
|
||||
|
||||
## B.2 ggml.c builder asserts (relax F32-only -> {F32,BF16})
|
||||
File `ggml/src/ggml.c`:
|
||||
- `ggml_gated_delta_net` (6287): line 6308 `GGML_ASSERT(state->type == GGML_TYPE_F32)` ->
|
||||
`... == GGML_TYPE_F32 || ... == GGML_TYPE_BF16`.
|
||||
- `ggml_gated_delta_net_inplace` (6349): same `state` assert (~6366-6370) + any `src_state_dst`
|
||||
type assert -> allow BF16.
|
||||
- `ggml_gated_delta_net_inplace_ids` (6417): same `state` + `src_state_dst` relax.
|
||||
- KEEP the op OUTPUT scratch f32: line 6327 `ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne)` stays. The
|
||||
`[attn_scores | new_states]` output is a TRANSIENT graph tensor; the bf16 persisted write goes
|
||||
through `src_state_dst`/`state` (in-place). The non-in-place fallback `cpy`s scratch->cache and
|
||||
`ggml_cpy` already type-converts f32->bf16.
|
||||
|
||||
## B.3 CUDA supports_op
|
||||
`ggml/src/ggml-cuda/ggml-cuda.cu`, `supports_op` case `GGML_OP_GATED_DELTA_NET` (3096): allow a BF16
|
||||
`src[5]`/`src[6]` (add BF16 to the permitted state-src types).
|
||||
|
||||
## B.4 CUDA recurrence kernel `ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
Template the kernel + gather + launch on the CACHE-pointer dtype (`bool STATE_BF16`); keep f32 valid so
|
||||
the f32 opt-out is the SAME kernel. Include `<cuda_bf16.h>`; convert with `__bfloat162float` /
|
||||
`__float2bfloat16`. ALL recurrence math (lines 106-200) stays in f32 registers, byte-for-byte UNCHANGED.
|
||||
- Signatures: line 39 `const float * curr_state` -> `const STATE_T * curr_state`; line 57
|
||||
`float * state_dst` -> `STATE_T * state_dst`; `read_state` (85-88) -> `const STATE_T * read_state`.
|
||||
- LOAD (s0 -> f32 regs), lines 100-103:
|
||||
`if constexpr (STATE_BF16) s_shard[r]=__bfloat162float(read_state[i]); else s_shard[r]=read_state[i];`
|
||||
`s_shard` stays `float`.
|
||||
- STORE-BACK (f32 regs -> bf16 cache):
|
||||
- non-keep final write (203-208): `state[col*S_v+i] = STATE_BF16 ? __float2bfloat16(s_shard[r]) : s_shard[r];`
|
||||
- keep_rs_t snapshot (191-200) targets `dst + attn_score_elems` = the f32 OUTPUT scratch (kept f32
|
||||
per B.2); this is the prefill/rollback path (n_rs_seq>0), decode is `!keep_rs_t`. KEEP it f32.
|
||||
Only the CACHE pointers (`curr_state` src[5], `state_dst` src[6]) are STATE_T.
|
||||
- 0019 gather `gdn_gather_nonident_kernel` (7-30): `const float * cache` -> `const STATE_T * cache`;
|
||||
`dst[i] = STATE_BF16 ? __bfloat162float(src[i]) : src[i];`. Keep `scratch` OUTPUT f32 (pool alloc
|
||||
326-333 stays `ggml_cuda_pool_alloc<float>`) so the non-identity read path feeds f32; the identity
|
||||
in-place path reads bf16 directly. `read_state`'s dtype follows the branch that selected it.
|
||||
- Dispatcher (270-353):
|
||||
- casts 299/323 `(const float *)src_state->data`, 312 `(float *)src_state_dst->data` ->
|
||||
`(const nv_bfloat16 *)` / `(nv_bfloat16 *)` when `type == GGML_TYPE_BF16`; branch launch on type.
|
||||
- asserts 309-311: `src_state_dst->type == GGML_TYPE_F32` -> allow BF16; `nb[0] == sizeof(float)` ->
|
||||
`== ggml_type_size(type)`; `nb[1] == S_v*S_v*H*sizeof(float)` -> `... * ggml_type_size(type)`.
|
||||
- q/k/v/g/beta strides (348-353) are ACTIVATION (f32) strides - UNCHANGED. Kernel indexes state by
|
||||
ELEMENT (`col*S_v+i`), so the typed pointer halves the byte stride implicitly.
|
||||
- `launch_gated_delta_net` (212-) + S_v switch (230-260): thread `STATE_BF16` into the
|
||||
`gated_delta_net_cuda<S_v, KDA, keep_rs_t, STATE_BF16>` instantiations.
|
||||
|
||||
## B.5 CPU reference `ggml/src/ggml-cpu/ops.cpp` (parity / CI / CPU-offload fallback)
|
||||
`ggml_compute_forward_gated_delta_net_one_chunk` (10662) + `_f32` (10847), dispatch (10915):
|
||||
- LOAD: 10726 `const float * state_in_base = (const float *)src_state->data`, the rs_head/gather read
|
||||
10744-10745, and 10891 `const float * cache = (const float *)src_state->data` -> when
|
||||
`src_state->type == GGML_TYPE_BF16`, read `GGML_BF16_TO_FP32(((const ggml_bf16_t*)..)[..])`.
|
||||
- STORE: 10758-10762 `inplace_state_base = (float *)src_state_dst->data` -> store
|
||||
`((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`; relax asserts `nb[0]`/`nb[1]`
|
||||
to `ggml_type_size(type)`. Keep ONE impl, branch load/store on `src_state->type`.
|
||||
|
||||
## B.6 Conv state (`r_l`) -> bf16 : DEFER (optional, low-value, prefill snag)
|
||||
Conv state ~12.6 MB total, LAUNCH-bound (0021 removed concat/cpy); bf16 saves ~0 ms, adds complexity:
|
||||
- DECODE (0021 fused) `ggml_ssm_conv_update_inplace` (ggml.c:5566) asserts 5581-5584
|
||||
`conv_states/conv_state_dst->type == F32`; CUDA `ssm_conv_update_f32` (ssm-conv.cu:131) + CPU
|
||||
`ggml_compute_forward_ssm_conv_update_f32` (ops.cpp:9471) read/write f32. To bf16: relax the 2
|
||||
asserts, template tap LOAD (`__bfloat162float`) + ring write-back STORE (`__float2bfloat16`), cast
|
||||
`conv_states`/`conv_state_dst` ptrs in both dispatchers.
|
||||
- PREFILL (non-fused) `build_conv_state` (delta-net-base.cpp:449-524): `conv_states=build_rs(...)`
|
||||
(bf16 view) then `ggml_concat(conv_states, qkv_mixed, 0)` (472). **`ggml_concat` requires same type**
|
||||
- qkv_mixed is f32 -> bf16 conv cache BREAKS the prefill concat (needs an f32 staging view of the
|
||||
taps first; the ring write-back `ggml_cpy` at 496/520 already converts; concat is the blocker).
|
||||
RECOMMENDATION: keep `type_r` = F32 in v1 (matches Part A §6). Ship `type_s`=BF16 first; `type_r`=BF16
|
||||
is a follow-up that adds the f32 staging view.
|
||||
|
||||
## B.7 Confirm UNTOUCHED: full-attn KV-cache (16 layers) + FP4 weights
|
||||
- KV-cache: the `llama_kv_cache` half of `llama_memory_hybrid`, alloc with `params.type_k/type_v`
|
||||
(llama-model.cpp 2030-2031 / 2089-2090 / 2108-2109). Part A changes ONLY the recurrent half's
|
||||
`type_s`; `attn_type_k`/`attn_type_v` untouched. Paged-KV gather (0003-0011), flash-attn,
|
||||
`type_k()/type_v()` accessors (kv-cache.h 161-162/381-382) unaffected.
|
||||
- FP4 weights (nvfp4 dense + MoE): model weights, separate from runtime state caches; recurrence/conv
|
||||
kernels read STATE not weights. FP4 GEMM (0017/0020) untouched.
|
||||
- Activations (q/k/v/g/beta, attn-out, z) stay f32 (<1% of bytes). Only persisted `s_l` rows narrow.
|
||||
|
||||
## B.8 Conversion-point cheat-sheet (the ONLY numeric-precision boundaries)
|
||||
1. CUDA load `gated_delta_net.cu` ~102: `s_shard[r] = __bfloat162float(read_state[i])`.
|
||||
2. CUDA store ~207: `state[col*S_v+i] = __float2bfloat16(s_shard[r])`.
|
||||
3. CUDA gather ~20: `dst[i] = __bfloat162float(src[i])` (bf16 cache -> f32 scratch).
|
||||
4. CPU load `ops.cpp` ~10726/10744/10891: `GGML_BF16_TO_FP32(((ggml_bf16_t*)src_state->data)[..])`.
|
||||
5. CPU store ~10762: `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`.
|
||||
Everything between (1)/(4) and (2)/(5) is f32-register math, identical to today's f32 kernel. Only the
|
||||
persisted cache rounds to bf16 once per step; g<1 geometric decay bounds the rounding.
|
||||
|
||||
## B.9 File-by-file edit table (Part B)
|
||||
| File | Edit |
|
||||
|---|---|
|
||||
| `ggml/src/ggml.c` | relax `state`/`src_state_dst` F32 asserts -> allow BF16 in the 3 GDN builders (6308, ~6370, ~6430); keep output scratch F32 (6327) |
|
||||
| `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` GATED_DELTA_NET (3096): allow BF16 state src |
|
||||
| `ggml/src/ggml-cuda/gated_delta_net.cu` | template kernel+gather+launch on STATE_BF16; `__bfloat162float` load / `__float2bfloat16` store; cast src_state/src_state_dst ptrs; relax dispatcher asserts (309-311) to `ggml_type_size(type)`; keep gather scratch + keep_rs snapshot f32 |
|
||||
| `ggml/src/ggml-cpu/ops.cpp` | bf16 load/store branch in GDN ref (10726/10744/10758-10762/10891); relax asserts |
|
||||
| `tests/test-backend-ops.cpp` | add BF16-state GATED_DELTA_NET case (CUDA bf16 vs CPU bf16) |
|
||||
| (deferred) conv: `ggml.c:5581-84`, `ssm-conv.cu:131`, `ops.cpp:9471`, `delta-net-base.cpp:472` | v2 only - f32 staging before prefill concat |
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,37 +0,0 @@
|
||||
# bf16 SSM state - build/de-risk progress
|
||||
|
||||
DECISION (user override of plan): f32 DEFAULT + bf16 OPT-IN. type_s default = GGML_TYPE_F32.
|
||||
Conv state (type_r) stays F32. Recurrence math stays f32 (load->f32, store->cache dtype).
|
||||
|
||||
## STEP 1 (dtype-generic kernel + op) - DONE + DE-RISK GATE 1 PASSED
|
||||
Files (DGX ~/llama-paged-dev):
|
||||
- ggml/src/ggml.c: 3 GDN builder asserts F32 -> {F32,BF16}; state_dst nb[0] -> ggml_type_size.
|
||||
- ggml/src/ggml-cuda/gated_delta_net.cu: gdn_state_t<STATE_BF16> alias; gather + recurrence kernel +
|
||||
launchers templated on STATE_BF16; __bfloat162float load / __float2bfloat16 store; gather scratch
|
||||
shares cache dtype (uniform read); dispatcher detects src_state->type, GDN_DISPATCH macro 8-way.
|
||||
- ggml/src/ggml-cpu/ops.cpp: byte-based read base + read_bf16 load conversion; bf16 in-place
|
||||
convert-store after token loop; bf16 gather widening; relaxed asserts to ggml_type_size.
|
||||
- ggml/src/ggml-cpu/ggml-cpu.c: work-size +S_v*S_v for bf16 in-place.
|
||||
- tests/test-backend-ops.cpp: state_type field on test_gated_delta_net; 16 bf16 cases (hs 64+128 x
|
||||
decode/prefill/keep_rs x kda).
|
||||
GATE 1: build clean (EXIT=0); test-backend-ops -o GATED_DELTA_NET = 52/52 OK (CUDA bf16 vs CPU bf16).
|
||||
|
||||
## STEP 2/3/4 (cparams opt-in wiring) - IN PROGRESS
|
||||
f32 DEFAULT everywhere; --cache-type-ssm bf16 opts in.
|
||||
|
||||
## STEP 2/3/4 (cparams opt-in) - DONE
|
||||
- llama.h/llama-context.cpp/llama-memory.h/llama-model.cpp: type_r/type_s plumbed, DEFAULT F32.
|
||||
- common.h/common.cpp/arg.cpp: cache_type_ssm/conv (F32 default) + --cache-type-ssm/-conv CLI.
|
||||
- llama-memory-recurrent.cpp: convert-on-mismatch f32<->bf16 (r and s) via ggml_*_row API.
|
||||
|
||||
## EXTRA FIX (plan B.1 miss): build_rs rs_zero clear used ggml_scale (f32-only) -> bf16 abort.
|
||||
- llama-graph.cpp: f32 keeps ggml_scale_inplace (bit-exact); non-f32 uses ggml_fill_inplace.
|
||||
- fill.cu + ops.cpp + ggml.c: added BF16 to ggml_fill. get_rows/cpy already bf16-capable.
|
||||
|
||||
## DE-RISK GATE - ALL PASS
|
||||
- build clean EXIT=0 (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
|
||||
- test-backend-ops -o GATED_DELTA_NET = 52/52 (16 bf16 cases: decode/prefill/keep_rs x kda x hs64/128).
|
||||
- f32 default md5: dense 5951a5b4... MoE 07db32c2... == 0023 (non-invasive; also --cache-type-ssm f32 matches).
|
||||
- bf16 opt-in: coherent "Paris", no crash; byte-identical to f32 on 48-tok sample (Same-top-p 100%).
|
||||
- diff backup: ~/llama-paged-dev/BF16_SSM_STATE.diff (1003 lines, 15 files). NOT committed/pushed.
|
||||
READY FOR C.2 KL GATE (GateBench).
|
||||
@@ -1,203 +0,0 @@
|
||||
# bf16 SSM-state cache - BUILD + DE-RISK RESULTS
|
||||
|
||||
Label: bf16-build-derisk (the GPU build agent). Lands on top of patch 0023 (HEAD f7409c2) on the DGX
|
||||
dev tree `~/llama-paged-dev` (branch `paged`). Status: **DE-RISK GATE PASSED, READY FOR THE C.2 KL
|
||||
GATE (GateBench).** Work is built into `build-cuda` and saved as `~/llama-paged-dev/BF16_SSM_STATE.diff`
|
||||
(uncommitted on the dev tree; the 0024 ship/shelve decision is gated on GateBench's KL results).
|
||||
|
||||
## DECISION applied (user override of the plan): f32 DEFAULT + bf16 OPT-IN
|
||||
The plan defaulted bf16; the user wants f32 to stay the bit-exact DEFAULT and bf16 to be opt-IN via
|
||||
`--cache-type-ssm bf16`. So `type_s` default = `GGML_TYPE_F32`, `type_r` default = `GGML_TYPE_F32`
|
||||
(conv stays f32 always, per plan C.0). Only the persisted RECURRENT (temporal) state narrows to bf16
|
||||
when opted in; recurrence math stays f32 (load->f32, compute f32, store->cache dtype). The opt-in is
|
||||
non-invasive: with no flag the output is byte-identical to 0023.
|
||||
|
||||
## Files changed (15; full diff = ~/llama-paged-dev/BF16_SSM_STATE.diff, 1003 lines)
|
||||
|
||||
STEP 1 - dtype-generic kernel + op (the de-risk core):
|
||||
- `ggml/src/ggml.c` - 3 GDN builder `state`/`state_dst` asserts F32 -> {F32,BF16}; `state_dst->nb[0]`
|
||||
`sizeof(float)` -> `ggml_type_size(state_dst->type)`. Also relaxed the `ggml_fill` builder assert to
|
||||
allow BF16 (needed by the rs_zero clear; see below).
|
||||
- `ggml/src/ggml-cuda/gated_delta_net.cu` - `gdn_state_t<STATE_BF16>` alias (`nv_bfloat16`/`float`);
|
||||
recurrence kernel + gather kernel + both launchers + the dispatcher templated on `STATE_BF16`.
|
||||
LOAD `__bfloat162float`, STORE `__float2bfloat16`; the gather scratch is allocated at the CACHE
|
||||
dtype so `read_state` is a single uniform dtype (no mixed-dtype read path - eliminates the plan-R2
|
||||
landmine). The keep_rs snapshot + the non-in-place final write stay f32 (op output scratch); the
|
||||
bf16 store happens ONLY on the in-place cache path. `supports_op` already returned `true`
|
||||
unconditionally for GATED_DELTA_NET, so no change there.
|
||||
- `ggml/src/ggml-cpu/ops.cpp` - byte-based prior-state read base + `read_bf16` load conversion
|
||||
(`GGML_BF16_TO_FP32`); bf16 in-place convert-store after the per-(head,seq) token loop
|
||||
(`GGML_FP32_TO_BF16`); bf16-widening non-identity gather; relaxed `nb[]` asserts to
|
||||
`ggml_type_size`. Added a `ggml_compute_forward_fill_bf16` + dispatch case.
|
||||
- `ggml/src/ggml-cuda/fill.cu` - BF16 case in the fill kernel switch.
|
||||
- `ggml/src/ggml-cpu/ggml-cpu.c` - GDN work-size adds the extra `S_v*S_v` f32 buffer when the cache is
|
||||
bf16 in-place (mirror of `need_work` in ops.cpp).
|
||||
- `tests/test-backend-ops.cpp` - `state_type` field on `test_gated_delta_net`; 16 bf16-state cases
|
||||
(head_size 64 + 128 x {decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda 0/1, n_seqs 1/2).
|
||||
|
||||
STEP 2/3/4 - cparams opt-in wiring (f32 DEFAULT):
|
||||
- `include/llama.h` - `type_r`/`type_s` in `llama_context_params` (adjacent to type_k/type_v).
|
||||
- `src/llama-context.cpp` - default-params `type_r = type_s = GGML_TYPE_F32`; `params_mem` passes them.
|
||||
- `src/llama-memory.h` - `type_r`/`type_s` in `llama_memory_params`.
|
||||
- `src/llama-model.cpp` - the 3 hardcoded `GGML_TYPE_F32` recurrent ctor pairs (recurrent /
|
||||
hybrid_iswa / hybrid = the qwen35/qwen35moe path) now pass `params.type_r` / `params.type_s`.
|
||||
- `src/llama-memory-recurrent.cpp` - back-compat: `state_read_data` converts f32<->bf16 on type
|
||||
mismatch (helper `recurrent_read_convert_rows` via the public `ggml_bf16_to_fp32_row` /
|
||||
`ggml_fp32_to_bf16_row`) instead of failing, for both r and s; lets an f32-saved session restore
|
||||
into a bf16 cache and vice versa.
|
||||
- `src/llama-graph.cpp` - `build_rs` rs_zero clear: f32 keeps the exact `ggml_scale_inplace(.,0)` op
|
||||
(bit-exactness); bf16 (and any non-f32) state uses `ggml_fill_inplace(.,0)` (CUDA scale is f32-only;
|
||||
this was the one extra state-touching op the plan's "one op family" claim missed). get_rows + cpy
|
||||
on the extra-states path already support bf16, so no change needed there.
|
||||
- `common/common.h` / `common/common.cpp` / `common/arg.cpp` - `cache_type_ssm` / `cache_type_conv`
|
||||
(default F32) + `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` CLI (reusing the
|
||||
existing `kv_cache_type_from_str`, which already maps `f32`/`bf16`).
|
||||
|
||||
## DE-RISK GATE - ALL PASS
|
||||
|
||||
1. **Build clean** (build-cuda, CUDA arch 121): EXIT=0 for ggml/ggml-cuda/ggml-cpu/llama/llama-common
|
||||
and the binaries (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
|
||||
2. **test-backend-ops -o GATED_DELTA_NET = 52/52 PASS** (CUDA backend vs CPU reference). Includes all
|
||||
16 new bf16-state cases (CUDA bf16 vs CPU bf16) covering decode (n_tokens==1), multi-token
|
||||
prefill/chunk (33/64/100), and keep_rs_t (K=4), with kda on/off and head_size 64 + 128 (production
|
||||
S_v). The bf16 op test is the deterministic R2 de-risk for the load/compute/store contract.
|
||||
3. **f32-default md5 == 0023 baseline (opt-in is non-invasive):**
|
||||
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023 (no flag AND `--cache-type-ssm f32`)
|
||||
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023
|
||||
Command: `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`.
|
||||
4. **bf16 opt-in coherence + engaged (dense, `--cache-type-ssm bf16`):** no crash; coherent + on-topic.
|
||||
- 48-tok greedy ("The capital of France is"): "**Paris**." - byte-identical to f32 (md5 5951a5b4...),
|
||||
i.e. Same-top-p = 100% over that short sample (the g<1 decay bounds the per-step rounding so the
|
||||
argmax trajectory is unchanged at short length).
|
||||
- 256-tok greedy ("Explain how a transformer LM generates text, step by step"): fluent, well-structured
|
||||
step-by-step explanation, and the bf16 md5 (`fc82b4cd44f8ec999c3b6843eb3f8c61`) **DIVERGES** from
|
||||
f32 (`554cc667a2e62a47c34a999a127ac7e5`) - definitive proof that bf16 is genuinely ENGAGED (not a
|
||||
silent f32 fallback) and behaves as expected (non-bit-exact, coherent). The per-token divergence
|
||||
is exactly what the C.2 teacher-forced KL gate quantifies.
|
||||
- Independent proof bf16 is allocated: BEFORE the build_rs fill fix, decode aborted in
|
||||
`ggml_cuda_op_scale` on the recurrent-state tensor - an f32 cache would never have reached that
|
||||
bf16-only failure, so the opt-in demonstrably allocates bf16. Wiring is also directly traceable:
|
||||
`--cache-type-ssm bf16` -> cache_type_ssm -> cparams.type_s -> params_mem.type_s -> the
|
||||
llama_memory_hybrid recurrent `s_l` alloc.
|
||||
|
||||
CONFIRM: ready for the C.2 KL-divergence + PPL-delta + long-context drift gate (GateBench).
|
||||
|
||||
## A landmine fixed beyond the plan (record for the gate/ship agents)
|
||||
The plan B.1 asserted `s_l` reaches compute through ONLY the gated-DeltaNet op. It also flows through
|
||||
`build_rs`: (a) the rs_zero restart-slot clear used `ggml_scale_inplace(.,0)`, and `ggml_cuda_op_scale`
|
||||
hard-asserts f32 -> the first bf16 decode aborted in scale. Fixed by routing the bf16 clear through
|
||||
`ggml_fill` (with a new bf16 fill path). (b) the extra-states `ggml_get_rows` + `ggml_cpy` already
|
||||
support bf16 (verified) - no change. This is exactly the kind of non-decode state path the de-risk
|
||||
was meant to surface; it is now covered end-to-end (the bf16 coherence run exercises rs_zero on the
|
||||
fresh-sequence prompt).
|
||||
|
||||
## NOT done in this phase (next agents)
|
||||
- STEP 5 LocalAI gRPC/YAML (`CacheTypeSSM`/`CacheTypeConv` proto + grpc-server + model_config +
|
||||
options + meta registry) - needed to force f32/bf16 from a gallery YAML; not on the de-risk gate.
|
||||
- STEP 6 capability fallback (device-match probe to demote bf16->f32 before alloc on a device lacking
|
||||
the bf16 GDN/fill path, e.g. CPU-offloaded GDN). The CPU reference DOES implement bf16 (load/store/
|
||||
gather/fill) so a CPU fallback is numerically correct today, but the probe is the clean guard.
|
||||
- The C.2 KL/PPL/long-context gate + the C.3 nsys per-call bench - GateBench (GPU gate agent, runs
|
||||
sequentially after this build phase; binaries are pre-built in build-cuda).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
# C.2/C.3 ACCEPTANCE GATE + PARITY BENCH RESULTS (label bf16-gate-bench)
|
||||
|
||||
Status: **GATE FAILS -> NO-SHIP. KEEP SHELVED. patch 0024 NOT created; nothing committed.**
|
||||
All runs on `dgx.casa` build-cuda binaries, wikitext-2-raw test, `-ngl 99 -fa on --seed 1`.
|
||||
Corpus: `~/bench/klgate/wikitext-2-raw/wiki.test.raw` (symlink to `~/wikitext-2-raw`, ~280k tokens).
|
||||
|
||||
## 1. KL acceptance gate
|
||||
|
||||
### Noise floor (f32-vs-f32, c256 chunks32) - the non-determinism floor
|
||||
| model | Mean KLD | Max KLD | Same-top-p | ln(PPL(Q)/PPL(base)) |
|
||||
|---|---|---|---|---|
|
||||
| dense q27 | -1.3e-5 | 1e-6 | 100.000% | +0.001256 |
|
||||
| MoE q35 | ~0 (-3e-7) | 5.9e-5 | 100.000% | +0.000607 |
|
||||
|
||||
### Headline 256-token gate (bf16-vs-f32, c256 chunks32) - PASSES, but vacuously
|
||||
bf16 c256 is **byte-identical to the floor** for both models (Mean KLD -1.3e-5 dense / ~0 MoE,
|
||||
Same-top-p 100%, identical PPL). Reason: a single 256-token window is processed in ONE ubatch
|
||||
(ub512 > 256), so the recurrent state is written to the bf16 cache only ONCE at the chunk end and is
|
||||
NEVER read back to produce that window's logits. The 256-token gate therefore does NOT exercise the
|
||||
bf16 round-trip at all - it is blind to the actual cost.
|
||||
|
||||
### Long-context drift sweep (bf16-vs-f32, chunks8) - FAILS HARD for BOTH models
|
||||
| model | ctx | Mean KLD | Same-top-p | Max KLD | 99.9% KLD |
|
||||
|---|---|---|---|---|---|
|
||||
| dense | 256 | -1.3e-5 | 100.000% | 1e-6 | 0 |
|
||||
| dense | 1024 | 0.0647 | 91.54% | 20.17 | 7.69 |
|
||||
| dense | 2048 | 0.1739 | 90.65% | 24.89 | 18.18 |
|
||||
| dense | 4096 | 0.1258 | 90.40% | 26.03 | 17.22 |
|
||||
| MoE | 256 | ~0 | 100.000% | 5.6e-5 | 4.9e-5 |
|
||||
| MoE | 1024 | 0.0472 | 90.04% | 5.13 | 0.95 |
|
||||
| MoE | 2048 | 0.0442 | 90.84% | 1.85 | 1.11 |
|
||||
| MoE | 4096 | 0.0422 | 89.97% | 2.76 | 0.83 |
|
||||
|
||||
Gate thresholds: Mean KLD < 1e-3; Same-top-p >= 99.5%; |ln(PPL ratio)| < 0.005;
|
||||
drift MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0%.
|
||||
Result: 256-tok PASS (vacuous); **drift FAIL by ~50-170x on Mean KLD and ~9 pts on Same-top-p**
|
||||
(top-p ~90% = roughly 1 token in 10 flips its argmax at >=1024 ctx). FAIL for both dense and MoE.
|
||||
|
||||
### Discrimination (is it a bug or genuine bf16?) - dense c1024 chunks8
|
||||
- **f32 long-context floor c1024**: Mean KLD -1.2e-5, Same-top-p 100% -> the bf16 divergence is REAL
|
||||
signal, not a long-context measurement artifact.
|
||||
- **bf16 KLD is invariant to ubatch-boundary count** (= the cross-ubatch state read-back frequency):
|
||||
ub1024 (0 internal boundaries) 0.0642 / 91.19%; ub512 (1) 0.0647 / 91.54%; ub256 (3) 0.0639 /
|
||||
91.17%; ub64 (15) 0.0682 / 90.97%. Flat. -> The error is INTRINSIC to bf16 over the long
|
||||
recurrence INSIDE a single op call, NOT a chunked-prefill/keep_rs/gather handoff bug (R2 ruled out;
|
||||
test-backend-ops 52/52 already passed). The error PLATEAUS with context (contraction), i.e. it is
|
||||
bounded but LARGE: the gated-DeltaNet has long-memory heads (exp(g) ~ 1), so the g<1 decay does NOT
|
||||
tightly contract the per-step bf16 rounding the way the plan's A.3 optimistically assumed.
|
||||
|
||||
Note (CORRECTED): this is NOT vLLM's precision. vLLM keeps the GDN **temporal state in f32** (proven
|
||||
three ways in BITEXACT_VS_VLLM.md: empirical kernel-boundary tensor dtype, the config chain, and the
|
||||
bandwidth regime; only vLLM's tiny conv state is bf16). So bf16 temporal here is a step BELOW vLLM's
|
||||
recurrent precision, not a match. (An earlier byte-gate draft mislabeled vLLM as bf16-state; that was
|
||||
refuted.) At equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW).
|
||||
|
||||
## 2. Parity bench - the perf lever IS real
|
||||
|
||||
### nsys recurrence per-call (graphs OFF, npp4 ntg32 npl128) - gated_delta_net_cuda Avg
|
||||
| model | f32 ms/call | bf16 ms/call | delta |
|
||||
|---|---|---|---|
|
||||
| dense q27 | 3.381 | 1.726 | **-49.0%** |
|
||||
| MoE q35 | 2.245 | 1.153 | **-48.6%** |
|
||||
|
||||
The predicted 3.49 -> 2-3 ms/call lever LANDED (even beat it). Total GPU time dropped too (dense
|
||||
~12.05 -> ~9.05 s graphs-off). bf16 halving the persisted SSM-state bytes halves the dominant decode
|
||||
kernel, exactly as designed.
|
||||
|
||||
### End-to-end decode throughput (S_TG aggregate, npp128 ntg128, graphs ON unless noted)
|
||||
| model | npl | f32 t/s | bf16 t/s | note |
|
||||
|---|---|---|---|---|
|
||||
| dense | 32 | 212 | 239 | +12.8% |
|
||||
| dense | 128 | 371-376 (stable) | 287 / 336 / 487 / 498 (BIMODAL) | clean ~490 = +31%; bad runs from a CUDA-graph instability on the dense path |
|
||||
| dense | 128 | 371.67 (graphsOFF) | 486.68 (graphsOFF) | clean +31% |
|
||||
| MoE | 32 | 449 | 509 | +13.4% |
|
||||
| MoE | 128 | 767 | 958 | +24.9% (clean, nsys-corroborated) |
|
||||
|
||||
% of vLLM (391 t/s dense reference): f32 default = 95-96% (bit-exact, higher precision than vLLM);
|
||||
bf16 clean ~490 = **125%** (but unstable on dense + fails the numeric gate). MoE bf16 +25% is clean.
|
||||
|
||||
## 3. DECISION: NO-SHIP / KEEP SHELVED
|
||||
- The KL gate **fails** the long-context drift criterion for both models: bf16 SSM state changes
|
||||
~10% of tokens at >=1024 ctx vs our f32 (Same-top-p ~90%, Mean KLD 0.04-0.17). It is therefore NOT
|
||||
a quality-neutral opt-in and cannot honor the project's "f32 bit-exact default" promise.
|
||||
- Per the task rule (gate FAIL -> do not ship as usable): **patch 0024 was NOT created and nothing was
|
||||
committed** (DGX tree stays uncommitted; backup `~/llama-paged-dev/BF16_SSM_STATE.diff`).
|
||||
- The perf lever is genuinely real (recurrence halves; dense ~490 t/s = 125% of vLLM when clean; MoE
|
||||
+25%), but bf16 temporal is BELOW vLLM's precision (vLLM keeps temporal f32), so it remains a valid
|
||||
FUTURE option only if shipped as an explicitly-labeled "reduced-precision, NON-bit-exact, below-vLLM"
|
||||
mode (never quality-neutral), AND the dense CUDA-graph throughput instability (bimodal 287..498) is
|
||||
fixed first. The principled path is hybrid per-head precision (f32 long-memory heads + bf16 fast
|
||||
heads) - keeps precision at-or-above vLLM while capturing most of the speedup.
|
||||
- Recommendation: keep the shipped default as f32 bit-exact (95% of vLLM at higher precision). Shelve
|
||||
bf16. Optional follow-up lever if precision must be cut: bf16 only on the SHORT-memory heads (those
|
||||
with exp(g) well below 1), keeping long-memory heads f32 - a mixed-precision state that could pass
|
||||
the gate while still cutting bytes; not implemented/measured here.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,339 +0,0 @@
|
||||
# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
|
||||
|
||||
Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
|
||||
Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
|
||||
(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
|
||||
vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
|
||||
|
||||
> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
|
||||
> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
|
||||
> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
|
||||
> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
|
||||
> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
|
||||
> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
|
||||
> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
|
||||
>
|
||||
> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
|
||||
> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
|
||||
> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
|
||||
> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
|
||||
> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
|
||||
> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
|
||||
> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
|
||||
> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
|
||||
> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
|
||||
> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
|
||||
> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
|
||||
> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
|
||||
> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
|
||||
> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
|
||||
> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
|
||||
> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
|
||||
> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
|
||||
> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
|
||||
> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
|
||||
> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
|
||||
> reading consistent with both source *and* the measured time.
|
||||
|
||||
## Headline (two answers)
|
||||
|
||||
1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
|
||||
two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
|
||||
kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
|
||||
reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
|
||||
that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
|
||||
accumulation - a >>ULP divergence in every projection and the LM head.
|
||||
|
||||
2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
|
||||
was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
|
||||
same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
|
||||
bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
|
||||
vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
|
||||
not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
|
||||
|
||||
---
|
||||
|
||||
# The five questions, answered (synthesis)
|
||||
|
||||
**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
|
||||
bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
|
||||
activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
|
||||
(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
|
||||
projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
|
||||
(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
|
||||
into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
|
||||
(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
|
||||
A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
|
||||
rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
|
||||
GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
|
||||
*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
|
||||
(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
|
||||
with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
|
||||
same-precision run because of the A8-vs-A4 GEMM gap.
|
||||
|
||||
**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
|
||||
and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
|
||||
re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
|
||||
81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
|
||||
74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
|
||||
fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
|
||||
per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
|
||||
precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
|
||||
worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
|
||||
bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
|
||||
other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
|
||||
chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
|
||||
one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
|
||||
don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
|
||||
recurrence (~440 tok/s) - an **over-clock**, not a requirement.
|
||||
|
||||
**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
|
||||
corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
|
||||
the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
|
||||
**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
|
||||
state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
|
||||
precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
|
||||
precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
|
||||
which costs llama precision only on a tiny stream and buys almost no BW.
|
||||
|
||||
**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
|
||||
never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
|
||||
throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
|
||||
KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
|
||||
registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
|
||||
is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
|
||||
must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
|
||||
precision-for-speed trade it is.
|
||||
|
||||
**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
|
||||
"bf16 is required" framing - keep the bit-exact levers.**
|
||||
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
|
||||
than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
|
||||
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
|
||||
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
|
||||
occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
|
||||
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
|
||||
non-associativity across distinct kernel stacks).
|
||||
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
|
||||
**Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
|
||||
parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
|
||||
vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
|
||||
parity-matching change.
|
||||
|
||||
---
|
||||
|
||||
# PART A - Divergence inventory (per source: bit-identical vs close)
|
||||
|
||||
Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
|
||||
conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
|
||||
RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
|
||||
|
||||
## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
|
||||
|
||||
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
|
||||
(`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
|
||||
block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
|
||||
`mmq_y`/`mmq_x` + the warp-MMA fragment layout.
|
||||
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
|
||||
w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
|
||||
(**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
|
||||
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
|
||||
16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
|
||||
differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
|
||||
divergence and is present in every projection + the LM head, so logits differ materially on its own.
|
||||
|
||||
## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
|
||||
|
||||
Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
|
||||
`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
|
||||
`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
|
||||
With S[k][v] (llama, transposed) == h[v][k] (vLLM):
|
||||
|
||||
```
|
||||
llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum
|
||||
delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k]
|
||||
vLLM: h' = g*h_old # decay rounded into EVERY element first
|
||||
kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta
|
||||
h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k]
|
||||
```
|
||||
|
||||
Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
|
||||
with identical f32 state, identical inputs, and identical reduction tree:
|
||||
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
|
||||
vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
|
||||
Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
|
||||
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
|
||||
butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
|
||||
**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
|
||||
|
||||
## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
|
||||
llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
|
||||
`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
|
||||
*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
|
||||
(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
|
||||
|
||||
## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
|
||||
L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
|
||||
(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
|
||||
`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
|
||||
Different variance reduction tree / eps placement / fusion boundary.
|
||||
|
||||
## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
|
||||
MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
|
||||
order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
|
||||
`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
|
||||
time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
|
||||
rounding points + convert-time vs runtime `exp(A_log)`.
|
||||
|
||||
## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
|
||||
llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
|
||||
`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
|
||||
rounding.
|
||||
|
||||
## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
|
||||
Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
|
||||
inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
|
||||
llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
|
||||
|
||||
### Inventory summary
|
||||
|
||||
| Source | bit-identical? | divergence size |
|
||||
|---|---|---|
|
||||
| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
|
||||
| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
|
||||
| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
|
||||
| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
|
||||
| MRoPE | NO | ~ULP-1e-6 |
|
||||
| gate softplus/exp | NO | rounding points |
|
||||
| flash attention | NO | softmax tiling |
|
||||
| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
|
||||
|
||||
Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
|
||||
not close-to-equal at the bit level; they agree only to a few significant digits.
|
||||
|
||||
---
|
||||
|
||||
# PART B - The decisive f32-state correction (proof from source)
|
||||
|
||||
The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
|
||||
"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
|
||||
were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
|
||||
|
||||
1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
|
||||
and **`text_config.mamba_ssm_dtype = "float32"`**.
|
||||
2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
|
||||
`"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
|
||||
3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
|
||||
`mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
|
||||
`cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
|
||||
**`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
|
||||
4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
|
||||
(vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
|
||||
It runs **before** any layer/model state-dtype resolution.
|
||||
5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
|
||||
--gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
|
||||
non-default args confirm none). So the override fires and the value is "float32".
|
||||
6. State dtype resolution reads the **already-overridden** value:
|
||||
- `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
|
||||
cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
|
||||
- `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
|
||||
`vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
|
||||
- `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
|
||||
temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
|
||||
= **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
|
||||
7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
|
||||
element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
|
||||
round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
|
||||
with `p_ht.dtype == initial_state.dtype == float32`.
|
||||
|
||||
**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
|
||||
to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
|
||||
override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
|
||||
byte stream).
|
||||
|
||||
## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
|
||||
|
||||
| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
|
||||
|---|---|---|---|---|---|
|
||||
| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
|
||||
| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
|
||||
|
||||
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
|
||||
"more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
|
||||
count vLLM is at ~81%, i.e. **more** efficient than llama.
|
||||
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
|
||||
is bandwidth **efficiency** on equal bytes, not width.
|
||||
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
|
||||
|
||||
---
|
||||
|
||||
# PART C - The f32-preserving lever, and where recompute/bf16 land
|
||||
|
||||
Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
|
||||
llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
|
||||
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
|
||||
column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
|
||||
each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
|
||||
a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
|
||||
columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
|
||||
width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
|
||||
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
|
||||
vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
|
||||
|
||||
**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
|
||||
occupancy tune above is the real bit-exact win:**
|
||||
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
|
||||
2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
|
||||
full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
|
||||
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
|
||||
dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
|
||||
read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
|
||||
replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
|
||||
at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
|
||||
bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
|
||||
need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
|
||||
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
|
||||
uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
|
||||
overlapping two BW-bound phases sums to ~0. FAILS.
|
||||
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
|
||||
decode); attention/materialization-free form is O(t) over the prefix. FAILS.
|
||||
|
||||
**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
|
||||
recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
|
||||
precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
|
||||
throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
|
||||
MMQ + activation fold); none require bf16.
|
||||
|
||||
---
|
||||
|
||||
# PART D - Verdict on B4 + the meaningful weaker form
|
||||
|
||||
## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
|
||||
|
||||
1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
|
||||
`g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
|
||||
cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
|
||||
expression*, differing **even given identical f32 state and inputs**.
|
||||
2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
|
||||
fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
|
||||
Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
|
||||
inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
|
||||
**cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
|
||||
op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
|
||||
meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
|
||||
|
||||
## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
|
||||
Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
|
||||
(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
|
||||
cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
|
||||
KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
|
||||
**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
|
||||
agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
|
||||
gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
|
||||
weakening greedy agreement for this model specifically.
|
||||
|
||||
**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
|
||||
gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
|
||||
bf16 state does not and must be KL/PPL-gated vs llama's own f32).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,53 +0,0 @@
|
||||
# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate)
|
||||
|
||||
## Hard blocker on direct DRAM counters
|
||||
- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only).
|
||||
- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM.
|
||||
- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root.
|
||||
- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte
|
||||
geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor.
|
||||
|
||||
## Byte geometry (exact, from gated_delta_net.cu + GGUF)
|
||||
- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head).
|
||||
- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB.
|
||||
- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register,
|
||||
writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced).
|
||||
l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at
|
||||
steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away.
|
||||
- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB.
|
||||
- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call.
|
||||
|
||||
## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97)
|
||||
- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call
|
||||
(min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms).
|
||||
- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak.
|
||||
- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal.
|
||||
=> re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE.
|
||||
Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak.
|
||||
|
||||
## Conv-path (same trace, steady-decode region kernels, per-call):
|
||||
- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD
|
||||
- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big)
|
||||
- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op)
|
||||
|
||||
## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite)
|
||||
- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16).
|
||||
Geometry identical to llama (H=48, k/v head_dim 128, S_v 128).
|
||||
- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1),
|
||||
bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs ->
|
||||
store bf16 once).
|
||||
- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient.
|
||||
|
||||
## Conv-path (llama steady decode, per call x48 layers)
|
||||
- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56)
|
||||
= ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%).
|
||||
- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state).
|
||||
|
||||
## FINAL VERDICT (DONE)
|
||||
- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak).
|
||||
- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%);
|
||||
gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes.
|
||||
- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width.
|
||||
805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327.
|
||||
Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision.
|
||||
- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended).
|
||||
@@ -1,57 +0,0 @@
|
||||
# B_MOE_PROGRESS.md - B-3 (mmq_y-down warp-remap, patch 0028) checkpoint
|
||||
|
||||
Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev`
|
||||
`2f4f5ab`, branch `b-work`), independent of the held hybrid 0026. Worktree: `.../feat+paged-attention`.
|
||||
|
||||
## Prior: B-2 (act-quant retune) = NEGATIVE (no lift, no patch 0027). MoE ~85% of vLLM @npl128.
|
||||
B-2 proved the act-quant tax (~2%) is already optimally tiled; the structural MoE residual is the
|
||||
grouped FP4 `mul_mat_q<NVFP4>` GEMM (~27%, LPDDR5x BW floor) + bf16 projections (~10.5%). => try B-3.
|
||||
|
||||
## The lever (B-3 / SPEEDUP_HUNT B rank #3)
|
||||
mmq_y-down warp-remap of the NVFP4 FP4-MMA grouped GEMM `mul_mat_q<NVFP4>` in `ggml/.../mmq.cuh`.
|
||||
mmq_y tiles the weight-row (N) dimension; lowering 128->64 raises resident CTAs (smaller per-CTA
|
||||
shared + accumulator + 128 vs 256 threads/CTA => ~2x blocks/SM) to hide LPDDR5x weight-load latency,
|
||||
WITHOUT re-reading weights (each weight row lives in exactly one row-tile => BW-neutral). The MoE
|
||||
GEMM runs at ~35% of peak BW (occupancy-limited, NOT BW-saturated), so more resident CTAs is the
|
||||
right mechanism - and it is the ONE untested occupancy lever (M-tile = NEUTRAL 0015, MINBLOCKS =
|
||||
+8.7% slower 0017).
|
||||
|
||||
## The coupling that makes it a real kernel change (not the 0017 knob alone)
|
||||
The FP4-MMA path has `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; tile_C::I==16 for the
|
||||
m16n8k64 block-scaled FP4 MMA). nwarps is global `256/warp_size = 8`, so mmq_y is pinned at 128. The
|
||||
0017 `GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP this assert at mmq_y=64. B-3 makes nwarps TYPE-AWARE:
|
||||
`mmq_get_nwarps_device<type>()` returns mmq_y/16 = 4 for NVFP4-reduced (else stock 8), keeping the
|
||||
coupling. 2 new overloads (device template + host 3-arg) + 9 call-site swaps to `<type>`. Default
|
||||
GGML_CUDA_FP4_MMQ_Y==128 returns stock nwarps for EVERY type => default build byte-identical to stock.
|
||||
|
||||
## Bit-exactness note (the real risk)
|
||||
The per-output K-reduction order is mmq_y-INVARIANT (each output row owned by one thread). BUT mmq_y=64
|
||||
DOUBLES nty (row-tiles), changing the stream-k kbc partition => an output tile's K-range may be split
|
||||
across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different grouping =>
|
||||
FP non-associativity CAN perturb the last logit bits => greedy argmax COULD flip. So B-3 is NOT
|
||||
bit-exact-by-construction in the md5 sense; the md5 gate is EMPIRICAL. md5 fail => not bit-exact => STOP.
|
||||
|
||||
## Status: COMPLETE - BIT-EXACT but FLAT. No patch 0028. Full result + assessment in B_MOE_RESULTS.md.
|
||||
- [x] Source-read mmq.cuh: nwarps/mmq_y coupling, FP4 MMA vec_dot, kernel+fixup+launch+case sites.
|
||||
- [x] Edited mmq.cuh: 2 nwarps overloads + 9 `<type>` swaps. git diff clean (37+/11-).
|
||||
- [x] BEFORE baseline (stock-0025 binaries, same session): dense md5 5951a5b4==ref, moe 07db32c2==ref;
|
||||
MoE S_TG npl32=441.98, npl128=756.47.
|
||||
- [x] BUILD build-cuda @mmq_y=64 (full cuda rebuild): EXIT=0 - compiles (static_assert holds at 4*16=64).
|
||||
- [x] md5 GATE PASS both models @64; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
|
||||
- [x] Clean back-to-back A/B (build-cuda-base @128 vs build-cuda @64), 3 reps: npl32 +0.29%,
|
||||
npl128 +0.40% - within the ~0.4% noise band. FLAT.
|
||||
- [x] nsys A/B: grouped GEMM kernel mmq_y=64 -1.3% FASTER, BUT stream_k_fixup +42% costlier + SSM (40%)
|
||||
dominant & untouched => end-to-end inert. BW-bound confirmed (same as 0015/0017/B-2).
|
||||
- [x] DECIDED: FLAT -> no patch 0028. Dev tree reverted to pristine 0025 (no ggml diff), build-cuda
|
||||
reconfigured to default + rebuilt. Bit-exact MoE ceiling = ~85% @npl128 / ~87.5% @npl32 of vLLM.
|
||||
- [x] ASSESS + RECOMMEND (in B_MOE_RESULTS.md): residual = structural Marlin-NvFp4 grouped-GEMM gap,
|
||||
uncloseable bit-exactly; fall back to 0026 bf16-SSM opt-in (default-off, fails MoE KL gate, ~95%).
|
||||
|
||||
## Gate references
|
||||
- dense q36-27b-nvfp4 md5 == 5951a5b4d624ce891e22ab5fca9bc439
|
||||
- MoE q36-35b-a3b-nvfp4 md5 == 07db32c2bcb78d17a43ed18bc22705cd
|
||||
- gate cmd: `llama-completion -m M -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`
|
||||
- bench: `llama-batched-bench -m M -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128` (S_TG=decode_agg)
|
||||
- vLLM ref decode_agg @npl128 = 882.2 t/s (npl32 ref 500.8).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,232 +0,0 @@
|
||||
# B_MOE_RESULTS.md - B-2 (down_proj act-quant retune / M1) RESULT: NEGATIVE (no headroom)
|
||||
|
||||
Agent: B2-build (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
|
||||
branch `b-work`), independent of the held hybrid 0026 (`33e7c65`). Lever: SPEEDUP_HUNT.md section B,
|
||||
rank #2 ("down_proj act-quant retune (M1): bit-exact, bounded - act-quant is ~2% of MoE step").
|
||||
|
||||
## VERDICT
|
||||
**The existing `blockDim.x = 128` is ALREADY the kernel-level optimum for `quantize_mmq_nvfp4` on
|
||||
GB10 sm_121. B-2 has zero headroom: there is nothing to bake (128 is the current default), and it
|
||||
does NOT lift MoE decode (end-to-end flat within 0.4% noise across all block sizes). No patch 0027.**
|
||||
MoE stays ~85% of vLLM @npl128 / ~87% @npl32, well below vLLM => the remaining MoE lever is B-3.
|
||||
|
||||
## The change that was built+measured (bit-exact, then REVERTED - did not lift)
|
||||
`ggml/src/ggml-cuda/quantize.cu`, `quantize_mmq_fp4_cuda` NVFP4 branch. Replaced the hardcoded
|
||||
`constexpr int nvfp4_block_size = 128` with a `static const int` selected once from env
|
||||
`LLAMA_MOE_QUANT_BLOCK` (default 128), `block_num_y` recomputed from the SAME `blockDim.x`. ~20 LOC.
|
||||
|
||||
### Why ANY block size is provably byte-identical (the bit-exact invariant)
|
||||
`quantize_mmq_nvfp4` maps thread -> column purely via the global linear index
|
||||
`gy = blockDim.x*blockIdx.y + threadIdx.x` -> `i0_base = gy*QK_NVFP4_SUB`, with NO cross-thread
|
||||
communication (no shared memory, no warp reduction) and every thread writing its OWN disjoint output
|
||||
sub-block (its own `sub` slot in `block_fp4_mmq`: `yqs[2*sub+0/1]`, `d4[sub]`). The per-thread quant
|
||||
body (amax, the 5-offset fp8-code search, the q0/q1 nibble packing, the writeback) is untouched. So
|
||||
the (thread)->output-byte map - and the produced bytes - are invariant to `blockDim.x`. Confirmed
|
||||
empirically: md5 identical at block 64, 128, AND 256, both models.
|
||||
|
||||
## GATE (bit-exact) - BOTH MODELS PASS at default AND at non-128 blocks
|
||||
greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
|
||||
|
||||
| block | dense q36-27b-nvfp4 md5 | MoE q36-35b-a3b-nvfp4 md5 |
|
||||
|------:|-------------------------|---------------------------|
|
||||
| 128 (default) | 5951a5b4d624ce891e22ab5fca9bc439 == ref | 07db32c2bcb78d17a43ed18bc22705cd == ref |
|
||||
| 64 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
|
||||
| 256 | 5951a5b4...439 == ref | 07db32c2...5cd == ref |
|
||||
|
||||
test-backend-ops (CUDA0): **MUL_MAT 1146/1146 PASS**, **MUL_MAT_ID 806/806 PASS**.
|
||||
|
||||
## MEASUREMENT 1 - end-to-end MoE decode_agg (S_TG t/s), the actual throughput
|
||||
`llama-batched-bench -m q36-35b-a3b-nvfp4.gguf -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`,
|
||||
1 rep/block (run-to-run noise ~0.3-0.5%):
|
||||
|
||||
| block | npl=32 S_TG | npl=128 S_TG |
|
||||
|------:|------------:|-------------:|
|
||||
| 32 | 437.54 | 750.41 |
|
||||
| 64 | 437.82 | 751.68 |
|
||||
| 96 | 437.69 | 749.46 |
|
||||
| **128 (base/default)** | **438.14** | **751.76** |
|
||||
| 160 | 436.38 | 750.99 |
|
||||
| 192 | 436.81 | 751.61 |
|
||||
| 256 | 437.77 | 750.14 |
|
||||
|
||||
Spread: npl32 = 1.76 t/s (0.4%), npl128 = 2.3 t/s (0.3%) - all within noise. **No block size lifts
|
||||
end-to-end decode.** Expected: the act-quant is ~2% of the MoE step, so even a perfect (0 ns) quantize
|
||||
kernel caps the end-to-end win at ~2%, and 128 is already optimal => measured 0%.
|
||||
|
||||
## MEASUREMENT 2 - nsys kernel-level delta on quantize_mmq_nvfp4 (the meaningful B-2 metric)
|
||||
`nsys --report cuda_gpu_kern_sum`, MoE, `GGML_CUDA_DISABLE_GRAPHS=1 -npp 4 -ntg 32 -npl 128`,
|
||||
8,193 kernel invocations (the kernel is 2.0-2.2% of GPU time in this decode-heavy window):
|
||||
|
||||
| block | total ns | avg ns | median ns | vs 128 (total) |
|
||||
|------:|---------:|-------:|----------:|---------------:|
|
||||
| 64 | 127,523,328 | 15,564.9 | 12,256 | +8.7% slower |
|
||||
| **128 (default)** | **117,371,424** | **14,325.8** | **11,488** | baseline (fastest) |
|
||||
| 192 | 128,970,464 | 15,741.5 | 12,032 | +9.9% slower |
|
||||
| 256 | 125,422,048 | 15,308.4 | 11,936 | +6.9% slower |
|
||||
|
||||
**128 is a clean local minimum** (faster than the 64 below and the 192/256 above; 96 and 160 are its
|
||||
immediate neighbors, end-to-end-neutral, nsys-stats flaked on the re-runs but cannot beat a bracketed
|
||||
local min). The 7-10% kernel-level regression of the alternatives at 0% end-to-end change is exactly
|
||||
why end-to-end is flat: this BW-bound, 256-tiny-expert model has no col-tile/occupancy headroom in
|
||||
the act-quant - the same conclusion patch 0015 reached for the M-tile and patch 0017 for MINBLOCKS.
|
||||
|
||||
## WHERE MoE STANDS (decode_agg, this base = 0025 with the re-graph)
|
||||
vLLM ref @npl128 = 882.2, @npl32 = 500.8.
|
||||
- npl128: 751.8 / 882.2 = **85.2% of vLLM**
|
||||
- npl32: 438.1 / 500.8 = **87.5% of vLLM**
|
||||
|
||||
B-2 adds 0 (within noise). MoE is **still well below vLLM** => **TRY B-3** (the mmq_y-down warp-remap
|
||||
on the grouped `mul_mat_q<NVFP4,M-tile=64>` GEMM, ~27% of the MoE step - the only untested MoE GEMM
|
||||
lever; SPEEDUP_HUNT B rank #3, real kernel change, bit-exact, predicted bounded on this BW-bound
|
||||
model). The structural MoE residual is the FP4 grouped GEMM at the LPDDR5x BW floor + the bf16
|
||||
projections (~10.5%); the act-quant tax (~2%) is NOT where the gap lives and is already optimally
|
||||
tiled. Recurrence (~48%) is already past vLLM (0018-0022).
|
||||
|
||||
## DECISION
|
||||
No patch 0027 (B-2 does not lift; dev tree reverted to pristine 0025). The `LLAMA_MOE_QUANT_BLOCK`
|
||||
hook + this measurement confirm 128 is the GB10 optimum, should other hardware ever want re-tuning.
|
||||
Hand off to B-3 (patch 0028) as the next MoE GEMM lever.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
# B-3 (mmq_y-down warp-remap of the NVFP4 grouped GEMM) RESULT: BIT-EXACT but FLAT (no patch 0028)
|
||||
|
||||
Agent: B3-or-assess (GPU agent, DGX GB10 sm_121). Base: clean 0025 tip (`~/llama-paged-dev` `2f4f5ab`,
|
||||
branch `b-work`), independent of the held hybrid 0026. Lever: SPEEDUP_HUNT.md section B rank #3 - the
|
||||
0017-deferred structural `mmq_y`-down warp-remap on the grouped FP4-MMA `mul_mat_q<NVFP4>` (the ~26-27%
|
||||
MoE-specific GEMM), the only untested MoE GEMM occupancy lever.
|
||||
|
||||
## VERDICT
|
||||
**Bit-exact (md5 PASS both models + test-backend-ops PASS), but end-to-end FLAT: npl128 +0.3-0.4%
|
||||
(consistent direction, kernel-backed) and npl32 +0.1-0.3%, both inside the ~0.4% run-to-run band. The
|
||||
warp-remap makes the grouped GEMM kernel ITSELF ~1.3% faster (occupancy DID rise) but the step is
|
||||
BW/SSM-bound, so it does NOT lift MoE decode. No patch 0028.** MoE stays ~85% of vLLM @npl128.
|
||||
|
||||
## The change that was built+measured (bit-exact, then REVERTED)
|
||||
`ggml/src/ggml-cuda/mmq.cuh`. The FP4-MMA path couples the weight-row tile to the warp count via the
|
||||
invariant `static_assert(nwarps*tile_C::I == mmq_y)` (mmq.cuh:3280; `tile_C::I==16` for the m16n8k64
|
||||
block-scaled FP4 MMA). `nwarps` is global `256/warp_size = 8`, pinning `mmq_y=128`; the 0017
|
||||
`GGML_CUDA_FP4_MMQ_Y` knob alone would TRIP that assert at 64. B-3 makes nwarps TYPE-AWARE: a new
|
||||
`mmq_get_nwarps_device<type>()` (+ 3-arg host overload) returns `mmq_y/16 = 4` for NVFP4-reduced (else
|
||||
the stock 8), so `mmq_y=64 -> nwarps=4 -> 128 threads/CTA` (vs 256) -> ~2x resident CTAs. 2 overloads +
|
||||
9 `<type>` call-site swaps (kernel, process_tile, write_back_mma, stream_k_fixup, nvfp4 loader, 2 host).
|
||||
Built with `-DGGML_CUDA_FP4_MMQ_Y=64`; the compile SUCCEEDS (the static_assert now holds at 4*16=64).
|
||||
**Default `GGML_CUDA_FP4_MMQ_Y==128` returns stock nwarps for every type => a default build is
|
||||
byte-identical to stock** (the bit-exact opt-out, proven by the md5 below at 128).
|
||||
|
||||
### Bit-exactness is EMPIRICAL here (not by-construction)
|
||||
The per-output K-reduction order is mmq_y-invariant (each output row owned by one thread), but mmq_y=64
|
||||
DOUBLES `nty` (row-tiles), changing the stream-k `kbc` partition => an output tile's K-range can be
|
||||
split across CTAs at different points and recombined by `mul_mat_q_stream_k_fixup` in a different
|
||||
grouping => FP non-associativity COULD perturb the last logit bits and flip a greedy argmax. It did NOT
|
||||
for the gate prompt (md5 matched), but B-3 is therefore NOT bit-exact-by-construction - a default-ON
|
||||
ship would be a (small) precision risk. This is a second reason not to ship it for a 0% gain.
|
||||
|
||||
## GATE (bit-exact) - BOTH MODELS PASS
|
||||
greedy `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`:
|
||||
- dense q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439 == ref
|
||||
- MoE q36-35b-a3b-nvfp4 = 07db32c2bcb78d17a43ed18bc22705cd == ref
|
||||
- test-backend-ops CUDA0: **MUL_MAT 1146/1146 PASS, MUL_MAT_ID 806/806 PASS.**
|
||||
|
||||
## MEASUREMENT 1 - end-to-end MoE decode_agg, clean BACK-TO-BACK A/B (build-cuda-base @128 vs build-cuda @64)
|
||||
`llama-batched-bench -m q36-35b-a3b-nvfp4 -c 32768 -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128`, S_TG t/s,
|
||||
3 reps alternating (no concurrent load):
|
||||
|
||||
| npl | mmq_y=128 (base) mean | mmq_y=64 (B-3) mean | delta |
|
||||
|----:|----------------------:|--------------------:|------:|
|
||||
| 32 | 437.6 (437.3-437.7) | 438.8 (438.4-439.1) | +0.29% |
|
||||
| 128 | 750.1 (748.9-751.1) | 753.1 (753.0-753.4) | +0.40% |
|
||||
|
||||
Every B-3 rep edges the base by +0.3-0.4% @npl128 (consistent, kernel-backed), but the per-build spread
|
||||
(base 748.9-751.1) OVERLAPS - it is at the edge of noise, NOT a meaningful lift. Caps the end-to-end win
|
||||
at well under 1%, nowhere near the gap to vLLM (882).
|
||||
|
||||
## MEASUREMENT 2 - nsys kernel-level A/B (the meaningful B-3 evidence), clean, no concurrent load
|
||||
`GGML_CUDA_DISABLE_GRAPHS=1 nsys ... -npp 4 -ntg 32 -npl 128`, decode-isolated window, `cuda_gpu_kern_sum`:
|
||||
|
||||
| kernel (% of window) | mmq_y=128 total ns | mmq_y=64 total ns | delta |
|
||||
|---------------------------------|-------------------:|------------------:|-------:|
|
||||
| gated_delta_net (SSM, ~40%) | 2,335,951,709 | 2,334,847,390 | 0.0% (untouched, DOMINANT) |
|
||||
| **mul_mat_q<NVFP4, M-tile 64>** (MoE GEMM, ~26%) | **1,502,548,958** | **1,483,685,630** | **-1.26% (kernel faster)** |
|
||||
| mul_mat_q<NVFP4, M-tile 128> (router, ~3.7%) | 224,532,704 | 210,885,920 | -6.1% |
|
||||
| quantize_mmq_nvfp4 (act-quant, ~2%) | 119,118,624 | 118,718,496 | -0.3% |
|
||||
| **mul_mat_q_stream_k_fixup<128>** (~0.6%) | **26,848,479** | **38,117,532** | **+42% (fixup COSTLIER)** |
|
||||
|
||||
The warp-remap DOES what it claims at the kernel level: the grouped GEMM is **-1.3%** (more resident
|
||||
CTAs hide a sliver of weight-load latency). But (a) it is only ~26% of the step, (b) halving mmq_y
|
||||
DOUBLES the row-tiles so the stream-k fixup recombination grows **+42%** (+11.3M ns), eating ~60% of the
|
||||
GEMM's 18.9M-ns saving, and (c) the step is dominated by the gated_delta_net SSM (~40%, untouched, and
|
||||
already PAST vLLM's BW efficiency per 0018-0022) with the GEMM itself at the LPDDR5x BW floor. Net
|
||||
mul_mat region saving ~7.6M ns on a ~5.8B-ns window = ~0.13%; end-to-end +0.3-0.4% (within noise).
|
||||
**This is the definitive BW-bound proof: even a real occupancy win on the target kernel does not move
|
||||
end-to-end** - the same outcome as patch 0015 (M-tile NEUTRAL), 0017 (MINBLOCKS +8.7% slower), and B-2
|
||||
(act-quant FLAT). The MoE grouped GEMM is bandwidth-limited, not occupancy-limited, at the kernel exit.
|
||||
|
||||
## DECISION
|
||||
No patch 0028 (B-3 does not lift end-to-end; bit-exactness is empirical, not by-construction; the fixup
|
||||
penalty + BW floor swamp the +1.3% kernel win). Dev tree reverted to pristine 0025 (no ggml diff),
|
||||
build-cuda reconfigured to default (no flag) and rebuilt. The `mmq_get_nwarps_device<type>()` remap is a
|
||||
correct, reusable warp-remap should occupancy-bound FP4 hardware ever appear; it is inert on GB10.
|
||||
|
||||
---
|
||||
|
||||
# FINAL ASSESSMENT - the honest bit-exact MoE ceiling, and the recommendation
|
||||
|
||||
## The bit-exact MoE GEMM/launch track is now EXHAUSTED
|
||||
| MoE lever (bit-exact) | result | MoE decode_agg @npl128 |
|
||||
|-----------------------|--------|------------------------|
|
||||
| 0025 re-graph (B-1, LANDED) | the ONLY bit-exact MoE win | ~82% -> **~85%** of vLLM |
|
||||
| B-2 act-quant retune (no patch) | FLAT (128 already optimal) | +0% |
|
||||
| B-3 mmq_y-down warp-remap (no patch) | FLAT (kernel -1.3%, e2e +0.3% noise) | +0% |
|
||||
|
||||
**Honest bit-exact MoE ceiling on GB10 = ~85% of vLLM @npl128 (753 / 882.2), ~87.5% @npl32 (439 / 500.8).**
|
||||
B-1 (re-graph, in 0025) banked the move from ~82% to ~85%; B-2 and B-3 each add 0. The grouped-GEMM/
|
||||
launch track has no remaining bit-exact headroom.
|
||||
|
||||
## Is the residual the structural Marlin-MoE gap? YES.
|
||||
The remaining ~15% is structural and uncloseable bit-exactly, decomposed from the nsys:
|
||||
- **Grouped FP4 GEMM (~26%) is at the LPDDR5x BW floor.** B-3 proved an occupancy win there is
|
||||
end-to-end-inert. vLLM ships a purpose-built **Marlin-NvFp4** grouped GEMM (a different, more
|
||||
bandwidth-efficient schedule); llama runs native FP4-MMA W4A4 (a HIGHER arithmetic tier, but the
|
||||
decode shape is BW-bound so the tier does not help). This is THE structural gap and matches
|
||||
FP4_GEMM_SCOPE_B.md's "MoE ceiling ~76% from the GEMM track alone."
|
||||
- **The SSM recurrence (~40%) is already PAST vLLM** (84.6% vs 82.4% peak BW, 0018-0022) - not a lever.
|
||||
- **bf16 projections (~10.5%)** - both engines pay similar; not a bit-exact lever.
|
||||
|
||||
No bit-exact lever closes the structural grouped-GEMM gap. ~85% is the honest bit-exact MoE plateau.
|
||||
|
||||
## RECOMMENDATION: ship the bit-exact ~85% as DEFAULT; expose 0026 bf16-SSM as a documented opt-in for the last ~10% on MoE (NOT default, NOT in the recommended config)
|
||||
|
||||
Per the user's decision rule ("pursue B first; if it cannot reach/beat vLLM on MoE, fall back to the
|
||||
held hybrid/bf16 opt-in"): **B (bit-exact) cannot reach vLLM on MoE (~85%), so the fallback applies -
|
||||
but with a hard caveat the team must carry.**
|
||||
|
||||
1. **DEFAULT = the bit-exact plateau (0025 with the re-graph), MoE ~85% of vLLM.** This is the honest,
|
||||
precision-safe ship: the recurrence already BEATS vLLM's BW efficiency, the GEMM is the same FP4
|
||||
arithmetic class, and the output is byte-identical to the f32 reference. Do not claim MoE *parity*
|
||||
bit-exactly - claim ~85% with a precision profile at-or-above vLLM.
|
||||
|
||||
2. **FALLBACK (opt-in only) = 0026 hybrid bf16-SSM.** It is the ONLY remaining MoE lever (it speeds the
|
||||
~40% recurrence, the part B does not touch): measured **+11.5% MoE decode** (1110.7 -> 1238.1 t/s in
|
||||
the 0026 harness) -> would lift MoE ~85% -> **~95% of vLLM**. BUT: (a) it is **non-bit-exact**; (b) it
|
||||
**FAILS the MoE KL ship-gate by a wide margin** (MeanKLD ~0.045 / Same-top-p ~91% vs the 1e-3 / 99.5%
|
||||
bar - the gated-DeltaNet state is hypersensitive to bf16; A_HYBRID_SSM_RESULTS.md: "MoE has NO low-KL
|
||||
regime ... Do NOT put a hybrid T in the gallery/recommended config"); and (c) even then it reaches
|
||||
**~95%, not a clean beat** of vLLM, while conceding precision vLLM keeps (all-f32 SSM state).
|
||||
|
||||
=> Ship 0026 default-OFF (`ssm_hybrid_tau_thresh = 0` / no `--ssm-bf16-tau`); expose the bf16-SSM as
|
||||
an EXPLICIT opt-in flag for callers who knowingly accept a real MoE precision regression for ~+11.5%
|
||||
decode (~95% of vLLM). Keep it OUT of the gallery/recommended MoE config.
|
||||
|
||||
**Bottom line for the parent:** bit-exact MoE on GB10 plateaus at **~85% of vLLM** and the residual is
|
||||
the structural Marlin-NvFp4 grouped-GEMM gap that NO bit-exact lever closes (B-1 banked the re-graph;
|
||||
B-2 and B-3 are 0). Bit-exact does NOT reach/beat vLLM on MoE. The only lever that closes more (to ~95%)
|
||||
is the held 0026 bf16-SSM, which is **non-bit-exact AND fails the MoE KL gate** - so it ships **opt-in,
|
||||
default-off, not in the recommended config**, not as the default. Recommend shipping the honest ~85%
|
||||
bit-exact default and documenting the opt-in for users who accept the precision tradeoff. Do not market
|
||||
MoE parity; the bit-exact default is ~85% with a precision profile at-or-above vLLM, which is the
|
||||
defensible claim.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,499 +0,0 @@
|
||||
# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10
|
||||
|
||||
Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This
|
||||
document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style
|
||||
token-granular continuous-batch scheduler, and records the single honest finding that
|
||||
re-shapes what the change can and cannot buy.
|
||||
|
||||
Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
|
||||
LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`),
|
||||
MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev`
|
||||
(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`.
|
||||
Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that
|
||||
`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`).
|
||||
|
||||
## TL;DR (the honest reframe)
|
||||
|
||||
Three findings, read directly from the source at HEAD `151343b` and from the committed
|
||||
NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset
|
||||
what it is allowed to claim:
|
||||
|
||||
1. **The unified mixed batch already exists.** `update_slots()` already builds ONE
|
||||
`llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill
|
||||
tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every
|
||||
`SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then
|
||||
Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode
|
||||
is therefore **already claimed first and never dropped or capped** - the exact property
|
||||
vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction.
|
||||
|
||||
2. **The chunked-prefill slot state already exists and already persists across steps.** A
|
||||
slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()`
|
||||
is a partial prefill; it stays in that state and resumes next step until its prompt is
|
||||
fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING`
|
||||
(line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING`
|
||||
simultaneously; there is **no global "one prefill at a time" gate**. So the mission's
|
||||
"allow a slot to be mid-prefill while others decode in the same step" is **not a state
|
||||
machine to build - it is already the behaviour.** This is the single biggest de-risking
|
||||
fact in this document.
|
||||
|
||||
3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013
|
||||
(`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by
|
||||
slots in iteration order. It is not decode-load-aware (does not subtract the live decode
|
||||
count `D`), not adaptive (one constant across npl 8..128), and not fair (the first
|
||||
`PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that
|
||||
static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total
|
||||
per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover**
|
||||
`T - D` distributed across waiting prompts with a per-slot cap. That is ~the only
|
||||
behavioural change. **No new slot states, no batch-formation rewrite.**
|
||||
|
||||
### The honest ceiling (this is load-bearing for how the work is scoped and sold)
|
||||
|
||||
The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus
|
||||
`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a
|
||||
decode-KERNEL batch-scaling ceiling, not a scheduler defect**:
|
||||
|
||||
- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode).
|
||||
- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero
|
||||
prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same
|
||||
~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the
|
||||
same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The
|
||||
scheduler cannot lift this.**
|
||||
- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular
|
||||
scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once
|
||||
workload.
|
||||
|
||||
Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is:
|
||||
|
||||
- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on
|
||||
staggered arrival) - the genuine, large win.
|
||||
- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without
|
||||
per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE
|
||||
TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole
|
||||
npl range and across dense vs MoE.
|
||||
- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the
|
||||
burst-TTFT spread), and no admission collapse under sustained load.
|
||||
|
||||
Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the
|
||||
paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or
|
||||
CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's
|
||||
scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness +
|
||||
self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids.
|
||||
|
||||
## The gap, precisely localized (recap of the committed bench)
|
||||
|
||||
At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0,
|
||||
decode_agg tok/s | TTFT mean, npl swept 8/32/64/128:
|
||||
|
||||
| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM |
|
||||
|----:|------------------------:|-----------:|----------------------:|---------:|
|
||||
| 8 | 63.5 / 4.3 s | 64.3 / 2.6 s | 169.3 / 1.7 s | 202.0 / 0.8 s |
|
||||
| 32 | 105.7 / 23.1 s | 189.8 / 7.5 s | 239.0 / 9.0 s | 462.0 / 2.3 s |
|
||||
| 64 | 132.0 / 109 s | 284.2 / 13 s | 277.0 / 16.2 s | 624.5 / 4.1 s |
|
||||
| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s |
|
||||
|
||||
Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That
|
||||
convergence is the signal: once prefill starvation is removed, a dense model and a
|
||||
12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill**
|
||||
and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching
|
||||
decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual).
|
||||
This scope addresses the first; it names the second as the separate lever.
|
||||
|
||||
## What already exists (reuse, do NOT rebuild)
|
||||
|
||||
All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`.
|
||||
|
||||
- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every
|
||||
`SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to
|
||||
`generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends
|
||||
its sampled token (+ draft tokens) via `common_batch_add`. After this loop,
|
||||
`batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in.
|
||||
- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (=
|
||||
`num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT`
|
||||
slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it).
|
||||
Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at
|
||||
line 3502. **This is exactly vLLM's "leave the request in `running`, advance
|
||||
`num_computed_tokens` next step" - already implemented.**
|
||||
- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill;
|
||||
the compute loop (lines ~3366-3378) `for (i=0; i<batch.n_tokens; i+=n_tokens){ n_tokens =
|
||||
min(n_batch, batch.n_tokens-i); llama_decode(batch_view); }` runs it as one `llama_decode`
|
||||
when `batch.n_tokens <= n_batch`; `n_ubatch` (512) splitting happens inside `llama_decode`.
|
||||
- **[D] patch 0013 static prefill budget** - the thing to supersede. Read once at lines
|
||||
2737-2747 (`n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))`, a CONSTANT for
|
||||
the run); enforced as an extra `while` predicate at line 3188 (`n_prompt_budgeted <
|
||||
n_prefill_budget`), counter at 3214, outer break at 3326. `0` = disabled = byte-identical
|
||||
stock.
|
||||
- **[E] productization seam** - `backend/cpp/llama-cpp/grpc-server.cpp` lines 781-791 parse
|
||||
the model option `max_prefill_tokens` / `mpt` / `prefill_budget` and `setenv`
|
||||
`LLAMA_PREFILL_BUDGET` before context init (same pattern as `kv_paged`). New knobs hang off
|
||||
this seam identically.
|
||||
- **[F] paged KV (patches 0001-0011)** - on-demand block allocation keyed by sequence
|
||||
position. Batch formation only changes **which** tokens are in a step; paged alloc is
|
||||
driven by the per-slot sequence positions, which are unchanged. Orthogonal (see Correctness).
|
||||
|
||||
## vLLM v1 reference algorithm (the target, for fidelity)
|
||||
|
||||
From `vllm/v1/core/sched/scheduler.py::schedule()` (0.23.0, on the box). The unifying idea:
|
||||
there is no prefill phase vs decode phase. Every request advances `num_computed_tokens`
|
||||
toward `num_tokens` by up to N this step; for a decoder N=1, for a prefiller N=remaining
|
||||
prompt. One per-step `token_budget = max_num_batched_tokens` bounds the TOTAL (decode +
|
||||
prefill). Pass 1 visits `running` first (decoders cost 1 each -> all decode claimed before
|
||||
any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each
|
||||
chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps:
|
||||
`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len,
|
||||
per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch
|
||||
maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded
|
||||
progress (-> low, flat TTFT), one `model.forward()` per step.
|
||||
|
||||
The mapping to llama is clean because [A]+[B] already give us "running visited first" and
|
||||
"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D`
|
||||
sizing, and the per-request chunk cap with fair distribution.**
|
||||
|
||||
## The unified per-step batch-formation algorithm (the design)
|
||||
|
||||
New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`):
|
||||
|
||||
- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token
|
||||
budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch`
|
||||
(2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is
|
||||
unchanged.
|
||||
- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per
|
||||
step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`,
|
||||
floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress.
|
||||
- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave
|
||||
exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case.
|
||||
|
||||
Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today):
|
||||
|
||||
```
|
||||
common_batch_clear(batch); // line 2594
|
||||
|
||||
// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719)
|
||||
for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot);
|
||||
... speculative draft ...
|
||||
for (slot : generating) slot.update_batch(batch); // appends decode (+draft) tokens
|
||||
|
||||
>> D = batch.n_tokens; // NEW seam: decode load is now final (after 2719)
|
||||
>> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch);
|
||||
>> prefill_budget_step = max(0, T - D); // DYNAMIC leftover, auto-shrinks with D
|
||||
>> prefill_cap_per_slot = PREFILL_CAP; // long_prefill_token_threshold analogue
|
||||
>> n_prompt_budgeted = 0; // total prompt tokens added this step (subsumes 0013)
|
||||
|
||||
// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair)
|
||||
if (cont_batching || batch.n_tokens == 0) {
|
||||
>> for (k = 0; k < n_slots; ++k) { // round-robin start offset (fairness, see P2)
|
||||
>> slot = slots[(rr_start + k) % n_slots];
|
||||
if (!slot.is_processing() || !can_batch_with) continue;
|
||||
if (slot.state == STARTED) slot.state = PROCESSING_PROMPT; // line 2782 (unchanged)
|
||||
>> slot_prompt_added = 0; // NEW: per-slot chunk counter (reset each slot)
|
||||
// inner prompt-fill (lines 3187-3239), guard now triple-bounded:
|
||||
while (slot.prompt.n_tokens() < slot.task->n_tokens()
|
||||
>> && batch.n_tokens < T // was: < n_batch
|
||||
>> && n_prompt_budgeted < prefill_budget_step // was: 0013 static n_prefill_budget
|
||||
>> && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution
|
||||
common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd);
|
||||
slot.prompt.tokens.push_back(cur_tok);
|
||||
slot.n_prompt_tokens_processed++;
|
||||
n_prompt_budgeted++; slot_prompt_added++;
|
||||
... checkpoint-boundary breaks (unchanged) ...
|
||||
}
|
||||
if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT; // line 3252
|
||||
... checkpoint creation (unchanged) ...
|
||||
>> if (batch.n_tokens >= T) break; // was: >= n_batch (line 3320)
|
||||
>> if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326)
|
||||
}
|
||||
}
|
||||
|
||||
for (i=0; i<batch.n_tokens; i+=n) { n=min(n_batch,batch.n_tokens-i); llama_decode(view); } // unchanged
|
||||
```
|
||||
|
||||
The whole change is: (a) compute `prefill_budget_step = T - D` at the new seam after line
|
||||
2719 instead of reading a static env constant at 2737; (b) bound the inner/outer loops by `T`
|
||||
and the dynamic budget instead of `n_batch` and the static budget; (c) add `slot_prompt_added`
|
||||
with `prefill_cap_per_slot` for per-slot fairness; (d) a round-robin start offset so the same
|
||||
early slots do not always win the leftover.
|
||||
|
||||
**Why this holds the decode ceiling without tuning.** `T` bounds total tokens per step ->
|
||||
bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec).
|
||||
As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates
|
||||
the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256
|
||||
reaches 161; here it is reached **automatically across the npl range** because the budget is
|
||||
`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover
|
||||
(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode-
|
||||
saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes
|
||||
bounded progress every step instead of waiting for a dedicated prefill burst.
|
||||
|
||||
## Slot state machine changes (minimal - this is the headline de-risk)
|
||||
|
||||
**No new states. No state-transition rewrite.** The existing 6-state machine
|
||||
(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72)
|
||||
already encodes everything:
|
||||
|
||||
- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING`
|
||||
slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch).
|
||||
- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() <
|
||||
task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters
|
||||
the slot and resumes from `prompt.n_tokens()`.
|
||||
|
||||
The only **additions** are per-step scheduler scratch, not slot lifecycle state:
|
||||
|
||||
1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for
|
||||
the per-slot chunk cap. Not stored on the slot across steps.
|
||||
2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the
|
||||
leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always
|
||||
draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without
|
||||
it, slot 0's prompt finishes first every time and the last slots starve).
|
||||
3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT`
|
||||
transitions begin in one step. This falls out of the budget arithmetic already (a bounded
|
||||
`prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it
|
||||
may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a
|
||||
new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots.
|
||||
|
||||
That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap.
|
||||
The mission's feared "slot-state rewrite" does not materialize.
|
||||
|
||||
## How it supersedes / subsumes patch 0013
|
||||
|
||||
| property | 0013 (static cap) | this scheduler (dynamic `T - D`) |
|
||||
|----------|-------------------|----------------------------------|
|
||||
| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises |
|
||||
| decode-load aware | no (ignores `D`) | yes (leftover after decode) |
|
||||
| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) |
|
||||
| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) |
|
||||
| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts |
|
||||
| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** |
|
||||
|
||||
0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant
|
||||
and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the
|
||||
new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the
|
||||
dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013.
|
||||
|
||||
## Correctness
|
||||
|
||||
- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already
|
||||
advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the
|
||||
chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add`
|
||||
receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state.
|
||||
- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by
|
||||
the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded:
|
||||
pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays
|
||||
coherent and answers correctly. The op-level math per token is position-determined and
|
||||
unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic
|
||||
oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies
|
||||
to end-to-end only, never to the op test.
|
||||
- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed
|
||||
by sequence position and slot/stream, which this change does not touch; it changes only which
|
||||
tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates
|
||||
per-token via the block tables regardless of what prefill tokens are co-batched. Required gate:
|
||||
run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical
|
||||
scheduling decisions** (same per-step token counts, same admission order) - paged must be a
|
||||
no-op on the scheduler.
|
||||
- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same
|
||||
`task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant
|
||||
LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types /
|
||||
per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note
|
||||
it, do not try to lift it here.
|
||||
- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that
|
||||
reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the
|
||||
dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained
|
||||
heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s,
|
||||
zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step.
|
||||
This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left
|
||||
on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision
|
||||
should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose
|
||||
checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching
|
||||
change.
|
||||
|
||||
## Phased plan P0 -> P3 (work, payoff, files, risk)
|
||||
|
||||
| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk |
|
||||
|-------|------|-----------------------------------------------------|-------|------|
|
||||
| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low |
|
||||
| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) |
|
||||
| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) |
|
||||
| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) |
|
||||
|
||||
**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):**
|
||||
|
||||
- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on
|
||||
staggered arrival it goes below the vLLM burst figure to ~2 s).
|
||||
- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM,
|
||||
the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate
|
||||
lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the
|
||||
ceiling, the kernel on the throughput residual.
|
||||
|
||||
## GB10 considerations
|
||||
|
||||
- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read +
|
||||
paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only
|
||||
keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single
|
||||
`llama_decode`; `n_ubatch` (512) governs the internal split.
|
||||
- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more
|
||||
prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower
|
||||
TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load.
|
||||
Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving.
|
||||
- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be
|
||||
checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win.
|
||||
- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat
|
||||
~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change.
|
||||
- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the
|
||||
scheduler's uniform decode-only step is a precondition that *helps* a future capture.
|
||||
|
||||
## Biggest risks and how to de-risk
|
||||
|
||||
1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others-
|
||||
decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch
|
||||
(`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states
|
||||
untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock,
|
||||
verified by an A/B diff of per-step token counts.
|
||||
2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already
|
||||
documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**:
|
||||
op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with
|
||||
the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical**
|
||||
to 0013.
|
||||
3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite
|
||||
with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a
|
||||
no-op on batch formation). This is a hard gate, not a spot check.
|
||||
4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the
|
||||
scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission
|
||||
checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the
|
||||
batching win is legible independent of the checkpoint co-defect.
|
||||
5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the
|
||||
work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**:
|
||||
frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission
|
||||
rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually
|
||||
moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the
|
||||
scheduler is never charged with the 391/811 number the kernel forbids.
|
||||
|
||||
## Commit / hygiene
|
||||
|
||||
Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity
|
||||
scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are
|
||||
implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch-
|
||||
scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp`
|
||||
beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
|
||||
`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
|
||||
(human pushes).
|
||||
|
||||
---
|
||||
|
||||
## Review / risk (adversarial, source-verified)
|
||||
|
||||
Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
|
||||
llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
|
||||
committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
|
||||
|
||||
### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
|
||||
|
||||
The central de-risking claims check out against the code, and the load-bearing honesty (decode
|
||||
residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
|
||||
Two calibration fixes are required before P1 (below), neither changes the go decision.
|
||||
|
||||
### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
|
||||
|
||||
- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
|
||||
per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
|
||||
and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
|
||||
`paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
|
||||
(line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
|
||||
decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
|
||||
scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
|
||||
`llama_decode`/ubatch/KV rewrite in scope.
|
||||
- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
|
||||
target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
|
||||
`split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
|
||||
forcing `-kvu`. Neither is introduced by this change.
|
||||
- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
|
||||
chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
|
||||
-> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
|
||||
(llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
|
||||
does not break anything (0013 already hits it) but it means the actual scheduled object is a
|
||||
split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
|
||||
and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
|
||||
the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
|
||||
grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
|
||||
hang off it identically.
|
||||
|
||||
### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
|
||||
|
||||
- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
|
||||
(`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
|
||||
`cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
|
||||
**eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
|
||||
launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
|
||||
cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
|
||||
refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
|
||||
- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
|
||||
llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
|
||||
**flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
|
||||
quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
|
||||
dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
|
||||
`T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
|
||||
compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
|
||||
vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
|
||||
"toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
|
||||
dynamic budget would not. This strengthens P1's case beyond what the doc claims.
|
||||
- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
|
||||
`T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
|
||||
co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
|
||||
tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
|
||||
drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
|
||||
treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
|
||||
decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
|
||||
"barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
|
||||
arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
|
||||
|
||||
### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
|
||||
|
||||
- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
|
||||
find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
|
||||
change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
|
||||
verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
|
||||
- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
|
||||
config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
|
||||
sound (default-off path is untouched).
|
||||
- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
|
||||
a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
|
||||
not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
|
||||
|
||||
### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
|
||||
|
||||
The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
|
||||
with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
|
||||
2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
|
||||
line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
|
||||
The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
|
||||
to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
|
||||
`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
|
||||
(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
|
||||
burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
|
||||
fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
|
||||
npl8 and costs MoE TTFT), so P1 is the floor.
|
||||
|
||||
### Realistic effort / payoff and sequencing
|
||||
|
||||
- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
|
||||
set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
|
||||
- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
|
||||
Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
|
||||
0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
|
||||
decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
|
||||
- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
|
||||
tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
|
||||
residual spread/robustness problem.
|
||||
- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
|
||||
host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
|
||||
pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
|
||||
step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
|
||||
not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
|
||||
over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
|
||||
not change."
|
||||
|
||||
### Bottom line
|
||||
|
||||
GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
|
||||
libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
|
||||
cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
|
||||
decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
|
||||
`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
|
||||
P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
|
||||
of the 2.4x throughput residual.
|
||||
@@ -1,106 +0,0 @@
|
||||
# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
|
||||
|
||||
The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
|
||||
After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
|
||||
the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
|
||||
|
||||
## What changed
|
||||
|
||||
A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
|
||||
non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
|
||||
|
||||
qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step]
|
||||
-> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step]
|
||||
-> ggml_silu [folded into ssm_conv on CUDA]
|
||||
-> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step]
|
||||
|
||||
with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
|
||||
the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
|
||||
ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
|
||||
the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
|
||||
the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
|
||||
vLLM's `causal_conv1d_update`.
|
||||
|
||||
Files:
|
||||
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
|
||||
(src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
|
||||
src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
|
||||
- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
|
||||
(channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
|
||||
`ggml_cuda_op_ssm_conv`.
|
||||
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
|
||||
channels) + branch in `ggml_compute_forward_ssm_conv`.
|
||||
- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
|
||||
conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
|
||||
target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
|
||||
- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
|
||||
(`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
|
||||
rollback keep the existing concat+ssm_conv+silu+cpy chain.
|
||||
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
|
||||
vs the CPU reference across backends.
|
||||
|
||||
## Gate: test-backend-ops (CUDA0 vs CPU reference)
|
||||
|
||||
- SSM_CONV: 45/45 OK (unchanged path intact)
|
||||
- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
|
||||
- SSM_CONV_BIAS_SILU: 90/90 OK
|
||||
|
||||
## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
|
||||
|
||||
Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
|
||||
|
||||
| model | baseline md5 | fused md5 | result |
|
||||
|--------------------|----------------------------------|----------------------------------|-----------------|
|
||||
| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL |
|
||||
| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL |
|
||||
|
||||
## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
|
||||
|
||||
Dense q36-27b-nvfp4:
|
||||
|
||||
| mode | npl | baseline | fused | delta |
|
||||
|-----------|-----|----------|--------|---------|
|
||||
| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% |
|
||||
| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% |
|
||||
| eager | 32 | 196.07 | 197.61 | +0.8% |
|
||||
| eager | 128 | 333.62 | 342.97 | +2.8% |
|
||||
|
||||
MoE q36-35b-a3b-nvfp4:
|
||||
|
||||
| mode | npl | baseline | fused | delta |
|
||||
|-----------|-----|----------|--------|---------|
|
||||
| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% |
|
||||
| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% |
|
||||
| eager | 32 | 421.05 | 432.46 | +2.7% |
|
||||
| eager | 128 | 689.15 | 713.87 | +3.6% |
|
||||
|
||||
Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
|
||||
**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
|
||||
|
||||
## Step time + nsys kernel delta
|
||||
|
||||
Per-step decode time (dense npl128, T_TG / ntg=128):
|
||||
- baseline 48.711 s / 128 = 380.6 ms/step
|
||||
- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms)
|
||||
- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
|
||||
|
||||
nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
|
||||
|
||||
| kernel | baseline calls | fused calls | per-step (eager) |
|
||||
|---------------------|----------------|-------------|------------------|
|
||||
| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms |
|
||||
| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms |
|
||||
| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms |
|
||||
| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms |
|
||||
|
||||
Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
|
||||
and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
|
||||
prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
|
||||
|
||||
## Verdict
|
||||
|
||||
Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
|
||||
391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
|
||||
de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,639 +0,0 @@
|
||||
# Critical-Path Gap Analysis - GDN decode region
|
||||
|
||||
## vllm-gdn-compare (READ-ONLY, no GPU) - vLLM decode GDN kernel inventory vs llama
|
||||
|
||||
### Source ground truth
|
||||
- Local checkout `/home/mudler/_git/vllm` and the DGX's benchmarked venv
|
||||
`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm` are STRUCTURALLY
|
||||
IDENTICAL (same file `qwen_gdn_linear_attn.py`, byte-for-byte same line numbers
|
||||
1287/1344/1457/1644/1684). So the analysis below is faithful to what was actually
|
||||
benchmarked on the GB10. Both are a recent dev build (`__version__ = "dev"`), same
|
||||
era as the "0.23.0" reference; the GDN path is the refactored
|
||||
`vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`.
|
||||
|
||||
### The headline: vLLM runs the entire GDN region at decode as 2 Triton kernels + 3 GEMMs, ALL fused
|
||||
Per Qwen3.5 gated-DeltaNet (linear-attn) layer, vLLM decode launches:
|
||||
|
||||
| # | Kernel | What is folded in |
|
||||
|---|--------|-------------------|
|
||||
| 1 | `in_proj_qkvz` GEMM | (quantized matmul - shared with llama) |
|
||||
| 2 | `in_proj_ba` GEMM | (quantized matmul - shared with llama) |
|
||||
| 3 | `_causal_conv1d_update_kernel` (causal_conv1d.py:1193) | conv1d **+ silu activation fused in** (the `activation` arg) |
|
||||
| 4 | `fused_recurrent_gated_delta_rule_packed_decode_kernel` (fused_recurrent.py:256-336) | **l2norm(q), l2norm(k), scale, softplus gate, A_log decay exp(g), sigmoid(beta), the delta-rule recurrence (b_h*=exp(g); delta update), the output b_o=sum(b_h*b_q), AND the SSM state write-back** - all in one kernel |
|
||||
| 5 | `RMSNormGated` (gated rms_norm) | **output gate silu/sigmoid * z fused into the rms_norm**; the comment notes the norm+quant is further fusable by the compilation pass (`fuse_norm_quant`) |
|
||||
| 6 | `out_proj` GEMM | (quantized matmul - shared with llama) |
|
||||
|
||||
So the GDN-region "glue" elementwise op count in vLLM is effectively ZERO separate
|
||||
launches. Everything llama runs as standalone ggml nodes - conv-silu, gate
|
||||
sigmoid, softplus, l2norm, scale, decay mul, residual add, gather - is absorbed
|
||||
into kernels #3, #4, and #5.
|
||||
|
||||
Verified kernel bodies:
|
||||
- `fused_recurrent_gated_delta_rule_packed_decode_kernel` lines 313-336:
|
||||
`b_q/sqrt(sum(b_q^2)+eps)`, `b_k/sqrt(...)` (l2norm), `b_q*scale`,
|
||||
`softplus_x=where(x<=thr, log(1+exp(x)), x)`, `g_val=-exp(A_log)*softplus_x`,
|
||||
`beta_val=sigmoid(b)`, `b_h*=exp(g_val)`, `b_v-=sum(b_h*b_k)`, `b_v*=beta_val`,
|
||||
`b_h+=b_v*b_k`, `b_o=sum(b_h*b_q)`, `tl.store(p_o,...)`, `tl.store(p_ht,...)`.
|
||||
ONE kernel = recurrence + ALL gating + l2norm + state writeback.
|
||||
- The non-packed variant `fused_sigmoid_gating_delta_rule_update_kernel`
|
||||
(fused_sigmoid_gating.py:24-179) is the same fusion (used for the spec-decode /
|
||||
mixed-batch path); both fold gate+l2norm+recurrence+writeback into one launch.
|
||||
- Decode dispatch: `_forward_core` (line 1286-1298) routes pure non-spec decode to
|
||||
`_forward_core_decode_non_spec` (line 1644), which calls exactly
|
||||
`causal_conv1d_update` (#3) then `fused_recurrent_gated_delta_rule_packed_decode`
|
||||
(#4). `_output_projection` (line 851) does `self.norm(core_attn_out, z)` (#5,
|
||||
gated rmsnorm) then `out_proj` (#6).
|
||||
|
||||
### vLLM ALSO captures decode in a FULL CUDA graph - the launch bubbles are gone entirely
|
||||
`vllm/v1/attention/backends/gdn_attn.py`:
|
||||
- `_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH` (line 82)
|
||||
- `use_full_cuda_graph = cudagraph_mode.has_full_cudagraphs()` (line 113)
|
||||
- `build_for_cudagraph_capture` (line 509): "only decode is supported for full
|
||||
cudagraphs with Mamba" / "GDN only supports decode-only full CUDAGraph capture".
|
||||
|
||||
So at decode vLLM captures the WHOLE forward (all 48 layers: GDN linear-attn layers
|
||||
+ the 1-in-4 full-attn layers + projections + conv + recurrence + gated rmsnorm)
|
||||
into a single replayed CUDA graph. Per-kernel host launch latency and the
|
||||
data-dependent inter-op gaps are eliminated at replay time. Even the 2 Triton
|
||||
kernels per GDN layer incur no host-side launch bubble during graph replay.
|
||||
|
||||
### Why this is the 62%-vs-40% explanation (not GEMM throughput)
|
||||
- llama runs the GDN region as ~7-9 separate ggml nodes per layer at decode
|
||||
(`ssm_conv`, `gated_delta_net` recurrence, `gdn_gather`, `k_bin_bcast` mul,
|
||||
`silu`, `sigmoid`, `l2_norm`, `op_add`, `concat`), each a host-launched kernel,
|
||||
serially data-dependent (conv -> gate -> recurrence -> gather), with the gating
|
||||
elementwise wedged between recurrence steps. Each launch + the dependency stall
|
||||
is a bubble ON the critical path. x48 layers x ~8 ops = ~384 launch bubbles/step.
|
||||
- vLLM has 2 fused Triton kernels per GDN layer AND wraps them in a CUDA graph, so
|
||||
the GDN-region inter-op bubble count at decode is ~0. The recurrence kernel
|
||||
itself is already near-parity in llama (gated_delta_net 1.47 ms/call vs vLLM).
|
||||
The gap is the surrounding launch/sync overhead, which is exactly the 60% idle
|
||||
measured (llama ~40% busy vs vLLM 62%).
|
||||
- This matches why P2a and Lever 2 were FLAT: they shrink GPU-busy kernels that are
|
||||
already overlapped with the 42% mul_mat_q GEMM. The real wall-clock lever is the
|
||||
SERIAL GDN gating chain's launch bubbles, which vLLM removed by (a) fusion into
|
||||
the recurrence kernel and (b) CUDA-graph capture.
|
||||
|
||||
### What llama would need to match vLLM (two independent wins, either helps)
|
||||
1. **Op fusion (Lever 3).** Collapse the GDN per-layer gating chain into the
|
||||
recurrence kernel: fold conv-silu, q/k l2norm, scale, softplus+A_log gate,
|
||||
sigmoid-beta, the exp-decay mul, the residual add, and the SSM-state write-back
|
||||
INTO the `gated_delta_net` CUDA kernel (and fuse the output gate silu*z into the
|
||||
final rms_norm). Target: from ~8 GDN nodes/layer down to ~2 (conv-fused +
|
||||
recurrence-fused), mirroring vLLM's `fused_recurrent_gated_delta_rule_packed_decode`.
|
||||
The conv silu fold and the l2norm/scale/gate fold are the high-value pieces -
|
||||
they are pure elementwise prologues sitting ON the serial chain between conv and
|
||||
recurrence.
|
||||
2. **CUDA-graph the decode step.** Even without fusion, capturing the decode forward
|
||||
in a CUDA graph removes the per-node host launch latency for all ~384 nodes/step.
|
||||
(Prior A.2 work flagged ggml-cuda graph capture as the orthogonal lever; the
|
||||
measured GDN structure here is exactly why it should move the wall.) vLLM gets
|
||||
BOTH; llama gets neither today.
|
||||
|
||||
### Bottom line for the gap-analysis agent
|
||||
The candidate explanation is confirmed at the source level: vLLM's GDN decode region
|
||||
is 2 fused Triton kernels under a full CUDA graph vs llama's ~8 separate
|
||||
host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM
|
||||
compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing nsys
|
||||
trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
|
||||
gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
|
||||
decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
|
||||
|
||||
---
|
||||
|
||||
## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
|
||||
|
||||
Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
|
||||
compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
|
||||
budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
|
||||
|
||||
### Inputs (measured / GGUF metadata, no new GPU work)
|
||||
- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
|
||||
Both numbers are shared identically by llama and vLLM (same HW, same weights).
|
||||
- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
|
||||
**16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
|
||||
head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
|
||||
Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
|
||||
- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
|
||||
T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
|
||||
- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
|
||||
|
||||
### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
|
||||
| term | bytes/step | basis |
|
||||
|------|-----------|-------|
|
||||
| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
|
||||
| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
|
||||
| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
|
||||
| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
|
||||
| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
|
||||
|
||||
**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
|
||||
Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
|
||||
|
||||
KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
|
||||
= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
|
||||
GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
|
||||
occupancy) and Lever-2 were FLAT on decode.
|
||||
|
||||
### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
|
||||
| term | FLOPs/step | floor |
|
||||
|------|-----------|-------|
|
||||
| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
|
||||
| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
|
||||
| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
|
||||
|
||||
### 3. Verdict / bubble budget / parity target
|
||||
```
|
||||
compute floor bandwidth floor MEASURED step x above bw-floor
|
||||
GB10 dense-FP4 ~13 ms ~154 ms (117-224)
|
||||
vLLM dense @128 327 ms ~2.1x (1.5-2.8x)
|
||||
llama dense @128 384 ms ~2.5x (1.7-3.3x)
|
||||
```
|
||||
- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
|
||||
below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
|
||||
- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
|
||||
LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
|
||||
Confirms prior "decode is 2.5x above its bandwidth floor" work.
|
||||
- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
|
||||
= kernel-launch latency + occupancy gaps + serial data-dependency stalls.
|
||||
- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
|
||||
Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
|
||||
SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
|
||||
so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
|
||||
|
||||
**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
|
||||
Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
|
||||
collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
|
||||
gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
|
||||
already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
|
||||
|
||||
### Cross-check / sizing for the gap-analysis (timeline) agent
|
||||
- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
|
||||
GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
|
||||
op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
|
||||
so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
|
||||
wall is the critical path.
|
||||
- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
|
||||
GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
|
||||
sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
|
||||
< 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
|
||||
kernel slower than vLLM fused_recurrent, or scheduler/H2D).
|
||||
- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
|
||||
Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
|
||||
gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
|
||||
on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
|
||||
|
||||
Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
|
||||
recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
|
||||
come from bubble (not bytes, not FLOPs).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
|
||||
|
||||
### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
|
||||
Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
|
||||
`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
|
||||
`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
|
||||
gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
|
||||
runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
|
||||
grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
|
||||
128 lanes hold the full head-vector (one element per lane).
|
||||
|
||||
Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
|
||||
critical path between the in-proj GEMMs and the recurrence):
|
||||
1. `beta = ggml_sigmoid(ssm_beta @ cur)` -> kernel reads `beta_val = *beta_t`
|
||||
2. `alpha = ssm_alpha @ cur`
|
||||
3. `ggml_add(alpha, ssm_dt)` (k_bin_bcast op_add)
|
||||
4. `ggml_softplus(...)` (unary_op<softplus>, 1248 inst)
|
||||
5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked) -> g; kernel does `expf(g_t)`
|
||||
6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
|
||||
7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
|
||||
|
||||
Post-GDN gate (consumes kernel output):
|
||||
8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
|
||||
|
||||
### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
|
||||
These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
|
||||
-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
|
||||
operand it needs:
|
||||
- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
|
||||
- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
|
||||
- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
|
||||
|
||||
That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
|
||||
|
||||
### Why the OUTPUT gate (step 8) is NOT folded into this kernel
|
||||
The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
|
||||
In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
|
||||
blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
|
||||
grid-global sync - not feasible without a grid redesign. Leave step 8 as the
|
||||
existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
|
||||
The conv-silu (step 6) is a convolution, structurally separate; rely on the
|
||||
existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
|
||||
recurrence kernel.
|
||||
|
||||
### Implementation scope
|
||||
- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
|
||||
- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
|
||||
- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
|
||||
- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
|
||||
- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
|
||||
- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
|
||||
- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
|
||||
|
||||
### Numeric considerations / bit-exactness
|
||||
- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
|
||||
- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
|
||||
|
||||
### Estimated kernels-removed-per-layer and the honest ceiling
|
||||
- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
|
||||
- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
|
||||
|
||||
### RISK (must be settled before building)
|
||||
1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
|
||||
overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
|
||||
Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
|
||||
between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
|
||||
batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
|
||||
gap (overlapped), do NOT build op-fusion; go to lever (2) below.
|
||||
2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
|
||||
8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
|
||||
FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
|
||||
to the post-GDN region or to inter-layer launch latency generally, the
|
||||
higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
|
||||
latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
|
||||
fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
|
||||
pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
|
||||
CUDA-graph lever if the bubbles are spread across the step rather than concentrated
|
||||
in the pre-GDN gate slice.
|
||||
3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
|
||||
later op needing more srcs on this node has zero headroom; flag for review.
|
||||
|
||||
## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
|
||||
|
||||
### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
|
||||
|
||||
Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
|
||||
(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
|
||||
(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
|
||||
against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
|
||||
|
||||
### How graph-disable / re-instantiation are decided (this fork's state machine)
|
||||
- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
|
||||
reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
|
||||
`node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
|
||||
SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
|
||||
list. So no GDN op forces graph-disable.
|
||||
- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
|
||||
(incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
|
||||
warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
|
||||
WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
|
||||
CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
|
||||
|
||||
### Why the GDN region's properties are STABLE across steady decode steps
|
||||
The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
|
||||
```
|
||||
state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
|
||||
kv_head * n_embd_s * elsize); // offset = kv_head
|
||||
ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
|
||||
```
|
||||
Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
|
||||
`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
|
||||
`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
|
||||
(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
|
||||
range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
|
||||
(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
|
||||
same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
|
||||
offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
|
||||
fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
|
||||
compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
|
||||
tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
|
||||
|
||||
### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
|
||||
`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
|
||||
on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
|
||||
gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
|
||||
no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
|
||||
(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
|
||||
noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
|
||||
graphs engage), confirming none of these break stream capture.
|
||||
|
||||
### The only re-instantiation is NOT GDN-driven
|
||||
A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
|
||||
the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
|
||||
lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
|
||||
replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
|
||||
`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
|
||||
the default decode path and is not the GDN/SSM path.)
|
||||
|
||||
### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
|
||||
The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
|
||||
decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
|
||||
graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
|
||||
Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
|
||||
0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
|
||||
captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
|
||||
decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
|
||||
GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
|
||||
"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
|
||||
SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
|
||||
wall-clock).
|
||||
|
||||
### Graph-safe lever for GDN: none new
|
||||
- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
|
||||
are already safe and captured.
|
||||
- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
|
||||
ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
|
||||
ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
|
||||
Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
|
||||
A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
|
||||
host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
|
||||
- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
|
||||
(`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
|
||||
That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
|
||||
worth a separate confirm for the MoE model.
|
||||
|
||||
## decode-timeline-gap (GPU, label gap-analysis) - the decisive fresh node-level measurement
|
||||
|
||||
This is the new GPU run the analysis was waiting on. It arbitrates between the
|
||||
roofline/vllm-gdn-compare theory ("57 ms = 100% bubble, Lever 3 closes it") and the
|
||||
cudagraph-coverage source verdict ("~99.4% busy, bandwidth-bound, bubbles refuted").
|
||||
The measurement confirms the latter and refutes the former, with per-kernel numbers.
|
||||
|
||||
### Capture (the trap the prior `--trace=cuda` fell into is now avoided)
|
||||
`nsys profile --trace=cuda --cuda-graph-trace=node` on build-cuda-base (clean
|
||||
Lever-1, HEAD df1cc97, git-clean mmq.cuh), q36-27b-nvfp4 dense, `-fa on -npp 128
|
||||
-ntg 24 -npl 128 -c 33000`. Artifacts on DGX: `~/llama-paged-dev/nsysgap.{nsys-rep,
|
||||
sqlite}`. The decode step is a single CUDA graph (graphId=11, 23 replays = steps
|
||||
2-24; graphId=1 x8 = prefill). Plain `--trace=cuda` recorded each step as ONE opaque
|
||||
~380 ms block, so the widely-cited `nsysab_new.kern.txt` breakdown (mul_mat_q 42%,
|
||||
gated_delta_net 13%) is PREFILL + the single eager capture step, NOT decode. With
|
||||
node-level trace the graph expands: 168201 kernels = 91499 graph-internal + 76702
|
||||
eager prefill. **All graph kernels on stream 14 (single stream) -> strictly serial,
|
||||
no overlap, so any inter-kernel gap is pure GPU idle.**
|
||||
|
||||
### One steady decode step (window between decode launches 22413.26 / 22796.74 ms, width 383.48 ms)
|
||||
Exactly 48 `gated_delta_net` + 16 `flash_attn` = one clean step (48 GDN + 16 attn).
|
||||
2965 kernels.
|
||||
|
||||
| classification | ms/step | % of step |
|
||||
|---|---|---|
|
||||
| (a) inter-kernel LAUNCH gaps + (b) SERIAL-DEPENDENCY stalls (LAG sum, single stream) | **0.225** | **0.06%** |
|
||||
| (c) within-kernel time (GPU running) | 380.4 | 99.94% |
|
||||
|
||||
Zero gaps > 5 us. Largest single gap 2.40 us. 1260 sub-1us gaps + 1700 back-to-back.
|
||||
**The decode step is 99.94% GPU-busy. There are no bubbles.** This independently
|
||||
confirms cudagraph-coverage's ~99.4% and **refutes** roofline-decode's "57 ms = 100%
|
||||
bubble" and vllm-gdn-compare's "~384 launch bubbles/step on the critical path".
|
||||
nvidia-smi's "40% util" = low SM/compute efficiency WITHIN kernels (c) (memory-latency-
|
||||
bound, ~12-16% achieved occupancy), not wall-clock idle.
|
||||
|
||||
### Real decode kernel mix (% of the 380.4 ms step) - corrects the prefill-contaminated kern_sum
|
||||
| kernel | n/step | ms | % | grid CTAs | waves/48SM |
|
||||
|---|---|---|---|---|---|
|
||||
| gated_delta_net_cuda | 48 | **196.37** | **51.6** | 48x128x32 = 196608 | 4096 |
|
||||
| mul_mat_q (FP4 in/out/qkv/o proj) | 496 | 92.90 | 24.4 | 136 | 1.5 |
|
||||
| quantize_mmq_nvfp4 | 496 | 17.13 | 4.5 | 483 | 10 |
|
||||
| nvjet GEMM (lm_head) | 1 | 11.91 | 3.1 | 1944 | 40 |
|
||||
| flash_attn_ext_f16 (16 attn layers) | 16 | 11.67 | 3.1 | 48 | 1.0 |
|
||||
| concat_cont (conv-state) | 48 | 8.01 | 2.1 | 20480 | 427 |
|
||||
| cpy_scalar | 64 | 7.62 | 2.0 | 49152 | 1024 |
|
||||
| k_get_rows_float | 49 | 7.08 | 1.9 | 15098 | 315 |
|
||||
| k_bin_bcast (gate mul + add) | 720 | 6.59 | 1.7 | 3169 | 66 |
|
||||
| ssm_conv_f32 | 48 | 5.64 | 1.5 | 10240 | 213 |
|
||||
| unary_gated (silu/sigmoid) | 128 | 5.36 | 1.4 | 5888 | 123 |
|
||||
| mul_mat_q_stream_k_fixup | 304 | 3.94 | 1.0 | 192 | 4 |
|
||||
| rms_norm_f32 | 209 | 3.52 | 0.9 | 1764 | 37 |
|
||||
| l2_norm_f32 | 96 | 0.64 | 0.2 | | |
|
||||
| gdn_gather_nonident | 48 | **0.061** | 0.016 | | |
|
||||
|
||||
- `gated_delta_net` is **51.6% of the step**, the single dominant term. The
|
||||
previously-cited "1.47 ms/call near-vLLM" was the EAGER average over 1248 calls
|
||||
(range 0.046-4.42 ms = prefill warmups + capture); true steady decode is
|
||||
**4.08-4.11 ms/call** (gridY=128 = the 128 seqs). 2.8x higher than believed.
|
||||
- It launches 196608 CTAs / 4096 waves = NOT occupancy-starved; the cost is
|
||||
bandwidth-bound state traffic (~384 MB read + ~384 MB write per layer for the
|
||||
48-head x 128-seq x [state 128 x head_v 128] recurrent state, ~190 GB/s effective).
|
||||
- The Lever-3 narrow target (gating glue) = k_bin_bcast 6.59 + silu/sigmoid 5.36 +
|
||||
l2_norm 0.64 + softplus 0.13 = **12.76 ms = 3.35%** of the step. `gdn_gather` is
|
||||
**0.06 ms** (negligible - it early-returns on identity ids as predicted).
|
||||
|
||||
### The three answers (with numbers)
|
||||
1. **Bubbles on the serial GDN critical path?** NO. 0.225 ms idle/step = 0.06%,
|
||||
zero gaps > 5 us. CUDA graphs eliminated launch overhead; serial dependencies do
|
||||
not produce idle (each kernel starts < 1 us after the previous). The premise is
|
||||
refuted by direct measurement.
|
||||
2. **Would Lever 3 (fuse the gating chain) shrink the step or overlap away?** It
|
||||
shrinks it, but only by its hard ceiling **12.76 ms = 3.35%** (380 -> 367 ms, 336
|
||||
-> ~348 tok/s, 86% -> 89% of vLLM). It does NOT close the 14% / 53-57 ms gap.
|
||||
IMPORTANT mechanism correction: the step is single-stream and 99.94% busy, so
|
||||
there is NO overlap to absorb freed time (the lever3-design RISK #1 "same trap as
|
||||
P2a if overlapped" does NOT apply - nothing overlaps). So removing those kernels'
|
||||
GPU-time DOES cut wall-clock - but the win is removing their HBM byte traffic, NOT
|
||||
launch bubbles (there are none). And the value is the measured ~12.76 ms, not the
|
||||
"~288 launch bubbles" framing (those launches cost ~0 inside the graph). This also
|
||||
explains P2a/Lever-2 flatness correctly: NOT "overlapped busy-time" (no overlap),
|
||||
but P2a tuned the prefill large-M GEMM (decode GEMMs are 136-CTA tail-bound, untouched)
|
||||
and Lever-2 relocated mandatory quantize work into the GEMM prologue (net zero).
|
||||
3. **Do CUDA graphs cover the GDN region at B=128?** YES, fully. Whole step = one
|
||||
graph, 23 replays, ~0.2 ms host gap between steps. `gdn_gather_nonident` and the
|
||||
in-place state ops are graph-internal nodes (graphNodeId != 0); no fragmentation.
|
||||
Confirms cudagraph-coverage. Note: lever #2 from vllm-gdn-compare ("CUDA-graph the
|
||||
decode step") is ALREADY IN EFFECT in this build and did not close the gap - so it
|
||||
is spent, not pending.
|
||||
|
||||
### Verdict against roofline-decode's own sizing test
|
||||
roofline-decode stated: "if critical-path gaps total < 57 ms, parity is NOT reachable
|
||||
via GDN-gate fusion alone and the gap is elsewhere (GDN core kernel slower than vLLM
|
||||
fused_recurrent)." **Measured gaps = 0.225 ms << 57 ms.** Therefore, by that test, the
|
||||
53-57 ms / 14% gap is NOT bubble and NOT closable by gating fusion. It lives in
|
||||
**kernel GPU-time**, dominated by the `gated_delta_net` recurrence (51.6%, bandwidth-
|
||||
bound) and secondarily the FP4 GEMM + quantize stack (29%). The "57 ms = 100% bubble"
|
||||
roofline conclusion was an inference from the prefill-contaminated GPU-busy sum
|
||||
(~555 ms vs 384 ms "implies overlap"); the node-level decode-only measurement shows
|
||||
per-step GPU-busy = wall (no overlap), so that inference does not hold.
|
||||
|
||||
### Recommendation (resized)
|
||||
- The real lever is the `gated_delta_net` recurrence kernel itself (196 ms, 51.6%):
|
||||
match vLLM's `fused_recurrent_gated_delta_rule_packed_decode` (vllm-gdn-compare
|
||||
kernel #4) which folds l2norm + gate + decay + recurrence + state-writeback into a
|
||||
SINGLE pass over the state, reducing HBM round-trips of the state. The win is byte
|
||||
reduction in a memory-bound single-stream step, not bubble removal.
|
||||
- The lever3-design fusion is still worth doing as a component of that (it removes
|
||||
~12.76 ms = 3.35% of real byte traffic, and unlike its own RISK section feared, it
|
||||
will NOT be flat because there is no overlap), but on its own it is a ~3% lever, not
|
||||
the gap-closer. Build it folded into a single-pass recurrence kernel, not as an
|
||||
isolated gate fold.
|
||||
- Next decisive measurement (future GPU-agent run): profile vLLM's decode step at
|
||||
npl128 with the same node-level method and compare per-region GPU-time (GDN
|
||||
recurrence vs GEMM vs attention) to localize exactly where vLLM spends its 53-57 ms
|
||||
less. Both engines move near-identical bytes only if vLLM's fused recurrence does
|
||||
not re-stream state; the per-kernel A/B will show whether the gap is the recurrence
|
||||
pass or the GEMM/quantize stack.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict
|
||||
|
||||
Reconciles all six investigation sections above plus the three adversarial verdicts
|
||||
(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is
|
||||
99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by
|
||||
the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers
|
||||
are byte-reduction inside that kernel - NOT launch-bubble removal.**
|
||||
|
||||
### 1. The proven critical-path decomposition of the decode step
|
||||
|
||||
Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97,
|
||||
q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One
|
||||
steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on
|
||||
ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window
|
||||
383.48 ms.
|
||||
|
||||
BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle):
|
||||
|
||||
| bucket | ms/step | % step | note |
|
||||
|---|---|---|---|
|
||||
| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency |
|
||||
| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us |
|
||||
| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back |
|
||||
| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces |
|
||||
| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step |
|
||||
|
||||
The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved
|
||||
occupancy on memory-latency-bound kernels), NOT wall-clock idle.
|
||||
|
||||
KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives):
|
||||
|
||||
| kernel | ms | % step | regime |
|
||||
|---|---|---|---|
|
||||
| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) |
|
||||
| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode |
|
||||
| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) |
|
||||
| `nvjet` lm_head GEMM | 11.91 | 3.1 | |
|
||||
| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | |
|
||||
| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target |
|
||||
| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) |
|
||||
| `k_get_rows_float` | 7.08 | 1.9 | |
|
||||
| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) |
|
||||
| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 |
|
||||
| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) |
|
||||
| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | |
|
||||
| `rms_norm_f32` | 3.52 | 0.9 | |
|
||||
| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target |
|
||||
| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) |
|
||||
|
||||
GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step.
|
||||
The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was
|
||||
PREFILL + the single eager capture step contaminating the average over 1248 calls (range
|
||||
0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step.
|
||||
|
||||
### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty
|
||||
|
||||
**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN
|
||||
chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim
|
||||
requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay
|
||||
already collapsed launch latency; serial data-dependency does NOT equal idle when the
|
||||
graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM
|
||||
efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated
|
||||
`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block).
|
||||
|
||||
**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing
|
||||
serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove
|
||||
(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap
|
||||
anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a
|
||||
tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and
|
||||
Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero).
|
||||
(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy
|
||||
single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%**
|
||||
(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that
|
||||
over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but
|
||||
only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3
|
||||
is a component, not the gap-closer.
|
||||
|
||||
**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"):
|
||||
**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware
|
||||
floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is
|
||||
not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is
|
||||
software-closable in BIT-EXACT form turns on one unmeasured quantity (below).
|
||||
|
||||
RESIDUAL UNCERTAINTY (the single open question that decides everything):
|
||||
- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above
|
||||
ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED
|
||||
it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write)
|
||||
bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM,
|
||||
then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit
|
||||
HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same,
|
||||
the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a
|
||||
hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on
|
||||
both halves). This is the one measurement that converts the verdict from "refuted as
|
||||
worded" to a definitive yes/no.
|
||||
- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip
|
||||
[TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode
|
||||
graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace
|
||||
cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold.
|
||||
The dense 335 tok/s headline is fully settled.
|
||||
|
||||
### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327)
|
||||
|
||||
Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines
|
||||
share identical bandwidth/compute floors. Ranked by expected recovery.
|
||||
|
||||
| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** |
|
||||
| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless |
|
||||
| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone |
|
||||
| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track |
|
||||
| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) |
|
||||
| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound |
|
||||
| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat |
|
||||
| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap |
|
||||
| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor |
|
||||
|
||||
Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a
|
||||
single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a
|
||||
distinct, bit-exactness-breaking precision track.
|
||||
|
||||
### 4. The honest verdict and the single highest-value next step
|
||||
|
||||
**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the
|
||||
unmeasured byte ratio:
|
||||
- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass
|
||||
fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking
|
||||
llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance).
|
||||
- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the
|
||||
57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it.
|
||||
|
||||
Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT
|
||||
ship the isolated gate fold as "the parity lever."
|
||||
|
||||
**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN
|
||||
recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass
|
||||
over the recurrent state) - BUT gate the build on one cheap measurement first, because it
|
||||
is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal.
|
||||
|
||||
**The measurement that confirms it before over-investing (one short GPU run, gap-analysis
|
||||
agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's
|
||||
`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two
|
||||
counters:
|
||||
- `dram__bytes.sum` (actual DRAM bytes/call)
|
||||
- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s)
|
||||
|
||||
Decision rule:
|
||||
- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant
|
||||
un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery
|
||||
scales with the byte delta (up to ~40 ms). This is the gap-closer.
|
||||
- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the
|
||||
recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only
|
||||
the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4,
|
||||
accept non-bit-exact), and bit-exact parity is NOT reachable.
|
||||
|
||||
**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort):
|
||||
the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity),
|
||||
which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound,
|
||||
single-stream step where their full GPU-time is wall-clock.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,185 +0,0 @@
|
||||
# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121)
|
||||
|
||||
Profiling study (no engine changes). Question: matched apples-to-apples (both
|
||||
batched servers, NVFP4-class weights, prefix caching on, both eager), why is
|
||||
`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a
|
||||
1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural.
|
||||
|
||||
Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers.
|
||||
llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf`
|
||||
(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`,
|
||||
`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin),
|
||||
`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix,
|
||||
K=32 concurrent, generate 64. All profiling scripts are dev-tree only
|
||||
(`~/bench/decode_study/`); minimal in-code timers were not needed (server already
|
||||
reports per-slot `eval time`, which excludes prompt-eval = pure decode).
|
||||
|
||||
## TL;DR
|
||||
|
||||
1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode
|
||||
the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys).
|
||||
Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage
|
||||
sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The
|
||||
"GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload.
|
||||
2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the
|
||||
opposite of the thin-batch-GEMM story. Attention scaling with context length, not
|
||||
the matmul, is the load-bearing cost.
|
||||
3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than
|
||||
stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It
|
||||
gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every
|
||||
step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's
|
||||
in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts
|
||||
decode latency.
|
||||
4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM**
|
||||
(~120-185 ms/step). The residual gap is the **long-context decode-attention
|
||||
kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity
|
||||
gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps.
|
||||
|
||||
## The measured numbers (batch 32, server-reported pure-decode step time)
|
||||
|
||||
`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token`
|
||||
(the most-contended, full-batch-32 slots; excludes prompt eval).
|
||||
|
||||
| config | decode step ms (max / top8) | client wall ms/step |
|
||||
|------------------------------------------|-----------------------------|---------------------|
|
||||
| paged, ctx 1024, greedy | 1343 / 1279 | 1468 |
|
||||
| paged, ctx 1024, **heavy 5-sampler** | 1346 / 1280 | 1470 |
|
||||
| **stock** (no paging), ctx 1024, greedy | **729 / 650** | 768 |
|
||||
| paged, **ctx 64** (short), greedy | **215 / 215** | 253 |
|
||||
| vLLM NVFP4A16, ctx 1024 (K=32) | **~120-185** (270 tok/s) | - |
|
||||
|
||||
The brief's reference ~828 ms/step sits between the stock (650-729) and paged
|
||||
(1279-1343) numbers measured here; the decomposition below is what is robust. Our
|
||||
fan-out shares no prefix across the 32 slots (each slot independently prefills 1056
|
||||
tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the
|
||||
"max" slot is maximally contended, which is why our paged max runs a little above 828.
|
||||
|
||||
### Context sweep - decode step is attention-scaling, not fixed overhead
|
||||
|
||||
Pure-decode step vs shared-prefix length (paged, batch 32):
|
||||
|
||||
| prefix ctx | decode step ms |
|
||||
|-----------|----------------|
|
||||
| 64 | 215 |
|
||||
| 128 | ~290 |
|
||||
| 256 | ~410 |
|
||||
| 512 | ~660 |
|
||||
| 1024 | ~1280 |
|
||||
|
||||
Roughly linear in context length: ~1 ms of added step time per added context token.
|
||||
The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope +
|
||||
loop + sampling, attention negligible). Everything above it scales with KV length =
|
||||
attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step.
|
||||
|
||||
## Where the ~1280 ms paged decode step goes (nsys, pure-decode window)
|
||||
|
||||
`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way
|
||||
decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's
|
||||
own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%).
|
||||
GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi).
|
||||
|
||||
| bucket | % GPU time | abs (of ~1280 ms) | what it is |
|
||||
|--------------------------------|-----------:|------------------:|------------|
|
||||
| `flash_attn_ext_f16` ATTENTION | **47.7%** | ~610 ms | decode attention over the 1056-cell KV |
|
||||
| `cpy_scalar` KV copy/cast | 18.3% | ~234 ms | KV write + f32->f16 casts |
|
||||
| `get_rows/set_rows` KV gather | 17.8% | ~228 ms | **paged** gather of K/V/mask to contiguous |
|
||||
| `mul_mat_q` + `quantize_mmq` | 15.7% | ~201 ms | NVFP4 weight GEMM (+ activation requant) |
|
||||
| rmsnorm / silu / rope / add | ~0.6% | ~8 ms | elementwise |
|
||||
|
||||
Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the
|
||||
weight matmul is ~the entire short-context step, and is context-independent, as
|
||||
expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep
|
||||
finding that ~84% of the step scales with context.
|
||||
|
||||
Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util%
|
||||
+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the
|
||||
classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step).
|
||||
|
||||
### Stock vs paged decomposition
|
||||
|
||||
- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read
|
||||
directly by the FA kernel, **no gather**).
|
||||
- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged
|
||||
gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that
|
||||
feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's
|
||||
self-inflicted cost and is the entire ~1.85x stock->paged regression.
|
||||
|
||||
## vLLM decode architecture mapped onto each llama bucket
|
||||
|
||||
vLLM at ~120-185 ms/step is faster on **every** bucket:
|
||||
|
||||
| llama bucket (paged) | ms | vLLM equivalent | does vLLM avoid it? |
|
||||
|-----------------------------|-------|-----------------|---------------------|
|
||||
| paged KV gather (get_rows) | ~228 | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. |
|
||||
| KV copy/cast | ~234 | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy |
|
||||
| decode attention | ~610 | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 |
|
||||
| weight GEMM + act quant | ~201 | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel |
|
||||
| CPU sampling / loop | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too |
|
||||
|
||||
vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so
|
||||
vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a
|
||||
stack of kernel-efficiency wins, not one silver bullet.
|
||||
|
||||
## Ranked levers - closable vs structural
|
||||
|
||||
1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged
|
||||
path; net-zero risk - it is a regression]** The worktree's paged engine makes
|
||||
decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer
|
||||
every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable
|
||||
paged KV** - it only ever helps prefix-prefill *memory*, never decode latency.
|
||||
Fully recovering this *and* keeping paging requires reading paged blocks
|
||||
in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2.
|
||||
|
||||
2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock /
|
||||
~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx.
|
||||
llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's
|
||||
FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the
|
||||
cost that *grows with context* - exactly the regime the brief targets. Tractable in
|
||||
principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel
|
||||
paged read that also kills lever 1's gather), but it is deep CUDA work on a new
|
||||
arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.**
|
||||
|
||||
3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]**
|
||||
The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and
|
||||
less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA
|
||||
and folding the activation quant into the GEMM is tractable kernel work. Bounded
|
||||
impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step).
|
||||
|
||||
4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy ==
|
||||
heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until
|
||||
the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the
|
||||
"host-bound serving loop" hypothesis for this decode-bound workload.
|
||||
|
||||
5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server
|
||||
already fuses all 32 slots into one decode step (one set of kernels per step over
|
||||
batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins
|
||||
on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is
|
||||
kernel-bound, not scheduler-bound.
|
||||
|
||||
## Honest bottom line
|
||||
|
||||
The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as:
|
||||
|
||||
- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it
|
||||
(don't run paged for decode-latency workloads).
|
||||
- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**:
|
||||
llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than
|
||||
vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel
|
||||
paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project.
|
||||
- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not**
|
||||
the gap; the GPU is ~95% busy on memory-bound kernels the whole step.
|
||||
|
||||
What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel
|
||||
work). What is structural / hard: lever 2 (the decode-attention kernel + a real
|
||||
in-kernel paged read), which is where the context-scaling gap actually lives and where
|
||||
any serious effort to approach vLLM on GB10 must go.
|
||||
|
||||
## Reproduction (dev-tree only, `~/bench/decode_study/`)
|
||||
|
||||
- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config.
|
||||
- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`.
|
||||
- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`.
|
||||
- `cat2.py` - kernel-time categorization from the sqlite export.
|
||||
- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s).
|
||||
</content>
|
||||
</invoke>
|
||||
@@ -1,756 +0,0 @@
|
||||
# Decode parity exploration (post-SSM-fix) - per-agent findings
|
||||
|
||||
Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
|
||||
dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
|
||||
The remaining +54% to parity is the question each section below probes. All numbers
|
||||
DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
|
||||
|
||||
---
|
||||
|
||||
## Section: per-token-latency (critical path / host-loop) - READ-ONLY
|
||||
|
||||
**Verdict: the per-step critical path and host loop are NOT the residual lever.
|
||||
Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
|
||||
~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
|
||||
(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
|
||||
sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
|
||||
saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
|
||||
efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
|
||||
|
||||
### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
|
||||
|
||||
`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
|
||||
steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
|
||||
|
||||
| activity set | busy % | exposed idle |
|
||||
|-------------------------|---------|--------------|
|
||||
| kernels only | 80.25% | 19.74% |
|
||||
| kernels + memcpy (all) | 99.35% | **0.65%** |
|
||||
|
||||
- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
|
||||
filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
|
||||
(the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
|
||||
They were a real critical-path serialization, and **patches 0018/0019 removed exactly
|
||||
these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
|
||||
+37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
|
||||
back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
|
||||
the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
|
||||
- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
|
||||
**median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
|
||||
(sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
|
||||
- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
|
||||
**0.34%** of the window - the pure launch overhead that CUDA graphs collapse
|
||||
(measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
|
||||
default paged decode with a 256-token reset cadence).
|
||||
|
||||
### 2. Post-SSM scaling of the FIXED host gap
|
||||
|
||||
The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
|
||||
As decode accelerated it grew only as a fraction of a shrinking step:
|
||||
|
||||
| build | step ms @npl128 | host gap | host gap % of step |
|
||||
|---------------|-----------------|----------|--------------------|
|
||||
| pre-SSM (146) | ~877 | ~2 ms | 0.24% |
|
||||
| post-SSM (256)| ~499 | ~2 ms | **~0.40%** |
|
||||
| vLLM (391) | ~328 | (n/a) | (would be ~0.6%) |
|
||||
|
||||
Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
|
||||
the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
|
||||
below the host time, which is not the case at 65% of parity.
|
||||
|
||||
### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
|
||||
|
||||
The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
|
||||
needs layer N-1; cannot be parallelized across layers). The question is whether each
|
||||
individual kernel fills the SMs at batch 128. It does:
|
||||
|
||||
- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
|
||||
= `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
|
||||
`(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
|
||||
state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
|
||||
context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
|
||||
- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
|
||||
regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
|
||||
- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
|
||||
npl128: if the chain under-filled, busy% would be well below 99%.
|
||||
|
||||
Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
|
||||
weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
|
||||
the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
|
||||
host/scheduler problem.
|
||||
|
||||
### 4. What the host actually does per step (eager rep runtime API)
|
||||
|
||||
Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
|
||||
|
||||
| API | n | total | avg |
|
||||
|---------------------------|-------|---------|---------|
|
||||
| cudaStreamSynchronize | 1723 | 7775 ms | 4513 us |
|
||||
| cudaLaunchKernelExC | 30983 | 4045 ms | 131 us |
|
||||
| cudaLaunchKernel | 20385 | 2694 ms | 132 us |
|
||||
| cudaMemcpyAsync | 2085 | 96 ms | 46 us |
|
||||
|
||||
~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
|
||||
graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
|
||||
GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
|
||||
sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
|
||||
only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
|
||||
|
||||
### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
|
||||
|
||||
vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
|
||||
async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
|
||||
2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
|
||||
vectorized numpy (no per-token Python), attention metadata `build()` once per step
|
||||
reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
|
||||
non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
|
||||
next-step host prep overlaps the current-step GPU compute on the async stream. The key
|
||||
asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
|
||||
KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
|
||||
`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
|
||||
|
||||
But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
|
||||
loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
|
||||
328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
|
||||
kernel time (FP4 GEMM), not host.
|
||||
|
||||
### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
|
||||
|
||||
- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
|
||||
default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
|
||||
Nothing left to win here.
|
||||
- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
|
||||
the forward; the host still samples + runs `update_slots` + rebuilds the batch
|
||||
between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
|
||||
step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
|
||||
non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
|
||||
rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
|
||||
cleanup, not a parity lever.
|
||||
- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
|
||||
GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
|
||||
decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
|
||||
kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
|
||||
GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
|
||||
traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
|
||||
FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
|
||||
|
||||
### Evidence
|
||||
- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
|
||||
- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
|
||||
- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
|
||||
VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
|
||||
# Decode-Parity Exploration
|
||||
|
||||
## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
|
||||
|
||||
### Model config (Qwen3.5-27B dense, from vLLM config.json)
|
||||
- linear_key_head_dim K = 128, linear_value_head_dim V = 128
|
||||
- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
|
||||
- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
|
||||
- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
|
||||
Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
|
||||
|
||||
### Which kernels run at decode
|
||||
- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
|
||||
Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
|
||||
- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
|
||||
(the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
|
||||
|
||||
### The state HBM traffic is IDENTICAL - it is NOT the lever
|
||||
Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
|
||||
The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
|
||||
=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
|
||||
plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
|
||||
|
||||
### Algorithmic / parallelization delta (the real differences)
|
||||
|
||||
1) Reduction strategy (biggest structural difference)
|
||||
- llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
|
||||
one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
|
||||
TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
|
||||
rounds on the critical path, with n_tokens=1 they are NOT amortized.
|
||||
- vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
|
||||
of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
|
||||
ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
|
||||
Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
|
||||
|
||||
2) Occupancy / launch geometry (likely the dominant bandwidth gap)
|
||||
- llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
|
||||
Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
|
||||
(only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
|
||||
s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
|
||||
(~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
|
||||
loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
|
||||
- vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
|
||||
software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
|
||||
|
||||
3) Redundant non-state traffic in llama
|
||||
- q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
|
||||
=> ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
|
||||
Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
|
||||
- Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
|
||||
single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
|
||||
|
||||
4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
|
||||
- vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
|
||||
(-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
|
||||
- llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
|
||||
ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
|
||||
ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
|
||||
gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
|
||||
early-returns on the identity check, but still a grid launch of n_seqs blocks).
|
||||
Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
|
||||
Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
|
||||
launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
|
||||
|
||||
### What a faster llama GDN decode kernel would need (optimization scope)
|
||||
- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
|
||||
contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
|
||||
- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
|
||||
(the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
|
||||
state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
|
||||
the equal state bytes - the single highest-leverage change.
|
||||
- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
|
||||
the output store across the warp.
|
||||
- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
|
||||
reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
|
||||
per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
|
||||
(or fold the identity check into the recurrence prologue, which it already partly does).
|
||||
- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
|
||||
divergence-from-reference not a parity lever.
|
||||
|
||||
### Bottom line
|
||||
llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
|
||||
so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
|
||||
the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
|
||||
starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
|
||||
(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
|
||||
vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
|
||||
row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
|
||||
|
||||
---
|
||||
|
||||
## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
|
||||
|
||||
Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
|
||||
memcpy summed separately so D2D is never lumped into compute), not from summary text.
|
||||
|
||||
### CLAIM 1 - decode decomposition
|
||||
PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
|
||||
gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
|
||||
1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
|
||||
=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
|
||||
full attention 0.37%.
|
||||
|
||||
D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
|
||||
526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
|
||||
the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
|
||||
my DtoD 0.05 GB, the 2.16 GB is DtoH).
|
||||
|
||||
FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
|
||||
NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
|
||||
no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
|
||||
predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
|
||||
("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
|
||||
to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
|
||||
"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
|
||||
get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
|
||||
Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
|
||||
|
||||
### CLAIM 2 - 146 -> ~257 ("+66%")
|
||||
146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
|
||||
SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
|
||||
+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
|
||||
|
||||
### CLAIM 3 - P2a GEMM-remap FLAT on decode
|
||||
THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
|
||||
kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
|
||||
|
||||
### CLAIM 4 - 65% of vLLM (254 vs 391)
|
||||
254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
|
||||
|
||||
### Traps checked
|
||||
GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
|
||||
noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
|
||||
(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
|
||||
|
||||
### Net verdict
|
||||
1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
|
||||
UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
|
||||
"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
|
||||
|
||||
---
|
||||
|
||||
## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
|
||||
|
||||
Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
|
||||
arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
|
||||
Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
|
||||
(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
|
||||
|
||||
### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
|
||||
|
||||
NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
|
||||
|
||||
WEIGHTS (read ONCE per step, shared across all 128 seqs):
|
||||
- NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
|
||||
(per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
|
||||
- LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
|
||||
- per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
|
||||
- token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
|
||||
=> WEIGHTS TOTAL = 15.14 GB / step
|
||||
|
||||
PER-SEQUENCE STATE (x128 seqs, read + write every step):
|
||||
- SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
|
||||
x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
|
||||
- conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
|
||||
x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
|
||||
- KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
|
||||
4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
|
||||
|
||||
TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
|
||||
|
||||
### Floor vs measured -- decode is NOT at the bandwidth floor
|
||||
|
||||
Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
|
||||
Measured llama = 128 tok / 254 t/s = **504 ms/step** => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
|
||||
vLLM 391 t/s = 128 / 391 = **327 ms/step** => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
|
||||
|
||||
The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
|
||||
fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
|
||||
face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
|
||||
1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
|
||||
bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
|
||||
is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
|
||||
|
||||
### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
|
||||
|
||||
SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
|
||||
At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
|
||||
weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
|
||||
helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
|
||||
Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
|
||||
floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
|
||||
by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
|
||||
state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
|
||||
|
||||
### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
|
||||
|
||||
1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
|
||||
(pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
|
||||
356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
|
||||
post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
|
||||
mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
|
||||
PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
|
||||
(18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
|
||||
2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
|
||||
launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
|
||||
launches/step of serialization.
|
||||
3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
|
||||
ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
|
||||
gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
|
||||
recurrence prologue.
|
||||
4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
|
||||
Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
|
||||
the GEMM prologue -> pure launch latency, not traffic.
|
||||
5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
|
||||
NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
|
||||
weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
|
||||
|
||||
### Bottom line (weight-bandwidth)
|
||||
At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
|
||||
(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
|
||||
traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
|
||||
DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
|
||||
(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
|
||||
55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
|
||||
lever, largely orthogonal to the batch-128 decode-parity gap.
|
||||
|
||||
---
|
||||
|
||||
## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
|
||||
|
||||
Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
|
||||
`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
|
||||
covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
|
||||
This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
|
||||
fusion gate, TMA scoping, and the conv-state residual.
|
||||
|
||||
**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
|
||||
`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
|
||||
`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
|
||||
**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
|
||||
baseline" is only about the act-quant producer.
|
||||
|
||||
**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
|
||||
the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
|
||||
flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
|
||||
per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
|
||||
shaved".
|
||||
|
||||
### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
|
||||
`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
|
||||
(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
|
||||
`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
|
||||
activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
|
||||
but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
|
||||
is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
|
||||
the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
|
||||
chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
|
||||
utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
|
||||
(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
|
||||
as final.
|
||||
|
||||
### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
|
||||
ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
|
||||
ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
|
||||
(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
|
||||
GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
|
||||
M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
|
||||
Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
|
||||
the npl128 throughput number does not** - tune the two separately.
|
||||
|
||||
### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
|
||||
Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
|
||||
P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
|
||||
stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
|
||||
*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
|
||||
(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
|
||||
weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
|
||||
latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
|
||||
"NVFP4-dense is a prefill/low-batch lever".)
|
||||
|
||||
### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
|
||||
After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
|
||||
delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
|
||||
the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
|
||||
in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
|
||||
can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
|
||||
low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
|
||||
launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
|
||||
|
||||
### Deferred (covered by other sections, I concur)
|
||||
- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
|
||||
state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
|
||||
parity lever - last priority, quality-risk.
|
||||
- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
|
||||
an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
|
||||
- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
|
||||
|
||||
### Bottom line (this section's net-new)
|
||||
Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
|
||||
pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
|
||||
fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
|
||||
pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
|
||||
/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
|
||||
`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
|
||||
(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
|
||||
|
||||
Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
|
||||
BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
|
||||
llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
|
||||
SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
|
||||
a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
|
||||
so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
|
||||
Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
|
||||
per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
|
||||
as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
|
||||
|
||||
### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
|
||||
|
||||
The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
|
||||
w16/w64 timing, in-process, batch 128):
|
||||
|
||||
| engine / mode | ms/step | decode tok/s | notes |
|
||||
|--------------------------|---------|--------------|--------------------------------|
|
||||
| llama post-SSM (graphs) | ~510-522| **245-251** | S_TG @npl128 ntg32 (this run) |
|
||||
| vLLM enforce_eager | 324.9 | **394.0** | == the ~391 ref (h2h log 371-384)|
|
||||
| vLLM cuda-graphs | 304.9 | **419.8** | graphs buy only +6% |
|
||||
|
||||
- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
|
||||
(394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
|
||||
kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
|
||||
- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
|
||||
WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
|
||||
decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
|
||||
exactly -> the tight window is faithful; per-kernel numbers below use it.
|
||||
|
||||
### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
|
||||
|
||||
Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
|
||||
ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
|
||||
kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
|
||||
|
||||
| component (per step) | llama ms/step | llama % | vLLM ms/step | vLLM % |
|
||||
|-----------------------------|---------------|---------|--------------|--------|
|
||||
| GDN linear-attn recurrence | 193 (48x4.03) | 38% | 174 (48x3.62)| 53% |
|
||||
| FP4 matmul + act-quant | **236** | **46%** | **117** | **36%**|
|
||||
| - mul_mat_vec_q (GEMV) | 132 (48x2.75) | 26% | - | - |
|
||||
| - mul_mat_q (GEMM) | 88 (448 calls)| 17% | cutlass 61 | 19% |
|
||||
| - quantize_mmq_nvfp4 | 16 (448) | 3% | nvjet 53+cvt2| 17% |
|
||||
| full attention (16 layers) | 6.6 (16) | 1.3% | 6.2 (16) | 1.9% |
|
||||
| SSM conv + glue/elementwise | ~45 | 9% | ~22 | 7% |
|
||||
| MEMCPY (D2D+H2D) | 2.5 (131 MB) | 0.5% | 0.36 (85 MB) | 0.1% |
|
||||
| **TOTAL** | **~510** | 100% | **~326** | 100% |
|
||||
|
||||
### 2. The three load-bearing comparisons (the brief)
|
||||
|
||||
**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
|
||||
llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
|
||||
`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
|
||||
slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
|
||||
(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
|
||||
framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
|
||||
(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
|
||||
|
||||
**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
|
||||
vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
|
||||
per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
|
||||
(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
|
||||
not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
|
||||
BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
|
||||
weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
|
||||
deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
|
||||
|
||||
**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
|
||||
vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
|
||||
the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
|
||||
vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
|
||||
|
||||
### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
|
||||
|
||||
llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
|
||||
cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
|
||||
PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
|
||||
FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
|
||||
|
||||
Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
|
||||
decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
|
||||
**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
|
||||
a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
|
||||
amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
|
||||
weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
|
||||
GEMV-at-batch-128 on the vLLM side at all.**
|
||||
|
||||
Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
|
||||
it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
|
||||
which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
|
||||
GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
|
||||
and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
|
||||
fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
|
||||
mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
|
||||
(16 ms + 448 launches), then the GDN +19 ms.
|
||||
|
||||
### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
|
||||
|
||||
weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
|
||||
My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
|
||||
do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
|
||||
bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
|
||||
`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
|
||||
raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
|
||||
not the GDN kernel and not byte-cutting.
|
||||
|
||||
### Verdict (profiler)
|
||||
- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
|
||||
Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
|
||||
- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
|
||||
- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
|
||||
18 GB/step DtoD removal CONFIRMED in-trace.
|
||||
- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
|
||||
= 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
|
||||
one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
|
||||
GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
|
||||
|
||||
### Evidence (DGX, this agent)
|
||||
- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
|
||||
- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
|
||||
- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
|
||||
- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
|
||||
(decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED
|
||||
|
||||
Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the
|
||||
fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by
|
||||
reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD
|
||||
46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference.
|
||||
|
||||
### 0. Headline
|
||||
|
||||
Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is
|
||||
NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA
|
||||
graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect:
|
||||
**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`)
|
||||
at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of
|
||||
decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass
|
||||
M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step.
|
||||
|
||||
### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED
|
||||
|
||||
HELD (confirmed by both the adversarial re-derivation and the fresh profile):
|
||||
- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB,
|
||||
mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings).
|
||||
- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed
|
||||
three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel
|
||||
sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The
|
||||
SSM fix (0018/0019) is the real breakthrough and is working.
|
||||
- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself
|
||||
shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound.
|
||||
- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed.
|
||||
- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util)
|
||||
vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below.
|
||||
- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128
|
||||
(llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler.
|
||||
- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already
|
||||
runs with graphs. Confirmed by the profiler.
|
||||
|
||||
REFUTED / CORRECTED:
|
||||
- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net
|
||||
1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The
|
||||
profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% /
|
||||
+19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama,
|
||||
53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower
|
||||
window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/
|
||||
fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity.
|
||||
- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM,
|
||||
not the speedup (validate-findings).
|
||||
|
||||
SUPERSEDED (the gap validate-findings flagged, now filled by real data):
|
||||
- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the
|
||||
producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace
|
||||
replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM
|
||||
family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17%
|
||||
(the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as
|
||||
"48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This
|
||||
is why P2a was flat on decode: **it optimized the wrong FP4 kernel.**
|
||||
|
||||
### 2. Ground-truth per-step decode decomposition + the single biggest overage
|
||||
|
||||
From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate:
|
||||
|
||||
| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) |
|
||||
|-----------------------------|----------|--------|---------|-------|------------------|
|
||||
| GDN recurrence kernel | 193 | 38% | 174 | 53% | **+19** |
|
||||
| FP4 matmul + act-quant | 236 | 46% | 117 | 36% | **+119** |
|
||||
| - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0 | - | **+132** |
|
||||
| - mul_mat_q (MMQ GEMM) | 88 | 17% | 61 (cutlass) | 19% | +27 |
|
||||
| - quantize_mmq_nvfp4 | 16 | 3% | 55 (nvjet+cvt)| 17% | -39 |
|
||||
| full attention (16 layers) | 6.6 | 1.3% | 6.2 | 1.9% | +0.4 |
|
||||
| SSM conv + glue/elementwise | 45 | 9% | 22 | 7% | +23 |
|
||||
| MEMCPY | 2.5 | 0.5% | 0.36 | 0.1% | +2 |
|
||||
| **TOTAL** | **~510** | 100% | **~326**| 100% | **+184** |
|
||||
|
||||
The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset
|
||||
by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So
|
||||
the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms.
|
||||
|
||||
**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block
|
||||
(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout
|
||||
`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is:
|
||||
|
||||
```cpp
|
||||
// current code (qwen3next.cpp, end of the GDN block)
|
||||
ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm,
|
||||
head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128]
|
||||
cur = build_lora_mm(model.layers[il].ssm_out, final_output); // <-- the matmul
|
||||
cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); // collapse AFTER
|
||||
```
|
||||
|
||||
`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda
|
||||
dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the
|
||||
128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs =
|
||||
**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's
|
||||
trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128
|
||||
sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles -
|
||||
the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not
|
||||
in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN
|
||||
kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all.
|
||||
|
||||
This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt
|
||||
len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2])
|
||||
trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream
|
||||
`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is
|
||||
the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer.
|
||||
|
||||
### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391)
|
||||
|
||||
Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover
|
||||
figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels);
|
||||
each needs a confirming re-profile. Ceilings are cumulative.
|
||||
|
||||
| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability |
|
||||
|---|-------|-------------------|--------------|-----------------------|-----------|--------------|
|
||||
| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) |
|
||||
| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) |
|
||||
| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) |
|
||||
| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) |
|
||||
| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) |
|
||||
| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) |
|
||||
| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) |
|
||||
|
||||
Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per
|
||||
o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1
|
||||
rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise
|
||||
passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band,
|
||||
not a sum.
|
||||
|
||||
### 4. Verdict: is true decode parity reachable?
|
||||
|
||||
**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.**
|
||||
Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32
|
||||
recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM
|
||||
achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x
|
||||
throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to
|
||||
specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The
|
||||
GDN recurrence (the supposed floor) is only +11%/call between the two engines.
|
||||
|
||||
How far each tier reaches:
|
||||
- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves
|
||||
the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact,
|
||||
no new kernel (MMQ already runs the in-projection at this exact shape and type).
|
||||
- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks
|
||||
38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the
|
||||
GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now).
|
||||
- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work:
|
||||
matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy,
|
||||
folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard
|
||||
but not floored" - it is kernel engineering, not a hardware wall.
|
||||
|
||||
**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D
|
||||
`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant
|
||||
post-matmul `reshape_2d`):
|
||||
|
||||
```cpp
|
||||
// route the GDN output projection through tensor-core MMQ at decode:
|
||||
// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact.
|
||||
ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm,
|
||||
head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
|
||||
cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ
|
||||
```
|
||||
|
||||
Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one
|
||||
number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The
|
||||
same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and
|
||||
any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those
|
||||
and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy
|
||||
compromise, so bit-exactness is preserved throughout.
|
||||
|
||||
### Evidence (this section)
|
||||
- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj
|
||||
layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on
|
||||
`ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is
|
||||
MMQ-supported).
|
||||
- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,184 +0,0 @@
|
||||
# F16/BF16 Glue Probe - the dense decode residual to vLLM
|
||||
|
||||
Question: dense decode parity sits at llama 384.6 vs vLLM 418.8 t/s @ npl128 = 91.8%.
|
||||
The 49% SSM recurrence (f32 BOTH engines) and the 27% NVFP4 GEMM (W4A4 BOTH) are
|
||||
precision-matched. The residual ~8% may be partly that llama runs the NON-recurrence
|
||||
GLUE (attention, norms, activations, elementwise, residual stream) in F32 while vLLM
|
||||
runs the model in BF16. This probe settles, empirically on q36-27b-nvfp4 @npl128, how
|
||||
much of that residual is realistically f16/bf16-closable.
|
||||
|
||||
Model: Qwen3.5-27B NVFP4 (dense). 64 layers = 16 attention + 48 gated-DeltaNet
|
||||
(SSM) recurrent. Build b104-f7409c2 (patch 0023), verified git-clean and coherent.
|
||||
The bf16 SSM work was never applied to the tree (only saved as a diff backup);
|
||||
ggml-cuda needed no recompile on rebuild, so the binary is bit-identical to clean 0023.
|
||||
|
||||
## (1) Current KV / state dtype (SETTLED)
|
||||
|
||||
From the `-v` init log:
|
||||
|
||||
- ATTENTION KV cache (16 of 64 layers):
|
||||
`K (f16): 1280 MiB, V (f16): 1280 MiB` => **DEFAULT IS ALREADY F16.**
|
||||
- RECURRENT cache (48 gated-DeltaNet layers):
|
||||
`R (f32): 180 MiB` (conv state), `S (f32): 4608 MiB` (SSM state) => **f32.**
|
||||
|
||||
Consequence: the attention KV is ALREADY at vLLM's 16-bit bit-width. `--cache-type f16`
|
||||
is a literal no-op; the cheap KV lever is spent. The f32 lives in (a) the recurrent
|
||||
SSM/conv state (matched to vLLM, the bf16 version is shelved for failing the f32 KL
|
||||
gate) and (b) the intermediate-activation glue (norms, residual stream, attention
|
||||
compute, activations) - that glue is where llama still pays f32 vs vLLM bf16.
|
||||
|
||||
## (2) Decode kernel budget (nsys --cuda-graph-trace=node, npl128, 39 steady steps)
|
||||
|
||||
step span 342.0 ms ; sum-of-kernels 338.8 ms ; **kern/span 99.0%** - the decode is
|
||||
GPU-bound, kernels back-to-back, nsys overhead negligible. The measured bench step
|
||||
(128 tok / 373.5 t/s = 342.8 ms) equals the nsys span, so the %-of-step figures below
|
||||
ARE wall-time fractions.
|
||||
|
||||
OUT of scope - already precision-matched (83.2% of the step):
|
||||
|
||||
| kernel | ms/step | % |
|
||||
|---|---:|---:|
|
||||
| gated_delta_net (SSM recurrence, f32 BOTH) | 167.1 | 49.3 |
|
||||
| mul_mat_q NVFP4 (W4A4 GEMM, BOTH) | 93.0 | 27.4 |
|
||||
| quantize_mmq_nvfp4 (FP4 act-quant) | 17.6 | 5.2 |
|
||||
| mul_mat_q stream_k fixup (FP4 reduction) | 4.1 | 1.2 |
|
||||
|
||||
F16-ABLE GLUE - f32 in llama, bf16 in vLLM:
|
||||
|
||||
Budget A (clean compute glue, decoupled from the f32 state):
|
||||
|
||||
| kernel | ms/step |
|
||||
|---|---:|
|
||||
| flash_attn_ext | 11.94 |
|
||||
| unary_gated_op (silu) | 5.16 |
|
||||
| k_bin_bcast (mul) | 4.72 |
|
||||
| rms_norm | 3.58 |
|
||||
| k_bin_bcast (add, residual)| 1.67 |
|
||||
| l2_norm | 0.65 |
|
||||
| cpy_scalar | 0.37 |
|
||||
| rope | 0.26 |
|
||||
| sigmoid | 0.22 |
|
||||
| softplus | 0.09 |
|
||||
| flash_attn fixups | 0.08 |
|
||||
| **Budget A total** | **28.74 ms = 8.4% of step** |
|
||||
|
||||
Budget B (+ the non-FP4 cublas GEMM): + nvjet 12.17 ms => **40.91 ms = 12.0%**.
|
||||
|
||||
Recurrence-coupled data movement (NOT bit-safe f16-able - needs the f32 state to go
|
||||
bf16, which is the shelved work that fails the f32 KL gate):
|
||||
ssm_conv 8.37 + k_get_rows_float 6.98 + k_set_rows 0.66 + gdn_gather 0.06 = 16.08 ms = 4.7%.
|
||||
|
||||
## (3) Cache-type A/B (decode_agg S_TG t/s, dense)
|
||||
|
||||
| npl | DEFAULT | F16-explicit | Q8_0 |
|
||||
|---:|---:|---:|---:|
|
||||
| 32 | 209.05 | 208.75 | 208.63 |
|
||||
| 128 | 373.46 | 373.56 | 374.71 |
|
||||
|
||||
- F16-explicit == DEFAULT (0.03% delta) => proves the default KV is already f16; the
|
||||
flag is a no-op.
|
||||
- Q8_0 (8-bit, half the f16 KV bytes) is within noise at every npl => the attention KV
|
||||
bandwidth is NOT a decode bottleneck (it is 16/64 layers; flash_attn is 3.5% of the
|
||||
step). The KV-cache dtype is not a decode lever for this model.
|
||||
- Coherence (48-tok greedy, "The capital of France is"): default and q8_0 both fully
|
||||
coherent; q8_0 only causes minor greedy-path divergence, no quality break. But since
|
||||
q8_0 buys zero speed and is not bit-exact, it is pointless here.
|
||||
|
||||
## Read: how much of the ~8% dense residual is f16-closable
|
||||
|
||||
The gap is ~27 ms/step (llama 332.8 ms vs vLLM 305.7 ms at npl128).
|
||||
|
||||
f16 does not zero the glue, it speeds it up. Realistic recovery:
|
||||
- Memory-bound glue (norms + elementwise + activations + copies + rope = 16.7 ms):
|
||||
f16 halves the bytes => ~50% => ~8.4 ms.
|
||||
- flash_attn_ext (12.0 ms): KV is ALREADY f16 and the accumulation must stay f32
|
||||
(vLLM also f32-accumulates), so only the Q/projection side helps => ~25% => ~3.0 ms.
|
||||
- Budget A realistic recovery ~= **11.4 ms**.
|
||||
- nvjet non-FP4 GEMM (12.2 ms): bf16 tensor cores vs f32 ~= ~40-50% => ~5 ms, but
|
||||
uncertain (may already run TF32) => +nvjet recovery ~= **16 ms**.
|
||||
|
||||
So f16/bf16 glue realistically recovers **~11 ms (glue only) to ~16 ms (+GEMM) of the
|
||||
~27 ms gap = roughly 40-60% of the dense residual.** That moves parity 91.8% ->
|
||||
~95-96%, NOT a full close. The remaining ~3-4% is structural: cublas GEMM efficiency
|
||||
on the non-FP4 paths, graph/launch scheduling vs vLLM, and the irreducible f32
|
||||
accumulation in attention and the recurrence.
|
||||
|
||||
Caveats for a build decision:
|
||||
1. The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable
|
||||
(KV already f16, accumulate stays f32). The cleanly recoverable mass is the
|
||||
norms+elementwise+activations (~16.7 ms).
|
||||
2. The recurrence-coupled 4.7% (ssm_conv + state gather) is only f16-able by taking the
|
||||
SSM/conv state to bf16 = the already-built, already-shelved work that fails the f32
|
||||
KL gate. It is OUT of a bit-safe f16 build.
|
||||
3. f16 glue is NON-bit-exact (same category as the shelved bf16 SSM state). It would be
|
||||
an OPT-IN fast path, not the bit-exact default. Realistic ceiling ~95-96% parity for
|
||||
a meaningful (norms/elementwise/activations + optionally nvjet) f16 conversion, at
|
||||
the cost of leaving the 95%-bit-exact f32 plateau.
|
||||
|
||||
## (4) What it costs to capture it: NOT a flag (source map, read-only)
|
||||
|
||||
The asymmetry confirmed at the source level (DGX `~/llama-paged-dev` @ f7409c2, tree
|
||||
git-clean; vLLM ref from BITEXACT_VS_VLLM.md):
|
||||
- vLLM `text_config.dtype = bfloat16` => the ENTIRE non-quantized compute (residual
|
||||
stream, RMSNorm I/O with f32-internal reduction, FlashAttention out, SiLU, gating,
|
||||
conv state) runs in BF16. Only the gated-DeltaNet temporal SSM state is f32
|
||||
(`mamba_ssm_dtype="float32"`, matched to llama).
|
||||
- llama's intermediate activations are F32 **by construction, everywhere**:
|
||||
`ggml_mul_mat` hardcodes an F32 result (ggml.c:3250), so the stream snaps back to F32
|
||||
after EVERY projection (Q/K/V/O, wqkv, ssm in/out, ffn up/gate/down, eh_proj, lm_head).
|
||||
`ggml_rms_norm`/`ggml_l2_norm`/`ggml_silu`/`ggml_add`/`ggml_mul`/`flash_attn_ext`/
|
||||
`ggml_ssm_conv` all preserve/emit F32. There is no point where the stream is f16.
|
||||
|
||||
There is **no vLLM-style global model-compute-dtype knob** in ggml/llama. You cannot flip
|
||||
one model-load flag. Three escalating options, all opt-in / non-bit-exact:
|
||||
|
||||
- A flag: does not exist and cannot exist as-is - the F32 is structural, not a default.
|
||||
- Option 1 (targeted per-op f16, no new kernels): silu/sigmoid/softplus (unary.cu),
|
||||
add/mul (binbcast.cu), rope already have f16 paths. But the residual stream stays F32,
|
||||
so each op must be wrapped cast(F16)->op->cast(F32), adding 2 `cpy` ops per op. At
|
||||
decode these ops are tiny and memory-bound; the cast traffic ~= the op traffic, so the
|
||||
net win is near-zero or negative unless the cast is FUSED into the producer/consumer.
|
||||
Crucially this CANNOT capture the norms - the largest glue item.
|
||||
- Option 2 (the real lever, multi-file code change): carry the residual stream in F16
|
||||
across the layer, cast to F32 only at the quantize boundary. Requires (a) f16 projection
|
||||
output (patch `ggml_mul_mat` to honor a dst-type, or a cpy->F16 after each proj),
|
||||
(b) **NEW F16 template instantiations in norm.cu** for rms_norm / l2_norm / fused
|
||||
rms+mul / fused rms+mul+add (today hard-`GGML_ASSERT(type==F32)` at norm.cu:441-442,
|
||||
465-466, 525-527, 601-604) keeping the f32 reduction, (c) optionally an F16 ssm-conv.cu,
|
||||
plus graph-dtype plumbing in qwen35.cpp / llama-graph.cpp to thread F16 through
|
||||
inpL/cur/the residual adds. The single biggest code item is the norm.cu f16 kernels -
|
||||
the exact band vLLM runs in bf16 that Option 1 cannot reach.
|
||||
|
||||
Must-stay-f32 regardless (vLLM does the same): RMSNorm/L2Norm sum-of-squares reduction;
|
||||
FlashAttention KQ/softmax accumulation (forced `GGML_PREC_F32`, llama-graph.cpp:2117);
|
||||
the gated-DeltaNet recurrent SSM temporal state (f32 BOTH engines, out of scope); the
|
||||
src1->q8_1/nvfp4 activation quantization reads F32, so the stream must be F32 at every
|
||||
projection boundary no matter what.
|
||||
|
||||
## Verdict: probe-further-then-decide, leaning not-worth-it for the default
|
||||
|
||||
f16 does NOT meaningfully close the dense residual on its own, and what it can close is a
|
||||
multi-file non-bit-exact build, not a flag.
|
||||
|
||||
- Precision is NOT the dominant cause of the 8% gap. 83.2% of the decode step (recurrence
|
||||
49.3% + FP4 GEMM 27.4% + FP4 act-quant/fixup 6.4%) is already precision-matched f32/W4A4
|
||||
on both engines. The f16-able glue is only 8.4% of the step (Budget A); of the ~27 ms
|
||||
gap, f16 realistically recovers ~11 ms (glue) to ~16 ms (+ the uncertain nvjet GEMM) =
|
||||
40-60% of the residual. The remaining ~3-4% is kernel/scheduling efficiency (non-FP4
|
||||
cublas GEMM, graph-launch overhead, irreducible f32 accumulation) that f16 cannot touch.
|
||||
- The recoverable mass is the norm+elementwise+activation band, which is precisely the
|
||||
part that needs NEW f16 norm kernels (Option 2). The no-new-kernel ops (Option 1) are
|
||||
too small and their cast overhead likely eats the win.
|
||||
- Any version is opt-in / non-bit-exact, the same gate-failing category as the already
|
||||
shelved bf16-SSM-state work. It cannot be the bit-exact f32 default; it is a second,
|
||||
separately-maintained fast path with a ~95-96% ceiling.
|
||||
|
||||
Recommendation: do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau
|
||||
(patches 0018-0023) as the default. If chasing the last 4% later, the only lever worth a
|
||||
build is Option 2's norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise
|
||||
band, ~11 ms); gate it behind an explicit opt-in flag and validate it against the same KL
|
||||
threshold as bf16-SSM before shipping. The non-FP4 cublas GEMM efficiency and graph-launch
|
||||
scheduling - the structural ~3-4% - are a better long-term target than precision, because
|
||||
they help the bit-exact default too.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,532 +0,0 @@
|
||||
# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
|
||||
|
||||
Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
|
||||
residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
|
||||
8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
|
||||
consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
|
||||
|
||||
**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
|
||||
ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
|
||||
parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
|
||||
parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
|
||||
vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
|
||||
|
||||
Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
|
||||
Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
|
||||
committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
|
||||
benchmarks were run** (track A is on the box).
|
||||
|
||||
## 0. Grounded inputs (measured, committed)
|
||||
|
||||
| quantity | value | source |
|
||||
|---|---|---|
|
||||
| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
|
||||
| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
|
||||
| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
|
||||
| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
|
||||
| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
|
||||
| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
|
||||
| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
|
||||
| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
|
||||
| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
|
||||
| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
|
||||
|
||||
`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
|
||||
`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
|
||||
compute-bound.
|
||||
|
||||
---
|
||||
|
||||
## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
|
||||
|
||||
This is the first thing track B must settle, and the evidence settles it decisively.
|
||||
|
||||
| option | verdict | why |
|
||||
|---|---|---|
|
||||
| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
|
||||
| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
|
||||
| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
|
||||
|
||||
**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
|
||||
id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
|
||||
The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
|
||||
hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
|
||||
to the bandwidth floor at the thin decode M-tile.
|
||||
|
||||
### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
|
||||
|
||||
Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
|
||||
ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
|
||||
`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
|
||||
**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
|
||||
`mma_block_scaled_fp4` (`mma.cuh:1126`).
|
||||
|
||||
**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
|
||||
(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
|
||||
values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
|
||||
`ggml_cuda_float_to_fp4_e2m1`. The MMA is
|
||||
`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
|
||||
(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
|
||||
already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
|
||||
sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
|
||||
activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
|
||||
halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
|
||||
|
||||
Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
|
||||
`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
|
||||
FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
|
||||
is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
|
||||
|
||||
---
|
||||
|
||||
## 2. The roofline — answering the load-bearing question
|
||||
|
||||
**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
|
||||
under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
|
||||
|
||||
### 2a. DENSE Qwen3.6-27B, M=128
|
||||
|
||||
`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
|
||||
|
||||
- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
|
||||
@216 GB/s = 83 ms → 1,542 tok/s.
|
||||
- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
|
||||
at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
|
||||
collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
|
||||
territory.
|
||||
- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
|
||||
slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
|
||||
compute-bound territory at single-digit efficiency.
|
||||
- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
|
||||
the GEMM.**
|
||||
|
||||
### 2b. MoE Qwen3.6-35B-A3B, M=128
|
||||
|
||||
@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
|
||||
expert M ≈ **4 tokens**.
|
||||
|
||||
- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
|
||||
- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
|
||||
purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
|
||||
feeding ragged M≈4 tiles.
|
||||
- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
|
||||
loss, not compute).
|
||||
- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
|
||||
already pushing the MoE BW floor.
|
||||
|
||||
**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth
|
||||
is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
|
||||
|
||||
---
|
||||
|
||||
## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
|
||||
|
||||
The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
|
||||
keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
|
||||
`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
|
||||
single most important structural fact for track B:
|
||||
|
||||
> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
|
||||
> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
|
||||
> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
|
||||
> regimes pick opposite knobs:
|
||||
|
||||
| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
|
||||
|---|---|---|
|
||||
| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
|
||||
| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
|
||||
| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
|
||||
| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
|
||||
| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
|
||||
|
||||
This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
|
||||
0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
|
||||
|
||||
The five inefficiencies, ranked:
|
||||
|
||||
1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
|
||||
whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
|
||||
into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
|
||||
`block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
|
||||
activation round-trip, not just the launch (see §4.4).
|
||||
|
||||
2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
|
||||
`load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` →
|
||||
`vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
|
||||
`cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
|
||||
M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
|
||||
35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
|
||||
occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
|
||||
hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
|
||||
|
||||
3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
|
||||
decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
|
||||
but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
|
||||
**`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
|
||||
|
||||
4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
|
||||
tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag-
|
||||
group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
|
||||
At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
|
||||
the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
|
||||
|
||||
5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
|
||||
shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
|
||||
|
||||
**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
|
||||
`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
|
||||
raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
|
||||
|
||||
---
|
||||
|
||||
## 4. The specific build-ready changes
|
||||
|
||||
All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
|
||||
stock behavior until proven.
|
||||
|
||||
### 4.1 Dense M-tile / occupancy (the make-or-break)
|
||||
|
||||
- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
|
||||
re-reads 18 GB). Lock this as an invariant in P0.
|
||||
- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
|
||||
128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
|
||||
static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
|
||||
fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
|
||||
hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
|
||||
dense occupancy lever and respects every GB10 rule.
|
||||
- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
|
||||
`rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
|
||||
~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
|
||||
re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
|
||||
broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
|
||||
lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
|
||||
|
||||
### 4.2 FP4-MMA fragment usage
|
||||
|
||||
- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
|
||||
`m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
|
||||
all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
|
||||
these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
|
||||
shape.
|
||||
- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
|
||||
|
||||
### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
|
||||
|
||||
- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
|
||||
(the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
|
||||
`LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep.
|
||||
- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
|
||||
of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
|
||||
(scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
|
||||
|
||||
### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
|
||||
|
||||
- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
|
||||
(`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
|
||||
operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
|
||||
- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
|
||||
- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
|
||||
(`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
|
||||
`rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
|
||||
MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
|
||||
`quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
|
||||
unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
|
||||
with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
|
||||
struct is the frozen interface between them.
|
||||
|
||||
### 4.5 GB10-fit rules (binding constraints on every kernel change)
|
||||
|
||||
- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
|
||||
layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
|
||||
XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
|
||||
- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
|
||||
padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
|
||||
- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
|
||||
FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
|
||||
- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
|
||||
(smaller `mmq_y`, smaller shared), not a deeper pipeline.
|
||||
- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
|
||||
perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
|
||||
|
||||
---
|
||||
|
||||
## 5. Correctness / parity gate (every phase)
|
||||
|
||||
- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
|
||||
`test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
|
||||
unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
|
||||
is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
|
||||
test).
|
||||
- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
|
||||
the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
|
||||
small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
|
||||
`mmq_y` changes and block-pad masking can leak.
|
||||
- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
|
||||
must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
|
||||
rounding).
|
||||
- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
|
||||
(dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
|
||||
the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
|
||||
|
||||
---
|
||||
|
||||
## 6. Phased plan, with expected decode_agg at each phase
|
||||
|
||||
Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
|
||||
**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
|
||||
|
||||
### DENSE (parity target 391)
|
||||
|
||||
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
|
||||
|---|---|---:|---:|---:|---:|---|
|
||||
| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
|
||||
| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
|
||||
| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** |
|
||||
| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low |
|
||||
|
||||
Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.**
|
||||
Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
|
||||
contingent on P2 clearing the kill-gate and on A landing.**
|
||||
|
||||
### MoE (parity target 811)
|
||||
|
||||
| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
|
||||
|---|---|---:|---:|---:|---:|---|
|
||||
| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
|
||||
| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
|
||||
| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
|
||||
| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
|
||||
|
||||
MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
|
||||
hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
|
||||
non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
|
||||
cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
|
||||
(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
|
||||
|
||||
### Explicitly NOT in scope (and why)
|
||||
|
||||
- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
|
||||
CUTLASS's grouped FP4 is broken on sm_121.
|
||||
- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
|
||||
- "Make activations 4-bit" — already W4A4; no work, no win there.
|
||||
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
|
||||
**separate tracks**; B owns the GEMM only.
|
||||
|
||||
---
|
||||
|
||||
## 7. The honest ceiling — does B reach TRUE PARITY?
|
||||
|
||||
- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
|
||||
GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
|
||||
hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.**
|
||||
The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
|
||||
~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
|
||||
occupancy kill-gate and contingent on track A. **GO (conditional).**
|
||||
|
||||
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
|
||||
even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
|
||||
grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
|
||||
non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
|
||||
deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
|
||||
|
||||
- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
|
||||
*different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
|
||||
M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
|
||||
the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
|
||||
(occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
|
||||
rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
|
||||
which is exactly what the P2 kill-gate measures.
|
||||
|
||||
**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
|
||||
vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
|
||||
with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
|
||||
MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
|
||||
PARTIAL for MoE.**
|
||||
|
||||
---
|
||||
|
||||
## 8. One-paragraph summary
|
||||
|
||||
The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
|
||||
floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
|
||||
FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
|
||||
**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
|
||||
`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
|
||||
M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
|
||||
win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
|
||||
**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
|
||||
**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
|
||||
demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy
|
||||
kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
|
||||
sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
|
||||
(conditional, B+A), PARTIAL for MoE.**
|
||||
|
||||
---
|
||||
|
||||
## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
|
||||
|
||||
Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
|
||||
BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
|
||||
grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
|
||||
`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
|
||||
the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
|
||||
rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
|
||||
The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
|
||||
reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
|
||||
(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
|
||||
fixed**.
|
||||
|
||||
### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
|
||||
|
||||
The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
|
||||
dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
|
||||
parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
|
||||
|
||||
- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
|
||||
so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
|
||||
proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
|
||||
build stuck at 1/4 of MMQ.
|
||||
- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
|
||||
hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
|
||||
floor** as M grows; vLLM **holds it**.
|
||||
|
||||
So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
|
||||
W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
|
||||
to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
|
||||
path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
|
||||
constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
|
||||
tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
|
||||
it is a different and more tractable regime; it under-states that the *binding risk is identical*.
|
||||
|
||||
### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
|
||||
|
||||
The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
|
||||
bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
|
||||
territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
|
||||
(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
|
||||
hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
|
||||
not the hardware, is the limiter.
|
||||
|
||||
**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
|
||||
the dense step (numbers from the doc's own inputs):
|
||||
|
||||
```
|
||||
llama step @npl128 795 ms (decode_agg 161)
|
||||
vLLM step @npl128 328 ms (decode_agg 391)
|
||||
total gap 467 ms
|
||||
|
||||
llama GEMM 471 ms
|
||||
vLLM GEMM (at the floor) ~66-88 ms (66 @273 GB/s spec, 88 @216 GB/s achieved)
|
||||
=> GEMM gap 383-405 ms = 82-87% of the 467 ms total gap
|
||||
=> non-GEMM gap 62-84 ms = 13-18% of the total gap
|
||||
```
|
||||
|
||||
So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
|
||||
needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
|
||||
elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
|
||||
necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
|
||||
parity story is *entirely* contingent on A, which the P3 framing buries.
|
||||
|
||||
### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
|
||||
|
||||
vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
|
||||
pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
|
||||
not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
|
||||
occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
|
||||
proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
|
||||
occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
|
||||
|
||||
The doc conflates two different things under "deep pipeline":
|
||||
- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
|
||||
(correctly banned).
|
||||
- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
|
||||
register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
|
||||
occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
|
||||
|
||||
**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
|
||||
not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
|
||||
(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
|
||||
to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
|
||||
Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
|
||||
abandoning the path that works because the *manual* version of it failed.
|
||||
|
||||
### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
|
||||
|
||||
The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
|
||||
- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
|
||||
and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
|
||||
evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
|
||||
- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
|
||||
`nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
|
||||
host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
|
||||
where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
|
||||
- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
|
||||
on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
|
||||
|
||||
**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
|
||||
real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
|
||||
|
||||
### 9.5 Honest expected outcome (the numbers I would defend)
|
||||
|
||||
| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
|
||||
|---|---:|---:|---:|---|
|
||||
| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
|
||||
| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
|
||||
|
||||
Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
|
||||
**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
|
||||
floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
|
||||
contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
|
||||
GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
|
||||
tail, not median.
|
||||
|
||||
**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
|
||||
outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
|
||||
**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
|
||||
**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
|
||||
the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
|
||||
close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
|
||||
parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
|
||||
the parity tail hits.
|
||||
|
||||
### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
|
||||
|
||||
The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
|
||||
reasons:
|
||||
1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
|
||||
producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
|
||||
seam is speculative until P3.
|
||||
2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
|
||||
bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
|
||||
B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
|
||||
59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
|
||||
<15%-eff kill-gate.
|
||||
3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
|
||||
risky kernel tune.
|
||||
|
||||
**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
|
||||
post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
|
||||
make-or-break decision cheaper, better-informed, and bankable-either-way.
|
||||
|
||||
### 9.7 Verdict (go / no-go)
|
||||
|
||||
- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
|
||||
(expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
|
||||
bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
|
||||
is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
|
||||
first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
|
||||
(iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
|
||||
|
||||
- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
|
||||
(it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
|
||||
ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
|
||||
the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
|
||||
**1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
|
||||
CUDA-graph, GDN bf16 state).
|
||||
|
||||
- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
|
||||
**top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
|
||||
and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
|
||||
high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
|
||||
~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
|
||||
occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.
|
||||
@@ -1,96 +0,0 @@
|
||||
# Patch 0030 - fused-op backend gate (audit RISKY-1 fix) - RESULTS
|
||||
|
||||
Closes the single latent silent-miscompute hazard from `ARCH_GENERALITY_AUDIT.md`
|
||||
(RISKY-1): the fused GDN / discriminated-SSM_CONV decode ops are CUDA+CPU-only but
|
||||
were emitted DEFAULT-ON with no backend guard.
|
||||
|
||||
## The hazard
|
||||
|
||||
- `cparams.fused_gdn_ar = fused_gdn_ch = auto_fgdn = true` are set unconditionally
|
||||
in the `llama_context` constructor (`src/llama-context.cpp`).
|
||||
- Patches 0018/0019/0026 add `ggml_gated_delta_net_inplace[_ids][_hybrid]`
|
||||
(reuse `GGML_OP_GATED_DELTA_NET` with extra src slots).
|
||||
- Patches 0021/0028 add `ggml_ssm_conv_update_inplace[_ids]` which **reuse
|
||||
`GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]`/`src[4]`** (ring/ids).
|
||||
- Both families have CUDA + CPU kernels only. No `supports_op` change was made for
|
||||
the discriminated variants.
|
||||
- A backend that supports **plain** `SSM_CONV` but ignores the discriminator
|
||||
(Vulkan/SYCL/Metal) returns `supports_op==true` for the node; the scheduler
|
||||
assigns the discriminated conv to it; it runs the **wrong plain conv** =>
|
||||
SILENT corruption (not a crash).
|
||||
- The upstream `auto_fgdn` resolution only inspects `GATED_DELTA_NET` nodes, so the
|
||||
discriminated-`SSM_CONV` safety was only **incidentally** covered (GDN-op and
|
||||
discriminated-conv happened to share backend coverage). It goes live the moment a
|
||||
non-CUDA paged build of a gated-DeltaNet model exists.
|
||||
|
||||
## The fix (emission gate, not supports_op)
|
||||
|
||||
Chosen route: **gate the emission on the active compute backend type.** The
|
||||
`supports_op` route would require editing every other backend's per-device
|
||||
`supports_op` (Vulkan/SYCL/Metal/...) to reject the discriminated `SSM_CONV` -
|
||||
invasive, fragile, and not centrally exposed by the ggml backend interface. The
|
||||
emission gate is self-contained in the fork's own code.
|
||||
|
||||
`src/llama-context.cpp`, in `llama_context::sched_reserve()`, immediately before
|
||||
the existing `if (cparams.auto_fgdn)` resolution block: if any **non-CPU** compute
|
||||
backend has a reg name other than `"CUDA"` / `"ROCm"` (HIP) / `"MUSA"` (the three
|
||||
`GGML_CUDA_NAME` values - all the same hipified ggml-cuda TU that carries the
|
||||
discriminated-op handling), force
|
||||
`fused_gdn_ar = fused_gdn_ch = auto_fgdn = false`.
|
||||
|
||||
Every emission site keys off these flags:
|
||||
`conv_decode_fused = (n_seq_tokens==1) && (n_rs_seq==0) && fused_gdn_ar`
|
||||
(qwen35/qwen35moe/qwen3next + `build_conv_state_fused`) and
|
||||
`fused = (n_seq_tokens==1) ? fused_gdn_ar : fused_gdn_ch` (delta-net-base). With
|
||||
the flags false the graph takes the upstream non-fused branch: a **plain
|
||||
`ggml_ssm_conv` (no discriminator) + `ggml_silu`**, which every backend handles
|
||||
correctly.
|
||||
|
||||
## CUDA byte-identical invariant
|
||||
|
||||
On a CUDA backend the reg name is `"CUDA"`, so `fgdn_backend_ok` stays true, the
|
||||
flags are left untouched, and the emitted decode graph is unchanged. The fix only
|
||||
changes behavior on non-CUDA/non-CPU backends. CUDA decode graph is byte-identical
|
||||
to pre-0030 **by construction** (no flag flips on CUDA), so the existing greedy
|
||||
md5 gates are unaffected on the validated GB10 target.
|
||||
|
||||
## Verification
|
||||
|
||||
- COMPILE (GPU-free, done on a CPU box): reconstructed the exact source state
|
||||
(upstream pin `9d5d882d` + paged patches `0001-0029`, .md docs stripped) and
|
||||
applied 0030. CPU-only build (`-DGGML_CUDA=OFF`) of `llama` + `test-backend-ops`
|
||||
links `libllama.so` and the test binary with **0 errors**; the edited
|
||||
`llama-context.cpp` compiles clean (uses only the already-included `<cstring>`
|
||||
and the backend-reg API already used in this TU:
|
||||
`ggml_backend_dev_backend_reg` / `ggml_backend_reg_name` /
|
||||
`ggml_backend_dev_type`).
|
||||
- 0030 applies cleanly on a fresh pin+0001-0029 tree via both `git apply --check`
|
||||
(Makefile path) and `patch -p1 -N` (prepare.sh path).
|
||||
- test-backend-ops correctness is a **CUDA0-vs-CPU** comparison; a CPU-only run
|
||||
skips CPU-vs-CPU by design ("Skipping CPU backend"). The relevant test cases are
|
||||
registered and will be exercised by the DGX CUDA run:
|
||||
`test_ssm_conv` / `test_ssm_conv_update` (SSM_CONV_UPDATE) /
|
||||
`test_ssm_conv_update_ids` (SSM_CONV_UPDATE_IDS) /
|
||||
`test_gated_delta_net` (+ `_hybrid`).
|
||||
|
||||
## Pending on the DGX (GPU)
|
||||
|
||||
The CUDA-side confirmation could not be run from the CPU box (the DGX cloudflared
|
||||
tunnel `jp-6.prem.io` was returning `websocket: bad handshake` for the whole
|
||||
session - origin offline). To run on the DGX `~/llama-paged-dev` (branch `paged`)
|
||||
once reachable, then commit 0030 there too:
|
||||
|
||||
```
|
||||
test-backend-ops test -o SSM_CONV
|
||||
test-backend-ops test -o SSM_CONV_UPDATE
|
||||
test-backend-ops test -o SSM_CONV_UPDATE_IDS
|
||||
test-backend-ops test -o GATED_DELTA_NET # expect: 2/2 backends passed, OK
|
||||
```
|
||||
|
||||
Greedy md5 (only if >40GB VRAM free; must equal the established baselines):
|
||||
`q36-27b-nvfp4 == 5951a5b4d624ce891e22ab5fca9bc439`,
|
||||
`q36-35b-a3b-nvfp4 == 07db32c2bcb78d17a43ed18bc22705cd`. Since 0030 does not flip
|
||||
any flag on CUDA, the md5 is unchanged by code-path argument; the run is a
|
||||
belt-and-suspenders confirmation, not a correctness dependency.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,89 +0,0 @@
|
||||
# Decode-Parity: Parked Levers (future exploration)
|
||||
|
||||
**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode
|
||||
38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch
|
||||
**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the
|
||||
dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor.
|
||||
bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate).
|
||||
|
||||
The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact
|
||||
floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode
|
||||
gains are marginal; the levers below are the honest pick-up points, ranked by promise.
|
||||
|
||||
---
|
||||
|
||||
## 1. Hybrid-precision SSM state (the most promising)
|
||||
|
||||
The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large -
|
||||
recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean
|
||||
runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx,
|
||||
~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the
|
||||
long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding);
|
||||
short/fast-decaying heads are fine.
|
||||
|
||||
**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1)
|
||||
in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could
|
||||
capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph
|
||||
build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel
|
||||
plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head
|
||||
dtype selection on top.
|
||||
|
||||
*Note (precision, corrected):* plain bf16 (no split) is a legitimate **opt-in for precision-tolerant
|
||||
deployments**, but it is *below* vLLM's recurrent precision, NOT equal to it. vLLM keeps the
|
||||
gated-DeltaNet **temporal state in f32** (proven three ways in `BITEXACT_VS_VLLM.md`; only its tiny
|
||||
conv state is bf16, and llama keeps even that f32). So bf16 here trades *below-vLLM* precision for
|
||||
*above-vLLM* throughput. We declined it as the default because both llama's f32 AND vLLM's f32 are a
|
||||
higher bar - and at equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak BW),
|
||||
so we do not need bf16 to match vLLM's recurrence.
|
||||
|
||||
## 2. Dense CUDA-graph instability
|
||||
|
||||
The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path
|
||||
CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable
|
||||
(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on
|
||||
the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic.
|
||||
|
||||
## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk)
|
||||
|
||||
The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4`
|
||||
into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover
|
||||
~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst
|
||||
gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1
|
||||
GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by
|
||||
the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and
|
||||
proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.**
|
||||
|
||||
## 4. Datacenter Blackwell (sm_100)
|
||||
|
||||
This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter
|
||||
Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA
|
||||
characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the
|
||||
whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding
|
||||
kernel). A separate investigation if datacenter Blackwell becomes a target.
|
||||
|
||||
## 5. Prefill / TTFT scheduler + paged-pool burst degradation (HIGH priority - the weakest benchmark number)
|
||||
|
||||
The final benchmark (`QWEN36_NVFP4_BENCH.md`) exposed TTFT as the clear weak spot vs vLLM. Two distinct
|
||||
issues:
|
||||
- **Static decode-first budget tradeoff:** the QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS=512`)
|
||||
maximizes decode tok/s + memory but throttles burst-prefill, so under a synchronized 128-way burst TTFT
|
||||
climbs to **903 s dense / 213 s MoE @npl128** vs vLLM's chunked-prefill 6-18 s. A dynamic/adaptive budget
|
||||
(by concurrency + queue depth), or matching vLLM's chunked-prefill interleave, would rebalance.
|
||||
- **Paged-pool burst-degradation BUG (concrete, found in the benchmark):** after a high-npl burst, a
|
||||
server's *subsequent lower-npl* prefill collapses (fresh npl8 = 507 t/s / 6 s TTFT; npl8 after an npl64
|
||||
burst = 65 t/s / 64 s). Decode stays robust; only prefill degrades -> root-cause the paged-pool state
|
||||
that persists across the burst.
|
||||
|
||||
**HIGH promise** for the serving experience: decode (dense 90-117%, MoE 77-83% of vLLM) and memory (1.5-3x
|
||||
lower) are already strong; TTFT is the one number holding back a clean public win.
|
||||
|
||||
## 6. MoE-specific recurrence tuning
|
||||
|
||||
The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The
|
||||
MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM
|
||||
shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile.
|
||||
|
||||
---
|
||||
|
||||
*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this
|
||||
directory.*
|
||||
@@ -1,208 +0,0 @@
|
||||
# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?
|
||||
|
||||
Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify
|
||||
llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the
|
||||
context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?
|
||||
|
||||
Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda
|
||||
sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf`
|
||||
(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models:
|
||||
`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf`
|
||||
(MoE, arch `qwen35moe`).
|
||||
|
||||
## TL;DR verdict
|
||||
|
||||
**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that
|
||||
reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's
|
||||
`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a
|
||||
major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so
|
||||
the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general
|
||||
kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode
|
||||
kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded.
|
||||
|
||||
The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is
|
||||
**flat** while the full-attention kernel grows ~3.1x:
|
||||
|
||||
| kernel | ctx 1024 | ctx 4096 | ratio | meaning |
|
||||
|--------|---------:|---------:|------:|---------|
|
||||
| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** |
|
||||
| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected |
|
||||
| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent |
|
||||
|
||||
Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a
|
||||
per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows.
|
||||
|
||||
## 1. Architecture (confirmed from GGUF metadata + tensor names)
|
||||
|
||||
Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard
|
||||
full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.
|
||||
|
||||
**Dense Qwen3.6-27B (`general.architecture = qwen35`):**
|
||||
- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**.
|
||||
- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`,
|
||||
rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`.
|
||||
- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`,
|
||||
`ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is
|
||||
`[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value
|
||||
heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.
|
||||
|
||||
**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):**
|
||||
- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers).
|
||||
- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`,
|
||||
`expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`.
|
||||
- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`,
|
||||
`inner_size = 4096` -> `H_v = 32` value heads.
|
||||
|
||||
**Tensor names confirm the op split (27B, per-layer dump):**
|
||||
- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`,
|
||||
`ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn
|
||||
block). No `attn_k/v/output`, no per-head q/k norm.
|
||||
- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`,
|
||||
`attn_q_norm`, `attn_k_norm`. No `ssm_*`.
|
||||
|
||||
llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV
|
||||
cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read
|
||||
and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the
|
||||
SSM-style recurrent path, not standard attention.
|
||||
|
||||
## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)
|
||||
|
||||
Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from
|
||||
`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits
|
||||
`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472).
|
||||
|
||||
**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1`
|
||||
(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else
|
||||
`build_delta_net_autoregressive`. Both are O(1):
|
||||
|
||||
- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on
|
||||
the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay),
|
||||
`sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update),
|
||||
`o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and
|
||||
the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`.
|
||||
- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op,
|
||||
`ggml_gated_delta_net(q, k, v, g, b, s, K=1)`.
|
||||
|
||||
**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):**
|
||||
the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back
|
||||
with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv
|
||||
state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3`
|
||||
samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back.
|
||||
So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens -
|
||||
this is the recurrent analogue, fixed size, independent of context length.
|
||||
|
||||
**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and
|
||||
`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same
|
||||
device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99`
|
||||
that does not happen, so the **fused single-kernel path is what runs**.
|
||||
|
||||
**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously
|
||||
O(1) in context:**
|
||||
- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185):
|
||||
the grid spans heads x sequences x state-columns. **There is no context-length dimension and no
|
||||
context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s)
|
||||
`[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`).
|
||||
- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then
|
||||
loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single
|
||||
iteration: read the one new token, do the rank-1 update
|
||||
`s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141),
|
||||
then write the updated state back (lines 161-167). No second loop, no read of any past KV.
|
||||
- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x
|
||||
batch) and **constant in context length**. This is precisely vLLM's
|
||||
`fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a
|
||||
fixed-size `[K,V]` state) cited in the grounding doc.
|
||||
|
||||
A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`);
|
||||
the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches
|
||||
(prefill), never at decode.
|
||||
|
||||
## 3. nsys decode profiling: GDN is a small share and does not scale with context
|
||||
|
||||
Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven
|
||||
to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and
|
||||
sqlites under `~/bench/gdn_study/`).
|
||||
|
||||
**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:**
|
||||
|
||||
| bucket | % of busy | us/launch |
|
||||
|--------|----------:|----------:|
|
||||
| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - |
|
||||
| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 |
|
||||
| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - |
|
||||
| elementwise / act_glu / norm / rope | ~13.5 | - |
|
||||
| embed_gather (`get_rows`) | 2.9 | - |
|
||||
| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 |
|
||||
| copy_cast (`cpy`) | 1.8 | - |
|
||||
| **GDN_conv (`ssm_conv`)** | **1.5** | - |
|
||||
|
||||
The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the
|
||||
**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model,
|
||||
where the grounding predicted the GEMM would be the lever.
|
||||
|
||||
**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:**
|
||||
GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the
|
||||
weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the
|
||||
context.
|
||||
|
||||
**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs
|
||||
4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1
|
||||
`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical
|
||||
decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is
|
||||
a true per-step comparison:
|
||||
|
||||
| kernel / bucket | ctx 1024 | ctx 4096 | ratio |
|
||||
|-----------------|---------:|---------:|------:|
|
||||
| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** |
|
||||
| GDN_recurrent share | 0.6% | 0.4% | flat/down |
|
||||
| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x |
|
||||
| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** |
|
||||
| ATTENTION_full share | 0.6% | 1.8% | 3.0x up |
|
||||
| total ms / decode step | 84.9 | 86.0 | 1.01x |
|
||||
|
||||
The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the
|
||||
full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely
|
||||
moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is
|
||||
the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.**
|
||||
|
||||
(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently
|
||||
prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)`
|
||||
+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping
|
||||
prefill short.)
|
||||
|
||||
## 4. Verdict and fix scope
|
||||
|
||||
**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused,
|
||||
single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in
|
||||
the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1)
|
||||
`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning
|
||||
with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the
|
||||
op carries no context-length parameter at all.
|
||||
|
||||
**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat
|
||||
share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would
|
||||
not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`:
|
||||
1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that
|
||||
vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
|
||||
2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with
|
||||
context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.
|
||||
|
||||
If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a
|
||||
complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is
|
||||
`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this
|
||||
traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a
|
||||
llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the
|
||||
`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and
|
||||
it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM
|
||||
(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed
|
||||
steady-state step as the bounded host-side add-on.
|
||||
|
||||
## Reproduce
|
||||
|
||||
- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`.
|
||||
- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373,
|
||||
build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595`
|
||||
(fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid
|
||||
184-185, dispatch 226-312).
|
||||
- Profiles: `~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur>` runs `llama-server` under
|
||||
nsys and drives `clientloop.py`; `catgdn.py <sqlite>` buckets kernels. Sqlites:
|
||||
`gdn_npl128_ctx1024`, `gdn_npl32_ctx1024`, `gdn_npl4_ctx1024`, `gdn_npl4_ctx4096`.
|
||||
@@ -1,344 +0,0 @@
|
||||
# GDN recurrence byte gate + fused single-pass kernel design
|
||||
|
||||
Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only;
|
||||
the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent.
|
||||
|
||||
## TL;DR (the correction the workflow was set up to settle)
|
||||
|
||||
**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>`
|
||||
(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers
|
||||
ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state
|
||||
back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`,
|
||||
`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM.
|
||||
|
||||
The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather
|
||||
each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only
|
||||
the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`,
|
||||
the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations**
|
||||
(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to
|
||||
recover; the recurrence kernel is byte-minimal on state by construction.
|
||||
|
||||
Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is
|
||||
f32-state read+write bandwidth, already a single pass. The two real levers are decoupled:
|
||||
|
||||
1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small
|
||||
per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv
|
||||
`cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM.
|
||||
2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only
|
||||
large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only
|
||||
the persisted cache in bf16.
|
||||
|
||||
Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below).
|
||||
|
||||
## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128)
|
||||
|
||||
State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**.
|
||||
|
||||
Per kernel call (one GDN layer, full 128-seq batch), single pass:
|
||||
- state read = 786,432 * 128 * 4 = 402.65 MB
|
||||
- state write = 402.65 MB
|
||||
- **state R+W = 805.3 MB/call** (768 MiB)
|
||||
- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**.
|
||||
|
||||
Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**.
|
||||
GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.**
|
||||
|
||||
48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB
|
||||
activation traffic is noise; state is 99% of the recurrence bytes.
|
||||
|
||||
### What this means for the open question
|
||||
- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`,
|
||||
consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware
|
||||
floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an
|
||||
occupancy/coalescing tune, NOT a fusion win.
|
||||
- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent
|
||||
state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant
|
||||
stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the
|
||||
single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of
|
||||
vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence
|
||||
kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is
|
||||
at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%.
|
||||
|
||||
## The fused single-pass kernel design
|
||||
|
||||
Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte
|
||||
verdict.
|
||||
|
||||
### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel
|
||||
|
||||
Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing
|
||||
single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never
|
||||
make a separate DRAM round-trip, and the per-op launches collapse to one.
|
||||
|
||||
Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer:
|
||||
|
||||
```
|
||||
wqkv GEMM -> qkv_mixed (keep: GEMM, separate)
|
||||
wqkv_gate GEMM -> z (keep: GEMM, separate)
|
||||
ssm_beta GEMM -> beta -> sigmoid [FOLD beta sigmoid]
|
||||
ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g]
|
||||
build_conv_state: reshape, transpose qkv, CONCAT, cpy [concat/cpy -> conv-state plumbing, see note]
|
||||
ggml_ssm_conv(conv_input, conv_kernel) [FOLD depthwise conv, K=4]
|
||||
ggml_silu(conv_output) [FOLD silu]
|
||||
views q_conv/k_conv/v_conv
|
||||
ggml_l2_norm(q_conv); ggml_l2_norm(k_conv) [FOLD 2x l2norm]
|
||||
[repeat_4d skipped on fused path]
|
||||
ggml_gated_delta_net_inplace_ids(...) <-- THE recurrence kernel (196 ms)
|
||||
build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul [FOLD post gated-RMSNorm]
|
||||
ssm_out GEMM (keep: GEMM, separate)
|
||||
```
|
||||
|
||||
Fold list (what moves INTO the kernel):
|
||||
- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta.
|
||||
- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel.
|
||||
The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the
|
||||
`a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches.
|
||||
- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the
|
||||
conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading
|
||||
state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed
|
||||
token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim)
|
||||
is tiny and already cached; fold its read here and its 1-token shift write into the epilogue
|
||||
(replaces the `concat`+`cpy` conv-state update).
|
||||
- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel
|
||||
already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same
|
||||
warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math.
|
||||
- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the
|
||||
attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS,
|
||||
multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output -
|
||||
removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip.
|
||||
|
||||
State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm
|
||||
collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery:
|
||||
the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held
|
||||
(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391).
|
||||
|
||||
Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4):
|
||||
1. PDL sync.
|
||||
2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel,
|
||||
dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element.
|
||||
3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps).
|
||||
4. Load `s0` column into `s_shard` (UNCHANGED single read).
|
||||
5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta,
|
||||
attn = S^T q * scale).
|
||||
6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv
|
||||
state back to the conv cache (replaces concat+cpy).
|
||||
7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and
|
||||
by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate.
|
||||
|
||||
Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw
|
||||
`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new
|
||||
`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe,
|
||||
llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it.
|
||||
|
||||
### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact)
|
||||
|
||||
Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is
|
||||
already >=85% of peak (then f32 is at the floor and bf16 is the only way down).
|
||||
|
||||
- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at
|
||||
the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected
|
||||
335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of
|
||||
peak, so likely better.
|
||||
- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence
|
||||
arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved
|
||||
within a step; only the PERSISTED state is rounded to bf16 each step.
|
||||
- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not
|
||||
accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a
|
||||
256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the
|
||||
ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is
|
||||
unreachable because the f32 state bytes are irreducible (single pass already).
|
||||
|
||||
## Numeric / bit-exactness notes (for fold (1))
|
||||
- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum).
|
||||
Order of summation across lanes differs from the standalone op's sequential sum -> floating
|
||||
reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a
|
||||
tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend:
|
||||
ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019
|
||||
already established the harness: dense text md5, MoE byte-identical).
|
||||
- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` /
|
||||
`ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the
|
||||
v - g*kv -> *beta -> S update -> S^T q sequence.
|
||||
- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu`
|
||||
if done in the same order.
|
||||
- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path)
|
||||
to stay bit-exact.
|
||||
|
||||
## Implementation scope
|
||||
- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new
|
||||
kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm
|
||||
epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI.
|
||||
- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d,
|
||||
ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps).
|
||||
- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside
|
||||
the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route
|
||||
the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`,
|
||||
`kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged.
|
||||
- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused
|
||||
op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`.
|
||||
- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store
|
||||
convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids
|
||||
builders' state asserts, and the recurrent cache type.
|
||||
|
||||
## Risk register
|
||||
- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of
|
||||
vLLM). Worth it only if the workflow wants >90% and accepts no bf16.
|
||||
- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is
|
||||
f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf
|
||||
call.
|
||||
- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters
|
||||
(CPU offload of a GDN layer must fall back cleanly).
|
||||
- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32
|
||||
state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort.
|
||||
|
||||
---
|
||||
|
||||
# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED
|
||||
|
||||
The design above predicted the answer; this is the decisive measurement that confirms it.
|
||||
|
||||
## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)).
|
||||
|
||||
Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is
|
||||
physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at
|
||||
**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak).
|
||||
The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming.
|
||||
|
||||
## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway
|
||||
- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM`
|
||||
(`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa).
|
||||
DRAM byte counters are unobtainable on this box.
|
||||
- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from
|
||||
the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor;
|
||||
comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it.
|
||||
|
||||
## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF)
|
||||
llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1.
|
||||
vLLM: postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples).
|
||||
|
||||
| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream |
|
||||
|---|---|---|---|---|---|---|
|
||||
| llama gated_delta_net_cuda | f32 | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x |
|
||||
| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1) | 111 GB/s | **41%** | ~1.0x |
|
||||
|
||||
- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms).
|
||||
- vLLM recurrence/step = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x.
|
||||
The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE).
|
||||
- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive
|
||||
f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))`
|
||||
once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" ->
|
||||
model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128).
|
||||
|
||||
## Why re-stream ~1.0x (the gate number)
|
||||
Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**.
|
||||
1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end:
|
||||
~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD.
|
||||
|
||||
## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead
|
||||
2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one
|
||||
(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM):
|
||||
- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms
|
||||
(~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call =>
|
||||
save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead.
|
||||
- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on
|
||||
PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16.
|
||||
|
||||
## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48
|
||||
concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56)
|
||||
= ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes
|
||||
-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op,
|
||||
identity seqs -> confirms gather does NOT re-stream state at steady decode).
|
||||
|
||||
## One-line answer
|
||||
llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak.
|
||||
conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
|
||||
=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
|
||||
(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
|
||||
|
||||
---
|
||||
|
||||
# FINAL DECISION (synthesis of all four agents) - the five points
|
||||
|
||||
This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
|
||||
(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
|
||||
(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
|
||||
|
||||
## (1) Byte-ratio verdict - the decisive number
|
||||
|
||||
**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
|
||||
capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
|
||||
the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
|
||||
peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
|
||||
`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
|
||||
single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
|
||||
`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
|
||||
vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
|
||||
|
||||
## (2) Fused single-pass GDN recurrence: **NO-BUILD**
|
||||
|
||||
A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
|
||||
write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
|
||||
q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
|
||||
Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
|
||||
the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
|
||||
|
||||
## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
|
||||
|
||||
This is independent of the recurrence verdict and holds regardless. Build a fused
|
||||
`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
|
||||
(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
|
||||
registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
|
||||
folds `silu`, and writes the 1-token-shifted ring state back in place.
|
||||
- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
|
||||
materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
|
||||
(5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
|
||||
(88.5-89.3% of vLLM 391).
|
||||
- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
|
||||
primitive, same f32 state bytes written - only the producing node changes. Greedy output is
|
||||
bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
|
||||
|
||||
## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
|
||||
|
||||
Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
|
||||
step) is narrowing the state-cache width to bf16, exactly as vLLM does.
|
||||
- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
|
||||
f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
|
||||
step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
|
||||
360-443 tok/s @npl128).
|
||||
- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
|
||||
irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
|
||||
vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
|
||||
precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
|
||||
the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
|
||||
over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
|
||||
(g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
|
||||
- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
|
||||
lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
|
||||
recurrence term.
|
||||
|
||||
## (5) Ranked build order + the single highest-value next step
|
||||
|
||||
1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
|
||||
reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
|
||||
additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
|
||||
Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
|
||||
384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
|
||||
byte-identical).
|
||||
2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
|
||||
dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
|
||||
Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
|
||||
duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
|
||||
256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
|
||||
f32 fallback verified.
|
||||
3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
|
||||
conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
|
||||
+ launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
|
||||
no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
|
||||
md5 unchanged.
|
||||
|
||||
**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
|
||||
that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
|
||||
the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
|
||||
**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
|
||||
lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,34 +0,0 @@
|
||||
# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
|
||||
|
||||
STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
|
||||
|
||||
## What
|
||||
Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
|
||||
update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
|
||||
0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
|
||||
(read path gather -> indexed in-kernel read; values + reduction order unchanged).
|
||||
|
||||
## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
|
||||
- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
|
||||
MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
|
||||
- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
|
||||
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
|
||||
|
||||
## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
|
||||
- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
|
||||
- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
|
||||
- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
|
||||
step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
|
||||
|
||||
## Artifacts
|
||||
- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
|
||||
- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
|
||||
- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
|
||||
|
||||
## gather-bench landed (worktree)
|
||||
|
||||
Rigorous same-session A/B (DGX GB10) validated patch 0028 bit-exact and lifting both models;
|
||||
results folded into LEVER1_GATHER_RESULTS.md and the regenerated 0028 patch. The bench files
|
||||
first landed in this worktree via concurrent merge c1f1d1e8e (origin/master sweep); this commit
|
||||
re-anchors them with sign-off attribution. DGX llama tree dedicated commit: fafe878 (code
|
||||
byte-identical to 944636c; docs-only amend). Both trees committed, not pushed.
|
||||
@@ -1,163 +0,0 @@
|
||||
# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
|
||||
|
||||
The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
|
||||
kernel vLLM has no equivalent of (~5.2 ms/step MoE decode; also present in dense): vLLM updates its
|
||||
gated-DeltaNet recurrent state in-place inside the fused decode kernel, while llama ran a separate
|
||||
`ggml_get_rows` gather. Patch 0019 fused the SSM recurrent-state gather; patch 0021 fused the conv
|
||||
compute/write-back but KEPT a `build_rs` gather for the conv taps ("tiny; not one of the eliminated
|
||||
buckets"). This patch closes that residual.
|
||||
|
||||
## Which gather was still firing (nsys-located, DGX GB10 sm_121)
|
||||
|
||||
Profiled MoE `q36-35b-a3b-nvfp4` at batch-128 decode (`llama-batched-bench -npp128 -ntg24 -npl128
|
||||
-fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`). The decode-window `k_get_rows_float<float,float>`
|
||||
distribution was bimodal: a BIG cluster of **~720 instances (= 30 GDN layers x 24 decode steps) at
|
||||
~115 us each** plus small embedding/router gathers.
|
||||
|
||||
The big gather's geometry (`grid=(ne10=128, block_num_y=96, 1)`) decodes to **128 rows (= n_seqs
|
||||
active sequences) of ne00 = 24576 floats**. With the model's real dims (`d_conv=4, d_inner=4096,
|
||||
n_group=16, d_state=128`):
|
||||
- `n_embd_r = (d_conv-1) * (d_inner + 2*n_group*d_state) = 3 * 8192 = 24576` -> `block_num_y=96` EXACT match.
|
||||
- `n_embd_s = d_state * d_inner = 524288` (the SSM state, gridY 2048 - already fused by 0019).
|
||||
|
||||
So the residual `k_get_rows` is the **conv-state tap gather** in `build_conv_state_fused`
|
||||
(`src/models/delta-net-base.cpp`), which called the plain 4-arg `build_rs` -> `ggml_get_rows` of the
|
||||
24576-float conv-state row x 128 sequences, once per GDN layer per decode step (~3.4 ms/step here,
|
||||
~5.2 ms/step at steady ntg=128). The SSM-state gather is already fused, so this conv gather is the
|
||||
last `k_get_rows` in the GDN decode path.
|
||||
|
||||
## What changed (mirror of the 0019 SSM gather fusion; bit-exact by construction)
|
||||
|
||||
New op `ggml_ssm_conv_update_inplace_ids` (reuses `GGML_OP_SSM_CONV`, discriminated by a non-null
|
||||
`src[4]` = ids). Instead of a pre-gathered tap scratch, it takes the FULL conv-state cache (`src[0]`)
|
||||
plus the per-sequence `ids` (= the recurrent-state `s_copy`, `src[4]`; `op_params[1]=rs_head`) and
|
||||
reads each active sequence's prior K-1 taps directly from `cache[ids[s]]` in the kernel. This removes
|
||||
the separate `k_get_rows` launch.
|
||||
|
||||
Race-free, exactly mirroring 0019:
|
||||
- **Identity** sequences (`ids[s] == rs_head + s`, the whole AR-decode path) read the taps in place
|
||||
from the `conv_state_dst` write slot. The kernel loads the full conv window into registers before
|
||||
it writes the 1-token-shifted ring back, so read==write slot is race-free per (channel, seq) thread.
|
||||
- **Non-identity** sequences (reorder / `rs_zero` remap at a prefill->decode boundary) are gathered
|
||||
into a disjoint scratch by a small `ssm_conv_gather_nonident_kernel` first (no-op at steady decode),
|
||||
so the update kernel never reads a slot another block writes.
|
||||
|
||||
The read VALUES are unchanged (identity in-place taps == the gathered taps == `cache[ids[s]]`); only
|
||||
the read PATH changes from a `ggml_get_rows` materialization to an indexed in-kernel read. The conv
|
||||
math, ascending-tap FMA order, silu and the ring write-back are byte-identical to 0021.
|
||||
|
||||
Files:
|
||||
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: `ggml_ssm_conv_update_inplace_ids` builder
|
||||
(src[0]=full cache [K-1,channels,n_cells], src[1]=conv_kernel, src[2]=x_cur, src[3]=conv_state_dst,
|
||||
src[4]=ids; op_params[0]=fuse_silu, op_params[1]=rs_head).
|
||||
- `ggml/src/ggml-cuda/ssm-conv.cu`: `ssm_conv_gather_nonident_kernel` + `ssm_conv_update_ids_f32`
|
||||
kernel + `ggml_cuda_op_ssm_conv_update_ids` + a `src[4]`-discriminated branch in `ggml_cuda_op_ssm_conv`.
|
||||
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_ids_f32` (window copied to a
|
||||
local before the possibly-aliasing write) + dispatch branch.
|
||||
- `src/models/delta-net-base.cpp`: `build_conv_state_fused` now feeds the FULL cache + ids through the
|
||||
`build_rs` `get_state_rows` lambda (the rs_zero clear + extra-states copy still run around it),
|
||||
exactly like the 0019 recurrent-attn fusion. The `qwen35` / `qwen35moe` / `qwen3next` callers are
|
||||
unchanged (they already route the single-token decode path here).
|
||||
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update_ids` (16 cases) - ids = a shuffled permutation
|
||||
with `rs_head=0`, so each case exercises BOTH the identity in-place read and the non-identity cache
|
||||
read; validates the conv+silu output vs the CPU reference.
|
||||
|
||||
## GATE: test-backend-ops (CUDA0 vs CPU, 2/2 backends)
|
||||
|
||||
- SSM_CONV_UPDATE_IDS: OK (NEW; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
|
||||
- SSM_CONV_UPDATE: OK (0021 path intact)
|
||||
- SSM_CONV: OK
|
||||
- GATED_DELTA_NET: OK
|
||||
- GET_ROWS: OK
|
||||
|
||||
## GATE: greedy bit-exactness (--temp 0 --seed 1 -n 48, -fa on) - BOTH models BYTE-IDENTICAL
|
||||
|
||||
| model | baseline md5 | 0028 md5 | result |
|
||||
|--------------------|----------------------------------|----------------------------------|-----------------|
|
||||
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | BYTE-IDENTICAL |
|
||||
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | BYTE-IDENTICAL |
|
||||
|
||||
(Built on the `paged` branch f32-default = 0026 hybrid default is f32; the baseline was re-confirmed
|
||||
on the same build before the edit.)
|
||||
|
||||
## nsys proof - the gather is eliminated (MoE decode, npp128 ntg24 npl128, same window)
|
||||
|
||||
| kernel | before | after |
|
||||
|-------------------------------------|---------------|-------------------------------|
|
||||
| `k_get_rows_float<float,float>` cnt | 10174 | 9454 (720 fewer = 30 GDN x 24)|
|
||||
| `k_get_rows_float<float,float>` sum | 186.3 ms | 102.8 ms (-83.5 ms) |
|
||||
| conv update kernel | `ssm_conv_update_f32` 720 | `ssm_conv_update_ids_f32` 720 |
|
||||
| `ssm_conv_gather_nonident_kernel` | - | 720 x ~1.1 us = 0.8 ms (no-op at decode) |
|
||||
|
||||
The 720 big ~115 us conv gathers are gone; the only added work is a ~1.1 us no-op gather kernel per
|
||||
layer-step (all sequences identity during steady AR decode). This matches 0019's "no-op at decode,
|
||||
median ~1.2 us" non-identity gather.
|
||||
|
||||
## Preliminary throughput (post-fusion, single point; rigorous A/B is the bench phase)
|
||||
|
||||
- MoE `q36-35b-a3b-nvfp4` npl128 (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`): **783.9 t/s**, step
|
||||
163.3 ms/step (MOE_GAP @0025 was 752.3 t/s / 169.8 ms/step => -6.5 ms/step in this stack).
|
||||
- dense `q36-27b-nvfp4` npl128: **377.3 t/s** (~96% of vLLM 391; includes 0022/0026 base gains).
|
||||
- npl128 ran clean (EXIT=0) on both - the non-identity boundary path does not crash.
|
||||
|
||||
## Verdict
|
||||
|
||||
Bit-exact (both md5 gates byte-identical, all test-backend-ops pass), the residual `k_get_rows` conv
|
||||
gather is eliminated (nsys-confirmed), and decode throughput improves. Helps BOTH dense and MoE (the
|
||||
shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path (after 0019 SSM-state
|
||||
+ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
|
||||
|
||||
Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
|
||||
NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
|
||||
|
||||
### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
|
||||
|
||||
| model | base (0026) | lever1 (0028) | recorded baseline |
|
||||
|-------------------|----------------------------------|----------------------------------|----------------------------------|
|
||||
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
|
||||
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
|
||||
|
||||
test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
|
||||
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
|
||||
|
||||
### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
|
||||
|
||||
dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
|
||||
|
||||
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
|
||||
|-----|-----------|-------------|--------|----------------|
|
||||
| 32 | 208.56 | 209.39 | +0.40% | - |
|
||||
| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
|
||||
|
||||
MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
|
||||
|
||||
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
|
||||
|-----|-----------|-------------|--------|----------------|
|
||||
| 32 | 456.85 | 459.56 | +0.59% | - |
|
||||
| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
|
||||
|
||||
Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
|
||||
|
||||
### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
|
||||
|
||||
| kernel | base (0026) | lever1 (0028) |
|
||||
|---------------------------------|------------------------|----------------------------------------------|
|
||||
| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
|
||||
| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
|
||||
| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
|
||||
| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
|
||||
|
||||
The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
|
||||
into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
|
||||
the -3.13 ms/step throughput delta at npl128.
|
||||
|
||||
### Verdict (gather-bench)
|
||||
|
||||
Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
|
||||
gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
|
||||
throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
|
||||
MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
|
||||
@@ -1,77 +0,0 @@
|
||||
# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
|
||||
|
||||
The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
|
||||
(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
|
||||
bit-exact tensor reshape that re-routes the per-layer SSM output projection
|
||||
from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
|
||||
|
||||
## The mechanism (profiled, both engines)
|
||||
|
||||
Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
|
||||
The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
|
||||
(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
|
||||
to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
|
||||
`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
|
||||
128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
|
||||
the ssm_out weight read across the 128 sequences. vLLM packs the same projection
|
||||
into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
|
||||
only the output projection was in 3D SSM layout.
|
||||
|
||||
## The fix
|
||||
|
||||
In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
|
||||
the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
|
||||
decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
|
||||
MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
|
||||
so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
|
||||
2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
|
||||
proven by the in-projection.
|
||||
|
||||
```
|
||||
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
|
||||
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
|
||||
...
|
||||
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
|
||||
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
|
||||
```
|
||||
|
||||
## Gates (all PASS)
|
||||
|
||||
- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
|
||||
post-SSM baseline build:
|
||||
- dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
|
||||
- MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
|
||||
- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
|
||||
- Coherent dense + MoE output (greedy text inspected).
|
||||
|
||||
## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
|
||||
|
||||
S_TG t/s (decode aggregate):
|
||||
|
||||
| model | npl | baseline | Lever 1 | delta |
|
||||
|------------------|-----|----------|---------|---------|
|
||||
| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
|
||||
| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
|
||||
| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
|
||||
| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
|
||||
|
||||
Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
|
||||
up from 65% post-SSM).
|
||||
|
||||
## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
|
||||
|
||||
The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
|
||||
|
||||
| kernel | baseline | Lever 1 |
|
||||
|-------------------------------------|--------------------|------------------|
|
||||
| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
|
||||
| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
|
||||
|
||||
The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
|
||||
(+1200 instances, +363 ms over the window), and its per-call average DROPS
|
||||
(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
|
||||
than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
|
||||
~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
|
||||
old GEMV: the amortized weight read is the win.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,83 +0,0 @@
|
||||
# LEVER 4 - NVFP4 the bf16 MoE GDN/attn projections: KL-GATE FAIL, no-ship
|
||||
|
||||
GPU agent (L4-gatebench), DGX GB10 (sm_121, BLACKWELL_NATIVE_FP4=1). Build at 0028 (HEAD fafe878,
|
||||
branch `paged`). Lever 4 hypothesis (from `MOE_GAP_VS_VLLM.md` + the lever-4 scope): the MoE GGUF's
|
||||
GDN/attn projections (in_proj_qkvz=attn_qkv, in_proj_ba=ssm_alpha/ssm_beta, out_proj=ssm_out,
|
||||
attn_gate, full-attn attn_q/k/v/output) are left in BF16 by nvidia modelopt while the dense
|
||||
q36-27b-nvfp4 (unsloth) already ships them NVFP4. The scope called this a "quant-provenance accident"
|
||||
and proposed re-quantizing them to NVFP4 to recover the ~20.3->13.8ms projection-GEMM bucket.
|
||||
|
||||
**Verdict: KL-GATE FAIL on every axis, for both variants. STOP, do NOT ship. No 0029 GGUF, no
|
||||
gallery entry, no bench, no nsys** (per spec: KL fails first -> report, do not bench/ship). The bf16
|
||||
projections are a **deliberate precision choice, not an accident** - re-quantizing them costs ~6% PPL.
|
||||
|
||||
## Gate setup (all bit-changing -> KLD gate per spec)
|
||||
|
||||
- Reference (the "f32" of the gate): `~/work/darwin_36b_opus/f16.gguf` - the full-precision f16 GGUF
|
||||
of the same Qwen3.6-35B-A3B model (qwen35moe, 41 blocks, vocab 248320, embd 2048). Verified it
|
||||
matches the NVFP4 baseline shape; its own PPL = 7.376 self-consistent with the KLD base.
|
||||
- KL base: `llama-perplexity --kl-divergence-base` over `wiki.test.raw`, c512, 16 chunks (8192 tok),
|
||||
-ngl 99, seed 1. Base file `~/bench/l4gate/klbase_moe.dat` (2.0 GB). f16 PPL(base) = 7.3734.
|
||||
- Candidates scored with `--kl-divergence` against that base, identical c512/16-chunks/seed.
|
||||
- Current "bf16-projection GGUF" baseline = `~/bench/q36-35b-a3b-nvfp4.gguf` (the shipping NVFP4:
|
||||
experts NVFP4, GDN/attn projections BF16). It is the reference for the PPL-delta and argmax gates.
|
||||
|
||||
## Measurements (16 chunks, c512, 8192 tokens, wiki.test.raw)
|
||||
|
||||
| model | PPL(Q) | PPL delta vs baseline | Mean KLD-to-f16 | Same-top-p (argmax agree vs f16) | RMS dp |
|
||||
|-------|--------|-----------------------|-----------------|----------------------------------|--------|
|
||||
| baseline NVFP4 (proj BF16, shipping) | 7.3896 | - (reference) | 0.1366 | 84.31% | 9.20% |
|
||||
| **projq FULL** (190 proj -> NVFP4, incl. in_proj_ba) | 7.8705 | **+6.51%** | 0.1638 | 81.72% | 10.47% |
|
||||
| **projq CONS** (130 proj -> NVFP4, in_proj_ba kept BF16) | 7.8440 | **+6.15%** | 0.1716 | 82.16% | 10.82% |
|
||||
|
||||
Baseline vs f16: PPL ratio 1.0022 (+0.22%), i.e. the shipping NVFP4 is already near-f16 - because
|
||||
modelopt put the quant-sensitive GDN/attn projections in BF16 and only the experts (designed for FP4)
|
||||
in NVFP4. projq pushes the projections to NVFP4 and PPL ratio jumps to 1.067 (FULL) / 1.064 (CONS).
|
||||
|
||||
## Gate verdict (all three conditions FAIL)
|
||||
|
||||
1. **PPL delta < ~1% vs the bf16-projection GGUF -> FAIL.** FULL +6.51%, CONS +6.15%. Off by ~6x.
|
||||
2. **KLD-to-f32 < 0.06 -> FAIL.** The shipping baseline NVFP4 itself sits at 0.137 mean KLD vs f16
|
||||
(per-token KLD is naturally high at 248K vocab), and projq raises it to 0.164 (FULL) / 0.172 (CONS).
|
||||
Whatever the intended reference granularity, projq is strictly worse than the baseline, not < 0.06.
|
||||
3. **Zero greedy-argmax flips -> FAIL.** Per-token top-1 agreement vs f16 drops from 84.31% (baseline)
|
||||
to 81.72% (FULL) / 82.16% (CONS): the requant flips the argmax on ~2.2-2.6% MORE tokens than the
|
||||
shipping model. (A direct `llama-cli --temp 0 -n 48` greedy diff was attempted but the paged
|
||||
llama-cli build segfaults at teardown on ALL models incl. baseline - not projq-specific - so the
|
||||
8192-token Same-top-p above is the argmax measure used; it is strictly stronger than a 48-tok probe.)
|
||||
|
||||
CONSERVATIVE (keeping the most quant-sensitive in_proj_ba=ssm_alpha/ssm_beta in BF16) recovered almost
|
||||
nothing: 7.844 vs 7.871. The damage is in the BULK attn/GDN projections (attn_qkv, ssm_out, attn_gate,
|
||||
attn_q/k/v/output), not the tiny in_proj_ba. An attn_gate-excluded third variant would, at best, shave
|
||||
a fraction of a percent off a 6% miss - not worth a GPU pass. lm_head was already NVFP4 in the baseline
|
||||
(and in vLLM's checkpoint), so it is not a variable here and was never the issue.
|
||||
|
||||
## Why the premise was wrong (root cause of the failure)
|
||||
|
||||
The scope assumed vLLM runs these projections in NVFP4. It does not. vLLM runs the **nvidia modelopt
|
||||
checkpoint** (`~/bench/q36-35b-a3b-nvfp4-vllm`), which is the SAME provenance that left these exact
|
||||
projections in BF16. So:
|
||||
|
||||
- The baseline GGUF's bf16 projections **match vLLM** already. They are not a llama-vs-vLLM gap.
|
||||
- modelopt left in_proj_qkvz/in_proj_ba/out_proj/attn_q/k/v/output in BF16 **because they are
|
||||
quant-sensitive in this hybrid gated-DeltaNet + attention model** - the gate confirms this empirically
|
||||
at ~6% PPL. The dense q36-27b-nvfp4 (unsloth) tolerating NVFP4 projections does not transfer: it is a
|
||||
different (non-MoE, different-provenance) model and a different sensitivity profile.
|
||||
- Re-quantizing them is therefore not "matching vLLM" - it is going BEYOND vLLM's precision and paying
|
||||
for it in quality. The ~20.3ms projection-GEMM bucket is the price of running these projections in
|
||||
high precision; vLLM pays the same precision cost (its nvjet/cutlass bf16 GEMMs), so the bucket is NOT
|
||||
the lever it looked like. The L4 speed win is real but only purchasable with a 6% PPL regression -
|
||||
rejected by the gate.
|
||||
|
||||
## Disposition / artifacts
|
||||
|
||||
- Both projq GGUFs exist on DGX but are **dead** (do not publish): `~/bench/q36-35b-a3b-nvfp4-projq.gguf`
|
||||
(FULL, md5 1bd32114..., sha256 88b7e812...), `~/bench/q36-35b-a3b-nvfp4-projq-cons.gguf` (CONS, md5
|
||||
6847ebe3..., sha256 ca035111...). The L4-requant pin files (`~/bench/pins_projq_{full,cons}.txt`) and
|
||||
`/tmp/gen_pins.py` remain if a future, kernel-side (not precision-side) approach is ever revisited.
|
||||
- Gate logs: `~/bench/l4gate/` - `f16base.log`, `kld.{baseline,projqFULL,projqCONS}.log`,
|
||||
`klbase_moe.dat`.
|
||||
- No code change, no patch, no commit to the DGX `llama-paged-dev` tree. No `-paged` gallery entry.
|
||||
- MoE remains at 86.3% of vLLM @ npl128; this lever does not move it within the quality budget.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,73 +0,0 @@
|
||||
# Lever 5 - block-table within-step host cache (patch 0029)
|
||||
|
||||
## What
|
||||
|
||||
`get_block_table()` is called once per full-attention layer per decode step. The
|
||||
KV cell layout (and therefore the block table bytes) is fixed for the whole step;
|
||||
it only changes in `apply()` when the ubatch's slots are committed. The old path
|
||||
recomputed the full table on every full-attention layer of every step.
|
||||
|
||||
Patch 0029 builds the table once per step and reuses the bytes (`memcpy`) for the
|
||||
remaining full-attention layers, invalidating the cache in `apply()`. The reused
|
||||
bytes are identical to a fresh compute, so the change is bit-exact. Disable with
|
||||
`LLAMA_PAGED_NO_BT_CACHE=1`.
|
||||
|
||||
## Host-side get_block_table time (the lever)
|
||||
|
||||
`llama-batched-bench`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`,
|
||||
`-npp 128 -ntg 128 -npl 128 -ngl 99 -fa on`, measured with the in-tree
|
||||
`[L5INSTR]` host timers (aggregate over the full bench, n=2048 dense / 1280 MoE
|
||||
get_block_table calls):
|
||||
|
||||
| model | get_block_table host, cache OFF | cache ON | reduction |
|
||||
|-------|--------------------------------:|---------:|----------:|
|
||||
| MoE q36-35b-a3b-nvfp4 | 112.94 ms | 14.82 ms | -87% |
|
||||
| dense q36-27b-nvfp4 | 193.78 ms | 16.90 ms | -91% |
|
||||
|
||||
The MoE 112.94 -> 14.82 ms is the "110 -> 14 ms host" headline. `set_inputs`
|
||||
host time falls in lockstep (MoE 128.6 -> 32.0 ms; dense 220.2 -> 36.5 ms) and
|
||||
`process_ubatch` host (hostproc) drops MoE 498.8 -> 413.0 ms, dense 730.1 ->
|
||||
544.2 ms.
|
||||
|
||||
## Throughput effect
|
||||
|
||||
Same bench, TG (decode) tokens/s, cache OFF -> ON:
|
||||
|
||||
| model | TG t/s OFF | TG t/s ON | delta | vs vLLM @npl128 |
|
||||
|-------|-----------:|----------:|------:|----------------:|
|
||||
| dense q36-27b-nvfp4 | 364.81 | 374.72 | +2.7% | 374.72 / 391 = 95.8% |
|
||||
| MoE q36-35b-a3b | 752.19 | 756.97 | +0.6% (flat) | n/a |
|
||||
|
||||
- Dense decode is partly host-bound, so removing ~90% of the get_block_table host
|
||||
time lifts dense TG by a few percent (run-to-run; ~0.4-2.7% across runs) and
|
||||
pushes it to ~96-97.5% of the vLLM 391 t/s @npl128 reference.
|
||||
- MoE decode is compute-bound (the FP4 GEMM dominates the step), so the ~98 ms of
|
||||
saved host time is hidden behind GPU compute and is off the critical path: MoE
|
||||
TG is flat. The deployment path (MoE) sees no regression and no win - the cache
|
||||
is a pure pipeline cleanup there.
|
||||
- npl=1 single-stream decode: get_block_table is tiny either way (MoE 0.64 ->
|
||||
0.22 ms over 128 steps); the lever only matters at batch.
|
||||
|
||||
## Bit-exactness
|
||||
|
||||
`llama-completion -p "The capital of France is" -n 48 --temp 0 --seed 1`,
|
||||
chat-template (conversation) path:
|
||||
|
||||
| path | md5 |
|
||||
|------|-----|
|
||||
| non-paged MoE | 07db32c2bcb78d17a43ed18bc22705cd |
|
||||
| paged MoE, cache ON | 8cb0ce23777bf55f92f63d0292c756b0 |
|
||||
| paged MoE, cache OFF (`LLAMA_PAGED_NO_BT_CACHE=1`) | 8cb0ce23777bf55f92f63d0292c756b0 |
|
||||
| dense non-paged | 5951a5b4d624ce891e22ab5fca9bc439 |
|
||||
| dense paged | 5951a5b4d624ce891e22ab5fca9bc439 |
|
||||
|
||||
cache ON == cache OFF confirms the lever is numerically neutral. The paged-MoE
|
||||
md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a benign
|
||||
FP-accumulation-order difference of the paged attention reduction, KL-validated
|
||||
in PAGED_BITEXACT_NOTE.md (not introduced by this lever - it is present on the
|
||||
0028 baseline too).
|
||||
|
||||
## Verdict
|
||||
|
||||
Ship. Bit-exact per path, real host-pipe win on host-bound (dense) decode,
|
||||
neutral on the compute-bound MoE deployment path.
|
||||
@@ -420,8 +420,8 @@ may want a second lighter gallery variant (context_size 16384, parallel 4, drop
|
||||
================================================================================
|
||||
|
||||
The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
|
||||
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; QWEN36_NVFP4_BENCH.md section "The 4 models"
|
||||
documents provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
|
||||
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
|
||||
document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
|
||||
tensors from nvidia modelopt weights). To publish:
|
||||
|
||||
1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
|
||||
@@ -434,7 +434,7 @@ tensors from nvidia modelopt weights). To publish:
|
||||
3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
|
||||
pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
|
||||
Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
|
||||
paged), and the GB10 benchmark table (link QWEN36_NVFP4_BENCH.md numbers).
|
||||
paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
|
||||
4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
|
||||
LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
|
||||
tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
|
||||
|
||||
@@ -1,143 +0,0 @@
|
||||
# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
|
||||
|
||||
The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
|
||||
opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
|
||||
`mmq_x` auto-select in `mul_mat_q_case`. Companion to
|
||||
`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
|
||||
`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
|
||||
|
||||
Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
|
||||
**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
|
||||
41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
|
||||
larger expert FFN, standard attention).
|
||||
|
||||
## What it does (vs 0014)
|
||||
|
||||
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
|
||||
the per-expert column upper bound = token count) in one column-tile, i.e. stock
|
||||
**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
|
||||
per-expert density is tiny, that 128-wide tile is mostly padding.
|
||||
|
||||
Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
|
||||
**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
|
||||
per-expert density host-side, from args the ids path already passes:
|
||||
|
||||
```
|
||||
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
|
||||
n_experts = nchannels_x = ne02
|
||||
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
|
||||
```
|
||||
|
||||
and caps to the small tile (default 64) **only when `density <= density_max`**, so
|
||||
the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
|
||||
No new kernel: the selection only lowers the loop's upper bound to an
|
||||
already-compiled, granularity- and shared-memory-validated `mmq_x`.
|
||||
|
||||
## The threshold matters: `density_max = 8`, not `tile/4 = 16`
|
||||
|
||||
The cap must fire for decode but not for a prefill ubatch. Each has per-expert
|
||||
density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
|
||||
|
||||
```
|
||||
128 experts 256 experts
|
||||
prefill ubatch (512) 32 16
|
||||
decode npl128 (128) 8 4
|
||||
```
|
||||
|
||||
`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
|
||||
prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
|
||||
`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
|
||||
`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
|
||||
default change is what makes the patch prefill-safe on the 256-expert model.
|
||||
|
||||
## Measurements (default-on vs stock, median of 5 reps)
|
||||
|
||||
`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
|
||||
sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
|
||||
|
||||
```
|
||||
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
|
||||
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
|
||||
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
|
||||
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
|
||||
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
|
||||
```
|
||||
|
||||
Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
|
||||
`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
|
||||
`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
|
||||
2029.4, 2029.0]`.
|
||||
|
||||
### Honest read: neutral on this model
|
||||
|
||||
On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
|
||||
npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
|
||||
the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
|
||||
materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
|
||||
256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
|
||||
lever has nothing to bite on.
|
||||
|
||||
### npl128 decode tile sweep confirms 64 is the only useful width
|
||||
|
||||
`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
|
||||
|
||||
```
|
||||
TILE8 TILE16 TILE32 TILE64 TILE96
|
||||
-6.31% -3.18% -0.17% +0.70% -0.76%
|
||||
```
|
||||
|
||||
Smaller tiles are **worse**, not better: more column-tiles per expert = more
|
||||
grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
|
||||
the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
|
||||
same as 0014.
|
||||
|
||||
## Why ship it default-on anyway
|
||||
|
||||
1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
|
||||
global, so prefill keeps its 128 tile (S_PP neutral above).
|
||||
2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
|
||||
`tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
|
||||
`cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
|
||||
-1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
|
||||
is identical by construction.)
|
||||
3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
|
||||
does not help.
|
||||
4. **Correctness-gated** by the P0 harness (below).
|
||||
|
||||
## Conservative by design (known limitation)
|
||||
|
||||
A pure-density gate cannot separate two cases with the **same** per-expert density:
|
||||
Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
|
||||
16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
|
||||
+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
|
||||
it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
|
||||
as future work, not implemented.
|
||||
|
||||
## Knobs
|
||||
|
||||
- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
|
||||
- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
|
||||
forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
|
||||
- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
|
||||
- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
|
||||
|
||||
## P0 correctness gate
|
||||
|
||||
`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
|
||||
NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
|
||||
`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
|
||||
tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
|
||||
0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
|
||||
oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
|
||||
suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
|
||||
byte-identical to stock).
|
||||
|
||||
## Verdict
|
||||
|
||||
- Correct, prefill-safe, default-on density-aware tile select; the durable design
|
||||
0014's own doc scoped. Supersedes 0014's global cap as the default path; the
|
||||
`LLAMA_MOE_MMQ_X` knob is retained as a manual override.
|
||||
- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
|
||||
because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
|
||||
lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
|
||||
prefill cost.
|
||||
@@ -1,21 +0,0 @@
|
||||
# MOE_GAP_PROGRESS.md - moe-gap-groundtruth GPU agent checkpoint
|
||||
|
||||
Status: **DONE.** Both-engine MoE decode decomposition complete. Findings in `MOE_GAP_VS_VLLM.md`.
|
||||
|
||||
## Runs (DGX GB10 sm_121, GPU free, foreground)
|
||||
- llama: `build-cuda` 2f4f5ab (0025), `llama-batched-bench -npp128 -ntg128 -npl128 -c32768 -fa on`,
|
||||
`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`. S_TG=752.3 t/s, step 169.8 ms, busy 97.5%.
|
||||
Artifacts on DGX: `~/llama-paged-dev/moe_gap_llama.{nsys-rep,trace.csv}`.
|
||||
- vLLM 0.23.0 graphs-ON (FULL_AND_PIECEWISE, the 882-ref config): `~/bench/moe_gap_vllm.py` under
|
||||
`nsys --capture-range=cudaProfilerApi`. step 142.0 ms, busy 99.7%.
|
||||
Artifacts on DGX: `~/bench/moe_gap_vllm.{nsys-rep,trace.csv}`, script `~/bench/moe_gap_vllm.py`.
|
||||
- Extractor: `~/bench/decode_decomp2.py` (dual-engine, steps = GDN-kernel-count / 30; cross-checked vs
|
||||
flash/reshape_cache = 10x and vs throughput). Grouped-MoE GEMM isolated by per-call duration (LONG/SHORT).
|
||||
|
||||
## Result (1 line)
|
||||
Gap = 27.8 ms/step (llama 83.6% of vLLM). **MoE grouped GEMM is a llama WIN** (native FP4-MMA W4A4 47.3 ms
|
||||
vs Marlin W4A16 50.0 ms). The 15% is bf16-projections+convert (+6.5), recurrence state-gather plumbing
|
||||
(+6.6, led by k_get_rows 5.2 ms), graph/overlap (+7.0), W4A4 act-quant tax (+3.3), router/glue (+5.4).
|
||||
Marlin is NOT the lever; do not build a W4A16 MoE GEMM.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,484 +0,0 @@
|
||||
# MOE_GAP_VS_VLLM.md - ground-truth both-engine MoE decode decomposition (where vLLM's ~15% lives)
|
||||
|
||||
> **READ THE FINAL SECTION FIRST ("RESIDUAL-ASSESS (FINAL)" at the bottom).** It concludes the hunt and
|
||||
> CORRECTS one premise used throughout the body below: this doc assumes vLLM runs the GDN/attn projections
|
||||
> as NVFP4-Marlin. It does NOT. vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
|
||||
> projection bucket is a matched-precision (bf16) gap, not a quant gap. Lever 4 (NVFP4 the projections) is
|
||||
> REJECTED (+6% PPL, and not even a vLLM gap). The MoE is at its bit-exact ceiling (~86-88% of vLLM).
|
||||
|
||||
THE GPU AGENT (label `moe-gap-groundtruth`), DGX GB10 (sm_121). First **side-by-side, both-engine,
|
||||
per-kernel ms/step** decomposition of the MoE decode gap. All prior B work decomposed llama ONLY; this
|
||||
profiles vLLM's decode step too and computes the per-bucket `llama - vLLM` delta to pinpoint the gap.
|
||||
|
||||
Model `q36-35b-a3b-nvfp4` (40 layers: 30 GDN linear-attn + 10 full-attn, 256 experts top-8, vocab 248320).
|
||||
Both engines profiled at **batch 128 decode** with `nsys --cuda-graph-trace=node`, steady-decode window,
|
||||
per-step normalized by GDN-kernel-count / 30 (cross-checked vs flash/reshape_cache counts and throughput).
|
||||
|
||||
- **llama**: `build-cuda` tip `2f4f5ab` (patch 0025), `llama-batched-bench -npp 128 -ntg 128 -npl 128
|
||||
-c 32768 -fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1` (the re-graph ON = the 752 t/s ship point).
|
||||
Measured **S_TG = 752.3 t/s** => **step = 169.8 ms**, GPU-busy 97.5% (idle 2.5% = 4.2 ms/step).
|
||||
- **vLLM 0.23.0**: `q36-35b-a3b-nvfp4-vllm`, **CUDA graphs ON** (`cudagraph_mode=FULL_AND_PIECEWISE`,
|
||||
the 882-reference config, NOT enforce_eager), MARLIN NvFp4 MoE, 128 seqs x 128-tok prompt x 128 gen.
|
||||
Measured **step = 142.0 ms** (= 901 t/s-equiv), GPU-busy 99.7% (idle 0.3% = 0.4 ms/step).
|
||||
- Gap reproduced: **169.8 - 142.0 = 27.8 ms/step** (llama 83.6% of vLLM here; matches the ~85% server number).
|
||||
|
||||
## THE HEADLINE: the MoE grouped GEMM is NOT vLLM's advantage - it is a llama WIN
|
||||
|
||||
Grouped MoE-expert GEMM, isolated by per-call duration (LONG calls = the per-expert grouped GEMM):
|
||||
|
||||
| grouped MoE-expert GEMM | ms/step | what |
|
||||
|-------------------------|--------:|------|
|
||||
| **llama** `mul_mat_q<NVFP4,M-tile=64>` (+stream-k fixup + gather) | **48.3** | native Blackwell FP4-MMA **W4A4** |
|
||||
| **vLLM** `marlin_moe_wna16::Marlin` | **50.0** | **W4A16** (FP4 weights dequant-in-kernel -> bf16 MMA) |
|
||||
|
||||
**llama's native-FP4 grouped GEMM is ~1.7 ms/step FASTER than vLLM's Marlin W4A16 at the ragged
|
||||
tiny-M (~4 rows/expert) decode shape** (pure GEMM core 47.3 vs 50.0). Both read the same ~4-bit weight
|
||||
bytes and are bandwidth-bound, so they tie to within a few %, and llama's 2x-rate FP4-MMA edges it.
|
||||
**=> Marlin is NOT faster here; a Marlin-style W4A16 MoE GEMM in llama would make the MoE GEMM SLOWER.**
|
||||
This directly answers the brief's load-bearing question #1/#2 and extends the prior `w4a16-marlin` DENSE
|
||||
conclusion ("the win was NVFP4-dense-quant, not the Marlin kernel") to MoE: **the MoE GEMM kernel is not
|
||||
the lever; llama already beats Marlin there.**
|
||||
|
||||
## Side-by-side per-step decomposition (ms/step, kernel-time attribution)
|
||||
|
||||
| bucket | llama ms | vLLM ms | Δ llama-vLLM | note |
|
||||
|--------|---------:|--------:|-------------:|------|
|
||||
| **Recurrence / SSM** | **79.3** | **72.7** | **+6.6** | core kernel is a llama WIN (70.0 vs 71.1); the gap is llama's state-gather/conv plumbing |
|
||||
| **MoE-expert grouped GEMM** | 48.3 | 50.0 | **-1.7** | **llama FASTER** (native FP4-MMA W4A4 vs Marlin W4A16) |
|
||||
| **Dense projections (+glue)** | **20.3** | **13.8** | **+6.5** | llama runs GDN/attn projections in BF16 cublas; vLLM runs them as compact NVFP4-Marlin; +2.9 ms is llama's bf16<->f32 `convert_unary` glue vLLM never pays |
|
||||
| **Norms / glue / memcpy** | 9.6 | 6.0 | +3.6 | llama `k_bin_bcast` (expert-combine+residual) 4.3 + memcpy 2.4 heavier |
|
||||
| **Act-quant (W4A4 tax)** | 3.3 | 0.0 | **+3.3** | `quantize_mmq_nvfp4`; vLLM W4A16 keeps acts bf16 => structurally ZERO |
|
||||
| **Router / align** | 2.4 | 0.5 | +1.9 | llama computes router via a full FP4 GEMM (1.6) + argsort/scatter; vLLM fuses topk/align |
|
||||
| **Attention (full-attn)** | 2.8 | 2.6 | +0.2 | parity |
|
||||
| kernel-time subtotal | 166.1 | 145.7 | +20.4 | |
|
||||
| **GPU idle (host bubble)** | 4.2 | 0.4 | **+3.8** | graph coverage: llama partially-graphed (0025) vs vLLM FULL_AND_PIECEWISE |
|
||||
| cross-stream overlap (union<sumdur) | ~0.8 | ~4.0 | ~-3.2 (vLLM overlaps more) | vLLM runs more kernels concurrently |
|
||||
| **STEP TOTAL (wall)** | **169.8** | **142.0** | **+27.8** | |
|
||||
|
||||
### Per-engine top kernels (ms/step)
|
||||
|
||||
```
|
||||
llama (752 t/s, step 169.8 ms, 97.5% busy) vLLM (901-equiv, step 142.0 ms, 99.7% busy)
|
||||
70.0 gated_delta_net_cuda REC core 71.1 fused_recurrent_gated_delta REC core
|
||||
47.3 mul_mat_q grouped MoE (M=64) MoE GEMM 50.0 marlin_moe_wna16::Marlin MoE GEMM
|
||||
8.2 nvjet 192x136 (bf16 proj) PROJ 4.0 nvjet 128x72 (bf16 proj) PROJ
|
||||
5.2 k_get_rows_float REC-GATHER REC <-- vLLM 2.8 marlin dense (lm_head NVFP4) PROJ
|
||||
4.5 cutlass::Kernel2 (bf16 GEMM) PROJ has 2.7 nvjet 128x64 (bf16 proj) PROJ
|
||||
4.3 k_bin_bcast (combine+resid) GLUE no 2.5 flash_fwd_splitkv ATTN
|
||||
4.1 nvjet 128x64 (bf16 proj) PROJ equiv 2.0 marlin dense small (NVFP4) PROJ
|
||||
3.4 ssm_conv_update_f32 REC of 1.6 causal_conv1d_update REC
|
||||
3.3 quantize_mmq_nvfp4 W4A4 TAX ACTQ <-- vLLM these 1.4 std::enable_if (glue) GLUE
|
||||
2.9 convert_unary bf16<->f32 PROJ-GLUE <-- two 1.2 reduce_kernel GLUE
|
||||
2.8 flash_attn_tile ATTN (5.2+ 1.0 cutlass::device (fp8 lin) PROJ
|
||||
2.4 MEMCPY-Device (SSM state) GLUE 2.9 = 0.8 nvjet 32x64 PROJ
|
||||
1.6 mul_mat_q router (M=128) ROUTER 8 ms 0.4 act_and_mul (SwiGLU) GLUE
|
||||
1.5 rms_norm_f32 GLUE pure 0.2 topkGating / moe_align ROUTE
|
||||
... llama 0.1 reshape_and_cache_flash ATTN
|
||||
tax)
|
||||
```
|
||||
|
||||
## WHERE THE 27.8 ms ACTUALLY IS (ranked) - and it is NOT the Marlin GEMM
|
||||
|
||||
1. **Dense projections + bf16<->f32 glue: +6.5 ms.** llama keeps the GDN/attn linear projections (and
|
||||
the lm_head) in **BF16** (cublas `nvjet`/`cutlass`, full-precision weight reads) and pays a 2.9 ms
|
||||
`convert_unary` bf16<->f32 tax around them; vLLM runs the same projections as **compact NVFP4-Marlin
|
||||
W4A16** (4-bit weight read, ~4x less BW) and stays bf16 end-to-end (no convert). This is the
|
||||
**`NVFP4-dense-quant` lever the prior `w4a16-marlin` project already identified - applied to the
|
||||
still-bf16 projections**, not the MoE GEMM.
|
||||
2. **Recurrence state-gather/conv plumbing: +6.6 ms.** The recurrence CORE kernel is a **llama win**
|
||||
(gated_delta_net 70.0 vs vLLM fused_recurrent 71.1, confirming "past vLLM on BW efficiency"). The gap
|
||||
is entirely the surrounding plumbing: **`k_get_rows_float` 5.2 ms (the recurrent-state gather)** +
|
||||
`ssm_conv_update` 3.4 vs vLLM's single `causal_conv1d_update` 1.6. vLLM has **no gather** - its
|
||||
recurrent state is updated in-place inside the fused decode kernel. `k_get_rows` is the single biggest
|
||||
llama-specific kernel vLLM has no equivalent of.
|
||||
3. **Graph coverage + stream overlap: ~+7.0 ms combined** (idle +3.8, cross-stream overlap ~+3.2). vLLM
|
||||
FULL_AND_PIECEWISE is 99.7% busy with more concurrent kernels; llama (partially graphed post-0025) is
|
||||
97.5% busy with thinner overlap.
|
||||
4. **W4A4 act-quant tax: +3.3 ms.** `quantize_mmq_nvfp4`; vLLM's W4A16 choice makes this structurally 0.
|
||||
Fusing the quant into the preceding op (as vLLM fuses act_quant into RMSNorm/SiLU) would erase it.
|
||||
5. **Router GEMM + norms/glue: +5.4 ms.** llama computes router logits via a full FP4 GEMM (1.6) and has
|
||||
heavier `k_bin_bcast` combine/residual + memcpy; vLLM fuses routing into tiny topk/align kernels.
|
||||
|
||||
## THE SINGLE BIGGEST vLLM-MoE ADVANTAGE
|
||||
|
||||
**Not the Marlin GEMM.** It is a near-tie between two ~6.5 ms buckets, both bf16-precision-related:
|
||||
- **Dense projections (+6.5 ms)** - vLLM runs the GDN/attn projections + lm_head as NVFP4-Marlin while
|
||||
llama runs them BF16 + a 2.9 ms convert tax. Single biggest *bucket* delta.
|
||||
- **Recurrent-state gather (+5.2 ms, kernel `k_get_rows_float`)** - the single biggest *kernel* vLLM
|
||||
avoids entirely (in-place fused state vs llama's separate gather). Plus +1.8 ms more REC plumbing.
|
||||
|
||||
The MoE grouped GEMM (the brief's hypothesis) is a **-1.7 ms llama win**, so it is explicitly ruled out.
|
||||
|
||||
## ANSWERS TO THE BRIEF
|
||||
|
||||
1. **WHERE is vLLM's 15%?** Spread across bf16-projection BW (+6.5) + recurrence state-gather plumbing
|
||||
(+6.6) + graph/overlap (+7.0) + act-quant tax (+3.3) + router/glue (+5.4). **NOT the MoE GEMM.**
|
||||
2. **Is Marlin faster at tiny-M decode?** **No.** llama native FP4-MMA W4A4 = 47.3 ms vs Marlin W4A16 =
|
||||
50.0 ms. Marlin is ~5% slower here; both are at the LPDDR5x BW floor.
|
||||
3. **Should llama implement a Marlin-style W4A16 MoE GEMM?** **No** - it would slow the MoE GEMM and is
|
||||
not where the gap lives. The `w4a16-marlin` DENSE verdict ("NVFP4-dense-quant, not the Marlin kernel")
|
||||
carries to MoE. The real, ordered levers are: **(a) NVFP4-quantize the still-bf16 GDN/attn projections
|
||||
+ lm_head** (close ~+6.5, the largest, bit-changing but the same class of move vLLM makes); **(b) fuse
|
||||
away the recurrent-state gather `k_get_rows`** (~+5, bit-exact, the biggest single-kernel win);
|
||||
**(c) fuller CUDA-graph coverage + stream overlap** (~+7, bit-exact); **(d) fuse the W4A4 act-quant
|
||||
into the preceding op** (+3.3, bit-exact). None of these is a new MoE GEMM.
|
||||
|
||||
---
|
||||
|
||||
# FINAL DECISION (cross-agent synthesis) - "can we do what vLLM does on MoE?"
|
||||
|
||||
Three agents converged on the same verdict from independent angles: `moe-gap-groundtruth`
|
||||
(the measured both-engine nsys decomposition above), `vllm-marlin-study` (source-read of vLLM's
|
||||
`moe_wna16_marlin_gemm` / `moe_align_block_size` / `prepare_nvfp4_moe_layer_for_marlin` on the DGX),
|
||||
and `marlin-port-feasibility` (read-only assessment of the dense W4A16 scaffold + prior STOP). All
|
||||
three agree, and the measurement is the arbiter. Below is the decision the user asked for.
|
||||
|
||||
## (1) WHERE the 15% lives - decisive
|
||||
|
||||
The gap is **27.8 ms/step (llama at 83.6% of vLLM)** and it is **NOT one kernel - it is a sum of small
|
||||
deltas, and the MoE grouped GEMM is on llama's side of the ledger.** Ranked:
|
||||
|
||||
| rank | lever | Δ ms/step | bit-exact? | this is... |
|
||||
|-----:|-------|----------:|:----------:|------------|
|
||||
| 1 | Graph coverage + cross-stream overlap | ~+7.0 | **yes** | scheduler/runtime (idle +3.8, overlap +3.2) |
|
||||
| 2 | Recurrence state-gather/conv plumbing (`k_get_rows_float` 5.2 + conv) | +6.6 | **yes** | llama-only kernels; vLLM updates state in-place |
|
||||
| 3 | Dense GDN/attn projections + lm_head (bf16 vs NVFP4) + convert glue | +6.5 | **no** | the NVFP4-dense-quant lever, on the projections |
|
||||
| 4 | Router GEMM + norms/combine/memcpy glue | +5.4 | mostly yes | llama router = full FP4 GEMM; vLLM fuses topk/align |
|
||||
| 5 | W4A4 act-quant tax (`quantize_mmq_nvfp4`) | +3.3 | **yes** | vLLM's W4A16 makes this structurally 0 |
|
||||
| - | **MoE-expert grouped GEMM** | **-1.7** | - | **llama WIN** - native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 |
|
||||
|
||||
**The Marlin GEMM is explicitly ruled out as the source of the gap.** Both engines read the same ~22 GB
|
||||
of ~4-bit expert weights once per step and are LPDDR5x-bandwidth-bound; on that weight stream they tie,
|
||||
and llama's 2x-rate FP4-MMA edges Marlin's half-rate bf16 MMA. It is **not the projections-vs-Marlin
|
||||
distinction in the experts, it is the projections in the DENSE path, the recurrence plumbing, and the
|
||||
runtime/graph** that cost llama the 15%. Not distributed, not the expert GEMM, not routing alone.
|
||||
|
||||
## (2) Can llama MATCH it - and HOW
|
||||
|
||||
**Yes - to within a few percent, and NOT with a Marlin/W4A16 MoE GEMM.** The two biggest *compute*
|
||||
kernels (the gated-DeltaNet SSM core 70.0 vs 71.1, and the MoE grouped GEMM 47.3 vs 50.0) are **already
|
||||
llama wins.** The gap is overhead/scheduling/precision-of-the-other-tensors, all of which llama can
|
||||
attack on its existing W4A4 FP4-MMA expert path. The four levers, in recommended build order:
|
||||
|
||||
| order | build | gain | bit-exact / gate | effort |
|
||||
|------:|-------|-----:|------------------|--------|
|
||||
| 1st | **Fuse away the recurrent-state gather `k_get_rows_float`** (update SSM state in-place in the GDN decode path, fold `ssm_conv_update`) | ~+5 ms (~3% of step) - biggest single-kernel win | **bit-exact** (no md5 rebaseline) | medium - CUDA, the GDN decode kernel |
|
||||
| 2nd | **Fuller CUDA-graph coverage + stream overlap** (extend the 0025 re-graph to the remaining MoE/projection nodes, overlap independent streams) | ~+7 ms combined; 0025 already banked ~+1.9% | **bit-exact** | medium - scheduler, partly done |
|
||||
| 3rd | **NVFP4-quantize the still-bf16 GDN/attn projections + lm_head** (the same move vLLM makes on its dense path; 4-bit weight read ~4x less BW, kills the 2.9 ms bf16<->f32 convert) | ~+6.5 ms - biggest *bucket* | **bit-changing** (re-baselines md5 gates; precision-UPGRADE, see below) | medium-high - new NVFP4 weight path for non-expert linears |
|
||||
| 4th | **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** (as vLLM fuses act-quant) | +3.3 ms | **bit-exact** | low-medium |
|
||||
|
||||
**Reach:** the three bit-exact levers (1+2+4 ~= +15.3 ms) alone close the gap to ~154.5 ms/step
|
||||
=> ~830 t/s = **~94% of vLLM, with zero precision change and zero md5 rebaseline.** Adding the
|
||||
NVFP4-projection lever (3, +6.5) reaches ~148 ms => ~865 t/s = **~96-97% of vLLM**, with the residual
|
||||
being router/glue and the irreducible cross-stream-overlap that is structural to how ggml schedules
|
||||
host-launched nodes vs vLLM's single fused graph. Because llama's two heaviest kernels are already
|
||||
ahead, **parity-or-better is physically reachable** once the plumbing/overhead is removed; vLLM has no
|
||||
arithmetic advantage on this hardware (its W4A16 is half-rate FP4 - it only wins on overhead and on the
|
||||
dense-path weight-read BW).
|
||||
|
||||
## (3) The leading lever, in full - and the Marlin question, settled
|
||||
|
||||
**The user's specific hypothesis - "do what vLLM does = a Marlin-style W4A16 grouped MoE GEMM" - is
|
||||
REJECTED, by measurement and by feasibility.**
|
||||
|
||||
- **It is not where the gap is.** The MoE GEMM is a **-1.7 ms llama win.** A W4A16 Marlin MoE GEMM would
|
||||
make that bucket SLOWER (half-rate bf16 MMA on the ~27% GEMM bucket), not faster.
|
||||
- **Its entire intrinsic upside is the ~2% act-quant tax** (W4A16 has no activation quantize). That
|
||||
+2% ceiling is **smaller than the +1.9% the bit-exact 0025 re-graph already banked**, at vastly higher
|
||||
effort and with a precision change. And the act-quant tax is independently closeable bit-exactly by
|
||||
lever 4 (fuse it into the preceding op) without touching the GEMM.
|
||||
- **The scaffold does not help.** `paged/kernel/w4a16/marlin-w4a16.cu` is dense-only, Q4_0/Q4_K, with no
|
||||
grouped/MUL_MAT_ID path and no NVFP4 dequant. A real MoE Marlin is effectively a from-scratch port of
|
||||
`moe_wna16_marlin_gemm` (per-expert M-tiles, block-padded `moe_align` token-sort, stream-K over ragged
|
||||
segments, NVFP4->bf16 in-kernel dequant). vLLM only reaches the BW floor via cutlass-SM120 TMA +
|
||||
warp-specialized pipelining; the GB10 occupancy-only route the dense scaffold tried **plateaued at
|
||||
~9 TFLOPS / 178 t/s (~5x under MMQ)** and STOPPED at the occupancy wall (XOR-swizzle + deep cp.async
|
||||
collapse GB10 occupancy). Realistic outcome of an MoE port: **a net REGRESSION** on the 27% GEMM
|
||||
bucket. Multi-week, high-risk, DGX-only, no `ncu`, for a +2% ceiling. **Do not build it.**
|
||||
|
||||
**Why vLLM runs W4A16 at all:** not because it is better - because sm_121 (consumer Blackwell / GB10)
|
||||
has no working cutlass FP4 MoE cubins (vLLM whitelists only sm_100/103 datacenter Blackwell for native
|
||||
FP4 MoE; the engine literally warns it is falling back to "Weight-only FP4 ... Marlin kernel"). On GB10,
|
||||
W4A16 is HALF the FP4-MMA rate. **llama's native W4A4 FP4-MMA is the higher hardware tier; matching vLLM
|
||||
does NOT mean copying its W4A16 fallback.**
|
||||
|
||||
**Precision / gate (the brief's key nuance, assessed honestly):** the observation that W4A16 (bf16 acts)
|
||||
is a strict activation-precision UPGRADE over W4A4 (FP4 acts), with better KL-to-f32, is **correct but
|
||||
unmonetizable here.** (a) The current W4A4 MoE default is **already bit-exact to the f32 reference**
|
||||
(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable on both models) - you get no quality credit for
|
||||
being more precise than a default that already passes, and the precision-sensitive site is the
|
||||
gated-DeltaNet SSM *state* (a different op, addressed by the separate 0026 bf16-SSM opt-in), not the MoE
|
||||
GEMM. (b) W4A16 is **non-bit-exact vs the W4A4 default, so adopting it re-baselines every shipped md5
|
||||
gate** - a real cost for a +2% throughput ceiling that is itself likely negative. So the precision angle
|
||||
does not flip the verdict: it would be a precision upgrade nobody needs, bought with a slower,
|
||||
occupancy-hostile, gate-rebaselining kernel. The one genuinely precision-positive AND throughput-positive
|
||||
move that quantizes weights is **lever 3 (NVFP4 projections)** - and that is W4A16 on the DENSE linears
|
||||
(where it cuts weight-read BW), not on the experts.
|
||||
|
||||
## (4) HONEST VERDICT + recommended build
|
||||
|
||||
**VERDICT: We can essentially match vLLM on MoE decode (~94% bit-exact, ~96-97% with the projection
|
||||
quant, parity-or-better physically in reach), but NOT by doing "what vLLM does" in the sense the question
|
||||
implies. A Marlin/W4A16 grouped MoE GEMM is the wrong lever - the MoE GEMM is already a llama win and a
|
||||
W4A16 port would regress it. The 15% is bf16 dense-projection bandwidth + recurrence-gather plumbing +
|
||||
graph/overlap overhead + a 2% act-quant tax + router glue. Every piece is closeable on llama's existing
|
||||
native-FP4 expert path, mostly bit-exactly.**
|
||||
|
||||
**Recommended build (ship order, none of it a new MoE GEMM):**
|
||||
1. **`k_get_rows` SSM-state-gather fusion** - bit-exact, ~+5 ms, biggest single-kernel win, no rebaseline. **Do first.**
|
||||
2. **Extend CUDA-graph coverage + stream overlap** beyond 0025 - bit-exact, ~+7 ms combined, partly banked.
|
||||
3. **Fuse the W4A4 act-quant into the preceding RMSNorm/SiLU** - bit-exact, +3.3 ms, erases the act-quant tax (the only thing W4A16 would have bought) without W4A16.
|
||||
4. **NVFP4-quantize the bf16 GDN/attn projections + lm_head** - +6.5 ms (biggest bucket), bit-changing
|
||||
(re-gate md5; precision-UPGRADE, the same NVFP4-dense-quant move vLLM makes). Ship as default after
|
||||
re-gating, or as an opt-in if the md5 rebaseline is undesirable.
|
||||
|
||||
**Do NOT build:** the W4A16/Marlin grouped MoE GEMM (`paged/kernel/w4a16/` scaffold is dense-only and not
|
||||
reusable). Neither default nor opt-in: +2% ceiling < the already-banked bit-exact +1.9%, likely a net
|
||||
regression on the 27% GEMM bucket, multi-week high-risk, and it rebaselines every gate. The dense
|
||||
`w4a16-marlin` STOP transfers to MoE, and MORE strongly (the tiny-M decode shape is purely BW-bound, so
|
||||
the FP4-vs-bf16 tier is a wash that the weight-read floor erases - leaving only the half-rate downside).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
# LEVER 4 (scope) - NVFP4-quantize the still-bf16 MoE GDN/attn projections (+lm_head), the +6.5 ms bucket
|
||||
|
||||
Label `L4-scope`, READ-ONLY (no GPU). This scopes lever 4 - the single biggest *bucket* in the table
|
||||
above (**Dense projections +glue, +6.5 ms**) and the only remaining MoE lever with a real, measurable
|
||||
gain after levers 2 and 3 both came back FLAT measurement-STOPs (no patch, no commit - see
|
||||
`LEVER2_GRAPH_COVERAGE_RESULTS.md`, `LEVER3_ACTQUANT_FUSION_RESULTS.md`, `LEVERS_23_PROGRESS.md`). Lever 4
|
||||
is **bit-changing** (re-gates md5; gate on KL-to-f32, not bit-exact md5). Below: the root cause, the
|
||||
path, effort, the precision/KL story, the expected gain, and the default-vs-opt-in recommendation.
|
||||
|
||||
## Root cause: the MoE GGUF's projections are bf16 only because of its quant PROVENANCE
|
||||
|
||||
The "still-bf16 GDN/attn projections" are **MoE-specific, and they are an accident of how the MoE
|
||||
checkpoint was quantized - not a llama limitation.** The two GGUFs have different quant lineages:
|
||||
|
||||
- **Dense `q36-27b-nvfp4` (unsloth, native-Blackwell FP4, 304 NVFP4 tensors):** the GDN/attn projections
|
||||
ARE already NVFP4. Proven directly - `DECODE_PARITY_EXPLORE.md:594` shows the dense `ssm_out`
|
||||
(GDN out-projection) running as an **FP4 GEMV/MMQ** (`mul_mat_vec_q`/`mul_mat_q<NVFP4>`), and the
|
||||
in_proj runs FP4 MMQ at M=128. This is exactly why the **dense decode is already at 96.6% of vLLM** -
|
||||
it has essentially no bf16-projection bucket left.
|
||||
- **MoE `q36-35b-a3b-nvfp4` (nvidia modelopt, 241 NVFP4 tensors):** modelopt quantized the **256-expert
|
||||
FFN** tensors to NVFP4 (the 241 count is dominated by the packed grouped-expert tensors) but **left the
|
||||
GDN/attn linear projections in BF16** - `in_proj_qkvz`, `in_proj_ba`, the GDN `out_proj`/`ssm_out`, and
|
||||
the full-attn `attn_q/k/v/output`. Those are exactly the **bf16 nvjet/cutlass projection GEMMs** seen in
|
||||
the MoE decode top-kernel list (8.2 `nvjet 192x136` + 4.5 `cutlass::Kernel2` + 4.1 `nvjet 128x64`)
|
||||
plus the 2.9 ms `convert_unary` bf16<->f32 glue = the **20.3 ms projection bucket** vs vLLM's 13.8 ms
|
||||
(vLLM runs the same projections, and on this modelopt checkpoint even its lm_head, as NVFP4-Marlin -
|
||||
see its `2.8 marlin dense (lm_head NVFP4)` kernel).
|
||||
|
||||
**=> Lever 4 is overwhelmingly a MoE-GGUF move:** bring the MoE GGUF's GDN/attn projections to the SAME
|
||||
NVFP4 the DENSE GGUF already ships and that vLLM already runs on the identical weights. It is not a new
|
||||
capability - the dense GGUF is the existence proof that llama runs and ships these projections in NVFP4.
|
||||
|
||||
## (1) THE PATH + EFFORT
|
||||
|
||||
Two ways to get the projection weights into NVFP4:
|
||||
|
||||
- **PATH A - offline re-quantize to a NEW GGUF variant (RECOMMENDED, = exactly what vLLM does).** Re-run
|
||||
`llama-quantize` on the MoE source with the `--tensor-type` selector EXPANDED to also capture the
|
||||
GDN/attn projection tensor-name patterns that the modelopt checkpoint left bf16 (the GDN `in_proj_*` /
|
||||
`out_proj`/`ssm_out` and full-attn `attn_q/k/v/output` weights), producing e.g.
|
||||
`q36-35b-a3b-nvfp4-projq.gguf`. **ZERO kernel/runtime code:** NVFP4 weights already flow end-to-end -
|
||||
the loader auto-creates the per-tensor NVFP4 sidecar scales when `type == GGML_TYPE_NVFP4`
|
||||
(`llama-model.cpp:1459`), and the projection GEMMs then route to the already-tuned `mul_mat_q<NVFP4>`
|
||||
(patch 0017) instead of cublas/nvjet. The dense GGUF is the live proof this path works and gates clean.
|
||||
**Effort: LOW-MEDIUM** - the only "build" is the quantize recipe + a KL gate harness + a gallery/index
|
||||
entry + a RELEASE note. Risk items: (i) confirm the exact bf16 tensor list with a CPU `gguf_dump`
|
||||
(metadata-only, no GPU); (ii) NVFP4 needs the contraction dim divisible by the 16-elt block - any
|
||||
projection whose row dim is not a multiple of 16 stays bf16 (or needs padding), which is the most
|
||||
likely reason a given tensor was left bf16 and must be checked per-tensor; (iii) the lm_head decision
|
||||
(below).
|
||||
- **PATH B - runtime quantize bf16->NVFP4 at load.** Convert the bf16 projection weights in-memory at
|
||||
model load (one-time ue4m3 per-block scale-search), GGUF unchanged. **Worse choice:** needs new
|
||||
load-time quant code (MEDIUM), and it *silently* changes the output of an existing GGUF for current
|
||||
users (an implicit, non-opt-in precision change) - strictly inferior to an explicit new artifact.
|
||||
Only attractive if shipping a new GGUF is somehow impossible; it is not.
|
||||
|
||||
## (2) PRECISION / KL story (honest)
|
||||
|
||||
Quantizing the projection WEIGHTS bf16 -> NVFP4 (e2m1 + per-16 ue4m3 scale) is a per-weight precision
|
||||
**downgrade vs the current bf16** on those specific tensors (it adds ~4-bit weight-quant error), and -
|
||||
because they route to the W4A4 MMQ path - it also FP4-quantizes those projections' activations. It is
|
||||
NOT a precision upgrade over bf16; it is the **same W4A4/W4A16-class move vLLM already makes on these
|
||||
same projections**, so at matched precision it is apples-to-apples with vLLM. Non-bit-exact => **re-gate
|
||||
on KL-to-f32, not md5.**
|
||||
|
||||
**KL estimate: should PASS with margin.** Three independent reasons: (a) the dense GGUF ALREADY ships
|
||||
these GDN/attn projections in NVFP4 and passes its greedy gate (`5951a5b4...`), so the move is
|
||||
empirically proven shippable on this architecture; (b) the 256 experts already run W4A4 NVFP4 and pass
|
||||
(test-backend-ops MUL_MAT_ID 806/806, greedy md5 stable) - the GDN/attn projections are the same class of
|
||||
linear op and arguably less sensitive than the expert FFN; (c) this is a per-step, **non-accumulating**
|
||||
weight-quant error - structurally unlike the bf16-GDN-*state* experiment (`BF16_SSM_STATE_RESULTS.md`)
|
||||
that FAILED the KL gate (KLD 0.06-0.17, ~10% argmax flips) because that error *accumulated* through the
|
||||
recurrence. Expect KLD-to-f32 well under that failed-state threshold and PPL delta sub-percent (cf. the
|
||||
broader NVFP4-dense ~+4.8% PPL-vs-Q4_K figure is for full-model NVFP4; here only a minority of residual
|
||||
projection tensors move). **The one genuinely risky tensor is lm_head** (logit-direct; `OTHER_PATHS_
|
||||
INVESTIGATION.md` flags NVFP4-lm_head can flip the greedy argmax). For the MoE, quantizing lm_head is
|
||||
*fair* (vLLM's modelopt checkpoint already runs lm_head NVFP4), so include it but gate it explicitly on
|
||||
argmax-agreement; if it flips the greedy probe, keep lm_head bf16 and bank only the GDN/attn portion.
|
||||
Recommended gate: **KLD-to-f32 < the bf16-state failure floor (~0.06) AND PPL delta < ~1% vs the current
|
||||
bf16-projection GGUF AND zero greedy-argmax flips on the -n 48 probe.**
|
||||
|
||||
## (3) EXPECTED MoE GAIN
|
||||
|
||||
Closing the +6.5 ms projection bucket = bringing llama's 20.3 ms projection bucket down to vLLM's
|
||||
~13.8 ms (NVFP4 cuts the projection weight-read ~4x - 2.37 GB-class bf16 -> ~0.56 B/wt - and the W4A4
|
||||
MMQ path stays in the quantized domain, **erasing the 2.9 ms `convert_unary` bf16<->f32 glue**). llama's
|
||||
native FP4-MMA is faster per-FLOP than vLLM's W4A16-Marlin and these projections are BW-bound, so llama
|
||||
lands at parity-or-slightly-better, same as the expert GEMM (where W4A4 beat Marlin by 1.7 ms).
|
||||
|
||||
- With **lm_head also NVFP4** (fair on this modelopt MoE, vLLM did it): full ~**+6.5 ms** =>
|
||||
step 169.8 -> ~163.3 ms => ~785 t/s.
|
||||
- With **lm_head kept bf16** (conservative): ~**+4 to +5 ms** (the GDN/attn projections + the convert
|
||||
glue; lm_head's ~bf16 GEMM stays) => step 169.8 -> ~165-166 ms => ~768-775 t/s.
|
||||
|
||||
In MOE_GAP frame (vLLM 142.0 ms / 901 t/s-equiv): **MoE moves from 86.3% (post-lever-1 / 0028) toward
|
||||
~89-91% of vLLM** (full bucket) or ~88% (lm_head bf16). This is the **largest single banked MoE gain
|
||||
available** - lever 1 (gather) shipped, levers 2 and 3 banked nothing, and the MoE GEMM is already a
|
||||
llama win - so after lever 4 the residual is just router/glue + the structural cross-stream-overlap and
|
||||
the ~4.2 ms host bubble (reachable only via a paged-attn host-pipeline edit, not a quant or graph knob).
|
||||
|
||||
## (4) RECOMMENDATION: ship as a SEPARATE OPT-IN gallery GGUF variant (KL-gated), not a re-gated default
|
||||
|
||||
**Ship lever 4 as a distinct, opt-in gallery variant** (e.g. `q36-35b-a3b-nvfp4-projq` / `-w4a4full`),
|
||||
**not** as a silent replacement of the default MoE GGUF. Rationale:
|
||||
|
||||
1. The current default MoE GGUF is **md5-bit-exact-gated** (`07db32c2...` shipped); making it default
|
||||
forces a permanent md5 rebaseline of every gate - the hard line this whole track has held (levers 2+3
|
||||
STOPPED rather than cross it). A new artifact sidesteps that for users who chose the f32-lineage GGUF.
|
||||
2. Path A produces a **new GGUF anyway** (offline re-quant), so a separate gallery entry costs nothing
|
||||
extra and makes the throughput<->precision choice explicit and reversible.
|
||||
3. The gain (~+4-6.5 ms, ~86% -> ~88-91% of vLLM) is real but modest - not worth forcing a precision
|
||||
change on default-path users.
|
||||
4. **Promotion path:** because lever 4 only brings the MoE GGUF to the SAME NVFP4 the dense GGUF already
|
||||
ships *as its default* and that vLLM already runs, a clean KL gate (KLD << 0.06, PPL delta < ~0.5%,
|
||||
zero argmax flips) is a strong case to PROMOTE the variant to the default MoE GGUF in a later release.
|
||||
Ship opt-in first to preserve the bit-exact default and avoid a forced rebaseline; promote if the
|
||||
gate is clean and lm_head NVFP4 holds.
|
||||
|
||||
**Effort summary:** LOW-MEDIUM, dominated by the KL gate + gallery wiring, NOT code (zero new kernel; the
|
||||
NVFP4 weight path - loader sidecar scales + tuned `mul_mat_q<NVFP4>` - is already in tree and proven by
|
||||
the dense GGUF). Highest-ROI remaining MoE lever. **Do first among remaining MoE work**, ahead of any
|
||||
non-bit-exact recurrence-plumbing or the rejected W4A16/Marlin GEMM.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
> **SUPERSEDED:** the lever-4 scope above was optimistic and PRE-GATE. The L4 KL gate FAILED
|
||||
> (+6.15-6.51% PPL, see `LEVER4_PROJNVFP4_RESULTS.md`) and the premise was wrong (vLLM keeps these
|
||||
> projections BF16 too). Lever 4 is REJECTED - do NOT ship. See the FINAL section below.
|
||||
|
||||
---
|
||||
|
||||
# RESIDUAL-ASSESS (FINAL, concludes the hunt) - convert-glue + bf16-GEMM verdicts, the bit-exact MoE ceiling
|
||||
|
||||
Label `residual-assess`, DGX GB10 (sm_121). After lever 1 shipped (0028, MoE 86.3% of vLLM @npl128,
|
||||
bit-exact), levers 2+3 flat, lever 4 REJECTED (KL-gate FAIL, AND vLLM keeps the same projections bf16),
|
||||
and lever 5 flat for MoE (host-side, off the compute-bound critical path; dense gets +0.41%), this is the
|
||||
final honest assessment of the two remaining sub-levers inside the 20.3-vs-13.8 ms projection bucket.
|
||||
Both are **bit-CHANGING or at-the-BW-floor.** The hunt is DONE.
|
||||
|
||||
## CORRECTION that reframes the projection bucket
|
||||
|
||||
The body above assumed **vLLM runs the GDN/attn projections as NVFP4-Marlin.** FALSE (confirmed by the L4
|
||||
gate). vLLM runs the **same nvidia-modelopt checkpoint** as the GGUF, which keeps `in_proj_qkvz`,
|
||||
`in_proj_ba`, `out_proj`, `attn_gate`, and full-attn `attn_q/k/v/output` in **BF16**. llama and vLLM run
|
||||
these projections at the **same precision (bf16).** The +6.5 ms projection-bucket delta is therefore NOT
|
||||
a precision/quant gap - it is (a) llama's f32-residual-stream convert tax and (b) bf16-GEMM kernel /
|
||||
round-trip efficiency, both at matched bf16 precision.
|
||||
|
||||
## (1) convert-glue verdict (3.24 ms/step measured): NOT bit-exact eliminable
|
||||
|
||||
Empirical split (`moe_dec` nsys, per-step over 43 decode steps):
|
||||
- `convert_unary<float,bf16>` (input, f32 act -> bf16): **1.73 ms/step**, 186 calls/step
|
||||
- `convert_unary<bf16,float>` (output, bf16 -> f32): **1.52 ms/step**, 186 calls/step (equal count = every
|
||||
bf16 projection round-trips)
|
||||
|
||||
Source root cause (`ggml/src/ggml-cuda/ggml-cuda.cu:1663-1690`, the `src0->type == BF16` cuBLAS path):
|
||||
ggml converts f32 activations to bf16, runs `cublasGemmEx` bf16xbf16 with **CUBLAS_COMPUTE_32F** but
|
||||
writes the result to a **bf16** buffer (`dst_bf16`, `CUDA_R_16BF`), then widens bf16 -> f32. The f32
|
||||
accumulator is **rounded to bf16 and then widened back** - it drops ~15 mantissa bits, and that
|
||||
bf16-rounded value feeds the f32 residual stream.
|
||||
|
||||
- The **output round-trip is load-bearing for the shipped numerics.** The fp16-fp32-compute path 40 lines
|
||||
down (`:1729`, `dst CUDA_R_32F`) proves cuBLAS CAN write the f32 accumulator directly - so the bf16
|
||||
output write+convert is a removable ggml inefficiency. BUT removing it (f32-direct output) changes the
|
||||
value from "bf16-rounded" to "full-f32" => greedy md5 (`07db32c2`) re-baselines. It is a **precision
|
||||
boundary (an upgrade), exactly like lever 4.** NOT bit-exact.
|
||||
- The **input convert is intrinsic** to a bf16 GEMM (cuBLAS needs bf16 inputs; ggml's residual stream is
|
||||
f32). The only bit-exact move is to fuse the f32->bf16 cast into the producing op's epilogue (same RNE
|
||||
rounding, one fewer launch) - but that is per-site ggml graph surgery for a sub-1.7 ms launch ceiling,
|
||||
and it is **subsumed by the (rejected) lever-4 move**: NVFP4-quantizing the weights routes the
|
||||
projection to `mul_mat_q<NVFP4>` (W4A4) and deletes the entire bf16 cuBLAS path - input convert, GEMM,
|
||||
output convert - in one shot.
|
||||
- vLLM pays ~0 here because it runs an **end-to-end bf16 residual stream** (no f32 intermediate). Matching
|
||||
that = converting llama's residual stream to bf16 = a global precision change, md5 rebaseline. Also not
|
||||
bit-exact.
|
||||
|
||||
**Verdict: bit-exact-eliminable = NO.** The f32<->bf16 round-trip is load-bearing for the current md5 (the
|
||||
bf16-rounded output IS the shipped value). Every way to remove it (f32-direct GEMM output, bf16 residual
|
||||
stream, or NVFP4 weights) is bit-changing. The one bit-exact sliver (fuse the input cast into the
|
||||
producer) is ~1.7 ms ceiling, high per-site effort, and redundant with lever 4. (Aside: the f32-direct
|
||||
GEMM output is a genuine upstreamable ggml win - faster AND more precise - but it rebaselines md5, so it
|
||||
is off the bit-exact table for this hunt.)
|
||||
|
||||
## (2) bf16 projection GEMM verdict (17.27 ms/step measured): BW-bound at the floor, no kernel lever
|
||||
|
||||
Per-step bf16-projection GEMM (nvjet cuBLASLt + cutlass bf16, `moe_dec` nsys): **17.27 ms/step, 225
|
||||
calls/step.** Roofline at the M=128 decode shape:
|
||||
- Arithmetic intensity ~= 2*M FLOP / 2 bytes-per-weight = **M = 128 FLOP/byte** (the weight read
|
||||
dominates; activations/output negligible at M=128).
|
||||
- GB10: LPDDR5x unified BW ~= **273 GB/s**; bf16 tensor-core peak >= ~250 TFLOPS => ridge point ~=
|
||||
250e12 / 273e9 ~= **>900 FLOP/byte.** 128 << 900 => **memory-bandwidth-bound by ~7x.**
|
||||
- Achieved: 17.27 ms at 273 GB/s = **~4.7 GB of bf16 projection weights streamed per step** - i.e. the
|
||||
GEMM moves the weight bytes at ~full LPDDR5x bandwidth. **It is at the BW floor.**
|
||||
|
||||
The nvjet kernels are `tmaAB` (TMA-streamed on both operands) - the optimal Blackwell weight-streaming
|
||||
access pattern; vLLM's cutlass does the same and reads the **same bf16 bytes.** A cutlass swap cannot beat
|
||||
the byte floor. The only way faster is **fewer weight bytes = quantize** (lever 4, ~4x fewer bytes) -
|
||||
bit-changing AND rejected on quality (+6% PPL) AND not even a vLLM-parity gap. The residual ~3.5 ms of the
|
||||
llama-vs-vLLM GEMM-bucket delta traces to llama's extra `dst_bf16` write+read round-trip traffic (the
|
||||
convert glue of verdict 1), not a worse GEMM kernel.
|
||||
|
||||
**Verdict: at the bandwidth floor; no bit-exact (nor even same-precision) kernel lever exists.** nvjet
|
||||
already streams the weights near-optimally.
|
||||
|
||||
## (3) The bit-exact MoE ceiling, and the irreducible residual
|
||||
|
||||
| MoE lever | status | bit-exact? | MoE gain |
|
||||
|-----------|--------|:----------:|----------|
|
||||
| 1 - recurrent-state gather fusion (0028) | **SHIPPED** | yes | banked -> 86.3% of vLLM |
|
||||
| 2 - graph coverage / overlap | flat | yes | ~0 |
|
||||
| 3 - act-quant fusion | flat | yes | ~0 |
|
||||
| 5 - block-table within-step cache | flat for MoE | yes | ~0 (host off compute-bound path; dense +0.41%) |
|
||||
| 4 - NVFP4 projections | REJECTED | no | +6% PPL, not a vLLM gap |
|
||||
| convert-glue elimination | this assess | **no** (precision boundary) | bit-changing only |
|
||||
| bf16-GEMM kernel | this assess | **no** (BW floor) | none |
|
||||
|
||||
**Realistic bit-exact MoE ceiling = ~86-88% of vLLM @npl128. The shipped state (lever 1, 86.3%) is
|
||||
essentially AT it.** Lever 5 adds nothing to MoE. No clean bit-exact MoE lever remains.
|
||||
|
||||
**The irreducible ~12-14% residual to vLLM is structural, not a missing optimization:**
|
||||
1. **f32-residual-stream convert tax (~3.2 ms/step)** - ggml runs an f32 graph and casts per bf16
|
||||
projection; vLLM runs bf16 end-to-end. Removing it is a precision change.
|
||||
2. **bf16-GEMM BW floor + round-trip traffic (~3.5 ms/step)** - both engines at the LPDDR5x byte floor on
|
||||
bf16 weights; the delta is the round-trip traffic (= item 1, bit-changing).
|
||||
3. **Recurrence-plumbing remainder** - mostly banked by lever 1; the core SSM kernel is already a llama
|
||||
win.
|
||||
4. **Between-replay host loop + graph/overlap bubble** - sampling needs logits between graph replays;
|
||||
irreducible at this batch shape.
|
||||
|
||||
## CONCLUSION: the MoE-parity hunt is DONE
|
||||
|
||||
The MoE is at its bit-exact ceiling. The two heaviest MoE compute kernels (the gated-DeltaNet SSM core and
|
||||
the NVFP4 expert grouped GEMM) are **already llama wins**, so there is no arithmetic gap to close. The
|
||||
remaining 12-14% is the f32-vs-bf16 graph-precision tax, the bf16-weight BW floor, and the irreducible
|
||||
host loop - none of which is a clean bit-exact lever, and the one bit-changing option (quantize the
|
||||
projections) is rejected on quality and is not even a vLLM-parity gap. **No one-more-lever for MoE.** The
|
||||
only clean win left in the whole track is DENSE (+0.41% from lever 5), gated behind first resolving the
|
||||
pre-existing paged-MoE baseline md5 drift (paged `8cb0ce23` vs canonical `07db32c2`) the L5 finish flagged.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,220 +0,0 @@
|
||||
# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
|
||||
|
||||
Build-ready plan. **Not implemented in this workflow** (large kernel work). This
|
||||
document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
|
||||
on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
|
||||
finding that re-shapes the whole effort.
|
||||
|
||||
Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
|
||||
LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
|
||||
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
|
||||
HEAD at patch 0013), `build-cuda` sm_121.
|
||||
|
||||
## TL;DR (the honest reframe)
|
||||
|
||||
**The grouped GEMM the mission scoped to build from scratch already exists in
|
||||
upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
|
||||
sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
|
||||
MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
|
||||
building blocks:
|
||||
|
||||
1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
|
||||
count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
|
||||
2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
|
||||
grid flattened into kbc-continuous space over expert x col-tile x row-tile x
|
||||
k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
|
||||
|
||||
The per-expert host-side row-gather loop in `ggml-cuda.cu`
|
||||
`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
|
||||
analysis describes as "the cliff" - is a **fallback only reached when
|
||||
`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
|
||||
GB10 mxfp4 path.**
|
||||
|
||||
Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
|
||||
Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
|
||||
`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
|
||||
original `253/505/830/620` cliff was a real high-batch regression that has since
|
||||
been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
|
||||
batched-bench artifact.
|
||||
|
||||
**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
|
||||
**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
|
||||
sizing heuristic.** This document scopes that delta, plus the optional
|
||||
block-padded align, plus the parity gate and phased plan. It also records what is
|
||||
intentionally NOT built and why (the W4A16 occupancy wall).
|
||||
|
||||
## The one structural gap: M-tile sizing
|
||||
|
||||
`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
|
||||
**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
|
||||
count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
|
||||
always selects the **largest** `mmq_x` that fits shared memory. vLLM's
|
||||
CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
|
||||
(typ. 16/32/64), padded **per expert**.
|
||||
|
||||
ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
|
||||
per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
|
||||
~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
|
||||
So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
|
||||
tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
|
||||
`need_check`) every expert runs as a masked partial tail.
|
||||
|
||||
The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
|
||||
(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
|
||||
|
||||
Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
|
||||
expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
|
||||
re-read only across multiple col-tiles, of which there is one) while it **lowers
|
||||
shared-mem footprint and raises occupancy** - strictly aligned with the GB10
|
||||
occupancy lessons.
|
||||
|
||||
## What already exists (reuse, do NOT rebuild)
|
||||
|
||||
Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
|
||||
|
||||
- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
|
||||
expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
|
||||
warp-scan for the compaction index, scatters into `ids_src1` (column gather
|
||||
permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
|
||||
`expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
|
||||
This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
|
||||
vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
|
||||
are exact (not block-padded).
|
||||
- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
|
||||
(kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
|
||||
Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
|
||||
Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
|
||||
col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
|
||||
x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
|
||||
past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
|
||||
`stream_k_fixup` pass recombines split tiles.
|
||||
- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
|
||||
(== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
|
||||
`quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
|
||||
the `ids_src1` gather fused** (one pass, no separate row-copy).
|
||||
|
||||
Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
|
||||
with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
|
||||
(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
|
||||
`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
|
||||
|
||||
(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
|
||||
7 for turing_plus - which is fine for thin batch and out of scope here.)
|
||||
|
||||
## What to add (the durable delta, priority order)
|
||||
|
||||
### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
|
||||
|
||||
In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
|
||||
**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
|
||||
capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
|
||||
|
||||
- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
|
||||
host-side SELECTION change - it picks a different already-compiled instantiation.
|
||||
**Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
|
||||
- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
|
||||
- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
|
||||
tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
|
||||
|
||||
This is the single highest-leverage change and the seed of the durable port.
|
||||
|
||||
### [2] Block-padded moe_align (the moe_align_block_size port proper)
|
||||
|
||||
Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
|
||||
block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
|
||||
block boundaries. Then every col-tile is **full**, which:
|
||||
|
||||
- drops the `need_check` masking + per-expert partial-tail MMA,
|
||||
- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
|
||||
schedule), removing the `col_diff` skip branch.
|
||||
|
||||
Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
|
||||
`write_back` masking (pad rows must not write output). This is the proper
|
||||
`moe_align_block_size` analogue and the durable second step.
|
||||
|
||||
### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
|
||||
|
||||
A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
|
||||
unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
|
||||
completeness; do not start without [1]+[2] measured as insufficient.
|
||||
|
||||
## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
|
||||
|
||||
- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
|
||||
`ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
|
||||
(`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
|
||||
scatter). No change to the host dispatch decision.
|
||||
- **Gated fallback preserved**: the existing per-expert host loop
|
||||
(`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
|
||||
non-Blackwell GPUs. The new selection only fires on the grouped path.
|
||||
- **Env gates** (off = exact current behavior):
|
||||
- `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
|
||||
(lever-1 + [1] manual knob).
|
||||
- `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
|
||||
Default both off until parity + throughput proven, then flip [1]'s
|
||||
auto-selection on by default.
|
||||
|
||||
## Correctness / parity gate
|
||||
|
||||
Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
|
||||
reference is **deterministic** - the op test must be **bit-exact**.
|
||||
|
||||
- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
|
||||
`n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
|
||||
- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
|
||||
of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
|
||||
tile geometry and where block-pad masking can leak.
|
||||
- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
|
||||
to current op-test output (op test is exact; the GB10 CUDA greedy-decode
|
||||
non-determinism band applies only to end-to-end, never to the op test).
|
||||
- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
|
||||
`-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
|
||||
and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
|
||||
CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
|
||||
|
||||
Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
|
||||
|
||||
## Phased plan, expected payoff, risk per phase
|
||||
|
||||
| Phase | Work | Expected payoff | Risk |
|
||||
|-------|------|-----------------|------|
|
||||
| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
|
||||
| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
|
||||
| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
|
||||
| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
|
||||
|
||||
**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
|
||||
no broken path to unlock. The durable win is **matching vLLM's saturated
|
||||
grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
|
||||
M-tile mismatch - a micro-efficiency gain at large effective batch, not a
|
||||
step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
|
||||
hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
|
||||
already uses the same sorted-grouped-GEMM algorithm, so structural parity is
|
||||
**already met**; this closes the residual kernel micro-gap.
|
||||
|
||||
## The biggest risk: the GB10 W4A16 occupancy wall
|
||||
|
||||
The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
|
||||
178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
|
||||
XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
|
||||
|
||||
- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
|
||||
or XOR-swizzle - they are exactly what killed W4A16);
|
||||
- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
|
||||
- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
|
||||
FP4 = 2x INT8/BF16 rate;
|
||||
- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
|
||||
MoE wins come from occupancy/tile fit, not bandwidth).
|
||||
|
||||
Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
|
||||
shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
|
||||
re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
|
||||
rewards - which is why it leads the plan and [3] is gated behind it.
|
||||
|
||||
## Commit / hygiene
|
||||
|
||||
Scope doc only (this file). No engine change committed in this workflow. Bench and
|
||||
parity scripts are dev-tree-only. Commit with `git -s`, trailer
|
||||
`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
|
||||
Do not push (human pushes). When [1]/[2] are implemented they mirror to
|
||||
`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).
|
||||
@@ -1,71 +0,0 @@
|
||||
# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
|
||||
|
||||
Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
|
||||
8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
|
||||
|
||||
## What
|
||||
|
||||
ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
|
||||
`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
|
||||
For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
|
||||
every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
|
||||
`n_expert_used` times (4x for q36-35b-a3b).
|
||||
|
||||
`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
|
||||
consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
|
||||
shfl/reduction). So the quantized block for a given token is byte-identical no matter which
|
||||
expert slot it lands in.
|
||||
|
||||
## Lever
|
||||
|
||||
When `ne11 == 1` (broadcast up/gate):
|
||||
1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
|
||||
(`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
|
||||
2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
|
||||
(`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
|
||||
kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
|
||||
|
||||
Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
|
||||
re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
|
||||
(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
|
||||
|
||||
The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
|
||||
478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
|
||||
gather costs 32 ms.
|
||||
|
||||
## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
|
||||
|
||||
nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
|
||||
| kernel | dedup off | dedup on |
|
||||
|-----------------------|-----------|----------|
|
||||
| quantize_mmq_nvfp4 | 868 ms | 457 ms |
|
||||
| gather_mmq_fp4 | - | 32 ms |
|
||||
| net quantize path | 868 ms | 489 ms | (-379 ms decode GPU-time)
|
||||
| gated_delta_net (50%) | unchanged | unchanged |
|
||||
| mul_mat_q<NVFP4> | unchanged | unchanged |
|
||||
|
||||
Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
|
||||
| model | npl32 off->on | npl128 off->on |
|
||||
|-----------------|------------------|-----------------------|
|
||||
| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
|
||||
| dense q36-27b | 207.4 -> 206.9 (flat) | 373.28 -> 373.24 (byte-flat) |
|
||||
|
||||
Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
|
||||
|
||||
## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
|
||||
|
||||
| model | md5 (default on) | == 0022 |
|
||||
|------------------|--------------------------------------|---------|
|
||||
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | yes (dense untouched) |
|
||||
| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd | yes (on == off == 0022) |
|
||||
|
||||
test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
|
||||
|
||||
## Knob
|
||||
|
||||
On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
|
||||
(byte-identical output, used as the A/B baseline).
|
||||
|
||||
Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,99 +0,0 @@
|
||||
# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
|
||||
|
||||
Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
|
||||
`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
|
||||
Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
|
||||
(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
|
||||
`build-cuda` sm_121.
|
||||
|
||||
## Headline (honest): there is no npl128 cliff to erase on this build
|
||||
|
||||
The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
|
||||
@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
|
||||
|
||||
```
|
||||
llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
|
||||
npl 1 8 32 64 128 256
|
||||
stock 85 282 629 935 1295 1779 <- monotonic, no knee
|
||||
```
|
||||
|
||||
The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
|
||||
decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
|
||||
`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
|
||||
persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
|
||||
`MOE_GROUPED_GEMM_SCOPE.md`.
|
||||
|
||||
## What the knob does
|
||||
|
||||
`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
|
||||
(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
|
||||
column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
|
||||
(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
|
||||
~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
|
||||
throughput on the padding columns, and the larger y-tile lowers occupancy.
|
||||
|
||||
`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
|
||||
(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
|
||||
still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
|
||||
already uses for smaller batches - no new kernel configuration. Default
|
||||
(unset/<=0) = disabled => byte-identical to stock.
|
||||
|
||||
## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
|
||||
|
||||
Decode throughput, S_TG t/s:
|
||||
|
||||
```
|
||||
npl stock cap16 cap32 cap64
|
||||
1 85 85 85 85
|
||||
8 282 280 282 282
|
||||
32 629 623 629 628
|
||||
64 935 915 949 934
|
||||
128 1295 1204 1344 1357 <- cap64 +4.8% (cap16 -7%)
|
||||
256 1779 1370 1723 1820 <- cap64 +2.3% (cap16 -23%)
|
||||
```
|
||||
|
||||
Prefill throughput, S_PP t/s (the cost):
|
||||
|
||||
```
|
||||
npl stock cap16 cap32 cap64
|
||||
128 3083 1817 2559 3038
|
||||
256 3084 1818 2560 3046
|
||||
-41% -17% -1.3%
|
||||
```
|
||||
|
||||
Reproducibility (interleaved off/cap64, two reps each):
|
||||
|
||||
```
|
||||
npl off rep1/rep2 cap64 rep1/rep2
|
||||
128 1300 / 1290 1357.5 / 1357.0
|
||||
256 1786 / 1782 1826.3 / 1824.5
|
||||
```
|
||||
|
||||
cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
|
||||
|
||||
## Why 64 is the only value that helps net
|
||||
|
||||
A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
|
||||
16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
|
||||
craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
|
||||
so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
|
||||
the fuller, higher-occupancy tile.
|
||||
|
||||
## Verdict
|
||||
|
||||
- Real but **modest** high-effective-batch DECODE micro-optimization
|
||||
(+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
|
||||
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
|
||||
continuous batching already scales). Shipped as an opt-in, default-off knob;
|
||||
recommended value 64 for decode-heavy high-concurrency deployments.
|
||||
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
|
||||
for single-stream generation and stays coherent; thousands of capped MoE
|
||||
matmuls at npl128/256 ran with no CUDA error / NaN.
|
||||
|
||||
## Durable follow-up (scoped, not implemented)
|
||||
|
||||
Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
|
||||
from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
|
||||
small tile while prefill keeps its large tile automatically (removes the ~1.3%
|
||||
prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
|
||||
`MOE_GROUPED_GEMM_SCOPE.md`.
|
||||
@@ -1,323 +0,0 @@
|
||||
# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
|
||||
|
||||
Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
|
||||
The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
|
||||
`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
|
||||
CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
|
||||
That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
|
||||
kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
|
||||
"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
|
||||
|
||||
Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
|
||||
- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
|
||||
- `quantize_mmq_nvfp4` (496/step) ~4.5%
|
||||
- `nvjet` lm_head GEMM ~3.1%
|
||||
- `flash_attn_ext_f16` (16 attn layers) ~3.1%
|
||||
- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
|
||||
l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
|
||||
|
||||
Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
|
||||
`fattn.cu`, `fattn-common.cuh`.
|
||||
|
||||
---
|
||||
|
||||
## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
|
||||
|
||||
The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
|
||||
- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
|
||||
Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
|
||||
traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
|
||||
`vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
|
||||
verified in prior runs (1115/805 gate, byte-identical).
|
||||
- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
|
||||
for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
|
||||
results.
|
||||
|
||||
The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
|
||||
decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
|
||||
is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
|
||||
|
||||
> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
|
||||
> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
|
||||
> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
|
||||
> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
|
||||
|
||||
Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
|
||||
weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
|
||||
stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
|
||||
- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
|
||||
- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
|
||||
|
||||
So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
|
||||
dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
|
||||
NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
|
||||
helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
|
||||
to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
|
||||
exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
|
||||
so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
|
||||
occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
|
||||
|
||||
VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
|
||||
re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
|
||||
one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
|
||||
measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
|
||||
GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
|
||||
drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
|
||||
was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
|
||||
|
||||
WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
|
||||
grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
|
||||
[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
|
||||
stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
|
||||
across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
|
||||
tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
|
||||
baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
|
||||
no order-preserving change moves it.
|
||||
|
||||
---
|
||||
|
||||
## RANKED bit-exact non-recurrence levers
|
||||
|
||||
Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
|
||||
gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
|
||||
|
||||
### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
|
||||
Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
|
||||
emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
|
||||
producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
|
||||
- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
|
||||
`amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
|
||||
shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
|
||||
the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
|
||||
output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
|
||||
producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
|
||||
producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
|
||||
- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
|
||||
holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
|
||||
~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
|
||||
the removed kernel).
|
||||
- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
|
||||
measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
|
||||
fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
|
||||
producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
|
||||
- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
|
||||
|
||||
### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
|
||||
Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
|
||||
the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
|
||||
the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
|
||||
- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
|
||||
same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
|
||||
half of the prior Lever-3 design.
|
||||
- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
|
||||
different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
|
||||
reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
|
||||
(`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
|
||||
FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
|
||||
f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
|
||||
grouping is forbidden.)
|
||||
- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
|
||||
excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
|
||||
- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
|
||||
|
||||
### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
|
||||
See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
|
||||
md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
|
||||
the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
|
||||
(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
|
||||
Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
|
||||
expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
|
||||
|
||||
### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
|
||||
`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
|
||||
SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
|
||||
under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
|
||||
a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
|
||||
them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
|
||||
(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
|
||||
attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
|
||||
|
||||
---
|
||||
|
||||
## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
|
||||
- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
|
||||
CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
|
||||
- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
|
||||
remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
|
||||
the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
|
||||
recurrence's vectorized state loads).
|
||||
- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
|
||||
slower and breaks the one-read invariant; not a lever.
|
||||
- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
|
||||
- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
|
||||
- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
|
||||
|
||||
---
|
||||
|
||||
## One-line summary for the parent
|
||||
The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
|
||||
win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
|
||||
reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
|
||||
fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
|
||||
**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
|
||||
small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
|
||||
the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
|
||||
NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
|
||||
P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
|
||||
shapes (large-N asymptotic vs 1-3-wave decode per-call).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
|
||||
|
||||
The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
|
||||
measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
|
||||
A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
|
||||
SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
|
||||
|
||||
## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
|
||||
Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
|
||||
|
||||
Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
|
||||
gated_delta_net (SSM recurrence) 193.97 ms/step 51.0% <- BINDING KERNEL
|
||||
mul_mat_q<NVFP4,m=128,nc=0> 93.64 ms/step 24.6% <- the P2a target
|
||||
quantize_mmq_nvfp4 16.77 ms/step 4.4%
|
||||
nvjet (cublas lm_head GEMM) 12.07 ms/step 3.2%
|
||||
flash_attn_ext_f16 11.69 ms/step 3.1%
|
||||
concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
|
||||
mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
|
||||
|
||||
conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
|
||||
gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
|
||||
nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
|
||||
|
||||
BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
|
||||
Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
|
||||
|
||||
## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
|
||||
The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
|
||||
mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
|
||||
mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
|
||||
for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
|
||||
write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
|
||||
stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
|
||||
|
||||
cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
|
||||
-> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
|
||||
BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
|
||||
IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
|
||||
|
||||
CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
|
||||
S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
|
||||
DENSE q36-27b: npl 32 208.02 -> 207.51 (-0.2%) npl 128 374.30 -> 373.19 (-0.3%) FLAT
|
||||
MoE q36-35b-a3b: npl 32 438.83 -> 439.30 (+0.1%) npl 128 745.71 -> 745.07 (-0.1%) FLAT
|
||||
Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
|
||||
|
||||
## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
|
||||
Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
|
||||
gated_delta_net 167.99 -> 167.89 ms/step (IDENTICAL - it is byte-identical code, untouched)
|
||||
mul_mat_q<NVFP4,128,0> 93.79 -> 92.74 ms/step (-1.1%, FLAT) <- the P2a target, decode shape
|
||||
mul_mat_q_stream_k_fixup 3.96 -> 5.65 ms/step (+1.7ms, REGRESSES at nwarps/2=2)
|
||||
=> the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
|
||||
|
||||
The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
|
||||
mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
|
||||
92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
|
||||
occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
|
||||
BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
|
||||
mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
|
||||
bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
|
||||
|
||||
Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
|
||||
flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
|
||||
99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
|
||||
the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
|
||||
the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
|
||||
|
||||
CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
|
||||
build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
|
||||
(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
|
||||
0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
|
||||
of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
|
||||
same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
|
||||
the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
|
||||
in whatever other flag/patch state each was built from.
|
||||
|
||||
## VERDICT
|
||||
P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
|
||||
kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
|
||||
S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
|
||||
the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
|
||||
gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
|
||||
(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
|
||||
vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
|
||||
|
||||
Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
|
||||
(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
|
||||
locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
|
||||
Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
|
||||
GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
|
||||
|
||||
## Levers EVALUATED
|
||||
|
||||
### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
|
||||
The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
|
||||
Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
|
||||
so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
|
||||
quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
|
||||
1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
|
||||
DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
|
||||
quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
|
||||
fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
|
||||
|
||||
### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
|
||||
Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
|
||||
DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
|
||||
garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
|
||||
selection -> less / better-localized expert work -> artificially fast. The authoritative
|
||||
decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
|
||||
3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
|
||||
|
||||
### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
|
||||
ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
|
||||
up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
|
||||
re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
|
||||
reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
|
||||
Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
|
||||
expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
|
||||
untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
|
||||
- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
|
||||
quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
|
||||
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
|
||||
- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
|
||||
- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
|
||||
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
|
||||
restores stock. Committed: DGX f7409c2 + worktree patch 0023.
|
||||
|
||||
### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
|
||||
Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
|
||||
bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
|
||||
writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
|
||||
{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
|
||||
mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
|
||||
~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
|
||||
order), lm_head (cublas) have NO bit-exact lever.
|
||||
|
||||
## Verdict
|
||||
The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
|
||||
quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
|
||||
It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
|
||||
bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
|
||||
The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
|
||||
fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
|
||||
lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,119 +0,0 @@
|
||||
# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
|
||||
|
||||
Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
|
||||
Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
|
||||
NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
|
||||
Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
|
||||
|
||||
## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
|
||||
|
||||
vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
|
||||
Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
|
||||
|
||||
## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
|
||||
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
|
||||
dtype=torch.float32 elem_bytes=4 shape=(1553, 48, 128, 128) per-slot state = 786432 elems = 3.000 MiB (f32)
|
||||
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB (bf16 would be 402.7 MB)
|
||||
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32 (mamba_cache_dtype=auto/bf16 for conv)
|
||||
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
|
||||
sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
|
||||
-> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
|
||||
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
|
||||
BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak (min 84.3%, p90 81.1%)
|
||||
|
||||
## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
|
||||
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
|
||||
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
|
||||
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
|
||||
gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
|
||||
Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
|
||||
median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
|
||||
(Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
|
||||
|
||||
## THE GAP (equal f32 bytes, different efficiency)
|
||||
llama 805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
|
||||
vLLM 805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
|
||||
=> vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
|
||||
Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
|
||||
reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
|
||||
|
||||
---
|
||||
|
||||
# retune-build (BUILD AGENT) — patch 0022 SHIPPED
|
||||
|
||||
vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
|
||||
is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
|
||||
|
||||
## The change — bit-exact column folding (Lever A + B + D)
|
||||
|
||||
`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
|
||||
`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
|
||||
`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
|
||||
body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
|
||||
`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
|
||||
|
||||
Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
|
||||
mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
|
||||
`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
|
||||
`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
|
||||
value provably does not depend on either (columns are fully independent — column c reads only its own
|
||||
S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
|
||||
state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
|
||||
grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
|
||||
(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
|
||||
per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
|
||||
reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
|
||||
coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
|
||||
default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
|
||||
|
||||
## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
|
||||
|
||||
Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
|
||||
isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
|
||||
agent's 4.021 ms / 73.4% exactly (n=1152).
|
||||
|
||||
| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
|
||||
|---------------------------|---------|------|-------|
|
||||
| base (0021) | 4.021 | 200.3| 73.4 |
|
||||
| 4 x 1 (control == base) | 4.034 | 199.7| 73.1 |
|
||||
| 4 x 2 | 3.887 | 207.2| 75.9 |
|
||||
| 4 x 4 | 3.775 | 213.3| 78.1 |
|
||||
| 8 x 1 | 3.837 | 209.9| 76.9 |
|
||||
| 8 x 2 | 3.749 | 214.8| 78.7 |
|
||||
| 8 x 4 | 3.699 | 217.7| 79.9 |
|
||||
| 8 x 8 | 3.586 | 224.6| 82.3 |
|
||||
| 16 x 2 | 3.665 | 219.8| 80.5 |
|
||||
| 16 x 4 | 3.585 | 224.7| 82.3 |
|
||||
| **16 x 8 (WINNER/default)** | **3.488** | **230.9** | **84.6** |
|
||||
| 32 x 4 | 3.489 | 230.8| 84.6 |
|
||||
|
||||
Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
|
||||
1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
|
||||
|
||||
## Gates (both PASS, non-negotiable)
|
||||
|
||||
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
|
||||
(16,8 default) AND (4,1 control):
|
||||
- q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
|
||||
- q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
|
||||
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
|
||||
|
||||
## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
|
||||
|
||||
| model | npl | base S_TG t/s | flag S_TG t/s | gain |
|
||||
|-------|-----|---------------|---------------|------|
|
||||
| dense 27b | 32 | 199.2 | 207.6 | +4.2% |
|
||||
| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
|
||||
| MoE 35b-a3b | 32 | 420.6 | 440.0 | +4.6% |
|
||||
| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
|
||||
|
||||
Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
|
||||
|
||||
## Parity vs vLLM (recurrence kernel)
|
||||
|
||||
Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
|
||||
(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
|
||||
is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
|
||||
(matmul/attn), not in gated-DeltaNet.
|
||||
|
||||
Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.
|
||||
@@ -1,511 +0,0 @@
|
||||
# OTHER_PATHS_INVESTIGATION.md
|
||||
|
||||
Read-only investigation of the four post-0023 paths (A MoE grouped-GEMM, B lm_head GEMM,
|
||||
C TTFT/paged-pool burst, D dense CUDA-graph). One section per agent. No GPU except the
|
||||
moe-gpu-profile agent.
|
||||
|
||||
---
|
||||
|
||||
## A. MoE grouped-GEMM gap (label: moe-gemm-source, READ-ONLY, no GPU)
|
||||
|
||||
### The decisive finding: vLLM's MoE on GB10 is MARLIN W4A16, not a native-FP4 grouped GEMM
|
||||
|
||||
Engine-log ground truth (`VLLM_DECODE_GROUNDING.md`, from `~/bench/h2h_moe_vllm.log`):
|
||||
`"Using 'MARLIN' NvFp4 MoE backend ... Your GPU does not have native support for FP4
|
||||
computation ... Weight-only FP4 compression will be used leveraging the Marlin kernel"`.
|
||||
vLLM does NOT take its native-FP4 cutlass/trtllm MoE path on sm_121 (it whitelists only
|
||||
sm_100/103 datacenter Blackwell for FP4-MMA MoE). So on this box vLLM's MoE is:
|
||||
|
||||
- `moe_align_block_size` (BLOCK-PADDED token-sort; `num_tokens_post_padded`, sentinel pad rows),
|
||||
- **2 grouped `moe_wna16_marlin_gemm` launches per MoE block** (gate_up, then SiLU+mul, then down),
|
||||
each ONE launch over ALL experts, `use_fp32_reduce=True`,
|
||||
- **W4A16: activations stay bf16, NEVER quantized**; FP4 weights dequantized in-kernel to bf16,
|
||||
bf16 MMA,
|
||||
- the whole decode step under a FULL CUDA graph.
|
||||
|
||||
llama's MUL_MAT_ID on GB10 (mmq.cu id-branch + mmid.cu + mmq.cuh stream-k) is:
|
||||
|
||||
- `mm_ids_helper` token-sort/scatter, **NO block padding** (exact segments, `expert_bounds`),
|
||||
- **activation FP4 quantize** (`quantize_mmq_fp4`) of the expert-gathered rows = W4A4,
|
||||
- **1 persistent stream-k `mul_mat_q<NVFP4>` launch per projection**, native Blackwell FP4-MMA
|
||||
(`block_fp4_mmq`), fp32 accumulate + `stream_k_fixup`,
|
||||
- per-expert-density `mmq_x` (M-tile) select (patches 0014/0015, default tile 64 @ density<=8),
|
||||
- NOT under a CUDA graph.
|
||||
|
||||
### So the "missing fused grouped GEMM" does not exist - llama already HAS it
|
||||
|
||||
llama's grouped FP4-MMA stream-k IS the same sorted-grouped-GEMM algorithm vLLM uses, and on
|
||||
GB10 llama's MoE GEMM is at a HIGHER-precision/native-FP4 tier than vLLM's W4A16 Marlin. The
|
||||
MoE decode gap (77-83% of vLLM vs dense 90-117%) is therefore NOT a grouped-GEMM-architecture
|
||||
deficit. The MoE-specific EXTRA gap (the ~10-15pt that MoE is worse than dense) decomposes as:
|
||||
|
||||
1. **W4A4 activation-quantize tax (llama-only, the biggest MoE-specific discrete cost).**
|
||||
llama quantizes activations to FP4 for the MoE GEMM; vLLM (W4A16) keeps them bf16 and pays
|
||||
ZERO activation quantize. At MoE decode npl128 that is 1024 up/gate rows (patch 0023 dedup'd
|
||||
the broadcast ones to 128 unique + a coalesced block gather) PLUS 1024 down_proj rows
|
||||
(distinct per expert, CANNOT be dedup'd). nsys decode-isolated (`MOE_QUANT_DEDUP_RESULTS.md`):
|
||||
`quantize_mmq_nvfp4` is still **457 ms** of decode GPU-time after the 0023 up/gate dedup; the
|
||||
remaining bulk is the down_proj per-expert re-quantize. vLLM's W4A16 choice is actually SMART
|
||||
for MoE decode on a bandwidth-bound box: keeping activations bf16 adds negligible activation
|
||||
bandwidth at M~8/expert but ELIMINATES the entire quantize pass.
|
||||
|
||||
2. **Un-graphed extra MoE nodes' launch bubbles.** Per MoE layer llama runs mm_ids_helper +
|
||||
quantize + gather + 2 grouped GEMMs + SiLU/mul + down-quantize + moe_sum as separate
|
||||
host-launched ggml nodes, none under a CUDA graph; vLLM runs moe_align + 2 grouped launches
|
||||
under a full decode graph. This is the SAME launch-bubble root cause `CRITICALPATH_GAP_ANALYSIS.md`
|
||||
pins for the GDN region (57 ms/step dense = 100% bubble), amplified for MoE by the extra
|
||||
quantize/gather/scatter nodes - consistent with MoE being relatively worse than dense.
|
||||
|
||||
3. **Ragged tiny-M tile + `need_check` partial-tail MMA** in the grouped stream-k. Already
|
||||
addressed by 0014/0015 and measured **NEUTRAL** on q36-35b-a3b: that model is bandwidth/
|
||||
SSM-recurrence-bound, not col-tile-occupancy-bound (the `LLAMA_MOE_DECODE_TILE` sweep shows 64
|
||||
is the only non-negative width and it is within noise). So the M-tile lever has nothing to
|
||||
bite on for THIS model; it banks +4.8% only on col-tile-bound MoE (Qwen3-Coder-30B).
|
||||
|
||||
### Bit-exact llama MoE-GEMM levers (ranked)
|
||||
|
||||
- **M1 (bit-exact, modest): down_proj activation-quantize kernel retune.** The remaining ~457 ms
|
||||
is dominated by the down_proj per-expert FP4 re-quantize (`ne11==n_expert_used`, no dedup
|
||||
possible). The per-block quantize is a pure per-thread function of 16 consecutive inputs (the
|
||||
property 0023 exploited to make its gather bit-exact), so the launch GEOMETRY can be retuned
|
||||
(occupancy/coalescing, like 0022 did for the recurrence and like 0023's coalesced-uint4 gather
|
||||
fix) while the quantized bytes stay BYTE-IDENTICAL. Also worth checking whether the down gather
|
||||
(`ids_src1`) is redundant when the SwiGLU intermediate is already expert-contiguous. Scope:
|
||||
nsys the down-branch `quantize_mmq_fp4` on GB10, retune block/grid, gate on test-backend-ops
|
||||
MUL_MAT_ID exact + greedy md5 == 0023. Expected: low single-digit % at npl128 (bounded - it is
|
||||
a fraction of a fraction of the step), but it is the only clean quantize-axis lever left after
|
||||
0023 and it is strictly bit-exact.
|
||||
|
||||
- **M2 (bit-exact, the structurally-correct big one, SHARED with path D/A.2): CUDA-graph the MoE
|
||||
decode step.** Graph replay does not change numerics => bit-exact. The MoE-specific extra node
|
||||
count (quantize+gather+scatter+2 GEMM+silu+sum/layer, none graphed) makes the launch-bubble tax
|
||||
larger for MoE than dense, which is exactly why MoE sits at 77-83% while dense is 90-117%.
|
||||
Capturing the decode forward removes those bubbles. This is the same lever the GDN/A.2 work
|
||||
scoped; it helps MoE MORE than dense. Highest-leverage bit-exact MoE win, but it is a
|
||||
decode-graph-capture project, not a MoE-GEMM kernel edit.
|
||||
|
||||
- **M0 (already shipped): 0017 `GGML_CUDA_FP4_MINBLOCKS` (min-resident-CTAs register-cap) and
|
||||
0014/0015 (`mmq_x` density auto-tile) already cover the FP4-MMA occupancy + M-tile axes of the
|
||||
SHARED `mul_mat_q<NVFP4>` kernel.** 0017 is bit-exact (register allocation cannot change
|
||||
results) and was tuned on dense; a MoE-targeted min-blocks re-sweep (grouped per-expert M-tiles
|
||||
have different occupancy than the dense M=128 GEMM) is a cheap bit-exact follow-up, but
|
||||
MOE_DENSITY_AUTO_TILE already found this model is bandwidth-bound, so headroom is likely small.
|
||||
|
||||
### NOT recommended (explicitly out of scope)
|
||||
|
||||
- **W4A16 bf16-activation MoE GEMM (matching vLLM's Marlin choice).** This is the single biggest
|
||||
MoE-specific structural difference and would erase the activation-quantize tax entirely, but it
|
||||
(a) is NOT bit-exact (bf16 activations vs llama's FP4), and (b) is the W4A16 occupancy-wall
|
||||
dead-end the docs flag (only ~9 TFLOP/178 t/s on GB10). Do not pursue.
|
||||
|
||||
### Verdict / ranking of path A
|
||||
|
||||
Path A is NOT a missing-kernel opportunity - llama already runs the sorted-grouped-FP4-MMA GEMM,
|
||||
at a higher native-FP4 tier than vLLM's GB10 W4A16 Marlin fallback. The MoE-specific extra gap is
|
||||
(1) the W4A4 activation-quantize tax vLLM structurally avoids by choosing W4A16, and (2) the same
|
||||
un-graphed launch-bubble tax as the GDN region, amplified by MoE's extra nodes. The only purely
|
||||
bit-exact, MoE-GEMM-local lever left is M1 (down_proj quantize retune, modest). The real MoE
|
||||
bit-exact win is M2 (CUDA-graph the decode step), which is the SAME lever as path A.2/D and helps
|
||||
MoE more than dense - so A's best lever collapses into the decode-graph effort rather than
|
||||
standing alone. Recommend ranking A's standalone kernel value BELOW the decode-graph (M2/D) and
|
||||
the lm_head (B) levers; fold A into the decode-graph build, and keep M1 as a cheap bit-exact
|
||||
bank-shot.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## B. lm_head GEMM (label: cublas-lmhead, READ-ONLY, no GPU)
|
||||
|
||||
### The decisive fact: lm_head is BF16, not NVFP4 - so it CANNOT take the FP4 MMQ path
|
||||
|
||||
`output.weight` (the LM head) in q36-35b-a3b-nvfp4 is **type 30 = GGML_TYPE_BF16, NOT quantized**
|
||||
(verified in `DECODE_PARITY_EXPLORE.md:298`: "2425 MB = 2.37 GB, read in full each step", 16% of
|
||||
weight traffic). This is by construction: the model was quantized with `--tensor-type attn/ffn=
|
||||
nvfp4`, which converts the attn+ffn tensors to NVFP4 and **leaves `output.weight` (and `tok_embd`)
|
||||
at base BF16** - the standard recipe, because the final projection is the most logit-sensitive
|
||||
tensor. The NVFP4 sidecar scales (`output_s`, `output_in_s`) are only created when
|
||||
`output->type == GGML_TYPE_NVFP4` (`llama-model.cpp:1459`), so for the BF16 head `model.output_s`
|
||||
is null.
|
||||
|
||||
### Why it runs cublas/nvjet and not MMQ (exact routing trace)
|
||||
|
||||
Graph: `qwen35moe.cpp:244` `cur = build_lora_mm(model.output, cur, model.output_s)` ->
|
||||
`llama-graph.cpp:1093` is just `ggml_mul_mat(ctx0, w, cur)` (the null `w_s` skips the scale `ggml_mul`).
|
||||
Then `ggml_cuda_mul_mat` (`ggml-cuda.cu:2540`) decides the kernel:
|
||||
|
||||
- `use_mul_mat_q` / `use_mul_mat_vec_q` BOTH require `ggml_is_quantized(src0->type)`. BF16 is NOT
|
||||
quantized (`is_quantized=false` for F16/BF16/F32; NVFP4 IS `is_quantized=true`, `ggml.c:748`).
|
||||
=> **both MMQ paths are ineligible for the BF16 head.** (If the head were NVFP4 it WOULD route to
|
||||
the tuned FP4 `mul_mat_q` - this is exactly the difference.)
|
||||
- At decode npl128 the activation `src1->ne[1] = 128` columns: `use_mul_mat_vec_f` is gated off by
|
||||
the mmvf batch cap; `use_mul_mat_f` (the MMF bf16 tensor-core GEMM) is gated off by
|
||||
`ggml_cuda_should_use_mmf` for the wide `151936-row x 128-col` shape.
|
||||
- `use_batched_cublas_bf16` is true, but the batched-cublas branch additionally requires
|
||||
`src1->ne[2]*src1->ne[3] > 1` (a 3D/4D multi-batch GEMM). The decode lm_head is 2D
|
||||
(`ne[2]*ne[3] == 1`) => **batched-cublas branch is skipped.**
|
||||
- => falls through to the final `else`: `ggml_cuda_op_mul_mat_cublas`. With `src0` BF16 +
|
||||
bf16-MMA hardware it takes the BF16 branch (`ggml-cuda.cu:1663`): `cublasGemmEx(CUDA_R_16BF,
|
||||
CUDA_R_16BF -> CUBLAS_COMPUTE_32F, TENSOR_OP)`. **That cublasLt kernel is `nvjet_sm121`.**
|
||||
|
||||
Cost (both models): dense `nvjet` lm_head = **12.17 ms = 3.66% of the 332.8 ms dense step**
|
||||
(`F16_DENSE_RESIDUAL_PROBE.md:65`); MoE = **11.91 ms = 3.1%** (`CRITICALPATH_GAP_ANALYSIS.md:398`).
|
||||
|
||||
### CRITICAL correction: the current head is NOT "f32-lm_head" - it is already BF16-rounded
|
||||
|
||||
The task brief calls the baseline "f32-lm_head"; it is not. The cublas BF16 branch **downcasts the
|
||||
F32 activation to BF16**, does BF16xBF16 with F32 accumulate, **writes the result as BF16** (dst is
|
||||
`CUDA_R_16BF`), then upcasts BF16->F32. So today's "bit-exact reference" logits are already
|
||||
**BF16-precision**, not f32. Two consequences:
|
||||
1. Any NVFP4/FP8 head swap is measured against a BF16 baseline, not f32 - the precision delta vs
|
||||
the *true* f32 head is partly already paid.
|
||||
2. A *different* BF16 GEMM kernel that also F32-accumulates and BF16-rounds the output is
|
||||
**bit-identical for the vast majority of logits** (differs only at rare BF16 rounding ties) -
|
||||
this is what makes option (c) below "essentially bit-exact".
|
||||
|
||||
### The options, and which break bit-exactness
|
||||
|
||||
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ. BIGGEST win, BREAKS bit-exactness.** Weight
|
||||
2.37 GB BF16 -> ~0.6 GB NVFP4 (0.5625 B/wt = 4x fewer bytes) AND it then hits the already-tuned
|
||||
`mul_mat_q<NVFP4>` (0017) instead of cublas. Memory-bound floor drops ~4x => save ~8-9 ms =
|
||||
~2.5% of the dense step. But NVFP4 < BF16 precision => **different logit bits, can flip the greedy
|
||||
argmax** = NOT bit-exact; and it is **UNFAIR vs vLLM**, which keeps its LM head BF16
|
||||
(`DECODE_PARITY_EXPLORE.md:358`: "fp8 LM head ... only matters if vLLM also quantizes it"). This
|
||||
is the same opt-in, non-bit-exact bucket as the f16-glue probe (already concluded SKIP).
|
||||
- **(b) FP8 / Q8_0 head.** Smaller error than NVFP4 but still != BF16 bits => still NOT bit-exact,
|
||||
and it is not even on the tuned FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
|
||||
- **(c) Keep BF16 weight, swap the kernel (custom skinny wide-vocab streaming GEMM, or a cublasLt
|
||||
algo heuristic tuned for the thin-M / huge-N memory-bound shape).** The ONLY essentially-bit-exact
|
||||
option (F32 accumulate + BF16 round = identical except rounding ties, per the correction above).
|
||||
|
||||
### Realistic lever + scope: there is NO good bit-exact lever here
|
||||
|
||||
Bandwidth math kills option (c): `nvjet` moves 2.37 GB in ~11.9-12.2 ms = **~195-199 GB/s = ~72% of
|
||||
the GB10's 273 GB/s peak**. The lm_head GEMM is therefore **already one of the MOST
|
||||
bandwidth-efficient kernels in the step** - the overall decode step runs at only 40% util /
|
||||
110 GB/s (`DECODE_PARITY_EXPLORE.md`). The bit-exact ceiling is tiny: even a perfect
|
||||
HBM-saturating kernel (199 -> 273 GB/s) takes 11.9 -> ~8.7 ms = **save ~3 ms = ~0.9% of the dense
|
||||
step**, and beating cublas's own tuned nvjet on a pure weight-stream shape is NOT guaranteed (it may
|
||||
already be near-optimal). High kernel-writing effort, uncertain sub-1% payoff. (`F16_DENSE_RESIDUAL_
|
||||
PROBE.md:97` independently estimates a bf16-glue nvjet recovery of only ~5 ms and flags it
|
||||
"uncertain - may already run TF32" - consistent with little headroom.)
|
||||
|
||||
The structural reason: the head must read the **entire 2.37 GB weight for just 128 output columns**
|
||||
(inherently memory-bound), and **you cannot cut those weight bytes without changing the dtype** -
|
||||
i.e. bit-exactness and the only real speedup (fewer weight bytes) are **mutually exclusive** here.
|
||||
|
||||
### Verdict / ranking of path B
|
||||
|
||||
The lm_head cublas/nvjet GEMM is a **dead end for a bit-exact win**: it is already ~72% of peak HBM
|
||||
(the step's most efficient major kernel), so a bit-exact kernel swap caps at <1% with real risk and
|
||||
no guarantee of beating cublas. The only large win - NVFP4-quantizing the head (~2.5%) - is
|
||||
explicitly non-bit-exact AND unfair vs vLLM (which keeps BF16), so it lands in the same opt-in
|
||||
non-bit-exact bucket as f16-glue that was already shelved. Rank B's bit-exact value **at the bottom**
|
||||
of the four paths. The one worthwhile note for the team is the correction that the head is already
|
||||
BF16 (not f32), which slightly narrows what "bit-exact" even protects here; if the project ever
|
||||
opens a *non*-bit-exact opt-in track, NVFP4-head (option a) is a clean ~2.5% dense lever that rides
|
||||
the existing tuned FP4 MMQ - but it must be gated as opt-in and excluded from any vLLM-parity claim.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## A.2 / D. GPU-measured MoE decode decomposition + dense-graph stability (label: moe-gpu-profile, THE GPU AGENT)
|
||||
|
||||
nsys `--cuda-graph-trace=node` on a steady MoE decode at npl128 (q36-35b-a3b-nvfp4, HEAD f7409c2,
|
||||
clean 0023 build-cuda). The measurement was decode-isolated: the run has a prefill phase (16384 tok,
|
||||
the big-GEMM region) followed by 64 steady decode steps; I segmented the timeline by GPU-idle gaps,
|
||||
dropped the prefill window, and aggregated per-kernel time over the 64-step decode window only
|
||||
(`moe_decode_npl128.{nsys-rep,trace.csv}` on the DGX; extractor `decfull.py`/`grid.py`).
|
||||
|
||||
### MoE decode window: 98.3% GPU-bound, ~165 ms/step. Per-kernel share of decode GPU-time:
|
||||
|
||||
```
|
||||
41.9% gated_delta_net_cuda REC (SHARED with dense, already tuned 0018-0022)
|
||||
26.9% mul_mat_q<NVFP4, M-tile=64> MOE expert grouped GEMM (MUL_MAT_ID) <-- biggest MoE-specific kernel
|
||||
7.7% nvjet_sm121 (cublas bf16) attn/gdn bf16 projections + the BF16 lm_head (path B)
|
||||
2.7% cutlass_80 bf16 s16816 relu bf16 GEMM (shared-expert / gate)
|
||||
2.7% k_bin_bcast (mul/add) expert-combine + routing-weight scale + glue
|
||||
2.6% k_get_rows_float REC recurrent-state gather
|
||||
2.4% flash_attn_ext_f16 attention
|
||||
2.3% mul_mat_q<NVFP4, M-tile=128> router / non-grouped FP4 GEMM
|
||||
2.1% ssm_conv(+update) REC
|
||||
2.0% quantize_mmq_nvfp4 MOE W4A4 activation-quantize tax (3.25 ms/step)
|
||||
1.8% convert_unary bf16<->f32 glue around the bf16 projections
|
||||
1.5% cpy_scalar glue
|
||||
0.9% rms_norm
|
||||
0.5% REC gating act | 0.5% streamk_fixup | 0.3% mm_ids_helper | 0.3% argsort |
|
||||
0.2% l2norm | 0.2% set_rows | 0.1% gather_mmq_fp4 | <0.1% topk/softmax/reduce (routing)
|
||||
```
|
||||
|
||||
Bucketed: **Recurrence (shared, tuned) ~= 47.5%** (gdn 41.9 + get_rows 2.6 + ssm_conv 2.1 + gating
|
||||
0.5 + l2norm 0.2 + set_rows 0.2). **MoE FFN+routing block ~= 31%** (grouped GEMM 26.9 + activation
|
||||
quant 2.0 + streamk 0.5 + mm_ids_helper/argsort/gather/softmax/topk/reduce ~1.3 + the expert-combine
|
||||
share of bin_bcast). **cublas/cutlass bf16 projections ~= 10.4%** (nvjet 7.7 + cutlass 2.7).
|
||||
Attention ~2.4%. The recurrence is the single biggest term but it is shared with dense and already
|
||||
the subject of 0018-0022, so it is NOT a MoE lever.
|
||||
|
||||
### The biggest MoE-specific kernel (the lever): mul_mat_q<NVFP4, M-tile=64> grouped GEMM
|
||||
|
||||
26.9% of decode = ~43.5 ms/step, avg **373 us/call**, grids of **2048 and 8192** 64-wide tiles
|
||||
(blk=32 = 1 warp/block). Compare the dense FFN GEMM in the same family at npl128: `mul_mat_q<NVFP4,
|
||||
M-tile=128>` avg **31 us/call**, grid 48. The grouped per-expert GEMM is ~12x the per-call cost and
|
||||
launches 100-200x more tiles because each of 128 experts is a separate tiny-M sub-GEMM (128 tokens x
|
||||
top-k / 128 experts ~= a handful of rows per expert) padded into 64-wide tiles. This is exactly the
|
||||
ragged-tiny-M / col-tile-occupancy axis section A's 0014/0015 `mmq_x` density auto-tile already
|
||||
covers and measured NEUTRAL on this bandwidth-bound a3b model. MMQ FP4 is integer/FP4-exact
|
||||
independent of tile geometry, so this kernel IS bit-exact to retune (occupancy/min-blocks/M-tile),
|
||||
but the headroom on THIS model is small (it is bandwidth-bound, not tile-occupancy-bound).
|
||||
|
||||
### Confirmations / quantifications of section A (from live GPU, not source-reading):
|
||||
|
||||
1. **Un-graphed at npl128: CONFIRMED in source, but NOT the npl128 bottleneck.** NVFP4 on sm121
|
||||
(turing_plus path) has `mmvq_mmid_max = 8` (`mmvq.cu:145`); MoE decode batch ne[2]=128 > 8, so
|
||||
`[TAG_MUL_MAT_ID_CUDA_GRAPHS]` (`ggml-cuda.cu:3273`) disables CUDA graphs for the WHOLE step and
|
||||
the MMQ grouped path (not MMVQ) is taken. HOWEVER the measured decode window is **98.3% GPU-util
|
||||
with ~7.8 us inter-step host gaps** - at npl128 the kernels are large enough to fully hide the
|
||||
per-op launch latency, so the un-graphed launch-bubble tax is negligible HERE. The un-graphed
|
||||
penalty is a SMALL-npl problem; at npl128 the MoE gap is in-kernel (grouped GEMM + quantize),
|
||||
not host bubbles. This refines A's M2: graphing the decode step helps small-npl MoE much more
|
||||
than npl128 MoE.
|
||||
2. **W4A4 activation-quantize tax: CONFIRMED present but only 2.0% at npl128.** `quantize_mmq_nvfp4`
|
||||
= 3.25 ms/step in the decode-isolated window (A's 457 ms figure is a whole-run/different-window
|
||||
total). Real, and vLLM-W4A16 avoids it, but it is a small-single-digit term, not dominant.
|
||||
3. **lm_head/projection cublas (path B): CONFIRMED ~12.4 ms/step** of nvjet in MoE decode (matches
|
||||
B's 11.91 ms), but that 7.7% bundle is mostly per-layer attn/gdn bf16 projections, not just the
|
||||
one lm_head.
|
||||
|
||||
### D. Dense CUDA-graph stability: f32 dense is STABLE, the bimodality was a BF16-only artifact
|
||||
|
||||
Dense (q36-27b-nvfp4) has no MUL_MAT_ID, so it stays fully CUDA-graphed. Measured S_TG @npl128:
|
||||
|
||||
```
|
||||
intra-process (1 load, 6x npl=128, npp8/ntg48, N_KV=7168): 376.2 376.2 375.7 375.1 375.3 374.9 (spread <0.4%)
|
||||
inter-process (6 separate procs, fresh graph capture each):373.6 377.0 376.8 376.6 376.2 375.7 (spread ~0.9%)
|
||||
committed heavy config (npl128 ntg128, N_KV=32768): 333.3 / 334.8 / 335.9 (spread ~0.8%)
|
||||
```
|
||||
|
||||
No bimodality in either replay (intra-process) or capture (inter-process). The custom graph state
|
||||
machine (`ggml-cuda.cu:4484`: warmup_complete requires 2 property-stable calls; the one-time capture
|
||||
cost lands in T_PP, not S_TG) absorbs capture into prefill, which is the only "hint" (the first
|
||||
in-process measurement has a slightly higher T_PP and a marginally lower S_TG, fully bounded). The
|
||||
287/336/487/498 bimodality in the brief was the shelved BF16 SSM-state path (BF16_SSM_STATE.diff,
|
||||
never applied), not the shipped f32 path. There is NO graphs-off env in this fork (graph enable is
|
||||
compile-time USE_CUDA_GRAPH + the warmup machine), so a graph-disable A/B would need a rebuild; given
|
||||
the f32 path is already stable to <1%, path D is a non-issue and not worth the rebuild.
|
||||
|
||||
### Verdict (GPU agent)
|
||||
|
||||
- The MoE decode gap vs vLLM at npl128 is **in-kernel, not host-overhead**: 98.3% GPU-util rules
|
||||
out the un-graphed launch-bubble story AT npl128. The single biggest MoE-specific kernel is the
|
||||
`mul_mat_q<NVFP4, M-tile=64>` grouped GEMM (26.9%, 43.5 ms/step); it is bit-exact to retune but
|
||||
bandwidth-bound on this a3b model (A's auto-tile already measured neutral), so the standalone
|
||||
bit-exact MoE-GEMM lever is REAL but BOUNDED. The recurrence (47.5%) is shared and already tuned.
|
||||
- **Path D (dense graph instability) is closed: the shipped f32 dense path is stable (<1%, no
|
||||
bimodality).** No latent fragility, no rebuild warranted.
|
||||
- Net ranking from the GPU side agrees with A/B: the MoE-GEMM and lm_head levers are both bounded
|
||||
and partly non-bit-exact; the only structurally large bit-exact MoE win (A's M2, graph the decode
|
||||
step) pays off mostly at SMALL npl, not at the npl128 where the benchmark gap is reported.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## C. TTFT / paged-pool burst degradation (label: ttft-burst-rootcause, READ-ONLY, source + committed traces)
|
||||
|
||||
Files read: `paged/paged_kv_manager.{h,cpp}`; patches `0004` (on-demand alloc), `0007` (persistent
|
||||
manager + ref-counted prefix), `0008` (server cross-request share), `0013`/`0016` (prefill budget);
|
||||
docs `QWEN36_NVFP4_BENCH.md`, `BENCHMARK_PROGRESS.md`, `CHUNKED_PREFILL_PLAN.md`,
|
||||
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`, `P1_DYNAMIC_BUDGET_RESULTS.md`, `FUTURE_LEVERS.md`.
|
||||
|
||||
### Part 1 - the static decode-first budget: why a 128-way burst hits 903 s dense / 213 s MoE TTFT
|
||||
|
||||
How the budget schedules (patch 0016, `server-context.cpp::update_slots`): each step builds ONE
|
||||
mixed batch. Phase 1 appends every GENERATING slot's single sampled token UNCONDITIONALLY (no budget
|
||||
gate), so after Phase 1 `batch.n_tokens == D` (the live decode load). Phase 2 then fills prompt
|
||||
tokens, bounded by three predicates: the hard `batch.n_tokens < n_batch` (2048) ceiling, a per-step
|
||||
`prefill_budget_step`, and a per-slot `prefill_cap_per_slot`. **Decode is structurally claimed first
|
||||
and never capped; only prefill is throttled.**
|
||||
|
||||
At the shipped config (`LLAMA_MAX_BATCH_TOKENS=512`, i.e. T=512=n_ubatch) the dynamic terms
|
||||
degenerate to constants:
|
||||
- `prefill_budget_step = max(n_ubatch, T - D) = max(512, 512-D) = 512` for all D in [0,128] - the
|
||||
floor binds, the `T-D` adaptivity NEVER bites (exactly the "structural note" in
|
||||
`P1_DYNAMIC_BUDGET_RESULTS.md`).
|
||||
- `prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) = min(512, 5243) = 512`, clamped to 512.
|
||||
|
||||
So each step admits at most 512 prefill tokens TOTAL and up to 512 per single slot. Each benchmark
|
||||
prompt is exactly 512 tokens and there is NO round-robin (0016 drains slots in index order):
|
||||
**the first waiting slot consumes the entire 512-token step budget with its whole prompt; the 128
|
||||
prompts prefill strictly SERIALLY, one prompt per step.** Slot k's first token appears after ~k
|
||||
prefill steps and each step co-batches the accumulating decode load, so step time grows. Mean TTFT
|
||||
~= (half the prompts) x step_time ~= **903 s dense** (each step reads the full 28B NVFP4 weights) /
|
||||
**213 s MoE** (3B active = cheaper steps). Decode_agg stays high (384/726 t/s) because Phase 1 seats
|
||||
every decode token every step. This is the **deliberate decode-first tradeoff**: T=512 was chosen
|
||||
for decode throughput + memory; TTFT was the sacrificed axis. The 903 s is partly self-inflicted by
|
||||
the floor budget + lack of fairness, not a kernel limit (dense `prefill_tps` collapses to ~70 t/s
|
||||
under the throttle vs vLLM's flat ~1420).
|
||||
|
||||
The fix (chunked-interleave / fair dynamic budget = P2 of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`,
|
||||
NOT implemented), three pieces in `update_slots` Phase 2, zero libllama change:
|
||||
1. Raise T toward `n_batch` (2048) so the per-step total budget is large; keep decode-first via the
|
||||
REAL `prefill_budget_step = T - D` (leftover auto-shrinks as D rises, so the step never inflates
|
||||
past T even at npl128).
|
||||
2. A per-slot chunk cap MUCH smaller than the budget (the `long_prefill_token_threshold` analogue),
|
||||
e.g. 128-256 tokens, so one prompt cannot monopolize the step.
|
||||
3. A round-robin start offset over PROCESSING_PROMPT slots so leftover budget spreads across MANY
|
||||
waiting prompts per step.
|
||||
|
||||
Net: instead of "one full 512-prompt per step" (serial, last prompt waits 128 steps), each step
|
||||
admits small chunks from ~T/cap prompts at once, so all 128 advance in lockstep and TTFT collapses
|
||||
from O(k*step) to O(constant) - the vLLM 6-18 s regime. 0016's per-slot-cap variable already exists
|
||||
but is inert at the shipped config and lacks the round-robin spreader. Honest boundary (already in
|
||||
the docs): this closes TTFT, it does NOT lift the ~161/333 decode ceiling (a separate lever).
|
||||
|
||||
### Part 2 - the burst-degradation BUG: later lower-npl prefill collapses 507 -> 65 t/s, decode fine, restart cures it
|
||||
|
||||
The signature - prefill-only collapse, decode untouched, persists in-process, a server restart fully
|
||||
cures it (the benchmark's documented "restart per npl" workaround) - points to persistent paged-pool
|
||||
host state never restored short of `clear()`/teardown. Two compounding mechanisms, both confirmable
|
||||
from the patch source:
|
||||
|
||||
**(1) RECLAMATION GAP - blocks are returned ONLY on a FULL-range wipe.** `paged_alloc` returns a
|
||||
sequence's blocks to the pool in exactly two places (patch 0004, kept in 0007): `clear()` ->
|
||||
`release_all`, and `seq_rm(seq, p0, p1)` ONLY when `p0 == 0 && p1 == MAX`. But llama-server's normal
|
||||
slot lifecycle issues PARTIAL truncations: slot reuse with a retained common/BOS prefix calls
|
||||
`seq_rm(slot.id, n_past, -1)` with `n_past > 0` (patch 0008 itself calls
|
||||
`common_context_seq_rm(ctx, slot.id, n_past, -1)`); context-shift / partial rewinds likewise. None
|
||||
satisfy `p0 == 0`, so the release hook never fires: the kv-cache frees those CELLS but the manager
|
||||
still believes the sequence owns those BLOCKS. The two desync and the manager's effective free pool
|
||||
shrinks every time. Patch 0008's own comment is the smoking gun - it added the `n_past < 16` gate
|
||||
because a mismatched full-prompt reservation vs suffix-only submission "never leaves stale blocks
|
||||
(which otherwise fragment the paged pool ... and crashed the server under high fan-out)". 0008 only
|
||||
closed that hole for the narrow `share()` path; the general partial-`seq_rm` path stays unhooked, so
|
||||
over a high-fan-out burst leaked blocks accumulate and never return.
|
||||
|
||||
**(2) FRAGMENTATION / NO COMPACTION - the free queue is permuted by the burst and never rebuilt.**
|
||||
Even for cleanly freed blocks, `BlockPool::free_blocks` just `prepend_n`/`append_n`s them in free
|
||||
order; no compaction, no pristine reset. After a high-fan-out burst (many interleaved alloc/free
|
||||
across many seqs in the unified pool, or reversed-order frees in a per-stream pool) the free queue is
|
||||
a scrambled permutation of physical block ids. A subsequent LOW-npl prefill then `popleft`s
|
||||
physically SCATTERED blocks, so its 512-token KV scatter-WRITE plus the in-kernel paged-attention
|
||||
GATHER lose locality across the KV span -> prefill throughput collapses. Decode is a single-token
|
||||
append per step with a gather amortized over tiny per-step work, so it barely notices - exactly the
|
||||
observed "prefill collapses, decode robust". The scramble + leak persist for the process lifetime
|
||||
(only `clear()`/restart rebuilds a contiguous free queue) - precisely why restart-per-npl restores
|
||||
507 t/s. Contributing factor: slots used in the burst but not reassigned next run are never released
|
||||
(release fires only on next-task divergence), so a low-npl run sees a reduced, fragmented pool and
|
||||
falls back to the stock contiguous allocator more often (the `place()->false->res.idxs.clear()`
|
||||
fallback in find_slot), scanning a littered cell array - another prefill-only slowdown.
|
||||
|
||||
Fix scope (all gated behind `LLAMA_KV_PAGED`, default-off byte-identical, no libllama API change):
|
||||
- **Fix-1 (core, ~30-50 lines): close the reclamation gap.** Add
|
||||
`paged::PagedKVManager::truncate(seq, n_keep)` that frees the trailing blocks of a request beyond
|
||||
block index `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of the truncated block suffix),
|
||||
expose `paged_alloc::truncate(cache, stream, seq, n_keep)`, and call it from
|
||||
`llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Manager
|
||||
accounting then tracks the kv-cache exactly; the leak stops.
|
||||
- **Fix-2 (small): defrag on empty.** When a stream's cells reach `get_used() == 0`, rebuild that
|
||||
manager's free queue to pristine contiguous order (or recreate the manager) so a reused pool
|
||||
starts unfragmented.
|
||||
- **Fix-3 (small): release on slot completion.** Add a paged release at server `slot.release()` so
|
||||
finished-but-idle sequences return blocks promptly and a later low-npl run sees a full, compact
|
||||
pool.
|
||||
- **Fix-4 (optional hardening): best-fit / contiguous-run preference** in `get_new_blocks` + a
|
||||
defrag pass before the find_slot stock-fallback fires.
|
||||
|
||||
Validation repro (GPU-bound, for a later profiling pass): npl64 burst then npl8 on ONE server;
|
||||
assert npl8 `prefill_tps` within ~10% of a fresh-server npl8, and that `paged_alloc::num_free`
|
||||
returns to the fresh value after the burst drains.
|
||||
|
||||
### Verdict / ranking of path C
|
||||
|
||||
Two distinct things: a **BUG** (Part 2) and a **tuning tradeoff** (Part 1). Rank the BUG first - it
|
||||
is a true correctness/hygiene defect, not a tradeoff: a long-lived production server silently
|
||||
degrades under ordinary mixed load and currently REQUIRES the "restart per npl" crutch, unacceptable
|
||||
in real serving. Fix scope is small and localized to the paged-alloc unit + one `seq_rm` call site,
|
||||
default-off byte-identical, with a crisp pass/fail repro. The chunked-interleave scheduler (Part 1)
|
||||
is the bigger HEADLINE (the weakest benchmark number, 903 s/213 s burst TTFT vs vLLM 6-18 s) but a
|
||||
larger effort with a deliberate TTFT-vs-decode-ITL tradeoff to navigate. The two are complementary:
|
||||
the scheduler reduces how punishing each burst is; the bug fix ensures the pool survives the burst
|
||||
so the NEXT request is not poisoned.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## SYNTHESIS - ranking and the first build target (label: orchestrator)
|
||||
|
||||
The brief framed two tracks: **BIT-EXACT** levers (help the shipped f32 parity DEFAULT, included in
|
||||
the vLLM-parity claim) and **SERVING** levers (gated behind `LLAMA_KV_PAGED`, default-off
|
||||
byte-identical, outside the parity claim). The decisive cross-cutting finding from all four agents:
|
||||
**there is no compelling first build target on the bit-exact decode-default track** - A is bounded,
|
||||
B is a sub-1% dead end, D is closed - **while the SERVING track has one clear, high-ROI, tractable,
|
||||
low-risk, byte-identical-default first target: the paged-pool burst-degradation bug.**
|
||||
|
||||
### Per-path scorecard
|
||||
|
||||
| Path | Expected gain | Tractability | Bit-exactness | Net |
|
||||
|------|---------------|--------------|---------------|-----|
|
||||
| **A** MoE grouped-GEMM | Standalone kernel: **bounded, low single-digit %** at npl128 (model is bandwidth-bound; 0014/0015 M-tile auto-tile already NEUTRAL here). The big MoE win = M2 = graph-the-decode-step, which is SHARED with D and pays off mostly at SMALL npl, not the npl128 benchmark point. | M1 (down_proj quantize retune) cheap; M2 a decode-graph-capture project (large). | M1 strictly bit-exact (byte-identical quantized output); M2 bit-exact (replay). Helps the DEFAULT. | Real but **bounded**; no clean standalone kernel win. Keep M1 as a cheap bank-shot; fold M2 into a decode-graph effort. |
|
||||
| **B** cublas lm_head (nvjet) | Bit-exact ceiling **<1%** (~3 ms; nvjet already ~72% of peak HBM, the step's most efficient major kernel). The only big win (NVFP4 head ~2.5%) is non-bit-exact AND unfair vs vLLM (which keeps BF16). | Custom skinny-GEMM = high effort, uncertain it beats cublas. | Bit-exact option caps <1%; the 2.5% option is a logits change (opt-in only). | **Dead end** for the default. Rank LAST. |
|
||||
| **C** TTFT / paged-pool burst | **Part 2 bug:** restores prefill from collapsed 65 -> ~507 t/s after a burst (removes the "restart per npl" crutch). **Part 1 scheduler:** the headline - 903 s/213 s burst TTFT -> vLLM 6-18 s regime. | **Part 2: small + localized** (paged-alloc unit + 1 seq_rm call site). Part 1: larger (fairness + admission + tuning). | Both gated behind `LLAMA_KV_PAGED`, **default-off byte-identical**. SERVING track (doesn't touch the parity-default numerics). | **Highest ROI x tractability.** Part 2 is a true correctness defect with a crisp repro. |
|
||||
| **D** dense CUDA-graph instability | **Zero** - f32 dense measured STABLE (<1% spread, no bimodality). The 287/336/487/498 bimodality was the SHELVED BF16 SSM path, not the shipped f32 path. | n/a (would need a rebuild for a graphs-off A/B). | n/a | **CLOSED.** Not worth any work. |
|
||||
|
||||
### Ranked order (ROI x tractability x bit-exactness)
|
||||
|
||||
1. **C-Part2 - paged-pool burst-degradation bug fix.** Small, localized, default-off byte-identical,
|
||||
crisp pass/fail repro, removes a real production-serving defect + the benchmark's restart crutch.
|
||||
2. **C-Part1 - chunked-interleave / fair dynamic budget.** The public-facing TTFT headline closer,
|
||||
but a larger effort and a deliberate TTFT-vs-ITL tradeoff. Do it AFTER the bug fix (the scheduler
|
||||
reduces burst pain; the bug fix keeps the pool alive across bursts).
|
||||
3. **A-M1 - down_proj activation-quantize kernel retune** (cheap bit-exact bank-shot for the default;
|
||||
bounded payoff on this bandwidth-bound model). Optionally folded with a future decode-graph build
|
||||
(A-M2 / the shared MoE+GDN decode-graph capture), which is the only structurally large bit-exact
|
||||
MoE lever but a big project that helps small-npl more than npl128.
|
||||
4. **B - lm_head kernel swap.** Bit-exact ceiling <1% with real risk. Skip unless a non-bit-exact
|
||||
opt-in track opens (then NVFP4-head ~2.5% dense, gated, excluded from parity claims).
|
||||
5. **D - dense graph instability.** Closed, no work.
|
||||
|
||||
### THE FIRST BUILD TARGET: paged-pool burst-degradation bug fix (C-Part2)
|
||||
|
||||
**Why this one:** it is the only candidate that is simultaneously (a) high ROI - fixes a real
|
||||
correctness defect that forces the "restart per npl" crutch in long-lived serving, (b) tractable -
|
||||
small and localized to the paged-alloc unit plus one `seq_rm` call site, (c) safe for the parity
|
||||
claim - gated behind `LLAMA_KV_PAGED`, default-off byte-identical, and (d) verifiable with a crisp
|
||||
pass/fail repro. Every bit-exact-default alternative is bounded (A), a dead end (B), or closed (D).
|
||||
|
||||
**Implementation plan (incremental, each step independently shippable):**
|
||||
1. **Fix-1 (core):** add `paged::PagedKVManager::truncate(seq, n_keep)` that ref-count-frees the
|
||||
trailing blocks beyond block index `ceil(n_keep/bs)`; expose
|
||||
`paged_alloc::truncate(cache, stream, seq, n_keep)`; call it from `llama_kv_cache::seq_rm` for the
|
||||
`p1 == MAX && p0 > 0` case (ideally any `[p0,p1)`). Closes the reclamation gap so manager
|
||||
accounting tracks the kv-cache exactly.
|
||||
2. **Fix-2:** defrag-on-empty - when a stream reaches `get_used() == 0`, rebuild its free queue to
|
||||
pristine contiguous order.
|
||||
3. **Fix-3:** paged release at server `slot.release()` so finished-idle sequences return blocks
|
||||
promptly.
|
||||
4. **Fix-4 (optional):** best-fit / contiguous-run preference in `get_new_blocks` + a defrag pass
|
||||
before the find_slot stock fallback.
|
||||
|
||||
**Confirming measurement (the explicit repro, GPU-bound):** on ONE long-lived server, run an npl64
|
||||
burst, let it drain, then run npl8. PASS if (i) npl8 `prefill_tps` is within ~10% of a fresh-server
|
||||
npl8 (vs the ~65 vs ~507 collapse today), and (ii) `paged_alloc::num_free` returns to the
|
||||
fresh-start value after the burst drains (proves no leaked blocks). Decode t/s must be unchanged.
|
||||
|
||||
**Bit-exact gate it MUST pass:**
|
||||
- With `LLAMA_KV_PAGED` unset, the build is byte-identical to HEAD f7409c2 (the fix lives entirely
|
||||
inside the paged path) - `test-backend-ops` + the greedy-decode md5 against the 0023 baseline are
|
||||
unchanged.
|
||||
- With `LLAMA_KV_PAGED` set, the fix changes only block ACCOUNTING and PLACEMENT, never KV values or
|
||||
compute, so the greedy-decode md5 on a fixed prompt is identical before vs after the fix (and the
|
||||
post-burst run produces the same tokens as a fresh-server run).
|
||||
|
||||
**Paths NOT worth building now:** B (lm_head, sub-1% bit-exact ceiling, the only big win is a
|
||||
non-bit-exact unfair-vs-vLLM logits change), and D (dense graph instability, measured stable -
|
||||
closed). A's standalone kernel value is bounded; keep A-M1 as a cheap follow-up and fold A-M2 into a
|
||||
later decode-graph project, but it is not the first target.
|
||||
|
||||
**First target: ship the paged-pool burst-degradation bug fix (C-Part2, Fix-1 + Fix-2 + Fix-3).**
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,305 +0,0 @@
|
||||
# P1 results: dynamic decode-first prefill-token budget (patch 0016)
|
||||
|
||||
Implements **P1** of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`: replace patch 0013's
|
||||
**static** per-step prefill cap with a **dynamic, decode-first** token budget in
|
||||
`tools/server/server-context.cpp::update_slots()`. Policy change only, zero
|
||||
libllama changes, default-off byte-identical. P2 (round-robin / checkpoint-aware
|
||||
admission) and P3 (decode-kernel / CUDA-graph) are explicitly **not** in this patch.
|
||||
|
||||
## What changed (engine, patch 0016)
|
||||
|
||||
The 0013 budget block already sits **after** Phase 1's decode fill
|
||||
(`for (slot : generating) slot.update_batch(batch)`, lines 2716-2720), so at that
|
||||
point `batch.n_tokens == D` is the live decode load. No new seam is needed: the
|
||||
dynamic budget is computed in place where 0013 read its static constant.
|
||||
|
||||
| seam (post-0015 line) | before (0013) | after (0016) |
|
||||
|---|---|---|
|
||||
| budget block @2737-2747 | `n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))` (static constant) | `D = batch.n_tokens`; `T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch)`; `prefill_budget_step = max(n_ubatch, T - D)`; `prefill_cap_per_slot = clamp(min(T, ceil(0.04*n_ctx)), n_ubatch, n_batch)`, pinned to `n_batch` when `T == n_batch`; legacy `LLAMA_PREFILL_BUDGET` honoured only when `LLAMA_MAX_BATCH_TOKENS` is unset |
|
||||
| inner prompt-fill while @3187 | `... && batch.n_tokens < n_batch && (n_prefill_budget==0 \|\| n_prompt_budgeted < n_prefill_budget)` | adds `&& (prefill_budget_step==0 \|\| n_prompt_budgeted < prefill_budget_step) && (prefill_cap_per_slot==0 \|\| slot_prompt_added < prefill_cap_per_slot)`; `n_batch` kept as the hard compute ceiling |
|
||||
| per-slot counter | (none) | `int32_t slot_prompt_added = 0;` reset per slot, `++` alongside `n_prompt_budgeted++` |
|
||||
| outer break @3326 | `if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) break;` | `if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) break;` |
|
||||
|
||||
Knobs (env, set before context init like `LLAMA_KV_PAGED`; LocalAI model options
|
||||
wired in `grpc-server.cpp` beside `max_prefill_tokens`):
|
||||
|
||||
- `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step
|
||||
token budget `T` (decode + prefill), the vLLM `max_num_batched_tokens` analogue.
|
||||
Default `n_batch`, clamped `[n_ubatch, n_batch]`.
|
||||
- `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot prompt-chunk cap, the
|
||||
`long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04*n_ctx))`
|
||||
floored at `n_ubatch`. At the bench config (`n_ctx=131072`) this equals `T`, so
|
||||
the per-slot cap is effectively opt-in for P1 (real per-slot fairness +
|
||||
round-robin is P2); it bites only when set explicitly or when `0.04*n_ctx < T`.
|
||||
- `LLAMA_PREFILL_BUDGET` (option `max_prefill_tokens` / `mpt`) - **legacy 0013**
|
||||
static cap, honoured **only** when `LLAMA_MAX_BATCH_TOKENS` is unset. 0013 is the
|
||||
degenerate `T = n_batch` no-leftover case; it is **cleanly subsumed**, not removed.
|
||||
|
||||
## Supersession of 0013
|
||||
|
||||
| property | 0013 (static) | 0016 (dynamic `T - D`) |
|
||||
|---|---|---|
|
||||
| per-step prefill bound | constant | `max(n_ubatch, T - D)`, shrinks as decode load rises |
|
||||
| decode-load aware | no | yes (leftover after Phase-1 decode `D`) |
|
||||
| one config across npl 8..128 | no (256 best @128, net-negative @8) | yes (self-tuning) |
|
||||
| long-prompt monopoly guard | no | per-slot `slot_prompt_added` cap |
|
||||
| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - kept |
|
||||
| legacy knob | `LLAMA_PREFILL_BUDGET` | preserved when dynamic knob unset |
|
||||
|
||||
## Determinism / byte-identical analysis (verified by construction)
|
||||
|
||||
The hard ceiling `batch.n_tokens < n_batch` is **kept** in the inner loop (not
|
||||
replaced by `< T`). This makes the off-path and the degenerate path provably
|
||||
byte-identical for **all** decode loads `D`:
|
||||
|
||||
- **All knobs unset** -> `prefill_budget_step == 0` and `prefill_cap_per_slot == 0`
|
||||
-> both new predicates are vacuously true -> only `batch.n_tokens < n_batch`
|
||||
binds -> **bit-for-bit stock**. The outer break is `prefill_budget_step > 0`
|
||||
guarded, so it never fires. Identical to 0013's off-path by construction.
|
||||
- **Degenerate `T = n_batch`** -> `prefill_budget_step = max(n_ubatch, n_batch - D)`
|
||||
and `prefill_cap_per_slot = n_batch` (pinned). The budget bound
|
||||
`n_prompt_budgeted < n_batch - D` is equivalent to `batch.n_tokens < n_batch`
|
||||
(since `batch.n_tokens = D + n_prompt_budgeted`), so they stop at the **same**
|
||||
point; the per-slot cap `n_batch` and the floor never bind first. When `D` is so
|
||||
large that `n_batch - D < n_ubatch`, the kept `batch.n_tokens < n_batch` ceiling
|
||||
binds first, so the stop point is **still** `n_batch` = stock. Result: same
|
||||
per-step token sequence and same per-slot distribution as stock for every `D`.
|
||||
- **Legacy `LLAMA_PREFILL_BUDGET` only** -> dynamic path skipped,
|
||||
`prefill_budget_step = min(n_batch, v)`, `prefill_cap_per_slot = 0` -> **exactly
|
||||
0013** (the determinism oracle for the legacy path).
|
||||
- **`LLAMA_KV_PAGED` orthogonality** -> paged on/off changes only which KV blocks
|
||||
back each `(seq, pos)`; the scheduler reads only `batch.n_tokens`, slot states,
|
||||
and `n_ctx`/`n_batch`/`n_ubatch` - none paged-dependent. Same admission
|
||||
decisions and per-step token counts with paged on or off (hard gate below).
|
||||
|
||||
## Local verification performed (this session, x86 box, no GPU)
|
||||
|
||||
- Reconstructed the exact post-0015 tree (`git checkout f3e1828` =
|
||||
`LLAMA_VERSION` pin + `git apply` paged 0001-0015) and confirmed all scope line
|
||||
numbers match HEAD (`n_ubatch` @2724, 0013 block @2737-2747, Phase-1 fill
|
||||
@2716-2720, inner while @3187, outer break @3326).
|
||||
- Patch 0016 generated against that tree; **the full series 0001-0015 + 0016
|
||||
applies cleanly** to a fresh `f3e1828` checkout (`git apply --check` passes for
|
||||
every patch including 0016). Stat: `1 file changed, 85 insertions(+), 22
|
||||
deletions(-)`.
|
||||
- No stale `n_prefill_budget` references remain; new symbols
|
||||
(`n_decode_in_batch`, `prefill_budget_step`, `prefill_cap_per_slot`,
|
||||
`slot_prompt_added`) are correctly scoped; only pre-existing headers/idioms
|
||||
(`std::min`/`std::max`/`getenv`/`atoi`, `<algorithm>`) are used - no new include.
|
||||
- Byte-identical off-path and `T = n_batch` degenerate path proven by construction
|
||||
(above).
|
||||
|
||||
## Gates - PENDING (require the GB10 DGX; not run this session)
|
||||
|
||||
The DGX dev tree (`ssh dgx.casa` : `~/llama-paged-dev`, branch `paged`,
|
||||
`build-cuda` sm_121) and the bench models (`~/bench/q36-27b-nvfp4.gguf`,
|
||||
`~/bench/q36-35b-a3b-nvfp4.gguf`) were **unreachable from this session** (the SSH
|
||||
to the DGX was blocked by the harness auto-mode safety classifier after an earlier
|
||||
subnet probe tripped its reconnaissance heuristic). The build + the four gates +
|
||||
the A/B sweep below were therefore **not executed**. Numbers must be filled by a
|
||||
re-run on the DGX (or with `ssh dgx.casa` allowlisted). Methodology is locked here
|
||||
so the re-run is mechanical.
|
||||
|
||||
Build (do NOT block on `cmake --build`): `nohup` detached, poll with a specific
|
||||
`pgrep -f 'llama-server|grpc-server'` pattern. Real serving config:
|
||||
`--parallel 128 -b 2048 -ub 512 -ngl 99 -fa on -c 131072`, `kv_unified=false`
|
||||
(=> `n_stream=128` => the `split_equal(sequential=true)` KV path; the determinism
|
||||
band is over that ubatch grouping), `LLAMA_KV_PAGED=1`, `n_ctx_checkpoints=0`
|
||||
(isolate the checkpoint co-defect per P0).
|
||||
|
||||
| # | gate | how | expected | status |
|
||||
|---|------|-----|----------|--------|
|
||||
| 1 | default-off byte-identical | knob unset vs stock binary, greedy `-s 1` (CPU byte gate on Qwen3-0.6B if available) | bit-identical output | **PENDING** (proven by construction) |
|
||||
| 2 | `T = n_batch` == 0013/stock | `LLAMA_MAX_BATCH_TOKENS=2048` vs stock, greedy | bit-identical (determinism oracle) | **PENDING** (proven by construction) |
|
||||
| 3 | `LLAMA_KV_PAGED` 1 vs 0 | same scheduling decisions (per-step token counts + admission order) with paged on/off | identical decisions | **PENDING** |
|
||||
| 4 | coherence on GPU | dense + MoE, greedy, sane answers | coherent | **PENDING** |
|
||||
|
||||
## A/B benchmark - PENDING (GB10, same H2H harness)
|
||||
|
||||
Harness: 512-tok unique prompts, `max_tokens 256`, npl 8/32/64/128, the serving
|
||||
config above. Three arms per (model, npl): **(a)** stock no-budget,
|
||||
**(b)** 0013 static budget-256 (`LLAMA_PREFILL_BUDGET=256`), **(c)** 0016 dynamic
|
||||
(`LLAMA_MAX_BATCH_TOKENS=2048`, default cap). Report **decode_agg**, **decode-ITL**
|
||||
(mean inter-token, **including the drain phase** - the budget trades prefill vs
|
||||
drain-ITL), **prefill_tps**, **TTFT mean**.
|
||||
|
||||
Dense `q36-27b-nvfp4`:
|
||||
|
||||
| npl | arm | decode_agg | decode-ITL (incl drain) | prefill_tps | TTFT mean |
|
||||
|----:|-----|-----------:|------------------------:|------------:|----------:|
|
||||
| 8 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
|
||||
| 32 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
|
||||
| 64 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
|
||||
| 128 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
|
||||
|
||||
MoE `q36-35b-a3b-nvfp4`: same table, **PENDING**.
|
||||
|
||||
Reference ceilings to validate against (from `QWEN36_NVFP4_BENCH.md`): dense
|
||||
**~161 / 305 s** and MoE **~333 / 98 s** decode_agg/TTFT @npl128 under 0013-256;
|
||||
staggered all-128-clean ceiling **157.4** dense.
|
||||
|
||||
### Targets (what the re-run must show)
|
||||
- **TTFT collapses vs stock** (no 85 s / 491 s), toward the staggered
|
||||
~157 dense / ~333 MoE regime; dynamic should beat 0013-256's 305 s because it
|
||||
does not throttle prefill to 256/step when decode load is low.
|
||||
- **Ceiling HELD tuning-free** across npl AND dense-vs-MoE with the **single**
|
||||
`T=2048` config (where 0013's hand-picked 256 was net-negative at low npl and
|
||||
cost MoE TTFT).
|
||||
- **No low-concurrency regression** at npl8 vs stock.
|
||||
- **Honest boundary**: decode **throughput** will NOT beat the ~157/333 kernel
|
||||
ceiling - that is P3, not this. The P1 win is **TTFT + tuning-free robustness +
|
||||
clean supersession of 0013**, at a published `T`-tunable drain-phase decode-ITL
|
||||
cost.
|
||||
|
||||
## Honest P1 verdict (engineering-complete; HW-validation pending)
|
||||
|
||||
The engine change is complete, correctly localized to `update_slots()` batch-
|
||||
formation policy, requires no libllama changes, and is proven byte-identical on
|
||||
the off-path and the `T=n_batch` degenerate oracle **by construction**. It cleanly
|
||||
supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
|
||||
and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
|
||||
**pending DGX access** and must be run before this is sold on numbers. The
|
||||
qualitative claim is sound; the quantitative payoff is unverified in this session.
|
||||
|
||||
## Staggered-arrival evaluation
|
||||
|
||||
Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
|
||||
0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
|
||||
requests at t=0) is structurally adversarial to *any* prefill budget: under a
|
||||
burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
|
||||
drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
|
||||
trickle in while others are already decoding) is the regime 0016 is designed for:
|
||||
when a new prefill arrives, the decode-first budget should keep the
|
||||
already-decoding slots flowing (low/flat inter-token latency) while the new
|
||||
prefill takes only the leftover `T - D`. This section measures exactly that.
|
||||
|
||||
### Harness (staggered client, dev-tree-only)
|
||||
|
||||
`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
|
||||
all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
|
||||
`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
|
||||
prefix caching). It records, per request, the send time, the TTFT, and the
|
||||
absolute timestamp of **every** generated token (full ITL series); raw dumps go to
|
||||
`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
|
||||
**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
|
||||
99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
|
||||
`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
|
||||
at the default max=32, identical across all arms). Three to four arms per model,
|
||||
**env-only** difference, sequenced on the single GPU with PID-file stop between
|
||||
arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
|
||||
dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
|
||||
|
||||
**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
|
||||
ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
|
||||
by already-decoding slots **while new prefills are still arriving** -> the
|
||||
decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
|
||||
exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
|
||||
first-token latency per newly-arriving request. *decode agg* = total generated /
|
||||
decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
|
||||
is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
|
||||
merit). *wall* = last token - first send.
|
||||
|
||||
### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
|
||||
|
||||
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|
||||
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
|
||||
| stock | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
|
||||
| 0013 (pb256) | 527 / 640 / 650 | 0 / 0 | 44763 / 90338 | 81.2 | 201.8 |
|
||||
| 0016 (mbt512) | 730 / 897 / 901 | 0 / 0 | 33320 / 66595 | 88.4 | 185.8 |
|
||||
| 0016 (mbt1024) | 1320 / 2050 / 2051 | 46 / 5 | 33402 / 62636 | 72.4 | 226.8 |
|
||||
|
||||
**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
|
||||
admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
|
||||
(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
|
||||
`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
|
||||
**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
|
||||
protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
|
||||
freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
|
||||
protection/TTFT dial.
|
||||
|
||||
### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
|
||||
|
||||
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|
||||
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
|
||||
| stock | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
|
||||
| 0013 (pb256) | 443 / 572 / 607 | 0 / 0 | 18608 / 38347 | 38.0 | 54.7 |
|
||||
| 0016 (mbt512) | 597 / 858 / 863 | 0 / 0 | 14506 / 28055 | 43.9 | 47.4 |
|
||||
|
||||
Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
|
||||
arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
|
||||
wall at equal protection.
|
||||
|
||||
### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
|
||||
|
||||
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|
||||
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
|
||||
| stock | 706 / 1146 / 1148 | 132 / 0 | 2774 / 5105 | 202.4 | 81.1 |
|
||||
| 0013 (pb256) | 194 / 273 / 280 | 0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
|
||||
| 0016 (mbt512) | 275 / 366 / 373 | 0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
|
||||
|
||||
MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
|
||||
prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
|
||||
present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
|
||||
`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
|
||||
throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
|
||||
TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
|
||||
visible here.
|
||||
|
||||
### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
|
||||
|
||||
At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
|
||||
degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
|
||||
token lands inside the window, so the in-window protection metric is empty). This
|
||||
reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
|
||||
mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
|
||||
budget ~= stock, stock marginally better on TTFT and throughput. This is the
|
||||
control, not 0016's target regime.
|
||||
|
||||
### Structural note (intellectual honesty)
|
||||
|
||||
At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
|
||||
**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
|
||||
floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
|
||||
therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
|
||||
genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
|
||||
prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
|
||||
settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
|
||||
dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
|
||||
TTFT. The distinctive value of the dynamic budget is the **safety property**: it
|
||||
lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
|
||||
count auto-shrinks so decode is never starved when load rises - which is precisely
|
||||
what stock lacks (stock = unbounded prefill chunk = the freezes).
|
||||
|
||||
### Verdict (honest)
|
||||
|
||||
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
|
||||
vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
|
||||
already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
|
||||
in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
|
||||
(0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
|
||||
real P1 win and it shows **only** under staggered arrival, never under the burst.
|
||||
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
|
||||
across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
|
||||
precisely because it lets prefill stampede the decoders (that stampede *is* the
|
||||
freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
|
||||
lunch; this does not manufacture a TTFT-beats-stock claim.
|
||||
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
|
||||
It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
|
||||
strictly better point on the protection/TTFT frontier than 0013-256 (equal
|
||||
spike-free protection, materially lower TTFT/throughput/wall), and it adds a
|
||||
principled, decode-adaptive, single-`T` way to move along that frontier (one
|
||||
config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
|
||||
strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
|
||||
stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
|
||||
(staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
|
||||
from prefill-interleaving cost, not a kernel difference) - the P1 win is
|
||||
latency-under-load, as expected.
|
||||
|
||||
**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
|
||||
spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
|
||||
a decode-adaptive knob that holds one config across loads and model types. Against
|
||||
stock it is a deliberately different operating point that trades a few seconds of
|
||||
new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
|
||||
avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
|
||||
protective setting and higher `T` when low-load TTFT matters more than ITL
|
||||
flatness.
|
||||
@@ -1,107 +0,0 @@
|
||||
# Paged-KV: GPU 0007 re-run + shared-prefix throughput benchmark
|
||||
|
||||
DGX Spark (NVIDIA GB10, sm_121 / cc 12.1), CUDA 13, dev tree `~/llama-paged-dev`
|
||||
branch `paged`, base pin `f3e182816421c648188b5eab269853bf1531d950`, full paged
|
||||
engine (0001-0004, 0006, 0007). All paged behaviour stays gated by
|
||||
`LLAMA_KV_PAGED`; default-off is byte-identical to stock. Models:
|
||||
`Qwen3-0.6B-Q8_0.gguf` and `Qwen3-32B-Q4_K_M.gguf`.
|
||||
|
||||
## Deliverable 1 - GPU run of the 0007 prefix-engine correctness driver
|
||||
|
||||
The committed driver `examples/simple/paged-prefix-engine.cpp` hardcodes
|
||||
`n_gpu_layers = 0`. For this GPU run it was given a dev-only
|
||||
`PAGED_NGL` env override (`mp.n_gpu_layers = getenv("PAGED_NGL") ? atoi(...) : 0`),
|
||||
rebuilt in `build-cuda`, run, then the edit was **reverted** so the committed
|
||||
driver stays byte-clean (it is dev scaffolding, never shipped in a patch).
|
||||
|
||||
Three runs of the same Gate-0 driver, Qwen3-0.6B, `LLAMA_KV_PAGED=1`:
|
||||
|
||||
| binary / offload | result |
|
||||
|------------------------------------------|-------------------------|
|
||||
| committed `build-cpu` driver | **ALL PASS (failures=0)** |
|
||||
| `build-cuda`, `PAGED_NGL=99` (all layers)| GATE FAILED (failures=3)|
|
||||
| `build-cuda`, `PAGED_NGL=0` (same binary)| GATE FAILED (failures=2)|
|
||||
|
||||
**The GPU run did NOT print ALL PASS - reported honestly.** But the failures are
|
||||
narrow and are not a paged-engine bug:
|
||||
|
||||
- Every **structural / mechanical** paged invariant PASSES on GPU, in both
|
||||
scenarios (boundary and mid-block): prefill computed ONLY the suffix (32 prefix
|
||||
tokens skipped), shared prefix block-aligned, shared-block `ref_cnt == 2` while
|
||||
both sequences hold it, ref drops `2 -> 1` on freeing one sharer, only the
|
||||
private (suffix) blocks are returned, and the prefix block returns to the pool
|
||||
once all sharers free. The cross-request KV reuse mechanism itself is GPU-clean.
|
||||
- The only failures are the **exact greedy-token byte-identical** assertions
|
||||
(e.g. boundary `B-shared` vs `B-from-scratch`). They diverge at a single near-tie
|
||||
token (boundary: 2nd generated token `17971` vs `5671`) and then cascade
|
||||
autoregressively.
|
||||
|
||||
Root cause is **CUDA float-kernel non-determinism, not the paged logic**: the
|
||||
*same* CUDA binary fails the exact-token assertions even with `PAGED_NGL=0` (zero
|
||||
layers offloaded), whereas the genuine `build-cpu` binary passes all 16/16. The
|
||||
CUDA backend (loaded via `ggml_backend_load_all`) uses non-associative reductions
|
||||
whose result differs between the full-prefill batch shape and the
|
||||
incremental-suffix batch shape; under greedy decode a single logit near-tie flips
|
||||
and the sequences cascade apart. This refines the earlier note in
|
||||
`PAGED_GPU_VERIFY.md` (which framed it as "not GPU-specific" and had no CPU pass
|
||||
to compare against): the CPU build now passes clean, so the divergence is a strict
|
||||
test-assertion artefact of CUDA float ordering, not a defect in 0006/0007.
|
||||
|
||||
## Deliverable 2 - shared-prefix throughput benchmark (the real-win test)
|
||||
|
||||
Dev-only driver `examples/simple/paged-prefix-bench.cpp` (registered in
|
||||
`examples/simple/CMakeLists.txt`, dev tree only - not in any shipped patch).
|
||||
Workload: `K` sequences that all share a `P`-token common prefix (a system /
|
||||
RAG preamble), each with a unique `S`-token suffix; prefill only (`G=0`,
|
||||
generation is identical compute in both modes so it is excluded from the
|
||||
headline). GPU, `-ngl 99`, `kv_unified = true`.
|
||||
|
||||
- **NO-SHARE (stock):** `LLAMA_KV_PAGED` unset; every sequence prefills the full
|
||||
`P+S` tokens. Total prefill work `= K*(P+S)`.
|
||||
- **PAGED-SHARE:** `LLAMA_KV_PAGED=1`; the prefix is computed ONCE on seq 0,
|
||||
committed via `paged_prefix_api::commit`, then every other seq calls
|
||||
`paged_prefix_api::share` to physically reuse the ref-counted prefix blocks and
|
||||
prefills ONLY its suffix. Total prefill work `= P + K*S`.
|
||||
|
||||
**`kv_unified` note:** this engine's cross-request share is built around the
|
||||
*unified* stream-0 pool (ref-counted shared cells), so `kv_unified = true` is what
|
||||
makes the share engage - the same setting the committed 0007 driver uses. With
|
||||
`kv_unified = true` the share engaged in every run (evidence below).
|
||||
|
||||
### Reuse actually engaged (share mode)
|
||||
|
||||
In every share run: `kshare(seq 1) = 1024` (the full block-aligned prefix is
|
||||
reused, not recomputed), the shared prefix block's `ref_cnt == K` (all sharers
|
||||
point at one physical copy), and `prefill_tokens_submitted` collapses from
|
||||
`K*(P+S)` to `P + K*S`.
|
||||
|
||||
### Results (P=1024, S=32, prefill-only)
|
||||
|
||||
| model | K | mode | prefill tokens | prefill time | raw tok/s | shared ref_cnt |
|
||||
|--------------|----|-----------|----------------|--------------|-----------|----------------|
|
||||
| Qwen3-0.6B | 32 | no-share | 33792 | 4.659 s | 7253 | - |
|
||||
| Qwen3-0.6B | 32 | **share** | 2048 | **0.554 s** | 3695 | 32 |
|
||||
| Qwen3-32B | 16 | no-share | 16896 | 26.14 s | 647 | - |
|
||||
| Qwen3-32B | 16 | **share** | 1536 | **3.64 s** | 422 | 16 |
|
||||
| Qwen3-32B | 32 | no-share | 33792 | 61.91 s | 546 | - |
|
||||
| Qwen3-32B | 32 | **share** | 2048 | **6.02 s** | 340 | 32 |
|
||||
|
||||
### Verdict: YES, a real and substantial win, and it grows with K
|
||||
|
||||
- Prefill wall-time speedup: **0.6B K=32 -> 8.4x**, **32B K=16 -> 7.2x**,
|
||||
**32B K=32 -> 10.3x**. The win grows with the number of sharers because
|
||||
no-share prefix recompute is `O(K)` while the shared prefix is `O(1)` plus
|
||||
`K` tiny suffixes.
|
||||
- Note the honest caveat in the raw-throughput column: share mode submits small
|
||||
32-token suffix batches that are *less* GPU-efficient (340-422 tok/s) than the
|
||||
large no-share batches (546-7253 tok/s). The win is **not** higher tok/s - it is
|
||||
computing ~11-16x **fewer** tokens. On a fast GB10 prefill that still nets a
|
||||
7-10x wall-time reduction because prefill is compute-bound and the shared prefix
|
||||
dominates the token count.
|
||||
- This is exactly the many-users-one-system-prompt / RAG-preamble fan-out
|
||||
scenario, and the paged cross-request prefix cache delivers there.
|
||||
|
||||
Scaffolding (`paged-prefix-bench.cpp`, the `PAGED_NGL` driver tweak) stays
|
||||
dev-tree-only and is not part of any shipped patch.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,81 +0,0 @@
|
||||
# Paged-KV GPU verification + full backend CUDA build
|
||||
|
||||
Verification run on a DGX Spark (NVIDIA GB10, compute capability 12.1 / sm_121),
|
||||
CUDA 13.0, against pin `f3e182816421c648188b5eab269853bf1531d950`. Models:
|
||||
`Qwen3-0.6B-Q8_0.gguf` (core gate) and `Qwen3-32B-Q4_K_M.gguf` (sanity).
|
||||
|
||||
All paged behaviour stays gated by `LLAMA_KV_PAGED` (env) / the `kv_paged`
|
||||
server option; default-off is byte-identical to stock.
|
||||
|
||||
## Deliverable 1 - GPU-path correctness (all on GPU, `-ngl 99`)
|
||||
|
||||
CUDA build of the dev tree configured with
|
||||
`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`;
|
||||
all paged drivers (`llama-simple`, `llama-paged-multiseq`,
|
||||
`llama-paged-prefix`, `llama-paged-prefix-engine`) compiled clean under sm_121.
|
||||
|
||||
1. Core token-identical gate - PASS. `llama-simple` greedy, Qwen3-0.6B, `-ngl 99`:
|
||||
stock (env unset) vs `LLAMA_KV_PAGED=1` output is BYTE-IDENTICAL. The paged
|
||||
path is genuinely engaged: `LLAMA_KV_PAGED_DEBUG=1` shows the device gather
|
||||
firing (`[paged-attn] gather n_stream=1 ...`), per-token block placement
|
||||
(`[paged-alloc] ... grew`), and the stock run uses CUDA Graphs while the paged
|
||||
run takes the distinct gather path - yet output matches exactly.
|
||||
|
||||
2. Multi-stream - PASS. `llama-paged-multiseq -s 4 -ngl 99`, stock vs paged:
|
||||
all 4 concurrent sequences BYTE-IDENTICAL on GPU (n_seqs=4, CUDA0 compute
|
||||
buffer matches expectation). Same result reproduced on the CPU build.
|
||||
|
||||
Prefix recompute-skip (`llama-paged-prefix-engine`, patch 0007) - MIXED, and
|
||||
this is a dev-scaffolding driver ("not shipped"); it was never built on CPU
|
||||
(absent from the CPU Gate-0 set), so there is no prior CPU pass to match.
|
||||
The driver hardcodes `n_gpu_layers = 0`; a reported test-harness-only env
|
||||
override (`PAGED_NGL`) was added to run it at `-ngl 99` (29/29 layers
|
||||
offloaded confirmed), then reverted. Results are IDENTICAL on CPU and GPU
|
||||
(so not a GPU issue):
|
||||
- PASS: measured recompute-skip (32 prefix tokens skipped, block-aligned),
|
||||
ref-count == 2 on shared block, ref drop 2->1 on free, only-private-blocks
|
||||
returned, block returned to pool.
|
||||
- FAIL: 2 of ~16 greedy-token-equality assertions. `boundary` case diverges
|
||||
from the from-scratch baseline at the 2nd generated token (`17971` vs
|
||||
`5671`) and then completely; `mid-block` "A re-shareable after free, output
|
||||
unchanged" also differs. Driver prints `GATE FAILED (failures=2)`.
|
||||
This is a divergence in the prefix recompute-skip path (0006/0007), NOT in the
|
||||
core gather gate, and not GPU-specific. Reported, not fixed (out of scope).
|
||||
|
||||
3. 32B GPU sanity - PASS. `LLAMA_KV_PAGED=1 llama-simple -ngl 99 -n 16` on
|
||||
Qwen3-32B-Q4_K_M (65/65 layers offloaded): coherent output
|
||||
("The capital of France is Paris..."), no crash, no OOM.
|
||||
|
||||
## Deliverable 2 - full backend build with the paged patches
|
||||
|
||||
Built in a nested LocalAI tree on the DGX; gRPC v1.59.0 built from source
|
||||
(LocalAI bundle; the system protobuf ships no CMake CONFIG) in ~26 min.
|
||||
|
||||
- (2a) `make llama.cpp LLAMA_PAGED=on` - PASS. All 6 paged patches
|
||||
(0001,0002,0003,0004,0006,0007) `git apply` cleanly to the pin (EXIT=0). The 8
|
||||
vendored paged sources land in `llama.cpp/src/` and are BYTE-IDENTICAL to the
|
||||
dev tree; `grpc-server.cpp` carries the `kv_paged`/`paged_attention` option
|
||||
(patch 0005); `llama-kv-cache.cpp` has the env-gated hooks.
|
||||
|
||||
- (2b) grpc-server under CUDA sm_121 - PASS (with the single-application caveat
|
||||
below). 89 MB ARM aarch64 executable, build ~139 s, linked against
|
||||
libcudart.so.13 / libcublas.so.13; binary contains the paged option strings
|
||||
and `paged_alloc`/`paged_attn`/gather symbols.
|
||||
|
||||
- (2c) `make llama.cpp LLAMA_PAGED=off` - PASS. "skipping paged-attention patch
|
||||
series", EXIT=0, NO `paged-*` sources in the checkout (clean escape hatch).
|
||||
|
||||
### Build-flow finding: paged patches are applied TWICE in the on-flow
|
||||
|
||||
A plain `make grpc-server LLAMA_PAGED=on` FAILS to compile. The paged series is
|
||||
applied by BOTH the Makefile `llama.cpp` target (`git apply`) AND `prepare.sh`
|
||||
(`patch -p1`). On the already-git-applied tree, `prepare.sh` hits "Reversed (or
|
||||
previously applied) patch detected! Assume -R? [n]", declines, and re-applies the
|
||||
pure-addition hunks a second time. `llama_kv_cache::get_n_gather` etc. end up
|
||||
defined twice -> redefinition errors in `llama-kv-cache.cpp` (`.rej`/`.orig`
|
||||
litter `src/`). Single application (one of the two appliers) compiles clean -
|
||||
the 2b build above used a single git-apply with `prepare.sh` patching suppressed.
|
||||
Reported only; the fix (drop one of the two application sites for
|
||||
`patches/paged/`) is out of scope for this verification.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,120 +0,0 @@
|
||||
# PAGED_POOL_BURST_FIX (patch 0024)
|
||||
|
||||
Fixes the paged-pool **burst-degradation bug** identified in `OTHER_PATHS_INVESTIGATION.md`
|
||||
(section C, Part 2): on a long-lived `llama-server` with `LLAMA_KV_PAGED=1`, a high-fan-out
|
||||
prefill burst strands KV blocks in the host-side paged pool, so a subsequent lower-npl prefill
|
||||
draws from a depleted / fragmented pool and its throughput collapses (the benchmark's documented
|
||||
"restart the server per npl" crutch). Decode is unaffected. The fix touches **only host-side block
|
||||
accounting and placement - never KV values or compute** - so it is gated behind `LLAMA_KV_PAGED`
|
||||
and is byte-identical to HEAD with the flag unset.
|
||||
|
||||
## Root cause (two compounding host-side defects)
|
||||
|
||||
1. **Reclamation gap.** `paged_alloc` returned a sequence's blocks only on a full-range wipe
|
||||
(`seq_rm(seq, 0, MAX)`). A partial **tail** truncation `seq_rm(seq, p0>0, MAX)` - which
|
||||
`llama-server` issues on every reused slot and before a cross-request prefix splice - freed the
|
||||
kv-cache CELLS but left the manager owning the trailing BLOCKS. The two desync; the free pool
|
||||
shrinks. (Applies to pure-attention paged caches; on hybrid SSM models the partial seq_rm is
|
||||
rejected by the recurrent cache before it reaches the attention cache, so the dominant leak there
|
||||
is #1b below.)
|
||||
1b. **Idle-slot retention.** Stock `llama-server` keeps a finished slot's KV resident for that
|
||||
slot's own next-prompt cache. Under the paged engine, the blocks of the many slots a burst
|
||||
touches but a later low-npl run never reassigns are stranded for the process lifetime - a later
|
||||
run sees a depleted pool.
|
||||
2. **No compaction.** `BlockPool::free_blocks` returns blocks in free order; after a burst the free
|
||||
queue is a scrambled permutation of physical ids, so a later prefill pops physically scattered
|
||||
blocks and its KV scatter-write + paged-attention gather lose locality.
|
||||
|
||||
## The fix (all behind `LLAMA_KV_PAGED`; `LLAMA_PAGED_NO_RECLAIM=1` restores pre-fix behavior)
|
||||
|
||||
- **Fix-1 - reclaim trailing blocks.** `paged::PagedKVManager::truncate(seq, n_keep)` frees every
|
||||
block at index >= `ceil(n_keep/bs)` (ref-counted, mirroring vLLM's free of a truncated suffix),
|
||||
exposed as `paged_alloc::truncate(cache, stream, seq, n_keep)` and called from
|
||||
`llama_kv_cache::seq_rm` for the `p1 == MAX && p0 > 0` case. Manager accounting now tracks the
|
||||
kv-cache exactly. (`src/paged-kv-manager.*`, `src/paged-alloc.*`, `src/llama-kv-cache.cpp`)
|
||||
- **Fix-2 - defrag on empty.** When the pool becomes fully idle (`all_free()`),
|
||||
`defrag_free_pool()` relinks the free queue into ascending block-id order (`FreeBlockQueue::rebuild`),
|
||||
preserving content-cache hashes. Triggered after `release`/`truncate`. (`src/paged-kv-manager.*`,
|
||||
`src/paged-alloc.*`)
|
||||
- **Fix-3 - release on slot completion.** At `server_slot::release()` the paged engine issues
|
||||
`prompt_clear()` (full seq_rm: clears cells AND releases+defrags the blocks) and drops the
|
||||
slot-local prompt cache, so a finished-idle slot returns its blocks promptly; cross-request reuse
|
||||
still works through the committed paged content cache. (`tools/server/server-context.cpp`)
|
||||
|
||||
## Validation (DGX GB10, dense q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023)
|
||||
|
||||
### Bit-exactness (the parity-safe property)
|
||||
Greedy decode, fixed prompt/seed, 48 tokens, `llama-completion`:
|
||||
|
||||
| build / flag | md5 |
|
||||
|---|---|
|
||||
| 0023 baseline (paged off) | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| AFTER paged **off** | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
|
||||
| AFTER paged **on**, reclaim default-on | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
|
||||
| AFTER paged **on**, `LLAMA_PAGED_NO_RECLAIM=1` | `5951a5b4d624ce891e22ab5fca9bc439` (== baseline) |
|
||||
|
||||
Identical across the board: the fix changes no KV value or compute. `test-backend-ops` is unaffected
|
||||
by construction (the change touches only host-side block accounting in libllama and the server; no
|
||||
ggml operator is modified) and was re-run green against the fixed `libllama`.
|
||||
|
||||
### Host-side unit test (`llama-paged-reclaim-unit`, no GPU)
|
||||
- Fix-1: `allocate(0,512)` -> 32 blocks; `truncate(0,256)` reclaims exactly **16** trailing blocks;
|
||||
`truncate(0,16)` returns to 1 block; `free` returns to pristine.
|
||||
- Fix-2: 8 blocks freed in scrambled order then `defrag_free_pool()` -> next `block_table` pops
|
||||
**ascending** physical ids. `UNIT PASS`.
|
||||
|
||||
### Repro on the model (`llama-paged-burst-bench`, A/B on one binary via `LLAMA_PAGED_NO_RECLAIM`)
|
||||
NSLOT=64, NPL=8, PP=512, pool=2527 blocks. Same binary, A/B by env.
|
||||
|
||||
- **Fix-2 (fragmentation -> prefill).** Fresh npl8 vs npl8 after a scrambling burst+drain:
|
||||
- BEFORE (`NO_RECLAIM`): prefill 870.5 -> 822.1 t/s, **ratio 0.944** (fragmented free queue).
|
||||
- AFTER (defrag on): prefill 869.2 -> 867.8 t/s, **ratio 0.998** (free queue compacted).
|
||||
- **Fix-3 mechanism (idle-slot leak -> reclaim).** Burst 64 sequences left idle, then full-release
|
||||
(what Fix-3's `prompt_clear` issues at `slot.release()`): pool free
|
||||
**2527 (pristine) -> 479 (64 idle slots strand 2048 blocks) -> 2527 (reclaimed == fresh)**. The
|
||||
leaked-block count is exactly 64 x ceil(512/16) = 2048.
|
||||
- Decode is untouched throughout (single-token append; the fix only moves/accounts blocks).
|
||||
|
||||
### Server repro (`llama-server`, one long-lived process, FRESH-npl8 -> BURST-npl64 -> POST-npl8)
|
||||
`-c 36000 -np 64 -b 2048 -ub 512`, `LLAMA_MAX_BATCH_TOKENS=512`, distinct 512-token prompts,
|
||||
`cache_prompt:false`, A/B by `LLAMA_PAGED_NO_RECLAIM`. Aggregate prefill = total prompt tokens / wave
|
||||
wall.
|
||||
|
||||
| wave | BEFORE (`NO_RECLAIM`) | AFTER (fix) |
|
||||
|---|---|---|
|
||||
| FRESH-npl8 | 488 t/s (wall 8.4 s) | 525 t/s (wall 7.8 s) |
|
||||
| POST-npl8 (after burst) | **44 t/s (wall 93 s)** | **532 t/s (wall 7.7 s)** |
|
||||
| post / fresh | **0.090 (11x collapse)** | **1.01 (recovered, within 1%)** |
|
||||
| paged release lines in log | 17 | **96** (Fix-3 fires at each slot completion) |
|
||||
| `CANARY_TOKENS_MATCH` (fresh vs post, identical prompts) | **YES** | **YES** |
|
||||
|
||||
The bug reproduces exactly (the investigation's 507 -> 65 collapse; here 488 -> 44); the fix restores
|
||||
POST-npl8 to within ~1% of fresh and the release-log count jumps from 17 to 96, confirming Fix-3
|
||||
returns each finished slot's blocks. The canary tokens are identical fresh-vs-post in BOTH arms:
|
||||
paged placement is value-invariant, so the fix never changes the served output - only when the pool
|
||||
recovers. Decode is structurally untouched (release happens after a request completes); greedy md5
|
||||
above proves decode values are byte-identical.
|
||||
|
||||
## Tradeoff / scope notes
|
||||
- On **hybrid SSM models** (qwen35), the recurrent cache rejects a partial tail `seq_rm`, so the
|
||||
hybrid wrapper never forwards it to the attention cache: Fix-1 effectively applies to
|
||||
pure-attention paged caches, while the hybrid leak is dominated by idle-slot retention (Fix-3) and
|
||||
fragmentation (Fix-2). Confirmed by the unit test (Fix-1 logic) and Test-C (2048 blocks stranded
|
||||
by 64 idle slots, returned to fresh on reclaim).
|
||||
- Fix-3 clears a finished slot's KV at `release()`, so a repeated-prompt workload loses the
|
||||
slot-local prompt cache. Cross-request reuse normally falls back to the committed paged content
|
||||
cache, but that publish path (`paged_prefix_api::commit`) is itself a no-op on hybrid wrappers, so
|
||||
for hybrid + repeated prompts Fix-3 trades prompt-cache reuse for pool hygiene. Gated behind
|
||||
`LLAMA_KV_PAGED`; `LLAMA_PAGED_NO_RECLAIM=1` restores the stock retain-idle behavior.
|
||||
|
||||
## Files
|
||||
- `src/paged-kv-manager.{h,cpp}` - `truncate`, `defrag_free_pool`/`defrag_free_queue`,
|
||||
`FreeBlockQueue::rebuild`, `all_free`/`total_blocks`.
|
||||
- `src/paged-alloc.{h,cpp}` - `truncate`, `reclaim_active`, defrag-on-empty in `release`/`truncate`,
|
||||
`num_free_global`/`num_managers`.
|
||||
- `src/llama-kv-cache.cpp` - partial-tail-seq_rm reclaim hook.
|
||||
- `src/paged-prefix-api.{h,cpp}` - `num_free_global`/`num_managers` introspection passthrough.
|
||||
- `tools/server/server-context.cpp` - Fix-3 paged release at `slot.release()`.
|
||||
- `examples/simple/paged-reclaim-unit.cpp`, `paged-burst-bench.cpp` - dev test scaffolding.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,111 +0,0 @@
|
||||
# Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)
|
||||
|
||||
Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched,
|
||||
both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs).
|
||||
Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64
|
||||
tokens, K requests fired concurrently (cold fan-out), one client hitting both
|
||||
OpenAI-compatible servers with identical token-id prompts.
|
||||
|
||||
This run fixes the two confounders in the earlier comparison (a *serial* Q4_K dev
|
||||
driver vs a *batched* FP4 vLLM server). Here both sides are batched and NVFP4.
|
||||
|
||||
## Setup
|
||||
|
||||
- llama.cpp: `llama-server` built from the paged dev tree (`~/llama-paged-dev`,
|
||||
branch `paged`, patches 0001-0007), CUDA `build-cuda/` (sm_121).
|
||||
`LLAMA_KV_PAGED=1`, `-ngl 99 --parallel 32 -c 40960`, model
|
||||
`q3-32b-nvfp4-dense.gguf` (NVFP4 weights, FP4-MMA kernel). OpenAI `/completion`.
|
||||
- vLLM 0.23.0: `vllm serve q3-32b-nvfp4a16/` (compressed-tensors W4A16 / Marlin),
|
||||
`--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9
|
||||
--max-num-seqs 64`, APC on (default). OpenAI `/v1/completions`.
|
||||
|
||||
## Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server
|
||||
|
||||
This is itself a key result. The paged engine has two distinct mechanisms:
|
||||
|
||||
1. Physical paged block placement (patches 0002/0004) - runs inside
|
||||
`llama_kv_cache::find_slot`, gated only by `LLAMA_KV_PAGED`. This DOES engage in
|
||||
the server: with `LLAMA_KV_PAGED_DEBUG=1`, 2 concurrent shared-prefix requests
|
||||
produced 14 `[paged-alloc] ... grew` lines, one stream per `seq`.
|
||||
|
||||
2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win
|
||||
(`shares N prefix blocks ... prefix NOT recomputed`, ref-counted block sharing).
|
||||
This is reachable ONLY through `paged_prefix_api::share/commit`
|
||||
(`src/paged-prefix-api.cpp`), which only the standalone driver calls.
|
||||
|
||||
Evidence it does not reach the server:
|
||||
- Static: `grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/`
|
||||
returns nothing; `nm` on the binary finds no `paged_prefix` symbol use from the
|
||||
server path. Nothing in `llama_decode` or the server calls `share`/`commit`.
|
||||
- Runtime: the 2-request verify run logged **0** `shares prefix blocks` /
|
||||
`NOT recomputed` lines. Both `seq=0` and `seq=1` independently grew to 65 blocks,
|
||||
each allocating and recomputing the full ~972-token prefix separately - no
|
||||
cross-slot KV block sharing, no `ref_cnt>1`.
|
||||
|
||||
So the 0007 recompute-skip, proven in the driver, does **not** yet reach the
|
||||
server. Closing it needs server-side wiring: when admitting a slot whose prompt
|
||||
shares a prefix with another live/committed slot, the server would have to call
|
||||
the `paged_prefix_api::share` / `commit` seam. That is a future patch.
|
||||
|
||||
Note: llama-server has its OWN native prefix reuse (the slot prompt cache /
|
||||
"context checkpoints"). In the K=32 wave the server reused the prefix cached by the
|
||||
earlier wave, so prefill was only the 32-token suffix (`prompt eval ... / 32
|
||||
tokens`). But that is a separate mechanism, it only helps prefill, and prefill is
|
||||
not the bottleneck here (see below), so it does not change the verdict.
|
||||
|
||||
## Finding 2 - the matched comparison
|
||||
|
||||
Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out,
|
||||
identical token-id prompts via one client.
|
||||
|
||||
| K | engine | wall (s) | aggregate gen tok/s | req/s | vLLM speedup |
|
||||
|----|----------|----------|---------------------|-------|--------------|
|
||||
| 16 | llama.cpp| 50.7 | 18.9 | 0.30 | - |
|
||||
| 16 | vLLM | 8.57 | 119.5 | 1.87 | ~5.9x |
|
||||
| 32 | llama.cpp| 58.3 | 34.0 | 0.53 | - |
|
||||
| 32 | vLLM | 8.86 | 231.1 | 3.61 | ~6.6x |
|
||||
|
||||
vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32),
|
||||
enforce_eager (CUDA graphs disabled), `enable_prefix_caching=True`.
|
||||
|
||||
### Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why
|
||||
|
||||
With every confounder removed (both batched, both NVFP4, both eager, both with
|
||||
prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound,
|
||||
not prefill/cache-bound:
|
||||
|
||||
- The G=64 workload is dominated by decode. In the llama K=32 run, decode was
|
||||
52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since
|
||||
the server's native prompt cache already reused the prefix). So even perfect
|
||||
prefix sharing - paged or native - cannot move the total much.
|
||||
- llama.cpp batched decode: **~828 ms per decode step** at batch 32
|
||||
(1.21 tok/s per sequence).
|
||||
- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs ->
|
||||
**~185 ms per step**, roughly **4-5x faster per decode step**.
|
||||
- CUDA graphs are NOT the differentiator: both sides are eager (llama
|
||||
`graphs reused = 0`, vLLM `--enforce-eager`). The win is vLLM's batched-decode
|
||||
efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill
|
||||
scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on
|
||||
this Blackwell-class part.
|
||||
|
||||
Because decode dominates, wiring the paged 0007 recompute-skip into the server
|
||||
(Finding 1) would mainly remove redundant prefill across slots - a real saving for
|
||||
short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against
|
||||
a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on
|
||||
its own, make llama.cpp competitive here; the decode kernel/batching gap is the
|
||||
load-bearing factor.
|
||||
|
||||
## Caveats
|
||||
|
||||
- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the
|
||||
FP4-MMA kernel); output quality is not the subject of this run.
|
||||
- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations;
|
||||
llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
|
||||
- One llama request per run hit an intermittent HTTP 500 ("output does not match
|
||||
the expected Content-only format" - a Qwen3 thinking-output quirk on
|
||||
`/completion`), so llama counts were 15/16 and 31/32. The failed request returns
|
||||
early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama
|
||||
run would be marginally slower - i.e. the ~6x gap reported here is conservative
|
||||
(favorable to llama.cpp).
|
||||
- Both servers cold-started; numbers are end-to-end wall from the concurrent
|
||||
client. Disk healthy (~325 GB free), GPU otherwise idle.
|
||||
@@ -1,165 +0,0 @@
|
||||
# Paged-attention closing measurements: stock GPU determinism + vLLM comparison
|
||||
|
||||
Two closing measurements for the paged-attention series, run on a DGX Spark
|
||||
(NVIDIA GB10, compute capability 12.1 / sm_121), CUDA 13. Dev tree
|
||||
`~/llama-paged-dev` branch `paged`, paged engine gated by env `LLAMA_KV_PAGED`
|
||||
(default-off = stock). Models: `Qwen3-0.6B-Q8_0.gguf` and
|
||||
`Qwen3-32B-Q4_K_M.gguf` (llama.cpp), `Qwen3-32B` nvfp4a16 / W4A16 HF safetensors
|
||||
(vLLM 0.23.0). All dev drivers are dev-tree-only and not shipped.
|
||||
|
||||
## Deliverable 1: stock GPU determinism across batch shapes (no paging)
|
||||
|
||||
Question: is the patch-0007 GPU byte-identity "failure" (a near-tie greedy token
|
||||
flips on CUDA, e.g. 17971 vs 5671) caused by paging, or is it inherent stock
|
||||
CUDA non-determinism from running the same tokens in a different batch shape?
|
||||
|
||||
Method: a new dev-only driver `llama-paged-batchshape` (paging explicitly OFF:
|
||||
`unsetenv("LLAMA_KV_PAGED")`). For a prompt `[P+S]` it greedy-decodes two ways,
|
||||
both stock contiguous KV:
|
||||
|
||||
- (a) `full` - prefill the whole `[P+S]` in ONE `llama_decode`.
|
||||
- (b) `split` - prefill `P` in one `llama_decode`, then `S` in a second.
|
||||
|
||||
The two paths write byte-for-identical token ids; the only difference is the
|
||||
batch shape submitted to the kernels (full prefill vs P-then-S), which changes
|
||||
the float reduction order in the GEMMs and therefore the KV values by tiny
|
||||
amounts. 5 distinct prompts, suffix S=16.
|
||||
|
||||
### Single next token (the literal T_full vs T_split)
|
||||
|
||||
Both CPU and CUDA returned the SAME greedy next token for all 5 prompts
|
||||
(0/5 flips). BUT the top-2 logit gap measurably changes with the batch shape on
|
||||
CUDA, proving the float order does differ:
|
||||
|
||||
```
|
||||
CUDA, S=8: prompt 1 T_full=1896 (gap 0.07072) T_split=1896 (gap 0.17986)
|
||||
CUDA, S=8: prompt 4 T_full=49584 (gap 0.93304) T_split=49584 (gap 0.85785)
|
||||
```
|
||||
|
||||
The argmax simply did not flip on the immediate next token for these prompts -
|
||||
the gaps, while shifting, stayed wide enough.
|
||||
|
||||
### Generated stream (what 0007 actually byte-asserts)
|
||||
|
||||
0007 asserts byte-identity over a *generated* token stream, where the tiny
|
||||
prefill-shape KV perturbation accumulates and eventually crosses a near-tie.
|
||||
Generating G tokens greedily from `full` vs `split` and reporting first
|
||||
divergence:
|
||||
|
||||
| gen length | CPU diverged | CUDA diverged |
|
||||
|-----------|--------------|---------------|
|
||||
| G=24 (0007 default) | 1/5 (prompt 0 @ step 5) | 2/5 (prompt 1 @ step 3, prompt 4 @ step 6) |
|
||||
| G=64 | 2/5 (steps 5, 42) | 3/5 (steps 3, 6, 30) |
|
||||
|
||||
Example CUDA divergence, pure stock, zero paging:
|
||||
`prompt 1: DIVERGES at gen step 3: full=1260 split=576`.
|
||||
|
||||
### Verdict (Deliverable 1): HYPOTHESIS HELD
|
||||
|
||||
The 0007 GPU byte-identity failure is **stock batch-shape non-determinism, not a
|
||||
paged bug**. With paging entirely OFF, stock llama.cpp produces a different
|
||||
greedy token stream when the same prompt is processed in a full-prefill batch vs
|
||||
a split (prefix-then-suffix) batch - exactly the shape difference that 0007's
|
||||
prefix-share path introduces (full B-from-scratch vs prefix-cached + suffix-only).
|
||||
|
||||
Refinement (reported honestly): it is **not strictly CUDA-only**. CPU exhibits
|
||||
the same divergence, just less often and later (1/5 vs 2/5 at G=24, and CPU's
|
||||
flips land at later generation steps). This is exactly why 0007's small, short
|
||||
CPU scenarios happened to pass 16/16 while the CUDA run flipped: CUDA's larger
|
||||
parallel reductions reorder more aggressively, so a near-tie crosses earlier and
|
||||
more frequently. The phenomenon is floating-point GEMM-batching non-determinism,
|
||||
inherent to both backends; paging is not the cause.
|
||||
|
||||
## Deliverable 2: vLLM vs llama.cpp+paged on a shared-prefix fan-out
|
||||
|
||||
Workload: K requests share a 1024-token system prefix, each with a unique
|
||||
32-token suffix, then generate 64 tokens. Both engines cache the shared prefix
|
||||
(vLLM automatic prefix caching ON by default; llama.cpp via the paged
|
||||
cross-request prefix cache, `LLAMA_KV_PAGED=1`).
|
||||
|
||||
Quant is the realistic apples-to-oranges, reported honestly:
|
||||
- llama.cpp: Qwen3-32B **Q4_K_M** (GGUF), `-ngl 99`, CUDA dequant kernels.
|
||||
- vLLM: Qwen3-32B **nvfp4a16 (W4A16)**, served via the **Marlin FP4
|
||||
weight-only** kernel because GB10 (sm_121) has **no native FP4 compute** -
|
||||
i.e. vLLM is on a slower-than-ideal kernel path here. vLLM also ran
|
||||
`enforce_eager=True` (no CUDA graphs / torch.compile; the env lacked a working
|
||||
inductor/ninja toolchain), so the vLLM numbers are if anything **conservative**.
|
||||
|
||||
### vLLM (automatic prefix caching), end-to-end
|
||||
|
||||
APC hits confirmed in the engine log: **"Prefix cache hit rate: 97.0%"**,
|
||||
`prefix_cache_hits 33040/34848` (K=16) and `99344/102432` (K=32).
|
||||
|
||||
| K | APC | prefill wall (G=1) | total wall (G=64) | throughput |
|
||||
|---|-----|--------------------|--------------------|-----------|
|
||||
| 16 | ON | 0.749 s | 6.63 s | 2.41 req/s |
|
||||
| 16 | OFF | 20.19 s | 27.21 s | 0.59 req/s |
|
||||
| 32 | ON | 1.13 s | 7.56 s | 4.23 req/s |
|
||||
| 32 | OFF | 40.19 s | 48.71 s | 0.66 req/s |
|
||||
|
||||
vLLM's APC cuts the fan-out prefill ~27x (K=16) to ~36x (K=32) vs APC-off; the
|
||||
huge ratio reflects how slow the FP4-emulation prefill is when forced to
|
||||
recompute all K prefixes.
|
||||
|
||||
### llama.cpp + paged prefix cache (prefill phase)
|
||||
|
||||
The paged shared-prefix bench (`llama-paged-prefix-bench`, `BENCH_GEN=0`,
|
||||
`PAGED_NGL=99`). Reuse confirmed: `kshare(seq1)=1024`, shared-block
|
||||
`ref_cnt = K` (all sequences hold the one prefix), 15360 / 31744 prefix tokens
|
||||
skipped.
|
||||
|
||||
| K | mode | prefill tokens submitted | prefill wall | vs no-share |
|
||||
|---|------|--------------------------|--------------|-------------|
|
||||
| 16 | PAGED-SHARE | 1536 | 3.66 s | 7.15x |
|
||||
| 16 | NO-SHARE | 16896 | 26.17 s | 1.0x |
|
||||
| 32 | PAGED-SHARE | 2048 | 6.04 s | 10.3x |
|
||||
| 32 | NO-SHARE | 33792 | 62.17 s | 1.0x |
|
||||
|
||||
The paged prefix cache delivers the expected **7.15x (K=16) / 10.3x (K=32)**
|
||||
prefill wall-time reduction - the headline cross-request prefix-skip win, on a
|
||||
real 32B model on GPU.
|
||||
|
||||
### Head-to-head, both engines caching the shared prefix
|
||||
|
||||
Prefill of the cached fan-out (vLLM G=1, ~prefill; llama.cpp G=0, pure prefill):
|
||||
|
||||
| K | llama.cpp+paged prefill | vLLM APC prefill | vLLM faster by |
|
||||
|---|-------------------------|------------------|----------------|
|
||||
| 16 | 3.66 s | 0.749 s | ~4.9x |
|
||||
| 32 | 6.04 s | 1.13 s | ~5.3x |
|
||||
|
||||
### Verdict (Deliverable 2): competitive in kind, behind in absolute terms
|
||||
|
||||
With both engines caching the shared prefix, **llama.cpp+paged is qualitatively
|
||||
competitive but absolutely behind vLLM on this GB10 box**:
|
||||
|
||||
- **Same optimization, same order of magnitude.** llama.cpp's paged prefix cache
|
||||
reproduces exactly the win vLLM's APC gives - skip the shared-prefix recompute
|
||||
- and yields a 7-10x prefill reduction vs its own no-share baseline. On the
|
||||
RAG/system-prompt fan-out the algorithmic gap is closed: llama.cpp no longer
|
||||
pays K x prefix.
|
||||
|
||||
- **vLLM still wins head-to-head by ~5x on the cached prefill** (0.75s vs 3.66s
|
||||
at K=16; 1.13s vs 6.04s at K=32), and by more end-to-end because it does
|
||||
**continuous batched decode** (all K sequences decoded in one fused step)
|
||||
while the llama.cpp paged *dev driver* decodes each sequence serially. That
|
||||
decode-batching gap is a property of the serving stack, not of the paged
|
||||
prefix cache. Notably vLLM wins here while handicapped (eager mode, FP4
|
||||
weight-only emulation with no native FP4 on GB10); a tuned vLLM would lead by
|
||||
more.
|
||||
|
||||
- **Honest caveats / blockers.** (1) Quant differs (Q4_K_M vs nvfp4a16). (2) The
|
||||
comparison is prefill-vs-prefill plus vLLM end-to-end; a clean llama.cpp
|
||||
end-to-end on this driver is blocked because its generation phase has a
|
||||
stale-logits bug (`get_logits_ith` reads seq 0's prefill index after later
|
||||
sequences' prefills overwrote the logits buffer -> segfault), and even fixed
|
||||
its decode is serial, so it would not be apples-to-apples vs vLLM's batched
|
||||
decode. The fair end-to-end llama.cpp number needs the grpc / llama-server
|
||||
continuous-batching path, not this dev scaffold. (3) vLLM ran eager + FP4
|
||||
emulation, making its numbers conservative.
|
||||
|
||||
Bottom line: paged gives llama.cpp the cross-request prefix-skip that vLLM's APC
|
||||
provides, which is the categorical win and removes the K x prefix penalty on
|
||||
RAG/system-prompt fan-out. On absolute wall-time on this hardware vLLM retains a
|
||||
~5x prefill lead and a larger end-to-end lead from continuous batched decode and
|
||||
a more optimized serving stack.
|
||||
@@ -1,107 +0,0 @@
|
||||
# Pin-bump apply-feasibility check: paged patch series vs latest llama.cpp tip
|
||||
|
||||
Date: 2026-06-27. Scope: textual `git apply` feasibility ONLY. No compile, no
|
||||
bit-exact gate (those require the DGX GPU and the manual PIN_SYNC process). This
|
||||
report answers one question: if we bumped the pin to the latest upstream tip,
|
||||
would the vendored paged patch series still apply?
|
||||
|
||||
## Pins
|
||||
|
||||
| | commit | subject |
|
||||
|---|---|---|
|
||||
| Current shipped pin | `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` | model : Add label for LFM2.5-230M (#25008) |
|
||||
| Latest master tip | `c299a92c38b6de6a1139617652b66081828648db` | binaries : Improve rpc-server and export-graph-ops names (#25045) |
|
||||
|
||||
Gap: the pin is **23 commits behind** the latest master tip (`ahead_by: 23`,
|
||||
GitHub compare API). The upstream range touched many files across the tree
|
||||
(modifications plus at least one rename).
|
||||
|
||||
## Method
|
||||
|
||||
Two fresh shallow clones of `ggml-org/llama.cpp` (the current pin as a baseline,
|
||||
and the latest master tip as the target). The series
|
||||
`backend/cpp/llama-cpp/patches/paged/0*.patch` (28 files: 0001-0030, gaps at
|
||||
0005 and 0027) was applied IN ORDER to each tree.
|
||||
|
||||
Each patch was classified two ways:
|
||||
|
||||
- **`git apply --check -p1`** - this is the BUILD's real apply method
|
||||
(`backend/cpp/llama-cpp/Makefile`'s `llama.cpp` target does
|
||||
`git apply --verbose "$p" || exit 1`). This is the only signal that decides
|
||||
whether a bumped build succeeds. `git apply` natively tolerates `@@`
|
||||
line-number offsets but NOT context-line changes.
|
||||
- **GNU `patch -p1` dry-run** - the `prepare.sh` fallback method, used here as a
|
||||
recovery probe to tell a fixable offset/fuzz from a genuine conflict.
|
||||
|
||||
Running against BOTH pins isolates bump-induced failures from pre-existing,
|
||||
pin-independent quirks of the shipped series.
|
||||
|
||||
## Result: the bump is CLEAN / offset-tolerant. Zero re-exports needed for the bump.
|
||||
|
||||
The series behaves **identically** under `git apply` on the latest tip and on
|
||||
the current pin.
|
||||
|
||||
- **27 / 28 patches apply CLEAN under `git apply`** on the latest tip (same 27
|
||||
as on the current pin).
|
||||
- **1 / 28 fails `git apply` (0019) - and it fails identically on the current
|
||||
pin too**, for a reason that has nothing to do with the bump (see below). Its
|
||||
code applies fine.
|
||||
- **No new conflicts.** Not a single patch that applied on the current pin fails
|
||||
on the latest tip.
|
||||
- **Zero context-fuzz anywhere.** Every recovery the GNU-patch probe reported is
|
||||
a pure line-number offset, which `git apply` absorbs natively.
|
||||
|
||||
### What the 23-commit jump actually changed
|
||||
|
||||
Only which patches `git apply` has to place at a line offset (context drift from
|
||||
the 23 upstream commits). All still apply CLEAN; none needs re-export.
|
||||
|
||||
- Offset-placed on the current pin (6): 0009, 0017, 0018, 0020, 0021, 0024.
|
||||
- Offset-placed on the latest tip (10): 0009, 0015, 0017, 0018, 0020, 0021,
|
||||
0024, 0025, 0026, 0028.
|
||||
- New offsets introduced by the bump (4): **0015, 0025, 0026, 0028** - all
|
||||
remain CLEAN under `git apply` (line offset only, no fuzz, no conflict).
|
||||
|
||||
### The single `git apply` failure (0019) is pre-existing, not a bump regression
|
||||
|
||||
`0019-qwen35-ssm-decode-fused-gather.patch` fails `git apply` on BOTH pins. The
|
||||
sole cause is its first hunk, a *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
|
||||
- a dev-only doc that exists on the DGX dev tree (from an unshipped docs commit)
|
||||
but is absent from any clean upstream checkout:
|
||||
|
||||
```
|
||||
error: SSM_DECODE_FIX_RESULTS.md: No such file or directory
|
||||
```
|
||||
|
||||
`git apply` is atomic, so that one stray hunk rejects the whole patch. 0019's 8
|
||||
real code files (ggml.h, ggml-cpu/ops.cpp, ggml-cuda/gated_delta_net.cu, ggml.c,
|
||||
delta-net-base.cpp, models.h, qwen35.cpp, qwen35moe.cpp) all apply cleanly (the
|
||||
GNU-patch probe applies them with only line offsets and reports 0 failed code
|
||||
hunks). This is exactly the pre-existing finding documented in
|
||||
`PIN_SYNC_9d5d882d.md` ("Pre-existing finding ... NOT introduced by this
|
||||
pin-sync, NOT fixed here ... a separate cleanup, out of scope"). It is identical
|
||||
at both pins, so it is NOT introduced by a bump. Stripping the stray dev-doc
|
||||
hunk from 0019 (and the analogous 0021 *create* hunk for
|
||||
`CONV_STATE_FUSION_RESULTS.md`, which happens to apply fine) is a cleanup that
|
||||
should happen regardless of any pin bump.
|
||||
|
||||
## Verdict
|
||||
|
||||
A pin bump from `9d5d882d` to the latest tip `c299a92c` is **textually clean**:
|
||||
the full paged series applies via the build's `git apply` with only benign
|
||||
line-number offsets and zero conflicts - no patch needs re-export for the bump.
|
||||
The lone `git apply` failure (0019) is a pre-existing shipped-series defect (a
|
||||
stray dev-doc hunk), present identically on the current pin, and unrelated to the
|
||||
bump.
|
||||
|
||||
## Caveats (why this does NOT authorise shipping a bump)
|
||||
|
||||
This is a textual apply check only. It does NOT verify that the patches are still
|
||||
SEMANTICALLY correct against upstream's 23 refactor commits, that the result
|
||||
compiles, or that it stays bit-exact. The 23 upstream commits touched many files;
|
||||
a clean text-apply can still hide a semantic break (e.g. a function the kernel
|
||||
patches call was refactored). The manual PIN_SYNC process on the DGX GPU
|
||||
(rebuild + `test-backend-ops` + the greedy-md5 bit-exact gate + a decode bench)
|
||||
remains the gate before any pin is advanced. This report only establishes that
|
||||
the bump's textual conflict surface is empty, so that pin-sync would start from a
|
||||
clean apply.
|
||||
@@ -1,301 +0,0 @@
|
||||
# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d
|
||||
|
||||
Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp
|
||||
`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both
|
||||
the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then
|
||||
re-exported from the rebased commits; **4 patch files changed** and are updated
|
||||
in this commit. A quick decode bench confirms the patchset performs the same on
|
||||
the new tip.
|
||||
|
||||
## Early-warning canary: when to run the NEXT pin-sync
|
||||
|
||||
The shipped pin (this file's tip, mirrored in
|
||||
`backend/cpp/llama-cpp-localai-paged/Makefile`) is advanced ONLY by this manual,
|
||||
GPU-verified PIN_SYNC. Because the paged backend is excluded from the nightly
|
||||
auto-bumper (`.github/workflows/bump_deps.yaml`), nothing nightly tells you when
|
||||
upstream has drifted past the patches. That signal comes from a dedicated
|
||||
scheduled canary:
|
||||
|
||||
- **Workflow:** `.github/workflows/llama-cpp-paged-canary.yml` (weekly, plus
|
||||
`workflow_dispatch`). It resolves the latest `ggml-org/llama.cpp` master tip,
|
||||
then in two jobs (a) APPLIES the full series to that tip with the build's own
|
||||
`git apply` method via `.github/scripts/paged-canary-apply.sh`, and (b)
|
||||
COMPILES the paged backend (cublas) against it using the same base-grpc-cuda-12
|
||||
toolchain + `make grpc-server` target the shipped build uses.
|
||||
- **Green** = the series still applies and compiles on upstream HEAD; nothing to
|
||||
do.
|
||||
- **Red** = upstream moved out from under the patches. **Canary red -> run a
|
||||
PIN_SYNC** (rebase the patches onto the new tip, pass the bit-exact gate on the
|
||||
GPU, re-export the `.patch` files, then advance the pin). The canary is
|
||||
signal-only: it opens no PR and never moves the pin, so the shipped build and
|
||||
the dep-bump PRs stay green regardless.
|
||||
- **0019 handling:** the canary apply helper excludes ONLY the stray
|
||||
`SSM_DECODE_FIX_RESULTS.md` dev-doc hunk (the pre-existing quirk documented in
|
||||
the "Pre-existing finding" section below and in `PIN_BUMP_APPLY_CHECK.md`),
|
||||
applying 0019's real code hunks atomically. So that benign quirk never
|
||||
false-positives the canary, but a genuine code break in 0019 still turns it
|
||||
red.
|
||||
|
||||
## Upstream jump
|
||||
|
||||
- OLD LocalAI pin: `8be759e6`
|
||||
- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)")
|
||||
- Upstream jump `8be759e6..9d5d882d` = **17 commits**.
|
||||
|
||||
### Note on the dev-tree base (important)
|
||||
The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its
|
||||
real base (merge-base of `paged` with both pins) is `f3e1828`
|
||||
("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor
|
||||
of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` =
|
||||
**109 upstream commits**, a strictly larger surface than the 17-commit pin bump.
|
||||
The end state (paged patches on `9d5d882d`) is identical either way; the larger
|
||||
traverse only means the conflict surface was the worst case, and it still came
|
||||
through bit-exact.
|
||||
|
||||
## Rebase
|
||||
|
||||
- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3).
|
||||
- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/
|
||||
FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not
|
||||
shipped as `.patch` files).
|
||||
- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024).
|
||||
- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024).
|
||||
|
||||
### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`)
|
||||
|
||||
Every rebase conflict was in the llama-server continuous-batch scheduler wiring,
|
||||
all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` /
|
||||
`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate
|
||||
uses `llama-completion`, not the server, with no env set). The root cause was a
|
||||
single upstream refactor of `update_slots()`:
|
||||
|
||||
- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`,
|
||||
replacing bottom-of-loop `break` with a top-of-lambda
|
||||
`if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set
|
||||
false on `batch.add()` failure);
|
||||
- the embedding/rerank early-exits changed `continue;` -> `return;`;
|
||||
- the `server_batch` token count accessor was renamed `batch.n_tokens` ->
|
||||
`batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens`
|
||||
member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`).
|
||||
|
||||
**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict.
|
||||
Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor.
|
||||
Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by
|
||||
keeping HEAD's refactored structure and re-inserting the `[paged 0008]`
|
||||
`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;`
|
||||
and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level,
|
||||
with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics
|
||||
unchanged.
|
||||
|
||||
**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts.
|
||||
- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before
|
||||
HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved
|
||||
alora_scale/disabled_id into the `server_batch` struct).
|
||||
- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)`
|
||||
(took upstream's `batch.size()` rename, kept the budget condition).
|
||||
- C3: the original outer `break` was translated to the new idiom `add_ok = false;`
|
||||
(exact semantic equivalent of "stop admitting prompts to remaining slots"); the
|
||||
upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled
|
||||
by the top-of-lambda check).
|
||||
|
||||
**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2
|
||||
conflicts + 1 clean-hunk fix.
|
||||
- The big budget-block rewrite hunk applied cleanly (its expected parent == the
|
||||
faithfully-resolved 0013 block).
|
||||
- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;`
|
||||
referenced the `server_batch` member, which has no `.n_tokens` -> changed to
|
||||
`batch.size()` (== D, the Phase-1 decode load; identical value).
|
||||
- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` +
|
||||
`prefill_cap_per_slot`), adopted `batch.size()`.
|
||||
- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`,
|
||||
dropped the upstream-removed `batch.n_tokens >= n_batch` break.
|
||||
|
||||
OFF-path invariant verified by construction in all three: with the env knobs
|
||||
unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`)
|
||||
the added conditions never fire, so the scheduler is byte-identical to stock HEAD.
|
||||
|
||||
### Kernel patches: ZERO rebase conflicts
|
||||
Patches 0017-0024 - which touch the bit-exact compute paths
|
||||
(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112,
|
||||
`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`,
|
||||
`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all
|
||||
applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel
|
||||
context was touched during conflict resolution.
|
||||
|
||||
## Clean rebuild
|
||||
`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`,
|
||||
preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON,
|
||||
GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0,
|
||||
all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz`
|
||||
download miss, unrelated to the gate binaries.)
|
||||
|
||||
## GATE: ALL GREEN
|
||||
|
||||
(a) `test-backend-ops` (Backend CUDA0):
|
||||
| op | result |
|
||||
|----|--------|
|
||||
| GATED_DELTA_NET | 36/36 OK |
|
||||
| SSM_CONV | 45/45 OK |
|
||||
| MUL_MAT | 1146/1146 OK |
|
||||
| MUL_MAT_ID | 806/806 OK |
|
||||
|
||||
(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`):
|
||||
| model | md5 | baseline | verdict |
|
||||
|-------|-----|----------|---------|
|
||||
| dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
|
||||
|
||||
Bit-exactness preserved across the upstream jump.
|
||||
|
||||
## Decode bench sanity (rebased build, post-pin-sync)
|
||||
|
||||
`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
|
||||
S_TG (decode) tok/s at npl128, patch defaults on:
|
||||
| model | npl128 S_TG (new tip) | post-0023 reference | delta |
|
||||
|-------|----------------------|---------------------|-------|
|
||||
| dense `q36-27b-nvfp4` | **366.41** | 373.2 | -1.8% |
|
||||
| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% |
|
||||
|
||||
Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`.
|
||||
(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.)
|
||||
|
||||
## Export phase: re-export `.patch` files and pick the ones that changed
|
||||
|
||||
The committed `.patch` files were generated against the old base. Each shipped
|
||||
patch was re-exported from its rebased commit (`git format-patch -1 <commit>`) and
|
||||
compared body-to-body against the committed file (ignoring the volatile `From`
|
||||
commit-hash line and the `index` blob-hash lines). Classification:
|
||||
|
||||
- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`.
|
||||
- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted ->
|
||||
still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`,
|
||||
`0021`, `0024`.
|
||||
- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`,
|
||||
`0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`.
|
||||
|
||||
An independent isolated `git apply --check` sweep (each shipped patch vs the
|
||||
rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`)
|
||||
are the only ones that no longer `git apply` to `9d5d882d`. The build applies the
|
||||
series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets,
|
||||
so the 7 LINENUM patches still apply (verified) and are intentionally not churned.
|
||||
|
||||
### 0015 was a 4th change beyond the 3 rebase conflicts
|
||||
The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015`
|
||||
(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but
|
||||
its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`:
|
||||
upstream inserted a new test case
|
||||
(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`)
|
||||
in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's
|
||||
context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure
|
||||
context re-anchor, no behavioral change. This is exactly why a per-patch
|
||||
re-export/apply-check was run instead of trusting the 3-conflict count.
|
||||
|
||||
### What changed in each updated patch (From/index hash noise aside)
|
||||
- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit`
|
||||
call), re-indented to the refactored `update_slots` lambda level and re-anchored
|
||||
after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated.
|
||||
- `0013`: budget var-block / while-gate / admission-break re-expressed against the
|
||||
refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated.
|
||||
- `0015`: hunk context re-anchored around the new upstream test case; inserted
|
||||
lines identical; `@@` header updated.
|
||||
- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission
|
||||
`add_ok=false` against the refactored loop; `@@` headers updated.
|
||||
|
||||
## Equivalence proof (the updated series == the gate-green tree)
|
||||
|
||||
The 4 updated files are byte-faithful `git format-patch -1` exports of the
|
||||
gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying
|
||||
the full corrected series (the 19 unchanged committed patches + the 4 re-exports)
|
||||
in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for
|
||||
all 23 patches, and the resulting tree is **byte-identical to the gate-green
|
||||
`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except
|
||||
`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the
|
||||
shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
|
||||
the md5 bit-exact gate, and the bench.
|
||||
|
||||
## Shipped-build bug FIXED: stray dev-doc hunks stripped from the patch series
|
||||
|
||||
The pin-sync export captured dev-only result/progress docs that live in the DGX
|
||||
dev tree (`~/llama-paged-dev`) but are ABSENT from a clean `ggml-org/llama.cpp`
|
||||
checkout. The shipped build applies the paged series with **strict `git apply`**
|
||||
(the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
|
||||
`git apply --verbose "$p" || { echo "paged patch failed"; exit 1; }`), which is
|
||||
atomic: a single hunk against a missing file REJECTS the entire patch and the
|
||||
`exit 1` fails the build. (`prepare.sh` uses tolerant `patch -pN -N ... || true`,
|
||||
but it is guarded by the `src/paged-kv-manager.cpp` sentinel and skipped at build
|
||||
time once the Makefile has applied the series, so the strict `git apply` is the
|
||||
real shipped path.)
|
||||
|
||||
Root failure was patch `0019`'s *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
|
||||
(`index 2e7c8c2..77879e4 100644`): on a clean tree `git apply` cannot find the
|
||||
file to modify ("No such file or directory") and rejects all of `0019`, which
|
||||
then cascades to `0021`/`0022`/`0026`/`0028` (they build on `0019`'s code). The
|
||||
build therefore only succeeded on the DGX (where the doc exists) and FAILED on CI
|
||||
/ any clean checkout.
|
||||
|
||||
Fixed by stripping every stray non-source hunk so the patches contain ONLY
|
||||
llama.cpp source changes. Stripped hunks (dev docs absent from a clean
|
||||
`9d5d882d` checkout):
|
||||
|
||||
| patch | stripped dev-doc hunk(s) | hunk kind |
|
||||
|-------|--------------------------|-----------|
|
||||
| `0019` | `SSM_DECODE_FIX_RESULTS.md` | modify (the root reject) |
|
||||
| `0020` | `LEVER1_OPROJ_MMQ_RESULTS.md` | create |
|
||||
| `0021` | `CONV_STATE_FUSION_RESULTS.md` | create |
|
||||
| `0028` | `LEVER1_GATHER_PROGRESS.md`, `LEVER1_GATHER_RESULTS.md` | create |
|
||||
|
||||
(The `create` hunks did not reject on their own - `git apply` will create a new
|
||||
file even on a clean tree - but they polluted the build tree with stray dev docs
|
||||
and violated the source-only invariant, so they were stripped too.) For each
|
||||
patch the `diff --git a/<devdoc> ...` section was removed along with its diffstat
|
||||
per-file line, any `create mode` trailer, and the `N files changed, ...` summary
|
||||
was corrected; **every llama.cpp SOURCE hunk is byte-identical** (verified by
|
||||
sha256 of each patch's source-diff tail before vs after the strip).
|
||||
|
||||
Verified on a fresh `git clone` of `ggml-org/llama.cpp` at this pin `9d5d882d`:
|
||||
- BEFORE the strip, strict `git apply` of the series: OK through `0018`, then
|
||||
`0019` FAILS ("SSM_DECODE_FIX_RESULTS.md: No such file or directory") -> the
|
||||
Makefile `exit 1`s; continue-mode shows the full cascade `0019` `0021` `0022`
|
||||
`0026` `0028` failing.
|
||||
- AFTER the strip, strict `git apply` of the full series `0001..0030` reaches
|
||||
**exit 0** (every patch OK, sentinel `src/paged-kv-manager.cpp` created, zero
|
||||
stray `*_RESULTS.md`/`*_PROGRESS.md` in the tree). The tolerant `patch -p1`
|
||||
path (prepare.sh fallback) also applies with zero rejects.
|
||||
|
||||
## Durable fix: keep patch exports SOURCE-ONLY
|
||||
|
||||
The pin-sync / re-export step MUST NOT capture dev-only artifacts into the shipped
|
||||
`.patch` files. A clean `ggml-org/llama.cpp` checkout contains its own real docs
|
||||
(`README.md`, `docs/`, `AGENTS.md`, ...) but NOT LocalAI dev notes - anything
|
||||
matching `*_RESULTS.md`, `*_PROGRESS.md`, `*.diff`, `final_benchmark.csv`,
|
||||
`LEVER*`, `BENCH*`, `paged-*-bench.cpp`, or any path that does not exist at the
|
||||
pin is a dev artifact and must be excluded. Concretely, when re-exporting:
|
||||
|
||||
- prefer `git format-patch -1 <commit> -- ':!*.md' ':!*.diff' ':!*.csv'` (or an
|
||||
explicit pathspec of the llama.cpp source dirs `src/ ggml/ common/ include/
|
||||
tools/ tests/ cmake/`) so dev docs never enter the patch body;
|
||||
- keep the dev-notes commits SEPARATE from the code commits on the dev branch, so
|
||||
a per-commit export is naturally source-only;
|
||||
- after export, gate with: clone the pin, `git apply` the full series with strict
|
||||
(no-`--exclude`, no `|| true`) `git apply` - it MUST reach exit 0. The weekly
|
||||
canary (`.github/workflows/llama-cpp-paged-canary.yml`) does this against
|
||||
upstream HEAD; now that the patches are source-only its `0019`
|
||||
`SSM_DECODE_FIX_RESULTS.md` `--exclude` workaround
|
||||
(`.github/scripts/paged-canary-apply.sh`) is no longer needed and can be removed
|
||||
on the next canary touch.
|
||||
|
||||
The upcoming `c299a92c` pin-bump re-export MUST follow this: produce source-only
|
||||
patches and pass the strict-`git apply` gate on a clean checkout before advancing
|
||||
the pin.
|
||||
|
||||
## Historical note (pre-strip)
|
||||
Before this cleanup, `0019` carried the `SSM_DECODE_FIX_RESULTS.md` modify hunk
|
||||
identically in the old and new exports (LINENUM class) and was left untouched
|
||||
during the pin-sync to keep the rebase faithful; `0021`'s
|
||||
`CONV_STATE_FUSION_RESULTS.md` was a create hunk that applied but still leaked a
|
||||
dev doc. Both are now removed by the source-only strip above.
|
||||
|
||||
## Source of truth
|
||||
The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD
|
||||
`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains
|
||||
the pre-rebase state.
|
||||
@@ -1,286 +0,0 @@
|
||||
# QUANT_GENERALITY - are the paged decode opts NVFP4-specific or quant-agnostic?
|
||||
|
||||
Source-verified classification of the paged decode optimizations (patches 0013-0029)
|
||||
as either QUANT-AGNOSTIC (operate on the gated-DeltaNet f32/bf16 recurrent state, the
|
||||
paged serving host path, or the matmul ROUTING - independent of the model's weight
|
||||
quantization, so they help a Q4_K / Q8_0 / bf16 Qwen3.6 as much as an NVFP4 one) or
|
||||
NVFP4-SPECIFIC (only fire for / only help GGML_TYPE_NVFP4 weights on a Blackwell GPU).
|
||||
|
||||
READ-ONLY, NO GPU. Every classification below is taken from the patch body source,
|
||||
not from the prose claims. Hardware referenced for the empirical plan only.
|
||||
|
||||
---
|
||||
|
||||
## 1. THE GROUND TRUTH GATE: what makes anything NVFP4-specific
|
||||
|
||||
There is exactly ONE runtime gate in the whole ggml-cuda matmul stack that means
|
||||
"NVFP4 on Blackwell":
|
||||
|
||||
mmq.cu: const bool use_native_fp4 = blackwell_mma_available(cc)
|
||||
&& (src0->type == GGML_TYPE_NVFP4 ...);
|
||||
|
||||
(confirmed in ARCH_GENERALITY_AUDIT.md section gguf-targeting-1 and in patch 0023's
|
||||
own diff context). A patch is NVFP4-specific iff the code it changes lives INSIDE a
|
||||
`use_native_fp4` / `type == GGML_TYPE_NVFP4` / `blackwell_mma_available(cc)` branch.
|
||||
Everything else - the gated-DeltaNet recurrence, the conv update, the SSM/conv state
|
||||
caches, the MMQ-vs-MMVQ dispatch, the CUDA-graph guard, the host scheduler and paged
|
||||
pool - is dtype-independent.
|
||||
|
||||
The recurrent state is the decisive fact: in this hybrid model the gated-DeltaNet
|
||||
temporal state, the conv ring state, q/k/v/g/beta and the SSM scratch are ALL
|
||||
GGML_TYPE_F32 (asserted explicitly in every new op builder: see 0018 ggml.c
|
||||
`GGML_ASSERT(state->type == GGML_TYPE_F32)`, 0019 same, 0021/0028 conv asserts
|
||||
`conv_states->type == GGML_TYPE_F32`). The weight quantization type never enters the
|
||||
recurrence or conv kernels. So any patch that only touches those is quant-agnostic by
|
||||
construction.
|
||||
|
||||
---
|
||||
|
||||
## 2. PER-PATCH CLASSIFICATION (with source evidence)
|
||||
|
||||
| patch | what it changes | classification | source evidence |
|
||||
|-------|-----------------|----------------|-----------------|
|
||||
| 0013 | static per-step prefill-token budget (LLAMA_PREFILL_BUDGET) | QUANT-AGNOSTIC | tools/server/server-context.cpp only; a host scheduler loop bound on prompt-token COUNT; no dtype anywhere; default-off byte-identical |
|
||||
| 0014 | manual MoE token-tile (mmq_x) cap | QUANT-AGNOSTIC | mmq.cuh `mul_mat_q_case`; cap applies on `args.expert_bounds != nullptr` (the MUL_MAT_ID grouped path) for ANY templated `<type>`; no NVFP4 branch |
|
||||
| 0015 | density-aware MoE token-tile auto-select | QUANT-AGNOSTIC | mmq.cuh; gate is `expert_bounds != nullptr` + per-expert density only, NEVER on src0 type. PROVEN on a non-NVFP4 model: the measured +4.8% win was Qwen3-Coder-30B (128 larger experts), test gate covers MXFP4 AND NVFP4 |
|
||||
| 0016 | dynamic decode-first prefill budget (supersedes 0013) | QUANT-AGNOSTIC | update_slots() policy only; "identical decisions paged on or off", zero libllama/dtype touch; default-off |
|
||||
| 0017 | FP4 GEMM decode mmq_y / minblocks tile tune | NVFP4-SPECIFIC, but DEFAULT-OFF / INERT | mmq.cuh `get_mmq_y_host`: fires only `type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)`. BUT the patch is a recorded NO-BUILD: every occupancy probe REGRESSED (kill-gate tripped), so nothing is enabled by default. Default build is byte-identical to stock; it changes no behavior |
|
||||
| 0018 | in-place SSM recurrent-state write-back | QUANT-AGNOSTIC | gated_delta_net.cu + ggml.c; operates on the f32 recurrent state cache (`state->type == GGML_TYPE_F32`); removes a D2D f32 state copy. Weights never read by this op |
|
||||
| 0019 | fused recurrent-state gather (ids read, no get_rows) | QUANT-AGNOSTIC | reads the f32 state cache via ids; builder asserts F32 on q/k/v/g/beta/state/state_dst; mirrors ggml_ssm_scan. No weight dtype involved |
|
||||
| 0020 | gated-DeltaNet o_proj MMVQ->MMQ reshape | QUANT-AGNOSTIC (routing) | qwen35.cpp/qwen35moe.cpp/qwen3next.cpp: a 2D-vs-3D RESHAPE of the f32 activation so `src1->ne[1]=128` routes to MMQ instead of batch-1 MMVQ. The MMVQ(ne[1]<=8)-vs-MMQ dispatch is a generic ggml-cuda decision present for EVERY quantized type. See section 3 |
|
||||
| 0021 | in-place conv-state fusion (conv+silu+ring write) | QUANT-AGNOSTIC | ssm-conv.cu + ggml.c new op asserts `conv_states/conv_kernel/x_cur/conv_state_dst == GGML_TYPE_F32`; pure f32 conv-state work |
|
||||
| 0022 | gated_delta_net_cuda occupancy/coalescing retune | QUANT-AGNOSTIC | gated_delta_net.cu kernel: q/k/v/g/beta/state are all f32; the COLS_PER_WARP/NUM_WARPS fold is a scheduling change on the f32 recurrence. Never touches a weight tensor |
|
||||
| 0023 | MoE NVFP4 activation-quantize de-dup | NVFP4-SPECIFIC | mmq.cu: the `gather_mmq_fp4` de-dup is INSIDE `if (use_native_fp4) { ... }`. Gathers `block_fp4_mmq`. The non-FP4 path (`quantize_mmq_q8_1_cuda`) is untouched. Confirmed NVFP4-only |
|
||||
| 0024 | paged-pool burst reclaim (truncate/defrag/release) | QUANT-AGNOSTIC | paged-alloc / paged-kv-manager / llama-kv-cache host accounting; "never KV values or compute, no ggml op touched"; gated behind LLAMA_KV_PAGED |
|
||||
| 0025 | MoE-decode CUDA-graph re-graph (graph-safe id path) | QUANT-AGNOSTIC (corrects hypothesis) | ggml-cuda.cu: relaxes the MUL_MAT_ID graph guard when `ggml_is_quantized(src0) && ggml_cuda_should_use_mmq(...)`. Gated on the GENERIC quantized-MMQ grouped path, NOT on NVFP4. See section 4 |
|
||||
| 0026 | hybrid per-head f32/bf16 SSM state (--cache-type-ssm / tau) | QUANT-AGNOSTIC, default-off (and precision-changing) | common/arg.cpp + cparams type_s/type_r + tau; changes the RECURRENT-STATE cache dtype (f32 default, bf16 opt-in). Independent of the weight quant; default tau=0 keeps bit-exact f32 |
|
||||
| 0028 | residual conv-tap gather fusion (ids read) | QUANT-AGNOSTIC | ssm-conv.cu new SSM_CONV_UPDATE_IDS op reads the f32 conv cache via ids; eliminates the last k_get_rows in the GDN decode path. f32 throughout |
|
||||
| 0029 | block-table within-step host cache | QUANT-AGNOSTIC | llama-kv-cache.cpp / paged-attn.cpp: memcpy-reuse of an int32 block table across full-attn layers of a step; pure host pipeline, bit-exact |
|
||||
|
||||
(There is no patch 0027.)
|
||||
|
||||
### Summary count
|
||||
- QUANT-AGNOSTIC (helps any weight quant): 0013, 0014, 0015, 0016, 0018, 0019, 0020,
|
||||
0021, 0022, 0024, 0025, 0026, 0028, 0029 - 14 of 16 landed patches.
|
||||
- NVFP4-SPECIFIC: 0023 (the only landed NVFP4-only optimization) + 0017 (NVFP4-only but
|
||||
default-off / inert, no measured win).
|
||||
|
||||
---
|
||||
|
||||
## 3. 0020 IN DETAIL - MMQ-over-MMVQ at batched decode is a win for ANY quantized type
|
||||
|
||||
The hypothesis is CONFIRMED. 0020 is not an FP4 trick:
|
||||
|
||||
- The gated-DeltaNet op left its output in 3D SSM layout `[value_dim, n_seq_tokens=1,
|
||||
n_seqs=128]`, so the ssm_out matmul saw `src1->ne[1] = 1` with the 128 sequences
|
||||
stuck in `ne[2]`.
|
||||
- ggml-cuda dispatches `ne[1] <= 8` to MMVQ (the batch<=8 GEMV) and larger to MMQ
|
||||
(the tensor-core GEMM). This `ne[1]`-threshold dispatch is type-INDEPENDENT: it is
|
||||
the same routing for Q4_K, Q8_0, Q6_K, MXFP4, NVFP4 - every k-/legacy-quant has BOTH
|
||||
an MMVQ (mmvq.cu vec_dot) AND an MMQ (mmq.cuh) path.
|
||||
- The fix is a `ggml_reshape_2d` to `[value_dim, n_seq_tokens*n_seqs] = [6144, 128]` so
|
||||
`src1->ne[1] = 128` routes to the M=128 MMQ GEMM that amortizes the ssm_out weight
|
||||
read across all 128 sequences. Same contiguous data, bit-identical.
|
||||
|
||||
Why it generalizes: at batched decode (npl 32-128) the weight read of ssm_out is the
|
||||
cost, and MMVQ at the degenerate batch-1 shape re-reads / fails to amortize the weight
|
||||
for whatever dtype the weight is. MMQ at M=128 reads each weight tile once for all 128
|
||||
tokens. That amortization is a pure bandwidth win that exists for every quantized
|
||||
weight type, not just NVFP4. A Q4_K or Q8_0 Qwen3.6 has the exact same 3D-SSM-output ->
|
||||
batch-1-MMVQ pathology and gets the same MMQ amortization from the reshape. (The patch
|
||||
already routes the in-projection through MMQ; only the output was stuck in 3D.)
|
||||
|
||||
The same logic underwrites 0014/0015 (the MoE `mmq_x` token-tile is a generic grouped-
|
||||
MMQ knob; the win was measured on a non-NVFP4 Qwen3-Coder-30B) and 0025 (section 4).
|
||||
|
||||
---
|
||||
|
||||
## 4. 0025 CORRECTS THE HYPOTHESIS - it is quant-agnostic, not NVFP4-specific
|
||||
|
||||
The hypothesis listed "the act-quant / quantize_mmq_nvfp4 portions of 0025" as
|
||||
NVFP4-specific. That is a patch-number mismatch. The ACTUAL patch 0025
|
||||
(0025-qwen35moe-nvfp4-moe-decode-regraph.patch) does NOT contain any act-quant /
|
||||
quantize_mmq_nvfp4 code. Its entire diff is one hunk in ggml-cuda.cu:
|
||||
|
||||
bool mmid_needs_sync = !ggml_is_quantized(src0->type) || node->ne[2] > mmvq_mmid_max;
|
||||
if (mmid_needs_sync && ggml_is_quantized(src0->type) &&
|
||||
getenv("LLAMA_MOE_FORCE_GRAPHS") &&
|
||||
ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[2], src0->ne[2])) {
|
||||
mmid_needs_sync = false; // keep CUDA graphs on for the grouped-MMQ id path
|
||||
}
|
||||
|
||||
The relax condition is `ggml_is_quantized(src0->type) && ggml_cuda_should_use_mmq(...)`
|
||||
- the GENERIC quantized grouped-MMQ id-path, NOT NVFP4. `should_use_mmq()` returns true
|
||||
for Q4_K / Q8_0 / etc. at large enough batch just as for NVFP4. So a Q4_K or Q8_0 MoE
|
||||
Qwen3.6 whose MUL_MAT_ID takes the grouped MMQ path also keeps CUDA graphs across the
|
||||
MoE decode step under LLAMA_MOE_FORCE_GRAPHS. 0025 is quant-agnostic.
|
||||
|
||||
LEVER2_GRAPH_COVERAGE_RESULTS.md confirms this is the role of 0025 ("0025's
|
||||
[TAG_MUL_MAT_ID_CUDA_GRAPHS] env-gate keeps the grouped MMQ id-path graph-safe").
|
||||
|
||||
Where the hypothesis's "act-quant / quantize_mmq_nvfp4" actually lives: that is
|
||||
LEVER 3 (LEVER3_ACTQUANT_FUSION_RESULTS.md - fuse W4A4 act-quant into RMSNorm/SiLU),
|
||||
which is genuinely NVFP4-specific, BUT it was a measurement STOP and NEVER LANDED (no
|
||||
patch 0030, no commit). Likewise LEVER 4 (NVFP4 the still-bf16 GDN/attn projections,
|
||||
LEVER4_PROJNVFP4_RESULTS.md) is NVFP4-specific but FAILED its KL gate (~6% PPL) and was
|
||||
NOT shipped. So the only NVFP4-specific code that actually landed is 0023 (+ inert 0017).
|
||||
|
||||
### Net correction to the hypothesis
|
||||
- 0018/0019, 0021, 0022, 0028, 0026, 0013/0016, 0029, 0020: CONFIRMED quant-agnostic.
|
||||
- 0023: CONFIRMED NVFP4-specific.
|
||||
- 0025: WRONG in the hypothesis -> it is QUANT-AGNOSTIC (CUDA-graph guard on the generic
|
||||
quantized grouped-MMQ path). The NVFP4-specific "act-quant" work the hypothesis was
|
||||
thinking of is LEVER 3, which is unshipped (STOP), not patch 0025.
|
||||
- Bonus: 0014/0015 (not in the hypothesis) are quant-agnostic, and 0017 is
|
||||
NVFP4-specific but default-off/inert.
|
||||
|
||||
---
|
||||
|
||||
## 5. RELATIVE-IMPACT BY WEIGHT-QUANT SIZE
|
||||
|
||||
Decode is bandwidth-bound on the weight read. The quant-agnostic opts target work whose
|
||||
absolute cost is FIXED in the weight quant: the f32 recurrence, the f32 conv state, the
|
||||
host pipeline. The weight-read buckets (MoE expert GEMM + dense projections) scale
|
||||
~linearly with bits-per-weight. So the quant-agnostic opts deliver the same ABSOLUTE
|
||||
millisecond saving at every quant, but the RELATIVE % shrinks as the weight grows.
|
||||
|
||||
Anchor: the measured MoE q36-35b-a3b NVFP4 decode step (MOE_GAP_VS_VLLM.md, step =
|
||||
169.8 ms, GPU-busy 97.5%), split into quant-agnostic vs weight-quant-scaling buckets:
|
||||
|
||||
| bucket | ms/step @ NVFP4 | scales with weight bits? | which opts touch it |
|
||||
|--------|-----------------|--------------------------|---------------------|
|
||||
| Recurrence core (gated_delta_net) | 70.0 | NO (f32 state) | 0022 |
|
||||
| Recurrent-state + conv gather/plumbing (k_get_rows 5.2 + ssm_conv 3.4) | ~8.6 | NO (f32) | 0018/0019/0021/0028 |
|
||||
| Host bubble (sample+batch+block-table) | 4.2 | NO (host) | 0013/0016/0024/0029 |
|
||||
| Router / norms / glue | ~5.4 | mostly NO | 0014/0015 partial |
|
||||
| MoE expert GEMM | 47.3 | YES (4-bit now) | (weight read) |
|
||||
| Dense GDN/attn projections + convert glue | 20.3 | YES | (weight read) |
|
||||
| W4A4 act-quant tax (quantize_mmq_nvfp4) | 3.3 | (FP4 only) | 0023 |
|
||||
|
||||
Quant-agnostic, weight-size-fixed total: ~70.0 + 8.6 + 4.2 + 5.4 = ~88 ms (~52% of the
|
||||
NVFP4 step). Weight-read buckets: 47.3 + 20.3 = ~67.6 ms (~40%).
|
||||
|
||||
Model the weight-read buckets as scaling with bytes-per-weight relative to NVFP4 (4-bit
|
||||
= 1x): Q8_0 ~ 2x, bf16 ~ 4x. Hold the ~88 ms fixed (the recurrence f32 byte stream and
|
||||
host time do not change with the weight quant), and recompute the recurrence/host
|
||||
fraction of the step:
|
||||
|
||||
| weight quant | weight-read buckets (ms, est.) | fixed quant-agnostic (ms) | step (ms, est.) | recurrence+host % of step |
|
||||
|--------------|--------------------------------|---------------------------|-----------------|---------------------------|
|
||||
| NVFP4 (4-bit) | ~68 (1x) | ~88 | ~159 (+act-quant ~3) | ~52% (measured ~50%) |
|
||||
| Q8_0 (8-bit) | ~136 (2x) | ~88 | ~224 | ~39% |
|
||||
| bf16 (16-bit) | ~272 (4x) | ~88 | ~360 | ~24% |
|
||||
|
||||
Reading this:
|
||||
- The quant-agnostic SSM/serving opts deliver the SAME ~ms savings at Q8/bf16 as at
|
||||
NVFP4 (they remove fixed f32/host work). The headline % speedups quoted in the patch
|
||||
bodies (e.g. 0019 dense npl128 +37.8%, 0020 +31.7%, 0022 +11.1%) are the LARGEST at
|
||||
NVFP4 precisely because the fixed recurrence is the biggest fraction of the smallest
|
||||
(4-bit weight) step. The same absolute removal is a smaller % of a Q8 step and a much
|
||||
smaller % of a bf16 step, because the weight-read denominator grows.
|
||||
- This MATCHES the brief's decomposition framing (recurrence ~40-50%, GEMM ~26-28% at
|
||||
NVFP4): at NVFP4 the recurrence dominates, so the recurrence-targeting opts are where
|
||||
the win is; as the weight quant grows the GEMM dominates and the recurrence opts
|
||||
matter relatively less (but never zero, and never negative).
|
||||
- Corollary: the ONE NVFP4-specific landed lever, 0023, only addresses the ~3.3 ms FP4
|
||||
act-quant tax (and only the broadcast up/gate share of it) - the smallest bucket and
|
||||
its measured win is +1.7%. The big bit-exact wins are all quant-agnostic.
|
||||
|
||||
So the optimization set is overwhelmingly general: a Q4_K / Q8_0 / bf16 Qwen3.6 gets the
|
||||
full recurrence + conv + serving + MMQ-routing benefit; only the small FP4 act-quant
|
||||
de-dup (0023) does nothing for it (and the inert 0017 was never enabled).
|
||||
|
||||
---
|
||||
|
||||
## 6. EMPIRICAL CONFIRMATION PLAN (specify only - DO NOT run; the GPU is busy)
|
||||
|
||||
Goal: prove on hardware that the quant-agnostic opts FIRE and LIFT a non-NVFP4 Qwen3.6,
|
||||
isolating them from the one NVFP4-specific lever.
|
||||
|
||||
### 6.1 Hardware
|
||||
GB10 / DGX Spark (sm_121), when free. The DGX has live deployments; this plan is
|
||||
read-only until then. (Any Blackwell or non-Blackwell CUDA host also works to prove
|
||||
quant-GENERALITY - the recurrence/serving opts are not Blackwell-gated; only the NVFP4
|
||||
FP4-MMA tier is. Running on a non-Blackwell card would ALSO demonstrate the opts help
|
||||
where there is no use_native_fp4 path at all - a strong second proof.)
|
||||
|
||||
### 6.2 Build the non-NVFP4 control GGUF first (prerequisite)
|
||||
The same Qwen3.6 architecture, re-quantized so the weights are NOT NVFP4 but the
|
||||
gated-DeltaNet/conv recurrence is still f32:
|
||||
|
||||
- Source: the existing q36-27b (dense) and/or q36-35b-a3b (MoE) f16/bf16 GGUF already
|
||||
on the DGX (~/work/darwin_36b_opus/f16.gguf is the MoE f16 used as the LEVER4 KL
|
||||
base; an equivalent dense f16 exists).
|
||||
- Produce: `llama-quantize f16.gguf q36-27b-Q4_K_M.gguf Q4_K_M` (primary control) and
|
||||
optionally `... Q8_0` and keep the f16/bf16 as the 16-bit control. Q4_K_M is the
|
||||
cleanest contrast: 4-bit like NVFP4 but a totally different (k-quant, non-FP4-MMA)
|
||||
weight path, so any shared win is provably from the f32 recurrence / routing, not
|
||||
from FP4.
|
||||
- Note: this requantize is free (no retrain) and must be done before any A/B.
|
||||
|
||||
### 6.3 Bit-exact gate per path (same method as the patch bodies)
|
||||
For the bit-EXACT quant-agnostic opts (0018/0019/0020/0021/0022/0028/0029 and the
|
||||
host 0013/0016/0024 default-off), the gate is: greedy `llama-completion --temp 0
|
||||
--seed 1 --ignore-eos -n 256`, md5 of the output, patches-ON == patches-OFF on the
|
||||
Q4_K_M control. Per path:
|
||||
- non-paged Q4_K vs paged Q4_K (expect the same benign paged-reduction FP-order
|
||||
delta noted in PAGED_BITEXACT_NOTE.md / 0029, gate with KLD/PPL not md5 across the
|
||||
paged boundary, md5-exact within a fixed paged/non-paged setting).
|
||||
- patches-on vs patches-off (see toggles 6.4) on the Q4_K control: byte-identical md5.
|
||||
- 0026 (bf16 SSM state) is precision-CHANGING -> gate with KLD-to-f16 + PPL, not md5,
|
||||
exactly like LEVER4 did; default tau=0 stays md5-exact.
|
||||
- test-backend-ops on the build: GATED_DELTA_NET, SSM_CONV, SSM_CONV_UPDATE,
|
||||
SSM_CONV_UPDATE_IDS, MUL_MAT, MUL_MAT_ID, GET_ROWS all green (these op tests are
|
||||
dtype-parametrized and already include non-FP4 types).
|
||||
|
||||
### 6.4 The clean A/B (decode_agg, llama-batched-bench)
|
||||
Two arms, SAME Q4_K_M control GGUF, `-fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
|
||||
report S_TG (decode aggregate), median of 5 reps:
|
||||
|
||||
- Arm A (patches-OFF baseline): the cleanest is two builds - the pre-0018 paged commit
|
||||
(the SSM opts not yet present) vs HEAD. If a rebuild is not wanted, approximate
|
||||
OFF on the single HEAD binary by setting every disabling toggle at once:
|
||||
fused GDN off (cparams.fused_gdn_ar/ch path disabled - the "fusion off" mode the
|
||||
patch docs A/B against), `GDN_NW=4 GDN_CPW=1` (0022 pre-retune), `LLAMA_MOE_AUTO_TILE=0`
|
||||
(0015), no `LLAMA_MOE_FORCE_GRAPHS` (0025 off), `LLAMA_PAGED_NO_BT_CACHE=1` (0029),
|
||||
`LLAMA_PAGED_NO_RECLAIM=1` (0024), `LLAMA_PREFILL_BUDGET`/`LLAMA_MAX_BATCH_TOKENS`
|
||||
unset (0013/0016), tau=0 / ctssm f32 (0026). The two-build form is preferred for a
|
||||
publishable number; the env form is a fast same-binary sanity A/B.
|
||||
- Arm B (patches-ON default): stock defaults (fusion on, 16x8, auto-tile on,
|
||||
FORCE_GRAPHS on for the MoE graph arm, bt-cache on, reclaim on).
|
||||
|
||||
### 6.5 What result confirms quant-generality
|
||||
1. The quant-agnostic opts FIRE on Q4_K: nsys on Arm B (Q4_K) shows the same kernel
|
||||
deltas the NVFP4 runs showed - `k_get_rows_float` bucket collapses (0019/0028),
|
||||
`concat_cont` + decode `cpy_scalar` gone and `ssm_conv_update` present (0021), the
|
||||
o_proj `mul_mat_vec_q m=1` bucket gone and absorbed into `mul_mat_q m=128`
|
||||
(0020 - now a Q4_K MMQ kernel, proving the routing win is not FP4-bound),
|
||||
`get_block_table` host time down ~90% (0029).
|
||||
2. The opts LIFT the non-NVFP4 model: Arm B S_TG > Arm A S_TG on the Q4_K control at
|
||||
npl 32 and 128, with the recurrence/routing opts contributing the bulk (expect a
|
||||
smaller % than the NVFP4 runs per section 5, but clearly positive and of the same
|
||||
absolute ms order).
|
||||
3. The NVFP4-specific lever does NOTHING on Q4_K: toggling 0023
|
||||
(`GGML_CUDA_MOE_QUANT_DEDUP=0` vs default) shows ZERO delta on the Q4_K MoE control
|
||||
(it never enters the `use_native_fp4` branch) - the negative control that isolates
|
||||
the one NVFP4-only optimization from the general ones.
|
||||
|
||||
A clean pass = Arm B beats Arm A on Q4_K with the SSM/conv/routing/host kernel deltas
|
||||
present and 0023 inert. That proves the decode wins are quant-general; NVFP4 is just the
|
||||
weight quant where they show the largest PERCENTAGE because its weight read is smallest.
|
||||
|
||||
---
|
||||
|
||||
## 7. ONE-LINE VERDICT
|
||||
|
||||
14 of the 16 landed paged decode patches (0013-0029) are quant-agnostic: they act on the
|
||||
f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the generic
|
||||
MMQ-vs-MMVQ / CUDA-graph routing, none of which read the weight tensor's quant type. Only
|
||||
0023 is genuinely NVFP4-specific (and 0017 is NVFP4-only but default-off/inert). The
|
||||
hypothesis was right except for 0025, which is quant-agnostic (a generic
|
||||
`ggml_is_quantized && should_use_mmq` CUDA-graph guard); the NVFP4-specific "act-quant"
|
||||
work it was conflated with is LEVER 3, which never shipped. The opts deliver fixed
|
||||
absolute ms savings at any weight quant; the % is largest at NVFP4 only because its
|
||||
4-bit weight read makes the fixed recurrence the biggest slice of the step.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,464 +0,0 @@
|
||||
# Qwen3.6 NVFP4-vs-NVFP4: llama.cpp vs vLLM on GB10 (DGX Spark)
|
||||
|
||||
Apples-to-apples benchmark. Both engines run the **same NVFP4 weights** on the **same box**
|
||||
(GB10, sm_121, LPDDR5x unified memory ~273 GB/s). The question is not "who wins the HW
|
||||
lottery" but "at matched NVFP4, on one bandwidth-limited box, does our paged llama.cpp
|
||||
(patch 0015, expert-density-aware MoE token-tile auto-select, default-on) sit at par with /
|
||||
ahead of / behind vLLM?"
|
||||
|
||||
---
|
||||
|
||||
# FINAL shipping benchmark (patch 0023, f32 bit-exact build) — 2026-06-26
|
||||
|
||||
This is the **publishable, plot-ready** apples-to-apples result. Both engines at their **best
|
||||
realistic config** (no handicapping either side), matched NVFP4 weights, one clean GB10 box
|
||||
(LocalAI service containers stopped for the duration, restored after). Raw rows in
|
||||
[`final_benchmark.csv`](final_benchmark.csv); per-row checkpoint log in
|
||||
[`BENCHMARK_PROGRESS.md`](BENCHMARK_PROGRESS.md).
|
||||
|
||||
## Build under test (the clean shipping result)
|
||||
|
||||
- **llama.cpp** = patch **0023**, dev tree `~/llama-paged-dev` HEAD **`f7409c2`**, git-clean
|
||||
(the shelved bf16-GDN-state work was reverted; `git diff` empty at HEAD before the
|
||||
`build-cuda` rebuild). Greedy gate confirmed canonical f32 output on both models. The bf16
|
||||
GDN-state path is **shelved** (it fails the f32 KL gate); the shipped plateau is the
|
||||
**95%-bit-exact f32** stack (patches 0018-0023). dense greedy md5 `5951a5b4…`, MoE
|
||||
`07db32c2…` are the 0023 references (the *transcript* md5 also encodes llama-cli UI chrome,
|
||||
which has since changed, so the build was verified instead via the clean git tree + full
|
||||
rebuild + the greedy numerical gate).
|
||||
|
||||
## Config (both engines at BEST realistic config)
|
||||
|
||||
- **llama-server**: `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
|
||||
`LLAMA_KV_PAGED=1`, **CUDA graphs ON** (`USE_GRAPHS=1`, default), and the QoS prefill budget
|
||||
**`LLAMA_MAX_BATCH_TOKENS=512`** (patch 0016 decode-first dynamic budget). 512 is the
|
||||
`n_ubatch` floor and is the best of the swept budgets: at npl32 it gives 133 s TTFT vs
|
||||
**394 s for stock** (no budget) — lower budget = stronger decode-first = better burst TTFT,
|
||||
and decode throughput is budget-independent.
|
||||
- **vLLM 0.23.0**: its strongest honest decode config — **CUDA graphs ON** (NOT
|
||||
`--enforce-eager`; `cudagraph_mode=FULL_AND_PIECEWISE`), `--gpu-memory-utilization 0.85
|
||||
--max-model-len 4096 --max-num-seqs 256 -tp 1`, chunked prefill on, prefix caching off.
|
||||
- **Client** (`h2h_cli3.py`, identical async harness both sides): 512-token **unique-nonce**
|
||||
prompt (fresh full prefill every request, defeats all prefix caching), `max_tokens=256`,
|
||||
`temperature=0`, `ignore_eos=True`, streaming with usage; concurrency npl 8/32/64/128.
|
||||
- **Precision asymmetry (in llama's disfavour, yet llama still competes)**: llama runs
|
||||
**f32 GDN recurrent state + q8 activations**; vLLM runs **bf16 GDN state + w4a4**. The
|
||||
numbers below are llama at *higher* precision.
|
||||
|
||||
## DENSE — Qwen3.6-27B NVFP4 (`q36-27b-nvfp4`)
|
||||
|
||||
| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
|
||||
|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
|
||||
| 8 | llama | **82.5** | 9.57 | 507 | 6 038 | 53.5 | 50.2 |
|
||||
| 8 | vLLM | 70.4 | 8.76 | 2096 | 1 861 | 110.9 | 107.6 |
|
||||
| 32 | llama | **192.6** | 4.79 | 115 | 133 552 | 69.6 | 66.3 |
|
||||
| 32 | vLLM | 211.8 | 6.28 | 2183 | 5 353 | 110.9 | 107.6 |
|
||||
| 64 | llama | **277.8** | 3.09 | 96 | 321 619 | 84.0 | 80.6 |
|
||||
| 64 | vLLM | 309.1 | 4.38 | 2089 | 9 512 | 110.9 | 107.6 |
|
||||
| 128 | llama | **384.6** | 1.86 | 70 | 902 763 | 93.8 | 90.5 |
|
||||
| 128 | vLLM | 418.8 | 2.79 | 1929 | 18 450 | 111.0 | 107.6 |
|
||||
|
||||
**llama decode as % of vLLM (dense):** npl8 **117%**, npl32 **91%**, npl64 **90%**, npl128 **92%**.
|
||||
|
||||
## MoE — Qwen3.6-35B-A3B NVFP4 (`q36-35b-a3b-nvfp4`)
|
||||
|
||||
| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
|
||||
|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
|
||||
| 8 | llama | 211.8 | 24.45 | 1236 | 2 477 | 39.7 | 36.1 |
|
||||
| 8 | vLLM | 256.5 | 31.84 | 5187 | 769 | 109.6 | 106.3 |
|
||||
| 32 | llama | 393.0 | 10.02 | 1214 | 8 225 | 47.1 | 43.8 |
|
||||
| 32 | vLLM | 500.8 | 14.90 | 6223 | 1 830 | 109.6 | 106.4 |
|
||||
| 64 | llama | 527.0 | 6.15 | 1152 | 15 850 | 57.1 | 53.8 |
|
||||
| 64 | vLLM | 686.1 | 9.83 | 5927 | 3 224 | 109.6 | 106.4 |
|
||||
| 128 | llama | 726.4 | 3.73 | 277 | 213 017 | 61.5 | 58.2 |
|
||||
| 128 | vLLM | 882.2 | 6.05 | 5301 | 6 488 | 109.6 | 106.4 |
|
||||
|
||||
**llama decode as % of vLLM (MoE):** npl8 **83%**, npl32 **78%**, npl64 **77%**, npl128 **82%**.
|
||||
|
||||
## Plots (decode throughput vs concurrency)
|
||||
|
||||
Generated from [`final_benchmark.csv`](final_benchmark.csv) (matplotlib); the per-point label is
|
||||
llama as a share of vLLM decode at that concurrency.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
## The honest public story (let the numbers speak)
|
||||
|
||||
1. **Decode throughput — the headline.** On the dense 27B, paged llama.cpp **matches/beats
|
||||
vLLM**: 117% of vLLM at npl8 and a steady **90-92%** across npl32-128 — at *higher*
|
||||
precision (f32 GDN state + q8 act vs vLLM bf16 + w4a4). On the MoE 35B-A3B llama lands at
|
||||
**77-83%** of vLLM decode — close, but vLLM's fused grouped-GEMM MoE keeps a clear edge.
|
||||
2. **Memory — a decisive llama win.** vLLM's pre-reserved pool is a **flat ~107 GB** at every
|
||||
concurrency (the `--gpu-memory-utilization 0.85` design). llama's **on-demand paged KV**
|
||||
uses **50-90 GB (dense)** and **36-58 GB (MoE)**, growing with load: at the operating point
|
||||
most people actually run (npl≤32) llama uses **~1.5-3× less unified memory**, and even at
|
||||
npl128 it stays below vLLM. This is the "fits where vLLM OOMs" axis.
|
||||
3. **TTFT — vLLM's win, llama's disclosed tradeoff.** vLLM's chunked prefill absorbs a
|
||||
128-way simultaneous burst gracefully (6-18 s). llama's decode-first QoS budget protects
|
||||
decode throughput by throttling burst-prefill, so TTFT climbs at high concurrency
|
||||
(dense npl128 **903 s**, MoE npl128 **213 s**). It is *bounded relative to no-budget*
|
||||
(stock is worse) but high in absolute terms under a synchronized burst. Under realistic
|
||||
staggered arrival this is far milder; for a synchronized-burst benchmark it is the cost of
|
||||
the decode-first scheduler. **Decode and memory are unaffected.**
|
||||
|
||||
**Bottom line for the GB10 / DGX Spark page:** with matched NVFP4 weights, paged llama.cpp
|
||||
delivers **90-117% of vLLM dense decode** and **77-83% of vLLM MoE decode** at **equal-or-higher
|
||||
precision** and **1.5-3× lower memory** (on-demand paged KV vs a fixed 107 GB pool). The
|
||||
remaining gap is MoE-decode and burst-TTFT, not dense-decode or memory.
|
||||
|
||||
## Anomalies / methodology notes (rigour)
|
||||
|
||||
- **Paged-pool burst degradation (real, worked around).** After a high-npl burst, a llama
|
||||
server's *subsequent lower-npl* prefill collapses (npl8 fresh = 507 t/s / 6 s TTFT; the same
|
||||
npl8 *after* an npl64 burst = 65 t/s / 64 s TTFT). Decode is unaffected. To measure clean
|
||||
per-config prefill/TTFT, **the llama server is restarted per npl** (cheap vs the prefill
|
||||
cost). vLLM has no such degradation — verified by an end-of-sweep npl8 re-check that matched
|
||||
the opening npl8 (dense 70.4→73.4, MoE 256.5→226.4) — so vLLM uses one server per combo.
|
||||
- **Fresh-prefill discipline.** Every measured request uses a unique nonce so prefill is always
|
||||
a full fresh compute (the task's "defeat prefix caching" intent); vLLM ran with
|
||||
`enable_prefix_caching=False`, llama with `cache_prompt:false`. Apples-to-apples.
|
||||
- **No bimodality observed.** With per-npl restart + a cheap (ptok=8) graph warmup, the early
|
||||
two-pass checks matched within <0.5% (npl8 486/484 t/s), so the headline uses one stable
|
||||
measured pass per (model,engine,npl).
|
||||
- **Clean environment.** The benchmark's peak (dense ~94 GB) plus the idle LocalAI worker's
|
||||
~30 GB resident model OOM-cycled the service containers on the first attempt and corrupted
|
||||
one run; the `local-ai`/`local-ai-worker` containers were stopped for the measurement
|
||||
(baseline ~3.3 GB, ~120 GB free) and **restarted afterwards** to return the host.
|
||||
- **peak_gb** is absolute unified-memory used (`MemTotal-MemAvailable`) peak; `engine_gb` =
|
||||
peak − the ~3.3 GB OS baseline (the per-config engine footprint).
|
||||
- **Internal-consistency check (decode_agg vs perseq×npl).** `decode_agg_tps` is the steady-state
|
||||
aggregate over the decode window; `decode_perseq_tps` is each sequence's lifetime rate (output
|
||||
tokens ÷ total request latency, so it *includes* the TTFT queue wait). They coincide when
|
||||
TTFT ≪ decode-window (vLLM npl8: 70.4 vs 70.1, +0.5%) and diverge exactly as TTFT grows, on
|
||||
**both** engines (the agg−perseq×npl gap rises monotonically with `ttft_mean`: vLLM 0.5%→17%,
|
||||
llama 8%→62% across npl8→128, mirroring its 6 s→903 s TTFT). The relationship is governed by
|
||||
TTFT, not a measurement artifact, and the FINAL rows are distinct from the historical patch-0015
|
||||
table (no stale-baseline carry-over).
|
||||
|
||||
---
|
||||
|
||||
## Setup (historical — patch 0015 run; FINAL section above is the shipping 0023 result)
|
||||
|
||||
- **Box**: GB10 / DGX Spark, sm_121, unified LPDDR5x (~273 GB/s). Memory figures are
|
||||
unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
|
||||
- **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
|
||||
`build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
|
||||
-ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the
|
||||
TTFT caveat in the verdict.**
|
||||
- **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
|
||||
--max-num-seqs 256 -tp 1`.
|
||||
- **Client**: identical async client for both engines. Per request: 512-token unique prompt
|
||||
(unique leading tokens defeat cross-request prefix caching), `max_tokens=256`,
|
||||
`temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128.
|
||||
- **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
|
||||
live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
|
||||
`PEAK_GB` (unified-memory peak).
|
||||
|
||||
## The 4 models (NVFP4, matched weights)
|
||||
|
||||
| Model | llama.cpp GGUF | vLLM checkpoint | Match |
|
||||
|-------|----------------|-----------------|-------|
|
||||
| DENSE Qwen3.6-27B (28B dense) | `q36-27b-nvfp4.gguf` (native Blackwell FP4) | `q36-27b-nvfp4-vllm/` (unsloth TRUE W4A4) | clean W4A4 both sides |
|
||||
| MoE Qwen3.6-35B-A3B (36B total, ~3B active) | `q36-35b-a3b-nvfp4.gguf` (241 NVFP4 tensors, nvidia weights) | `q36-35b-a3b-nvfp4-vllm/` (nvidia modelopt; vLLM picks Marlin NvFp4 MoE + FA2) | NVFP4 weight-only, identical nvidia weights |
|
||||
|
||||
---
|
||||
|
||||
## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB)
|
||||
|
||||
### MoE Qwen3.6-35B-A3B (~3B active)
|
||||
|
||||
| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
|
||||
|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
|
||||
| 8 | llama | 170.2 | 20.27 | 2813 | 855 | 38.98 |
|
||||
| 8 | vLLM | 202.0 | 24.92 | 4648 | 799 | 111.49 |
|
||||
| 32 | llama | 235.4 | 6.77 | 2005 | 4970 | 43.06 |
|
||||
| 32 | vLLM | 462.0 | 13.59 | 4755 | 2308 | 111.26 |
|
||||
| 64 | llama | 271.7 | 3.88 | 2389 | 7205 | 52.53 |
|
||||
| 64 | vLLM | 624.5 | 8.90 | 4784 | 4072 | 111.46 |
|
||||
| 128 | llama | 292.2 | 2.05 | 657 | 84800 | 61.42 |
|
||||
| 128 | vLLM | 811.1 | 5.46 | 4263 | 7980 | 111.61 |
|
||||
|
||||
llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128.
|
||||
|
||||
### DENSE Qwen3.6-27B
|
||||
|
||||
| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
|
||||
|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
|
||||
| 8 | llama | 63.8 | 7.60 | 1117 | 2029 | 51.72 |
|
||||
| 8 | vLLM | 64.3 | 7.98 | 1514 | 2593 | 112.07 |
|
||||
| 32 | llama | 108.9 | 3.08 | 752 | 13212 | 61.48 |
|
||||
| 32 | vLLM | 189.8 | 5.57 | 1555 | 7477 | 112.09 |
|
||||
| 64 | llama | 126.2 | 1.78 | 465 | 53818 | 74.90 |
|
||||
| 64 | vLLM | 284.2 | 3.92 | 1526 | 12942 | 112.11 |
|
||||
| 128 | llama | 134.6 | 0.93 | 125 | 491195 | 94.03 |
|
||||
| 128 | vLLM | 390.7 | 2.50 | 1420 | 24806 | 112.12 |
|
||||
|
||||
llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128.
|
||||
|
||||
---
|
||||
|
||||
## Verdict
|
||||
|
||||
**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM
|
||||
scales substantially better as concurrency rises.**
|
||||
|
||||
1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's
|
||||
~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines.
|
||||
|
||||
2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to
|
||||
~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811);
|
||||
llama plateaus (dense 64->135, MoE 170->292).
|
||||
|
||||
3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes
|
||||
at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s,
|
||||
8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent
|
||||
512-token prefills starve each other and the decode. Crucially, that starvation also drags
|
||||
`decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the
|
||||
measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens`
|
||||
(the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency
|
||||
decode by keeping all slots live.
|
||||
|
||||
4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at
|
||||
gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged
|
||||
on-demand KV is materially more memory-efficient / multi-tenant-friendly.
|
||||
|
||||
5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream
|
||||
on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at
|
||||
npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a
|
||||
GB10 target.
|
||||
|
||||
### Honest bottom line
|
||||
|
||||
The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these
|
||||
NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a
|
||||
bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill
|
||||
starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode
|
||||
batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated
|
||||
regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
|
||||
on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
|
||||
absolute gap.
|
||||
|
||||
---
|
||||
|
||||
## Fair re-run (max_prefill_tokens on)
|
||||
|
||||
The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section
|
||||
re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over
|
||||
**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server
|
||||
(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
|
||||
`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`,
|
||||
`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged
|
||||
(carried over from the committed dense table, not re-run).
|
||||
|
||||
### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB)
|
||||
|
||||
| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM |
|
||||
|----:|--------|------------------:|-----------:|-----------:|------------:|-----:|
|
||||
| 8 | decode agg | 63.8 | 63.5 | 63.8 | 63.5 | 64.3 |
|
||||
| 8 | TTFT ms | 2029 | 4255 | 3756 | 2653 | 2593 |
|
||||
| 32 | decode agg | 108.9 | 105.7 | 107.7 | 108.8 | 189.8 |
|
||||
| 32 | TTFT ms | 13212 | 23114 | 18934 | 13912 | 7477 |
|
||||
| 64 | decode agg | 126.2 | 132.0 | 131.2 | 118.2 | 284.2 |
|
||||
| 64 | TTFT ms | 53818 | 109455 | 74272 | 92450 | 12942 |
|
||||
| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 |
|
||||
| 128 | TTFT ms | 491195| **305423**| 543448| 424058| 24806 |
|
||||
|
||||
Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) ->
|
||||
~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1.
|
||||
|
||||
### Best budget = 256 (only the saturated npl128 regime benefits)
|
||||
|
||||
At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**:
|
||||
|
||||
- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run.
|
||||
- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock.
|
||||
- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s.
|
||||
|
||||
Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a
|
||||
looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So
|
||||
the tightest cap (256) protects in-flight decode the most when the box is saturated.
|
||||
|
||||
### Honest caveat: this bursty workload is the worst case for TTFT
|
||||
|
||||
At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left
|
||||
decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there
|
||||
is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget
|
||||
only pays off once enough slots are decoding that an unbounded prefill would starve them, which on
|
||||
this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~
|
||||
stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival
|
||||
pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not
|
||||
exercised here.
|
||||
|
||||
### Bottom line (dense)
|
||||
|
||||
The prefill budget is a **real but narrow** lever on this workload: at maximum saturation
|
||||
(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama
|
||||
from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x
|
||||
faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
|
||||
light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
|
||||
should be applied selectively (high-concurrency serving), not as an unconditional default.
|
||||
|
||||
## MoE 35B-A3B fair re-run (max_prefill_tokens on)
|
||||
|
||||
Same build (HEAD 151343b, P0+P1 patch 0015), same flags (`-c 131072 --parallel 128 -b 2048
|
||||
-ub 512 -ngl 99 -fa on`, `LLAMA_KV_PAGED=1`), same all-at-once harness (512-tok unique prompt,
|
||||
gen 256, temp 0, ignore_eos). Swept the dense winner budget 256 plus neighbor 512.
|
||||
|
||||
### Primary table - budget 256 (decode_agg tok/s | TTFT mean ms | peak host GB)
|
||||
|
||||
| npl | stock (no budget) | budget 256 (best) | budget 512 | vLLM |
|
||||
|----:|------------------:|------------------:|-----------:|-----:|
|
||||
| 8 | 170.2 / 855 / - | 169.3 / 1655 / 38.95 | 172.1 / 1488 / 38.82 | 202.0 / 799 |
|
||||
| 32 | 235.4 / 4970 / - | 239.0 / 9034 / 42.93 | 234.7 / 7260 / 42.72 | 462.0 / 2308 |
|
||||
| 64 | 271.7 / 7205 / - | 277.0 / 16249 / 51.96 | 274.5 / 13660 / 52.53 | 624.5 / 4072 |
|
||||
| 128 | 292.2 / 84800 / - | **333.5 / 98106 / 61.42** | 300.8 / 92470 / 61.45 | 811.1 / 7980 |
|
||||
|
||||
Peak host GB (paged KV, budget-independent): ~38.9 (npl8) -> ~42.8 (npl32) -> ~52 (npl64) ->
|
||||
~61.4 (npl128). Far below the dense run (94 GB @npl128) - only ~3B params are active, so the KV
|
||||
plus activations footprint stays light even fully saturated.
|
||||
|
||||
### MoE inverts the dense story: the budget buys decode, NOT TTFT
|
||||
|
||||
Unlike the dense 27B (where the stock run was prefill-starved to 491 s TTFT @npl128 and the budget
|
||||
cut it 38%), the MoE stock run was **never prefill-starved**: 3B active params make prefill cheap,
|
||||
so stock TTFT @npl128 was already only 84.8 s. Capping prefill therefore cannot rescue TTFT - it
|
||||
can only **defer first tokens to free decode steps**. Result at npl128 with budget 256:
|
||||
|
||||
- **decode_agg: 292.2 -> 333.5 tok/s (+14.1%)** vs the starved stock run.
|
||||
- **TTFT mean: 84.8 s -> 98.1 s (+15.7%, WORSE)** - the budget costs latency here.
|
||||
- llama decode as % of vLLM @npl128: **36.0% -> 41.1%**. TTFT now ~12.3x vLLM's 7.98 s.
|
||||
|
||||
Budget 512 is the milder trade (decode +3.0% to 300.8, TTFT +9.0% to 92.5 s @npl128). Budget 256
|
||||
maximizes decode throughput; 512 if you want to bleed less TTFT. At npl 8/32/64 both budgets are
|
||||
net-negative or flat on decode and clearly raise TTFT (e.g. npl64 7.2 s -> 16.2 s @b256), the same
|
||||
all-at-once burst artifact seen in the dense run.
|
||||
|
||||
### Does the ~3B-active decode scale better now? Yes - the plateau is gone
|
||||
|
||||
The headline win is the **decode scaling curve**, not any single point:
|
||||
|
||||
| npl step | stock decode_agg | budget-256 decode_agg |
|
||||
|---------:|-----------------:|----------------------:|
|
||||
| 8 -> 32 | 170 -> 235 (+38%) | 169 -> 239 (+41%) |
|
||||
| 32 -> 64 | 235 -> 272 (+16%) | 239 -> 277 (+16%) |
|
||||
| 64 -> 128| 272 -> 292 (**+7.4%**, plateauing) | 277 -> 333.5 (**+20.4%**, still climbing) |
|
||||
|
||||
Stock MoE decode **plateaus** at saturation (+7.4% over the last doubling) because unbounded
|
||||
prefills keep stealing steps from the many ready decode slots. Budget 256 removes that ceiling -
|
||||
decode keeps climbing +20% into npl128, so more of the 128 slots actually decode concurrently.
|
||||
This is the cleanest evidence that patch 0013 protects in-flight decode once enough slots are live.
|
||||
|
||||
### Bottom line (MoE)
|
||||
|
||||
For the A3B MoE the prefill budget is a **decode-throughput lever, paid for in TTFT** - the mirror
|
||||
image of the dense case. Budget 256 lifts decode_agg +14% @npl128 and, more importantly, restores
|
||||
monotonic decode scaling (kills the stock plateau), moving llama from 36.0% to 41.1% of vLLM
|
||||
decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: vLLM still decodes
|
||||
~2.4x faster (811 vs 333.5) and holds TTFT ~12x lower (8.0 s vs 98.1 s) @npl128, and scales
|
||||
monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated
|
||||
MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for
|
||||
latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here).
|
||||
|
||||
---
|
||||
|
||||
## Fair re-run verdict
|
||||
|
||||
This is the synthesis after patch 0013 (`max_prefill_tokens` / `LLAMA_PREFILL_BUDGET`) was turned
|
||||
on for both models. It answers three questions: how much of the apparent gap was prefill
|
||||
starvation, what genuine gap to vLLM remains after that artifact is removed, and where that leaves
|
||||
the "par-or-beat vLLM" goal.
|
||||
|
||||
### 1. How much did patch 0013 close the gap?
|
||||
|
||||
The original (stock) tables blamed two things on llama: an exploding TTFT and a flat decode curve
|
||||
at high concurrency. The budget re-run shows these were **two different problems with two
|
||||
different root causes**, and only one was prefill starvation.
|
||||
|
||||
**Dense 27B - was genuinely prefill-starved.** Dense prefill is expensive (full 28B weights per
|
||||
token), so 128 simultaneous 512-token prefills truly starved both first-tokens and decode. Budget
|
||||
256 @npl128:
|
||||
|
||||
| metric @npl128 | stock | budget 256 | vLLM | what closed |
|
||||
|----------------|------:|-----------:|-----:|-------------|
|
||||
| TTFT mean | 491.2 s | **305.4 s** (-37.8%) | 24.8 s | starvation real; -186 s recovered |
|
||||
| decode_agg | 134.6 | **161.2** (+19.8%) | 390.7 | freed slots now decode |
|
||||
| llama as % of vLLM decode | 34.5% | **41.3%** | 100% | +6.8 pts |
|
||||
|
||||
Dense llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **99 / 56 / 46 / 41** (was 99/57/44/34).
|
||||
The fix moved only the saturated tail; npl 8/32 were never starved and are unchanged.
|
||||
|
||||
**MoE 35B-A3B - was NOT prefill-starved (the inversion).** Only ~3B active params, so prefill was
|
||||
already cheap and stock TTFT @npl128 was 84.8 s, not dense's 491 s. There was no starvation to
|
||||
rescue, so the budget could not cut TTFT - it instead converted deferred prefill into decode
|
||||
steps. Budget 256 @npl128:
|
||||
|
||||
| metric @npl128 | stock | budget 256 | vLLM | direction |
|
||||
|----------------|------:|-----------:|-----:|-----------|
|
||||
| TTFT mean | 84.8 s | 98.1 s (+15.7%, WORSE) | 7.98 s | budget costs latency here |
|
||||
| decode_agg | 292.2 | **333.5** (+14.1%) | 811.1 | plateau removed |
|
||||
| llama as % of vLLM decode | 36.0% | **41.1%** | 100% | +5.1 pts |
|
||||
|
||||
MoE llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **84 / 52 / 44 / 41** (was 84/51/44/36).
|
||||
The decisive MoE finding is the scaling curve, not the point: stock decode plateaued over the last
|
||||
doubling (64->128 = +7.4%); budget 256 restored monotonic scaling (+20.4%), proving the stock flat
|
||||
curve was unbounded prefill stealing steps from ready decode slots, not a kernel ceiling.
|
||||
|
||||
**Combined takeaway.** Both models converge to the **same ~41% of vLLM decode at npl128** after the
|
||||
fix. That convergence is the signal: once prefill starvation is removed, dense and a 12x-cheaper-
|
||||
prefill MoE land on the identical ceiling, which means the remaining gap is **not** about prefill
|
||||
at all - it is the decode scheduler.
|
||||
|
||||
### 2. The honest remaining gap to vLLM
|
||||
|
||||
After patch 0013, the residual gap is the **continuous-batched-decode efficiency** lever, and it is
|
||||
real, not an artifact:
|
||||
|
||||
- vLLM still decodes **~2.4x faster** at npl128 on both models (390.7 vs 161.2 dense; 811.1 vs
|
||||
333.5 MoE).
|
||||
- vLLM holds TTFT **~12x lower** at npl128 (24.8 vs 30.5 s dense; 8.0 vs 98.1 s MoE) - and does so
|
||||
while decoding faster, i.e. no latency/throughput trade.
|
||||
- **vLLM scales monotonically and steeply** (dense 64->391, MoE 202->811 across npl 8->128); llama,
|
||||
even with the budget, only **partially** recovers its scaling (dense 64->161, MoE 170->334).
|
||||
|
||||
The mechanism: vLLM's scheduler interleaves prefill and decode at token granularity (chunked
|
||||
prefill + paged continuous batching) every step, keeping the GPU saturated with a near-optimal mix.
|
||||
Patch 0013 is a coarser tool - a static per-step prefill **cap** - which protects in-flight decode
|
||||
but does not actively schedule the prefill/decode mix, and on the bursty all-at-once harness it
|
||||
defers first tokens (the TTFT penalty at npl 8/32/64, and the MoE TTFT regression @npl128). The gap
|
||||
that remains is the **quality of the step-by-step batching decision**, not raw kernel speed: at
|
||||
npl8 the kernels are at parity (dense 99%, MoE 84%), so the per-token math is competitive - what
|
||||
vLLM does better is keeping more sequences productively in-flight every step as concurrency rises.
|
||||
|
||||
### 3. Where this leaves "par-or-beat vLLM", and the last lever
|
||||
|
||||
**Where llama is competitive today (NVFP4, GB10):**
|
||||
|
||||
- **Low concurrency (npl<=8): at parity.** Dense 99%, MoE 84% of vLLM decode, comparable TTFT.
|
||||
For single-user / few-stream local serving - LocalAI's dominant mode - llama.cpp is already
|
||||
there on matched NVFP4.
|
||||
- **Memory efficiency: llama wins outright at every concurrency.** On-demand paged KV (dense
|
||||
52->94 GB, MoE 39->61 GB) vs vLLM's flat ~112 GB pre-reservation. On a 128 GB unified box this is
|
||||
the difference between multi-tenant headroom and OOM - a genuine product advantage, not a
|
||||
consolation.
|
||||
|
||||
**Where llama is not competitive:** high-concurrency decode throughput (npl>=32), where vLLM is
|
||||
~2-2.4x ahead and the budget only narrows it to ~41%.
|
||||
|
||||
**The last lever** is therefore *not* another prefill knob (0013 has extracted what a static cap
|
||||
can give) and *not* the kernel (at parity @npl8). It is **token-granular continuous-batch
|
||||
scheduling**: actively interleaving chunked prefill with decode every step rather than capping
|
||||
prefill, so all live slots decode while new prefills trickle in - exactly what closes vLLM's
|
||||
monotonic-scaling advantage. A staggered (non-burst) arrival pattern would also let 0013 protect
|
||||
decode jitter without the burst-TTFT penalty seen here, narrowing the practical gap for real
|
||||
serving traffic that does not arrive all-at-once.
|
||||
|
||||
### Bottom line
|
||||
|
||||
Patch 0013 is validated and worth shipping as a **selective, high-concurrency QoS lever**: it
|
||||
recovers dense TTFT 38% and lifts saturated decode +14-20%, converging both models to ~41% of
|
||||
vLLM. But it is honestly **not a gap-closer**. The "par-or-beat vLLM" goal is **met at low
|
||||
concurrency and on memory efficiency, and not met at high-concurrency decode throughput.** The
|
||||
remaining ~2.4x is a continuous-batched-decode scheduling gap, not a prefill-starvation or kernel
|
||||
gap - and that is the next (harder) lever, distinct from anything 0013 can touch.
|
||||
317
backend/cpp/llama-cpp/patches/paged/README.md
Normal file
317
backend/cpp/llama-cpp/patches/paged/README.md
Normal file
@@ -0,0 +1,317 @@
|
||||
# LocalAI paged-attention llama.cpp patch series
|
||||
|
||||
This directory holds the vendored patch series that turns stock llama.cpp into
|
||||
LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are
|
||||
applied on top of a pinned upstream llama.cpp at build time; nothing here is a
|
||||
fork - it is a source-only `*.patch` stack plus this single canonical doc.
|
||||
|
||||
> One-file rule: this README is the canonical reference for the patch series. The
|
||||
> only other docs kept in this directory are operational and linked below:
|
||||
> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
|
||||
> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
|
||||
> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
|
||||
|
||||
---
|
||||
|
||||
## 1. What it is
|
||||
|
||||
`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a
|
||||
vendored patch series over upstream llama.cpp that adds
|
||||
|
||||
- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks,
|
||||
free pool, ref-counted blocks) with a **block-table flash-attention** read so
|
||||
the attention kernels index physical cells instead of a contiguous buffer;
|
||||
- **cross-request prefix sharing** - concurrent requests that share a long
|
||||
prefix physically reuse one committed copy of the prefix blocks and prefill
|
||||
only their divergent suffix;
|
||||
- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget
|
||||
decoupled from `n_batch`, so a long prefill never freezes co-batched decode;
|
||||
- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid
|
||||
gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
|
||||
GEMM - dominates the decode step.
|
||||
|
||||
It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
|
||||
export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
|
||||
[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
|
||||
(see section 7).
|
||||
|
||||
The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
|
||||
enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
|
||||
`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the
|
||||
runtime hooks are inert, so a single `grpc-server.cpp` is shared between the
|
||||
clean and the paged build.
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
The decode step on these models breaks into three cost centers; the patch series
|
||||
attacks each one.
|
||||
|
||||
**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager`
|
||||
(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out
|
||||
fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with
|
||||
copy-on-write for shared prefixes). The attention path reads through a **block
|
||||
table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed
|
||||
as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the
|
||||
CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]`
|
||||
and read K/V in place. Token-position ordering keeps the flash-attn online-softmax
|
||||
reduction order identical to stock. A null block table is the stock contiguous
|
||||
read, byte-identical.
|
||||
|
||||
**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48
|
||||
gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On
|
||||
GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode
|
||||
cost. The series fuses that plumbing to mirror vLLM's
|
||||
`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written
|
||||
to its cache slot in place (no copy-back, no `get_rows` materialization), the
|
||||
conv state is updated in place, the output projection is reshaped to route to the
|
||||
tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all
|
||||
bit-exact (md5-gateable) against the f32 baseline.
|
||||
|
||||
**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses
|
||||
Blackwell's native FP4-MMA. The series removes a redundant activation-requantize
|
||||
in the MoE broadcast projections (bit-exact byte copy of identical blocks) and
|
||||
keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only
|
||||
NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back
|
||||
to dequant.
|
||||
|
||||
**The prefill/decode scheduler.** `update_slots()` already emits one unified
|
||||
mixed prefill+decode batch per step. The scheduler patches change only the *count*
|
||||
of prefill tokens admitted per step: decode tokens are claimed first
|
||||
(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the
|
||||
live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto-
|
||||
shrinking as decode load rises. Pure scheduler policy, byte-identical when off,
|
||||
orthogonal to the paged allocator.
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch series (0001-0030)
|
||||
|
||||
28 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
|
||||
`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
|
||||
is in section 5.
|
||||
|
||||
### Paged-KV core (0001-0012)
|
||||
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) |
|
||||
| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) |
|
||||
| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes |
|
||||
| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) |
|
||||
| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) |
|
||||
| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) |
|
||||
| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band |
|
||||
| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band |
|
||||
| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) |
|
||||
| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec |
|
||||
| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) |
|
||||
|
||||
### Decode-first scheduler (0013, 0016)
|
||||
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) |
|
||||
| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) |
|
||||
|
||||
(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in
|
||||
high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a
|
||||
default-on, density-aware auto-select that is prefill-safe by construction. Both
|
||||
bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green,
|
||||
but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
|
||||
ships as the parity gate + default-off instrumentation only.)
|
||||
|
||||
### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
|
||||
|
||||
These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
|
||||
|
||||
| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) |
|
||||
|---|---|---|
|
||||
| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% |
|
||||
| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% |
|
||||
| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% |
|
||||
| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% |
|
||||
| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
|
||||
| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |
|
||||
|
||||
### MoE NVFP4 quant (0023, 0025)
|
||||
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
|
||||
| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
|
||||
|
||||
### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
|
||||
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
|
||||
| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
|
||||
| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
|
||||
| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
|
||||
|
||||
---
|
||||
|
||||
## 4. Benchmarks
|
||||
|
||||
Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
|
||||
**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
|
||||
S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
|
||||
serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png),
|
||||
[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data
|
||||
[`final_benchmark.csv`](final_benchmark.csv).
|
||||
|
||||
### (a) + (b) Patched vs stock vs vLLM
|
||||
|
||||
The **stock** and **patched** columns are the same binary, env-toggled, on the
|
||||
**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
|
||||
apples-to-apples measure of the patch series' contribution. The **vLLM** column
|
||||
is a **different harness** (vLLM server + client continuous batching), so the
|
||||
cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
|
||||
|
||||
**Dense Qwen3.6-27B-NVFP4** (t/s):
|
||||
|
||||
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|
||||
|----:|------:|--------:|-----:|------------------:|---------------------:|
|
||||
| 8 | 65.7 | 84.0 | 71.1 | 118% | 1.28x |
|
||||
| 32 | 113.7 | 204.0 | 207.6 | 98% | 1.79x |
|
||||
| 64 | 134.3 | 294.9 | 309.7 | 95% | 2.20x |
|
||||
| 128 | 143.5 | 371.2 | 422.4 | 88% | 2.59x |
|
||||
|
||||
**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
|
||||
|
||||
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|
||||
|----:|------:|--------:|------:|-----------------:|---------------------:|
|
||||
| 8 | 181.4 | 227.4 | 315.1 | 72% | 1.25x |
|
||||
| 32 | 260.8 | 455.7 | 681.9 | 67% | 1.75x |
|
||||
| 64 | 306.8 | 612.3 | 765.5 | 80% | 2.00x |
|
||||
| 128 | 331.3 | 772.6 | 1011.7 | 76% | 2.33x |
|
||||
|
||||
**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
|
||||
@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
|
||||
config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
|
||||
groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
|
||||
|
||||
**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
|
||||
stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
|
||||
remaining gap is structural (see section 5).
|
||||
|
||||
### (c) Apple M4 (16GB) - for curiosity only
|
||||
|
||||
No t/s: the 24GB NVFP4 GGUF did not finish downloading and would not fit in 16GB
|
||||
RAM (= SSD paging). Architectural findings:
|
||||
|
||||
- Metal `supports_op` **excludes NVFP4** from `MUL_MAT` / `MUL_MAT_ID` /
|
||||
`GET_ROWS`, so the FP4 matmuls **fall back to CPU** - there is no Apple
|
||||
FP4-MMA.
|
||||
- `GATED_DELTA_NET` and `SSM_CONV` / `SSM_SCAN` **do** have Metal kernels.
|
||||
|
||||
Verdict: NVFP4 Qwen3.6 needs **Blackwell FP4-MMA + >24GB RAM**; a 16GB M4 is not
|
||||
a fit. A Metal-native Q4_K Qwen3.6 would be a different artifact.
|
||||
|
||||
---
|
||||
|
||||
## 5. Dev notes - what we learned
|
||||
|
||||
**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy
|
||||
md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France
|
||||
is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
|
||||
`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default
|
||||
chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
|
||||
touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
|
||||
|
||||
**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)).
|
||||
Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
|
||||
(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
|
||||
is a benign FP-accumulation-order difference of the paged attention reduction,
|
||||
**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <=
|
||||
KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two
|
||||
equivalent FP-reorderings of the same quantized model, not a regression. Future
|
||||
paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`.
|
||||
|
||||
**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE
|
||||
decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama
|
||||
**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth;
|
||||
the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection
|
||||
bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel
|
||||
llama is losing. The MoE GEMM kernel is *not* where the gap lives.
|
||||
|
||||
**Rejected / flat levers** (recorded so they are not re-tried):
|
||||
|
||||
- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
|
||||
exhausted by 0025; more graph/stream overlap is a no-op or small regression on
|
||||
this model.
|
||||
- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
|
||||
by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
|
||||
further bit-exact lever clears it. 0023 already banks the de-dup.
|
||||
- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).**
|
||||
Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the
|
||||
same bf16 projections. No-ship.
|
||||
- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody
|
||||
needs bought with a ~5% slower kernel; both kernels are already at the BW floor.
|
||||
(The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
|
||||
carries over to MoE.)
|
||||
|
||||
**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
|
||||
that bf16 KL error concentrates in long-memory heads and can be removed by
|
||||
keeping them f32 - is **empirically refuted**: the error scales with the bf16
|
||||
head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
|
||||
byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
|
||||
byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
|
||||
ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
|
||||
in a recommended/gallery config.
|
||||
|
||||
---
|
||||
|
||||
## 6. Architecture and quant generality
|
||||
|
||||
(From the arch-generality and quant-generality audits.)
|
||||
|
||||
- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4
|
||||
activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations
|
||||
help **any quant** of these models (the GDN recurrence, conv, gather and
|
||||
o_proj-MMQ levers operate on the f32 recurrent state and the routing layout,
|
||||
not on the weight dtype).
|
||||
- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back
|
||||
to dequant on other hardware; the GB10-tuned occupancy params (0022) are
|
||||
perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change
|
||||
correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
|
||||
CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.
|
||||
|
||||
---
|
||||
|
||||
## 7. Pin + maintenance policy
|
||||
|
||||
- **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
|
||||
[`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
|
||||
onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
|
||||
(dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
|
||||
c299a92c` jump (23 upstream commits) needed zero patch changes and did not
|
||||
change decode output.
|
||||
- **Decoupled from the nightly auto-bumper.** There is deliberately **no**
|
||||
`bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
|
||||
silently shift the tree out from under the patches.
|
||||
- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml)
|
||||
(via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh))
|
||||
tries the patch series against the latest upstream tip with the build's own
|
||||
strict `git apply`. **Red = upstream drifted past the series -> run a
|
||||
PIN_SYNC** (do not bump the pin blindly). The canary references
|
||||
[`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md).
|
||||
|
||||
---
|
||||
|
||||
## 8. Models
|
||||
|
||||
The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
|
||||
|
||||
| Gallery entry | Weights (HuggingFace) | Notes |
|
||||
|---|---|---|
|
||||
| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). |
|
||||
| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. |
|
||||
|
||||
Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
|
||||
(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
|
||||
`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
|
||||
`ssm_bf16_tau`). The full backend-split + gallery plan is in
|
||||
[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md).
|
||||
@@ -1,400 +0,0 @@
|
||||
# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU)
|
||||
|
||||
Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4
|
||||
producer-fold the BEST remaining bit-exact decode lever, or is something better being missed?
|
||||
Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched.
|
||||
|
||||
## Starting line (post-0023)
|
||||
- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023.
|
||||
- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%).
|
||||
- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded
|
||||
kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel).
|
||||
- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED.
|
||||
|
||||
## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue)
|
||||
|
||||
(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4,
|
||||
32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one
|
||||
block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction
|
||||
(fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only
|
||||
thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which
|
||||
repartition which rows a lane sums into the warp-butterfly = a different reduction grouping =
|
||||
breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at
|
||||
the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR.
|
||||
(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail.
|
||||
Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the
|
||||
online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR.
|
||||
(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum
|
||||
order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever.
|
||||
(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2)
|
||||
is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k
|
||||
fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode
|
||||
GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR.
|
||||
|
||||
=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the
|
||||
step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small,
|
||||
foldable glue/quantize/gather buckets below.
|
||||
|
||||
## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk)
|
||||
|
||||
Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace:
|
||||
- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%)
|
||||
- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather)
|
||||
- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net)
|
||||
|
||||
Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the
|
||||
three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent
|
||||
kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded =
|
||||
byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree /
|
||||
eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2).
|
||||
|
||||
Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST
|
||||
plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense
|
||||
+2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so
|
||||
~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION
|
||||
grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the
|
||||
writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for
|
||||
any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread
|
||||
shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte
|
||||
gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO
|
||||
writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst
|
||||
plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q
|
||||
prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the
|
||||
LAST-MILE lever, not the first.
|
||||
|
||||
Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk -
|
||||
but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the
|
||||
0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them
|
||||
("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy
|
||||
with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup
|
||||
exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via
|
||||
ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is
|
||||
bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source
|
||||
scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than
|
||||
Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live
|
||||
~2% bit-exact bucket that the current plan does not address.
|
||||
|
||||
## IS THE fp4 FOLD THE RIGHT NEXT BUILD?
|
||||
|
||||
DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood
|
||||
bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of
|
||||
the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing
|
||||
with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has
|
||||
the worst gain/plumbing ratio of the candidates.
|
||||
|
||||
Recommended build order (all bit-exact, all stack 1:1 on the serial single stream):
|
||||
1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%).
|
||||
2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk).
|
||||
3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if
|
||||
the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery.
|
||||
If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the
|
||||
biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before.
|
||||
|
||||
## HONEST BIT-EXACT CEILING
|
||||
|
||||
The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK:
|
||||
~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense.
|
||||
373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391.
|
||||
=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the
|
||||
ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt.
|
||||
Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets
|
||||
resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and
|
||||
at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the
|
||||
~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere.
|
||||
|
||||
Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs
|
||||
llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the
|
||||
only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is
|
||||
already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with
|
||||
the folds reaches throughput parity at higher precision - a strong result, but each fold is a
|
||||
diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that
|
||||
goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
====================================================================================================
|
||||
|
||||
# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU)
|
||||
|
||||
Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}`
|
||||
(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp.
|
||||
|
||||
## 0. One-line verdict
|
||||
The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is
|
||||
(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the
|
||||
reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings
|
||||
+ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured
|
||||
per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense
|
||||
best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7.
|
||||
|
||||
## 1. Which graph nodes fuse
|
||||
Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32<bs, do_multiply=true>` kernel
|
||||
(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target.
|
||||
|
||||
- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487:
|
||||
`attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at
|
||||
m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the
|
||||
f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing
|
||||
`{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam.
|
||||
- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243:
|
||||
`attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND
|
||||
`ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer
|
||||
MUST write f32 -> smaller win.
|
||||
- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only.
|
||||
|
||||
## 2. Byte-exact target (norm.cu rms_norm_f32<bs,true>)
|
||||
Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float.
|
||||
```
|
||||
for col=tid; col<ncols; col+=bs: tmp += x[col]*x[col]; // (R1) strided sumsq grouping
|
||||
tmp = block_reduce<SUM, bs>(tmp, s_sum); // (R2) tree width depends on bs
|
||||
mean = tmp/ncols; scale = rsqrtf(mean+eps); // (R3) exact eps/div
|
||||
for col=tid; col<ncols; col+=bs: dst[col] = scale*x[col]*mul[col];// (W) per-channel gain, mul_col==col
|
||||
```
|
||||
(W) is per-column independent (scale block-uniform) -> writeback may be re-partitioned. (R1/R2/R3)
|
||||
are the ONLY order-sensitive parts and must stay byte-identical.
|
||||
|
||||
## 3. Fused producer kernel (quantize.cu) - deltas vs the stash
|
||||
Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock`
|
||||
(verbatim per-thread NVFP4 quant). Required changes:
|
||||
1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY.
|
||||
2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce<SUM,bs>`, `mean=tmp/ncols`,
|
||||
`scale=rsqrtf(mean+eps)`. Byte-identical once bs matches.
|
||||
3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s<n_sub; s+=bs`, col0=s*16,
|
||||
`v=scale*xr[col]*mul[col]` (col<ncols else 0), amax=max|v|, `quantize_nvfp4_write_subblock(vals,
|
||||
amax, sub, y+ib)`, `ib=k_block*ne11+row`, n_sub=ncols_padded/16. x is re-read (canonical does too).
|
||||
4. `template<bool write_f32>`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store),
|
||||
TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut.
|
||||
Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1
|
||||
(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride
|
||||
s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)).
|
||||
|
||||
## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh)
|
||||
Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`.
|
||||
In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args
|
||||
src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal).
|
||||
Sharing across non-adjacent siblings:
|
||||
- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse;
|
||||
one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean.
|
||||
- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins.
|
||||
The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes
|
||||
(a key reason it was flat; nsys showed quantize 12896->10816, not ->0).
|
||||
|
||||
## 5. Bit-exactness argument
|
||||
(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction
|
||||
(quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte.
|
||||
(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch)
|
||||
AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant.
|
||||
=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held.
|
||||
Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped).
|
||||
|
||||
## 6. THE TRAP
|
||||
- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is
|
||||
1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash
|
||||
hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs
|
||||
32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs.
|
||||
- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction
|
||||
partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs.
|
||||
- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate.
|
||||
- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only
|
||||
changes WHO writes src1.
|
||||
|
||||
## 7. Contrast with Lever-2 + lower-risk alternative
|
||||
Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9%
|
||||
rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND
|
||||
non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats
|
||||
it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again.
|
||||
LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient
|
||||
standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate /
|
||||
ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm
|
||||
untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half
|
||||
captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed
|
||||
is easily eaten by the fused kernel's added work.
|
||||
|
||||
## 8. Scope + build order (the gate)
|
||||
Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized
|
||||
on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE
|
||||
+ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT).
|
||||
Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time
|
||||
vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg
|
||||
actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true,
|
||||
keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable
|
||||
(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
====================================================================================================
|
||||
|
||||
# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2
|
||||
|
||||
Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable
|
||||
fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits.
|
||||
|
||||
## Build-dir correction (acted on)
|
||||
The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically
|
||||
`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s,
|
||||
git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is
|
||||
unaffected by the 0023 MoE de-dup knob - dense has no MoE.)
|
||||
|
||||
## Confirmed baseline
|
||||
- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the
|
||||
~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.)
|
||||
- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%,
|
||||
sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle).
|
||||
|
||||
## Dense decode decomposition (ms/step)
|
||||
gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q<NVFP4,128> 93.57 (27.4%) |
|
||||
**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) |
|
||||
ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 |
|
||||
stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25.
|
||||
|
||||
## quantize_mmq_nvfp4 at the dense decode shape (the answer)
|
||||
- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per
|
||||
NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong).
|
||||
- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn):
|
||||
- K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up,
|
||||
ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED
|
||||
foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366).
|
||||
- K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)),
|
||||
two muls break the chain) + wo (attn-gated producer).
|
||||
- K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer).
|
||||
|
||||
## Foldable portion (measured) - LARGER than the byte-model 2.7%
|
||||
The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model
|
||||
predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144
|
||||
indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls.
|
||||
- byte-model FOLDABLE = 9.73 ms = 2.82% of step
|
||||
- flat-split FOLDABLE = 11.90 ms = 3.45% of step (368 small-K quants, the physically correct one)
|
||||
- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo
|
||||
~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%).
|
||||
- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold)
|
||||
= 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32)
|
||||
= ~7.8 ms raw but lower net efficiency.
|
||||
|
||||
## Critical path: YES (1:1)
|
||||
98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the
|
||||
serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps).
|
||||
|
||||
## Realistic recoverable - and the honest haircut
|
||||
RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic
|
||||
already flagged and this profile does not overturn:
|
||||
- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved,
|
||||
and the f32 write is droppable ONLY at the FFN boundary where it is dead);
|
||||
- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2
|
||||
siblings, always wrote f32, and used a non-canonical bs=256 reduction;
|
||||
- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole
|
||||
quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small
|
||||
kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC
|
||||
garbage-routing skip artifact). This caps the realistic NET nearer the conservative end.
|
||||
=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of
|
||||
FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not
|
||||
bank above ~2.5%.
|
||||
|
||||
## VERDICT (GPU-measurement view)
|
||||
- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7%
|
||||
byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision
|
||||
change), and the largest single clean dense bucket left after the plateaued recurrence.
|
||||
- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the
|
||||
HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates.
|
||||
- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows
|
||||
folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest
|
||||
~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if
|
||||
decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact
|
||||
construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not
|
||||
bit-exact.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
====================================================================================================
|
||||
|
||||
# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023
|
||||
|
||||
DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies
|
||||
its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and
|
||||
measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a
|
||||
clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed
|
||||
to the code; this verdict appended only.
|
||||
|
||||
I extended the read-only agents' analysis with the two things they could not verify from the .md
|
||||
verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in
|
||||
the source. Both kill the build.
|
||||
|
||||
## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive)
|
||||
LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured:
|
||||
- dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5%
|
||||
- MoE q36-35b-a3b npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3%
|
||||
nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%),
|
||||
NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK
|
||||
the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly
|
||||
consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the
|
||||
fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the
|
||||
producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only
|
||||
recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved
|
||||
this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes"
|
||||
(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at
|
||||
the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below.
|
||||
|
||||
## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible)
|
||||
This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the
|
||||
redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the
|
||||
SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained
|
||||
quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime.
|
||||
The DENSE redundancy is INTER-NODE and that is a different, much harder problem:
|
||||
- The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM,
|
||||
reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus,
|
||||
mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are
|
||||
NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A
|
||||
contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible
|
||||
for the qkvz bucket.
|
||||
- De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share
|
||||
src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last
|
||||
sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool
|
||||
must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool +
|
||||
src1_prequantized path the fold needs, with real implementation risk (graph-capture
|
||||
non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing"
|
||||
framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs)
|
||||
is common to both.
|
||||
- The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case.
|
||||
- The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn
|
||||
LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8;
|
||||
ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes.
|
||||
Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step
|
||||
~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new
|
||||
fusion code path + the risk to the byte gate.
|
||||
|
||||
## 3. The pointwise + get_rows folds are not clean wins either
|
||||
- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused
|
||||
(ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/
|
||||
softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable
|
||||
remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS
|
||||
measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding
|
||||
kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever.
|
||||
- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified
|
||||
mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win.
|
||||
|
||||
## 4. Tree state / gates
|
||||
- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff -
|
||||
no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile).
|
||||
- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged).
|
||||
- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline
|
||||
the reprofile already measured would waste the GPU window).
|
||||
|
||||
## 5. FINAL BIT-EXACT CEILING
|
||||
Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023).
|
||||
This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left:
|
||||
- gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor.
|
||||
- mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever
|
||||
(any knob changes a K-/softmax-reduction order vs the f32 reference).
|
||||
- The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest
|
||||
piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has
|
||||
no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is
|
||||
each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the
|
||||
plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow.
|
||||
FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81
|
||||
bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing
|
||||
that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved).
|
||||
RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat).
|
||||
If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE
|
||||
scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts
|
||||
above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,138 +0,0 @@
|
||||
# GB10 same-day head-to-head server sweep: llama-server (paged) vs vLLM
|
||||
|
||||
Date: 2026-06-23. Hardware: GB10 / DGX Spark (sm_121, 128 GB LPDDR5x unified, ~273 GB/s
|
||||
weight-read floor). GPU otherwise idle (sibling vLLM had exited; LocalAI docker workers
|
||||
stopped for the run).
|
||||
|
||||
This sweep **replaces** the stale carried "~75-80% of vLLM" figure (commit 07985ba4,
|
||||
pre-co-batching, single-point). It measures *real serving* steady-state aggregate decode
|
||||
throughput across the full concurrency curve, for three model classes, with one identical
|
||||
client driving both engines.
|
||||
|
||||
## Method
|
||||
|
||||
- **llama**: `llama-server` from the paged dev tree (`~/llama-paged-dev/build-cuda`, HEAD =
|
||||
patch 0013 / commit 17d97cb), `LLAMA_KV_PAGED=1`, `-fa on -ngl 999 --parallel 128 -c 65536`.
|
||||
- **vLLM**: 0.23.0, `vllm serve --enforce-eager --enable-prefix-caching --max-num-seqs >=128
|
||||
--max-model-len 4096` (APC on, eager per the GB10 no-CUDA-graphs edge).
|
||||
- **Client** (`sweep_client2.py`): N concurrent **non-streaming** `/v1/completions`, short
|
||||
shared prompt, `max_tokens=min_tokens=256`, `ignore_eos=true`. Aggregate decode tok/s =
|
||||
total generated tokens / wall. Non-streaming keeps the Python client off the critical path
|
||||
(one JSON parse per request, not per token), so the **server** is the bottleneck. Validated:
|
||||
vLLM pushed 4227 tok/s through the exact same client where llama topped out at 2087, so the
|
||||
client is not the cap. Both engines use the identical client + prompt -> apples-to-apples.
|
||||
- npl (concurrency) sweep: 8 / 32 / 64 / 128.
|
||||
|
||||
Quant parity:
|
||||
- Dense: llama **NVFP4-dense GGUF** (weight-only FP4, 16-bit compute) vs vLLM **NVFP4A16**
|
||||
(weight FP4, 16-bit activation) -> matched precision class.
|
||||
- Small: llama **Q8_0** vs vLLM **bf16** (closest loadable form).
|
||||
- MoE: llama **mxfp4** GGUF. **vLLM could not serve this MoE on GB10 at all** (see below), so
|
||||
there is no vLLM MoE column.
|
||||
|
||||
## Results: aggregate decode tok/s (higher is better)
|
||||
|
||||
### Dense 32B (llama NVFP4-dense vs vLLM NVFP4A16)
|
||||
|
||||
| npl | llama (NVFP4) | vLLM (NVFP4A16) | llama % of vLLM |
|
||||
|----:|--------------:|----------------:|----------------:|
|
||||
| 8 | 83.2 | 85.9 | **96.9%** |
|
||||
| 32 | 228.9 | 301.3 | 76.0% |
|
||||
| 64 | 367.1 | 507.8 | 72.3% |
|
||||
| 128 | 520.6 | 604.0 | 86.2% |
|
||||
|
||||
Plateau: neither has plateaued at 128 (both still climbing, weight-read bound). llama is at
|
||||
**parity at batch-8** (97%), dips to ~72% mid-curve (npl 32-64), recovers to 86% at 128.
|
||||
|
||||
### Small Qwen3-0.6B (llama Q8_0 vs vLLM bf16)
|
||||
|
||||
| npl | llama (Q8_0) | vLLM (bf16) | llama % of vLLM |
|
||||
|----:|-------------:|------------:|----------------:|
|
||||
| 8 | 911.3 | 923.0 | **98.7%** |
|
||||
| 32 | 1701.6 | 2531.4 | 67.2% |
|
||||
| 64 | 1911.7 | 3497.1 | 54.7% |
|
||||
| 128 | 2087.6 | 4227.6 | 49.4% |
|
||||
|
||||
Plateau: **llama plateaus hard** at ~2.0-2.1k by npl 64-128 (+9% from 64->128). vLLM keeps
|
||||
scaling (3497 -> 4227). For a tiny runtime-bound model, vLLM's scheduler/batching amortizes
|
||||
better; llama-server's per-token host cost (sampling, detok, slot mgmt) caps it. This is the
|
||||
worst llama-vs-vLLM ratio in the sweep (down to 49%).
|
||||
|
||||
### MoE Qwen3-Coder-30B-A3B (llama mxfp4; vLLM = NOT SERVABLE on GB10)
|
||||
|
||||
| npl | llama (mxfp4) | vLLM |
|
||||
|----:|--------------:|-----:|
|
||||
| 8 | 290.0 | n/a |
|
||||
| 32 | 582.5 | n/a |
|
||||
| 64 | 931.8 | n/a |
|
||||
| 128 | 1041.3 | n/a |
|
||||
|
||||
llama-server scales cleanly to **1041 tok/s** at npl 128 with **no npl-128 expert-activation
|
||||
cliff** (unlike the prior `llama-batched-bench` MoE numbers 253/505/830/620 that peaked at 64;
|
||||
short-prompt continuous batching in the server avoids it).
|
||||
|
||||
**vLLM could not serve this MoE on GB10 (two independent failures):**
|
||||
1. **bf16** (`Qwen/Qwen3-Coder-30B-A3B-Instruct`, the only HF form on the box): loads the
|
||||
56.9 GB of weights, then **hangs at the MoE warmup** (`Using MoEPrepareAndFinalize
|
||||
NoDPEPModular` -> `Model loading took ...`), GPU 0% util, and **takes the whole box down
|
||||
(hard reboot)**. Reproduced twice. With tight `--gpu-memory-utilization` it still hangs at
|
||||
the same step before the API server ever comes up.
|
||||
2. **mxfp4 GGUF** (same weights llama uses): vLLM 0.23.0's GGUF loader **cannot map the fused
|
||||
qwen3moe expert tensors** (`RuntimeError: Failed to map GGUF parameters (48):
|
||||
['model.layers.N.mlp.experts.gate_up_proj', ...]`). Engine init fails outright.
|
||||
|
||||
So on GB10, llama.cpp is the *only* engine of the two that serves this 30B-A3B MoE at all -
|
||||
an availability win, independent of throughput.
|
||||
|
||||
## Batch-8 anomaly triage (dense NVFP4) -- RESOLVED
|
||||
|
||||
The prior mixed-load run reported llama batch-8 steady decode at **471 ms/step (~19% of vLLM
|
||||
aggregate, ~17 tok/s)**. This sweep does **not** reproduce it. Clean isolated batch-8 decode:
|
||||
|
||||
- `llama-server` batch-8 dense paged = **83.2 tok/s** aggregate = ~96 ms/step = **96.9% of
|
||||
vLLM's 85.9** (parity, both at the LPDDR5x weight-read floor).
|
||||
|
||||
`llama-batched-bench` cross-check, dense NVFP4, `-npp 16 -ntg 128 -npl 1,8`, the three
|
||||
hypotheses isolated (S_TG = decode tok/s aggregate at batch 8):
|
||||
|
||||
| config | batch-8 S_TG t/s | ms/decode-step |
|
||||
|-----------------------|-----------------:|---------------:|
|
||||
| paged, ctx 65536 | 90.32 | 88.6 |
|
||||
| stock, ctx 65536 | 88.39 | 90.5 |
|
||||
| paged, ctx 163840 | 89.33 | 89.6 |
|
||||
| stock, ctx 163840 | 87.72 | 91.2 |
|
||||
|
||||
Conclusion: clean batch-8 dense decode is **~88-90 tok/s (~89 ms/step) regardless of all three
|
||||
suspects**:
|
||||
- **Paged overhead?** No -- paged is within 2% of stock, and at ctx 65k paged is *faster*
|
||||
(90.3 vs 88.4). The decode path is not paying a paged penalty at batch-8.
|
||||
- **The 163840-token ctx allocation?** No -- ctx 163840 == ctx 65536 within 1% (89.3 vs 90.3).
|
||||
The large allocation does not slow steady-state decode.
|
||||
- **NVFP4 decode cost?** This *is* the cost -- ~89 ms/step is the GB10 weight-read floor for a
|
||||
32B at batch-8 (it matches vLLM's 86 tok/s server and exceeds it at the kernel level: 90 vs
|
||||
86). It is the hardware ceiling, not a bug.
|
||||
|
||||
The 471 ms/step is ~5.3x slower than this clean floor and is explained by none of the three.
|
||||
It was a **mixed-load artifact**: the 8 decoders were time-sharing the GPU with a concurrent
|
||||
prefill (a large prompt / chunked prefill landing on the same steps). That decode-vs-prefill
|
||||
contention is exactly the stall **patch 0013 (`LLAMA_PREFILL_BUDGET`)** bounds. In steady-state
|
||||
isolated decode, batch-8 dense is at **parity with vLLM (97%)**, not 19%.
|
||||
|
||||
## Aggregate map (replaces the carried 75-80%)
|
||||
|
||||
llama-server (paged) as a fraction of vLLM, by regime:
|
||||
|
||||
- **Low concurrency (batch-8): parity, 97-99%** on both measurable classes. Both engines sit on
|
||||
the LPDDR5x weight-read floor; there is nothing to win.
|
||||
- **Dense 32B, mid-to-high concurrency: 72-86%.** Dips to ~72% at npl 32-64, recovers to 86% at
|
||||
128. Both still climbing (weight-bound), neither plateaus by 128.
|
||||
- **Small 0.6B, mid-to-high concurrency: 49-67%.** llama plateaus ~2.0k; vLLM scales to 4.2k.
|
||||
Runtime/scheduler-bound regime -- vLLM's batching wins; this is llama's weakest ratio.
|
||||
- **MoE 30B-A3B: llama-only.** vLLM cannot serve it on GB10 (bf16 reboots the box at MoE
|
||||
warmup; GGUF expert tensors unmappable). llama serves it at 290 -> 1041 tok/s, scaling
|
||||
cleanly with no npl-128 cliff.
|
||||
|
||||
Net: the single "75-80%" number is replaced by a curve. It is *roughly* right only for the
|
||||
dense mid-band; it is too optimistic for the small model at high concurrency (49%) and moot for
|
||||
MoE (where llama is the only option). The headline is parity at low concurrency and a hardware
|
||||
(not engine) ceiling on dense decode.
|
||||
@@ -1,567 +0,0 @@
|
||||
# SPEEDUP_HUNT.md - the post-0023 vLLM decode close/beat hunt
|
||||
|
||||
Accumulator for the four-lever speedup hunt on the clean pin-synced base (llama.cpp
|
||||
9d5d882d, bit-exact md5 == 0023 baseline). Levers (current-brief labels):
|
||||
A = hybrid per-head SSM precision, B = MoE grouped-GEMM, C = structural dense residual
|
||||
(lm_head + graph/launch), D = f16 glue.
|
||||
|
||||
---
|
||||
|
||||
## D - f16 GLUE: confirm lower-priority (label: D-f16-confirm, READ-ONLY no GPU)
|
||||
|
||||
Re-read `F16_DENSE_RESIDUAL_PROBE.md` (the lever-D doc) plus `BF16_SSM_STATE_RESULTS.md`
|
||||
(lever A's parent work) and `OTHER_PATHS_INVESTIGATION.md` (the B/lm_head + graph
|
||||
analysis). Verdict: **D is correctly deprioritized. Dominated by both A and B. Build
|
||||
later behind an opt-in flag only if the last ~4% dense is ever chased; do NOT build now.**
|
||||
|
||||
### The numbers that pin D below A and B
|
||||
|
||||
- D's reachable mass is TINY. The dense decode gap to vLLM is ~27 ms/step (llama 332.8 ms
|
||||
vs vLLM 305.7 ms @npl128). 83.2% of the step (recurrence 49.3% + FP4 GEMM 27.4% + FP4
|
||||
act-quant/fixup 6.4%) is ALREADY precision-matched f32/W4A4 on both engines - f16 cannot
|
||||
touch it. The f16-able glue is only **8.4% of the step** (Budget A = 28.74 ms: norms +
|
||||
elementwise + activations + flash_attn + rope + copies).
|
||||
- f16 does not zero the glue, it halves the bytes of the memory-bound part. Realistic
|
||||
recovery from the probe: ~11 ms (glue only) to ~16 ms (+ the uncertain nvjet GEMM) =
|
||||
**40-60% of the 27 ms residual**. That moves dense parity 91.8% -> ~95-96%, NOT a close.
|
||||
- The single largest f16-able line (flash_attn 11.9 ms) is the LEAST recoverable (KV is
|
||||
ALREADY f16, the KQ/softmax accumulate stays forced f32 = vLLM does the same). The cleanly
|
||||
recoverable band is just the norms+elementwise+activations (~16.7 ms -> ~8.4 ms saved).
|
||||
|
||||
### Dominated by A (parity-and-beyond) and B (the bigger gap) - confirmed
|
||||
|
||||
- **A dominates on the same dense axis.** A targets the recurrence, which is 49.3% of the
|
||||
dense step - i.e. ~6x the mass D can touch. The bf16-SSM measurement already proved the
|
||||
recurrence kernel halves (-49%/call) and clean dense bf16 hit ~490 t/s = **125% of vLLM**
|
||||
(`BF16_SSM_STATE_RESULTS.md` sec 2). A's hybrid per-head variant keeps the long-memory
|
||||
heads f32 to pass the KL gate that plain bf16 failed (drift FAIL ~10% argmax flips @>=1024
|
||||
ctx) while banking most of that +25-31%. So A is the parity-AND-BEYOND lever on dense;
|
||||
D's ceiling is ~96% parity. A wins outright.
|
||||
- **B is the bigger gap.** MoE sits at ~82% (726 vs 882) vs dense ~92%; the MoE-specific
|
||||
kernel (mul_mat_q<NVFP4,M-tile=64> grouped GEMM, 26.9% of MoE decode = ~43.5 ms/step) and
|
||||
the W4A4 act-quant tax are real MoE deltas. D is a DENSE-only lever (the MoE step is
|
||||
recurrence + FP4-GEMM + bf16-projection dominated; the f16 glue band is even smaller
|
||||
there) - it does nothing for the larger MoE gap. B addresses where the bench is worst.
|
||||
- **C overlaps and out-prioritizes D's residual.** The probe's own conclusion: the
|
||||
remaining ~3-4% after f16 is structural (non-FP4 cublas/nvjet GEMM efficiency +
|
||||
graph/launch scheduling), and those help the BIT-EXACT default too, unlike D which is
|
||||
opt-in non-bit-exact. C's graph/launch work is the better long-term dense target.
|
||||
|
||||
### Is there a cheap subset of D worth folding into a later build?
|
||||
|
||||
**No cheap subset that pays.** The probe maps D to three escalating options:
|
||||
|
||||
- A flag: does not exist and cannot exist - the F32 stream is STRUCTURAL
|
||||
(`ggml_mul_mat` hardcodes an F32 result, so the residual stream snaps back to F32 after
|
||||
every projection; rms_norm/l2_norm/silu/add/mul/flash_attn/ssm_conv all emit F32).
|
||||
- **Option 1 (the "cheap" one: per-op f16 on ops that already have f16 paths - silu/sigmoid/
|
||||
softplus/add/mul/rope): NET NEAR-ZERO OR NEGATIVE.** Because the residual stream stays F32,
|
||||
each op must be wrapped cast(F16)->op->cast(F32) = 2 extra `cpy` ops. At decode these ops
|
||||
are tiny and memory-bound, so the cast traffic ~= the op traffic and the win is eaten unless
|
||||
the cast is FUSED into producer/consumer. Crucially Option 1 CANNOT reach the norms - the
|
||||
largest glue item. So the only "cheap" subset is the one that does not actually help.
|
||||
- Option 2 (the real lever): carry the residual stream in F16 across the layer, which needs
|
||||
NEW F16 template instantiations in norm.cu (rms_norm / l2_norm / fused rms+mul / rms+mul+add,
|
||||
today hard-`GGML_ASSERT(type==F32)`) keeping the f32 reduction, an f16 projection-output
|
||||
path, plus graph-dtype plumbing in qwen35.cpp/llama-graph.cpp. Multi-file, recovers ~11 ms,
|
||||
and is **non-bit-exact** (same gate-failing category as the shelved bf16-SSM state). Not cheap.
|
||||
|
||||
There is no fold-in-for-free subset: the only no-new-kernel piece (Option 1) is net-zero, and
|
||||
the only piece that captures real mass (Option 2 norm.cu f16 kernels) is a multi-file build.
|
||||
|
||||
### THE D PRIORITY CALL
|
||||
|
||||
D is correctly deprioritized, below A, B, and C:
|
||||
- **Reachable mass:** D 8.4% of the dense step vs A's 49.3% recurrence; D is dense-only and
|
||||
does nothing for the bigger MoE (B) gap.
|
||||
- **Ceiling:** D tops out ~95-96% dense parity; A is already parity-AND-BEYOND (125% clean,
|
||||
hybrid keeps most of it inside the KL gate).
|
||||
- **Bit-exactness:** D is opt-in NON-bit-exact (same bucket as shelved bf16-SSM and the
|
||||
NVFP4-head); it cannot improve the shipped f32 bit-exact default, whereas C's structural
|
||||
graph/launch work does help the default.
|
||||
|
||||
### RECOMMENDATION: build LATER (opt-in only), not now; no cheap subset to fold in
|
||||
|
||||
Do NOT build the f16 glue path now. Ship the 95%-bit-exact f32 plateau (patches 0018-0023)
|
||||
as the default. If the last ~4% dense is ever chased, the ONLY worthwhile piece is Option 2's
|
||||
norm.cu f16 kernels + f16 residual stream (recovers the norm/elementwise band, ~11 ms); gate
|
||||
it behind an explicit opt-in flag and validate it against the SAME KL threshold that failed
|
||||
plain bf16-SSM before shipping. Skip Option 1 entirely (cast overhead eats the win). Prefer
|
||||
the structural ~3-4% (non-FP4 cublas GEMM efficiency + graph/launch scheduling, lever C) over
|
||||
D, because that helps the bit-exact default too. D stays the lowest-priority of the four levers.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## A - HYBRID PER-HEAD f32/bf16 SSM STATE (label: A-hybrid-design, READ-ONLY no GPU)
|
||||
|
||||
Goal: capture most of the whole-bf16 SSM-state win (recurrence -49%/call; dense ~490 t/s = 125% of
|
||||
vLLM; MoE +25%) WITHOUT the KL failure (whole-bf16 MeanKLD 0.05-0.17, Same-top-p ~90%, ~10% argmax
|
||||
flips @>=1024 ctx). Keep f32 on the long-memory heads (where bf16 rounding does NOT contract and the
|
||||
KL error concentrates); bf16 only the fast-decaying heads. Stays at-or-above vLLM precision (vLLM
|
||||
keeps ALL temporal state f32) while landing ABOVE vLLM throughput.
|
||||
|
||||
### Why the error concentrates in long-memory heads (the physics)
|
||||
qwen35/qwen35moe take the NON-KDA path: per (head h, token t) the decay is ONE scalar
|
||||
(gated_delta_net.cu `g_val = expf(g[h,t])`, `S <- g_val*S + k(x)delta`). The gate (qwen35.cpp):
|
||||
`g[h,t] = ssm_a[h] * softplus(alpha[h,t] + ssm_dt[h])`, with `ssm_a[h] = -exp(A_log[h]) <= 0` =>
|
||||
decay = exp(g) in (0,1]. Two STATIC per-head weights set the timescale: ssm_a[h] (tensor
|
||||
SSM_A_NOSCAN, [n_v_heads]) = decay-rate SCALE (|ssm_a| small => structurally long-memory); ssm_dt[h]
|
||||
(SSM_DT "bias", [n_v_heads]) = softplus operating point. bf16 carry-error per step is contracting,
|
||||
bounded ~`eps*tau_h`, eps~2^-8~3.9e-3, head memory length `tau_h ~ 1/(|ssm_a[h]|*softplus(ssm_dt[h]))`
|
||||
tokens. Error scales LINEARLY with tau_h => long-memory heads blow up the KL (matches the measured
|
||||
plateau-but-large failure). Keep those f32.
|
||||
|
||||
### Classification: per-head STATIC, at model load (NOT per-token)
|
||||
g is per-token but the long-vs-fast PROPERTY is per-head static (dominated by ssm_a/ssm_dt). A cache
|
||||
row's dtype must be stable across the sequence => a per-token threshold is impossible; classify ONCE
|
||||
at load into a per-(layer,head) dtype mask.
|
||||
- TIER 1 (default, zero-cost, deterministic): pure-weights. `tau_h = 1/(|ssm_a[il][h]|*
|
||||
softplus(ssm_dt[il][h]))`; keep f32 if tau_h > T_thresh, else bf16. T_thresh is THE knob (start
|
||||
32-64; sweep on GateBench). eps*tau_h => a single T_thresh sets a uniform per-head error ceiling.
|
||||
- TIER 2 (optional): short calibration pass measures per-head time-mean of actual exp(g[h,t]); write
|
||||
mask to a model-hash sidecar (paid once). Use only if Tier 1 lands just above the gate.
|
||||
cparam `ssm_hybrid_tau_thresh` / `--ssm-bf16-tau`: inf => all-f32 (today's bit-exact default); 0 =>
|
||||
all-bf16 (the shelved mode); the hybrid band is in between.
|
||||
|
||||
### Mixed-dtype cache layout: two homogeneous partitions per slot (packed)
|
||||
Split persisted s_l ([S_v,S_v,H,slots] f32, n_embd_s=S_v*S_v*H) into TWO dtype-homogeneous sub-caches
|
||||
sized by head COUNT (this is what saves bytes): `s_l_f32 [S_v*S_v*n_f32, slots]` f32 +
|
||||
`s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. Static map `head_slot[h]={is_bf16, local_idx}`. q/k/v/g/beta
|
||||
KEEP natural head order (no activation permute). Block h_idx -> head_slot -> base + local_idx*S_v*S_v.
|
||||
Recurrence R+W bytes scale by `f_bytes = (n_f32 + n_bf16/2)/H = 1 - 0.5*(n_bf16/H)`. In-place/ids
|
||||
identity stays race-free (each head writes its own partition; read==write slot, registers before
|
||||
store). (Cheaper coarse fallback = per-LAYER dtype, near-zero layout code, but long-memory heads span
|
||||
most layers => too coarse; per-head is the right granularity.)
|
||||
|
||||
### Kernel: single launch, runtime per-head branch (on top of BF16_SSM_STATE.diff)
|
||||
Reuse the existing bf16 plumbing (gdn_state_t alias, __bfloat162float load / __float2bfloat16 store,
|
||||
gather template, dtype-detect dispatcher). Hybrid change: pass BOTH bases (`const float* s_f32_base`,
|
||||
`const nv_bfloat16* s_bf16_base`, + the two state_dst views) + device `head_slot[]`; branch load/store
|
||||
on `head_slot[h_idx].is_bf16` (UNIFORM per block => no warp divergence). Recurrence math byte-for-byte
|
||||
untouched (f32 registers). keep_rs_t snapshots stay f32 (op-output scratch). gdn_gather_nonident
|
||||
becomes per-head dtype-aware (still disjoint-scratch race-free). ONE op call + ONE launch.
|
||||
|
||||
### KL-gate plan + estimated pass / f32 fraction / speedup
|
||||
KLD contribution ~ (eps*tau_h)^2 => dominated by the top-tau heads; removing the top ~25-40% by tau
|
||||
cuts MeanKLD 1-2 orders. Honest estimate: ~30-40% f32 PASSES Same-top-p>=99.5% and brings MeanKLD to
|
||||
1e-3..1e-2; strict <1e-3 may need ~40-50% f32. Find the exact fraction by sweeping T_thresh on the
|
||||
EXISTING GateBench harness (noise floor -> 256-tok gate -> drift sweep 256/1024/2048/4096, both
|
||||
models). Hybrid is STRICTLY safer than vLLM (vLLM = all-f32 temporal; we f32 exactly the unsafe
|
||||
heads). Long-memory heads are the minority (~20-40%) => design band f in [0.30, 0.50].
|
||||
Speedup (dense, bandwidth-bound recurrence, graphs-off): f32 3.38 ms/call, whole-bf16 1.73 (-49%);
|
||||
hybrid ~ f_bytes*3.38 => f=0.30 -> 2.20 ms (-35%, ~70% of bf16 win); f=0.50 -> 2.54 ms (-25%, ~50%).
|
||||
Throughput (dense f32 ~371-384=95% vLLM; whole-bf16 ~490=125%; vLLM ref 419): f=0.30 -> ~454 t/s
|
||||
(~108% vLLM, gate-likely); f=0.50 -> ~430 t/s (~103% vLLM, most robust). MoE: smaller absolute
|
||||
recurrence (31 GDN layers, H_v=32) + MUL_MAT_ID-bound step (lever B) => hybrid keeps the +13-25%
|
||||
recurrence share KL-passing but does not alone close the MoE GEMM gap. Joint gate: nsys per-call bytes
|
||||
down AND KL<1e-3 both models.
|
||||
|
||||
### Scope on top of BF16_SSM_STATE.diff
|
||||
Reuse verbatim: gdn_state_t alias, templated load/store, gather template, dispatcher dtype-detect,
|
||||
type_s/type_r cparams, CPU mirror, back-compat row convert, bf16 fill, test-backend-ops bf16 cases.
|
||||
NEW: (1) classifier ~80-150 LOC (host fn over ssm_a/ssm_dt -> head_is_bf16[layer][head] + counts +
|
||||
T_thresh cparam/CLI; optional Tier-2 calib+sidecar). (2) split cache layout ~150-250 LOC (BIGGEST:
|
||||
llama-memory-recurrent.cpp alloc s_l_f32+s_l_bf16 by per-layer counts; build_rs builds two views +
|
||||
passes head_slot; n_embd_s split). (3) kernel ~120-200 LOC (two bases + device map, runtime per-head
|
||||
branch at load/in-place-store/gather/dispatch; math untouched; STATE_BF16 template stays as the
|
||||
all-bf16 case). (4) ids/in-place per-head (state_dst two partition views; per-head gather; identity
|
||||
unchanged). (5) CPU mirror per-head branch. (6) test-backend-ops MIXED-dtype-state case (decode +
|
||||
multi-token prefill + keep_rs_t = the R2 corruption net). (7) gate: sweep T_thresh for min-f32 passing
|
||||
KL<1e-3 + Same-top-p>=99.5% + drift both models; nsys per-call confirms f_bytes; md5 that T_thresh=inf
|
||||
reproduces the f32 baseline (bit-exact opt-out preserved).
|
||||
|
||||
Net: principled path ABOVE vLLM throughput (dense ~430-454 vs vLLM 419) at-or-above vLLM precision,
|
||||
KL-gated. Biggest new item = the split-tensor cache layout; classifier + kernel bounded; gate is a
|
||||
threshold sweep on the existing harness.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## B - MoE GROUPED-GEMM + RE-GRAPH (label: B-moe-profile-design, THE GPU AGENT)
|
||||
|
||||
GPU-measured on DGX GB10 (sm_121), dev tree `~/llama-paged-dev` HEAD `2ee65c2` (patch 0024; the
|
||||
decode kernels are byte-identical to 0023/f7409c2 - 0024 is the serving-only burst-reclaim).
|
||||
`build-cuda`, model `q36-35b-a3b-nvfp4`, `llama-batched-bench -fa on -npp 128 -ntg 128`,
|
||||
`LLAMA_KV_PAGED=1`. `decode_agg = S_TG t/s`. Batched-bench is the clean-kernel measure (no server
|
||||
scheduler overhead), so its npl128 = ~743 t/s sits ABOVE the server final_benchmark 726 t/s; the
|
||||
re-graph % gain below transfers to both paths (same kernels, same graph-disable).
|
||||
|
||||
### 1. MoE decode decomposition @npl128 - RE-CONFIRMED on the current HEAD
|
||||
|
||||
Fresh nsys `--cuda-graph-trace=node`, decode-isolated steady window, % of summed kernel GPU-time
|
||||
(reproduces the 0023 profile in `OTHER_PATHS_INVESTIGATION.md` A.2/D within noise; window is
|
||||
95.4% kernels-only busy / 96.8% with memcpy = GPU-compute-bound):
|
||||
|
||||
```
|
||||
42.3% gated_delta_net_cuda REC (shared w/ dense; ALREADY tuned past vLLM, 0018-0022: 84.6% vs 82.4% peak BW)
|
||||
~29.5% mul_mat_q<NVFP4> MoE FP4 GEMM = grouped M-tile=64 (~27%, biggest MoE-specific) + router M-tile=128 (~2.3%)
|
||||
~10.5% nvjet_sm121 + cutlass (bf16) attn/gdn bf16 projections + the BF16 lm_head (path B)
|
||||
3.1% k_get_rows_float REC state gather
|
||||
2.7% k_bin_bcast expert-combine + routing-weight scale + glue
|
||||
2.1% ssm_conv_update_f32 REC
|
||||
2.0% quantize_mmq_nvfp4 W4A4 activation-quant tax (3.25 ms/step; vLLM-W4A16 avoids it)
|
||||
1.8% convert_unary bf16<->f32 glue around the bf16 projections
|
||||
1.4% MEMCPY-DtoD (SSM state copy fused away by 0018-0019; now small)
|
||||
0.5% mul_mat_q_stream_k_fixup | 0.32% mm_ids_helper | 0.19% argsort | 0.14% gather_mmq_fp4 (0023 dedup) | 0.3% flash_attn
|
||||
```
|
||||
|
||||
Bucketed: **Recurrence/SSM ~48% (shared, tuned past vLLM, NOT a MoE lever)**; **MoE FP4 GEMM+routing
|
||||
~33%**; **bf16 projections ~10.5%**; act-quant tax ~2%; attention ~0.3%.
|
||||
|
||||
### 2. RE-GRAPH the MoE decode step - TESTED + MEASURED (the headline finding)
|
||||
|
||||
**Un-graphed status CONFIRMED, and the disable is purely conservative.** NVFP4 on sm_121 has
|
||||
`get_mmvq_mmid_max_batch_turing_plus(NVFP4)=8` (`mmvq.cu:139-148`). At MoE decode `ne[2]=npl > 8`,
|
||||
so every MUL_MAT_ID node trips the disable in `ggml_cuda_graph_check_compability`
|
||||
(`ggml-cuda.cu:3278`: `node->ne[2] > mmvq_mmid_max => use_cuda_graph=false` for the WHOLE step).
|
||||
BUT the path actually taken at `ne[2]>8` on Blackwell NVFP4 is `ggml_cuda_should_use_mmq()==true`
|
||||
(`ggml-cuda.cu:2664`) -> the **grouped stream-k `mul_mat_q` id-branch**, launched on one stream with
|
||||
**NO host sync** (verified: zero `cudaStreamSynchronize`/`Memcpy` in `mmq.cu`/`mmid.cu`). The stream
|
||||
sync the disable guards against lives ONLY in the per-expert host-loop fallback, which is never
|
||||
reached when `should_use_mmq` is true. So graphs are SAFE for the grouped path; the disable is a
|
||||
conservative over-guard (upstream TODO + ggml-org/llama.cpp#18958).
|
||||
|
||||
**The lever (env-gated, bit-exact, built+measured here).** Relax the disable when the node takes
|
||||
the grouped MMQ path. Patch (one function, one TU, 9 s incremental build):
|
||||
|
||||
```c
|
||||
// ggml-cuda.cu ggml_cuda_graph_check_compability(), [TAG_MUL_MAT_ID_CUDA_GRAPHS]
|
||||
bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
|
||||
if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
|
||||
getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
|
||||
ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
|
||||
mmid_needs_sync = false; // grouped stream-k id-path is sync-free => graph-safe
|
||||
}
|
||||
if (mmid_needs_sync) { use_cuda_graph = false; ... }
|
||||
```
|
||||
|
||||
**Measured A/B (2 reps each, rock-solid; OFF=stock graphs-disabled, ON=LLAMA_MOE_FORCE_GRAPHS=1):**
|
||||
|
||||
| npl | OFF decode_agg | ON decode_agg | gain | OFF %vLLM | ON %vLLM |
|
||||
|----:|---------------:|--------------:|-----:|----------:|---------:|
|
||||
| 8 | 226.0 | 226.4 | +0.2% (noise) | 88% | 88% | *(ne2=8<=mmid_max: MMVQ path already graphs, FORCE inert)*
|
||||
| 32 | 433.8 | 452.7 | **+4.4%** | 86.6% | **90.4%** |
|
||||
| 64 | 589.0 | 605.9 | **+2.9%** | 85.9% | **88.3%** |
|
||||
| 128 | 743.1 | 757.1 | **+1.9%** | 84.2% | **85.8%** |
|
||||
|
||||
(vLLM ref 256.5 / 500.8 / 686.1 / 882.2.) The win is largest at small batch (more host-launch
|
||||
overhead relative to kernel work) and shrinks as kernels dominate at npl128 - exactly the ~1.7%
|
||||
within-step launch-idle the prior agent measured at 98.3% GPU-busy. This REFINES the prior "graphs
|
||||
won't help npl128" verdict: it DOES help (+1.9%, above noise), and helps npl32-64 substantially
|
||||
(+3-4%). **Bit-exact by construction** (graph replay re-issues the identical kernel sequence with
|
||||
identical args; FORCE only flips `use_cuda_graph`; the shipped f32 dense path already runs graphed).
|
||||
**Bit-exact gate - both PASS (measured):** `test-backend-ops -o MUL_MAT_ID -b CUDA0` = **806/806,
|
||||
CUDA0 OK** (the grouped FP4 kernel is untouched - the edit is host-only graph-compat logic); and a
|
||||
**parallel-greedy np16** run (ne2=16>8, i.e. the grouped MMQ path under graphs ON vs eager OFF) gives
|
||||
**byte-identical generated content ON==OFF** (md5 `04c4761...` both, 16/16 completions, diff empty).
|
||||
**SHIP CANDIDATE -> patch 0025** (default-off env now; safe to flip to `should_use_mmq`-gated
|
||||
default-ON since it is a pure, gated, bit-exact win).
|
||||
|
||||
### 3. Grouped-GEMM occupancy headroom - EXHAUSTED on this model (cheap levers), one structural lever left
|
||||
|
||||
- The FP4-MMA `mul_mat_q<NVFP4>` is **register-bound to 1 CTA/SM** (`__launch_bounds__(256,1)`,
|
||||
~255 regs/thread = ~12.5% thread occupancy). Grouped grids: ~2048 and ~8192 64-wide tiles.
|
||||
- **M-tile (col-tile) axis NEUTRAL** (runtime `LLAMA_MOE_DECODE_TILE`, npl128): TILE32 742.4 /
|
||||
TILE64 744.2 / TILE96 747.1 - all within 0.6%. Re-confirms patch 0015: this 256-tiny-expert model
|
||||
is **bandwidth/SSM-bound, not col-tile-occupancy-bound**, so the M-tile lever has nothing to bite.
|
||||
- **Cheap occupancy lever already measured (patch 0017):** compile-time `GGML_CUDA_FP4_MINBLOCKS=2`
|
||||
on MoE @npl128 = **+0.4% (noise)**, and nsys showed it makes the dense FP4 GEMM **+8.7% SLOWER**
|
||||
(register-cap spills, occupancy did not usefully rise). So the cheap register-cap lever is spent.
|
||||
- **Only untested grouped-GEMM lever = the structural `mmq_y`-down (nwarps=4 warp-remap)** - the
|
||||
0017-deferred P2. `mmq_y` tiles N (weight rows), not M, so shrinking it does NOT re-read weights
|
||||
(BW-neutral) and raises resident CTAs. Bit-exact (warp/fragment remap, same FP4-MMA math), but a
|
||||
real kernel change (the `nwarps x tile_C::I == mmq_y` static_assert coupling), and predicted
|
||||
BOUNDED on this BW-bound model. Not a cheap toggle; do only if the re-graph + M1 banks are
|
||||
insufficient.
|
||||
|
||||
### 4. W4A16 option (skip the act-quant, vLLM's Marlin choice) - NOT recommended
|
||||
|
||||
vLLM on GB10 runs **MARLIN W4A16** MoE (engine-log confirmed: "Your GPU does not have native FP4 ...
|
||||
Marlin kernel"): bf16 activations NEVER quantized, FP4 weights dequant-in-kernel to bf16, **bf16
|
||||
MMA**, under a full CUDA graph. It does this because CUTLASS's native-FP4 grouped GEMM is broken on
|
||||
consumer sm_121 (whitelists only sm_100/103 datacenter Blackwell). llama instead runs **native
|
||||
Blackwell FP4-MMA W4A4** grouped stream-k - a HIGHER arithmetic tier (GB10 FP4 = 2x INT8/BF16 rate).
|
||||
The W4A4 act-quant tax llama pays (`quantize_mmq_nvfp4`) is **only ~2.0% of MoE decode** (3.25 ms/step
|
||||
after the 0023 up/gate dedup). Adopting W4A16 to erase it would: (a) be **NOT bit-exact** (bf16 acts
|
||||
!= FP4 acts -> different logits); (b) **descend to BF16-class MMA** (concede GB10's 2x FP4 rate - the
|
||||
grouped GEMM, ~27% of the step, would run at HALF the MMA rate); (c) re-enter the **W4A16 occupancy
|
||||
wall** (the prior GB10 W4A16 effort plateaued ~9 TFLOP/178 t/s). The BW saving is a sliver (acts are
|
||||
tiny vs the ~weight read at M~4/expert), so it trades a bit-exact 2% for a non-bit-exact, slower,
|
||||
occupancy-hostile path. **Reject.** The act-quant tax is better attacked bit-exactly via the down_proj
|
||||
quantize retune (M1).
|
||||
|
||||
### 5. RANKED MoE levers (expected gain, bit-exactness, tractability)
|
||||
|
||||
1. **RE-GRAPH the MoE decode (this patch, -> 0025): MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128.**
|
||||
Bit-exact, tiny (one function, one TU), low-risk, built+measured. **The clear #1.** Helps the
|
||||
server path AND small-npl most (where llama was weakest: npl32 86.6%->90.4% of vLLM).
|
||||
2. **down_proj act-quant retune (M1): bit-exact, bounded (act-quant is ~2%).** Cheap bank-shot;
|
||||
retune `quantize_mmq_nvfp4` block/grid (byte-identical output, like 0023's gather). Low single-%.
|
||||
3. **Grouped-GEMM `mmq_y`-down warp-remap: bit-exact, BW-neutral, the 0017-deferred P2.** Speculative,
|
||||
predicted bounded on this BW-bound model; real kernel work. Only if 1+2 insufficient.
|
||||
4. **M-tile / MINBLOCKS occupancy: EXHAUSTED** (measured neutral-to-negative). Do not pursue.
|
||||
5. **W4A16: REJECT** (non-bit-exact, slower BF16 arithmetic, occupancy wall). Not even a clean opt-in.
|
||||
|
||||
**Net:** the bit-exact MoE-GEMM-region headroom from 1+2(+3) is ~3-6% at npl128 (MoE ~84% -> ~88-90%
|
||||
of vLLM) and ~4-5% at npl32-64. Full MoE parity is NOT reachable from the GEMM/launch track alone:
|
||||
the remaining gap is the grouped GEMM (~27%, FP4-MMA at the LPDDR5x BW floor - hardest regime, vLLM
|
||||
ships purpose-built Marlin-NvFp4) + the bf16 projections (~10.5%). The recurrence (~48%) is already
|
||||
PAST vLLM. The single highest-ROI, ship-now item is the re-graph patch (0025).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
## C - STRUCTURAL DENSE RESIDUAL: lm_head + scheduling (label: C-structural-design, READ-ONLY no GPU)
|
||||
|
||||
Source-confirmed on DGX `~/llama-paged-dev` @ HEAD `2ee65c2` plus committed traces
|
||||
(`CRITICALPATH_GAP_ANALYSIS.md`, `A2_CUDAGRAPH_DECODE.md`, `F16_DENSE_RESIDUAL_PROBE.md`,
|
||||
`OTHER_PATHS_INVESTIGATION.md` sec B). Numbers are dense q36-27b-nvfp4 @npl128: step ~333 ms
|
||||
(384 t/s), gap to vLLM (419 t/s = 305 ms) is ~27-28 ms/step. **Verdict: lever C is a near
|
||||
dead-end for a bit-exact dense win; rank it LAST of A/B/C/D for the bit-exact default.**
|
||||
|
||||
### How the lm_head is stored, and why it routes to cublas/nvjet (not the tuned FP4 MMQ)
|
||||
|
||||
`output.weight` is **GGML_TYPE_BF16** (NOT quantized): the `--tensor-type attn/ffn=nvfp4`
|
||||
recipe converts only attn+ffn, leaving the logit-sensitive final projection (and tok_embd)
|
||||
at base BF16. Confirmed: `llama-model.cpp:1460` creates the NVFP4 scale `output_s` ONLY
|
||||
`if (output->type == GGML_TYPE_NVFP4)`, so for the BF16 head `model.output_s` is null, and
|
||||
`build_lora_mm` (`llama-graph.cpp:1087`) collapses to a plain `ggml_mul_mat`. In
|
||||
`ggml_cuda_mul_mat` dispatch (`ggml-cuda.cu:2599-2629`): `use_mul_mat_q`/`use_mul_mat_vec_q`
|
||||
both require `ggml_is_quantized(src0)` (BF16 fails => the tuned FP4 path is INELIGIBLE);
|
||||
MMF is gated off for the wide `vocab x 128` shape; `use_batched_cublas_bf16` is true but the
|
||||
batched branch additionally needs `src1->ne[2]*ne[3] > 1` (the 2D decode lm_head fails it).
|
||||
Falls through to `ggml_cuda_op_mul_mat_cublas` BF16 branch (`:1662`): downcast F32 act ->
|
||||
BF16, `cublasGemmEx(16BF x 16BF -> COMPUTE_32F)` = **nvjet_sm121**, output rounded BF16 ->
|
||||
upcast F32. Shape M=vocab(151936) x N=128 x K=5120: a tall-skinny output GEMM reading the
|
||||
ENTIRE BF16 head weight for 128 columns = inherently **memory-bound**. On the dense model
|
||||
this is the ONLY non-FP4 cublas GEMM in decode. Cost: nvjet = 11.91 ms = 3.1-3.6% of step.
|
||||
|
||||
**CRITICAL CORRECTION the team must carry:** the baseline is NOT "f32 lm_head". The cublas
|
||||
BF16 branch downcasts the activation F32->BF16 AND rounds the output to BF16. Today's
|
||||
"bit-exact reference" logits are ALREADY BF16-precision on both input and output. So
|
||||
"bit-exact" for lever C only protects BF16-rounded logits, which is exactly why option (c)
|
||||
is "essentially bit-exact" and why any meaningful lm_head speedup requires changing the dtype.
|
||||
|
||||
### lm_head bit-exact lever + gain - bandwidth math kills it
|
||||
|
||||
nvjet moves the full BF16 head weight in 11.9-12.2 ms = ~195-199 GB/s = ~72% of GB10's
|
||||
273 GB/s peak: it is ALREADY one of the most bandwidth-efficient kernels in the step (the
|
||||
overall decode step runs at only ~40% util / ~110 GB/s). The bit-exact ceiling is the
|
||||
remaining bandwidth headroom only:
|
||||
- **(c) keep BF16 weight, swap the kernel** (custom skinny wide-vocab streaming GEMM, or a
|
||||
hand-picked cublasLt algo/workspace heuristic for the thin-N/huge-M shape). The ONLY
|
||||
essentially-bit-exact option. Perfect HBM saturation 199 -> 273 GB/s = 11.9 -> ~8.7 ms =
|
||||
**save ~3 ms = ~0.9-1.0% of step = ~11% of the 27 ms gap.** REALISTIC gain: 0 to 3 ms,
|
||||
leaning toward 0 - cublasLt already selected nvjet as its best algo, so beating it on a
|
||||
pure weight-stream is not guaranteed, and it is high kernel-writing effort. (F16 probe
|
||||
independently estimates the same nvjet recovery as "~5 ms, uncertain - may already run TF32".)
|
||||
|
||||
Structural reason it is near-zero: the head must read the entire BF16 weight for 128 columns;
|
||||
you CANNOT cut those weight bytes without changing the dtype. Bit-exactness and the only real
|
||||
speedup (fewer weight bytes) are mutually exclusive here.
|
||||
|
||||
### lm_head NON-bit-exact options (excluded from any vLLM-parity claim)
|
||||
|
||||
- **(a) NVFP4-quantize the head -> tuned FP4 MMQ.** Biggest win, BREAKS bit-exactness.
|
||||
Weight ~4x fewer bytes (BF16 ~1.5-2.4 GB -> NVFP4 ~0.4-0.6 GB) AND rides the already-tuned
|
||||
`mul_mat_q<NVFP4>` (patch 0017): memory floor drops ~4x = **save ~8-9 ms = ~2.5% of step**.
|
||||
BUT NVFP4 < BF16 precision => different logit bits, can flip greedy argmax, AND it is
|
||||
**UNFAIR vs vLLM** (which keeps its LM head BF16). Same opt-in non-bit-exact bucket as the
|
||||
shelved bf16-SSM / f16-glue; exclude from parity claims.
|
||||
- (b) FP8 / Q8_0 head: smaller error than NVFP4 but still != BF16 bits AND not on the tuned
|
||||
FP4 MMQ path, so it buys less speed than (a). No reason to prefer.
|
||||
- (existing knob) `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` (`ggml-cuda.cu:1610`): 16-bit accumulate
|
||||
on this exact GEMM, faster but NON-bit-exact (16F vs 32F accumulate). Non-bit-exact track only.
|
||||
|
||||
### Scheduling / launch bit-exact lever + gain - ~0.05%
|
||||
|
||||
The decode step is GPU-bound at 99.94% (node-level trace, single stream, graphId replayed).
|
||||
CUDA graphs ALREADY collapse within-step launch latency: exposed idle = 0.225 ms/step = 0.06%,
|
||||
zero gaps > 5 us, graph ON vs OFF = +0.13% @npl128 (noise). Graphs are NOT a pending dense
|
||||
lever - they are already in effect. The ONLY graph-non-covered overhead is the BETWEEN-step
|
||||
host gap: ggml rebuilds the cgraph each step with a NEW `cgraph->uid`, so the uid fast-path in
|
||||
`ggml_cuda_graph_update_required` never fires and the host re-dispatches ~3100 launches between
|
||||
graph launches. MEASURED exposed cost: ~0.2 ms/step = ~0.05% (most of the ~2 ms host loop
|
||||
overlaps GPU compute). **Bit-exact lever:** make the cgraph PERSISTENT/reused across decode
|
||||
steps so the uid fast-path fires (replay-only => bit-exact). GAIN ~0.2 ms/step = ~0.05%, medium
|
||||
effort (touches ggml graph lifetime), second-order. No other per-step host overhead is exposed
|
||||
(the host loop is HIDDEN under GPU compute until the kernels get fast enough to drop GPU-busy
|
||||
below host time).
|
||||
|
||||
### Quantified realistic bit-exact total for lever C
|
||||
|
||||
lm_head kernel swap 0 to ~3 ms (upper ~0.9%, realistically ~0) + persistent cgraph ~0.2 ms
|
||||
(~0.05%) = **combined bit-exact ceiling ~3.2 ms = ~0.95% of the 333 ms step = ~12% of the
|
||||
27 ms gap.** Moves dense parity 91.8% -> at most ~92.7%, realistically <0.5% net (<1.5 ms).
|
||||
The "~3-4%" in the brief is the lm_head's TOTAL cost, NOT what is bit-exactly recoverable: only
|
||||
the bandwidth headroom (~3 ms) and host gap (~0.2 ms) are recoverable; the other ~9 ms is the
|
||||
irreducible BF16 weight stream BOTH engines pay (vLLM keeps a BF16 head too). **Rank C LAST for
|
||||
the bit-exact default.** Its one durable note for the team: the lm_head logits are ALREADY
|
||||
BF16-rounded (not f32), which both narrows what option (c) must preserve and is exactly why the
|
||||
only meaningful lm_head speedup requires a dtype change (= non-bit-exact + unfair vs vLLM).
|
||||
|
||||
Source (DGX @2ee65c2): `llama-model.cpp:1460`, `llama-graph.cpp:1087`, `qwen35.cpp:222` /
|
||||
`qwen35moe.cpp:246`, `ggml-cuda.cu:2599-2629` / `:1662-1690` / `:1610`.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
---
|
||||
|
||||
# RANK + PLAN - the final synthesis (build order, A handoff, B/C/D queue)
|
||||
|
||||
This is the decision section: all four levers measured/designed, ranked by gain x tractability
|
||||
x gate, the concrete A build plan, and the ordered B/C/D queue with each one's trigger. Base:
|
||||
clean pin-synced llama.cpp 9d5d882d, bit-exact md5 == 0023. Dense gap to vLLM ~27 ms/step (384
|
||||
vs 419 t/s @npl128); MoE ~82% (726 vs 882). Recurrence already PAST vLLM (84.6% vs 82.4% peak BW).
|
||||
|
||||
## (1) Per-lever scorecard: gain (dense + MoE), tractability, gate
|
||||
|
||||
| Lever | Dense decode gain | MoE decode gain | Tractability | Quality gate | Bit-exact? |
|
||||
|-------|-------------------|-----------------|--------------|--------------|------------|
|
||||
| **B re-graph (patch 0025)** | ~0 (dense already graphed) | **MEASURED +4.4% npl32 / +2.9% npl64 / +1.9% npl128** (MoE 84%->86% .. 90% of vLLM) | **VERY HIGH - already built+measured**, 1 fn / 1 TU / 9 s build | md5 byte-identical: **PASSED** (MUL_MAT_ID 806/806 + parallel-greedy md5 identical) | YES |
|
||||
| **A hybrid per-head SSM** | **+25% to +35%/call recurrence -> ~430-454 t/s = 103-108% of vLLM** (ABOVE vLLM) | keeps the +13-25% recurrence share KL-passing; does NOT alone close the MoE GEMM floor | MEDIUM-HIGH - builds on `BF16_SSM_STATE.diff`; biggest new piece = split-dtype cache layout (~150-250 LOC) | **KL<1e-3 + Same-top-p>=99.5% + drift sweep 256/1024/2048/4096 both models**; md5 that T_thresh=inf == f32 baseline | f32 default YES; hybrid is at-or-above vLLM precision, KL-gated |
|
||||
| **B M1 down_proj retune** | ~0 | bit-exact, bounded (act-quant is ~2% of MoE step) - low single-% | HIGH - block/grid retune of `quantize_mmq_nvfp4`, byte-identical output | md5 byte-identical | YES |
|
||||
| **B mmq_y-down warp-remap** | small (shared FP4 GEMM) | bit-exact, BW-neutral, predicted BOUNDED on this BW-bound model | LOW-MEDIUM - real kernel change (nwarps x tile_C coupling) | test-backend-ops MUL_MAT_ID + md5 | YES |
|
||||
| **C lm_head kernel swap** | 0 to ~3 ms (~0.9%, realistically ~0; uncertain it beats nvjet) | ~0 | LOW payoff - high kernel-writing effort, not guaranteed to beat cublasLt | md5 (BF16-rounded logits) | YES (essentially) |
|
||||
| **C persistent cgraph** | ~0.2 ms (~0.05%) | ~0 (B's re-graph already covers MoE host gap) | MEDIUM - touches ggml graph lifetime, for 0.05% | replay-only = bit-exact, md5 | YES |
|
||||
| **D f16 glue (Option 2)** | ~11-16 ms = 40-60% of residual -> 91.8% -> ~95-96% (NOT a close) | ~0 (dense-only lever) | LOW-MEDIUM - new norm.cu f16 kernels, multi-file | **NON-bit-exact, must pass the SAME KL<1e-3 that plain bf16-SSM FAILED** | NO - opt-in only |
|
||||
|
||||
Notes that decide the ranking:
|
||||
- **B's re-graph helps ONLY MoE** (dense decode is already graphed; the disable is the MoE
|
||||
MUL_MAT_ID `ne[2]>8` over-guard). It is the single highest-ROI item because it is already
|
||||
built, measured, and gated - zero remaining build risk, just a default flip.
|
||||
- **A is the only lever that moves dense ABOVE vLLM** (103-108%) and it does it at-or-above
|
||||
vLLM precision (vLLM keeps ALL temporal state f32; A keeps f32 on exactly the unsafe heads).
|
||||
It reaches the largest mass (recurrence = 49.3% dense / ~48% MoE = ~6x what D can touch).
|
||||
- **C and D are dead-or-tiny for the bit-exact default.** C's bit-exact ceiling is <1% with
|
||||
real risk; D is non-bit-exact, dense-only, and tops out at ~96% parity (not a close).
|
||||
|
||||
## (2) Ranked build order (gain x tractability x gate) - A confirmed as the build lead
|
||||
|
||||
1. **B re-graph (patch 0025) - LAND NOW.** Already built + measured + both gates PASSED. The
|
||||
only remaining decision is flipping the default from env-gated (`LLAMA_MOE_FORCE_GRAPHS`) to
|
||||
`should_use_mmq`-gated default-ON. Zero new build, measured +1.9-4.4% MoE, bit-exact. This
|
||||
is not a "build" so much as a "ship"; it precedes A because it is free and de-risked.
|
||||
2. **A hybrid per-head SSM - THE BUILD LEAD (user-greenlit, CONFIRMED by evidence).** The only
|
||||
lever that takes dense ABOVE vLLM and the only principled fix for the bf16-SSM KL failure.
|
||||
Largest reachable mass, bounded build on an existing diff, KL-gated. Build plan in (3).
|
||||
3. **B M1 down_proj act-quant retune** - cheap bit-exact bank-shot, run after A while the GPU
|
||||
is warm. Bounded (~2% act-quant tax), byte-identical-output retune.
|
||||
4. **B mmq_y-down warp-remap** - only if 1+2+3 leave MoE short of target; real kernel work,
|
||||
predicted bounded on this BW-bound model.
|
||||
5. **C persistent cgraph** - a bit-exact ~0.05% micro-win for the default; build only if a
|
||||
broad graph-lifetime refactor is happening anyway (not worth a standalone effort).
|
||||
6. **C lm_head BF16 kernel swap** - near-zero, uncertain, high effort. Effectively shelved.
|
||||
7. **D f16 glue (Option 2 norm.cu kernels)** - LAST, opt-in only, non-bit-exact, dense-only,
|
||||
gated by the same KL threshold bf16-SSM failed. Build only if the last ~4% dense is chased
|
||||
AFTER A lands and is shown insufficient. Skip Option 1 entirely (cast overhead eats the win).
|
||||
|
||||
**Why A over B as the lead, despite B's re-graph being measured:** B's re-graph is already
|
||||
DONE - it is a ship, not a build. For the NEW build effort, A is correctly the lead: it is the
|
||||
only lever with a path ABOVE vLLM on dense, it attacks the largest mass (recurrence, shared by
|
||||
both models), and it converts the already-proven whole-bf16 win (490 t/s = 125% vLLM, but KL
|
||||
FAIL) into a KL-passing form. B's remaining items (M1, mmq_y) are bounded single-% bank-shots
|
||||
that cannot reach parity on their own (the residual MoE gap is the FP4 grouped GEMM at the
|
||||
LPDDR5x BW floor + bf16 projections, both structural). So: ship 0025, then build A, then bank B.
|
||||
|
||||
## (3) CONCRETE A BUILD PLAN (hand to the build agent)
|
||||
|
||||
**Objective:** a per-head mixed-dtype SSM state cache - f32 on long-memory heads, bf16 on
|
||||
fast-decaying heads - that captures 50-70% of the whole-bf16 recurrence win (-25% to -35%/call)
|
||||
while PASSING KL<1e-3. Builds directly on the existing `BF16_SSM_STATE.diff` (untracked backup
|
||||
on DGX `~/llama-paged-dev`). Target dense ~430-454 t/s (103-108% of vLLM 419), MoE +13-25%
|
||||
recurrence share KL-passing. f32 default stays bit-exact (md5 == 0023 baseline).
|
||||
|
||||
**Reuse VERBATIM from BF16_SSM_STATE.diff** (do NOT rewrite): `gdn_state_t<STATE_BF16>` alias,
|
||||
templated `__bfloat162float` load / `__float2bfloat16` store, the gather template, the dtype-
|
||||
detect dispatcher, `type_s`/`type_r` cparam wiring, the CPU mirror, the back-compat row convert,
|
||||
the bf16 fill path, and the test-backend-ops bf16 cases.
|
||||
|
||||
**NEW work items (in build order):**
|
||||
|
||||
1. **Head classifier (~80-150 LOC, do first, no GPU).** Host function over `ssm_a` (tensor
|
||||
`SSM_A_NOSCAN`, `[n_v_heads]`, = `-exp(A_log)`) and `ssm_dt` (tensor `SSM_DT`, `[n_v_heads]`):
|
||||
for each (layer il, head h) compute `tau_h = 1 / (|ssm_a[il][h]| * softplus(ssm_dt[il][h]))`;
|
||||
set `head_is_bf16[il][h] = (tau_h <= T_thresh)`. Emit per-layer `n_f32`/`n_bf16` counts +
|
||||
the `head_slot[il][h] = {is_bf16, local_idx}` map. Add cparam `ssm_hybrid_tau_thresh` / CLI
|
||||
`--ssm-bf16-tau` (inf => all-f32 bit-exact default; 0 => all-bf16; hybrid band in between).
|
||||
Runs in microseconds at load, no data, no GPU. (Optional Tier-2: a short calibration pass
|
||||
measuring per-head time-mean of actual `exp(g[h,t])` -> model-hash sidecar; only if Tier 1
|
||||
lands just above the gate.)
|
||||
2. **Split-dtype cache layout (~150-250 LOC - THE BIGGEST piece).** In
|
||||
`llama-memory-recurrent.cpp`: replace the single `s_l` ([S_v,S_v,H,slots] f32) with two
|
||||
dtype-homogeneous sub-caches sized by per-layer head COUNT (this is what saves the bytes):
|
||||
`s_l_f32 [S_v*S_v*n_f32, slots]` f32 + `s_l_bf16 [S_v*S_v*n_bf16, slots]` bf16. In
|
||||
`build_rs` (`delta-net-base.cpp`): build the two views + pass the `head_slot` map; split the
|
||||
`n_embd_s` accessors. q/k/v/g/beta KEEP natural head order (no activation permute - they come
|
||||
from the projection GEMMs). Coarser per-LAYER fallback is REJECTED (long-memory heads span
|
||||
most layers => too coarse; per-head is the right granularity).
|
||||
3. **Recurrence kernel: single launch, runtime per-head branch (~120-200 LOC).** Pass BOTH
|
||||
bases (`const float* s_f32_base`, `const nv_bfloat16* s_bf16_base`) + the two `state_dst`
|
||||
partition views + the device `head_slot[]` map. Branch on `head_slot[h_idx].is_bf16` at the
|
||||
load site, the in-place store site, the gather, and the dispatcher. The branch is UNIFORM
|
||||
within a block (all threads share `h_idx` = `blockIdx.x`) => **NO warp divergence**. The
|
||||
recurrence math (the ~140-260 region) stays byte-for-byte f32-register, untouched. `keep_rs_t`
|
||||
snapshots stay f32 (op-output scratch). The `STATE_BF16` template stays as the all-bf16
|
||||
special case.
|
||||
4. **ids / in-place per-head.** `state_dst` becomes two partition views; `gdn_gather_nonident`
|
||||
becomes per-head dtype-aware (copies each head's `S_v*S_v` block from the right partition of
|
||||
`cache[ids[s]]`; still disjoint-scratch race-free). Each head writes its own partition slot
|
||||
(read==write slot, loaded to registers before store) => the identity / in-place property is
|
||||
preserved.
|
||||
5. **CPU mirror (ops.cpp)** per-head dtype branch for CI / CPU-offload parity.
|
||||
6. **test-backend-ops: a MIXED-dtype-state GATED_DELTA_NET case** (some heads f32, some bf16)
|
||||
vs the CPU ref, covering decode + multi-token prefill + `keep_rs_t` (this is the R2
|
||||
silent-corruption net - do NOT skip it).
|
||||
7. **Gate (GPU, GateBench harness, already built).** Sweep `T_thresh` to find the MINIMUM f32
|
||||
fraction that passes: noise floor first, then the 256-tok KL gate, then the long-context
|
||||
drift sweep 256/1024/2048/4096, BOTH models (dense q36-27b + MoE q36-35b-a3b). Pass bar =
|
||||
**KL<1e-3 AND Same-top-p>=99.5% AND drift bounded**. nsys per-call confirms `f_bytes` =
|
||||
`(n_f32 + n_bf16/2)/H` dropped. md5 that `T_thresh=inf` reproduces the f32 baseline (the
|
||||
bit-exact opt-out MUST be preserved).
|
||||
|
||||
**Expected result (from the physics + the whole-bf16 measurement):** KLD contribution per head
|
||||
~ `(eps*tau_h)^2` (eps~2^-8~3.9e-3) is dominated by the top-tau heads, so removing the top
|
||||
~25-40% by tau cuts MeanKLD by 1-2 orders. Design band **f32 fraction f in [0.30, 0.50]**:
|
||||
- f=0.30 (n_bf16/H=0.70): `f_bytes`=0.65 -> ~2.20 ms/call (-35%), captures ~70% of the bf16
|
||||
win -> dense **~454 t/s = ~108% of vLLM** (gate-likely, MeanKLD ~1e-3..1e-2).
|
||||
- f=0.50: `f_bytes`=0.75 -> ~2.54 ms/call (-25%), captures ~50% -> dense **~430 t/s = ~103% of
|
||||
vLLM** (most robust pass; strict KL<1e-3 may need this fraction).
|
||||
|
||||
The exact f is found by the T_thresh sweep. **MoE:** A keeps the +13-25% recurrence share
|
||||
KL-passing but does NOT by itself close the MoE GEMM gap (that is B). Joint ship gate = nsys
|
||||
per-call bytes down AND KL<1e-3 for BOTH models; neither alone ships. Hybrid is STRICTLY safer
|
||||
than vLLM (we keep f32 exactly where bf16 is unsafe; vLLM keeps all-f32 everywhere).
|
||||
|
||||
## (4) Ordered B / C / D queue with build triggers
|
||||
|
||||
- **B-1 re-graph default flip (patch 0025): trigger = NOW / immediate.** Already built, measured
|
||||
(+1.9-4.4% MoE), both gates PASSED. Flip env-gated -> `should_use_mmq`-gated default-ON. No
|
||||
dependency on A. Ship first.
|
||||
- **B-2 down_proj act-quant retune (M1): trigger = after A's kernel work lands** (reuse the warm
|
||||
GPU window). Bit-exact block/grid retune of `quantize_mmq_nvfp4`, byte-identical output.
|
||||
Bounded ~1% (act-quant is ~2% of the MoE step). Run it; it is cheap.
|
||||
- **B-3 mmq_y-down warp-remap: trigger = ONLY if B-1 + B-2 + A leave MoE below the target.**
|
||||
Real kernel change, BW-neutral, predicted bounded on this BW-bound model. Speculative; gate by
|
||||
test-backend-ops MUL_MAT_ID + md5.
|
||||
- **C-1 persistent cgraph: trigger = ONLY if a broader ggml graph-lifetime refactor is already
|
||||
in flight.** Standalone it is ~0.05%, not worth the graph-lifetime touch. Bit-exact (replay).
|
||||
- **C-2 lm_head BF16 kernel swap: trigger = effectively NEVER for the default** (0 to ~3 ms,
|
||||
uncertain it beats nvjet, high effort). Documented; not queued.
|
||||
- **D Option 2 f16-glue norm.cu kernels: trigger = ONLY if dense parity is still wanted AFTER A
|
||||
lands AND A is shown insufficient, AND an opt-in non-bit-exact mode is acceptable.** Multi-file,
|
||||
recovers ~11 ms (norm/elementwise band), gated by the SAME KL<1e-3 that plain bf16-SSM failed.
|
||||
Skip Option 1 (net-zero cast overhead). Lowest priority of all.
|
||||
|
||||
**Bottom line:** ship 0025 now (free, measured MoE +1.9-4.4%), then build A (the only path
|
||||
ABOVE vLLM on dense, KL-gated, ~430-454 t/s = 103-108% of vLLM), then bank B-2/B-3 on MoE. C is
|
||||
last for the bit-exact default (<1%, dead-end); D is opt-in-only and dense-only, behind the KL
|
||||
gate, only if the last ~4% is ever chased. The recurrence is already PAST vLLM; A converts that
|
||||
proven win into a KL-passing form, and the MoE GEMM floor (the structural residual) is the one
|
||||
piece no bit-exact lever fully closes - vLLM ships purpose-built Marlin-NvFp4 there.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
@@ -1,184 +0,0 @@
|
||||
# SSM decode fix - qwen35 gated-DeltaNet in-place recurrent-state write-back (patch 0018)
|
||||
|
||||
Follow-up to `A2_CUDAGRAPH_DECODE.md`. That analysis located the real decode lever
|
||||
on the Qwen3.6 hybrid-SSM models (arch `qwen35`, 48 gated-DeltaNet linear-attn
|
||||
layers : 16 full-attn layers) and ruled out the FP4 GEMM, CUDA graphs, the host
|
||||
loop, and attention. The corrected per-kernel + per-memcpy decode decomposition
|
||||
attributed ~67% of decode GPU time to SSM-state plumbing:
|
||||
|
||||
gated_delta_net 23.4% | get_rows state-gather 21.9% | D2D state-copy 18.9% (= ~67%)
|
||||
FP4 matmul ~28% | full attention 0.4%
|
||||
|
||||
Root cause: per SSM layer per step the fused `gated_delta_net` op wrote its new
|
||||
recurrent state into graph scratch, then a **separate `ggml_cpy` persisted the
|
||||
full ~225 MB state into the recurrent-state cache** (1584 D2D ops, 356 GB, 18.9%
|
||||
of decode over the profile window). vLLM's `fused_recurrent_gated_delta_rule`
|
||||
keeps the state in place (no copy).
|
||||
|
||||
## STEP 1 (this patch): kill the per-layer D2D state copy-back
|
||||
|
||||
`ggml_gated_delta_net_inplace` (new builder, `src[6] = state_dst`) makes the op
|
||||
write its final recurrent state **directly into the active sequences' contiguous
|
||||
cache slot** (at `kv_head`), eliminating the copy-back. The op output then carries
|
||||
only the attention scores. SSM arithmetic is unchanged - only the destination
|
||||
pointer of the final-state write moved.
|
||||
|
||||
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new `ggml_gated_delta_net_inplace` op
|
||||
builder. `dst` retains the same `[attn | state]` layout so the attention-output
|
||||
view is identical; the state region is left unused.
|
||||
- `ggml/src/ggml-cuda/gated_delta_net.cu`: kernel/launch/op-handler thread an
|
||||
optional `state_dst`; final-state (`!keep_rs`) write targets it when present.
|
||||
- `ggml/src/ggml-cpu/ops.cpp`: K==1 path operates in place on the `state_dst`
|
||||
cache view (kept CPU-correct for non-CUDA runs / CI).
|
||||
- `src/models/delta-net-base.cpp`: `build_recurrent_attn` uses the in-place op on
|
||||
the fused decode/prefill path and drops the `ggml_cpy`. The rollback path
|
||||
(`n_rs_seq > 0`) is unchanged. The get_rows state gather is unchanged (STEP 2).
|
||||
|
||||
### Correctness gate
|
||||
|
||||
- **Bit-identical**: greedy (`--temp 0 --seed 1`) `llama-completion` output on
|
||||
`q36-27b-nvfp4` is byte-for-byte identical between the copy-back baseline and the
|
||||
in-place build (`diff` -> IDENTICAL).
|
||||
- **Coherent**: dense + MoE multi-paragraph greedy generations are on-topic and
|
||||
correct (Rayleigh scattering; Roman Empire 27 BCE / Actium 31 BCE; primes;
|
||||
additive vs subtractive color).
|
||||
- Gated to the `qwen35` / gated-DeltaNet fused path; rollback and all non-SSM
|
||||
archs untouched (they never construct the in-place op).
|
||||
|
||||
### Measured decode_agg (`S_TG t/s`, npp 128, ntg 128, -fa on, paged on, fusion off)
|
||||
|
||||
Dense `q36-27b-nvfp4`:
|
||||
|
||||
| npl | baseline | in-place | delta | % of vLLM (391 @128) |
|
||||
|-----|----------|----------|---------|----------------------|
|
||||
| 32 | 113.74 | 136.39 | +19.9% | - |
|
||||
| 128 | 146.23 | 180.53 | +23.5% | 37.4% -> 46.2% |
|
||||
|
||||
The npl-128 result lands on the predicted copy-removal ceiling (~180 t/s).
|
||||
|
||||
MoE `q36-35b-a3b-nvfp4`:
|
||||
|
||||
| npl | baseline | in-place | delta |
|
||||
|-----|----------|----------|---------|
|
||||
| 32 | 246.79 | 279.41 | +13.2% |
|
||||
| 128 | 313.36 | 372.62 | +18.9% |
|
||||
|
||||
### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
|
||||
|
||||
The D2D state-copy bucket collapsed:
|
||||
|
||||
| bucket | before | after |
|
||||
|-------------------|---------------------|----------------------|
|
||||
| MEMCPY D2D | 18.9% / 356 GB / 1584 ops | 0.23% / 2.93 GB / 734 ops |
|
||||
|
||||
The ~225 MB/copy recurrent-state copy-back is gone (122x fewer D2D bytes); the
|
||||
residual D2D is the small conv-state copies. With it removed, the remaining decode
|
||||
buckets are `gated_delta_net` 26.0%, FP4 matmul ~37.5%, and `get_rows` state
|
||||
gather 18.8%.
|
||||
|
||||
## STEP 2 (not in this patch): fuse the get_rows state gather
|
||||
|
||||
The state gather is now the largest single non-GEMM bucket (18.8%). It is a pure
|
||||
materialization: `build_rs` calls `ggml_get_rows(cache, s_copy_main)` to copy each
|
||||
sequence's previous state into a contiguous scratch tensor before the op reads it.
|
||||
`ggml_ssm_scan` already avoids this by taking the `ids` tensor (`src[6]`) and
|
||||
reading the per-seq state directly from the full cache. The same fusion applies
|
||||
here: give `ggml_gated_delta_net` an `ids` source, read `curr_state` from
|
||||
`cache + ids[seq]*D` in the kernel, and pass the full cache via the `build_rs`
|
||||
`get_state_rows` lambda (mirroring `mamba-base.cpp`). Predicted ceiling with both
|
||||
steps: ~247 t/s (~63% of vLLM dense @128), GEMM untouched.
|
||||
|
||||
## Verdict on the path to parity
|
||||
|
||||
STEP 1 removes ~half of the SSM plumbing overhead and is the dominant, lowest-risk
|
||||
lever; it is bit-exact and shipped here. STEP 2 (gather fusion) has a proven ggml
|
||||
precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLLM
|
||||
after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
|
||||
track. No paged/graph/block-table change can move decode on this model (full
|
||||
attention is 0.4% of decode).
|
||||
|
||||
## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
|
||||
|
||||
After Step 1 the largest non-GEMM decode bucket was the recurrent-state
|
||||
`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
|
||||
sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
|
||||
gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
|
||||
`ggml_ssm_scan`'s `ids` source.
|
||||
|
||||
`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
|
||||
`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
|
||||
`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
|
||||
`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
|
||||
writes the cache directly: no recurrent-state materialization at all. The
|
||||
`build_recurrent_attn` fused path feeds the full cache and ids through the
|
||||
`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
|
||||
`rs_zero` clear and the extra-states copy around the op.
|
||||
|
||||
### Race-free by construction (CUDA)
|
||||
|
||||
In-place write plus an ids read of the same cache is only safe when the read slot
|
||||
equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
|
||||
(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
|
||||
can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
|
||||
prefill ubatch). The kernel handles both per (seq, head) block on device:
|
||||
|
||||
- identity sequences read `s0` in place from the destination slot `state_dst`
|
||||
(the kernel loads all of `s0` into registers before it writes the new state,
|
||||
so reading and writing the same slot is race-free) -- no materialization;
|
||||
- non-identity sequences read from a disjoint scratch that a small
|
||||
`gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
|
||||
recurrence never reads a slot another block writes.
|
||||
|
||||
`ids` stays a device pointer (dereferenced only in the kernels; the input is
|
||||
device-resident at op-execute time, so a host read segfaults). The CPU op
|
||||
mirrors the same logic (host identity check + a serial gather in the dispatcher
|
||||
for the non-identity case). The math is unchanged, so the result is bit-identical
|
||||
to the `get_rows` path in every case.
|
||||
|
||||
Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
|
||||
`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
|
||||
untouched (they keep the materialized-state overload).
|
||||
|
||||
### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
|
||||
|
||||
Dense `q36-27b-nvfp4`:
|
||||
|
||||
| npl | Step 1 (baseline) | Step 2 | delta | % of vLLM (391 @128) |
|
||||
|-----|-------------------|----------|---------|----------------------|
|
||||
| 32 | 137.64 | 170.68 | +24.0% | - |
|
||||
| 128 | 186.25 | 256.57 | +37.8% | 47.6% -> 65.6% |
|
||||
|
||||
The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
|
||||
|
||||
MoE `q36-35b-a3b-nvfp4`:
|
||||
|
||||
| npl | Step 1 (baseline) | Step 2 | delta |
|
||||
|-----|-------------------|----------|---------|
|
||||
| 32 | 299.68 | 366.69 | +22.4% |
|
||||
| 128 | 409.30 | 553.63 | +35.3% |
|
||||
|
||||
(Step-1 baselines re-measured in the same session; the brief's reference figures
|
||||
were 136 / 180 dense and 279 / 373 MoE.)
|
||||
|
||||
### Bit-exact gate
|
||||
|
||||
Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
|
||||
fusion off) vs the Step-1 build:
|
||||
|
||||
- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
|
||||
- MoE `q36-35b-a3b-nvfp4`: byte-identical;
|
||||
- Step-2 dense run1 == run2 (deterministic, no race).
|
||||
|
||||
### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
|
||||
|
||||
The recurrent-state gather bucket collapsed:
|
||||
|
||||
| kernel | Step 1 | Step 2 |
|
||||
|----------------------------|----------|-----------------------------------------|
|
||||
| `k_get_rows_float` | 18.8% | 0.7% (residual: embeddings / conv-state)|
|
||||
| `gdn_gather_nonident` | - | 1.7% (no-op at decode, median ~1.2 us) |
|
||||
| `gated_delta_net_cuda` | 26.0% | 22.5% |
|
||||
| FP4 GEMM family | ~37.5% | ~48% (now the dominant residual) |
|
||||
|
||||
The SSM state gather is effectively eliminated. The residual decode gap to vLLM
|
||||
is now the FP4 GEMM (~48% of decode), a separate kernel track.
|
||||
@@ -1,126 +0,0 @@
|
||||
# Track B P0 + P1 results: the FP4-MMA decode-GEMM occupancy tune (GB10, sm_121)
|
||||
|
||||
Measured on the DGX (GB10 / DGX Spark, sm_121, `~/llama-paged-dev`, branch `paged`). Implements
|
||||
`FP4_GEMM_SCOPE_B.md` P0 (baseline + bit-exact gate) and P1 (the cheap host/occupancy tile tune).
|
||||
Dev-tree commit: **089f78d** (`feat(paged): FP4 decode GEMM track-B P0 gate + default-off occupancy
|
||||
instrumentation`). Patch artifact: `0017-fp4-gemm-decode-tile-tune.patch`.
|
||||
|
||||
**Headline verdict: the P1 occupancy kill-gate TRIPPED.** None of the cheap host/occupancy levers
|
||||
lift dense or MoE decode_agg on GB10; every dense probe regresses and the nsys evidence shows the
|
||||
FP4 GEMM kernel gets *slower* under register-capping. Nothing is enabled by default (the levers are
|
||||
compile-time/env gated and the default build is byte-identical to stock). The one untested lever is
|
||||
the structural `mmq_y`-down, which is **not** a host switch: it is coupled to `nwarps` by the
|
||||
`nwarps*tile_C::I == mmq_y` static_assert, so it requires an `nwarps=4` warp-remap (P2 kernel work).
|
||||
|
||||
All benches: `llama-batched-bench -fa on -c 32768 -ngl 99 -npp 128 -ntg 128 -npl 32,128`.
|
||||
`decode_agg = S_TG` (aggregate decode tok/s). 3 reps dense, 2 reps MoE; medians below.
|
||||
|
||||
## P0 baseline (mmq_y=128, minblocks=1 — stock)
|
||||
|
||||
### Bit-exact parity gate (CPU oracle vs CUDA, deterministic)
|
||||
- `test-backend-ops -o MUL_MAT -b CUDA0`: **1115/1115** (1103 stock + 12 new NVFP4/MXFP4 dense
|
||||
decode-shape cases), NVFP4 0 fail.
|
||||
- `test-backend-ops -o MUL_MAT_ID -b CUDA0`: **805/805**, NVFP4 0 fail.
|
||||
- New P0 cases exercise the weight-row (`mmq_y`) tiling boundary: `type_a ∈ {NVFP4, MXFP4}`,
|
||||
`m ∈ {2048 (exact at mmq_y 64/128), 1600 (ragged vs 128), 2050 (ragged vs both 64 & 128 →
|
||||
need_check last row-tile)}`, `n ∈ {32, 128}` (decode M), `k = 2048`. They make the oracle cover
|
||||
the `mmq_y`/min-blocks changes and stay bit-exact with every lever on.
|
||||
|
||||
### Decode throughput (decode_agg = S_TG)
|
||||
| model | npl32 | npl128 |
|
||||
|---|---:|---:|
|
||||
| DENSE q36-27b-nvfp4 | 117.3 | **149.5** |
|
||||
| MoE q36-35b-a3b-nvfp4 (stock mmq_x=128/expert) | 262.6 | **336.3** |
|
||||
|
||||
(For reference the scope §6 cites dense 161 / MoE 333 from a server harness; this is the cleaner
|
||||
batched-bench A/B baseline. The relative P0→P1 deltas below are what the kill-gate turns on.)
|
||||
|
||||
### nsys FP4 GEMM efficiency (dense, `-npp 64 -ntg 48 -npl 128`)
|
||||
The decode FP4 weight GEMM kernel = `mul_mat_q<NVFP4(40), mmq_x=128, need_check=0>`:
|
||||
- **33.2 %** of GPU kernel time, total **2.782 s** / 4576 inst, **avg 608 µs/launch**.
|
||||
- Plus `quantize_mmq_nvfp4` 9.1 % (the act-quant bucket — track A's target), `mul_mat_q<…,16,…>`
|
||||
5.8 % (prefill ubatch tiling), stream-k fixups ~0.5 %.
|
||||
|
||||
This is the locked baseline; P1 must lower the GEMM kernel time (raise FP4-eff) to pass.
|
||||
|
||||
## P1 — the cheap occupancy levers (all default-off, byte-identical when off)
|
||||
|
||||
Three bit-exact, gated levers were added (`mmq.cuh`):
|
||||
- `GGML_CUDA_FP4_MMQ_Y` (default 128): type-aware `get_mmq_y_host/device` plumbing for an NVFP4
|
||||
weight-row tile override. **Inert** — see "the mmq_y wall" below.
|
||||
- `GGML_CUDA_FP4_MINBLOCKS` (default 1): NVFP4-only `__launch_bounds__` min-resident-CTAs lever
|
||||
(register-caps the FP4-MMA kernel so >1 CTA co-resides). The bounded occupancy probe.
|
||||
- `GGML_CUDA_FP4_DENSE_MMQ_X` (env, default off): dense col-tile re-read occupancy diagnostic
|
||||
(the §4.1 A/B: does eating a 2× weight re-read at a smaller `mmq_x` buy net occupancy?).
|
||||
|
||||
P1 parity: with `MINBLOCKS=2` the gate stays **MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0
|
||||
fail** — register allocation is result-neutral, so bit-exactness holds.
|
||||
|
||||
### DENSE decode_agg @ npl128 — every occupancy probe REGRESSES
|
||||
| config | npl32 | npl128 | Δ vs P0 @npl128 |
|
||||
|---|---:|---:|---:|
|
||||
| P0 stock (mmq_y=128, minblocks=1) | 117.3 | **149.5** | — |
|
||||
| MINBLOCKS=2 (2 resident CTAs via reg-cap) | 115.7 | 147.9 | **−1.1 %** |
|
||||
| DENSE_MMQ_X=64 (2 col-tiles, 2× weight re-read) | 115.3 | 144.3 | **−3.5 %** |
|
||||
| DENSE_MMQ_X=32 (4 col-tiles, 4× weight re-read) | 115.4 | 141.7 | **−5.2 %** |
|
||||
|
||||
### MoE decode_agg @ npl128 — mmq_x-down regresses; min-blocks neutral
|
||||
| config | npl32 | npl128 | Δ vs stock @npl128 |
|
||||
|---|---:|---:|---:|
|
||||
| stock (mmq_x=128/expert) | 262.6 | **336.3** | — |
|
||||
| TILE32 | 262.1 | 336.0 | −0.1 % |
|
||||
| TILE16 | 261.1 | 324.0 | **−3.7 %** |
|
||||
| TILE8 | 260.8 | 316.6 | **−5.9 %** |
|
||||
| MINBLOCKS=2 | 260.0 | 337.7 | +0.4 % (noise) |
|
||||
|
||||
The MoE result reproduces patch 0015 exactly: q36-35b-a3b (256 tiny experts, GDN linear attention)
|
||||
decode is GDN/bandwidth-bound, **not** col-tile-occupancy-bound, so tightening `mmq_x` below 64
|
||||
(the brief's "8–16 ideal") monotonically *loses*. 64 ≈ 32 ≈ stock is the floor.
|
||||
|
||||
### nsys kill-gate evidence (the decisive datum)
|
||||
`mul_mat_q<NVFP4,128,0>` under MINBLOCKS=2: **2.782 s → 3.025 s**, avg **608 µs → 661 µs
|
||||
(+8.7 % SLOWER)**. The FP4-MMA kernel needs >128 regs/thread; forcing 2 CTAs/SM register-caps it,
|
||||
which **spills to local memory**, so the GEMM does *more* work per launch — occupancy did not
|
||||
usefully rise, it inverted. FP4-eff went **down**, not up. Kill-gate tripped, with hard evidence.
|
||||
|
||||
## Why P1 can't lift it (and why mmq_y-down is P2, not P1)
|
||||
|
||||
The two orthogonal occupancy probes both regress: register-capping (minblocks↑) spills, and
|
||||
col-tile-shrinking (mmq_x↓) re-reads the 18 GB weight set. This says the **dense M=128 tile is
|
||||
already weight-read / one-read-optimal at mmq_x=128** — it is not occupancy-starved in a way the
|
||||
cheap levers can fix. This contradicts the scope's central "self-inflicted occupancy, recover it by
|
||||
raising resident CTAs" hypothesis *for the cheap levers*.
|
||||
|
||||
The only lever that raises resident CTAs **without** spilling and **without** extra weight reads is
|
||||
the structural `mmq_y`-down (smaller weight-row tile → smaller shared + smaller accumulator → more
|
||||
CTAs, weights still read once). But `mmq_y` is **rigidly** `nwarps * tile_C::I = 8 * 16 = 128`
|
||||
(the `mmq.cuh:3258` static_assert; `tile_C::I=16` is the fixed `m16n8k64` MMA shape). So
|
||||
`mmq_y=64` requires **`nwarps=4`** — a warp-remap, not a host switch. That remap threads `nwarps`
|
||||
through ~13 NVFP4-reachable sites including the **shared** `vec_dot_fp4_fp4_mma` (used by both NVFP4
|
||||
and MXFP4) and the loader/kernel nwarps lockstep, with real risk of a silent shared-mem/thread-block
|
||||
mismatch. It was scoped but **deferred to P2** (the scope's own phase table also places `mmq_y`-down
|
||||
at P2, after the P1 host-only knobs). The `get_mmq_y` host/device plumbing is committed and inert so
|
||||
P2 only has to add the `nwarps` half.
|
||||
|
||||
## Honest verdict vs the scope targets
|
||||
|
||||
- **DENSE:** P1 (host knobs + min-blocks + re-read diagnostic) does **not** move decode_agg toward
|
||||
the 391 target — it slightly *regresses* (149.5 → 147.9, 38 % of vLLM). The scope's P1 row
|
||||
(~177, "honest: small") was optimistic; on GB10 the cheap levers are net-negative. The remaining
|
||||
upside lives entirely in the P2 `mmq_y`-down (nwarps=4) kernel remap **plus** track A. Whether
|
||||
that clears the floor is now an *open, unproven* question — the cheap-lever evidence here leans
|
||||
*against* large occupancy upside (the tile already looks one-read-optimal), so the P2 ceiling is
|
||||
plausibly lower than the scope's 316–328.
|
||||
- **MoE:** the mmq_x-down lever (the brief's MoE P1) is a **confirmed dead-end on this model**
|
||||
(regresses; GDN/BW-bound, reproduces patch 0015). min-blocks is neutral. No host-level MoE win.
|
||||
|
||||
**Kill-gate: TRIPPED on both arms.** Per the brief this is *not* forced into a default-on change.
|
||||
Committed: the P0 bit-exact gate + the default-off instrumentation + this honest record. Not pushed.
|
||||
|
||||
## Reproduce
|
||||
```
|
||||
# default (byte-identical stock): build-cuda as-is -> MUL_MAT 1115/1115, MUL_MAT_ID 805/805
|
||||
# occupancy probe: cmake build with -DGGML_CUDA_FP4_MINBLOCKS=2 (or flip the macro default)
|
||||
# dense re-read A/B: GGML_CUDA_FP4_DENSE_MMQ_X=64 ./llama-batched-bench -m q36-27b-nvfp4.gguf ...
|
||||
# nsys: nsys profile --trace cuda ... ; nsys stats --report cuda_gpu_kern_sum (watch mul_mat_q<40,128,0>)
|
||||
```
|
||||
@@ -1,315 +0,0 @@
|
||||
# vLLM 0.23.0 eager-decode grounding: where the ~2.4x decode gap to llama.cpp comes from
|
||||
|
||||
Source-reading + grounding only (no GPU, no benchmarking, no llama code changes). This
|
||||
decomposes vLLM 0.23.0's per-decode-step work in `enforce_eager` mode and attributes the
|
||||
measured ~2.4x decode-throughput gap on GB10 (DGX Spark, sm_121) to its parts, so the
|
||||
throughput thread can decide what llama.cpp would actually need (CUDA-graphed decode vs new
|
||||
kernels) before anyone touches a kernel.
|
||||
|
||||
Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
|
||||
LPDDR5x ~273 GB/s. vLLM install read: `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`
|
||||
(on `dgx.casa`, read-only). Evidence: engine logs `~/bench/h2h_dense_vllm.log`,
|
||||
`~/bench/h2h_moe_vllm.log`; nsys decode trace `~/bench/decode_study/srv_decode2.sqlite`
|
||||
(reproduced here via `cat2.py`); committed `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
|
||||
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
|
||||
|
||||
## TL;DR (the evidence-based answer)
|
||||
|
||||
At batch ~128, ~1024 ctx, NVFP4, `enforce_eager` (no CUDA graphs on either side), vLLM decodes
|
||||
~2.4x faster than llama.cpp. Decomposed:
|
||||
|
||||
1. **The gap is dominantly a KERNEL-efficiency gap, not a host-overhead gap.** The strongest
|
||||
single datum: during steady llama decode the GPU is **~94.6% busy** (nvidia-smi, real run) /
|
||||
85.5% in the nsys window (`DECODE_GAP_STUDY.md`; nsys adds gaps). A GPU that is already ~95%
|
||||
busy has at most ~5% exposed host bubble, so a CUDA graph (which only removes host/launch
|
||||
overhead) can recover at most that bubble. **CUDA-graphing llama's decode is therefore a
|
||||
minority lever: on the order of ~5-15% of the step, i.e. roughly ~10-20% of the 2.4x.** The
|
||||
remaining ~80-90% is the GPU spending its busy time in kernels that are simply slower per unit
|
||||
work than vLLM's.
|
||||
|
||||
2. **vLLM's eager decode step is cheap on the host by construction**, so its host time is small
|
||||
to begin with and hides behind the async CUDA stream: persistent pre-allocated input buffers
|
||||
updated with vectorized numpy (no per-token Python), attention metadata built once per step and
|
||||
shared across all layers, no GPU->CPU sync in the hot path, and a fixed small kernel-launch
|
||||
sequence per layer (2 ops per Linear, 2 grouped Marlin launches for *all* MoE experts).
|
||||
`async_scheduling` was **off** in this run (absent from both engine logs; default resolves to
|
||||
the synchronous `Scheduler`, `config/scheduler.py:168-176`), so vLLM achieved the 2.4x with
|
||||
*synchronous* per-step scheduling. The host advantage is structural, not pipelining.
|
||||
|
||||
3. **Where vLLM's kernels win:** (a) attention reads paged KV **in-kernel** via a block table in
|
||||
one batched `flash_attn_varlen_func` launch, with **no gather/copy** (vLLM never pays llama's
|
||||
paged `get_rows` + `cpy` tax, which is ~36% of llama's *paged* step); (b) the dense NVFP4 GEMM
|
||||
is a **native FP4-MMA cutlass** kernel with the activation-quant **fused** into the preceding
|
||||
RMSNorm/SiLU (no standalone `quantize_mmq` requant pass); (c) the MoE experts are **one grouped
|
||||
Marlin kernel per projection for all experts** (W4A16, in-kernel dequant); (d) on these Qwen3.6
|
||||
models a fraction of layers are **GDN linear-attention** whose decode is an **O(1)-in-context
|
||||
recurrent state update**, not an O(ctx) KV read.
|
||||
|
||||
4. **Sampling is not the gap** on either side: vLLM samples all ~128 sequences with a handful of
|
||||
batched on-GPU kernels (FlashInfer), greedy and a heavy sampler chain cost the same; this
|
||||
mirrors llama's own finding (`DECODE_GAP_STUDY.md`: greedy 1343 ms == 5-sampler 1346 ms).
|
||||
|
||||
## The measured gap (apples-to-apples, both eager)
|
||||
|
||||
From `QWEN36_NVFP4_BENCH.md` (matched NVFP4 weights, one GB10 box, vLLM 0.23.0
|
||||
`--enforce-eager`, llama patch 0015 + budget-256), decode aggregate tok/s at npl128:
|
||||
|
||||
| model | llama (best) | vLLM | ratio | per-step (128 tok) llama -> vLLM |
|
||||
|-------|-------------:|-----:|------:|----------------------------------|
|
||||
| DENSE Qwen3.6-27B | 161.2 | 390.7 | **2.42x** | ~795 ms -> ~328 ms |
|
||||
| MoE Qwen3.6-35B-A3B | 333.5 | 811.1 | **2.43x** | ~384 ms -> ~158 ms |
|
||||
|
||||
Both models converge to ~41% of vLLM at npl128 after llama's prefill-starvation is removed
|
||||
(patch 0013), and at npl8 the kernels are at parity (dense 99%, MoE 84%). So the residual ~2.4x
|
||||
is a steady-state decode property at high batch, not a prefill or scheduler artifact (the
|
||||
scheduler was separately proven not to be the lever: a clean all-128-decoding run still tops out
|
||||
at 157-161 dense / 333 MoE - `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`).
|
||||
|
||||
## Confirmed configuration (both sides eager, no CUDA graphs)
|
||||
|
||||
vLLM, both models (engine logs):
|
||||
- `enforce_eager=True`, `CompilationMode.NONE`, `cudagraph_mode=<CUDAGraphMode.NONE>`:
|
||||
`"Enforce eager set, disabling torch.compile and CUDAGraphs ... -cc.mode=none
|
||||
-cc.cudagraph_mode=none"`, `"Cudagraph is disabled under eager mode"`. So no torch.compile, no
|
||||
inductor, no graph capture: the model runs as pure eager dispatch of custom ops.
|
||||
- Attention: `"Using FLASH_ATTN attention backend out of ['FLASH_ATTN','FLASHINFER','TRITON_ATTN',
|
||||
'FLEX_ATTENTION']"`, `"Using FlashAttention version 2"`.
|
||||
- Dense weight GEMM: `"Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM"` (native W4A4
|
||||
cutlass FP4-MMA), `"Enabled custom fusions: norm_quant, act_quant"`, FlashInfer autotuned the
|
||||
`fp4_gemm` (16 configs) at startup.
|
||||
- MoE weight GEMM: `"Using 'MARLIN' NvFp4 MoE backend out of ['FLASHINFER_TRTLLM',...,'MARLIN',
|
||||
'EMULATION']"` with `"Your GPU does not have native support for FP4 computation ... Weight-only
|
||||
FP4 compression will be used leveraging the Marlin kernel"` (so MoE experts = W4A16 weight-only
|
||||
Marlin: in-kernel dequant + bf16 MMA), plus `"FlashInferFP8ScaledMM"` for the FP8 attention
|
||||
linears.
|
||||
- Both models are **hybrid GDN**: `"Using Triton/FLA GDN prefill kernel"` and `"Setting attention
|
||||
block size to 784/1056 tokens to ensure attention page size >= mamba page size"` (dense 784, MoE
|
||||
1056). A decode-time `fused_recurrent_gated_delta_rule_packed_decode_kernel` is JIT-compiled.
|
||||
- Sampling: `"Using FlashInfer for top-p & top-k sampling."`
|
||||
- `async_scheduling` not present in either log -> synchronous `Scheduler`.
|
||||
|
||||
llama side (the brief's premise, corroborated by `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review):
|
||||
`-fa on`, paged KV, eager (no engaged CUDA graphs at batched decode). The `DECODE_GAP_STUDY.md`
|
||||
nsys run explicitly set `GGML_CUDA_DISABLE_GRAPHS=1` to match.
|
||||
|
||||
## Decomposition of vLLM's eager decode step
|
||||
|
||||
All file paths below are under
|
||||
`/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm/`. The driver is
|
||||
`v1/worker/gpu_model_runner.py::execute_model` (line 4005): host preprocess under
|
||||
`synchronize_input_prep()`, then `_model_forward` under `set_forward_context`, then `compute_logits`;
|
||||
sampling is a separate `sample_tokens` (line 4357). Under eager, `_determine_batch_execution_and_padding`
|
||||
(line 3768) dispatches `CUDAGraphMode.NONE`, and `_model_forward` (line 3718) just calls
|
||||
`self.model(...)` directly: no capture, no replay, same code every step.
|
||||
|
||||
### (a) Attention - one batched in-kernel paged-decode launch + O(1) GDN layers
|
||||
|
||||
- **Full-attention layers (FA2):** `v1/attention/backends/flash_attn.py`. `FlashAttentionImpl.forward`
|
||||
(667-848) issues **one** `flash_attn_varlen_func` (796-818) over all ~128 decode tokens, passing
|
||||
`key_cache`/`value_cache` (the raw paged block pools, **not gathered**), `cu_seqlens_q`,
|
||||
`seqused_k`, and **`block_table=attn_metadata.block_table`**. The kernel walks the block table to
|
||||
fetch each sequence's KV pages directly. In-kernel paged read confirmed: there is **no gather/copy**
|
||||
in the Python layer; the only KV write is `reshape_and_cache_flash` (a scatter of the new token via
|
||||
`slot_mapping`). FA2 disables vLLM's AOT host scheduler (`aot_schedule = (fa_version==3)` is False,
|
||||
333), so `schedule()` returns `None` (445-469): the per-step metadata `build()` (388-575) is **pure
|
||||
reference/scalar assembly**, no Python loop over the 128 sequences, no host scheduling, no sync.
|
||||
- **Built once per step, reused across layers:** `supports_update_block_table=True` (300); the first
|
||||
full-attn layer calls `build()`, every later layer reuses it via `update_block_table()` (577-586,
|
||||
a `copy.copy`). So `build()` runs **once per decode step** for the whole KV group, not per layer.
|
||||
- **GDN linear-attention layers (the hybrid half):** `model_executor/layers/mamba/gdn/
|
||||
qwen_gdn_linear_attn.py`, kernels in `model_executor/layers/fla/ops/fused_recurrent.py`. Pure decode
|
||||
takes `_forward_core_decode_non_spec` (1644-1696): two state-update kernels only -
|
||||
`causal_conv1d_update` + `fused_recurrent_gated_delta_rule_packed_decode` (Triton kernel 255-336,
|
||||
grid `(NV, B*HV)` = one batched launch over all 128 rows). Each program updates a **fixed-size
|
||||
[K,V] recurrent state** (`b_h *= exp(g); b_h += (beta*(v - h.k)) outer k; o = h.q`) - **no loop over
|
||||
the 1024 past tokens, no KV read.** This is **O(1) in context length**, while FA2 streams ~ctx KV
|
||||
per head per row. On these Qwen3.6 models the GDN layers make a chunk of the decode cost flat in
|
||||
ctx, a structural cheapness llama only gets if its GGUF implements GDN the same way (see caveat).
|
||||
|
||||
### (b) Weight GEMM - native FP4-MMA (dense) / grouped Marlin (MoE), M-batched, fused quant
|
||||
|
||||
- **Dense NVFP4 linear:** `model_executor/layers/quantization/modelopt.py::ModelOptNvFp4LinearMethod.apply`
|
||||
(1226-1232) -> `model_executor/kernels/linear/nvfp4/flashinfer.py::apply_weights` (56-89): exactly
|
||||
two GPU ops - `scaled_fp4_quant` (activation -> packed FP4 + blockscale) then
|
||||
`flashinfer_scaled_fp4_mm` (the autotuned `fp4_gemm`, a **native W4A4 cutlass FP4-MMA** whose
|
||||
**dequant is fused into the MMA epilogue** via the precomputed `alpha = in_gscale*w_gscale`). The
|
||||
activation-quant is itself folded away: `compilation/passes/fusion/rms_quant_fusion.py:98`
|
||||
(`norm_quant`: RMSNorm -> `scaled_fp4_quant` fused) and `act_quant_fusion.py:40,128`
|
||||
(`act_quant`: SiLU+mul -> FP4 fused). **There is no standalone full-tensor requantize pass** like
|
||||
llama's `quantize_mmq`, and the weight is never dequantized to a temp buffer.
|
||||
- **MoE experts (Marlin W4A16):** `model_executor/layers/fused_moe/experts/marlin_moe.py`.
|
||||
`fused_marlin_moe` (227) does **one** `moe_align_block_size` token-sort then `_fused_marlin_moe`
|
||||
(59) issues **exactly two grouped kernels** - `moe_wna16_marlin_gemm` for gate_up (137) and for
|
||||
down (194) - **each a single launch covering ALL experts** (it walks `expert_ids`/`sorted_token_ids`
|
||||
internally; no Python loop over experts), with a `silu_and_mul` between and a `moe_sum` reduce
|
||||
after. W4A16 means weights are dequantized in-kernel and activations stay bf16 (never requantized).
|
||||
- **Decode-M batching (the key throughput property):** the dense GEMM reshapes activations to (M, K)
|
||||
with M = total decode tokens (~128) and reads each FP4 weight **once for all 128 tokens**; the MoE
|
||||
grouped GEMM reads each routed expert's weight **once** for the ~M*topk/E tokens routed to it. At
|
||||
M~128 with FP4 weights these are weight-read / memory-bound (correct: the GB10 LPDDR5x ~273 GB/s
|
||||
is the floor), but the bytes are amortized over the whole batch. This is the ideal case and it is
|
||||
the same regime llama is in - so the GEMM gap is kernel efficiency (fused quant + native FP4 MMA),
|
||||
not a batching defect.
|
||||
- **Host cost per layer (eager):** each `Linear.apply()` dispatches at most 2 `torch.ops` kernels; a
|
||||
dense layer's GEMM+norm/act portion is ~7-11 launches, a MoE expert block is ~5-6 launches **for all
|
||||
experts combined** (expert count does not multiply launches). Fixed, small, no per-tile/per-expert
|
||||
Python.
|
||||
|
||||
### (c) Sampling - fully batched on-GPU, negligible
|
||||
|
||||
`v1/sample/sampler.py::Sampler.forward` (72) operates on the whole `[num_seqs, vocab]` logits
|
||||
tensor: batched `argmax` (greedy, 240) or temperature `div_` + one FlashInfer
|
||||
`top_k_top_p_sampling_from_logits` (`v1/sample/ops/topk_topp_sampler.py:493`) + `torch.where`
|
||||
(296-301). **No per-sequence Python loop** in the hot path. Per-seq params live as pre-staged GPU
|
||||
tensors `temperature/top_p/top_k[num_seqs]` (`v1/worker/gpu_input_batch.py:184-205`), copied once via
|
||||
non-blocking H2D and rebuilt only on batch change (`refresh_metadata`, 815-829). Greedy and the full
|
||||
chain are the same batched-op class. Sampled-token D2H is async (CUDA-event gated, 243-313);
|
||||
detokenization runs on CPU in the async output processor (`v1/engine/output_processor.py`). Sampling
|
||||
is a negligible tail and does not stall the GPU loop - exactly as on the llama side.
|
||||
|
||||
### (d) Host / Python per-step loop - cheap by construction, hidden behind the async stream
|
||||
|
||||
`execute_model` host prep, all incremental on persistent buffers (`_prepare_inputs`, 1872+):
|
||||
- `block_table.commit_block_table` started **first** to overlap its copy with following CPU work
|
||||
(1890); each step appends only newly-allocated block ids (`append_row`), usually <=1 at decode.
|
||||
- positions / token gather are **vectorized numpy + a single `torch.index_select`** into the
|
||||
pre-allocated `input_ids.cpu` (1928-1939); `query_start_loc`/`seq_lens` set by slice ops
|
||||
(1979-1990). `slot_mapping` is one Triton kernel (`v1/worker/block_table.py`). **No per-token, no
|
||||
per-request Python loop** in the steady decode path.
|
||||
- `CommonAttentionMetadata` assembled once (2287-2305), then the attention builder runs once per KV
|
||||
group (see (a)).
|
||||
- The forward runs under `set_forward_context(...)` with `cudagraph_runtime_mode=NONE`; `_model_forward`
|
||||
is a direct `self.model(...)`.
|
||||
- **No GPU->CPU sync in the hot path:** the sampled-token copy is `non_blocking` + event-gated;
|
||||
`execute_model` returns after launching the forward, and the cheap host prep for the next step
|
||||
overlaps the GPU executing the current step on the async CUDA stream (CUDA launches are
|
||||
non-blocking). `async_scheduling` was off, so this overlap is just ordinary CUDA async, not
|
||||
pipelined scheduling - yet it is enough because the host work is so small.
|
||||
|
||||
What llama-server's per-step C++ loop pays that vLLM does not (host side, graph-addressable):
|
||||
ggml rebuilds/reallocates the compute graph each decode step and dispatches ~1k kernel launches from
|
||||
the loop on the weak Grace ARM cores (`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` review). vLLM's persistent
|
||||
buffers + build-once-reuse metadata + fixed launch sequence are exactly the things that keep its eager
|
||||
step host-cheap; llama could borrow these (persistent device KV/block metadata, build the ggml graph
|
||||
once and reuse it, zero per-step host sync) to shrink the bubble **without** a full CUDA graph.
|
||||
|
||||
## The llama side, for the split (nsys, reproduced)
|
||||
|
||||
`~/bench/decode_study/cat2.py` over `srv_decode2.sqlite` (Qwen3-32B dense, pure full-attention, 64
|
||||
layers, batch 32, 1024 ctx, paged, eager), reproduced now:
|
||||
|
||||
```
|
||||
window_span_s 24.960 sum_kernel_s 21.348 gpu_busy_pct 85.5
|
||||
ATTENTION (flash_attn_ext_f16) 10.177 s 47.7%
|
||||
kv_copy_cast (cpy_*) 3.903 s 18.3%
|
||||
embed_gather_rows (get/set) 3.803 s 17.8% <- the PAGED gather tax
|
||||
GEMM_weight (mul_mat) 3.173 s 14.9%
|
||||
GEMM_act_quant (quantize_mmq) 0.172 s 0.8%
|
||||
rmsnorm/silu/rope/add ~0.12 s ~0.6%
|
||||
```
|
||||
|
||||
So on llama's paged decode step: ~84% is KV/attention (attention 47.7% + KV copy 18.3% + paged
|
||||
gather 17.8%), ~16% is weight GEMM, and the host loop is **hidden** (GPU 85-94% busy; greedy ==
|
||||
heavy-sampler step time). Mapping each bucket to vLLM:
|
||||
|
||||
| llama bucket (paged) | nsys % | vLLM equivalent | vLLM avoids it? |
|
||||
|----------------------|------:|-----------------|-----------------|
|
||||
| paged KV gather (`get_rows`) | 17.8% | block table read **in-kernel** | **Yes, entirely** (no such op) |
|
||||
| KV copy/cast (`cpy_*`) | 18.3% | KV written once into block pool, read in place | Mostly |
|
||||
| decode attention (`flash_attn_ext_f16`) | 47.7% | FA2 paged-decode varlen (+ O(1) GDN layers) | Same op, faster kernel; GDN is cheaper still |
|
||||
| weight GEMM + act quant | 15.7% | fused native-FP4 / grouped Marlin, no separate requant | Faster + removes the requant kernel |
|
||||
| host serving loop / sampling | ~0 (hidden) | cheap persistent-buffer prep, batched GPU sampling | Both hidden; vLLM also cheap |
|
||||
|
||||
Note: the nsys decomposition is on **Qwen3-32B (pure attention)**; the 2.4x throughput numbers are on
|
||||
**Qwen3.6 hybrid GDN** models. The bucket *shares* differ between the two (GDN shifts work off
|
||||
attention), but the lesson - llama's step is GPU-bound on attention + the paged gather + FP4 GEMM,
|
||||
with the host hidden - transfers.
|
||||
|
||||
## The split of the 2.4x: kernel vs host (graph-addressable)
|
||||
|
||||
Anchored on the measured **~94.6% GPU busy** during steady llama decode (nvidia-smi,
|
||||
`DECODE_GAP_STUDY.md`):
|
||||
|
||||
- **Host / CUDA-graph-addressable: the minority, ~5-15% of the llama step (=> ~10-20% of the 2.4x).**
|
||||
A GPU that is ~95% busy exposes at most ~5% host idle; a CUDA graph (capture-once, replay) removes
|
||||
per-step launch latency + ggml graph rebuild/realloc and can tighten inter-kernel gaps, plausibly
|
||||
recovering ~5-15% of the step in the best case. On llama's ~795 ms dense step that is ~40-120 ms of
|
||||
the ~467 ms gap. **A CUDA graph cannot close a 2.4x gap**, because the gap is mostly the GPU's busy
|
||||
time, not idle. (The fraction shrinks further at batch 128 vs the nsys batch 32: the per-step launch
|
||||
count is fixed while per-kernel work grows, so host overhead is a smaller share at higher batch.)
|
||||
- **Kernel efficiency: the majority, ~80-90% of the 2.4x.** The GPU's busy time goes into kernels that
|
||||
are slower per unit work than vLLM's, decomposed:
|
||||
- **the paged gather regression (~36% of llama's *paged* step; `get_rows`+`cpy`)** - vLLM never pays
|
||||
it because it reads paged KV in-kernel. This is the single biggest discrete, llama-specific,
|
||||
addressable chunk, but removing it only restores llama's own *stock* path; stock is still ~2x off
|
||||
vLLM (`DECODE_GAP_STUDY.md`).
|
||||
- **long-context decode-attention** (the largest residual; attention is ~48% of the step and grows
|
||||
with ctx) - llama's `flash_attn_ext_f16` decode is slower than vLLM's FA2 paged-decode on sm_121,
|
||||
and slower still than the O(1) GDN layers on these models.
|
||||
- **the FP4 weight GEMM floor** (~15-30%) - vLLM fuses the activation-quant into the norm/SiLU and
|
||||
uses native FP4-MMA / grouped Marlin; llama runs `mul_mat_q` + a separate `quantize_mmq` requant.
|
||||
|
||||
## Ranked list: what llama would need to close the 2.4x, and how much each buys
|
||||
|
||||
1. **Do not pay the paged gather at decode. [largest discrete, llama-addressable; ~36% of the paged
|
||||
step]** Either disable paged KV for decode-latency workloads, or read paged blocks **in-kernel via
|
||||
a block table** like vLLM (no `get_rows`/`cpy`). This is a kernel change (a real in-kernel
|
||||
paged-decode read), not a graph change. Caveat: it only brings the paged path back to llama-stock;
|
||||
stock is still ~2x off vLLM, so this is necessary but not sufficient.
|
||||
2. **Faster long-context decode-attention kernel. [biggest residual; partly structural]** A proper
|
||||
flash-decoding / split-K-over-KV, GQA-grouped, in-kernel-paged decode kernel for sm_121 (this also
|
||||
subsumes lever 1). Deep CUDA work, gated by kernel maturity on Blackwell-class parts. This is where
|
||||
the context-scaling gap lives and where most of the 2.4x is.
|
||||
3. **Fused FP4 weight GEMM. [bounded; ~15-30%]** Fold the activation-quant into the preceding norm/SiLU
|
||||
(vLLM's `norm_quant`/`act_quant`) and into the GEMM epilogue; use native FP4-MMA where the part
|
||||
supports it. Removes the separate `quantize_mmq` pass. Bounded below by weight-read bandwidth
|
||||
(~19 GB/step over 273 GB/s).
|
||||
4. **CUDA-graph the steady-state pure-decode step. [smallest, cheapest; ~10-20% of the gap]** Capture
|
||||
the all-128-decoding step once and replay (it is already fixed-shape at steady decode - the
|
||||
scheduler does not need to change to enable this, per `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md` P3).
|
||||
Recovers the ~5% GPU-idle bubble + ggml per-step graph rebuild/realloc + launch latency on the weak
|
||||
Grace cores. A real, independent, low-risk win, but bounded by the ~95%-busy measurement: it does
|
||||
**not** close the kernel gap. Cheaper host-side half-measures that need no graph: persistent device
|
||||
KV/block metadata, build the ggml graph once and reuse it, and remove any per-step host sync (mirror
|
||||
vLLM's persistent-buffer + build-once-reuse + non-blocking-D2H pattern).
|
||||
5. **Verify llama's GDN/linear-attention decode path. [architectural, model-specific]** On these
|
||||
Qwen3.6 hybrids vLLM runs the linear-attention layers as an O(1)-in-ctx recurrent state update. If
|
||||
llama's GGUF runs those layers as full attention (O(ctx)) rather than a recurrent state, that is a
|
||||
per-layer decode cost vLLM structurally avoids on exactly these models - check before attributing
|
||||
the whole residual to the full-attention kernel.
|
||||
|
||||
## Honest bottom line
|
||||
|
||||
The ~2.4x eager decode gap is **dominantly a kernel-efficiency gap (~80-90%), not a host-overhead
|
||||
gap.** The decisive evidence is that llama's GPU is already ~94.6% busy during steady decode, so the
|
||||
CUDA-graph-addressable host slice is a minority (~10-20% of the gap), recoverable but bounded. The
|
||||
bulk of vLLM's advantage is concrete kernel work: an in-kernel paged-decode read that eliminates
|
||||
llama's gather/copy tax (~36% of the paged step), a faster long-context decode-attention kernel, a
|
||||
fused native-FP4 GEMM, and (on these specific models) O(1)-in-ctx GDN linear-attention layers. vLLM's
|
||||
host loop is cheap by construction (persistent buffers, build-once-reuse metadata, no hot-path sync,
|
||||
fixed small launch sequence) and it achieved the 2.4x with *synchronous* scheduling and *no* CUDA
|
||||
graphs - so the host is not where vLLM's lead comes from, and a CUDA graph is the cheapest but
|
||||
smallest of llama's available levers, not the silver bullet. The throughput effort should be scoped
|
||||
as kernel work (in-kernel paged-decode read + flash-decoding attention + fused FP4 GEMM) with a
|
||||
CUDA-graphed steady-state decode as a separate, bounded, lower-risk add-on.
|
||||
|
||||
## Key source citations (on dgx.casa, read-only)
|
||||
|
||||
- Eager driver / host loop: `v1/worker/gpu_model_runner.py` execute_model 4005, _model_forward 3718,
|
||||
_prepare_inputs 1872, _determine_batch_execution_and_padding 3768, sample_tokens 4357,
|
||||
synchronize_input_prep 3704; `v1/worker/block_table.py`; `v1/worker/gpu_input_batch.py:184-205`.
|
||||
- Attention: `v1/attention/backends/flash_attn.py` (forward 667-848, varlen call 796-818, builder
|
||||
388-575, update_block_table 577-586); `model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`
|
||||
(decode 1644-1696); `model_executor/layers/fla/ops/fused_recurrent.py` (kernel 255-336).
|
||||
- GEMM: `model_executor/kernels/linear/nvfp4/flashinfer.py:56-89`;
|
||||
`model_executor/layers/quantization/modelopt.py` (NvFp4 LinearMethod 1103-1232, MoE 1381-1666);
|
||||
`model_executor/layers/fused_moe/experts/marlin_moe.py` (59-225, 227-360, 732-895);
|
||||
`compilation/passes/fusion/rms_quant_fusion.py:98`, `act_quant_fusion.py:40,128`.
|
||||
- Sampling: `v1/sample/sampler.py:72-302`; `v1/sample/ops/topk_topp_sampler.py:55,460-497`;
|
||||
`v1/sample/metadata.py`; `v1/engine/output_processor.py`.
|
||||
- Config: `config/scheduler.py:146,168-176` (async_scheduling default -> sync Scheduler).
|
||||
- Evidence: `~/bench/h2h_dense_vllm.log`, `~/bench/h2h_moe_vllm.log`, `~/bench/decode_study/cat2.py`
|
||||
over `srv_decode2.sqlite`; this worktree `QWEN36_NVFP4_BENCH.md`, `DECODE_GAP_STUDY.md`,
|
||||
`CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`.
|
||||
</content>
|
||||
</invoke>
|
||||
@@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
|
||||
LocalAI supports various types of backends:
|
||||
|
||||
- **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
|
||||
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md` for the quality/throughput profile).
|
||||
- **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp/patches/paged/README.md` for the quality/throughput profile).
|
||||
- **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
|
||||
- **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
|
||||
- **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
|
||||
|
||||
@@ -20,7 +20,7 @@
|
||||
# persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
|
||||
# stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
|
||||
# bit-exact (~91% same-top-p, beats vLLM dense) - see
|
||||
# backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md for the quality profile.
|
||||
# backend/cpp/llama-cpp/patches/paged/README.md for the quality profile.
|
||||
# The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
|
||||
# =============================================================================
|
||||
- name: "qwen3.6-27b-nvfp4-paged"
|
||||
|
||||
Reference in New Issue
Block a user