docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)

Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 22:05:59 +00:00
parent ed5eb705c7
commit 3466094c68
2 changed files with 95 additions and 45 deletions

View File

@@ -157,45 +157,79 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
S_TG (t/s) from `llama-batched-bench`, `-fa on -ngl 99`, `npp 128 / ntg 128`,
swept over serving width `npl` in {8, 32, 64, 128}. Plots:
[`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
[`final_benchmark.csv`](docs/final_benchmark.csv).
### (a) + (b) Patched vs stock vs vLLM
> **What was re-measured (2026-06-27).** The three llama columns - **stock**,
> **patched**, and **patched+bf16-tau** - were all re-measured this session on one
> consistent `llama-batched-bench` harness. The **vLLM** column is the
> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak
> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports
> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it
> (the memory-advantage note below is the prior-session finding).
The **stock** and **patched** columns are the same binary, env-toggled, on the
**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
apples-to-apples measure of the patch series' contribution. The **vLLM** column
is a **different harness** (vLLM server + client continuous batching), so the
### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM
The **stock** column is a separate, unpatched llama.cpp built at this backend's
**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are
the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus
`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three
run on the **same harness**, so "x over stock" is an apples-to-apples measure of
the patch series. (Note: the patch series' dominant SSM decode fusions are
compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched*
binary does **not** reproduce stock; only the separately-built unpatched
`9d5d882d` binary does.) The **vLLM** column is a **different harness** (vLLM
server + client continuous batching) and a **prior-session reference**, so the
cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
**Dense Qwen3.6-27B-NVFP4** (t/s):
**Dense Qwen3.6-27B-NVFP4** (decode t/s):
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|----:|------:|--------:|-----:|------------------:|---------------------:|
| 8 | 65.7 | 84.0 | 71.1 | 118% | 1.28x |
| 32 | 113.7 | 204.0 | 207.6 | 98% | 1.79x |
| 64 | 134.3 | 294.9 | 309.7 | 95% | 2.20x |
| 128 | 143.5 | 371.2 | 422.4 | 88% | 2.59x |
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
| 8 | 68.3 | 85.3 | 87.8 | 70.4 | 1.25x | +3% |
| 32 | 119.9 | 211.9 | 231.0 | 211.8 | 1.77x | +9% |
| 64 | 142.8 | 305.2 | 341.4 | 309.1 | 2.14x | +12% |
| 128 | 155.1 | 382.1 | 446.1 | 418.8 | 2.46x | +17% |
**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across
the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 /
110 / 107%).
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|----:|------:|--------:|------:|-----------------:|---------------------:|
| 8 | 181.4 | 227.4 | 315.1 | 72% | 1.25x |
| 32 | 260.8 | 455.7 | 681.9 | 67% | 1.75x |
| 64 | 306.8 | 612.3 | 765.5 | 80% | 2.00x |
| 128 | 331.3 | 772.6 | 1011.7 | 76% | 2.33x |
**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s):
**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
| 8 | 186.7 | 230.3 | 240.5 | 256.5 | 1.23x | +4% |
| 32 | 267.4 | 466.4 | 508.1 | 500.8 | 1.74x | +9% |
| 64 | 320.5 | 622.4 | 703.8 | 686.1 | 1.94x | +13% |
| 128 | 347.2 | 784.3 | 918.0 | 882.2 | 2.26x | +17% |
**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
remaining gap is structural (see section 5).
MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches
parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32.
**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact
tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and
64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact**
(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as
bf16 to halve that head's recurrence byte stream. Measured decode gain over
patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at
npl128** (dense and MoE alike).
**Caveat on the vLLM column.** It is a **different harness** and a
**prior-session** measurement (not re-run this session), so the cross-engine "% of
vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama uses
**1.5-3x lower** memory than vLLM.
**Takeaway.** Re-measured this session, the patch series gives up to **2.46x
(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to,
slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept).
Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with
width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched**
sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead**
at npl>=32. The residual non-bf16 MoE gap is structural (see section 5).
### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?

View File

@@ -1,17 +1,33 @@
model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
model,engine,npl,decode_agg_tps,prefill_tps
q36-27b-nvfp4,llama-stock,8,68.3,937.7
q36-27b-nvfp4,llama-stock,32,119.9,885.2
q36-27b-nvfp4,llama-stock,64,142.8,885.1
q36-27b-nvfp4,llama-stock,128,155.1,887.2
q36-27b-nvfp4,llama-patched,8,85.3,915.1
q36-27b-nvfp4,llama-patched,32,211.9,919.0
q36-27b-nvfp4,llama-patched,64,305.2,923.5
q36-27b-nvfp4,llama-patched,128,382.1,922.9
q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
q36-27b-nvfp4,vllm,8,70.4,2096.2
q36-27b-nvfp4,vllm,32,211.8,2182.6
q36-27b-nvfp4,vllm,64,309.1,2088.9
q36-27b-nvfp4,vllm,128,418.8,1929.1
q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5
q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8
q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5
q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4
q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5
1 model engine npl decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms peak_gb
2 q36-27b-nvfp4 llama llama-stock 8 82.5 68.3 9.57 507.3 937.7 6038.1 53.51
3 q36-27b-nvfp4 llama llama-stock 32 192.6 119.9 4.79 115.0 885.2 133551.7 69.63
4 q36-27b-nvfp4 llama llama-stock 64 277.8 142.8 3.09 95.9 885.1 321618.8 83.96
5 q36-27b-nvfp4 llama llama-stock 128 384.6 155.1 1.86 69.7 887.2 902762.7 93.82
6 q36-27b-nvfp4 vllm llama-patched 8 70.4 85.3 8.76 2096.2 915.1 1861.1 110.92
7 q36-27b-nvfp4 vllm llama-patched 32 211.8 211.9 6.28 2182.6 919.0 5353.2 110.87
8 q36-27b-nvfp4 vllm llama-patched 64 309.1 305.2 4.38 2088.9 923.5 9512.4 110.88
9 q36-27b-nvfp4 vllm llama-patched 128 418.8 382.1 2.79 1929.1 922.9 18449.5 110.95
10 q36-35b-a3b-nvfp4 q36-27b-nvfp4 llama llama-patched-bf16tau 8 211.8 87.8 24.45 1236.4 919.2 2477.1 39.66
11 q36-35b-a3b-nvfp4 q36-27b-nvfp4 llama llama-patched-bf16tau 32 393.0 231.0 10.02 1213.9 931.1 8225.2 47.11
12 q36-35b-a3b-nvfp4 q36-27b-nvfp4 llama llama-patched-bf16tau 64 527.0 341.4 6.15 1152.3 930.7 15849.5 57.13
13 q36-35b-a3b-nvfp4 q36-27b-nvfp4 llama llama-patched-bf16tau 128 726.4 446.1 3.73 276.8 932.2 213017.2 61.51
14 q36-35b-a3b-nvfp4 q36-27b-nvfp4 vllm 8 256.5 70.4 31.84 5186.5 2096.2 768.8 109.62
15 q36-35b-a3b-nvfp4 q36-27b-nvfp4 vllm 32 500.8 211.8 14.90 6223.4 2182.6 1830.4 109.63
16 q36-35b-a3b-nvfp4 q36-27b-nvfp4 vllm 64 686.1 309.1 9.83 5926.5 2088.9 3224.4 109.63
17 q36-35b-a3b-nvfp4 q36-27b-nvfp4 vllm 128 882.2 418.8 6.05 5300.5 1929.1 6487.7 109.64
18 q36-35b-a3b-nvfp4 llama-stock 8 186.7 1501.5
19 q36-35b-a3b-nvfp4 llama-stock 32 267.4 1856.8
20 q36-35b-a3b-nvfp4 llama-stock 64 320.5 1949.5
21 q36-35b-a3b-nvfp4 llama-stock 128 347.2 1995.4
22 q36-35b-a3b-nvfp4 llama-patched 8 230.3 1510.3
23 q36-35b-a3b-nvfp4 llama-patched 32 466.4 1969.2
24 q36-35b-a3b-nvfp4 llama-patched 64 622.4 2122.8
25 q36-35b-a3b-nvfp4 llama-patched 128 784.3 2177.0
26 q36-35b-a3b-nvfp4 llama-patched-bf16tau 8 240.5 1539.8
27 q36-35b-a3b-nvfp4 llama-patched-bf16tau 32 508.1 2031.7
28 q36-35b-a3b-nvfp4 llama-patched-bf16tau 64 703.8 2151.8
29 q36-35b-a3b-nvfp4 llama-patched-bf16tau 128 918.0 2212.3
30 q36-35b-a3b-nvfp4 vllm 8 256.5 5186.5
31 q36-35b-a3b-nvfp4 vllm 32 500.8 6223.4
32 q36-35b-a3b-nvfp4 vllm 64 686.1 5926.5
33 q36-35b-a3b-nvfp4 vllm 128 882.2 5300.5