docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)

Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 10:27:30 -04:00 · 2026-06-27 22:05:59 +00:00
parent ed5eb705c7
commit 3466094c68
2 changed files with 95 additions and 45 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -157,45 +157,79 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact

 Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
 **Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
-S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
-serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
+S_TG (t/s) from `llama-batched-bench`, `-fa on -ngl 99`, `npp 128 / ntg 128`,
+swept over serving width `npl` in {8, 32, 64, 128}. Plots:
+[`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
 [`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
 [`final_benchmark.csv`](docs/final_benchmark.csv).

-### (a) + (b) Patched vs stock vs vLLM
+> **What was re-measured (2026-06-27).** The three llama columns - **stock**,
+> **patched**, and **patched+bf16-tau** - were all re-measured this session on one
+> consistent `llama-batched-bench` harness. The **vLLM** column is the
+> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak
+> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports
+> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it
+> (the memory-advantage note below is the prior-session finding).

-The **stock** and **patched** columns are the same binary, env-toggled, on the
-**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
-apples-to-apples measure of the patch series' contribution. The **vLLM** column
-is a **different harness** (vLLM server + client continuous batching), so the
+### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM
+
+The **stock** column is a separate, unpatched llama.cpp built at this backend's
+**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are
+the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus
+`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three
+run on the **same harness**, so "x over stock" is an apples-to-apples measure of
+the patch series. (Note: the patch series' dominant SSM decode fusions are
+compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched*
+binary does **not** reproduce stock; only the separately-built unpatched
+`9d5d882d` binary does.) The **vLLM** column is a **different harness** (vLLM
+server + client continuous batching) and a **prior-session reference**, so the
 cross-engine "% of vLLM" is **indicative, not apples-to-apples**.

-**Dense Qwen3.6-27B-NVFP4** (t/s):
+**Dense Qwen3.6-27B-NVFP4** (decode t/s):

-| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
-|----:|------:|--------:|-----:|------------------:|---------------------:|
-| 8   |  65.7 |   84.0 |  71.1 | 118% | 1.28x |
-| 32  | 113.7 |  204.0 | 207.6 |  98% | 1.79x |
-| 64  | 134.3 |  294.9 | 309.7 |  95% | 2.20x |
-| 128 | 143.5 |  371.2 | 422.4 |  88% | 2.59x |
+| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
+|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
+| 8   |  68.3 |   85.3 |             87.8 |         70.4 | 1.25x | +3%  |
+| 32  | 119.9 |  211.9 |            231.0 |        211.8 | 1.77x | +9%  |
+| 64  | 142.8 |  305.2 |            341.4 |        309.1 | 2.14x | +12% |
+| 128 | 155.1 |  382.1 |            446.1 |        418.8 | 2.46x | +17% |

-**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
+Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across
+the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 /
+110 / 107%).

-| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
-|----:|------:|--------:|------:|-----------------:|---------------------:|
-| 8   | 181.4 |  227.4 |  315.1 | 72% | 1.25x |
-| 32  | 260.8 |  455.7 |  681.9 | 67% | 1.75x |
-| 64  | 306.8 |  612.3 |  765.5 | 80% | 2.00x |
-| 128 | 331.3 |  772.6 | 1011.7 | 76% | 2.33x |
+**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s):

-**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
-@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
-config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
-groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
+| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
+|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
+| 8   | 186.7 |  230.3 |            240.5 |        256.5 | 1.23x | +4%  |
+| 32  | 267.4 |  466.4 |            508.1 |        500.8 | 1.74x | +9%  |
+| 64  | 320.5 |  622.4 |            703.8 |        686.1 | 1.94x | +13% |
+| 128 | 347.2 |  784.3 |            918.0 |        882.2 | 2.26x | +17% |

-**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
-stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
-remaining gap is structural (see section 5).
+MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches
+parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32.
+
+**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact
+tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and
+64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact**
+(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as
+bf16 to halve that head's recurrence byte stream. Measured decode gain over
+patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at
+npl128** (dense and MoE alike).
+
+**Caveat on the vLLM column.** It is a **different harness** and a
+**prior-session** measurement (not re-run this session), so the cross-engine "% of
+vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama uses
+**1.5-3x lower** memory than vLLM.
+
+**Takeaway.** Re-measured this session, the patch series gives up to **2.46x
+(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to,
+slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept).
+Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with
+width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched**
+sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead**
+at npl>=32. The residual non-bf16 MoE gap is structural (see section 5).

 ### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?

--- a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
+++ b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
@@ -1,17 +1,33 @@
-model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
-q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
-q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
-q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
-q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
-q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
-q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
-q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
-q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
-q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
-q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
-q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
-q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
-q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
-q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
-q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
-q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
+model,engine,npl,decode_agg_tps,prefill_tps
+q36-27b-nvfp4,llama-stock,8,68.3,937.7
+q36-27b-nvfp4,llama-stock,32,119.9,885.2
+q36-27b-nvfp4,llama-stock,64,142.8,885.1
+q36-27b-nvfp4,llama-stock,128,155.1,887.2
+q36-27b-nvfp4,llama-patched,8,85.3,915.1
+q36-27b-nvfp4,llama-patched,32,211.9,919.0
+q36-27b-nvfp4,llama-patched,64,305.2,923.5
+q36-27b-nvfp4,llama-patched,128,382.1,922.9
+q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
+q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
+q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
+q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
+q36-27b-nvfp4,vllm,8,70.4,2096.2
+q36-27b-nvfp4,vllm,32,211.8,2182.6
+q36-27b-nvfp4,vllm,64,309.1,2088.9
+q36-27b-nvfp4,vllm,128,418.8,1929.1
+q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5
+q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8
+q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5
+q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4
+q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
+q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
+q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
+q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
+q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
+q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
+q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
+q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
+q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
+q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
+q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
+q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5