mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 10:27:30 -04:00
docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -157,45 +157,79 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
|
||||
|
||||
Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
|
||||
**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
|
||||
S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
|
||||
serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
|
||||
S_TG (t/s) from `llama-batched-bench`, `-fa on -ngl 99`, `npp 128 / ntg 128`,
|
||||
swept over serving width `npl` in {8, 32, 64, 128}. Plots:
|
||||
[`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
|
||||
[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
|
||||
[`final_benchmark.csv`](docs/final_benchmark.csv).
|
||||
|
||||
### (a) + (b) Patched vs stock vs vLLM
|
||||
> **What was re-measured (2026-06-27).** The three llama columns - **stock**,
|
||||
> **patched**, and **patched+bf16-tau** - were all re-measured this session on one
|
||||
> consistent `llama-batched-bench` harness. The **vLLM** column is the
|
||||
> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak
|
||||
> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports
|
||||
> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it
|
||||
> (the memory-advantage note below is the prior-session finding).
|
||||
|
||||
The **stock** and **patched** columns are the same binary, env-toggled, on the
|
||||
**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
|
||||
apples-to-apples measure of the patch series' contribution. The **vLLM** column
|
||||
is a **different harness** (vLLM server + client continuous batching), so the
|
||||
### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM
|
||||
|
||||
The **stock** column is a separate, unpatched llama.cpp built at this backend's
|
||||
**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are
|
||||
the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus
|
||||
`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three
|
||||
run on the **same harness**, so "x over stock" is an apples-to-apples measure of
|
||||
the patch series. (Note: the patch series' dominant SSM decode fusions are
|
||||
compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched*
|
||||
binary does **not** reproduce stock; only the separately-built unpatched
|
||||
`9d5d882d` binary does.) The **vLLM** column is a **different harness** (vLLM
|
||||
server + client continuous batching) and a **prior-session reference**, so the
|
||||
cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
|
||||
|
||||
**Dense Qwen3.6-27B-NVFP4** (t/s):
|
||||
**Dense Qwen3.6-27B-NVFP4** (decode t/s):
|
||||
|
||||
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|
||||
|----:|------:|--------:|-----:|------------------:|---------------------:|
|
||||
| 8 | 65.7 | 84.0 | 71.1 | 118% | 1.28x |
|
||||
| 32 | 113.7 | 204.0 | 207.6 | 98% | 1.79x |
|
||||
| 64 | 134.3 | 294.9 | 309.7 | 95% | 2.20x |
|
||||
| 128 | 143.5 | 371.2 | 422.4 | 88% | 2.59x |
|
||||
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|
||||
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
|
||||
| 8 | 68.3 | 85.3 | 87.8 | 70.4 | 1.25x | +3% |
|
||||
| 32 | 119.9 | 211.9 | 231.0 | 211.8 | 1.77x | +9% |
|
||||
| 64 | 142.8 | 305.2 | 341.4 | 309.1 | 2.14x | +12% |
|
||||
| 128 | 155.1 | 382.1 | 446.1 | 418.8 | 2.46x | +17% |
|
||||
|
||||
**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
|
||||
Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across
|
||||
the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 /
|
||||
110 / 107%).
|
||||
|
||||
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|
||||
|----:|------:|--------:|------:|-----------------:|---------------------:|
|
||||
| 8 | 181.4 | 227.4 | 315.1 | 72% | 1.25x |
|
||||
| 32 | 260.8 | 455.7 | 681.9 | 67% | 1.75x |
|
||||
| 64 | 306.8 | 612.3 | 765.5 | 80% | 2.00x |
|
||||
| 128 | 331.3 | 772.6 | 1011.7 | 76% | 2.33x |
|
||||
**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s):
|
||||
|
||||
**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
|
||||
@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
|
||||
config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
|
||||
groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
|
||||
| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched |
|
||||
|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:|
|
||||
| 8 | 186.7 | 230.3 | 240.5 | 256.5 | 1.23x | +4% |
|
||||
| 32 | 267.4 | 466.4 | 508.1 | 500.8 | 1.74x | +9% |
|
||||
| 64 | 320.5 | 622.4 | 703.8 | 686.1 | 1.94x | +13% |
|
||||
| 128 | 347.2 | 784.3 | 918.0 | 882.2 | 2.26x | +17% |
|
||||
|
||||
**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
|
||||
stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
|
||||
remaining gap is structural (see section 5).
|
||||
MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches
|
||||
parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32.
|
||||
|
||||
**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact
|
||||
tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and
|
||||
64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact**
|
||||
(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as
|
||||
bf16 to halve that head's recurrence byte stream. Measured decode gain over
|
||||
patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at
|
||||
npl128** (dense and MoE alike).
|
||||
|
||||
**Caveat on the vLLM column.** It is a **different harness** and a
|
||||
**prior-session** measurement (not re-run this session), so the cross-engine "% of
|
||||
vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama uses
|
||||
**1.5-3x lower** memory than vLLM.
|
||||
|
||||
**Takeaway.** Re-measured this session, the patch series gives up to **2.46x
|
||||
(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to,
|
||||
slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept).
|
||||
Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with
|
||||
width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched**
|
||||
sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead**
|
||||
at npl>=32. The residual non-bf16 MoE gap is structural (see section 5).
|
||||
|
||||
### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?
|
||||
|
||||
|
||||
@@ -1,17 +1,33 @@
|
||||
model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
|
||||
q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
|
||||
q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
|
||||
q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
|
||||
q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
|
||||
q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
|
||||
q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
|
||||
q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
|
||||
q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
|
||||
q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
|
||||
q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
|
||||
q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
|
||||
q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
|
||||
q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
|
||||
q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
|
||||
q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
|
||||
q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
|
||||
model,engine,npl,decode_agg_tps,prefill_tps
|
||||
q36-27b-nvfp4,llama-stock,8,68.3,937.7
|
||||
q36-27b-nvfp4,llama-stock,32,119.9,885.2
|
||||
q36-27b-nvfp4,llama-stock,64,142.8,885.1
|
||||
q36-27b-nvfp4,llama-stock,128,155.1,887.2
|
||||
q36-27b-nvfp4,llama-patched,8,85.3,915.1
|
||||
q36-27b-nvfp4,llama-patched,32,211.9,919.0
|
||||
q36-27b-nvfp4,llama-patched,64,305.2,923.5
|
||||
q36-27b-nvfp4,llama-patched,128,382.1,922.9
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
|
||||
q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
|
||||
q36-27b-nvfp4,vllm,8,70.4,2096.2
|
||||
q36-27b-nvfp4,vllm,32,211.8,2182.6
|
||||
q36-27b-nvfp4,vllm,64,309.1,2088.9
|
||||
q36-27b-nvfp4,vllm,128,418.8,1929.1
|
||||
q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5
|
||||
q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8
|
||||
q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5
|
||||
q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4
|
||||
q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
|
||||
q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
|
||||
q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
|
||||
q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
|
||||
q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
|
||||
q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
|
||||
q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
|
||||
q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
|
||||
q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5
|
||||
|
||||
|
Reference in New Issue
Block a user