diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 9259c3c77..245c30ad2 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -157,45 +157,79 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense **Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg` -S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over -serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png), +S_TG (t/s) from `llama-batched-bench`, `-fa on -ngl 99`, `npp 128 / ntg 128`, +swept over serving width `npl` in {8, 32, 64, 128}. Plots: +[`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png), [`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data [`final_benchmark.csv`](docs/final_benchmark.csv). -### (a) + (b) Patched vs stock vs vLLM +> **What was re-measured (2026-06-27).** The three llama columns - **stock**, +> **patched**, and **patched+bf16-tau** - were all re-measured this session on one +> consistent `llama-batched-bench` harness. The **vLLM** column is the +> **prior-session reference** (kept as-is, *not* re-run this session). Per-run peak +> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports +> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it +> (the memory-advantage note below is the prior-session finding). -The **stock** and **patched** columns are the same binary, env-toggled, on the -**same harness** (`llama-batched-bench`) - so "x over stock" is an exact -apples-to-apples measure of the patch series' contribution. The **vLLM** column -is a **different harness** (vLLM server + client continuous batching), so the +### (a) + (b) Patched vs stock vs patched+bf16-tau vs vLLM + +The **stock** column is a separate, unpatched llama.cpp built at this backend's +**exact pin (`9d5d882d`)**; the **patched** and **patched+bf16-tau** columns are +the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus +`LLAMA_MOE_FORCE_GRAPHS=1` for MoE; bf16-tau adds `--ssm-bf16-tau 64`). All three +run on the **same harness**, so "x over stock" is an apples-to-apples measure of +the patch series. (Note: the patch series' dominant SSM decode fusions are +compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched* +binary does **not** reproduce stock; only the separately-built unpatched +`9d5d882d` binary does.) The **vLLM** column is a **different harness** (vLLM +server + client continuous batching) and a **prior-session reference**, so the cross-engine "% of vLLM" is **indicative, not apples-to-apples**. -**Dense Qwen3.6-27B-NVFP4** (t/s): +**Dense Qwen3.6-27B-NVFP4** (decode t/s): -| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock | -|----:|------:|--------:|-----:|------------------:|---------------------:| -| 8 | 65.7 | 84.0 | 71.1 | 118% | 1.28x | -| 32 | 113.7 | 204.0 | 207.6 | 98% | 1.79x | -| 64 | 134.3 | 294.9 | 309.7 | 95% | 2.20x | -| 128 | 143.5 | 371.2 | 422.4 | 88% | 2.59x | +| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched | +|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:| +| 8 | 68.3 | 85.3 | 87.8 | 70.4 | 1.25x | +3% | +| 32 | 119.9 | 211.9 | 231.0 | 211.8 | 1.77x | +9% | +| 64 | 142.8 | 305.2 | 341.4 | 309.1 | 2.14x | +12% | +| 128 | 155.1 | 382.1 | 446.1 | 418.8 | 2.46x | +17% | -**MoE Qwen3.6-35B-A3B-NVFP4** (t/s): +Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across +the widths); **patched+bf16-tau** is **ahead of vLLM at every width** (125 / 109 / +110 / 107%). -| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock | -|----:|------:|--------:|------:|-----------------:|---------------------:| -| 8 | 181.4 | 227.4 | 315.1 | 72% | 1.25x | -| 32 | 260.8 | 455.7 | 681.9 | 67% | 1.75x | -| 64 | 306.8 | 612.3 | 765.5 | 80% | 2.00x | -| 128 | 331.3 | 772.6 | 1011.7 | 76% | 2.33x | +**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s): -**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE -@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference -config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the -groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM. +| npl | stock | patched | patched+bf16-tau | vLLM (prior) | patched x over stock | bf16-tau over patched | +|----:|------:|--------:|-----------------:|-------------:|---------------------:|----------------------:| +| 8 | 186.7 | 230.3 | 240.5 | 256.5 | 1.23x | +4% | +| 32 | 267.4 | 466.4 | 508.1 | 500.8 | 1.74x | +9% | +| 64 | 320.5 | 622.4 | 703.8 | 686.1 | 1.94x | +13% | +| 128 | 347.2 | 784.3 | 918.0 | 882.2 | 2.26x | +17% | -**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over -stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the -remaining gap is structural (see section 5). +MoE **patched** is 90 / 93 / 91 / 89% of vLLM; **patched+bf16-tau** reaches +parity-to-ahead (94 / 101 / 103 / 104%) at npl>=32. + +**On bf16-tau.** The `patched+bf16-tau` column uses `--ssm-bf16-tau 64` (no exact +tau was recorded behind the documented "~+12%"; the flag help suggests 32/64, and +64 bf16's more of the fast-decaying GDN heads). It is **opt-in and NOT bit-exact** +(~91% same-top-p; see section 5) - it persists fast-decaying GDN head state as +bf16 to halve that head's recurrence byte stream. Measured decode gain over +patched grows with serving width: **+3-4% at npl8, ~+12-13% at npl64, +17% at +npl128** (dense and MoE alike). + +**Caveat on the vLLM column.** It is a **different harness** and a +**prior-session** measurement (not re-run this session), so the cross-engine "% of +vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama uses +**1.5-3x lower** memory than vLLM. + +**Takeaway.** Re-measured this session, the patch series gives up to **2.46x +(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to, +slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept). +Opt-in **bf16-tau adds a further +3% to +17%** on top of patched (growing with +width). Dense is **ahead of vLLM** with bf16-tau at every width; MoE **patched** +sits at ~89-93% of the prior-session vLLM and **bf16-tau reaches parity-to-ahead** +at npl>=32. The residual non-bf16 MoE gap is structural (see section 5). ### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here? diff --git a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv index 3b85165de..de9c24737 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv +++ b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv @@ -1,17 +1,33 @@ -model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb -q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51 -q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63 -q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96 -q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82 -q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92 -q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87 -q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88 -q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95 -q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66 -q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11 -q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13 -q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51 -q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62 -q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63 -q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63 -q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64 +model,engine,npl,decode_agg_tps,prefill_tps +q36-27b-nvfp4,llama-stock,8,68.3,937.7 +q36-27b-nvfp4,llama-stock,32,119.9,885.2 +q36-27b-nvfp4,llama-stock,64,142.8,885.1 +q36-27b-nvfp4,llama-stock,128,155.1,887.2 +q36-27b-nvfp4,llama-patched,8,85.3,915.1 +q36-27b-nvfp4,llama-patched,32,211.9,919.0 +q36-27b-nvfp4,llama-patched,64,305.2,923.5 +q36-27b-nvfp4,llama-patched,128,382.1,922.9 +q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2 +q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1 +q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7 +q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2 +q36-27b-nvfp4,vllm,8,70.4,2096.2 +q36-27b-nvfp4,vllm,32,211.8,2182.6 +q36-27b-nvfp4,vllm,64,309.1,2088.9 +q36-27b-nvfp4,vllm,128,418.8,1929.1 +q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5 +q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8 +q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5 +q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4 +q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3 +q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2 +q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8 +q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0 +q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8 +q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7 +q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8 +q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3 +q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5 +q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4 +q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5 +q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5