mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 10:27:30 -04:00
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1.4 KiB
1.4 KiB
| 1 | model | engine | npl | decode_agg_tps | prefill_tps |
|---|---|---|---|---|---|
| 2 | q36-27b-nvfp4 | llama-stock | 8 | 68.3 | 937.7 |
| 3 | q36-27b-nvfp4 | llama-stock | 32 | 119.9 | 885.2 |
| 4 | q36-27b-nvfp4 | llama-stock | 64 | 142.8 | 885.1 |
| 5 | q36-27b-nvfp4 | llama-stock | 128 | 155.1 | 887.2 |
| 6 | q36-27b-nvfp4 | llama-patched | 8 | 85.3 | 915.1 |
| 7 | q36-27b-nvfp4 | llama-patched | 32 | 211.9 | 919.0 |
| 8 | q36-27b-nvfp4 | llama-patched | 64 | 305.2 | 923.5 |
| 9 | q36-27b-nvfp4 | llama-patched | 128 | 382.1 | 922.9 |
| 10 | q36-27b-nvfp4 | llama-patched-bf16tau | 8 | 87.8 | 919.2 |
| 11 | q36-27b-nvfp4 | llama-patched-bf16tau | 32 | 231.0 | 931.1 |
| 12 | q36-27b-nvfp4 | llama-patched-bf16tau | 64 | 341.4 | 930.7 |
| 13 | q36-27b-nvfp4 | llama-patched-bf16tau | 128 | 446.1 | 932.2 |
| 14 | q36-27b-nvfp4 | vllm | 8 | 70.4 | 2096.2 |
| 15 | q36-27b-nvfp4 | vllm | 32 | 211.8 | 2182.6 |
| 16 | q36-27b-nvfp4 | vllm | 64 | 309.1 | 2088.9 |
| 17 | q36-27b-nvfp4 | vllm | 128 | 418.8 | 1929.1 |
| 18 | q36-35b-a3b-nvfp4 | llama-stock | 8 | 186.7 | 1501.5 |
| 19 | q36-35b-a3b-nvfp4 | llama-stock | 32 | 267.4 | 1856.8 |
| 20 | q36-35b-a3b-nvfp4 | llama-stock | 64 | 320.5 | 1949.5 |
| 21 | q36-35b-a3b-nvfp4 | llama-stock | 128 | 347.2 | 1995.4 |
| 22 | q36-35b-a3b-nvfp4 | llama-patched | 8 | 230.3 | 1510.3 |
| 23 | q36-35b-a3b-nvfp4 | llama-patched | 32 | 466.4 | 1969.2 |
| 24 | q36-35b-a3b-nvfp4 | llama-patched | 64 | 622.4 | 2122.8 |
| 25 | q36-35b-a3b-nvfp4 | llama-patched | 128 | 784.3 | 2177.0 |
| 26 | q36-35b-a3b-nvfp4 | llama-patched-bf16tau | 8 | 240.5 | 1539.8 |
| 27 | q36-35b-a3b-nvfp4 | llama-patched-bf16tau | 32 | 508.1 | 2031.7 |
| 28 | q36-35b-a3b-nvfp4 | llama-patched-bf16tau | 64 | 703.8 | 2151.8 |
| 29 | q36-35b-a3b-nvfp4 | llama-patched-bf16tau | 128 | 918.0 | 2212.3 |
| 30 | q36-35b-a3b-nvfp4 | vllm | 8 | 256.5 | 5186.5 |
| 31 | q36-35b-a3b-nvfp4 | vllm | 32 | 500.8 | 6223.4 |
| 32 | q36-35b-a3b-nvfp4 | vllm | 64 | 686.1 | 5926.5 |
| 33 | q36-35b-a3b-nvfp4 | vllm | 128 | 882.2 | 5300.5 |