mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 10:27:30 -04:00
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
34 lines
1.4 KiB
CSV
34 lines
1.4 KiB
CSV
model,engine,npl,decode_agg_tps,prefill_tps
|
|
q36-27b-nvfp4,llama-stock,8,68.3,937.7
|
|
q36-27b-nvfp4,llama-stock,32,119.9,885.2
|
|
q36-27b-nvfp4,llama-stock,64,142.8,885.1
|
|
q36-27b-nvfp4,llama-stock,128,155.1,887.2
|
|
q36-27b-nvfp4,llama-patched,8,85.3,915.1
|
|
q36-27b-nvfp4,llama-patched,32,211.9,919.0
|
|
q36-27b-nvfp4,llama-patched,64,305.2,923.5
|
|
q36-27b-nvfp4,llama-patched,128,382.1,922.9
|
|
q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
|
|
q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
|
|
q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
|
|
q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
|
|
q36-27b-nvfp4,vllm,8,70.4,2096.2
|
|
q36-27b-nvfp4,vllm,32,211.8,2182.6
|
|
q36-27b-nvfp4,vllm,64,309.1,2088.9
|
|
q36-27b-nvfp4,vllm,128,418.8,1929.1
|
|
q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5
|
|
q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8
|
|
q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5
|
|
q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4
|
|
q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
|
|
q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
|
|
q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
|
|
q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
|
|
q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
|
|
q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
|
|
q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
|
|
q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
|
|
q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
|
|
q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
|
|
q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
|
|
q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5
|