Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
Ettore Di Giacinto 3466094c68 docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:05:59 +00:00

34 lines
1.4 KiB
CSV

model,engine,npl,decode_agg_tps,prefill_tps
q36-27b-nvfp4,llama-stock,8,68.3,937.7
q36-27b-nvfp4,llama-stock,32,119.9,885.2
q36-27b-nvfp4,llama-stock,64,142.8,885.1
q36-27b-nvfp4,llama-stock,128,155.1,887.2
q36-27b-nvfp4,llama-patched,8,85.3,915.1
q36-27b-nvfp4,llama-patched,32,211.9,919.0
q36-27b-nvfp4,llama-patched,64,305.2,923.5
q36-27b-nvfp4,llama-patched,128,382.1,922.9
q36-27b-nvfp4,llama-patched-bf16tau,8,87.8,919.2
q36-27b-nvfp4,llama-patched-bf16tau,32,231.0,931.1
q36-27b-nvfp4,llama-patched-bf16tau,64,341.4,930.7
q36-27b-nvfp4,llama-patched-bf16tau,128,446.1,932.2
q36-27b-nvfp4,vllm,8,70.4,2096.2
q36-27b-nvfp4,vllm,32,211.8,2182.6
q36-27b-nvfp4,vllm,64,309.1,2088.9
q36-27b-nvfp4,vllm,128,418.8,1929.1
q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5
q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8
q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5
q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4
q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3
q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2
q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8
q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0
q36-35b-a3b-nvfp4,llama-patched-bf16tau,8,240.5,1539.8
q36-35b-a3b-nvfp4,llama-patched-bf16tau,32,508.1,2031.7
q36-35b-a3b-nvfp4,llama-patched-bf16tau,64,703.8,2151.8
q36-35b-a3b-nvfp4,llama-patched-bf16tau,128,918.0,2212.3
q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5
q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4
q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5
q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5