Files
LocalAI/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
Ettore Di Giacinto 3466094c68 docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:05:59 +00:00

1.4 KiB

1modelenginenpldecode_agg_tpsprefill_tps
2q36-27b-nvfp4llama-stock868.3937.7
3q36-27b-nvfp4llama-stock32119.9885.2
4q36-27b-nvfp4llama-stock64142.8885.1
5q36-27b-nvfp4llama-stock128155.1887.2
6q36-27b-nvfp4llama-patched885.3915.1
7q36-27b-nvfp4llama-patched32211.9919.0
8q36-27b-nvfp4llama-patched64305.2923.5
9q36-27b-nvfp4llama-patched128382.1922.9
10q36-27b-nvfp4llama-patched-bf16tau887.8919.2
11q36-27b-nvfp4llama-patched-bf16tau32231.0931.1
12q36-27b-nvfp4llama-patched-bf16tau64341.4930.7
13q36-27b-nvfp4llama-patched-bf16tau128446.1932.2
14q36-27b-nvfp4vllm870.42096.2
15q36-27b-nvfp4vllm32211.82182.6
16q36-27b-nvfp4vllm64309.12088.9
17q36-27b-nvfp4vllm128418.81929.1
18q36-35b-a3b-nvfp4llama-stock8186.71501.5
19q36-35b-a3b-nvfp4llama-stock32267.41856.8
20q36-35b-a3b-nvfp4llama-stock64320.51949.5
21q36-35b-a3b-nvfp4llama-stock128347.21995.4
22q36-35b-a3b-nvfp4llama-patched8230.31510.3
23q36-35b-a3b-nvfp4llama-patched32466.41969.2
24q36-35b-a3b-nvfp4llama-patched64622.42122.8
25q36-35b-a3b-nvfp4llama-patched128784.32177.0
26q36-35b-a3b-nvfp4llama-patched-bf16tau8240.51539.8
27q36-35b-a3b-nvfp4llama-patched-bf16tau32508.12031.7
28q36-35b-a3b-nvfp4llama-patched-bf16tau64703.82151.8
29q36-35b-a3b-nvfp4llama-patched-bf16tau128918.02212.3
30q36-35b-a3b-nvfp4vllm8256.55186.5
31q36-35b-a3b-nvfp4vllm32500.86223.4
32q36-35b-a3b-nvfp4vllm64686.15926.5
33q36-35b-a3b-nvfp4vllm128882.25300.5