LocalAI/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv at 3466094c68b2393bd2ae3b4e66e5619b4dea5fb1

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-28 10:27:30 -04:00

Files

Ettore Di Giacinto 3466094c68 docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)

Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-27 22:05:59 +00:00

1.4 KiB

Raw Blame History

1	model	engine	npl	decode_agg_tps	prefill_tps
2	q36-27b-nvfp4	llama-stock	8	68.3	937.7
3	q36-27b-nvfp4	llama-stock	32	119.9	885.2
4	q36-27b-nvfp4	llama-stock	64	142.8	885.1
5	q36-27b-nvfp4	llama-stock	128	155.1	887.2
6	q36-27b-nvfp4	llama-patched	8	85.3	915.1
7	q36-27b-nvfp4	llama-patched	32	211.9	919.0
8	q36-27b-nvfp4	llama-patched	64	305.2	923.5
9	q36-27b-nvfp4	llama-patched	128	382.1	922.9
10	q36-27b-nvfp4	llama-patched-bf16tau	8	87.8	919.2
11	q36-27b-nvfp4	llama-patched-bf16tau	32	231.0	931.1
12	q36-27b-nvfp4	llama-patched-bf16tau	64	341.4	930.7
13	q36-27b-nvfp4	llama-patched-bf16tau	128	446.1	932.2
14	q36-27b-nvfp4	vllm	8	70.4	2096.2
15	q36-27b-nvfp4	vllm	32	211.8	2182.6
16	q36-27b-nvfp4	vllm	64	309.1	2088.9
17	q36-27b-nvfp4	vllm	128	418.8	1929.1
18	q36-35b-a3b-nvfp4	llama-stock	8	186.7	1501.5
19	q36-35b-a3b-nvfp4	llama-stock	32	267.4	1856.8
20	q36-35b-a3b-nvfp4	llama-stock	64	320.5	1949.5
21	q36-35b-a3b-nvfp4	llama-stock	128	347.2	1995.4
22	q36-35b-a3b-nvfp4	llama-patched	8	230.3	1510.3
23	q36-35b-a3b-nvfp4	llama-patched	32	466.4	1969.2
24	q36-35b-a3b-nvfp4	llama-patched	64	622.4	2122.8
25	q36-35b-a3b-nvfp4	llama-patched	128	784.3	2177.0
26	q36-35b-a3b-nvfp4	llama-patched-bf16tau	8	240.5	1539.8
27	q36-35b-a3b-nvfp4	llama-patched-bf16tau	32	508.1	2031.7
28	q36-35b-a3b-nvfp4	llama-patched-bf16tau	64	703.8	2151.8
29	q36-35b-a3b-nvfp4	llama-patched-bf16tau	128	918.0	2212.3
30	q36-35b-a3b-nvfp4	vllm	8	256.5	5186.5
31	q36-35b-a3b-nvfp4	vllm	32	500.8	6223.4
32	q36-35b-a3b-nvfp4	vllm	64	686.1	5926.5
33	q36-35b-a3b-nvfp4	vllm	128	882.2	5300.5

1.4 KiB Raw Blame History

1.4 KiB

Raw Blame History