LocalAI/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv at 2fa8ef8fc53ccb2d932a1a89472486dfc80e0b59

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 03:17:01 -04:00

Files

Ettore Di Giacinto 4cd90bfae9 paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)

The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is
removed from the llama-cpp-localai-paged patch series. Clean re-measurement after
the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16
(tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs
780.0 t/s. The mode engages but adds zero speed because it is subsumed by the
fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau
was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and
extra CUDA template-instantiation compile cost with no offsetting benefit.

Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only
mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn,
which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025,
0028-0030) applies clean with git apply --check against the pin
0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob
(patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are
already absent).

Removed:
- patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
- the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared
  grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no
  longer reads)
- the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows
  (README + final_benchmark.csv), the ssm_bf16_tau option text in backend
  index.yaml, the gallery NOTE block, and the docs/features/backends.md mention.

The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat)
in the backend README section 5, the paged-backend agent guide, and the
vLLM-parity methodology, so it is not re-tried.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-28 16:06:06 +00:00

1.0 KiB

Raw Blame History

1	model	engine	npl	decode_agg_tps	prefill_tps
2	q36-27b-nvfp4	llama-stock	8	68.3	937.7
3	q36-27b-nvfp4	llama-stock	32	119.9	885.2
4	q36-27b-nvfp4	llama-stock	64	142.8	885.1
5	q36-27b-nvfp4	llama-stock	128	155.1	887.2
6	q36-27b-nvfp4	llama-patched	8	85.3	915.1
7	q36-27b-nvfp4	llama-patched	32	211.9	919.0
8	q36-27b-nvfp4	llama-patched	64	305.2	923.5
9	q36-27b-nvfp4	llama-patched	128	382.1	922.9
10	q36-27b-nvfp4	vllm	8	70.4	2096.2
11	q36-27b-nvfp4	vllm	32	211.8	2182.6
12	q36-27b-nvfp4	vllm	64	309.1	2088.9
13	q36-27b-nvfp4	vllm	128	418.8	1929.1
14	q36-35b-a3b-nvfp4	llama-stock	8	186.7	1501.5
15	q36-35b-a3b-nvfp4	llama-stock	32	267.4	1856.8
16	q36-35b-a3b-nvfp4	llama-stock	64	320.5	1949.5
17	q36-35b-a3b-nvfp4	llama-stock	128	347.2	1995.4
18	q36-35b-a3b-nvfp4	llama-patched	8	230.3	1510.3
19	q36-35b-a3b-nvfp4	llama-patched	32	466.4	1969.2
20	q36-35b-a3b-nvfp4	llama-patched	64	622.4	2122.8
21	q36-35b-a3b-nvfp4	llama-patched	128	784.3	2177.0
22	q36-35b-a3b-nvfp4	vllm	8	256.5	5186.5
23	q36-35b-a3b-nvfp4	vllm	32	500.8	6223.4
24	q36-35b-a3b-nvfp4	vllm	64	686.1	5926.5
25	q36-35b-a3b-nvfp4	vllm	128	882.2	5300.5

1.0 KiB Raw Blame History

1.0 KiB

Raw Blame History