mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 03:17:01 -04:00
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1.0 KiB
1.0 KiB
| 1 | model | engine | npl | decode_agg_tps | prefill_tps |
|---|---|---|---|---|---|
| 2 | q36-27b-nvfp4 | llama-stock | 8 | 68.3 | 937.7 |
| 3 | q36-27b-nvfp4 | llama-stock | 32 | 119.9 | 885.2 |
| 4 | q36-27b-nvfp4 | llama-stock | 64 | 142.8 | 885.1 |
| 5 | q36-27b-nvfp4 | llama-stock | 128 | 155.1 | 887.2 |
| 6 | q36-27b-nvfp4 | llama-patched | 8 | 85.3 | 915.1 |
| 7 | q36-27b-nvfp4 | llama-patched | 32 | 211.9 | 919.0 |
| 8 | q36-27b-nvfp4 | llama-patched | 64 | 305.2 | 923.5 |
| 9 | q36-27b-nvfp4 | llama-patched | 128 | 382.1 | 922.9 |
| 10 | q36-27b-nvfp4 | vllm | 8 | 70.4 | 2096.2 |
| 11 | q36-27b-nvfp4 | vllm | 32 | 211.8 | 2182.6 |
| 12 | q36-27b-nvfp4 | vllm | 64 | 309.1 | 2088.9 |
| 13 | q36-27b-nvfp4 | vllm | 128 | 418.8 | 1929.1 |
| 14 | q36-35b-a3b-nvfp4 | llama-stock | 8 | 186.7 | 1501.5 |
| 15 | q36-35b-a3b-nvfp4 | llama-stock | 32 | 267.4 | 1856.8 |
| 16 | q36-35b-a3b-nvfp4 | llama-stock | 64 | 320.5 | 1949.5 |
| 17 | q36-35b-a3b-nvfp4 | llama-stock | 128 | 347.2 | 1995.4 |
| 18 | q36-35b-a3b-nvfp4 | llama-patched | 8 | 230.3 | 1510.3 |
| 19 | q36-35b-a3b-nvfp4 | llama-patched | 32 | 466.4 | 1969.2 |
| 20 | q36-35b-a3b-nvfp4 | llama-patched | 64 | 622.4 | 2122.8 |
| 21 | q36-35b-a3b-nvfp4 | llama-patched | 128 | 784.3 | 2177.0 |
| 22 | q36-35b-a3b-nvfp4 | vllm | 8 | 256.5 | 5186.5 |
| 23 | q36-35b-a3b-nvfp4 | vllm | 32 | 500.8 | 6223.4 |
| 24 | q36-35b-a3b-nvfp4 | vllm | 64 | 686.1 | 5926.5 |
| 25 | q36-35b-a3b-nvfp4 | vllm | 128 | 882.2 | 5300.5 |