Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto 5cec1a6a21 docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction
Synthesize the cross-engine bit-exactness and f32-preserving-parity study.
Resolve the contradiction between sub-agents (one f32, two bf16) by reading
every link of vLLM's state-dtype chain on live source:

- config.json text_config.mamba_ssm_dtype = "float32" (both served models)
- cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override
- vllm.py __post_init__ -> try_verify_and_update_config (config finalize)
- Qwen3_5ForConditionalGenerationConfig override copies "float32" into
  mamba_ssm_cache_dtype before state-dtype resolution
- mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16)
- qwen_gdn_linear_attn allocates the temporal cache at f32

Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED
(equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED
(vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune
of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever);
B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand
gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction
trees, plus general FP non-associativity). bf16 temporal state degrades BELOW
vLLM's f32 recurrent precision -> an over-clock, not a parity requirement.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 16:55:25 +00:00
..
2026-04-12 08:51:30 +02:00