Files
LocalAI/backend/cpp
Ettore Di Giacinto 001d833426 docs(paged): f16/bf16 glue probe - dense decode residual ceiling
Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023):
- attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a
  no-op; q8_0 within noise -> KV dtype is not a decode lever
- nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn,
  excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12%
  incl. the non-FP4 cublas GEMM)
- f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the
  8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 09:11:21 +00:00
..