LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 18:06:58 -04:00

Files

Ettore Di Giacinto 87cfd1fadb docs(paged): quant-generality audit - SSM/serving opts are quant-agnostic (0013-0029)

Source-verify each paged decode optimization as quant-agnostic (operates on the
f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the
generic MMQ/CUDA-graph routing) vs NVFP4-specific (only fires inside the
use_native_fp4 / GGML_TYPE_NVFP4 branch).

Findings: 14 of 16 landed patches are quant-agnostic (0013/0014/0015/0016/0018/
0019/0020/0021/0022/0024/0025/0026/0028/0029). Only 0023 (MoE FP4 act-quant
de-dup, inside use_native_fp4) is NVFP4-specific; 0017 is NVFP4-only but
default-off/inert (kill-gate, no win).

Corrects the hypothesis on 0025: the actual patch is the MUL_MAT_ID CUDA-graph
guard relaxation gated on ggml_is_quantized + ggml_cuda_should_use_mmq (the
generic quantized grouped-MMQ path), NOT NVFP4. The NVFP4-specific act-quant /
quantize_mmq_nvfp4 work is LEVER 3, which was a measurement STOP and never
landed (no patch); LEVER 4 (NVFP4 projections) KL-failed and never shipped.

Adds the relative-impact-by-quant estimate (fixed f32-recurrence/host ms is the
largest step fraction at NVFP4, shrinks at Q8/bf16 as the weight read grows) and
the A/B plan to prove generality on a Q4_K_M requant of the same Qwen3.6 (build
the control first, md5/KLD bit-exact gate per path, decode_agg npl 32/128, with
0023 as the NVFP4-only negative control).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-27 07:06:14 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

docs(paged): quant-generality audit - SSM/serving opts are quant-agnostic (0013-0029)

2026-06-27 07:06:14 +00:00

CMakeLists.txt

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497 )

2026-06-25 15:47:03 +02:00

grpc-server.cpp

feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode