LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 03:17:01 -04:00

Files

Ettore Di Giacinto 000705321f feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)

Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight
GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet)
tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the
validated, bit-exact-gated mechanism and records the honest GB10 result: it is a
regression, so it ships default-off (== stock), mirroring the patch-0017
default-off discipline.

Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense
M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that
dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays
FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda
gains the NVFP4 case.

Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID
806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4
large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for
short prefill (5951a5b4, decode untouched) and for a >threshold prefill that
exercises the bf16 path (5f3967df - no greedy argmax flips).

Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99
-> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42
(-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here:
FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step
weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate,
only partially amortizing as M grows. Default-off keeps stock S_PP (966.98).

Phase 2 (MoE grouped large-M) not implemented: it inherits the same
bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
would regress for the same reason; a real prefill GEMM win needs option (b), a
native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-28 17:42:15 +00:00

ACCELERATOR_PORTING_SCOPE.md

docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)

2026-06-28 08:34:32 +00:00

DECODE_SERVING_SCOPE.md

docs(paged): scope the continuous-serving decode gap (host-bound, design-only)

2026-06-28 17:14:51 +00:00

final_benchmark.csv

paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)

2026-06-28 16:06:06 +00:00

LOCALAI_LLAMACPP_BACKEND_PLAN.md

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/