LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 03:17:01 -04:00

Files

Ettore Di Giacinto 11128cb080 docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only)

Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where
MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than
vLLM's marlin/cutlass large-M path (~51% of the prefill gap).

Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first,
MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a
multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already
exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path
(falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS
banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4).

Design only - no kernel, no GPU run.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

2026-06-28 16:42:23 +00:00

ACCELERATOR_PORTING_SCOPE.md

docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)

2026-06-28 08:34:32 +00:00

final_benchmark.csv

paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)

2026-06-28 16:06:06 +00:00

LOCALAI_LLAMACPP_BACKEND_PLAN.md

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

2026-06-27 13:20:05 +00:00

PAGED_BITEXACT_NOTE.md

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/