Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765,
vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at
B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision:
the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just
the MoE grouped GEMM (dense GEMM is the simpler first kernel to land).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by
one aligned index; n_kv compaction; gated so stock stays byte-identical) with
the token-identical gate and the known risks (mask transpose layout, v_trans).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy
generation is token-identical to stock (verified on Qwen3-0.6B at the pin),
branch confirmed firing. Default off. The placement substrate for the gather-read.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Build llama.cpp separately
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Start to try to attach some tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add git and small fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: correctly autoload external backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to run AIO tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Slightly update the Makefile helps
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Adapt auto-bumper
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to run linux test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add llama-cpp into build pipelines
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add default capability (for cpu)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop llama-cpp specific logic from the backend loader
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* drop grpc install in ci for tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Pass by backends path for tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Build protogen at start
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(tests): set backends path consistently
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Correctly configure the backends path
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to build for darwin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Compile for metal on arm64/darwin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to run build off from cross-arch
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add to the backend index nvidia-l4t and cpu's llama-cpp backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Build also darwin-x86 for llama-cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Disable arm64 builds temporary
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Test backend build on PR
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixup build backend reusable workflow
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* pass by skip drivers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Use crane
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Skip drivers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* x86 darwin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add packaging step for llama.cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix leftover from bark-cpp extraction
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to fix hipblas build
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>