Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by
one aligned index; n_kv compaction; gated so stock stays byte-identical) with
the token-identical gate and the known risks (mask transpose layout, v_trans).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy
generation is token-identical to stock (verified on Qwen3-0.6B at the pin),
branch confirmed firing. Default off. The placement substrate for the gather-read.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Build llama.cpp separately
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Start to try to attach some tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add git and small fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix: correctly autoload external backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to run AIO tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Slightly update the Makefile helps
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Adapt auto-bumper
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to run linux test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add llama-cpp into build pipelines
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add default capability (for cpu)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Drop llama-cpp specific logic from the backend loader
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* drop grpc install in ci for tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Pass by backends path for tests
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Build protogen at start
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(tests): set backends path consistently
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Correctly configure the backends path
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to build for darwin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* WIP
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Compile for metal on arm64/darwin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to run build off from cross-arch
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add to the backend index nvidia-l4t and cpu's llama-cpp backends
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Build also darwin-x86 for llama-cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Disable arm64 builds temporary
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Test backend build on PR
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixup build backend reusable workflow
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* pass by skip drivers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Use crane
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Skip drivers
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* x86 darwin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Add packaging step for llama.cpp
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Fix leftover from bark-cpp extraction
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* Try to fix hipblas build
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>