Commit Graph

32 Commits

Author SHA1 Message Date
Ettore Di Giacinto
4bc2b4a9b2 feat(paged): add patch 0013 decoupled per-step prefill-token budget
Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's
vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token
budget for the inherited update_slots() scheduler, decoupled from n_batch
(the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt
tokens a single update_slots() step ingests, splitting a long prefill across
more steps so co-batched decode keeps advancing instead of freezing for the
duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) =
disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED.

Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected
6000-token prefill, same binary, only the env differs): worst decode freeze
3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256,
for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost.
This is a latency/fairness lever, not an aggregate-throughput lever (steady
decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift).

Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N
is byte-identical to stock -bN while preserving n_batch for decode width; the
only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping
that pure stock -b exhibits too. Verified applying on the pinned llama.cpp
f3e1828 after patch 0008.

Productisation follow-up: surface as a grpc-server.cpp options knob
(max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 09:55:32 +00:00
Ettore Di Giacinto
ba6bd94976 feat(paged): assert mask-pad invariant for the paged tile route (patch 0012)
Patch 0012 of the paged-attention series. Adds a defensive GGML_ASSERT in
src/paged-attn.cpp so the now-default paged decode route (GQA-grouped
fattn-tile kernel) cannot silently start leaking past-end KV rows.

The route stays correct only because the compacted mask/block-table length
n_view = GGML_PAD(n_gather, 256) is a whole number of flash-attn KV tiles
(nbatch_fa = 64 for head_dim 128 divides 256), so the last tile sits entirely
inside the -inf pad window. The assert (n_view % 64 == 0) pins that implicit
invariant: a future pad < 256 or tile > 256 that broke it now aborts instead
of leaking. Additive only, no behaviour change.

Verified on the DGX dev tree: build-cpu compiles and the paged CPU byte gate
(LLAMA_KV_PAGED off vs on, Qwen3-0.6B-Q8_0, greedy) stays byte-identical with
the assert silent.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 09:13:08 +00:00
Ettore Di Giacinto
e983919516 feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)
Increment 3 attention lever. In the paged in-kernel decode dispatch, route the
common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel
for everything else. Tile groups the q-heads that share a kv-head (ncols2) so
each K/V row is loaded once per group instead of once per q-head, and runs at
higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4,
F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the
decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8).
The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096
-3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step.

Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache
would be converted to a contiguous F16 copy by launch_fattn, breaking the
in-kernel block-table read. So tile is correct only for an F16 cache, and the
grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2};
everything else falls back to the inc-1 vec path, exactly as before this change.
LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010)
was already plumbed; this only adds the default route. (Paged decode currently
needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by
this change: stock+q8_0 works, paged+q8_0 aborts both before and after.)

Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves
over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the
under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping
attacks directly.

Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off)
byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path
(7/8 sequences identical, 8th in the same kernel-noise band where vec also
drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock
(no block table) is byte-identical: the dispatch guard only diverts on src[5].
Full rationale and numbers in the patch header.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-22 22:38:28 +00:00
Ettore Di Giacinto
2c5adda28c feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)
Increment 2 (robustness): graft the patch-0009 phys(j) block-table read into
the CUDA tile kernel (mirror of fattn-vec.cuh) and add a dispatch guard so a
present block table (src[5]) routes ONLY to the vec or tile kernel, never to
mma/wmma (which ignore the table and would silently read the wrong physical
cells). Default route stays vec, the inc-1 byte-validated path.

Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B) PASS; GPU vec-paged ==
stock at -s 1 PASS; the real Qwen3-32B NVFP4 batch decode confirmed dispatching
to vec (Q ne=[128,1,64,N]). The tile graft is plumbed for the increment-3 GQA
head-group reuse but is EXPERIMENTAL/not byte-validated (LLAMA_KV_PAGED_TILE=1):
the GQA-grouped ncols2>1 tile path reads a full nbatch_fa tile unbounded while
the compacted paged mask is not padded to cover it. Bounding that path is
increment-3 work; the default vec route is unaffected.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 20:37:12 +00:00
Ettore Di Giacinto
ee13a94a8c paged: in-kernel decode read patch 0009 (kill the gather regression)
Mirror patch 0009 for the paged llama.cpp engine. It removes the patch-0003
per-layer per-step gather (ggml_get_rows of K/V to a contiguous buffer) on the
decode step and instead reads paged blocks in-kernel: build_attn passes the
physical K/V views plus a position-ordered block table (src[5] of
ggml_flash_attn_ext, padded to FATTN_KQ_STRIDE), and the CUDA fattn vec kernel
plus the CPU reference map each logical KV index to its physical cell and read
in place. KV_max / parallel_blocks / stream_k split-K are unchanged; a nullptr
block table is the stock contiguous read (byte-identical, gated by
LLAMA_KV_PAGED).

Verified on GB10 (sm_121, Qwen3-32B NVFP4, batch 32 / 1024 ctx): the decode
step drops from 1279 ms (paged-gather) to 696 ms in-kernel (-46%), reaching
stock parity (647 ms). CPU paged vs stock is bit-for-bit identical; GPU stays
within the documented batch-shape non-determinism band.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 18:04:09 +00:00
Ettore Di Giacinto
4dcbcfcf92 docs(paged): decode-step gap study vs vLLM on GB10
Profiling decomposition of the llama-server batch-32 / 1024-ctx decode step
vs vLLM on a DGX Spark (GB10, sm_121). Findings: decode is GPU-bound (~95%
busy, sampling/loop fully hidden); at 1024 ctx the step is ~84% KV/attention
and ~16% weight GEMM; the paged KV engine is a ~1.85x decode regression vs
stock (per-layer gather-to-contiguous); even stock is ~4-5x slower than vLLM,
gated by the long-context decode-attention and thin-batch FP4 GEMM kernels,
not by the serving loop. Ranked closable-vs-structural levers included.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 15:44:24 +00:00
Ettore Di Giacinto
80e0c1ac6b feat(paged): wire cross-request prefix share into llama-server (patch 0008)
Ship patch 0008 of the paged-attention series: wire the paged cross-request
prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam)
into the llama-server continuous-batching loop so CONCURRENT requests sharing a
long prefix reuse one committed copy of the prefix blocks and prefill ONLY their
divergent suffix. The server's native prompt cache only reuses a slot's own prior
prompt; it does not share across distinct concurrent slots. 0008 adds that
cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical).

The hook lives in tools/server/server-context.cpp update_slots (the only place
with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated
lines: a fresh-slot share() that advances n_past past the committed prefix, and a
commit() at the prefill->generation transition. The n_past<block gate guarantees
every positive share is adopted so the engine reservation matches the suffix-only
batch (no stale paged blocks).

Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder:
K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix
instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s,
K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed'
(ref_cnt>1), greedy output within the documented CUDA batch-shape
non-determinism band.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 15:03:16 +00:00
Ettore Di Giacinto
52f0f7b8cf docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)
Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4
GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager,
both prefix-cache on. Two findings: (1) the paged cross-request prefix
recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable
via paged_prefix_api::share/commit, which the server never calls; the server
engages only physical paged block placement plus its own native prompt cache. (2)
With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s;
K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at
batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 14:16:52 +00:00
Ettore Di Giacinto
f347f7ca1d docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison
Two closing measurements on DGX Spark (GB10, sm_121):

1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock
   llama.cpp produces a different greedy token stream when the same prompt
   is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch.
   At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA
   (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity
   failure is stock floating-point batch-shape non-determinism, not a paged
   bug. CPU exhibits it too, just less often, which is why 0007's short CPU
   scenarios passed 16/16 while the CUDA run flipped.

2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a
   1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix
   cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share
   baseline - the same cross-request prefix-skip vLLM's APC provides (97%
   hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster
   (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider
   end-to-end due to continuous batched decode. Competitive in kind, behind
   in absolute terms on this hardware.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 13:48:01 +00:00
Ettore Di Giacinto
0dd45f0da5 docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results
Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a
shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS;
the CUDA build fails only the strict greedy-token-equality assertions (the same
binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not
a paged-logic defect - every structural KV-reuse invariant passes on GPU.

The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops
7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and
reused via the paged cross-request prefix cache.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 12:59:09 +00:00
Ettore Di Giacinto
d1ba327843 docs(paged): record GPU correctness + CUDA backend-build verification
GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series:
core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged
at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent.
Full backend: patches/paged apply clean to the pin and grpc-server compiles+links
under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on
make flow (git apply + prepare.sh) and a token divergence in the unshipped
prefix-recompute-skip dev driver (same on CPU and GPU).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 11:50:01 +00:00
Ettore Di Giacinto
ecffd4b097 feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)
Mirror patch 0007 of the paged-attention series into the vendored llama.cpp
patch set. It wires the host-side cross-request prefix cache (0006) into the
engine so a new sequence physically shares the cached prefix blocks (ref-counted)
and decodes only the divergent suffix - the shared prefix KV is never recomputed.

paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream)
keyed by the real seq_id (per-sequence ref-counted free); two gated
llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared
physical cells' seq-membership so the engine attention mask covers the
already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos.
Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is
additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock
byte-identical.

Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode
from scratch at a block boundary and mid-block, prefill computing only the suffix
(32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's
removal, survivor intact and re-shareable, pool restored when all freed). The
0004 serving gate stays byte-identical stock vs paged in unified and non-unified
mode.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 10:47:10 +00:00
Ettore Di Giacinto
67c6208b3a feat(llama-cpp/paged): cross-request prefix caching patch 0006
Mirror patch 0006 of the paged-attention series into the vendored llama.cpp
patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with
host-side cross-request prefix sharing: place_with_prefix reuses cached
physical blocks for a new sequence shared prefix (ref_cnt++) and allocates
only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1)
block before a divergent write so co-owners stay byte-correct; ref-counted
free releases a shared block only at ref 0. Core kv-cache files untouched;
gated behind LLAMA_KV_PAGED, default off.

Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix
greedy tokens byte-identical to the unshared baseline at both a block boundary
and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated),
and copy-on-write + seq_rm ref-count safety with no use-after-free.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 10:14:27 +00:00
Ettore Di Giacinto
04e3d04ab8 build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)
Move the paged-attention patch series (0001-0004 + docs) into patches/paged/,
applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is
now clean, so a dep-bump that breaks a paged hook can be unblocked with
LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed
independently - decoupling the paged-KV maintenance from routine bumps without a
separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply,
flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is
byte-identical to stock until that env is set. Glob/flag logic verified in bash.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 09:22:36 +00:00
Ettore Di Giacinto
4968cd8a94 paged-attn 0004: on-demand KV block allocation
Wire the paged placement in find_slot through the vendored PagedKVManager
(0001) instead of a fixed full-pool permutation. Blocks are popped from a free
pool on demand as a sequence crosses block boundaries, and returned on sequence
end (full seq_rm / clear). One manager per (kv-cache, stream); all state lives
in a new src/paged-alloc unit keyed by a static registry, so the core kv-cache
struct is untouched (find_slot/clear/seq_rm gain only a gated call). Default
off; stock path byte-identical.

Gate 0 (CPU, Qwen3-0.6B-Q8_0), LLAMA_KV_PAGED=1 token-identical vs stock:
- single-stream llama-simple, 48 tok: identical
- multi-stream driver, 3 seqs x 40 tok: identical
Demand-driven confirmed via debug log: blocks grow 0->1->2->3->4 at logical
positions 16/32/48 (peak 4 blocks vs 16-block budget), per stream independently.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 08:50:57 +00:00
Ettore Di Giacinto
37e0e1ef55 paged-attn 0003: lift gather-read to multi-stream
The 0003 gather-read was single-stream only (GGML_ASSERT k->ne[3]==1). Lift it
to N streams: one index column per stream over the unified batch, gathered with
a single ggml_get_rows along the stream axis. Each column is position-sorted
(preserving the flash-attn online-softmax reduction order that makes the read
byte-identical) and padded to the max non-empty count across streams with a
masked (empty) cell, which contributes exp(-inf)=0.

Core touch stays additive: the one-line build_attn hook is unchanged; only the
two kv-cache gather helpers (now per-stream) and src/paged-attn.cpp grow.

Gate 0 (CPU, Qwen3-0.6B-Q8_0): a multi-sequence greedy driver (non-unified KV,
k->ne[3]>1) is token-identical between stock (env unset) and LLAMA_KV_PAGED=1:
3 seqs x 40 tok, 2 seqs x 32 tok, 5 seqs x 32 tok all identical; single-stream
llama-simple unchanged. Debug log confirms n_stream=3 engaged the multi path.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 08:46:12 +00:00
Ettore Di Giacinto
d9d846e04b feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive
Implements the paged-attention gather-read (the real engine compute): attention
reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the
non-empty-cell index list before build_attn_mha. Verified token-identical to stock
greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with
n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op).

Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp}
(an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line
in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit
to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind
LLAMA_KV_PAGED so stock path stays byte-identical.

Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU
flash-attn's online softmax reduces cells in physical-array order and is FP-order-
sensitive, so 0002's scattered placement alone (full-window read) diverges from stock
past the first block; the position-sorted gather reproduces stock's exact reduction
order -> bit-identical. So 0003 is what makes paged placement token-identical under
flash-attn.

Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 08:26:46 +00:00
Ettore Di Giacinto
84d59e659b docs(paged): additive "hook, don't edit" layout for the patch series
Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites
core attention and is unvendorable. Our own series must be additive so it survives
llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget:
every change is either new code in a new vendored src/ file, or a single env-gated
hook at one call site that delegates to it - no logic in core files, no core struct
edits.

Grounds it in the pinned source: llm_graph_input_i is pure-virtual and
res->add_input() lets a new file register a graph input, so paged behavior plugs in
without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file
surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass)
+ two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands
entirely in LocalAI's grpc-server.cpp (no core patch).

Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the
next focused token-identical Gate-0 block.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 07:28:44 +00:00
Ettore Di Giacinto
d2651c86d9 bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline
Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 06:59:50 +00:00
Ettore Di Giacinto
19742aee64 bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - every flag lever exhausted
Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel;
no config/flag shortcut remains.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 03:59:27 +00:00
Ettore Di Giacinto
37cbc089b0 bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x)
vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765,
vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at
B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision:
the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just
the MoE grouped GEMM (dense GEMM is the simpler first kernel to land).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 03:55:58 +00:00
Ettore Di Giacinto
b7b2e8291c kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3)
The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:44:31 +00:00
Ettore Di Giacinto
cb28deda6b bench(paged): decode profile overturns 'engine-addressable' - decode is 54.6% MoE GEMM too
Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both
phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM
gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:27:35 +00:00
Ettore Di Giacinto
2a500c371f bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps
Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix).
Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the
engine's domain (0004 pool + 0005 continuous batching target it). Baseline for
the series to improve on.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:20:22 +00:00
Ettore Di Giacinto
48fbb9384f docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild, single-stream first
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:14:25 +00:00
Ettore Di Giacinto
145e45b6f2 docs(paged): exact executable plan for 0003 gather-read
Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by
one aligned index; n_kv compaction; gated so stock stays byte-identical) with
the token-identical gate and the known risks (mask transpose layout, v_trans).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:12:18 +00:00
Ettore Di Giacinto
c4b4f3a3e4 docs(paged): series status 0001/0002 done+verified; honest parity note
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:05:14 +00:00
Ettore Di Giacinto
61ff738177 patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-identical
find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy
generation is token-identical to stock (verified on Qwen3-0.6B at the pin),
branch confirmed firing. Default off. The placement substrate for the gather-read.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 23:04:28 +00:00
Ettore Di Giacinto
ce48cc0751 patch(paged) 0001: vendor PagedKVManager into llama.cpp src
First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the
CPU-verified vLLM-parity block manager) + CMake entry. No behavior change.
Generated against the pinned LLAMA_VERSION; applies clean.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:55:22 +00:00
Ettore Di Giacinto
ba3fa5a633 build(paged): stacking patch-series scaffolding for llama.cpp paged attention
Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against
the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one
small, independently-buildable patch so the work rebases cleanly across llama.cpp
bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix
caching) + the regen workflow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-19 22:53:20 +00:00
Ettore Di Giacinto
e3bcba5c45 chore: ⬆️ Update ggml-org/llama.cpp to 7f8ef50cce40e3e7e4526a3696cb45658190e69a (#7402)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-01 07:50:40 +01:00
Ettore Di Giacinto
294f7022f3 feat: do not bundle llama-cpp anymore (#5790)
* Build llama.cpp separately

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Start to try to attach some tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add git and small fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: correctly autoload external backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to run AIO tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Slightly update the Makefile helps

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Adapt auto-bumper

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to run linux test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add llama-cpp into build pipelines

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add default capability (for cpu)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop llama-cpp specific logic from the backend loader

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* drop grpc install in ci for tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Pass by backends path for tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Build protogen at start

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(tests): set backends path consistently

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Correctly configure the backends path

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to build for darwin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* WIP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Compile for metal on arm64/darwin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to run build off from cross-arch

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add to the backend index nvidia-l4t and cpu's llama-cpp backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Build also darwin-x86 for llama-cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Disable arm64 builds temporary

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Test backend build on PR

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixup build backend reusable workflow

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* pass by skip drivers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Use crane

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Skip drivers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* x86 darwin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add packaging step for llama.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix leftover from bark-cpp extraction

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Try to fix hipblas build

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 13:24:12 +02:00