LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-25 00:59:28 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	da67fd87e2	docs(paged): A.2 CUDA-graph decode lever measurement and gap diagnosis Phase 1 measures the CUDA-graph lever on the paged decode (q36-27b-nvfp4 dense, GB10 sm_121, fusion off). The 4-cell decode_agg {stock,paged} x {graphs on,off} is flat within ~1%: the graphs-on win is +0.13% at npl128 and +1.1% at npl32 (both within run noise). The default paged decode is not eager: it captures and replays graphs with a 256-token reset cadence identical to stock non-paged (block-table ne0 = GGML_PAD(n_gather,256) only steps at 256-token boundaries); only the gather fallback grows n_gather every step and runs pure eager. 'graphs reused=0' was a uid fast-path false negative (llama rebuilds the cgraph each step, so the reuse log never fires while the graph still replays via the instance path). nsys (reliable eager trace, plus the captured trace re-run with --cuda-graph-trace=node to defeat nsys omitting graph-internal kernels, an artifact that otherwise reads 0.3% busy) shows the steady decode is 99.4-99.5% GPU-busy. Idle is ~0.6% of the step: 0.37% within-step launch gaps (the only thing graphs remove, cut to 0.11% when captured) plus a 0.24% between-step host gap (~2ms per step). Throughput is identical on/off. Verdict: CUDA-graphing the paged decode is not a throughput lever; the decode is GPU-compute-bound and the 2.6x gap to vLLM (148 vs 391) is in the per-step GPU kernel work (FP4 GEMM + attention at batch 128), not launch overhead or the host loop. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 21:26:16 +00:00
Ettore Di Giacinto	40f019e761	docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017) Mirror of llama.cpp dev-tree commit 089f78d. Track B P0 (bit-exact NVFP4 dense decode-shape MUL_MAT parity gate) + P1 (default-off occupancy levers) for the GB10 dense FP4 weight GEMM. P1 kill-gate TRIPPED: the cheap host/occupancy levers do not lift decode_agg on GB10 (sm_121). DENSE q36-27b-nvfp4 @npl128 149.5 -> minblocks2 147.9 (-1.1%) -> dense mmq_x=64 144.3 (-3.5%); MoE q36-35b-a3b mmq_x-down regresses (TILE16 -3.7%, TILE8 -5.9%, reproduces patch 0015). nsys: the FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s->3.025s (+8.7% slower) under register-capping (spilling). The dense M=128 tile is already weight-read/one-read-optimal; the only untested lever is the structural mmq_y-down (nwarps=4 warp-remap, blocked by nwarps*tile_C::I==mmq_y), deferred to P2. All levers default-off => default build byte-identical to stock. See THROUGHPUT_B_P1_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 17:58:00 +00:00
Ettore Di Giacinto	39e16cc2c4	docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go Append section 9 (skeptical staff-CUDA-engineer review) to FP4_GEMM_SCOPE_B.md, stress-testing the dense/MoE parity verdict against the committed grounding. Key findings: - Not the W4A16 wall: the npl-sweep (dense 99/56/46/41% of vLLM at npl 8/32/64/128) shows llama's FP4-MMA kernel HITS the weight-read floor at M=8 and FALLS OFF it as M grows, while vLLM HOLDS it. Working-path tune, dual existence proof (M=8 + vLLM M=128), not a greenfield build. Same binding constraint as W4A16 though (hide LPDDR5x latency at the larger tile on an occupancy-dominated part). - The dense gap is ~82-87% GEMM, ~13-18% non-GEMM (467 ms total = 383-405 GEMM + 62-84 non-GEMM). B alone caps ~80%; track A is what tips dense over the parity line. - Sharpest omission: vLLM's M=128 floor is reached via cutlass TMA + deep pipeline - the technique the doc forbids on GB10. TMA != manual cp.async (lower occupancy cost); it must be an in-scope P2 fallback, not categorically banned. - Honest landing: dense ~80-90% (parity the optimistic tail, contingent on B+A+floor), MoE ~55-65% (parity not reachable from B). Low-regret: even a tripped P2 kill-gate lands B+A ~89%, doubling today's 41%. - Sequencing fix: land A first (defines B's interface + baseline + kill-gate), then run B's P2 against the post-A number. Verdict: DENSE conditional GO (scope as GEMM-gap-closing, not true parity; A-first, gate at P2, add TMA); MoE NO-GO for parity from B (do the cheap mmq_x-down win as a 1.7-1.85x, not parity). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:31:35 +00:00
Ettore Di Giacinto	7434d64c75	docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + per-phase decode_agg Rewrite the track-B scope into the definitive build-ready plan for the NVFP4 FP4-MMA decode GEMM toward vLLM GB10 parity. Source-read of the mmq.cuh/mma.cuh/quantize.cu FP4 path on the dgx paged dev tree settles two load-bearing facts the prior draft got partly wrong: - llama's dense path is already TRUE W4A4 (block_fp4_mmq packs 256 e2m1 values + ue4m3 scales; the MMA is kind::mxf4nvf4 e2m1.e2m1...ue4m3), so there is no activation-bit-width work to do; the whole dense deficit is scheduling/occupancy. - the mmq_x selector minimizes ntiles_x, which PINS dense decode at mmq_x=128 (weights read once). Shrinking mmq_x re-reads the 18 GB weights, so the dense occupancy lever is mmq_y-down (BW-neutral), NOT mmq_x-down; MoE's free lever is the per-expert mmq_x-down (patch 0015). Adds the explicit kernel-approach decision (tune the existing FP4-MMA mul_mat_q; reject the cutlass-SM120 rewrite, dead on GB10 and broken on sm_121; reject the BF16-Marlin descent), the concrete build-ready changes (mmq_y/granularity/stream-k knobs, FP4-MMA fragment invariants, the ue4m3 scale path, and the block_fp4_mmq y-tile ABI contract for the track-A act-quant fusion handoff), the GB10-fit rules, the bit-exact test-backend-ops gate with decode-shape + ragged-M cases, and per-phase expected decode_agg tables. Verdict (honest, roofline-grounded): the decode GEMM is bandwidth-bound on the hardware roofline (M=128 << crossover 611; weight-read floors 4-6x above vLLM) but compute-bound in practice at ~3% FP4 eff, so 273 GB/s is not the wall. DENSE: GO (conditional) - B+A reaches 376-394 tok/s = 90-103% of vLLM 391, gated by a P2 occupancy kill-gate (<15% FP4 eff -> parity off). MoE: PARTIAL/NO-GO - ceiling ~76% of 811 (618) from the GEMM alone; full MoE parity needs the non-GEMM tracks too. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:21:48 +00:00
Ettore Di Giacinto	c1d7f336cb	docs(paged): enrich track-B scope with code-level FP4-GEMM inefficiencies Add the source-read kernel-mechanism map (no cp.async weight pipeline, mmq_x tile-maximizing selector vs GB10 occupancy, MoE per-expert M-tile waste, iter_k=512 coupling, ruled-out non-levers) and strip the stray trailing tags from the prior write. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:11:41 +00:00
Ettore Di Giacinto	ea634ee958	docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go/no-go Roofline at the decode batch shape (M=128, NVFP4 weights) on GB10 (sm_121): the dense weight-read floor (~1,940 tok/s) and MoE floor (~1,590 tok/s) sit 4-6x above vLLM's 391/811, so 273 GB/s is NOT the wall. At FP4 peak the GEMM is bandwidth-bound (crossover M*~611 >> 128); at the kernel's ~3% achieved FP4 efficiency it is compute-bound by its own inefficiency (471 ms vs a 66 ms floor). Verdict: dense decode parity is plausibly reachable via a tuned FP4-MMA decode M-tile (track B) + fused act-quant (track A), landing 376-394 tok/s = 90-103% of vLLM 391, but only at the top of the demonstrated GB10 FP4 envelope (~17-21%) and with no margin (occupancy wall is the binding constraint, not bandwidth). MoE parity is NOT reachable from the GEMM alone (ceiling ~60-76% of 811): its floor is the hardest grouped-GEMM regime and ~24% of its step is non-GEMM work outside track B. GO (conditional) for dense, PARTIAL for MoE. Build-ready phased plan included; tune the existing block_fp4_mmq path, not a W4A16 rewrite. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 14:09:41 +00:00
Ettore Di Giacinto	e4c63179e0	docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever Closes lever 5 of VLLM_DECODE_GROUNDING.md. GGUF metadata + source reading on the paged dev tree plus nsys decode traces on Qwen3.6-27B NVFP4 (GB10 sm_121) confirm the Gated-Delta-Net linear-attention layers decode as a fused single CUDA kernel (gated_delta_net.cu) updating a fixed-size cached recurrent state: no context-length parameter, no KV re-scan. Matched-batch context-scaling control (npl4, pure decode) shows the GDN kernel flat (10.3 -> 8.0 us/launch) across 4x context while full-attention grows 3.1x (27 -> 85 us). GDN is a small, context-flat share (~0.4-10%% by batch); the FP4 weight GEMM dominates (~67%). Verdict: GDN decode is efficient, not the cheap model-specific fix; the 2.4x is the general GEMM + full-attention kernel work, as the grounding concluded. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 11:21:44 +00:00
Ettore Di Giacinto	f7500df64e	docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a steady-rate client that keeps a mix of in-flight decoders + newly-arriving prefills, capturing per-request TTFT and the full inter-token-latency series. Append the metrics (in-flight decode protection + new-request TTFT, per arm) and an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's in-flight decoders freeze multi-second on every prefill admission while both budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the protection/TTFT frontier than 0013-256 (equal spike-free protection, materially lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016 earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 10:56:13 +00:00
Ettore Di Giacinto	24ce7d0823	feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1) Mirror the P1 engine change of CONTINUOUS_BATCH_SCHEDULER_SCOPE.md into the vendored paged patch series and surface it as a LocalAI model option. - patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch: supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC, decode-first token budget in update_slots(). At the budget seam (already after Phase 1's decode fill, so batch.n_tokens == D is known) compute T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch), prefill_budget_step = max(n_ubatch, T - D), and a per-slot prompt-chunk cap prefill_cap_per_slot; bound the Phase-2 prompt-fill loop and outer admission break by these instead of 0013's constant. Policy-only change, no new slot states, no batch-formation rewrite, zero libllama changes. Decode is structurally claimed first (Phase 1) so the decode-first guarantee is free. As decode load D rises the leftover auto-shrinks, so the budget self-tunes across npl 8..128 and dense vs MoE and holds the GB10 decode ceiling tuning-free (vs 0013's hand-picked 256). The legacy LLAMA_PREFILL_BUDGET path is preserved (honoured only when the dynamic knob is unset), so 0013 is cleanly subsumed. DEFAULT-OFF byte-identical: all-knobs-unset and the degenerate T == n_batch case are bit-identical to stock by construction (the n_batch hard ceiling is kept and the dynamic bounds reach it at the same point for every D). Orthogonal to LLAMA_KV_PAGED. - grpc-server.cpp: wire the new knob as model options max_batch_tokens / mbt (-> LLAMA_MAX_BATCH_TOKENS) and prefill_cap (-> LLAMA_PREFILL_CAP), beside the existing max_prefill_tokens / mpt seam; default-off, takes precedence over the legacy static budget when set. - patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md: design, the byte-identical determinism analysis (verified by construction), the local patch-apply verification, and the gate + A/B bench methodology. Validation status: the patch applies cleanly on top of LLAMA_VERSION (f3e1828) + paged 0001-0015, and the off-path / T==n_batch determinism is proven by construction. The GB10 sm_121 build, the four runtime gates, and the dense+MoE A/B sweep are PENDING a DGX run (the dev box was unreachable this session) and are documented as such in P1_DYNAMIC_BUDGET_RESULTS.md; do not sell the quantitative TTFT payoff until that re-run lands. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 07:48:20 +00:00
Ettore Di Giacinto	fccbb4082d	docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp Decompose vLLM's enforce_eager decode step (attention / weight GEMM / sampling / host loop) on GB10 (DGX Spark, sm_121) and attribute the measured ~2.4x NVFP4 decode-throughput gap to its parts, from source reading plus the existing nsys decode trace and H2H bench logs. Key finding: the gap is dominantly a KERNEL-efficiency gap (~80-90%), not a host-overhead gap. llama's GPU is already ~94.6% busy during steady decode, so a CUDA-graphed decode is a minority lever (~10-20% of the gap, bounded by the GPU-idle bubble), not the silver bullet. vLLM's wins: in-kernel paged-decode read (no gather tax), faster long-context attention, fused native-FP4 / grouped-Marlin GEMM, and O(1)-in-ctx GDN linear-attention layers on these Qwen3.6 hybrids. vLLM achieved 2.4x with synchronous scheduling and no CUDA graphs. Evidence: vllm 0.23.0 source (gpu_model_runner, flash_attn/gdn backends, modelopt/marlin GEMM, v1/sample), reproduced nsys kernel categorization (cat2.py), and QWEN36_NVFP4_BENCH / DECODE_GAP_STUDY / CONTINUOUS_BATCH_SCHEDULER_SCOPE. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-24 07:44:07 +00:00
Ettore Di Giacinto	5a38dd3f09	docs(paged): adversarial review of the continuous-batch scheduler scope Append a source-verified Review / risk section to CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. Verdict: scope is sound, GO on P0 -> P1, conditional P2, separate-track P3. Key checks against HEAD 151343b: - Tractability: zero libllama changes. The mixed per-seq prefill+decode ubatch is the existing shipping path (common_batch_add per-token pos/seq, init_batch split, paged_alloc is hooks on the same llama_kv_cache class, not a new class). The new scheduler changes only the prefill token count, never the batch structure. - The real serving config is kv_unified=false (-> n_stream=n_seq_max=128), so the split path is split_equal(sequential=true), not the contiguous split_simple the pseudocode implies. Fold into P0 ubatch-shape and determinism analysis; lock the split path in the A/B. - CUDA graphs ruled out: both NVFP4 H2H vLLM servers ran --enforce-eager (cudagraph_mode=NONE), so the npl128 2.4x decode gap is genuine eager-kernel + per-step host overhead. Scheduler cannot close it; the 157/333 ceiling stands. - TTFT root quantified: prefill_tps collapses with concurrency for llama (dense 1117->125) while vLLM holds flat ~1420. The dynamic T-D budget attacks this directly and can sustain prefill_tps >= vLLM during the drain, so burst-TTFT parity is mechanically plausible, but it couples to a decode-ITL knob (T) that MUST be co-reported with TTFT. Two calibration fixes required before P1: co-report drain-phase decode-ITL with TTFT (stop charging/selling the steady-state decode_agg number), and acknowledge the split_equal/n_stream=128 path. Neither changes the go decision. P1 is the minimal high-ROI step (handful of line edits at named seams); gate P2 on P1 metrics; P3 (kernel/CUDA-graph) owns the 2.4x residual independent of the scheduler. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 22:48:31 +00:00
Ettore Di Giacinto	ed17fc804e	docs(paged): scope token-granular continuous-batch scheduler for llama-server Build-ready plan (not implemented) for a vLLM-v1-style token-granular continuous-batch scheduler in tools/server/server-context.cpp update_slots(), the last lever after patch 0013 on the GB10 NVFP4 llama-vs-vLLM gap. Key findings that shape the scope: - The unified mixed batch already exists: Phase 1 (2604-2719) claims every ready decode token unconditionally, Phase 2 (2753-3330) fills prefill into the same llama_batch. Decode-first is structural, not a thing to build. - The chunked-prefill slot state already persists across steps (a PROCESSING_PROMPT slot with prompt.n_tokens() < task->n_tokens() resumes). No slot-state rewrite is needed - the feared big risk does not materialize. - The only missing piece is the budget POLICY: convert 0013's static per-step prefill cap into a dynamic, decode-first, per-slot-fair token budget (one total T, decode claims D, prefill gets leftover T-D, capped per slot). - Honest ceiling: the residual ~2.4x decode gap is a decode-KERNEL batch scaling ceiling (~157-161 dense / ~333 MoE @npl128), NOT a scheduler defect. The scheduler closes the 12x TTFT gap and holds that ceiling tuning-free; the throughput residual is a separate, named decode-kernel lever (P3). Phased P0-P3 with per-phase payoff, files, risks, and GB10 considerations. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 22:36:15 +00:00
Ettore Di Giacinto	362eea90ff	docs(paged): fair re-run verdict - synthesize NVFP4 llama vs vLLM scorecard Phase 3 synthesis of the max_prefill_tokens (patch 0013) fair re-run: how much of the gap was prefill starvation, the genuine remaining gap to vLLM, and where par-or-beat stands per concurrency/model. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 21:39:22 +00:00
Ettore Di Giacinto	c7075fb796	docs(paged): MoE 35B-A3B NVFP4 fair re-run with max_prefill_tokens budget Budget 256/512 sweep on the A3B MoE under patch 0013. Mirror image of the dense case: stock MoE was never prefill-starved (3B active, TTFT 84.8s @npl128), so the budget is a decode-throughput lever paid for in TTFT, not a TTFT fix. Budget 256 lifts decode_agg +14% (292->333.5 tok/s) and restores monotonic decode scaling (kills the stock +7.4% plateau, now +20% into npl128), moving llama 36.0%->41.1% of vLLM decode. Gap not closed: vLLM still ~2.4x decode and ~12x lower TTFT @npl128. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 21:38:08 +00:00
Ettore Di Giacinto	c8b1f16507	docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget sweep Re-run the dense Qwen3.6-27B NVFP4 vs vLLM A/B with patch 0013's QoS prefill budget enabled (LLAMA_PREFILL_BUDGET swept over 256/512/1024), fixing the prior run that left prefill unbounded and let high-concurrency prefills starve each other. At the saturated npl128 point budget=256 is the best lever: decode_agg 134.6 -> 161.2 tok/s (+19.8%) and TTFT 491.2 s -> 305.4 s (-37.8%) vs the starved stock run, moving llama from 34.5% to 41.3% of vLLM decode. Larger budgets help less; at light/moderate concurrency the budget is net-negative for TTFT because this all-at-once workload has no in-flight decode to protect at t=0. Documented honestly: a real but narrow high-concurrency lever, not a gap-closer (vLLM still ~2.4x decode / ~12x lower TTFT at npl128). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 21:22:07 +00:00
Ettore Di Giacinto	2975a74fb4	docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM, dense + MoE) Full 4-way sweep (npl 8/32/64/128): dense Qwen3.6-27B (clean W4A4) + MoE Qwen3.6-35B-A3B (vLLM Marlin NvFp4). Parity at npl8; vLLM scales ~2.8-2.9x ahead on decode at npl128. llama TTFT explodes at high concurrency - run WITHOUT max_prefill_tokens (0013), the prefill-starvation also drags decode_agg; fair re-run with the QoS budget pending. llama wins on on-demand memory (paged). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 20:21:50 +00:00
Ettore Di Giacinto	ee78ae4a11	docs(paged): Qwen3.6 NVFP4 h2h bench doc - MoE llama.cpp table First crash-resilient slab of the apples-to-apples NVFP4-vs-NVFP4 llama.cpp-vs-vLLM benchmark on GB10. MoE Qwen3.6-35B-A3B paged llama.cpp (patch 0015) decode/prefill/TTFT/VRAM at npl 8/32/64/128. vLLM and dense tables append as the sweeps land. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 19:43:55 +00:00
Ettore Di Giacinto	acb22a66ed	feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015) Mirror of llama-paged-dev commit 151343b into the pinned paged patch series. The durable, default-on follow-up to patch 0014's opt-in LLAMA_MOE_MMQ_X global cap: a host-side density-aware mmq_x auto-select in mul_mat_q_case that caps the MUL_MAT_ID grouped FP4-MMA token-tile only at low per-expert density (decode) and keeps the 128 tile at high density (prefill), so it is prefill-safe by construction (removes 0014's ~1.3% prefill cost). No new kernel. density_max default = 8 (not tile/4 = 16): 16 equals the 256-expert prefill-ubatch density and regressed S_PP ~2% on Qwen3.6-35B-A3B NVFP4; 8 sits between decode and prefill density for n_experts in [128,511] at n_ubatch=512. Honest result on the mission's MoE target (Qwen3.6-35B-A3B NVFP4, 256 experts + GDN/SSM linear attention, GB10 sm_121, median of 5 reps): NEUTRAL. Decode S_TG is within run-to-run noise (npl128 +0.36%) and prefill S_PP neutral (within +/-0.7%). This model is bound by the SSM recurrence and 256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile lever has nothing to bite on; a npl128 tile sweep confirms 64 is the only useful width (TILE8 -6.3% ... TILE96 -0.8%). The lever's real win lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128 per patch 0014), which the auto-select reproduces at npl128 by construction at zero prefill cost. Shipped default-on because it is prefill-safe, decode-neutral here, and correctness-gated. LLAMA_MOE_MMQ_X (0014) kept as a manual override; LLAMA_MOE_AUTO_TILE=0 restores exact stock selection. P0 gate: test-backend-ops test_mul_mat_id ragged small-M NVFP4/MXFP4 MoE decode-density shapes pass CUDA-vs-CPU on GB10 both default-on and stock. Full rationale and tables in patches/paged/MOE_DENSITY_AUTO_TILE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 19:04:55 +00:00
Ettore Di Giacinto	010067d900	feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap Mirror of the dev-tree engine patch (ggml mmq.cuh) into the paged patch set, plus its measurement writeup. Adds LLAMA_MOE_MMQ_X, an opt-in env cap on the MoE grouped-GEMM token-tile (mmq_x) for the MUL_MAT_ID path; default-off = byte-identical to stock. Honest result of the MoE near-term lever: the npl128 decode cliff does NOT exist on current HEAD (stock decode is monotonic 85/282/629/935/1295/1779 t/s at npl 1/8/32/64/128/256; the old cliff was fixed upstream by the sorted grouped FP4-MMA GEMM + MoE stream-k). The cap is therefore not a cliff fix but a modest high-batch decode micro-optimization: cap64 gives +4.8% decode at npl128 and +2.3% at npl256 (reproducible, neutral at npl<=64) for a ~1.3% prefill cost; cap16/cap32 are net-negative (prefill -41% / -17%). Full tables in MOE_TOKEN_TILE_CAP.md; durable density-aware follow-up in MOE_GROUPED_GEMM_SCOPE.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 13:49:15 +00:00
Ettore Di Giacinto	8925c009b7	docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10 Build-ready plan (not implemented) for matching/beating vLLM MoE grouped-GEMM efficiency on GB10 sm_121 for Qwen3-30B-A3B mxfp4. Honest reframe: the grouped GEMM the mission scoped to build already exists upstream and runs on GB10 for mxfp4 - should_use_mmq() routes MUL_MAT_ID to the grouped mmq path, which already contains both vLLM building blocks (mm_ids_helper moe_align/scatter + a persistent stream-k FP4-MMA grouped GEMM). The npl128 cliff was a since-fixed regression, not a batched-bench artifact; re-measured decode is monotonic 85->1771 t/s. The one structural gap is M-tile sizing: ggml maximizes mmq_x over the aggregate token count while vLLM uses a small per-expert BLOCK_SIZE_M, so each tiny per-expert M-tile is 3-6% filled at decode density. Scope is a surgical two-step delta (expert-aware mmq_x selection; block-padded moe_align), the parity gate (test_mul_mat_id bit-exact + ragged small-M), and a phased plan gated behind the GB10 W4A16 occupancy wall. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 13:17:03 +00:00
Ettore Di Giacinto	a3abd60ae0	docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM) Same-day steady-state aggregate-decode sweep at npl 8/32/64/128 for three model classes, replacing the stale ~75-80%-of-vLLM carried figure with a full concurrency curve. Findings: - Dense 32B (NVFP4 vs NVFP4A16): parity at batch-8 (97%), 72-86% mid/high. - Small 0.6B: parity at batch-8 (99%), 49-67% at high concurrency (llama plateaus ~2.0k, vLLM scales to 4.2k; runtime/scheduler-bound). - MoE 30B-A3B: llama-only at 290-1041 tok/s. vLLM cannot serve it on GB10 (bf16 hangs at MoE warmup and reboots the box, twice; mxfp4 GGUF expert tensors unmappable by vLLM 0.23.0). Batch-8 anomaly resolved: clean isolated dense batch-8 decode is ~88-90 tok/s (~89 ms/step) across paged-vs-stock (within 2%, paged slightly faster) and ctx 65536-vs-163840 (within 1%). The prior 471 ms/step was a mixed-load decode/prefill contention artifact, not paged overhead, ctx allocation, or NVFP4 cost - the case patch 0013 LLAMA_PREFILL_BUDGET bounds. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 12:22:15 +00:00
Ettore Di Giacinto	dd6a4425e0	feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill QoS budget) Surface patch 0013's decoupled per-step prefill-token budget as a per-model grpc-server option, mirroring the existing kv_paged option. When max_prefill_tokens (aliases: mpt, prefill_budget) is set to a positive integer, params_parse setenv's LLAMA_PREFILL_BUDGET before context creation so the vendored update_slots() scheduler latches it; unset or non-positive leaves the env untouched, preserving stock unbounded-prefill behaviour (an externally exported LLAMA_PREFILL_BUDGET still works as an escape hatch). This bounds the head-of-line decode stall a large prompt inflicts on the in-flight decoders co-batched with it, with no steady-state throughput cost. Verified on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8-slot continuous batching, one ~6k-token prefill injected mid-stream; same binary, only the budget differs: budget worst decode gap prefill wall unset 2.462 s 6.672 s 512 0.669 s (3.7x) 7.516 s 256 0.398 s (6.2x) 8.854 s Monotonic: a smaller budget cuts the decode stall further at a modest TTFT cost, the classic chunked-prefill trade-off. grpc-server.cpp compiles cleanly against the paged build tree. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 11:25:44 +00:00
Ettore Di Giacinto	4bc2b4a9b2	feat(paged): add patch 0013 decoupled per-step prefill-token budget Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token budget for the inherited update_slots() scheduler, decoupled from n_batch (the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt tokens a single update_slots() step ingests, splitting a long prefill across more steps so co-batched decode keeps advancing instead of freezing for the duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) = disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED. Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected 6000-token prefill, same binary, only the env differs): worst decode freeze 3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256, for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost. This is a latency/fairness lever, not an aggregate-throughput lever (steady decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift). Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N is byte-identical to stock -bN while preserving n_batch for decode width; the only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping that pure stock -b exhibits too. Verified applying on the pinned llama.cpp f3e1828 after patch 0008. Productisation follow-up: surface as a grpc-server.cpp options knob (max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 09:55:32 +00:00
Ettore Di Giacinto	ba6bd94976	feat(paged): assert mask-pad invariant for the paged tile route (patch 0012) Patch 0012 of the paged-attention series. Adds a defensive GGML_ASSERT in src/paged-attn.cpp so the now-default paged decode route (GQA-grouped fattn-tile kernel) cannot silently start leaking past-end KV rows. The route stays correct only because the compacted mask/block-table length n_view = GGML_PAD(n_gather, 256) is a whole number of flash-attn KV tiles (nbatch_fa = 64 for head_dim 128 divides 256), so the last tile sits entirely inside the -inf pad window. The assert (n_view % 64 == 0) pins that implicit invariant: a future pad < 256 or tile > 256 that broke it now aborts instead of leaking. Additive only, no behaviour change. Verified on the DGX dev tree: build-cpu compiles and the paged CPU byte gate (LLAMA_KV_PAGED off vs on, Qwen3-0.6B-Q8_0, greedy) stays byte-identical with the assert silent. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 09:13:08 +00:00
Ettore Di Giacinto	e983919516	feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011) Increment 3 attention lever. In the paged in-kernel decode dispatch, route the common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel for everything else. Tile groups the q-heads that share a kv-head (ncols2) so each K/V row is loaded once per group instead of once per q-head, and runs at higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4, F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8). The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096 -3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step. Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache would be converted to a contiguous F16 copy by launch_fattn, breaking the in-kernel block-table read. So tile is correct only for an F16 cache, and the grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2}; everything else falls back to the inc-1 vec path, exactly as before this change. LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010) was already plumbed; this only adds the default route. (Paged decode currently needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by this change: stock+q8_0 works, paged+q8_0 aborts both before and after.) Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping attacks directly. Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off) byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path (7/8 sequences identical, 8th in the same kernel-noise band where vec also drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock (no block table) is byte-identical: the dispatch guard only diverts on src[5]. Full rationale and numbers in the patch header. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]	2026-06-22 22:38:28 +00:00
Ettore Di Giacinto	2c5adda28c	feat(paged): tile in-kernel decode read + dispatch guard (patch 0010) Increment 2 (robustness): graft the patch-0009 phys(j) block-table read into the CUDA tile kernel (mirror of fattn-vec.cuh) and add a dispatch guard so a present block table (src[5]) routes ONLY to the vec or tile kernel, never to mma/wmma (which ignore the table and would silently read the wrong physical cells). Default route stays vec, the inc-1 byte-validated path. Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B) PASS; GPU vec-paged == stock at -s 1 PASS; the real Qwen3-32B NVFP4 batch decode confirmed dispatching to vec (Q ne=[128,1,64,N]). The tile graft is plumbed for the increment-3 GQA head-group reuse but is EXPERIMENTAL/not byte-validated (LLAMA_KV_PAGED_TILE=1): the GQA-grouped ncols2>1 tile path reads a full nbatch_fa tile unbounded while the compacted paged mask is not padded to cover it. Bounding that path is increment-3 work; the default vec route is unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 20:37:12 +00:00
Ettore Di Giacinto	ee13a94a8c	paged: in-kernel decode read patch 0009 (kill the gather regression) Mirror patch 0009 for the paged llama.cpp engine. It removes the patch-0003 per-layer per-step gather (ggml_get_rows of K/V to a contiguous buffer) on the decode step and instead reads paged blocks in-kernel: build_attn passes the physical K/V views plus a position-ordered block table (src[5] of ggml_flash_attn_ext, padded to FATTN_KQ_STRIDE), and the CUDA fattn vec kernel plus the CPU reference map each logical KV index to its physical cell and read in place. KV_max / parallel_blocks / stream_k split-K are unchanged; a nullptr block table is the stock contiguous read (byte-identical, gated by LLAMA_KV_PAGED). Verified on GB10 (sm_121, Qwen3-32B NVFP4, batch 32 / 1024 ctx): the decode step drops from 1279 ms (paged-gather) to 696 ms in-kernel (-46%), reaching stock parity (647 ms). CPU paged vs stock is bit-for-bit identical; GPU stays within the documented batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 18:04:09 +00:00
Ettore Di Giacinto	4dcbcfcf92	docs(paged): decode-step gap study vs vLLM on GB10 Profiling decomposition of the llama-server batch-32 / 1024-ctx decode step vs vLLM on a DGX Spark (GB10, sm_121). Findings: decode is GPU-bound (~95% busy, sampling/loop fully hidden); at 1024 ctx the step is ~84% KV/attention and ~16% weight GEMM; the paged KV engine is a ~1.85x decode regression vs stock (per-layer gather-to-contiguous); even stock is ~4-5x slower than vLLM, gated by the long-context decode-attention and thin-batch FP4 GEMM kernels, not by the serving loop. Ranked closable-vs-structural levers included. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:44:24 +00:00
Ettore Di Giacinto	80e0c1ac6b	feat(paged): wire cross-request prefix share into llama-server (patch 0008) Ship patch 0008 of the paged-attention series: wire the paged cross-request prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam) into the llama-server continuous-batching loop so CONCURRENT requests sharing a long prefix reuse one committed copy of the prefix blocks and prefill ONLY their divergent suffix. The server's native prompt cache only reuses a slot's own prior prompt; it does not share across distinct concurrent slots. 0008 adds that cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical). The hook lives in tools/server/server-context.cpp update_slots (the only place with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated lines: a fresh-slot share() that advances n_past past the committed prefix, and a commit() at the prefill->generation transition. The n_past<block gate guarantees every positive share is adopted so the engine reservation matches the suffix-only batch (no stale paged blocks). Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder: K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s, K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed' (ref_cnt>1), greedy output within the documented CUDA batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:03:16 +00:00
Ettore Di Giacinto	52f0f7b8cf	docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache) Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4 GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager, both prefix-cache on. Two findings: (1) the paged cross-request prefix recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable via paged_prefix_api::share/commit, which the server never calls; the server engages only physical paged block placement plus its own native prompt cache. (2) With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s; K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 14:16:52 +00:00
Ettore Di Giacinto	f347f7ca1d	docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison Two closing measurements on DGX Spark (GB10, sm_121): 1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock llama.cpp produces a different greedy token stream when the same prompt is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch. At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity failure is stock floating-point batch-shape non-determinism, not a paged bug. CPU exhibits it too, just less often, which is why 0007's short CPU scenarios passed 16/16 while the CUDA run flipped. 2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a 1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share baseline - the same cross-request prefix-skip vLLM's APC provides (97% hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider end-to-end due to continuous batched decode. Competitive in kind, behind in absolute terms on this hardware. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 13:48:01 +00:00
Ettore Di Giacinto	0dd45f0da5	docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS; the CUDA build fails only the strict greedy-token-equality assertions (the same binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not a paged-logic defect - every structural KV-reuse invariant passes on GPU. The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops 7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and reused via the paged cross-request prefix cache. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 12:59:09 +00:00
Ettore Di Giacinto	9537726649	fix(llama-cpp/paged): stop double-applying the paged patches in prepare.sh The Makefile llama.cpp target git-applies the paged series at checkout; prepare.sh then re-applied with patch, fuzzily duplicating hunks (redefinition errors -> the grpc-server CUDA build failed under LLAMA_PAGED=on). Guard prepare.sh's apply with a sentinel (skip when llama.cpp/src/paged-kv-manager.cpp already exists) + -N/-r flags, so it only does work against an unpatched checkout. Found by the GPU/full-build verification (PAGED_GPU_VERIFY.md). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 11:54:51 +00:00
Ettore Di Giacinto	d1ba327843	docs(paged): record GPU correctness + CUDA backend-build verification GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series: core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent. Full backend: patches/paged apply clean to the pin and grpc-server compiles+links under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on make flow (git apply + prepare.sh) and a token divergence in the unshipped prefix-recompute-skip dev driver (same on CPU and GPU). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 11:50:01 +00:00
Ettore Di Giacinto	ecffd4b097	feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007) Mirror patch 0007 of the paged-attention series into the vendored llama.cpp patch set. It wires the host-side cross-request prefix cache (0006) into the engine so a new sequence physically shares the cached prefix blocks (ref-counted) and decodes only the divergent suffix - the shared prefix KV is never recomputed. paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream) keyed by the real seq_id (per-sequence ref-counted free); two gated llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared physical cells' seq-membership so the engine attention mask covers the already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos. Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock byte-identical. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode from scratch at a block boundary and mid-block, prefill computing only the suffix (32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's removal, survivor intact and re-shareable, pool restored when all freed). The 0004 serving gate stays byte-identical stock vs paged in unified and non-unified mode. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:47:10 +00:00
Ettore Di Giacinto	67c6208b3a	feat(llama-cpp/paged): cross-request prefix caching patch 0006 Mirror patch 0006 of the paged-attention series into the vendored llama.cpp patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with host-side cross-request prefix sharing: place_with_prefix reuses cached physical blocks for a new sequence shared prefix (ref_cnt++) and allocates only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1) block before a divergent write so co-owners stay byte-correct; ref-counted free releases a shared block only at ref 0. Core kv-cache files untouched; gated behind LLAMA_KV_PAGED, default off. Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix greedy tokens byte-identical to the unshared baseline at both a block boundary and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated), and copy-on-write + seq_rm ref-count safety with no use-after-free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:14:27 +00:00
Ettore Di Giacinto	667a21c119	feat(llama-cpp): expose paged KV cache as a per-server option (patch 0005) Wire the continuous-batching serving path (update_slots) to the on-demand paged KV-cache engine (patches 0001-0004). update_slots already drives the engine transparently through the existing kv-cache seams: each slot's sequence allocates paged blocks on arrival (find_slot placement) and returns them on slot release (the seq_rm free seam). No serving-loop change is needed for correctness. This patch only exposes the enable cleanly: instead of forcing operators to export the process-wide LLAMA_KV_PAGED env, add `kv_paged` (aliases `paged_kv` / `paged_attention`) and `kv_paged_debug` model options that set the env before the model/context is created. Default off; when the option is absent nothing is touched, so an externally exported env still works and stock behaviour is unchanged. Verified on a dynamic continuous-batching harness (NP physical slots reused across M>NP queued prompts, single mixed llama_decode per step, greedy): 12 dynamically-arriving sequences over 4 slots are token-identical to the stock single-slot serial baseline under both the unified and per-sequence caches. The debug trace confirms per-slot [paged-alloc] grow on arrival and per-stream release on seq_rm. The per-slot allocate/free capacity benefit only materialises under a per-sequence cache (kv_unified:false), since paged block ownership is keyed by stream; the unified cache collapses every slot onto one stream and the run stays correct but degenerates to a single bounded, stock-recycled pool. We do not flip kv_unified here, to keep the default serving behaviour and idle-slot prompt cache unchanged. No core llama.cpp patch: no engine bug was found under dynamic slot churn. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:33:32 +00:00
Ettore Di Giacinto	04e3d04ab8	build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on) Move the paged-attention patch series (0001-0004 + docs) into patches/paged/, applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is now clean, so a dep-bump that breaks a paged hook can be unblocked with LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed independently - decoupling the paged-KV maintenance from routine bumps without a separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply, flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is byte-identical to stock until that env is set. Glob/flag logic verified in bash. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:22:36 +00:00
Ettore Di Giacinto	4968cd8a94	paged-attn 0004: on-demand KV block allocation Wire the paged placement in find_slot through the vendored PagedKVManager (0001) instead of a fixed full-pool permutation. Blocks are popped from a free pool on demand as a sequence crosses block boundaries, and returned on sequence end (full seq_rm / clear). One manager per (kv-cache, stream); all state lives in a new src/paged-alloc unit keyed by a static registry, so the core kv-cache struct is untouched (find_slot/clear/seq_rm gain only a gated call). Default off; stock path byte-identical. Gate 0 (CPU, Qwen3-0.6B-Q8_0), LLAMA_KV_PAGED=1 token-identical vs stock: - single-stream llama-simple, 48 tok: identical - multi-stream driver, 3 seqs x 40 tok: identical Demand-driven confirmed via debug log: blocks grow 0->1->2->3->4 at logical positions 16/32/48 (peak 4 blocks vs 16-block budget), per stream independently. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:50:57 +00:00
Ettore Di Giacinto	37e0e1ef55	paged-attn 0003: lift gather-read to multi-stream The 0003 gather-read was single-stream only (GGML_ASSERT k->ne[3]==1). Lift it to N streams: one index column per stream over the unified batch, gathered with a single ggml_get_rows along the stream axis. Each column is position-sorted (preserving the flash-attn online-softmax reduction order that makes the read byte-identical) and padded to the max non-empty count across streams with a masked (empty) cell, which contributes exp(-inf)=0. Core touch stays additive: the one-line build_attn hook is unchanged; only the two kv-cache gather helpers (now per-stream) and src/paged-attn.cpp grow. Gate 0 (CPU, Qwen3-0.6B-Q8_0): a multi-sequence greedy driver (non-unified KV, k->ne[3]>1) is token-identical between stock (env unset) and LLAMA_KV_PAGED=1: 3 seqs x 40 tok, 2 seqs x 32 tok, 5 seqs x 32 tok all identical; single-stream llama-simple unchanged. Debug log confirms n_stream=3 engaged the multi path. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:46:12 +00:00
Ettore Di Giacinto	d9d846e04b	feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive Implements the paged-attention gather-read (the real engine compute): attention reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the non-empty-cell index list before build_attn_mha. Verified token-identical to stock greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op). Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp} (an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind LLAMA_KV_PAGED so stock path stays byte-identical. Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU flash-attn's online softmax reduces cells in physical-array order and is FP-order- sensitive, so 0002's scattered placement alone (full-window read) diverges from stock past the first block; the position-sorted gather reproduces stock's exact reduction order -> bit-identical. So 0003 is what makes paged placement token-identical under flash-attn. Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:26:46 +00:00
Ettore Di Giacinto	84d59e659b	docs(paged): additive "hook, don't edit" layout for the patch series Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites core attention and is unvendorable. Our own series must be additive so it survives llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget: every change is either new code in a new vendored src/ file, or a single env-gated hook at one call site that delegates to it - no logic in core files, no core struct edits. Grounds it in the pinned source: llm_graph_input_i is pure-virtual and res->add_input() lets a new file register a graph input, so paged behavior plugs in without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass) + two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands entirely in LocalAI's grpc-server.cpp (no core patch). Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the next focused token-identical Gate-0 block. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 07:28:44 +00:00
Ettore Di Giacinto	931793aa24	feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is only the test box and its "no win" result is a low-bandwidth artifact: 1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in patches/0002; durable fix (clamp to free_vram, honor --fit off) noted. 2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging wins - variable prompt/gen lengths, continuous arrival, shared prefix - and reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools run fixed-length all-at-once load, which is why they never show a paged win. 3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth- bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds on the target; remaining work is hardening/finishing the paged op (PR22569 was 12-13% slower and lacks prefix sharing). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 23:16:28 +00:00
Ettore Di Giacinto	0337505dc8	docs(paged): measure paged KV at high concurrency (LLAMA_MAX_SEQ=2048) - no single-GB10 win Closes the open question from PR22569_EVAL: that eval was blocked by the 256-seq compile cap and used a compute-bound 32B. Recompiled LLAMA_MAX_SEQ=2048 and swept a bandwidth-bound model (Qwen3-1.7B) to npl=2048, both KV layouts. Result: aggregate decode plateaus at the hardware ceiling for BOTH layouts - 1.7B flattens ~3200-3700 t/s by npl=512 (contiguous and paged alike), 32B-dense ~540 by npl=128. Pushing concurrency past the plateau collapses per-seq tps (23->1.9) and explodes TTFT (0.6s->64s) with no aggregate gain. Paged KV is a memory-capacity / anti-fragmentation / prefix-sharing feature, not a single-node throughput lever; the 24k aggregate is a fleet-level (multi-GPU) result, unreachable on one GB10 regardless of KV layout. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 22:47:20 +00:00
Ettore Di Giacinto	faeb5b457c	analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM) Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so it IS the better int4 decode GEMM the diagnosis called for - no multi-day Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill (1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 21:42:17 +00:00
Ettore Di Giacinto	6e0b910210	analysis: decode gap is GPU/kernel-bound, NOT host overhead (corrects premise) Rigorous re-measurement on pr24423: concurrent decode is GPU-compute-bound (~96% util, sampled), CUDA graphs ARE enabled at npl=128 (94/98 calls replay a captured graph; n_kv padded to 256 keeps topology stable), and graphs ON vs OFF is only +1.5% at npl=128. The earlier '20% GPU util / 170ms host' read was a windowing error (whole-run nsys vs decode-windowed). So no host/graph patch helps. The real 547->667 gap is the quantized DECODE GEMM: mul_mat_q (Q4_K/Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10 bandwidth floor (poorly tuned for the thin n=128 shape); vLLM's Marlin int4 runs closer. Lever = a Marlin-style int4 decode kernel for K-quants (or a Marlin-friendly int4 serving format), not host work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 21:32:58 +00:00
Ettore Di Giacinto	aaf7b4112e	test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10 NVFP4-dense is producible via --tensor-type attn=nvfp4 --tensor-type ffn=nvfp4 (GGML_TYPE_NVFP4 has a full quantize path; no top-level ftype needed). Clean-from-BF16 4B PPL: NVFP4 14.31 vs Q4_K 13.66 vs MXFP4 17.42 vs BF16 13.32 - Q4_K-class, not MXFP4-class. Prefill routes onto the FP4 MMA kernel (~1.29x Q4_K on 4B, within 5% of MXFP4). It is the quality-preserving FP4 win MXFP4 was not. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 18:44:57 +00:00
Ettore Di Giacinto	037ad82b7c	docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend) Fair clean-source perplexity check on DGX Spark (GB10): quantize Qwen3-4B from one BF16 source to both Q4_K_M and MXFP4 (no imatrix, identical recipe). Q4_K_M is +2.6% PPL vs BF16; MXFP4-dense is +30.8% (+27.5% worse than Q4_K). The existing 32B MXFP4 was confirmed double-quant (Q4_K_M -> MXFP4 via --allow-requantize), but the clean 4B test shows the gap is intrinsic to the format, not the double-quant. Output stays coherent. Verdict: the ~1.58x prefill / ~1.2x decode win does not justify a Blackwell MXFP4-dense quality recommendation; keep Q4_K_M the dense default, pursue NVFP4 instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 17:25:14 +00:00
Ettore Di Giacinto	1887385b79	analysis: MXFP4-dense fails quality check (~27% worse PPL than Q4_K) - do not recommend Clean fair comparison (Qwen3-4B, all from same BF16 source, wikitext PPL): BF16 13.32, Q4_K_M 13.66 (+2.6%, near-lossless), MXFP4 17.42 (+30.8%). MXFP4 is ~27% worse than Q4_K even clean from BF16 (32B double-quant cross-check: 7.39 vs 8.46, +14.6%, same direction). MXFP4_MOE is built for MoE expert tensors; on dense attn/ffn it is far lossier than Q4_K's 6-bit superblock structure. The ~1.58x prefill is not worth ~27% PPL - Q4_K stays the dense default; FP4 only where the model is trained for it (MoE). Verdict: do NOT ship a Blackwell MXFP4-dense rec. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 17:24:24 +00:00
Ettore Di Giacinto	40ee9cdd13	docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10 PR #17004 is merged and already present in our pinned llama.cpp f3e1828. Measured on DGX Spark (GB10, sm_121, Qwen3-32B-Q4_K_M): - llama-batched-bench does no sampling (random tokens), so it cannot test the fix; its ~540 t/s plateau is not sampling-bound. - Real-sampling A/B via llama-batched (CPU vs -bs GPU sampler): +25% at np=32, +3% at np=64, GGML_ASSERT(obj_new) graph-alloc crash at np>=128. - nsys at np=64: GPU-busy time and kernel mix unchanged (392 vs 404 t/s); sampling kernels negligible. GPU utilization did not rise. Clean negative: the fix does not break the plateau toward the ~2700 ceiling or past vLLM 667, and is unusable at the multi-user parallelism in question. Adoption: code arrives via LLAMA_VERSION bump (prepare.sh vendors the modified upstream server-context.cpp), but grpc-server must set params.sampling.backend_sampling to enable it; grammar/tool-call/logprobs requests fall back to CPU. Defer adoption until #18547/#18550 stabilise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 15:44:21 +00:00

1 2 3 4 5 ...

6823 Commits