Commit Graph

35 Commits

Author SHA1 Message Date
Ettore Di Giacinto
be65438eac docs(paged): record MoE-prefill engine-gap decomposition + GEMM-port negatives (default-off)
nsys cross-engine decomposition: the MoE prefill 64% gap vs vLLM is engine plumbing, not the kernel (GPU 97% busy, 443 vs 197 us/tok). Three buckets: per-expert W4A4 M-fragmentation (58%), GDN scan (24%), f32<->bf16 casts (15%). Offline-repack (0045) and verbatim vLLM-marlin port both trail FP4-MMQ via wrapper overhead, kept default-off as recorded negatives.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 17:20:07 +00:00
Ettore Di Giacinto
7b38c6b2a3 feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on under paged KV (patch 0044)
Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031)
as a single GDN_TC-selected build and ship the M5 variant (full TC form-T
solve + state-update mma) default-ON when LLAMA_KV_PAGED is set.

The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both
env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays
regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1
so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a
{1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because
the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%).

Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1,
llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32:
  -npp 512  S_PP 2208.96 -> 2286.5 t/s  (+3.5%, mean of 3 interleaved A/B)
  -npp 2048 S_PP 2021.5  -> 2379.8 t/s  (+17.7%)
Decode S_TG unchanged (~399 vs ~397 t/s, within noise).

Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on
== M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4.
test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk
up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens)
and the sequential path agree word-for-word until one benign greedy token-flip;
dense is byte-identical. The chunked scan is a NEW per-path result (different FP
reduction order), NMSE-validated benign.

CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README
sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 06:42:11 +00:00
Ettore Di Giacinto
042deab40e docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both-engine profile-validated)
Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 06:15:10 +00:00
Ettore Di Giacinto
c4058eb4da feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off
0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 06:15:10 +00:00
Ettore Di Giacinto
f1c98ff0b9 fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B regression)
Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a
measured end-to-end A/B proved that is a serving mistake. S3 defers prefill
admission on the period-8 cadence, which delays prompt admission: 2.5x worse
TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no
end-to-end win at any concurrency. Its apparent decode_agg gain was a metric
artifact (faster per-step decode bought by starving prefill).

Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the
mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for
decode-dominated, low-arrival traffic where TTFT is not a concern. The default now
prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping
default-on; only S3's default changes.

Re-exports patch 0041 (change folded into its source commit) and updates the
README 0041 row plus the decode-serving narrative to record the A/B finding.

Greedy md5 gate unchanged (single-sequence llama-completion path, not
update_slots): paged MoE 8cb0ce23, dense 5951a5b4.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 05:00:11 +00:00
Ettore Di Giacinto
b028c81eda docs(paged): record padded/fixed-slot decode shape as tested-and-rejected
The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push
serving graph reuse from ~72% toward ~100%) was implemented in an isolated
worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10.
Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving
throughput at every concurrency and does not close the vLLM gap.

Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step
(n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so
n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a
release()-side guard keeps a finished slot warm under padding. Each dummy is its
own sequence (private recurrent state, per-stream paged attention, logits
discarded), so it cannot perturb a real stream.

Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The
literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA
decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF
diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed:
ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly
(MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing.

Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16
/ PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered
aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD
(reuse 38%) 558.0.

Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound
- baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys
~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding
adds dummy-row compute proportional to pad_width - real_load, catastrophic at low
load; (3) in continuous serving padding cannot hold a constant width (perpetual
prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink
padding prevents is itself a throughput win in a compute-bound regime. The
residual burst gap is GPU-compute, which a host-side reuse lever cannot close.

Patch series unchanged: this rejected lever is NOT added to patches/paged/.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 20:47:43 +00:00
Ettore Di Giacinto
2fa8ef8fc5 fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV
FIX A (patch 0031 compose break): the chunked GDN prefill patch carried
'#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but
those were introduced by the dropped bf16-tau patch 0026, so on the
bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git
apply' failed. The same 0026 drop also shifted 0031's later hunks off their
context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template
params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh
pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the
cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for
INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA &&
!keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF
(GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies
clean on 0ed235ea ('git apply --check' green for every patch).

FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF.
Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still
overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in
update_slots(); re-exported from the dev tree, README 0041 row updated to match.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 19:37:05 +00:00
Ettore Di Giacinto
d706980c2b feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041)
Add the two decode-serving graph-reuse levers (validated on GB10) that close the
host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real
continuous serving while tying it in static batched-bench).

- 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode
  llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml
  graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape
  can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5
  byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%.
- 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode
  steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse.
  Default-off policy on top of 0016; bit-exact (per-stream independent).

S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph
reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq
median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%);
S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put
set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table +
DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 18:04:28 +00:00
Ettore Di Giacinto
000705321f feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)
Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight
GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet)
tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the
validated, bit-exact-gated mechanism and records the honest GB10 result: it is a
regression, so it ships default-off (== stock), mirroring the patch-0017
default-off discipline.

Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense
M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that
dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays
FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda
gains the NVFP4 case.

Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID
806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4
large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for
short prefill (5951a5b4, decode untouched) and for a >threshold prefill that
exercises the bf16 path (5f3967df - no greedy argmax flips).

Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99
-> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42
(-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here:
FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step
weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate,
only partially amortizing as M grows. Default-off keeps stock S_PP (966.98).

Phase 2 (MoE grouped large-M) not implemented: it inherits the same
bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
would regress for the same reason; a real prefill GEMM win needs option (b), a
native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:42:15 +00:00
Ettore Di Giacinto
4bdd26a7f0 docs(paged): scope tensor-core (mma) chunked GDN prefill kernel
Scopes the follow-up recorded by patch 0031 + README section 5: replace the
serial per-thread reductions of the chunked gated-DeltaNet prefill scan with
mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the
path that would beat the tuned sequential scan and close the GDN prefill
bucket toward vLLM's ~2.5x-cheaper chunked scan.

Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer
Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a
data-center only); the usable path is the extended mma.sync family. So the
kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's
mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma'
shorthand reads as mma only on this part.

Design: register-resident state frees the 64KB that forced C=16, admitting
C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32
precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the
bounded de-gating; A-inverse via blocked forward substitution (FLA UT
transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW
~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not.
Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains
beat the sequential scan and close most of the bucket but not full sm_100-class
parity. KL-gate binding (NMSE likely fails at reduced precision). Phased:
re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state ->
occupancy/cp.async; opt-in default-OFF until A/B-proven.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:23:51 +00:00
Ettore Di Giacinto
9a28f23134 docs(paged): scope the continuous-serving decode gap (host-bound, design-only)
Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static
batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving
through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains
~5.9. Scope shows the gap is the scheduler/host loop, not the kernel.

Root-cause hypothesis from source: continuous batching's batch-shape + seq-set
churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/
allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph
update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host
rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids
this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that
the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill
freezing + burst collapse, NOT decode-step graph reuse, which is why the serving
gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was
static-regime and is reopened here for serving only.

Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for
graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable
scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound
before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/
get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE
and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel
code; no GPU run in this pass.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:14:51 +00:00
Ettore Di Giacinto
e610347367 feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked
parallel-scan prefill kernel for gated DeltaNet (the upstream
gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO).
Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous.

Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within
the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases:
exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms
f32 chunked-vs-sequential NMSE ~1e-13.

OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128
f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM
with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential
(~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with
GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the
84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident
state with larger chunks (recorded in README section 5).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 17:09:38 +00:00
Ettore Di Giacinto
11128cb080 docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only)
Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where
MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than
vLLM's marlin/cutlass large-M path (~51% of the prefill gap).

Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first,
MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a
multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already
exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path
(falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS
banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4).

Design only - no kernel, no GPU run.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
2026-06-28 16:42:23 +00:00
Ettore Di Giacinto
4cd90bfae9 paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is
removed from the llama-cpp-localai-paged patch series. Clean re-measurement after
the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16
(tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs
780.0 t/s. The mode engages but adds zero speed because it is subsumed by the
fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau
was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and
extra CUDA template-instantiation compile cost with no offsetting benefit.

Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only
mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn,
which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025,
0028-0030) applies clean with git apply --check against the pin
0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob
(patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are
already absent).

Removed:
- patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
- the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared
  grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no
  longer reads)
- the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows
  (README + final_benchmark.csv), the ssm_bf16_tau option text in backend
  index.yaml, the gallery NOTE block, and the docs/features/backends.md mention.

The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat)
in the backend README section 5, the paged-backend agent guide, and the
vLLM-parity methodology, so it is not re-tried.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 16:06:06 +00:00
Ettore Di Giacinto
2c59805267 fix(paged): rpc cmake target renamed rpc-server -> ggml-rpc-server at pin 0ed235ea
llama.cpp renamed the RPC tool target (tools/rpc/CMakeLists.txt: set(TARGET
ggml-rpc-server)) at the 0ed235ea pin. master already updated the stock
llama-cpp Makefile to match (--target ggml-rpc-server, cp bin/ggml-rpc-server);
the paged backend's separate Makefile copy was left stale and its -grpc (RPC)
variant failed with 'No rule to make target rpc-server' (grpc-server itself
built to 100%). Mirror the stock rename in the paged Makefile.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 11:10:07 +00:00
Ettore Di Giacinto
c51ff4cec9 docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)
Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged
backend's accelerator-portable wins off the CUDA family. It builds on (does
not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion
detail (benefit #1), and adds:

- Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011):
  new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan
  flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal
  EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the
  fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants
  are full). Universal constraint: only the vec/scalar decode kernel admits
  the per-cell indexed read, so route block-table ops onto vec (as CUDA's
  0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path
  contiguous-only. This is the lever that flips paged KV from
  neutral-to-slightly-negative to non-negative off CUDA.
- Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable
  win - host-side update_slots() policy, zero kernel work, runs on any
  accelerator as-is.
- Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only);
  flags the backend-agnostic analogues of the act-quant dedup and the
  graph-coverage lever without over-claiming a port.
- A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port);
  FP4-MMA stays Blackwell-only.

Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they
bundle into one per-backend PR behind a shared ops-first PR. Cross-link added
from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no
Metal/SYCL/Vulkan/ROCm hardware here).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 08:34:32 +00:00
Ettore Di Giacinto
ea72a56e2c Merge origin/master + pin-sync paged backend to 0ed235ea
master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the
shared grpc-server.cpp. The paged backend's pin must track the stock pin (the
grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged
patches apply clean on 0ed235ea (verified against a fresh upstream clone). The
bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full
grpc-server build verify on GPU/CI to follow.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 07:56:47 +00:00
Ettore Di Giacinto
1f3e5ba301 fix(paged): serialize both SSM partitions in hybrid bf16-tau state save/restore (patch 0026)
The opt-in ssm_bf16_tau hybrid mode splits a gated-DeltaNet layer's
recurrent SSM state into an f32 partition (s_l) and a bf16 partition
(s_l_bf16). The recurrent state serialization paths (state_write_data /
state_read_data) were never updated for the split: they read/wrote s_l
using the FULL hparams.n_embd_s() (S_v*S_v*H) row width, but a split
layer's s_l only holds S_v*S_v*n_f32, so the access overruns the smaller
tensor (a ggml_backend tensor read out of bounds), and the bf16
fast-head partition was never persisted at all.

This is what broke high-concurrency serving with --ssm-bf16-tau: the
server's context-checkpoint feature serializes per-sequence state via
state_seq_get_data. With a checkpoint enabled, even a single request
triggered the out-of-bounds read; at higher concurrency the cell range
starts at a higher base slot so the overrun reaches further (hard abort
in a debug build, silent state corruption then 1-token-then-EOS on
restore in a release build). The static batched-bench never exercises
save/restore so it did not catch it; the GDN decode kernel and per-head
partition offsets were already correct (decode with checkpoints disabled
is fine at N=8/16/32).

Fix: serialize the f32 partition and, when the layer is split, the bf16
partition right after it, each with its OWN row width (tensor ne[0]).
head_slot is rebuilt deterministically at load (same model + tau), so it
is not serialized. Non-split layers have ne[0] == n_embd_s() and no bf16
partition, so their on-disk format and behavior are byte-identical (the
default f32 path and the bit-exact gate are unaffected).

Verified on GB10/DGX with Qwen3.6-35B-A3B-NVFP4 + --ssm-bf16-tau 64 via a
continuous-batching llama-server: with context checkpoints enabled, N=8,
N=16 and N=32 (slot reuse + restore) all now produce full coherent
128-token output and the server stays up; pre-fix the same config
aborted on the first checkpoint.

Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 07:47:17 +00:00
Ettore Di Giacinto
4da769c1ca paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (fix amd64/non-arm64 build; compile-only)
Vendored paged headers used size_t / uintN_t without including <cstddef> /
<cstdint>. The arm64 DGX toolchain provides them transitively so the build
passed there, but amd64/older toolchains do not, failing the CI amd64 build one
header at a time ('size_t' does not name a type -> cascade).

paged-kv-manager.h was already fixed. This adds the missing includes to the
remaining vendored headers at the point each is created/rewritten in the patch
series so every src/paged*.h self-includes both:

  * paged-attn.h     (0003): add <cstddef> (had <cstdint>)
  * paged-alloc.h    (0007): add <cstddef> (had <cstdint>)
  * paged-prefix-api.h (0007): add <cstddef> + <cstdint> (had only llama.h)

The .cpp units include their own paged header, so they inherit the includes
transitively. Whole series still applies clean on the pinned llama.cpp.

Compile-only change: no runtime behavior change, bit-exactness unaffected.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 06:18:56 +00:00
Ettore Di Giacinto
23b11a5239 paged-kv-manager.h: add missing <cstddef> for size_t
Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the
header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains
do not). Compile-only change, bit-exactness unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 04:09:16 +00:00
Ettore Di Giacinto
0b84fda496 docs(paged): add the bf16-tau opt-in line to the decode plots
Per request, the plots now show all four series: llama.cpp (standard), vLLM,
LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau
(opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width
and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:25:02 +00:00
Ettore Di Giacinto
1431f72b92 docs(paged): regenerate decode plots (3-way) from re-measured data + overview
Rebuild the two committed decode plots from the re-measured CSV and add a combined
overview. Three series per the comparison that matters: llama.cpp (standard) vs
vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau
stays out of the plot (it remains in the CSV + the README table as the opt-in row).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:20:12 +00:00
Ettore Di Giacinto
3466094c68 docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:05:59 +00:00
Ettore Di Giacinto
ed5eb705c7 docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
The paged backend's llama.cpp pin was reverted from c299a92c back to
9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the
reverted sync) is dead weight. The pin-sync PROCESS stays documented in
the three live places: the Makefile comment, README section 7 (Pin +
maintenance policy), and .agents/llama-cpp-localai-paged-backend.md.

Delete the doc and repoint every reference to it (Makefile, README,
.agents, canary script + workflow) at README section 7. No functional
paths change: the canary's patches-dir glob (patches/paged/0*.patch)
is untouched.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 21:34:10 +00:00
Ettore Di Giacinto
53f66a6f03 fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link
The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin.
grpc-server.cpp is SHARED with the stock backend and tracks the stock pin;
c299a92c's upstream server-API refactor pulled stream_* helpers into the headers
grpc-server.cpp includes, whose definitions the stock-aligned build does not
compile -> every paged variant failed to LINK (undefined reference to
stream_aware_should_stop / stream_pipe_producer::cleanup /
stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact
gate never exercises the full grpc-server build, so it slipped through.

Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact
AND grpc-server links - the original DGX-proven baseline). Document the hard
constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the
paged pin must track the stock pin, and a pin-sync must pass the full CI
grpc-server build, not only the bit-exact gate.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 20:28:28 +00:00
Ettore Di Giacinto
08b754f910 chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 13:20:05 +00:00
Ettore Di Giacinto
a4e730979d feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets
The paged backend previously built for cublas/cuda, cpu, vulkan, sycl,
hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the
GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant,
so the backend is neutral-to-negative there (README section 4c). The
darwin grpc-server link also fails on undefined upstream server symbols,
turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only.

- backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64,
  vulkan amd64/arm64 and metal-darwin rows for this backend; keep the
  four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13).
- index.yaml: meta-backend (and -development) capabilities are now
  CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts);
  removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries.
- Removed the now-unused darwin build script and its Makefile target /
  .NOTPARALLEL entry / backend_build_darwin.yml step.
- Documented the CUDA-only build coverage in the patch README and plan.

Non-CUDA users should use the stock llama-cpp backend.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:29:15 +00:00
Ettore Di Giacinto
9115c2c52c docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale
The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the
Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was
wrong. The patchset's fusions are gated off off-CUDA, so the backend ships
CUDA-only; non-CUDA users use stock llama-cpp.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:18:11 +00:00
Ettore Di Giacinto
984c8fcbea docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL)
Source-only analysis of what it would take to give the gated-DeltaNet decode
fusions (0018 in-place state write-back, 0019 fused recurrent-state gather,
0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on
the non-CUDA compute backends, so the patch-series decode win extends past
CUDA-family hardware.

Key findings:
- The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist
  upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is
  stale). The Qwen3.6 hybrids run on all three today via the non-fused path;
  Layer-2 is the decode SPEEDUP, not enabling the model to run.
- Per backend the new work is only the FUSION plumbing: redirect the GDN state
  write (in-place), add the ids read, write one new conv-update kernel + its
  ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan)
  pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and
  test-backend-ops cases are shared and already done.
- Bit-exactness is feasible per backend by construction (the fusions redirect
  addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU)
  is the gate.
- The 0030 name allow-list should become capability-driven (make supports_op
  authoritative for the discriminated src slots).
- Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup =
  simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to
  author), then Vulkan (widest hardware reach but the shader-gen + variant
  matrix + subgroup variance make it the capstone).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:11:24 +00:00
Ettore Di Giacinto
4a9a1dd247 docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings
Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) -
patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read
is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon.
Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings
+ upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact,
CPU-mirrored win worth upstreaming).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 11:05:37 +00:00
Ettore Di Giacinto
78fac9a28f refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series
Move ALL paged-attention content out of the stock backend/cpp/llama-cpp
backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is
pure upstream llama.cpp and the paged backend owns and applies its own vendored
patch series.

- Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/
  (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen,
  its own 0001-0002 patches, dense-era design docs, tests). Zero references
  repo-wide.
- Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README
  + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged
  README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock
  backend keeps no patches/ dir; it had no non-paged base patches.
- Purify the stock backend: remove the LLAMA_PAGED make variable, the
  patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh;
  remove the paged-series handling from prepare.sh. The stock llama.cpp target
  now only clones the pin and applies its own (currently empty) base patches/
  series. The runtime paged option hooks in the shared grpc-server.cpp are
  untouched (inert without the patches).
- The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto
  each freshly cloned tree via strict git apply (apply-paged-patches), after the
  copied stock infra clones the pin and applies base patches.
- Repoint every reference to the old patches/paged path: the upstream canary
  workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs,
  backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and
  the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on
  build-toggle from comments.

Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to
a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed
canary apply script resolves and applies the series end to end.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 11:01:22 +00:00
Ettore Di Giacinto
a5a5b2ad80 feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)
Advance the paged-attention backend's owned llama.cpp pin by 23 upstream
commits. The shipped source-only patch series (0001-0030, 28 patches) applies
strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export
needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121):

- md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense
  non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all
  match the established baselines.
- test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16,
  SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146,
  MUL_MAT_ID 806/806; all OK.

The 23-commit upstream jump did not change our decode output. The .patch files
are kept byte-identical (they already apply strict-clean at the new pin); only
the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:57:33 +00:00
Ettore Di Giacinto
2bee7a5ab1 ci(paged): add early-warning canary for vendored llama.cpp paged patches
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified
llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can
never silently break the shipped build. That exclusion also removed the early
warning of upstream drift. This restores the signal without touching the pin.

Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch):

- apply-check job (ubuntu-latest, toolchain-free): resolve the latest
  ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged
  series 0001-0030 in order with the build's own git-apply method via the new
  shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break.
- compile job (needs apply-check): on the exact tip it validated, build the
  paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the
  same `make grpc-server` target the shipped build uses, so a red means upstream
  drift, not toolchain noise. nvcc compiles the kernels with no GPU present.

Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the
paged Makefile pin. The canary is signal-only: it opens no PR and never moves
the pin, so the shipped build and the dep-bump PRs stay green regardless. It is
fully separate from bump_deps.

The lone pre-existing quirk in the series (patch 0019 carries a stray modify
hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean
upstream checkout; git apply is atomic so it rejects the whole patch and
cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes
only that dev-doc and still applies 0019's real code hunks atomically, mirroring
prepare.sh's tolerance, so the quirk never false-positives the canary but a
genuine code break in 0019 still turns it red.

Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile
and .github/workflows/bump_deps.yaml at this canary as the drift signal, and
document it in the PIN_SYNC doc: canary red -> do a pin-sync.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:29:09 +00:00
Ettore Di Giacinto
e160041f05 chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper
The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION,
which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest
ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is
safe, but the paged backend applies a vendored patch series
(backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE
specific tip. A naive bump moves the tip out from under the patches and breaks
'git apply' at build time - a dep-bump PR would go red (or, worse, the break
surfaces later in a release build).

Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION
pin (the verified 9d5d882d) and force it into every copied build via
LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the
paged build to an unverified tip. Unlike turboquant (whose fork branch carries
the patches and is safe to auto-bump), the paged series is vendored, so it gets
NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process.
Add cross-referencing comments in both Makefiles and bump_deps.yaml.

Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip
(c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply'
with only benign line offsets and zero conflicts; the lone failure (0019) is a
pre-existing stray dev-doc hunk, identical on the current pin, not a bump
regression.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:02:37 +00:00
Ettore Di Giacinto
167768cac3 feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on),
shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no
grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp).
Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for
sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the
benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16),
mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 12:58:56 +00:00