docs(agents): add paged-backend maintenance + vLLM-parity methodology skills

Two .agents guides (indexed in AGENTS.md): - llama-cpp-localai-paged-backend.md: what the CUDA-only paged backend is, the patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, the CUDA-only / stock-stays-pure invariants, and the Metal/SYCL/Vulkan follow-up scope. - vllm-parity-methodology.md: the decode-parity playbook (bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B, recording rejected levers, multi-agent GPU orchestration). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 12:58:01 +00:00
parent a4e730979d
commit db14006fcd
3 changed files with 183 additions and 0 deletions
--- a/.agents/llama-cpp-localai-paged-backend.md
+++ b/.agents/llama-cpp-localai-paged-backend.md
@@ -0,0 +1,93 @@
+# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode)
+
+`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the
+llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid
+gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock
+`llama-cpp` backend's sources and applies a vendored patch series on top at build
+time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc.
+
+**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/patches/paged/README.md`
+(architecture, the patch series 0001-0030, benchmarks, dev notes, generality,
+pin/canary policy). Read it for any technical detail; this guide is the maintenance
+how-to.
+
+## Where things live
+
+- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the
+  stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at
+  this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the
+  `apply-paged-patches` define (strict `git apply`), then builds `grpc-server`.
+- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
+  series (0001-0030) + the README + operational docs (`PIN_SYNC_*.md`,
+  `PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`).
+- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
+  - the CUDA build entry points.
+- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no
+  paged patches.
+
+## Invariants (do not break these)
+
+- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a
+  `patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`.
+- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off
+  (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-
+  slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add
+  cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`.
+  (Those builds also fail to link `grpc-server` on darwin/arm64 against upstream
+  `stream_*` server symbols - another reason it is CUDA-only.)
+- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a
+  dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
+  stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
+- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
+  baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults
+  off; never put it in a recommended/gallery config.
+
+## Maintaining the pin against new llama.cpp
+
+The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual
+pin-sync. It is deliberately **excluded from the nightly auto-bumper**
+(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches
+and break `git apply` at build time.
+
+1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
+   runs weekly: it applies + builds the series against the latest upstream tip and
+   goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
+2. **The pin-sync** (recorded in `PIN_SYNC_*.md`): rebase the series onto the new
+   tip (resolve conflicts; re-export **source-only** with a pathspec like
+   `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
+   pass the bit-exact gate on **every** path + `test-backend-ops`, then bump
+   `LLAMA_VERSION`. The 9d5d882d -> c299a92c bump (23 upstream commits) needed zero
+   patch changes; bumps are usually offset-tolerant (git apply absorbs offsets).
+
+## The bit-exact gate (run for every change)
+
+- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 </dev/null | md5sum`,
+  paged paths prefixed `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged
+  MoE). Must match the recorded baseline. Redirect stdin from `/dev/null` or
+  `llama-completion` hangs in conversation mode.
+- `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`,
+  `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
+- **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a
+  benign, KL-validated FP-accumulation-order difference (see `PAGED_BITEXACT_NOTE.md`).
+  Compare a paged-MoE change to the **paged** reference, not the non-paged one.
+
+## Encapsulating your work
+
+- When you change a patch, regenerate the `.patch` (source-only) and keep the dev
+  tree and this worktree byte-identical. Commit both with sign-off.
+- New optimization -> next patch number (gaps 0005/0027 are intentional). Update
+  the README's patch table and dev notes - keep the README the single doc; do not
+  scatter `*_RESULTS.md` files.
+- Record rejected/flat levers in the README too (they stop the next person from
+  re-running dead ends).
+
+## Follow-ups (Metal / SYCL / Vulkan)
+
+The decode fusions are implemented for **CUDA + CPU only**. The base
+gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan,
+so the models **run** there via the non-fused path - what is missing is the
+fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no
+Metal/SYCL/Vulkan hardware to test on here) is scoped in `UPSTREAM_LAYER2_SCOPE.md`
+(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one
+PR per backend, each gated by `test-backend-ops` on the target hardware). The
+methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md).
--- a/.agents/vllm-parity-methodology.md
+++ b/.agents/vllm-parity-methodology.md
@@ -0,0 +1,87 @@
+# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp
+
+This is the playbook that took the paged backend
+([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md))
+from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest
+ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on
+accelerator Y" effort. The *levers* are model- and hardware-specific; the
+*discipline* below is not. The worked example, with all numbers, is the paged
+backend README.
+
+## The core loop
+
+1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per
+   path) and an f32 reference. Every optimization must stay byte-identical to it -
+   or ship as an explicit, default-off precision opt-in. This is what lets you
+   optimize aggressively without silently regressing quality. Gate two ways:
+   greedy md5, and `test-backend-ops` against the CPU oracle.
+
+2. **Profile - do not assume.** nsys the steady-state decode step, broken down per
+   *kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong
+   here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state
+   **plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM.
+   Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling
+   window artifact (decode was 96-99% GPU-busy), not real idle.
+
+3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the
+   competitor's, side by side, per bucket, and compute the per-bucket delta. This
+   tells you WHERE the gap actually is - not where you would guess. It overturned
+   premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it
+   keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.
+
+4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
+   same-session A/B bench (patched-off vs patched-on, identical harness = an exact
+   measure). Bank only what lifts AND gates. **Record every rejected or flat lever
+   with the reason** - over time this is the most valuable part: it stops the next
+   person re-running dead ends.
+
+5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
+   lever measured, not assumed). What remains is physical - the memory-bandwidth
+   floor, the irreducible serial-SSM host loop (sampling can't start until logits
+   land). Name it; do not claim more than you measured.
+
+## Hard rules learned
+
+- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
+  (`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM"
+  (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
+  and config (context length alone shifted the MoE figure 76% <-> 86%).
+- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%
+  but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in -
+  never in a recommended config.
+- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
+  critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
+  projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
+  Always measure before believing; a plausible mechanism is not a result.
+- **The gate can be per-path.** Paged vs non-paged attention legitimately produces
+  different (equivalent) FP-reduction orders; validate the difference is benign
+  (KLD to f32) and then gate each path against its own reference.
+
+## Orchestration (multi-agent)
+
+- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel
+  design/analysis/read agents are fine; concurrent GPU benches pollute each other's
+  numbers.
+- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to
+  *refute* it; majority-refute kills it. Prevents plausible-but-wrong results.
+- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a
+  progress-file checkpoint. Agents that background work and "wait for the monitor
+  event" stall - forbid that pattern.
+- **GPU coexistence.** On a shared host, stop the user's deployments for a clean
+  benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a
+  failure cannot strand them).
+
+## What generalizes (and what doesn't)
+
+The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions,
+NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not
+benefit. But the *findings* often generalize and are worth upstreaming: the
+"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored
+fusion ops help any backend running these models. Separate "ship our tuned backend"
+from "upstream the portable op" - they are different deliverables.
+
+## The closing record
+
+Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons),
+the structural ceiling, and the cross-backend / cross-quant generality. Negative
+results are as valuable as wins. The paged backend README is the template.
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -23,6 +23,8 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
+| [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) | Working on the CUDA-only paged-attention llama.cpp variant (Qwen3.6 hybrid-SSM / Blackwell NVFP4 decode) - patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, CUDA-only invariants, stock-stays-pure, Metal/SYCL/Vulkan follow-up scope |
+| [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) | The methodology for closing the vLLM decode-throughput gap in llama.cpp - bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B discipline, recording rejected levers, multi-agent GPU orchestration |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
@@ -37,6 +39,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]

 - **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
 - **Logging**: Use `github.com/mudler/xlog` (same API as slog)
+- **Paged llama.cpp backend**: `llama-cpp-localai-paged` is a CUDA-only variant that owns its own patch series + its own pinned llama.cpp (manual pin-sync, weekly canary); the stock `llama-cpp` backend stays patch-free. Read [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) before touching either, and [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) for the decode-parity methodology behind it.
 - **Go style**: Prefer `any` over `interface{}`
 - **Comments**: Explain *why*, not *what*
 - **Docs**: Update `docs/content/` when adding features or changing config