mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
docs(agents): add paged-backend maintenance + vLLM-parity methodology skills
Two .agents guides (indexed in AGENTS.md): - llama-cpp-localai-paged-backend.md: what the CUDA-only paged backend is, the patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, the CUDA-only / stock-stays-pure invariants, and the Metal/SYCL/Vulkan follow-up scope. - vllm-parity-methodology.md: the decode-parity playbook (bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B, recording rejected levers, multi-agent GPU orchestration). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
93
.agents/llama-cpp-localai-paged-backend.md
Normal file
93
.agents/llama-cpp-localai-paged-backend.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode)
|
||||
|
||||
`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the
|
||||
llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid
|
||||
gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock
|
||||
`llama-cpp` backend's sources and applies a vendored patch series on top at build
|
||||
time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc.
|
||||
|
||||
**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/patches/paged/README.md`
|
||||
(architecture, the patch series 0001-0030, benchmarks, dev notes, generality,
|
||||
pin/canary policy). Read it for any technical detail; this guide is the maintenance
|
||||
how-to.
|
||||
|
||||
## Where things live
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the
|
||||
stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at
|
||||
this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the
|
||||
`apply-paged-patches` define (strict `git apply`), then builds `grpc-server`.
|
||||
- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
|
||||
series (0001-0030) + the README + operational docs (`PIN_SYNC_*.md`,
|
||||
`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`).
|
||||
- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
|
||||
- the CUDA build entry points.
|
||||
- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no
|
||||
paged patches.
|
||||
|
||||
## Invariants (do not break these)
|
||||
|
||||
- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a
|
||||
`patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`.
|
||||
- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off
|
||||
(patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-
|
||||
slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add
|
||||
cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`.
|
||||
(Those builds also fail to link `grpc-server` on darwin/arm64 against upstream
|
||||
`stream_*` server symbols - another reason it is CUDA-only.)
|
||||
- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a
|
||||
dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
|
||||
stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
|
||||
- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
|
||||
baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults
|
||||
off; never put it in a recommended/gallery config.
|
||||
|
||||
## Maintaining the pin against new llama.cpp
|
||||
|
||||
The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual
|
||||
pin-sync. It is deliberately **excluded from the nightly auto-bumper**
|
||||
(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches
|
||||
and break `git apply` at build time.
|
||||
|
||||
1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
|
||||
runs weekly: it applies + builds the series against the latest upstream tip and
|
||||
goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
|
||||
2. **The pin-sync** (recorded in `PIN_SYNC_*.md`): rebase the series onto the new
|
||||
tip (resolve conflicts; re-export **source-only** with a pathspec like
|
||||
`-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
|
||||
pass the bit-exact gate on **every** path + `test-backend-ops`, then bump
|
||||
`LLAMA_VERSION`. The 9d5d882d -> c299a92c bump (23 upstream commits) needed zero
|
||||
patch changes; bumps are usually offset-tolerant (git apply absorbs offsets).
|
||||
|
||||
## The bit-exact gate (run for every change)
|
||||
|
||||
- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 </dev/null | md5sum`,
|
||||
paged paths prefixed `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged
|
||||
MoE). Must match the recorded baseline. Redirect stdin from `/dev/null` or
|
||||
`llama-completion` hangs in conversation mode.
|
||||
- `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`,
|
||||
`GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
|
||||
- **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a
|
||||
benign, KL-validated FP-accumulation-order difference (see `PAGED_BITEXACT_NOTE.md`).
|
||||
Compare a paged-MoE change to the **paged** reference, not the non-paged one.
|
||||
|
||||
## Encapsulating your work
|
||||
|
||||
- When you change a patch, regenerate the `.patch` (source-only) and keep the dev
|
||||
tree and this worktree byte-identical. Commit both with sign-off.
|
||||
- New optimization -> next patch number (gaps 0005/0027 are intentional). Update
|
||||
the README's patch table and dev notes - keep the README the single doc; do not
|
||||
scatter `*_RESULTS.md` files.
|
||||
- Record rejected/flat levers in the README too (they stop the next person from
|
||||
re-running dead ends).
|
||||
|
||||
## Follow-ups (Metal / SYCL / Vulkan)
|
||||
|
||||
The decode fusions are implemented for **CUDA + CPU only**. The base
|
||||
gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan,
|
||||
so the models **run** there via the non-fused path - what is missing is the
|
||||
fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no
|
||||
Metal/SYCL/Vulkan hardware to test on here) is scoped in `UPSTREAM_LAYER2_SCOPE.md`
|
||||
(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one
|
||||
PR per backend, each gated by `test-backend-ops` on the target hardware). The
|
||||
methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md).
|
||||
87
.agents/vllm-parity-methodology.md
Normal file
87
.agents/vllm-parity-methodology.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp
|
||||
|
||||
This is the playbook that took the paged backend
|
||||
([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md))
|
||||
from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest
|
||||
ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on
|
||||
accelerator Y" effort. The *levers* are model- and hardware-specific; the
|
||||
*discipline* below is not. The worked example, with all numbers, is the paged
|
||||
backend README.
|
||||
|
||||
## The core loop
|
||||
|
||||
1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per
|
||||
path) and an f32 reference. Every optimization must stay byte-identical to it -
|
||||
or ship as an explicit, default-off precision opt-in. This is what lets you
|
||||
optimize aggressively without silently regressing quality. Gate two ways:
|
||||
greedy md5, and `test-backend-ops` against the CPU oracle.
|
||||
|
||||
2. **Profile - do not assume.** nsys the steady-state decode step, broken down per
|
||||
*kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong
|
||||
here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state
|
||||
**plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM.
|
||||
Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling
|
||||
window artifact (decode was 96-99% GPU-busy), not real idle.
|
||||
|
||||
3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the
|
||||
competitor's, side by side, per bucket, and compute the per-bucket delta. This
|
||||
tells you WHERE the gap actually is - not where you would guess. It overturned
|
||||
premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it
|
||||
keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.
|
||||
|
||||
4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
|
||||
same-session A/B bench (patched-off vs patched-on, identical harness = an exact
|
||||
measure). Bank only what lifts AND gates. **Record every rejected or flat lever
|
||||
with the reason** - over time this is the most valuable part: it stops the next
|
||||
person re-running dead ends.
|
||||
|
||||
5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
|
||||
lever measured, not assumed). What remains is physical - the memory-bandwidth
|
||||
floor, the irreducible serial-SSM host loop (sampling can't start until logits
|
||||
land). Name it; do not claim more than you measured.
|
||||
|
||||
## Hard rules learned
|
||||
|
||||
- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
|
||||
(`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM"
|
||||
(batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
|
||||
and config (context length alone shifted the MoE figure 76% <-> 86%).
|
||||
- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%
|
||||
but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in -
|
||||
never in a recommended config.
|
||||
- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
|
||||
critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
|
||||
projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
|
||||
Always measure before believing; a plausible mechanism is not a result.
|
||||
- **The gate can be per-path.** Paged vs non-paged attention legitimately produces
|
||||
different (equivalent) FP-reduction orders; validate the difference is benign
|
||||
(KLD to f32) and then gate each path against its own reference.
|
||||
|
||||
## Orchestration (multi-agent)
|
||||
|
||||
- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel
|
||||
design/analysis/read agents are fine; concurrent GPU benches pollute each other's
|
||||
numbers.
|
||||
- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to
|
||||
*refute* it; majority-refute kills it. Prevents plausible-but-wrong results.
|
||||
- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a
|
||||
progress-file checkpoint. Agents that background work and "wait for the monitor
|
||||
event" stall - forbid that pattern.
|
||||
- **GPU coexistence.** On a shared host, stop the user's deployments for a clean
|
||||
benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a
|
||||
failure cannot strand them).
|
||||
|
||||
## What generalizes (and what doesn't)
|
||||
|
||||
The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions,
|
||||
NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not
|
||||
benefit. But the *findings* often generalize and are worth upstreaming: the
|
||||
"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored
|
||||
fusion ops help any backend running these models. Separate "ship our tuned backend"
|
||||
from "upstream the portable op" - they are different deliverables.
|
||||
|
||||
## The closing record
|
||||
|
||||
Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons),
|
||||
the structural ceiling, and the cross-backend / cross-quant generality. Negative
|
||||
results are as valuable as wins. The paged backend README is the template.
|
||||
@@ -23,6 +23,8 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
|
||||
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
|
||||
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
|
||||
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
|
||||
| [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) | Working on the CUDA-only paged-attention llama.cpp variant (Qwen3.6 hybrid-SSM / Blackwell NVFP4 decode) - patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, CUDA-only invariants, stock-stays-pure, Metal/SYCL/Vulkan follow-up scope |
|
||||
| [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) | The methodology for closing the vLLM decode-throughput gap in llama.cpp - bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B discipline, recording rejected levers, multi-agent GPU orchestration |
|
||||
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
|
||||
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
|
||||
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
|
||||
@@ -37,6 +39,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
|
||||
|
||||
- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
|
||||
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
|
||||
- **Paged llama.cpp backend**: `llama-cpp-localai-paged` is a CUDA-only variant that owns its own patch series + its own pinned llama.cpp (manual pin-sync, weekly canary); the stock `llama-cpp` backend stays patch-free. Read [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) before touching either, and [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) for the decode-parity methodology behind it.
|
||||
- **Go style**: Prefer `any` over `interface{}`
|
||||
- **Comments**: Explain *why*, not *what*
|
||||
- **Docs**: Update `docs/content/` when adding features or changing config
|
||||
|
||||
Reference in New Issue
Block a user