diff --git a/.agents/llama-cpp-localai-paged-backend.md b/.agents/llama-cpp-localai-paged-backend.md new file mode 100644 index 000000000..716f01ebe --- /dev/null +++ b/.agents/llama-cpp-localai-paged-backend.md @@ -0,0 +1,93 @@ +# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode) + +`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the +llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid +gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock +`llama-cpp` backend's sources and applies a vendored patch series on top at build +time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc. + +**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/patches/paged/README.md` +(architecture, the patch series 0001-0030, benchmarks, dev notes, generality, +pin/canary policy). Read it for any technical detail; this guide is the maintenance +how-to. + +## Where things live + +- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the + stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at + this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the + `apply-paged-patches` define (strict `git apply`), then builds `grpc-server`. +- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch` + series (0001-0030) + the README + operational docs (`PIN_SYNC_*.md`, + `PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`). +- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh` + - the CUDA build entry points. +- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no + paged patches. + +## Invariants (do not break these) + +- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a + `patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`. +- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off + (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to- + slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add + cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`. + (Those builds also fail to link `grpc-server` on darwin/arm64 against upstream + `stream_*` server symbols - another reason it is CUDA-only.) +- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a + dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A + stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.) +- **Bit-exact by default.** Every shipped patch is byte-identical to the f32 + baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults + off; never put it in a recommended/gallery config. + +## Maintaining the pin against new llama.cpp + +The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual +pin-sync. It is deliberately **excluded from the nightly auto-bumper** +(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches +and break `git apply` at build time. + +1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml` + runs weekly: it applies + builds the series against the latest upstream tip and + goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync. +2. **The pin-sync** (recorded in `PIN_SYNC_*.md`): rebase the series onto the new + tip (resolve conflicts; re-export **source-only** with a pathspec like + `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box, + pass the bit-exact gate on **every** path + `test-backend-ops`, then bump + `LLAMA_VERSION`. The 9d5d882d -> c299a92c bump (23 upstream commits) needed zero + patch changes; bumps are usually offset-tolerant (git apply absorbs offsets). + +## The bit-exact gate (run for every change) + +- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 next patch number (gaps 0005/0027 are intentional). Update + the README's patch table and dev notes - keep the README the single doc; do not + scatter `*_RESULTS.md` files. +- Record rejected/flat levers in the README too (they stop the next person from + re-running dead ends). + +## Follow-ups (Metal / SYCL / Vulkan) + +The decode fusions are implemented for **CUDA + CPU only**. The base +gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan, +so the models **run** there via the non-fused path - what is missing is the +fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no +Metal/SYCL/Vulkan hardware to test on here) is scoped in `UPSTREAM_LAYER2_SCOPE.md` +(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one +PR per backend, each gated by `test-backend-ops` on the target hardware). The +methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md). diff --git a/.agents/vllm-parity-methodology.md b/.agents/vllm-parity-methodology.md new file mode 100644 index 000000000..0ebc7f140 --- /dev/null +++ b/.agents/vllm-parity-methodology.md @@ -0,0 +1,87 @@ +# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp + +This is the playbook that took the paged backend +([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md)) +from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest +ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on +accelerator Y" effort. The *levers* are model- and hardware-specific; the +*discipline* below is not. The worked example, with all numbers, is the paged +backend README. + +## The core loop + +1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per + path) and an f32 reference. Every optimization must stay byte-identical to it - + or ship as an explicit, default-off precision opt-in. This is what lets you + optimize aggressively without silently regressing quality. Gate two ways: + greedy md5, and `test-backend-ops` against the CPU oracle. + +2. **Profile - do not assume.** nsys the steady-state decode step, broken down per + *kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong + here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state + **plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM. + Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling + window artifact (decode was 96-99% GPU-busy), not real idle. + +3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the + competitor's, side by side, per bucket, and compute the per-bucket delta. This + tells you WHERE the gap actually is - not where you would guess. It overturned + premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it + keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap. + +4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate -> + same-session A/B bench (patched-off vs patched-on, identical harness = an exact + measure). Bank only what lifts AND gates. **Record every rejected or flat lever + with the reason** - over time this is the most valuable part: it stops the next + person re-running dead ends. + +5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every + lever measured, not assumed). What remains is physical - the memory-bandwidth + floor, the irreducible serial-SSM host loop (sampling can't start until logits + land). Name it; do not claim more than you measured. + +## Hard rules learned + +- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness + (`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM" + (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness + and config (context length alone shifted the MoE figure 76% <-> 86%). +- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12% + but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in - + never in a recommended config. +- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the + critical path benches FLAT (the freed time becomes idle). Quantizing the bf16 + projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason. + Always measure before believing; a plausible mechanism is not a result. +- **The gate can be per-path.** Paged vs non-paged attention legitimately produces + different (equivalent) FP-reduction orders; validate the difference is benign + (KLD to f32) and then gate each path against its own reference. + +## Orchestration (multi-agent) + +- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel + design/analysis/read agents are fine; concurrent GPU benches pollute each other's + numbers. +- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to + *refute* it; majority-refute kills it. Prevents plausible-but-wrong results. +- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a + progress-file checkpoint. Agents that background work and "wait for the monitor + event" stall - forbid that pattern. +- **GPU coexistence.** On a shared host, stop the user's deployments for a clean + benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a + failure cannot strand them). + +## What generalizes (and what doesn't) + +The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions, +NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not +benefit. But the *findings* often generalize and are worth upstreaming: the +"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored +fusion ops help any backend running these models. Separate "ship our tuned backend" +from "upstream the portable op" - they are different deliverables. + +## The closing record + +Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons), +the structural ceiling, and the cross-backend / cross-quant generality. Negative +results are as valuable as wins. The paged backend README is the template. diff --git a/AGENTS.md b/AGENTS.md index 1095ef531..dd2d59f5d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -23,6 +23,8 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants] | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) | | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions | | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing | +| [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) | Working on the CUDA-only paged-attention llama.cpp variant (Qwen3.6 hybrid-SSM / Blackwell NVFP4 decode) - patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, CUDA-only invariants, stock-stays-pure, Metal/SYCL/Vulkan follow-up scope | +| [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) | The methodology for closing the vLLM decode-throughput gap in llama.cpp - bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B discipline, recording rejected levers, multi-agent GPU orchestration | | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks | | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling | | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix | @@ -37,6 +39,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants] - **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md). - **Logging**: Use `github.com/mudler/xlog` (same API as slog) +- **Paged llama.cpp backend**: `llama-cpp-localai-paged` is a CUDA-only variant that owns its own patch series + its own pinned llama.cpp (manual pin-sync, weekly canary); the stock `llama-cpp` backend stays patch-free. Read [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) before touching either, and [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) for the decode-parity methodology behind it. - **Go style**: Prefer `any` over `interface{}` - **Comments**: Explain *why*, not *what* - **Docs**: Update `docs/content/` when adding features or changing config