diff --git a/.agents/llama-cpp-localai-paged-backend.md b/.agents/llama-cpp-localai-paged-backend.md index 716f01ebe..dc8361b7f 100644 --- a/.agents/llama-cpp-localai-paged-backend.md +++ b/.agents/llama-cpp-localai-paged-backend.md @@ -6,7 +6,7 @@ gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock `llama-cpp` backend's sources and applies a vendored patch series on top at build time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc. -**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/patches/paged/README.md` +**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/README.md` (architecture, the patch series 0001-0030, benchmarks, dev notes, generality, pin/canary policy). Read it for any technical detail; this guide is the maintenance how-to. @@ -18,8 +18,11 @@ how-to. this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the `apply-paged-patches` define (strict `git apply`), then builds `grpc-server`. - `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch` - series (0001-0030) + the README + operational docs (`PIN_SYNC_*.md`, - `PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`). + series (0001-0030), nothing else. +- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The + operational docs (`PIN_SYNC_*.md`, `PAGED_BITEXACT_NOTE.md`, + `UPSTREAM_LAYER2_SCOPE.md`) and dev artifacts live in + `backend/cpp/llama-cpp-localai-paged/docs/`. - `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh` - the CUDA build entry points. - `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no @@ -52,7 +55,7 @@ and break `git apply` at build time. 1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml` runs weekly: it applies + builds the series against the latest upstream tip and goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync. -2. **The pin-sync** (recorded in `PIN_SYNC_*.md`): rebase the series onto the new +2. **The pin-sync** (recorded in `docs/PIN_SYNC_*.md`): rebase the series onto the new tip (resolve conflicts; re-export **source-only** with a pathspec like `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box, pass the bit-exact gate on **every** path + `test-backend-ops`, then bump @@ -68,7 +71,7 @@ and break `git apply` at build time. - `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`). - **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a - benign, KL-validated FP-accumulation-order difference (see `PAGED_BITEXACT_NOTE.md`). + benign, KL-validated FP-accumulation-order difference (see `docs/PAGED_BITEXACT_NOTE.md`). Compare a paged-MoE change to the **paged** reference, not the non-paged one. ## Encapsulating your work @@ -87,7 +90,7 @@ The decode fusions are implemented for **CUDA + CPU only**. The base gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan, so the models **run** there via the non-fused path - what is missing is the fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no -Metal/SYCL/Vulkan hardware to test on here) is scoped in `UPSTREAM_LAYER2_SCOPE.md` +Metal/SYCL/Vulkan hardware to test on here) is scoped in `docs/UPSTREAM_LAYER2_SCOPE.md` (recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one PR per backend, each gated by `test-backend-ops` on the target hardware). The methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md). diff --git a/.github/scripts/paged-canary-apply.sh b/.github/scripts/paged-canary-apply.sh index 1de59a9df..dfcc88874 100755 --- a/.github/scripts/paged-canary-apply.sh +++ b/.github/scripts/paged-canary-apply.sh @@ -27,8 +27,8 @@ # missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028 # build on 0019's code, the rejection cascades to them too. This is a # PRE-EXISTING shipped-series defect, present identically on every pin, NOT an -# upstream break (see backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md -# and README.md). We exclude ONLY that dev-doc path and still +# upstream break (see backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md +# and backend/cpp/llama-cpp-localai-paged/README.md). We exclude ONLY that dev-doc path and still # apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019 # still fails the canary. prepare.sh tolerates the same hunk via # `patch ... || true`; this mirrors that tolerance precisely. @@ -53,7 +53,7 @@ apply_one() { echo "paged-canary: applying $(basename "$p")" if ! git apply --verbose "$@" "$p"; then echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")" - echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly" + echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly" exit 1 fi } diff --git a/.github/workflows/llama-cpp-paged-canary.yml b/.github/workflows/llama-cpp-paged-canary.yml index 46ed6940e..8220acd30 100644 --- a/.github/workflows/llama-cpp-paged-canary.yml +++ b/.github/workflows/llama-cpp-paged-canary.yml @@ -17,7 +17,7 @@ name: 'llama.cpp paged patches: upstream canary' # RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip, # pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance # the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See -# backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md. +# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md. # # SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully # decoupled from bump_deps - so the main dep-bump PR stays green regardless. A diff --git a/backend/cpp/llama-cpp-localai-paged/Makefile b/backend/cpp/llama-cpp-localai-paged/Makefile index 13921220c..05f3b790e 100644 --- a/backend/cpp/llama-cpp-localai-paged/Makefile +++ b/backend/cpp/llama-cpp-localai-paged/Makefile @@ -9,7 +9,7 @@ # Pin handling (mirrors the turboquant wrapper, the precedent this is modelled # on): the paged patch series is hand-verified bit-exact against ONE specific # llama.cpp tip and re-exported by the manual PIN_SYNC process -# (patches/paged/PIN_SYNC_*.md). A naive pin bump would move the tip out from +# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from # under the patches and break `git apply` at build time, so this backend OWNS # its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin # from backend/cpp/llama-cpp/Makefile. The override is forced into every copied @@ -30,7 +30,7 @@ # the nightly llama.cpp bump cannot silently break the vendored paged patches. # Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate + # re-export), then update this value. See: -# backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_*.md +# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md # # This pin = the manual, verified sync. The signal telling you WHEN to do the # next sync is the early-warning canary diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md similarity index 92% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/README.md rename to backend/cpp/llama-cpp-localai-paged/README.md index bc42b0450..9b7d1eb80 100644 --- a/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -1,15 +1,15 @@ # LocalAI paged-attention llama.cpp patch series -This directory holds the vendored patch series that turns stock llama.cpp into -LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are -applied on top of a pinned upstream llama.cpp at build time; nothing here is a -fork - it is a source-only `*.patch` stack plus this single canonical doc. +This backend vendors the patch series (in `patches/paged/`) that turns stock +llama.cpp into LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The +patches are applied on top of a pinned upstream llama.cpp at build time; nothing +here is a fork - it is a source-only `*.patch` stack plus this canonical doc. > One-file rule: this README is the canonical reference for the patch series. The -> only other docs kept in this directory are operational and linked below: -> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts). -> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference). -> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items. +> only other docs are operational, kept in `docs/`, and linked below: +> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts). +> - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference). +> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items. --- @@ -32,7 +32,7 @@ vendored patch series over upstream llama.cpp that adds It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated -[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper +[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper (see section 7). The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is @@ -158,9 +158,9 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense **Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg` S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over -serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png), -[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data -[`final_benchmark.csv`](final_benchmark.csv). +serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png), +[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data +[`final_benchmark.csv`](docs/final_benchmark.csv). ### (a) + (b) Patched vs stock vs vLLM @@ -231,7 +231,7 @@ all three via the non-fused path. The patchset's fusions are gated off there (0030), so the outcome is the same neutral-to-slightly-negative as Metal - not "won't run". This backend therefore ships **CUDA-only** (where the fusions are live + verified); non-CUDA users should use the stock `llama-cpp` backend. See -[`UPSTREAM_LAYER2_SCOPE.md`](UPSTREAM_LAYER2_SCOPE.md) for what native non-CUDA +[`UPSTREAM_LAYER2_SCOPE.md`](docs/UPSTREAM_LAYER2_SCOPE.md) for what native non-CUDA fused kernels would take. --- @@ -245,7 +245,7 @@ is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`). -**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)). +**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md)). Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5 (`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this is a benign FP-accumulation-order difference of the paged attention reduction, @@ -325,7 +325,7 @@ in a recommended/gallery config. ## 7. Pin + maintenance policy - **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual - [`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series + [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process: rebase the source-only patch series onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d -> c299a92c` jump (23 upstream commits) needed zero patch changes and did not @@ -333,12 +333,12 @@ in a recommended/gallery config. - **Decoupled from the nightly auto-bumper.** There is deliberately **no** `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could silently shift the tree out from under the patches. -- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml) - (via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh)) +- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../.github/workflows/llama-cpp-paged-canary.yml) + (via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh)) tries the patch series against the latest upstream tip with the build's own strict `git apply`. **Red = upstream drifted past the series -> run a PIN_SYNC** (do not bump the pin blindly). The canary references - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md). + [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md). --- @@ -363,4 +363,4 @@ Both gallery entries set `backend: llama-cpp-localai-paged` and the paged servin (`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`, `flash_attention:on`, `context_size`). They intentionally stay bit-exact (no `ssm_bf16_tau`). The full backend-split + gallery plan is in -[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md). +[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md). diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md b/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md rename to backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/PAGED_BITEXACT_NOTE.md b/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/PAGED_BITEXACT_NOTE.md rename to backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md diff --git a/backend/cpp/llama-cpp-localai-paged/patches/README.md b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md similarity index 97% rename from backend/cpp/llama-cpp-localai-paged/patches/README.md rename to backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md index fa777ee44..13ce1194e 100644 --- a/backend/cpp/llama-cpp-localai-paged/patches/README.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md @@ -61,7 +61,7 @@ everywhere without ever touching the stock `llama-cpp` source tree. - **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing. - **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form - (see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index + (see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B, @@ -82,5 +82,5 @@ by itself reach vLLM throughput parity, because the measured prefill bottleneck (Lever 3: `mul_mat_q` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see -`paged/README.md`). So full vLLM parity = this series **AND** the +`../README.md`). So full vLLM parity = this series **AND** the kernel; neither alone suffices. diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md b/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md rename to backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md rename to backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/final_benchmark.csv b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/final_benchmark.csv rename to backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/paged-burst-bench.cpp b/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/paged-burst-bench.cpp rename to backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/paged-reclaim-unit.cpp b/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/paged-reclaim-unit.cpp rename to backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_dense_decode_vs_npl.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_dense_decode_vs_npl.png rename to backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_moe_decode_vs_npl.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png similarity index 100% rename from backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_moe_decode_vs_npl.png rename to backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png diff --git a/backend/cpp/llama-cpp-localai-paged/patches/BENCHMARKS.md b/backend/cpp/llama-cpp-localai-paged/patches/BENCHMARKS.md deleted file mode 100644 index df5f88fe0..000000000 --- a/backend/cpp/llama-cpp-localai-paged/patches/BENCHMARKS.md +++ /dev/null @@ -1,106 +0,0 @@ -# Paged-attention / parity benchmarks (GB10 / DGX Spark) - -Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted. - -**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series -(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded -prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams. - -## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8) - -| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap | -|---|-----------|-----------|--------|-----------|-----------|--------| -| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** | -| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× | -| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× | -| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× | - -## Verdict — two distinct gaps, only one is the engine's - -1. **Prefill (S_PP): 6–48× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE - GEMM kernel** (`mul_mat_q` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this** — - it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base). -2. **Decode at concurrency (S_TG): 2.5–3.7× behind for B≥8** (we *win* at B=1). This gap IS partly the - engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step. - **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel). - -## CORRECTION — decode-phase profile (B=64, decode-dominated nsys) - -The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run: - -| kernel | % GPU time | -|---|---| -| `mul_mat_q` (MoE GEMM) | **54.6** | -| `flash_attn_ext` (attention) | 19.8 | -| `mul_mat_q` (dense) | 10.9 | -| KV writes / quant / norms / rest | ~15 | - -**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill. -Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell* -fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are -MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management. - -### What the paged series IS still good for (just not throughput parity) - -- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM. -- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG. - -These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and -batched-bench (fresh, non-fragmented, no shared prefix) won't show them. - -## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES. - -The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path, -so we benchmarked a dense 32B head-to-head. - -**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on -DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16` -(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is -the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today; -root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.) - -| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap | -|---|---|---|---|---|---|---| -| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity | -| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× | -| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× | -| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× | - -**Findings:** -1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of - B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin - tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster* - than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.) -2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the - gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64). -3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped - block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two - (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model. - - **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) — - dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and - nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both - model classes. Parity is strictly the FP4 tensor-core kernel. -4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell - limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled - kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no - working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the - CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability - 120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with - `enable_flashinfer_autotune=True` at 0–3% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch - with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story - everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which - isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway). - **Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today. - Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` + - `VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container. - -## So, honestly, where parity stands - -- **Decode single-stream: already at/above parity** (B=1: 83 vs 48). -- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool + - 0005 continuous batching). Target: close the 2.5–3.7× at B≥8. -- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track. - -**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read) -turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at -B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing. diff --git a/backend/cpp/llama-cpp-localai-paged/patches/kernel/0001-fp4-grouped-moe-scaffold.patch b/backend/cpp/llama-cpp-localai-paged/patches/kernel/0001-fp4-grouped-moe-scaffold.patch deleted file mode 100644 index d1920560a..000000000 --- a/backend/cpp/llama-cpp-localai-paged/patches/kernel/0001-fp4-grouped-moe-scaffold.patch +++ /dev/null @@ -1,91 +0,0 @@ -diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu -new file mode 100644 -index 0000000..5f5a782 ---- /dev/null -+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu -@@ -0,0 +1,46 @@ -+#include "fp4-grouped-moe.cuh" -+ -+#include -+#include -+ -+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3). -+// -+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q - a warp-level mma.sync grouped MMQ - -+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency -+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged -+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop -+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA. -+// -+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the -+// default build is byte-identical. The kernel is filled in over the phases in the design doc. -+ -+static bool fp4_grouped_enabled() { -+ static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr); -+ return en; -+} -+ -+bool ggml_cuda_fp4_grouped_moe( -+ ggml_backend_cuda_context & ctx, -+ const ggml_tensor * src0, -+ const ggml_tensor * src1, -+ const ggml_tensor * ids, -+ ggml_tensor * dst) { -+ GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst); -+ -+ if (!fp4_grouped_enabled()) { -+ return false; // default: existing MMQ path -+ } -+ if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) { -+ return false; -+ } -+ -+ // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1, -+ // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity -+ // vs mul_mat_q before enabling by default. -+ static bool warned = false; -+ if (!warned) { -+ warned = true; -+ fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n"); -+ } -+ return false; // scaffold: fall back until the kernel lands -+} -diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh -new file mode 100644 -index 0000000..29e1b5a ---- /dev/null -+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh -@@ -0,0 +1,13 @@ -+#pragma once -+ -+#include "common.cuh" -+ -+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for -+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to -+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster. -+bool ggml_cuda_fp4_grouped_moe( -+ ggml_backend_cuda_context & ctx, -+ const ggml_tensor * src0, // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert] -+ const ggml_tensor * src1, // activations, F32 [n_embd, n_tokens, ...] -+ const ggml_tensor * ids, // expert routing, I32 -+ ggml_tensor * dst); // F32 output -diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu -index 8ea462a..104d131 100644 ---- a/ggml/src/ggml-cuda/ggml-cuda.cu -+++ b/ggml/src/ggml-cuda/ggml-cuda.cu -@@ -30,6 +30,7 @@ - #include "ggml-cuda/im2col.cuh" - #include "ggml-cuda/mmf.cuh" - #include "ggml-cuda/mmq.cuh" -+#include "ggml-cuda/fp4-grouped-moe.cuh" - #include "ggml-cuda/mmvf.cuh" - #include "ggml-cuda/mmvq.cuh" - #include "ggml-cuda/norm.cuh" -@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor * - } - - if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) { -+ if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; } - ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst); - return; - } diff --git a/backend/cpp/llama-cpp/grpc-server.cpp b/backend/cpp/llama-cpp/grpc-server.cpp index 05b5ea0b7..11db378d3 100644 --- a/backend/cpp/llama-cpp/grpc-server.cpp +++ b/backend/cpp/llama-cpp/grpc-server.cpp @@ -850,7 +850,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt // common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset. // Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an // externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is - // NOT bit-exact (~91% same-top-p ceiling); see patches/paged/README.md (Dev notes). + // NOT bit-exact (~91% same-top-p ceiling); see backend/cpp/llama-cpp-localai-paged/README.md (Dev notes). } else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) { if (optval != NULL) { try { diff --git a/backend/index.yaml b/backend/index.yaml index 4970b9907..db29d621a 100644 --- a/backend/index.yaml +++ b/backend/index.yaml @@ -2354,7 +2354,7 @@ uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant" mirrors: - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant -## llama-cpp-localai-paged (CUDA-only; see backend/cpp/llama-cpp-localai-paged/patches/paged/README.md section 4c) +## llama-cpp-localai-paged (CUDA-only; see backend/cpp/llama-cpp-localai-paged/README.md section 4c) - !!merge <<: *llamacpplocalaipaged name: "cuda12-llama-cpp-localai-paged" uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-llama-cpp-localai-paged" diff --git a/docs/content/features/backends.md b/docs/content/features/backends.md index 95b94c102..22579a24f 100644 --- a/docs/content/features/backends.md +++ b/docs/content/features/backends.md @@ -125,7 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github. LocalAI supports various types of backends: - **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX) - - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp-localai-paged/patches/paged/README.md` for the quality/throughput profile). + - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp-localai-paged/README.md` for the quality/throughput profile). - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo) - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS) - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step) diff --git a/gallery/index.yaml b/gallery/index.yaml index 0fd5a3fe0..5c4df8703 100644 --- a/gallery/index.yaml +++ b/gallery/index.yaml @@ -2,7 +2,7 @@ # ============================================================================= # NVFP4 Qwen3.6 (dense + MoE) for the LocalAI paged-attention llama.cpp backend. # These reproduce the GB10 / DGX Spark benchmark serving config (see -# backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md section 2). +# backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md section 2). # # PUBLISHED: the dense + MoE base NVFP4 GGUFs are live at huggingface.co/mudler/ # Qwen3.6-27B-NVFP4-GGUF and .../Qwen3.6-35B-A3B-NVFP4-GGUF (file_type MOSTLY_NVFP4); @@ -20,7 +20,7 @@ # persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte # stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT # bit-exact (~91% same-top-p, beats vLLM dense) - see -# backend/cpp/llama-cpp-localai-paged/patches/paged/README.md for the quality profile. +# backend/cpp/llama-cpp-localai-paged/README.md for the quality profile. # The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau). # ============================================================================= - name: "qwen3.6-27b-nvfp4-paged"