diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 087c20a34..1d29cc1f3 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -30,11 +30,11 @@ vendored patch series over upstream llama.cpp that adds gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4 GEMM - dominates the decode step. -It is **pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` backend's +It is **pinned to llama.cpp `0ed235ea2c17a19fc8238668653946721ed136fd`** (kept == the stock `llama-cpp` backend's pin) and advanced only by a manual, bit-exact-gated pin-sync process (see section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because `grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke -the grpc-server link and was reverted. +the grpc-server link and was reverted to the then-current stock pin. The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`, @@ -497,7 +497,7 @@ targeted is already recovered by the gather-fusion + block-table cache. per commit) from that branch, which is the pin commit plus the paged patch commits in order, so there is no more hand-export drift between the dev tree and the shipped series. -- **Pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` pin). The pin +- **Pinned to llama.cpp `0ed235ea2c17a19fc8238668653946721ed136fd`** (kept == the stock `llama-cpp` pin). The pin is advanced **only** by the manual pin-sync process (this section): rebase the source-only patch series onto the new tip, rebuild on GPU, pass the bit-exact gate on every path (dense + MoE, paged + non-paged) plus @@ -507,7 +507,7 @@ targeted is already recovered by the gather-fusion + block-table cache. server-API refactor breaks the grpc-server LINK even when the patches are bit-exact. A bump to `c299a92c` (23 commits ahead of stock) was greedy-md5 bit-exact but failed to link (undefined `stream_*` server helpers introduced by - the refactor), and was reverted to `9d5d882d`. The bit-exact gate alone does not + the refactor), and was reverted to the then-current stock pin. The bit-exact gate alone does not catch this; only the full CI grpc-server build does. - **Decoupled from the nightly auto-bumper.** There is deliberately **no** `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 0084fb4f0..23ab4ce18 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -87,7 +87,7 @@ Because the dir now permanently contains an `owner` file, **release with `rm -rf A separate 0-byte `~/bench/gpu.lock` is legacy/unrelated - ignore. -**Always gate on BOTH** `nvidia-smi --query-compute-apps=pid` count == 0 **and** `owner` FREE before benching. Concurrent jobs share this GPU: an offline-repack Marlin workflow, an `~/.cache/autoresearch-quant/` quant pipeline (this is the `llama-imatrix` class of job), and finetune trees. The canonical harnesses poll for GPU-idle up to 2h. +**Always gate on ALL THREE** before benching or building on DGX: `nvidia-smi --query-compute-apps=pid` count == 0, `owner` FREE, and `docker ps` shows no running containers. In particular, do not start work while a `local-ai-worker` container is running. Concurrent jobs share this GPU: an offline-repack Marlin workflow, an `~/.cache/autoresearch-quant/` quant pipeline (this is the `llama-imatrix` class of job), finetune trees, and LocalAI worker containers. The canonical harnesses poll for GPU-idle up to 2h. ### 3.2 Build (long; run detached + poll) - **Mainline / canonical grpc-server + binaries: CUDA arch `121`** (`-DCMAKE_CUDA_ARCHITECTURES=121`). Runtime banner shows `ARCHS = 1210 | BLACKWELL_NATIVE_FP4 = 1`. @@ -268,16 +268,16 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. -### Unpushed doc commits (in this worktree, not on origin) +### Recent context commits - `6edbb56b0` "docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)" - adds `VLLM_PARITY_FINAL.md`. - `baf102524` "docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)" - the ~56% -> ~86% correction. - `bd100dd20` "fix(paged): repair the patch series, sync to the fork branch" - dropped dev-tree 0044/0045, added f32-only M5 as 0047. - `b028c81ed` "docs(paged): record padded/fixed-slot decode shape as tested-and-rejected". ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) -1. **Pin mismatch.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` (authoritative, what builds; recent `ea72a56e2` / `2c5980526` pin-synced to it) vs README section 7 prose `9d5d882d` and `VLLM_PARITY_FINAL.md` "backend pin 9d5d882d" (STALE). Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile; fix the prose. +1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. 2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the fork HEAD (`51168c5ee`, patch 0044) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers. -3. **Worktree patch 0044 is committed on the fork but untracked here** (`patches/paged/0044-*.patch` shows `??`). +3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`. 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index 1f1342348..28ee15268 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -33,7 +33,9 @@ Source key (every number below cites one of these): Two models: the MoE **Qwen3.6-35B-A3B-NVFP4** (decision model, 256 experts top-8, 30 GDN + 10 full-attn layers + a dense shared expert per layer) and the dense **Qwen3.6-27B-NVFP4** (48 GDN + 16 full-attn). All numbers GB10 / CUDA 13 / -sm_121, backend pin `9d5d882d`. +sm_121. The current backend pin is `0ed235ea2c17a19fc8238668653946721ed136fd`; +the CDEF benchmark artifact itself records the dev-tree commit that produced +those binaries. ### 1a. Prefill (S_PP, prefill tokens/s)