docs(paged): refresh parity handoff state

Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-06-30 15:27:44 +00:00
parent 1b9176c2c8
commit de34cd5954
3 changed files with 11 additions and 9 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -30,11 +30,11 @@ vendored patch series over upstream llama.cpp that adds
  gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
  GEMM - dominates the decode step.

-It is **pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` backend's
+It is **pinned to llama.cpp `0ed235ea2c17a19fc8238668653946721ed136fd`** (kept == the stock `llama-cpp` backend's
 pin) and advanced only by a manual, bit-exact-gated pin-sync process (see
 section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because
 `grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke
-the grpc-server link and was reverted.
+the grpc-server link and was reverted to the then-current stock pin.

 The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
 enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
@@ -497,7 +497,7 @@ targeted is already recovered by the gather-fusion + block-table cache.
  per commit) from that branch, which is the pin commit plus the paged patch
  commits in order, so there is no more hand-export drift between the dev tree and
  the shipped series.
- **Pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` pin). The pin
+- **Pinned to llama.cpp `0ed235ea2c17a19fc8238668653946721ed136fd`** (kept == the stock `llama-cpp` pin). The pin
  is advanced **only** by the manual pin-sync process (this section):
  rebase the source-only patch series onto the new tip, rebuild on GPU, pass the
  bit-exact gate on every path (dense + MoE, paged + non-paged) plus
@@ -507,7 +507,7 @@ targeted is already recovered by the gather-fusion + block-table cache.
  server-API refactor breaks the grpc-server LINK even when the patches are
  bit-exact. A bump to `c299a92c` (23 commits ahead of stock) was greedy-md5
  bit-exact but failed to link (undefined `stream_*` server helpers introduced by
-  the refactor), and was reverted to `9d5d882d`. The bit-exact gate alone does not
+  the refactor), and was reverted to the then-current stock pin. The bit-exact gate alone does not
  catch this; only the full CI grpc-server build does.
 - **Decoupled from the nightly auto-bumper.** There is deliberately **no**
  `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -87,7 +87,7 @@ Because the dir now permanently contains an `owner` file, **release with `rm -rf

 A separate 0-byte `~/bench/gpu.lock` is legacy/unrelated - ignore.

-**Always gate on BOTH** `nvidia-smi --query-compute-apps=pid` count == 0 **and** `owner` FREE before benching. Concurrent jobs share this GPU: an offline-repack Marlin workflow, an `~/.cache/autoresearch-quant/` quant pipeline (this is the `llama-imatrix` class of job), and finetune trees. The canonical harnesses poll for GPU-idle up to 2h.
+**Always gate on ALL THREE** before benching or building on DGX: `nvidia-smi --query-compute-apps=pid` count == 0, `owner` FREE, and `docker ps` shows no running containers. In particular, do not start work while a `local-ai-worker` container is running. Concurrent jobs share this GPU: an offline-repack Marlin workflow, an `~/.cache/autoresearch-quant/` quant pipeline (this is the `llama-imatrix` class of job), finetune trees, and LocalAI worker containers. The canonical harnesses poll for GPU-idle up to 2h.

 ### 3.2 Build (long; run detached + poll)
 - **Mainline / canonical grpc-server + binaries: CUDA arch `121`** (`-DCMAKE_CUDA_ARCHITECTURES=121`). Runtime banner shows `ARCHS = 1210 | BLACKWELL_NATIVE_FP4 = 1`.
@@ -268,16 +268,16 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

-### Unpushed doc commits (in this worktree, not on origin)
+### Recent context commits
 - `6edbb56b0` "docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)" - adds `VLLM_PARITY_FINAL.md`.
 - `baf102524` "docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)" - the ~56% -> ~86% correction.
 - `bd100dd20` "fix(paged): repair the patch series, sync to the fork branch" - dropped dev-tree 0044/0045, added f32-only M5 as 0047.
 - `b028c81ed` "docs(paged): record padded/fixed-slot decode shape as tested-and-rejected".

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
-1. **Pin mismatch.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` (authoritative, what builds; recent `ea72a56e2` / `2c5980526` pin-synced to it) vs README section 7 prose `9d5d882d` and `VLLM_PARITY_FINAL.md` "backend pin 9d5d882d" (STALE). Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile; fix the prose.
+1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
 2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the fork HEAD (`51168c5ee`, patch 0044) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
-3. **Worktree patch 0044 is committed on the fork but untracked here** (`patches/paged/0044-*.patch` shows `??`).
+3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -33,7 +33,9 @@ Source key (every number below cites one of these):
 Two models: the MoE **Qwen3.6-35B-A3B-NVFP4** (decision model, 256 experts top-8,
 30 GDN + 10 full-attn layers + a dense shared expert per layer) and the dense
 **Qwen3.6-27B-NVFP4** (48 GDN + 16 full-attn). All numbers GB10 / CUDA 13 /
-sm_121, backend pin `9d5d882d`.
+sm_121. The current backend pin is `0ed235ea2c17a19fc8238668653946721ed136fd`;
+the CDEF benchmark artifact itself records the dev-tree commit that produced
+those binaries.

 ### 1a. Prefill (S_PP, prefill tokens/s)