docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7

The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 18:37:43 -04:00 · 2026-06-27 21:34:10 +00:00
parent 53f66a6f03
commit ed5eb705c7
6 changed files with 17 additions and 132 deletions
--- a/.agents/llama-cpp-localai-paged-backend.md
+++ b/.agents/llama-cpp-localai-paged-backend.md
@@ -20,8 +20,8 @@ how-to.
 - `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
  series (0001-0030), nothing else.
 - `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
-  operational docs (`PIN_SYNC_*.md`, `PAGED_BITEXACT_NOTE.md`,
-  `UPSTREAM_LAYER2_SCOPE.md`) and dev artifacts live in
+  operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and
+  dev artifacts live in
  `backend/cpp/llama-cpp-localai-paged/docs/`.
 - `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
  - the CUDA build entry points.
@@ -55,7 +55,7 @@ and break `git apply` at build time.
 1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
   runs weekly: it applies + builds the series against the latest upstream tip and
   goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
-2. **The pin-sync** (recorded in `docs/PIN_SYNC_*.md`): rebase the series onto the new
+2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new
   tip (resolve conflicts; re-export **source-only** with a pathspec like
   `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
   pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm
--- a/.github/scripts/paged-canary-apply.sh
+++ b/.github/scripts/paged-canary-apply.sh
@@ -27,8 +27,8 @@
 # missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
 # build on 0019's code, the rejection cascades to them too. This is a
 # PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
-# upstream break (see backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
-# and backend/cpp/llama-cpp-localai-paged/README.md). We exclude ONLY that dev-doc path and still
+# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7,
+# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still
 # apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
 # still fails the canary. prepare.sh tolerates the same hunk via
 # `patch ... || true`; this mirrors that tolerance precisely.
@@ -53,7 +53,7 @@ apply_one() {
  echo "paged-canary: applying $(basename "$p")"
  if ! git apply --verbose "$@" "$p"; then
    echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
-    echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
+    echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly"
    exit 1
  fi
 }
--- a/.github/workflows/llama-cpp-paged-canary.yml
+++ b/.github/workflows/llama-cpp-paged-canary.yml
@@ -16,8 +16,9 @@ name: 'llama.cpp paged patches: upstream canary'
 #
 # RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
 # pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
-# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
-# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md.
+# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See the backend README
+# section 7 (Pin + maintenance policy):
+# backend/cpp/llama-cpp-localai-paged/README.md.
 #
 # SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
 # decoupled from bump_deps - so the main dep-bump PR stays green regardless. A
--- a/backend/cpp/llama-cpp-localai-paged/Makefile
+++ b/backend/cpp/llama-cpp-localai-paged/Makefile
@@ -9,7 +9,8 @@
 # Pin handling (mirrors the turboquant wrapper, the precedent this is modelled
 # on): the paged patch series is hand-verified bit-exact against ONE specific
 # llama.cpp tip and re-exported by the manual PIN_SYNC process
-# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from
+# (README section 7 + .agents/llama-cpp-localai-paged-backend.md). A naive
+# pin bump would move the tip out from
 # under the patches and break `git apply` at build time, so this backend OWNS
 # its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin
 # from backend/cpp/llama-cpp/Makefile. The override is forced into every copied
@@ -30,7 +31,7 @@
 # the nightly llama.cpp bump cannot silently break the vendored paged patches.
 # Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate +
 # re-export), then update this value. See:
-#   backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md
+#   README section 7 + .agents/llama-cpp-localai-paged-backend.md
 #
 # This pin = the manual, verified sync. The signal telling you WHEN to do the
 # next sync is the early-warning canary
@@ -47,7 +48,7 @@
 # grpc-server.cpp failed to link with undefined references to stream_* server
 # helpers that the refactor pulled into the headers grpc-server.cpp includes.
 # Therefore a PIN_SYNC must pass the FULL grpc-server build/link on CI, not only
-# the bit-exact gate. See docs/PIN_SYNC_c299a92c.md.
+# the bit-exact gate. See README section 7 + .agents/llama-cpp-localai-paged-backend.md.
 LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -7,7 +7,6 @@ here is a fork - it is a source-only `*.patch` stack plus this canonical doc.

 > One-file rule: this README is the canonical reference for the patch series. The
 > only other docs are operational, kept in `docs/`, and linked below:
-> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
 > - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
 > - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.

@@ -31,9 +30,8 @@ vendored patch series over upstream llama.cpp that adds
  GEMM - dominates the decode step.

 It is **pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` backend's
-pin) and advanced only by a manual, bit-exact-gated
-[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
-(see section 7). The pin must stay aligned with the stock pin because
+pin) and advanced only by a manual, bit-exact-gated pin-sync process (see
+section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because
 `grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke
 the grpc-server link and was reverted.

@@ -327,7 +325,7 @@ in a recommended/gallery config.
 ## 7. Pin + maintenance policy

 - **Pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` pin). The pin
-  is advanced **only** by the manual [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process:
+  is advanced **only** by the manual pin-sync process (this section):
  rebase the source-only patch series onto the new tip, rebuild on GPU, pass the
  bit-exact gate on every path (dense + MoE, paged + non-paged) plus
  `test-backend-ops`, **and confirm the full grpc-server build links on CI**.
@@ -345,8 +343,7 @@ in a recommended/gallery config.
  (via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh))
  tries the patch series against the latest upstream tip with the build's own
  strict `git apply`. **Red = upstream drifted past the series -> run a
-  PIN_SYNC** (do not bump the pin blindly). The canary references
-  [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md).
+  PIN_SYNC** (do not bump the pin blindly), following the policy in this section.

 ---

--- a/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
@@ -1,114 +0,0 @@
-# Pin-sync: paged patch-stack -> llama.cpp c299a92c
-
-> **Status: REVERTED. The active pin is back at `9d5d882d`.** This bump was
-> bit-exact but **broke the CI grpc-server build/link**. `grpc-server.cpp` is
-> shared with the stock `llama-cpp` backend and tracks the stock pin (`9d5d882d`);
-> `c299a92c`'s upstream server-API refactor pulled `stream_*` helpers into the
-> headers grpc-server.cpp includes, and their definitions are not compiled by the
-> stock-aligned build, so every paged variant failed to link
-> (`undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup
-> / stream_session_attach_pipe`). **Lesson: a paged pin-sync must pass the FULL CI
-> grpc-server build, not only the greedy-md5 bit-exact gate, and the paged pin must
-> stay == the stock pin (or the backend must vendor a pin-matched grpc-server.cpp,
-> which we deliberately avoid to keep stock pure).** The bit-exactness findings
-> below remain valid for `c299a92c`; only the build/link blocks shipping it.
-
-Status (original, patch-level only): COMPLETE. The shipped source-only paged patch
-series (`0001`-`0030`, 28 `.patch` files) was advanced from llama.cpp `9d5d882d` to
-`c299a92c` ("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
-GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
-path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
-upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
-
-## Upstream jump
-
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
-  ("model : Add label for LFM2.5-230M (#25008)")
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
-  ("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
-
-## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
-
-Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
-**zero patch changes**. The already-shipped source-only series (the result of the
-`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
-`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
-`git apply`** (the `apply-paged-patches` step in
-`backend/cpp/llama-cpp-localai-paged/Makefile`:
-`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
-28 patches reported "Applied patch ... cleanly", the sentinel
-`src/paged-kv-manager.cpp` was created, and there are **zero** stray
-`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
-intact). git apply tolerates `@@` line-number offsets, which absorbed the
-upstream drift; no hunk context broke.
-
-Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
-patch tarball used for the verification has
-`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
-
-## Clean build
-
-Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
-28 patches applied as working-tree changes, then:
-
-```
-cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-  -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
-  -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
-cmake --build build-cuda --target llama-completion test-backend-ops -j20
-```
-
-Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
-`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
-
-## GATE: ALL GREEN
-
-Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
-`9d5d882d` build too):
-```
-llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-                 -n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
-# paged dense: prefix  LLAMA_KV_PAGED=1
-# paged MoE:   prefix  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
-```
-
-(a) greedy md5 - all four paths PASS:
-| path | model | md5 @ c299a92c | baseline | verdict |
-|------|-------|----------------|----------|---------|
-| non-paged | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
-| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
-| paged     | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
-| paged     | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
-
-(b) `test-backend-ops` (Backend CUDA0) - all PASS:
-| op | result |
-|----|--------|
-| SSM_CONV            | 45/45 OK |
-| SSM_CONV_UPDATE     | 16/16 OK |
-| SSM_CONV_UPDATE_IDS | 16/16 OK |
-| GATED_DELTA_NET     | 84/84 OK |
-| MUL_MAT             | 1146/1146 OK |
-| MUL_MAT_ID          | 806/806 OK |
-
-(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
-series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
-pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
-
-Bit-exactness preserved across the 23-commit upstream jump.
-
-## Canary
-
-`.github/workflows/llama-cpp-paged-canary.yml` and
-`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
-series is source-only and applies strict-clean with no `--exclude`, the canary's
-`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
-the shipped series) and may be removed on a future canary touch; left in place
-here to keep the pin-bump diff minimal.
-
-## Source of truth
-
-The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
-source of truth and are unchanged by this bump. The DGX dev tree
-(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
-the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.