docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7

The paged backend's llama.cpp pin was reverted from c299a92c back to
9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the
reverted sync) is dead weight. The pin-sync PROCESS stays documented in
the three live places: the Makefile comment, README section 7 (Pin +
maintenance policy), and .agents/llama-cpp-localai-paged-backend.md.

Delete the doc and repoint every reference to it (Makefile, README,
.agents, canary script + workflow) at README section 7. No functional
paths change: the canary's patches-dir glob (patches/paged/0*.patch)
is untouched.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 21:34:10 +00:00
parent 53f66a6f03
commit ed5eb705c7
6 changed files with 17 additions and 132 deletions

View File

@@ -20,8 +20,8 @@ how-to.
- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
series (0001-0030), nothing else.
- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
operational docs (`PIN_SYNC_*.md`, `PAGED_BITEXACT_NOTE.md`,
`UPSTREAM_LAYER2_SCOPE.md`) and dev artifacts live in
operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and
dev artifacts live in
`backend/cpp/llama-cpp-localai-paged/docs/`.
- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
- the CUDA build entry points.
@@ -55,7 +55,7 @@ and break `git apply` at build time.
1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
runs weekly: it applies + builds the series against the latest upstream tip and
goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
2. **The pin-sync** (recorded in `docs/PIN_SYNC_*.md`): rebase the series onto the new
2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new
tip (resolve conflicts; re-export **source-only** with a pathspec like
`-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm

View File

@@ -27,8 +27,8 @@
# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
# build on 0019's code, the rejection cascades to them too. This is a
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
# upstream break (see backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
# and backend/cpp/llama-cpp-localai-paged/README.md). We exclude ONLY that dev-doc path and still
# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7,
# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
# still fails the canary. prepare.sh tolerates the same hunk via
# `patch ... || true`; this mirrors that tolerance precisely.
@@ -53,7 +53,7 @@ apply_one() {
echo "paged-canary: applying $(basename "$p")"
if ! git apply --verbose "$@" "$p"; then
echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly"
exit 1
fi
}

View File

@@ -16,8 +16,9 @@ name: 'llama.cpp paged patches: upstream canary'
#
# RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
# pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md.
# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See the backend README
# section 7 (Pin + maintenance policy):
# backend/cpp/llama-cpp-localai-paged/README.md.
#
# SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
# decoupled from bump_deps - so the main dep-bump PR stays green regardless. A

View File

@@ -9,7 +9,8 @@
# Pin handling (mirrors the turboquant wrapper, the precedent this is modelled
# on): the paged patch series is hand-verified bit-exact against ONE specific
# llama.cpp tip and re-exported by the manual PIN_SYNC process
# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from
# (README section 7 + .agents/llama-cpp-localai-paged-backend.md). A naive
# pin bump would move the tip out from
# under the patches and break `git apply` at build time, so this backend OWNS
# its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin
# from backend/cpp/llama-cpp/Makefile. The override is forced into every copied
@@ -30,7 +31,7 @@
# the nightly llama.cpp bump cannot silently break the vendored paged patches.
# Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate +
# re-export), then update this value. See:
# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md
# README section 7 + .agents/llama-cpp-localai-paged-backend.md
#
# This pin = the manual, verified sync. The signal telling you WHEN to do the
# next sync is the early-warning canary
@@ -47,7 +48,7 @@
# grpc-server.cpp failed to link with undefined references to stream_* server
# helpers that the refactor pulled into the headers grpc-server.cpp includes.
# Therefore a PIN_SYNC must pass the FULL grpc-server build/link on CI, not only
# the bit-exact gate. See docs/PIN_SYNC_c299a92c.md.
# the bit-exact gate. See README section 7 + .agents/llama-cpp-localai-paged-backend.md.
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
CMAKE_ARGS?=

View File

@@ -7,7 +7,6 @@ here is a fork - it is a source-only `*.patch` stack plus this canonical doc.
> One-file rule: this README is the canonical reference for the patch series. The
> only other docs are operational, kept in `docs/`, and linked below:
> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
> - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
@@ -31,9 +30,8 @@ vendored patch series over upstream llama.cpp that adds
GEMM - dominates the decode step.
It is **pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` backend's
pin) and advanced only by a manual, bit-exact-gated
[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
(see section 7). The pin must stay aligned with the stock pin because
pin) and advanced only by a manual, bit-exact-gated pin-sync process (see
section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because
`grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke
the grpc-server link and was reverted.
@@ -327,7 +325,7 @@ in a recommended/gallery config.
## 7. Pin + maintenance policy
- **Pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` pin). The pin
is advanced **only** by the manual [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process:
is advanced **only** by the manual pin-sync process (this section):
rebase the source-only patch series onto the new tip, rebuild on GPU, pass the
bit-exact gate on every path (dense + MoE, paged + non-paged) plus
`test-backend-ops`, **and confirm the full grpc-server build links on CI**.
@@ -345,8 +343,7 @@ in a recommended/gallery config.
(via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh))
tries the patch series against the latest upstream tip with the build's own
strict `git apply`. **Red = upstream drifted past the series -> run a
PIN_SYNC** (do not bump the pin blindly). The canary references
[`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md).
PIN_SYNC** (do not bump the pin blindly), following the policy in this section.
---

View File

@@ -1,114 +0,0 @@
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
> **Status: REVERTED. The active pin is back at `9d5d882d`.** This bump was
> bit-exact but **broke the CI grpc-server build/link**. `grpc-server.cpp` is
> shared with the stock `llama-cpp` backend and tracks the stock pin (`9d5d882d`);
> `c299a92c`'s upstream server-API refactor pulled `stream_*` helpers into the
> headers grpc-server.cpp includes, and their definitions are not compiled by the
> stock-aligned build, so every paged variant failed to link
> (`undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup
> / stream_session_attach_pipe`). **Lesson: a paged pin-sync must pass the FULL CI
> grpc-server build, not only the greedy-md5 bit-exact gate, and the paged pin must
> stay == the stock pin (or the backend must vendor a pin-matched grpc-server.cpp,
> which we deliberately avoid to keep stock pure).** The bit-exactness findings
> below remain valid for `c299a92c`; only the build/link blocks shipping it.
Status (original, patch-level only): COMPLETE. The shipped source-only paged patch
series (`0001`-`0030`, 28 `.patch` files) was advanced from llama.cpp `9d5d882d` to
`c299a92c` ("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
## Upstream jump
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
("model : Add label for LFM2.5-230M (#25008)")
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
**zero patch changes**. The already-shipped source-only series (the result of the
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
`git apply`** (the `apply-paged-patches` step in
`backend/cpp/llama-cpp-localai-paged/Makefile`:
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
28 patches reported "Applied patch ... cleanly", the sentinel
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
intact). git apply tolerates `@@` line-number offsets, which absorbed the
upstream drift; no hunk context broke.
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
patch tarball used for the verification has
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
## Clean build
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
28 patches applied as working-tree changes, then:
```
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build-cuda --target llama-completion test-backend-ops -j20
```
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
## GATE: ALL GREEN
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
`9d5d882d` build too):
```
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
# paged dense: prefix LLAMA_KV_PAGED=1
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
```
(a) greedy md5 - all four paths PASS:
| path | model | md5 @ c299a92c | baseline | verdict |
|------|-------|----------------|----------|---------|
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
| op | result |
|----|--------|
| SSM_CONV | 45/45 OK |
| SSM_CONV_UPDATE | 16/16 OK |
| SSM_CONV_UPDATE_IDS | 16/16 OK |
| GATED_DELTA_NET | 84/84 OK |
| MUL_MAT | 1146/1146 OK |
| MUL_MAT_ID | 806/806 OK |
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
Bit-exactness preserved across the 23-commit upstream jump.
## Canary
`.github/workflows/llama-cpp-paged-canary.yml` and
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
series is source-only and applies strict-clean with no `--exclude`, the canary's
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
the shipped series) and may be removed on a future canary touch; left in place
here to keep the pin-bump diff minimal.
## Source of truth
The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
source of truth and are unchanged by this bump. The DGX dev tree
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.