feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)

Advance the paged-attention backend's owned llama.cpp pin by 23 upstream commits. The shipped source-only patch series (0001-0030, 28 patches) applies strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121): - md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all match the established baselines. - test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146, MUL_MAT_ID 806/806; all OK. The 23-commit upstream jump did not change our decode output. The .patch files are kept byte-identical (they already apply strict-clean at the new pin); only the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 18:06:58 -04:00 · 2026-06-27 08:57:33 +00:00
parent 7e1832b868
commit a5a5b2ad80
5 changed files with 105 additions and 5 deletions
--- a/.github/scripts/paged-canary-apply.sh
+++ b/.github/scripts/paged-canary-apply.sh
@@ -27,7 +27,7 @@
 # missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
 # build on 0019's code, the rejection cascades to them too. This is a
 # PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
-# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
+# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
 # and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still
 # apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
 # still fails the canary. prepare.sh tolerates the same hunk via
@@ -53,7 +53,7 @@ apply_one() {
  echo "paged-canary: applying $(basename "$p")"
  if ! git apply --verbose "$@" "$p"; then
    echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
-    echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md), do NOT bump the pin blindly"
+    echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
    exit 1
  fi
 }
--- a/.github/workflows/llama-cpp-paged-canary.yml
+++ b/.github/workflows/llama-cpp-paged-canary.yml
@@ -17,7 +17,7 @@ name: 'llama.cpp paged patches: upstream canary'
 # RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
 # pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
 # the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
-# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.
+# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md.
 #
 # SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
 # decoupled from bump_deps - so the main dep-bump PR stays green regardless. A
--- a/backend/cpp/llama-cpp-localai-paged/Makefile
+++ b/backend/cpp/llama-cpp-localai-paged/Makefile
@@ -35,7 +35,7 @@
 # this patch series against the latest upstream llama.cpp tip and goes red the
 # moment upstream drifts past the patches. Canary red -> run a PIN_SYNC, then
 # bump this value. The canary never touches this pin; it is signal-only.
-LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
+LLAMA_VERSION?=c299a92c38b6de6a1139617652b66081828648db

 CMAKE_ARGS?=
 BUILD_TYPE?=
--- a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
+++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
@@ -0,0 +1,100 @@
+# Pin-sync: paged patch-stack -> llama.cpp c299a92c
+
+Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
+28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
+("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
+GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
+path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
+upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
+
+## Upstream jump
+
+- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
+  ("model : Add label for LFM2.5-230M (#25008)")
+- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
+  ("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
+- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
+
+## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
+
+Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
+**zero patch changes**. The already-shipped source-only series (the result of the
+`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
+`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
+`git apply`** (the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
+`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
+28 patches reported "Applied patch ... cleanly", the sentinel
+`src/paged-kv-manager.cpp` was created, and there are **zero** stray
+`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
+intact). git apply tolerates `@@` line-number offsets, which absorbed the
+upstream drift; no hunk context broke.
+
+Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
+patch tarball used for the verification has
+`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
+
+## Clean build
+
+Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
+28 patches applied as working-tree changes, then:
+
+```
+cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
+  -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
+  -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
+cmake --build build-cuda --target llama-completion test-backend-ops -j20
+```
+
+Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
+`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
+
+## GATE: ALL GREEN
+
+Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
+`9d5d882d` build too):
+```
+llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
+                 -n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
+# paged dense: prefix  LLAMA_KV_PAGED=1
+# paged MoE:   prefix  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
+```
+
+(a) greedy md5 - all four paths PASS:
+| path | model | md5 @ c299a92c | baseline | verdict |
+|------|-------|----------------|----------|---------|
+| non-paged | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
+| paged     | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| paged     | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
+
+(b) `test-backend-ops` (Backend CUDA0) - all PASS:
+| op | result |
+|----|--------|
+| SSM_CONV            | 45/45 OK |
+| SSM_CONV_UPDATE     | 16/16 OK |
+| SSM_CONV_UPDATE_IDS | 16/16 OK |
+| GATED_DELTA_NET     | 84/84 OK |
+| MUL_MAT             | 1146/1146 OK |
+| MUL_MAT_ID          | 806/806 OK |
+
+(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
+series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
+pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
+
+Bit-exactness preserved across the 23-commit upstream jump.
+
+## Canary
+
+`.github/workflows/llama-cpp-paged-canary.yml` and
+`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
+series is source-only and applies strict-clean with no `--exclude`, the canary's
+`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
+the shipped series) and may be removed on a future canary touch; left in place
+here to keep the pin-bump diff minimal.
+
+## Source of truth
+
+The shipped `.patch` files under `backend/cpp/llama-cpp/patches/paged/` are the
+source of truth and are unchanged by this bump. The DGX dev tree
+(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
+the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -9,7 +9,7 @@
 # the sha256 below were verified against the Hub LFS hash and the uris resolve (200).
 # Converted from the unsloth/nvidia NVFP4 sources via llama.cpp --outtype auto.
 #
-# NOTE(NVFP4 read): the paged backend (pinned llama.cpp 9d5d882d) reads NVFP4 GGUF
+# NOTE(NVFP4 read): the paged backend (pinned llama.cpp c299a92c) reads NVFP4 GGUF
 # (the GB10 benchmark + the pin-sync md5 gate both ran NVFP4 GGUFs). These gallery
 # GGUFs were re-quantized with a newer convert (origin/master) preserving the same
 # MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.