feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)

Advance the paged-attention backend's owned llama.cpp pin by 23 upstream
commits. The shipped source-only patch series (0001-0030, 28 patches) applies
strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export
needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121):

- md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense
  non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all
  match the established baselines.
- test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16,
  SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146,
  MUL_MAT_ID 806/806; all OK.

The 23-commit upstream jump did not change our decode output. The .patch files
are kept byte-identical (they already apply strict-clean at the new pin); only
the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 08:57:33 +00:00
parent 7e1832b868
commit a5a5b2ad80
5 changed files with 105 additions and 5 deletions

View File

@@ -27,7 +27,7 @@
# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
# build on 0019's code, the rejection cascades to them too. This is a
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
# and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
# still fails the canary. prepare.sh tolerates the same hunk via
@@ -53,7 +53,7 @@ apply_one() {
echo "paged-canary: applying $(basename "$p")"
if ! git apply --verbose "$@" "$p"; then
echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md), do NOT bump the pin blindly"
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
exit 1
fi
}

View File

@@ -17,7 +17,7 @@ name: 'llama.cpp paged patches: upstream canary'
# RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
# pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.
# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md.
#
# SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
# decoupled from bump_deps - so the main dep-bump PR stays green regardless. A

View File

@@ -35,7 +35,7 @@
# this patch series against the latest upstream llama.cpp tip and goes red the
# moment upstream drifts past the patches. Canary red -> run a PIN_SYNC, then
# bump this value. The canary never touches this pin; it is signal-only.
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
LLAMA_VERSION?=c299a92c38b6de6a1139617652b66081828648db
CMAKE_ARGS?=
BUILD_TYPE?=

View File

@@ -0,0 +1,100 @@
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
## Upstream jump
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
("model : Add label for LFM2.5-230M (#25008)")
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
**zero patch changes**. The already-shipped source-only series (the result of the
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
`git apply`** (the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
28 patches reported "Applied patch ... cleanly", the sentinel
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
intact). git apply tolerates `@@` line-number offsets, which absorbed the
upstream drift; no hunk context broke.
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
patch tarball used for the verification has
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
## Clean build
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
28 patches applied as working-tree changes, then:
```
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build-cuda --target llama-completion test-backend-ops -j20
```
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
## GATE: ALL GREEN
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
`9d5d882d` build too):
```
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
# paged dense: prefix LLAMA_KV_PAGED=1
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
```
(a) greedy md5 - all four paths PASS:
| path | model | md5 @ c299a92c | baseline | verdict |
|------|-------|----------------|----------|---------|
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
| op | result |
|----|--------|
| SSM_CONV | 45/45 OK |
| SSM_CONV_UPDATE | 16/16 OK |
| SSM_CONV_UPDATE_IDS | 16/16 OK |
| GATED_DELTA_NET | 84/84 OK |
| MUL_MAT | 1146/1146 OK |
| MUL_MAT_ID | 806/806 OK |
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
Bit-exactness preserved across the 23-commit upstream jump.
## Canary
`.github/workflows/llama-cpp-paged-canary.yml` and
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
series is source-only and applies strict-clean with no `--exclude`, the canary's
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
the shipped series) and may be removed on a future canary touch; left in place
here to keep the pin-bump diff minimal.
## Source of truth
The shipped `.patch` files under `backend/cpp/llama-cpp/patches/paged/` are the
source of truth and are unchanged by this bump. The DGX dev tree
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.

View File

@@ -9,7 +9,7 @@
# the sha256 below were verified against the Hub LFS hash and the uris resolve (200).
# Converted from the unsloth/nvidia NVFP4 sources via llama.cpp --outtype auto.
#
# NOTE(NVFP4 read): the paged backend (pinned llama.cpp 9d5d882d) reads NVFP4 GGUF
# NOTE(NVFP4 read): the paged backend (pinned llama.cpp c299a92c) reads NVFP4 GGUF
# (the GB10 benchmark + the pin-sync md5 gate both ran NVFP4 GGUFs). These gallery
# GGUFs were re-quantized with a newer convert (origin/master) preserving the same
# MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.