mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 18:06:58 -04:00
feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)
Advance the paged-attention backend's owned llama.cpp pin by 23 upstream commits. The shipped source-only patch series (0001-0030, 28 patches) applies strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121): - md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all match the established baselines. - test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146, MUL_MAT_ID 806/806; all OK. The 23-commit upstream jump did not change our decode output. The .patch files are kept byte-identical (they already apply strict-clean at the new pin); only the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
4
.github/scripts/paged-canary-apply.sh
vendored
4
.github/scripts/paged-canary-apply.sh
vendored
@@ -27,7 +27,7 @@
|
||||
# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
|
||||
# build on 0019's code, the rejection cascades to them too. This is a
|
||||
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
|
||||
# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
|
||||
# upstream break (see backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
|
||||
# and PIN_BUMP_APPLY_CHECK.md). We exclude ONLY that dev-doc path and still
|
||||
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
|
||||
# still fails the canary. prepare.sh tolerates the same hunk via
|
||||
@@ -53,7 +53,7 @@ apply_one() {
|
||||
echo "paged-canary: applying $(basename "$p")"
|
||||
if ! git apply --verbose "$@" "$p"; then
|
||||
echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
|
||||
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md), do NOT bump the pin blindly"
|
||||
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
2
.github/workflows/llama-cpp-paged-canary.yml
vendored
2
.github/workflows/llama-cpp-paged-canary.yml
vendored
@@ -17,7 +17,7 @@ name: 'llama.cpp paged patches: upstream canary'
|
||||
# RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
|
||||
# pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
|
||||
# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
|
||||
# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.
|
||||
# backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md.
|
||||
#
|
||||
# SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
|
||||
# decoupled from bump_deps - so the main dep-bump PR stays green regardless. A
|
||||
|
||||
@@ -35,7 +35,7 @@
|
||||
# this patch series against the latest upstream llama.cpp tip and goes red the
|
||||
# moment upstream drifts past the patches. Canary red -> run a PIN_SYNC, then
|
||||
# bump this value. The canary never touches this pin; it is signal-only.
|
||||
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
|
||||
LLAMA_VERSION?=c299a92c38b6de6a1139617652b66081828648db
|
||||
|
||||
CMAKE_ARGS?=
|
||||
BUILD_TYPE?=
|
||||
|
||||
100
backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
Normal file
100
backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
|
||||
|
||||
Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
|
||||
28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
|
||||
("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
|
||||
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
|
||||
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
|
||||
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
|
||||
|
||||
## Upstream jump
|
||||
|
||||
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
|
||||
("model : Add label for LFM2.5-230M (#25008)")
|
||||
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
|
||||
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
|
||||
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
|
||||
|
||||
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
|
||||
|
||||
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
|
||||
**zero patch changes**. The already-shipped source-only series (the result of the
|
||||
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
|
||||
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
|
||||
`git apply`** (the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
|
||||
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
|
||||
28 patches reported "Applied patch ... cleanly", the sentinel
|
||||
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
|
||||
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
|
||||
intact). git apply tolerates `@@` line-number offsets, which absorbed the
|
||||
upstream drift; no hunk context broke.
|
||||
|
||||
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
|
||||
patch tarball used for the verification has
|
||||
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
|
||||
|
||||
## Clean build
|
||||
|
||||
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
|
||||
28 patches applied as working-tree changes, then:
|
||||
|
||||
```
|
||||
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
|
||||
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
|
||||
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build-cuda --target llama-completion test-backend-ops -j20
|
||||
```
|
||||
|
||||
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
|
||||
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
|
||||
|
||||
## GATE: ALL GREEN
|
||||
|
||||
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
|
||||
`9d5d882d` build too):
|
||||
```
|
||||
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
|
||||
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
|
||||
# paged dense: prefix LLAMA_KV_PAGED=1
|
||||
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
|
||||
```
|
||||
|
||||
(a) greedy md5 - all four paths PASS:
|
||||
| path | model | md5 @ c299a92c | baseline | verdict |
|
||||
|------|-------|----------------|----------|---------|
|
||||
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
|
||||
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
|
||||
|
||||
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
|
||||
| op | result |
|
||||
|----|--------|
|
||||
| SSM_CONV | 45/45 OK |
|
||||
| SSM_CONV_UPDATE | 16/16 OK |
|
||||
| SSM_CONV_UPDATE_IDS | 16/16 OK |
|
||||
| GATED_DELTA_NET | 84/84 OK |
|
||||
| MUL_MAT | 1146/1146 OK |
|
||||
| MUL_MAT_ID | 806/806 OK |
|
||||
|
||||
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
|
||||
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
|
||||
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
|
||||
|
||||
Bit-exactness preserved across the 23-commit upstream jump.
|
||||
|
||||
## Canary
|
||||
|
||||
`.github/workflows/llama-cpp-paged-canary.yml` and
|
||||
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
|
||||
series is source-only and applies strict-clean with no `--exclude`, the canary's
|
||||
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
|
||||
the shipped series) and may be removed on a future canary touch; left in place
|
||||
here to keep the pin-bump diff minimal.
|
||||
|
||||
## Source of truth
|
||||
|
||||
The shipped `.patch` files under `backend/cpp/llama-cpp/patches/paged/` are the
|
||||
source of truth and are unchanged by this bump. The DGX dev tree
|
||||
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
|
||||
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
|
||||
@@ -9,7 +9,7 @@
|
||||
# the sha256 below were verified against the Hub LFS hash and the uris resolve (200).
|
||||
# Converted from the unsloth/nvidia NVFP4 sources via llama.cpp --outtype auto.
|
||||
#
|
||||
# NOTE(NVFP4 read): the paged backend (pinned llama.cpp 9d5d882d) reads NVFP4 GGUF
|
||||
# NOTE(NVFP4 read): the paged backend (pinned llama.cpp c299a92c) reads NVFP4 GGUF
|
||||
# (the GB10 benchmark + the pin-sync md5 gate both ran NVFP4 GGUFs). These gallery
|
||||
# GGUFs were re-quantized with a newer convert (origin/master) preserving the same
|
||||
# MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.
|
||||
|
||||
Reference in New Issue
Block a user