diff --git a/.agents/llama-cpp-localai-paged-backend.md b/.agents/llama-cpp-localai-paged-backend.md index 9c6221d88..0997dcbd2 100644 --- a/.agents/llama-cpp-localai-paged-backend.md +++ b/.agents/llama-cpp-localai-paged-backend.md @@ -20,8 +20,8 @@ how-to. - `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch` series (0001-0030), nothing else. - `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The - operational docs (`PIN_SYNC_*.md`, `PAGED_BITEXACT_NOTE.md`, - `UPSTREAM_LAYER2_SCOPE.md`) and dev artifacts live in + operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and + dev artifacts live in `backend/cpp/llama-cpp-localai-paged/docs/`. - `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh` - the CUDA build entry points. @@ -55,7 +55,7 @@ and break `git apply` at build time. 1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml` runs weekly: it applies + builds the series against the latest upstream tip and goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync. -2. **The pin-sync** (recorded in `docs/PIN_SYNC_*.md`): rebase the series onto the new +2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new tip (resolve conflicts; re-export **source-only** with a pathspec like `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box, pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm diff --git a/.github/scripts/paged-canary-apply.sh b/.github/scripts/paged-canary-apply.sh index dfcc88874..5311f0541 100755 --- a/.github/scripts/paged-canary-apply.sh +++ b/.github/scripts/paged-canary-apply.sh @@ -27,8 +27,8 @@ # missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028 # build on 0019's code, the rejection cascades to them too. This is a # PRE-EXISTING shipped-series defect, present identically on every pin, NOT an -# upstream break (see backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md -# and backend/cpp/llama-cpp-localai-paged/README.md). We exclude ONLY that dev-doc path and still +# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7, +# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still # apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019 # still fails the canary. prepare.sh tolerates the same hunk via # `patch ... || true`; this mirrors that tolerance precisely. @@ -53,7 +53,7 @@ apply_one() { echo "paged-canary: applying $(basename "$p")" if ! git apply --verbose "$@" "$p"; then echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")" - echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly" + echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly" exit 1 fi } diff --git a/.github/workflows/llama-cpp-paged-canary.yml b/.github/workflows/llama-cpp-paged-canary.yml index 8220acd30..b79db5441 100644 --- a/.github/workflows/llama-cpp-paged-canary.yml +++ b/.github/workflows/llama-cpp-paged-canary.yml @@ -16,8 +16,9 @@ name: 'llama.cpp paged patches: upstream canary' # # RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip, # pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance -# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See -# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md. +# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See the backend README +# section 7 (Pin + maintenance policy): +# backend/cpp/llama-cpp-localai-paged/README.md. # # SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully # decoupled from bump_deps - so the main dep-bump PR stays green regardless. A diff --git a/backend/cpp/llama-cpp-localai-paged/Makefile b/backend/cpp/llama-cpp-localai-paged/Makefile index 79ebc48c5..f293b2f1f 100644 --- a/backend/cpp/llama-cpp-localai-paged/Makefile +++ b/backend/cpp/llama-cpp-localai-paged/Makefile @@ -9,7 +9,8 @@ # Pin handling (mirrors the turboquant wrapper, the precedent this is modelled # on): the paged patch series is hand-verified bit-exact against ONE specific # llama.cpp tip and re-exported by the manual PIN_SYNC process -# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from +# (README section 7 + .agents/llama-cpp-localai-paged-backend.md). A naive +# pin bump would move the tip out from # under the patches and break `git apply` at build time, so this backend OWNS # its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin # from backend/cpp/llama-cpp/Makefile. The override is forced into every copied @@ -30,7 +31,7 @@ # the nightly llama.cpp bump cannot silently break the vendored paged patches. # Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate + # re-export), then update this value. See: -# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md +# README section 7 + .agents/llama-cpp-localai-paged-backend.md # # This pin = the manual, verified sync. The signal telling you WHEN to do the # next sync is the early-warning canary @@ -47,7 +48,7 @@ # grpc-server.cpp failed to link with undefined references to stream_* server # helpers that the refactor pulled into the headers grpc-server.cpp includes. # Therefore a PIN_SYNC must pass the FULL grpc-server build/link on CI, not only -# the bit-exact gate. See docs/PIN_SYNC_c299a92c.md. +# the bit-exact gate. See README section 7 + .agents/llama-cpp-localai-paged-backend.md. LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1 CMAKE_ARGS?= diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 6b46e21ba..9259c3c77 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -7,7 +7,6 @@ here is a fork - it is a source-only `*.patch` stack plus this canonical doc. > One-file rule: this README is the canonical reference for the patch series. The > only other docs are operational, kept in `docs/`, and linked below: -> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts). > - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference). > - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items. @@ -31,9 +30,8 @@ vendored patch series over upstream llama.cpp that adds GEMM - dominates the decode step. It is **pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` backend's -pin) and advanced only by a manual, bit-exact-gated -[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper -(see section 7). The pin must stay aligned with the stock pin because +pin) and advanced only by a manual, bit-exact-gated pin-sync process (see +section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because `grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke the grpc-server link and was reverted. @@ -327,7 +325,7 @@ in a recommended/gallery config. ## 7. Pin + maintenance policy - **Pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` pin). The pin - is advanced **only** by the manual [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process: + is advanced **only** by the manual pin-sync process (this section): rebase the source-only patch series onto the new tip, rebuild on GPU, pass the bit-exact gate on every path (dense + MoE, paged + non-paged) plus `test-backend-ops`, **and confirm the full grpc-server build links on CI**. @@ -345,8 +343,7 @@ in a recommended/gallery config. (via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh)) tries the patch series against the latest upstream tip with the build's own strict `git apply`. **Red = upstream drifted past the series -> run a - PIN_SYNC** (do not bump the pin blindly). The canary references - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md). + PIN_SYNC** (do not bump the pin blindly), following the policy in this section. --- diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md b/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md deleted file mode 100644 index a53ee0ffc..000000000 --- a/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md +++ /dev/null @@ -1,114 +0,0 @@ -# Pin-sync: paged patch-stack -> llama.cpp c299a92c - -> **Status: REVERTED. The active pin is back at `9d5d882d`.** This bump was -> bit-exact but **broke the CI grpc-server build/link**. `grpc-server.cpp` is -> shared with the stock `llama-cpp` backend and tracks the stock pin (`9d5d882d`); -> `c299a92c`'s upstream server-API refactor pulled `stream_*` helpers into the -> headers grpc-server.cpp includes, and their definitions are not compiled by the -> stock-aligned build, so every paged variant failed to link -> (`undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup -> / stream_session_attach_pipe`). **Lesson: a paged pin-sync must pass the FULL CI -> grpc-server build, not only the greedy-md5 bit-exact gate, and the paged pin must -> stay == the stock pin (or the backend must vendor a pin-matched grpc-server.cpp, -> which we deliberately avoid to keep stock pure).** The bit-exactness findings -> below remain valid for `c299a92c`; only the build/link blocks shipping it. - -Status (original, patch-level only): COMPLETE. The shipped source-only paged patch -series (`0001`-`0030`, 28 `.patch` files) was advanced from llama.cpp `9d5d882d` to -`c299a92c` ("binaries : Improve rpc-server and export-graph-ops names. (#25045)"), -GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every -path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit -upstream jump `9d5d882d..c299a92c` did NOT change our decode output. - -## Upstream jump - -- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` - ("model : Add label for LFM2.5-230M (#25008)") -- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db` - ("binaries : Improve rpc-server and export-graph-ops names. (#25045)") -- Upstream jump `9d5d882d..c299a92c` = **23 commits**. - -## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c - -Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required -**zero patch changes**. The already-shipped source-only series (the result of the -`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean -`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict -`git apply`** (the `apply-paged-patches` step in -`backend/cpp/llama-cpp-localai-paged/Makefile`: -`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the -28 patches reported "Applied patch ... cleanly", the sentinel -`src/paged-kv-manager.cpp` was created, and there are **zero** stray -`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant -intact). git apply tolerates `@@` line-number offsets, which absorbed the -upstream drift; no hunk context broke. - -Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The -patch tarball used for the verification has -`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`. - -## Clean build - -Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the -28 patches applied as working-tree changes, then: - -``` -cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \ - -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \ - -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release -cmake --build build-cuda --target llama-completion test-backend-ops -j20 -``` - -Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0, -`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced. - -## GATE: ALL GREEN - -Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD -`9d5d882d` build too): -``` -llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \ - -n 48 --temp 0 --seed 1 /dev/null | md5sum -# paged dense: prefix LLAMA_KV_PAGED=1 -# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 -``` - -(a) greedy md5 - all four paths PASS: -| path | model | md5 @ c299a92c | baseline | verdict | -|------|-------|----------------|----------|---------| -| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS | -| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS | -| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS | -| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS | - -(b) `test-backend-ops` (Backend CUDA0) - all PASS: -| op | result | -|----|--------| -| SSM_CONV | 45/45 OK | -| SSM_CONV_UPDATE | 16/16 OK | -| SSM_CONV_UPDATE_IDS | 16/16 OK | -| GATED_DELTA_NET | 84/84 OK | -| MUL_MAT | 1146/1146 OK | -| MUL_MAT_ID | 806/806 OK | - -(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped -series now carries patches `0026`/`0028`'s added per-head/gather test cases; all -pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.) - -Bit-exactness preserved across the 23-commit upstream jump. - -## Canary - -`.github/workflows/llama-cpp-paged-canary.yml` and -`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the -series is source-only and applies strict-clean with no `--exclude`, the canary's -`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in -the shipped series) and may be removed on a future canary touch; left in place -here to keep the pin-bump diff minimal. - -## Source of truth - -The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the -source of truth and are unchanged by this bump. The DGX dev tree -(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency; -the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.