mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 18:37:43 -04:00
docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -20,8 +20,8 @@ how-to.
|
||||
- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
|
||||
series (0001-0030), nothing else.
|
||||
- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
|
||||
operational docs (`PIN_SYNC_*.md`, `PAGED_BITEXACT_NOTE.md`,
|
||||
`UPSTREAM_LAYER2_SCOPE.md`) and dev artifacts live in
|
||||
operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and
|
||||
dev artifacts live in
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/`.
|
||||
- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
|
||||
- the CUDA build entry points.
|
||||
@@ -55,7 +55,7 @@ and break `git apply` at build time.
|
||||
1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
|
||||
runs weekly: it applies + builds the series against the latest upstream tip and
|
||||
goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
|
||||
2. **The pin-sync** (recorded in `docs/PIN_SYNC_*.md`): rebase the series onto the new
|
||||
2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new
|
||||
tip (resolve conflicts; re-export **source-only** with a pathspec like
|
||||
`-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
|
||||
pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm
|
||||
|
||||
6
.github/scripts/paged-canary-apply.sh
vendored
6
.github/scripts/paged-canary-apply.sh
vendored
@@ -27,8 +27,8 @@
|
||||
# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
|
||||
# build on 0019's code, the rejection cascades to them too. This is a
|
||||
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
|
||||
# upstream break (see backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
|
||||
# and backend/cpp/llama-cpp-localai-paged/README.md). We exclude ONLY that dev-doc path and still
|
||||
# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7,
|
||||
# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still
|
||||
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
|
||||
# still fails the canary. prepare.sh tolerates the same hunk via
|
||||
# `patch ... || true`; this mirrors that tolerance precisely.
|
||||
@@ -53,7 +53,7 @@ apply_one() {
|
||||
echo "paged-canary: applying $(basename "$p")"
|
||||
if ! git apply --verbose "$@" "$p"; then
|
||||
echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
|
||||
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
|
||||
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly"
|
||||
exit 1
|
||||
fi
|
||||
}
|
||||
|
||||
5
.github/workflows/llama-cpp-paged-canary.yml
vendored
5
.github/workflows/llama-cpp-paged-canary.yml
vendored
@@ -16,8 +16,9 @@ name: 'llama.cpp paged patches: upstream canary'
|
||||
#
|
||||
# RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
|
||||
# pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
|
||||
# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
|
||||
# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md.
|
||||
# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See the backend README
|
||||
# section 7 (Pin + maintenance policy):
|
||||
# backend/cpp/llama-cpp-localai-paged/README.md.
|
||||
#
|
||||
# SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
|
||||
# decoupled from bump_deps - so the main dep-bump PR stays green regardless. A
|
||||
|
||||
@@ -9,7 +9,8 @@
|
||||
# Pin handling (mirrors the turboquant wrapper, the precedent this is modelled
|
||||
# on): the paged patch series is hand-verified bit-exact against ONE specific
|
||||
# llama.cpp tip and re-exported by the manual PIN_SYNC process
|
||||
# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from
|
||||
# (README section 7 + .agents/llama-cpp-localai-paged-backend.md). A naive
|
||||
# pin bump would move the tip out from
|
||||
# under the patches and break `git apply` at build time, so this backend OWNS
|
||||
# its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin
|
||||
# from backend/cpp/llama-cpp/Makefile. The override is forced into every copied
|
||||
@@ -30,7 +31,7 @@
|
||||
# the nightly llama.cpp bump cannot silently break the vendored paged patches.
|
||||
# Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate +
|
||||
# re-export), then update this value. See:
|
||||
# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md
|
||||
# README section 7 + .agents/llama-cpp-localai-paged-backend.md
|
||||
#
|
||||
# This pin = the manual, verified sync. The signal telling you WHEN to do the
|
||||
# next sync is the early-warning canary
|
||||
@@ -47,7 +48,7 @@
|
||||
# grpc-server.cpp failed to link with undefined references to stream_* server
|
||||
# helpers that the refactor pulled into the headers grpc-server.cpp includes.
|
||||
# Therefore a PIN_SYNC must pass the FULL grpc-server build/link on CI, not only
|
||||
# the bit-exact gate. See docs/PIN_SYNC_c299a92c.md.
|
||||
# the bit-exact gate. See README section 7 + .agents/llama-cpp-localai-paged-backend.md.
|
||||
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
|
||||
|
||||
CMAKE_ARGS?=
|
||||
|
||||
@@ -7,7 +7,6 @@ here is a fork - it is a source-only `*.patch` stack plus this canonical doc.
|
||||
|
||||
> One-file rule: this README is the canonical reference for the patch series. The
|
||||
> only other docs are operational, kept in `docs/`, and linked below:
|
||||
> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
|
||||
> - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
|
||||
> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
|
||||
|
||||
@@ -31,9 +30,8 @@ vendored patch series over upstream llama.cpp that adds
|
||||
GEMM - dominates the decode step.
|
||||
|
||||
It is **pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` backend's
|
||||
pin) and advanced only by a manual, bit-exact-gated
|
||||
[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
|
||||
(see section 7). The pin must stay aligned with the stock pin because
|
||||
pin) and advanced only by a manual, bit-exact-gated pin-sync process (see
|
||||
section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because
|
||||
`grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke
|
||||
the grpc-server link and was reverted.
|
||||
|
||||
@@ -327,7 +325,7 @@ in a recommended/gallery config.
|
||||
## 7. Pin + maintenance policy
|
||||
|
||||
- **Pinned to llama.cpp `9d5d882d`** (kept == the stock `llama-cpp` pin). The pin
|
||||
is advanced **only** by the manual [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process:
|
||||
is advanced **only** by the manual pin-sync process (this section):
|
||||
rebase the source-only patch series onto the new tip, rebuild on GPU, pass the
|
||||
bit-exact gate on every path (dense + MoE, paged + non-paged) plus
|
||||
`test-backend-ops`, **and confirm the full grpc-server build links on CI**.
|
||||
@@ -345,8 +343,7 @@ in a recommended/gallery config.
|
||||
(via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh))
|
||||
tries the patch series against the latest upstream tip with the build's own
|
||||
strict `git apply`. **Red = upstream drifted past the series -> run a
|
||||
PIN_SYNC** (do not bump the pin blindly). The canary references
|
||||
[`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md).
|
||||
PIN_SYNC** (do not bump the pin blindly), following the policy in this section.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -1,114 +0,0 @@
|
||||
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
|
||||
|
||||
> **Status: REVERTED. The active pin is back at `9d5d882d`.** This bump was
|
||||
> bit-exact but **broke the CI grpc-server build/link**. `grpc-server.cpp` is
|
||||
> shared with the stock `llama-cpp` backend and tracks the stock pin (`9d5d882d`);
|
||||
> `c299a92c`'s upstream server-API refactor pulled `stream_*` helpers into the
|
||||
> headers grpc-server.cpp includes, and their definitions are not compiled by the
|
||||
> stock-aligned build, so every paged variant failed to link
|
||||
> (`undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup
|
||||
> / stream_session_attach_pipe`). **Lesson: a paged pin-sync must pass the FULL CI
|
||||
> grpc-server build, not only the greedy-md5 bit-exact gate, and the paged pin must
|
||||
> stay == the stock pin (or the backend must vendor a pin-matched grpc-server.cpp,
|
||||
> which we deliberately avoid to keep stock pure).** The bit-exactness findings
|
||||
> below remain valid for `c299a92c`; only the build/link blocks shipping it.
|
||||
|
||||
Status (original, patch-level only): COMPLETE. The shipped source-only paged patch
|
||||
series (`0001`-`0030`, 28 `.patch` files) was advanced from llama.cpp `9d5d882d` to
|
||||
`c299a92c` ("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
|
||||
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
|
||||
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
|
||||
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
|
||||
|
||||
## Upstream jump
|
||||
|
||||
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
|
||||
("model : Add label for LFM2.5-230M (#25008)")
|
||||
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
|
||||
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
|
||||
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
|
||||
|
||||
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
|
||||
|
||||
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
|
||||
**zero patch changes**. The already-shipped source-only series (the result of the
|
||||
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
|
||||
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
|
||||
`git apply`** (the `apply-paged-patches` step in
|
||||
`backend/cpp/llama-cpp-localai-paged/Makefile`:
|
||||
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
|
||||
28 patches reported "Applied patch ... cleanly", the sentinel
|
||||
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
|
||||
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
|
||||
intact). git apply tolerates `@@` line-number offsets, which absorbed the
|
||||
upstream drift; no hunk context broke.
|
||||
|
||||
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
|
||||
patch tarball used for the verification has
|
||||
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
|
||||
|
||||
## Clean build
|
||||
|
||||
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
|
||||
28 patches applied as working-tree changes, then:
|
||||
|
||||
```
|
||||
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
|
||||
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
|
||||
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build-cuda --target llama-completion test-backend-ops -j20
|
||||
```
|
||||
|
||||
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
|
||||
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
|
||||
|
||||
## GATE: ALL GREEN
|
||||
|
||||
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
|
||||
`9d5d882d` build too):
|
||||
```
|
||||
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
|
||||
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
|
||||
# paged dense: prefix LLAMA_KV_PAGED=1
|
||||
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
|
||||
```
|
||||
|
||||
(a) greedy md5 - all four paths PASS:
|
||||
| path | model | md5 @ c299a92c | baseline | verdict |
|
||||
|------|-------|----------------|----------|---------|
|
||||
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
|
||||
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
|
||||
|
||||
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
|
||||
| op | result |
|
||||
|----|--------|
|
||||
| SSM_CONV | 45/45 OK |
|
||||
| SSM_CONV_UPDATE | 16/16 OK |
|
||||
| SSM_CONV_UPDATE_IDS | 16/16 OK |
|
||||
| GATED_DELTA_NET | 84/84 OK |
|
||||
| MUL_MAT | 1146/1146 OK |
|
||||
| MUL_MAT_ID | 806/806 OK |
|
||||
|
||||
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
|
||||
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
|
||||
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
|
||||
|
||||
Bit-exactness preserved across the 23-commit upstream jump.
|
||||
|
||||
## Canary
|
||||
|
||||
`.github/workflows/llama-cpp-paged-canary.yml` and
|
||||
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
|
||||
series is source-only and applies strict-clean with no `--exclude`, the canary's
|
||||
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
|
||||
the shipped series) and may be removed on a future canary touch; left in place
|
||||
here to keep the pin-bump diff minimal.
|
||||
|
||||
## Source of truth
|
||||
|
||||
The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
|
||||
source of truth and are unchanged by this bump. The DGX dev tree
|
||||
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
|
||||
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
|
||||
Reference in New Issue
Block a user