Commit Graph

4 Commits

Author SHA1 Message Date
Ettore Di Giacinto
fb2dc33d52 docs(paged): consolidate the dev-trail docs into one canonical README
The paged-attention patch directory had accumulated ~55 scattered dev docs
(results, progress, scope, lever, and gap-analysis notes). Consolidate the
durable content of all of them into one canonical
backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is,
the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM
decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030
patch series table with bit-exact status, the GB10 benchmarks
(patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes
(bit-exact methodology, the per-path gate, the MoE-parity conclusion, the
rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the
pin + canary maintenance policy, and the published NVFP4 gallery models.

Delete the consolidated-away dev trail. Keep the three operational docs the
README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md
(per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the
ship-as-own-backend design-of-record), plus the benchmark plots + csv. The
.patch files and the unit/bench .cpp are untouched.

Repoint every external reference to a deleted doc at the new README:
grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the
canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base
patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC
reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a
patch-internal path matcher, not a repo-doc link) is left intact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:25:47 +00:00
Ettore Di Giacinto
a5a5b2ad80 feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)
Advance the paged-attention backend's owned llama.cpp pin by 23 upstream
commits. The shipped source-only patch series (0001-0030, 28 patches) applies
strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export
needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121):

- md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense
  non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all
  match the established baselines.
- test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16,
  SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146,
  MUL_MAT_ID 806/806; all OK.

The 23-commit upstream jump did not change our decode output. The .patch files
are kept byte-identical (they already apply strict-clean at the new pin); only
the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:57:33 +00:00
Ettore Di Giacinto
2bee7a5ab1 ci(paged): add early-warning canary for vendored llama.cpp paged patches
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified
llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can
never silently break the shipped build. That exclusion also removed the early
warning of upstream drift. This restores the signal without touching the pin.

Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch):

- apply-check job (ubuntu-latest, toolchain-free): resolve the latest
  ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged
  series 0001-0030 in order with the build's own git-apply method via the new
  shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break.
- compile job (needs apply-check): on the exact tip it validated, build the
  paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the
  same `make grpc-server` target the shipped build uses, so a red means upstream
  drift, not toolchain noise. nvcc compiles the kernels with no GPU present.

Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the
paged Makefile pin. The canary is signal-only: it opens no PR and never moves
the pin, so the shipped build and the dep-bump PRs stay green regardless. It is
fully separate from bump_deps.

The lone pre-existing quirk in the series (patch 0019 carries a stray modify
hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean
upstream checkout; git apply is atomic so it rejects the whole patch and
cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes
only that dev-doc and still applies 0019's real code hunks atomically, mirroring
prepare.sh's tolerance, so the quirk never false-positives the canary but a
genuine code break in 0019 still turns it red.

Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile
and .github/workflows/bump_deps.yaml at this canary as the drift signal, and
document it in the PIN_SYNC doc: canary red -> do a pin-sync.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:29:09 +00:00
LocalAI [bot]
86a7f6c9fa ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781)
* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1

v4.2.1's backend.yml run (#25701862853) exposed three independent issues
on top of the singletons fix shipped in ea001995. Address all three plus
two related cleanups:

1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with
   "manifest not found"). Even after PR #9746 split multi/single-arch
   merges, the multiarch matrix itself takes ~2h to drain at
   max-parallel: 8, and the earliest per-arch digests (push-by-digest,
   no tag) get reaped by quay's GC before the merge runs. The split
   bounded the race for multiarch; it doesn't eliminate it. Anchor each
   per-arch digest immediately to a tag in the internal ci-cache image
   (`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC
   tagged manifests. backend_merge.yml deletes the keepalive tags via
   quay REST API after publishing the user-facing manifest list.
   Cleanup is best-effort: if the quay token is not OAuth-scoped the
   merge does NOT fail, the orphan tags just persist.

2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed
   and 2 cancelled singlearch builds (out of 199); GHA's default
   `needs:` semantics cascade-skipped the entire singlearch merge
   matrix, so zero singleton tags were applied even though 197
   singletons built successfully. Wrap the merge `if:` in
   `!cancelled() && ...` for both multi and single arch in backend.yml
   and backend_pr.yml so partial build failures publish the successful
   tag-suffixes.

3. Darwin llama-cpp grpc-server build fails with `find_package(absl)`
   not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd
   fix already in `Dependencies`: a brew cache hit restores
   `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but
   abseil isn't in our Cellar cache list and never gets installed
   alongside, leaving grpc's CMake unable to resolve it. Mirror the
   `brew reinstall ccache` line with `brew reinstall grpc` to
   re-validate grpc's full transitive dep closure on every cache-hit
   run.

4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1
   wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m,
   -gpu-nvidia-cuda-12-turboquant 6h05m,
   -gpu-nvidia-cuda-13-llama-cpp 5h37m,
   -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and
   cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3
   of the free-tier migration (PR #9730) had explicitly flagged this
   batch as 'highest-risk' with a per-entry revert path. All other
   matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h,
   intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest.

Verified locally: all six edited workflow YAMLs parse cleanly. Real
verification has to come from the next tag release run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract keepalive anchor + cleanup into .github/scripts/

The two inline shell blocks from the previous commit are long enough to
hurt readability of the workflow YAML and benefit from their own files
with self-contained docs. Move them to .github/scripts/:

  anchor-digest-in-cache.sh    backend_build.yml's keepalive anchor
  cleanup-keepalive-tags.sh    backend_merge.yml's best-effort cleanup

Workflow steps reduce to a single `run:` invocation each, with all the
parameter plumbing handled by env vars on the step. backend_merge.yml
also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts
only) so the cleanup script is available on the runner — backend_build
already checks out for the docker build.

Net workflow diff: -36 lines across the two files. Script logic and
behavior are byte-identical to the inline version.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:09 +02:00