ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781)

* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1

v4.2.1's backend.yml run (#25701862853) exposed three independent issues
on top of the singletons fix shipped in ea001995. Address all three plus
two related cleanups:

1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with
   "manifest not found"). Even after PR #9746 split multi/single-arch
   merges, the multiarch matrix itself takes ~2h to drain at
   max-parallel: 8, and the earliest per-arch digests (push-by-digest,
   no tag) get reaped by quay's GC before the merge runs. The split
   bounded the race for multiarch; it doesn't eliminate it. Anchor each
   per-arch digest immediately to a tag in the internal ci-cache image
   (`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC
   tagged manifests. backend_merge.yml deletes the keepalive tags via
   quay REST API after publishing the user-facing manifest list.
   Cleanup is best-effort: if the quay token is not OAuth-scoped the
   merge does NOT fail, the orphan tags just persist.

2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed
   and 2 cancelled singlearch builds (out of 199); GHA's default
   `needs:` semantics cascade-skipped the entire singlearch merge
   matrix, so zero singleton tags were applied even though 197
   singletons built successfully. Wrap the merge `if:` in
   `!cancelled() && ...` for both multi and single arch in backend.yml
   and backend_pr.yml so partial build failures publish the successful
   tag-suffixes.

3. Darwin llama-cpp grpc-server build fails with `find_package(absl)`
   not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd
   fix already in `Dependencies`: a brew cache hit restores
   `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but
   abseil isn't in our Cellar cache list and never gets installed
   alongside, leaving grpc's CMake unable to resolve it. Mirror the
   `brew reinstall ccache` line with `brew reinstall grpc` to
   re-validate grpc's full transitive dep closure on every cache-hit
   run.

4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1
   wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m,
   -gpu-nvidia-cuda-12-turboquant 6h05m,
   -gpu-nvidia-cuda-13-llama-cpp 5h37m,
   -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and
   cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3
   of the free-tier migration (PR #9730) had explicitly flagged this
   batch as 'highest-risk' with a per-entry revert path. All other
   matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h,
   intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest.

Verified locally: all six edited workflow YAMLs parse cleanly. Real
verification has to come from the next tag release run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract keepalive anchor + cleanup into .github/scripts/

The two inline shell blocks from the previous commit are long enough to
hurt readability of the workflow YAML and benefit from their own files
with self-contained docs. Move them to .github/scripts/:

  anchor-digest-in-cache.sh    backend_build.yml's keepalive anchor
  cleanup-keepalive-tags.sh    backend_merge.yml's best-effort cleanup

Workflow steps reduce to a single `run:` invocation each, with all the
parameter plumbing handled by env vars on the step. backend_merge.yml
also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts
only) so the cleanup script is available on the runner — backend_build
already checks out for the docker build.

Net workflow diff: -36 lines across the two files. Script logic and
behavior are byte-identical to the inline version.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-05-12 17:22:09 +02:00
committed by GitHub
parent a57e73691d
commit 86a7f6c9fa
8 changed files with 157 additions and 8 deletions

46
.github/scripts/anchor-digest-in-cache.sh vendored Executable file
View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
# garbage collector won't reap the manifest before backend_merge.yml runs.
#
# Context: backend_build.yml pushes by canonical digest only
# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
# anchoring tag, the earliest digests are gone by the time `imagetools create`
# tries to read them, producing "manifest not found" merge failures.
#
# We tag the digest under our internal ci-cache image; quay does not GC tagged
# manifests. The user-facing manifest list still references the original
# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# PLATFORM_TAG - amd64 / arm64 / single (single = singleton matrix entry)
# DIGEST - canonical content digest from build step (sha256:...)
#
# Optional env:
# ANCHOR_IMAGE - target image (default: quay.io/go-skynet/ci-cache)
# SOURCE_IMAGE - source image (default: quay.io/go-skynet/local-ai-backends)
# GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
set -euo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${PLATFORM_TAG:?}"
: "${DIGEST:?}"
anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
docker buildx imagetools create \
-t "${anchor_image}:${tag}" \
"${source_image}@${DIGEST}"
echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
fi

49
.github/scripts/cleanup-keepalive-tags.sh vendored Executable file
View File

@@ -0,0 +1,49 @@
#!/usr/bin/env bash
# Best-effort cleanup of the keepalive anchor tags written by
# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
# user-facing manifest list has been published.
#
# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
# The proper delete is the quay REST API, which requires an OAuth-scoped
# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
# token (typical for service accounts) the delete succeeds; otherwise this
# is a soft no-op and the tag persists until manually pruned.
#
# Cleanup failure MUST NOT fail the merge — the merge has already produced
# the user-facing manifest list at this point and the keepalive tags are
# pure overhead. We always exit 0.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# QUAY_TOKEN - bearer token for quay's REST API
#
# Optional env:
# QUAY_REPO - target repo (default: go-skynet/ci-cache)
# PLATFORM_TAGS - space-separated list of platform-tag values to try
# (default: "amd64 arm64 single")
# We don't know which platform-tag(s) exist for this
# tag-suffix without an extra API call, so we just try
# all three and ignore 404s for the ones that don't.
set -uo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${QUAY_TOKEN:?}"
quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
for plat in $platform_tags; do
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
http=$(curl -sS -o /dev/null -w '%{http_code}' \
-X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
case "$http" in
204|200) echo "deleted $tag" ;;
404) echo "not present: $tag" ;;
401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
*) echo "unexpected http $http deleting $tag - skipping" ;;
esac
done
exit 0