From 86a7f6c9faa8d8628c43b468d5fa1f12edaf008d Mon Sep 17 00:00:00 2001 From: "LocalAI [bot]" <139863280+localai-bot@users.noreply.github.com> Date: Tue, 12 May 2026 17:22:09 +0200 Subject: [PATCH] ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1 v4.2.1's backend.yml run (#25701862853) exposed three independent issues on top of the singletons fix shipped in ea001995. Address all three plus two related cleanups: 1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with "manifest not found"). Even after PR #9746 split multi/single-arch merges, the multiarch matrix itself takes ~2h to drain at max-parallel: 8, and the earliest per-arch digests (push-by-digest, no tag) get reaped by quay's GC before the merge runs. The split bounded the race for multiarch; it doesn't eliminate it. Anchor each per-arch digest immediately to a tag in the internal ci-cache image (`keepalive--`). Quay won't GC tagged manifests. backend_merge.yml deletes the keepalive tags via quay REST API after publishing the user-facing manifest list. Cleanup is best-effort: if the quay token is not OAuth-scoped the merge does NOT fail, the orphan tags just persist. 2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed and 2 cancelled singlearch builds (out of 199); GHA's default `needs:` semantics cascade-skipped the entire singlearch merge matrix, so zero singleton tags were applied even though 197 singletons built successfully. Wrap the merge `if:` in `!cancelled() && ...` for both multi and single arch in backend.yml and backend_pr.yml so partial build failures publish the successful tag-suffixes. 3. Darwin llama-cpp grpc-server build fails with `find_package(absl)` not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd fix already in `Dependencies`: a brew cache hit restores `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but abseil isn't in our Cellar cache list and never gets installed alongside, leaving grpc's CMake unable to resolve it. Mirror the `brew reinstall ccache` line with `brew reinstall grpc` to re-validate grpc's full transitive dep closure on every cache-hit run. 4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1 wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m, -gpu-nvidia-cuda-12-turboquant 6h05m, -gpu-nvidia-cuda-13-llama-cpp 5h37m, -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3 of the free-tier migration (PR #9730) had explicitly flagged this batch as 'highest-risk' with a per-entry revert path. All other matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h, intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest. Verified locally: all six edited workflow YAMLs parse cleanly. Real verification has to come from the next tag release run. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto * ci: extract keepalive anchor + cleanup into .github/scripts/ The two inline shell blocks from the previous commit are long enough to hurt readability of the workflow YAML and benefit from their own files with self-contained docs. Move them to .github/scripts/: anchor-digest-in-cache.sh backend_build.yml's keepalive anchor cleanup-keepalive-tags.sh backend_merge.yml's best-effort cleanup Workflow steps reduce to a single `run:` invocation each, with all the parameter plumbing handled by env vars on the step. backend_merge.yml also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts only) so the cleanup script is available on the runner — backend_build already checks out for the docker build. Net workflow diff: -36 lines across the two files. Script logic and behavior are byte-identical to the inline version. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto --------- Signed-off-by: Ettore Di Giacinto Co-authored-by: Ettore Di Giacinto --- .github/backend-matrix.yml | 18 ++++++-- .github/scripts/anchor-digest-in-cache.sh | 46 ++++++++++++++++++++ .github/scripts/cleanup-keepalive-tags.sh | 49 ++++++++++++++++++++++ .github/workflows/backend.yml | 11 ++++- .github/workflows/backend_build.yml | 10 +++++ .github/workflows/backend_build_darwin.yml | 7 ++++ .github/workflows/backend_merge.yml | 18 ++++++++ .github/workflows/backend_pr.yml | 6 ++- 8 files changed, 157 insertions(+), 8 deletions(-) create mode 100755 .github/scripts/anchor-digest-in-cache.sh create mode 100755 .github/scripts/cleanup-keepalive-tags.sh diff --git a/.github/backend-matrix.yml b/.github/backend-matrix.yml index 903d415ab..4aca4185e 100644 --- a/.github/backend-matrix.yml +++ b/.github/backend-matrix.yml @@ -389,7 +389,12 @@ include: tag-latest: 'auto' tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp' builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64' - runs-on: 'ubuntu-latest' + # bigger-runner: cold builds for this entry consistently take 5h+ on + # ubuntu-latest (observed 5h36m on v4.2.1). Move back to bigger-runner + # so the build finishes well within GHA's 6h job timeout. Phase 5.3 of + # the free-tier migration (PR #9730) flipped this to ubuntu-latest as + # a 'highest-risk batch' with explicit per-entry revert. + runs-on: 'bigger-runner' base-image: "ubuntu:24.04" skip-drivers: 'false' backend: "llama-cpp" @@ -403,7 +408,9 @@ include: tag-latest: 'auto' tag-suffix: '-gpu-nvidia-cuda-12-turboquant' builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64' - runs-on: 'ubuntu-latest' + # bigger-runner: same rationale as -gpu-nvidia-cuda-12-llama-cpp above + # (observed 6h5m wall-clock on v4.2.1, just past the 6h job timeout). + runs-on: 'bigger-runner' base-image: "ubuntu:24.04" skip-drivers: 'false' backend: "turboquant" @@ -899,7 +906,9 @@ include: tag-latest: 'auto' tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp' builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64' - runs-on: 'ubuntu-latest' + # bigger-runner: cold builds for this entry take 5h+ on ubuntu-latest + # (observed 5h37m on v4.2.1). Same rationale as the cuda-12 variant. + runs-on: 'bigger-runner' base-image: "ubuntu:24.04" skip-drivers: 'false' backend: "llama-cpp" @@ -913,7 +922,8 @@ include: tag-latest: 'auto' tag-suffix: '-gpu-nvidia-cuda-13-turboquant' builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64' - runs-on: 'ubuntu-latest' + # bigger-runner: observed 6h5m wall-clock on v4.2.1 — at the GHA timeout. + runs-on: 'bigger-runner' base-image: "ubuntu:24.04" skip-drivers: 'false' backend: "turboquant" diff --git a/.github/scripts/anchor-digest-in-cache.sh b/.github/scripts/anchor-digest-in-cache.sh new file mode 100755 index 000000000..409192788 --- /dev/null +++ b/.github/scripts/anchor-digest-in-cache.sh @@ -0,0 +1,46 @@ +#!/usr/bin/env bash +# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's +# garbage collector won't reap the manifest before backend_merge.yml runs. +# +# Context: backend_build.yml pushes by canonical digest only +# (push-by-digest=true). Unreferenced manifests on quay can be reaped within +# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build +# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an +# anchoring tag, the earliest digests are gone by the time `imagetools create` +# tries to read them, producing "manifest not found" merge failures. +# +# We tag the digest under our internal ci-cache image; quay does not GC tagged +# manifests. The user-facing manifest list still references the original +# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after +# the user-facing manifest is published — see cleanup-keepalive-tags.sh. +# +# Required env: +# GITHUB_RUN_ID - current workflow run id (set automatically by GHA) +# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm) +# PLATFORM_TAG - amd64 / arm64 / single (single = singleton matrix entry) +# DIGEST - canonical content digest from build step (sha256:...) +# +# Optional env: +# ANCHOR_IMAGE - target image (default: quay.io/go-skynet/ci-cache) +# SOURCE_IMAGE - source image (default: quay.io/go-skynet/local-ai-backends) +# GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it +set -euo pipefail + +: "${GITHUB_RUN_ID:?}" +: "${TAG_SUFFIX:?}" +: "${PLATFORM_TAG:?}" +: "${DIGEST:?}" + +anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}" +source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}" + +tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}" + +docker buildx imagetools create \ + -t "${anchor_image}:${tag}" \ + "${source_image}@${DIGEST}" + +echo "anchored ${DIGEST} as ${anchor_image}:${tag}" +if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then + echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}" +fi diff --git a/.github/scripts/cleanup-keepalive-tags.sh b/.github/scripts/cleanup-keepalive-tags.sh new file mode 100755 index 000000000..c536269d6 --- /dev/null +++ b/.github/scripts/cleanup-keepalive-tags.sh @@ -0,0 +1,49 @@ +#!/usr/bin/env bash +# Best-effort cleanup of the keepalive anchor tags written by +# anchor-digest-in-cache.sh. Called from backend_merge.yml after the +# user-facing manifest list has been published. +# +# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes. +# The proper delete is the quay REST API, which requires an OAuth-scoped +# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app +# token (typical for service accounts) the delete succeeds; otherwise this +# is a soft no-op and the tag persists until manually pruned. +# +# Cleanup failure MUST NOT fail the merge — the merge has already produced +# the user-facing manifest list at this point and the keepalive tags are +# pure overhead. We always exit 0. +# +# Required env: +# GITHUB_RUN_ID - current workflow run id (set automatically by GHA) +# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm) +# QUAY_TOKEN - bearer token for quay's REST API +# +# Optional env: +# QUAY_REPO - target repo (default: go-skynet/ci-cache) +# PLATFORM_TAGS - space-separated list of platform-tag values to try +# (default: "amd64 arm64 single") +# We don't know which platform-tag(s) exist for this +# tag-suffix without an extra API call, so we just try +# all three and ignore 404s for the ones that don't. +set -uo pipefail + +: "${GITHUB_RUN_ID:?}" +: "${TAG_SUFFIX:?}" +: "${QUAY_TOKEN:?}" + +quay_repo="${QUAY_REPO:-go-skynet/ci-cache}" +platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}" + +for plat in $platform_tags; do + tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}" + url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}" + http=$(curl -sS -o /dev/null -w '%{http_code}' \ + -X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000") + case "$http" in + 204|200) echo "deleted $tag" ;; + 404) echo "not present: $tag" ;; + 401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;; + *) echo "unexpected http $http deleting $tag - skipping" ;; + esac +done +exit 0 diff --git a/.github/workflows/backend.yml b/.github/workflows/backend.yml index 3afe0c681..b41c3d4dd 100644 --- a/.github/workflows/backend.yml +++ b/.github/workflows/backend.yml @@ -154,7 +154,13 @@ jobs: # digest only — no tags are applied at build time. backend-merge-jobs-multiarch: needs: [generate-matrix, backend-jobs-multiarch] - if: needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' + # !cancelled() lets the merge run even when a few build legs failed. + # Without it, GHA's default `needs:` cascade skips the entire merge + # matrix on a single failed/cancelled cell. We still want to publish + # the manifest lists for tag-suffixes whose legs all succeeded. + # Observed in v4.2.1: 2 singlearch build failures cascade-skipped all + # ~199 singlearch merge entries. + if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }} uses: ./.github/workflows/backend_merge.yml with: tag-latest: ${{ matrix.tag-latest }} @@ -170,7 +176,8 @@ jobs: backend-merge-jobs-singlearch: needs: [generate-matrix, backend-jobs-singlearch] - if: needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' + # See note on backend-merge-jobs-multiarch above for !cancelled(). + if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }} uses: ./.github/workflows/backend_merge.yml with: tag-latest: ${{ matrix.tag-latest }} diff --git a/.github/workflows/backend_build.yml b/.github/workflows/backend_build.yml index 7327287ce..b3e177bd1 100644 --- a/.github/workflows/backend_build.yml +++ b/.github/workflows/backend_build.yml @@ -228,6 +228,16 @@ jobs: digest="${{ steps.build.outputs.digest }}" touch "/tmp/digests/${digest#sha256:}" + # See .github/scripts/anchor-digest-in-cache.sh for why this is needed + # and how it interacts with backend_merge.yml's cleanup step. + - name: Anchor digest in ci-cache so quay GC won't reap before merge + if: github.event_name != 'pull_request' + env: + TAG_SUFFIX: ${{ inputs.tag-suffix }} + PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }} + DIGEST: ${{ steps.build.outputs.digest }} + run: .github/scripts/anchor-digest-in-cache.sh + # Artifact name uses a `--` separator between tag-suffix and platform-tag # to avoid prefix collisions during the merge job's pattern-based download. # Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a diff --git a/.github/workflows/backend_build_darwin.yml b/.github/workflows/backend_build_darwin.yml index ac39389f3..4c87e8d66 100644 --- a/.github/workflows/backend_build_darwin.yml +++ b/.github/workflows/backend_build_darwin.yml @@ -116,6 +116,13 @@ jobs: # already), we don't have to chase missing dylibs one at a time. # The downloads cache makes the reinstall fast (~5s on a hit). brew reinstall ccache + # Same pattern for grpc: its CMake config (used by the llama-cpp + # `grpc-server` target) does find_package(absl). The cache restores + # /opt/homebrew/Cellar/grpc so brew above no-ops the install, but + # abseil isn't in our Cellar cache list and never gets installed + # alongside, leaving grpc's CMake unable to resolve it. Reinstalling + # grpc re-validates and pulls abseil in, mirroring the ccache fix. + brew reinstall grpc # The brew cache restores the Cellar dirs but NOT the bin symlinks # at /opt/homebrew/bin/*. brew install above sees the Cellar present # and decides "already installed" without re-linking, so on a cache- diff --git a/.github/workflows/backend_merge.yml b/.github/workflows/backend_merge.yml index 466a5d843..0490cc6b3 100644 --- a/.github/workflows/backend_merge.yml +++ b/.github/workflows/backend_merge.yml @@ -34,6 +34,15 @@ jobs: env: quay_username: ${{ secrets.quayUsername }} steps: + # Sparse checkout: the merge job needs `.github/scripts/` (for the + # keepalive cleanup script) but none of the source tree. + - name: Checkout (.github/scripts only) + uses: actions/checkout@v6 + with: + sparse-checkout: | + .github/scripts + sparse-checkout-cone-mode: false + # `--` separator anchors the glob so we don't over-match sibling # backends whose tag-suffix happens to be a prefix of ours # (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the @@ -126,6 +135,15 @@ jobs: docker buildx imagetools inspect "$first_tag" fi + # See .github/scripts/cleanup-keepalive-tags.sh for why this is + # best-effort and what the failure modes are. + - name: Cleanup keepalive tags in ci-cache + if: github.event_name != 'pull_request' && success() + env: + TAG_SUFFIX: ${{ inputs.tag-suffix }} + QUAY_TOKEN: ${{ secrets.quayPassword }} + run: .github/scripts/cleanup-keepalive-tags.sh + - name: Job summary if: github.event_name != 'pull_request' run: | diff --git a/.github/workflows/backend_pr.yml b/.github/workflows/backend_pr.yml index 9b0aba310..e9520a548 100644 --- a/.github/workflows/backend_pr.yml +++ b/.github/workflows/backend_pr.yml @@ -104,7 +104,9 @@ jobs: # backend_merge.yml's push-side steps are all gated on # github.event_name != 'pull_request', so on a PR the merge job would # do nothing. Skip it entirely to avoid spinning up an empty runner. - if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' + # !cancelled() lets the merge run even when a few build legs fail — + # see the matching note in backend.yml. + if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }} uses: ./.github/workflows/backend_merge.yml with: tag-latest: ${{ matrix.tag-latest }} @@ -118,7 +120,7 @@ jobs: backend-merge-jobs-singlearch: needs: [generate-matrix, backend-jobs-singlearch] - if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' + if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }} uses: ./.github/workflows/backend_merge.yml with: tag-latest: ${{ matrix.tag-latest }}