ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781)

* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1 v4.2.1's backend.yml run (#25701862853) exposed three independent issues on top of the singletons fix shipped in ea001995. Address all three plus two related cleanups: 1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with "manifest not found"). Even after PR #9746 split multi/single-arch merges, the multiarch matrix itself takes ~2h to drain at max-parallel: 8, and the earliest per-arch digests (push-by-digest, no tag) get reaped by quay's GC before the merge runs. The split bounded the race for multiarch; it doesn't eliminate it. Anchor each per-arch digest immediately to a tag in the internal ci-cache image (`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC tagged manifests. backend_merge.yml deletes the keepalive tags via quay REST API after publishing the user-facing manifest list. Cleanup is best-effort: if the quay token is not OAuth-scoped the merge does NOT fail, the orphan tags just persist. 2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed and 2 cancelled singlearch builds (out of 199); GHA's default `needs:` semantics cascade-skipped the entire singlearch merge matrix, so zero singleton tags were applied even though 197 singletons built successfully. Wrap the merge `if:` in `!cancelled() && ...` for both multi and single arch in backend.yml and backend_pr.yml so partial build failures publish the successful tag-suffixes. 3. Darwin llama-cpp grpc-server build fails with `find_package(absl)` not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd fix already in `Dependencies`: a brew cache hit restores `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but abseil isn't in our Cellar cache list and never gets installed alongside, leaving grpc's CMake unable to resolve it. Mirror the `brew reinstall ccache` line with `brew reinstall grpc` to re-validate grpc's full transitive dep closure on every cache-hit run. 4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1 wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m, -gpu-nvidia-cuda-12-turboquant 6h05m, -gpu-nvidia-cuda-13-llama-cpp 5h37m, -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3 of the free-tier migration (PR #9730) had explicitly flagged this batch as 'highest-risk' with a per-entry revert path. All other matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h, intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest. Verified locally: all six edited workflow YAMLs parse cleanly. Real verification has to come from the next tag release run. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: extract keepalive anchor + cleanup into .github/scripts/ The two inline shell blocks from the previous commit are long enough to hurt readability of the workflow YAML and benefit from their own files with self-contained docs. Move them to .github/scripts/: anchor-digest-in-cache.sh backend_build.yml's keepalive anchor cleanup-keepalive-tags.sh backend_merge.yml's best-effort cleanup Workflow steps reduce to a single `run:` invocation each, with all the parameter plumbing handled by env vars on the step. backend_merge.yml also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts only) so the cleanup script is available on the runner — backend_build already checks out for the docker build. Net workflow diff: -36 lines across the two files. Script logic and behavior are byte-identical to the inline version. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-01 03:46:41 -04:00 · 2026-05-12 17:22:09 +02:00
parent a57e73691d
commit 86a7f6c9fa
8 changed files with 157 additions and 8 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -389,7 +389,12 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: cold builds for this entry consistently take 5h+ on
+    # ubuntu-latest (observed 5h36m on v4.2.1). Move back to bigger-runner
+    # so the build finishes well within GHA's 6h job timeout. Phase 5.3 of
+    # the free-tier migration (PR #9730) flipped this to ubuntu-latest as
+    # a 'highest-risk batch' with explicit per-entry revert.
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "llama-cpp"
@@ -403,7 +408,9 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: same rationale as -gpu-nvidia-cuda-12-llama-cpp above
+    # (observed 6h5m wall-clock on v4.2.1, just past the 6h job timeout).
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "turboquant"
@@ -899,7 +906,9 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: cold builds for this entry take 5h+ on ubuntu-latest
+    # (observed 5h37m on v4.2.1). Same rationale as the cuda-12 variant.
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "llama-cpp"
@@ -913,7 +922,8 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: observed 6h5m wall-clock on v4.2.1 — at the GHA timeout.
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "turboquant"
--- a/.github/scripts/anchor-digest-in-cache.sh
+++ b/.github/scripts/anchor-digest-in-cache.sh
@@ -0,0 +1,46 @@
+#!/usr/bin/env bash
+# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
+# garbage collector won't reap the manifest before backend_merge.yml runs.
+#
+# Context: backend_build.yml pushes by canonical digest only
+# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
+# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
+# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
+# anchoring tag, the earliest digests are gone by the time `imagetools create`
+# tries to read them, producing "manifest not found" merge failures.
+#
+# We tag the digest under our internal ci-cache image; quay does not GC tagged
+# manifests. The user-facing manifest list still references the original
+# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
+# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
+#
+# Required env:
+#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
+#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
+#   PLATFORM_TAG   - amd64 / arm64 / single (single = singleton matrix entry)
+#   DIGEST         - canonical content digest from build step (sha256:...)
+#
+# Optional env:
+#   ANCHOR_IMAGE   - target image (default: quay.io/go-skynet/ci-cache)
+#   SOURCE_IMAGE   - source image (default: quay.io/go-skynet/local-ai-backends)
+#   GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
+set -euo pipefail
+
+: "${GITHUB_RUN_ID:?}"
+: "${TAG_SUFFIX:?}"
+: "${PLATFORM_TAG:?}"
+: "${DIGEST:?}"
+
+anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
+source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
+
+tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
+
+docker buildx imagetools create \
+  -t "${anchor_image}:${tag}" \
+  "${source_image}@${DIGEST}"
+
+echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
+if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
+  echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
+fi
--- a/.github/scripts/cleanup-keepalive-tags.sh
+++ b/.github/scripts/cleanup-keepalive-tags.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+# Best-effort cleanup of the keepalive anchor tags written by
+# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
+# user-facing manifest list has been published.
+#
+# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
+# The proper delete is the quay REST API, which requires an OAuth-scoped
+# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
+# token (typical for service accounts) the delete succeeds; otherwise this
+# is a soft no-op and the tag persists until manually pruned.
+#
+# Cleanup failure MUST NOT fail the merge — the merge has already produced
+# the user-facing manifest list at this point and the keepalive tags are
+# pure overhead. We always exit 0.
+#
+# Required env:
+#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
+#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
+#   QUAY_TOKEN     - bearer token for quay's REST API
+#
+# Optional env:
+#   QUAY_REPO      - target repo (default: go-skynet/ci-cache)
+#   PLATFORM_TAGS  - space-separated list of platform-tag values to try
+#                    (default: "amd64 arm64 single")
+#                    We don't know which platform-tag(s) exist for this
+#                    tag-suffix without an extra API call, so we just try
+#                    all three and ignore 404s for the ones that don't.
+set -uo pipefail
+
+: "${GITHUB_RUN_ID:?}"
+: "${TAG_SUFFIX:?}"
+: "${QUAY_TOKEN:?}"
+
+quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
+platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
+
+for plat in $platform_tags; do
+  tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
+  url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
+  http=$(curl -sS -o /dev/null -w '%{http_code}' \
+    -X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
+  case "$http" in
+    204|200) echo "deleted $tag" ;;
+    404)     echo "not present: $tag" ;;
+    401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
+    *)       echo "unexpected http $http deleting $tag - skipping" ;;
+  esac
+done
+exit 0
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -154,7 +154,13 @@ jobs:
  # digest only — no tags are applied at build time.
  backend-merge-jobs-multiarch:
    needs: [generate-matrix, backend-jobs-multiarch]
-    if: needs.generate-matrix.outputs['has-merges-multiarch'] == 'true'
+    # !cancelled() lets the merge run even when a few build legs failed.
+    # Without it, GHA's default `needs:` cascade skips the entire merge
+    # matrix on a single failed/cancelled cell. We still want to publish
+    # the manifest lists for tag-suffixes whose legs all succeeded.
+    # Observed in v4.2.1: 2 singlearch build failures cascade-skipped all
+    # ~199 singlearch merge entries.
+    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
@@ -170,7 +176,8 @@ jobs:

  backend-merge-jobs-singlearch:
    needs: [generate-matrix, backend-jobs-singlearch]
-    if: needs.generate-matrix.outputs['has-merges-singlearch'] == 'true'
+    # See note on backend-merge-jobs-multiarch above for !cancelled().
+    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -228,6 +228,16 @@ jobs:
          digest="${{ steps.build.outputs.digest }}"
          touch "/tmp/digests/${digest#sha256:}"

+      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
+      # and how it interacts with backend_merge.yml's cleanup step.
+      - name: Anchor digest in ci-cache so quay GC won't reap before merge
+        if: github.event_name != 'pull_request'
+        env:
+          TAG_SUFFIX: ${{ inputs.tag-suffix }}
+          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
+          DIGEST: ${{ steps.build.outputs.digest }}
+        run: .github/scripts/anchor-digest-in-cache.sh
+
      # Artifact name uses a `--` separator between tag-suffix and platform-tag
      # to avoid prefix collisions during the merge job's pattern-based download.
      # Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -116,6 +116,13 @@ jobs:
          # already), we don't have to chase missing dylibs one at a time.
          # The downloads cache makes the reinstall fast (~5s on a hit).
          brew reinstall ccache
+          # Same pattern for grpc: its CMake config (used by the llama-cpp
+          # `grpc-server` target) does find_package(absl). The cache restores
+          # /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
+          # abseil isn't in our Cellar cache list and never gets installed
+          # alongside, leaving grpc's CMake unable to resolve it. Reinstalling
+          # grpc re-validates and pulls abseil in, mirroring the ccache fix.
+          brew reinstall grpc
          # The brew cache restores the Cellar dirs but NOT the bin symlinks
          # at /opt/homebrew/bin/*. brew install above sees the Cellar present
          # and decides "already installed" without re-linking, so on a cache-
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -34,6 +34,15 @@ jobs:
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
+      # Sparse checkout: the merge job needs `.github/scripts/` (for the
+      # keepalive cleanup script) but none of the source tree.
+      - name: Checkout (.github/scripts only)
+        uses: actions/checkout@v6
+        with:
+          sparse-checkout: |
+            .github/scripts
+          sparse-checkout-cone-mode: false
+
      # `--` separator anchors the glob so we don't over-match sibling
      # backends whose tag-suffix happens to be a prefix of ours
      # (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
@@ -126,6 +135,15 @@ jobs:
            docker buildx imagetools inspect "$first_tag"
          fi

+      # See .github/scripts/cleanup-keepalive-tags.sh for why this is
+      # best-effort and what the failure modes are.
+      - name: Cleanup keepalive tags in ci-cache
+        if: github.event_name != 'pull_request' && success()
+        env:
+          TAG_SUFFIX: ${{ inputs.tag-suffix }}
+          QUAY_TOKEN: ${{ secrets.quayPassword }}
+        run: .github/scripts/cleanup-keepalive-tags.sh
+
      - name: Job summary
        if: github.event_name != 'pull_request'
        run: |
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -104,7 +104,9 @@ jobs:
    # backend_merge.yml's push-side steps are all gated on
    # github.event_name != 'pull_request', so on a PR the merge job would
    # do nothing. Skip it entirely to avoid spinning up an empty runner.
-    if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true'
+    # !cancelled() lets the merge run even when a few build legs fail —
+    # see the matching note in backend.yml.
+    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
@@ -118,7 +120,7 @@ jobs:

  backend-merge-jobs-singlearch:
    needs: [generate-matrix, backend-jobs-singlearch]
-    if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true'
+    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}