ci(python-backends): add weekly DEPS_REFRESH cache-buster

The shared backend/Dockerfile.python ends in: RUN cd /${BACKEND} && PORTABLE_PYTHON=true make which `pip install`s each backend's requirements*.txt. A scan of all 34 Python backends shows every single one ships at least some unpinned deps (torch, transformers, vllm, diffusers, ...). With the registry cache now enabled, that `make` layer's BuildKit hash depends only on Dockerfile instructions + COPYed source — not on what pip resolves at runtime — so a warm cache would freeze upstream versions indefinitely. DEPS_REFRESH is an ARG declared right before that RUN. backend_build.yml computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) and passes it as a build-arg, so the install layer invalidates at most once per week and re-resolves PyPI / nightly indexes. Within a week, builds stay warm. Only Dockerfile.python is affected: Go (go.sum) and Rust (Cargo.lock) already lock their deps, and the C++ backends pull gRPC at a pinned tag and llama.cpp at a pinned commit. Add .agents/ci-caching.md documenting the cache layout (quay.io/go-skynet/ci-cache:cache<tag-suffix>), read/write semantics (master writes, PRs read-only), DEPS_REFRESH semantics, and how to manually evict tags. Index it from AGENTS.md (CLAUDE.md is a symlink). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7-1m
2026-04-29 03:24:49 -04:00 · 2026-04-27 14:21:11 +00:00
parent 3810fe1a1e
commit f4036fa83f
4 changed files with 106 additions and 0 deletions
--- a/.agents/ci-caching.md
+++ b/.agents/ci-caching.md
@@ -0,0 +1,87 @@
+# CI Build Caching
+
+Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache. This file explains how that cache is laid out, what invalidates it, and how to bypass it.
+
+## Cache layout
+
+- **Cache registry**: `quay.io/go-skynet/ci-cache`
+- **One tag per matrix entry**, derived from the existing `tag-suffix`:
+  - Backend builds (`backend_build.yml`): `cache<tag-suffix>`
+    - e.g. `cache-gpu-nvidia-cuda-12-llama-cpp`, `cache-cpu-vllm`, `cache-nvidia-l4t-cuda-13-arm64-vllm`
+  - Root image builds (`image_build.yml`): `cache-localai<tag-suffix>`
+    - e.g. `cache-localai-gpu-nvidia-cuda-12`, `cache-localai-gpu-vulkan`
+- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
+
+## Read/write semantics
+
+| Trigger | `cache-from` | `cache-to` |
+|---|---|---|
+| `push` to `master` / tag | yes | yes (`mode=max,ignore-error=true`) |
+| `pull_request` | yes | **no** |
+
+PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
+
+`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
+
+## Self-warming, no separate populator
+
+There is no cron job that pre-warms the cache. The production builds *are* the populator. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in `Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`, Python wheel installs, etc.).
+
+Historically there was a `generate_grpc_cache.yaml` cron that targeted a `grpc` stage in the root Dockerfile. That stage was removed in July 2025 and the cron silently failed every night for 9 months without writing anything. It was deleted along with the registry-cache rollout.
+
+## The `DEPS_REFRESH` cache-buster (Python backends)
+
+Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
+
+```dockerfile
+ARG DEPS_REFRESH=initial
+RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
+```
+
+Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
+
+`DEPS_REFRESH` defends against that:
+
+- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) before each build and passes it as a build-arg.
+- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
+- Within a week, builds stay warm.
+
+This applies only to `Dockerfile.python` because:
+- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
+- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
+- C++ backends (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) clone gRPC at a pinned tag (`v1.65.0`) and llama.cpp at a pinned commit; their inputs don't drift between rebuilds.
+
+### Adjusting the cadence
+
+If you need a faster refresh (e.g. while debugging an upstream flake), bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`). If you need a one-shot rebuild for a specific backend without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
+
+## Manually evicting cache
+
+To force a fully cold build for one backend or the whole image:
+
+```bash
+# Delete a single tag (requires quay credentials with admin on the repo)
+curl -X DELETE \
+  -H "Authorization: Bearer ${QUAY_TOKEN}" \
+  https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm
+
+# List all tags
+curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
+  "https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
+```
+
+Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry so a stale tag never bleeds into a different build.
+
+## What the cache **does not** cover
+
+- The "Free Disk Space" / "Release space from worker" steps run on every job — these reclaim ~6 GB on `ubuntu-latest` runners. They are runner-state cleanup, not Docker, and BuildKit caches don't apply.
+- Intermediate artifacts of `Build and push (PR)` are not pushed anywhere — PRs only build for verification.
+
+## Touching the cache pipeline
+
+When changing `image_build.yml`, `backend_build.yml`, or any of the `backend/Dockerfile.*` files:
+
+1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
+2. **Keep `tag-suffix` unique per matrix entry** — it's the cache namespace. Two matrix entries sharing a tag-suffix would clobber each other's cache.
+3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
+4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -208,6 +208,15 @@ jobs:
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}

+      # Weekly cache-buster for the per-backend `make` step. Most Python
+      # backends list unpinned deps (torch, transformers, vllm, ...), so a
+      # warm cache freezes upstream versions indefinitely. Rolling this
+      # weekly forces a re-resolve of the install layer at most once per
+      # week, picking up newer wheels without a full cold rebuild.
+      - name: Compute deps refresh key
+        id: deps_refresh
+        run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
+
      - name: Build and push
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
@@ -222,6 +231,7 @@ jobs:
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
+            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
@@ -245,6 +255,7 @@ jobs:
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
+            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -19,6 +19,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 |------|-------------|
 | [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
+| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, manual eviction |
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -203,6 +203,13 @@ COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
 ARG FROM_SOURCE=""
 ENV FROM_SOURCE=${FROM_SOURCE}

+# Cache-buster for the per-backend `make` step. Most Python backends list
+# unpinned deps (torch, transformers, vllm, ...), so a warm registry cache
+# would otherwise freeze upstream versions indefinitely. CI passes a value
+# that rolls weekly so the install layer is rebuilt at most once per week
+# and picks up newer wheels from PyPI / nightly indexes.
+ARG DEPS_REFRESH=initial
+
 RUN cd /${BACKEND} && PORTABLE_PYTHON=true make

 # Package GPU libraries into the backend's lib directory