refactor(docker): drop celery image, restore base apt layer dedup (#2776)

* refactor(docker): drop celery image, restore base apt layer dedup - Delete Dockerfile.celery.j2; compose now runs celery on the anthias-server image with a `command:` override. - Make viewer extend Dockerfile.base.j2 (mirroring test); drop 17 packages duplicated between viewer and base_apt_dependencies, plus 4 within-list duplicates. - Move `# syntax=docker/dockerfile:1.4` to line 1 of every rendered Dockerfile. It previously lived in uv-builder.j2 line 1 and got bumped mid-file for server by the bun-builder prelude, silently disabling the 1.4 frontend and breaking cache-key parity with viewer — the actual blocker for layer dedup. - Collapse CI matrix from (board × service) to (board) so all services for a board build on the same runner with the same buildkit cache, producing byte-identical apt layer digests at the registry. - Add ENV DJANGO_SETTINGS_MODULE to the server image so the merged image runs both server and celery CMDs. - Update all five compose templates (prod, balena prod, balena dev, dev, test) to redirect anthias-celery at the server image with a command: override. dev compose pins an explicit `image:` tag so both services share the locally-built SHA. - Remove old anthias-celery / srly-ose-celery containers in upgrade_containers.sh so the recreated container can take the name. Verified end-to-end on x86: server and viewer apt layers share a single digest; SHARED SIZE jumps from 132 MB to 1.216 GB; merged image runs both workloads in compose (celery task round-trips through Redis to SUCCESS). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(docker): cache buildkit layers in GHCR registry across CI runs Add a --cache-backend / $BUILDX_CACHE_BACKEND option to tools.image_builder with two modes: - `local` (default): writes to /tmp/.buildx-cache/<board>/. Unchanged from before; right for local dev. - `registry`: pushes BuildKit cache to ghcr.io/screenly/anthias-<service>:buildcache-<board>. Reuses the GHCR login already done by docker-build.yaml, no extra tokens or third-party actions needed. Wire CI to use registry mode on push events (master) so subsequent runs of the same board pull cached layers — the ~825 MB extracted apt install per service goes from ~3 min cold to a few seconds warm. workflow_dispatch on a non-master branch falls back to local mode (effectively no-cache) so manual runs can't pollute the master cache. Drop the old actions/cache@v5 step that mirrored /tmp/.buildx-cache/<board> through actions/cache — registry cache is per-step rather than one big tarball, so it survives the GitHub Actions cache 10 GB-per-repo eviction better. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(image-builder): move local cache out of /tmp to user XDG cache dir SonarCloud python:S5443 flagged the previous /tmp/.buildx-cache/ default as a security hotspot — `/tmp` is world-writable, so on a multi-user host another account could in principle tamper with the buildkit cache. Switch to $XDG_CACHE_HOME/anthias-buildx/<board>/ (default ~/.cache/anthias-buildx/), which is per-user by default and follows XDG Base Directory convention. CI is unaffected: docker-build.yaml uses --cache-backend=registry on push events, which pushes cache to GHCR and never touches the local path. Local dev users with stale state in /tmp/.buildx-cache/<board>/ can rm it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): correct cache-backend comments to match real behavior Two doc fixes per Copilot review on #2776: - tools/image_builder/__main__.py: the cache-backend rationale block still referenced /tmp/.buildx-cache/<board>; update to $XDG_CACHE_HOME/anthias-buildx/<board> so it matches the implementation moved in 529a50e0. - .github/workflows/docker-build.yaml: the env comment claimed pull-request builds read from the registry cache, but this workflow has no pull_request trigger — non-push runs are workflow_dispatch, which both falls through to local cache and skips `docker login ghcr.io`, so it has no GHCR auth at all. Rewrite the comment around the push / workflow_dispatch split the code actually implements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): address Copilot review on registry cache + test compose - tools/image_builder/__main__.py: comment in the registry-cache branch said the cache namespace was "picked from the build's tag list", but the implementation hardcodes ghcr.io/screenly/anthias-{service}. Rewrite the comment to describe what the code actually does and call out the hardcode so a future namespaces refactor doesn't silently break cache. - docker-compose.test.yml: anthias-celery had its own `build:` block pointing at Dockerfile.test, claiming "reuses the test image" — but compose builds two separate images per service even with identical context, defeating the dedup intent. Mirror the docker-compose.dev.yml pattern: pin anthias-test to an explicit `image: anthias-test:dev` tag and have anthias-celery reference the same tag with no `build:`. Also bind-mount the source into celery so it picks up code changes (matches anthias-test's existing volume). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(image-builder): read-only registry cache without --push Per Copilot review: --cache-backend=registry previously tried to push cache to ghcr.io/... regardless of --push, so a local invocation without GHCR auth would fail mid-build with a confusing registry error. Split the behavior: - Reads (cache_from) are always set when registry mode is active — the anthias-* GHCR packages are public, so warm-starting off CI's cache without auth works and helps local dev. - Writes (cache_to) only happen when --push is also set, since that's when the workflow has authenticated to GHCR. Without --push, log a yellow warning and skip cache_to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker): set DJANGO_SETTINGS_MODULE in test image for celery worker Per Copilot review on #2776 (suppressed-due-to-low-confidence note, but the bug is real): docker-compose.test.yml runs the celery worker from anthias-test:dev. celery_tasks.py calls django.setup() at module import time, which needs DJANGO_SETTINGS_MODULE in the environment. The pre-refactor Dockerfile.celery.j2 set it explicitly; this PR moved that ENV to Dockerfile.server.j2 only, so the production celery (running on the server image) is fine but the test celery would have crashed with ImproperlyConfigured. Set the same ENV in Dockerfile.test.j2. Server and test images both ship a usable Django environment for any process that imports anthias_django. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-30 09:15:51 -04:00 · 2026-04-29 15:21:43 +01:00
parent c76f5afa20
commit 5e00c8ba25
20 changed files with 241 additions and 145 deletions
--- a/.github/workflows/docker-build.yaml
+++ b/.github/workflows/docker-build.yaml
@@ -40,24 +40,35 @@ jobs:
    # Scoped per-job (not at workflow level) so `run-tests` and any
    # future read-only job don't inherit `packages: write`. `buildx`
    # needs it so `docker login ghcr.io` with GITHUB_TOKEN can push
-    # ghcr.io/screenly/anthias-*. `contents: read` is the implicit
-    # default but pinned explicitly so a future workflow edit can't
-    # silently lose checkout access.
+    # both ghcr.io/screenly/anthias-* image tags and the
+    # `buildcache-*` registry cache tags written by --cache-backend=
+    # registry. `contents: read` is the implicit default but pinned
+    # explicitly so a future workflow edit can't silently lose
+    # checkout access.
    permissions:
      contents: read
      packages: write
    strategy:
      # Don't cancel sibling jobs on the first failure: any platform that
-      # has already finished building its image will have pushed the
-      # immutable <short-hash>-<board> tag, which is harmless on its own.
-      # Only the publish-latest job below — gated on the entire matrix
-      # succeeding — moves the floating latest-* tag, so a partial failure
-      # leaves users on the previous coherent latest-* set instead of a
-      # half-pushed mix of old + new images.
+      # has already finished building its images will have pushed the
+      # immutable <short-hash>-<board> tags, which are harmless on their
+      # own. Only the publish-latest job below — gated on the entire
+      # matrix succeeding — moves the floating latest-* tag, so a partial
+      # failure leaves users on the previous coherent latest-* set
+      # instead of a half-pushed mix of old + new images.
+      #
+      # The matrix is intentionally only on `board`, not `(board, service)`:
+      # buildkit's per-runner cache hashes apt-get-update output (timestamps
+      # in /var/lib/apt/lists/*, mirror selection) into the layer digest,
+      # so the same package list installed on two different runners
+      # produces two different layer hashes. Building all services for
+      # one board on a single runner means the base apt layer is hashed
+      # once and shared across server / viewer / test / etc. — which is
+      # what makes Dockerfile.base.j2's include-shared layer actually
+      # dedup at the registry level. See refactor: drop celery image.
      fail-fast: false
      matrix:
        board: ['pi1', 'pi2', 'pi3', 'pi4', 'pi4-64', 'pi5', 'x86']
-        service: ['server', 'celery', 'redis', 'viewer']
        python-version: ["3.11"]
    runs-on: ubuntu-24.04

@@ -95,19 +106,6 @@ jobs:
          docker buildx create --use --name multiarch-builder
          docker buildx inspect --bootstrap

-      - name: Cache Docker layers
-        uses: actions/cache@27d5ce7f107fe9357f9df03efb73ab90386fccae # v5
-        id: cache
-        with:
-          path: /tmp/.buildx-cache/${{ matrix.board }}-${{ matrix.service }}
-          key: buildx-${{ matrix.board }}-${{ matrix.service }}-${{ hashFiles('docker/**/*') }}
-          restore-keys: |
-            buildx-${{ matrix.board }}-${{ matrix.service }}-
-
-      - name: Inspect cache before build
-        run: |
-          ls -la /tmp/.buildx-cache/${{ matrix.board }}-${{ matrix.service }} || true
-
      - name: Login to Docker Hub
        if: success() && github.event_name == 'push'
        uses: docker/login-action@4907a6ddec9925e35a0a9e82d7399ccc52663121 # v4
@@ -123,20 +121,32 @@ jobs:
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

-      - name: Build Container
+      - name: Build Containers
        env:
          DOCKER_BUILDKIT: 1
          BUILDKIT_PROGRESS: plain
+          # On push events (master): use BuildKit's registry cache
+          # backend — pushes cache to ghcr.io/screenly/anthias-
+          # <service>:buildcache-<board> so subsequent push runs of
+          # the same board can pull cached layers without
+          # re-installing ~825 MB of apt packages from scratch.
+          #
+          # On workflow_dispatch (manual runs on any branch): fall
+          # through to `local` mode — a per-runner ephemeral
+          # directory, so effectively no-cache. workflow_dispatch
+          # also skips the `docker login ghcr.io` step above (it's
+          # gated on `event_name == 'push'`), so registry cache
+          # would have no auth to write with anyway. This workflow
+          # has no `pull_request` trigger; PRs never run this job.
+          BUILDX_CACHE_BACKEND: ${{ github.event_name == 'push' && 'registry' || 'local' }}
        run: |
          uv run python -m tools.image_builder \
            --build-target=${{ matrix.board }} \
-            --service=${{ matrix.service }} \
+            --service=server \
+            --service=viewer \
+            --service=redis \
            ${{ github.event_name == 'push' && '--push --skip-latest-tag' || '' }}

-      - name: Inspect cache after build
-        run: |
-          ls -la /tmp/.buildx-cache/${{ matrix.board }}-${{ matrix.service }} || true
-
  # Mirror the immutable <short-hash>-<board> tags pushed by the buildx
  # matrix onto the floating latest-<board> tag. Runs only after every
  # buildx job has succeeded, so latest-* is never advanced from a
@@ -199,7 +209,7 @@ jobs:
          set -euo pipefail
          GIT_SHORT_HASH=$(git rev-parse --short=7 HEAD)
          BOARDS=(pi1 pi2 pi3 pi4 pi4-64 pi5 x86)
-          SERVICES=(server celery redis viewer)
+          SERVICES=(server redis viewer)
          # GHCR first so the canonical primary is current even if the
          # Docker Hub mirror later in the loop flakes.
          NAMESPACES=(ghcr.io/screenly/anthias screenly/anthias)
--- a/.github/workflows/test-runner.yml
+++ b/.github/workflows/test-runner.yml
@@ -43,7 +43,6 @@ jobs:
          uv run python -m tools.image_builder \
            --dockerfiles-only \
            --disable-cache-mounts \
-            --service celery \
            --service redis \
            --service test

--- a/.gitignore
+++ b/.gitignore
@@ -44,7 +44,6 @@ docker/Dockerfile.base
 docker/Dockerfile.nginx
 docker/Dockerfile.server
 docker/Dockerfile.websocket
-docker/Dockerfile.celery
 docker/Dockerfile.redis
 docker/Dockerfile.viewer
 docker/Dockerfile.test
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -11,7 +11,7 @@ Anthias is an open-source digital signage platform for Raspberry Pi and x86 PCs
 Anthias runs as a set of Docker containers:

 - **anthias-server** (port 80 in prod, 8000 in dev) — uvicorn (ASGI) serving the Django web app, REST API, the React frontend's static assets (via WhiteNoise), uploaded media at `/anthias_assets/`, and the WebSocket endpoint at `/ws` (Django Channels with a Redis-backed channel layer). Always plain HTTP — TLS is opt-in and handled by the **anthias-caddy** sidecar that `bin/enable_ssl.sh` installs as a compose override (Caddy local CA by default, or auto Let's Encrypt with `--domain`, or BYO cert with `--cert`/`--key`).
- **anthias-celery** — Async task queue (asset downloads, cleanup). Publishes asset-update events back to the WebSocket consumers via the Channels Redis layer.
+- **anthias-celery** — Async task queue (asset downloads, cleanup). Runs the same image as `anthias-server` with a CMD override that starts the Celery worker; the two services share the entire root filesystem to avoid duplicating ~825 MB of identical apt content per device. Publishes asset-update events back to the WebSocket consumers via the Channels Redis layer.
 - **anthias-viewer** — Drives the display, receives instructions over the Redis pub/sub `anthias.viewer` channel, talks to anthias-server over HTTP.
 - **redis** (port 6379) — Celery broker + result backend, Channels channel layer, and the viewer signalling bus (pub/sub channel + per-correlation-ID reply lists).
 - **webview** — Qt-based browser for rendering content on the display; fetches `/anthias_assets/` from anthias-server.
@@ -72,7 +72,7 @@ uv run ruff check /path/to/file.py     # Lint specific file

 ```bash
 # Build and start test containers
-uv run python -m tools.image_builder --dockerfiles-only --disable-cache-mounts --service celery --service redis --service test
+uv run python -m tools.image_builder --dockerfiles-only --disable-cache-mounts --service redis --service test
 docker compose -f docker-compose.test.yml up -d --build

 # Prepare and run tests (integration and non-integration must be run separately)
--- a/bin/upgrade_containers.sh
+++ b/bin/upgrade_containers.sh
@@ -69,7 +69,6 @@ if [[ -n $(docker ps | grep srly-ose) ]]; then
    set +e
    docker container rename srly-ose-server anthias-server
    docker container rename srly-ose-viewer anthias-viewer
-    docker container rename srly-ose-celery anthias-celery
    set -e
 fi

@@ -77,11 +76,18 @@ fi
 #   * nginx / websocket — folded into anthias-server (uvicorn).
 #   * wifi-connect      — service removed; nmcli/nmtui is the supported
 #                          path now.
+#   * anthias-celery / srly-ose-celery containers from the era when
+#     celery had its own image. The new compose file recreates the
+#     anthias-celery container against ghcr.io/screenly/anthias-server,
+#     so the old container (still pointing at the deleted celery image)
+#     must be removed first or the server-image-backed replacement
+#     can't take its name.
 # Volumes are shared across services, so removing the containers is safe.
 set +e
 docker rm -f \
    anthias-nginx anthias-websocket anthias-wifi-connect \
    srly-ose-nginx srly-ose-websocket srly-ose-wifi-connect \
+    anthias-celery srly-ose-celery \
    >/dev/null 2>&1
 set -e

--- a/docker-compose.balena.dev.yml.tmpl
+++ b/docker-compose.balena.dev.yml.tmpl
@@ -45,10 +45,12 @@ services:
      io.balena.features.supervisor-api: '1'

  anthias-celery:
-    image: ghcr.io/screenly/anthias-celery:${GIT_SHORT_HASH}-${BOARD}
-    build:
-      context: .
-      dockerfile: ./docker/Dockerfile.celery
+    # Runs on the same image as anthias-server with a CMD override.
+    # See docker-compose.yml.tmpl for context on the merge.
+    image: ghcr.io/screenly/anthias-server:${GIT_SHORT_HASH}-${BOARD}
+    command: >
+      celery -A celery_tasks.celery worker -B -n worker@anthias
+      --loglevel=info --schedule /tmp/celerybeat-schedule
    depends_on:
      - anthias-server
      - redis
--- a/docker-compose.balena.yml.tmpl
+++ b/docker-compose.balena.yml.tmpl
@@ -39,7 +39,12 @@ services:
      io.balena.features.supervisor-api: '1'

  anthias-celery:
-    image: ghcr.io/screenly/anthias-celery:${GIT_SHORT_HASH}-${BOARD}
+    # Runs on the same image as anthias-server with a CMD override.
+    # See docker-compose.yml.tmpl for context on the merge.
+    image: ghcr.io/screenly/anthias-server:${GIT_SHORT_HASH}-${BOARD}
+    command: >
+      celery -A celery_tasks.celery worker -B -n worker@anthias
+      --loglevel=info --schedule /tmp/celerybeat-schedule
    depends_on:
      - anthias-server
      - redis
--- a/docker-compose.dev.yml
+++ b/docker-compose.dev.yml
@@ -2,6 +2,10 @@

 services:
  anthias-server:
+    # Explicit image tag so anthias-celery below can reference the same
+    # built image without a duplicate `build:` block (which would
+    # produce a separate, byte-identical-but-distinct image tag).
+    image: anthias-server:dev
    build:
      context: .
      dockerfile: docker/Dockerfile.server
@@ -21,19 +25,27 @@ services:
      - ./:/usr/src/app/

  anthias-celery:
-    build:
-      context: .
-      dockerfile: docker/Dockerfile.celery
+    # Reuses anthias-server:dev via the explicit image tag above.
+    # Compose builds anthias-server first (it owns the build:) and
+    # this service inherits the same image, only overriding CMD.
+    image: anthias-server:dev
    depends_on:
-      - anthias-server
-      - redis
+      anthias-server:
+        condition: service_started
+      redis:
+        condition: service_started
+    command: >
+      celery -A celery_tasks.celery worker -B -n worker@anthias
+      --loglevel=info --schedule /tmp/celerybeat-schedule
    environment:
      - HOME=/data
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
+      - ENVIRONMENT=development
    restart: always
    volumes:
      - anthias-data:/data
+      - ./:/usr/src/app/

  redis:
    platform: "linux/amd64"
--- a/docker-compose.test.yml
+++ b/docker-compose.test.yml
@@ -2,6 +2,10 @@

 services:
  anthias-test:
+    # Explicit image tag so anthias-celery below can reference the same
+    # built image without a duplicate `build:` block (which would
+    # produce a separate, byte-identical-but-distinct image tag).
+    image: anthias-test:dev
    build:
      context: .
      dockerfile: docker/Dockerfile.test
@@ -17,18 +21,27 @@ services:
      - anthias-data:/data

  anthias-celery:
-    build:
-      context: .
-      dockerfile: docker/Dockerfile.celery
+    # Reuses anthias-test:dev via the explicit image tag above — the
+    # test image is a superset of server (same base apt + venv +
+    # bun + chrome for selenium). Compose builds anthias-test first
+    # (it owns the build:) and this service inherits the same image,
+    # only overriding CMD.
+    image: anthias-test:dev
+    command: >
+      celery -A celery_tasks.celery worker -B -n worker@anthias
+      --loglevel=info --schedule /tmp/celerybeat-schedule
    depends_on:
-      - anthias-test
-      - redis
+      anthias-test:
+        condition: service_started
+      redis:
+        condition: service_started
    environment:
      - HOME=/data
      - CELERY_BROKER_URL=redis://redis:6379/0
      - CELERY_RESULT_BACKEND=redis://redis:6379/0
    restart: always
    volumes:
+      - .:/usr/src/app
      - anthias-data:/data

  redis:
--- a/docker-compose.yml.tmpl
+++ b/docker-compose.yml.tmpl
@@ -58,10 +58,15 @@ services:
      io.balena.features.supervisor-api: '1'

  anthias-celery:
-    image: ghcr.io/screenly/anthias-celery:${DOCKER_TAG}-${DEVICE_TYPE}
-    build:
-      context: .
-      dockerfile: docker/Dockerfile.celery
+    # Runs on the same image as anthias-server with a CMD override.
+    # Shipping one image instead of two is the point — server and celery
+    # share their entire root filesystem (base apt + venv + app source),
+    # and a separate celery image was duplicating ~825 MB extracted of
+    # identical content per device. See refactor: drop celery image.
+    image: ghcr.io/screenly/anthias-server:${DOCKER_TAG}-${DEVICE_TYPE}
+    command: >
+      celery -A celery_tasks.celery worker -B -n worker@anthias
+      --loglevel=info --schedule /tmp/celerybeat-schedule
    depends_on:
      - anthias-server
      - redis
--- a/docker/Dockerfile.celery.j2
+++ b/docker/Dockerfile.celery.j2
@@ -1,22 +0,0 @@
-{% include 'uv-builder.j2' %}
-
-{% include 'Dockerfile.base.j2' %}
-
-COPY --from=uv-builder /venv /venv
-ENV PATH="/venv/bin:$PATH"
-ENV VIRTUAL_ENV="/venv"
-
-RUN mkdir -p /usr/src/app
-WORKDIR /usr/src/app
-COPY . /usr/src/app/
-
-ENV GIT_HASH={{ git_hash }}
-ENV GIT_SHORT_HASH={{ git_short_hash }}
-ENV GIT_BRANCH={{ git_branch }}
-ENV DJANGO_SETTINGS_MODULE="anthias_django.settings"
-
-CMD celery -A celery_tasks.celery worker \
-  -B -n worker@anthias \
-  --loglevel=info \
-  --schedule \
-  /tmp/celerybeat-schedule
--- a/docker/Dockerfile.server.j2
+++ b/docker/Dockerfile.server.j2
@@ -1,3 +1,5 @@
+# syntax=docker/dockerfile:1.4
+# vim: ft=dockerfile
 {% if environment == 'production' %}
 {# bun ships no 32-bit binaries at all — its release artifacts cover
   only {linux,darwin,windows}-{x64,aarch64}, so a target-platform
@@ -58,5 +60,6 @@ ENV GIT_HASH={{ git_hash }}
 ENV GIT_SHORT_HASH={{ git_short_hash }}
 ENV GIT_BRANCH={{ git_branch }}
 ENV DEVICE_TYPE={{ device_type }}
+ENV DJANGO_SETTINGS_MODULE="anthias_django.settings"

 CMD ["bash", "bin/start_server.sh"]
--- a/docker/Dockerfile.test.j2
+++ b/docker/Dockerfile.test.j2
@@ -1,9 +1,9 @@
+# syntax=docker/dockerfile:1.4
+# vim: ft=dockerfile
 {% include 'uv-builder.j2' %}

 {% include 'Dockerfile.base.j2' %}

-# vim: ft=dockerfile
-
 # @TODO: Uncomment this build stage when test_add_asset_streaming is fixed.
 # FROM debian:buster as builder

@@ -63,4 +63,5 @@ RUN cp ansible/roles/anthias/files/anthias.conf \
 ENV GIT_HASH={{ git_hash }}
 ENV GIT_SHORT_HASH={{ git_short_hash }}
 ENV GIT_BRANCH={{ git_branch }}
+ENV DJANGO_SETTINGS_MODULE="anthias_django.settings"
 ENV PATH="/opt/chrome-linux64:/opt/chromedriver-linux64:$PATH"
--- a/docker/Dockerfile.viewer.j2
+++ b/docker/Dockerfile.viewer.j2
@@ -1,8 +1,9 @@
+# syntax=docker/dockerfile:1.4
+# vim: ft=dockerfile
 {% include 'uv-builder.j2' %}

-FROM {{ base_image }}:{{ base_image_tag }}
+{% include 'Dockerfile.base.j2' %}

-# This list needs to be trimmed back later
 {% if disable_cache_mounts %}
 RUN \
 {% else %}
@@ -10,7 +11,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
 {% endif %}
    apt-get update && \
    apt-get -y install --no-install-recommends \
-    {% for dependency in apt_dependencies -%}
+    {% for dependency in viewer_extra_apt_dependencies -%}
        {% if not loop.last %}
        {{ dependency }} \
        {% else %}
@@ -22,10 +23,6 @@ COPY --from=uv-builder /venv /venv
 ENV PATH="/venv/bin:$PATH"
 ENV VIRTUAL_ENV="/venv"

-# Works around issue with `curl`
-# https://github.com/balena-io-library/base-images/issues/562
-RUN c_rehash
-
 # QT Base from packages does not support eglfs.
 # Qt 5 boards (Pi 1-4 32-bit) get a custom cross-built Qt runtime via the
 # qt5-* archive below; Qt 6 boards use Debian's apt qt6-* packages and so
@@ -74,6 +71,4 @@ WORKDIR /usr/src/app
 RUN mkdir -p /usr/src/app
 COPY . /usr/src/app/

-{% include 'labels.j2' %}
-
 CMD ["bash", "./bin/start_viewer.sh"]
--- a/docker/labels.j2
+++ b/docker/labels.j2
@@ -5,8 +5,7 @@
   push/delete access. Without it, packages stay private even when the
   repo is public, and the package page on GitHub has no link back. #}
 {% set service_descriptions = {
-    'server': 'Anthias web server (uvicorn + Django + Channels)',
-    'celery': 'Anthias background task worker (asset downloads, cleanup, display power)',
+    'server': 'Anthias web server (uvicorn + Django + Channels); also runs the Celery worker via a CMD override',
    'redis': 'Redis broker for Anthias Celery and Channels',
    'viewer': 'Anthias display/viewer service',
    'test': 'Anthias test runner',
--- a/docker/uv-builder.j2
+++ b/docker/uv-builder.j2
@@ -1,6 +1,3 @@
-# syntax=docker/dockerfile:1.4
-# vim: ft=dockerfile
-
 {# Single source of truth for the uv version — used both by the
   prebuilt-image COPY (amd64/arm64) and the PyPI fallback
   (32-bit ARM), so both paths stay byte-pinned and reproducible. #}
--- a/docs/developer-documentation.md
+++ b/docs/developer-documentation.md
@@ -90,7 +90,6 @@ Build and start the containers.
 $ uv run python -m tools.image_builder \
  --dockerfiles-only \
  --disable-cache-mounts \
-  --service celery \
  --service redis \
  --service test
 $ docker compose \
--- a/tools/image_builder/main.py
+++ b/tools/image_builder/main.py
@@ -34,24 +34,37 @@ def build_image(
    clean_build: bool,
    push: bool,
    dockerfiles_only: bool,
+    cache_backend: str,
 ) -> None:
    # Enable BuildKit
    os.environ['DOCKER_BUILDKIT'] = '1'

    context = {}

-    # Create board-specific cache directory
-    cache_dir = Path('/tmp/.buildx-cache') / (
+    # Local cache: per-board on-disk directory under the user's
+    # XDG-style cache home (override via $XDG_CACHE_HOME). Per-user
+    # rather than under /tmp so a multi-user host doesn't share
+    # buildkit cache state across accounts. Unused by the registry
+    # backend, which pushes to GHCR instead.
+    cache_scope = (
        f'{board}-64'
        if board == 'pi4' and target_platform == 'linux/arm64/v8'
        else board
    )
-    try:
-        cache_dir.mkdir(parents=True, exist_ok=True)
-    except Exception as e:
-        click.secho(
-            f'Warning: Failed to create cache directory: {e}', fg='yellow'
-        )
+    xdg_cache_home = (
+        Path(os.environ['XDG_CACHE_HOME'])
+        if os.environ.get('XDG_CACHE_HOME')
+        else Path.home() / '.cache'
+    )
+    cache_dir = xdg_cache_home / 'anthias-buildx' / cache_scope
+    if cache_backend == 'local':
+        try:
+            cache_dir.mkdir(parents=True, exist_ok=True)
+        except Exception as e:
+            click.secho(
+                f'Warning: Failed to create cache directory: {e}',
+                fg='yellow',
+            )

    base_apt_dependencies = [
        'build-essential',
@@ -141,22 +154,79 @@ def build_image(
    except:  # noqa: E722
        docker.buildx.create(name='multiarch-builder', use=True)

-    docker.buildx.build(
-        context_path='.',
-        cache=(not clean_build),
-        cache_from={
-            'type': 'local',
-            'src': str(cache_dir),
-        }
-        if not clean_build
-        else None,
-        cache_to={
+    # Resolve cache_from / cache_to. `--clean-build` short-circuits both
+    # to None for a true cold rebuild. Otherwise we pick a backend:
+    #
+    #   * local    — board-scoped on-disk directory at
+    #     $XDG_CACHE_HOME/anthias-buildx/<board> (typically
+    #     ~/.cache/anthias-buildx/<board>). Used for local dev so
+    #     cache state survives across `tools.image_builder`
+    #     invocations on the same machine.
+    #   * registry — BuildKit's registry cache backend
+    #     (https://docs.docker.com/build/cache/backends/registry/).
+    #     Pushes cache to a tagged image at
+    #     <namespace>-<service>:buildcache-<board>. Reuses the GHCR
+    #     login already done by CI — no extra tokens or third-party
+    #     actions needed — and inherits GHCR's free unlimited
+    #     storage for public packages. Cache lives next to the real
+    #     image tags but with a `buildcache-*` prefix so it can't
+    #     collide with the immutable <short-hash>-<board> or
+    #     floating latest-<board> tags.
+    if clean_build:
+        cache_from = None
+        cache_to = None
+    elif cache_backend == 'registry':
+        # Hardcode the GHCR-primary namespace so the cache lives next to
+        # the published images for this service. Doesn't read from
+        # `namespaces` below: cache only needs one canonical home, and
+        # GHCR's free unlimited storage for public packages makes it the
+        # right one. If the namespaces list changes in the future, this
+        # ref needs to move with it.
+        cache_ref = (
+            f'ghcr.io/screenly/anthias-{service}:buildcache-{cache_scope}'
+        )
+        # Reads are always safe — anthias-* GHCR packages are public,
+        # so cache_from works without auth (matters for someone
+        # invoking this locally with --cache-backend=registry to
+        # warm-start off CI's cache).
+        cache_from = {'type': 'registry', 'ref': cache_ref}
+        if push:
+            cache_to = {
+                'type': 'registry',
+                'ref': cache_ref,
+                'mode': 'max',
+                # `image-manifest=true` writes the cache as an OCI
+                # image manifest rather than the legacy index-only
+                # form, which is the only thing GHCR will accept
+                # under the ghcr.io/screenly/anthias-* repos (it
+                # rejects standalone cache manifests). Cheap, just
+                # affects how the cache blob is wrapped.
+                'image-manifest': 'true',
+            }
+        else:
+            # Without --push the build hasn't authenticated to GHCR,
+            # so trying to write cache there would fail mid-build.
+            # Read-only: pull layers from the published cache, don't
+            # update it.
+            cache_to = None
+            click.secho(
+                f'cache-backend=registry without --push: reading from '
+                f'{cache_ref} but not writing back.',
+                fg='yellow',
+            )
+    else:
+        cache_from = {'type': 'local', 'src': str(cache_dir)}
+        cache_to = {
            'type': 'local',
            'dest': str(cache_dir),
            'mode': 'max',
        }
-        if not clean_build
-        else None,
+
+    docker.buildx.build(
+        context_path='.',
+        cache=(not clean_build),
+        cache_from=cache_from,
+        cache_to=cache_to,
        builder='multiarch-builder',
        file=f'docker/Dockerfile.{service}',
        load=True,
@@ -225,6 +295,21 @@ def build_image(
    '--dockerfiles-only',
    is_flag=True,
 )
+@click.option(
+    '--cache-backend',
+    type=click.Choice(['local', 'registry']),
+    default='local',
+    envvar='BUILDX_CACHE_BACKEND',
+    help=(
+        'BuildKit cache backend. `local` (default) writes to '
+        '$XDG_CACHE_HOME/anthias-buildx/<board>/ (typically '
+        '~/.cache/anthias-buildx/) and is right for local dev. '
+        '`registry` pushes the cache to '
+        'ghcr.io/screenly/anthias-<service>:buildcache-<board> for '
+        'CI — reuses the GHCR login already done by the workflow, '
+        'no extra tokens needed. Override via $BUILDX_CACHE_BACKEND.'
+    ),
+)
 def main(
    clean_build: bool,
    build_target: str,
@@ -235,6 +320,7 @@ def main(
    push: bool,
    skip_latest_tag: bool,
    dockerfiles_only: bool,
+    cache_backend: str,
 ) -> None:
    git_branch = pygit2.Repository('.').head.shorthand
    git_hash = str(pygit2.Repository('.').head.target)
@@ -300,6 +386,7 @@ def main(
            clean_build,
            push,
            dockerfiles_only,
+            cache_backend,
        )


--- a/tools/image_builder/constants.py
+++ b/tools/image_builder/constants.py
@@ -2,7 +2,6 @@ SHORT_HASH_LENGTH = 7
 BUILD_TARGET_OPTIONS = ['pi1', 'pi2', 'pi3', 'pi4', 'pi4-64', 'pi5', 'x86']
 SERVICES = (
    'server',
-    'celery',
    'redis',
    'viewer',
    'test',
--- a/tools/image_builder/utils.py
+++ b/tools/image_builder/utils.py
@@ -80,7 +80,6 @@ def generate_dockerfile(service: str, context: dict[str, Any]) -> None:
 def get_uv_builder_context(service: str) -> dict[str, Any]:
    service_to_group = {
        'server': 'server',
-        'celery': 'server',
        'viewer': 'viewer',
        'test': 'test',
    }
@@ -158,25 +157,29 @@ def get_viewer_context(board: str, target_platform: str) -> dict[str, Any]:

    qt_major_version = qt_version.split('.')[0]

-    apt_dependencies = [
-        'build-essential',
+    # Viewer-only apt deps. The shared set (build-essential, curl, ffmpeg,
+    # git-core, libcec-dev, libffi-dev, libssl-dev, net-tools, procps,
+    # psmisc, python-is-python3, python3-dev, python3-gi, python3-pip,
+    # python3-setuptools, sqlite3, sudo, plus libraspberrypi0 on 32-bit
+    # Pi boards) is installed by Dockerfile.base.j2 in a layer that
+    # server (and test) also use, so it dedups across images. Anything
+    # listed here is unique to the viewer image.
+    viewer_extra_apt_dependencies = [
        'ca-certificates',
-        'curl',
        'dbus-daemon',
        'fonts-arphic-uming',
-        'git-core',
        'libasound2-dev',
        'libavcodec-dev',
+        'libavdevice-dev',
+        'libavfilter-dev',
        'libavformat-dev',
        'libavutil-dev',
        'libbz2-dev',
-        'libcec-dev ',
        'libdbus-1-dev',
        'libdbus-glib-1-dev',
        'libdrm-dev',
        'libegl1-mesa-dev',
        'libevent-dev',
-        'libffi-dev',
        'libfontconfig1-dev',
        'libfreetype6-dev',
        'libgbm-dev',
@@ -204,7 +207,7 @@ def get_viewer_context(board: str, target_platform: str) -> dict[str, Any]:
        'libsnappy-dev',
        'libsqlite3-dev',
        'libsrtp2-dev',
-        'libssl-dev',
+        'libswresample-dev',
        'libswscale-dev',
        'libsystemd-dev',
        'libts-dev',
@@ -241,31 +244,13 @@ def get_viewer_context(board: str, target_platform: str) -> dict[str, Any]:
        'libxslt1-dev',
        'libxss-dev',
        'libxtst-dev',
-        'net-tools',
-        'procps',
-        'psmisc',
-        'python3-dev',
-        'python3-gi',
        'python3-netifaces',
-        'python3-pip',
-        'python3-setuptools',
-        'python-is-python3',
        'ttf-wqy-zenhei',
        'vlc',
-        'sudo',
-        'sqlite3',
-        'ffmpeg',
-        'libavcodec-dev',
-        'libavdevice-dev',
-        'libavfilter-dev',
-        'libavformat-dev',
-        'libavutil-dev',
-        'libswresample-dev',
-        'libswscale-dev',
    ]

    if is_qt6:
-        apt_dependencies.extend(
+        viewer_extra_apt_dependencies.extend(
            [
                'mpv',
                'qt6-base-dev',
@@ -274,9 +259,11 @@ def get_viewer_context(board: str, target_platform: str) -> dict[str, Any]:
            ]
        )
    else:
-        apt_dependencies.extend(
+        # libraspberrypi0 already comes in via base_apt_dependencies on
+        # 32-bit Pi boards (see __main__.py), so it's deliberately not
+        # repeated here.
+        viewer_extra_apt_dependencies.extend(
            [
-                'libraspberrypi0',
                'libgst-dev',
                'libsqlite0-dev',
                'libsrtp0-dev',
@@ -285,10 +272,10 @@ def get_viewer_context(board: str, target_platform: str) -> dict[str, Any]:
        )

        if board != 'pi1':
-            apt_dependencies.extend(['libssl1.1'])
+            viewer_extra_apt_dependencies.extend(['libssl1.1'])

    return {
-        'apt_dependencies': apt_dependencies,
+        'viewer_extra_apt_dependencies': viewer_extra_apt_dependencies,
        'qt_version': qt_version,
        'qt_major_version': qt_major_version,
        'webview_version': webview_version,