Commit Graph

174 Commits

Author SHA1 Message Date
Viktor Petersson
10c68b26cc feat(viewer,build,balena): add arm64/Qt6 pi3-64 board and the Rock Pi 4 fleet; keep 32-bit pi3 as legacy (#2985)
* feat(viewer,build): add arm64/Qt6 pi3-64 board; keep 32-bit pi3 as legacy

Revises issue #2906 Phase 2. The original plan (delete the Qt 5 toolchain,
force Pi 2/Pi 3 onto Qt 6) is abandoned: Qt 5 was fixed up on master and
stays. Instead, add a NEW board target `pi3-64` — a 64-bit (arm64) Qt 6
viewer image for Raspberry Pi 3 hardware on a 64-bit OS — as its own image
stream, disk image, and balena fleet. The legacy 32-bit armhf/Qt5 `pi3`
board is left untouched and flagged as legacy/maintenance.

pi3-64 mirrors the existing `pi4-64` path (Qt 6, eglfs_kms; video played
in-process by AnthiasViewer's QtMultimedia pipeline — QMediaPlayer + the
ffmpeg/libavcodec backend with V4L2 HW decode, no external player).
VideoCore IV is H.264-only HW decode. Board selection is by `uname -m`: a
Pi 3 on a 64-bit OS gets `pi3-64`, a 32-bit OS keeps `pi3` (the model
string is identical on both arches).

- image_builder: pi3-64 build params (arm64) + is_qt6; constants.
- Dockerfile.viewer.j2 + start_viewer.sh: pi3-64 shares the pi4-64 eglfs
  KMS path; renamed board-agnostic eglfs-kms-pi4.json -> eglfs-kms.json.
- Detection: install.sh / upgrade_containers.sh (aarch64 Pi 3 -> pi3-64).
- Runtime: media_player force_mpv set (selects MPVMediaPlayer, the
  QtMultimedia D-Bus shim); processing codec grid {'h264'}.
- CI: docker-build matrix + mirror-latest-tags.
- Balena (fleet screenly_ose/anthias-pi3-64, device type raspberrypi3-64):
  disk-image + manual-deploy workflows, balena_ota_deploy.sh,
  balena_fleet_maintenance.py, balena_unpin_devices.py, deploy_to_balena.sh,
  balena-host-config.json.
- Pi Imager: SUPPORTED_BOARDS += pi3-64 (non-maintenance); pi3 stays legacy.
- Docs + tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(website): link the Pi 3 (64-bit) bullet like its siblings

Copilot review: the list is introduced as 'links to the images', so the
new pi3-64 entry should be navigable like the surrounding bullets. Link
the label to the release-images section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(balena): add the Rock Pi 4 fleet (screenly_ose/anthias-rockpi4)

Wires the anthias-rockpi4 balena fleet (device type rockpi-4b-rk3399)
into the OTA deploy + disk-image pipeline. The fleet has no
board-specific image build: it runs the generic arm64 containers, so
bin/balena_ota_deploy.sh / bin/deploy_to_balena.sh map the rockpi4
board to the <short-hash>-arm64 image tags (and strip the /dev/vchiq
mount — no VideoCore on RK3399), and the disk-image preflight verifies
the arm64 images exist.

Root-cause fix for the fleet's codec gate: balena ships no
anthias_host_agent service, so host:board_subtype was never published
and resolve_device_key() stayed 'arm64' — whose HW-decode set is empty,
rejecting every video upload. The model-string → subtype table moves to
the dependency-free anthias_common.device_helper.detect_board_subtype
(single source, imported by host_agent), and
anthias_common.board.get_board_subtype now falls back to reading
/proc/device-tree/model in-container when Redis has no value. The
device tree is kernel-global — the same mechanism get_device_type has
always used for Pi detection — so the rockpi4 fleet resolves its
{h264, hevc} envelope without a host-side daemon, and compose installs
whose host_agent died self-heal too.

- build-balena-disk-image.yaml: rockpi4 in both matrices, fleet +
  rockpi-4b-rk3399 image cases, arm64 images in the preflight check.
- deploy-balena-manual.yaml: rockpi4 board option.
- balena-host-config.json: rockpi4 declared {} (config.txt is
  RPi-only; the reconcile hard-fails on a missing key).
- balena_fleet_maintenance.py / balena_unpin_devices.py: fleet added.
- tests: get_board_subtype Redis-first + device-tree-fallback order;
  detect_board_subtype patch targets follow the move.
- docs: board-enablement, balena-fleet-host-config,
  installation-options.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:49:12 +02:00
Viktor Petersson
1f438d2af0 perf(viewer): render video via QML VideoOutput in a QQuickWidget (#2975)
* perf(viewer): render video via QML VideoOutput in a QQuickWidget

- replace the QGraphicsVideoItem-on-raster-QGraphicsView substrate:
  QVideoFrame::toImage did an RHI offscreen render + GPU->CPU
  readback per frame, capping presentation at 8.3 fps (Pi 4) /
  10-12 fps (Pi 5) with a saturated GUI thread while HW decode ran
  fine (issue 2967). Validated on both testbeds: Pi 4 30.0 fps
  presented at 64% total CPU, Pi 5 26.6 fps at 13-35%
- VideoOutput keeps frames on the GPU: scene-graph textures with
  shader YUV->RGB, composited through the same QQuickRenderControl
  FBO machinery QWebEngineView already uses (eglfs-safe, inherits
  whole-screen rotation -- re-validated under QT_QPA_EGLFS_ROTATION)
- log frames-rendered (QQuickWindow::afterRendering) next to
  frames-delivered in playback-stats so presentation-side drops are
  visible -- the sink-only counter is how the 8 fps regression
  shipped unnoticed; connection is retried from play() so the
  counter can't silently stay dead
- fail hard (qFatal) when the QML scene is unavailable instead of
  decoding video to nowhere: crash-respawn is supervised and loud,
  a silent black-screen kiosk is not
- video-rotate maps to VideoOutput.orientation (still a defensive
  no-op; every platform rotates the whole screen)
- ship qt6-declarative-dev + qml6-module-qtquick/-qtmultimedia in
  the Qt6 viewer images; drop the now-unused multimediawidgets
- run the C++ tests with QT_QUICK_BACKEND=software so the QML scene
  loads under the offscreen platform

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(image-builder): align gstreamer-drop version comment to Qt 6.5

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 17:06:30 +02:00
Viktor Petersson
94814958a3 fix(viewer): play pi2/pi3 video via GStreamer HW pipeline (replace VLC) (#2972)
- Replace VLCMediaPlayer with GstFbdevMediaPlayer: spawn `gst-launch-1.0
  playbin` with a fully-hardware video sink — v4l2h264dec (bcm2835 codec)
  decodes, v4l2convert (bcm2835 ISP) HW-scales + converts YUV to the
  framebuffer format, fbdevsink paints /dev/fb0. No DRM master / X /
  Wayland needed, which a bare uid-1000 viewer with no compositor cannot
  acquire.
- Restores hardware video on the Qt5 linuxfb boards after #1980 dropped
  the Broadcom mmal_vout; drives the same VPU + ISP silicon mmal used.
  Measured on a Pi 3: 1080p30 -> rgb565 at ~40 fps, zero dropped frames.
  (An interim ffmpeg -> fbdev approach was abandoned: HW decode worked
  but CPU YUV->rgb565 convert via swscale managed only ~6 fps — no NEON
  path — and CPU scaling is unaccelerated. v4l2convert moves that to the
  ISP.)
- playbin gives container-agnostic demux, auto-plugged HW decoder, and
  graceful optional-audio. Loop the slot via re-launch-on-EOS; kill by
  process group so no gst-launch orphan keeps the framebuffer.
- Rotation rides `videoflip`, inserted only when the panel is rotated.
- Add gstreamer1.0-{tools,plugins-base,plugins-good,plugins-bad} to the
  Qt5 viewer image; drop the now-unused vlc apt package + python-vlc dep
  (+ its mypy override). Sync the upload-gate comment in processing.py.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 14:14:02 +02:00
Viktor Petersson
6b3f638c60 fix(viewer): retry AnthiasViewer spawn so armv7 WebEngine-init crash self-heals (#2969)
* fix(viewer): skip the Writeback connector in the eglfs headless guard

#2962 added wait_for_eglfs_display so a screenless eglfs (Pi 4) board waits
for a display instead of crash-looping on Qt's "no screens available".
eglfs_has_display() treated any connector status other than "disconnected"
as a present display (to tolerate bridges that report "unknown").

balenaOS 2026.x exposes a KMS `card0-Writeback-1` virtual connector that
ALWAYS reports "unknown". On a headless Pi 4 (both HDMI ports
"disconnected") the writeback connector's "unknown" satisfied the guard, so
it skipped the wait, launched eglfs, and the viewer crash-looped on
"no screens available" / "AnthiasViewer exited before emitting D-Bus
handshake" — exactly the failure #2962 was meant to prevent. Confirmed on
multiple live pi4 on 2026.1.0 (card0-HDMI-A-1/-2 = disconnected,
card0-Writeback-1 = unknown).

Skip `*Writeback*` connectors so only real display outputs (HDMI/DSI/DP/…)
count. A genuinely headless board now waits gracefully; the bridge-"unknown"
hedge is preserved for real connectors. Verified locally for headless,
connected, and bridge-unknown layouts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): retry AnthiasViewer spawn so armv7 WebEngine-init crash self-heals

- Wrap the AnthiasViewer launch in a capped exponential-backoff retry
  loop (BROWSER_SPAWN_MAX_ATTEMPTS) instead of raising on the first
  failed handshake
- Convert the tight container restart loop on Pi 2/Pi 3 into an
  in-process retry that self-heals on a later launch
- Publish viewer:webview_status to Redis (retrying/failed) so a stuck
  board is distinguishable from an empty playlist
- Add WebviewLaunchError + _spawn_webview_once helper; throttle repeat
  warnings to avoid flooding journald
- Cover retry-then-succeed and exhaust-then-raise paths in tests
- Document the armv7 WebEngine-init crash + retry stop-gap in
  docs/board-enablement.md

The 32-bit Qt5 viewer intermittently aborts during Chromium/WebEngine
init (malloc(): unaligned tcache chunk detected) ~75-90% of launches;
reproduced on a 64-bit Pi 3B+. No userspace mitigation fixes the
corruption, but a fresh launch clears it ~10-25% of the time, so
retrying catches a good launch within a few attempts (validated
on-device: handshake on attempt 6). Clean fix is arm64/Qt6 on 64-bit OS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): address review — bound status-beacon Redis, funnel CommandNotFound

- Give the webview health beacon a dedicated Redis client with short
  socket timeouts (connect_to_redis gains opt-in timeout params,
  defaulting to the historical blocking behaviour) so a Redis stall
  can't hang viewer startup inside the spawn-retry loop
- Wrap sh.CommandNotFound into WebviewLaunchError in _spawn_webview_once
  so a missing binary is reported + handled on the same path as every
  other launch failure instead of escaping the retry loop
- Reword the board-enablement note so it describes the WebEngine-init
  observation without referencing a --no-sandbox flag the viewer
  doesn't receive
- conftest: accept the new connect_to_redis kwargs in the fake factory

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): chain final launch error; harden second redis test patch

- Chain the exhausted-retries WebviewLaunchError from last_error so the
  traceback preserves the underlying failure (timeout / early-exit /
  wrapped CommandNotFound)
- conftest: the autouse _mock_redis fixture's connect_to_redis patch now
  accepts *args, **kwargs too (matches the import-time patch)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): scope spawn-retry by call site; drop write-only status beacon

Addresses self-review findings on the retry mechanism:

- Mid-playback respawn (view_image/view_webpage, on the asset_loop
  thread) now uses a small, short budget (BROWSER_SPAWN_INLINE_*) so a
  persistent crash can't freeze the loop (no rotations/skips/standby,
  watchdog starved) for minutes; startup keeps the generous budget. A
  persistent mid-run failure raises and the container restart re-rolls.
- Permanent failures (missing binary) raise WebviewBinaryMissingError
  and short-circuit the retry instead of burning the full backoff budget.
- _spawn_webview_once now reaps the terminated process (SIGTERM, wait,
  SIGKILL) on the handshake-timeout path so a retry can't overlap two
  AnthiasViewers contending for the framebuffer / D-Bus name.
- Reset the stale `browser` global before re-spawning.
- Poll spawned process every 0.25s (was 1s) so a fast init crash is
  noticed promptly in the retry loop.
- Drop the write-only viewer:webview_status Redis beacon (no reader
  existed) and revert the connect_to_redis timeout-param widening +
  conftest churn; operator-visible status is the throttled log output.
- Tests: cover early-exit, terminate-on-timeout, missing-binary
  short-circuit, backoff growth, and the inline budget cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): clamp max_attempts to >=1; correct retry-logging comment

- Guard load_browser against a non-positive max_attempts (would skip the
  loop and raise a confusing "0 attempts; last error: None")
- Reword the comment: the first failure logs its reason AND a retry
  line, so it's not literally "one log line per attempt"

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): clamp backoff_cap and startup_timeout in load_browser

- A backoff_cap below 1s would devolve into a tight retry loop; a
  negative one would make sleep() raise ValueError mid-retry and mask
  the real launch error
- Clamp a negative startup_timeout to 0 (immediate-timeout attempt)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 07:35:45 +02:00
Viktor Petersson
cca69b594d fix(balena): repair Pi 4 graphics overlay + manage fleet host config as IAC (#2947) (#2949)
* fix(viewer): detect Pi 4 eglfs DRM card at runtime so boot doesn't hang (#2947)

- vc4-drm (display) and v3d (render-only) race during probe, so the
  display node is card0 on some boots/images and card1 on others
- #2905 hardcoded /dev/dri/card1; when vc4 loses the race eglfs opens
  the render-only node, finds no connectors, and the device hangs on
  the balena splash forever
- start_viewer.sh now picks the card that owns connectors at runtime
  and rewrites QT_QPA_EGLFS_KMS_CONFIG before launch
- prefers a connected connector, falls back to any card exposing
  connectors (excludes the connector-less v3d node)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(balena): repair Pi 4 graphics overlay, manage fleet host config as IAC (#2947)

Root cause of the Pi 4 boot-splash hang: the anthias-pi4 fleet's
dtoverlay was stored as the malformed value `"vc4-kms-v3d"` — literal
double-quotes, on the legacy RESIN_ prefix — unlike every other board's
clean `vc4-kms-v3d`. The quotes stop the firmware loading the overlay,
so the Pi 4 fell back to firmware-KMS; since #2905 the viewer renders
through Qt eglfs_kms, which needs full KMS, so the display never came up
and the device hung on the splash. (linuxfb, used before #2905, didn't
care, which is why this surfaced now.) The malformed value was a manual
dashboard edit — the config was never codified.

- add balena-host-config.json: declarative per-board config.txt knobs,
  reconciled from the live fleets and corrected (clean pi4 dtoverlay;
  drop bogus `dtparam=...,vc4-kms-v3d`; standardize pi2/pi3 off the
  RESIN_ prefix; drop pi5 gpu_mem which a Pi 5 ignores; add cma-512 to
  pi5 per docs/board-enablement.md)
- build-balena-disk-image.yaml reconciles each fleet to the file:
  upsert under the canonical BALENA_ prefix, then prune anything not in
  the file (incl. legacy RESIN_HOST_CONFIG_* dupes). Supervisor vars
  untouched.
- docs/balena-fleet-host-config.md documents the mechanism + the full
  per-board audit; modernize the self-hosted doc's `env add`->`env set`
- drop `--pin-device-to-release` from `balena preload` (added in #2098)
  so flashed devices track the latest stable release instead of freezing;
  correct installation-options.md / faq.yaml accordingly

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): harden fleet host-config reconcile against fleet-wide wipe (#2947)

Review of the prune step surfaced three failure modes:

- An empty desired set (absent board key, or jq/file parse failure not
  caught by set -e through a process substitution) made the prune delete
  every config var on the fleet, incl. dtoverlay=vc4-kms-v3d. Now resolve
  the board config with `jq -ec .boards[$b]` and hard-fail if it's null
  or the file is invalid; a `{}` board (x86) is truthy so it still
  reconciles to "no config.txt".
- The prune selector `test("HOST_CONFIG")` was an unanchored substring
  match — a var merely containing HOST_CONFIG (e.g. BALENA_HOST_CONFIG-
  URATION_BACKUP) would be pruned. Anchored to `^(BALENA|RESIN)_HOST_CONFIG_`.
- A transient `balena env list` / jq failure in the prune's process
  substitution was swallowed (pipefail doesn't propagate out of `<(...)`),
  silently skipping the prune and leaving stale RESIN_ duplicates. Capture
  the listing into a var first so the failure aborts the step.

Also folds the duplicate jq pass over balena-host-config.json into the
single `board_json` resolve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(balena): describe current host config, not the one-time cleanup

Replace the point-in-time "Audit" table (old quoted/broken values, was→fixed
narrative) with a forward-looking per-key rationale. The cleanup history lives
in the PR / git history; the doc should describe what each setting is and why.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(ci): drop the inline rationale comment from the preload step

Move the explanation out of the workflow and into history (this is where
`git blame` on the preload step lands):

`--commit latest` seeds the image with the current release's container
images so a freshly flashed device boots fully offline. We deliberately
do NOT pass `--pin-device-to-release`: pinning (added in #2098) froze
flashed devices to the downloaded release, so they never received OTA
fixes. Anthias balena devices should track the fleet's latest stable
release, so the device joins the fleet unpinned and auto-updates from
here on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:46:13 +02:00
Viktor Petersson
b57c1a16a7 feat(viewer,server): 1 GB SBC enablement — low-RAM degradation gates (#2915)
Boards with < 1.5 GiB MemTotal (Pi 2/Pi 3 1GB, Pi 4 1GB, Rock Pi 4 1GB,
generic-arm64 1GB SKUs) OOM-cycled when QtMultimedia loaded a 4K HEVC asset
alongside the two QtWebEngine renderers introduced by #2905. On-device repro
on a 1 GB Rock Pi 4 confirmed `global_oom` on the container's bash process
followed by a restart loop, plus the kernel keeping sshd in banner-exchange
until power-cycle. This patch puts the device into a graceful degraded mode
before the OOM cascade fires.

  * bin/upgrade_containers.sh — exports ANTHIAS_LOW_RAM=1 when
    TOTAL_MEMORY_KB < 1572864 (1.5 GiB), 0 otherwise. Threshold cleanly
    splits the 1 GB SKUs from the 2 GB+ SKUs in the supported fleet.
  * docker-compose.yml.tmpl — forwards ANTHIAS_LOW_RAM to the viewer
    container.
  * anthias_host_agent — publishes host:total_mem_kb to Redis alongside the
    existing host:board_subtype so server-side gates can read MemTotal
    without re-opening /proc/meminfo themselves.
  * anthias_common.board — adds LOW_RAM_THRESHOLD_KB, get_total_mem_kb(),
    is_low_ram_device() helpers.
  * anthias_webview (view.cpp) — when ANTHIAS_LOW_RAM=1, aliases webView2
    onto webView1 so the rest of the dual-buffer logic still runs but never
    spawns a second Chromium renderer (~100 MB physical RAM saved per
    device). UX: page swap is in-place with a brief blank during load, no
    preloaded crossfade.
  * anthias_server.processing — extends the codec gate with a low-RAM 1080p
    resolution cap. Above-1080p uploads on low-RAM boards are rejected at
    upload with the existing recipe machinery, extended to inject
    `-vf scale=1920:1080:force_original_aspect_ratio=decrease` so the
    operator's re-encode lands within the envelope. If an upload fails BOTH
    codec and resolution gates, the codec message wins but the recipe folds
    in the downscale so a single re-encode satisfies both.
  * Diagnostics page — Memory card surfaces a "Low-RAM mode" badge with the
    threshold MiB so operators can see why the device degraded. /api/v2/info's
    `memory` field gains a `low_ram: bool` for API clients.
  * docs/board-enablement.md — rewrote the stale `--hwdec=drm-copy` /
    per-codec dispatch text (removed in #2905); documented the known rkvdec
    mainline limitation on Armbian 6.18 (HEVC stateless engages via
    `-hwaccel drm` but produces decode errors; H.264 has no v4l2_request
    binding in +rpt1 7.1.3) and the new low-RAM mode.

Tests cover the four matrix cells (low/high-RAM × in/over-cap), the recipe
shape with and without `cap_to_1080p`, the cap defence against
unknown / zero dimensions, host_agent's MemTotal parser, and the API
endpoint's new low_ram field.

Out of scope (separate work): HEVC HW-decode on arm64 — depends on an
upstream rkvdec driver fix landing in Debian-shipped Armbian kernels;
Anthias does not maintain its own kernel/distro.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 07:48:46 +02:00
Viktor Petersson
57b4f25c77 feat(viewer,server): per-board HW decode dispatch + codec gate on upload (#2885)
* perf(viewer): pi4-64/pi5 use mpv --vo=gpu --gpu-context=drm

On Pi the connector's preferred mode is usually 4K (most modern
TVs report 3840x2160 in their EDID), and the previous --vo=drm
path ran a CPU zimg upscale from 1080p source to that 4K output.
On a 4-core A72 that's the bottleneck — mpv VO drops 59-75
frames per 30s on a stock 1080p H.264 signage clip. Pi5's A76
is faster but the same upscale path is still the limit.

Switching the VO to GL with the DRM context (mpv --vo=gpu
--gpu-context=drm) hands the upscale to the V3D and leaves
everything else identical — mpv still owns DRM master, still
reads --drm-mode=1920x1080@60 (kept), still runs in
--vd-lavc-threads=4 software decode (mpv 0.40 in Debian Trixie
has v4l2m2m-copy but not v4l2request, so --hwdec=auto-safe
falls back to software on this asset; that hasn't changed).

Measured on a 4K-connected Pi4-64 Rev 1.5, same clip, same 30 s
window:

  --vo=drm                                : 59-75 vo drops / 30 s
  --vo=gpu --gpu-context=drm (this patch) : 3-6 vo drops / 30 s

`decoder-frame-drop-count` is 0 in both — the regression was
purely on the VO side, and shifting scaling off the CPU is what
buys the headroom.

x86 (cage + --gpu-context=wayland) is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(viewer): drop --drm-mode pin on Pi4-64/Pi5 under --gpu-context=drm

The previous commit moved Pi4-64/Pi5 to `mpv --vo=gpu
--gpu-context=drm` but kept the `--drm-mode=1920x1080@60` pin
from the old --vo=drm path. On-device testing showed the pin
*hurts* throughput under GBM: 294 vo drops/30s with the pin,
3-6 without, on the same 4K-connected Pi4 and the same H.264
clip.

The pin existed in the first place to dodge CPU zimg upscale to
4K, which the A72 couldn't keep up with on the legacy --vo=drm
path. Under --gpu-context=drm the V3D does the scaling for free
at the connector's preferred mode, so the workaround is no
longer needed and is in fact harmful.

`--vd-lavc-threads=4` stays — software decode under
--hwdec=auto-safe (mpv 0.40 has v4l2m2m-copy but not
v4l2request) still benefits from explicit threading.

Verified on a 4K-connected Pi4-64 across H.264 (30/24 fps) and
HEVC clips: 2-6 vo drops/30s in every case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(viewer): consolidate Qt6 boards onto cage + Wayland, pin Pi 4 to 1080p

Folds in PR #2883: Pi 4-64 / Pi 5 now run under cage with mpv on
--vo=gpu --gpu-context=wayland, joining x86 and arm64 on a single
Wayland-based display stack. Drops the --vo=drm legacy path
entirely from MPVMediaPlayer. Qt 5 boards (pi2 / pi3) stay on
linuxfb via VLCMediaPlayer — out of scope here.

Replaces the perf branch's `--vo=gpu --gpu-context=drm` standalone
fix with the consolidated cage path. The previous standalone
finding (3-6 vo drops / 30 s on Pi 4 at 4K) was a Pi-without-cage
optimization; once Pi runs under cage like every other Qt6 board,
the same trick applies via wayland but cage's composite step adds
its own pass and the V3D on Pi 4 can't keep up at 4K (738 vo
drops / 30 s measured at native 4K under cage). Fix: move the
1080p mode pin one layer up from app code to host config — the
new ansible/.../cmdline.txt.j2 conditional appends
`video=HDMI-A-1:1920x1080@60 video=HDMI-A-2:1920x1080@60` when
`device_type == 'pi4-64'`. With output pinned to 1080p there's no
upscale anywhere in the pipeline, matching the bandwidth profile
of today's --vo=drm production setup.

Pi 5 / x86 / arm64 keep the connector's preferred mode (typically
4K). Pi 5's V3D 7.1 has roughly 2× Pi 4's throughput; x86 iGPUs
handle 4K via VAAPI; arm64 SBC perf varies by SoC.

Other notable changes folded in from #2883:

* tools/image_builder/utils.py — `cage` + `qt6-wayland` move out
  of the per-board branch into the shared is_qt6 block.
  `wlr-randr` (was x86-only) goes in the shared block too since
  rotation now happens via wlr-randr on every Qt6 board.
  `va-driver-all` stays x86-only (no VAAPI on Pi / ARM SoCs).
* docker/Dockerfile.viewer.j2 — QT_QPA_PLATFORM=wayland gated on
  is_qt6 instead of board in ('x86', 'arm64').
* bin/start_viewer.sh — case on DEVICE_TYPE: every Qt6 board
  takes the cage + sudo path. Pi2 / Pi3 stay on the legacy
  direct-sudo path.
* src/anthias_viewer/media_player.py — single --vo=gpu
  --gpu-context=wayland for all reachable device types. The
  per-board rotate_args block is gone: every Qt6 device inherits
  the transform from cage via wlr-randr, so mpv would
  double-rotate if it set --video-rotate.
* tests/test_media_player.py — parametrised tests for all four
  Qt6 boards (x86, arm64, pi4-64, pi5) hitting the same VO path;
  rotation tests assert mpv *never* sets --video-rotate under
  cage.
* website/data/faq.yaml — rotation entry points at Settings page
  / wlr-randr; resolution entry calls out the Pi 4 1080p pin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ansible): propagate tags into boot.yml include_tasks

The `Configure boot partition` task in system/tasks/main.yml was
tagged `touches-boot-partition` / `raspberry-pi` but those tags
weren't propagated to the tasks inside boot.yml — Ansible's
default include_tasks behaviour matches the include against
--tags but leaves the included tasks tag-less, so they get
filtered back out. Running `ansible-playbook ... --tags
touches-boot-partition` therefore did nothing.

Use the explicit `apply: tags:` form so the include's tags are
copied onto each task in boot.yml. With this, the standalone
"re-render boot config" workflow actually works, which matters
on Pi 4 now that the 1080p HDMI mode pin in cmdline.txt.j2
needs to land without re-running the whole playbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): keep Pi 4 on linuxfb; only Pi 5 / x86 / arm64 go cage

On-device testing on a Pi 4 Model B Rev 1.5 with a 4K HDMI display
showed cage+wayland is fundamentally too heavy for the V3D 6.0:

  --vo=drm    (existing, no cage)                : 59-75 drops/30s
  --vo=gpu --gpu-context=drm  (no cage, GPU scale): 3-6 drops/30s
  --vo=gpu --gpu-context=wayland (cage, even at  : 730+ drops/30s,
    1080p HDMI cmdline pin to avoid 4K scale)      mpv at 99% CPU
                                                   running ~1/4×
                                                   real time

The 1080p HDMI pin doesn't recover Pi 4 — cage's composite pass
costs more than the V3D 6.0 has spare bandwidth for, regardless
of output resolution, with the webview running in the background
or not. Pi 5's V3D 7.1 has roughly 2× the throughput and is
expected to keep up; x86 / arm64 already shipped on cage and
remain unchanged.

Net result:

  * Pi 4-64 stays on Qt linuxfb (no compositor) with mpv on
    --vo=gpu --gpu-context=drm. mpv writes straight to KMS via
    libgbm and lets the V3D do video scaling — keeping the
    standalone perf-branch finding that drops from 59-75 → 3-6
    on the same clip.
  * Pi 5 / x86 / arm64 stay (or move) onto cage + qt6-wayland +
    wlr-randr with mpv on --vo=gpu --gpu-context=wayland.
  * Pi 2 / Pi 3 stay on the Qt5 + VLC + linuxfb track they were
    already on.
  * The Pi 4 1080p HDMI cmdline pin added in the previous commit
    is reverted (no longer needed without cage).
  * Rotation handling: mpv emits --video-rotate=N on Pi 4 (no
    compositor to apply the transform) and skips it on the cage
    boards (wlr-randr handles it there).

Goal-wise this is the partial-consolidation we agreed to as last
resort: three of four Qt6 boards share one Wayland stack, Pi 4
keeps the framebuffer path for as long as the V3D 6.0 + mpv 0.40
combo lacks the headroom. Pi 4 remains in scope for revisiting
once mpv ships the v4l2request hwdec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): mirror host render-GID for all Qt 6 boards, not just cage

mpv uses /dev/dri/renderD128 for --vo=gpu on every Qt 6 board
now — wayland (cage path on x86 / arm64 / pi5) and drm (linuxfb
path on Pi 4) both go through Mesa GL. The render-GID mirror was
inside the cage branch of start_viewer.sh, so Pi 4's mpv ran as
viewer user, hit the render node owned by GID 992, got
"Permission denied", and bailed with "Failed initializing any
suitable GPU context!".

Hoist the render-GID setup above the per-board case so it runs
for every Qt 6 board. cage / linuxfb branching stays as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): Pi 4 stays on --vo=drm (Qt linuxfb DRM master contention)

Earlier commits switched Pi 4 to mpv --vo=gpu --gpu-context=drm
based on a 3-6 vo-drop/30 s measurement. That test was run as
root in a fresh container — no Qt linuxfb in the picture. In
the production viewer where AnthiasWebview holds the framebuffer
via Qt linuxfb, --vo=gpu fails:

  failed to open /dev/dri/renderD128: Permission denied
  [vo/gpu/drm] Failed to acquire DRM master: Permission denied
  [vo/gpu] Failed initializing any suitable GPU context!
  Error opening/initializing the selected video_out (--vo) device.
  Video: no video

Mesa GBM holds DRM master persistently and contends with Qt
linuxfb's framebuffer use. mpv's classic --vo=drm has its own
master juggling (briefly grab → render → drop) that coexists
fine with linuxfb — that's why master's existing Pi 4 config
works.

Revert Pi 4 mpv flags to the production master config:
  --vo=drm --drm-mode=1920x1080@60 --vd-lavc-threads=4

The standalone perf-finding from this branch's earlier history
turns out not to apply in production; retracted from the
roll-up. Pi 5 / x86 / arm64 unchanged (they're on cage +
--vo=gpu --gpu-context=wayland, which has its own DRM master
flow via cage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): cage opens on the first connected connector, not HDMI-A-1

Without `-o`, cage uses whatever output the DRM backend enumerates
first — typically HDMI-A-1 on Pi 5 (closer to USB-C) and the
on-board panel / first HDMI on x86 / arm64. If the operator plugs
into the *other* port (Pi 5 HDMI-A-2, or any DP connector on
x86), cage renders to a disconnected connector and the screen
stays black.

start_viewer.sh now iterates /sys/class/drm/card*-*, picks the
first connector whose status reads "connected", strips the
cardN- prefix to get the bare name cage expects (HDMI-A-1,
HDMI-A-2, DP-1, eDP-1, …), and passes it via `-o`. Falls back to
letting cage pick if nothing is connected yet — the display may
come up via HPD after cage starts, or this is a build/CI host
with no display at all.

Caught while end-to-end testing on the rig: Pi 5 cable on
HDMI-A-2 went to a black screen even though `cat
/sys/class/drm/card1-HDMI-A-2/status` reported "connected" and
cage / the viewer were running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viewer): mpv from apt.raspberrypi.com on Pi 4 / Pi 5, hwdec auto-copy

Stock Debian Trixie's mpv 0.40 is compiled without `v4l2request`
hwdec, so Pi 5's Hantro stateless decoder is invisible to it and
mpv falls back to software decode for every H.264 / H.265 source.
Pi 4's V4L2 M2M decoder is reachable via `v4l2m2m-copy` but mpv's
`--hwdec=auto-safe` whitelist explicitly excludes that method, so
auto-detect picked software there too.

Two changes, applied together because they only make sense
together:

* Pi 4 / Pi 5 viewer images now pull mpv (and the FFmpeg library
  family it depends on) from `archive.raspberrypi.com/debian
  trixie main`. The Pi-tuned build ships `v4l2request` hwdec
  (Pi 5) and a maintained `v4l2m2m-copy` (Pi 4). An apt-pin
  restricts the Pi repo to the mpv + libav* packages only, so
  curl / ca-certificates / etc. continue to come from stock
  Debian and the rest of the image stays on the same baseline.
* `MPVMediaPlayer.play()` switches `--hwdec=auto-safe` →
  `--hwdec=auto-copy`. auto-copy is the same family but with a
  broader whitelist that *includes* the v4l2-family copy hwdecs.
  Net effect: x86 still picks vaapi-copy (unchanged), Pi 4 picks
  v4l2m2m-copy, Pi 5 picks v4l2request, arm64 falls through to
  software (no v4l2request in stock Debian mpv, no vendor-tuned
  Rockchip plugin in stock either — Tier-2 follow-up).

Plus an `ANTHIAS_DEBUG_DROPS=1` env knob: when set on the viewer
container, mpv's stdout/stderr go to `/data/.anthias/mpv.log`
(host-bound) instead of `/dev/null`, and `--no-terminal` is
dropped so the status line ("AV: ... Dropped: N") is emitted.
Lets us read per-asset frame-drop counts straight from the
production viewer pipeline (no custom harness, no rebuild)
during the test-grid runs. Default (unset) preserves the silent
behaviour.

Also: drops the `cage -o <connector>` autodetect attempt — cage
0.1.x in Trixie doesn't accept `-o`, just `-m last`. Use that
instead so cage opens on the most-recently-connected output
regardless of HDMI-A-N enumeration order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): use deb-packaged Pi keyring for archive.raspberrypi.com

apt update against http://archive.raspberrypi.com/debian trixie
was failing in the Pi 4 / Pi 5 viewer image builds:

  Sub-process /usr/bin/sqv returned an error code (1):
  Signing key on CF8A1AF502A2AA2D763BAE7E82B129927FA3303E is not
  bound: No binding signature at time …
  Policy rejected non-revocation signature (PositiveCertification)
  requiring second pre-image resistance
  SHA1 is not considered secure since 2026-02-01

Pi's bare `raspberrypi.gpg.key` URL still serves the original
2012-vintage RSA 2048 key with SHA1 binding signatures that
Trixie's sqv refuses to certify under the post-2026-02-01
crypto policy. The deb-packaged keyring inside
`raspberrypi-archive-keyring_2025.1+rpt1_all.deb` ships the
*same* key fingerprint but with rebuilt binding signatures
that sqv accepts — that's the keyring Pi OS Trixie itself
installs, which is why `apt update` against this exact repo
works on a real Pi 5 device today.

Fetch the deb directly with curl, extract its bundled
`.pgp` keyring, and point `signed-by=` at the installed copy.
The pin block restricts what packages the Pi repo can supply
(mpv + libav* + ffmpeg + libpostproc — the FFmpeg family),
so the rest of the image keeps its stock-Debian baseline.

Also extend the pin to cover libpostproc* and ffmpeg, since
mpv's apt deps drag those into the Pi-tagged version on
install; without the pin extension, apt rejected the resolve
with "broken packages".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viewer): per-codec hwdec on Pi via Lua hook

mpv 0.40's `--hwdec` accepts a single value at startup, so we
can't ask it to try v4l2m2m-copy for H.264 *and* drm-copy for
HEVC out of the box. The Pi-tuned mpv from
archive.raspberrypi.com supports both hwdec methods but each
covers a different codec subset:

* v4l2m2m-copy — Pi 4's V3D V4L2 M2M decoder. H.264 works; Pi
  5's Hantro G2 is V4L2-stateless-only so this no-ops there.
* drm-copy — FFmpeg's `v4l2_request_hevc` hwaccel. HEVC only,
  works on both Pi 4 and Pi 5.

Add a small `on_load` Lua hook (inlined as `_PI_HWDEC_LUA`,
written to /tmp on first play(), loaded with `--script=`) that
checks `video-codec-name` and picks the right hwdec at file
open. Net effect:

  Pi 4 H.264 → v4l2m2m-copy   (HW)
  Pi 4 HEVC  → drm-copy       (HW)
  Pi 5 H.264 → v4l2m2m-copy   (no device, falls back to SW
                                — only path until mpv re-adds
                                v4l2_request_h264 hwdec)
  Pi 5 HEVC  → drm-copy       (HW)

The base `--hwdec=auto-copy` startup value still applies on
x86 / arm64 (vaapi-copy on Intel/AMD; software fall-back on
Rockchip), where the hook isn't loaded.

Verified on real hardware:
  $ mpv ... --script=/tmp/anthias-pi-hwdec.lua test_hevc.mp4
  [pi-hwdec] codec=hevc -> hwdec=drm-copy
  Using hardware decoding (drm-copy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer,server): HW-decode everywhere on Pi 4 / Pi 5 / x86

The previous per-codec Lua hook in media_player.py was a silent no-op:
mpv's video-codec-name property is empty at every script event before
hwdec init (on_load, on_preloaded), so --hwdec=auto-copy leaked through.
auto-copy's upstream whitelist excludes v4l2m2m-copy, so H.264 on Pi 4
fell back to software despite the V3D V4L2 M2M decoder being available.

Viewer (src/anthias_viewer/media_player.py)

- Replace the Lua hook with ffprobe-driven dispatch from Python at
  launch time. ffprobe is in the viewer image; the call is ~50 ms.
- Per-board mapping: Pi 4 → {h264: v4l2m2m-copy, hevc: drm-copy};
  Pi 5 → {hevc: drm-copy}. Pi 5 H.264 falls back to auto-copy
  because mpv has no v4l2-request H.264 hwdec for the Hantro G1,
  and passing v4l2m2m-copy there just logs "Could not find a valid
  device" before SW-falling-back.
- Live-verified on Pi 4: "Using hardware decoding (v4l2m2m-copy)"
  for 1080p H.264 and "Using hardware decoding (drm-copy)" for
  HEVC at 1080p and 4K.

Asset processor (src/anthias_server/processing.py)

- Pi 5 profile drops H.264 from passthrough_video_codecs — Pi 5
  has no mpv H.264 HW path, so H.264 uploads must transcode to HEVC
  at upload time to keep the HW-decode-everywhere contract.
- Pi 4 profile adds passthrough_video_max_pixels for H.264, capped
  at 1080p (1920*1080). 4K H.264 clears the codec gate but the V3D
  H.264 envelope tops at 1080p60, so the cap forces it through a
  libx265 re-encode at upload time. HEVC keeps no cap (the
  dedicated HEVC block handles 4Kp60).
- _ffprobe_summary now returns video_pixels alongside codec /
  container / audio_codec; _video_can_passthrough enforces the
  per-codec pixel cap when the profile declares one.

Tests

- test_media_player.py: new per-board hwdec tests (Pi 4 H.264 →
  v4l2m2m-copy; Pi 5 H.264 → auto-copy; both → drm-copy for HEVC;
  auto-copy fallback when ffprobe fails; no probe on x86 / arm64).
- test_processing.py: matrix tests updated to include video_pixels;
  parametrised rows now exercise Pi 5 H.264-no-passthrough and the
  Pi 4 4K H.264 cap. New end-to-end tests prove
  _run_video_normalisation transcodes Pi 5 H.264 → HEVC and Pi 4
  4K H.264 → HEVC.

Docs (docs/board-enablement.md, new)

- Goal + per-board HW-decode capability table.
- Asset processor codec policy spelled out as a contract.
- BBB test bed recipe (source clips, libx265 transcode commands,
  ANTHIAS_DEBUG_DROPS=1, mpv.log slicing).

Follow-up: Pi 5 4K HEVC HW

The Hantro G2 decoder can't allocate 4K dst buffers from Pi 5's
default 64 MB CMA ("v4l2_request_hevc_start_frame: Failed to get
dst buffer") and SW-falls-back. Adding cma=512M to the kernel
cmdline does NOT work — the kernel takes the cmdline value over
the device-tree linux,cma node, orphaning rpi-hevc-dec ("Failed
to probe hardware -517") and unpopulating /dev/video*, which
kills HEVC HW at every resolution. The right fix is a
dtparam/dtoverlay in /boot/firmware/config.txt that resizes the
existing DT-declared region without orphaning the codec's
reserved-mem reference. Until that lands, the pi5 profile should
downscale 4K → 1080p HEVC. Documented in cmdline.txt.j2 and
docs/board-enablement.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(viewer,server): mock _probe_video_codec; fix mypy on Popen IO types

CI failures on the previous commit (bb27b186) came from:

* ``subprocess.run`` inside ``_probe_video_codec`` blowing up under
  the existing ``mpv`` fixture, which patches ``subprocess.Popen``
  to a MagicMock. ``subprocess.run`` internally instantiates Popen
  for the ffprobe shellout, gets a MagicMock back, then trips on
  unpacking communicate()'s result. Fixed by default-mocking
  ``_probe_video_codec`` in the fixture (returns '' so dispatch
  falls back to 'auto-copy', preserving legacy assertions) and
  layering the same mock onto the standalone rotation tests that
  build MPVMediaPlayer outside the fixture.

* ``ruff format``: the multi-line ffprobe arg list in
  ``_probe_video_codec`` needed splitting one-arg-per-line.

* ``mypy``: typing the popen_stdout / popen_stderr locals as
  ``object`` couldn't satisfy any Popen overload. Switched to
  ``int | IO[bytes]`` which covers both the DEVNULL / STDOUT
  sentinels and the bind-mounted mpv.log file handle.

* ``test_passthrough_containers_match_real_ffprobe_format_names``
  was pinned to the pi5 profile to exercise the H.264 + HEVC
  passthrough path; pi5 no longer passthroughs H.264, and the
  fake summary it constructs has no width/height (so pi4-64's
  cap fails it too). Switched the pin to x86, which has no
  per-codec caps — the test is about *container* recognition, not
  codec/resolution gating.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): downscale 4K HEVC → 1080p on Pi 5 (CMA workaround)

Pi 5's Hantro G2 HEVC decoder is rated for 4Kp60 but the stock 64 MB
CMA on Pi OS can't fit a 4K HEVC dst-buffer pool — at 4K mpv hits
``v4l2_request_hevc_start_frame: Failed to get dst buffer`` and
silently SW-falls-back. Bumping cma= on the kernel cmdline orphans
``rpi-hevc-dec`` entirely (the kernel takes the cmdline value over
the device-tree linux,cma node, leaving the driver returning
``Failed to probe hardware -517``), so the kernel-side knob isn't
available without a dtoverlay change.

Until that follow-up lands, the asset processor caps Pi 5 HEVC at
1080p both ways:

* ``passthrough_video_max_pixels`` gates 4K HEVC uploads out of
  passthrough — anything wider than 1920×1080 falls through to a
  re-encode.
* New ``transcode_video_max_pixels`` per-codec field tells
  ``_transcode_to_target`` to emit a
  ``-vf scale='if(gt(ih,1080),-2,iw)':'min(ih,1080)'`` filter that
  caps height at the 16:9 budget (cap_h = floor(sqrt(cap × 9/16))).
  Portrait 4K → 1080p height; landscape 4K → 1920×1080. Sub-1080p
  sources are untouched (the ``min()`` guard prevents upscale; ``-2``
  on width keeps libx265 happy with even dimensions).

Pi 4 / x86 don't carry the cap (their HW decoders handle 4Kp60
cleanly), so the filter stays absent from those profiles.

Tests cover (a) the new pi5+hevc+4K row in the parametrised
passthrough matrix (False at 4K, True at 1080p), (b) ffmpeg argv
shape: -vf scale=... emitted for pi5 HEVC, absent for pi4-64 HEVC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer,system): Pi 5 4K HEVC HW + display-resampled VO sync

Two tied changes that move every supported board to clean HW
decode at the source's actual framerate.

Pi 5 4K HEVC via cma-512
------------------------

Pi OS for Pi 5 reserves 64 MB of CMA by default. The Hantro G2
HEVC decoder needs a buffer pool large enough to hold several 4K
dst frames (each ~12 MB) plus reference frames, so the stock
allocation can fit 1080p HEVC but not 4K — at 4K mpv hits
``v4l2_request_hevc_start_frame: Failed to get dst buffer`` and
silently SW-falls-back.

Adding ``cma=512M`` to /boot/firmware/cmdline.txt does NOT work:
the kernel takes the cmdline value over the device-tree
``linux,cma`` node, which orphans ``rpi-hevc-dec`` entirely
(returns ``Failed to probe hardware -517`` and ``/dev/video*``
disappears, killing HEVC HW at every resolution).

The Pi-OS-blessed merge is ``dtoverlay=vc4-kms-v3d,cma-512`` in
/boot/firmware/config.txt — the v3d overlay carries its own
``cma-N`` parameter that resizes the DT linux,cma node in place
without orphaning the codec driver. A standalone
``dtoverlay=cma,cma-512`` silently no-ops on Pi 5 because the
v3d overlay initialises the CMA region first; reusing the v3d
overlay's parameter is the documented way to merge them.

ansible/roles/system/templates/config.txt.j2 now emits the
``,cma-512`` parameter on Pi 5 only — Pi 4 already gets 512 MB
CMA by default so the override is a no-op there. The earlier
attempt at a kernel-cmdline cma= override (in cmdline.txt.j2) is
removed; the file's comment now points readers at the correct
config.txt path.

Live-verified on Pi 5: CmaTotal=512MB after the overlay change,
/dev/video* present, rpi-hevc-dec probes cleanly. Asset processor
pi5 profile no longer carries a HEVC pixel cap — Pi 5 can decode
HEVC at its silicon's real capability.

mpv --video-sync=display-resample
---------------------------------

mpv 0.40 defaults to ``--video-sync=audio`` which syncs the video
clock to the audio clock and drops VO frames when the two drift.
On every board tested (Pi 4 --vo=drm, Pi 5 + x86 --vo=gpu
--gpu-context=wayland) this produced 60–90% VO drops at 60 fps
content even when the decoder reported healthy HW decode
(``Using hardware decoding (...)`` banner present, no decoder
errors). The drops were at the VO, not the decoder.

``--video-sync=display-resample`` flips the relationship: sync
video to the display refresh and resample audio to match. Audio
resampling is a <1% CPU 2-channel job and most signage clips
have no audible content anyway, so it's effectively free; the
benefit is clean playback at the source's frame rate.

Test bed touched
----------------

* test_play_invokes_popen_with_expected_args_on_pi4_64: argv
  now includes ``--video-sync=display-resample``.
* test_video_can_passthrough_respects_board_codec_set: pi5 +
  hevc + 4K is now ``True`` (passthrough) because the CMA fix
  lets the silicon do its rated job. Comment updated to point
  at config.txt.j2.
* Removed the transient downscale-on-Pi 5 codepath
  (``transcode_video_max_pixels`` field, the
  ``-vf scale='if(gt(ih,...))':...`` filter, and the two tests
  asserting it) — that was a workaround for the CMA issue and
  is no longer needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): introduce PlaybackEnvelope dataclass + matrix + cache

Foundation for the per-board playback envelope rollout (see
/home/ubuntu/.claude/plans/serene-munching-gem.md). No behaviour
change yet — wires up the canonical source of truth that
processing.py, celery_tasks.py's future re-render walker, and the
viewer's hwdec dispatch will all read from in the next commit.

src/anthias_server/playback_envelope.py (new)
---------------------------------------------

Frozen dataclass ``PlaybackEnvelope`` carrying codec / max_width /
max_height / max_fps plus a fixed ``container_ext = 'mp4'``.
``ENVELOPE_BY_DEVICE_TYPE`` maps every supported board:

* pi2 / pi3 / arm64 → H.264 1920x1080 30 (no HEVC silicon /
  no upstream mpv HW path)
* pi4-64 / pi5 / x86 → HEVC 3840x2160 60 (dedicated HEVC block
  or VAAPI; fleet uniformity so the same upload produces
  bit-identical variants on every board)

``compute_envelope()`` resolves the current process's envelope
from DEVICE_TYPE; unset / unknown / mixed-case / whitespace all
fall back to the conservative default (H.264 1080p30).

``load_cached()`` / ``save_cached()`` round-trip the envelope to
``~/.anthias/playback-envelope.json``. Cache corruption (missing
file, bad JSON, unsupported codec) returns ``None`` so the caller
recomputes and overwrites — a hand-edit that breaks the file
self-heals on next start. ``save_cached`` writes atomically via
temp-file + rename.

src/anthias_server/processing.py
--------------------------------

``_ffprobe_summary`` now returns ``video_fps`` alongside the
existing keys. The next commit (Phase 2) uses this to decide
whether to emit ``-r envelope.max_fps`` — the cap is one-way, so
sub-cap source rates pass through unchanged. r_frame_rate is
parsed as a rational ``num/den``; unparseable / zero-denominator
collapses to ``None`` so the caller treats source fps as
"unknown" and skips the gate.

tests
-----

* tests/test_playback_envelope.py (new): matrix coverage; unset /
  unknown / cased / whitespace inputs; cache round-trip; missing
  / corrupt JSON / invalid-payload recovery; atomic write
  (no leaked .tmp); container_ext invariant.
* tests/test_processing.py: positive video_fps cases (integer
  rates, NTSC drop-frame 30000/1001 + 60000/1001, bogus / no-slash
  / zero-denominator inputs); the two ``assert summary == { ... }``
  ffprobe-recovery tests now include the new ``video_fps: None``
  key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): envelope-driven asset processor with sibling-original

Refactor ``processing.py`` so every video upload produces a
variant matching the board's playback envelope while preserving
the source as a sibling ``.original.<ext>`` file. Rotation is now
gapless by construction — every variant on disk shares one codec /
max resolution / max fps per board, so the viewer's output mode
never has to switch mid-clip.

src/anthias_server/processing.py
--------------------------------

* Replace ``_BOARD_PROFILES`` + ``_resolve_board_profile`` +
  ``_PI4_H264_MAX_PIXELS`` + ``_BoardProfile`` typedef with
  ``compute_envelope()`` from the new ``playback_envelope`` module
  (landed in 0b6bea0c). One canonical source of truth for "what
  every variant on disk looks like".

* ``_ffprobe_summary`` now returns per-axis dimensions
  (``video_width``, ``video_height``) alongside the existing
  ``video_pixels`` total. The envelope check is per-axis so an
  ultrawide source (e.g. 5760×1080) gets caught by the width cap
  even though its total pixel count is below 4K's.

* ``_video_can_passthrough(summary, envelope)`` is the new
  contract: passthrough iff (a) container is mp4, (b) codec
  matches envelope.codec exactly, (c) both axes are within the
  envelope cap, (d) source fps is at-or-under envelope.max_fps,
  (e) audio is demuxer-compatible. Any None in source dims / fps
  bails to transcode (we don't gamble on unsized clips).

* ``_transcode_to_target(input, output, envelope=None,
  source_summary=None)`` emits the smallest set of flags that
  lands the output inside the envelope. ``-vf scale=...`` only
  when source > envelope on either axis; ``-r envelope.max_fps``
  only when source fps > cap. The fps cap is one-way — we never
  up-convert a sub-cap source. New helper
  ``_video_args_for_codec`` picks libx264 / libx265 from the
  envelope's codec.

* ``_run_video_normalisation`` reorganised around the sibling-
  original pattern:
  - Fresh upload / legacy asset: rename ``Asset.uri`` to
    ``<base>.original.<ext>`` (the source-preservation step).
  - Re-render: read from the existing ``.original.*`` sibling
    instead.
  - Re-probe from the (possibly new) source location.
  - Passthrough branch: copy source → variant slot bitwise
    (cross-device fleet sha256 stays equal).
  - Transcode branch: staging-file render with the existing
    atomic-replace contract.
  - Stamp ``metadata['original_uri']`` (path to sibling),
    ``metadata['envelope']`` (envelope dict the variant matches).
    ``metadata['transcode_target']`` kept as the
    ``envelope.codec`` duplicate for one release of back-compat
    with the serializer surface.

Tests
-----

* ``test_video_can_passthrough_decision_table`` recast against
  the H.264 1920×1080 30 default envelope. Each row tests one
  gate (codec / per-axis dim / fps / audio / unknowns / probe
  gaps) without overlap.
* ``test_video_can_passthrough_respects_envelope`` end-to-end:
  pin ``DEVICE_TYPE``, build a summary at the given
  (codec, w, h, fps), assert the verdict. Replaces the legacy
  ``..._respects_board_codec_set``.
* ``test_transcode_to_target_emits_scale_when_source_oversize``,
  ``..._emits_fps_clamp_when_source_fast``,
  ``..._omits_clamps_when_source_at_envelope``: pin the smallest
  ffmpeg flag set per source / envelope combination.
* ``_envelope_summary`` helper at the top of the file
  short-circuits the per-test summary construction.
* Mock signatures for ``_transcode_to_target`` updated to accept
  the new ``envelope`` / ``source_summary`` kwargs.
* ``test_resolve_board_profile_picks_target_codec_per_board``
  deleted — equivalent coverage is in tests/test_playback_envelope.py
  against ``compute_envelope`` directly.

Stale doc / comment references to ``_BOARD_PROFILES`` /
``_resolve_board_profile`` updated to point at
``playback_envelope.ENVELOPE_BY_DEVICE_TYPE`` /
``compute_envelope``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): re-render walker + startup envelope reconciler

* New celery task `regenerate_for_envelope_change`: walks
  `Asset.objects.filter(mimetype='video')` and queues
  `normalize_video_asset` for any row whose
  `metadata['envelope']` no longer matches the current envelope.
  Malformed payloads, missing keys, and per-row exceptions are
  logged but don't stop the walker.
* New `AnthiasAppConfig.ready` hook -> `app/startup.py:
  run_envelope_check`: compares cached vs computed envelope,
  persists fresh, dispatches the walker on mismatch. Short-circuits
  under `ENVIRONMENT=test` / `PYTEST_CURRENT_TEST` so pytest runs
  don't enqueue stray walkers. Celery dispatch failure is logged
  but non-fatal -- the cache is already saved, so the next start
  sees the new envelope on disk and recovers.
* Tests cover: skip-in-envelope, queue-stale, legacy migration
  (no envelope key), image-asset skip, force-requeue, malformed
  payload recovery, continue-after-per-row-failure, every
  hook code path (test short-circuit, no-cache, match, mismatch,
  dispatch failure, corrupt cache).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): preserve `.original.<ext>` siblings during orphan sweep

The Celery ``cleanup`` task built its "referenced" set only from
``Asset.uri``. With sibling-original storage, the source bytes live
at ``metadata['original_uri']`` (e.g. ``<id>.original.mov``) while
``Asset.uri`` points at the playback variant (``<id>.mp4``). Without
this fix every video upload's ``.original.<ext>`` falls outside the
1h mtime guard once the variant lands and gets silently deleted on
the next hourly sweep — breaking the re-render walker as soon as
the envelope changes.

* ``cleanup``: union ``Asset.uri`` ∪ ``metadata['original_uri']``
  into the referenced set, tolerant of legacy rows with non-dict
  metadata.
* Tests cover the new claim path + the malformed-metadata
  fallback so a stray ``metadata=None`` row can't crash the sweep.

The upload-path serializer itself stays untouched: the existing
``rename(tmp, <id><ext>)`` lands the upload at a single path, and
``processing._run_video_normalisation`` handles the
rename-to-``.original.<ext>`` atomically on first run. No double-
write, no extra disk traffic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(server): cover sibling-original storage across normalisation paths

Adds five tests pinning the ``.original.<ext>`` + variant contract
that the envelope walker depends on:

* fresh upload → ``<id>.original.<src_ext>`` created next to
  ``<id>.mp4``; ``metadata['original_uri']`` + ``metadata['envelope']``
  populated.
* re-render → ``.original.<ext>`` is byte-identical across passes
  (sha256 compared before/after); the walker reads from it and
  never rewrites it.
* passthrough → both files exist even when the source already
  matches the envelope (``shutil.copyfile`` semantics, not rename).
* legacy migration → pre-rollout assets with no ``original_uri``
  key get renamed to ``.original.<ext>`` on first walker pass.
* dangling ``original_uri`` → falls back to treating ``asset.uri``
  as the source-to-preserve; no silent error, no lost variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(board-enablement): replace codec policy table with playback envelope

* board-enablement.md now documents the envelope matrix as the
  single source of truth shared by the asset processor, the
  re-render walker, and the viewer's hwdec dispatch. The legacy
  ``_BOARD_PROFILES`` / ``passthrough_video_codecs`` vocabulary has
  been removed -- it never matched what ``processing.py`` does
  post-envelope.
* Calls out the ``<id>.original.<src_ext>`` + ``<id>.mp4`` sibling
  layout, the metadata keys the walker reads, and the cross-board
  fleet sha256 expectation.
* Pi 5 CMA quote rewritten: the real fix is
  ``dtoverlay=vc4-kms-v3d,cma-512`` in config.txt, not a downscale
  workaround. Kernel cmdline ``cma=`` is documented as the broken
  path it actually is.
* Failure-mode list updated for envelope-driven dispatch (off-
  envelope variant, display refresh ceiling, walker storm on
  unwritable cache, sha256 fleet divergence).
* ``media_player.py`` comment block: updates the Pi 5 H.264 →
  auto-copy and HEVC → drm-copy comments to reference the playback
  envelope by name and point at the correct CMA fix (config.txt
  dtoverlay, not cmdline.txt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): mypy on `_make_video_asset` + boolean is_enabled

* `dict` annotations get explicit `dict[str, Any]` parameters
  (Anthias's mypy config sets `disallow_any_generics`).
* `is_enabled=1` → `is_enabled=True` so the Asset field's bool
  type matches mypy's view of django-stubs models.
* Adds the missing ``typing.Any`` import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server,tests): envelope-aware container gate + startup hook safety

Run 1 of CI surfaced several issues in the envelope refactor:

* **MP4 family container detection.** ffprobe reports an MP4 file's
  ``format_name`` as ``mov,mp4,m4a,3gp,3g2,mj2`` (``mov`` first
  because the QuickTime/MP4 demuxer is one codepath). The envelope
  gate compared the source container to ``envelope.container_ext``
  by exact equality, so every MP4 upload was rejected at the
  container gate even though the bytes are exactly what we'd
  write. Adds ``_MP4_FAMILY_CONTAINERS`` and special-cases ``mp4``
  envelope to accept any synonym.
* **Celery workers were running ``run_envelope_check``.**
  ``celery_tasks.py`` top-level-calls ``django.setup()``, which
  fires ``AppConfig.ready`` in every process that imports it,
  including the celery worker -- the previous comment in ``apps.py``
  was wrong. Two writers race on the cache file and could
  double-queue the walker for a single envelope change. New
  ``_is_celery_worker()`` short-circuit detects the
  ``celery -A ... worker`` invocation via ``sys.argv[0]``.
* **Settings singleton captures HOME at init.**
  ``AnthiasSettings.home`` is set once at module import time, so
  ``monkeypatch.setenv('HOME', tmpdir)`` in tests doesn't reach the
  envelope cache helpers. Updates ``cache_dir`` and ``fake_home``
  fixtures to also patch ``settings.home`` via ``monkeypatch.setattr``.
* **Stale tests.**
  - Drop ``test_cleanup_tolerates_non_dict_metadata`` -- the schema
    enforces ``metadata`` as a non-null JSON dict, so the failure
    mode it claimed to test can't occur. ``cleanup()`` keeps the
    defensive ``isinstance(metadata, dict)`` check as a no-cost
    belt-and-braces.
  - ``test_video_passthrough_for_h264_or_hevc_in_known_containers``
    rewritten as ``test_video_passthrough_when_source_matches_board_envelope``
    -- the old matrix included libx264 on pi4-64 (no longer
    passthrough because pi4-64 is HEVC) and non-mp4 containers
    (always re-encoded now because the variant slot is fixed at
    ``.mp4``).
  - ``test_video_passthrough_records_target_codec`` switches the
    source codec to libx265 so it actually hits the passthrough
    branch on pi4-64.
  - ``test_video_passthrough_uses_summary_duration_no_second_probe``
    rebuilt via ``_envelope_summary`` so the synthesised summary
    carries the new ``video_width / video_height / video_fps``
    fields.
  - The two ``test_ffprobe_summary_handles_*`` early-return shape
    assertions add ``video_width`` / ``video_height`` to match the
    real return shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server,tests): drop PYTEST_CURRENT_TEST gate; align stale summaries

Run 2 of CI surfaced three more issues:

* **``PYTEST_CURRENT_TEST`` is not fixture-controllable.** pytest
  re-sets the env var at the start of every test's ``call`` phase,
  so ``monkeypatch.delenv`` in a ``setup`` fixture is overridden
  before the body runs. This made it impossible for any test to
  exercise the real startup hook path. The ``ENVIRONMENT=test``
  gate (set in ``conftest.py`` + the test compose file) is the
  durable, fixture-controllable signal — keep that, drop the
  pytest one. Test for the new ``_is_celery_worker`` short-circuit
  replaces the deleted ``test_short_circuits_when_pytest_current_test``.
* **Decision table parametrise had a wrong expectation.** Summary
  row "HEVC at envelope (codec, dims, fps all match)" was paired
  with ``expected=True``, but the test envelope is H.264 — codec
  mismatch must transcode, ``False``.
* **``test_video_passthrough_skips_duration_when_probe_unavailable``
  summary missed the new dim/fps fields.** Same root cause as
  before: ``_video_can_passthrough`` rejected the synthesised
  summary at the dims gate, the test fell through to a real
  ffmpeg call on a 64-byte stub, and ffmpeg "Invalid data found".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(envelope): add generic-arm64 key for Rock Pi / Armbian SBCs

The Anthias install path for Rock Pi 4 / Armbian boards writes
``DEVICE_TYPE=generic-arm64`` (see ``feat(install): generic-arm64
best-effort support``). The matrix only listed ``arm64``, so a
real install fell through to ``_DEFAULT`` — same envelope by
coincidence, but the walker would have logged "no matrix entry"
warnings on every server start and the docs/board-enablement
matrix would be subtly wrong about which key applies.

Lists the key explicitly with the same conservative H.264 1080p30
envelope and extends the parametrise coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): make celery_tasks.py top-level django.setup() reentrant-safe

``django.setup()`` calls ``apps.populate()``, which raises
``RuntimeError: populate() isn't reentrant`` if invoked while
already populating. The new ``AnthiasAppConfig.ready`` hook imports
``celery_tasks`` to dispatch the walker, which until this change
top-level-called ``django.setup()`` again -- so on every real
server start the import died, the dispatch failed, and the walker
never ran. Live-confirmed on the Pi 4 test bed.

Check ``django.apps.apps.apps_ready`` before calling ``setup()``:
the flag flips to True after the import phase but before per-app
``ready`` hooks run, so the standalone celery worker (where Django
isn't initialised yet) still calls setup() as before, while the
server process (mid-populate) correctly skips the reentrant call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): commit `original_uri` to DB before transcode (crash safety)

Live-confirmed on the Pi 4 test bed during the envelope rollout:
walker fired on a near-full SD card, ffmpeg ran out of space mid-
render, the on_failure hook cleared ``is_processing`` -- and the
hourly ``cleanup()`` sweep then silently deleted every
``.original.<ext>`` source it had just renamed, because
``Asset.uri`` still pointed at the (now-missing) variant path and
the orphan walker only knew about ``Asset.uri`` + a *committed*
``metadata['original_uri']``.

The metadata accumulator in ``_run_video_normalisation`` only wrote
to the DB at the end of the function, so any failure between
"rename source → .original.<ext>" and "render variant → atomic
replace" left the row's metadata stale.

Fix: persist ``metadata`` to the DB right after the rename, before
attempting any render. The contract becomes: if the file is on
disk under ``.original.<ext>``, the DB row knows it. ``cleanup()``
already reads ``metadata['original_uri']`` into the referenced set
(from ``fix(server): preserve `.original.<ext>` siblings during
orphan sweep``), so this commit closes the only window where that
guard could be bypassed.

Adds ``test_original_uri_persisted_before_render_for_crash_safety``
which mocks ``_transcode_to_target`` to raise and verifies the row
has ``metadata['original_uri']`` committed by the time the
exception propagates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(board-enablement): script-driven 1-minute sample pack

Previously the test pack was full-length BBB clips (~10 min) plus an
inline ffmpeg recipe in the docs that produced 4K HEVC re-encodes
taking ~30 min on a workstation. The on-device walker then had to
chew through the full-length variants, which on a Pi 4 / Rock Pi
turned a single rotation cycle into hours of wallclock for what was
really a hwdec-banner sanity check.

* New ``bin/generate_board_enablement_testbed.sh``: downloads the
  four BBB H.264 sources, trims each to 60 s with ``-c copy``
  (instant), then libx265-encodes each cut. Idempotent (skips
  files that already pass an ffprobe sanity check) and atomic
  (tmp-then-rename) so a power cycle mid-encode leaves a clean
  state.
* Pack drops from ~3.3 GB / 10 min per clip to ~350 MB / 60 s per
  clip. 60 s is enough to capture mpv's ``hwdec-current`` banner
  and read a stable ``Dropped:`` count, while keeping a full
  walker pass under a few minutes on every supported board.
* ``CUT_SECONDS`` / ``HEVC_CRF`` env knobs override defaults for
  iteration; the table in the doc lists what each clip exercises.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(envelope,viewer): runtime Rock Pi 4 detection unlocks v4l2m2m HW decode

``bin/install.sh`` writes ``DEVICE_TYPE=arm64`` for every aarch64
SBC it doesn't recognise as a Pi — Rock Pi 4, Orange Pi, Allwinner
H6 boards, Amlogic S905 boards all share that one catch-all
DEVICE_TYPE. The matrix can't promote ``arm64`` to HEVC + HW
because most of those boards have no upstream-mpv HW decode path
and would log "Could not find a valid device" on every play.

But the Rock Pi 4 (RK3399 / Radxa) DOES have a working v4l2m2m
driver exposed by the kernel:

  $ docker exec anthias-anthias-viewer-1 mpv --hwdec=help | grep v4l2m2m
    v4l2m2m-copy (h264_v4l2m2m-v4l2m2m-copy)
    v4l2m2m-copy (hevc_v4l2m2m-v4l2m2m-copy)
    v4l2m2m-copy (vp9_v4l2m2m-v4l2m2m-copy)
    ...

and ``/dev/video-dec2`` / ``/dev/video-dec4`` are present (the
v4l2_request decoder symlinks). Leaving Rock Pi on SW decode for
1080p HEVC measurably wastes the silicon.

Resolved at runtime via ``/proc/device-tree/model``:

* New matrix key ``rockpi4`` → HEVC 1920×1080 30. 1080p ceiling
  keeps disk use of the variant + ``.original.<ext>`` sibling
  comfortable on the typical SD card; HEVC codec exercises the
  Hantro path on the way through the viewer.
* ``compute_envelope`` and ``_pi_hwdec_for_uri`` both probe the
  device tree when DEVICE_TYPE is ``arm64`` (or legacy
  ``generic-arm64``). A Rock Pi 4B reports
  ``Radxa ROCK Pi 4B`` and gets upgraded; an Orange Pi or an
  Allwinner H6 board stays on the conservative SW envelope.
* Failure modes (no device tree, decode error, unknown SBC) all
  collapse to ``None`` so dev containers and the existing arm64
  catch-all keep working unchanged.

Four new tests pin:
- Rock Pi model → ``rockpi4`` envelope;
- legacy ``generic-arm64`` label also gets the upgrade;
- unknown SBC keeps the conservative envelope;
- missing ``/proc/device-tree/model`` doesn't raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(envelope,viewer): publish board subtype via host_agent + Redis

Previous commit (``dde1b20e``) added a runtime ``/proc/device-tree``
read inside the server + viewer containers. Containers don't see
that path by default, and mounting it into every container is
heavier than it's worth for one edge case (worse, balena's
restricted /proc would still trip).

``anthias_host_agent`` already runs on the host and publishes
host-side state to Redis (IP addresses, etc.). It's the right
layer for board identification:

* New ``detect_board_subtype()`` reads
  ``/proc/device-tree/model`` directly (host_agent IS on the
  host) and maps known SBC strings to matrix keys
  (Rock Pi 4A/4B/4C → ``rockpi4``).
* New ``set_board_subtype()`` publishes the resolved key (or the
  empty string for unknown boards) to ``host:board_subtype``
  before ``subscriber_loop`` flips ``host_agent_ready`` — so
  consumers can rely on the key being there once the readiness
  flag is set.
* Server's ``playback_envelope.compute_envelope`` and viewer's
  ``_pi_hwdec_for_uri`` read the same Redis key when DEVICE_TYPE
  is ``arm64`` / legacy ``generic-arm64``. Failure modes (Redis
  down, key missing, decode error) all collapse to ``None`` so
  the caller falls back to the conservative arm64 envelope.

No compose template changes. The viewer + server containers
already have Redis reachable (they use it for the Channels
layer + walker dispatch already), so the data path is free.

Unit tests pin:
* device-tree → subtype mapping for canonical + variant + edge
  Rock Pi strings, plus unknown boards;
* Redis publish writes the resolved key OR empty string;
* server's compute_envelope reads back through Redis correctly
  for known / unknown / empty / unreachable cases;
* subscriber_loop calls set_board_subtype before flipping
  ``host_agent_ready`` — race-free ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery): cap walker to --concurrency=1 so transcodes can't choke playback

Default celery worker concurrency = num_cores. On the boards
Anthias actually ships to (Pi 4 / Pi 5 / Rock Pi 4 / arm64
SBCs), that means up to 4 parallel ``libx265`` encodes sharing
the same SoC as the viewer's mpv process. ``nice -n 19`` +
``ionice -c 3`` are already in place, but nice(1) only helps
when there's CONTENTION -- four ffmpegs at nice 19 still
saturate every core, and each 1080p libx265 encode needs ~500 MB
RAM. A 4 GB SBC pushes into swap well before the walker
finishes, which stalls *everything* on the host -- live-
confirmed on the Rock Pi 4 during this PR: sshd starved through
banner exchange whenever the walker hit a fresh burst.

Asset processing is upload-time, not throughput-bound. The
operator-facing latency that matters is "upload click → asset
visible in rotation", which is bound by ONE encode regardless of
queue parallelism. Serial encodes finish a few minutes later in
wallclock but the viewer never drops a frame.

Applied to every prod / dev compose template. ``docker-compose.test.yml``
is left at default because the test suite never runs live
normalize tasks (the celery service in tests just exercises the
task dispatch plumbing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): force MPV on legacy ``generic-arm64`` DEVICE_TYPE

Rock Pi 4 running an older arm64 image reports
``DEVICE_TYPE=generic-arm64`` (pre-``refactor: rename device_type
generic-arm64 → arm64`` rebuilds). The MediaPlayerProxy
override only force-routed MPV for ``arm64`` / ``pi4-64``, so the
legacy label fell through to VLC -- which then crashed with
``NameError: no function 'libvlc_new'`` because the libvlc lib
isn't installed on the arm64 image. Live-confirmed in the viewer
crash loop on the Rock Pi 4 during this PR.

Adds ``'generic-arm64'`` to the force_mpv set + a test pinning
the dispatch. Covers the in-the-wild rolling-upgrade window
where a Rock Pi 4 deployment is sitting on an old image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): route ``generic-arm64`` through cage + ALSA-default like ``arm64``

Two more places in ``media_player.py`` only checked the post-rename
``arm64`` DEVICE_TYPE and missed the legacy ``generic-arm64`` label
the Rock Pi 4 test bed still reports:

* **VO dispatch** (line ~419) — without this, a generic-arm64 host
  falls through to the ``--vo=drm`` else branch, which mpv aborts
  with "No primary DRM device could be picked" because cage already
  holds DRM master in the cage + Wayland viewer stack
  (live-confirmed on the Rock Pi 4 in this PR).
* **ALSA card selection** (``get_alsa_audio_device``) — the Pi-name
  dispatch below the env-var check picks ``vc4hdmi`` / "Headphones"
  cards that don't exist on Rockchip / Allwinner / Amlogic. Without
  the legacy label here, mpv tries to open the Pi-specific HDMI
  card and dies with ``Unknown PCM sysdefault:CARD=vc4hdmi``.

Both branches now use the shared ``_ARM64_DEVICE_TYPES`` frozenset
that already governs the hwdec subtype probe, so the three paths
(envelope, hwdec dispatch, VO + ALSA) agree on what DEVICE_TYPE
labels are aarch64-catch-all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(envelope): Rock Pi 4 stays on H.264 1080p30 -- stock ffmpeg has no v4l2_request

Live testing on the Rock Pi 4 surfaced that the arm64 viewer
image's stock ffmpeg (Debian 7.1.3-0+deb13u1) is built without
``--enable-v4l2-request``, and the underlying kernel exposes the
RK3399's decoders only via the stateless v4l2_request API
(``rkvdec`` for HEVC, the Hantro block as ``rockchip,rk3399-vpu-dec``
for H.264). ffmpeg's stateful ``hevc_v4l2m2m`` / ``h264_v4l2m2m``
decoders can't reach them -- mpv logs ``Could not find a valid
device`` even after ``/dev/video-dec*`` symlinks are present.
mpv ``--hwdec=help`` also doesn't list rkmpp or drm-copy, so
there's no other path through the stock build.

So:

* ``rockpi4`` envelope drops from HEVC 1920x1080 30 to H.264
  1920x1080 30 -- the same conservative tier as the generic
  ``arm64`` catch-all. The viewer SW-decodes 1080p30 in real
  time on the Cortex-A72; no frames dropped, just no HW gain
  over plain ``arm64``.
* Rock Pi entry drops from ``_PI_HWDEC_BY_CODEC`` -- mpv falls
  through to ``auto-copy`` which mpv's whitelist resolves to
  SW decode on this build.
* host_agent's subtype publish, the start_viewer.sh
  ``/dev/video-dec*`` symlink creation, and the dedicated
  ``rockpi4`` matrix key all stay in place -- they're
  forward-compatible scaffolding so a follow-up enabling
  v4l2_request (or linking rkmpp) in the viewer build only has
  to bump the matrix entry's codec to ``hevc`` and add the
  hwdec dispatch row. No further plumbing churn.
* Tests + docs reflect the routing-without-HW reality.

The legacy-label fixes from this PR (force_mpv +
``--vo=gpu --gpu-context=wayland`` + ALSA default for the
``generic-arm64`` DEVICE_TYPE) are unaffected -- those are real
bug fixes the Rock Pi 4 needs to play *anything* under cage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viewer,envelope): extend +rpt1 ffmpeg to arm64; Rock Pi 4 = HEVC 4Kp60

The Raspberry Pi APT repo's ffmpeg build (``+rpt1``) ships with
``--enable-v4l2-request --enable-libudev --enable-vout-drm``,
which the stock Debian Trixie ffmpeg drops. Without those flags
the v4l2_request hardware decoder family is unreachable from
mpv — which is exactly what bit the Rock Pi 4 in this PR:
RK3399's ``rkvdec`` (HEVC) and Hantro VPU (H.264) are both
stateless v4l2_request decoders. Pi 4 / Pi 5 already pull from
the +rpt1 repo for the same reason; extending the conditional in
``Dockerfile.viewer.j2`` to also include ``arm64`` lights up
hardware decode on every arm64 SBC whose kernel exposes
v4l2_request decoders (Rock Pi, Orange Pi RK356x, Pine64,
Allwinner H6 with Cedrus, ...).

* ``Dockerfile.viewer.j2`` — board conditional ``('pi4-64',
  'pi5')`` → ``('pi4-64', 'pi5', 'arm64')``. The apt pin already
  restricts the +rpt1 repo to ``ffmpeg + libav* + mpv``, so other
  arm64 packages stay on stock Debian. Comment block updated to
  list which decoders each board reaches via this path.
* ``playback_envelope.py`` — ``rockpi4`` envelope flips from
  H.264 1080p30 to HEVC 3840×2160 60. RK3399's Hantro G2 is the
  same decoder family as Pi 5's and supports 4Kp60 per the
  Rockchip datasheet — matching Pi 5's envelope keeps the fleet
  uniform.
* ``media_player.py`` — ``_PI_HWDEC_BY_CODEC['rockpi4']`` maps
  both h264 and hevc to ``drm-copy`` (the v4l2_request hwdec
  path, same as Pi 5 for HEVC).
* Tests + docs updated accordingly.

The legacy-arm64 fixes (force_mpv + cage VO + ALSA default for
``generic-arm64``) and the host_agent subtype publish are
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery): cgroup CPU hard cap (`cpus: 1.0`) so encodes never starve the viewer

``nice -n 19 ionice -c 3`` + ``--concurrency=1`` lower priority and
limit parallelism, but they're soft hints — when libx265 is the
only heavy workload on the box the scheduler still hands it
everything available. Live-confirmed on the Rock Pi 4 in this PR:
sshd starved through banner exchange and mpv dropped mid-frame
during walker bursts, even with all three soft caps in place.

``cpus: 1.0`` is a cgroup CFS quota — one CPU's worth of compute
per period, kernel-enforced. On every supported SBC (Pi 4 / Pi 5 /
Rock Pi 4, all 4-core) it leaves 3+ cores for the viewer, the
host_agent, sshd, and everything else. x86 hosts have 8+ cores so
the cap is conservative there but harmless — asset processing is
upload-time, not throughput-bound.

Applied to every prod / dev compose template. test compose stays
uncapped because the test suite runs in CI environments with
deterministic resources where the cap would just slow CI down
without protecting anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery): scale CFS quota with host cores (half of \$(nproc), min 1.0)

A flat ``cpus: 1.0`` is too aggressive: it forces a single-thread
ceiling even when the host has many idle cores. On an 8-core x86
deployment the asset processor would take 4x longer than it needs
to without protecting anything we don't already protect.

Compute the limit dynamically in ``bin/upgrade_containers.sh``:
``$(nproc) * 0.5`` (floored to 1.0 so single-core hosts still
make progress). On the supported boards this lands at:

  * 4-core Pi 4 / Pi 5 / Rock Pi 4 → cpus: 2.0 (2 cores headroom
    for the viewer + system)
  * 8-core x86 → cpus: 4.0 (4 cores headroom)
  * 16-core x86 → cpus: 8.0 (still 50/50 with the system)

Soft priorities (``nice -n 19 ionice -c 3``) and the
``--concurrency=1`` walker still apply on top; the cgroup quota
is the hard backstop that guarantees "encoding never impacts
playback or UI access". Live test on the Rock Pi 4 (in this PR)
proved the soft caps alone aren't enough — libx265 saturated
every core and starved sshd through banner exchange.

The balena compose templates use a literal ``cpus: 2.0`` (balena
only targets 4-core Pi 2/3/4/5 today); the non-balena prod
compose substitutes the env var. Dev compose also uses a literal
``2.0`` since dev hosts vary too widely to autodetect cheaply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(walker): hardware-decode the source in the transcode pipeline

The walker's encode pass stays libx265-software-bound on every
SBC (none of Pi 4 / Pi 5 / Rock Pi 4 have HEVC HW encode), but
the *decode* half of the pipeline can be offloaded to the same
silicon mpv uses for playback. That's typically 30-50% of the
ffmpeg wall-clock on H.264 sources and dominant on 4K — well
worth the small dispatch table.

* ``_decode_hwaccel_args(source_codec)`` returns the per-board
  ``-hwaccel`` flags to prepend to the ffmpeg invocation. Uses
  the same host_agent subtype probe (``host:board_subtype`` in
  Redis) that envelope resolution already uses, so the walker
  and viewer agree on what board they're targeting.
* Dispatch matrix:
  - Pi 4 (V3D V4L2 M2M + rpi-hevc-dec) → ``-hwaccel drm`` for
    both H.264 and HEVC (the +rpt1 ffmpeg's v4l2_request path).
  - Pi 5 (Hantro G2) → ``-hwaccel drm`` for HEVC only.
  - Rock Pi 4 (rkvdec + Hantro VPU) → ``-hwaccel drm`` for both,
    same v4l2_request path as Pi 5.
  - x86 (VAAPI) → ``-hwaccel vaapi -hwaccel_device
    /dev/dri/renderD128`` for both.
  - Pi 2 / Pi 3 / unknown arm64 → no HW path mpv can address;
    SW decode is the only choice.
* ``_transcode_to_target`` wraps the ffmpeg call: first attempt
  with hwaccel args, fall back to SW decode on
  ``sh.ErrorReturnCode`` (kernel driver weird, device busy,
  bitstream the v4l2_request decoder rejects). Logs the
  underlying ffmpeg stderr at WARNING so an operator chasing a
  slow walker sees the HW path failed.

Tests pin every cell of the dispatch matrix + assert ``-hwaccel``
lands BEFORE ``-i`` in the argv (placing it after silently
no-ops in ffmpeg) + the two-call SW-fallback path on simulated
HW init failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server-image): extend +rpt1 ffmpeg pin to anthias-server too

The walker's HW-decode optimization (``processing._decode_hwaccel_args``
emits ``-hwaccel drm``) only works against the Raspberry Pi repo's
``+rpt1`` ffmpeg build, which has ``--enable-v4l2-request``. The
pin was previously only on the *viewer* image (Dockerfile.viewer.j2
in ``ba8d4709``), so the celery container — which runs the walker —
kept the stock Debian ffmpeg and the hwaccel call silently fell
back to SW on every board.

* New ``docker/_rpt1-ffmpeg-pin.j2`` extracts the pin block.
* Both ``Dockerfile.viewer.j2`` and ``Dockerfile.server.j2`` now
  include it via ``{% include '_rpt1-ffmpeg-pin.j2' %}``. Server
  also re-runs ``apt install --reinstall ffmpeg libav*`` so the
  pinned version replaces whatever the base layer installed.
* No effect on Pi 2 / Pi 3 / x86 boards — the include's
  ``{% if board in ('pi4-64', 'pi5', 'arm64') %}`` keeps it
  inert there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery,viewer): four hardening fixes so the player survives an upgrade

Live testing on Pi 4 / Pi 5 / Rock Pi 4 surfaced four scenarios
where a single ``docker compose pull && up -d`` (or any upgrade
that invalidates the playback envelope) wedges the device. These
aren't test-harness flakes; production operators on the same
hardware would hit them. All four belong in this PR alongside the
features that exposed them.

1. **Walker drip-feed** — ``regenerate_for_envelope_change``
   previously queued every stale ``normalize_video_asset`` in one
   beat tick. ``--concurrency=1`` serialises *execution* but the
   celery worker fetches the next task the instant the previous
   finishes, so a 100-asset catalog turns into hours of back-to-
   back libx265 with zero recovery windows between encodes.
   Switch to ``apply_async(args=..., countdown=N * 60)`` so
   each subsequent normalize starts at least 60 s after the
   previous was queued. Operator can flip ``is_processing=False``
   on a row mid-window to cancel its turn.
2. **``mem_limit`` on celery container** — cgroup CPU isolation
   alone doesn't stop libx265-4K from allocating ~1.5 GB resident
   memory, which on a 4 GB SBC pushes the system into swap and
   starves sshd + the viewer. Match the cpus cap with a memory
   cap (60% of host RAM, computed in ``bin/upgrade_containers.sh``).
3. **``stop_grace_period: 3s`` + ``stop_signal: SIGKILL`` on
   viewer** — cage doesn't reliably release DRM master on
   SIGTERM (its libinput shutdown path hangs on certain kernels)
   and the kernel's GPU driver leaves dangling references that
   prevent the next ``up`` from acquiring DRM master. Skipping the
   SIGTERM-then-wait dance on intentional restarts gets the
   device past cage's bug deterministically.
4. **libx265 / libx264 ``-preset superfast``** — was ``medium``.
   Asset processing is upload-time and only runs once per asset,
   so the 5-10× wallclock speedup is operator-facing throughput.
   The ~10-20% bitrate increase is invisible on typical signage
   content. Viewer decode is HW regardless of preset.

Tests:
* Walker test mocks switched from ``.delay`` to ``.apply_async``;
  signatures updated for ``args=(...,)`` + ``countdown=`` kwarg.
* New ``test_regenerate_walker_spaces_dispatches_via_countdown``
  asserts the countdowns are ``[0, 60, 120, ...]`` across a
  5-asset catalog so the drip-feed contract is pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): use sh.ErrorReturnCode_1 in hwaccel fallback test

sh.ErrorReturnCode is the abstract base; its __init__ does
`self.exit_code = self.exit_code` which AttributeErrors unless the
concrete numeric subclass (ErrorReturnCode_1, _2, ...) is used. Every
other call site in this file already uses ErrorReturnCode_1 — this was
the lone outlier introduced with the SW-fallback test in 0340b4f4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(asset-processor): drop on-device video transcoding

On-device libx265 transcode wedged a Pi 4's celery worker for 99 min on a
single 4K60 H.264→HEVC pass during PR validation. Every supported board
already HW-decodes both H.264 and HEVC via the viewer's per-board mpv
hwdec dispatch (drm-copy / vaapi-copy / v4l2m2m-copy), so the re-encode
provided no playback benefit for the codecs operators actually upload.

- ``normalize_video_asset`` now runs ffprobe and writes codec / dims /
  fps / duration into ``metadata``; the asset file is never rewritten.
- Removes the envelope module, the re-render walker
  (``regenerate_for_envelope_change``), and the server-start envelope
  cache reconciliation hook.
- Drops 33 transcode / envelope / sibling-original tests.

Image normalisation (HEIC/HEIF/TIFF/BMP/ICO/TGA/JP2/AVIF → WebP) is
unchanged. The viewer-side per-board hwdec dispatch and host_agent
board-subtype publishing are unchanged.

For codecs the target board can't HW-decode (MPEG-2, MPEG-4 ASP, ...)
the operator's recovery is to upload a transcoded copy; the metadata
fields surfaced here let them see codec / dims / fps in the asset list
before pushing the asset to the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(asset-processor): gate uploads to hardware-decoded codecs only

After ffprobe, ``normalize_video_asset`` now compares the source codec
against the board's HW-decode set (mirroring the viewer's
``_PI_HWDEC_BY_CODEC``). Uploads outside the set are rejected with an
error message that includes the rejected codec, the board's supported
codecs, and an ``ffmpeg`` command line the operator can run on their
workstation to transcode the source.

Per-board HW decode set:

- pi2 / pi3 → {h264}
- pi4-64 / rockpi4 / x86 → {h264, hevc}
- pi5 → {hevc} (no H.264 v4l2-request decoder mpv can reach)
- arm64 catch-all → ∅ (operator must install a board-specific image)

Also extracts ``DEVICE_TYPE`` → board-key resolution into a new
``anthias_common.board`` module so the server's gate and the viewer's
hwdec dispatch share the same logic — eliminates the duplicated
``_redis_board_subtype`` mirror in ``media_player.py``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dashboard): surface unsupported-codec failures with copyable recipe

UI/UX review of the gate's failure path surfaced two P0s and a few
smaller nits:

- The error message was only reachable via a native browser ``title``
  tooltip on the Failed pill — invisible on touchscreens, can't be
  copied, leaks the ``UnsupportedVideoCodecError:`` class prefix into
  the aria-label.
- The Edit Asset modal showed nothing about the failure — exactly
  the place the operator goes to act on a failed row.

Changes:

- ``UnsupportedVideoCodecError`` now carries the ffmpeg recipe as a
  ``recipe`` attribute. ``_NormalizeAssetTask.on_failure`` writes the
  bare message into ``metadata.error_message`` (no class-name prefix)
  and persists the recipe to ``metadata.error_recipe``.
- ``_asset_row.html`` Failed pill becomes a button — click opens the
  Edit Asset modal.
- ``_asset_modal.html`` renders a warning banner at the top of the
  Edit form when ``metadata.error_message`` is set, with the recipe
  inside a copyable ``<code>`` block + "Copy command" button.
- ``_ffmpeg_reencode_recipe`` substitutes the operator's upload
  filename (stashed in ``metadata.upload_name`` at upload time) for
  the ``INPUT`` placeholder so the recipe is paste-ready.
- Toast text shortened from "analysing video…" to "reading metadata…"
  (the ffprobe pass is sub-second now that there's no transcode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(processing): give recipe output a codec suffix so it doesn't overwrite input

E2E validation on a Pi 5 surfaced a recipe like:

  ffmpeg -i 'sample-h264.mp4' -c:v libx265 ... 'sample-h264.mp4'

— input and output point at the same file because both got the
upload's stem + ``.mp4`` suffix. Operator pasting the recipe would
overwrite their source. The fix gives the output filename a target-
codec marker (``sample-h264.hevc.mp4`` / ``sample-h264.h264.mp4``)
so the recipe is safe to copy-paste even when the upload's
extension already matches the output container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop transcode-era defensive hardening on celery + server image

These guards were load-bearing while the asset processor ran libx264 /
libx265 transcodes; with the on-device transcode pipeline gone they're
dead code defending against a workload that no longer exists.

Removed:
- ``cpus: ${CELERY_CPU_LIMIT}`` / ``cpus: 2.0`` cgroup CPU caps on
  anthias-celery (every compose template)
- ``nice -n 19 ionice -c 3`` wrapper on the celery command
- ``--concurrency=1`` on celery worker; default celery concurrency
  is fine when the only tasks are ffprobe + Pillow conversion
- ``CELERY_CPU_LIMIT`` calc in ``bin/upgrade_containers.sh``
- ``_rpt1-ffmpeg-pin.j2`` include + reinstall layer in
  ``Dockerfile.server.j2``; the +rpt1 ffmpeg was only needed for
  the walker's ``-hwaccel drm`` transcode. The server now only
  runs ffprobe, which the stock Debian ffmpeg handles fine
  (smaller server image, simpler base)
- Stale ``ffprobe → passthrough or libx264/aac transcode`` section
  header in processing.py

Kept:
- ``mem_limit: ${CELERY_MEMORY_LIMIT_KB}k`` on celery — still a
  useful safety net against a decompression-bomb fixture or
  runaway ffprobe
- ``+rpt1`` ffmpeg pin on the *viewer* image — still load-bearing
  for mpv's ``v4l2_request`` HW decode on Pi 4 / Pi 5 / Rock Pi 4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: keep nice -n 19 ionice -c 3 on celery

Cheap insurance against pathological inputs (decompression-bomb
HEIC, runaway ffprobe). Brought back across all four compose
templates after stripping the CPU cap + --concurrency=1 in the
prior cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dashboard): address review feedback on codec gate UX

* Plain-HTTP clipboard fallback. navigator.clipboard.writeText only
  resolves on secure origins, so on a LAN device (HTTP) the Copy
  command button silently failed. Add a window.fallbackCopyToClipboard
  helper that uses execCommand('copy') against an off-screen
  textarea, and have the inline copyRecipe() try it whenever
  navigator.clipboard isn't available or rejects. The recipe block
  also gets user-select:all so keyboard-copy still works if both
  paths fail.
* Friendlier message for the arm64 catch-all branch. "Supported:
  none." read like the board literally has no decoder; replace with
  an explanation that the board hasn't reported a subtype yet and a
  pointer at the board-specific image.
* Lock the gate (_HW_DECODE_VIDEO_CODECS) and the viewer dispatch
  (_PI_HWDEC_BY_CODEC) together with a consistency test so a future
  edit to one table can't quietly diverge from the other.
* Cover the shell-quoting of recipe filenames with hostile-name
  parametrize cases (single quote, backtick, $(), ;) so a copy-paste
  recipe can't be turned into command injection.
* Drop the stale "cgroup CPU cap" line from processing.py's module
  docstring — the cap was removed in f85f8035.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address post-review feedback on codec gate / hwdec dispatch

- processing: prefer the upload's extension token when ffprobe's
  format_name is a synonym list, so an .mp4 surfaces as
  container=mp4 (not mov, the first synonym).
- bin/start_viewer.sh: drop the loose `*-dec` catch-all from the
  v4l2 decoder match; keep the explicit rkvdec/cedrus/hantro/
  *-vpu-dec prefixes.
- media_player: cap the ANTHIAS_DEBUG_DROPS mpv.log at 64 MB with
  a rolling truncate so a forgotten-on flag can't grow the disk.
- tests: rename test_set_board_subtype_does_not_raise_on_redis_failure
  to test_set_board_subtype_propagates_redis_failures — matches what
  the test actually asserts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:46:02 +02:00
Viktor Petersson
2d7b92c006 Hugo website: docs migration, API reference, FAQ, and SEO (#2807)
* Move website to Hugo

* Rewrite in progress

* Add Hugo-native API reference page and fix CSS build path

Two related changes for the Hugo site:

1. CSS build target: package.json's css:build/css:watch wrote to
   assets/styles/style.css, but baseof.html uses a plain <link href>
   that Hugo serves from static/. The merge left a stale 14K static
   copy alongside the freshly-built 23K asset copy, so pages rendered
   with most utility classes undefined. Build target is now
   static/assets/styles/style.css, matching the convention used by
   every other website asset.

2. Hugo-native API docs at /api/. The OpenAPI spec is loaded from
   data/openapi.yaml (generated via `manage.py spectacular`) and
   rendered in layouts/_default/api.html and a recursive schema
   partial. Endpoints are grouped by tag with anchor jumps, color-
   coded method badges, params/request/response tables, and inline
   $ref resolution. Renders all 18 v2 endpoints across 9 tags with
   the existing Tailwind theme. No third-party JS bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Move documentation under Hugo and redirect old GitHub paths

Migrates docs/ markdown into website/content/docs/ rendered with a new
docs/ layout (list + single) and Tailwind prose styling. Images and
the d2 diagram move to website/static/docs/. Internal links rewritten
from /docs/foo.md to /docs/foo/, and GitHub-style alerts pre-converted
to bold-labeled blockquotes since the goldmark alert extension is not
enabled on this Hugo version.

The original docs/*.md files are kept as redirect stubs that point at
https://anthias.screenly.io/docs/... so external links into the GitHub
docs tree still resolve to a useful page. Root README.md links updated
to point at the website URLs.

Hugo nav now exposes Docs alongside Features / Get Started / API / FAQ.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix factual inaccuracies in migrated docs against the codebase

Reviewed all docs against the current source. Concrete fixes:

* _index.md: container names use the post-rebrand `anthias-` compose
  project prefix (e.g. `anthias-anthias-server-1`, `anthias-redis-1`)
  rather than the legacy `screenly-` form. Replaced `docker-compose
  logs` with `docker compose logs` and added the optional `anthias-
  caddy` sidecar to the container table.
* developer-documentation.md: fixed leading-letter typo ("unning"),
  and replaced the old Django test-runner invocation with the pytest
  commands used by the suite today (`pytest -n auto -m "not
  integration"` and `pytest -m integration`).
* balena-fleet-deployment.md: corrected the supported board list
  ($BOARD_TYPE) to match `bin/deploy_to_balena.sh --help`
  (`pi2`, `pi3`, `pi4-64`, `pi5` — no `pi1` or plain `pi4`). Updated
  registry reference from Docker Hub to GHCR.
* migrating-assets-to-screenly.md: `cd ~/screenly` → `cd ~/anthias`
  (post-rebrand install path).
* raspberry-pi5-ssd-install-instructions.md: fixed "Opitions" and
  "uinsg" typos in the boot-order steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Polish docs styling: callouts, syntax highlighting, hierarchy

Reworks docs prose styling so the migrated pages don't read like
default-Hugo-render-output:

* Headings: in-body H1/H2 collapse to a section divider style with a
  top border so they don't compete with the dark page hero. H4-H6
  become small uppercase eyebrows. Markdown sources mix #/##/####
  inconsistently — the visual scale now compresses gracefully.
* Alerts: a render-blockquote hook detects the bold-label preamble
  produced by our preprocessor (`> **Note**` etc.) and emits a typed
  `<blockquote class="docs-alert docs-alert-note">` so each kind gets
  its own colored border + label (note/tip/important/warning/caution).
* Syntax highlighting: enable Hugo Chroma with the github style,
  noClasses=false. Generated chroma.css ships as a static asset and is
  loaded alongside style.css. `pre`/`<code>` get a light surface that
  the chroma token colors sit on top of.
* Inline code, lists, links, tables, and images all get a small
  rebalance — bullet color, link underline weight, image shadow,
  table border-radius — to match the brand-purple theme.
* Footer: the Resources / Docs link pointed at the legacy
  github.com/.../docs/README.md path; now points at /docs/. Added an
  API Reference link alongside.
* Stripped a stray `<br>` in the Pi5 SSD doc that was creating a
  random gap between a blockquote and its illustrative screenshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Make x86/PC docs consistent and more user-friendly

The migrated docs used four different forms — "x86", "x86 device",
"PC (x86) devices", and "PC (x86 Devices)" — depending on the page.
Standardize on **PC (x86)** as the user-facing label (PC is what
people search for; x86 stays as the architecture qualifier).

Also rewrites x86-installation.md from a flat bullet dump into a
clearer five-step walkthrough — what you need, download, flash,
install Debian, prep the system, run the installer — and crosslinks
the right anchor in installation-options.md so PC users can hand off
to the scripted install without scrolling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Expand FAQ with forum-driven questions and refactor to data file

The FAQ had six entries that didn't reflect what people actually ask
on forums.screenly.io. Reviewed the all-time top topics and added the
ones that show up over and over: portrait rotation, YouTube playback,
Wi-Fi setup, static IP assignment, audio output, resolution / 4K,
black-screen troubleshooting, transitions, asset storage / backup,
SSH, HTTPS pointer, commercial-use clarity, getting logs, and a link
to the API reference.

Refactored the layout so it reads from data/faq.yaml grouped by
section (About, Installation & updates, Display & playback,
Operations) and renders each answer through markdownify. This makes
adding new entries a one-paragraph YAML edit instead of duplicating
~15 lines of accordion markup. Answers reuse the .docs-prose styling
so code, links, lists, and inline pre snippets all match the docs
pages.

Also tightened the "Accessing the REST API" section in /docs/ to
point at the new /api/ page first, with the live ReDoc URL on the
device as a secondary callout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Correct rotation FAQ — Anthias renders via linuxfb + DRM, no Wayland

Verified in code: docker/Dockerfile.viewer.j2 sets
QT_QPA_PLATFORM=linuxfb, webview/build_qt{5,6}.sh both pass
-skip wayland to the Qt build, and viewer/media_player.py invokes
mpv with --vo=drm. There is no Wayland compositor in the runtime
stack on any board.

Replaced the previous "Pi 5 with Wayland uses a different stack"
hand-wave with the actual fallback: if /boot/firmware/config.txt's
display_rotate=N doesn't stick on a Pi 5 / KMS pipeline, append
video=HDMI-A-1:...,rotate=N to /boot/firmware/cmdline.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Tighten three FAQ answers after a code-driven validation pass

* SSH: previous answer claimed SSH was on by default for both the
  Anthias disk image and the scripted install. Anthias's installer
  doesn't touch sshd at all, so the answer now distinguishes between
  the prebuilt images (SSH on) and a self-flashed Raspberry Pi OS Lite
  (SSH must be pre-enabled).
* Audio output: static/src/components/settings/audio-output.tsx hides
  the 3.5mm option on Pi 5 because the hardware lacks the jack. Call
  that out so Pi 5 users don't go looking for a missing dropdown item.
* Black screen: replaced the `xset dpms force on` suggestion. Anthias
  has no X server on any board (Qt runs on linuxfb, mpv on --vo=drm),
  so xset can't toggle DPMS. Pointed users at re-seating HDMI or
  checking the TV's input as a more grounded recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Correct features-page claim — Anthias detects display state, can't toggle it

The "Display power control" card promised programmatic on/off
toggling of the connected screen for energy savings. That isn't a
real feature. lib/diagnostics.py only calls libcec's tv.is_on() to
*query* the TV's power state — there's no power-on / standby command
path anywhere in the codebase. The result surfaces read-only as
display_power on the System Info page (static/src/components/
system-info.tsx).

Replaced the card with what's actually shipping: HDMI-CEC display
state *detection*, visible on the System Info page.

Verified the rest of the page against code while I was in there.
Accurate as written: image/video/webpage assets (Qt webview + mpv),
scheduling (start_date/end_date/duration on the asset model), drag-
drop playlists (@dnd-kit/sortable), shuffle (settings.shufflePlaylist),
1080p output (mpv pinned to 1920x1080@60 on pi4-64/pi5 in
viewer/media_player.py), real-time WebSocket sync (Django Channels +
Redis pub/sub), REST API (drf-spectacular), four-container compose
topology, backup/restore, optional basic auth (lib/auth.py BasicAuth),
System Info page fields (loadavg/free_space/uptime/anthias_version),
and the supported hardware list (matches ansible/site.yml's
device_type assertion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Punchier homepage tagline: "Free digital signage for everyone."

Replaces "Open source digital signage for any screen" with a shorter,
benefit-led headline. The new line breaks naturally across two lines
on desktop (free digital signage / for everyone.) and stays single-
line on mobile to avoid an awkward orphan.

Subtitle is unchanged — it still does the explanatory work (Pi or
PC, schedule images/videos/webpages, no subscriptions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* SEO sweep: per-page meta, FAQPage / TechArticle JSON-LD, robots.txt

Two functional gaps in the existing setup:

* og:description and twitter:description were hardcoded to a single
  marketing line on every page, while the per-page <meta name=
  description> already pulled from front matter. So the Slack/Twitter/
  Discord card preview always read the same blurb regardless of which
  page you shared. Now both the OG and Twitter description reflect the
  page's own .Params.description.

* Page titles drifted: most pages embedded "Anthias" in the title
  string, but the docs pages were just "Documentation" /
  "Installation Options" / etc. — fine for the H1, weak for SERPs.
  Title now appends " | Anthias" only when the page title doesn't
  already contain the brand, so existing branded titles stay clean
  and docs pages get a brand suffix automatically.

Other tightening:

* Added FAQPage structured data on /faq/ generated from data/faq.yaml
  so Google can surface FAQ rich results.
* Added TechArticle structured data on individual /docs/ pages.
* og:type now flips to "article" on docs pages.
* og:image:alt + twitter:image:alt populated.
* theme-color set to the brand purple for mobile browser chrome.
* JSON-LD home schema URL now uses site.BaseURL instead of a
  hardcoded production URL — important for staging / dev parity.
* <html lang> reads site.LanguageCode instead of a fixed "en".
* Added a real robots.txt that points crawlers at /sitemap.xml
  (Hugo already generates the sitemap, but a robots.txt makes the
  pointer explicit and unblocks tooling that looks for it).

Replaced placeholder image alt text in the docs ("balena-ss-01",
"imager-01", "rpi-eeprom-update", etc.) with descriptive captions —
better for screen readers and image-search SEO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Move site assets into Hugo's expected layout and rename /docs URLs

Two related cleanups.

ASSETS — site assets were split across website/static/assets/ (the
shadowed copy hugo.toml's [[module.mounts]] directed traffic to) and
website/assets/ (an unused duplicate). Hugo's own build report showed
"Processed images: 0" because nothing actually flowed through Pipes.

  * Removed the [[module.mounts]] override so Hugo uses default
    layout: assets/ for Pipes-processable resources, static/ for
    served-as-is files.
  * Used `git mv` to record the docs/ image and stylesheet renames as
    history-preserving moves rather than delete+add diffs.
  * Removed the duplicate website/static/assets/images/ directory —
    files already lived in website/assets/images/.
  * Bun's css:build/css:watch now write to assets/styles/style.css so
    Tailwind output flows through Hugo Pipes.
  * baseof.html loads style.css and chroma.css via resources.Get +
    fingerprint, with SRI integrity attributes. Each deploy produces
    a fresh content-hashed URL (/styles/style.<hash>.css), so the
    browser cache invalidates correctly without manual cache-busting.
  * Logos, social icons, hero raster (overview*.png), favicon, and
    plus/minus accordion icons all flow through resources.Get for
    consistent asset handling.
  * Added layouts/_default/_markup/render-image.html so markdown
    image references in /docs are looked up via resources.Get and
    emitted with loading="lazy" decoding="async".

URL RENAMES — the docs URLs were verbatim copies of the original
GitHub filenames, which made for noisy URLs like
/docs/raspberry-pi5-ssd-install-instructions/. Slugged each page and
left aliases for the old paths so Hugo emits a meta-refresh redirect:

  /docs/installation-options/                       → /docs/install/
  /docs/balena-fleet-deployment/                    → /docs/balena/
  /docs/x86-installation/                           → /docs/pc/
  /docs/raspberry-pi5-ssd-install-instructions/     → /docs/pi5-ssd/
  /docs/migrating-assets-to-screenly/               → /docs/migrate-to-screenly/
  /docs/qa-checklist/                               → /docs/qa/
  /docs/developer-documentation/                    → /docs/development/

Cross-doc links inside /docs and the README + repo-root docs/ stub
files all point at the new URLs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix lint + mypy on raspberry_pi_imager test (carried over from rebase)

The test_build_pi_imager_json.py file landed in `88d3881b Move website
to Hugo` with two pre-existing CI failures:

* ruff format --check: a few helper definitions had a stale line
  break the formatter wanted to collapse.
* mypy: `make_image_metadata(board: str) -> dict` is missing the
  generic type parameters that the project's mypy config flags as
  type-arg. Annotated as `dict[str, Any]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Self-host Plus Jakarta Sans via @fontsource (drop Google Fonts CDN)

Removes the third-party Google Fonts <link>. SonarCloud's Web:S5725
hotspot was flagging the link as a resource-integrity (SRI) risk —
SRI is impossible against Google Fonts because the served stylesheet
rotates per User-Agent and the woff2 URLs change with the font CSS.

Self-hosting the same font from npm via @fontsource removes the
cross-origin resource entirely.

How it's wired:

* `bun add -D @fontsource/plus-jakarta-sans` for the font binaries.
* `scripts/install-fonts.ts` is a small bun script that, given the
  installed package, copies woff2 files for latin + latin-ext at
  weights 400/500/600/700/800 to `static/fonts/` (so Hugo serves
  them at `/fonts/...`) and emits a combined
  `assets/fonts/plus-jakarta-sans.css` with the urls rewritten to
  absolute /fonts/... paths and the woff fallback stripped.
* `package.json` adds `fonts:install`, and chains it through
  `css:build` / `css:watch` so Tailwind always sees the generated
  CSS up to date.
* `main.css` @imports the generated CSS — Tailwind/Lightning CSS
  inlines the @font-face rules into the final fingerprinted
  style.<hash>.css.
* `.gitignore` excludes `assets/fonts/` and `static/fonts/` since
  both are deterministically regenerated from node_modules.
* `baseof.html` no longer pulls from fonts.googleapis.com.

Total payload: 10 woff2 files (~136KB), but each is loaded
on-demand by unicode-range — typical English-only visitors fetch
~50KB of fonts, served from same-origin.

The second Web:S5725 hotspot (gtag.js from googletagmanager.com)
is unchanged in this commit — Google's tag manager script is
updated server-side without a stable hash, so SRI cannot apply.
That one needs a product call (keep with dismissal, drop GA, or
move to a privacy-first SRI-friendly alternative).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address SonarCloud code-smell findings on the website

Cleared the unrelated SonarCloud findings raised on this PR:

* `install-fonts.ts`: `fs` and `path` imports use the `node:` prefix
  (typescript:S7772). The new prefixed form is the bun-recommended
  one, no behavior change.
* `_markup/render-image.html`: rewrote the comment that referenced
  `<img>` literally — Web:ImgWithoutAltCheck was treating the word
  inside the Hugo comment block as an actual element with no alt.
* `_default/faq.html`: replaced the accordion's `<div role="region">`
  with a real `<section>` element (Web:S6819). The aria-labelledby
  binding stays, so the accessible name resolution is identical and
  the semantics are now native rather than ARIA-emulated.
* `assets/styles/chroma.css`: stripped the two stray-semicolon lines
  left over from the sed pass that emptied the github-style backdrop
  (css:S1116). The remaining `.chroma { -webkit-text-size-adjust:
  none }` rule is what's actually load-bearing.
* `_default/baseof.html`:
  - accordion JS now reads `this.dataset.accordion` instead of
    `this.getAttribute('data-accordion')` (javascript:S7761).
  - GA bootstrap uses `globalThis.dataLayer` instead of
    `window.dataLayer` (javascript:S7764). Same semantics in any
    browser context, no globalThis polyfill needed for our targets.
* `layouts/index.html`: dropped the deprecated `scrolling="0"`
  attribute from the GitHub stars iframe (Web:S1827); replaced
  with the equivalent `overflow-hidden` Tailwind class.

The Web:S5725 SRI hotspot on the gtag.js script (line 162 of
baseof.html) is the only remaining finding. Google Tag Manager is
versioned server-side without a stable hash, so SRI fundamentally
can't apply — that one is being kept and dismissed in the
SonarCloud UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review + trigger marketing deploy on release publish

Copilot review:

* api.html: request-body renderer only looked at application/json,
  so endpoints whose only content type is multipart/form-data (file
  uploads) or application/x-www-form-urlencoded would render an
  empty Request body section. Pick application/json first if present,
  otherwise fall back to the first listed content type, and label
  the rendered schema with its actual content type.
* build_pi_imager_json.py: every requests.get() now sets a 30s
  timeout and calls raise_for_status() so a slow/rate-limited GitHub
  API doesn't hang the deploy job and a 4xx/5xx fails fast with a
  clear message rather than a confusing KeyError on response.json().
* docs/raspberry-pi5-ssd-install-instructions.md: "Other HAT's" →
  "Other HATs".
* docs/qa-checklist.md: dropped the spurious "a" in "Change a the
  start and end dates".
* deploy-website.yaml: jq's has() takes one key, so the validation
  step `has("name", "description", ...)` was actually a syntax error
  on every run — rewrote as `all($k; $entry | has($k))` over the
  required-key list.
* layouts/_default/get-started.html: the "Documentation" CTA pointed
  at the old GitHub markdown file; now links to /docs/ to match the
  navbar / footer.
* website/README.md: rewrote the project-structure tree to match
  what's actually in the repo (data/, scripts/, layouts/docs/,
  Goldmark _markup/ hooks etc.) and documented the bun pipeline —
  `hugo server` alone leaves /fonts/* as 404s because the woff2
  files are gitignored and materialized by `bun run fonts:install`.

Marketing deploy on release publish:

`build-balena-disk-image.yaml` cuts the GitHub release with the
*.img.zst artefacts as its final step; until now the marketing site
only re-deployed on master push or manual dispatch, so rpi-imager.
json on the live site lagged the freshest disk images by however
long it took someone to push an unrelated website change. Hooking
deploy-website.yaml to `on: release: types: [published]` makes the
site rebuild as soon as the release exists, which is exactly when
the GitHub API starts surfacing the new assets the JSON generator
queries. `prerelease=true` releases are included because that's what
build-balena-disk-image.yaml currently flags every release as.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address second round of Copilot review

* installation-options.md: balenaEthcher → balenaEtcher.
* balena-fleet-deployment.md: includa → include.
* developer-documentation.md: spash screen → splash screen.
* qa-checklist.md: enabling **Show splash screen** is supposed to
  *display* the splash, not hide it — flipped "is not being
  displayed" → "is being displayed". Also clickin → clicking.
* raspberry-pi5-ssd-install-instructions.md: `sudo apt update -y`
  isn't valid (apt's -y is only for install / upgrade), so the
  copy-paste step would error. Dropped the `-y` from update; the
  full-upgrade line keeps it because that's where it actually does
  something.
* deploy-website.yaml: the jq required-keys check was missing
  `icon` and `website`, which build_pi_imager_json.py's
  REQUIRED_FIELDS already enforces in the Python tests. Added them
  so the runtime validation matches the generator's contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address third round of Copilot review

* website/.gitignore: `_.log` was a typo from the original Hugo
  bootstrap — it doesn't match anything. Replaced with the intended
  `*.log` so log files are actually ignored.
* website/package.json: rewrote the `dev` script to capture both
  child PIDs and trap EXIT/INT/TERM so Ctrl-C (or hugo crashing)
  takes the Tailwind watcher down with it. Mirrors the pattern in
  the repo-root package.json's `dev`.
* docs/raspberry-pi5-ssd-install-instructions.md: "Early Pi 5's"
  → "Early Pi 5s" (no apostrophe on plurals).
* docs/qa-checklist.md: "make sure that the screen in standby mode"
  → "make sure that the screen is in standby mode" (missing verb).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Rewrite raspberry_pi_imager tests in pytest style

The file was unittest.TestCase classes — pytest discovers and runs
those, but the boilerplate doesn't earn its keep. Each test method
re-declared `@patch('...requests.get')` and rebuilt the same
MagicMock setup, and the per-board cases lived as 5+3+2+2 separate
methods that should have been one parametrize each.

Reworked as flat module-level functions backed by three fixtures:

* `mock_requests_get` — patches the module's `requests.get` and yields
  the mock so each test sets `return_value` / `side_effect` directly.
* `mock_release_assets` — preconfigured to return the canned release
  asset list, used by the `get_asset_list` cases.
* `mock_full_build` — wires up the three call shapes
  `build_imager_json()` makes (latest, asset list, per-asset json).

Per-board cases collapse into `@pytest.mark.parametrize`:
get_board_from_url's positive cases, the non-image-returns-None
cases, the maintenance-mode boards, and the modern boards.

Coverage is the same — 21 collected cases (pytest fans the parametrize
out from 12 test methods to 21 ids), all passing in 0.12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix PR checks and Copilot review items

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:16:25 +01:00
Viktor Petersson
5e00c8ba25 refactor(docker): drop celery image, restore base apt layer dedup (#2776)
* refactor(docker): drop celery image, restore base apt layer dedup

- Delete Dockerfile.celery.j2; compose now runs celery on the
  anthias-server image with a `command:` override.
- Make viewer extend Dockerfile.base.j2 (mirroring test); drop 17
  packages duplicated between viewer and base_apt_dependencies, plus
  4 within-list duplicates.
- Move `# syntax=docker/dockerfile:1.4` to line 1 of every rendered
  Dockerfile. It previously lived in uv-builder.j2 line 1 and got
  bumped mid-file for server by the bun-builder prelude, silently
  disabling the 1.4 frontend and breaking cache-key parity with
  viewer — the actual blocker for layer dedup.
- Collapse CI matrix from (board × service) to (board) so all
  services for a board build on the same runner with the same
  buildkit cache, producing byte-identical apt layer digests at the
  registry.
- Add ENV DJANGO_SETTINGS_MODULE to the server image so the merged
  image runs both server and celery CMDs.
- Update all five compose templates (prod, balena prod, balena dev,
  dev, test) to redirect anthias-celery at the server image with a
  command: override. dev compose pins an explicit `image:` tag so
  both services share the locally-built SHA.
- Remove old anthias-celery / srly-ose-celery containers in
  upgrade_containers.sh so the recreated container can take the name.

Verified end-to-end on x86: server and viewer apt layers share a
single digest; SHARED SIZE jumps from 132 MB to 1.216 GB; merged
image runs both workloads in compose (celery task round-trips
through Redis to SUCCESS).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(docker): cache buildkit layers in GHCR registry across CI runs

Add a --cache-backend / $BUILDX_CACHE_BACKEND option to
tools.image_builder with two modes:

- `local` (default): writes to /tmp/.buildx-cache/<board>/.
  Unchanged from before; right for local dev.
- `registry`: pushes BuildKit cache to
  ghcr.io/screenly/anthias-<service>:buildcache-<board>. Reuses the
  GHCR login already done by docker-build.yaml, no extra tokens or
  third-party actions needed.

Wire CI to use registry mode on push events (master) so subsequent
runs of the same board pull cached layers — the ~825 MB extracted
apt install per service goes from ~3 min cold to a few seconds
warm. workflow_dispatch on a non-master branch falls back to local
mode (effectively no-cache) so manual runs can't pollute the master
cache.

Drop the old actions/cache@v5 step that mirrored
/tmp/.buildx-cache/<board> through actions/cache — registry cache
is per-step rather than one big tarball, so it survives the GitHub
Actions cache 10 GB-per-repo eviction better.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(image-builder): move local cache out of /tmp to user XDG cache dir

SonarCloud python:S5443 flagged the previous /tmp/.buildx-cache/
default as a security hotspot — `/tmp` is world-writable, so on a
multi-user host another account could in principle tamper with the
buildkit cache. Switch to $XDG_CACHE_HOME/anthias-buildx/<board>/
(default ~/.cache/anthias-buildx/), which is per-user by default
and follows XDG Base Directory convention.

CI is unaffected: docker-build.yaml uses --cache-backend=registry
on push events, which pushes cache to GHCR and never touches the
local path. Local dev users with stale state in
/tmp/.buildx-cache/<board>/ can rm it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): correct cache-backend comments to match real behavior

Two doc fixes per Copilot review on #2776:

- tools/image_builder/__main__.py: the cache-backend rationale
  block still referenced /tmp/.buildx-cache/<board>; update to
  $XDG_CACHE_HOME/anthias-buildx/<board> so it matches the
  implementation moved in 529a50e0.
- .github/workflows/docker-build.yaml: the env comment claimed
  pull-request builds read from the registry cache, but this
  workflow has no pull_request trigger — non-push runs are
  workflow_dispatch, which both falls through to local cache and
  skips `docker login ghcr.io`, so it has no GHCR auth at all.
  Rewrite the comment around the push / workflow_dispatch split
  the code actually implements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): address Copilot review on registry cache + test compose

- tools/image_builder/__main__.py: comment in the registry-cache
  branch said the cache namespace was "picked from the build's tag
  list", but the implementation hardcodes
  ghcr.io/screenly/anthias-{service}. Rewrite the comment to
  describe what the code actually does and call out the hardcode
  so a future namespaces refactor doesn't silently break cache.
- docker-compose.test.yml: anthias-celery had its own `build:`
  block pointing at Dockerfile.test, claiming "reuses the test
  image" — but compose builds two separate images per service
  even with identical context, defeating the dedup intent. Mirror
  the docker-compose.dev.yml pattern: pin anthias-test to an
  explicit `image: anthias-test:dev` tag and have anthias-celery
  reference the same tag with no `build:`. Also bind-mount the
  source into celery so it picks up code changes (matches
  anthias-test's existing volume).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(image-builder): read-only registry cache without --push

Per Copilot review: --cache-backend=registry previously tried to
push cache to ghcr.io/... regardless of --push, so a local invocation
without GHCR auth would fail mid-build with a confusing registry
error. Split the behavior:

- Reads (cache_from) are always set when registry mode is active —
  the anthias-* GHCR packages are public, so warm-starting off CI's
  cache without auth works and helps local dev.
- Writes (cache_to) only happen when --push is also set, since
  that's when the workflow has authenticated to GHCR. Without
  --push, log a yellow warning and skip cache_to.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): set DJANGO_SETTINGS_MODULE in test image for celery worker

Per Copilot review on #2776 (suppressed-due-to-low-confidence note,
but the bug is real): docker-compose.test.yml runs the celery
worker from anthias-test:dev. celery_tasks.py calls django.setup()
at module import time, which needs DJANGO_SETTINGS_MODULE in the
environment. The pre-refactor Dockerfile.celery.j2 set it
explicitly; this PR moved that ENV to Dockerfile.server.j2 only,
so the production celery (running on the server image) is fine but
the test celery would have crashed with ImproperlyConfigured.

Set the same ENV in Dockerfile.test.j2. Server and test images
both ship a usable Django environment for any process that imports
anthias_django.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:21:43 +01:00
Viktor Petersson
4333fffafa refactor(messaging): replace ZMQ with Redis for all viewer signalling, drop pyzmq (#2760)
* refactor(messaging): replace ZMQ pub/sub with Redis for server→viewer commands

Server-to-viewer command bus moves off pyzmq onto Redis pub/sub on the
'anthias.viewer' channel, since Redis is already the broker for Celery
and the channel layer for Django Channels — no reason to run a second
message bus.

- settings.ZmqPublisher → settings.ViewerPublisher (redis.publish).
- viewer/zmq.py → viewer/messaging.py with ViewerSubscriber backed by
  redis.pubsub(); the two ZmqSubscriber threads in viewer.main collapse
  into one, since both former publishers (anthias-server and the
  host-side wifi-connect script) now fan into the same Redis channel.
- viewer-subscriber-ready gating preserved: set after subscribe()
  returns, same semantics as before.
- ZmqConsumer / ZmqCollector (viewer→server reply path) and pyzmq itself
  are intentionally left in place; PR2 migrates the reply bus and PR3
  removes pyzmq + libzmq from the dep tree and Dockerfiles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: publish host-side wifi-connect messages via Redis, not ZMQ

The captive-portal flow (`setup_wifi`, `show_splash`) used to publish on
ZMQ port 10001 from the host, with a second ZmqSubscriber inside the
viewer connected to host.docker.internal:10001 picking it up. The
previous commit collapsed the viewer down to a single Redis-backed
subscriber, so this script's ZMQ publishes were going nowhere.

Switch the script to redis.publish() against the same anthias.viewer
channel. The Redis client is already wired here for the
viewer-subscriber-ready gate, and the wifi-connect container runs in
network_mode: host, so loopback to redis on 127.0.0.1:6379 (already
exposed via the redis service's port mapping) keeps working unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(messaging): replace ZMQ reply bus with Redis BLPOP + correlation IDs

Drops the second ZMQ leg — the viewer→server reply path — in favor of
Redis BLPOP keyed by a UUID correlation ID. Same channel layer that PR1
moved the command bus onto, so the entire viewer messaging path now
runs on Redis.

Wire format extends the existing 'command&parameter' encoding: the
'current_asset_id' command (currently the only request-reply command)
now carries the correlation ID in the parameter slot, and the viewer
LPUSHes its JSON reply onto 'anthias.reply.<corr-id>' (with a 30s
EXPIRE so unread replies don't accumulate). The server BLPOPs that key.

This also fixes a latent correctness bug: ZmqCollector had no
correlation, so concurrent /v1 ViewerCurrentAsset callers could
mismatch replies. That hazard was masked today by uvicorn running
single-worker; with Redis + correlation IDs, the reply path is now safe
across concurrent callers.

- settings.ZmqConsumer / ZmqCollector → settings.ReplySender /
  ReplyCollector (BLPOP). 'import zmq' drops out — pyzmq itself is
  removed in the next commit.
- lib.errors.ZmqCollectorTimeoutError → ReplyTimeoutError (the only
  catch site is implicit — it bubbles to a 500 — so the rename is
  mechanical).
- viewer/__init__.py: send_current_asset_id_to_server takes a
  correlation ID and uses ReplySender. The 'current_asset_id' command
  handler in the dispatch table threads the parameter (now the corr ID)
  into the function call.
- api/views/v1.py ViewerCurrentAssetViewV1: generates a UUID, sends it
  with the command, BLPOPs on it.
- api/tests/test_v1_endpoints.py: ZmqCollector mock → ReplyCollector;
  side_effect signature relaxed to '*_' since recv_json now takes two
  positional args (corr, timeout_ms).
- stubs/redis-stubs/client.pyi: add rpush() and blpop() narrowed to
  decode_responses=True return shapes (the rest of the stub follows the
  same convention).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop pyzmq + libzmq, finalize ZMQ→Redis migration

With both legs of the viewer signalling path on Redis (PR1: command
bus, PR2: reply bus), the pyzmq runtime dependency and the libzmq*
build deps are no longer used.

- pyproject.toml: remove pyzmq==23.2.1 from server, viewer,
  wifi-connect, and mypy dep groups (4 places).
- uv.lock: regenerated; pyzmq + transitive py drop out.
- tools/image_builder/{__main__,utils}.py: remove libzmq3-dev /
  libzmq5-dev / libzmq5 from the base apt list and from the viewer
  context's apt list. docker/uv-builder.j2 likewise drops libzmq3-dev
  from both the prebuilt-uv branch and the pip-fallback branch (32-bit
  ARM). The rendered docker/Dockerfile.* artifacts are gitignored, so
  no committed Dockerfile churn here — they regenerate cleanly via
  `python -m tools.image_builder --dockerfiles-only`.
- send_zmq_message.py → send_viewer_message.py. The script already
  publishes via Redis (fixed in the PR1 follow-up); rename + update
  callers (bin/start_wifi_connect.sh, docker/Dockerfile.wifi-connect.j2)
  now that the ZMQ name is misleading.
- bin/start_server.sh: drop the stale "single-worker because
  ZmqPublisher binds 10001" comment. The publisher is now a Redis
  client — no port bind, multi-worker is safe whenever the operator
  wants to opt in (not changed in this PR).
- CLAUDE.md: update the architecture description (ZMQ ports 10001 /
  5558 are gone, Redis carries the viewer signalling traffic now).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: post-merge cleanup — re-flow ruff fmt + drop stale ZMQ refs

Three small clean-ups discovered while running CI locally after the
master merge (41d7a80a):

* `api/tests/test_v1_endpoints.py`: master added the ViewerPublisher
  mock decorator on a single >79-char line. Our branch tightened ruff
  via the v2 test sweep, so `ruff format --check` now flags it. Wrap
  it like every other long mock.patch call in this file.
* `docs/d2/anthias-diagram-overview.d2`: the server↔viewer edge label
  still said "ZMQ + asset fetches"; the migration finished in a9be1d3.
  Update to "Redis pub/sub + asset fetches" so the diagram matches
  CLAUDE.md's architecture description.
* `send_viewer_message.py`: stray "Specify the ZeroMQ message" help
  text on the `--action` flag. The script publishes via redis now;
  reword to be transport-neutral.

No production code touched. Verified locally: ruff check, ruff
format --check, mypy, eslint, prettier, bun test, the 107-test
Python unit suite, and the 12-test integration suite all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: address Copilot feedback on PR #2760

Three line-level review comments:

* `viewer/__init__.py` / `settings.py` — `send_current_asset_id_to_server`
  was creating a fresh `ReplySender()` (and a fresh `redis.Redis` client
  + connection pool) on every `current_asset_id` request. Reuse the
  process-wide `r` instead: `ReplySender.__init__` now takes the
  caller's redis connection, and the viewer constructs a single
  `reply_sender = ReplySender(r)` at module init.

* `viewer/messaging.py` — `ViewerSubscriber.run()` had no
  reconnect/retry: a transient redis blip during `subscribe()` or
  `listen()` killed the thread silently, leaving the viewer unable to
  receive any commands until the process restarted, and
  `viewer-subscriber-ready` could be left stuck at 1. Wrap the loop in
  exponential-backoff reconnect (1s → 30s cap) on
  `redis.ConnectionError`, and clear the readiness flag while
  disconnected so wifi-connect-style readiness-gated publishers wait
  instead of dropping messages on the floor. Set readiness only after
  `subscribe()` returns successfully.

* `settings.py` — `ReplyCollector.recv_json` rounded `timeout_ms <= 0`
  up to a 1-second BLPOP, breaking the old `ZmqCollector` contract
  where `timeout=0` was a non-blocking poll. Branch on `<= 0` and use
  `LPOP` (which the redis stub now declares); only round up for
  positive timeouts.

Also add the SonarQube `# NOSONAR` rationale on the two pre-existing
hotspots flagged in the PR diff (loopback HTTP for the captive-portal
page; the well-known wifi-connect AP gateway IP), and drop a redundant
`continue` at the end of the readiness wait loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: address Copilot follow-up feedback on PR #2760

Two new comments after the previous resolution round:

* `stubs/redis-stubs/client.pyi`: `Redis.lpop()`'s real return type
  depends on `count` — single value with no count, list with count.
  The previous stub always declared `str | None`, so a future
  `lpop(key, count=N)` call would silently typecheck against the wrong
  shape. Replace with two `@overload`s: no-count returns `str | None`
  (the form Anthias actually uses), explicit-int count returns
  `list[str] | None`. Also add `PubSub.close()` to the stub so the
  finally-block below typechecks.

* `viewer/messaging.py`: `ViewerSubscriber.run()` was creating a fresh
  PubSub on every reconnect attempt without closing the previous one.
  A flapping redis container would accumulate dead PubSub objects each
  holding a connection from the pool until GC reclaimed it. Wrap the
  per-iteration PubSub in a `finally: pubsub.close()` so the socket is
  released deterministically on every disconnect and on every clean
  exit from `_consume()`. Swallow `ConnectionError` from `close()`
  itself — the underlying socket is already gone in the case we care
  about.

Drive-by: the docstring referenced `setup_wifi` and the wifi-connect
readiness handshake, both of which #2763 deleted. Update to mention
the actual surviving commands and note that no consumer reads
`viewer-subscriber-ready` today (kept as a generic readiness signal).

Verified: ruff, ruff format, mypy (strict, 97 files), the 103-test
unit suite — all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: address Copilot's third round of feedback on PR #2760

Three more comments after the previous resolution round:

* `viewer/__init__.py` — `send_current_asset_id_to_server()` derefs
  `scheduler.current_asset_id`, but `subscriber.start()` runs before
  `scheduler = Scheduler()` in `main()`. A `current_asset_id` request
  arriving during `wait_for_server()` would `AttributeError` and the
  caller would see a 2s timeout instead of a useful answer. Guard:
  if scheduler is None, reply with `current_asset_id: None` — the v1
  endpoint already treats a falsy id as "no current asset" and returns
  `[]`, which is the correct semantic answer pre-init. Not silently
  dropping the reply: that would deadlock the caller for the full
  recv timeout.

  Other scheduler-touching handlers (`next`, `previous`, `asset`,
  `stop`) have the same pre-existing race, but it's identical to the
  ZMQ-era behavior and out of scope for this messaging migration.

* `api/tests/test_v1_endpoints.py` — `test_viewer_current_asset` only
  checked `send_to_viewer` call count, leaving the new corr-ID round
  trip untested. A future refactor that swapped sides of the UUID
  would deadlock the v1 endpoint until the recv timeout, which the
  test would fail to catch. Switch the `recv_json` mock from a
  side_effect lambda to a `MagicMock` so we can introspect its args,
  then assert the corr-ID extracted from the published command
  matches the corr-ID passed to `recv_json`.

* `stubs/redis-stubs/client.pyi` — the comment said "don't pretend to
  support `count`" but I'd added a `@overload` for the count form
  anyway in the previous round. Drop the count overload to match the
  comment's stated intent: Anthias only uses the no-count form, and a
  future caller adding `count=N` will get a clear "no overload
  matches" instead of a stub silently agreeing with the wrong shape.

Verified: ruff, ruff format, strict mypy (97 files), 9-test v1 suite,
103-test full unit suite — all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 12:58:08 +01:00
Viktor Petersson
7476a43b27 chore: drop wifi-connect service end-to-end (#2763)
The anthias-wifi-connect captive-portal helper has been pinned to
balena-os/wifi-connect v4.11.1 (Feb 2023) for ~3 years; upstream
dropped the ARMv6 binary back in v4.4.6 so Pi 1 was silently
shipping a wifi-connect container with no binary inside, and the
host script `bin/start_wifi_connect.sh` had a `set -e`-vs-`$?` bug
that made the captive-portal branch unreachable. nmcli/nmtui covers
the supported install path.

Removing the whole service rather than bumping it: there are no
production users left and bumping would require rewriting both the
architecture-to-asset matcher (Rust target triples now) and the
unzip step (tar.gz now).

Removed
- Container build:  docker/Dockerfile.wifi-connect[.j2],
                    `wifi-connect` group in pyproject.toml + uv.lock,
                    `wifi-connect` entry in image_builder SERVICES,
                    `get_wifi_connect_context()`,
                    `wifi-connect` cell in CI matrix +
                    docker-build.yaml retag SERVICES list.
- Compose:          `anthias-wifi-connect` service from prod / balena
                    / balena-dev templates, plus the now-unused
                    `host.docker.internal:host-gateway` extra_hosts
                    on `anthias-viewer`.
- Helper scripts:   bin/start_wifi_connect.sh,
                    start_wifi_connect_service.sh,
                    send_zmq_message.py.
- Viewer plumbing:  the second ZmqSubscriber bound to
                    host.docker.internal:10001, the
                    `viewer-subscriber-ready` Redis flag, the
                    `setup_wifi` / `show_splash` / `show_hotspot_page`
                    handlers and their entries in the `commands`
                    dict, the `mq_data` / `load_screen_displayed`
                    globals, and the now-unused `redis_connection`
                    parameter on `ZmqSubscriber`.
- Server:           `/hotspot` URL route, `views_files.hotspot`,
                    `HOTSPOT_FILE` / `INITIALIZED_FLAG` constants,
                    `HotspotViewTest`, templates/hotspot.html,
                    static/img/wifi-off.svg, /data/hotspot dir
                    creation in bin/start_viewer.sh.
- Host:             sudoers entry for /usr/local/sbin/wifi-connect,
                    ansible/roles/network template + vars.
- Docs:             docs/wifi-setup.md, the Wi-Fi Setup section and
                    container row in docs/README.md, the
                    wifi-connect.service line and stale
                    `initialized` flag bullet in
                    docs/developer-documentation.md, the
                    "Reset Wi-Fi → hotspot page" step in
                    docs/qa-checklist.md.

Migration paths kept (intentional)
- bin/upgrade_containers.sh now runs `docker rm -f` on
  anthias-wifi-connect and srly-ose-wifi-connect alongside the
  existing nginx/websocket cleanup, so on next pull devices drop
  the stale container.
- ansible/roles/network/tasks/main.yml stops, disables, and
  removes /etc/systemd/system/wifi-connect.service, then notifies
  a new `Reload systemd daemon` handler. Idempotent on fresh
  installs.

Verified
- `ruff check` + `ruff format --check`: clean.
- Strict `mypy .` (django-stubs + drf-stubs plugins): 97 files,
  0 issues.
- `ansible-lint ansible/`: passes at the `production` profile.
- All three compose templates render and parse via
  `docker compose config`.
- `python -m tools.image_builder --dockerfiles-only` generates
  the remaining 5 services with no Dockerfile.wifi-connect
  produced.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 09:40:41 +01:00
Viktor Petersson
f421130b24 refactor(server): collapse nginx + websocket containers into uvicorn (#2757)
* refactor(server): collapse nginx + websocket containers into uvicorn

Replace the nginx + gunicorn + gevent-websocket trio with a single
uvicorn ASGI server inside `anthias-server`:

* HTTP, /static/, /anthias_assets/, /static_with_mime/, and /hotspot
  are now served from Django (WhiteNoise + small file-serving views in
  `anthias_app/views_files.py` that re-implement nginx's IP allowlists).
* WebSockets move from a separate gevent process talking ZMQ to Django
  Channels with a Redis-backed channel layer, fanned out by celery via
  `channel_layer.group_send`.
* TLS termination is handled by uvicorn directly when SSL_CERTFILE /
  SSL_KEYFILE are set; `bin/enable_ssl.sh` now writes a compose
  override (no longer ansible) and a companion `bin/disable_ssl.sh`
  removes it. Cert + key live under `~/.anthias/ssl/`.
* `bin/upgrade_containers.sh` removes the legacy `anthias-nginx` and
  `anthias-websocket` containers on upgrade so they don't linger.
* Drop `gunicorn`, `gevent`, `gevent-websocket`, and the `websocket`
  uv group from `pyproject.toml`; add `channels`, `channels-redis`,
  `daphne`, `uvicorn[standard]`, and `whitenoise`.

Notes on hardening: `--forwarded-allow-ips` defaults to off so the IP
allowlist can't be bypassed via a spoofed `X-Forwarded-For`; operators
behind a reverse proxy can opt in via the `FORWARDED_ALLOW_IPS` env
var. Backup uploads previously sized by nginx's `client_max_body_size
4G` are preserved by setting `DATA_UPLOAD_MAX_MEMORY_SIZE = None`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address review feedback on uvicorn migration

* Drop USE_X_FORWARDED_HOST (inconsistent with the deliberate
  --forwarded-allow-ips hardening; without a proxy, X-Forwarded-Host is
  client-controlled).
* Remove daphne — uvicorn runs production and the test environment now
  uses it too (bin/prepare_test_environment.sh).
* Replace _safe_join's parents-membership check with Path.is_relative_to.
* Drop AllowedHostsOriginValidator wrapper (no-op under ALLOWED_HOSTS=['*'])
  and document where to put it back if hosts are ever locked down.
* Rename DOCKER_CIDR → DOCKER_BRIDGE_CIDR with a comment that this is
  defense-in-depth, not a real perimeter (LAN clients via the published
  port also appear in 172.16/12).
* Add anthias_app/tests.py covering the IP allowlists, mime override,
  hotspot gating, and traversal/symlink rejection in _safe_join (17 tests).
* Note the single-worker ZmqPublisher bind constraint in start_server.sh
  so a future scale-up doesn't EADDRINUSE on tcp://0.0.0.0:10001.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security): clear SonarCloud hotspots on uvicorn migration

* Restrict views_files.anthias_assets / static_with_mime / hotspot to
  GET via @require_GET (Sonar S3752, x3): they are read-only file
  servers and should reject other methods at the view boundary.
* Mark RFC1918 / Docker-bridge CIDR literals as NOSONAR S1313 (x4):
  they are intentional, well-known private network ranges.
* Mark `http://*` in CSRF_TRUSTED_ORIGINS as NOSONAR S5332 with a
  comment explaining devices ship over HTTP and operators opt into TLS
  via bin/enable_ssl.sh.

Existing 17 view tests continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: clear remaining static-analysis findings

* ruff format -- the previous tests.py reformatted itself; CI's
  `ruff format --check` now passes.
* CodeQL py/path-injection on _safe_join: rewrite using
  os.path.realpath + os.path.commonpath, which CodeQL recognises as a
  sanitiser for path-injection sinks. Behaviour is identical to the
  Path.is_relative_to version (both reject `..` and symlink escapes;
  the 17 tests in anthias_app/tests.py still pass).
* SonarCloud NOSONAR markers: switch to the codebase's bare `# NOSONAR`
  form (matches host_agent.py and tests/test_backup_helper.py); the
  earlier `# NOSONAR <rule>` form was not being honoured.
* Centralise the test-fixture IPs in module-level constants so S1313
  is suppressed in one place rather than at every callsite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security): inline path-injection check in views

CodeQL only treats os.path.commonpath as a sanitiser when the check
sits in the same function as the file-system sink — calling
_safe_join() from a separate function still leaves the open()/isfile()
sinks tainted (4 alerts on PR #2757).

Repeat the realpath + commonpath check inline in anthias_assets and
static_with_mime so CodeQL can prove the post-check path stays under
the configured root. _safe_join is kept for the SafeJoinTest unit
tests and as a documented helper.

Existing 17 tests in anthias_app/tests.py continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security): use realpath+startswith path sanitiser for CodeQL

CodeQL's path-injection model recognises the canonical
`realpath(...).startswith(base + sep)` pattern but apparently not
`os.path.commonpath(...) == root` in this codepath. Switch the inline
check in anthias_assets and static_with_mime to startswith so the
analyser can prove the post-check path stays under the configured
root.

Behaviour is identical: traversal and symlink-escape still 404
(verified by SafeJoinTest + view tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot review feedback

* lib/utils.py imported channels/asgiref at module level. The viewer
  container imports lib.utils via viewer/__init__.py but its uv
  dependency group does not ship channels, so the viewer would
  ImportError on startup. Move the channels imports into
  YoutubeDownloadThread.run() (server/celery-only path) so lib.utils
  remains importable from the viewer.
* Drop the unused _safe_join() helper and its three SafeJoinTest
  cases — the views inline a realpath+startswith sanitiser (CodeQL
  needs the check in the same function as the sink), and the helper
  was only being exercised in isolation. Add an equivalent
  symlink-escape test against anthias_assets so the actual code path
  used by the views is covered.
* Refresh the anthias_django/settings.py docstring + Django doc URLs
  from /3.2/ → /4.2/ to match the pinned Django version.

15 view tests pass (was 17 — lost 3 SafeJoinTest + gained 1 symlink
test against the real view).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: refresh architecture diagram for uvicorn migration

Drop the anthias-nginx and anthias-websocket nodes (and their edges)
from docs/d2/anthias-diagram-overview.d2 — the user now talks
directly to anthias-server (uvicorn handling HTTP + /ws), Celery
fans out asset-update events through the Redis-backed Channels
layer, and the viewer fetches media from anthias-server over HTTP.

Regenerate the SVG with d2 v0.7.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot SSL + CSRF / WS-origin feedback

* Dual uvicorn listeners when SSL is enabled (Copilot #1, #2). HTTP on
  $HTTP_PORT (default 8080) for inter-container traffic — viewer +
  webview hit anthias-server over plain HTTP on the Docker network and
  cannot validate uvicorn's self-signed cert. HTTPS on $HTTPS_PORT
  (default 8443) for external clients. bin/enable_ssl.sh now appends
  443:8443 to the compose ports list (instead of using `!override` to
  swap 80:8080 for 443:8080), so port 80 stays available for backward
  compatibility and the Docker-network HTTP port keeps working.
* Drop CSRF_TRUSTED_ORIGINS = ['http://*', 'https://*'] (Copilot #3).
  Verified via Django shell: those leading wildcards are ignored by
  Django 4.2 (only subdomain wildcards like https://*.example.com are
  honoured), so the setting was a no-op. Same-origin POSTs still pass
  through Django's built-in Origin/Host check.
* Re-add channels.security.websocket.AllowedHostsOriginValidator to
  the WebSocket router (Copilot #5). Currently a no-op under
  ALLOWED_HOSTS=['*'], but tightening ALLOWED_HOSTS later will now
  also tighten /ws.

Smoke test (dev + SSL override):
- HTTP  http://localhost:8000/      -> 200
- HTTPS https://localhost:8443/     -> 200
- HTTP  http://localhost:8443/      -> 000 (TLS-only, expected)
- internal http://localhost:8080/   -> 200
- 15 view tests still pass.

Note: Copilot #4 (Docker-bridge CIDR is bypassable via the published
port) is documented in views_files.py as defense-in-depth and matches
the original nginx posture; switching to app-layer auth is out of
scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(ssl): switch from in-uvicorn TLS to a Caddy sidecar

The previous SSL implementation gave anthias-server two uvicorn
listeners (HTTP + HTTPS) so the viewer/webview could keep talking
plain HTTP over the Docker network while external clients got TLS.
That dual-listener dance is non-zero overhead and complicates signal
handling. Switch to the standard reverse-proxy pattern instead.

When SSL is enabled by bin/enable_ssl.sh:

* anthias-server stays a single uvicorn listener on plain HTTP 8080
  (no SSL_CERTFILE/SSL_KEYFILE knobs, no dual-port logic).
* A Caddy sidecar (caddy:2-alpine, only present when the override is
  installed) terminates TLS on host port 443, redirects 80→443, and
  reverse-proxies to anthias-server:8080 — so X-Forwarded-Proto /
  X-Forwarded-For are forwarded as-is by Caddy.
* The override removes anthias-server's external port mapping
  (`ports: !override []`), so all external traffic must enter through
  Caddy and the IP allowlists in views_files.py see the original LAN
  client IP rather than the docker-bridge gateway. Inter-container
  traffic is unchanged.
* `FORWARDED_ALLOW_IPS=*` is set on anthias-server in the override —
  safe because anthias-server is no longer reachable from outside the
  Docker network — and `SECURE_PROXY_SSL_HEADER` is added in Django
  settings so request.is_secure() returns True for HTTPS callers.
* When SSL is *not* enabled there is zero new container, zero new
  config — the base compose file is untouched and Caddy isn't pulled
  or run.

bin/disable_ssl.sh now also removes the anthias-caddy container
before deleting the override, so HTTPS-only state is fully reversed.

Smoke-tested with a temporary Caddy override:
- HTTPS via Caddy:        200
- HTTP via Caddy:         301 → https://...
- Direct anthias-server:  refused (port mapping dropped by override)
- WebSocket upgrade:      101 Switching Protocols
- request.is_secure() with X-Forwarded-Proto=https: True
- 15 anthias_app view tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(views_files): document IP-allowlist threat model

Spell out exactly when the docker-bridge CIDR check is and isn't a
real perimeter:

* No-SSL default: anthias-server is published as 80:8080, so requests
  arrive with REMOTE_ADDR set to the docker bridge gateway (172.x) and
  LAN clients aren't actually excluded. Trying to plug the gap with
  auth would be security theatre — credentials would travel in
  plaintext over the LAN anyway.
* SSL via the Caddy sidecar: Caddy terminates TLS, rewrites
  X-Forwarded-For, uvicorn honours it (FORWARDED_ALLOW_IPS=*), and the
  check sees the real client IP — so the bypass is closed for any
  deployment that actually cares about confidentiality.

This is documentation only; no behavioural change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ssl): add --domain (auto Let's Encrypt) + drop openssl shim

bin/enable_ssl.sh now has three modes instead of two:

* Default (no args) — Caddy issues per-SNI certs lazily from its
  built-in local CA via `tls internal { on_demand }`. Drops the
  openssl self-signed-cert generation step entirely; Caddy persists
  the CA in the anthias-caddy-data volume and rotates leaf certs
  itself. Browsers still warn (CA is local) but no openssl/cert
  hygiene is needed on the host.

* `--domain example.com [--email you@example.com] [--staging]` —
  Caddy auto-issues + renews from Let's Encrypt. Caddy auto-creates
  the HTTP→HTTPS redirect for hostname sites. Use `--staging` to point
  at the ACME staging endpoint while testing, so the production rate
  limits aren't burned.

* `--cert /path/to/cert.pem --key /path/to/key.pem [--domain ...]` —
  unchanged: bring your own cert, Caddy serves it as-is with
  `auto_https off`.

Verified:
- All three Caddyfiles pass `caddy validate`.
- Default mode end-to-end: HTTPS=200 with cert from "Caddy Local
  Authority - ECC Intermediate", per-SNI SANs (DNS:localhost,
  IP Address:192.168.99.99 etc.), HTTP→HTTPS=301, /ws upgrade=101,
  anthias-server's external port mapping is dropped so direct access
  is refused.

Docs (CLAUDE.md, docs/README.md, docs/developer-documentation.md)
updated to describe the Caddy sidecar instead of in-uvicorn TLS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address self-review findings on PR #2757

* Gate SECURE_PROXY_SSL_HEADER on FORWARDED_ALLOW_IPS
  (anthias_django/settings.py): without the gate, a client on a
  plain-HTTP deploy could send `X-Forwarded-Proto: https` and flip
  `request.is_secure()`. Django reads the header from META directly,
  independent of uvicorn's --proxy-headers flag, so the previous
  unconditional setting was actually exploitable in non-SSL mode
  (secure-cookied sessions would drop on the next plain-HTTP request,
  redirects would point at https:// URLs that don't exist).

  Verified live: non-SSL → SECURE_PROXY_SSL_HEADER is None and
  is_secure() with spoofed XFP=https returns False; SSL via Caddy
  override → header is set and is_secure() returns True.

* Replace the isfile() pre-check + open() in anthias_assets and
  static_with_mime with a try/except FileNotFoundError around open()
  (anthias_app/views_files.py). Eliminates a (tiny but real) TOCTOU
  window between the stat and the open. IsADirectoryError handled
  too, since `realpath('/dir/')` resolves to the directory and open()
  would otherwise 500.

* Comment FORWARDED_ALLOW_IPS=* assumption in bin/enable_ssl.sh: the
  wildcard is only safe because the override drops anthias-server's
  external port mapping, so any future edit that re-adds a host:port
  publication has to either tighten the wildcard to Caddy's IP/CIDR
  or unset it.

* Replace ANSI-C escape sequences in the Caddyfile generator with
  plain multi-line strings. `read -r -d ''` was the first attempt
  but it strips trailing newlines, which collapsed `auto_https off`
  onto the same line as `}` in cert mode. Multi-line literals with
  echo "$VAR" are unambiguous and Caddy validates all three modes
  cleanly again.

* Add a docker-volume cleanup hint to bin/disable_ssl.sh: Caddy's
  local CA persists in anthias_anthias-caddy-data so an enable →
  disable → enable cycle reuses the same CA (intentional — browsers
  that trusted it stay trusted), and operators who want a fresh CA
  now have the exact `docker volume rm` command in the script's
  output.

15 view tests still pass; default + SSL Caddyfiles still validate;
default + SSL endpoints still return 200 / 301 / 101 in smoke tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot's host/MIME hardening feedback

Two security tightenings on top of the prior SECURE_PROXY_SSL_HEADER
gate (which Copilot flagged on a stale snapshot — that one's already
fixed in 07b784b9):

* `ALLOWED_HOSTS` is now driven by the `ALLOWED_HOSTS` env var, with
  `*` kept as the default so flexible LAN-by-IP / mDNS access still
  works out of the box. Operators on hardened LANs can opt into a
  strict allowlist (`ALLOWED_HOSTS=192.168.1.50,anthias.local,...`)
  to defend against DNS-rebinding without us guessing the right set
  of hostnames at install time. Verified the env override parses to
  `['192.168.1.50', 'anthias.local', 'localhost']`.

* `static_with_mime` now allowlists the `?mime=` query param against
  a small set of download-only types
  (`application/{gzip,octet-stream,x-gzip,x-tar,x-tgz,zip}`) instead
  of accepting whatever the caller sends. Closes the XSS footgun
  where `?mime=text/html` would have served a stored file as HTML.
  The frontend's only legitimate caller (the backup download) sends
  `application/x-tgz`, which is in the allowlist; anything else
  falls back to mimetypes.guess_type. Added
  `test_mime_override_rejects_html` to lock that behaviour in.

16 view tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 12:51:40 +01:00
Viktor Petersson
3c96b541a1 refactor: rename legacy 'screenly' dirs to 'anthias' with auto-migration (#2753)
* refactor: rename legacy 'screenly' dirs to 'anthias' with auto-migration

For legacy reasons the host directories storing the cloned repo, user
assets, and config + DB still carried the old 'screenly' name. Rename
all three to their 'anthias' equivalents, plus the in-container paths,
the screenly.db / screenly.conf filenames, /tmp/screenly.watchdog,
/etc/sudoers.d/screenly_overrides, the ansible role, and the nginx URL
location. Existing installations are migrated automatically:

  ~/screenly/         -> ~/anthias/
  ~/screenly_assets/  -> ~/anthias_assets/
  ~/.screenly/        -> ~/.anthias/
    screenly.db   -> anthias.db
    screenly.conf -> anthias.conf  (paths rewritten in the body)
  /etc/sudoers.d/screenly_overrides -> /etc/sudoers.d/anthias_overrides

Migration is driven by two new helpers:

  - bin/migrate_legacy_paths.sh: idempotent host-side rename. Self-relocates
    if invoked from inside the dir being renamed. Rewrites both relative
    and absolute path values inside screenly.conf. Leaves dir-level
    back-compat symlinks at the old paths and file-level symlinks
    (screenly.db, screenly.conf) inside the migrated config dir so
    user automation / one-version downgrade still find familiar names.
  - bin/migrate_in_container_paths.sh: defensive /data/.screenly and
    /data/screenly_assets symlinks invoked from the container start
    scripts, in case an older docker-compose.yml is still mounting the
    legacy paths during a partial upgrade.

Wired into bin/install.sh (renames ~/screenly before clone_repo, then
runs the in-repo helper after) and bin/upgrade_containers.sh (runs the
helper near the top before regenerating docker-compose.yml).

Out of scope (intentional): the screenly/anthias-* Docker Hub namespace,
the Screenly/Anthias GitHub repo URLs, the screenly_ose Balena fleet,
api.screenlyapp.com / apt.screenlyapp.com legacy URLs, and brand URLs
in docs.

Tests: added tests/test_migrate_legacy_paths.py (4 cases: full migration,
absolute-path conf rewrite, idempotent rerun, fresh-install no-op) and
tests/test_backup_helper.py::RecoverLegacyTarballTest (recover() still
accepts pre-rename .tar.gz backups). Ruff clean. All 6 new tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: apply ruff format to new test files

CI's `ruff format --check` flagged tests/test_backup_helper.py and
tests/test_migrate_legacy_paths.py. Reformatted; behaviour unchanged,
6/6 migration-related tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: suppress SonarCloud S5042 on write-mode tarfile.open in fixtures

The two new fixture-building calls in tests/test_backup_helper.py use
`tarfile.open(..., 'w:gz')` (write mode), which Sonar's python:S5042
rule flags as "expanding this archive file" without distinguishing
read from write. arcnames are hardcoded test inputs with no
path-traversal surface, so the warning is a false positive here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot review feedback

- lib/backup_helper.py: harden recover() against tar path traversal
  (Zip Slip / CVE-2007-4559). New _safe_tar_member() rejects absolute
  paths, '..' components, non-regular-non-directory members
  (symlinks/hardlinks/devices), members outside the allowed top-level
  dirs, and any post-normalisation path that escapes $HOME. Iterates
  members manually instead of bulk extractall(), and passes
  filter='data' on Python with PEP-706 extraction filters
  (3.11.4+/3.12+) for belt-and-suspenders defence.
- tests/test_backup_helper.py: BackupHelperTest now patches HOME to a
  per-test tmpdir so `tearDown` no longer rmtree's a real ~/anthias
  checkout when run on a developer workstation. Also added
  test_recover_skips_path_traversal_member, which proves a hostile
  tarball entry like `../evil.txt` is logged-and-skipped, not written
  outside $HOME.
- docs/raspberry-pi5-ssd-install-instructions.md: capitalise "This"
  after the period.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add missing leading slash to repo dir heading

The heading for the cloned repo dir was rendered as
`home/${USER}/anthias/`, while every other heading in the section uses
absolute paths like `/home/${USER}/.anthias/`. Same fix applied to the
legacy-path mention in the note below it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:34:53 +01:00
Viktor Petersson
c7ec6ea771 chore(build): replace webpack, npm, and jest with bun (#2746)
* chore(deps): manage Python deps via uv dependency-groups

Replaces the six service-scoped requirements*.txt files with
PEP 735 dependency-groups in pyproject.toml and rebuilds every
Docker image as a two-stage build: a uv-builder stage (using the
official ghcr.io/astral-sh/uv image, with a pip fallback for
armv6) produces /venv via `uv sync --group <svc>`, which the
runtime stage copies in. uv.lock becomes authoritative for all
services. requirements/requirements.host.txt is kept as a
committed, auto-generated artifact (`uv export --group host`) so
bin/install.sh and the Ansible role keep working; a python-lint
CI step enforces it stays in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump Django, cryptography, pyOpenSSL, and 5 others

- Django 4.2.29 → 4.2.30 (latest 4.2 LTS)
- cryptography 3.3.2 → 46.0.7 (capped by pyOpenSSL 26's `cryptography<47`;
  cryptography 47 is incompatible with the latest pyOpenSSL)
- pyOpenSSL 19.1.0 → 26.0.0 (required by newer cryptography ABI —
  pyOpenSSL 19 crashed at import against cryptography ≥ ~3.4)
- requests 2.32.5 → 2.33.1 (aligned across every group, including
  docker-image-builder and local)
- pyasn1 0.6.2 → 0.6.3
- redis 7.1.0 → 7.4.0
- Cython 3.2.3 → 3.2.4
- sh 1.8 → 2.2.2 (major bump; usages in celery_tasks.py, bin/wait.py,
  lib/utils.py stick to the stable `sh.<cmd>` + `sh.ErrorReturnCode_N`
  API — verified still works)
- python-vlc 3.0.20123 → 3.0.21203

`mako` and `flatted` were requested but skipped: `mako` was already
removed from the project (9535745e), and `flatted` is an npm dep in
`package-lock.json`, not a Python dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump wheel from 0.38.1 to 0.46.2

Closes Dependabot PR #2651.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): manage Python deps via uv dependency-groups

Replaces the six service-scoped requirements*.txt files with
PEP 735 dependency-groups in pyproject.toml and rebuilds every
Docker image as a two-stage build: a uv-builder stage (using the
official ghcr.io/astral-sh/uv image, with a pip fallback for
armv6) produces /venv via `uv sync --group <svc>`, which the
runtime stage copies in. uv.lock becomes authoritative for all
services. requirements/requirements.host.txt is kept as a
committed, auto-generated artifact (`uv export --group host`) so
bin/install.sh and the Ansible role keep working; a python-lint
CI step enforces it stays in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump Django, cryptography, pyOpenSSL, and 5 others

- Django 4.2.29 → 4.2.30 (latest 4.2 LTS)
- cryptography 3.3.2 → 46.0.7 (capped by pyOpenSSL 26's `cryptography<47`;
  cryptography 47 is incompatible with the latest pyOpenSSL)
- pyOpenSSL 19.1.0 → 26.0.0 (required by newer cryptography ABI —
  pyOpenSSL 19 crashed at import against cryptography ≥ ~3.4)
- requests 2.32.5 → 2.33.1 (aligned across every group, including
  docker-image-builder and local)
- pyasn1 0.6.2 → 0.6.3
- redis 7.1.0 → 7.4.0
- Cython 3.2.3 → 3.2.4
- sh 1.8 → 2.2.2 (major bump; usages in celery_tasks.py, bin/wait.py,
  lib/utils.py stick to the stable `sh.<cmd>` + `sh.ErrorReturnCode_N`
  API — verified still works)
- python-vlc 3.0.20123 → 3.0.21203

`mako` and `flatted` were requested but skipped: `mako` was already
removed from the project (9535745e), and `flatted` is an npm dep in
`package-lock.json`, not a Python dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump wheel from 0.38.1 to 0.46.2

Closes Dependabot PR #2651.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: adapt sh 2.x API changes in wait.py and viewer

Two real breakages uncovered by auditing every `sh.*` call site
against the sh 1.x → 2.x API:

- bin/wait.py: `sh.grep(sh.route(), 'default')` no longer pipes
  in sh 2.x — the inner command stringifies to its stdout and
  becomes a literal argument to grep, producing
  `grep '<route_output>' default` and an ErrorReturnCode_2. Use
  the idiomatic `sh.grep('default', _in=sh.route())` instead.

- viewer/__init__.py: `browser.process.alive` is gone in sh 2.x
  (`OProc` no longer exposes it). Use `browser.process.is_alive()[0]`,
  which returns the `(alive_bool, exit_code)` tuple.

Plus two review nits:
- Add trailing newline to docs/migrating-assets-to-screenly.md
- Use `diff -u` in the requirements.host.txt CI drift check so
  failures print a readable unified diff.

Verified against sh==2.2.2 inside the rebuilt server image:
- `sh.grep('default', _in=sh.echo('…'))` pipes correctly
- `cmd.process.is_alive()` → `(True, None)` while running,
  `(False, 0)` after wait()
- `cmd.process.stdout.decode('utf-8')` still works on `_bg=True`
  processes

83/83 unit tests + 12/12 integration tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): serialize apt cache access with sharing=locked

The multi-stage uv-builder + runtime layout means two RUN steps can
race on BuildKit's shared `/var/cache/apt` cache mount. apt requires
an exclusive lock on /var/cache/apt/archives, so a concurrent
apt-get in the sibling stage causes the build to fail with
`E: Could not get lock /var/cache/apt/archives/lock`.

BuildKit's default cache mount sharing mode is `shared` (unrestricted
concurrent access). Switching to `sharing=locked` makes BuildKit
serialize access across stages, matching apt's locking model.

Discovered while cross-compiling `pi4-64` under QEMU, where the
slower emulated apt-get in stage 1 overlapped with the host-speed
apt-get in stage 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: fix ansible-lint and sbom workflows

**ansible-lint** (broken since 2026-04-08, #2732):
- `ansible-community/ansible-lint-action@main` repo is gone (404),
  so every run failed with "Unable to resolve action".
- Rewrite the workflow to use setup-uv + `uv run ansible-lint` from
  a new `ansible-lint==26.4.0` entry in the `dev-host` dependency
  group — matches the uv-based pattern already used by
  `python-lint.yaml`.
- Add `.ansible-lint` config with a skip list covering 19
  pre-existing violations in `ansible/` roles
  (`var-naming[no-role-prefix]`, `risky-shell-pipe`, `no-free-form`)
  so the workflow can go green today; follow-up PRs should drive
  the skip list down.
- Extend the path triggers to fire on config, workflow, and lock
  changes — not just `ansible/**`.

**sbom** (broken since 2026-04-02):
- The `sbomify/github-action` renamed `SBOM_FILE` to `LOCK_FILE` for
  lockfile inputs. Every run has been failing with "`uv.lock` is a
  lock file, not an SBOM. Please use LOCK_FILE instead of SBOM_FILE."
- Rename both `SBOM_FILE` envs (`package-lock.json` and `uv.lock`)
  to `LOCK_FILE`.

Verified locally: `uv run ansible-lint ansible/` passes (0
failures, 0 warnings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(build): replace webpack, npm, and jest with bun

Collapses the JS toolchain to a single tool. Bun handles installs
(replacing npm), bundling via `bun build` + `sass` CLI (replacing
webpack + ts-loader + babel + mini-css-extract-plugin), and testing
via `bun test` (replacing jest + ts-jest + jest-fixed-jsdom). Dev/test
Dockerfiles pull the bun binary from the official `oven/bun` image via
`COPY --from=`; production uses `oven/bun` as a builder stage.

Removes 18 devDependencies and 5 config files; adds only `bunfig.toml`
and `@happy-dom/global-registrator`.

Drive-by fix: `FormData` was imported as a value from `@/types` in
two files but is a type-only interface shadowing the browser global.
Webpack+ts-loader silently erased it; Bun's bundler surfaced the bug.
Converted to `import type`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): symlink bunx to bun in dev and test images

`bunx` is a symlink to `bun` in the official `oven/bun` image, so the
single-file `COPY --from=oven/bun:...-slim /usr/local/bin/bun` missed it.
Result: `bun run dev:css` / `bun run build:css` failed with
`bunx: command not found` inside dev and test containers.

Recreate the symlink after the copy. Production is unaffected because
its builder stage uses `FROM oven/bun` (bunx already present).

Caught by full end-to-end build verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: SHA-pin all external GitHub Actions

Addresses SonarCloud rule githubactions:S7637 ("Use full commit SHA
hash for this dependency") and brings the repo in line with the
hardened CI guidance from OpenSSF, CISA, and GitHub itself: tag refs
like @v7 or @master are mutable and can be retargeted by the action
owner or via compromise. Pinning to a full commit SHA removes that
supply-chain risk.

Every `uses:` reference to an external action across all 13 workflow
files is now pinned by SHA, with the original tag preserved as an
inline comment so the intent remains readable:

    uses: actions/checkout@de0fac2e45 # v6

Dependabot's github-actions ecosystem (already configured in
.github/dependabot.yml) recognises this `<SHA> # <tag>` format and
will update both the SHA and the comment together on future version
bumps, so we don't lose automated update coverage.

Scope: 21 distinct external actions × 73 total use sites across
ansible-lint, build-balena-disk-image, build-webview, codeql-analysis,
deploy-website, docker-build, generate-openapi-schema, javascript-lint,
lint-workflows, python-lint, sbom, and test-runner. Local workflow
references (./.github/workflows/...) left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs,chore: address review feedback on bun migration

- Update CLAUDE.md and docs/developer-documentation.md to replace
  npm/webpack/jest references with bun equivalents. The old webpack
  ProvidePlugin bullet was superseded by tsconfig's react-jsx runtime;
  restate that.
- Add comments in setupTests.ts explaining (1) why Bun's native fetch
  is stashed and restored around happy-dom's GlobalRegistrator (so MSW
  can intercept) and (2) why testing-library is imported dynamically
  after registration (so `screen` binds to a live document.body).
- Narrow the production builder SCSS COPY back to `*.scss` and drop
  the unused `bunfig.toml` copy (it's only consumed by `bun test`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dev): fail-fast when a watcher crashes in `bun run dev`

`wait` without arguments returns the last-exiting job's status, so a
crashing JS or CSS watcher could leave the script reporting success.
Track each watcher's PID, use `wait -n` to exit on the first failure,
and kill the survivor via a trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 06:53:56 +01:00
Viktor Petersson
ee12387b06 chore(deps): manage Python deps via uv dependency-groups (#2744)
* chore(deps): manage Python deps via uv dependency-groups

Replaces the six service-scoped requirements*.txt files with
PEP 735 dependency-groups in pyproject.toml and rebuilds every
Docker image as a two-stage build: a uv-builder stage (using the
official ghcr.io/astral-sh/uv image, with a pip fallback for
armv6) produces /venv via `uv sync --group <svc>`, which the
runtime stage copies in. uv.lock becomes authoritative for all
services. requirements/requirements.host.txt is kept as a
committed, auto-generated artifact (`uv export --group host`) so
bin/install.sh and the Ansible role keep working; a python-lint
CI step enforces it stays in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump Django, cryptography, pyOpenSSL, and 5 others

- Django 4.2.29 → 4.2.30 (latest 4.2 LTS)
- cryptography 3.3.2 → 46.0.7 (capped by pyOpenSSL 26's `cryptography<47`;
  cryptography 47 is incompatible with the latest pyOpenSSL)
- pyOpenSSL 19.1.0 → 26.0.0 (required by newer cryptography ABI —
  pyOpenSSL 19 crashed at import against cryptography ≥ ~3.4)
- requests 2.32.5 → 2.33.1 (aligned across every group, including
  docker-image-builder and local)
- pyasn1 0.6.2 → 0.6.3
- redis 7.1.0 → 7.4.0
- Cython 3.2.3 → 3.2.4
- sh 1.8 → 2.2.2 (major bump; usages in celery_tasks.py, bin/wait.py,
  lib/utils.py stick to the stable `sh.<cmd>` + `sh.ErrorReturnCode_N`
  API — verified still works)
- python-vlc 3.0.20123 → 3.0.21203

`mako` and `flatted` were requested but skipped: `mako` was already
removed from the project (9535745e), and `flatted` is an npm dep in
`package-lock.json`, not a Python dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump wheel from 0.38.1 to 0.46.2

Closes Dependabot PR #2651.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: adapt sh 2.x API changes in wait.py and viewer

Two real breakages uncovered by auditing every `sh.*` call site
against the sh 1.x → 2.x API:

- bin/wait.py: `sh.grep(sh.route(), 'default')` no longer pipes
  in sh 2.x — the inner command stringifies to its stdout and
  becomes a literal argument to grep, producing
  `grep '<route_output>' default` and an ErrorReturnCode_2. Use
  the idiomatic `sh.grep('default', _in=sh.route())` instead.

- viewer/__init__.py: `browser.process.alive` is gone in sh 2.x
  (`OProc` no longer exposes it). Use `browser.process.is_alive()[0]`,
  which returns the `(alive_bool, exit_code)` tuple.

Plus two review nits:
- Add trailing newline to docs/migrating-assets-to-screenly.md
- Use `diff -u` in the requirements.host.txt CI drift check so
  failures print a readable unified diff.

Verified against sh==2.2.2 inside the rebuilt server image:
- `sh.grep('default', _in=sh.echo('…'))` pipes correctly
- `cmd.process.is_alive()` → `(True, None)` while running,
  `(False, 0)` after wait()
- `cmd.process.stdout.decode('utf-8')` still works on `_bg=True`
  processes

83/83 unit tests + 12/12 integration tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(docker): serialize apt cache access with sharing=locked

The multi-stage uv-builder + runtime layout means two RUN steps can
race on BuildKit's shared `/var/cache/apt` cache mount. apt requires
an exclusive lock on /var/cache/apt/archives, so a concurrent
apt-get in the sibling stage causes the build to fail with
`E: Could not get lock /var/cache/apt/archives/lock`.

BuildKit's default cache mount sharing mode is `shared` (unrestricted
concurrent access). Switching to `sharing=locked` makes BuildKit
serialize access across stages, matching apt's locking model.

Discovered while cross-compiling `pi4-64` under QEMU, where the
slower emulated apt-get in stage 1 overlapped with the host-speed
apt-get in stage 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: fix ansible-lint and sbom workflows

**ansible-lint** (broken since 2026-04-08, #2732):
- `ansible-community/ansible-lint-action@main` repo is gone (404),
  so every run failed with "Unable to resolve action".
- Rewrite the workflow to use setup-uv + `uv run ansible-lint` from
  a new `ansible-lint==26.4.0` entry in the `dev-host` dependency
  group — matches the uv-based pattern already used by
  `python-lint.yaml`.
- Add `.ansible-lint` config with a skip list covering 19
  pre-existing violations in `ansible/` roles
  (`var-naming[no-role-prefix]`, `risky-shell-pipe`, `no-free-form`)
  so the workflow can go green today; follow-up PRs should drive
  the skip list down.
- Extend the path triggers to fire on config, workflow, and lock
  changes — not just `ansible/**`.

**sbom** (broken since 2026-04-02):
- The `sbomify/github-action` renamed `SBOM_FILE` to `LOCK_FILE` for
  lockfile inputs. Every run has been failing with "`uv.lock` is a
  lock file, not an SBOM. Please use LOCK_FILE instead of SBOM_FILE."
- Rename both `SBOM_FILE` envs (`package-lock.json` and `uv.lock`)
  to `LOCK_FILE`.

Verified locally: `uv run ansible-lint ansible/` passes (0
failures, 0 warnings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: SHA-pin all external GitHub Actions

Addresses SonarCloud rule githubactions:S7637 ("Use full commit SHA
hash for this dependency") and brings the repo in line with the
hardened CI guidance from OpenSSF, CISA, and GitHub itself: tag refs
like @v7 or @master are mutable and can be retargeted by the action
owner or via compromise. Pinning to a full commit SHA removes that
supply-chain risk.

Every `uses:` reference to an external action across all 13 workflow
files is now pinned by SHA, with the original tag preserved as an
inline comment so the intent remains readable:

    uses: actions/checkout@de0fac2e45 # v6

Dependabot's github-actions ecosystem (already configured in
.github/dependabot.yml) recognises this `<SHA> # <tag>` format and
will update both the SHA and the comment together on future version
bumps, so we don't lose automated update coverage.

Scope: 21 distinct external actions × 73 total use sites across
ansible-lint, build-balena-disk-image, build-webview, codeql-analysis,
deploy-website, docker-build, generate-openapi-schema, javascript-lint,
lint-workflows, python-lint, sbom, and test-runner. Local workflow
references (./.github/workflows/...) left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): use RunningCommand.is_alive() instead of OProc tuple

OProc.is_alive() returns (bool, exit_code); RunningCommand.is_alive()
wraps that and returns just the bool. The wrapper is clearer than
indexing into the tuple.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 06:48:36 +01:00
Nico Miguelino
5c66a38743 feat: add support for Raspberry Pi OS Trixie (#2732)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 12:19:22 -07:00
Nico Miguelino
29ae072514 chore: replace Poetry with uv for managing host dependencies (#2611) 2025-12-16 05:03:27 -08:00
Nico Miguelino
4591206466 docs: add note that Trixie is not yet supported (#2531) 2025-10-10 09:49:07 -07:00
Romain Reignier
07572c6b49 docs: fix the Debian iso download link (#2494)
Now that Debian 13 has been release, the previous link points
to Debian 13 iso which is not compatible with current version of Anthias.
2025-09-05 08:54:45 -07:00
Nico Miguelino
89e6182871 chore: migrate to TypeScript (#2359) 2025-06-24 12:26:50 -07:00
nicomiguelino
8d12dff1d4 docs: update docs for installing using the release images 2025-06-11 09:49:35 -07:00
nicomiguelino
9a55aa6cbc docs: add guide on how to get started with the admin site 2025-06-04 09:47:32 -07:00
Nico Miguelino
87fbddeacc chore(ci): replace NPM lint script with lint:check and lint:fix (#2301) 2025-05-28 09:52:19 -07:00
Nico Miguelino
51e4511bba feat: migrate to React (#2265) 2025-05-26 21:04:19 -07:00
Alexis Threlfall
ad26782aba Update documentation with further information. (#2240)
* Screenshot of rpi-eeprom-update output.

* Update raspberry-pi5-ssd-install-instructions.md

Updated information regarding boot order and bootloader versions based on testing.

* Update raspberry-pi5-ssd-install-instructions.md

Update with further information after testing with Pi5/Balena OS

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

* Update docs/raspberry-pi5-ssd-install-instructions.md

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>

---------

Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-26 08:33:54 -07:00
Alexis Threlfall
15cdfa030c Update installation-options.md 2025-03-12 15:30:12 +00:00
Alexis Threlfall
85f3dfd8e0 Update installation-options.md 2025-03-12 15:29:35 +00:00
Alexis Threlfall
1473080ab4 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:36:36 +00:00
Alexis Threlfall
a94c7bcbf6 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:35:04 +00:00
Alexis Threlfall
9cf8252a41 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:34:53 +00:00
Alexis Threlfall
c02cf28812 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:34:41 +00:00
Alexis Threlfall
0220c83f05 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:34:18 +00:00
Alexis Threlfall
3206b242cf Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:34:06 +00:00
Alexis Threlfall
97811ef6a0 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:33:57 +00:00
Alexis Threlfall
322b880627 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:33:37 +00:00
Alexis Threlfall
67ef900a93 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:33:24 +00:00
Alexis Threlfall
6a734dc89e Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:32:49 +00:00
Alexis Threlfall
7eb348808d Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:32:41 +00:00
Alexis Threlfall
1a36330d07 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:32:27 +00:00
Alexis Threlfall
a36a0ff7b2 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:32:09 +00:00
Alexis Threlfall
9fa9d3ff58 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:31:50 +00:00
Alexis Threlfall
b184a08bdf Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:31:40 +00:00
Alexis Threlfall
03e51df7d5 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:31:23 +00:00
Alexis Threlfall
1fef3c9483 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:31:14 +00:00
Alexis Threlfall
7f5c5c2fb0 Update docs/raspberry-pi5-ssd-install-instructions.md
Co-authored-by: Nico Miguelino <nicomiguelino2014@gmail.com>
2025-03-12 13:31:02 +00:00
Alexis Threlfall
ba35e771be Add files via upload 2025-03-11 11:06:27 +00:00
Nico Miguelino
490051585f Replace flake8 with ruff (#2092) 2025-01-14 06:47:52 -08:00
Nico Miguelino
65d5b83a36 docs: update info about Pi 5 compatibility (#2174) 2024-12-24 12:45:40 -08:00
Nico Miguelino
4bcb620a03 docs: update docs to include balenaOS support for Pi 5 (#2172) 2024-12-23 09:32:37 -08:00
Nico Miguelino
de804d4f06 chore: refactor the image builder script into multiple files (#2161) 2024-12-17 09:52:12 -08:00