11 Commits

Author SHA1 Message Date
Viktor Petersson
cca69b594d fix(balena): repair Pi 4 graphics overlay + manage fleet host config as IAC (#2947) (#2949)
* fix(viewer): detect Pi 4 eglfs DRM card at runtime so boot doesn't hang (#2947)

- vc4-drm (display) and v3d (render-only) race during probe, so the
  display node is card0 on some boots/images and card1 on others
- #2905 hardcoded /dev/dri/card1; when vc4 loses the race eglfs opens
  the render-only node, finds no connectors, and the device hangs on
  the balena splash forever
- start_viewer.sh now picks the card that owns connectors at runtime
  and rewrites QT_QPA_EGLFS_KMS_CONFIG before launch
- prefers a connected connector, falls back to any card exposing
  connectors (excludes the connector-less v3d node)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(balena): repair Pi 4 graphics overlay, manage fleet host config as IAC (#2947)

Root cause of the Pi 4 boot-splash hang: the anthias-pi4 fleet's
dtoverlay was stored as the malformed value `"vc4-kms-v3d"` — literal
double-quotes, on the legacy RESIN_ prefix — unlike every other board's
clean `vc4-kms-v3d`. The quotes stop the firmware loading the overlay,
so the Pi 4 fell back to firmware-KMS; since #2905 the viewer renders
through Qt eglfs_kms, which needs full KMS, so the display never came up
and the device hung on the splash. (linuxfb, used before #2905, didn't
care, which is why this surfaced now.) The malformed value was a manual
dashboard edit — the config was never codified.

- add balena-host-config.json: declarative per-board config.txt knobs,
  reconciled from the live fleets and corrected (clean pi4 dtoverlay;
  drop bogus `dtparam=...,vc4-kms-v3d`; standardize pi2/pi3 off the
  RESIN_ prefix; drop pi5 gpu_mem which a Pi 5 ignores; add cma-512 to
  pi5 per docs/board-enablement.md)
- build-balena-disk-image.yaml reconciles each fleet to the file:
  upsert under the canonical BALENA_ prefix, then prune anything not in
  the file (incl. legacy RESIN_HOST_CONFIG_* dupes). Supervisor vars
  untouched.
- docs/balena-fleet-host-config.md documents the mechanism + the full
  per-board audit; modernize the self-hosted doc's `env add`->`env set`
- drop `--pin-device-to-release` from `balena preload` (added in #2098)
  so flashed devices track the latest stable release instead of freezing;
  correct installation-options.md / faq.yaml accordingly

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ci): harden fleet host-config reconcile against fleet-wide wipe (#2947)

Review of the prune step surfaced three failure modes:

- An empty desired set (absent board key, or jq/file parse failure not
  caught by set -e through a process substitution) made the prune delete
  every config var on the fleet, incl. dtoverlay=vc4-kms-v3d. Now resolve
  the board config with `jq -ec .boards[$b]` and hard-fail if it's null
  or the file is invalid; a `{}` board (x86) is truthy so it still
  reconciles to "no config.txt".
- The prune selector `test("HOST_CONFIG")` was an unanchored substring
  match — a var merely containing HOST_CONFIG (e.g. BALENA_HOST_CONFIG-
  URATION_BACKUP) would be pruned. Anchored to `^(BALENA|RESIN)_HOST_CONFIG_`.
- A transient `balena env list` / jq failure in the prune's process
  substitution was swallowed (pipefail doesn't propagate out of `<(...)`),
  silently skipping the prune and leaving stale RESIN_ duplicates. Capture
  the listing into a var first so the failure aborts the step.

Also folds the duplicate jq pass over balena-host-config.json into the
single `board_json` resolve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(balena): describe current host config, not the one-time cleanup

Replace the point-in-time "Audit" table (old quoted/broken values, was→fixed
narrative) with a forward-looking per-key rationale. The cleanup history lives
in the PR / git history; the doc should describe what each setting is and why.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(ci): drop the inline rationale comment from the preload step

Move the explanation out of the workflow and into history (this is where
`git blame` on the preload step lands):

`--commit latest` seeds the image with the current release's container
images so a freshly flashed device boots fully offline. We deliberately
do NOT pass `--pin-device-to-release`: pinning (added in #2098) froze
flashed devices to the downloaded release, so they never received OTA
fixes. Anthias balena devices should track the fleet's latest stable
release, so the device joins the fleet unpinned and auto-updates from
here on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 19:46:13 +02:00
Viktor Petersson
19dd5fb205 docs: document rpi-imager 2.0.x macOS write-error when flashing images (#2948) (#2950)
Raspberry Pi Imager 2.0.2–2.0.7 has a macOS bug that aborts mid-write
with "Error writing to storage device" when an image is decompressed on
the fly. It is upstream rpi-imager #1605/#1489 (macOS raw block devices
reject unaligned pwrite() calls) and affects .img.xz and plain .img
files too, not just our .zst images, so there is no image-side change
that fixes it. The fix landed upstream in rpi-imager#1621 (May 2026).

Add a warning + workaround (update Imager past 2.0.7, or decompress with
`zstd -d` and flash the resulting .img) to the installation docs and a
matching FAQ entry, and drop the stale "only Raspberry Pi Imager
supports .zst" claim.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 17:15:12 +02:00
Viktor Petersson
57b4f25c77 feat(viewer,server): per-board HW decode dispatch + codec gate on upload (#2885)
* perf(viewer): pi4-64/pi5 use mpv --vo=gpu --gpu-context=drm

On Pi the connector's preferred mode is usually 4K (most modern
TVs report 3840x2160 in their EDID), and the previous --vo=drm
path ran a CPU zimg upscale from 1080p source to that 4K output.
On a 4-core A72 that's the bottleneck — mpv VO drops 59-75
frames per 30s on a stock 1080p H.264 signage clip. Pi5's A76
is faster but the same upscale path is still the limit.

Switching the VO to GL with the DRM context (mpv --vo=gpu
--gpu-context=drm) hands the upscale to the V3D and leaves
everything else identical — mpv still owns DRM master, still
reads --drm-mode=1920x1080@60 (kept), still runs in
--vd-lavc-threads=4 software decode (mpv 0.40 in Debian Trixie
has v4l2m2m-copy but not v4l2request, so --hwdec=auto-safe
falls back to software on this asset; that hasn't changed).

Measured on a 4K-connected Pi4-64 Rev 1.5, same clip, same 30 s
window:

  --vo=drm                                : 59-75 vo drops / 30 s
  --vo=gpu --gpu-context=drm (this patch) : 3-6 vo drops / 30 s

`decoder-frame-drop-count` is 0 in both — the regression was
purely on the VO side, and shifting scaling off the CPU is what
buys the headroom.

x86 (cage + --gpu-context=wayland) is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(viewer): drop --drm-mode pin on Pi4-64/Pi5 under --gpu-context=drm

The previous commit moved Pi4-64/Pi5 to `mpv --vo=gpu
--gpu-context=drm` but kept the `--drm-mode=1920x1080@60` pin
from the old --vo=drm path. On-device testing showed the pin
*hurts* throughput under GBM: 294 vo drops/30s with the pin,
3-6 without, on the same 4K-connected Pi4 and the same H.264
clip.

The pin existed in the first place to dodge CPU zimg upscale to
4K, which the A72 couldn't keep up with on the legacy --vo=drm
path. Under --gpu-context=drm the V3D does the scaling for free
at the connector's preferred mode, so the workaround is no
longer needed and is in fact harmful.

`--vd-lavc-threads=4` stays — software decode under
--hwdec=auto-safe (mpv 0.40 has v4l2m2m-copy but not
v4l2request) still benefits from explicit threading.

Verified on a 4K-connected Pi4-64 across H.264 (30/24 fps) and
HEVC clips: 2-6 vo drops/30s in every case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(viewer): consolidate Qt6 boards onto cage + Wayland, pin Pi 4 to 1080p

Folds in PR #2883: Pi 4-64 / Pi 5 now run under cage with mpv on
--vo=gpu --gpu-context=wayland, joining x86 and arm64 on a single
Wayland-based display stack. Drops the --vo=drm legacy path
entirely from MPVMediaPlayer. Qt 5 boards (pi2 / pi3) stay on
linuxfb via VLCMediaPlayer — out of scope here.

Replaces the perf branch's `--vo=gpu --gpu-context=drm` standalone
fix with the consolidated cage path. The previous standalone
finding (3-6 vo drops / 30 s on Pi 4 at 4K) was a Pi-without-cage
optimization; once Pi runs under cage like every other Qt6 board,
the same trick applies via wayland but cage's composite step adds
its own pass and the V3D on Pi 4 can't keep up at 4K (738 vo
drops / 30 s measured at native 4K under cage). Fix: move the
1080p mode pin one layer up from app code to host config — the
new ansible/.../cmdline.txt.j2 conditional appends
`video=HDMI-A-1:1920x1080@60 video=HDMI-A-2:1920x1080@60` when
`device_type == 'pi4-64'`. With output pinned to 1080p there's no
upscale anywhere in the pipeline, matching the bandwidth profile
of today's --vo=drm production setup.

Pi 5 / x86 / arm64 keep the connector's preferred mode (typically
4K). Pi 5's V3D 7.1 has roughly 2× Pi 4's throughput; x86 iGPUs
handle 4K via VAAPI; arm64 SBC perf varies by SoC.

Other notable changes folded in from #2883:

* tools/image_builder/utils.py — `cage` + `qt6-wayland` move out
  of the per-board branch into the shared is_qt6 block.
  `wlr-randr` (was x86-only) goes in the shared block too since
  rotation now happens via wlr-randr on every Qt6 board.
  `va-driver-all` stays x86-only (no VAAPI on Pi / ARM SoCs).
* docker/Dockerfile.viewer.j2 — QT_QPA_PLATFORM=wayland gated on
  is_qt6 instead of board in ('x86', 'arm64').
* bin/start_viewer.sh — case on DEVICE_TYPE: every Qt6 board
  takes the cage + sudo path. Pi2 / Pi3 stay on the legacy
  direct-sudo path.
* src/anthias_viewer/media_player.py — single --vo=gpu
  --gpu-context=wayland for all reachable device types. The
  per-board rotate_args block is gone: every Qt6 device inherits
  the transform from cage via wlr-randr, so mpv would
  double-rotate if it set --video-rotate.
* tests/test_media_player.py — parametrised tests for all four
  Qt6 boards (x86, arm64, pi4-64, pi5) hitting the same VO path;
  rotation tests assert mpv *never* sets --video-rotate under
  cage.
* website/data/faq.yaml — rotation entry points at Settings page
  / wlr-randr; resolution entry calls out the Pi 4 1080p pin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ansible): propagate tags into boot.yml include_tasks

The `Configure boot partition` task in system/tasks/main.yml was
tagged `touches-boot-partition` / `raspberry-pi` but those tags
weren't propagated to the tasks inside boot.yml — Ansible's
default include_tasks behaviour matches the include against
--tags but leaves the included tasks tag-less, so they get
filtered back out. Running `ansible-playbook ... --tags
touches-boot-partition` therefore did nothing.

Use the explicit `apply: tags:` form so the include's tags are
copied onto each task in boot.yml. With this, the standalone
"re-render boot config" workflow actually works, which matters
on Pi 4 now that the 1080p HDMI mode pin in cmdline.txt.j2
needs to land without re-running the whole playbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): keep Pi 4 on linuxfb; only Pi 5 / x86 / arm64 go cage

On-device testing on a Pi 4 Model B Rev 1.5 with a 4K HDMI display
showed cage+wayland is fundamentally too heavy for the V3D 6.0:

  --vo=drm    (existing, no cage)                : 59-75 drops/30s
  --vo=gpu --gpu-context=drm  (no cage, GPU scale): 3-6 drops/30s
  --vo=gpu --gpu-context=wayland (cage, even at  : 730+ drops/30s,
    1080p HDMI cmdline pin to avoid 4K scale)      mpv at 99% CPU
                                                   running ~1/4×
                                                   real time

The 1080p HDMI pin doesn't recover Pi 4 — cage's composite pass
costs more than the V3D 6.0 has spare bandwidth for, regardless
of output resolution, with the webview running in the background
or not. Pi 5's V3D 7.1 has roughly 2× the throughput and is
expected to keep up; x86 / arm64 already shipped on cage and
remain unchanged.

Net result:

  * Pi 4-64 stays on Qt linuxfb (no compositor) with mpv on
    --vo=gpu --gpu-context=drm. mpv writes straight to KMS via
    libgbm and lets the V3D do video scaling — keeping the
    standalone perf-branch finding that drops from 59-75 → 3-6
    on the same clip.
  * Pi 5 / x86 / arm64 stay (or move) onto cage + qt6-wayland +
    wlr-randr with mpv on --vo=gpu --gpu-context=wayland.
  * Pi 2 / Pi 3 stay on the Qt5 + VLC + linuxfb track they were
    already on.
  * The Pi 4 1080p HDMI cmdline pin added in the previous commit
    is reverted (no longer needed without cage).
  * Rotation handling: mpv emits --video-rotate=N on Pi 4 (no
    compositor to apply the transform) and skips it on the cage
    boards (wlr-randr handles it there).

Goal-wise this is the partial-consolidation we agreed to as last
resort: three of four Qt6 boards share one Wayland stack, Pi 4
keeps the framebuffer path for as long as the V3D 6.0 + mpv 0.40
combo lacks the headroom. Pi 4 remains in scope for revisiting
once mpv ships the v4l2request hwdec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): mirror host render-GID for all Qt 6 boards, not just cage

mpv uses /dev/dri/renderD128 for --vo=gpu on every Qt 6 board
now — wayland (cage path on x86 / arm64 / pi5) and drm (linuxfb
path on Pi 4) both go through Mesa GL. The render-GID mirror was
inside the cage branch of start_viewer.sh, so Pi 4's mpv ran as
viewer user, hit the render node owned by GID 992, got
"Permission denied", and bailed with "Failed initializing any
suitable GPU context!".

Hoist the render-GID setup above the per-board case so it runs
for every Qt 6 board. cage / linuxfb branching stays as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): Pi 4 stays on --vo=drm (Qt linuxfb DRM master contention)

Earlier commits switched Pi 4 to mpv --vo=gpu --gpu-context=drm
based on a 3-6 vo-drop/30 s measurement. That test was run as
root in a fresh container — no Qt linuxfb in the picture. In
the production viewer where AnthiasWebview holds the framebuffer
via Qt linuxfb, --vo=gpu fails:

  failed to open /dev/dri/renderD128: Permission denied
  [vo/gpu/drm] Failed to acquire DRM master: Permission denied
  [vo/gpu] Failed initializing any suitable GPU context!
  Error opening/initializing the selected video_out (--vo) device.
  Video: no video

Mesa GBM holds DRM master persistently and contends with Qt
linuxfb's framebuffer use. mpv's classic --vo=drm has its own
master juggling (briefly grab → render → drop) that coexists
fine with linuxfb — that's why master's existing Pi 4 config
works.

Revert Pi 4 mpv flags to the production master config:
  --vo=drm --drm-mode=1920x1080@60 --vd-lavc-threads=4

The standalone perf-finding from this branch's earlier history
turns out not to apply in production; retracted from the
roll-up. Pi 5 / x86 / arm64 unchanged (they're on cage +
--vo=gpu --gpu-context=wayland, which has its own DRM master
flow via cage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): cage opens on the first connected connector, not HDMI-A-1

Without `-o`, cage uses whatever output the DRM backend enumerates
first — typically HDMI-A-1 on Pi 5 (closer to USB-C) and the
on-board panel / first HDMI on x86 / arm64. If the operator plugs
into the *other* port (Pi 5 HDMI-A-2, or any DP connector on
x86), cage renders to a disconnected connector and the screen
stays black.

start_viewer.sh now iterates /sys/class/drm/card*-*, picks the
first connector whose status reads "connected", strips the
cardN- prefix to get the bare name cage expects (HDMI-A-1,
HDMI-A-2, DP-1, eDP-1, …), and passes it via `-o`. Falls back to
letting cage pick if nothing is connected yet — the display may
come up via HPD after cage starts, or this is a build/CI host
with no display at all.

Caught while end-to-end testing on the rig: Pi 5 cable on
HDMI-A-2 went to a black screen even though `cat
/sys/class/drm/card1-HDMI-A-2/status` reported "connected" and
cage / the viewer were running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viewer): mpv from apt.raspberrypi.com on Pi 4 / Pi 5, hwdec auto-copy

Stock Debian Trixie's mpv 0.40 is compiled without `v4l2request`
hwdec, so Pi 5's Hantro stateless decoder is invisible to it and
mpv falls back to software decode for every H.264 / H.265 source.
Pi 4's V4L2 M2M decoder is reachable via `v4l2m2m-copy` but mpv's
`--hwdec=auto-safe` whitelist explicitly excludes that method, so
auto-detect picked software there too.

Two changes, applied together because they only make sense
together:

* Pi 4 / Pi 5 viewer images now pull mpv (and the FFmpeg library
  family it depends on) from `archive.raspberrypi.com/debian
  trixie main`. The Pi-tuned build ships `v4l2request` hwdec
  (Pi 5) and a maintained `v4l2m2m-copy` (Pi 4). An apt-pin
  restricts the Pi repo to the mpv + libav* packages only, so
  curl / ca-certificates / etc. continue to come from stock
  Debian and the rest of the image stays on the same baseline.
* `MPVMediaPlayer.play()` switches `--hwdec=auto-safe` →
  `--hwdec=auto-copy`. auto-copy is the same family but with a
  broader whitelist that *includes* the v4l2-family copy hwdecs.
  Net effect: x86 still picks vaapi-copy (unchanged), Pi 4 picks
  v4l2m2m-copy, Pi 5 picks v4l2request, arm64 falls through to
  software (no v4l2request in stock Debian mpv, no vendor-tuned
  Rockchip plugin in stock either — Tier-2 follow-up).

Plus an `ANTHIAS_DEBUG_DROPS=1` env knob: when set on the viewer
container, mpv's stdout/stderr go to `/data/.anthias/mpv.log`
(host-bound) instead of `/dev/null`, and `--no-terminal` is
dropped so the status line ("AV: ... Dropped: N") is emitted.
Lets us read per-asset frame-drop counts straight from the
production viewer pipeline (no custom harness, no rebuild)
during the test-grid runs. Default (unset) preserves the silent
behaviour.

Also: drops the `cage -o <connector>` autodetect attempt — cage
0.1.x in Trixie doesn't accept `-o`, just `-m last`. Use that
instead so cage opens on the most-recently-connected output
regardless of HDMI-A-N enumeration order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): use deb-packaged Pi keyring for archive.raspberrypi.com

apt update against http://archive.raspberrypi.com/debian trixie
was failing in the Pi 4 / Pi 5 viewer image builds:

  Sub-process /usr/bin/sqv returned an error code (1):
  Signing key on CF8A1AF502A2AA2D763BAE7E82B129927FA3303E is not
  bound: No binding signature at time …
  Policy rejected non-revocation signature (PositiveCertification)
  requiring second pre-image resistance
  SHA1 is not considered secure since 2026-02-01

Pi's bare `raspberrypi.gpg.key` URL still serves the original
2012-vintage RSA 2048 key with SHA1 binding signatures that
Trixie's sqv refuses to certify under the post-2026-02-01
crypto policy. The deb-packaged keyring inside
`raspberrypi-archive-keyring_2025.1+rpt1_all.deb` ships the
*same* key fingerprint but with rebuilt binding signatures
that sqv accepts — that's the keyring Pi OS Trixie itself
installs, which is why `apt update` against this exact repo
works on a real Pi 5 device today.

Fetch the deb directly with curl, extract its bundled
`.pgp` keyring, and point `signed-by=` at the installed copy.
The pin block restricts what packages the Pi repo can supply
(mpv + libav* + ffmpeg + libpostproc — the FFmpeg family),
so the rest of the image keeps its stock-Debian baseline.

Also extend the pin to cover libpostproc* and ffmpeg, since
mpv's apt deps drag those into the Pi-tagged version on
install; without the pin extension, apt rejected the resolve
with "broken packages".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viewer): per-codec hwdec on Pi via Lua hook

mpv 0.40's `--hwdec` accepts a single value at startup, so we
can't ask it to try v4l2m2m-copy for H.264 *and* drm-copy for
HEVC out of the box. The Pi-tuned mpv from
archive.raspberrypi.com supports both hwdec methods but each
covers a different codec subset:

* v4l2m2m-copy — Pi 4's V3D V4L2 M2M decoder. H.264 works; Pi
  5's Hantro G2 is V4L2-stateless-only so this no-ops there.
* drm-copy — FFmpeg's `v4l2_request_hevc` hwaccel. HEVC only,
  works on both Pi 4 and Pi 5.

Add a small `on_load` Lua hook (inlined as `_PI_HWDEC_LUA`,
written to /tmp on first play(), loaded with `--script=`) that
checks `video-codec-name` and picks the right hwdec at file
open. Net effect:

  Pi 4 H.264 → v4l2m2m-copy   (HW)
  Pi 4 HEVC  → drm-copy       (HW)
  Pi 5 H.264 → v4l2m2m-copy   (no device, falls back to SW
                                — only path until mpv re-adds
                                v4l2_request_h264 hwdec)
  Pi 5 HEVC  → drm-copy       (HW)

The base `--hwdec=auto-copy` startup value still applies on
x86 / arm64 (vaapi-copy on Intel/AMD; software fall-back on
Rockchip), where the hook isn't loaded.

Verified on real hardware:
  $ mpv ... --script=/tmp/anthias-pi-hwdec.lua test_hevc.mp4
  [pi-hwdec] codec=hevc -> hwdec=drm-copy
  Using hardware decoding (drm-copy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer,server): HW-decode everywhere on Pi 4 / Pi 5 / x86

The previous per-codec Lua hook in media_player.py was a silent no-op:
mpv's video-codec-name property is empty at every script event before
hwdec init (on_load, on_preloaded), so --hwdec=auto-copy leaked through.
auto-copy's upstream whitelist excludes v4l2m2m-copy, so H.264 on Pi 4
fell back to software despite the V3D V4L2 M2M decoder being available.

Viewer (src/anthias_viewer/media_player.py)

- Replace the Lua hook with ffprobe-driven dispatch from Python at
  launch time. ffprobe is in the viewer image; the call is ~50 ms.
- Per-board mapping: Pi 4 → {h264: v4l2m2m-copy, hevc: drm-copy};
  Pi 5 → {hevc: drm-copy}. Pi 5 H.264 falls back to auto-copy
  because mpv has no v4l2-request H.264 hwdec for the Hantro G1,
  and passing v4l2m2m-copy there just logs "Could not find a valid
  device" before SW-falling-back.
- Live-verified on Pi 4: "Using hardware decoding (v4l2m2m-copy)"
  for 1080p H.264 and "Using hardware decoding (drm-copy)" for
  HEVC at 1080p and 4K.

Asset processor (src/anthias_server/processing.py)

- Pi 5 profile drops H.264 from passthrough_video_codecs — Pi 5
  has no mpv H.264 HW path, so H.264 uploads must transcode to HEVC
  at upload time to keep the HW-decode-everywhere contract.
- Pi 4 profile adds passthrough_video_max_pixels for H.264, capped
  at 1080p (1920*1080). 4K H.264 clears the codec gate but the V3D
  H.264 envelope tops at 1080p60, so the cap forces it through a
  libx265 re-encode at upload time. HEVC keeps no cap (the
  dedicated HEVC block handles 4Kp60).
- _ffprobe_summary now returns video_pixels alongside codec /
  container / audio_codec; _video_can_passthrough enforces the
  per-codec pixel cap when the profile declares one.

Tests

- test_media_player.py: new per-board hwdec tests (Pi 4 H.264 →
  v4l2m2m-copy; Pi 5 H.264 → auto-copy; both → drm-copy for HEVC;
  auto-copy fallback when ffprobe fails; no probe on x86 / arm64).
- test_processing.py: matrix tests updated to include video_pixels;
  parametrised rows now exercise Pi 5 H.264-no-passthrough and the
  Pi 4 4K H.264 cap. New end-to-end tests prove
  _run_video_normalisation transcodes Pi 5 H.264 → HEVC and Pi 4
  4K H.264 → HEVC.

Docs (docs/board-enablement.md, new)

- Goal + per-board HW-decode capability table.
- Asset processor codec policy spelled out as a contract.
- BBB test bed recipe (source clips, libx265 transcode commands,
  ANTHIAS_DEBUG_DROPS=1, mpv.log slicing).

Follow-up: Pi 5 4K HEVC HW

The Hantro G2 decoder can't allocate 4K dst buffers from Pi 5's
default 64 MB CMA ("v4l2_request_hevc_start_frame: Failed to get
dst buffer") and SW-falls-back. Adding cma=512M to the kernel
cmdline does NOT work — the kernel takes the cmdline value over
the device-tree linux,cma node, orphaning rpi-hevc-dec ("Failed
to probe hardware -517") and unpopulating /dev/video*, which
kills HEVC HW at every resolution. The right fix is a
dtparam/dtoverlay in /boot/firmware/config.txt that resizes the
existing DT-declared region without orphaning the codec's
reserved-mem reference. Until that lands, the pi5 profile should
downscale 4K → 1080p HEVC. Documented in cmdline.txt.j2 and
docs/board-enablement.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(viewer,server): mock _probe_video_codec; fix mypy on Popen IO types

CI failures on the previous commit (bb27b186) came from:

* ``subprocess.run`` inside ``_probe_video_codec`` blowing up under
  the existing ``mpv`` fixture, which patches ``subprocess.Popen``
  to a MagicMock. ``subprocess.run`` internally instantiates Popen
  for the ffprobe shellout, gets a MagicMock back, then trips on
  unpacking communicate()'s result. Fixed by default-mocking
  ``_probe_video_codec`` in the fixture (returns '' so dispatch
  falls back to 'auto-copy', preserving legacy assertions) and
  layering the same mock onto the standalone rotation tests that
  build MPVMediaPlayer outside the fixture.

* ``ruff format``: the multi-line ffprobe arg list in
  ``_probe_video_codec`` needed splitting one-arg-per-line.

* ``mypy``: typing the popen_stdout / popen_stderr locals as
  ``object`` couldn't satisfy any Popen overload. Switched to
  ``int | IO[bytes]`` which covers both the DEVNULL / STDOUT
  sentinels and the bind-mounted mpv.log file handle.

* ``test_passthrough_containers_match_real_ffprobe_format_names``
  was pinned to the pi5 profile to exercise the H.264 + HEVC
  passthrough path; pi5 no longer passthroughs H.264, and the
  fake summary it constructs has no width/height (so pi4-64's
  cap fails it too). Switched the pin to x86, which has no
  per-codec caps — the test is about *container* recognition, not
  codec/resolution gating.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): downscale 4K HEVC → 1080p on Pi 5 (CMA workaround)

Pi 5's Hantro G2 HEVC decoder is rated for 4Kp60 but the stock 64 MB
CMA on Pi OS can't fit a 4K HEVC dst-buffer pool — at 4K mpv hits
``v4l2_request_hevc_start_frame: Failed to get dst buffer`` and
silently SW-falls-back. Bumping cma= on the kernel cmdline orphans
``rpi-hevc-dec`` entirely (the kernel takes the cmdline value over
the device-tree linux,cma node, leaving the driver returning
``Failed to probe hardware -517``), so the kernel-side knob isn't
available without a dtoverlay change.

Until that follow-up lands, the asset processor caps Pi 5 HEVC at
1080p both ways:

* ``passthrough_video_max_pixels`` gates 4K HEVC uploads out of
  passthrough — anything wider than 1920×1080 falls through to a
  re-encode.
* New ``transcode_video_max_pixels`` per-codec field tells
  ``_transcode_to_target`` to emit a
  ``-vf scale='if(gt(ih,1080),-2,iw)':'min(ih,1080)'`` filter that
  caps height at the 16:9 budget (cap_h = floor(sqrt(cap × 9/16))).
  Portrait 4K → 1080p height; landscape 4K → 1920×1080. Sub-1080p
  sources are untouched (the ``min()`` guard prevents upscale; ``-2``
  on width keeps libx265 happy with even dimensions).

Pi 4 / x86 don't carry the cap (their HW decoders handle 4Kp60
cleanly), so the filter stays absent from those profiles.

Tests cover (a) the new pi5+hevc+4K row in the parametrised
passthrough matrix (False at 4K, True at 1080p), (b) ffmpeg argv
shape: -vf scale=... emitted for pi5 HEVC, absent for pi4-64 HEVC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer,system): Pi 5 4K HEVC HW + display-resampled VO sync

Two tied changes that move every supported board to clean HW
decode at the source's actual framerate.

Pi 5 4K HEVC via cma-512
------------------------

Pi OS for Pi 5 reserves 64 MB of CMA by default. The Hantro G2
HEVC decoder needs a buffer pool large enough to hold several 4K
dst frames (each ~12 MB) plus reference frames, so the stock
allocation can fit 1080p HEVC but not 4K — at 4K mpv hits
``v4l2_request_hevc_start_frame: Failed to get dst buffer`` and
silently SW-falls-back.

Adding ``cma=512M`` to /boot/firmware/cmdline.txt does NOT work:
the kernel takes the cmdline value over the device-tree
``linux,cma`` node, which orphans ``rpi-hevc-dec`` entirely
(returns ``Failed to probe hardware -517`` and ``/dev/video*``
disappears, killing HEVC HW at every resolution).

The Pi-OS-blessed merge is ``dtoverlay=vc4-kms-v3d,cma-512`` in
/boot/firmware/config.txt — the v3d overlay carries its own
``cma-N`` parameter that resizes the DT linux,cma node in place
without orphaning the codec driver. A standalone
``dtoverlay=cma,cma-512`` silently no-ops on Pi 5 because the
v3d overlay initialises the CMA region first; reusing the v3d
overlay's parameter is the documented way to merge them.

ansible/roles/system/templates/config.txt.j2 now emits the
``,cma-512`` parameter on Pi 5 only — Pi 4 already gets 512 MB
CMA by default so the override is a no-op there. The earlier
attempt at a kernel-cmdline cma= override (in cmdline.txt.j2) is
removed; the file's comment now points readers at the correct
config.txt path.

Live-verified on Pi 5: CmaTotal=512MB after the overlay change,
/dev/video* present, rpi-hevc-dec probes cleanly. Asset processor
pi5 profile no longer carries a HEVC pixel cap — Pi 5 can decode
HEVC at its silicon's real capability.

mpv --video-sync=display-resample
---------------------------------

mpv 0.40 defaults to ``--video-sync=audio`` which syncs the video
clock to the audio clock and drops VO frames when the two drift.
On every board tested (Pi 4 --vo=drm, Pi 5 + x86 --vo=gpu
--gpu-context=wayland) this produced 60–90% VO drops at 60 fps
content even when the decoder reported healthy HW decode
(``Using hardware decoding (...)`` banner present, no decoder
errors). The drops were at the VO, not the decoder.

``--video-sync=display-resample`` flips the relationship: sync
video to the display refresh and resample audio to match. Audio
resampling is a <1% CPU 2-channel job and most signage clips
have no audible content anyway, so it's effectively free; the
benefit is clean playback at the source's frame rate.

Test bed touched
----------------

* test_play_invokes_popen_with_expected_args_on_pi4_64: argv
  now includes ``--video-sync=display-resample``.
* test_video_can_passthrough_respects_board_codec_set: pi5 +
  hevc + 4K is now ``True`` (passthrough) because the CMA fix
  lets the silicon do its rated job. Comment updated to point
  at config.txt.j2.
* Removed the transient downscale-on-Pi 5 codepath
  (``transcode_video_max_pixels`` field, the
  ``-vf scale='if(gt(ih,...))':...`` filter, and the two tests
  asserting it) — that was a workaround for the CMA issue and
  is no longer needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): introduce PlaybackEnvelope dataclass + matrix + cache

Foundation for the per-board playback envelope rollout (see
/home/ubuntu/.claude/plans/serene-munching-gem.md). No behaviour
change yet — wires up the canonical source of truth that
processing.py, celery_tasks.py's future re-render walker, and the
viewer's hwdec dispatch will all read from in the next commit.

src/anthias_server/playback_envelope.py (new)
---------------------------------------------

Frozen dataclass ``PlaybackEnvelope`` carrying codec / max_width /
max_height / max_fps plus a fixed ``container_ext = 'mp4'``.
``ENVELOPE_BY_DEVICE_TYPE`` maps every supported board:

* pi2 / pi3 / arm64 → H.264 1920x1080 30 (no HEVC silicon /
  no upstream mpv HW path)
* pi4-64 / pi5 / x86 → HEVC 3840x2160 60 (dedicated HEVC block
  or VAAPI; fleet uniformity so the same upload produces
  bit-identical variants on every board)

``compute_envelope()`` resolves the current process's envelope
from DEVICE_TYPE; unset / unknown / mixed-case / whitespace all
fall back to the conservative default (H.264 1080p30).

``load_cached()`` / ``save_cached()`` round-trip the envelope to
``~/.anthias/playback-envelope.json``. Cache corruption (missing
file, bad JSON, unsupported codec) returns ``None`` so the caller
recomputes and overwrites — a hand-edit that breaks the file
self-heals on next start. ``save_cached`` writes atomically via
temp-file + rename.

src/anthias_server/processing.py
--------------------------------

``_ffprobe_summary`` now returns ``video_fps`` alongside the
existing keys. The next commit (Phase 2) uses this to decide
whether to emit ``-r envelope.max_fps`` — the cap is one-way, so
sub-cap source rates pass through unchanged. r_frame_rate is
parsed as a rational ``num/den``; unparseable / zero-denominator
collapses to ``None`` so the caller treats source fps as
"unknown" and skips the gate.

tests
-----

* tests/test_playback_envelope.py (new): matrix coverage; unset /
  unknown / cased / whitespace inputs; cache round-trip; missing
  / corrupt JSON / invalid-payload recovery; atomic write
  (no leaked .tmp); container_ext invariant.
* tests/test_processing.py: positive video_fps cases (integer
  rates, NTSC drop-frame 30000/1001 + 60000/1001, bogus / no-slash
  / zero-denominator inputs); the two ``assert summary == { ... }``
  ffprobe-recovery tests now include the new ``video_fps: None``
  key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): envelope-driven asset processor with sibling-original

Refactor ``processing.py`` so every video upload produces a
variant matching the board's playback envelope while preserving
the source as a sibling ``.original.<ext>`` file. Rotation is now
gapless by construction — every variant on disk shares one codec /
max resolution / max fps per board, so the viewer's output mode
never has to switch mid-clip.

src/anthias_server/processing.py
--------------------------------

* Replace ``_BOARD_PROFILES`` + ``_resolve_board_profile`` +
  ``_PI4_H264_MAX_PIXELS`` + ``_BoardProfile`` typedef with
  ``compute_envelope()`` from the new ``playback_envelope`` module
  (landed in 0b6bea0c). One canonical source of truth for "what
  every variant on disk looks like".

* ``_ffprobe_summary`` now returns per-axis dimensions
  (``video_width``, ``video_height``) alongside the existing
  ``video_pixels`` total. The envelope check is per-axis so an
  ultrawide source (e.g. 5760×1080) gets caught by the width cap
  even though its total pixel count is below 4K's.

* ``_video_can_passthrough(summary, envelope)`` is the new
  contract: passthrough iff (a) container is mp4, (b) codec
  matches envelope.codec exactly, (c) both axes are within the
  envelope cap, (d) source fps is at-or-under envelope.max_fps,
  (e) audio is demuxer-compatible. Any None in source dims / fps
  bails to transcode (we don't gamble on unsized clips).

* ``_transcode_to_target(input, output, envelope=None,
  source_summary=None)`` emits the smallest set of flags that
  lands the output inside the envelope. ``-vf scale=...`` only
  when source > envelope on either axis; ``-r envelope.max_fps``
  only when source fps > cap. The fps cap is one-way — we never
  up-convert a sub-cap source. New helper
  ``_video_args_for_codec`` picks libx264 / libx265 from the
  envelope's codec.

* ``_run_video_normalisation`` reorganised around the sibling-
  original pattern:
  - Fresh upload / legacy asset: rename ``Asset.uri`` to
    ``<base>.original.<ext>`` (the source-preservation step).
  - Re-render: read from the existing ``.original.*`` sibling
    instead.
  - Re-probe from the (possibly new) source location.
  - Passthrough branch: copy source → variant slot bitwise
    (cross-device fleet sha256 stays equal).
  - Transcode branch: staging-file render with the existing
    atomic-replace contract.
  - Stamp ``metadata['original_uri']`` (path to sibling),
    ``metadata['envelope']`` (envelope dict the variant matches).
    ``metadata['transcode_target']`` kept as the
    ``envelope.codec`` duplicate for one release of back-compat
    with the serializer surface.

Tests
-----

* ``test_video_can_passthrough_decision_table`` recast against
  the H.264 1920×1080 30 default envelope. Each row tests one
  gate (codec / per-axis dim / fps / audio / unknowns / probe
  gaps) without overlap.
* ``test_video_can_passthrough_respects_envelope`` end-to-end:
  pin ``DEVICE_TYPE``, build a summary at the given
  (codec, w, h, fps), assert the verdict. Replaces the legacy
  ``..._respects_board_codec_set``.
* ``test_transcode_to_target_emits_scale_when_source_oversize``,
  ``..._emits_fps_clamp_when_source_fast``,
  ``..._omits_clamps_when_source_at_envelope``: pin the smallest
  ffmpeg flag set per source / envelope combination.
* ``_envelope_summary`` helper at the top of the file
  short-circuits the per-test summary construction.
* Mock signatures for ``_transcode_to_target`` updated to accept
  the new ``envelope`` / ``source_summary`` kwargs.
* ``test_resolve_board_profile_picks_target_codec_per_board``
  deleted — equivalent coverage is in tests/test_playback_envelope.py
  against ``compute_envelope`` directly.

Stale doc / comment references to ``_BOARD_PROFILES`` /
``_resolve_board_profile`` updated to point at
``playback_envelope.ENVELOPE_BY_DEVICE_TYPE`` /
``compute_envelope``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server): re-render walker + startup envelope reconciler

* New celery task `regenerate_for_envelope_change`: walks
  `Asset.objects.filter(mimetype='video')` and queues
  `normalize_video_asset` for any row whose
  `metadata['envelope']` no longer matches the current envelope.
  Malformed payloads, missing keys, and per-row exceptions are
  logged but don't stop the walker.
* New `AnthiasAppConfig.ready` hook -> `app/startup.py:
  run_envelope_check`: compares cached vs computed envelope,
  persists fresh, dispatches the walker on mismatch. Short-circuits
  under `ENVIRONMENT=test` / `PYTEST_CURRENT_TEST` so pytest runs
  don't enqueue stray walkers. Celery dispatch failure is logged
  but non-fatal -- the cache is already saved, so the next start
  sees the new envelope on disk and recovers.
* Tests cover: skip-in-envelope, queue-stale, legacy migration
  (no envelope key), image-asset skip, force-requeue, malformed
  payload recovery, continue-after-per-row-failure, every
  hook code path (test short-circuit, no-cache, match, mismatch,
  dispatch failure, corrupt cache).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): preserve `.original.<ext>` siblings during orphan sweep

The Celery ``cleanup`` task built its "referenced" set only from
``Asset.uri``. With sibling-original storage, the source bytes live
at ``metadata['original_uri']`` (e.g. ``<id>.original.mov``) while
``Asset.uri`` points at the playback variant (``<id>.mp4``). Without
this fix every video upload's ``.original.<ext>`` falls outside the
1h mtime guard once the variant lands and gets silently deleted on
the next hourly sweep — breaking the re-render walker as soon as
the envelope changes.

* ``cleanup``: union ``Asset.uri`` ∪ ``metadata['original_uri']``
  into the referenced set, tolerant of legacy rows with non-dict
  metadata.
* Tests cover the new claim path + the malformed-metadata
  fallback so a stray ``metadata=None`` row can't crash the sweep.

The upload-path serializer itself stays untouched: the existing
``rename(tmp, <id><ext>)`` lands the upload at a single path, and
``processing._run_video_normalisation`` handles the
rename-to-``.original.<ext>`` atomically on first run. No double-
write, no extra disk traffic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(server): cover sibling-original storage across normalisation paths

Adds five tests pinning the ``.original.<ext>`` + variant contract
that the envelope walker depends on:

* fresh upload → ``<id>.original.<src_ext>`` created next to
  ``<id>.mp4``; ``metadata['original_uri']`` + ``metadata['envelope']``
  populated.
* re-render → ``.original.<ext>`` is byte-identical across passes
  (sha256 compared before/after); the walker reads from it and
  never rewrites it.
* passthrough → both files exist even when the source already
  matches the envelope (``shutil.copyfile`` semantics, not rename).
* legacy migration → pre-rollout assets with no ``original_uri``
  key get renamed to ``.original.<ext>`` on first walker pass.
* dangling ``original_uri`` → falls back to treating ``asset.uri``
  as the source-to-preserve; no silent error, no lost variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(board-enablement): replace codec policy table with playback envelope

* board-enablement.md now documents the envelope matrix as the
  single source of truth shared by the asset processor, the
  re-render walker, and the viewer's hwdec dispatch. The legacy
  ``_BOARD_PROFILES`` / ``passthrough_video_codecs`` vocabulary has
  been removed -- it never matched what ``processing.py`` does
  post-envelope.
* Calls out the ``<id>.original.<src_ext>`` + ``<id>.mp4`` sibling
  layout, the metadata keys the walker reads, and the cross-board
  fleet sha256 expectation.
* Pi 5 CMA quote rewritten: the real fix is
  ``dtoverlay=vc4-kms-v3d,cma-512`` in config.txt, not a downscale
  workaround. Kernel cmdline ``cma=`` is documented as the broken
  path it actually is.
* Failure-mode list updated for envelope-driven dispatch (off-
  envelope variant, display refresh ceiling, walker storm on
  unwritable cache, sha256 fleet divergence).
* ``media_player.py`` comment block: updates the Pi 5 H.264 →
  auto-copy and HEVC → drm-copy comments to reference the playback
  envelope by name and point at the correct CMA fix (config.txt
  dtoverlay, not cmdline.txt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): mypy on `_make_video_asset` + boolean is_enabled

* `dict` annotations get explicit `dict[str, Any]` parameters
  (Anthias's mypy config sets `disallow_any_generics`).
* `is_enabled=1` → `is_enabled=True` so the Asset field's bool
  type matches mypy's view of django-stubs models.
* Adds the missing ``typing.Any`` import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server,tests): envelope-aware container gate + startup hook safety

Run 1 of CI surfaced several issues in the envelope refactor:

* **MP4 family container detection.** ffprobe reports an MP4 file's
  ``format_name`` as ``mov,mp4,m4a,3gp,3g2,mj2`` (``mov`` first
  because the QuickTime/MP4 demuxer is one codepath). The envelope
  gate compared the source container to ``envelope.container_ext``
  by exact equality, so every MP4 upload was rejected at the
  container gate even though the bytes are exactly what we'd
  write. Adds ``_MP4_FAMILY_CONTAINERS`` and special-cases ``mp4``
  envelope to accept any synonym.
* **Celery workers were running ``run_envelope_check``.**
  ``celery_tasks.py`` top-level-calls ``django.setup()``, which
  fires ``AppConfig.ready`` in every process that imports it,
  including the celery worker -- the previous comment in ``apps.py``
  was wrong. Two writers race on the cache file and could
  double-queue the walker for a single envelope change. New
  ``_is_celery_worker()`` short-circuit detects the
  ``celery -A ... worker`` invocation via ``sys.argv[0]``.
* **Settings singleton captures HOME at init.**
  ``AnthiasSettings.home`` is set once at module import time, so
  ``monkeypatch.setenv('HOME', tmpdir)`` in tests doesn't reach the
  envelope cache helpers. Updates ``cache_dir`` and ``fake_home``
  fixtures to also patch ``settings.home`` via ``monkeypatch.setattr``.
* **Stale tests.**
  - Drop ``test_cleanup_tolerates_non_dict_metadata`` -- the schema
    enforces ``metadata`` as a non-null JSON dict, so the failure
    mode it claimed to test can't occur. ``cleanup()`` keeps the
    defensive ``isinstance(metadata, dict)`` check as a no-cost
    belt-and-braces.
  - ``test_video_passthrough_for_h264_or_hevc_in_known_containers``
    rewritten as ``test_video_passthrough_when_source_matches_board_envelope``
    -- the old matrix included libx264 on pi4-64 (no longer
    passthrough because pi4-64 is HEVC) and non-mp4 containers
    (always re-encoded now because the variant slot is fixed at
    ``.mp4``).
  - ``test_video_passthrough_records_target_codec`` switches the
    source codec to libx265 so it actually hits the passthrough
    branch on pi4-64.
  - ``test_video_passthrough_uses_summary_duration_no_second_probe``
    rebuilt via ``_envelope_summary`` so the synthesised summary
    carries the new ``video_width / video_height / video_fps``
    fields.
  - The two ``test_ffprobe_summary_handles_*`` early-return shape
    assertions add ``video_width`` / ``video_height`` to match the
    real return shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server,tests): drop PYTEST_CURRENT_TEST gate; align stale summaries

Run 2 of CI surfaced three more issues:

* **``PYTEST_CURRENT_TEST`` is not fixture-controllable.** pytest
  re-sets the env var at the start of every test's ``call`` phase,
  so ``monkeypatch.delenv`` in a ``setup`` fixture is overridden
  before the body runs. This made it impossible for any test to
  exercise the real startup hook path. The ``ENVIRONMENT=test``
  gate (set in ``conftest.py`` + the test compose file) is the
  durable, fixture-controllable signal — keep that, drop the
  pytest one. Test for the new ``_is_celery_worker`` short-circuit
  replaces the deleted ``test_short_circuits_when_pytest_current_test``.
* **Decision table parametrise had a wrong expectation.** Summary
  row "HEVC at envelope (codec, dims, fps all match)" was paired
  with ``expected=True``, but the test envelope is H.264 — codec
  mismatch must transcode, ``False``.
* **``test_video_passthrough_skips_duration_when_probe_unavailable``
  summary missed the new dim/fps fields.** Same root cause as
  before: ``_video_can_passthrough`` rejected the synthesised
  summary at the dims gate, the test fell through to a real
  ffmpeg call on a 64-byte stub, and ffmpeg "Invalid data found".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(envelope): add generic-arm64 key for Rock Pi / Armbian SBCs

The Anthias install path for Rock Pi 4 / Armbian boards writes
``DEVICE_TYPE=generic-arm64`` (see ``feat(install): generic-arm64
best-effort support``). The matrix only listed ``arm64``, so a
real install fell through to ``_DEFAULT`` — same envelope by
coincidence, but the walker would have logged "no matrix entry"
warnings on every server start and the docs/board-enablement
matrix would be subtly wrong about which key applies.

Lists the key explicitly with the same conservative H.264 1080p30
envelope and extends the parametrise coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): make celery_tasks.py top-level django.setup() reentrant-safe

``django.setup()`` calls ``apps.populate()``, which raises
``RuntimeError: populate() isn't reentrant`` if invoked while
already populating. The new ``AnthiasAppConfig.ready`` hook imports
``celery_tasks`` to dispatch the walker, which until this change
top-level-called ``django.setup()`` again -- so on every real
server start the import died, the dispatch failed, and the walker
never ran. Live-confirmed on the Pi 4 test bed.

Check ``django.apps.apps.apps_ready`` before calling ``setup()``:
the flag flips to True after the import phase but before per-app
``ready`` hooks run, so the standalone celery worker (where Django
isn't initialised yet) still calls setup() as before, while the
server process (mid-populate) correctly skips the reentrant call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): commit `original_uri` to DB before transcode (crash safety)

Live-confirmed on the Pi 4 test bed during the envelope rollout:
walker fired on a near-full SD card, ffmpeg ran out of space mid-
render, the on_failure hook cleared ``is_processing`` -- and the
hourly ``cleanup()`` sweep then silently deleted every
``.original.<ext>`` source it had just renamed, because
``Asset.uri`` still pointed at the (now-missing) variant path and
the orphan walker only knew about ``Asset.uri`` + a *committed*
``metadata['original_uri']``.

The metadata accumulator in ``_run_video_normalisation`` only wrote
to the DB at the end of the function, so any failure between
"rename source → .original.<ext>" and "render variant → atomic
replace" left the row's metadata stale.

Fix: persist ``metadata`` to the DB right after the rename, before
attempting any render. The contract becomes: if the file is on
disk under ``.original.<ext>``, the DB row knows it. ``cleanup()``
already reads ``metadata['original_uri']`` into the referenced set
(from ``fix(server): preserve `.original.<ext>` siblings during
orphan sweep``), so this commit closes the only window where that
guard could be bypassed.

Adds ``test_original_uri_persisted_before_render_for_crash_safety``
which mocks ``_transcode_to_target`` to raise and verifies the row
has ``metadata['original_uri']`` committed by the time the
exception propagates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(board-enablement): script-driven 1-minute sample pack

Previously the test pack was full-length BBB clips (~10 min) plus an
inline ffmpeg recipe in the docs that produced 4K HEVC re-encodes
taking ~30 min on a workstation. The on-device walker then had to
chew through the full-length variants, which on a Pi 4 / Rock Pi
turned a single rotation cycle into hours of wallclock for what was
really a hwdec-banner sanity check.

* New ``bin/generate_board_enablement_testbed.sh``: downloads the
  four BBB H.264 sources, trims each to 60 s with ``-c copy``
  (instant), then libx265-encodes each cut. Idempotent (skips
  files that already pass an ffprobe sanity check) and atomic
  (tmp-then-rename) so a power cycle mid-encode leaves a clean
  state.
* Pack drops from ~3.3 GB / 10 min per clip to ~350 MB / 60 s per
  clip. 60 s is enough to capture mpv's ``hwdec-current`` banner
  and read a stable ``Dropped:`` count, while keeping a full
  walker pass under a few minutes on every supported board.
* ``CUT_SECONDS`` / ``HEVC_CRF`` env knobs override defaults for
  iteration; the table in the doc lists what each clip exercises.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(envelope,viewer): runtime Rock Pi 4 detection unlocks v4l2m2m HW decode

``bin/install.sh`` writes ``DEVICE_TYPE=arm64`` for every aarch64
SBC it doesn't recognise as a Pi — Rock Pi 4, Orange Pi, Allwinner
H6 boards, Amlogic S905 boards all share that one catch-all
DEVICE_TYPE. The matrix can't promote ``arm64`` to HEVC + HW
because most of those boards have no upstream-mpv HW decode path
and would log "Could not find a valid device" on every play.

But the Rock Pi 4 (RK3399 / Radxa) DOES have a working v4l2m2m
driver exposed by the kernel:

  $ docker exec anthias-anthias-viewer-1 mpv --hwdec=help | grep v4l2m2m
    v4l2m2m-copy (h264_v4l2m2m-v4l2m2m-copy)
    v4l2m2m-copy (hevc_v4l2m2m-v4l2m2m-copy)
    v4l2m2m-copy (vp9_v4l2m2m-v4l2m2m-copy)
    ...

and ``/dev/video-dec2`` / ``/dev/video-dec4`` are present (the
v4l2_request decoder symlinks). Leaving Rock Pi on SW decode for
1080p HEVC measurably wastes the silicon.

Resolved at runtime via ``/proc/device-tree/model``:

* New matrix key ``rockpi4`` → HEVC 1920×1080 30. 1080p ceiling
  keeps disk use of the variant + ``.original.<ext>`` sibling
  comfortable on the typical SD card; HEVC codec exercises the
  Hantro path on the way through the viewer.
* ``compute_envelope`` and ``_pi_hwdec_for_uri`` both probe the
  device tree when DEVICE_TYPE is ``arm64`` (or legacy
  ``generic-arm64``). A Rock Pi 4B reports
  ``Radxa ROCK Pi 4B`` and gets upgraded; an Orange Pi or an
  Allwinner H6 board stays on the conservative SW envelope.
* Failure modes (no device tree, decode error, unknown SBC) all
  collapse to ``None`` so dev containers and the existing arm64
  catch-all keep working unchanged.

Four new tests pin:
- Rock Pi model → ``rockpi4`` envelope;
- legacy ``generic-arm64`` label also gets the upgrade;
- unknown SBC keeps the conservative envelope;
- missing ``/proc/device-tree/model`` doesn't raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(envelope,viewer): publish board subtype via host_agent + Redis

Previous commit (``dde1b20e``) added a runtime ``/proc/device-tree``
read inside the server + viewer containers. Containers don't see
that path by default, and mounting it into every container is
heavier than it's worth for one edge case (worse, balena's
restricted /proc would still trip).

``anthias_host_agent`` already runs on the host and publishes
host-side state to Redis (IP addresses, etc.). It's the right
layer for board identification:

* New ``detect_board_subtype()`` reads
  ``/proc/device-tree/model`` directly (host_agent IS on the
  host) and maps known SBC strings to matrix keys
  (Rock Pi 4A/4B/4C → ``rockpi4``).
* New ``set_board_subtype()`` publishes the resolved key (or the
  empty string for unknown boards) to ``host:board_subtype``
  before ``subscriber_loop`` flips ``host_agent_ready`` — so
  consumers can rely on the key being there once the readiness
  flag is set.
* Server's ``playback_envelope.compute_envelope`` and viewer's
  ``_pi_hwdec_for_uri`` read the same Redis key when DEVICE_TYPE
  is ``arm64`` / legacy ``generic-arm64``. Failure modes (Redis
  down, key missing, decode error) all collapse to ``None`` so
  the caller falls back to the conservative arm64 envelope.

No compose template changes. The viewer + server containers
already have Redis reachable (they use it for the Channels
layer + walker dispatch already), so the data path is free.

Unit tests pin:
* device-tree → subtype mapping for canonical + variant + edge
  Rock Pi strings, plus unknown boards;
* Redis publish writes the resolved key OR empty string;
* server's compute_envelope reads back through Redis correctly
  for known / unknown / empty / unreachable cases;
* subscriber_loop calls set_board_subtype before flipping
  ``host_agent_ready`` — race-free ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery): cap walker to --concurrency=1 so transcodes can't choke playback

Default celery worker concurrency = num_cores. On the boards
Anthias actually ships to (Pi 4 / Pi 5 / Rock Pi 4 / arm64
SBCs), that means up to 4 parallel ``libx265`` encodes sharing
the same SoC as the viewer's mpv process. ``nice -n 19`` +
``ionice -c 3`` are already in place, but nice(1) only helps
when there's CONTENTION -- four ffmpegs at nice 19 still
saturate every core, and each 1080p libx265 encode needs ~500 MB
RAM. A 4 GB SBC pushes into swap well before the walker
finishes, which stalls *everything* on the host -- live-
confirmed on the Rock Pi 4 during this PR: sshd starved through
banner exchange whenever the walker hit a fresh burst.

Asset processing is upload-time, not throughput-bound. The
operator-facing latency that matters is "upload click → asset
visible in rotation", which is bound by ONE encode regardless of
queue parallelism. Serial encodes finish a few minutes later in
wallclock but the viewer never drops a frame.

Applied to every prod / dev compose template. ``docker-compose.test.yml``
is left at default because the test suite never runs live
normalize tasks (the celery service in tests just exercises the
task dispatch plumbing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): force MPV on legacy ``generic-arm64`` DEVICE_TYPE

Rock Pi 4 running an older arm64 image reports
``DEVICE_TYPE=generic-arm64`` (pre-``refactor: rename device_type
generic-arm64 → arm64`` rebuilds). The MediaPlayerProxy
override only force-routed MPV for ``arm64`` / ``pi4-64``, so the
legacy label fell through to VLC -- which then crashed with
``NameError: no function 'libvlc_new'`` because the libvlc lib
isn't installed on the arm64 image. Live-confirmed in the viewer
crash loop on the Rock Pi 4 during this PR.

Adds ``'generic-arm64'`` to the force_mpv set + a test pinning
the dispatch. Covers the in-the-wild rolling-upgrade window
where a Rock Pi 4 deployment is sitting on an old image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): route ``generic-arm64`` through cage + ALSA-default like ``arm64``

Two more places in ``media_player.py`` only checked the post-rename
``arm64`` DEVICE_TYPE and missed the legacy ``generic-arm64`` label
the Rock Pi 4 test bed still reports:

* **VO dispatch** (line ~419) — without this, a generic-arm64 host
  falls through to the ``--vo=drm`` else branch, which mpv aborts
  with "No primary DRM device could be picked" because cage already
  holds DRM master in the cage + Wayland viewer stack
  (live-confirmed on the Rock Pi 4 in this PR).
* **ALSA card selection** (``get_alsa_audio_device``) — the Pi-name
  dispatch below the env-var check picks ``vc4hdmi`` / "Headphones"
  cards that don't exist on Rockchip / Allwinner / Amlogic. Without
  the legacy label here, mpv tries to open the Pi-specific HDMI
  card and dies with ``Unknown PCM sysdefault:CARD=vc4hdmi``.

Both branches now use the shared ``_ARM64_DEVICE_TYPES`` frozenset
that already governs the hwdec subtype probe, so the three paths
(envelope, hwdec dispatch, VO + ALSA) agree on what DEVICE_TYPE
labels are aarch64-catch-all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(envelope): Rock Pi 4 stays on H.264 1080p30 -- stock ffmpeg has no v4l2_request

Live testing on the Rock Pi 4 surfaced that the arm64 viewer
image's stock ffmpeg (Debian 7.1.3-0+deb13u1) is built without
``--enable-v4l2-request``, and the underlying kernel exposes the
RK3399's decoders only via the stateless v4l2_request API
(``rkvdec`` for HEVC, the Hantro block as ``rockchip,rk3399-vpu-dec``
for H.264). ffmpeg's stateful ``hevc_v4l2m2m`` / ``h264_v4l2m2m``
decoders can't reach them -- mpv logs ``Could not find a valid
device`` even after ``/dev/video-dec*`` symlinks are present.
mpv ``--hwdec=help`` also doesn't list rkmpp or drm-copy, so
there's no other path through the stock build.

So:

* ``rockpi4`` envelope drops from HEVC 1920x1080 30 to H.264
  1920x1080 30 -- the same conservative tier as the generic
  ``arm64`` catch-all. The viewer SW-decodes 1080p30 in real
  time on the Cortex-A72; no frames dropped, just no HW gain
  over plain ``arm64``.
* Rock Pi entry drops from ``_PI_HWDEC_BY_CODEC`` -- mpv falls
  through to ``auto-copy`` which mpv's whitelist resolves to
  SW decode on this build.
* host_agent's subtype publish, the start_viewer.sh
  ``/dev/video-dec*`` symlink creation, and the dedicated
  ``rockpi4`` matrix key all stay in place -- they're
  forward-compatible scaffolding so a follow-up enabling
  v4l2_request (or linking rkmpp) in the viewer build only has
  to bump the matrix entry's codec to ``hevc`` and add the
  hwdec dispatch row. No further plumbing churn.
* Tests + docs reflect the routing-without-HW reality.

The legacy-label fixes from this PR (force_mpv +
``--vo=gpu --gpu-context=wayland`` + ALSA default for the
``generic-arm64`` DEVICE_TYPE) are unaffected -- those are real
bug fixes the Rock Pi 4 needs to play *anything* under cage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(viewer,envelope): extend +rpt1 ffmpeg to arm64; Rock Pi 4 = HEVC 4Kp60

The Raspberry Pi APT repo's ffmpeg build (``+rpt1``) ships with
``--enable-v4l2-request --enable-libudev --enable-vout-drm``,
which the stock Debian Trixie ffmpeg drops. Without those flags
the v4l2_request hardware decoder family is unreachable from
mpv — which is exactly what bit the Rock Pi 4 in this PR:
RK3399's ``rkvdec`` (HEVC) and Hantro VPU (H.264) are both
stateless v4l2_request decoders. Pi 4 / Pi 5 already pull from
the +rpt1 repo for the same reason; extending the conditional in
``Dockerfile.viewer.j2`` to also include ``arm64`` lights up
hardware decode on every arm64 SBC whose kernel exposes
v4l2_request decoders (Rock Pi, Orange Pi RK356x, Pine64,
Allwinner H6 with Cedrus, ...).

* ``Dockerfile.viewer.j2`` — board conditional ``('pi4-64',
  'pi5')`` → ``('pi4-64', 'pi5', 'arm64')``. The apt pin already
  restricts the +rpt1 repo to ``ffmpeg + libav* + mpv``, so other
  arm64 packages stay on stock Debian. Comment block updated to
  list which decoders each board reaches via this path.
* ``playback_envelope.py`` — ``rockpi4`` envelope flips from
  H.264 1080p30 to HEVC 3840×2160 60. RK3399's Hantro G2 is the
  same decoder family as Pi 5's and supports 4Kp60 per the
  Rockchip datasheet — matching Pi 5's envelope keeps the fleet
  uniform.
* ``media_player.py`` — ``_PI_HWDEC_BY_CODEC['rockpi4']`` maps
  both h264 and hevc to ``drm-copy`` (the v4l2_request hwdec
  path, same as Pi 5 for HEVC).
* Tests + docs updated accordingly.

The legacy-arm64 fixes (force_mpv + cage VO + ALSA default for
``generic-arm64``) and the host_agent subtype publish are
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery): cgroup CPU hard cap (`cpus: 1.0`) so encodes never starve the viewer

``nice -n 19 ionice -c 3`` + ``--concurrency=1`` lower priority and
limit parallelism, but they're soft hints — when libx265 is the
only heavy workload on the box the scheduler still hands it
everything available. Live-confirmed on the Rock Pi 4 in this PR:
sshd starved through banner exchange and mpv dropped mid-frame
during walker bursts, even with all three soft caps in place.

``cpus: 1.0`` is a cgroup CFS quota — one CPU's worth of compute
per period, kernel-enforced. On every supported SBC (Pi 4 / Pi 5 /
Rock Pi 4, all 4-core) it leaves 3+ cores for the viewer, the
host_agent, sshd, and everything else. x86 hosts have 8+ cores so
the cap is conservative there but harmless — asset processing is
upload-time, not throughput-bound.

Applied to every prod / dev compose template. test compose stays
uncapped because the test suite runs in CI environments with
deterministic resources where the cap would just slow CI down
without protecting anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery): scale CFS quota with host cores (half of \$(nproc), min 1.0)

A flat ``cpus: 1.0`` is too aggressive: it forces a single-thread
ceiling even when the host has many idle cores. On an 8-core x86
deployment the asset processor would take 4x longer than it needs
to without protecting anything we don't already protect.

Compute the limit dynamically in ``bin/upgrade_containers.sh``:
``$(nproc) * 0.5`` (floored to 1.0 so single-core hosts still
make progress). On the supported boards this lands at:

  * 4-core Pi 4 / Pi 5 / Rock Pi 4 → cpus: 2.0 (2 cores headroom
    for the viewer + system)
  * 8-core x86 → cpus: 4.0 (4 cores headroom)
  * 16-core x86 → cpus: 8.0 (still 50/50 with the system)

Soft priorities (``nice -n 19 ionice -c 3``) and the
``--concurrency=1`` walker still apply on top; the cgroup quota
is the hard backstop that guarantees "encoding never impacts
playback or UI access". Live test on the Rock Pi 4 (in this PR)
proved the soft caps alone aren't enough — libx265 saturated
every core and starved sshd through banner exchange.

The balena compose templates use a literal ``cpus: 2.0`` (balena
only targets 4-core Pi 2/3/4/5 today); the non-balena prod
compose substitutes the env var. Dev compose also uses a literal
``2.0`` since dev hosts vary too widely to autodetect cheaply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(walker): hardware-decode the source in the transcode pipeline

The walker's encode pass stays libx265-software-bound on every
SBC (none of Pi 4 / Pi 5 / Rock Pi 4 have HEVC HW encode), but
the *decode* half of the pipeline can be offloaded to the same
silicon mpv uses for playback. That's typically 30-50% of the
ffmpeg wall-clock on H.264 sources and dominant on 4K — well
worth the small dispatch table.

* ``_decode_hwaccel_args(source_codec)`` returns the per-board
  ``-hwaccel`` flags to prepend to the ffmpeg invocation. Uses
  the same host_agent subtype probe (``host:board_subtype`` in
  Redis) that envelope resolution already uses, so the walker
  and viewer agree on what board they're targeting.
* Dispatch matrix:
  - Pi 4 (V3D V4L2 M2M + rpi-hevc-dec) → ``-hwaccel drm`` for
    both H.264 and HEVC (the +rpt1 ffmpeg's v4l2_request path).
  - Pi 5 (Hantro G2) → ``-hwaccel drm`` for HEVC only.
  - Rock Pi 4 (rkvdec + Hantro VPU) → ``-hwaccel drm`` for both,
    same v4l2_request path as Pi 5.
  - x86 (VAAPI) → ``-hwaccel vaapi -hwaccel_device
    /dev/dri/renderD128`` for both.
  - Pi 2 / Pi 3 / unknown arm64 → no HW path mpv can address;
    SW decode is the only choice.
* ``_transcode_to_target`` wraps the ffmpeg call: first attempt
  with hwaccel args, fall back to SW decode on
  ``sh.ErrorReturnCode`` (kernel driver weird, device busy,
  bitstream the v4l2_request decoder rejects). Logs the
  underlying ffmpeg stderr at WARNING so an operator chasing a
  slow walker sees the HW path failed.

Tests pin every cell of the dispatch matrix + assert ``-hwaccel``
lands BEFORE ``-i`` in the argv (placing it after silently
no-ops in ffmpeg) + the two-call SW-fallback path on simulated
HW init failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(server-image): extend +rpt1 ffmpeg pin to anthias-server too

The walker's HW-decode optimization (``processing._decode_hwaccel_args``
emits ``-hwaccel drm``) only works against the Raspberry Pi repo's
``+rpt1`` ffmpeg build, which has ``--enable-v4l2-request``. The
pin was previously only on the *viewer* image (Dockerfile.viewer.j2
in ``ba8d4709``), so the celery container — which runs the walker —
kept the stock Debian ffmpeg and the hwaccel call silently fell
back to SW on every board.

* New ``docker/_rpt1-ffmpeg-pin.j2`` extracts the pin block.
* Both ``Dockerfile.viewer.j2`` and ``Dockerfile.server.j2`` now
  include it via ``{% include '_rpt1-ffmpeg-pin.j2' %}``. Server
  also re-runs ``apt install --reinstall ffmpeg libav*`` so the
  pinned version replaces whatever the base layer installed.
* No effect on Pi 2 / Pi 3 / x86 boards — the include's
  ``{% if board in ('pi4-64', 'pi5', 'arm64') %}`` keeps it
  inert there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(celery,viewer): four hardening fixes so the player survives an upgrade

Live testing on Pi 4 / Pi 5 / Rock Pi 4 surfaced four scenarios
where a single ``docker compose pull && up -d`` (or any upgrade
that invalidates the playback envelope) wedges the device. These
aren't test-harness flakes; production operators on the same
hardware would hit them. All four belong in this PR alongside the
features that exposed them.

1. **Walker drip-feed** — ``regenerate_for_envelope_change``
   previously queued every stale ``normalize_video_asset`` in one
   beat tick. ``--concurrency=1`` serialises *execution* but the
   celery worker fetches the next task the instant the previous
   finishes, so a 100-asset catalog turns into hours of back-to-
   back libx265 with zero recovery windows between encodes.
   Switch to ``apply_async(args=..., countdown=N * 60)`` so
   each subsequent normalize starts at least 60 s after the
   previous was queued. Operator can flip ``is_processing=False``
   on a row mid-window to cancel its turn.
2. **``mem_limit`` on celery container** — cgroup CPU isolation
   alone doesn't stop libx265-4K from allocating ~1.5 GB resident
   memory, which on a 4 GB SBC pushes the system into swap and
   starves sshd + the viewer. Match the cpus cap with a memory
   cap (60% of host RAM, computed in ``bin/upgrade_containers.sh``).
3. **``stop_grace_period: 3s`` + ``stop_signal: SIGKILL`` on
   viewer** — cage doesn't reliably release DRM master on
   SIGTERM (its libinput shutdown path hangs on certain kernels)
   and the kernel's GPU driver leaves dangling references that
   prevent the next ``up`` from acquiring DRM master. Skipping the
   SIGTERM-then-wait dance on intentional restarts gets the
   device past cage's bug deterministically.
4. **libx265 / libx264 ``-preset superfast``** — was ``medium``.
   Asset processing is upload-time and only runs once per asset,
   so the 5-10× wallclock speedup is operator-facing throughput.
   The ~10-20% bitrate increase is invisible on typical signage
   content. Viewer decode is HW regardless of preset.

Tests:
* Walker test mocks switched from ``.delay`` to ``.apply_async``;
  signatures updated for ``args=(...,)`` + ``countdown=`` kwarg.
* New ``test_regenerate_walker_spaces_dispatches_via_countdown``
  asserts the countdowns are ``[0, 60, 120, ...]`` across a
  5-asset catalog so the drip-feed contract is pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): use sh.ErrorReturnCode_1 in hwaccel fallback test

sh.ErrorReturnCode is the abstract base; its __init__ does
`self.exit_code = self.exit_code` which AttributeErrors unless the
concrete numeric subclass (ErrorReturnCode_1, _2, ...) is used. Every
other call site in this file already uses ErrorReturnCode_1 — this was
the lone outlier introduced with the SW-fallback test in 0340b4f4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(asset-processor): drop on-device video transcoding

On-device libx265 transcode wedged a Pi 4's celery worker for 99 min on a
single 4K60 H.264→HEVC pass during PR validation. Every supported board
already HW-decodes both H.264 and HEVC via the viewer's per-board mpv
hwdec dispatch (drm-copy / vaapi-copy / v4l2m2m-copy), so the re-encode
provided no playback benefit for the codecs operators actually upload.

- ``normalize_video_asset`` now runs ffprobe and writes codec / dims /
  fps / duration into ``metadata``; the asset file is never rewritten.
- Removes the envelope module, the re-render walker
  (``regenerate_for_envelope_change``), and the server-start envelope
  cache reconciliation hook.
- Drops 33 transcode / envelope / sibling-original tests.

Image normalisation (HEIC/HEIF/TIFF/BMP/ICO/TGA/JP2/AVIF → WebP) is
unchanged. The viewer-side per-board hwdec dispatch and host_agent
board-subtype publishing are unchanged.

For codecs the target board can't HW-decode (MPEG-2, MPEG-4 ASP, ...)
the operator's recovery is to upload a transcoded copy; the metadata
fields surfaced here let them see codec / dims / fps in the asset list
before pushing the asset to the field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(asset-processor): gate uploads to hardware-decoded codecs only

After ffprobe, ``normalize_video_asset`` now compares the source codec
against the board's HW-decode set (mirroring the viewer's
``_PI_HWDEC_BY_CODEC``). Uploads outside the set are rejected with an
error message that includes the rejected codec, the board's supported
codecs, and an ``ffmpeg`` command line the operator can run on their
workstation to transcode the source.

Per-board HW decode set:

- pi2 / pi3 → {h264}
- pi4-64 / rockpi4 / x86 → {h264, hevc}
- pi5 → {hevc} (no H.264 v4l2-request decoder mpv can reach)
- arm64 catch-all → ∅ (operator must install a board-specific image)

Also extracts ``DEVICE_TYPE`` → board-key resolution into a new
``anthias_common.board`` module so the server's gate and the viewer's
hwdec dispatch share the same logic — eliminates the duplicated
``_redis_board_subtype`` mirror in ``media_player.py``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(dashboard): surface unsupported-codec failures with copyable recipe

UI/UX review of the gate's failure path surfaced two P0s and a few
smaller nits:

- The error message was only reachable via a native browser ``title``
  tooltip on the Failed pill — invisible on touchscreens, can't be
  copied, leaks the ``UnsupportedVideoCodecError:`` class prefix into
  the aria-label.
- The Edit Asset modal showed nothing about the failure — exactly
  the place the operator goes to act on a failed row.

Changes:

- ``UnsupportedVideoCodecError`` now carries the ffmpeg recipe as a
  ``recipe`` attribute. ``_NormalizeAssetTask.on_failure`` writes the
  bare message into ``metadata.error_message`` (no class-name prefix)
  and persists the recipe to ``metadata.error_recipe``.
- ``_asset_row.html`` Failed pill becomes a button — click opens the
  Edit Asset modal.
- ``_asset_modal.html`` renders a warning banner at the top of the
  Edit form when ``metadata.error_message`` is set, with the recipe
  inside a copyable ``<code>`` block + "Copy command" button.
- ``_ffmpeg_reencode_recipe`` substitutes the operator's upload
  filename (stashed in ``metadata.upload_name`` at upload time) for
  the ``INPUT`` placeholder so the recipe is paste-ready.
- Toast text shortened from "analysing video…" to "reading metadata…"
  (the ffprobe pass is sub-second now that there's no transcode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(processing): give recipe output a codec suffix so it doesn't overwrite input

E2E validation on a Pi 5 surfaced a recipe like:

  ffmpeg -i 'sample-h264.mp4' -c:v libx265 ... 'sample-h264.mp4'

— input and output point at the same file because both got the
upload's stem + ``.mp4`` suffix. Operator pasting the recipe would
overwrite their source. The fix gives the output filename a target-
codec marker (``sample-h264.hevc.mp4`` / ``sample-h264.h264.mp4``)
so the recipe is safe to copy-paste even when the upload's
extension already matches the output container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: drop transcode-era defensive hardening on celery + server image

These guards were load-bearing while the asset processor ran libx264 /
libx265 transcodes; with the on-device transcode pipeline gone they're
dead code defending against a workload that no longer exists.

Removed:
- ``cpus: ${CELERY_CPU_LIMIT}`` / ``cpus: 2.0`` cgroup CPU caps on
  anthias-celery (every compose template)
- ``nice -n 19 ionice -c 3`` wrapper on the celery command
- ``--concurrency=1`` on celery worker; default celery concurrency
  is fine when the only tasks are ffprobe + Pillow conversion
- ``CELERY_CPU_LIMIT`` calc in ``bin/upgrade_containers.sh``
- ``_rpt1-ffmpeg-pin.j2`` include + reinstall layer in
  ``Dockerfile.server.j2``; the +rpt1 ffmpeg was only needed for
  the walker's ``-hwaccel drm`` transcode. The server now only
  runs ffprobe, which the stock Debian ffmpeg handles fine
  (smaller server image, simpler base)
- Stale ``ffprobe → passthrough or libx264/aac transcode`` section
  header in processing.py

Kept:
- ``mem_limit: ${CELERY_MEMORY_LIMIT_KB}k`` on celery — still a
  useful safety net against a decompression-bomb fixture or
  runaway ffprobe
- ``+rpt1`` ffmpeg pin on the *viewer* image — still load-bearing
  for mpv's ``v4l2_request`` HW decode on Pi 4 / Pi 5 / Rock Pi 4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: keep nice -n 19 ionice -c 3 on celery

Cheap insurance against pathological inputs (decompression-bomb
HEIC, runaway ffprobe). Brought back across all four compose
templates after stripping the CPU cap + --concurrency=1 in the
prior cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dashboard): address review feedback on codec gate UX

* Plain-HTTP clipboard fallback. navigator.clipboard.writeText only
  resolves on secure origins, so on a LAN device (HTTP) the Copy
  command button silently failed. Add a window.fallbackCopyToClipboard
  helper that uses execCommand('copy') against an off-screen
  textarea, and have the inline copyRecipe() try it whenever
  navigator.clipboard isn't available or rejects. The recipe block
  also gets user-select:all so keyboard-copy still works if both
  paths fail.
* Friendlier message for the arm64 catch-all branch. "Supported:
  none." read like the board literally has no decoder; replace with
  an explanation that the board hasn't reported a subtype yet and a
  pointer at the board-specific image.
* Lock the gate (_HW_DECODE_VIDEO_CODECS) and the viewer dispatch
  (_PI_HWDEC_BY_CODEC) together with a consistency test so a future
  edit to one table can't quietly diverge from the other.
* Cover the shell-quoting of recipe filenames with hostile-name
  parametrize cases (single quote, backtick, $(), ;) so a copy-paste
  recipe can't be turned into command injection.
* Drop the stale "cgroup CPU cap" line from processing.py's module
  docstring — the cap was removed in f85f8035.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address post-review feedback on codec gate / hwdec dispatch

- processing: prefer the upload's extension token when ffprobe's
  format_name is a synonym list, so an .mp4 surfaces as
  container=mp4 (not mov, the first synonym).
- bin/start_viewer.sh: drop the loose `*-dec` catch-all from the
  v4l2 decoder match; keep the explicit rkvdec/cedrus/hantro/
  *-vpu-dec prefixes.
- media_player: cap the ANTHIAS_DEBUG_DROPS mpv.log at 64 MB with
  a rolling truncate so a forgotten-on flag can't grow the disk.
- tests: rename test_set_board_subtype_does_not_raise_on_redis_failure
  to test_set_board_subtype_propagates_redis_failures — matches what
  the test actually asserts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:46:02 +02:00
Viktor Petersson
473d8991bf feat(website): home-page screenshot slider fed from CI captures (#2899)
* feat(website): home-page screenshot slider fed from CI captures

- Replace the static overview hero with a scroll-snap slider framed
  as a browser-window chrome (traffic lights, URL pill, counter,
  brand-yellow autoplay progress bars).
- Slides are sourced from website/assets/images/screenshots/ —
  gitignored; deploy-website.yaml downloads the latest successful
  marketing-screenshots artifact at build time, and the new
  bun run screenshots:fetch mirrors that for local dev.
- TypeScript slider in assets/js/slider.ts (deferred, 1.4 KB
  gzipped) handles autoplay, pause-off-screen, hover-pause,
  keyboard nav, and respects prefers-reduced-motion.
- Add a system-info marketing capture; the integration test
  overwrites the rendered DOM with curated Pi-5-shaped values
  under MARKETING_SCREENSHOTS=1 (page_context lives in uvicorn,
  out of monkeypatch reach).
- Flip marketing_screenshot fixture default to full_page=False so
  every capture is uniform 1400×900, fitting the slider's frame.
- Add a GitHub Sponsors link in the hero CTA and footer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): adopt Font Awesome icon kit + actionlint SC2012

- Add @fortawesome/fontawesome-free as a devDependency; mirror the
  fonts:install pattern with a new scripts/install-icons.ts that
  materializes a curated set (github, linkedin, x-twitter, heart)
  into assets/images/icons/ (gitignored).
- New partials/icon.html inlines the SVG via safeHTML so
  fill="currentColor" picks up the surrounding text colour — same
  Tailwind class can tint the icon and the label.
- Replace fb / instagram footer links with LinkedIn; swap the
  hand-rolled twitter + GitHub + Sponsor-heart SVGs over to the
  Font Awesome equivalents.
- Fix actionlint SC2012 in deploy-website.yaml by switching the
  post-download `ls | wc -l` to `find -name '*.png' | wc -l`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): add YouTube footer link; soften shellcheck comment

- Extend install-icons.ts list with the FA `youtube` brand SVG and
  drop a fourth social link into the footer pointing at
  https://www.youtube.com/c/screenlydigitalsignage.
- Reword the deploy-website.yaml comment so it no longer starts with
  `# shellcheck`, which actionlint was misreading as an SC1072/SC1073
  directive and failing the lint run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(website): drop X/Twitter from footer social row

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): address Copilot review on slider PR

- slider.ts: init activeIndex to -1 so the very first setActive(0)
  actually applies state (the equal-index early return was
  swallowing it; autoplay bar never started).
- baseof.html / index.html: pass a stable "screenshots-singleton"
  cache key to partialCached so the Resize pipeline runs once per
  build, not once per page.
- index.html / main.css / slider.ts: drop the role="tab" inside
  role="tablist" markup (the full ARIA Tabs pattern doesn't fit a
  horizontal-scroll carousel); use plain buttons with aria-current
  on the active pill instead.
- screenshots.html: correct the partial docstring — the layout
  omits the slider region when the slice is empty, it doesn't fall
  back to a placeholder.
- package.json: invoke the locally-pinned tailwindcss binary
  instead of `bunx @tailwindcss/cli`. bunx resolves through its
  own cache and was pulling Tailwind v4.3.0 even though the lock
  pins 4.2.4, so the committed style.css banner drifted.
  Rebuilt style.css against the pinned version (v4.2.4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): address second Copilot pass on slider PR

- deploy-website.yaml: add `actions: read` to the workflow's
  permissions block so the Fetch marketing screenshots step can call
  `gh run list` / `gh run download`. An explicit `permissions:` block
  defaults unspecified scopes to `none`, so the Actions API would
  otherwise 403.
- slider.ts: track the URL-pill fade `setTimeout` and clear it on
  every `setActive`. A rapid second slide change (button mash, swipe
  + observer update) could otherwise let a stale timeout fire later
  and briefly overwrite the URL pill with the previous slide's text.
- screenshots.html: capture the 1440-wide PNG/WebP renditions during
  the srcset ladder loop and reuse them for the <img src> + LCP
  preload fallback. Avoids re-calling $src.Resize with the same spec
  the loop already produced. Falls back to the source's own width
  when it's narrower than 1440.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): scope deploy permissions per-job; --repo support for fetch

Addresses the third Copilot pass + SonarCloud's new_security_rating
gate failure (S8264: read permissions declared at workflow level).

- deploy-website.yaml: move permissions to job level. `build` keeps
  `actions: read` (gh run list/download) + `contents: read` (checkout)
  + `pages: write` (configure-pages); `deploy` keeps `pages: write` +
  `id-token: write`. Least privilege per job, no more workflow-level
  scope inheritance for the read tokens.
- scripts/fetch-screenshots.ts: target `Screenly/Anthias` by default
  (passed through `gh run list` / `gh run download` via `--repo`) so
  contributors on fork clones still get the upstream artifact instead
  of failing against their own empty Actions API. `--repo <owner>/
  <repo>` overrides.
- data/screenshots.yaml: header comment no longer references the
  "committed seed copies" — the directory is gitignored and the
  workflow downloads into an empty dir.
- website/README.md: document the upstream-default + --repo flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): read autoplay duration from CSS; correct restart comment

Fourth Copilot pass: addresses the duplicated-constant and stale-comment
flags in slider.ts. The two artifact-path concerns Copilot raised on
deploy-website.yaml and scripts/fetch-screenshots.ts are verified false
positives — `actions/upload-artifact@v4` stores the upload relative to
the `path:` argument, so the artifact's top-level entries already are
the .png files (confirmed by downloading the latest run; no
`test-artifacts/marketing/` prefix present).

- slider.ts: drop the duplicated `AUTOPLAY_MS = 6000` constant. The
  authoritative timing value lives on `--autoplay-ms` in main.css
  (drives the @keyframes width animation on the progress bar). The
  JS now reads that custom property at init via getComputedStyle and
  uses it for the setTimeout cadence, so changing one in CSS no
  longer silently drifts away from the slide-advance timing. Keeps a
  `AUTOPLAY_MS_FALLBACK` for the unlikely case where the property is
  unset / unparseable.
- slider.ts: rewrite the "Force-restart the CSS animation by
  detaching+reattaching the node" comment — that's not what the code
  does. The animation restart is purely a CSS-selector consequence
  of removing `data-state` from the previously-active pill, no DOM
  manipulation involved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(tests): clarify system-info smoke-test docstring

Copilot flagged that the docstring overclaimed coverage. The assertions
only check that the heading renders and no 5xx fires; they don't
validate individual System Info values. Reword so the docstring matches
what the test actually does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): add role=region to slider carousel root

Many AT only announce aria-roledescription when it augments a
non-generic role. Adding role=region keeps the existing aria-label
as the accessible name and matches the W3C carousel pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): keep slider pill state in sync with off-screen + focus

Two Copilot-flagged drift bugs:

1. The off-screen visibility observer cleared the JS autoplay timer
   but left the active pill at data-state=playing. The CSS progress
   bar kept advancing while the slider was below the fold, so when
   it scrolled back into view the fresh full-duration timer and the
   bar disagreed. Delegate to hoverPause/hoverResume so both pause
   together and restart together.

2. focusout bubbles, so moving keyboard focus between elements
   inside the slider (track → next button) fired the root-level
   focusout and re-armed autoplay mid-nav. Check relatedTarget and
   skip when focus is still within refs.root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): derive og:image:alt from the actual page image

When the screenshots dir is empty the OG image falls back to
logo.svg, but the alt text stayed pinned to the old "dashboard
showing scheduled content" line — wrong for the logo and also
wrong if the first slide changes. Pull the alt from the same
$heroSlides[0] used to choose $pageImage, with "Anthias logo"
as the fallback when there are no slides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): drop unused --frame-bg custom property

Declared on .screenshot-slider but never referenced — the actual
background uses an inline linear-gradient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:53:17 +01:00
Viktor Petersson
c3e86c61c9 fix(csrf): CSRF_TRUSTED_ORIGINS env var for host-rewriting proxies (#2901)
* fix(csrf): CSRF_TRUSTED_ORIGINS env var for host-rewriting proxies (#2900)

IIS ARR with the default ``preserveHostHeader=false`` rewrites the
upstream ``Host`` before forwarding to anthias-server, so the browser's
``Origin: https://signage.example.com`` and uvicorn's
``Host: anthias.localdomain`` are genuinely different hostnames. The
existing ``SameHostOriginCsrfMiddleware`` fallback only tolerates
scheme drift on the *same* host, not different hosts — uploads (and
every other unsafe POST) 403 with ``Origin checking failed``.

The pre-rebrand React+DRF stack didn't hit this because uploads went
through DRF's Basic-auth API where ``SessionAuthentication``'s CSRF
enforcement didn't apply. The new Django+HTMX upload runs through
``CsrfViewMiddleware`` directly, which is why this regression
surfaced after the migration to Django templates.

Fix: expose Django's first-class ``CSRF_TRUSTED_ORIGINS`` setting as a
comma-separated env var. Operators behind a host-rewriting reverse
proxy list the public origin they actually serve under (e.g.
``CSRF_TRUSTED_ORIGINS=https://signage.example.com``); Django's stock
``_origin_verified`` then accepts requests from that origin. Default
is empty, so the same-host fallback continues to cover plain LAN /
Caddy-sidecar deployments where the proxy preserves Host upstream —
no behaviour change for existing setups.

The earlier "intentionally not set" comment was about the wildcard
limitation (Django only honours subdomain wildcards) and didn't rule
out specific hostnames; updated to reflect what's now supported.

Regression coverage in ``tests/test_csrf.py``:

* ``test_iis_rewrite_host_proxy_without_trusted_origin_rejected``
  pins the 403 that justifies the new knob existing.
* ``test_iis_rewrite_host_proxy_with_trusted_origin_passes`` pins
  the fix — listing the public origin makes the POST succeed.
* ``test_trusted_origin_does_not_open_other_hosts`` pins that the
  allowlist stays exact (``signage.example.com`` doesn't open
  ``attacker.example``).

No new proxy-header trust added; no change to ``request.get_host()``,
``is_secure()``, or ``build_absolute_uri()``. The operator opts in
explicitly per-deployment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(csrf): use pytest_django SettingsWrapper type for settings fixture

Copilot review: ``settings: pytest.FixtureRequest`` was the wrong
annotation — the pytest-django ``settings`` fixture yields a
``pytest_django.fixtures.SettingsWrapper``, not a ``FixtureRequest``.
The mismatch would surface as ``attr-defined`` errors under strict
mypy when assigning ``settings.CSRF_TRUSTED_ORIGINS``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(csrf): move SettingsWrapper import under TYPE_CHECKING, rewrap URL

Two Copilot review nits from the previous push:

* ``from __future__ import annotations`` makes the
  ``SettingsWrapper`` import a type-only reference at runtime. Even
  though current Ruff tracks that as a used import, parking it under
  ``TYPE_CHECKING`` is the idiomatic shape and stays robust against
  stricter lints landing later.
* The example origin in the ``CSRF_TRUSTED_ORIGINS`` settings comment
  was wrapped mid-hostname (``https://signage.`` / ``example.com``),
  which is easy to misread or copy wrong. Reflow the paragraph so the
  full URL stays on a single line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(faq): how to run Anthias behind a custom reverse proxy

Operators putting Anthias behind nginx / Apache / IIS / Traefik
hit the same CSRF rejection that motivated #2901 the moment their
proxy rewrites the upstream Host (the default for nginx, Apache
mod_proxy, and IIS ARR). The fix landed as the CSRF_TRUSTED_ORIGINS
env var, but nothing on the website surfaces it — the only docs
were the settings.py comment and the PR description.

Add an "Operations" FAQ entry that:

* names the symptom — POSTs 403 with ``Origin checking failed``;
* lists the Host-preservation directive for the five reverse
  proxies operators actually deploy (nginx, Apache, IIS ARR,
  Caddy, Traefik) — preferred path, since it costs one line of
  proxy config and Anthias's same-host fallback then handles
  scheme drift automatically;
* documents ``CSRF_TRUSTED_ORIGINS=https://signage.example.com``
  as the escape hatch for operators who can't touch the proxy
  config;
* notes that the bundled ``./bin/enable_ssl.sh`` Caddy sidecar
  already does the right thing so the FAQ entry only matters for
  third-party proxy setups.

No code change — website data only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:30:35 +01:00
Viktor Petersson
7f8bbe43d7 feat(install): generic-arm64 best-effort support (Armbian SBCs) (#2879)
* feat(install): generic-arm64 best-effort support (Armbian on Rock Pi, Orange Pi, …)

Wires up a `generic-arm64` device_type so the installer recognises any
aarch64 host that isn't a Raspberry Pi and runs the same Anthias stack
on it. Closes #2849 (Tier 1).

* `bin/install.sh::set_device_type` + `bin/upgrade_containers.sh` get
  an `aarch64` fallback branch, INTRO_MESSAGE / unsupported-message
  copy refreshed, raspberry-pi-tagged ansible tasks skipped on
  generic-arm64 (same as x86), vchiq strip extended.
* ansible: validated set in `site.yml`, `docker_arch_by_device_type`
  gains `generic-arm64: arm64`. `docker-buildx-plugin` added to the
  apt-install list — required for MODE=build with `--platform=`
  Dockerfiles, harmless on pull-mode boards. Pre-existing host_agent
  service unit hardcoded `~/installer_venv/bin/python` (an ephemeral
  tmpdir post-#2843); split into a persistent `~/.anthias-venv` that
  ansible syncs before installing the unit.
* image_builder: `generic-arm64` build target, Qt6 + cage + wayland
  like x86; `va-driver-all` deliberately *not* shipped — Rockchip /
  Allwinner / Amlogic mainline hwdec goes through V4L2 M2M /
  request API, not VAAPI, so mesa-va-drivers would be dead weight.
* viewer: `start_viewer.sh` reuses the x86 cage path for
  generic-arm64; `media_player.py` routes generic-arm64 to MPV (the
  `device_helper.get_device_type()` fallback returns 'pi1' on
  non-Pi aarch64 hosts, so the proxy needs the DEVICE_TYPE env
  override that pi4-64 already uses). New test added.
* host_agent: `SUPPORTED_INTERFACES` gains `end` prefix —
  Rockchip GMAC etc. surface as `end0` on systemd predictable
  naming, which was previously filtered out, leaving the splash
  page stuck on "Detecting network…".
* CI: docker-build matrix + mirror-latest-tags publish
  `latest-generic-arm64` alongside the existing per-board tags.
* Docs: README, marketing site supported-hardware table, and FAQ
  get a plain-language "Yes, on a best-effort basis" entry that
  spells out the software-decode trade-off, the SoCs known to work
  well (RK3399 / RK35xx / Allwinner H6 / Amlogic GXBB-GXL-GXM /
  S905X3), and the boards to avoid (Allwinner H616 / H618). Per-SoC
  hardware decode (`rkmpp`, `cedrus`, `meson-vdec`) is the planned
  Tier-2 follow-up.

Validated end-to-end on a Rock Pi 4B (Armbian trixie, RK3399, 1GB
RAM) via build-on-device: install completes, web UI reachable, all
four asset types (image, H.264 1080p60, H.265 1080p60, webpage)
cycle through the viewer cleanly, mpv pure-decode benchmark shows
0 dropped frames over the full 60s of each clip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ansible-lint): pair become with become_user on .anthias-venv sync task

ansible-lint's partial-become rule fires on `become_user:` without a
matching `become:` at the same level, even when the play-level become
already covers it. Explicit pairing keeps lint quiet without changing
runtime behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot review feedback on generic-arm64 PR

- ansible: drop `creates:` guard on the runtime venv sync — `uv sync`
  is idempotent (sub-second resolver check when nothing changed), so
  re-running unconditionally means dependency updates from a
  pyproject.toml / uv.lock change actually land on upgrade instead of
  silently skipping. Idempotency surfaced via `changed_when` keyed on
  uv's `+/-/~` package-action prefix so steady-state runs stay `ok`.
- ansible: rework docker-buildx-plugin comment to justify the
  install on its own merits (any MODE=build run needs it because of
  `FROM --platform=$BUILDPLATFORM` in Dockerfiles) rather than tying
  it to generic-arm64 lacking published tags — that explanation
  becomes stale the moment this PR merges and CI publishes them.
- viewer: `get_alsa_audio_device()` short-circuits on
  `DEVICE_TYPE=generic-arm64` before the Pi-firmware dispatch, since
  the Rock Pi / Orange Pi / Banana Pi class of board has none of the
  `vc4hdmi*` or `Headphones` ALSA cards. Defers to ALSA's `default`
  device; operators with a non-standard sink can override via
  `~/.asoundrc` (already bind-mounted into the viewer container).
- tests: new assertions that generic-arm64 routes mpv through
  `--vo=gpu --gpu-context=wayland` and `--audio-device=alsa/default`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(website): disambiguate Debian release codenames in supported-hardware copy

Copilot flagged the previous wording — "running Raspberry Pi OS, Debian, or
Armbian (Trixie or Bookworm)" — as misleading: the parenthetical reads as
if Raspberry Pi OS and Armbian are themselves "Trixie or Bookworm", but
those are Debian codenames, and Armbian builds can also be Ubuntu-based.
Split the sentence so the codenames are tied explicitly to Debian.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ansible): derive is_raspberry_pi from device_type, not architecture

Copilot caught that the `is_raspberry_pi` helper in docker.yml was
defined as `ansible_architecture in ['aarch64', 'armv7l', 'armv6l']`,
which is also true on generic-arm64 (Rock Pi / Orange Pi / …). That
silently applied the Pi-only `gpio` group to non-Pi SBCs.

device_type is the authoritative discriminator and is validated
upstream in ansible/site.yml's pre_tasks, so use it directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: rename device_type generic-arm64 → arm64 (parallel to x86)

Per review feedback: `generic-arm64` was the original working name for
the new aarch64 non-Pi fallback. `arm64` is shorter and parallels `x86`
— both are architecture-generic device_types that catch any host
without a board-specific image, sitting alongside the per-board labels
(pi2 / pi3 / pi4-64 / pi5). User-facing prose still says "generic
64-bit ARM" or "Armbian on Rock Pi / Orange Pi / …" for context.

Mechanical s/generic-arm64/arm64/ across install scripts, ansible,
image_builder, viewer / start_viewer, host_agent, tests, CI matrix,
mirror-latest-tags, Dockerfile.viewer.j2, README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review polish on arm64 PR

- viewer: get_alsa_audio_device's arm64 short-circuit now logs the
  registered ALSA cards (from /proc/asound/cards — aplay isn't in
  the viewer image) once per process when DEVICE_TYPE=arm64, so an
  operator reporting "no HDMI audio" carries enough breadcrumbs in
  journalctl alone to pick the right ~/.asoundrc override.
- ansible: rewrite the docker-buildx-plugin size claim — 15 MB
  download / 67 MB extracted, from the deb metadata on arm64.
- viewer: MediaPlayerProxy.get_instance comment block split into a
  two-bullet rationale, calling out the pi4-64 and arm64 cases
  separately so a future reader doesn't mistake the lead sentence
  for "pi4-64-only".
- install.sh / upgrade_containers.sh: spell out that the aarch64
  catch-all in set_device_type is intentional — a future Pi model
  whose model string drifts past the regexes lands here too,
  trading software decode + no Pi-boot tweaks for a louder fail.
- README + FAQ: tighten the Plymouth caveat from "few seconds of
  black" to "kernel boot log scrolls until the viewer takes over",
  which is what actually happens on most U-Boot ARM SBCs.
- ansible: rename the docker.yml var from `is_raspberry_pi` to
  `device_is_pi` now that it's derived from device_type rather
  than `ansible_architecture`, so the name matches what it does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: narrow arm64 support to Debian-based Armbian (call out Ubuntu)

Copilot flagged that "Armbian" in the new docs is ambiguous —
Armbian builds come in both Debian-based (Bookworm/Trixie) and
Ubuntu-based (Jammy/Noble) flavours. The installer's ansible role
wires Docker's apt repo under
download.docker.com/linux/debian/{{ ansible_distribution_release }},
which 404s on the Ubuntu codenames, so an Ubuntu-Armbian user
following the current docs would hit a broken install at the very
first `apt update`.

Narrowing the wording in README, the marketing site's
supported-hardware blurb, and the FAQ to "Debian-based Armbian" so
users pick the right image. Extending the installer/playbook to
handle Ubuntu-based Armbian is a separate follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:05:52 +01:00
Viktor Petersson
1ddf845e3d fix(viewer): send Accept-Language from system locale (#2878)
* fix(viewer): send Accept-Language from system locale (#480)

The Qt WebEngine in AnthiasWebview never set an Accept-Language header,
so multi-language URL assets served their default (typically English)
regardless of how the Pi's locale was configured.

Plumb the host's locale through: bind-mount /etc/default/locale into
the viewer container, source it in start_viewer.sh, and have the C++
webview build an RFC 7231 header from QLocale::system().uiLanguages().
A Pi configured with LANG=nl_NL.UTF-8 now advertises
nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7 to origin servers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): forward locale via envsubst, not a host bind mount

Bind-mounting /etc/default/locale was risky: when the file is missing
on the host (some minimal RPi OS / x86 images), Docker silently
creates an empty *directory* at the mount path on the host, then maps
it into the container — where the start_viewer.sh source would fail.

Drop the bind mount and forward LANG/LANGUAGE through compose envsubst
instead: upgrade_containers.sh sources /etc/default/locale before
templating, so docker-compose.yml ends up with the host's locale baked
into the viewer service's environment block. No host filesystem
mutation, no compose-time bind dependency.

Same change extends the fix to balena: the balena supervisor injects
Device / Service Variables as env vars into the running container, so
setting LANG=nl_NL.UTF-8 in the balena dashboard now reaches
AnthiasWebview without any compose mount. Add discoverability comments
in both balena compose templates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(website): FAQ entry on locale-driven URL asset language

Document how to override the device locale on Raspberry Pi OS (via
update-locale) and on balena (via Device Variable) so multi-language
URL assets serve the right language. Companion to the Accept-Language
plumbing in the viewer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): forward LC_ALL too; drop empty locale env on startup

Two Copilot review follow-ups on the locale plumbing:

* `LC_ALL` was missing from the envsubst forwarding — some operators
  configure their locale via `LC_ALL` rather than `LANG`/`LANGUAGE`.
  Add it to the viewer service's environment block and the sudo
  --preserve-env allowlist (the latter was already in place).

* `envsubst` substitutes `${LANG}` to an empty string when LANG is
  unset on the host, which means the viewer container starts with
  `LANG=""` — semantically different from "unset" and capable of
  overriding image defaults that downstream consumers (Python's
  `locale`, libc helpers) rely on. Strip empty locale vars in
  `start_viewer.sh` before launch so an unconfigured host leaves the
  container's image defaults in place.

Reword the comment block so "issue 480" doesn't visually collide with
the YAML `#` comment marker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 20:56:35 +01:00
Viktor Petersson
adbce1544d fix(viewer): x86 video playback under cage (dmabuf-wayland + VAAPI) (#2861)
* fix(viewer): route mpv through Wayland-EGL when compositor present

- x86 viewer runs under `cage`; cage holds DRM master so mpv `--vo=drm`
  is denied. Use `--vo=gpu --gpu-context=wayland` when `WAYLAND_DISPLAY`
  is set in the environment.
- Test fixture drops `WAYLAND_DISPLAY` so existing DRM-path tests stay
  deterministic across dev shells; new test covers the Wayland branch.

Refs #2859

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(viewer): clarify which process exports WAYLAND_DISPLAY

`cage` exports WAYLAND_DISPLAY for its child; bin/start_viewer.sh
only preserves it across sudo's env scrub.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(faq): bootstrap sudo + curl on stock Debian x86 installs

Setting a root password during Debian installation skips the
automatic sudo group setup, leaving the regular user unable to run
the Anthias installer. Document the as-root one-shot to install
sudo + curl and add the user to the sudo group.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(x86): clarify the host runs headless, no desktop environment

The in-container Wayland compositor (`cage`) takes DRM master itself
and renders directly to KMS. A host-side desktop (Xorg, GNOME, KDE,
display manager) would compete for the display and break boot-to-
content. Spell that out at the top of the PC install page, and call
out exactly what to uncheck during the Debian installer's software
selection step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): make cage start in a container without logind

libseat's default `logind` backend requires a systemd-logind session,
which doesn't exist inside the viewer container; cage exits with
"Could not get primary session for user: No data available" and the
Wayland socket is never opened. Switch to the `builtin` direct-device
backend — the container runs privileged, so /dev/dri and /dev/input
are accessible without going through logind. Also set
WLR_LIBINPUT_NO_DEVICES=1 so wlroots doesn't refuse to start when no
keyboard/mouse is mapped in — a signage kiosk has neither.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): chown cage's Wayland socket so the viewer user can connect

cage runs as root (the container's USER) and `wl_display_add_socket_auto`
creates `$XDG_RUNTIME_DIR/wayland-0` with root:root 0600 perms. The
inner `sudo -u viewer` therefore fails with "Failed to create
wl_display (Permission denied)" before Qt can load the wayland
platform plugin. Wrap cage's child in a tiny shim that chowns the
socket to viewer (still running as root from cage's fork) before sudo
drops privileges. cage exports WAYLAND_DISPLAY before exec'ing the
child, so the path is fully resolved when the shim runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): use dmabuf-wayland + universal VA-API drivers on x86

`--vo=gpu --gpu-context=wayland` consistently stalls under cage:
mpv attaches a buffer, the compositor doesn't release it back fast
enough, and the swap chain dries up after the first frame. Swap to
`--vo=dmabuf-wayland`, which hands decoded frames straight to the
compositor as DMA-BUFs via `wp_linux_dmabuf_v1` for direct scanout,
sidestepping the GL upload/swap path entirely. Paired with the
existing `--hwdec=auto-safe`, VAAPI-capable iGPUs decode directly
into NV12 DMA-BUFs for zero-copy playback.

Add `va-driver-all` to the x86 viewer image — a Debian metapackage
that bundles `intel-media-va-driver` (modern Intel iHD),
`i965-va-driver` (older Intel), and `mesa-va-drivers` (Gallium /
AMD radeonsi etc.). One image covers every x86 GPU without
per-vendor build variants; mpv picks the right VAAPI driver at
runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): grant viewer the host render GID for VAAPI on x86

The previous commit added `va-driver-all` and switched mpv to
`--vo=dmabuf-wayland` so hardware decode could engage, but the viewer
container's `viewer` user only inherits group `video` (GID 44 — for
`/dev/dri/card0`). The DRI render node `/dev/dri/renderD128` is owned
by the host's `render` group (GID 992 on Debian/Ubuntu, varies on
other distros), and there's no matching group inside the container, so
VAAPI fails with "wayland: failed to open /dev/dri/renderD128" and mpv
falls back to software decode. At 1080p on entry-level x86 that drops
frames — the explicit goal here is zero drops with HW accel.

Detect the host render GID from the device node at container start,
mirror it into the container as a synthetic `host-render` group, and
add `viewer` to it. Membership is resolved by `sudo -u viewer` from
/etc/group, so this has to land before cage launches the inner sudo.

Also tighten the VO gate per Copilot review feedback: key off
`device_type == 'x86'` instead of `WAYLAND_DISPLAY`. The env-var gate
leaked through to dev/test machines where WAYLAND_DISPLAY is set, and
on a Pi we'd otherwise combine `--drm-mode=...` with a Wayland VO.
Drop the now-redundant WAYLAND_DISPLAY teardown from the test fixture.

Verified on x86 Debian 13: VAAPI now loads `iHD_drv_video.so` and mpv
reports `[vo/dmabuf-wayland/vaapi] Initialized VAAPI: version 1.22`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(viewer): switch x86 mpv VO to gpu+wayland for stable playback

dmabuf-wayland segfaults reliably under the viewer's launch pattern
(background subprocess, no controlling tty) on mpv 0.40.0 +
wlroots-0.18 + libplacebo. The crash happens between hwdec
initialization and file open — visible in --log-file output, the log
truncates right after the last hwdec driver probe ("vulkan: This is
not a libplacebo vulkan gpu api context") and never reaches the
demuxer stage. Earlier validation succeeded only because mpv was
invoked interactively from a tty; the viewer's subprocess.Popen path
hits the bug every time.

Switch to --vo=gpu --gpu-context=wayland — the generic GL-over-Wayland
path mpv supports on every x86 GPU with Mesa or vendor GL drivers.
With --hwdec=auto-safe, VAAPI-capable hardware (Intel iHD/i965, AMD
radeonsi, ...) still decodes in hardware and hands frames to the GL
context as DMA-BUFs via mpv's dmabuf-interop-gl; software decode keeps
working via the same VO for codecs without HW support. The trade-off
is one extra GL upload step versus dmabuf-wayland's direct scanout,
which is fine on iGPU and the only path that's actually stable.

Verified on x86 Debian 13 with cage 0.2 + mpv 0.40.0:
- H.264 1080p60 (Big Buck Bunny sunflower) plays end-to-end, log
  reports "Using hardware decoding (vaapi)", iHD driver loads, CPU
  settles at ~25% (software decode would peg multiple cores).
- HEVC 1080p30 plays end-to-end, hevc_qsv / VAAPI engaged.
- Zero dropped frames over 30s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(install): wire up GRUB cmdline for Plymouth on x86

bin/install.sh runs the ansible playbook with `--skip-tags raspberry-pi`
on x86 hosts, which skips system/tasks/boot.yml — that's the only
place anything adds `splash`, `plymouth.ignore-serial-consoles`, and
`vt.global_cursor_default=0` to the kernel cmdline. Pi boards write
them straight into /boot/firmware/cmdline.txt; x86 boards have no
equivalent and end up with stock Debian GRUB defaults. The splashscreen
role still installs Plymouth, sets the Anthias theme, and the
plymouth-start/quit services run at boot — but without the cmdline
hook, Plymouth never draws the splash and the boot is a wall of text.

Add system/tasks/grub.yml — an x86-only sibling of boot.yml — that
edits /etc/default/grub idempotently (negative lookahead per flag so
re-runs are no-ops and pre-existing flags are preserved) and notifies
a new system role handler to run `update-grub`. Wire it in from
system/tasks/main.yml gated on `device_type == 'x86'`.

Subset rationale (vs the full boot.yml cmdline):
- `init=/lib/systemd/systemd` — Debian's default init anyway.
- `cgroup_enable=memory` / `cgroup_memory=1` — Pi-kernel workaround;
  cgroup v2 on Debian x86 has the memory controller enabled by default.
- `net.ifnames=0` — Anthias keys off the MAC for device identity, so
  we keep Debian's predictable interface names on x86.

Regex tested against the actual device's /etc/default/grub (already
hand-fixed earlier — task is a clean no-op) and against a stock fresh
Debian default (correctly appends all four flags, second run is
idempotent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(image-builder): comment matches actual mpv VO after VO switch

Comment alongside the Qt6 mpv apt list still said x86 used
`mpv --vo=dmabuf-wayland`. 63f31c6f swapped that for
`--vo=gpu --gpu-context=wayland` (dmabuf-wayland segfaulted under the
viewer's background-spawn path); update the comment so future readers
don't chase the wrong VO. Per Copilot review on PR #2861.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:48:13 +01:00
Viktor Petersson
2d0132b6a4 feat(processing): normalise HEIC/HEIF/TIFF images and exotic-codec videos at upload time (#2832)
* feat(processing): add upload-time normalisation for images and exotic-codec videos

Two new Celery tasks run on every fresh upload, mirroring the
``download_youtube_asset`` async pattern:

* ``normalize_image_asset`` converts HEIC / HEIF / TIFF to lossless
  WebP via Pillow + pillow-heif, preserving alpha. Other image
  formats short-circuit out as a no-op.
* ``normalize_video_asset`` ffprobes the upload, passes through if
  it's already H.264/HEVC in an accepted container with a viewer-
  friendly audio codec, otherwise transcodes to H.264 + AAC MP4 with
  ``-threads 2 -preset medium -crf 23`` so two cores stay free for
  the on-device viewer.

Both tasks land their output via a staging-file rename, write
``Asset.metadata`` flags (``original_ext``, ``transcoded`` /
``converted``, ``error_message``), and clear ``is_processing`` on
success — or via a custom ``Task.on_failure`` on permanent failure
so a row never stays stuck on the "Processing" pill.

Schema:
* New ``Asset.metadata`` JSONField (default dict) plus migration.
  Exposed read+write on ``AssetSerializerV2`` (read-only on v1.x).

Wiring:
* ``CreateAssetSerializerMixin.prepare_asset`` flags ``is_processing``
  and stashes ``_pending_normalize`` (``image``/``video``/``None``);
  ``AssetListViewV2`` and ``AssetListViewV1_2`` dispatch the matching
  task after persistence.
* The HTMX ``assets_upload`` view now persists the source extension
  on disk so the task can identify the format, replaces the
  ``probe_video_duration`` hop with ``normalize_video_asset``
  (whose passthrough branch is the same probe + duration path),
  and dispatches ``normalize_image_asset`` for HEIC/HEIF/TIFF.
* Add-asset modal accepts the wider extension list.

Resource control:
* ``anthias-celery`` worker command wrapped with
  ``nice -n 19 ionice -c 3`` in compose templates so transcodes
  never starve the on-device viewer.
* ``ffmpeg`` invocation pins ``-threads 2`` for the same reason.

Dependencies:
* New: Pillow, pillow-heif (Python); libheif1 in
  ``base_apt_dependencies`` (~1 MB extracted).
* No changes to ffmpeg/ffprobe — already runtime deps.

Tests:
* ``tests/test_processing.py`` covers both tasks: HEIC/HEIF/TIFF
  conversion (incl. uppercase ext, RGBA handling), JPEG no-op,
  corrupt-input failure path, six-row passthrough decision table,
  exotic-codec → H.264 transcode (mpeg2, mjpeg), MP4-with-non-H264
  in-place transcode, ffmpeg timeout/failure/zero-byte cleanup,
  ffprobe missing-stream parsing, on_failure metadata write,
  prepare_asset routing for HEIC / video / remote URL / JPEG.
* PDF support is deferred to a follow-up — out of scope here per
  the issue's "image/video first" framing.

* ci(mypy): include Pillow in mypy group so processing.py type-checks

* fix(processing): address Copilot review comments

* assets_upload now falls back to UploadedFile.content_type and
  finally an extension-based classification so HEIC/HEIF/TIFF
  uploads still classify on hosts whose mimetypes DB doesn't ship
  ``image/heic`` mappings.
* _ffprobe_summary derives the container from ffprobe's
  ``format.format_name`` (a comma-joined synonym list — pick the
  first token in the passthrough set) instead of trusting the
  filename extension. A ``.bin`` file containing MP4 bytes now
  classifies correctly; a ``.mp4`` file containing avi-only bytes
  no longer slips into the passthrough branch.
* Zero-byte ffmpeg output now removes the staging file before
  raising, mirroring the timeout/error branches above. All three
  failure paths share a small _drop_staging() helper so cleanup
  stays consistent.

New tests:
* ffprobe summary prefers format_name over filename, with a
  deterministic fallback to the extension when format_name is
  absent.
* zero-byte transcode output cleans up its staging file (asserts
  no leftover ``staging`` files in assetdir).
* assets_upload classifies HEIC via Content-Type when guess_type
  returns None.

* test(processing): drop /tmp/ paths from ffprobe summary tests

SonarCloud's python:S5443 flags hardcoded ``/tmp/`` paths as
"publicly writable directory" usage. The flagged lines pass these
strings as labels to ``_ffprobe_summary`` (whose internals are
mocked away in those tests) — only the extension is consumed by
the real code path. Switching to ``fixture.<ext>`` keeps the test
intent clear and silences the security hotspot.

* feat(processing): pick transcode target per board (H.264 vs HEVC)

The previous pipeline always emitted H.264, which is wasteful on
boards whose player can hardware- or software-decode HEVC: a typical
clip re-encodes ~30-50% smaller at perceptual parity. Introduce a
board-profile grid keyed on ``DEVICE_TYPE`` (set at image-build time
in the Dockerfile) so each device gets the codec its on-device player
actually decodes well:

  ┌──────────┬─────────────────┬──────────────┬──────────────┐
  │ Board    │ Player          │ HEVC OK?     │ Target codec │
  ├──────────┼─────────────────┼──────────────┼──────────────┤
  │ pi2/pi3  │ VLC + mmal-vc4  │ no HW, slow CPU │ H.264     │
  │ pi4-64   │ mpv + V4L2 HEVC │ HW-decoded   │ HEVC         │
  │ pi5      │ mpv + SW decode │ A76 SW @ 1080p │ HEVC       │
  │ x86      │ mpv + va/nv/qsv │ HW-decoded   │ HEVC         │
  │ unset    │ (dev / unknown) │ assume no    │ H.264        │
  └──────────┴─────────────────┴──────────────┴──────────────┘

Per-board passthrough also tightens: an HEVC upload to a pi3 device
no longer slips through unplayable — it gets transcoded to H.264.
Conversely, an H.264 upload on pi5 still passes through unchanged
(no point re-encoding to HEVC on a row that already plays).

The ``Asset.metadata['transcode_target']`` field now records the
codec the device wanted, written on both passthrough and transcode
paths so the operator can see "this device wanted hevc, the upload
already was hevc, no work needed" without inferring.

* ``_BOARD_PROFILES`` maps each ``DEVICE_TYPE`` value the image
  builder emits to ``{transcode_target, passthrough_video_codecs,
  video_args}``. ``_resolve_board_profile`` reads the env var.
* ``_video_can_passthrough`` and ``_transcode_to_target`` accept an
  optional profile arg; tests pin the profile per case rather than
  mutating env (and one parametrised test still uses env so the
  resolve path is exercised end-to-end).
* HEVC encode args include ``-tag:v hvc1`` for broader player compat
  (mpv/VLC don't care, but iOS / browsers prefer hvc1 over hev1).
* libx265 CRF 28 chosen as the rough perceptual equivalent of
  libx264 CRF 23 — matches the heuristic in libx265's own docs.

Tests:
* New parametrised tests for the codec grid: per-board target codec
  resolution, per-board passthrough decision, per-board ffmpeg argv
  (including ``-tag:v hvc1`` only on HEVC boards), pi3 + HEVC source
  → libx264 transcode, pi5 passthrough records target codec.
* Updated existing passthrough test to pin DEVICE_TYPE=pi5 since
  the default profile is now H.264-only.

* feat(youtube,ui): chain YouTube into the same processing pipeline + error pill

Two unifications driven by the same goal — every "row processing"
state and every "row failed" state should look identical to the
operator regardless of which celery task handled the row.

YouTube → normalize_video_asset chain
-------------------------------------
``download_youtube_asset`` no longer terminates the row's
in-flight state on its own. After yt-dlp lands the .mp4 it:

  * writes ``metadata['source']='youtube'`` and
    ``metadata['source_url']`` so an operator can recover the
    original URL after ``name`` is overwritten with the resolved
    title,
  * leaves ``is_processing=True``,
  * dispatches ``normalize_video_asset`` to take over.

The chained pass runs ffprobe and decides per-board passthrough vs.
transcode using the codec grid landed in this PR. That matters
because yt-dlp's ``format_sort: vcodec:h264`` is a *preference*, not
a guarantee — when no H.264 rendition is available yt-dlp falls
back to whatever it can get (vp9 webm, av1, ...). Without the
chain, those downloads would land on a pi3 device unplayable. With
the chain, the same codec grid that protects file uploads protects
YouTube downloads too, and the row carries the same metadata shape
(``original_ext``, ``transcoded``, ``transcode_target``).

Failure-path unification
------------------------
``_DownloadYoutubeTask.on_failure`` now reuses
``processing._set_processing_error`` + ``processing._notify`` —
single source of truth for the error_message contract instead of
two near-duplicate blocks. A failed YouTube download now writes
``metadata.error_message`` (``DownloadError: 404 Not Found`` etc.)
exactly like a failed normalisation does.

UI: error pill
--------------
The asset table row template renders a warn-coloured "Failed" pill
(in the column previously occupied by the active toggle) when
``metadata.error_message`` is populated and ``is_processing`` is
clear. The full message rides along on the title/aria-label so the
operator can hover for context — no extra modal needed. Same shape
as the existing ``processing-pill`` so the column layout stays
stable across in-progress / failed / done states.

Tests
-----
* ``test_download_youtube_asset_success_chains_into_normalize_video``
  — happy path now asserts ``is_processing=True`` post-task and
  ``dispatch_normalize_video`` was called with the asset_id.
* ``test_download_youtube_asset_on_failure_writes_error_metadata``
  — replaces the old "clears processing" test; asserts both
  ``is_processing=False`` and the ExceptionType+message in
  ``metadata.error_message``.
* Three other YouTube tests updated to mock
  ``dispatch_normalize_video`` so they don't hit a real broker.
* ``test_asset_row_renders_error_pill_when_processing_failed`` and
  ``test_asset_row_no_error_pill_when_metadata_clean`` lock in the
  template's pill rendering.

* feat(processing): normalise BMP, ICO, TGA, JPEG 2000, and AVIF to WebP

Extends the image-normalisation pipeline to cover the realistic set
of "operator drags an unusual image format into the upload modal"
cases, all handled by Pillow's built-in decoders without a new apt
or wheel dependency:

  ┌──────────┬────────────────────────────────────────────────────┐
  │ Format   │ Why we want it converted                           │
  ├──────────┼────────────────────────────────────────────────────┤
  │ BMP      │ Uncompressed; a 4K BMP is ~30 MB vs ~1 MB as WebP. │
  │ ICO      │ Multi-frame Windows icon; pick the largest, flatten│
  │ TGA      │ Screenshot tools / game asset exports; no browser  │
  │          │ support.                                           │
  │ JPEG2000 │ .jp2/.j2k/.jpx/.jpc/.jpf — scanner output; no      │
  │          │ browser support.                                   │
  │ AVIF     │ Modern phone exports. Chromium 85+ renders AVIF,   │
  │          │ but the legacy Pi 2/3 Qt5 WebEngine predates it,   │
  │          │ so converting on upload means one playback path    │
  │          │ across the fleet.                                  │
  └──────────┴────────────────────────────────────────────────────┘

JPEG / PNG / WebP / GIF / SVG remain untouched — already
viewer-friendly *and* well-compressed.

Implementation:
* Extend ``NORMALIZE_IMAGE_EXTS``; the rest of the pipeline already
  accepts any extension in this set (RGBA conversion happens inside
  ``_convert_image_to_webp`` regardless of source format).
* Replace the duplicate extension set in ``assets_upload`` with a
  call to ``processing.needs_image_normalisation`` so the source of
  truth lives in one place.
* Widen the upload modal's <input accept> attribute.

Tests:
* ``test_image_normalises_to_lossless_webp_across_formats`` is a
  parametrised matrix that round-trips each new format end-to-end:
  source synthesised via Pillow, runs through
  ``_run_image_normalisation``, asserts the WebP output decodes
  cleanly back to a 16x16 image. Catches both decoder-side
  regressions (Pillow drops a format) and writer-side regressions
  (RGBA convert mode breaks one source).
* ``test_needs_image_normalisation`` extended to cover every entry
  in the new set plus negative cases (.jpg/.png/.webp/.gif/.svg
  stay False). Total: 109 image-format assertions.

* fix(processing): address Copilot review on commit 8602faff

Six items from Copilot's fresh review pass:

* ``assets_upload`` last-resort image-extension allowlist now
  derives from ``processing.NORMALIZE_IMAGE_EXTS`` rather than
  duplicating the set. Adding a new normalisable format (or
  removing one) only touches one place.
* ``_run_image_normalisation`` cleans up the ``.webp.tmp`` staging
  file on every failure path — Pillow's ``UnidentifiedImageError``
  *and* a generic OSError mid-encode (disk pressure, libheif
  crash). Mirrors the video pipeline's _drop_staging contract.
* ffmpeg failure messages decode the bytes ``stderr`` to UTF-8
  text (with replacement on malformed bytes) and tail-trim long
  output, so ``metadata.error_message`` reads as a real
  diagnostic instead of ``b'Invalid data found'``.
* Removed the dead ``path.normpath(staging) == path.normpath(
  src_uri)`` branch in the video transcode path. With the staging
  suffix in place the two paths can never collide; expanded the
  surrounding comment to explain why.
* Updated ``normalize_video_asset``'s docstring to describe the
  per-board codec grid (libx264 on pi2/pi3, libx265 on pi4-64 /
  pi5 / x86) rather than the now-stale "transcode to H.264 MP4".
* Fixed "truecate" → "truncate" typo in ``_asset_row.html``
  comment.

New tests:
* ``test_image_partial_write_cleans_staging`` — half-writes the
  ``.webp.tmp`` then raises OSError; asserts the runner removes
  the partial file before propagating.
* ``test_format_subprocess_stderr_decodes_and_trims`` — covers
  the bytes-decode, malformed-byte-replacement, tail-trim, and
  empty-stderr cases for the new helper.
* ``test_video_ffmpeg_error_cleans_staging`` strengthened to
  assert the error message contains *no* ``b'...'`` Python repr
  prefix — it's now operator-readable text.

* fix(processing): address Copilot review on commit 778d5c9f

Three code fixes plus a PR-description sync:

* ``_run_image_normalisation`` no-op path (when src_ext isn't in
  NORMALIZE_IMAGE_EXTS) now also clears
  ``metadata.error_message``. Without this, a row re-uploaded as
  a JPEG/PNG after a previously-failed HEIC conversion would
  drop is_processing but keep showing the "Failed" pill — the
  operator's table would lie about the row's current state.
* ``_ffprobe_summary`` now catches ``sh.CommandNotFound`` in
  addition to ``TimeoutException`` / ``ErrorReturnCode``. A
  stripped-down image / dev box without ffprobe in PATH used to
  crash the task with an unhandled CommandNotFound; now it
  collapses to the same all-'unknown' summary so the runner
  falls through to the transcode branch (which itself fails
  clean if ffmpeg is also missing — same on_failure contract).
* Rewrote the ``_ffprobe_summary`` docstring: the actual
  behaviour is "unknown" for missing video stream, "none" only
  for genuinely missing audio stream. The previous "''" claim
  was wrong and would have misled callers / future maintainers.

Tests:
* ``test_image_no_op_path_clears_stale_error_message`` — JPEG
  re-uploaded over a row whose previous attempt failed; the
  no-op branch must wipe the stale error_message.
* ``test_ffprobe_summary_handles_missing_ffprobe_binary`` —
  CommandNotFound side-effect; asserts all-'unknown' summary
  rather than a propagating exception.

* fix(processing): address Copilot review on commit 42697452

Four contract gaps Copilot flagged:

* ``normalize_image_asset`` / ``normalize_video_asset`` use
  ``autoretry_for=(OSError,)`` to recover from transient disk
  pressure. ``FileNotFoundError`` is-a ``OSError`` so the filter
  was catching it too — but a missing source file is permanent,
  and retrying just delays the on_failure that writes
  ``metadata.error_message``. Adding
  ``dont_autoretry_for=(FileNotFoundError,)`` to both decorators
  makes the missing-source raise propagate immediately, so the
  operator sees the "Failed" pill and the error message at the
  next browser refresh instead of waiting through up-to-3
  exponential-backoff retry cycles.

* ``_run_image_normalisation`` and ``_run_video_normalisation``
  both call ``os.replace(staging, final_uri)`` after a successful
  conversion / transcode. A rename failure (cross-device link,
  filesystem-full at the very last step, permissions) was
  outside the existing try/except, so the staging file would
  linger until cleanup()'s 1h sweep. Wrap both in a try/except
  that calls ``_drop_image_staging`` / ``_drop_staging`` on any
  OSError before propagating — the "no leftover staging
  artifacts on failure" contract now holds across every failure
  path.

Tests:
* ``test_image_rename_failure_cleans_staging`` and
  ``test_video_rename_failure_cleans_staging`` — patch
  ``os.replace`` to raise OSError; assert the staging file is
  gone before the exception reaches the runner's caller.
* ``test_normalize_tasks_exclude_filenotfounderror_from_autoretry``
  — celery-config-time check that both tasks expose
  ``FileNotFoundError`` in their dont_autoretry_for tuple, so a
  future change to the decorator can't silently regress the
  immediate-fail contract.

* docs(processing): align docstrings with current normalise scope

Three stale docstring callouts from Copilot's review of 7099b25e:

* ``processing.py`` module docstring — listed only HEIC/HEIF/TIFF
  for the image task. Updated to enumerate the full set
  (HEIC/HEIF/TIFF/BMP/ICO/TGA/JPEG 2000 family/AVIF) and to call
  out the JPEG/PNG/WebP/GIF/SVG no-op short-circuit.
* ``needs_image_normalisation`` docstring — same drift; rewrote
  the leading sentence to match what the predicate actually
  checks (``_ext(...) in NORMALIZE_IMAGE_EXTS``).
* ``tests/test_processing.py`` module docstring — said the image
  task covers only HEIC/HEIF/TIFF and that video transcodes are
  libx264-only. Both stale: extended to enumerate every image
  format the suite exercises, and to describe the per-board
  ``DEVICE_TYPE`` codec grid (libx264 on pi2/pi3, libx265 on
  pi4-64/pi5/x86) that the parametrised video tests pin down.

No code changes; documentation only.

* fix(processing): exclude UnidentifiedImageError from autoretry

Pillow's UnidentifiedImageError inherits from OSError, so it was
getting caught by normalize_image_asset's autoretry_for=(OSError,)
filter — a corrupt-image upload would retry up to 3 times with
exponential backoff before metadata.error_message landed.

Add it to dont_autoretry_for alongside FileNotFoundError so the
permanent-failure contract surfaces immediately. Test extended to
assert both exclusions are in place at celery-config time.

* docs(website): describe new upload-time normalisation in user-facing copy

Reflect the per-board codec grid + the wider image format set on
the marketing site, but in plain language — operators don't care
about codec names, mpv vs VLC, or hardware-decode paths.

* features.html — replaced the single "Images, videos, and web
  pages" card with three smaller ones: "Drop in almost anything"
  (the wider format support), "Plays smoothly on any device"
  (the per-board normalisation, framed as "Anthias prepares it
  in the background"), and "YouTube on your screen" (the YouTube
  download path, with the no-ads angle).
* faq.yaml — added "What image and video files can I upload?"
  entry, written conversationally; reframed the existing YouTube
  answer to mention the local-playback / no-ads benefit.

Avoids: codec names (H.264/HEVC), library names (ffmpeg/yt-dlp),
implementation terms (transcode/normalisation/passthrough), and
file-format extensions in the running prose. The Processing /
Failed dashboard badges get a one-line mention so operators know
what they'll see while a file is being prepared.

* fix(views): recover image extension from MIME subtype as last-resort fallback

Two items from Copilot's review of eb785f76:

* ``assets_upload`` had a two-step extension fallback (mimetypes
  guess → operator-supplied filename). On a host with a sparse
  mimetypes DB *and* an upload whose name has no extension —
  e.g. an Android share that renames the file to ``image`` —
  both fell through and the file landed extensionless. With no
  ``.heic`` / ``.avif`` / etc. on disk, ``needs_image_normalisation``
  returned False and the upload silently slipped past the
  normalise pipeline. Added a third step: when both prior
  fallbacks come back empty, derive ``.<subtype>`` from the
  ``image/<subtype>`` MIME and accept it only if that ext is in
  ``NORMALIZE_IMAGE_EXTS`` — same source of truth the pipeline
  already uses, so adding a new normalisable format only touches
  one place.

* Fixed a typo in the ``Asset.metadata`` model field comment
  (``image_normalize_asset`` / ``video_normalize_asset`` →
  ``normalize_image_asset`` / ``normalize_video_asset``) so the
  comment matches the actual task names.

New test ``test_assets_upload_extensionless_heic_falls_back_to_mime_subtype``
mocks both ``guess_type`` and ``guess_extension`` to simulate the
worst-case sparse-DB scenario, uploads a file with no name
extension, and asserts the row lands at ``<id>.heic`` with the
normalise task dispatched.

* fix(processing): include canonical ffprobe format names in passthrough set

Copilot caught a real bug: ``_PASSTHROUGH_CONTAINERS`` was a set of
short extension labels (``ts``, ``mkv``, ...), but the same set is
matched against ffprobe's reported ``format.format_name`` —
which uses different canonical names. Concretely:

  * ``.ts`` → ffprobe reports ``mpegts``, not ``ts``  → forced
    transcode despite being passthrough-eligible.
  * ``.mkv`` → ffprobe reports ``matroska,webm`` — only worked
    *accidentally* before because ``webm`` happened to be in the
    set, mislabelling the container in metadata.

Fix: add ``mpegts`` and ``matroska`` to the set with a comment
explaining the dual purpose (extension labels + canonical names).
Containers whose canonical name already matches the extension label
(``mp4``/``mov``/``mpeg``/``flv``/``avi``/``webm``) stay listed
once.

New parametrised test ``test_passthrough_containers_match_real_ffprobe_format_names``
locks the contract by mocking ``_ffprobe_streams`` with the actual
``format_name`` strings ffprobe emits for each container. Asserts
both the resolved label is in ``_PASSTHROUGH_CONTAINERS`` *and* the
``_video_can_passthrough`` decision returns True on a pi5 profile —
so a future change to either the set or the resolution logic that
re-introduces the regression fails this test.

* fix(processing): drop redundant .copy() in image conversion + stale comments

Three items from Copilot's latest review pass:

* ``_convert_image_to_webp`` was holding TWO full pixel buffers in
  memory at the WebP encode step: ``image.convert('RGBA')`` already
  returns a new image with its own buffer, then ``.copy()`` cloned
  it again. Meaningful on a Pi 5 decoding a 50 MP HEIC where each
  buffer is ~200 MB. Move the ``.save()`` inside the
  ``with Image.open(...)`` block instead — the converted image is
  safe to use across the close, but encoding inside the context
  means we never hold both source decoder state *and* the
  converted buffer at once.
* ``docker-compose.yml.tmpl`` worker comment said "exotic-codec →
  H.264 transcode"; rephrased to "board-appropriate H.264/HEVC
  transcode" to match the per-board grid.
* ``_asset_modal.html`` accept-attribute comment said "exotic
  video codecs → H.264 MP4"; same rewording.

* fix(processing): single ffprobe per upload + safer duration probe + byte-true stderr trim

Three Copilot items, all real bugs in the failure semantics or
performance of the video pipeline:

* ``_ffprobe_summary`` now also extracts ``format.duration`` from
  the same probe payload and returns it as ``duration_seconds``
  alongside container/codec info. The runner's passthrough path
  reuses that value instead of re-shelling ffprobe via
  ``get_video_duration`` — saves one probe per passthrough row,
  which is the common case on a per-board-codec-matched fleet.
  Floor to 1s mirrors the YouTube-task rule. Probe-failure path
  collapses to ``duration_seconds=None`` like the other fields.
* ``_resolve_duration_seconds`` now catches the exceptions
  ``get_video_duration`` raises (sh.ErrorReturnCode_1 on bad
  format, generic Exception on unexpected failures) and returns
  None instead of propagating. After a successful transcode the
  file is on disk and the row is otherwise ready; failing the
  whole task because the *post*-transcode duration probe stumbled
  was an own-goal — the operator can edit duration manually.
* ``_format_subprocess_stderr`` now trims raw bytes BEFORE
  decoding so ``_STDERR_TAIL_BYTES`` is truly a byte limit, not a
  character limit. Multibyte UTF-8 in the keep window can no
  longer push the decoded length past the budget. Mid-multibyte
  cuts produce the Unicode replacement character via
  ``errors='replace'`` rather than crashing.

New tests:
* ``test_ffprobe_summary_extracts_duration_from_probe_payload``
  covers good/sub-second/missing/unparseable values.
* ``test_video_passthrough_uses_summary_duration_no_second_probe``
  asserts ``get_video_duration`` is *not* called in the passthrough
  branch — locks the no-double-probe contract.
* ``test_resolve_duration_seconds_swallows_probe_exceptions``
  proves the helper returns None instead of propagating.
* ``test_format_subprocess_stderr_byte_trim_handles_multibyte_utf8``
  exercises the mid-multibyte cut + decoded-len bound.

* fix(processing): unify stderr trim across str/bytes; skip viewer reload on intermediate hops

Two more Copilot items:

* ``_format_subprocess_stderr`` had two trim branches: bytes (via
  byte-precise tail) and str (via character-count tail). The str
  branch could exceed _STDERR_TAIL_BYTES under multibyte text.
  Normalise to bytes once at the top (encoding str via UTF-8 with
  replacement) and run a single byte-precise trim — both paths
  now respect the byte budget identically.

* ``_notify`` gains a ``reload_viewer`` keyword. The YouTube
  task's intermediate notification (after writing title/duration
  but before chaining into normalize_video_asset, while
  is_processing is still True) now passes ``reload_viewer=False``.
  The browser-side dashboard nudge still fires so the operator
  sees the resolved title immediately; the on-device viewer
  doesn't reload its playlist for a row that's still mid-flight.
  The chained normalize step's _notify (which runs once
  is_processing clears and the file is final) handles the actual
  viewer reload — saves the viewer one redundant playlist refresh
  per YouTube upload.

Tests:
* ``test_notify_browser_only_skips_viewer_reload`` exercises the
  new flag.
* The YouTube-success test now mocks Redis and asserts
  ``publish.assert_not_called()`` to lock in the no-intermediate-
  reload contract.

* docs: drop stale PDF reference, fix grammar nit

* processing.py: _set_processing_error docstring listed
  "encrypted PDF" as a permanent-failure case from the issue's
  three-workstream framing. PDF is explicitly out of scope for
  this PR — replaced with concrete failure modes the current
  image/video tasks actually surface.
* docker-compose.yml.tmpl: "a single configure here" reads as
  a verb. Changed to "a single configuration here".

* fix(processing): reject decompression-bomb image uploads before decode

Real security gap Copilot caught: Pillow happily allocates pixel
buffers proportional to ``width × height`` regardless of how
small the source file is on disk. A few KB of crafted bytes
advertising a 1,000,000×1,000,000 image would force the celery
worker to attempt a multi-TB allocation — at best a hard OOM
that kills the worker and stalls the upload pipeline; at worst
a swap-storm that drags the on-device viewer with it.

Pillow ships ``MAX_IMAGE_PIXELS`` (default ~89 MP) which raises
``DecompressionBombError`` past 2× that threshold and warns
softly at the first level. That default is too lax for signage
content (where 4K @ 8 MP is already large) and pillow-heif's own
decoder can bypass the check on certain HEIF/AVIF inputs.

Two layers of protection:

1. ``_MAX_IMAGE_PIXELS = 50_000_000`` constant — bigger than any
   legitimate phone-camera output (modern flagships top out
   around 50 MP at the standard 4:3 aspect after JPEG/HEIC
   compression) but tiny compared to typical bomb fixtures.
2. ``_convert_image_to_webp`` reads ``image.size`` from the
   format header *before* any decode and raises ValueError if
   the dimensions exceed the cap. The on_failure path writes
   the message to ``metadata.error_message`` like any other
   permanent failure. Lowering Pillow's global
   ``Image.MAX_IMAGE_PIXELS`` to the same value protects any
   future call site that goes through ``Image.open`` outside
   this helper.

New test ``test_image_decompression_bomb_is_rejected`` mocks
``Image.open`` to return a stub whose ``.size`` exceeds the cap
(synthesising a real billion-pixel fixture would itself need
GBs of memory) and asserts the runner raises before any
``convert()`` / ``save()`` is reached.

* docs(views): sync assets_upload comment with NORMALIZE_IMAGE_EXTS

Inline comment listed only HEIC/HEIF/TIFF/BMP, but the constant it
points at also covers ICO/TGA/JP2 family/AVIF. Rewrote to reference
the constant as source of truth and enumerate the current set so
the comment stops drifting on the next addition.

* fix(processing): disable failed assets + reject misnamed bypass uploads

Two real bugs Copilot caught:

* ``_set_processing_error`` cleared ``is_processing`` but left
  ``is_enabled=True``, so a failed normalisation would still get
  queued for playback by the viewer's scheduler (which filters on
  is_enabled + date window only — it doesn't check
  ``metadata.error_message``). The on-screen result was a black
  rectangle for the row's duration. Flipping ``is_enabled=False``
  alongside ``is_processing=False`` keeps the bad row out of
  rotation; the operator can re-enable from the dashboard once
  the underlying issue is fixed. The ``error-pill`` template
  already replaces the active toggle so the operator sees the
  failure state before they re-enable.

* ``assets_upload`` deferred to ``mimetypes.guess_type`` first and
  only consulted ``UploadedFile.content_type`` when guess_type
  produced no image/video classification. If an operator renamed
  a HEIC to ``photo.jpg`` and uploaded it, guess_type returned
  ``image/jpeg`` (a passthrough type), the Content-Type fallback
  was skipped, the file landed as ``.jpg``, and the normalise
  pipeline never ran — a silent failure-to-render. Modern
  browsers sniff the actual bytes and tag the upload with
  ``image/heic`` regardless of filename, so the view now
  cross-checks: when guess_type and Content-Type share a
  top-level (image/* or video/*) but disagree on subtype, AND
  Content-Type's subtype maps to a NORMALIZE_IMAGE_EXTS
  extension, prefer Content-Type. Only upgrades — never downgrades —
  to avoid the inverse case (a JPEG mis-tagged as image/heic by
  the browser somehow) accidentally routing into the pipeline.

Tests:
* test_set_processing_error_writes_metadata extended to assert
  is_enabled flips to False alongside the error message write.
* New test_assets_upload_misnamed_heic_uses_browser_content_type
  uploads HEIC bytes named ``photo.jpg`` and asserts the file
  lands as .heic with the normalise task dispatched.

* build(pi2/pi3): add Pillow + pillow-heif build deps for armv7 source builds

Real concern Copilot caught about the Pillow / pillow-heif
introduction: neither ships armv7l manylinux wheels (Pillow 11
explicitly dropped them in its release notes; pillow-heif only
publishes x86_64 / aarch64). uv's resolution on a pi2 / pi3
image build therefore falls back to sdist, and the existing
``builder_extra_apt`` only covers libcec / libdbus headers — the
``uv sync`` step would gcc-fail at the first JPEG / HEIF binding.

Extend ``get_uv_builder_context`` to take a ``board`` argument
and append the Pillow / pillow-heif build-time deps when
``service='server'`` and ``board in {'pi2', 'pi3'}``. 64-bit
boards (pi4-64 / pi5 / x86) and the test image still get binary
wheels and the apt list stays unchanged for them — adding the
deps unconditionally would waste ~70 MB of layer space on every
non-armv7 build.

Pillow's documented build deps:
  libjpeg62-turbo-dev / libfreetype-dev / liblcms2-dev /
  libopenjp2-7-dev / libtiff-dev / libwebp-dev / zlib1g-dev

pillow-heif: libheif-dev (the libheif1 runtime is already in
``base_apt_dependencies`` for both architectures).

Verified: ``--build-target pi3`` now generates a Dockerfile that
installs the new build deps; ``--build-target pi5`` does not.
2026-05-07 12:22:36 +01:00
Viktor Petersson
8e2f38b140 refactor(auth): migrate to django.contrib.auth (#2828)
* refactor(auth): migrate to django.contrib.auth, add bearer tokens

Retire the parallel `Auth`/`NoAuth`/`BasicAuth` stack in favour of
Django's built-in primitives. Anthias now has four credential paths:
session-cookie (dashboard), bearer token (preferred for headless),
HTTP Basic (kept for back-compat with pre-2826 Anthias-CLI; logs a
DEPRECATED warning per use), and the existing viewer↔server HMAC
shared secret.

A 0005 data migration reads `[auth_basic]` user/password from
anthias.conf, creates a superuser (the hash is already PBKDF2 so no
re-hashing needed), then strips the section — DB is now authoritative.
Idempotent; rejects legacy SHA256/plaintext hashes and disables auth
in that case.

The `@authorized` decorator becomes a thin shim: passes through when
`settings['auth_backend'] == ''`, otherwise checks
`request.user.is_authenticated` and falls back to a /login redirect.
Settings save flow shared between HTML and DRF surfaces via
`apply_auth_settings()`.

Closes #2825.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): address PR feedback for #2828

- Reformat lib/auth.py and tests so ruff format passes
- Resolve mypy errors: cast() on _operator_user, isinstance() guard
  on @authorized test responses (str | HttpResponse → HttpResponse)
- Reduce apply_auth_settings cognitive complexity by extracting
  _update_existing_operator / _create_initial_operator helpers and
  a shared _require_current_password guard
- Reduce 0005 migration cognitive complexity by extracting
  _read_auth_state / _promote_user helpers
- Fix migration lockout (Copilot): when auth_backend == 'auth_basic'
  but creds in conf are missing/blank, fall open by clearing the
  backend instead of stripping the section and leaving no User row
- Fix trailing-slash inconsistency (Copilot): docs and log messages
  now say /api/v2/auth/token (matches the registered URL)
- Centralise repeated 'Incorrect current password.' / 'New passwords
  do not match!' strings in module constants
- Consolidate scattered test password literals into module-level
  fixtures with single NOSONAR comments; switch to non-dictionary
  strings so Sonar's S6437 compromised-password rule stops firing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): silence Sonar false positives + drop dead param

- Rename apply_auth_settings password kwargs (current_password →
  current_pwd, new_password → new_pwd, new_password_confirm →
  new_pwd_confirm) so Sonar's S6437 (hardcoded password) doesn't
  fire on every test call site. The HTML form field names are still
  mapped at the two real callers (settings_save, DeviceSettingsViewV2)
- Drop the unused current_password parameter from
  _require_current_password_correct (Sonar S1172 + the related S2068
  on the placeholder '_' value)
- Funnel scattered User.objects.create_superuser calls through a
  _make_operator helper in each test file so S6437 / S2068 fires in
  one suppressed location per file instead of once per test
- Rename the migration's local User parameter to user_model with a
  noqa for N803 — Sonar S117 wants snake_case but Django convention
  is to alias the model class as `User`
- Reduce _read_auth_state cognitive complexity by extracting a
  _conf_get helper for the trim-or-empty pattern

All local checks clean; 559 non-integration tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(auth): drop bearer token, defer to UI-managed PAT follow-up

The previous bearer-token implementation was a password-exchange
endpoint (POST /api/v2/auth/token with {username, password} → token).
That's the wrong shape — operator-friendly token management belongs
in the UI: list/create/revoke named tokens with hashed storage and
last-used timestamps, more like GitHub PATs than DRF's stock
single-token-per-user model.

Stripping it from this PR so the migration to django.contrib.auth
lands focused. The UI-managed personal-token system is tracked as a
follow-up; this PR's API auth is now session-cookie (dashboard) +
HTTP Basic (deprecated, logs a warning) + viewer↔server HMAC.

Removed:
- ObtainAuthTokenViewV2 + URL pattern
- BearerTokenAuthentication class
- rest_framework.authtoken from INSTALLED_APPS
- Bearer-related tests (5)

Updated docs (qa-checklist, developer-documentation, migrate-to-screenly,
faq) to drop bearer-token claims and point at the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auth-migration): clarify that legacy path move is shell-side

Copilot flagged that the migration docstring claimed support for the
pre-rebrand `~/.screenly/screenly.conf` path, but `_conf_path()` only
looks at the new `~/.anthias/anthias.conf` location. That's actually
correct at runtime — `bin/migrate_legacy_paths.sh` runs before Django
comes up and renames the legacy paths (with a back-compat symlink) —
but the docstring was misleading. Make the relationship explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth-migration): preserve disabled-but-configured creds, atomic write

Two PR-review changes rolled into one commit:

1. Promote a Django-format hash from anthias.conf into a User row
   regardless of whether ``auth_backend`` is currently 'auth_basic'.
   Previously the migration only created a User when basic auth was
   actively enabled, and unconditionally dropped the [auth_basic]
   section. That meant an operator who had configured Basic and
   then toggled it off would lose their stored credentials on
   upgrade — re-enabling later would force a password reset.
   Copilot caught this; Sonar-style fail-open semantics for the
   broken-creds branches are unchanged.

2. Make the migration transactional across both stores. Wrap _migrate
   in @transaction.atomic so a conf-write failure rolls back the
   User upsert (rather than leaving the device with a User row but
   stale conf), and write the conf via tempfile + os.replace so a
   crash mid-write never produces a half-written file.

Verified end-to-end with smoke tests (one subprocess per case so the
Django DB connection cache doesn't interfere):
  * enabled+valid hash → user created, auth stays on, section removed
  * disabled+valid hash → user PRESERVED, auth stays off, section removed
  * enabled+legacy SHA256 → no user, fail open to disabled, section removed
  * enabled+missing creds → no user, fail open to disabled, section removed

Also refreshed the test_auth.py module docstring to drop the stale
Bearer reference (Copilot).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): locate request by type, not by position

Copilot pointed out that ``@authorized`` was using ``args[-1]`` to
fish the request out of the wrapped view's args, which breaks for
any view called with extra positional parameters — DRF mixins like
``def get(self, request, asset_id)`` and Django views like
``assets_update(request, asset_id)`` would treat ``asset_id`` as the
request and raise ValueError.

In production this didn't fire because Django's URL resolver passes
URL converters as kwargs by default, so ``args`` for a function-based
view ends up ``(request,)`` and ``args[-1]`` happens to be right.
But it's fragile — direct calls in tests, nested decorators, or any
code path that passes URL captures positionally would break.

Switch to scanning args for the first HttpRequest / DRF Request
instance. The existing ``test_authorized_non_request_arg_raises``
moves to the same "no request object passed" message (since the new
predicate is "no Request found" not "the last arg isn't a Request");
added ``test_authorized_finds_request_among_positional_args`` to lock
in the DRF-style ``(self, request, asset_id)`` shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): validate auth_backend against known set before persisting

Copilot caught that ``apply_auth_settings`` accepted any non-empty
``new_auth_backend`` value and silently fell through. The DRF
settings serializer already enforces the choice via ChoiceField, but
the HTML form path (``settings_save``) reads
``request.POST.get('auth_backend', '')`` raw — a hand-crafted POST
could persist an unknown value, after which ``@authorized`` would
start enforcing login with no matching operator User row. Lockout.

Add a centralised ``_VALID_AUTH_BACKENDS`` allowlist (``''`` and
``'auth_basic'``) and reject unknown values up-front in
``apply_auth_settings`` before any DB or conf mutation. Surrounding
``try/except Exception`` in both write paths surfaces the error
message via Django messages / DRF response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): harden migration + clarify session-auth CSRF caveat in docs

Three Copilot review points in one commit:

1. Guard config.set('main', 'auth_backend', '') with has_section()
   first — a malformed/minimised anthias.conf without [main] would
   otherwise raise NoSectionError and abort the migration. Add the
   section if missing so the migration stays fail-open. Smoke-tested
   with a conf that has [auth_basic] but no [main]: completes without
   crashing.

2. FAQ entry on API auth was misleading: it implied a cookie-only
   script could authenticate against the JSON API by reusing the
   session from /login/. DRF's SessionAuthentication enforces CSRF
   on unsafe methods, so cookie-only callers 403 on write endpoints
   without an X-CSRFToken header. Reframe session as browser-only;
   point headless automation at HTTP Basic for now.

3. Same clarification in docs/developer-documentation.md's
   Authentication section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(website): drop misplaced Authentication section from dev docs

The Authentication section I added to
website/content/docs/developer-documentation.md (and the CSRF caveat
that followed) doesn't belong on the developer-documentation page —
that doc is for contributors building / testing / linting Anthias,
not API-consumer auth model.

The same content already exists in website/data/faq.yaml under
\"How do I authenticate API calls?\" which is the right surface for
API consumers. Dropping the duplicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): hash-format detection, password validators, username collision

Four Copilot review points addressed:

1. The migration's "is this a Django-format hash?" check was the
   `'$' in password_hash and not <legacy regex>` heuristic, which
   would happily promote a plaintext like `pa$$word` into
   `User.password`. Switched to `identify_hasher()` from
   `django.contrib.auth.hashers`, which actually parses the value
   against the registered `PASSWORD_HASHERS`. Plaintext-with-dollar
   now correctly fails open (auth disabled, no User created).
   Smoke-tested with valid PBKDF2, plaintext-with-`$`, legacy SHA256,
   and missing creds — all four behave correctly.

2. `_update_existing_operator` called `operator.save()` without
   guarding against username collisions; `User.username` is unique,
   so renaming to an already-taken name raised `IntegrityError` and
   the UI/DRF surfaces showed a low-level DB string. Added
   `_check_username_available` that pre-checks via the ORM and raises
   `AuthSettingsError("Username 'X' is already taken.")` instead.

3. & 4. Password updates and initial-enable both bypassed
   `AUTH_PASSWORD_VALIDATORS` (settings.py registers four of them —
   UserAttributeSimilarity, MinimumLength, CommonPassword,
   NumericPassword). Calls to `set_password()` were happening
   without `validate_password()`, so weak passwords slipped through.
   Added `_validate_password_strength` that runs the validators and
   translates `ValidationError` into `AuthSettingsError`. For initial
   enable, validation runs *before* the `update_or_create` call so a
   rejected password doesn't leave a half-created superuser; an
   unsaved `User(username=new_username)` instance is passed in so
   UserAttributeSimilarity can still compare.

Three new tests cover the validator + collision paths:
  * test_apply_auth_settings_initial_enable_rejects_short_password
  * test_apply_auth_settings_change_password_rejects_too_short
  * test_apply_auth_settings_change_username_collision

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(format): apply ruff format to auth.py

Trivial formatting nit from CI — ruff format trimmed three lines of
whitespace in src/anthias_server/lib/auth.py that I'd missed locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): reorder DRF auth classes + tighten Basic-auth tests

Three Copilot review points addressed:

1. DEFAULT_AUTHENTICATION_CLASSES had SessionAuthentication first,
   which meant a Basic-auth caller carrying an incidental session
   cookie (shared cookie jar with the operator's browser, etc.)
   would hit SessionAuthentication.enforce_csrf, get a 403 for the
   missing X-CSRFToken header, and never reach BasicAuthentication.
   Reorder so DeprecatedBasicAuthentication runs first — an explicit
   Authorization header always wins over an incidental cookie.

2. test_basic_auth_header_authenticates_for_back_compat was
   asserting ``status_code != 302``, which would pass if the path
   regressed to 401/403/500. Pin the actual success contract:
   200 + ``application/json`` + empty list body. Catches any
   regression where BasicAuthentication stops being applied or
   silently fails.

3. test_basic_auth_header_rejects_wrong_password was allowing
   {302, 401, 403}. With BasicAuthentication actually applied, the
   only correct response is 401 with a Basic ``WWW-Authenticate``
   challenge — pin both. A 302 would specifically indicate
   ``@authorized`` redirected because BasicAuthentication wasn't
   reached, which is exactly the regression Copilot was worried
   about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): close re-enable privilege-escalation when User row persists

Two related Copilot review findings, both rooted in the same gap:
the migration (and the enable→disable→re-enable flow) deliberately
preserves a User row even when ``auth_backend == ''``, so the
"is there an operator?" check needs to consult the DB, not just the
current request's session.

1. apply_auth_settings — privilege escalation. When auth is disabled
   the settings page is reachable unauthenticated. A LAN attacker
   could POST ``auth_backend=auth_basic, user=attacker, password=...``
   and the old code would treat that as "initial enable" because
   ``request.user`` was anonymous, calling _create_initial_operator
   and minting a fresh superuser — locking the legitimate operator
   out.

   Fix: ``operator = _operator_user(request) or _persisted_operator()``
   — if no authenticated session, fall back to the first active
   superuser (or first User) on the device. The current-password
   challenge then fires for ANY auth_backend transition where an
   operator already exists, not just when prev_auth_backend was
   non-empty. Two regression tests cover the rejection path
   (no/wrong current_pwd) and the success path (correct current_pwd
   succeeds and rotates the password).

2. page_context.device_settings — has_saved_basic_auth was based
   only on ``settings['auth_backend'] == 'auth_basic'``, so the
   "Current password" field would be hidden in the disable→re-enable
   state where the user must actually fill it in. Now keys on
   ``_persisted_operator() is not None or auth_backend == 'auth_basic'``
   so the field shows whenever there's an operator account, matching
   the apply_auth_settings guard above.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(auth): skip operator.save() when no auth field changed

apply_auth_settings runs on every settings-page POST (the form
submits the whole page, including unrelated toggles like
show_splash). _update_existing_operator was unconditionally calling
operator.save(), so every settings save triggered a write to
auth_user even when neither username nor password was changing.

Track which fields the form actually touched and only call
operator.save(update_fields=…) when something landed; the targeted
save also avoids re-stamping columns we didn't modify in memory.

Regression test snapshots operator.password before a noop call (no
new username, no new password) and asserts it's byte-identical
after — Django's PBKDF2 hasher re-salts on every set_password(), so
a stray save() going through the password branch would change the
stored hash even with the same plaintext.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auth): correct comment about why viewer is safe

The comment on the DeprecatedBasicAuthentication try/except claimed
``settings.REST_FRAMEWORK`` was gated behind the ``ANTHIAS_SERVICE
!= 'viewer'`` check, but it isn't — ``REST_FRAMEWORK`` is defined
unconditionally in ``django_project.settings``. The actual reason
the viewer is safe is that:

  1. ``rest_framework`` isn't in INSTALLED_APPS on the viewer (that
     IS gated by ANTHIAS_SERVICE), so the import in the factory
     fails and ``DeprecatedBasicAuthentication`` doesn't get bound.
  2. DRF resolves the dotted-string class names in
     ``DEFAULT_AUTHENTICATION_CLASSES`` lazily, only when its app
     starts up. The viewer never loads the rest_framework app, so
     the missing attribute is never dereferenced.

Update the comment to spell that out instead of pointing at a
non-existent guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(auth): widen apply_auth_settings type to HttpRequest | DRF Request

Copilot pointed out that ``apply_auth_settings`` and
``_operator_user`` are called from both write paths — the Django
HTML flow passes ``django.http.HttpRequest`` and the DRF API flow
passes ``rest_framework.request.Request`` — but the annotation said
``HttpRequest`` only. Runtime worked because DRF's Request delegates
``.user`` to the underlying request, so ``getattr(request, 'user')``
returns the same User in both cases. Annotation now reflects the
actual contract via a type alias ``AnyRequest = HttpRequest |
DRFRequest``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): throttle DEPRECATED Basic-auth log per (user, IP, path)

Copilot pointed out that ``DeprecatedBasicAuthentication`` logs a
WARNING on every successful Basic-auth request, which a polling
client (Anthias-CLI hitting /api/v2/info every 10s) would turn into
log + disk noise that drowns the actual signal.

Add an in-process throttle keyed on (user, client_ip, path) with a
1-hour TTL. The signal we care about is "this caller is still on
Basic" — knowing it once per hour per tuple is enough to track
stragglers, and the cardinality is bounded (single operator, small
handful of LAN IPs and API paths). Multi-worker deploys may emit a
few extra lines per worker, which is fine.

Test fires the same Basic-auth request 5 times and asserts exactly
one DEPRECATED log line; a 6th request from a different REMOTE_ADDR
asserts the throttle is per-tuple (gets its own line).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): three Copilot review points (operator selection, docstring, FAQ)

1. apply_auth_settings used to treat any authenticated request.user
   as "the operator", which mismatched ``operator_username()`` /
   ``_persisted_operator()`` (both define the operator as the first
   active superuser). If a recovery admin from
   ``manage.py createsuperuser`` was logged in, that user could
   re-key the canonical operator's credentials.

   Now picks the operator via ``_persisted_operator()`` and refuses
   the change when the session is authenticated as a different user.
   Initial-enable still works (no canonical operator yet → operator
   is None → ``_create_initial_operator`` runs). Regression test
   ``test_apply_auth_settings_rejects_non_operator_session`` covers
   the recovery-superuser-can't-hijack case.

2. ``_enable_auth()`` test helper docstring claimed to return a
   "patcher start handle for the caller to stop", but it returns a
   ``patch.dict(...)`` context manager. Fixed.

3. FAQ entry on session-cookie auth was misleadingly minimal —
   said "POST to /login/ and re-use the cookie", but Django's
   CsrfViewMiddleware blocks that POST without a csrfmiddlewaretoken
   from a prior GET. Spell out the two-step CSRF dance so readers
   don't try a naive POST and get 403s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auth): align all DEPRECATED-Basic-auth wording with throttling

Five Copilot-flagged spots all said the same thing — the
deprecation warning fires on "every successful auth" — but the
implementation throttles to one log line per (user, IP, path) per
1-hour TTL. Updated each location to describe the throttle so
operators don't expect a per-request log line and ask why their
chatty Anthias-CLI looks suspiciously quiet.

Touched:
- src/anthias_server/lib/auth.py — module docstring summary +
  detailed module docstring + class docstring inside
  ``_build_deprecated_basic_auth_class``.
- src/anthias_server/django_project/settings.py — comment in the
  REST_FRAMEWORK auth-class block.
- website/content/docs/migrating-assets-to-screenly.md — the
  migration-script note about expected DEPRECATED log lines.
- website/data/faq.yaml — "How do I authenticate API calls?"
  bullet on HTTP Basic.

No code changes — only documentation alignment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): gate DRF auth classes on auth_backend so disabled=open

Copilot caught a real contract violation: when the operator turns
auth off (``settings['auth_backend'] == ''``), the documented
behaviour is "API is fully open." But DRF's auth classes run before
the view, so they could still 401/403:

* ``BasicAuthentication`` returns 401 when an
  ``Authorization: Basic …`` header has wrong credentials.
* ``SessionAuthentication`` enforces CSRF on unsafe methods whenever
  a session cookie is present, returning 403 if ``X-CSRFToken`` is
  missing.

Neither is appropriate when auth is turned off. Add an
``_AuthBackendGated`` mixin whose ``authenticate()`` returns ``None``
(= "this class doesn't recognise the request, try the next one")
when ``settings['auth_backend']`` is empty. Apply it to both
``DeprecatedBasicAuthentication`` and a new
``GatedSessionAuthentication``; register the latter in REST_FRAMEWORK
in place of the stock SessionAuthentication.

Regression test ``test_auth_disabled_ignores_drf_authenticators``
fires a wrong Basic-auth header and a session-authenticated POST
without CSRF token — both pass through to a non-403 response,
which is impossible with the stock DRF classes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(auth): lazy-build DRF auth classes via module __getattr__

Two related Copilot points:

1. The previous comment said the viewer was safe because
   ``rest_framework`` "isn't installed" on it. Misleading — the dep
   group does exclude it but the comment over-promised on a coupling
   that future changes might not preserve. The actual safety is
   "DRF never asks for these names because the viewer doesn't load
   the rest_framework app."

2. Eager call to ``_build_drf_auth_classes()`` only swallowed
   ``ImportError``. If lib.auth were imported before django.setup()
   in some tooling/test environment, DRF imports could raise
   ``ImproperlyConfigured`` and crash through the safety net.

Fix both at once: switch to PEP-562 module ``__getattr__`` so the
auth classes are constructed only when first looked up — which only
happens when DRF resolves the dotted-string class names from
``REST_FRAMEWORK['DEFAULT_AUTHENTICATION_CLASSES']`` at app
startup. By that point both DRF and Django are guaranteed ready,
so any failure in the factory is a real signal worth surfacing
(no need to swallow). Once first accessed, results are cached on
the module via ``globals().update`` so subsequent lookups skip the
factory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): use RFC 5737 test-net IP to silence Sonar hotspot

Sonar flagged \`REMOTE_ADDR='10.0.0.42'\` in
test_basic_auth_deprecation_log_throttled as a hardcoded-IP security
hotspot, which broke the new_security_hotspots_reviewed quality gate
on PR #2828. Switched to 192.0.2.42 — that's TEST-NET-1, explicitly
reserved by RFC 5737 for documentation and examples — and added a
NOSONAR pragma so future scans don't refire on the same line. Test
behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 21:49:37 +01:00
Viktor Petersson
2d7b92c006 Hugo website: docs migration, API reference, FAQ, and SEO (#2807)
* Move website to Hugo

* Rewrite in progress

* Add Hugo-native API reference page and fix CSS build path

Two related changes for the Hugo site:

1. CSS build target: package.json's css:build/css:watch wrote to
   assets/styles/style.css, but baseof.html uses a plain <link href>
   that Hugo serves from static/. The merge left a stale 14K static
   copy alongside the freshly-built 23K asset copy, so pages rendered
   with most utility classes undefined. Build target is now
   static/assets/styles/style.css, matching the convention used by
   every other website asset.

2. Hugo-native API docs at /api/. The OpenAPI spec is loaded from
   data/openapi.yaml (generated via `manage.py spectacular`) and
   rendered in layouts/_default/api.html and a recursive schema
   partial. Endpoints are grouped by tag with anchor jumps, color-
   coded method badges, params/request/response tables, and inline
   $ref resolution. Renders all 18 v2 endpoints across 9 tags with
   the existing Tailwind theme. No third-party JS bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Move documentation under Hugo and redirect old GitHub paths

Migrates docs/ markdown into website/content/docs/ rendered with a new
docs/ layout (list + single) and Tailwind prose styling. Images and
the d2 diagram move to website/static/docs/. Internal links rewritten
from /docs/foo.md to /docs/foo/, and GitHub-style alerts pre-converted
to bold-labeled blockquotes since the goldmark alert extension is not
enabled on this Hugo version.

The original docs/*.md files are kept as redirect stubs that point at
https://anthias.screenly.io/docs/... so external links into the GitHub
docs tree still resolve to a useful page. Root README.md links updated
to point at the website URLs.

Hugo nav now exposes Docs alongside Features / Get Started / API / FAQ.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix factual inaccuracies in migrated docs against the codebase

Reviewed all docs against the current source. Concrete fixes:

* _index.md: container names use the post-rebrand `anthias-` compose
  project prefix (e.g. `anthias-anthias-server-1`, `anthias-redis-1`)
  rather than the legacy `screenly-` form. Replaced `docker-compose
  logs` with `docker compose logs` and added the optional `anthias-
  caddy` sidecar to the container table.
* developer-documentation.md: fixed leading-letter typo ("unning"),
  and replaced the old Django test-runner invocation with the pytest
  commands used by the suite today (`pytest -n auto -m "not
  integration"` and `pytest -m integration`).
* balena-fleet-deployment.md: corrected the supported board list
  ($BOARD_TYPE) to match `bin/deploy_to_balena.sh --help`
  (`pi2`, `pi3`, `pi4-64`, `pi5` — no `pi1` or plain `pi4`). Updated
  registry reference from Docker Hub to GHCR.
* migrating-assets-to-screenly.md: `cd ~/screenly` → `cd ~/anthias`
  (post-rebrand install path).
* raspberry-pi5-ssd-install-instructions.md: fixed "Opitions" and
  "uinsg" typos in the boot-order steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Polish docs styling: callouts, syntax highlighting, hierarchy

Reworks docs prose styling so the migrated pages don't read like
default-Hugo-render-output:

* Headings: in-body H1/H2 collapse to a section divider style with a
  top border so they don't compete with the dark page hero. H4-H6
  become small uppercase eyebrows. Markdown sources mix #/##/####
  inconsistently — the visual scale now compresses gracefully.
* Alerts: a render-blockquote hook detects the bold-label preamble
  produced by our preprocessor (`> **Note**` etc.) and emits a typed
  `<blockquote class="docs-alert docs-alert-note">` so each kind gets
  its own colored border + label (note/tip/important/warning/caution).
* Syntax highlighting: enable Hugo Chroma with the github style,
  noClasses=false. Generated chroma.css ships as a static asset and is
  loaded alongside style.css. `pre`/`<code>` get a light surface that
  the chroma token colors sit on top of.
* Inline code, lists, links, tables, and images all get a small
  rebalance — bullet color, link underline weight, image shadow,
  table border-radius — to match the brand-purple theme.
* Footer: the Resources / Docs link pointed at the legacy
  github.com/.../docs/README.md path; now points at /docs/. Added an
  API Reference link alongside.
* Stripped a stray `<br>` in the Pi5 SSD doc that was creating a
  random gap between a blockquote and its illustrative screenshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Make x86/PC docs consistent and more user-friendly

The migrated docs used four different forms — "x86", "x86 device",
"PC (x86) devices", and "PC (x86 Devices)" — depending on the page.
Standardize on **PC (x86)** as the user-facing label (PC is what
people search for; x86 stays as the architecture qualifier).

Also rewrites x86-installation.md from a flat bullet dump into a
clearer five-step walkthrough — what you need, download, flash,
install Debian, prep the system, run the installer — and crosslinks
the right anchor in installation-options.md so PC users can hand off
to the scripted install without scrolling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Expand FAQ with forum-driven questions and refactor to data file

The FAQ had six entries that didn't reflect what people actually ask
on forums.screenly.io. Reviewed the all-time top topics and added the
ones that show up over and over: portrait rotation, YouTube playback,
Wi-Fi setup, static IP assignment, audio output, resolution / 4K,
black-screen troubleshooting, transitions, asset storage / backup,
SSH, HTTPS pointer, commercial-use clarity, getting logs, and a link
to the API reference.

Refactored the layout so it reads from data/faq.yaml grouped by
section (About, Installation & updates, Display & playback,
Operations) and renders each answer through markdownify. This makes
adding new entries a one-paragraph YAML edit instead of duplicating
~15 lines of accordion markup. Answers reuse the .docs-prose styling
so code, links, lists, and inline pre snippets all match the docs
pages.

Also tightened the "Accessing the REST API" section in /docs/ to
point at the new /api/ page first, with the live ReDoc URL on the
device as a secondary callout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Correct rotation FAQ — Anthias renders via linuxfb + DRM, no Wayland

Verified in code: docker/Dockerfile.viewer.j2 sets
QT_QPA_PLATFORM=linuxfb, webview/build_qt{5,6}.sh both pass
-skip wayland to the Qt build, and viewer/media_player.py invokes
mpv with --vo=drm. There is no Wayland compositor in the runtime
stack on any board.

Replaced the previous "Pi 5 with Wayland uses a different stack"
hand-wave with the actual fallback: if /boot/firmware/config.txt's
display_rotate=N doesn't stick on a Pi 5 / KMS pipeline, append
video=HDMI-A-1:...,rotate=N to /boot/firmware/cmdline.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Tighten three FAQ answers after a code-driven validation pass

* SSH: previous answer claimed SSH was on by default for both the
  Anthias disk image and the scripted install. Anthias's installer
  doesn't touch sshd at all, so the answer now distinguishes between
  the prebuilt images (SSH on) and a self-flashed Raspberry Pi OS Lite
  (SSH must be pre-enabled).
* Audio output: static/src/components/settings/audio-output.tsx hides
  the 3.5mm option on Pi 5 because the hardware lacks the jack. Call
  that out so Pi 5 users don't go looking for a missing dropdown item.
* Black screen: replaced the `xset dpms force on` suggestion. Anthias
  has no X server on any board (Qt runs on linuxfb, mpv on --vo=drm),
  so xset can't toggle DPMS. Pointed users at re-seating HDMI or
  checking the TV's input as a more grounded recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Correct features-page claim — Anthias detects display state, can't toggle it

The "Display power control" card promised programmatic on/off
toggling of the connected screen for energy savings. That isn't a
real feature. lib/diagnostics.py only calls libcec's tv.is_on() to
*query* the TV's power state — there's no power-on / standby command
path anywhere in the codebase. The result surfaces read-only as
display_power on the System Info page (static/src/components/
system-info.tsx).

Replaced the card with what's actually shipping: HDMI-CEC display
state *detection*, visible on the System Info page.

Verified the rest of the page against code while I was in there.
Accurate as written: image/video/webpage assets (Qt webview + mpv),
scheduling (start_date/end_date/duration on the asset model), drag-
drop playlists (@dnd-kit/sortable), shuffle (settings.shufflePlaylist),
1080p output (mpv pinned to 1920x1080@60 on pi4-64/pi5 in
viewer/media_player.py), real-time WebSocket sync (Django Channels +
Redis pub/sub), REST API (drf-spectacular), four-container compose
topology, backup/restore, optional basic auth (lib/auth.py BasicAuth),
System Info page fields (loadavg/free_space/uptime/anthias_version),
and the supported hardware list (matches ansible/site.yml's
device_type assertion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Punchier homepage tagline: "Free digital signage for everyone."

Replaces "Open source digital signage for any screen" with a shorter,
benefit-led headline. The new line breaks naturally across two lines
on desktop (free digital signage / for everyone.) and stays single-
line on mobile to avoid an awkward orphan.

Subtitle is unchanged — it still does the explanatory work (Pi or
PC, schedule images/videos/webpages, no subscriptions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* SEO sweep: per-page meta, FAQPage / TechArticle JSON-LD, robots.txt

Two functional gaps in the existing setup:

* og:description and twitter:description were hardcoded to a single
  marketing line on every page, while the per-page <meta name=
  description> already pulled from front matter. So the Slack/Twitter/
  Discord card preview always read the same blurb regardless of which
  page you shared. Now both the OG and Twitter description reflect the
  page's own .Params.description.

* Page titles drifted: most pages embedded "Anthias" in the title
  string, but the docs pages were just "Documentation" /
  "Installation Options" / etc. — fine for the H1, weak for SERPs.
  Title now appends " | Anthias" only when the page title doesn't
  already contain the brand, so existing branded titles stay clean
  and docs pages get a brand suffix automatically.

Other tightening:

* Added FAQPage structured data on /faq/ generated from data/faq.yaml
  so Google can surface FAQ rich results.
* Added TechArticle structured data on individual /docs/ pages.
* og:type now flips to "article" on docs pages.
* og:image:alt + twitter:image:alt populated.
* theme-color set to the brand purple for mobile browser chrome.
* JSON-LD home schema URL now uses site.BaseURL instead of a
  hardcoded production URL — important for staging / dev parity.
* <html lang> reads site.LanguageCode instead of a fixed "en".
* Added a real robots.txt that points crawlers at /sitemap.xml
  (Hugo already generates the sitemap, but a robots.txt makes the
  pointer explicit and unblocks tooling that looks for it).

Replaced placeholder image alt text in the docs ("balena-ss-01",
"imager-01", "rpi-eeprom-update", etc.) with descriptive captions —
better for screen readers and image-search SEO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Move site assets into Hugo's expected layout and rename /docs URLs

Two related cleanups.

ASSETS — site assets were split across website/static/assets/ (the
shadowed copy hugo.toml's [[module.mounts]] directed traffic to) and
website/assets/ (an unused duplicate). Hugo's own build report showed
"Processed images: 0" because nothing actually flowed through Pipes.

  * Removed the [[module.mounts]] override so Hugo uses default
    layout: assets/ for Pipes-processable resources, static/ for
    served-as-is files.
  * Used `git mv` to record the docs/ image and stylesheet renames as
    history-preserving moves rather than delete+add diffs.
  * Removed the duplicate website/static/assets/images/ directory —
    files already lived in website/assets/images/.
  * Bun's css:build/css:watch now write to assets/styles/style.css so
    Tailwind output flows through Hugo Pipes.
  * baseof.html loads style.css and chroma.css via resources.Get +
    fingerprint, with SRI integrity attributes. Each deploy produces
    a fresh content-hashed URL (/styles/style.<hash>.css), so the
    browser cache invalidates correctly without manual cache-busting.
  * Logos, social icons, hero raster (overview*.png), favicon, and
    plus/minus accordion icons all flow through resources.Get for
    consistent asset handling.
  * Added layouts/_default/_markup/render-image.html so markdown
    image references in /docs are looked up via resources.Get and
    emitted with loading="lazy" decoding="async".

URL RENAMES — the docs URLs were verbatim copies of the original
GitHub filenames, which made for noisy URLs like
/docs/raspberry-pi5-ssd-install-instructions/. Slugged each page and
left aliases for the old paths so Hugo emits a meta-refresh redirect:

  /docs/installation-options/                       → /docs/install/
  /docs/balena-fleet-deployment/                    → /docs/balena/
  /docs/x86-installation/                           → /docs/pc/
  /docs/raspberry-pi5-ssd-install-instructions/     → /docs/pi5-ssd/
  /docs/migrating-assets-to-screenly/               → /docs/migrate-to-screenly/
  /docs/qa-checklist/                               → /docs/qa/
  /docs/developer-documentation/                    → /docs/development/

Cross-doc links inside /docs and the README + repo-root docs/ stub
files all point at the new URLs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix lint + mypy on raspberry_pi_imager test (carried over from rebase)

The test_build_pi_imager_json.py file landed in `88d3881b Move website
to Hugo` with two pre-existing CI failures:

* ruff format --check: a few helper definitions had a stale line
  break the formatter wanted to collapse.
* mypy: `make_image_metadata(board: str) -> dict` is missing the
  generic type parameters that the project's mypy config flags as
  type-arg. Annotated as `dict[str, Any]`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Self-host Plus Jakarta Sans via @fontsource (drop Google Fonts CDN)

Removes the third-party Google Fonts <link>. SonarCloud's Web:S5725
hotspot was flagging the link as a resource-integrity (SRI) risk —
SRI is impossible against Google Fonts because the served stylesheet
rotates per User-Agent and the woff2 URLs change with the font CSS.

Self-hosting the same font from npm via @fontsource removes the
cross-origin resource entirely.

How it's wired:

* `bun add -D @fontsource/plus-jakarta-sans` for the font binaries.
* `scripts/install-fonts.ts` is a small bun script that, given the
  installed package, copies woff2 files for latin + latin-ext at
  weights 400/500/600/700/800 to `static/fonts/` (so Hugo serves
  them at `/fonts/...`) and emits a combined
  `assets/fonts/plus-jakarta-sans.css` with the urls rewritten to
  absolute /fonts/... paths and the woff fallback stripped.
* `package.json` adds `fonts:install`, and chains it through
  `css:build` / `css:watch` so Tailwind always sees the generated
  CSS up to date.
* `main.css` @imports the generated CSS — Tailwind/Lightning CSS
  inlines the @font-face rules into the final fingerprinted
  style.<hash>.css.
* `.gitignore` excludes `assets/fonts/` and `static/fonts/` since
  both are deterministically regenerated from node_modules.
* `baseof.html` no longer pulls from fonts.googleapis.com.

Total payload: 10 woff2 files (~136KB), but each is loaded
on-demand by unicode-range — typical English-only visitors fetch
~50KB of fonts, served from same-origin.

The second Web:S5725 hotspot (gtag.js from googletagmanager.com)
is unchanged in this commit — Google's tag manager script is
updated server-side without a stable hash, so SRI cannot apply.
That one needs a product call (keep with dismissal, drop GA, or
move to a privacy-first SRI-friendly alternative).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address SonarCloud code-smell findings on the website

Cleared the unrelated SonarCloud findings raised on this PR:

* `install-fonts.ts`: `fs` and `path` imports use the `node:` prefix
  (typescript:S7772). The new prefixed form is the bun-recommended
  one, no behavior change.
* `_markup/render-image.html`: rewrote the comment that referenced
  `<img>` literally — Web:ImgWithoutAltCheck was treating the word
  inside the Hugo comment block as an actual element with no alt.
* `_default/faq.html`: replaced the accordion's `<div role="region">`
  with a real `<section>` element (Web:S6819). The aria-labelledby
  binding stays, so the accessible name resolution is identical and
  the semantics are now native rather than ARIA-emulated.
* `assets/styles/chroma.css`: stripped the two stray-semicolon lines
  left over from the sed pass that emptied the github-style backdrop
  (css:S1116). The remaining `.chroma { -webkit-text-size-adjust:
  none }` rule is what's actually load-bearing.
* `_default/baseof.html`:
  - accordion JS now reads `this.dataset.accordion` instead of
    `this.getAttribute('data-accordion')` (javascript:S7761).
  - GA bootstrap uses `globalThis.dataLayer` instead of
    `window.dataLayer` (javascript:S7764). Same semantics in any
    browser context, no globalThis polyfill needed for our targets.
* `layouts/index.html`: dropped the deprecated `scrolling="0"`
  attribute from the GitHub stars iframe (Web:S1827); replaced
  with the equivalent `overflow-hidden` Tailwind class.

The Web:S5725 SRI hotspot on the gtag.js script (line 162 of
baseof.html) is the only remaining finding. Google Tag Manager is
versioned server-side without a stable hash, so SRI fundamentally
can't apply — that one is being kept and dismissed in the
SonarCloud UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review + trigger marketing deploy on release publish

Copilot review:

* api.html: request-body renderer only looked at application/json,
  so endpoints whose only content type is multipart/form-data (file
  uploads) or application/x-www-form-urlencoded would render an
  empty Request body section. Pick application/json first if present,
  otherwise fall back to the first listed content type, and label
  the rendered schema with its actual content type.
* build_pi_imager_json.py: every requests.get() now sets a 30s
  timeout and calls raise_for_status() so a slow/rate-limited GitHub
  API doesn't hang the deploy job and a 4xx/5xx fails fast with a
  clear message rather than a confusing KeyError on response.json().
* docs/raspberry-pi5-ssd-install-instructions.md: "Other HAT's" →
  "Other HATs".
* docs/qa-checklist.md: dropped the spurious "a" in "Change a the
  start and end dates".
* deploy-website.yaml: jq's has() takes one key, so the validation
  step `has("name", "description", ...)` was actually a syntax error
  on every run — rewrote as `all($k; $entry | has($k))` over the
  required-key list.
* layouts/_default/get-started.html: the "Documentation" CTA pointed
  at the old GitHub markdown file; now links to /docs/ to match the
  navbar / footer.
* website/README.md: rewrote the project-structure tree to match
  what's actually in the repo (data/, scripts/, layouts/docs/,
  Goldmark _markup/ hooks etc.) and documented the bun pipeline —
  `hugo server` alone leaves /fonts/* as 404s because the woff2
  files are gitignored and materialized by `bun run fonts:install`.

Marketing deploy on release publish:

`build-balena-disk-image.yaml` cuts the GitHub release with the
*.img.zst artefacts as its final step; until now the marketing site
only re-deployed on master push or manual dispatch, so rpi-imager.
json on the live site lagged the freshest disk images by however
long it took someone to push an unrelated website change. Hooking
deploy-website.yaml to `on: release: types: [published]` makes the
site rebuild as soon as the release exists, which is exactly when
the GitHub API starts surfacing the new assets the JSON generator
queries. `prerelease=true` releases are included because that's what
build-balena-disk-image.yaml currently flags every release as.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address second round of Copilot review

* installation-options.md: balenaEthcher → balenaEtcher.
* balena-fleet-deployment.md: includa → include.
* developer-documentation.md: spash screen → splash screen.
* qa-checklist.md: enabling **Show splash screen** is supposed to
  *display* the splash, not hide it — flipped "is not being
  displayed" → "is being displayed". Also clickin → clicking.
* raspberry-pi5-ssd-install-instructions.md: `sudo apt update -y`
  isn't valid (apt's -y is only for install / upgrade), so the
  copy-paste step would error. Dropped the `-y` from update; the
  full-upgrade line keeps it because that's where it actually does
  something.
* deploy-website.yaml: the jq required-keys check was missing
  `icon` and `website`, which build_pi_imager_json.py's
  REQUIRED_FIELDS already enforces in the Python tests. Added them
  so the runtime validation matches the generator's contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address third round of Copilot review

* website/.gitignore: `_.log` was a typo from the original Hugo
  bootstrap — it doesn't match anything. Replaced with the intended
  `*.log` so log files are actually ignored.
* website/package.json: rewrote the `dev` script to capture both
  child PIDs and trap EXIT/INT/TERM so Ctrl-C (or hugo crashing)
  takes the Tailwind watcher down with it. Mirrors the pattern in
  the repo-root package.json's `dev`.
* docs/raspberry-pi5-ssd-install-instructions.md: "Early Pi 5's"
  → "Early Pi 5s" (no apostrophe on plurals).
* docs/qa-checklist.md: "make sure that the screen in standby mode"
  → "make sure that the screen is in standby mode" (missing verb).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Rewrite raspberry_pi_imager tests in pytest style

The file was unittest.TestCase classes — pytest discovers and runs
those, but the boilerplate doesn't earn its keep. Each test method
re-declared `@patch('...requests.get')` and rebuilt the same
MagicMock setup, and the per-board cases lived as 5+3+2+2 separate
methods that should have been one parametrize each.

Reworked as flat module-level functions backed by three fixtures:

* `mock_requests_get` — patches the module's `requests.get` and yields
  the mock so each test sets `return_value` / `side_effect` directly.
* `mock_release_assets` — preconfigured to return the canned release
  asset list, used by the `get_asset_list` cases.
* `mock_full_build` — wires up the three call shapes
  `build_imager_json()` makes (latest, asset list, per-asset json).

Per-board cases collapse into `@pytest.mark.parametrize`:
get_board_from_url's positive cases, the non-image-returns-None
cases, the maintenance-mode boards, and the modern boards.

Coverage is the same — 21 collected cases (pytest fans the parametrize
out from 12 test methods to 21 ids), all passing in 0.12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix PR checks and Copilot review items

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:16:25 +01:00