4086 Commits

Author SHA1 Message Date
Viktor Petersson
eb8909a788 fix(server): stream the backup as an async iterator so it doesn't time out (#3074)
* fix(server): stream the backup as an async iterator so it doesn't time out

StreamingHttpResponse only streams an asynchronous iterator under ASGI.
Handed the sync stream_backup() generator, Django's __aiter__ falls back
to `await sync_to_async(list)(...)`, which drains the whole generator —
building the entire archive (buffered in RAM) before the first response
byte. That silently reintroduced the 0-bytes-then-timeout failure the
streaming path was meant to fix and risks OOM on a 1 GB Pi (issue #3073).

- add astream_backup(): async wrapper pulling each chunk via
  sync_to_async(thread_sensitive=False) so bytes flow as the tar builds
- close the underlying sync generator on disconnect to stop the producer
- point the download view at astream_backup() (is_async path)
- regression test drives aiter(StreamingHttpResponse(...)) — the real
  ASGI consumption path the unit-level test never exercised

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): satisfy mypy strict in backup stream wrapper + test

- wrap next() in a typed helper with a None sentinel; sync_to_async(next)
  resolved next's single-arg overload, tripping mypy on the 2-arg call
- use a distinct file handle in the regression test (text- then
  binary-mode reuse of one variable is a mypy type conflict)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): serialize backup generator access to avoid disconnect race

Copilot review: with next() and close() on sync_to_async's shared pool,
a client disconnect mid-next() could run gen.close() on another thread
concurrently — raising "generator already executing" and leaking the
producer thread.

- drive next() and the cleanup close() through one dedicated
  single-worker executor so they can never overlap; the queued close()
  runs only after any in-flight next() returns
- add a regression test that aclose()s the async generator mid-stream
  and asserts the producer thread exits cleanly

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 12:03:14 +01:00
Viktor Petersson
10f273b7c3 fix(server): classify RTSP/streaming URLs as streaming, not webpage (#3071)
* fix(server): classify RTSP/streaming URLs as streaming, not webpage

The React→Django migration (#2818) reimplemented the URL→mimetype
classifier in the assets_create form handler but dropped the
scheme check, so rtsp://, rtmp:// and HLS/DASH manifest URLs fell
through to mimetype='webpage'. The viewer then routed them to
QtWebEngine instead of the video player and the stream never played.

- add `is_streaming_uri()` to remote_video, reusing the existing
  stream-scheme / manifest-extension sets so it can't drift from
  `is_downloadable_remote_video`
- assets_create now stamps stream URLs as mimetype='streaming' and
  applies the default_streaming_duration window
- regression tests for the classifier and the form handler

Validated end-to-end on the x86 testbed: an ffmpeg-fed RTSP feed
played live in the viewer (decoded and progressing on screen).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): reject RTMP — Qt6's QMediaPlayer can't play it

On-device testing showed rtmp:// streams render black on every board.
Root cause is in Qt6's FFmpeg backend: it sets a `timeout` AVFormatContext
option the rtmp protocol misreads as TCP listen mode, so the open fails
(libavformat itself plays rtmp fine). Rather than let operators add a
stream that never shows:

- drop rtmp from validate_url's allowed schemes (gates the UI form and
  the API)
- drop rtmp from remote_video._STREAM_SCHEMES so is_streaming_uri never
  classifies a legacy rtmp row as a playable stream
- assets_create returns a clear "use RTSP/HLS/DASH" message instead of
  the generic "Invalid URL"
- tests updated

url_fails keeps probing rtmp defensively for any pre-existing DB rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(review): restrict is_streaming_uri manifest match to http(s)

Addresses Copilot review on #3071: a manifest extension (.m3u8/.mpd/…)
was treated as streaming for any scheme, so file:///x/index.m3u8 would
classify as mimetype='streaming'. Manifests are HTTP-delivered — only
match them over http(s). Adds file:// manifest regression cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(sonar): silence S5332 on http manifest test fixture

The DASH-over-http URL in the is_streaming_uri parametrize is a
deliberate test input — it exercises the http (not https) manifest
branch. It's a fixture, not real traffic, so annotate the SonarCloud
clear-text-protocol hotspot with # NOSONAR rather than changing the
test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 21:45:29 +01:00
Viktor Petersson
9107539a08 fix(build): replace retired Linaro armhf toolchain with Debian's (#3070)
* fix(build): replace retired Linaro armhf toolchain with Debian's

releases.linaro.org has been retired, so the pi2/pi3 webview-builder's
download of gcc-linaro-7.4.1 fails to connect (curl exit 7). Every
master Docker build has gone red since, and because publish-latest only
advances the floating latest-<board> tags on a fully green matrix, the
latest-* images froze — users tracking latest stopped getting any
merged fixes.

- Install Debian's supported crossbuild-essential-armhf (gcc 14, same
  arm-linux-gnueabihf- prefix) instead of fetching the dead tarball
- Symlink it under the legacy gcc-linaro path the frozen Qt 5
  qmake.conf bakes into CROSS_COMPILE, so the pinned WebView-v2026.04.1
  artifact needs no rebuild; the app still links against the Raspbian
  /sysroot, so the target glibc is unchanged
- Apply the same swap to the offline toolchain-rebuild path
  (build_qt5.sh + its Dockerfile) so the dead URL is gone everywhere

Validated by building the pi2 viewer image: qmake/make link the
AnthiasViewer binary against the pinned Qt 5 libs with the Debian
cross-gcc, producing a valid ARM EABI5 hardfloat executable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(build): harden toolchain shim and correct amd64-pin comment

Addresses Copilot review on #3070:

- build_qt5.sh: fail fast with a clear message if the armhf cross
  toolchain is absent (else the glob expands to a literal and errors
  confusingly), and use `ln -sf` so reruns are idempotent.
- Dockerfile.qt5-webview-builder.j2: the amd64-pin comment referenced
  the removed Linaro download; the real reason is the pinned Qt 5
  bundle's x86_64 host qmake (and the x86_64-hosted Debian cross-gcc).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 17:16:36 +01:00
Viktor Petersson
4402ea4fb0 fix(sentry): don't report settings-save input validation as errors (#3068)
ANTHIAS-3D ("AuthSettingsError: New passwords do not match!") is
operator input validation, not a bug — a mismatched/incorrect
password, a taken username, or a too-weak password typed into the
settings form. Both settings-save surfaces caught it under a broad
`except Exception` + `logger.exception(...)`, and Sentry's logging
integration turns that ERROR record into an event.

- catch AuthSettingsError ahead of the generic handler in both the
  HTML view and the DRF v2 view; log it at warning (no traceback) so
  it never reaches the logging integration, and surface the
  operator-friendly message (the v2 view previously buried it under a
  generic "An error occurred")
- add AuthSettingsError to the before_send drop filter as a backstop
  for any other path that logs it as an error
- spell auth.py's AnyRequest as an explicit TypeAlias: the implicit
  form flipped to mypy "not valid as a type" once settings.py began
  importing the module
- regression tests for the before_send drop and the warning-level,
  no-traceback v2 rejection

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 15:05:06 +01:00
Viktor Petersson
93f47428df fix(install): accept pi3-64 in Ansible playbook validation (#3067)
A Raspberry Pi 3 running a 64-bit OS reports DEVICE_TYPE=pi3-64 (both
bin/install.sh::set_device_type and bin/upgrade_containers.sh emit it,
and latest-pi3-64 Docker images are built), but the Ansible playbook
never accepted that value:

- ansible/site.yml's pre_task assertion only allowed pi2, pi3, pi4-64,
  pi5, x86, arm64 — so a 64-bit Pi 3 install aborted at "Gathering
  Facts" with "Required environment variables missing or invalid"
  (forums.screenly.io/t/6716).
- ansible/roles/system/tasks/docker.yml's docker_arch_by_device_type
  map is exhaustive-by-design and would have raised on the unmapped
  pi3-64 key; add the arm64 mapping.
- the device_is_pi membership test omitted pi3-64, which would have
  dropped the gpio group on a board that is a Pi.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 11:56:12 +01:00
Viktor Petersson
a3aa63fa1a feat(viewer): blank/unblank commands to turn the display off (#3065)
* feat(viewer): add blank/unblank commands to turn the display off

Adds a way to blank the screen on demand, parallel to the existing
next/previous/stop/play viewer commands on the anthias.viewer Redis
channel:

* Wayland boards (x86/pi5/arm64): wlr-randr powers the connector off
  (true DPMS power-off) and back on. _wlr_output_names() gained an
  include_disabled flag so unblank can re-enable a connector that's
  currently Enabled: no.
* eglfs/linuxfb boards (pi2/pi3/pi4): the Qt app owns the DRM master
  and can't be powered off externally, so the asset loop paints a new
  all-black BLACK_SCREEN image instead (same proven loadImage path as
  the standby screen). Backlight stays on; the screen goes black.

blank_display() flips display_blanked + loop_is_stopped from the
subscriber thread and runs the out-of-process wlr-randr call there; the
webview repaint is deferred to the main loop thread (start_loop), which
owns current_browser_url — mirroring the existing rotation-bounce
threading discipline.

Validated live on x86 (wayland): `blank` -> DP-1 Enabled: no, `unblank`
-> Enabled: yes. eglfs black-paint reuses the standby loadImage path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): address review on display blanking (mypy, log spam, state)

- Import get_skip_event from anthias_viewer.utils in the test instead
  of via the viewer module, which doesn't explicitly re-export it
  (mypy attr-defined under --no-implicit-reexport).
- start_loop: guard the black repaint on current_browser_url so it
  doesn't re-call view_image() — and re-log "Current url ..." at INFO —
  on every 0.1s tick while blanked (Copilot).
- Route stop/play through module-level helpers that set the
  loop_is_stopped global start_loop actually reads; the prior
  setattr(__main__, ...) wrote a dead namespace under
  `python -m anthias_viewer` and never paused the loop. play now
  implies unblank when the display is blanked (Copilot).
- Add tests for stop flag + play-implies-unblank.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(viewer): mark BLACK_SCREEN http URL safe for Sonar (S5332)

http is intentional — the viewer talks to the local anthias-server over
plain HTTP (TLS is the opt-in Caddy sidecar's job), identical to the
existing STANDBY_SCREEN / SPLASH_PAGE_URL. Annotate the line # NOSONAR,
the repo's documented convention for Sonar false positives under
Automatic Analysis (see sonar-project.properties; cf. test_csrf.py).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 11:50:54 +01:00
Viktor Petersson
f67aa1e7f5 fix(celery): soft-limit the display-power and telemetry pokes (#3063)
* fix(celery): soft-limit the display-power and telemetry pokes

- ANTHIAS-A/9/B group by the worker-SIGKILL signature, not the task,
  so they survived #3017 — which only soft-limited the asset probe
- get_display_power and send_telemetry_task still ran under a bare
  time_limit=30; a wedged CEC query or a getaddrinfo stall (requests'
  timeout doesn't cover DNS) tripped the hard limit and SIGKILLed the
  pool child
- give both the #3017 treatment: soft_time_limit raises inside the
  task so it logs and skips the tick; the hard limit stays as the
  C-code backstop
- add regression tests for the limits and the soft-timeout skip

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(celery): write display_power value and TTL atomically

- a soft-limit signal between SET and EXPIRE could leave display_power
  without a TTL (stale value that never expires)
- use a single SET with ex= so the write and TTL are atomic

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(telemetry): write cooldown key and TTL atomically

- send_telemetry_task now runs under a soft time limit, so a
  SoftTimeLimitExceeded landing between the cooldown SET and EXPIRE
  would leave telemetry-cooldown without a TTL — silencing telemetry
  permanently (same class as the display_power fix in f5e94664)
- collapse the SET + EXPIRE into one SET … ex= so the value and TTL
  are written atomically
- update test_telemetry to assert the atomic SET and accept ex= in
  the fake redis client

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 11:25:44 +01:00
Viktor Petersson
bc27c555df fix(server): polish home bulk-action & upload UI/UX (#3066)
* fix(server): polish home bulk-action & upload UI/UX

Follow-up UI/UX pass over the recently shipped bulk asset management
(#3048), multi-file upload (#3049), and ffmpeg/HandBrake rejection
hints (#3040).

- Reserve bottom space (.has-bulk-selection) while a selection is
  active so the fixed bulk-action bar never floats over the last rows
  or their action buttons — exactly the assets a bulk selection
  targets.
- Cap .modal-card to the viewport and scroll inside it, with sticky
  header/footer, so a tall bulk-edit form (or the Edit modal with the
  failure alert + Advanced open) keeps its title and Apply/Save buttons
  reachable instead of pushing them below the fold.
- Wire real drag-and-drop on the upload dropzone (dropFiles() feeds the
  same sequential uploadFiles() batch path); the dashed zone already
  read as a drop target but silently ignored drops.
- Lift the selection checkbox contrast on the dark Enabled surface so
  the select-all (checked/indeterminate) and per-row boxes are legible
  against the purple gradient; scoped to .asset-select so the activity
  switch keeps its track styling.
- Anchor the bulk bar's clear (x) to the top-right on phones so it no
  longer wraps beside the destructive Delete (mis-tap risk).
- Stack the ffmpeg recipe's copy button under the command at narrow
  widths; make the empty-state CTA a <button> (action, not nav); drop
  the bulk duration field's placeholder that collided with its
  floating label.

Presentational only — no API/model changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): stop dropzone highlight flicker on child drag-over

Copilot review on #3066: dragleave bubbles from the dropzone's child
icon/paragraphs, toggling dragActive off while the cursor is still over
the label and flickering the highlight. Add the .self modifier so the
handler only runs when leaving the label itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): clear dropzone drag highlight when the modal closes

Second Copilot pass on #3066: dragActive could stay true if the user
drags into the dropzone then closes the Add modal (Esc/backdrop/Cancel)
before dragleave fires, so the dropzone re-opened still highlighted.
Reset dragActive in closeModal() alongside the other modal state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 11:19:18 +01:00
Viktor Petersson
8a4db73b84 fix(viewer): detect DB changes under WAL so new assets load (#3062)
* fix(viewer): detect DB changes under WAL so new assets load (#3061)

The viewer polls the database's mtime to decide when to reload its
playlist, stat'ing only the main `anthias.db` file. Since #3015 the
DB is opened with `journal_mode=WAL`, where commits land in the
`anthias.db-wal`/`-shm` sidecars and leave the main file's mtime
frozen until a (rare) checkpoint. So `get_db_mtime()` never advanced
and `refresh_playlist()` never reloaded — most visibly, the first
asset on a fresh install never displayed (its empty playlist also has
a `None` deadline, so the deadline fallback can't recover it either).

Take the newest mtime across `anthias.db`, `anthias.db-wal`, and
`anthias.db-shm`: WAL commits move the sidecars and a checkpoint moves
the main file, so the value advances on every write regardless of
journal mode. Add a regression test that a write touching only the
`-wal` sidecar is detected.

Fixes #3061

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(viewer): isolate WAL mtime test in tmp_path to avoid xdist race

The new test created /tmp/fakedb-wal in the shared /tmp; with the now
WAL-aware get_db_mtime(), that sidecar could leak into the concurrent
test_check_get_db_mtime (asserting == 0 on /tmp/fakedb) under
`pytest -n auto`. Use a unique tmp_path dir so the sidecar can't escape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 11:04:26 +01:00
Viktor Petersson
65b122cbaf chore(deps): bump grouped Dependabot updates and refresh uv.lock (#3064)
Consolidates 6 Dependabot PRs into one grouped update and regenerates
uv.lock (which the per-PR Dependabot runs left untouched, editing only
pyproject.toml).

Python (dev/build tooling + type stubs + minor runtime libs):
- pytest-cov 6.0.0 -> 7.1.0 (dev)
- click 8.1.7 -> 8.4.1 (docker-image-builder, local)
- sh 2.2.2 -> 2.3.0 (server, viewer)
- django-stubs-ext 6.0.3 -> 6.0.5 (server, mypy)
- types-pytz 2026.2.0.20260506 -> 2026.2.0.20260518 (dev-host)

GitHub Actions (pinned-SHA bumps):
- github/codeql-action init/autobuild/analyze (v4)
- codecov/codecov-action (v6)

Supersedes #3058, #3057, #3056, #3055, #3054, #3053.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:53:56 +01:00
Viktor Petersson
3f05ead272 feat(server): add bulk asset management (enable/disable/edit/delete) (#3048)
* feat(server): add bulk asset management (enable/disable/edit/delete)

Operators with large libraries previously had to act on assets one
row at a time — a recurring forum request for years (#3046). The home
page now has per-row selection checkboxes, a per-section select-all in
each surface header, and a floating bulk-action bar (Enable, Disable,
Edit, Delete) that appears whenever a selection is active.

Backend: two new server-rendered endpoints reuse the shared
_asset_table_response so the whole batch swaps the table partial and
nudges the viewer once, instead of one round-trip per row.
  - assets_bulk_action — enable/disable/delete a selected set; delete
    goes through delete_asset_with_file so on-disk cleanup matches the
    per-asset path (#2908).
  - assets_bulk_update — applies common schedule fields (start/end
    dates, duration, play-from/until times, weekdays). Each group is
    opt-in via an apply_* flag so an operator only overwrites the
    fields they ticked, and everything is parsed before any row is
    mutated so a bad date/time toasts without a half-applied batch.
    Video duration stays owned by the probe task, mirroring
    assets_update.

Frontend: selection lives in the homeApp Alpine state (selectedIds),
so row :checked bindings re-evaluate after the table's 5s HTMX swap and
the selection survives. setVisible() is re-published from the table
partial on every swap to drive the select-all state and prune a
selection of rows that have since disappeared. The bulk-edit modal
reuses the existing flatpickr date/time inputs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): address Copilot review on bulk asset management

- asset_ids filter: stop mark_safe'ing the JSON. It lands in a
  double-quoted x-init="…" attribute, so its own double quotes were
  closing the attribute early and breaking the Alpine expression.
  Return a plain str and let Django autoescaping entity-encode the
  quotes; the browser decodes them back to valid JSON. Adds a
  regression test asserting the rendered attribute is escaped.
- bulk delete: pass delete_asset_with_file(nudge_viewer=False) per row
  and fire a single viewer reload after the batch, instead of one
  reload per asset. New keyword-only flag defaults True so the API /
  single-delete paths are unchanged.
- bulk enable/disable: collapse the per-row save() loop into one
  queryset update() (no model signals here); its row count drives the
  toast and the empty-selection guard.
- bulk duration: a blank field with apply_duration on no longer
  clobbers every asset's duration to 0 — it toasts and changes
  nothing (mirrors the per-asset edit form's preserve-on-blank intent).
  Negative values are rejected too, and the modal input is now
  required as client-side defense-in-depth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(server): collapse bulk_update writes into a single bulk_update()

Second Copilot pass on #3048: assets_bulk_update still did one
asset.save() per selected row, so a large selection (the point of bulk
editing) fired N UPDATEs. Mutate the in-memory objects, track exactly
which columns were touched (honouring the skip-videos-for-duration
rule), and write them all with one Asset.objects.bulk_update(). An
edit that ends up touching nothing (apply_dates ticked with both
fields blank, or a duration-only edit on an all-video selection) now
short-circuits with the "nothing to change" toast, since bulk_update()
rejects an empty field list. Adds a test asserting exactly one UPDATE
statement for a 5-asset batch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): keep bulk duration write off video rows; clear modal default

Third Copilot pass on #3048:
- The single bulk_update() folded `duration` into the all-rows field
  list, which writes every video row's stale in-memory duration back
  and can clobber a concurrent probe_video_duration UPDATE. Write
  duration on its own bulk_update() over only the non-video subset, so
  video rows are never touched by the duration column at all. Added a
  test spying on bulk_update to assert no video object is in a
  duration write.
- The bulk-edit duration input was prefilled with 10, so toggling
  "Duration" and submitting would silently set every non-video asset
  to 10s. Drop the default to an empty placeholder and rely on the
  existing `required` to force an intentional value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): prune selection once across both sections; accurate bulk count

Fourth Copilot pass on #3048:
- syncVisibleIds(active, inactive): the table partial now publishes both
  sections' ids in one call and selectedIds is pruned once against
  their union. The previous two sequential setVisible() calls pruned on
  the first call while the other section's list was still stale, so a
  row that moved between sections (e.g. an enabled asset just disabled)
  lost its selection across the swap.
- assets_bulk_update toast now reports only the rows actually written:
  a duration-only edit skips videos, so a mixed selection reports the
  non-video count instead of claiming every row was updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): bulk forms clear only on real success; O(n) section selection

Fifth Copilot pass on #3048:
- Bulk endpoints return 200 even when they refuse the input (bad date,
  partial time window, blank duration, "nothing to change") and just
  ride an error/info toast on HX-Trigger. The forms cleared the
  selection / closed the modal on any 2xx, wiping the operator's work
  when nothing was applied. New isSuccessResponse(event) helper gates
  the clear/close on a 2xx whose toast is absent or kind 'success';
  wired into all four bulk forms (enable/disable/delete/edit).
- sectionAllSelected/sectionSomeSelected did linear includes() against
  selectedIds, i.e. O(visible × selected) on every reactive
  re-evaluation. Build a Set once per call so they stay O(n) for the
  large selections bulk editing targets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): trim bulk ids; guard enable/disable on matched-rows count

Sixth Copilot pass on #3048:
- _bulk_ids() now strips each CSV segment, so a hand-built "a, b" (with
  spaces) still matches its rows instead of silently no-op'ing.
- enable/disable take the count from a separate matched-rows count()
  rather than update()'s return value, which reports rows *changed* on
  some backends — re-enabling already-enabled assets would otherwise
  count 0 and (returning no toast) make the client treat it as success
  and clear the selection. A genuinely empty match now returns an info
  toast so the client's success gate keeps the selection. The empty
  bulk-delete case gets the same info toast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): info toast on bulk_update with no matching ids

Seventh Copilot pass on #3048: assets_bulk_update returned a silent
no-toast 2xx when the posted ids matched no rows (stale selection), so
the client's success gate cleared the selection / closed the modal
even though nothing applied. Return the same info toast the
enable/disable path uses so the selection is kept. Adds a test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(server): bulk_update via uniform queryset update()s; toast on no-op action

Eighth Copilot pass on #3048:
- assets_bulk_action no longer returns a silent no-toast 2xx for an
  unknown action or empty id set — an unknown action toasts an error,
  an empty selection an info toast, so the client's success gate keeps
  the selection instead of treating it as success.
- assets_bulk_update applies the (uniform) new values with plain
  queryset update()s instead of loading every selected row and building
  a bulk_update() CASE. Shared fields go in one update(**shared);
  duration goes through a separate exclude(mimetype='video').update(),
  which keeps the never-touch-video-duration guarantee at the SQL level
  (no in-memory staleness). Row counts come from matched-rows count()s,
  not update()'s changed-rows return. Replaced the bulk_update-spy test
  with a SQL-level video-duration-untouched test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): bridge bulk hx-on handlers into Alpine via window event

Ninth Copilot pass on #3048 — and a real runtime bug: the bulk forms'
hx-on::after-request handlers called Alpine component methods
(isSuccessResponse / clearSelection / closeBulkEdit / bulkDeleteOpen)
directly, but hx-on runs in GLOBAL scope, so those are undefined there
and the handler threw ReferenceError — the selection never cleared and
the modals never closed on success.

Fixed with the same dispatch-to-window bridge the Add modal already
uses for 'asset-saved': hx-on now calls the global window.bulkSucceeded()
gate and, on success, dispatches a 'bulk-done' window CustomEvent (with
detail flags for which modal to close). The root x-data handles it via
@bulk-done.window="onBulkDone($event)" in Alpine scope. Added a render
test asserting the forms use the bridge and never call Alpine methods
from hx-on.

Also: assets_bulk_update now returns an info toast on an empty id set
(was a silent no-toast 2xx the success gate would mis-read as success).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:31:15 +01:00
Viktor Petersson
9094a2031d fix(install): install polkitd on Trixie+ manage-network path; minimize apt updates (#3060)
* fix(ansible): install polkitd on Trixie+ manage-network path

- NetworkManager only Recommends polkitd, so minimal Armbian Trixie
  images ship NM without it and have no /etc/polkit-1/rules.d
- the rules copy task then failed: "Destination directory
  /etc/polkit-1/rules.d does not exist", aborting the install
- install polkitd explicitly so the rule has a home and a daemon to
  enforce it

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ansible): drop redundant apt update_cache on polkitd install

- the network role runs after the system role's apt update + dist
  upgrade, so the cache is already fresh
- matches the splashscreen role's mid-playbook package installs, which
  carry no update_cache

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(install): refresh apt lists once per source change

- guard install.sh's two apt-get updates behind apt_update_once so a
  minimal image no longer double-updates; reset only when the legacy
  Raspbian mirror is actually rewritten
- drop redundant update_cache on the libc6-dev reinstall; the cache is
  already fresh from install.sh
- net result: one base update + one per added repo (Docker)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:30:13 +01:00
Viktor Petersson
8cea9cb736 fix(balena): add missing device types to unpin fleet map (#3051)
- Add anthias-pi3-64 and anthias-rockpi4 to FLEET_DEVICE_TYPE
- Both fleets were in FLEETS but absent from the map, so they hit
  the "unknown device type" branch and the job exited 1 (errors=2)
- Re-syncs the map with bin/balena_fleet_maintenance.py per the comment

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 19:37:25 +01:00
Viktor Petersson
d323bf147d feat(viewer): add "Prefer dark mode" setting for web page assets (#3050)
* feat(viewer): add "Prefer dark mode" setting for web page assets

Adds a "Prefer dark mode" toggle in Settings that instructs the Qt
webview to render web page assets dark. The C++ webview owns the
realization: applyDarkModePreference() reads ANTHIAS_PREFER_DARK_MODE
and injects --blink-settings=forceDarkModeEnabled=true into Chromium
before QtWebEngine inits, working uniformly across Qt5 (Pi 1-4) and
Qt6 (Pi 5/x86) without a version macro.

Python owns the setting: the viewer exports the env var per board in
_build_webview_env() and respawns the webview when the operator toggles
the setting (reusing the rotation-bounce handoff). Wired through the
settings page, page_context, the form POST, and the v2 device-settings
API.

Validated on a Pi 5 testbed: with the setting on, AnthiasViewer's env
carries ANTHIAS_PREFER_DARK_MODE=1 and the QtWebEngine renderer
processes run with --blink-settings=forceDarkModeEnabled=true; with it
off, neither is present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(test): annotate pixel list so mypy doesn't flag no-any-return

PIL's getdata() is untyped, so sum(list(...)) was Any and the
-> float return tripped mypy's no-any-return. Annotate the list as
list[int].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(viewer): address review on dark-mode flag and its test

- main.cpp: merge forceDarkModeEnabled into an existing
  --blink-settings switch instead of appending a duplicate (Chromium
  keeps only the last occurrence, so a second switch would silently
  drop any Blink settings already set); no-op if already requested.
- test_webview_dark_mode: launch via the suite's browser_type /
  browser_type_launch_args fixtures so the conftest --no-sandbox
  override (and any future launch config) is inherited rather than
  hardcoded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 18:57:10 +01:00
Viktor Petersson
bb9da44d02 fix(server): restore multi-file (bulk) upload in the Add Asset modal (#3049)
* fix(server): restore multi-file (bulk) upload in the Add Asset modal

Multi-file upload (added in #2778 for the React UI) was lost in the
#2818 React→Alpine/HTMX rewrite: the file picker accepted a single
file only. The assets_upload endpoint already takes one file per
POST, so this is a client-only fix.

- Add `multiple` to the #add-file input.
- Drive the upload from uploadFiles() in home.ts: iterate the selected
  files and POST them sequentially (one XHR per file) against the
  existing single-file endpoint, with "X of N" progress. htmx's
  single-form submit would only ever send the first file, so the file
  tab is no longer htmx-managed; toasts are replayed from the server's
  HX-Trigger header by hand.
- Single-file uploads still flow through the same path unchanged.

Adds an integration test (test_add_multiple_uploads_at_once) that
selects two files in one go and asserts both persist.

Fixes #3045

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): don't clear upload state on closeModal mid-batch

Hiding the modal during an in-flight 'sending' upload cleared
uploadState, which disarmed uploadFiles()'s re-entry guard and let a
reopened modal start a second batch racing the first over the shared
progress/index fields. Only reset upload state when nothing is in
flight.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): surface upload rejections instead of swallowing them

assets_upload refused invalid/empty uploads with messages.error +
HTTP 200, which the HTMX/XHR path drops silently (the partial carries
no toast header) — so the operator saw nothing and the batch uploader
counted the rejection as a success. Pass the rejection through
_asset_table_response(toast=('error', …)) so it rides the HX-Trigger
header on every transport.

Client side, uploadOne() now distinguishes 'ok' / 'rejected' (2xx +
error toast) / 'error' (transport): a rejected file surfaces its
server toast and the batch skips it and keeps going, while a true
transport failure aborts. Addresses Copilot review feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(server): update uploadOne comment for the UploadResult return

The doc comment still described the old boolean return; it now
documents the ok / rejected / error tri-state. Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(server): attribute single-file limit to the endpoint, not htmx

The comments said htmx "would only ever send the first file", but
htmx includes every selected file in the multipart body — the real
single-file constraint is assets_upload reading request.FILES.get.
Reword both comments. Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 06:45:38 +02:00
Viktor Petersson
3b8cf7afa1 fix(viewer): recognize all cage/Wayland boards for screen rotation (#3047)
Screen rotation was a no-op on Raspberry Pi 5 (and likely generic
arm64 cage boards): `_is_wayland_board()` only treated x86 as a
Wayland board, so neither the linuxfb `:rotation=` option nor the
wlr-randr path ran.

- Key `_is_wayland_board()` off `QT_QPA_PLATFORM` starting with
  `wayland`, mirroring the docker/Dockerfile.viewer.j2 split
  (x86/arm64/pi5) instead of an x86-only DEVICE_TYPE check.
- Update tests to pair DEVICE_TYPE with QT_QPA_PLATFORM as in
  production; switch linuxfb-intent tests off the (now wayland) pi5.
- Add regression tests covering the board matrix and the Pi 5
  wlr-randr rotation path.

Validated end-to-end on a real Pi 5: both the boot-time apply and the
live Settings reload now push the wlroots transform (90/270) to the
connected display.

Fixes #3044

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:35:22 +02:00
Viktor Petersson
fe8f0541e9 fix(viewer): default x86 audio fallback to ALSA 'default' (#3038)
On x86, get_alsa_audio_device() previously returned
'sysdefault:CARD=HID' as the fallback, which matches no real ALSA
card on standard Intel/AMD/Nvidia HDA chipsets, leaving operators
with no working audio.

Defer to the `default` device instead. On x86 (a Qt 6 board)
VideoView::resolveAlsaDevice resolves `default` to the PulseAudio
default sink, since Debian's Qt 6 Multimedia only has a PulseAudio
backend. This mirrors the ARM64 branch above, which already returns
'default' for the same reason.

Update the corresponding test to assert the new fallback.

Co-authored-by: Xzzz <xzz@carnetderoot.net>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
v2026.06.3
2026-06-09 09:52:41 +02:00
Viktor Petersson
8c6cfaf26a feat(server): offer HandBrake GUI steps for rejected video uploads (#3040)
* feat(server): offer HandBrake GUI steps for rejected video uploads

- Add _handbrake_steps mirroring the ffmpeg recipe's codec/1080p choices
- Persist the steps to metadata.error_handbrake on rejection
- Render them as a numbered list with a handbrake.fr link in the modal

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): make HandBrake steps a real preset-based walkthrough

- Lead with the stock "Fast 1080p30" preset (H.264 MP4, 1080p cap)
- HEVC boards just flip Video Encoder to "H.265 (x265)"
- Spell out source pick, Save As/Browse, and Start Encode
- Drop the cap arg: the 1080p preset is the low-RAM fix too

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: assert on HANDBRAKE_URL constant, not the literal

Reference processing.HANDBRAKE_URL instead of a duplicated URL literal
so CodeQL stops flagging the assertion as incomplete-URL-substring
sanitization (false positive in a test containment check).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: address Copilot review on HandBrake steps

- Reword final step: upload as a new asset (the Edit modal has no
  upload control)
- Rename stale-clear test to mention error_handbrake too

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 08:55:57 +02:00
Viktor Petersson
5d21924080 fix(celery): don't report the by-design codec rejection to Sentry (#3041)
The upload codec/resolution gate raises ``UnsupportedVideoCodecError``
as a deliberate, operator-facing rejection — it's surfaced in the UI as
a "Failed" pill plus a copy-pasteable ffmpeg re-encode recipe via
``_NormalizeAssetTask.on_failure``. It's an expected outcome (e.g. Pi 5
has no H.264 HW decode block, an unknown arm64 board can't certify any
codec), not a fault, so it shouldn't reach Sentry — but every rejection
was landing there as an unhandled task error (Sentry ANTHIAS-1J,
ANTHIAS-20).

List it in the video task's ``throws``: Celery then logs it at INFO
without a traceback, and sentry-sdk's CeleryIntegration skips
``task.throws`` exceptions (``_capture_exception`` returns early on
``isinstance(exc, task.throws)``), so the gate stops flooding Sentry.
``on_failure`` still runs, so the operator-facing error pill and recipe
are unchanged.

Regression test asserts the video task declares it in ``throws`` and the
image task (which never raises it) does not.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 08:45:01 +02:00
Viktor Petersson
4e47d70d87 fix(build): ship libwebpdemux2 so pi3 Pillow can save WebP (#3042)
The upload image-normalisation pipeline converts HEIC/HEIF/TIFF/etc. to
lossless WebP via Pillow. On 32-bit Pi (armv7) there's no Pillow wheel,
so it's compiled from source against libwebp-dev and dynamically links
the whole webp family at runtime: libwebp7, libwebpmux3 *and*
libwebpdemux2 (Pillow's _webp uses WebPAnimDecoder).

The runtime image only pulled libwebp7 + libwebpmux3 — both transitive
deps of ffmpeg's libavcodec — but NOT libwebpdemux2, which nothing else
depends on. Without it, ``import PIL._webp`` fails to resolve
libwebpdemux.so.2, the WebP plugin registers no save handler, and the
conversion dies with ``KeyError: 'WEBP'`` (Sentry ANTHIAS-1Y, 11 events,
all pi3).

Add libwebp7 / libwebpdemux2 / libwebpmux3 to the runtime
base_apt_dependencies explicitly so the pipeline never depends on
ffmpeg's transitive graph — same rationale as the libheif1 entry beside
them. Inert on 64-bit boards (their Pillow wheel bundles its own webp
libs), so listed unconditionally for simplicity, matching libheif1.

Verified: a from-source Pillow 12.2.0 _webp.so on debian:trixie links
libwebp.so.7, libwebpmux.so.3, libwebpdemux.so.2 and libsharpyuv.so.0;
installing ffmpeg+libheif1 leaves libwebpdemux2 absent.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 08:44:45 +02:00
Viktor Petersson
2f1016008f fix(server): surface the exception detail in GitHub update-check warnings (#3023)
- A bare 'no data' hid the host/timeout info str(exc) carries —
  Sentry ANTHIAS-8 read 'ConnectionError fetching latest release
  from GitHub: no data', which said nothing actionable
- Ported from #3014 (the rest of that PR shipped via #3019)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 08:04:39 +02:00
Viktor Petersson
03ae4eb187 chore(release): bump to 2026.06.3 (#3037)
- CalVer (YYYY.0M.MICRO); still June 2026, micro 2 -> 3
- Gives Sentry a real release boundary: every build since 2026.6.2
  reported the same base version (only the +git-hash differed), so
  resolved-in-next-release never stuck and fixed issues kept
  reopening on the next event. A version bump lets the deployed
  fixes actually clear from the board.
- Ships the crash/noise fixes merged since 2026.6.2: SQLite WAL +
  busy timeout (#3015), celery migration-gate (#3016) and
  asset-probe soft limits (#3017), transient-redis/CancelledError
  Sentry filtering + redis healthcheck (#3018/#3028), GitHub
  update-check log level (#3019), webview respawn on D-Bus death at
  setup and mid-play (#3020/#3031), resilient static-file scan
  (#3026), Wayland-socket wait (#3030), and Sentry release/board
  triage tags (#3021/#3025)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 08:02:55 +02:00
Viktor Petersson
6bcffc5b63 chore(deps): sync uv.lock with pyproject.toml pins (#3039)
master's committed uv.lock had drifted from its own pyproject.toml:
uvicorn, pyyaml, django-stubs and types-python-dateutil were bumped
in earlier merged PRs but the lockfile was never regenerated, so
`uv lock --check` failed on master. Regenerate the lock so it matches
the already-declared `==` pins (httptools moves transitively with
uvicorn[standard]). No pyproject changes — no new dependency decisions.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 08:02:32 +02:00
Viktor Petersson
3b5ad782bc fix(celery): store display_power in redis as a string, not a bool (#3032)
- diagnostics.get_display_power() returns str | bool; redis-py
  rejects a bool with DataError: Invalid input of type: 'bool', so
  every clean CEC True/False crashed the get_display_power task and
  left the key unset (Sentry ANTHIAS-2C, 32 events)
- Coerce to str before r.set: the v2 System Info API exposes
  display_power as string | null and passes it through, so
  'True'/'False'/'CEC error' all fit — and the on/off state now
  actually populates instead of only the error fallbacks landing
- The existing test asserted the buggy bool write (a Mock redis
  accepted it); rewrite it to assert str coercion across True/False/
  error, with an isinstance guard against the DataError

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 07:46:50 +02:00
Viktor Petersson
0145e6ed81 fix(sentry): silence celery redis-backend reconnect-retry log (#3036)
- celery's redis result backend logs its own connection-loss retries
  at ERROR ("Connection to Redis lost: Retry (4/20) in 1.00 second.")
  while it retries on its own — the same expected-transient noise as
  the consumer/beat loggers already ignored, but from the
  celery.backends.redis logger (Sentry ANTHIAS-2E)
- ignore_logger('celery.backends.redis'); extend the regression test

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 07:45:15 +02:00
Viktor Petersson
871401c614 fix(viewer): wait for cage's Wayland socket before spawning the webview (#3030)
* fix(viewer): wait for cage's Wayland socket before spawning the webview

- On Wayland boards a webview respawn that races a moment when cage's
  socket is briefly absent had every attempt die with Qt's "Failed to
  create wl_display (Connection refused)", burning the whole inline
  retry budget and crash-looping the container (Sentry ANTHIAS-19)
- _wait_for_wayland_socket() blocks (bounded) for
  $XDG_RUNTIME_DIR/$WAYLAND_DISPLAY before each spawn — the Wayland
  analogue of start_viewer.sh's /dev/fb0 and eglfs-display waits
- No-op on linuxfb/eglfs boards, when the env names no socket, and
  (the common case) when the socket is already up; a dead compositor
  falls through after the timeout so the normal spawn-failure path
  still runs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): address review — gate Wayland wait on env, share the deadline

- _is_wayland_board() is x86-only, so gating the socket wait on it
  skipped exactly the Pi 5 where ANTHIAS-19 fired. Gate on the
  WAYLAND_DISPLAY env cage exports instead — set on x86/arm64/pi5
  alike, unset on linuxfb/eglfs — so the wait engages on every cage
  board (Copilot)
- Fold the wait into _spawn_webview_once's existing monotonic
  deadline so the socket wait and the handshake share one
  startup_timeout budget rather than stacking up to +10s on the
  inline respawn path (Copilot)
- Drop the now-unused BROWSER_WAYLAND_SOCKET_WAIT_SECONDS constant
- Tests now drive the real env signals (incl. a DEVICE_TYPE=pi5 case
  that fails if the wait is board-gated) and pass a monotonic deadline

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 21:28:17 +02:00
Viktor Petersson
056f539d4d fix(viewer): respawn the webview on a video-play D-Bus death, not log+stop (#3031)
- MPVMediaPlayer.play/stop called playVideo/stopVideo directly and, on
  a webview-gone D-Bus error (NoReply — the webview crashed mid-call),
  logged ERROR and gave up: a Sentry event for a self-healing
  condition, and the video stayed dead until the next rotation
  respawned the webview (Sentry ANTHIAS-1A)
- Route both through the same _send_to_webview wrapper the image/page
  paths use (#3012): reap + respawn + retry once on webview death,
  injected via set_send_to_webview() in setup() alongside the bus
- Genuinely unexpected errors still propagate to play/stop's existing
  log+clear-state handling; with no wrapper injected (tests/standalone)
  calls run directly, preserving prior behaviour
- Add tests for the routing, respawn-recovery, re-raise, and the
  no-injection fallback

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 21:27:33 +02:00
Viktor Petersson
fe942a50f6 fix(sentry): also drop transient redis TimeoutError events (#3028)
- The before_send filter caught redis.exceptions.ConnectionError but
  not redis.exceptions.TimeoutError — in redis-py the two are
  siblings under RedisError, not parent/child, so a redis outage that
  hangs the socket (rather than refusing) slipped through to Sentry
- Surfaced post-deploy as ANTHIAS-1B (Timeout connecting to server,
  viewer resolution reporter) once the build-hash release tags made
  it identifiable
- Match on both types; add the redis-stubs TimeoutError entry and a
  regression test that also pins the sibling (not subclass) relation

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 21:07:30 +02:00
Viktor Petersson
757feaef2a fix(server): survive unreadable static files instead of crash-looping (#3026)
* fix(server): survive unreadable static files instead of crash-looping

- A balena OTA rewrote the staticfiles layer onto a device with
  corrupted ext4 metadata; WhiteNoise's one-shot startup scan raised
  OSError 117 (Structure needs cleaning) at ASGI import and uvicorn
  crash-looped, bricking the device over one unreadable Django-admin
  vendor file (Sentry ANTHIAS-Y, 400+ events from one device)
- Subclass the middleware with a per-entry fault-tolerant scan: skip
  what the filesystem refuses, serve the rest, and emit one ERROR per
  startup so the storage fault still reaches Sentry once per boot
- Add a minimal whitenoise stub (channels-stubs pattern) so the
  subclass type-checks under strict mypy

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): address review — robust static URL join + module logger

- Normalise root with a trailing separator before slicing so a root
  without one can't yield a '/static//css/app.css' double-slash URL
  that fails to match requests (this was also the CI test failure)
- Use a module-level logger instead of the root logger, per the
  codebase convention
- Tests now assert the full canonical /static/ URLs so a URL-join
  regression can't slip through

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): address review — close scandir FDs, partial stub, real responses

- Iterate os.scandir under a with-block so directory FDs close
  promptly on large/deep trees
- Drive the middleware tests through update_files_dictionary directly
  (the test settings enable WHITENOISE_AUTOREFRESH, so __init__ never
  scans) and assert the full canonical /static/ URLs — this is also
  what the prior CI failure was telling us
- get_response returns a real HttpResponse per Django's contract;
  tighten the stub's get_response to HttpResponseBase (no None) and
  mark py.typed partial, matching channels-stubs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 21:07:00 +02:00
Viktor Petersson
eb1baa6fbf feat(sentry): stamp the release with the build hash and tag balena deploys (#3025)
* feat(sentry): stamp the release with the build hash and tag balena deploys

- The CalVer alone proved ambiguous: a balena OTA deploy from master
  ships new code while pyproject still carries the last tagged
  version, so pre- and post-deploy events were indistinguishable in
  the 2026.6.2 audit
- Postfix the release with GIT_SHORT_HASH (already baked into every
  image by tools/image_builder) as semver build metadata:
  2026.6.2+abc1234
- Tag events balena=true/false via the same is_balena_app() helper
  the reboot/shutdown tasks use, so triage knows which operational
  playbook applies
- Add tests for the release stamping and the tag wiring

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): keep the balena check import-light, test it behaviorally

- Importing anthias_common.utils into Django settings dragged
  sh/requests/redis into every settings load and crashed django-stubs'
  mypy plugin in CI's slim environment
- Inline the BALENA env check as is_balena_deploy() and pin it
  against the canonical is_balena_app() helper in a test so the two
  can't drift (also replaces the brittle source-text assertion —
  Copilot)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 19:42:02 +02:00
Viktor Petersson
9863d8c9d3 fix(viewer): aspect-fit, gapless looping, and 30fps cap for pi1/2/3 video (#3004)
* fix(viewer): aspect-fit, gapless looping, and 30fps cap for pi1/2/3 video

Fixes the issue #2987 regressions on the Qt5 linuxfb boards by moving
playback from a bash gst-launch relaunch loop into a small in-process
GStreamer helper (anthias_viewer/gst_fbdev_player.py):

- Portrait/4:3 videos no longer stretch to the framebuffer: a CAPS-event
  pad probe reads the decoder's native dims + PAR and pins aspect-fit
  caps (pixel-aspect-ratio=1/1) on the capsfilter, so the bcm2835 ISP
  scales aspect-correct and fbdevsink centers the frame. The previous
  fb-sized forced caps parked the distortion in a PAR that fbdevsink
  ignores (reproduced on-device: 1080x1920 -> 3840x2160 par 81/256).
- Clips no longer freeze/cut at loop boundaries: playbin about-to-finish
  re-queues the same URI for a gapless loop instead of rebuilding the
  whole pipeline per iteration (0.4-1.7 s per loop measured on a Pi 4,
  several seconds on a Pi 3, all eaten out of the fixed slot duration).
  Flush-seek on EOS and NULL->PLAYING restart remain as fallbacks.
- 50/60 fps sources drop to an even 30 fps cadence up front (videorate
  drop-only) instead of juddering on irregular late-frame drops; the
  decode->ISP->memcpy chain sustains ~40 fps at 1080p on a Pi 3.
- The framebuffer is zeroed at startup so letterbox borders are black
  rather than remnants of the previous asset.
- The helper runs by path (not -m) so the package __init__ (Django
  settings, redis, D-Bus) never imports in the child; validated e2e on
  the armhf image: negotiation, rotation, looping, SIGTERM exit 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): address review feedback on the fbdev player helper

- Correct the module docstring: the helper is executed by file path,
  not -m (the package __init__ must not import in the child)
- Fail fast with a clear log line when the GStreamer python bindings
  are missing instead of crashing with a traceback
- Clear the framebuffer in scanline-sized chunks so a 4K console
  doesn't peak a ~33 MB allocation on a 512 MB board

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): degrade to silent video when the audio branch fails

Integrated testbed run surfaced a wholesale-failure mode the relaunch
loop also had: a broken audio branch killed the video with it. Two
real-world triggers: the ALSA card is absent (HDMI audio disabled in
config.txt -> no vc4hdmi), and an undecodable audio codec (AC3 -
a52dec lives in plugins-ugly, not shipped). Retry once with
GST_PLAY_FLAG_AUDIO cleared on both the synchronous start failure
(alsasink can't reach READY) and the first async pipeline error; a
genuine video error recurs on the retry and still exits non-zero.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): read the visible fb resolution via FBIOGET_VSCREENINFO

Integrated testbed run surfaced a divergence the sysfs read hides:
sysfs virtual_size reports xres_virtual/yres_virtual, which can be
larger than the scanned-out mode (panning / double-buffer configs —
observed live: visible 1920x1080, virtual 3840x2160). fbdevsink
centers/crops against varinfo.xres/yres, so scaling to the virtual
size paints mostly off-screen. Query the same ioctl fbdevsink uses;
keep the sysfs read (and the 1080p default) as fallbacks for hosts
without fb access.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): swap the audio sink for fakesink on the video-only retry

Clearing GST_PLAY_FLAG_AUDIO is not sufficient: an element set on the
audio-sink property remains a playsink child and is still state-synced
with the pipeline, so a failing alsasink failed the retry too
(observed live on the testbed). Replace it with a fakesink when
degrading to silent video.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): pre-flight the ALSA device and rebuild on audio failure

The integrated testbed run showed two things the previous in-place
retry missed:
- a playbin whose sink activation failed does not reliably restart
  after NULL: the video-only retry failed instantly on the reused
  element with no further GStreamer error;
- alsasink opens the PCM on NULL->READY, so a missing card is
  detectable synchronously before it can poison playbin's whole
  sink activation.

So: pre-flight the device with a standalone alsasink and only wire
the audio branch when it opens; on any pipeline error with audio
enabled (e.g. an undecodable AC3 track mid-preroll), tear down and
rebuild a fresh video-only playbin instead of restarting the errored
one. Genuine video errors recur on the rebuilt pipeline and still
exit non-zero.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): satisfy mypy on the gi-typed returns

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 16:00:33 +02:00
Viktor Petersson
e882417c0c perf(viewer): pace Qt6 video frame delivery to scene-render capacity (#3006)
* perf(viewer): pace video frame delivery to scene-render capacity

Part of issue #2987: 1080p60 content on Pi 4 presented at 22.6 fps
with the playback position falling to ~0.6x realtime (clips ended
early), because every sink delivery scheduled a QML scene render on
a GUI thread that sustains ~45 renders/s at 1080p — overload made
throughput collapse below even the 30 fps a 1080p30 clip achieves.

QMediaPlayer now renders into an intermediate QVideoSink; frames
forward to the VideoOutput's sink only once the scene graph has
composited the previous one (QQuickWindow::afterRendering). 30 fps
sources pass untouched; 60 fps sources settle into an even ~half
cadence instead of irregular drops. If the render signal is not
wired the gate falls back to unpaced forwarding.

Stats lines gain a frames-forwarded field between frames-delivered
and frames-rendered so the gate is observable in playback-stats.log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(viewer): forward parked frames on render completion (mailbox)

The 1-deep gate was stop-and-wait: render -> re-arm -> idle until the
next sink delivery -> render, measuring only ~23 presented fps on a
GUI thread that renders faster back-to-back. Park the newest frame
that arrives mid-render in a single-slot mailbox and forward it the
moment afterRendering fires, so renders chain at capacity with at
most one frame of latency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): reset the pacing gate and clear the sink on stop()

A frame parked in the mailbox at stop() time could be forwarded by a
later afterRendering (stale-frame flash on the next reveal) and kept
its decoder buffer alive between assets. Clear the mailbox, re-arm
the gate, and push an empty frame to the VideoOutput so the last
displayed buffer is released too (review feedback).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:54:06 +02:00
Viktor Petersson
edcd6f381a fix(celery): gate worker startup on applied database migrations (#3016)
* fix(celery): gate worker startup on applied database migrations

- The celery container boots in parallel with anthias-server, whose
  start script is still running dbbackup -> migrate (or the dbrestore
  fallback that drops and re-creates every table)
- A task replayed off the Redis broker in that window died with
  OperationalError: no such table: assets — one burst per device on
  every upgrade/first boot (Sentry ANTHIAS-1)
- Block in worker_init until the unapplied-migration plan is empty;
  tasks stay queued in the broker while waiting, so nothing is lost
- Treat probe errors (database locked, missing django_migrations
  table) as not-ready instead of raising
- Works for every deployment topology (compose, balena, dev, test)
  without touching the five compose files that spell the worker CMD

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(celery): address review — log probe failures, rate-limit wait warning

- Log the migration-readiness probe's underlying error at DEBUG so a
  persistent non-transient cause is identifiable from device logs
- Repeat the waiting warning every 30s instead of every 5s poll
- Patch time.sleep by dotted path for strict mypy

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(celery): address review — only DatabaseError counts as not-ready

- Catch django.db.utils.DatabaseError in the readiness probe; a
  programming bug now fails fast instead of parking the worker in an
  infinite wait
- Reword the docstring to be accurate about DEBUG-level visibility
- Add a fail-fast regression test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(celery): address review — explicit next-log threshold for the wait warning

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(celery): address review — point the wait warning at the DEBUG probe logs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:52:42 +02:00
Viktor Petersson
43b937563b fix(sentry): stop reporting transient redis blips and client disconnects (#3018)
* fix(sentry): stop reporting transient redis blips and client disconnects

- Redis restarting (container recycle, compose startup before DNS
  resolves) produced an error event per process per blip even though
  every consumer self-heals: celery reconnects with backoff, the
  viewer's resolution reporter retries next tick, Channels
  re-establishes on the next frame (Sentry ANTHIAS-M, ANTHIAS-K,
  ANTHIAS-H, ANTHIAS-J)
- Add a before_send hook that drops events whose exception chain
  contains redis.exceptions.ConnectionError or asyncio.CancelledError
  (an HTTP client hanging up mid-request under ASGI — ANTHIAS-N)
- Silence celery's per-reconnect-attempt ERROR log at the logger
  (it arrives as a log message, not an exception)
- Downgrade the viewer reporter's redis-down log to a warning and
  extract the tick body into a testable helper
- Add regression tests for the filter and the reporter tick

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): address review — typed before_send, cleaner test fixtures

- Annotate the hook with sentry_sdk.types Event/Hint for strict mypy
- Build exc_info triples directly in tests instead of catching
  BaseException (Sonar S5754) and compare events by equality
  (Sonar S5796)
- Use record.getMessage() in the caplog assertion (Copilot)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): address review — make the ignored-logger test order-independent

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): address review — lift the module-wide logging disable for caplog tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): silence celery beat's reconnect-retry log too

- The embedded beat scheduler logs every broker reconnect attempt at
  ERROR ("beat: Connection error ... Trying again"), the same
  expected-transient noise as the consumer logger (Sentry ANTHIAS-P)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): address review — respect __suppress_context__ in the chain walk

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(compose): healthcheck redis and gate services on it answering PING

- depends_on with bare service_started only orders container
  creation; uvicorn/celery/viewer could still race a redis that
  hadn't finished loading its RDB, producing the startup
  connection-refused noise (review feedback on this PR)
- Add a redis-cli ping healthcheck to the prod template, dev, and
  test composes, and gate anthias-server / anthias-viewer /
  anthias-celery on service_healthy
- compose-only: the balena supervisor doesn't support depends_on
  conditions, and a redis container recycling mid-life is gated by
  nothing — so the Sentry-side handling of transient redis errors
  stays

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 15:51:54 +02:00
Viktor Petersson
50cc80455a feat(sentry): tag events with device type, host kernel, and board model (#3021)
* feat(sentry): tag events with device type, host kernel, and board model

- Events are sent from inside containers, so Sentry's stock OS
  detection never sees the host — the armv7 webview-crash cohort
  (ANTHIAS-D / ANTHIAS-F) was impossible to segment because nothing
  on the event said which kernel the device boots
- Tag device_type (baked into the image), kernel_release /
  kernel_machine (containers share the host kernel), and the
  device-tree board model
- Add tests for the board-model reader

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): address review — decode the device-tree model as UTF-8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): address review — trim only the trailing NUL/whitespace

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): address review — match device_helper's device-tree trim idiom

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:50:29 +02:00
Viktor Petersson
daa1d4bbb6 fix(server): open SQLite with WAL journaling and a busy timeout (#3015)
* fix(server): open SQLite with WAL journaling and a busy timeout

- uvicorn, the celery worker, and the viewer share one SQLite file
  across containers; the stock rollback journal plus a 0s busy
  timeout raised OperationalError: database is locked fleet-wide
  (Sentry ANTHIAS-C, ANTHIAS-E, ANTHIAS-G)
- timeout=20 waits for the lock instead of failing on the spot
- journal_mode=WAL lets readers and the writer coexist;
  synchronous=NORMAL is the recommended WAL pairing
- transaction_mode=IMMEDIATE queues concurrent writers on the busy
  handler instead of deadlocking on the read-to-write lock upgrade
- Add regression tests covering the connection options

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): annotate tmp_path for strict mypy

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:43:19 +02:00
Viktor Petersson
5b6e0a4adc fix(test): give the test-stack celery worker the test environment (#3013)
* fix(test): give the test-stack celery worker the test environment

The anthias-celery service in docker-compose.test.yml lacked
ENVIRONMENT=test and ANTHIAS_TEST_DB_PATH, so the worker's settings
took the production branch.

- Worker read the empty /data/.anthias/anthias.db instead of the
  test DB, failing every dispatched task with "no such table: assets"
- Worker picked up the default production Sentry DSN, leaking
  test-run errors into the production project tagged as production
- Share the environment block between anthias-test and anthias-celery
  via a YAML anchor so the two cannot drift again

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(integration): assert video normalisation's terminal state

With the celery worker now on the correct test DB, the upload test
raced the worker: it asserted the transient placeholder duration and
is_processing=True, which the (newly working) normalisation pipeline
overwrites moments later.

- Wait for is_processing to clear instead of racing the worker
- Assert the probed duration (5 s) and the codec-gate rejection
  (no DEVICE_TYPE in the test container means an empty HW-decode set)
- The wait doubles as a regression guard for the worker reading the
  wrong DB: that failure mode leaves the row stuck on processing

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: fix comment typo (ffprobe'd) from Copilot review

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:39:30 +02:00
Viktor Petersson
633b41ed87 fix(celery): catch the asset-probe timeout instead of hard-killing the worker (#3017)
* fix(celery): catch the asset-probe timeout instead of hard-killing the worker

- revalidate_asset_url's 30s hard limit was reachable by a legitimate
  probe (DNS stall + HEAD 10s + GET 10s), and tripping it SIGKILLs the
  pool child — three Sentry issues per occurrence (ANTHIAS-A,
  ANTHIAS-9, ANTHIAS-B)
- Add soft_time_limit=60 / time_limit=90: the soft limit raises inside
  the task, which records the verdict an HTTP timeout gets
  (unreachable) instead of dying
- Give the periodic sweep the same treatment: abort cleanly a minute
  before its hard limit, releasing the singleton lock
- Add regression tests for limits and soft-timeout behaviour

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(celery): address review — cover the DB update with the soft-limit catch

- The soft signal is delivered asynchronously, so it can land during
  the row UPDATE as well as the probe; cover the whole task body in
  both the on-demand recheck and the sweep
- Re-raise SoftTimeLimitExceeded past the sweep's blanket per-asset
  handler so the outer abort path sees it
- Satisfy strict mypy on the Optional time-limit comparisons; reword
  a misleading test comment

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:38:34 +02:00
Viktor Petersson
9ff20a82b6 fix(server): log GitHub release-check failures at warning, not error (#3019)
* fix(server): log GitHub release-check failures at warning, not error

- A device that can't reach api.github.com (offline installs,
  locked-down networks, GitHub outages, rate limits) is a routine
  condition the update check already degrades through gracefully:
  5-minute backoff plus the cached last verdict
- ERROR-level logs land in Sentry, so every offline device produced
  events on each splash-page render (Sentry ANTHIAS-8)
- Downgrade every failure path in lib/github.py to warning and pin
  the module as ERROR-free with a regression test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): address review — AST-walk the github module for ERROR logging

- Replace the brittle substring check with an AST walk over actual
  logging.* calls, so comments can't false-positive and ERROR-level
  calls can't slip through renamed (Copilot)
- Patch logging by dotted path for strict mypy

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): address review — drop unused fixture, accurate AST-check docstring

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): address review — include the logging.fatal alias in the AST guard

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:37:47 +02:00
Viktor Petersson
d2960d56f4 fix(viewer): respawn the webview when it dies before setup's bus.get (#3020)
- The armv7 Qt5 init crash can strike in the gap between the D-Bus
  handshake (which made load_browser() return) and setup()'s
  bus.get — the anthias.viewer name is released again, pydbus raises
  ServiceUnknown, and the GError escaped main() into a container
  restart loop (Sentry ANTHIAS-3)
- Apply the same webview-gone detection and respawn-then-retry-once
  contract as _send_to_webview (#3012), with the generous startup
  budget since nothing is on screen yet
- Unrelated D-Bus errors (e.g. Disconnected) still propagate so the
  container restart handles what a respawn can't fix
- Add regression tests for both paths

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 13:39:49 +02:00
Viktor Petersson
50aa201e53 fix(viewer): respawn the webview when it dies mid-D-Bus call instead of crashing (#3012)
* fix(viewer): respawn the webview when it dies mid-D-Bus call instead of crashing

- The armv7 Qt5 heap-corruption crash that load_browser() already
  retries past can also strike after the D-Bus handshake; the death
  then surfaces as a GError (NoReply) out of the in-flight
  loadImage/loadPage call, escapes main(), and turns one process
  crash into a container restart loop (Sentry 58040ab3)
- Wrap loadImage/loadPage in _send_to_webview(): on a webview-gone
  D-Bus error, reap the dead process, respawn with the inline
  budget, and retry the call once; anything else still raises
- Reset current_browser_url in load_browser() so a respawned webview
  always gets its asset re-sent, even when the URL is unchanged
  (a crashed single-asset playlist previously respawned to a blank
  screen forever)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): use https URLs in new viewer tests

- Resolves SonarCloud S5332 security hotspots on the PR

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 11:09:42 +02:00
Viktor Petersson
8ff0331c32 fix(viewer): silence sh's monitor-thread crash report on webview exit (#3011)
- Pass _bg_exc=False when spawning AnthiasViewer: sh's default re-raises
  the exit error (e.g. SignalException_SIGABRT on a Qt init crash, or
  SIGTERM from our own teardown) inside its daemon monitor thread,
  where nothing can catch it
- Sentry reported these as unhandled errors even though the handshake
  watch already detects the death and load_browser() already retries
- The handle is only used via is_alive()/process.stdout/terminate(),
  never .wait(), so no exception is silently deferred
- Lock the kwargs in with a regression assertion in test_load_browser

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 10:36:36 +02:00
Viktor Petersson
f26f3e7e0c fix(server): keep legacy host timezones from crash-looping Django (#3010)
- Hosts with a legacy alias in /etc/timezone (e.g. US/Central) passed
  the old pytz validation but failed Django's own check against
  /usr/share/zoneinfo, raising ValueError at startup in every Django
  process (server, celery, migrate)
- Validate the way Django does: a zoneinfo lookup plus the on-disk
  file check, falling back to UTC when either fails
- Ship tzdata-legacy in the base image so legacy aliases resolve and
  devices keep their actual local time instead of dropping to UTC
- Add regression tests for the validation fallback

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 10:31:41 +02:00
Viktor Petersson
040094a35e chore(release): bump to 2026.06.2 (#3009)
- CalVer (YYYY.0M.MICRO); still June 2026, micro 1 -> 2
- Ships the Qt 6 video audio fix (#3001) — PulseAudio in the viewer
  container; videos were silent on pi4-64/pi5/x86/arm64 since the
  QtMultimedia migration
- Adds the arm64/Qt6 pi3-64 board and the Rock Pi 4 fleet (#2985)
- Page-load watchdog so a stalled fetch can't freeze the display
  (#3003), Sentry error tracking for the Django services (#3007)
- Redis data persisted to the mounted volume so device identity
  survives recreation (#2983); unpinner also rolls OS + supervisor
  updates (#2984)
- Streamed backup downloads (#3005), 12-hour AM/PM asset times
  (#3002), BuildKit frontend via mirror.gcr.io (#3008)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 08:36:23 +02:00
Viktor Petersson
da53e045a6 fix(server): stream the backup download so large libraries don't time out (#3005)
* fix(server): stream the backup download so large libraries don't time out

Part of issue #2987 ("Get backup" reportedly never completes on a
Pi 3B+): the settings download built the whole tar.gz on disk before
sending the first byte. tar+gzip at the default level 9 measures
98 s for 355 MB on a Pi 4 (~3.6 MB/s) — a multi-GB library on a Pi 3
sends nothing for longer than a browser keeps a byte-less request
alive, so the download always aborted.

- settings_backup now streams the archive while it is built
  (StreamingHttpResponse over a pipe fed by a tarfile producer
  thread): first bytes hit the wire immediately, nothing is staged
  on the SD card, and a client disconnect stops the producer.
- gzip level drops to 1 for both the streamed and the API
  (create_backup) paths — backups are mostly already-compressed
  media, so level 9 burned minutes of CPU for ~no size win.
- The Content-Disposition filename is RFC-8187-escaped via Django's
  content_disposition_header (player_name is operator-controlled).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): surface backup-stream producer failures, fix mypy type

- stream_backup() re-raises a producer failure once the pipe drains,
  so the response aborts mid-transfer instead of completing 200 with
  a silently truncated archive (review feedback). Client disconnects
  still log-and-stop without morphing into spurious errors.
- Return type is Generator (not Iterator) — the disconnect test
  calls .close(), which mypy rejects on Iterator.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 08:00:28 +02:00
Viktor Petersson
10c68b26cc feat(viewer,build,balena): add arm64/Qt6 pi3-64 board and the Rock Pi 4 fleet; keep 32-bit pi3 as legacy (#2985)
* feat(viewer,build): add arm64/Qt6 pi3-64 board; keep 32-bit pi3 as legacy

Revises issue #2906 Phase 2. The original plan (delete the Qt 5 toolchain,
force Pi 2/Pi 3 onto Qt 6) is abandoned: Qt 5 was fixed up on master and
stays. Instead, add a NEW board target `pi3-64` — a 64-bit (arm64) Qt 6
viewer image for Raspberry Pi 3 hardware on a 64-bit OS — as its own image
stream, disk image, and balena fleet. The legacy 32-bit armhf/Qt5 `pi3`
board is left untouched and flagged as legacy/maintenance.

pi3-64 mirrors the existing `pi4-64` path (Qt 6, eglfs_kms; video played
in-process by AnthiasViewer's QtMultimedia pipeline — QMediaPlayer + the
ffmpeg/libavcodec backend with V4L2 HW decode, no external player).
VideoCore IV is H.264-only HW decode. Board selection is by `uname -m`: a
Pi 3 on a 64-bit OS gets `pi3-64`, a 32-bit OS keeps `pi3` (the model
string is identical on both arches).

- image_builder: pi3-64 build params (arm64) + is_qt6; constants.
- Dockerfile.viewer.j2 + start_viewer.sh: pi3-64 shares the pi4-64 eglfs
  KMS path; renamed board-agnostic eglfs-kms-pi4.json -> eglfs-kms.json.
- Detection: install.sh / upgrade_containers.sh (aarch64 Pi 3 -> pi3-64).
- Runtime: media_player force_mpv set (selects MPVMediaPlayer, the
  QtMultimedia D-Bus shim); processing codec grid {'h264'}.
- CI: docker-build matrix + mirror-latest-tags.
- Balena (fleet screenly_ose/anthias-pi3-64, device type raspberrypi3-64):
  disk-image + manual-deploy workflows, balena_ota_deploy.sh,
  balena_fleet_maintenance.py, balena_unpin_devices.py, deploy_to_balena.sh,
  balena-host-config.json.
- Pi Imager: SUPPORTED_BOARDS += pi3-64 (non-maintenance); pi3 stays legacy.
- Docs + tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(website): link the Pi 3 (64-bit) bullet like its siblings

Copilot review: the list is introduced as 'links to the images', so the
new pi3-64 entry should be navigable like the surrounding bullets. Link
the label to the release-images section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(balena): add the Rock Pi 4 fleet (screenly_ose/anthias-rockpi4)

Wires the anthias-rockpi4 balena fleet (device type rockpi-4b-rk3399)
into the OTA deploy + disk-image pipeline. The fleet has no
board-specific image build: it runs the generic arm64 containers, so
bin/balena_ota_deploy.sh / bin/deploy_to_balena.sh map the rockpi4
board to the <short-hash>-arm64 image tags (and strip the /dev/vchiq
mount — no VideoCore on RK3399), and the disk-image preflight verifies
the arm64 images exist.

Root-cause fix for the fleet's codec gate: balena ships no
anthias_host_agent service, so host:board_subtype was never published
and resolve_device_key() stayed 'arm64' — whose HW-decode set is empty,
rejecting every video upload. The model-string → subtype table moves to
the dependency-free anthias_common.device_helper.detect_board_subtype
(single source, imported by host_agent), and
anthias_common.board.get_board_subtype now falls back to reading
/proc/device-tree/model in-container when Redis has no value. The
device tree is kernel-global — the same mechanism get_device_type has
always used for Pi detection — so the rockpi4 fleet resolves its
{h264, hevc} envelope without a host-side daemon, and compose installs
whose host_agent died self-heal too.

- build-balena-disk-image.yaml: rockpi4 in both matrices, fleet +
  rockpi-4b-rk3399 image cases, arm64 images in the preflight check.
- deploy-balena-manual.yaml: rockpi4 board option.
- balena-host-config.json: rockpi4 declared {} (config.txt is
  RPi-only; the reconcile hard-fails on a missing key).
- balena_fleet_maintenance.py / balena_unpin_devices.py: fleet added.
- tests: get_board_subtype Redis-first + device-tree-fallback order;
  detect_board_subtype patch targets follow the move.
- docs: board-enablement, balena-fleet-host-config,
  installation-options.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:49:12 +02:00
Viktor Petersson
6f01e0aa33 fix(webview): add page-load watchdog so a stalled fetch can't freeze the display (#3003)
- Chromium has no overall navigation timeout: a fetch whose packets
  stop arriving mid-load (WiFi dropout, no FIN/RST) keeps loadFinished
  from ever firing, so the dual-view swap never happens and the screen
  freezes on the previous asset until the container is restarted
- Arm a single-shot QTimer per load attempt (default 30s, tunable via
  ANTHIAS_WEBPAGE_TIMEOUT_S, clamped to 5-3600s)
- On timeout, stop() the wedged navigation (cancelling its pending
  network I/O so dead sockets don't pile up in the connection pool)
  and retry the same URI on a fresh request
- Failed-fast loads (DNS / connection refused) reuse the watchdog as
  a paced retry tick - the viewer only re-sends loadPage when the URL
  changes, so a single-webpage playlist gets no other retry
- Disarm the watchdog on every path that supersedes the pending load
  (success, loadImage, playVideo)

Fixes #2999

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:48:23 +02:00
Viktor Petersson
e211ec848e feat(sentry): add Sentry error tracking to the Django services (#3007)
* feat(sentry): add Sentry error tracking to the Django services

Adds sentry-sdk to the server and viewer dependency groups (plus the
mypy group, since that CI job imports settings) and initialises it in
the shared Django settings module, so anthias-server, anthias-celery,
and the viewer all report crashes.

- The DSN can be overridden via the SENTRY_DSN env var (pointing at
  an operator's own Sentry project), or set to an empty string to
  disable crash reporting entirely.
- environment= mirrors the ENVIRONMENT env var (production default)
  so dev/CI events are filterable in Sentry.
- release= comes from pyproject.toml's [project].version via the
  existing get_anthias_release() helper.
- send_default_pii is gated on the existing analytics_opt_out knob in
  anthias.conf — opted-out devices still report crashes, but without
  request headers / IPs / user data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sentry): don't send events from test runs by default

The unit suite is built to run with no external network dependencies
(conftest.py force-mocks Redis for the same reason), and exceptions
raised on purpose by failing tests must not land in the production
Sentry project. Default the DSN to empty under ENVIRONMENT=test or
pytest (reusing the existing argv detector, moved up from the
DATABASES section); an explicit SENTRY_DSN still wins so the
integration stack can opt in deliberately.

Addresses the Copilot review comment on the hardcoded default DSN.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(sentry): pin the no-events-under-pytest guarantee

Asserts the client has no DSN, builds no transport, and drops
capture calls when running under pytest, so a future settings.py
refactor can't silently re-enable sending from test runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:41:46 +02:00
Viktor Petersson
7fc57fecf0 fix(docker): pull the BuildKit frontend via mirror.gcr.io (#3008)
* fix(docker): pull the BuildKit frontend via mirror.gcr.io

The `# syntax=docker/dockerfile:1.4` directive made every image build
fetch the frontend from registry-1.docker.io — the last remaining
Docker Hub dependency (base images already come from mirror.gcr.io,
bun/uv from ghcr.io). Docker Hub pulls from shared GitHub runner IPs
intermittently time out, failing CI before the build even starts.

Re-point the directive at Google's pull-through cache, which serves
the same multi-arch manifest list. The version pin stays for frontend
reproducibility.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(docker): bump the BuildKit frontend pin from 1.4 to 1.24

1.4 dates to May 2022; 1.24 is the current release. Nothing in the
templates needs newer syntax (--mount=type=cache predates 1.4), so
this is purely picking up four years of frontend bugfixes. Keeps the
minor-pin convention — the tag floats only over patch releases.

Validated by building the rendered redis image against the mirrored
1.24 frontend.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(docker): use ENV key=value form flagged by 1.24 build checks

`docker build --check` with the 1.24 frontend flags the legacy
`ENV DEBIAN_FRONTEND noninteractive` form (LegacyKeyValueFormat) in
the test template — the only hit across all four templates. All
rendered Dockerfiles now lint clean against the new frontend.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:41:21 +02:00
Viktor Petersson
fb4770bfe3 feat(balena): unpinner also rolls OS + supervisor updates (#2984)
* fix(viewer): fix two startup warnings surfaced by the log cleanup

With debug logging removed (#2977), two pre-existing startup warnings
became visible in the viewer logs. Both are fixed here:

- migrate_legacy_paths.sh: `set -euo pipefail` + the bare `${USER}` in
  USER_HOME aborted the script when $USER is unset — which is the case
  in-container (DATA mode, via migrate_in_container_paths.sh). On a
  legacy (screenly→anthias) device that would skip the in-container
  migration entirely. Fall back to `id -un` so it resolves without
  tripping `set -u`.
- AnthiasViewer: the MainWindow ctor called showFullScreen() AND main()
  called window->show(), showing the window twice. Under cage/wayland
  the double surface-commit tripped wlroots' "A configure is scheduled
  for an uninitialized xdg_surface" warning at startup. Show once, in
  main(), after construction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(balena): unpinner also rolls OS + supervisor updates

bin/balena_unpin_devices.py cleared the app-release pin but left
unpinned devices on whatever old balenaOS / supervisor they booted.
Add two optional, cloud-API-only phases so the one hourly job brings
the fleet fully current:

- --os-update: start a resinhup toward the latest balenaOS for the
  device type (POST actions/<base>/v2/<uuid>/resinhup, base resolved
  from /config). Online-only, skips OS < 2.14.0 (single-hop HUP floor
  per balena-hup-action-utils), bounded to --os-percent (default 5%)
  of the eligible population per run so the backlog ramps instead of
  stampeding.
- --supervisor: point devices at the newest supervisor release for
  their CPU architecture (PATCH should_be_managed_by__release).
  Restricted to devices already on the target OS so the supervisor/OS
  pairing stays compatible, and tranche-bounded like the OS phase.

Both phases honour the anthias_keep_pinned opt-out, stay dry-run by
default, and keep the aggregate-only public-log output. The hourly
workflow now passes --os-update --supervisor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(balena): skip supervisor bump on legacy-OS device types

getSupervisorReleasesForCpuArchitecture returns the newest supervisor
for an arch regardless of OS. On a device type frozen on a legacy
balenaOS line that's unsafe: raspberry-pi2 tops out at balenaOS 5.1.x
(devices run supervisor 15.x), and pointing them at 17.x would likely
break them. Only run the supervisor phase when the fleet's target OS is
on the calendar-versioned line (major >= 2025); legacy-OS fleets keep
their OS-matched supervisor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(balena): set a User-Agent so Cloudflare allows the resinhup call

The device-actions host (actions.balena-devices.com) is behind
Cloudflare, which 403s the default Python-urllib User-Agent as a banned
client signature (error 1010) — so every OS-update POST failed. Send a
descriptive User-Agent on all requests; the resinhup action then
triggers normally (HTTP 202).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(balena): report actual per-fleet OS-update started/failed counts

The per-fleet line printed the tranche size regardless of how many
resinhup calls actually succeeded. Track started/failed per fleet and
print both, so a few transient busy/offline failures are visible
instead of hidden behind the planned count.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:24:43 +02:00