mirror of https://github.com/Screenly/Anthias.git synced 2026-06-10 17:18:43 -04:00

Files

Viktor Petersson 6b3f638c60 fix(viewer): retry AnthiasViewer spawn so armv7 WebEngine-init crash self-heals (#2969 )

* fix(viewer): skip the Writeback connector in the eglfs headless guard

#2962 added wait_for_eglfs_display so a screenless eglfs (Pi 4) board waits
for a display instead of crash-looping on Qt's "no screens available".
eglfs_has_display() treated any connector status other than "disconnected"
as a present display (to tolerate bridges that report "unknown").

balenaOS 2026.x exposes a KMS `card0-Writeback-1` virtual connector that
ALWAYS reports "unknown". On a headless Pi 4 (both HDMI ports
"disconnected") the writeback connector's "unknown" satisfied the guard, so
it skipped the wait, launched eglfs, and the viewer crash-looped on
"no screens available" / "AnthiasViewer exited before emitting D-Bus
handshake" — exactly the failure #2962 was meant to prevent. Confirmed on
multiple live pi4 on 2026.1.0 (card0-HDMI-A-1/-2 = disconnected,
card0-Writeback-1 = unknown).

Skip `*Writeback*` connectors so only real display outputs (HDMI/DSI/DP/…)
count. A genuinely headless board now waits gracefully; the bridge-"unknown"
hedge is preserved for real connectors. Verified locally for headless,
connected, and bridge-unknown layouts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): retry AnthiasViewer spawn so armv7 WebEngine-init crash self-heals

- Wrap the AnthiasViewer launch in a capped exponential-backoff retry
  loop (BROWSER_SPAWN_MAX_ATTEMPTS) instead of raising on the first
  failed handshake
- Convert the tight container restart loop on Pi 2/Pi 3 into an
  in-process retry that self-heals on a later launch
- Publish viewer:webview_status to Redis (retrying/failed) so a stuck
  board is distinguishable from an empty playlist
- Add WebviewLaunchError + _spawn_webview_once helper; throttle repeat
  warnings to avoid flooding journald
- Cover retry-then-succeed and exhaust-then-raise paths in tests
- Document the armv7 WebEngine-init crash + retry stop-gap in
  docs/board-enablement.md

The 32-bit Qt5 viewer intermittently aborts during Chromium/WebEngine
init (malloc(): unaligned tcache chunk detected) ~75-90% of launches;
reproduced on a 64-bit Pi 3B+. No userspace mitigation fixes the
corruption, but a fresh launch clears it ~10-25% of the time, so
retrying catches a good launch within a few attempts (validated
on-device: handshake on attempt 6). Clean fix is arm64/Qt6 on 64-bit OS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): address review — bound status-beacon Redis, funnel CommandNotFound

- Give the webview health beacon a dedicated Redis client with short
  socket timeouts (connect_to_redis gains opt-in timeout params,
  defaulting to the historical blocking behaviour) so a Redis stall
  can't hang viewer startup inside the spawn-retry loop
- Wrap sh.CommandNotFound into WebviewLaunchError in _spawn_webview_once
  so a missing binary is reported + handled on the same path as every
  other launch failure instead of escaping the retry loop
- Reword the board-enablement note so it describes the WebEngine-init
  observation without referencing a --no-sandbox flag the viewer
  doesn't receive
- conftest: accept the new connect_to_redis kwargs in the fake factory

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): chain final launch error; harden second redis test patch

- Chain the exhausted-retries WebviewLaunchError from last_error so the
  traceback preserves the underlying failure (timeout / early-exit /
  wrapped CommandNotFound)
- conftest: the autouse _mock_redis fixture's connect_to_redis patch now
  accepts *args, **kwargs too (matches the import-time patch)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): scope spawn-retry by call site; drop write-only status beacon

Addresses self-review findings on the retry mechanism:

- Mid-playback respawn (view_image/view_webpage, on the asset_loop
  thread) now uses a small, short budget (BROWSER_SPAWN_INLINE_*) so a
  persistent crash can't freeze the loop (no rotations/skips/standby,
  watchdog starved) for minutes; startup keeps the generous budget. A
  persistent mid-run failure raises and the container restart re-rolls.
- Permanent failures (missing binary) raise WebviewBinaryMissingError
  and short-circuit the retry instead of burning the full backoff budget.
- _spawn_webview_once now reaps the terminated process (SIGTERM, wait,
  SIGKILL) on the handshake-timeout path so a retry can't overlap two
  AnthiasViewers contending for the framebuffer / D-Bus name.
- Reset the stale `browser` global before re-spawning.
- Poll spawned process every 0.25s (was 1s) so a fast init crash is
  noticed promptly in the retry loop.
- Drop the write-only viewer:webview_status Redis beacon (no reader
  existed) and revert the connect_to_redis timeout-param widening +
  conftest churn; operator-visible status is the throttled log output.
- Tests: cover early-exit, terminate-on-timeout, missing-binary
  short-circuit, backoff growth, and the inline budget cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): clamp max_attempts to >=1; correct retry-logging comment

- Guard load_browser against a non-positive max_attempts (would skip the
  loop and raise a confusing "0 attempts; last error: None")
- Reword the comment: the first failure logs its reason AND a retry
  line, so it's not literally "one log line per attempt"

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(viewer): clamp backoff_cap and startup_timeout in load_browser

- A backoff_cap below 1s would devolve into a tight retry loop; a
  negative one would make sleep() raise ValueError mid-retry and mask
  the real launch error
- Clamp a negative startup_timeout to 0 (immediate-timeout attempt)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>