Files
Anthias/stubs
Viktor Petersson 7ef4fbbca3 refactor(splash): poll IPs from client instead of rendering server-side (#2806)
* refactor(splash): poll IPs from client instead of rendering server-side

The splash page used to call get_node_ip() synchronously in the view
and render a static IP list. Two problems with that:

1. The view passed get_node_ip()'s return value to
   ipaddress.ip_address() unconditionally. On a fresh Balena boot the
   supervisor API can be slow to respond, in which case get_node_ip()
   returns the literal string 'Unknown' — and
   ipaddress.ip_address('Unknown') raises ValueError, which 500s the
   whole splash render. The viewer's webview then displays whatever
   Qt does with a 5xx page (blank/error), not "Unknown" as the prior
   structure suggested.

2. The render is a static snapshot. The splash is on screen for ~60s
   thanks to viewer/__init__.py's SPLASH_DELAY, but the rendered IPs
   reflect only what the host bus answered at second 0. If the
   supervisor recovers at second 8 — or if IPs change due to a DHCP
   renewal mid-splash — the page can't update.

This change makes the splash render immediately with no IPs and
populates them client-side by polling a new lightweight endpoint:

  - GET /api/v2/network/ip-addresses returns
    {"ip_addresses": ["http://192.168.1.42", ...]}.
  - Unauth'd because the splash itself is unauth'd and the data is
    already disclosed by the splash render — no new exposure. Test
    pinned to make a future flip to @authorized fail fast.
  - Narrow on purpose: just IPs, no diagnostics. /api/v2/info covers
    the heavy "everything about the device" case but is auth'd and
    pulls psutil/statvfs/version-checks that would compound on a
    2-second poll.

The polling JS lives in templates/splash-page.html (~50 lines of
vanilla JS, no framework, no build step). It polls every 2s for the
first ~30s and then backs off to 5s, so a long-tail recovery during
the splash's display window still gets caught without the page
thrashing the host bus when nothing's changing.

A new module-level helper _safe_ip_addresses() in api/views/v2.py
owns the get_node_ip() -> formatted-list conversion, with the
ValueError bug fixed at source: 'Unknown' and 'Unable to retrieve
IP.' both map to []. InfoViewV2.get_ip_addresses() is refactored to
delegate, so /api/v2/info gets the same robustness.

Behavior change for /api/v2/info (worth flagging for reviewers): if
get_node_ip() returned something malformed before, the endpoint would
have 500'd; it now returns ip_addresses=[] and 200.

The latent fragility of get_node_ip()'s Balena single-shot lookup
itself is left for a separate change — anything that calls it
directly (CLI tools, future endpoints) would still see 'Unknown' on a
slow first boot. The right fix there is symmetrizing the Balena
branch to retry the way the non-Balena branch already does. Out of
scope here: the splash specifically needed structural change, not
just a longer retry, because static-snapshot rendering doesn't make
use of the 60-second display window even if get_node_ip() always
returned in time.

Refs #2803 (Part B follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: format tests/test_splash_page.py

CI's `ruff format --check` (separate from `ruff check`) wanted shorter
lines refolded onto single lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: quiet SonarCloud quality gate (NOSONAR on intentional http://)

Anthias serves the admin UI on plain HTTP per CLAUDE.md (TLS is opt-
in via the Caddy sidecar). The splash page surfaces device URLs so
the operator can click into that UI, so _safe_ip_addresses() must
emit http:// URLs to match how the device actually listens —
emitting https:// would point at a port that isn't bound on a
default install. Add NOSONAR markers on the two emit lines (S5332)
with rationale, mirroring the existing NOSONAR pattern in
host_agent.py.

In tests/test_splash_page.py, hoist the IP literals to module-level
fixture constants with NOSONAR markers (S1313, "hardcoded IP").
Downstream f-strings reference the constants, which centralizes the
suppression and avoids scattering NOSONAR comments through every
assertion.

No production code paths affected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: centralize 'http://' literal under one NOSONAR

The previous fixup put NOSONAR markers on the IP-fixture constants
but left the http:// scheme inline in the f-strings, which Sonar
still flagged as S5332 at each call site. Hoist the prefix to a
single _HTTP constant with the NOSONAR + rationale, and reference
it from the f-strings. Same output, no production impact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: address Copilot review on PR #2806

Two issues flagged on the original PR.

1. _safe_ip_addresses() called lib.utils.get_node_ip(), which is
   anything but lightweight on non-Balena: it publishes
   'set_ip_addresses' to host_agent and waits up to 60s for
   host_agent_ready plus 20s on ip_addresses_ready before returning.
   Inside a 2-second polling endpoint, a single slow first call would
   tail-back every subsequent poll, and the splash would never
   populate. Defeats the entire point of the redesign.

   Split the resolver into two pieces:

   - _format_ip_urls(node_ip): pure formatter, shared by the polling
     endpoint and /api/v2/info. No I/O, tolerates 'Unknown' /
     'Unable to retrieve IP.' sentinels by returning [].

   - _resolve_node_ip(): fast-path resolver for the polling endpoint
     only. On bare metal, reads the cached 'ip_addresses' Redis key
     directly (host_agent populates it) and fires a fire-and-forget
     'hostcmd' publish on cache miss so the next poll finds something.
     On Balena, calls the supervisor with a 1.5s HTTP timeout —
     supervisor is single-source-of-truth there and there's no cache
     to fall back on, so a tight bound is the best we can do. Caps
     well under the JS poll cadence (2s) so request workers stay free.

   InfoViewV2.get_ip_addresses() now calls _format_ip_urls(get_node_ip())
   directly, preserving the existing blocking behavior since /api/v2/info
   is auth'd and not polled.

2. The JS polling loop never stopped — just backed off to 5s forever.
   In the actual viewer, the splash is destroyed after ~60s so polling
   ends naturally. But if the page is opened by anything else (curl,
   dev tools, an idle browser tab), it would generate forever-requests
   in the background.

   Add a wall-clock cap of 120s — double the in-viewer display window,
   generous enough for slow first-boot recoveries, bounded enough that
   an idle tab doesn't accumulate load. Whatever was last rendered
   stays on screen after the cap.

Tests rewritten to mock at the right boundary: _format_ip_urls is
tested directly with strings (no I/O), _resolve_node_ip is tested
with mocked Redis / requests.get, and the endpoint test exercises the
full chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: address Copilot follow-up review on PR #2806

Two more issues flagged after the first round of fixups:

1. The bare ``except Exception`` around ``r.publish('hostcmd', ...)``
   in ``_resolve_node_ip()`` would swallow any error, including
   programming mistakes (e.g. ``r`` accidentally None after a future
   refactor) — turning "splash IPs never populate" into a silent
   failure with no breadcrumb. Narrow to ``redis.RedisError`` so we
   only swallow what we actually want to swallow (transient broker
   issues), and let real bugs surface as 500s where they can be
   diagnosed. Added ``RedisError`` to the project's redis stub.

2. The previous test patched ``api.views.v2.get_node_ip`` to assert
   that the splash view doesn't call it — but the splash view lives
   in ``anthias_app.views``, and that import path doesn't reach
   anything in api.views.v2. The patch wasn't asserting anything
   useful: the test would pass regardless of what splash_page did.

   Replace with two more honest tests:

   - test_renders_200_without_mocking_anything: render must succeed
     with no fixtures or mocks. The original "Unknown" → ValueError
     → 500 regression now requires zero scaffolding to catch.

   - test_splash_view_does_not_import_get_node_ip: assert at the
     module-attribute level that ``anthias_app.views`` doesn't carry
     ``get_node_ip`` in its namespace. Fails fast if a future refactor
     re-adds the import (which is what would re-introduce the
     synchronous IP work we just removed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: address Copilot third-round review on PR #2806

Two issues flagged after the second round of fixups:

1. _resolve_node_ip()'s r.get('ip_addresses') had no error handling.
   If Redis is flaking on the device (early boot, transient broker
   hiccup), the splash polling endpoint would 500 — which is exactly
   the kind of fragility this whole redesign was supposed to remove.
   Wrap r.get in a redis.RedisError catch and treat it as a cache
   miss: return ''. The JS keeps polling at 2s intervals and recovers
   as soon as Redis comes back.

2. The cache-miss path published 'set_ip_addresses' on every poll.
   host_agent's set_ip_addresses runs an internet probe with a 10x1s
   tenacity retry, so worst-case it's ~10s of work. At a 2s poll
   cadence, we'd queue ~5 redundant refresh requests before the
   first one finishes — keeping host_agent busy for far longer than
   necessary on a slow first boot.

   Add a SETNX-with-TTL debounce key (_IP_REFRESH_PENDING_KEY,
   12s TTL — covers worst-case host_agent latency with margin). Only
   the first cache-miss poll in the window publishes; later polls
   within the TTL no-op. A future poll past the TTL retries naturally
   if the cache is still empty.

   On publish failure (Redis flake mid-call), the debounce key is
   actively cleared so the next poll can retry — otherwise we'd be
   pinned out of refreshing for the whole TTL after a transient
   error that didn't actually queue a refresh.

Tests cover all four behaviors:
- r.get RedisError returns ''
- Three back-to-back cache-miss polls publish only once (debounce
  holds across calls within the window)
- A failed publish leaves the debounce key cleared so the next poll
  retries
- Existing setUp/tearDown also clear the debounce key between tests
  so they don't interfere with each other

Also adds `delete` to the redis stub (already present on PR #2805's
branch; needed here for the test's debounce-key cleanup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: address Copilot fourth-round review on PR #2806

Two issues flagged after the third round of fixups:

1. _resolve_node_ip()'s Balena branch built the supervisor URL inline
   with its own auth + headers, duplicating
   lib.utils.get_balena_supervisor_api_response() and
   get_balena_device_info(). The two would drift over time —
   different timeout / error handling on each side already.

   Extend the shared helper to accept a timeout (and, via **kwargs,
   any other ``requests`` parameter). Existing callers that don't
   pass extras are unaffected — the request still goes out with the
   long-standing unbounded behavior. Then have v2.py call
   ``get_balena_device_info(timeout=_BALENA_SUPERVISOR_TIMEOUT_S)``
   so URL construction lives in one place.

2. _resolve_node_ip()'s SETNX-debounce write wasn't wrapped in a
   redis.RedisError handler. The earlier fixup wrapped both r.get()
   and r.publish(), but a Redis flake between those two calls (right
   on the SETNX) would still 500 the splash poll. Wrap the SETNX
   too; on failure, fall through to '' just like the other Redis
   failure paths.

Tests updated:
- Existing Balena-supervisor tests now patch the shared helper
  (api.views.v2.get_balena_device_info) instead of mocking
  api.views.v2.requests.get directly. They still assert the timeout
  is bounded, which is the load-bearing property.
- New test_setnx_failure_returns_empty pins the SETNX-flake path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: address Copilot fifth-round review on PR #2806

Two issues flagged after the previous round:

1. Hardcoded URL in templates/splash-page.html. Other templates
   (e.g. login.html) use Django's ``{% url %}`` tag so a future API
   mount-path or version-prefix change doesn't silently break the
   page. Resolve via ``{% url 'api:network_ip_addresses_v2' %}`` and
   inject the resolved path into the JS as a string.

   The existing test_renders_with_polling_script assertion still
   holds — it checks for the literal ``/api/v2/network/ip-addresses``
   in the response body, which is what the URL tag resolves to.

2. ``_resolve_node_ip()`` treated a cached value of ``'[]'`` (no IPs
   found) as a cache hit. ``json.loads('[]')`` returns ``[]``,
   ``' '.join([])`` returns ``''``, and the resolver short-circuited
   without publishing a refresh. host_agent's first run on a
   still-coming-up network produces exactly that ``'[]'`` write —
   so the splash would never recover when networking came online
   during the display window. Every poll past that first write
   would short-circuit on the empty cached value.

   Fix: only treat a non-empty decoded list as a hit. Empty list
   (and malformed JSON) fall through to the debounced refresh
   publish path so host_agent gets nudged to retry.

Tests:
- New test_empty_list_in_cache_triggers_refresh pins the fix:
  cached ``'[]'`` must publish a refresh, not silently no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: avoid bare {% url %} in JS comment that Django parses

The previous fixup added a JS comment that referenced ``{% url %}``
literally to explain what the tag does. Django's template engine
doesn't know JS comments aren't templated — it scanned the comment,
saw a ``{% url %}`` block with no arguments, and raised
``'url' takes at least one argument, a URL pattern name`` at parse
time, which broke /splash-page rendering and the
test_renders_with_polling_script test.

Rephrase the comment to refer to "the reverser" instead of the tag
syntax. The actual ``{% url 'api:network_ip_addresses_v2' %}`` call
is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: kick off debounced refresh on cache hits too

Both Copilot comments flagged the same gap. The splash docstring
claims the page "updates if IPs change during the splash's display
window" (e.g. DHCP renewal mid-splash), but on bare-metal that's
only true at the moment Redis is empty. After the first successful
host_agent run populates the cache, ``_resolve_node_ip()`` returns
the cached value immediately and never publishes another
``set_ip_addresses`` — so the cached IPs would freeze for the rest
of the splash window even if the underlying network state changed.

Make the behavior match the docstring: extract the SETNX-debounced
publish into a ``_publish_refresh`` helper and call it from both the
cache-hit and cache-miss paths. The cache-hit path returns the
cached value immediately (no blocking on the publish) and the
refresh runs in the background so subsequent polls see the updated
data once host_agent re-populates Redis. The same SETNX TTL caps the
publish frequency at once per ``_IP_REFRESH_DEBOUNCE_S`` (12s) so a
tight poll loop can't queue redundant refresh requests.

Tests:
- The previous ``test_reads_from_redis_cache`` asserted the cache
  hit did NOT publish — drop that assertion (it pinned the wrong
  behavior). Renamed the surviving form to just verify the cached
  value is returned.
- New ``test_cache_hit_also_kicks_off_refresh`` pins the new
  behavior: cache hit returns immediately AND fires the debounced
  publish.
- Existing debounce tests keep working because the SETNX gate moved
  into the helper but the semantics didn't change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: port splash tests to pytest, enhance Redis fake for SETNX

The merge with origin/master pulled in #2801's pytest migration. The
splash tests in tests/test_splash_page.py were still unittest-style
TestCase classes (added on this branch after the migration target);
port them to pytest functions to match the rest of the suite.

Also enhance conftest.py's Redis fake (same enhancement as #2805):

- Replace the 2-arg ``r.set`` lambda with a ``_set`` that honors
  ``nx``/``ex`` kwargs and returns ``True`` / ``None`` matching real
  Redis semantics. ``_resolve_node_ip``'s SETNX-based debounce gate
  relies on this — without it, every "is the publish window open?"
  check would spuriously believe the gate was free.
- Fix ``r.delete`` to return an int (count of keys deleted) instead
  of a list of popped values.
- Add ``r.eval`` (no-op for unknown scripts; nothing on this branch
  uses it, but #2805 does and the conftest is shared).

Tests rewritten as pytest functions with ``@pytest.mark.django_db``
where needed and a ``bare_metal_no_pending`` fixture that clears the
per-test debounce key. Coverage unchanged: format_ip_urls (5
formatting + sentinel cases), bare-metal resolver (cache hit, cache
hit + refresh, cache miss + publish, malformed cache, empty-list
cache, Redis errors on get/set/publish, debounce), Balena resolver
(supervisor success / timeout / error status), endpoint full chain
(200 with IPs, 200 + [] on cache miss, unauthenticated), splash view
(renders without mocking, doesn't import get_node_ip, polling
script present).

All 45 tests pass under ``uv run pytest -m "not integration"``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: address Copilot seventh-round review on PR #2806

Three issues:

1. ``json.loads(raw)`` returning a non-list (e.g. corrupted cache, a
   different producer writing the same key) would fall through to
   ``' '.join(ips)`` — which crashes on int/dict and silently joins
   characters on a string. Validate ``ips`` is a non-empty list of
   strings before joining; otherwise treat as a malformed cache and
   fall through to refresh.

2. ``test_resolve_publish_failure_releases_debounce`` patched
   ``r.get`` to always return None and then asserted
   ``r.get(_IP_REFRESH_PENDING_KEY) is None`` — that post-condition
   assertion was made after the patch had been reverted (so it did
   call into the real fake), but the structure was confusing enough
   that Copilot read it as vacuous. Rewrite to assert directly on
   ``r.delete``: wrap the delete with ``mock.patch.object(...,
   wraps=...)`` and verify it was called once with the debounce key.
   The publish call is also asserted as called-once for completeness.

3. ``NetworkIpAddressesViewV2`` is unauth'd and triggers a side
   effect (``_publish_refresh`` → ``hostcmd:set_ip_addresses`` →
   host_agent's internet probe). Flag this as a known limitation in
   the docstring, mirroring the framing on AssetRecheckViewV2 in
   #2805 — the real fix is a shared internal-auth mechanism that
   cuts across both endpoints, designed alongside the broader auth
   rework. Existing mitigations (SETNX debounce, no new data
   disclosure, host_agent's own retry/throttle) bound blast radius
   in the meantime.

All 24 splash tests pass under ``uv run pytest -m 'not integration'``;
ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: empty commit to retrigger workflows

* fixup: correct stale docstring on debounce-release test

The docstring on test_resolve_publish_failure_releases_debounce
described an earlier (vacuous) form of the test — asserting via
``in_dict_after`` against the underlying fake store. The current
test asserts on ``r.delete`` calls via ``wraps=`` instead, which
is what the docstring should describe.

Updates the docstring to match the assertions actually performed
and to explain why ``wraps=`` is needed here (``r.get`` is patched
inside the block to force the cache-miss path, so reading the key
back after the block would pass for the wrong reason).

No production-code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup: fall back to MY_IP env var on bare-metal cache miss

``lib.utils.get_node_ip()`` falls back to ``getenv('MY_IP')`` when
the Redis ``ip_addresses`` key is empty/unset. ``MY_IP`` is set by
``bin/upgrade_containers.sh`` (the host's outbound IP) and exported
into the server container by ``docker-compose.yml.tmpl``.

The polling resolver added in this PR didn't carry that fallback —
so any setup where ``host_agent`` isn't populating Redis (custom
deploys, late-starting host_agent, crashed host_agent) would freeze
the splash on "Detecting network…" forever. Mirror the fallback
here so the splash can show *something* useful in those cases.

The Redis-error early-return at the top of the function is folded
into the same fallback path: rather than 500-ing out before we can
even consult ``MY_IP``, treat a Redis read failure as a cache miss
so the env-var fallback still applies.

Tests: clear ``MY_IP`` in the ``bare_metal_no_pending`` fixture to
keep the existing assertions deterministic on developer shells /
runners that may export it; add three new tests covering the
cache-miss, Redis-down, and unset-env paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:00:44 +01:00
..