DELETE /meta_instance/<id> crashes nodes with anyio.ClosedResourceError
when runner.shutdown() sends to an already-closed _cancel_sender channel.
The unhandled error tears down the TaskGroup and triggers a Rust/PyO3
destructor panic.
Fixes:
- runner_supervisor.py: Remove duplicate send/close on _cancel_sender
(lines 191-192 were a copy of 188-189), wrap in try/except
- worker/main.py: Guard both runner.shutdown() call sites (plan_step
Shutdown handler and Worker.run finally block) with
contextlib.suppress(ClosedResourceError)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, DeleteMetaInstance cascade-deleted backing instances without
cancelling their active tasks, leaving orphaned task references. Now emits
TaskStatusUpdated(Cancelled) for Pending/Running tasks before InstanceDeleted.
Also adds lifecycle logging for meta-instance operations, a GET /meta_instances
endpoint, and 2 regression tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. DERIVED REACTIVITY BUG: `unifiedDisplayItems` used `$derived(fn)` which
made the derived value the function itself instead of its result. Svelte
never tracked reactive dependencies in the function body, so the instance
list didn't update when metaInstances or instances changed. Fixed by using
`$derived.by(fn)` and removing the `()` call-sites in the template.
2. TAUTOLOGICAL CHECK: In `getMetaInstancePlacingStatus`, the `lastError ? ...
: null` guard inside the `failures > 0` branch was always true because
`lastFailureError` and `consecutiveFailures` are always set together in
`apply_instance_retrying` and `apply_instance_deleted`. Removed the dead
`: null` branch.
Also fixes pyright errors in test file by using proper pytest.MonkeyPatch type.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ModelCard.load() does async I/O inside the 1-second reconcile loop. A slow
or failing load blocked all reconciliation (health checks, node timeouts,
other meta-instances). Adds a 10-second timeout, per-meta-instance error
handling with MetaInstancePlacementFailed events, and documents the
intentional early return in apply_instance_retrying.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The place_instance API endpoint used fire-and-forget: it sent the command
and returned HTTP 200 immediately. On a fresh cluster start, the master's
state often lacks topology/memory data, so placement raises ValueError
which was silently caught and logged. The caller never learned it failed.
Two fixes:
- API: validate placement locally before sending, return HTTP 400 on
failure instead of silently accepting an unprocessable command
- Master: emit MetaInstancePlacementFailed on immediate placement error
in CreateMetaInstance handler so the error surfaces in state right away
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The placement algorithm previously selected the smallest viable cycle,
causing large models to be distributed across too few nodes and running
out of memory. Changed get_smallest_cycles to get_largest_cycles so that
all healthy nodes are utilized, spreading layers more evenly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TaskCancelled command and Cancelled task status
- Detect API client disconnects in master/api.py
- Handle TaskCancelled in master state machine
- Add _cancel_tasks to worker for graceful task cleanup
- Add cancel_receiver to runner for inference abort
- Add mx_any helper in MLX utils for cancellable operations
- Guard instance lookup in worker to prevent KeyError
- Update tests for cancellation flow
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On startup, _emit_existing_download_progress() used
downloaded_bytes_this_session to decide between DownloadPending and
DownloadOngoing. Since downloaded_bytes_this_session is always 0 on
startup (it tracks the current session only), fully-downloaded models
were incorrectly reported as DownloadPending.
Now checks actual disk state: if downloaded_bytes >= total_bytes, emit
DownloadCompleted regardless of session bytes. This fixes the UI showing
models as pending when they're already available.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents RuntimeError when the context has already been set,
e.g. when Terminal.app reuses a tab or the process restarts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two race conditions existed in the meta-instance lifecycle:
1. CreateMetaInstance buffered MetaInstanceCreated but didn't apply it
before awaiting ModelCard.load(). The reconciler could interleave
during the await, leading to duplicate placements.
Fix: apply MetaInstanceCreated eagerly via _apply_and_broadcast,
then re-check satisfaction after the await so placement uses fresh
state and skips if the reconciler already handled it.
2. delete_meta_instance (API handler) sent DeleteMetaInstance then
read self.state.instances for cascade deletion. State was stale,
so backing instances created between the send and the read were
missed — permanently orphaning them.
Fix: move cascade delete into the command processor's
DeleteMetaInstance handler, where InstanceDeleted events are
generated atomically with MetaInstanceDeleted.
Reproduced on 4-node Mac Mini cluster: 28K anomalies in stress test
including 21 permanently orphaned instances. After fix, the cascade
delete and placement are race-free.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TaggedModel's wrap validator converts JSON→Python validation context,
which breaks strict-mode bytes deserialization from JSON strings.
Use Base64Bytes type to encode/decode bytes as base64 strings in JSON.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Anonymous pipes from os.pipe() don't survive multiprocessing.Process
spawn on macOS (default since Python 3.8). The FD numbers are passed
but the actual file descriptors don't exist in the child process,
causing EBADF errors.
Switch to named pipes (FIFOs) which the child opens by path in the
spawned process, getting valid FDs for the C++ SideChannel.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fragile TCP SideChannel with anonymous pipes relayed through
exo's event-sourced control plane. RunnerSupervisor creates pipe pairs
for MlxJaccl instances, relays all_gather rounds via JacclSideChannelData/
JacclSideChannelGathered events through the master, eliminating errno=57
crashes from Thunderbolt RDMA driver instability.
Also includes dashboard RDMA warning improvements and instance retry fixes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- apply_instance_created no longer clears last_failure_error so the
error context persists while the new instance starts up
- Dashboard retryError shows the error without (N/3) prefix when
consecutiveFailures is 0 (instance was recreated)
- Jaccl warning tooltip now says "experimental RDMA driver in macOS"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect errors containing "[jaccl]" in MetaInstance failure errors and
display a red dismissible alert banner. The tooltip explains this is a
macOS RDMA driver issue and that the affected machine needs to be
restarted. Alert re-appears if a new error arrives after dismissal.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When runners fail for a MetaInstance-backed Instance, retry up to 3
times by restarting runners within the same Instance rather than
deleting and recreating it each time. After 3 failures, delete the
Instance so MetaInstanceReconciler can create a fresh one.
- Add InstanceRetrying event that removes runners from state (signaling
workers to restart) and increments consecutive_failures on MetaInstance
- InstanceHealthReconciler emits InstanceRetrying when under retry limit,
InstanceDeleted when exhausted or no MetaInstance
- Worker _kill_runner detects retry signal (runner deleted from state +
terminal supervisor) and cleans up for _create_runner to recreate
- Worker _create_runner guards against oscillation by blocking creation
while any peer runner has explicit terminal status
- InstanceCreated resets consecutive_failures for fresh starts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move placement_error, consecutive_failures, last_failure_error, and
last_failure_at directly onto the MetaInstance model instead of keeping
them as separate State mappings (meta_instance_errors, InstanceFailureInfo,
meta_instance_failure_info). Adds a 5-second cooldown between retry attempts
to prevent rapid instance churn when runners fail instantly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each error in the combined message is now prefixed with the node's friendly
name (e.g. "MacBook Pro: OOM; Mac Studio: connection reset") so the root
cause node is easily identifiable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dashboard % 3 logic already handles displaying retry progress in batches
(RETRYING 1/3, 2/3, 3/3, then PLACING with error, repeat). No need to
permanently block placement after 3 failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple runners fail, concatenate all error messages with "; " so the
real error isn't hidden by generic side-effect failures from other runners.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MetaInstanceReconciler now checks failure count before placement — after 3
consecutive failures it emits MetaInstancePlacementFailed instead of retrying
forever. Dashboard shows "Retrying after error: <msg>" in orange throughout
the retry cycle, not just during the brief window with no backing instance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extend InstanceDeleted with failure_error field for runner crash info
- Add InstanceFailureInfo model tracking consecutive failures per MetaInstance
- InstanceHealthReconciler now detects runner failures (all terminal with
at least one RunnerFailed) in addition to connection failures
- apply_instance_deleted increments failure counter for meta-bound instances
- Dashboard shows RETRYING (N/3) status with error messages, and
"Instance re-created due to failure" after 3 consecutive failures
- Extract and display RunnerFailed error messages in instance status
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
frozenset serializes to a JSON array but cannot be deserialized back
in strict mode through the TaggedModel wrap validator (list → frozenset
coercion is rejected). Changed to list[NodeId] since the model is
already frozen/immutable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dashboard now extracts node IDs from the selected preview's
memory_delta_by_node, ensuring the backend places on exactly the
nodes the user was shown. Also reverts incorrect RDMA min_nodes >= 2
enforcement since single-node RDMA is valid.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RDMA requires at least 2 nodes — a single-node RDMA instance is
nonsensical. Enforce this in both the dashboard (when building the
launch request) and the backend placement (when filtering cycles).
Previously, selecting RDMA would still place on 1 node because
min_nodes defaulted to 1 and the placement silently switched to Ring.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When user selects specific nodes via the filter, min_nodes should be at
least the number of filtered nodes to prevent placement from picking a
smaller cycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dashboard was not including the user's node filter in the POST to
/meta_instance, so placement ignored which nodes the user selected.
Also, placement silently fell back to Ring when RDMA was requested but
no RDMA-connected cycles were available — now raises an error that
surfaces via MetaInstancePlacementFailed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mode="plain" validator bypassed Pydantic's string-to-enum coercion,
so JSON strings like "Tensor" and "MlxJaccl" from the dashboard failed
the isinstance check and silently fell back to Pipeline/MlxRing defaults.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Show why MetaInstance placement fails instead of stuck "PLACING", and
show per-node runner status during loading for multi-node instances.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a MetaInstance has no backing instance yet, derive the strategy
display from the MetaInstance's own sharding and instanceMeta fields
rather than showing "Unknown (Unknown)".
Also clean up all stale MlxIbv references across the dashboard —
the backend enum is MlxJaccl.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace inline _plan() with ProcessManager loop (_reconcile), tick
every 1s instead of 10s — safe because all PMs are idempotent
- Fix dashboard sending "MlxIbv" instead of "MlxJaccl" for RDMA
instance type, which silently fell back to MlxRing default
- Remove all stale MlxIbv references from dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline _plan() steps with a list of ProcessManagers, each
implementing async reconcile(State) -> Sequence[Event]. Tick every
1s instead of 10s — safe because all PMs are idempotent against state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The separate MetaInstanceBound event + meta_instance_backing map
introduced two bugs: stale exclusion sets in the reconciler loop and
a delete ordering race. Embedding meta_instance_id directly on
BaseInstance eliminates the binding mechanism entirely — when an
instance is created for a MetaInstance it carries the ID, when
deleted the binding is gone. No separate map, no cleanup, no races.
Also fixes delete_meta_instance to cascade-delete backing instances.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MetaInstanceBound event and meta_instance_backing State field
for explicit MetaInstance → Instance binding (prevents ambiguous
linking when two MetaInstances have identical constraints)
- Replace model_card: ModelCard with model_id: ModelId on MetaInstance
(load ModelCard on-demand at placement time)
- Add MetaInstance API endpoints (POST /meta_instance, DELETE)
- Update dashboard to use MetaInstances as primary primitive with
unified display items merging MetaInstances and orphan instances
- Dashboard launches via MetaInstance instead of direct Instance creation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces MetaInstance as a declarative constraint ensuring an instance
matching given parameters (model, sharding, min_nodes) always exists.
The master's reconciliation loop continuously checks for unsatisfied
meta-instances and attempts placement. Connection health checking
verifies that specific IPs (MlxRing) and RDMA interfaces (MlxJaccl)
stored on instances still exist as topology edges, enabling automatic
recovery when cables are swapped or interfaces change.
Also eliminates the master's loopback event path, unifying all event
emission through _apply_and_broadcast for simpler control flow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.