Commit Graph

2136 Commits

Author SHA1 Message Date
Alex Cheema
08c94bc283 fix: apply nix fmt to node_timeout.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:24:23 -08:00
Alex Cheema
7913e1a03f fix: only show RDMA cable warning for directly-connected Thunderbolt pairs
The RDMA DEVICE UNHEALTHY warning was iterating over ALL topology edges
(Socket + RDMA) to find node pairs, showing "disconnect and reconnect
Thunderbolt 5 cable" for every pair including socket-only connections.
Nodes are not all-to-all connected via Thunderbolt.

Filter to only RDMA-tagged edges (sourceRdmaIface set) which represent
actual Thunderbolt 5 links. Only pairs with a direct Thunderbolt
connection AND an RDMA health error now get the cable warning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:24:23 -08:00
Alex Cheema
539be54edb fix: evict stale peer when node restarts with new ID (merge blocker)
When a node restarts quickly (before the 30s timeout), it gets a new
random peer ID. The master sees two nodes with the same friendly_name:
the dead old peer (with stale RunnerReady entries) and the live new
peer. The old instance appears healthy so no replacement is created,
and inference routes to dead runners.

Fix: NodeTimeoutReconciler now groups nodes by friendly_name and
force-evicts stale duplicates (older last_seen) immediately via
NodeTimedOut. This triggers the existing cleanup cascade: topology
removal → instance health failure → instance deleted → MetaInstance
re-placed on the live peer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:24:23 -08:00
Alex Cheema
483c52f3b1 fix: clean up orphaned runners when node times out
When a node restarts it gets a new peer ID (keypair is regenerated each
launch). The old peer ID lingers in state until the 30s timeout fires,
but its runners were never cleaned up — they accumulated as permanent
RunnerShutdown zombies in state.runners. Now apply_node_timed_out also
removes runners mapped to the timed-out node across all instances.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:24:23 -08:00
Alex Cheema
fbf9007591 fix: clean up stale runners from state when instance is deleted
apply_instance_deleted() previously only removed the instance from
state.instances, leaving its runner entries orphaned in state.runners
with their last known status (e.g. RunnerReady). After a node kill and
rejoin, readiness checks would see these stale entries and attempt
inference against dead runner processes, causing post-recovery failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:23:11 -08:00
Alex Cheema
1b68eb5be5 fix: shard redistribution blocked when pinned node dies (BUG-B)
When a MetaInstance was created with node_ids (from the placement
preview), try_place_for_meta_instance passed them as required_nodes
to place_instance, which demands ALL listed nodes exist in a topology
cycle. If a pinned node died and was removed from the topology, no
cycle could satisfy the constraint and placement failed permanently.

Fix: intersect node_ids with live topology nodes before passing as
required_nodes. Dead nodes are dropped from the constraint so
placement succeeds on the remaining N-1 nodes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:23:11 -08:00
Alex Cheema
afaf778ba6 fix: RDMA warning uses all edges to find peers, not just RDMA-tagged ones
Root cause: when RDMA is broken (ibv_alloc_pd failure), the
RDMAConnection edges are often absent from the topology — only
SocketConnection edges remain. The old code filtered to RDMA-tagged
edges only, so unhealthy nodes with no RDMA edges fell through to the
"on device X" solo fallback.

Fix: use ALL topology edges to find connected peers. Any direct
connection between nodes represents a Thunderbolt cable to re-seat.
Remove the solo fallback and the {#if pair.nodeB} template branch
entirely — warnings always say "between A and B".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:23:11 -08:00
Alex Cheema
59e44df71b fix: matchesSelectedRuntime tautology and unifiedDisplayItems derived pattern
matchesSelectedRuntime checked (MlxJaccl || MlxJaccl) in the else
branch — always true for MlxJaccl, making instance type filtering
a no-op. Simplified to direct equality: runtime === selectedInstanceType.

unifiedDisplayItems used $derived(() => ...) which makes the derived
value a function rather than the computed array. Changed to
$derived.by(() => ...) so the value is the array itself, and removed
the () call syntax from template references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:23:11 -08:00
Alex Cheema
cd7799cd9a fix: RDMA dashboard warning shows 'between A and B' for all unhealthy pairs
When both nodes in an RDMA pair were unhealthy, the second node's pair
was already in seenPairs causing foundPeer to stay false, which made
it fall through to the solo "on device X" warning instead of the
correct "between A and B" message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:23:11 -08:00
Alex Cheema
d9f82034b5 Add proactive RDMA device health detection with dashboard warning
Periodically probe ibv_alloc_pd() via ctypes on macOS to detect the
Thunderbolt XDomainLink boot-time initialization bug before it crashes
JACCL instance creation. The dashboard shows a red "RDMA DEVICE
UNHEALTHY" warning with the affected cable endpoints by friendly name
and a recommendation to replug the Thunderbolt 5 cable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:23:11 -08:00
Alex Cheema
5ea3778674 fix: use force=True for multiprocessing set_start_method
Prevents RuntimeError when the context has already been set,
e.g. when Terminal.app reuses a tab or the process restarts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:51 -08:00
Alex Cheema
a9b835d481 fix: eliminate command/reconciler interleaving race in meta-instance
Two race conditions existed in the meta-instance lifecycle:

1. CreateMetaInstance buffered MetaInstanceCreated but didn't apply it
   before awaiting ModelCard.load(). The reconciler could interleave
   during the await, leading to duplicate placements.

   Fix: apply MetaInstanceCreated eagerly via _apply_and_broadcast,
   then re-check satisfaction after the await so placement uses fresh
   state and skips if the reconciler already handled it.

2. delete_meta_instance (API handler) sent DeleteMetaInstance then
   read self.state.instances for cascade deletion. State was stale,
   so backing instances created between the send and the read were
   missed — permanently orphaning them.

   Fix: move cascade delete into the command processor's
   DeleteMetaInstance handler, where InstanceDeleted events are
   generated atomically with MetaInstanceDeleted.

Reproduced on 4-node Mac Mini cluster: 28K anomalies in stress test
including 21 permanently orphaned instances. After fix, the cascade
delete and placement are race-free.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:51 -08:00
Alex Cheema
7feb5045e2 test: add 25 edge-case tests for MetaInstance lifecycle
Cover retry logic, error handling, backward compatibility,
concurrent scenarios, placement error tracking, and serialization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:51 -08:00
Alex Cheema
64beeb58d8 Fix JACCL SideChannel bytes serialization for JSON round-trip
TaggedModel's wrap validator converts JSON→Python validation context,
which breaks strict-mode bytes deserialization from JSON strings.
Use Base64Bytes type to encode/decode bytes as base64 strings in JSON.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:51 -08:00
Alex Cheema
3238157834 Use named pipes (FIFOs) for JACCL SideChannel relay
Anonymous pipes from os.pipe() don't survive multiprocessing.Process
spawn on macOS (default since Python 3.8). The FD numbers are passed
but the actual file descriptors don't exist in the child process,
causing EBADF errors.

Switch to named pipes (FIFOs) which the child opens by path in the
spawned process, getting valid FDs for the C++ SideChannel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:51 -08:00
Alex Cheema
9859ebd67a Add pipe-based JACCL SideChannel relay via exo control plane
Replace fragile TCP SideChannel with anonymous pipes relayed through
exo's event-sourced control plane. RunnerSupervisor creates pipe pairs
for MlxJaccl instances, relays all_gather rounds via JacclSideChannelData/
JacclSideChannelGathered events through the master, eliminating errno=57
crashes from Thunderbolt RDMA driver instability.

Also includes dashboard RDMA warning improvements and instance retry fixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:51 -08:00
Alex Cheema
9298bab09e Preserve last_failure_error across instance recreation, fix RDMA banner wording
- apply_instance_created no longer clears last_failure_error so the
  error context persists while the new instance starts up
- Dashboard retryError shows the error without (N/3) prefix when
  consecutiveFailures is 0 (instance was recreated)
- Jaccl warning tooltip now says "experimental RDMA driver in macOS"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:27 -08:00
Alex Cheema
f4b7f376b0 chore: remove temporary screenshot files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:27 -08:00
Alex Cheema
6c7c1079e5 temp: add jaccl warning screenshots for PR comment
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:27 -08:00
Alex Cheema
c2a356014c dashboard: show warning banner for [jaccl] RDMA driver errors
Detect errors containing "[jaccl]" in MetaInstance failure errors and
display a red dismissible alert banner. The tooltip explains this is a
macOS RDMA driver issue and that the affected machine needs to be
restarted. Alert re-appears if a new error arrives after dismissal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:27 -08:00
Alex Cheema
4f538e553a Retry runners within the same Instance instead of recreating
When runners fail for a MetaInstance-backed Instance, retry up to 3
times by restarting runners within the same Instance rather than
deleting and recreating it each time. After 3 failures, delete the
Instance so MetaInstanceReconciler can create a fresh one.

- Add InstanceRetrying event that removes runners from state (signaling
  workers to restart) and increments consecutive_failures on MetaInstance
- InstanceHealthReconciler emits InstanceRetrying when under retry limit,
  InstanceDeleted when exhausted or no MetaInstance
- Worker _kill_runner detects retry signal (runner deleted from state +
  terminal supervisor) and cleans up for _create_runner to recreate
- Worker _create_runner guards against oscillation by blocking creation
  while any peer runner has explicit terminal status
- InstanceCreated resets consecutive_failures for fresh starts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:27 -08:00
Alex Cheema
1e5e626725 Remove timestamp-based retry cooldown
Remove last_failure_at field and RETRY_COOLDOWN_SECONDS logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
a171c96c41 Consolidate failure state onto MetaInstance, add 5s retry cooldown
Move placement_error, consecutive_failures, last_failure_error, and
last_failure_at directly onto the MetaInstance model instead of keeping
them as separate State mappings (meta_instance_errors, InstanceFailureInfo,
meta_instance_failure_info). Adds a 5-second cooldown between retry attempts
to prevent rapid instance churn when runners fail instantly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
229fb29514 Show retry attempt count with error message, e.g. (2/3)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
89f0a4a69d Include node friendly names in runner error messages
Each error in the combined message is now prefixed with the node's friendly
name (e.g. "MacBook Pro: OOM; Mac Studio: connection reset") so the root
cause node is easily identifiable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
a896ecca84 Remove permanent retry blocking, allow continuous retry batches
The dashboard % 3 logic already handles displaying retry progress in batches
(RETRYING 1/3, 2/3, 3/3, then PLACING with error, repeat). No need to
permanently block placement after 3 failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
49ada3821f Show retry count in exceeded retry limit message (3/3)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
a6f7131fe0 Collect all runner error messages instead of just the last one
When multiple runners fail, concatenate all error messages with "; " so the
real error isn't hidden by generic side-effect failures from other runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
617d94ebc0 Stop infinite retries after 3 failures, show errors persistently in dashboard
MetaInstanceReconciler now checks failure count before placement — after 3
consecutive failures it emits MetaInstancePlacementFailed instead of retrying
forever. Dashboard shows "Retrying after error: <msg>" in orange throughout
the retry cycle, not just during the brief window with no backing instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
e20af70176 Add instance retry logic with max 3 retries and failure tracking
- Extend InstanceDeleted with failure_error field for runner crash info
- Add InstanceFailureInfo model tracking consecutive failures per MetaInstance
- InstanceHealthReconciler now detects runner failures (all terminal with
  at least one RunnerFailed) in addition to connection failures
- apply_instance_deleted increments failure counter for meta-bound instances
- Dashboard shows RETRYING (N/3) status with error messages, and
  "Instance re-created due to failure" after 3 consecutive failures
- Extract and display RunnerFailed error messages in instance status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
46219fcd07 Fix MetaInstance.node_ids frozenset failing JSON deserialization
frozenset serializes to a JSON array but cannot be deserialized back
in strict mode through the TaggedModel wrap validator (list → frozenset
coercion is rejected). Changed to list[NodeId] since the model is
already frozen/immutable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
37c5e878b9 Send node_ids from placement preview when launching instances
The dashboard now extracts node IDs from the selected preview's
memory_delta_by_node, ensuring the backend places on exactly the
nodes the user was shown. Also reverts incorrect RDMA min_nodes >= 2
enforcement since single-node RDMA is valid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
4d8ef4f462 Enforce min_nodes >= 2 for RDMA (MlxJaccl) instances
RDMA requires at least 2 nodes — a single-node RDMA instance is
nonsensical. Enforce this in both the dashboard (when building the
launch request) and the backend placement (when filtering cycles).
Previously, selecting RDMA would still place on 1 node because
min_nodes defaulted to 1 and the placement silently switched to Ring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
77afead0ef Ensure min_nodes >= node filter size when launching
When user selects specific nodes via the filter, min_nodes should be at
least the number of filtered nodes to prevent placement from picking a
smaller cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
959865f713 Send node_ids from dashboard, error on RDMA when unavailable
Dashboard was not including the user's node filter in the POST to
/meta_instance, so placement ignored which nodes the user selected.
Also, placement silently fell back to Ring when RDMA was requested but
no RDMA-connected cycles were available — now raises an error that
surfaces via MetaInstancePlacementFailed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
4317cf2bfb Fix use_default validator silently ignoring sharding/instance_meta
The mode="plain" validator bypassed Pydantic's string-to-enum coercion,
so JSON strings like "Tensor" and "MlxJaccl" from the dashboard failed
the isinstance check and silently fell back to Pipeline/MlxRing defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
848b5139f5 Add placement error feedback and per-node loading status
Show why MetaInstance placement fails instead of stuck "PLACING", and
show per-node runner status during loading for multi-node instances.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
72d9574fe8 Show MetaInstance sharding/type while PLACING, fix MlxIbv references
When a MetaInstance has no backing instance yet, derive the strategy
display from the MetaInstance's own sharding and instanceMeta fields
rather than showing "Unknown (Unknown)".

Also clean up all stale MlxIbv references across the dashboard —
the backend enum is MlxJaccl.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
14ff11eec2 Extract reconciler into ProcessManager protocol, fix RDMA instance type
- Replace inline _plan() with ProcessManager loop (_reconcile), tick
  every 1s instead of 10s — safe because all PMs are idempotent
- Fix dashboard sending "MlxIbv" instead of "MlxJaccl" for RDMA
  instance type, which silently fell back to MlxRing default
- Remove all stale MlxIbv references from dashboard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
4e9cbd09b0 Extract reconciler into ProcessManager protocol
Replace inline _plan() steps with a list of ProcessManagers, each
implementing async reconcile(State) -> Sequence[Event]. Tick every
1s instead of 10s — safe because all PMs are idempotent against state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
46bc69d65d Simplify MetaInstance binding: put meta_instance_id on Instance
The separate MetaInstanceBound event + meta_instance_backing map
introduced two bugs: stale exclusion sets in the reconciler loop and
a delete ordering race. Embedding meta_instance_id directly on
BaseInstance eliminates the binding mechanism entirely — when an
instance is created for a MetaInstance it carries the ID, when
deleted the binding is gone. No separate map, no cleanup, no races.

Also fixes delete_meta_instance to cascade-delete backing instances.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
9c49ea3a89 Add explicit MetaInstance binding, slim MetaInstance to use ModelId
- Add MetaInstanceBound event and meta_instance_backing State field
  for explicit MetaInstance → Instance binding (prevents ambiguous
  linking when two MetaInstances have identical constraints)
- Replace model_card: ModelCard with model_id: ModelId on MetaInstance
  (load ModelCard on-demand at placement time)
- Add MetaInstance API endpoints (POST /meta_instance, DELETE)
- Update dashboard to use MetaInstances as primary primitive with
  unified display items merging MetaInstances and orphan instances
- Dashboard launches via MetaInstance instead of direct Instance creation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:22:07 -08:00
Alex Cheema
26368f9837 Add MetaInstance declarative layer with connection health checking
Introduces MetaInstance as a declarative constraint ensuring an instance
matching given parameters (model, sharding, min_nodes) always exists.
The master's reconciliation loop continuously checks for unsatisfied
meta-instances and attempts placement. Connection health checking
verifies that specific IPs (MlxRing) and RDMA interfaces (MlxJaccl)
stored on instances still exist as topology edges, enabling automatic
recovery when cables are swapped or interfaces change.

Also eliminates the master's loopback event path, unifying all event
emission through _apply_and_broadcast for simpler control flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:19:09 -08:00
Evan Quiney
eccc6298d1 Revert "Add MetaInstance declarative layer (#1447)"
This reverts commit a962a28afc.
2026-02-17 18:11:47 +00:00
Evan Quiney
c8997217cf Revert "feat: better onboarding UX for new users (#1479)"
This reverts commit 490d2e46ba.
2026-02-17 18:02:32 +00:00
Alex Cheema
490d2e46ba feat: better onboarding UX for new users (#1479)
## Summary

- **Auto-open dashboard** in browser on first launch (uses
`~/.exo/.dashboard_opened` marker)
- **Welcome overlay** with "Choose a Model" CTA button when no model
instance is running
- **Tutorial progress messages** during model download → loading → ready
lifecycle stages
- **Fix conversation sidebar** text contrast — bumped to white text,
added active state background
- **Simplify technical jargon** — sharding/instance type/min nodes
hidden behind collapsible "Advanced Options" toggle; strategy display
hidden behind debug mode
- **Polished DMG installer** with drag-to-Applications layout, custom
branded background, and AppleScript-configured window positioning

## Test plan

- [ ] Launch exo for the first time (delete `~/.exo/.dashboard_opened`
to simulate) — browser should auto-open
- [ ] Verify welcome overlay appears on topology when no model is loaded
- [ ] Launch a model and verify download/loading/ready messages appear
in instance cards
- [ ] Check conversation sidebar text is readable (white on dark, yellow
when active)
- [ ] Verify "Advanced Options" toggle hides/shows sharding controls
- [ ] Build DMG with `packaging/dmg/create-dmg.sh` and verify
drag-to-Applications layout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 17:52:49 +00:00
rltakashige
facf2d4d03 Use custom fork that resolves GPU locks (#1489)
## Motivation

There is an issue on Macs that means that an explicit synchronization is
necessary for memory to be updated from L1 cache. This means that GPU
locks can occur when a spin wait does not see the updated timestamp.

## Changes

Updated in my own personal fork.

## Why It Works

https://github.com/ARM-software/acle/releases

## Test Plan

### Manual Testing
Tested manually that no GPU locks occur (even with multiple simultaneous
instances running) and that the performance differential is negligible
(267 vs 269 tps on Llama 3.2 1B at an approx 10k context.)


------------------------------------------------------
I have seen a GPU lock, specifically when sending a particularly large
chat completion while the model was loading. However, I have since been
unable to reproduce and this may be something I did wrong. Please do
create an issue and tag me if any GPU locks do occur.

---------

Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 17:48:43 +00:00
Alex Cheema
a962a28afc Add MetaInstance declarative layer (#1447)
## Motivation

Users currently manage instances directly, which means if a node
disconnects or connections break, the instance dies and nothing
recreates it. MetaInstance is a declarative primitive: "ensure an
instance matching these parameters always exists." The reconciler
watches for unhealthy or missing backing instances and re-places them
automatically.

## Changes

- **MetaInstance type** (`meta_instance.py`): declarative constraint
with `model_id`, `min_nodes`, optional `node_ids`, and `sharding`
- **Reconciler** (`reconcile.py`): `find_unsatisfied_meta_instances`
checks which MetaInstances lack a healthy backing instance,
`try_place_for_meta_instance` creates one
- **Master loop** (`main.py`): periodically reconciles unsatisfied
MetaInstances; immediate placement on `CreateMetaInstance` command
- **API** (`api.py`): `create_meta_instance` / `delete_meta_instance` /
`GET /meta_instances` endpoints; delete cascades to backing instances
with task cancellation
- **Binding via `meta_instance_id` on Instance** (`instances.py`): no
separate binding event or backing map — the instance carries its parent
MetaInstance ID directly, eliminating race conditions in the reconciler
- **Dashboard**: sidebar shows MetaInstances with their backing instance
status; orphan instances (created directly) still shown separately
- **Tests**: constraint matching, connection health, unsatisfied
detection, exclusive binding, cascade delete with task cancellation

### Recent improvements

- **fix: cancel active tasks on cascade delete** — `DeleteMetaInstance`
now emits `TaskStatusUpdated(Cancelled)` for any Pending/Running tasks
on backing instances before emitting `InstanceDeleted`. Previously,
cascade-deleting backing instances left orphaned task references in
state.
- **Lifecycle logging** — added `logger.info`/`logger.warning` for:
`CreateMetaInstance` (model, min_nodes, sharding), `DeleteMetaInstance`
(with cascade count), reconciler placement success/failure, and retry
decisions with attempt counts in `InstanceHealthReconciler`.
- **GET `/meta_instances` endpoint** — lists all meta-instances without
needing to fetch full state.
- **2 regression tests** — `test_cascade_delete_cancels_active_tasks`
and `test_cascade_delete_skips_completed_tasks` verify the
cascade-delete event sequence.

## Why It Works

Putting `meta_instance_id` on `BaseInstance` makes binding inherent to
instance creation. When the reconciler creates an instance for a
MetaInstance, it tags it via `model_copy`. When the instance is deleted,
the binding disappears with it. This avoids the two bugs that a separate
binding mechanism would introduce:
1. Stale exclusion sets — the reconciler loop can't accidentally bind
two MetaInstances to the same instance
2. Delete ordering race — no window between deleting an instance and its
binding where the reconciler could re-place

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
- Created MetaInstance via dashboard, verified instance placed
- Verified delete cascades (deleting MetaInstance removes backing
instance)
- Verified orphan instances still work independently

### Automated Testing
- 30 tests in `test_meta_instance_edge_cases.py`: lifecycle, retry
logic, error handling, concurrent operations, cascade delete with task
cancellation
- 24 tests in `test_reconcile.py`: constraint matching, connection
health (single/multi-node, edge removal, IP changes), unsatisfied
detection, exclusive binding, idempotency
- All 261 tests pass
- basedpyright 0 errors, ruff clean, dashboard builds

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 09:48:19 -08:00
Alex Cheema
db79c350c1 Fix graceful process shutdown in macOS app (#1372)
## Motivation

Fixes #1370

When the macOS app stops exo, GPU/system memory isn't released. This
happens because:

1. The macOS app calls `process.terminate()` (SIGTERM) but the Python
process only registers a graceful shutdown handler for SIGINT, not
SIGTERM. SIGTERM's default Python behavior raises `SystemExit` which
bypasses the cleanup cascade (runner subprocess MLX cleanup via
`mx.clear_cache()`, channel closing, etc.).
2. The app doesn't wait for the process to actually finish cleanup — it
immediately nils out the process reference.

## Changes

**`src/exo/main.py`**: Register SIGTERM handler alongside SIGINT so the
graceful shutdown cascade (`Node.shutdown()` → cancel task group →
worker/runner cleanup → `mx.clear_cache()` + `gc.collect()`) runs
regardless of which signal is received.

**`app/EXO/EXO/ExoProcessController.swift`**: Replace immediate
`process.terminate()` with escalating shutdown per @Evanev7's
suggestion:
1. Send SIGINT via `process.interrupt()` — triggers the registered
Python handler for graceful cleanup
2. Wait up to 5 seconds for the process to exit
3. If still running, escalate to SIGTERM via `process.terminate()`
4. Wait up to 3 seconds
5. If still running, force kill via SIGKILL

The escalation runs in a detached `Task` so the UI updates immediately
(status → stopped) without blocking.

## Why It Works

The root cause is that SIGTERM wasn't triggering the graceful shutdown
path. By registering a SIGTERM handler in Python and sending SIGINT
first from the macOS app, the process gets a chance to run the full
cleanup cascade: cancelling the task group, shutting down runners (which
call `del model; mx.clear_cache(); gc.collect()`), closing channels, and
flushing logs. The escalation to SIGTERM and SIGKILL ensures the process
always terminates even if graceful shutdown hangs.

## Test Plan

### Manual Testing
<!-- Hardware: Mac Studio M4 Max 128GB -->
- Start exo via macOS app, load a model, run inference
- Stop via the toggle switch, verify memory is released without
requiring a system restart
- Test rapid stop/start (restart) to ensure no race conditions

### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — no changes

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan Quiney <evanev7@gmail.com>
2026-02-17 09:03:54 -08:00
Alex Cheema
d6301ed593 dashboard: redesign downloads page as model×node table (#1465)
## Motivation

The current downloads page uses a node-centric card grid layout that is
messy and hard to read — the same model across different nodes appears
in separate cards, and deep nesting wastes space. This makes it
difficult to quickly see which models are on which nodes.

## Changes

Rewrote the downloads page
(`dashboard/src/routes/downloads/+page.svelte`) from a card grid to a
clean table layout:

- **Rows** = models (unique across all nodes)
- **Columns** = nodes (with disk free shown in header)
- **Cells** show status at a glance:
  -  Green checkmark + size for completed downloads
  - 🟡 Yellow percentage + mini progress bar + speed for active downloads
  - `...` for pending downloads
  -  Red X for failed downloads
  - `--` for models not present on a node
- Delete/download action buttons appear on row hover
- Model name column is sticky on horizontal scroll (for many-node
clusters)
- Models sorted by number of nodes with completed downloads
- Imported shared utilities from `$lib/utils/downloads` instead of
inline re-implementations

### Backend: model directory in download events

- Added `model_directory` field to `BaseDownloadProgress` so all
download status events include the on-disk path
- Added `_model_dir()` helper to `DownloadCoordinator` to compute the
path from `EXO_MODELS_DIR`
- Dashboard uses this to show file location and enable "open in Finder"
for completed downloads

### Info modal

- Clicking a model name opens an info modal showing card details
(family, quantization, capabilities, storage size, layer count, tensor
parallelism support)

### Other fixes

- Fixed model name truncation in the table
- Excluded `tests/start_distributed_test.py` from pytest collection (CLI
script that calls `sys.exit()` at import time)

## Test Plan

- [x] `uv run basedpyright` — 0 errors
- [x] `uv run ruff check` — all passed
- [x] `nix fmt` — clean
- [x] `uv run pytest` — 188 passed, 1 skipped

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 14:31:47 +00:00