The RDMA DEVICE UNHEALTHY warning was iterating over ALL topology edges
(Socket + RDMA) to find node pairs, showing "disconnect and reconnect
Thunderbolt 5 cable" for every pair including socket-only connections.
Nodes are not all-to-all connected via Thunderbolt.
Filter to only RDMA-tagged edges (sourceRdmaIface set) which represent
actual Thunderbolt 5 links. Only pairs with a direct Thunderbolt
connection AND an RDMA health error now get the cable warning.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a node restarts quickly (before the 30s timeout), it gets a new
random peer ID. The master sees two nodes with the same friendly_name:
the dead old peer (with stale RunnerReady entries) and the live new
peer. The old instance appears healthy so no replacement is created,
and inference routes to dead runners.
Fix: NodeTimeoutReconciler now groups nodes by friendly_name and
force-evicts stale duplicates (older last_seen) immediately via
NodeTimedOut. This triggers the existing cleanup cascade: topology
removal → instance health failure → instance deleted → MetaInstance
re-placed on the live peer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a node restarts it gets a new peer ID (keypair is regenerated each
launch). The old peer ID lingers in state until the 30s timeout fires,
but its runners were never cleaned up — they accumulated as permanent
RunnerShutdown zombies in state.runners. Now apply_node_timed_out also
removes runners mapped to the timed-out node across all instances.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
apply_instance_deleted() previously only removed the instance from
state.instances, leaving its runner entries orphaned in state.runners
with their last known status (e.g. RunnerReady). After a node kill and
rejoin, readiness checks would see these stale entries and attempt
inference against dead runner processes, causing post-recovery failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a MetaInstance was created with node_ids (from the placement
preview), try_place_for_meta_instance passed them as required_nodes
to place_instance, which demands ALL listed nodes exist in a topology
cycle. If a pinned node died and was removed from the topology, no
cycle could satisfy the constraint and placement failed permanently.
Fix: intersect node_ids with live topology nodes before passing as
required_nodes. Dead nodes are dropped from the constraint so
placement succeeds on the remaining N-1 nodes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: when RDMA is broken (ibv_alloc_pd failure), the
RDMAConnection edges are often absent from the topology — only
SocketConnection edges remain. The old code filtered to RDMA-tagged
edges only, so unhealthy nodes with no RDMA edges fell through to the
"on device X" solo fallback.
Fix: use ALL topology edges to find connected peers. Any direct
connection between nodes represents a Thunderbolt cable to re-seat.
Remove the solo fallback and the {#if pair.nodeB} template branch
entirely — warnings always say "between A and B".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
matchesSelectedRuntime checked (MlxJaccl || MlxJaccl) in the else
branch — always true for MlxJaccl, making instance type filtering
a no-op. Simplified to direct equality: runtime === selectedInstanceType.
unifiedDisplayItems used $derived(() => ...) which makes the derived
value a function rather than the computed array. Changed to
$derived.by(() => ...) so the value is the array itself, and removed
the () call syntax from template references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When both nodes in an RDMA pair were unhealthy, the second node's pair
was already in seenPairs causing foundPeer to stay false, which made
it fall through to the solo "on device X" warning instead of the
correct "between A and B" message.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Periodically probe ibv_alloc_pd() via ctypes on macOS to detect the
Thunderbolt XDomainLink boot-time initialization bug before it crashes
JACCL instance creation. The dashboard shows a red "RDMA DEVICE
UNHEALTHY" warning with the affected cable endpoints by friendly name
and a recommendation to replug the Thunderbolt 5 cable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents RuntimeError when the context has already been set,
e.g. when Terminal.app reuses a tab or the process restarts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two race conditions existed in the meta-instance lifecycle:
1. CreateMetaInstance buffered MetaInstanceCreated but didn't apply it
before awaiting ModelCard.load(). The reconciler could interleave
during the await, leading to duplicate placements.
Fix: apply MetaInstanceCreated eagerly via _apply_and_broadcast,
then re-check satisfaction after the await so placement uses fresh
state and skips if the reconciler already handled it.
2. delete_meta_instance (API handler) sent DeleteMetaInstance then
read self.state.instances for cascade deletion. State was stale,
so backing instances created between the send and the read were
missed — permanently orphaning them.
Fix: move cascade delete into the command processor's
DeleteMetaInstance handler, where InstanceDeleted events are
generated atomically with MetaInstanceDeleted.
Reproduced on 4-node Mac Mini cluster: 28K anomalies in stress test
including 21 permanently orphaned instances. After fix, the cascade
delete and placement are race-free.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TaggedModel's wrap validator converts JSON→Python validation context,
which breaks strict-mode bytes deserialization from JSON strings.
Use Base64Bytes type to encode/decode bytes as base64 strings in JSON.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Anonymous pipes from os.pipe() don't survive multiprocessing.Process
spawn on macOS (default since Python 3.8). The FD numbers are passed
but the actual file descriptors don't exist in the child process,
causing EBADF errors.
Switch to named pipes (FIFOs) which the child opens by path in the
spawned process, getting valid FDs for the C++ SideChannel.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fragile TCP SideChannel with anonymous pipes relayed through
exo's event-sourced control plane. RunnerSupervisor creates pipe pairs
for MlxJaccl instances, relays all_gather rounds via JacclSideChannelData/
JacclSideChannelGathered events through the master, eliminating errno=57
crashes from Thunderbolt RDMA driver instability.
Also includes dashboard RDMA warning improvements and instance retry fixes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- apply_instance_created no longer clears last_failure_error so the
error context persists while the new instance starts up
- Dashboard retryError shows the error without (N/3) prefix when
consecutiveFailures is 0 (instance was recreated)
- Jaccl warning tooltip now says "experimental RDMA driver in macOS"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect errors containing "[jaccl]" in MetaInstance failure errors and
display a red dismissible alert banner. The tooltip explains this is a
macOS RDMA driver issue and that the affected machine needs to be
restarted. Alert re-appears if a new error arrives after dismissal.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When runners fail for a MetaInstance-backed Instance, retry up to 3
times by restarting runners within the same Instance rather than
deleting and recreating it each time. After 3 failures, delete the
Instance so MetaInstanceReconciler can create a fresh one.
- Add InstanceRetrying event that removes runners from state (signaling
workers to restart) and increments consecutive_failures on MetaInstance
- InstanceHealthReconciler emits InstanceRetrying when under retry limit,
InstanceDeleted when exhausted or no MetaInstance
- Worker _kill_runner detects retry signal (runner deleted from state +
terminal supervisor) and cleans up for _create_runner to recreate
- Worker _create_runner guards against oscillation by blocking creation
while any peer runner has explicit terminal status
- InstanceCreated resets consecutive_failures for fresh starts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move placement_error, consecutive_failures, last_failure_error, and
last_failure_at directly onto the MetaInstance model instead of keeping
them as separate State mappings (meta_instance_errors, InstanceFailureInfo,
meta_instance_failure_info). Adds a 5-second cooldown between retry attempts
to prevent rapid instance churn when runners fail instantly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each error in the combined message is now prefixed with the node's friendly
name (e.g. "MacBook Pro: OOM; Mac Studio: connection reset") so the root
cause node is easily identifiable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dashboard % 3 logic already handles displaying retry progress in batches
(RETRYING 1/3, 2/3, 3/3, then PLACING with error, repeat). No need to
permanently block placement after 3 failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When multiple runners fail, concatenate all error messages with "; " so the
real error isn't hidden by generic side-effect failures from other runners.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MetaInstanceReconciler now checks failure count before placement — after 3
consecutive failures it emits MetaInstancePlacementFailed instead of retrying
forever. Dashboard shows "Retrying after error: <msg>" in orange throughout
the retry cycle, not just during the brief window with no backing instance.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extend InstanceDeleted with failure_error field for runner crash info
- Add InstanceFailureInfo model tracking consecutive failures per MetaInstance
- InstanceHealthReconciler now detects runner failures (all terminal with
at least one RunnerFailed) in addition to connection failures
- apply_instance_deleted increments failure counter for meta-bound instances
- Dashboard shows RETRYING (N/3) status with error messages, and
"Instance re-created due to failure" after 3 consecutive failures
- Extract and display RunnerFailed error messages in instance status
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
frozenset serializes to a JSON array but cannot be deserialized back
in strict mode through the TaggedModel wrap validator (list → frozenset
coercion is rejected). Changed to list[NodeId] since the model is
already frozen/immutable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dashboard now extracts node IDs from the selected preview's
memory_delta_by_node, ensuring the backend places on exactly the
nodes the user was shown. Also reverts incorrect RDMA min_nodes >= 2
enforcement since single-node RDMA is valid.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RDMA requires at least 2 nodes — a single-node RDMA instance is
nonsensical. Enforce this in both the dashboard (when building the
launch request) and the backend placement (when filtering cycles).
Previously, selecting RDMA would still place on 1 node because
min_nodes defaulted to 1 and the placement silently switched to Ring.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When user selects specific nodes via the filter, min_nodes should be at
least the number of filtered nodes to prevent placement from picking a
smaller cycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dashboard was not including the user's node filter in the POST to
/meta_instance, so placement ignored which nodes the user selected.
Also, placement silently fell back to Ring when RDMA was requested but
no RDMA-connected cycles were available — now raises an error that
surfaces via MetaInstancePlacementFailed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mode="plain" validator bypassed Pydantic's string-to-enum coercion,
so JSON strings like "Tensor" and "MlxJaccl" from the dashboard failed
the isinstance check and silently fell back to Pipeline/MlxRing defaults.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Show why MetaInstance placement fails instead of stuck "PLACING", and
show per-node runner status during loading for multi-node instances.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a MetaInstance has no backing instance yet, derive the strategy
display from the MetaInstance's own sharding and instanceMeta fields
rather than showing "Unknown (Unknown)".
Also clean up all stale MlxIbv references across the dashboard —
the backend enum is MlxJaccl.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace inline _plan() with ProcessManager loop (_reconcile), tick
every 1s instead of 10s — safe because all PMs are idempotent
- Fix dashboard sending "MlxIbv" instead of "MlxJaccl" for RDMA
instance type, which silently fell back to MlxRing default
- Remove all stale MlxIbv references from dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline _plan() steps with a list of ProcessManagers, each
implementing async reconcile(State) -> Sequence[Event]. Tick every
1s instead of 10s — safe because all PMs are idempotent against state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The separate MetaInstanceBound event + meta_instance_backing map
introduced two bugs: stale exclusion sets in the reconciler loop and
a delete ordering race. Embedding meta_instance_id directly on
BaseInstance eliminates the binding mechanism entirely — when an
instance is created for a MetaInstance it carries the ID, when
deleted the binding is gone. No separate map, no cleanup, no races.
Also fixes delete_meta_instance to cascade-delete backing instances.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MetaInstanceBound event and meta_instance_backing State field
for explicit MetaInstance → Instance binding (prevents ambiguous
linking when two MetaInstances have identical constraints)
- Replace model_card: ModelCard with model_id: ModelId on MetaInstance
(load ModelCard on-demand at placement time)
- Add MetaInstance API endpoints (POST /meta_instance, DELETE)
- Update dashboard to use MetaInstances as primary primitive with
unified display items merging MetaInstances and orphan instances
- Dashboard launches via MetaInstance instead of direct Instance creation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces MetaInstance as a declarative constraint ensuring an instance
matching given parameters (model, sharding, min_nodes) always exists.
The master's reconciliation loop continuously checks for unsatisfied
meta-instances and attempts placement. Connection health checking
verifies that specific IPs (MlxRing) and RDMA interfaces (MlxJaccl)
stored on instances still exist as topology edges, enabling automatic
recovery when cables are swapped or interfaces change.
Also eliminates the master's loopback event path, unifying all event
emission through _apply_and_broadcast for simpler control flow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- **Auto-open dashboard** in browser on first launch (uses
`~/.exo/.dashboard_opened` marker)
- **Welcome overlay** with "Choose a Model" CTA button when no model
instance is running
- **Tutorial progress messages** during model download → loading → ready
lifecycle stages
- **Fix conversation sidebar** text contrast — bumped to white text,
added active state background
- **Simplify technical jargon** — sharding/instance type/min nodes
hidden behind collapsible "Advanced Options" toggle; strategy display
hidden behind debug mode
- **Polished DMG installer** with drag-to-Applications layout, custom
branded background, and AppleScript-configured window positioning
## Test plan
- [ ] Launch exo for the first time (delete `~/.exo/.dashboard_opened`
to simulate) — browser should auto-open
- [ ] Verify welcome overlay appears on topology when no model is loaded
- [ ] Launch a model and verify download/loading/ready messages appear
in instance cards
- [ ] Check conversation sidebar text is readable (white on dark, yellow
when active)
- [ ] Verify "Advanced Options" toggle hides/shows sharding controls
- [ ] Build DMG with `packaging/dmg/create-dmg.sh` and verify
drag-to-Applications layout
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
There is an issue on Macs that means that an explicit synchronization is
necessary for memory to be updated from L1 cache. This means that GPU
locks can occur when a spin wait does not see the updated timestamp.
## Changes
Updated in my own personal fork.
## Why It Works
https://github.com/ARM-software/acle/releases
## Test Plan
### Manual Testing
Tested manually that no GPU locks occur (even with multiple simultaneous
instances running) and that the performance differential is negligible
(267 vs 269 tps on Llama 3.2 1B at an approx 10k context.)
------------------------------------------------------
I have seen a GPU lock, specifically when sending a particularly large
chat completion while the model was loading. However, I have since been
unable to reproduce and this may be something I did wrong. Please do
create an issue and tag me if any GPU locks do occur.
---------
Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
Users currently manage instances directly, which means if a node
disconnects or connections break, the instance dies and nothing
recreates it. MetaInstance is a declarative primitive: "ensure an
instance matching these parameters always exists." The reconciler
watches for unhealthy or missing backing instances and re-places them
automatically.
## Changes
- **MetaInstance type** (`meta_instance.py`): declarative constraint
with `model_id`, `min_nodes`, optional `node_ids`, and `sharding`
- **Reconciler** (`reconcile.py`): `find_unsatisfied_meta_instances`
checks which MetaInstances lack a healthy backing instance,
`try_place_for_meta_instance` creates one
- **Master loop** (`main.py`): periodically reconciles unsatisfied
MetaInstances; immediate placement on `CreateMetaInstance` command
- **API** (`api.py`): `create_meta_instance` / `delete_meta_instance` /
`GET /meta_instances` endpoints; delete cascades to backing instances
with task cancellation
- **Binding via `meta_instance_id` on Instance** (`instances.py`): no
separate binding event or backing map — the instance carries its parent
MetaInstance ID directly, eliminating race conditions in the reconciler
- **Dashboard**: sidebar shows MetaInstances with their backing instance
status; orphan instances (created directly) still shown separately
- **Tests**: constraint matching, connection health, unsatisfied
detection, exclusive binding, cascade delete with task cancellation
### Recent improvements
- **fix: cancel active tasks on cascade delete** — `DeleteMetaInstance`
now emits `TaskStatusUpdated(Cancelled)` for any Pending/Running tasks
on backing instances before emitting `InstanceDeleted`. Previously,
cascade-deleting backing instances left orphaned task references in
state.
- **Lifecycle logging** — added `logger.info`/`logger.warning` for:
`CreateMetaInstance` (model, min_nodes, sharding), `DeleteMetaInstance`
(with cascade count), reconciler placement success/failure, and retry
decisions with attempt counts in `InstanceHealthReconciler`.
- **GET `/meta_instances` endpoint** — lists all meta-instances without
needing to fetch full state.
- **2 regression tests** — `test_cascade_delete_cancels_active_tasks`
and `test_cascade_delete_skips_completed_tasks` verify the
cascade-delete event sequence.
## Why It Works
Putting `meta_instance_id` on `BaseInstance` makes binding inherent to
instance creation. When the reconciler creates an instance for a
MetaInstance, it tags it via `model_copy`. When the instance is deleted,
the binding disappears with it. This avoids the two bugs that a separate
binding mechanism would introduce:
1. Stale exclusion sets — the reconciler loop can't accidentally bind
two MetaInstances to the same instance
2. Delete ordering race — no window between deleting an instance and its
binding where the reconciler could re-place
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
- Created MetaInstance via dashboard, verified instance placed
- Verified delete cascades (deleting MetaInstance removes backing
instance)
- Verified orphan instances still work independently
### Automated Testing
- 30 tests in `test_meta_instance_edge_cases.py`: lifecycle, retry
logic, error handling, concurrent operations, cascade delete with task
cancellation
- 24 tests in `test_reconcile.py`: constraint matching, connection
health (single/multi-node, edge removal, IP changes), unsatisfied
detection, exclusive binding, idempotency
- All 261 tests pass
- basedpyright 0 errors, ruff clean, dashboard builds
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
Fixes#1370
When the macOS app stops exo, GPU/system memory isn't released. This
happens because:
1. The macOS app calls `process.terminate()` (SIGTERM) but the Python
process only registers a graceful shutdown handler for SIGINT, not
SIGTERM. SIGTERM's default Python behavior raises `SystemExit` which
bypasses the cleanup cascade (runner subprocess MLX cleanup via
`mx.clear_cache()`, channel closing, etc.).
2. The app doesn't wait for the process to actually finish cleanup — it
immediately nils out the process reference.
## Changes
**`src/exo/main.py`**: Register SIGTERM handler alongside SIGINT so the
graceful shutdown cascade (`Node.shutdown()` → cancel task group →
worker/runner cleanup → `mx.clear_cache()` + `gc.collect()`) runs
regardless of which signal is received.
**`app/EXO/EXO/ExoProcessController.swift`**: Replace immediate
`process.terminate()` with escalating shutdown per @Evanev7's
suggestion:
1. Send SIGINT via `process.interrupt()` — triggers the registered
Python handler for graceful cleanup
2. Wait up to 5 seconds for the process to exit
3. If still running, escalate to SIGTERM via `process.terminate()`
4. Wait up to 3 seconds
5. If still running, force kill via SIGKILL
The escalation runs in a detached `Task` so the UI updates immediately
(status → stopped) without blocking.
## Why It Works
The root cause is that SIGTERM wasn't triggering the graceful shutdown
path. By registering a SIGTERM handler in Python and sending SIGINT
first from the macOS app, the process gets a chance to run the full
cleanup cascade: cancelling the task group, shutting down runners (which
call `del model; mx.clear_cache(); gc.collect()`), closing channels, and
flushing logs. The escalation to SIGTERM and SIGKILL ensures the process
always terminates even if graceful shutdown hangs.
## Test Plan
### Manual Testing
<!-- Hardware: Mac Studio M4 Max 128GB -->
- Start exo via macOS app, load a model, run inference
- Stop via the toggle switch, verify memory is released without
requiring a system restart
- Test rapid stop/start (restart) to ensure no race conditions
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — no changes
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan Quiney <evanev7@gmail.com>
## Motivation
The current downloads page uses a node-centric card grid layout that is
messy and hard to read — the same model across different nodes appears
in separate cards, and deep nesting wastes space. This makes it
difficult to quickly see which models are on which nodes.
## Changes
Rewrote the downloads page
(`dashboard/src/routes/downloads/+page.svelte`) from a card grid to a
clean table layout:
- **Rows** = models (unique across all nodes)
- **Columns** = nodes (with disk free shown in header)
- **Cells** show status at a glance:
- ✅ Green checkmark + size for completed downloads
- 🟡 Yellow percentage + mini progress bar + speed for active downloads
- `...` for pending downloads
- ❌ Red X for failed downloads
- `--` for models not present on a node
- Delete/download action buttons appear on row hover
- Model name column is sticky on horizontal scroll (for many-node
clusters)
- Models sorted by number of nodes with completed downloads
- Imported shared utilities from `$lib/utils/downloads` instead of
inline re-implementations
### Backend: model directory in download events
- Added `model_directory` field to `BaseDownloadProgress` so all
download status events include the on-disk path
- Added `_model_dir()` helper to `DownloadCoordinator` to compute the
path from `EXO_MODELS_DIR`
- Dashboard uses this to show file location and enable "open in Finder"
for completed downloads
### Info modal
- Clicking a model name opens an info modal showing card details
(family, quantization, capabilities, storage size, layer count, tensor
parallelism support)
### Other fixes
- Fixed model name truncation in the table
- Excluded `tests/start_distributed_test.py` from pytest collection (CLI
script that calls `sys.exit()` at import time)
## Test Plan
- [x] `uv run basedpyright` — 0 errors
- [x] `uv run ruff check` — all passed
- [x] `nix fmt` — clean
- [x] `uv run pytest` — 188 passed, 1 skipped
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>