mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-02-20 07:46:42 -05:00

Author	SHA1	Message	Date
Alex Cheema	08c94bc283	fix: apply nix fmt to node_timeout.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:24:23 -08:00
Alex Cheema	7913e1a03f	fix: only show RDMA cable warning for directly-connected Thunderbolt pairs The RDMA DEVICE UNHEALTHY warning was iterating over ALL topology edges (Socket + RDMA) to find node pairs, showing "disconnect and reconnect Thunderbolt 5 cable" for every pair including socket-only connections. Nodes are not all-to-all connected via Thunderbolt. Filter to only RDMA-tagged edges (sourceRdmaIface set) which represent actual Thunderbolt 5 links. Only pairs with a direct Thunderbolt connection AND an RDMA health error now get the cable warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:24:23 -08:00
Alex Cheema	539be54edb	fix: evict stale peer when node restarts with new ID (merge blocker) When a node restarts quickly (before the 30s timeout), it gets a new random peer ID. The master sees two nodes with the same friendly_name: the dead old peer (with stale RunnerReady entries) and the live new peer. The old instance appears healthy so no replacement is created, and inference routes to dead runners. Fix: NodeTimeoutReconciler now groups nodes by friendly_name and force-evicts stale duplicates (older last_seen) immediately via NodeTimedOut. This triggers the existing cleanup cascade: topology removal → instance health failure → instance deleted → MetaInstance re-placed on the live peer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:24:23 -08:00
Alex Cheema	483c52f3b1	fix: clean up orphaned runners when node times out When a node restarts it gets a new peer ID (keypair is regenerated each launch). The old peer ID lingers in state until the 30s timeout fires, but its runners were never cleaned up — they accumulated as permanent RunnerShutdown zombies in state.runners. Now apply_node_timed_out also removes runners mapped to the timed-out node across all instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:24:23 -08:00
Alex Cheema	fbf9007591	fix: clean up stale runners from state when instance is deleted apply_instance_deleted() previously only removed the instance from state.instances, leaving its runner entries orphaned in state.runners with their last known status (e.g. RunnerReady). After a node kill and rejoin, readiness checks would see these stale entries and attempt inference against dead runner processes, causing post-recovery failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:23:11 -08:00
Alex Cheema	1b68eb5be5	fix: shard redistribution blocked when pinned node dies (BUG-B) When a MetaInstance was created with node_ids (from the placement preview), try_place_for_meta_instance passed them as required_nodes to place_instance, which demands ALL listed nodes exist in a topology cycle. If a pinned node died and was removed from the topology, no cycle could satisfy the constraint and placement failed permanently. Fix: intersect node_ids with live topology nodes before passing as required_nodes. Dead nodes are dropped from the constraint so placement succeeds on the remaining N-1 nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:23:11 -08:00
Alex Cheema	afaf778ba6	fix: RDMA warning uses all edges to find peers, not just RDMA-tagged ones Root cause: when RDMA is broken (ibv_alloc_pd failure), the RDMAConnection edges are often absent from the topology — only SocketConnection edges remain. The old code filtered to RDMA-tagged edges only, so unhealthy nodes with no RDMA edges fell through to the "on device X" solo fallback. Fix: use ALL topology edges to find connected peers. Any direct connection between nodes represents a Thunderbolt cable to re-seat. Remove the solo fallback and the {#if pair.nodeB} template branch entirely — warnings always say "between A and B". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:23:11 -08:00
Alex Cheema	59e44df71b	fix: matchesSelectedRuntime tautology and unifiedDisplayItems derived pattern matchesSelectedRuntime checked (MlxJaccl \|\| MlxJaccl) in the else branch — always true for MlxJaccl, making instance type filtering a no-op. Simplified to direct equality: runtime === selectedInstanceType. unifiedDisplayItems used $derived(() => ...) which makes the derived value a function rather than the computed array. Changed to $derived.by(() => ...) so the value is the array itself, and removed the () call syntax from template references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:23:11 -08:00
Alex Cheema	cd7799cd9a	fix: RDMA dashboard warning shows 'between A and B' for all unhealthy pairs When both nodes in an RDMA pair were unhealthy, the second node's pair was already in seenPairs causing foundPeer to stay false, which made it fall through to the solo "on device X" warning instead of the correct "between A and B" message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:23:11 -08:00
Alex Cheema	d9f82034b5	Add proactive RDMA device health detection with dashboard warning Periodically probe ibv_alloc_pd() via ctypes on macOS to detect the Thunderbolt XDomainLink boot-time initialization bug before it crashes JACCL instance creation. The dashboard shows a red "RDMA DEVICE UNHEALTHY" warning with the affected cable endpoints by friendly name and a recommendation to replug the Thunderbolt 5 cable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:23:11 -08:00
Alex Cheema	5ea3778674	fix: use force=True for multiprocessing set_start_method Prevents RuntimeError when the context has already been set, e.g. when Terminal.app reuses a tab or the process restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:51 -08:00
Alex Cheema	a9b835d481	fix: eliminate command/reconciler interleaving race in meta-instance Two race conditions existed in the meta-instance lifecycle: 1. CreateMetaInstance buffered MetaInstanceCreated but didn't apply it before awaiting ModelCard.load(). The reconciler could interleave during the await, leading to duplicate placements. Fix: apply MetaInstanceCreated eagerly via _apply_and_broadcast, then re-check satisfaction after the await so placement uses fresh state and skips if the reconciler already handled it. 2. delete_meta_instance (API handler) sent DeleteMetaInstance then read self.state.instances for cascade deletion. State was stale, so backing instances created between the send and the read were missed — permanently orphaning them. Fix: move cascade delete into the command processor's DeleteMetaInstance handler, where InstanceDeleted events are generated atomically with MetaInstanceDeleted. Reproduced on 4-node Mac Mini cluster: 28K anomalies in stress test including 21 permanently orphaned instances. After fix, the cascade delete and placement are race-free. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:51 -08:00
Alex Cheema	7feb5045e2	test: add 25 edge-case tests for MetaInstance lifecycle Cover retry logic, error handling, backward compatibility, concurrent scenarios, placement error tracking, and serialization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:51 -08:00
Alex Cheema	64beeb58d8	Fix JACCL SideChannel bytes serialization for JSON round-trip TaggedModel's wrap validator converts JSON→Python validation context, which breaks strict-mode bytes deserialization from JSON strings. Use Base64Bytes type to encode/decode bytes as base64 strings in JSON. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:51 -08:00
Alex Cheema	3238157834	Use named pipes (FIFOs) for JACCL SideChannel relay Anonymous pipes from os.pipe() don't survive multiprocessing.Process spawn on macOS (default since Python 3.8). The FD numbers are passed but the actual file descriptors don't exist in the child process, causing EBADF errors. Switch to named pipes (FIFOs) which the child opens by path in the spawned process, getting valid FDs for the C++ SideChannel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:51 -08:00
Alex Cheema	9859ebd67a	Add pipe-based JACCL SideChannel relay via exo control plane Replace fragile TCP SideChannel with anonymous pipes relayed through exo's event-sourced control plane. RunnerSupervisor creates pipe pairs for MlxJaccl instances, relays all_gather rounds via JacclSideChannelData/ JacclSideChannelGathered events through the master, eliminating errno=57 crashes from Thunderbolt RDMA driver instability. Also includes dashboard RDMA warning improvements and instance retry fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:51 -08:00
Alex Cheema	9298bab09e	Preserve last_failure_error across instance recreation, fix RDMA banner wording - apply_instance_created no longer clears last_failure_error so the error context persists while the new instance starts up - Dashboard retryError shows the error without (N/3) prefix when consecutiveFailures is 0 (instance was recreated) - Jaccl warning tooltip now says "experimental RDMA driver in macOS" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:27 -08:00
Alex Cheema	f4b7f376b0	chore: remove temporary screenshot files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:27 -08:00
Alex Cheema	6c7c1079e5	temp: add jaccl warning screenshots for PR comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:27 -08:00
Alex Cheema	c2a356014c	dashboard: show warning banner for [jaccl] RDMA driver errors Detect errors containing "[jaccl]" in MetaInstance failure errors and display a red dismissible alert banner. The tooltip explains this is a macOS RDMA driver issue and that the affected machine needs to be restarted. Alert re-appears if a new error arrives after dismissal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:27 -08:00
Alex Cheema	4f538e553a	Retry runners within the same Instance instead of recreating When runners fail for a MetaInstance-backed Instance, retry up to 3 times by restarting runners within the same Instance rather than deleting and recreating it each time. After 3 failures, delete the Instance so MetaInstanceReconciler can create a fresh one. - Add InstanceRetrying event that removes runners from state (signaling workers to restart) and increments consecutive_failures on MetaInstance - InstanceHealthReconciler emits InstanceRetrying when under retry limit, InstanceDeleted when exhausted or no MetaInstance - Worker _kill_runner detects retry signal (runner deleted from state + terminal supervisor) and cleans up for _create_runner to recreate - Worker _create_runner guards against oscillation by blocking creation while any peer runner has explicit terminal status - InstanceCreated resets consecutive_failures for fresh starts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:27 -08:00
Alex Cheema	1e5e626725	Remove timestamp-based retry cooldown Remove last_failure_at field and RETRY_COOLDOWN_SECONDS logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	a171c96c41	Consolidate failure state onto MetaInstance, add 5s retry cooldown Move placement_error, consecutive_failures, last_failure_error, and last_failure_at directly onto the MetaInstance model instead of keeping them as separate State mappings (meta_instance_errors, InstanceFailureInfo, meta_instance_failure_info). Adds a 5-second cooldown between retry attempts to prevent rapid instance churn when runners fail instantly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	229fb29514	Show retry attempt count with error message, e.g. (2/3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	89f0a4a69d	Include node friendly names in runner error messages Each error in the combined message is now prefixed with the node's friendly name (e.g. "MacBook Pro: OOM; Mac Studio: connection reset") so the root cause node is easily identifiable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	a896ecca84	Remove permanent retry blocking, allow continuous retry batches The dashboard % 3 logic already handles displaying retry progress in batches (RETRYING 1/3, 2/3, 3/3, then PLACING with error, repeat). No need to permanently block placement after 3 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	49ada3821f	Show retry count in exceeded retry limit message (3/3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	a6f7131fe0	Collect all runner error messages instead of just the last one When multiple runners fail, concatenate all error messages with "; " so the real error isn't hidden by generic side-effect failures from other runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	617d94ebc0	Stop infinite retries after 3 failures, show errors persistently in dashboard MetaInstanceReconciler now checks failure count before placement — after 3 consecutive failures it emits MetaInstancePlacementFailed instead of retrying forever. Dashboard shows "Retrying after error: <msg>" in orange throughout the retry cycle, not just during the brief window with no backing instance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	e20af70176	Add instance retry logic with max 3 retries and failure tracking - Extend InstanceDeleted with failure_error field for runner crash info - Add InstanceFailureInfo model tracking consecutive failures per MetaInstance - InstanceHealthReconciler now detects runner failures (all terminal with at least one RunnerFailed) in addition to connection failures - apply_instance_deleted increments failure counter for meta-bound instances - Dashboard shows RETRYING (N/3) status with error messages, and "Instance re-created due to failure" after 3 consecutive failures - Extract and display RunnerFailed error messages in instance status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	46219fcd07	Fix MetaInstance.node_ids frozenset failing JSON deserialization frozenset serializes to a JSON array but cannot be deserialized back in strict mode through the TaggedModel wrap validator (list → frozenset coercion is rejected). Changed to list[NodeId] since the model is already frozen/immutable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	37c5e878b9	Send node_ids from placement preview when launching instances The dashboard now extracts node IDs from the selected preview's memory_delta_by_node, ensuring the backend places on exactly the nodes the user was shown. Also reverts incorrect RDMA min_nodes >= 2 enforcement since single-node RDMA is valid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	4d8ef4f462	Enforce min_nodes >= 2 for RDMA (MlxJaccl) instances RDMA requires at least 2 nodes — a single-node RDMA instance is nonsensical. Enforce this in both the dashboard (when building the launch request) and the backend placement (when filtering cycles). Previously, selecting RDMA would still place on 1 node because min_nodes defaulted to 1 and the placement silently switched to Ring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	77afead0ef	Ensure min_nodes >= node filter size when launching When user selects specific nodes via the filter, min_nodes should be at least the number of filtered nodes to prevent placement from picking a smaller cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	959865f713	Send node_ids from dashboard, error on RDMA when unavailable Dashboard was not including the user's node filter in the POST to /meta_instance, so placement ignored which nodes the user selected. Also, placement silently fell back to Ring when RDMA was requested but no RDMA-connected cycles were available — now raises an error that surfaces via MetaInstancePlacementFailed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	4317cf2bfb	Fix use_default validator silently ignoring sharding/instance_meta The mode="plain" validator bypassed Pydantic's string-to-enum coercion, so JSON strings like "Tensor" and "MlxJaccl" from the dashboard failed the isinstance check and silently fell back to Pipeline/MlxRing defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	848b5139f5	Add placement error feedback and per-node loading status Show why MetaInstance placement fails instead of stuck "PLACING", and show per-node runner status during loading for multi-node instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	72d9574fe8	Show MetaInstance sharding/type while PLACING, fix MlxIbv references When a MetaInstance has no backing instance yet, derive the strategy display from the MetaInstance's own sharding and instanceMeta fields rather than showing "Unknown (Unknown)". Also clean up all stale MlxIbv references across the dashboard — the backend enum is MlxJaccl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	14ff11eec2	Extract reconciler into ProcessManager protocol, fix RDMA instance type - Replace inline _plan() with ProcessManager loop (_reconcile), tick every 1s instead of 10s — safe because all PMs are idempotent - Fix dashboard sending "MlxIbv" instead of "MlxJaccl" for RDMA instance type, which silently fell back to MlxRing default - Remove all stale MlxIbv references from dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	4e9cbd09b0	Extract reconciler into ProcessManager protocol Replace inline _plan() steps with a list of ProcessManagers, each implementing async reconcile(State) -> Sequence[Event]. Tick every 1s instead of 10s — safe because all PMs are idempotent against state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	46bc69d65d	Simplify MetaInstance binding: put meta_instance_id on Instance The separate MetaInstanceBound event + meta_instance_backing map introduced two bugs: stale exclusion sets in the reconciler loop and a delete ordering race. Embedding meta_instance_id directly on BaseInstance eliminates the binding mechanism entirely — when an instance is created for a MetaInstance it carries the ID, when deleted the binding is gone. No separate map, no cleanup, no races. Also fixes delete_meta_instance to cascade-delete backing instances. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	9c49ea3a89	Add explicit MetaInstance binding, slim MetaInstance to use ModelId - Add MetaInstanceBound event and meta_instance_backing State field for explicit MetaInstance → Instance binding (prevents ambiguous linking when two MetaInstances have identical constraints) - Replace model_card: ModelCard with model_id: ModelId on MetaInstance (load ModelCard on-demand at placement time) - Add MetaInstance API endpoints (POST /meta_instance, DELETE) - Update dashboard to use MetaInstances as primary primitive with unified display items merging MetaInstances and orphan instances - Dashboard launches via MetaInstance instead of direct Instance creation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:22:07 -08:00
Alex Cheema	26368f9837	Add MetaInstance declarative layer with connection health checking Introduces MetaInstance as a declarative constraint ensuring an instance matching given parameters (model, sharding, min_nodes) always exists. The master's reconciliation loop continuously checks for unsatisfied meta-instances and attempts placement. Connection health checking verifies that specific IPs (MlxRing) and RDMA interfaces (MlxJaccl) stored on instances still exist as topology edges, enabling automatic recovery when cables are swapped or interfaces change. Also eliminates the master's loopback event path, unifying all event emission through _apply_and_broadcast for simpler control flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 10:19:09 -08:00
Evan Quiney	eccc6298d1	Revert "Add MetaInstance declarative layer (#1447 )" This reverts commit `a962a28afc`.	2026-02-17 18:11:47 +00:00
Evan Quiney	c8997217cf	Revert "feat: better onboarding UX for new users (#1479 )" This reverts commit `490d2e46ba`.	2026-02-17 18:02:32 +00:00
Alex Cheema	490d2e46ba	feat: better onboarding UX for new users (#1479 ) ## Summary - Auto-open dashboard in browser on first launch (uses `~/.exo/.dashboard_opened` marker) - Welcome overlay with "Choose a Model" CTA button when no model instance is running - Tutorial progress messages during model download → loading → ready lifecycle stages - Fix conversation sidebar text contrast — bumped to white text, added active state background - Simplify technical jargon — sharding/instance type/min nodes hidden behind collapsible "Advanced Options" toggle; strategy display hidden behind debug mode - Polished DMG installer with drag-to-Applications layout, custom branded background, and AppleScript-configured window positioning ## Test plan - [ ] Launch exo for the first time (delete `~/.exo/.dashboard_opened` to simulate) — browser should auto-open - [ ] Verify welcome overlay appears on topology when no model is loaded - [ ] Launch a model and verify download/loading/ready messages appear in instance cards - [ ] Check conversation sidebar text is readable (white on dark, yellow when active) - [ ] Verify "Advanced Options" toggle hides/shows sharding controls - [ ] Build DMG with `packaging/dmg/create-dmg.sh` and verify drag-to-Applications layout 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 17:52:49 +00:00
rltakashige	facf2d4d03	Use custom fork that resolves GPU locks (#1489 ) ## Motivation There is an issue on Macs that means that an explicit synchronization is necessary for memory to be updated from L1 cache. This means that GPU locks can occur when a spin wait does not see the updated timestamp. ## Changes Updated in my own personal fork. ## Why It Works https://github.com/ARM-software/acle/releases ## Test Plan ### Manual Testing Tested manually that no GPU locks occur (even with multiple simultaneous instances running) and that the performance differential is negligible (267 vs 269 tps on Llama 3.2 1B at an approx 10k context.) ------------------------------------------------------ I have seen a GPU lock, specifically when sending a particularly large chat completion while the model was loading. However, I have since been unable to reproduce and this may be something I did wrong. Please do create an issue and tag me if any GPU locks do occur. --------- Co-authored-by: Jake Hillion <jake@hillion.co.uk> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 17:48:43 +00:00
Alex Cheema	a962a28afc	Add MetaInstance declarative layer (#1447 ) ## Motivation Users currently manage instances directly, which means if a node disconnects or connections break, the instance dies and nothing recreates it. MetaInstance is a declarative primitive: "ensure an instance matching these parameters always exists." The reconciler watches for unhealthy or missing backing instances and re-places them automatically. ## Changes - MetaInstance type (`meta_instance.py`): declarative constraint with `model_id`, `min_nodes`, optional `node_ids`, and `sharding` - Reconciler (`reconcile.py`): `find_unsatisfied_meta_instances` checks which MetaInstances lack a healthy backing instance, `try_place_for_meta_instance` creates one - Master loop (`main.py`): periodically reconciles unsatisfied MetaInstances; immediate placement on `CreateMetaInstance` command - API (`api.py`): `create_meta_instance` / `delete_meta_instance` / `GET /meta_instances` endpoints; delete cascades to backing instances with task cancellation - Binding via `meta_instance_id` on Instance (`instances.py`): no separate binding event or backing map — the instance carries its parent MetaInstance ID directly, eliminating race conditions in the reconciler - Dashboard: sidebar shows MetaInstances with their backing instance status; orphan instances (created directly) still shown separately - Tests: constraint matching, connection health, unsatisfied detection, exclusive binding, cascade delete with task cancellation ### Recent improvements - fix: cancel active tasks on cascade delete — `DeleteMetaInstance` now emits `TaskStatusUpdated(Cancelled)` for any Pending/Running tasks on backing instances before emitting `InstanceDeleted`. Previously, cascade-deleting backing instances left orphaned task references in state. - Lifecycle logging — added `logger.info`/`logger.warning` for: `CreateMetaInstance` (model, min_nodes, sharding), `DeleteMetaInstance` (with cascade count), reconciler placement success/failure, and retry decisions with attempt counts in `InstanceHealthReconciler`. - GET `/meta_instances` endpoint — lists all meta-instances without needing to fetch full state. - 2 regression tests — `test_cascade_delete_cancels_active_tasks` and `test_cascade_delete_skips_completed_tasks` verify the cascade-delete event sequence. ## Why It Works Putting `meta_instance_id` on `BaseInstance` makes binding inherent to instance creation. When the reconciler creates an instance for a MetaInstance, it tags it via `model_copy`. When the instance is deleted, the binding disappears with it. This avoids the two bugs that a separate binding mechanism would introduce: 1. Stale exclusion sets — the reconciler loop can't accidentally bind two MetaInstances to the same instance 2. Delete ordering race — no window between deleting an instance and its binding where the reconciler could re-place ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> - Created MetaInstance via dashboard, verified instance placed - Verified delete cascades (deleting MetaInstance removes backing instance) - Verified orphan instances still work independently ### Automated Testing - 30 tests in `test_meta_instance_edge_cases.py`: lifecycle, retry logic, error handling, concurrent operations, cascade delete with task cancellation - 24 tests in `test_reconcile.py`: constraint matching, connection health (single/multi-node, edge removal, IP changes), unsatisfied detection, exclusive binding, idempotency - All 261 tests pass - basedpyright 0 errors, ruff clean, dashboard builds --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 09:48:19 -08:00
Alex Cheema	db79c350c1	Fix graceful process shutdown in macOS app (#1372 ) ## Motivation Fixes #1370 When the macOS app stops exo, GPU/system memory isn't released. This happens because: 1. The macOS app calls `process.terminate()` (SIGTERM) but the Python process only registers a graceful shutdown handler for SIGINT, not SIGTERM. SIGTERM's default Python behavior raises `SystemExit` which bypasses the cleanup cascade (runner subprocess MLX cleanup via `mx.clear_cache()`, channel closing, etc.). 2. The app doesn't wait for the process to actually finish cleanup — it immediately nils out the process reference. ## Changes `src/exo/main.py`: Register SIGTERM handler alongside SIGINT so the graceful shutdown cascade (`Node.shutdown()` → cancel task group → worker/runner cleanup → `mx.clear_cache()` + `gc.collect()`) runs regardless of which signal is received. `app/EXO/EXO/ExoProcessController.swift`: Replace immediate `process.terminate()` with escalating shutdown per @Evanev7's suggestion: 1. Send SIGINT via `process.interrupt()` — triggers the registered Python handler for graceful cleanup 2. Wait up to 5 seconds for the process to exit 3. If still running, escalate to SIGTERM via `process.terminate()` 4. Wait up to 3 seconds 5. If still running, force kill via SIGKILL The escalation runs in a detached `Task` so the UI updates immediately (status → stopped) without blocking. ## Why It Works The root cause is that SIGTERM wasn't triggering the graceful shutdown path. By registering a SIGTERM handler in Python and sending SIGINT first from the macOS app, the process gets a chance to run the full cleanup cascade: cancelling the task group, shutting down runners (which call `del model; mx.clear_cache(); gc.collect()`), closing channels, and flushing logs. The escalation to SIGTERM and SIGKILL ensures the process always terminates even if graceful shutdown hangs. ## Test Plan ### Manual Testing <!-- Hardware: Mac Studio M4 Max 128GB --> - Start exo via macOS app, load a model, run inference - Stop via the toggle switch, verify memory is released without requiring a system restart - Test rapid stop/start (restart) to ensure no race conditions ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `nix fmt` — no changes --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan Quiney <evanev7@gmail.com>	2026-02-17 09:03:54 -08:00
Alex Cheema	d6301ed593	dashboard: redesign downloads page as model×node table (#1465 ) ## Motivation The current downloads page uses a node-centric card grid layout that is messy and hard to read — the same model across different nodes appears in separate cards, and deep nesting wastes space. This makes it difficult to quickly see which models are on which nodes. ## Changes Rewrote the downloads page (`dashboard/src/routes/downloads/+page.svelte`) from a card grid to a clean table layout: - Rows = models (unique across all nodes) - Columns = nodes (with disk free shown in header) - Cells show status at a glance: - ✅ Green checkmark + size for completed downloads - 🟡 Yellow percentage + mini progress bar + speed for active downloads - `...` for pending downloads - ❌ Red X for failed downloads - `--` for models not present on a node - Delete/download action buttons appear on row hover - Model name column is sticky on horizontal scroll (for many-node clusters) - Models sorted by number of nodes with completed downloads - Imported shared utilities from `$lib/utils/downloads` instead of inline re-implementations ### Backend: model directory in download events - Added `model_directory` field to `BaseDownloadProgress` so all download status events include the on-disk path - Added `_model_dir()` helper to `DownloadCoordinator` to compute the path from `EXO_MODELS_DIR` - Dashboard uses this to show file location and enable "open in Finder" for completed downloads ### Info modal - Clicking a model name opens an info modal showing card details (family, quantization, capabilities, storage size, layer count, tensor parallelism support) ### Other fixes - Fixed model name truncation in the table - Excluded `tests/start_distributed_test.py` from pytest collection (CLI script that calls `sys.exit()` at import time) ## Test Plan - [x] `uv run basedpyright` — 0 errors - [x] `uv run ruff check` — all passed - [x] `nix fmt` — clean - [x] `uv run pytest` — 188 passed, 1 skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 14:31:47 +00:00

1 2 3 4 5 ...

2136 Commits