mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-02-13 15:43:28 -05:00

Author	SHA1	Message	Date
Evan	49df704dcd	woahg	2026-02-12 01:23:28 +00:00
Alex Cheema	62e8110e97	fix: prevent DownloadModel TaskCreated event flood (#1452 ) ## Motivation When a model download fails repeatedly (e.g. `ContentLengthError` on a large model like `zai-org/GLM-5`), the download coordinator accumulates duplicate progress callbacks — one per retry cycle. Each callback independently throttles at 1 event/sec, so after N retries, every download progress tick generates N events instead of 1. After an hour of failures (~60 retry cycles), this produces ~60 `NodeDownloadProgress` events/sec, overwhelming the master, delaying heartbeats, and causing the node to time itself out. ### The callback accumulation cycle 1. `_start_download_task()` calls `shard_downloader.on_progress(callback)` which appends to a list 2. Download fails → `DownloadFailed` status set, but old callback stays in the list 3. 60s later: `_emit_existing_download_progress()` scans disk → resets status to `DownloadPending` 4. Worker sends new `StartDownload` → coordinator accepts (guard didn't check `DownloadFailed`) 5. `_start_download_task()` appends another callback 6. Each callback has its own throttle → N callbacks = N events per progress tick ## Changes ### Commit 1: `src/exo/worker/main.py` Move the `DownloadModel` backoff check before `TaskCreated` emission in `plan_step()`. Previously `TaskCreated` was emitted unconditionally every 0.1s even when backoff blocked the download command. ### Commit 2: `src/exo/download/coordinator.py` 1. Register progress callback once in `__post_init__` instead of per-download in `_start_download_task()`. Uses a per-model throttle dict instead of per-callback closure variables. 2. Add `DownloadFailed` to the `_start_download()` guard so redundant `_start_download_task()` calls don't happen. Retries still work because `_emit_existing_download_progress` resets `DownloadFailed` → `DownloadPending` by scanning disk every 60s. ## Why It Works The root cause was callbacks accumulating in `ResumableShardDownloader.on_progress_callbacks` (a list that only appends, never clears). By registering one callback per coordinator lifetime and guarding against re-entry on `DownloadFailed`, we ensure exactly one progress event per model per progress tick regardless of how many retry cycles have occurred. ## Test Plan ### Manual Testing - Verified the download retry flow: failed download → 60s scan resets status → new `StartDownload` accepted → download retries with single callback ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `uv run pytest` — 188 passed --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 23:50:43 +00:00
Alex Cheema	98773437f3	Make info gatherer monitors resilient with retry loops and timeouts (#1448 ) ## Motivation Info gatherer monitors could silently stop posting events, causing stale node state after rejoins. The macmon monitor was especially fragile — it had no retry loop, so a crash or parse error would kill it permanently. Worse, the unhandled exception would propagate to the TaskGroup and take down all sibling monitors. Additionally, none of the monitors had timeouts on their subprocess calls, so a hung `system_profiler` or `networksetup` could stall a monitor indefinitely. ## Changes - Wrap `_monitor_macmon` in a `while True` retry loop with `except Exception`, matching the pattern used by all other monitors - Add `fail_after` timeouts to all monitor loop bodies: - 10s for lightweight commands (`_monitor_misc`, `_watch_system_info`, `_gather_iface_map` init) - 30s for heavier commands (`_monitor_system_profiler_thunderbolt_data`, `_monitor_thunderbolt_bridge_status`) - Remove unused `CalledProcessError` and `cast` imports ## Why It Works All monitors now follow the same resilient pattern: `while True` → `try` with `fail_after` → `except Exception` (logs warning) → `sleep`. If a subprocess hangs, the timeout fires and `TimeoutError` is caught by the existing `except Exception` handler. If macmon crashes, it restarts after the interval instead of dying permanently. No single monitor failure can cascade to kill the others. ## Test Plan ### Manual Testing <!-- Hardware: macOS with macmon installed --> <!-- What you did: --> - Run exo, kill macmon process (`kill $(pgrep macmon)`), verify it restarts and metrics resume - Verify all monitors continue posting events after simulated hangs ### Automated Testing - All 188 existing tests pass - basedpyright: 0 errors - ruff: all checks passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 23:07:36 +00:00
Jake Hillion	a8acb3cafb	dashboard: show available disk space on downloads page The downloads page previously only showed the approximate space used by downloaded models (summed from completed download sizes), but did not show how much disk space was actually available. This made it difficult to know if a download would succeed before pressing the button. Added disk space tracking to the InfoGatherer that polls the models directory partition every 30 seconds. The DiskUsage type captures total and available space, which flows through the event system to State and is exposed via the /state API. The dashboard now displays "X on disk / Y available" for each node in the downloads view. Test plan: - CI	2026-02-11 21:57:28 +00:00
Alex Cheema	a0721dbe57	feat: warn when cluster nodes have mismatched macOS versions (#1436 ) ## Motivation When nodes in an exo cluster run different macOS versions, inference can produce incompatible results or fail silently. Users currently have no way to know this from the dashboard. ## Changes - Added `get_os_version()` to `system_info.py` that returns the macOS version (e.g. `"15.3"`) or platform name for non-Mac nodes - Added `os_version` field to `NodeIdentity` and `StaticNodeInformation`, gathered once at startup - Propagated `os_version` through the event sourcing pipeline (`apply.py`) - Exposed `nodeIdentities` from the dashboard store with `osVersion` - Added a derived `macosVersionMismatch` check in `+page.svelte` that triggers when 2+ macOS nodes report different versions - Rendered a yellow "INCOMPATIBLE macOS VERSIONS" warning badge (matching the existing Thunderbolt Bridge cycle warning style) with a hover tooltip listing each node's name and version, in all three topology view sizes (large, medium, compact) ## Why It Works The OS version is a static property gathered once at node startup via `platform.mac_ver()`. It flows through the existing `StaticNodeInformation` → `NodeGatheredInfo` event → `NodeIdentity` state pipeline, so no new event types or state fields beyond `os_version` on `NodeIdentity` are needed. The dashboard derives the mismatch by comparing `osVersion` across all nodes whose version looks like a macOS version string (starts with a digit). ## Test Plan ### Manual Testing Hardware: 4x Mac Studio M2 Ultra 512GB (s18, s17 (2), james, mike), connected via Thunderbolt - s18 and s17 (2) on macOS 26.2, james and mike on macOS 26.3 - Verified the "INCOMPATIBLE macOS VERSIONS" warning badge appears in the topology view - Verified the hover tooltip lists all four nodes with their respective versions - Screenshots attached in comment below ### Automated Testing - basedpyright: 0 errors - ruff check: all checks passed - nix fmt: no formatting changes needed - Dashboard builds successfully --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 21:18:59 +00:00
Alex Cheema	50e2bcf93e	fix: RDMA debug labels, TB5 info box, and rdma_ctl status detection (#1437 ) ## Motivation Several RDMA/Thunderbolt UX issues in the dashboard and macOS app: 1. Debug mode showed "? ?" for RDMA connections — the topology view only extracted IPs from socket connections, not RDMA interface names 2. No way to detect if RDMA is actually enabled — the system only knew about TB5 hardware and RDMA topology edges, not whether `rdma_ctl` was enabled on each node 3. False "RDMA AVAILABLE" info box — showed on Mac Minis with idle TB5 ports even when RDMA was already enabled, and on single nodes with TB5 4. macOS app only showed local RDMA status — ran `rdma_ctl` locally with no visibility into other nodes in the cluster ## Changes ### Dashboard: Fix RDMA debug labels (`0abc90c4`) - Added `sourceRdmaIface` and `sinkRdmaIface` to `TopologyEdge` interface - Updated `TopologyGraph.svelte` and `ModelCard.svelte` to show `RDMA en2 → en3` instead of `? ?` ### Dashboard: TB5 RDMA info box (`a3795552`, `8ce8e173`) - Added dismissible info box when 2+ nodes have TB5 hardware but RDMA is disabled - Includes setup instructions (Recovery mode → `rdma_ctl enable` → reboot, TB5 cables, macOS version match) - Requires 2+ exo nodes with TB5 to avoid false positives from single-node setups ### Backend: `rdma_ctl status` detection (`ae07239b`) - Added `RdmaCtlStatus` event to `info_gatherer.py` — runs `rdma_ctl status` with 5s timeout, `shutil.which` guard, and `OSError` handling (polls every 10s on macOS) - Added `NodeRdmaCtlStatus` model to `profiling.py` and `node_rdma_ctl` field to `State` - Handle in `apply.py` (event apply + node timeout cleanup) - Exposed `nodeRdmaCtl` in dashboard store (`app.svelte.ts`) - Info box detection now uses actual RDMA status instead of TB5 link speeds ### Dashboard: Per-node RDMA debug labels (`ae07239b`) - Debug mode shows `RDMA:ON` (green) or `RDMA:OFF` (dim) per node in topology view, below the TB bridge label ### macOS app: Cluster-wide RDMA status from `/state` (`a1455b61`, `d0d77b63`) - Added `NodeRdmaCtlStatus` to `ClusterState.swift` — decoded from `/state` endpoint - Replaced local-only `rdma_ctl status` check with cluster-wide `nodeRdmaCtl` from state - Debug section shows per-node RDMA enabled/disabled for all nodes in the cluster - Still shows local `ibv_devices` and `ibv_devinfo` details (device names, active ports) for richer local debugging ## Files changed \| Area \| File \| Change \| \|------\|------\|--------\| \| Backend \| `src/exo/utils/info_gatherer/info_gatherer.py` \| `RdmaCtlStatus` event, monitor task \| \| Backend \| `src/exo/shared/types/profiling.py` \| `NodeRdmaCtlStatus` model \| \| Backend \| `src/exo/shared/types/state.py` \| `node_rdma_ctl` field \| \| Backend \| `src/exo/shared/apply.py` \| Event handler + timeout cleanup \| \| Dashboard \| `dashboard/src/lib/stores/app.svelte.ts` \| `nodeRdmaCtl` + `nodeThunderbolt` in store \| \| Dashboard \| `dashboard/src/routes/+page.svelte` \| Info box with RDMA detection + instructions \| \| Dashboard \| `dashboard/src/lib/components/TopologyGraph.svelte` \| RDMA debug labels per node + fix "? ?" \| \| Dashboard \| `dashboard/src/lib/components/ModelCard.svelte` \| RDMA interface display fix \| \| App \| `app/EXO/EXO/Models/ClusterState.swift` \| `NodeRdmaCtlStatus` struct + decode \| \| App \| `app/EXO/EXO/ContentView.swift` \| Cluster-wide RDMA view + local device details \| \| App \| `app/EXO/EXO/Services/NetworkStatusService.swift` \| Remove local `rdma_ctl`, keep `ibv_*` \| ## Test Plan - [x] `uv run basedpyright` — 0 errors - [x] `uv run ruff check` — pass - [x] `nix fmt` — clean - [x] `cd dashboard && npm run build` — success - [x] `uv run pytest` — 188 passed - [x] Xcode build — compiles (only pre-existing `dist/exo` resource error) - [x] Deployed to Mac Minis — `nodeRdmaCtl` shows `enabled: true`, no false info box - [x] Deployed to James cluster — RDMA debug labels show correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 20:43:11 +00:00
Alex Cheema	7bed91c9c2	feat: add Recent tab to model picker (#1440 ) ## Motivation When frequently switching between models, it's tedious to search through the full model list to find ones you've used before. A "Recent" tab provides quick access to previously launched models. ## Changes - New store (`dashboard/src/lib/stores/recents.svelte.ts`): `RecentsStore` class persisting recently launched model IDs with timestamps to localStorage (key: `exo-recent-models`). Caps at 20 entries, deduplicates on re-launch (moves to top). - FamilySidebar: Added "Recent" tab between Favorites and Hub, conditionally shown when there are recent models. - FamilyLogos: Added clock/history icon for the recents tab. - ModelPickerModal: Added `recentModelIds`/`hasRecents` props. Derives single-variant `ModelGroup[]` from recent IDs and renders them using the same `ModelPickerGroup` component as all other tabs — consistent styling, memory grey-out, favorites, info button, download indicators. - +page.svelte: Calls `recordRecentLaunch(modelId)` after successful instance launch. Passes reactive recent state to the modal. ## Why It Works Follows the exact same pattern as the existing Favorites feature (localStorage persistence, conditional tab display, reactive Svelte 5 `$state`/`$derived`). Recent models are wrapped as single-variant `ModelGroup` objects so they reuse `ModelPickerGroup` for identical row rendering across all tabs. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro --> - Launch a model instance → reopen model picker → "Recent" tab appears with the launched model - Launch a second model → it appears at top of the Recent list - Re-launch the first model → it moves back to top - Search within the Recent tab filters the list - Models that don't fit in memory are greyed out (same as All tab) - Close/reopen browser → recents persist from localStorage ### Automated Testing - Dashboard builds successfully (`npm run build`) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-11 12:34:08 -08:00
Hunter Bown	48caea4a23	feat: add intermediate model-fit state for cluster-capacity-only models (#1441 ) ## Motivation In the model picker, non-runnable models were not clearly separated between two different cases: - model exceeds currently available RAM but can fit in total cluster capacity - model exceeds total cluster capacity That made it harder to distinguish "not runnable right now" from "too large for this cluster." ## Changes - Added a tri-state fit status in dashboard model-picker flow: - `fits_now` - `fits_cluster_capacity` - `too_large` - Updated dashboard logic to compute both available cluster RAM and total cluster RAM. - Passed fit status through picker components. - Updated model size color mapping: - `fits_now` -> white/gray - `fits_cluster_capacity` -> orange - `too_large` -> red - Updated group ordering for non-runnable models: - orange groups (`fits_cluster_capacity`) are listed above red groups (`too_large`). ## Why It Works Launch safety is unchanged: selection is still gated by existing placement feasibility (`canModelFit`), so models that cannot run now remain disabled. The new fit status is used for visual distinction and ordering only: - runnable now - fits cluster capacity but not free RAM now - too large for cluster capacity ## Before Non-runnable models were not clearly distinguished by temporary capacity vs hard capacity limit. ## After Model picker clearly separates states by both color and order: - Runnable now (white/gray) - Fits cluster capacity but not free RAM now (orange, disabled) - Exceeds cluster capacity (red, disabled) ## Test Plan ### Manual testing - Open model picker with live cluster memory telemetry. - Verify white/gray models are selectable. - Verify orange models are disabled and appear above red models. - Verify red models are disabled and appear below orange models. ### Automated checks - `npm --prefix dashboard run build` passes. - `uv run basedpyright` passes. - `uv run ruff check` passes. - `npm --prefix dashboard run check` reports existing pre-change Svelte diagnostics (same known SVG title/a11y items). - `uv run pytest` in this local environment exits during collection due existing `tests/start_distributed_test.py` `SystemExit` usage. - No Python code was changed. --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>	2026-02-11 00:21:35 +00:00
Mustafa Alp Yılmaz	eead50b4cd	Fix setrlimit crash when hard file descriptor limit < 65535 (#1430 ) ## Summary `exo` crashes on startup when the system's hard file descriptor limit is below 65535, which occurs in macOS LaunchDaemon environments, Docker containers, and other restricted setups. Root cause: `resource.setrlimit(resource.RLIMIT_NOFILE, (max(soft, 65535), hard))` raises `ValueError` when `hard < 65535` because the soft limit cannot exceed the hard limit. Fix: - `main.py`: Clamp the target soft limit to `min(max(soft, 65535), hard)` so it never exceeds the hard limit - `utils_mlx.py`: Query current limits instead of hardcoding `(2048, 4096)`, which both crashed on restricted systems and incorrectly lowered the hard limit when it was set higher ## Test plan - [x] `basedpyright` passes with 0 errors - [x] `ruff check` passes - [x] Verify startup works on system with hard limit < 65535 (tested in macOS LaunchDaemon with hard limit 10240) - [x] Verify startup still works on default macOS (hard limit typically unlimited)	2026-02-10 22:46:22 +00:00
Jake Hillion	199df64cfc	util: remove VecExt trait, inline at call site (#1446 ) VecExt added a .map() convenience method on Vec<T> that simply called .into_iter().map(f).collect(). This thin wrapper provided no optimisation benefit and obscured a standard iterator pattern behind a nightly feature gate and an extra dependency. Replaced the single call site in exo_pyo3_bindings with the equivalent iterator chain and removed the ext module, the extend dependency, and the trait_alias feature gate from the util crate. Test plan: - CI	2026-02-10 20:45:57 +00:00
Ryuichi Leo Takashige	dc7ade8052	set the mlx hash	2026-02-10 20:16:17 +00:00
Ryuichi Leo Takashige	dc781497c5	update mlx to 0.30.6	2026-02-10 20:16:17 +00:00
Jake Hillion	c37eb24331	util: remove dead code (#1445 ) The util crate contained several unused items: NonemptyArray, BoxedSliceExt, a blanket Sealed trait, an empty alias module, six unused nightly feature gates, and five unused Cargo dependencies (thiserror, once_cell, internment, derive_more, bon, recursion). Removed all items that had no references outside their own definitions, keeping only WakerDeque, VecExt, and the trait_alias feature gate which are actively used by the networking and exo_pyo3_bindings crates. Test plan: - CI	2026-02-10 20:10:54 +00:00
Jake Hillion	8af2af6328	nix: override apple-sdk to 26.2 and enable MLX_BUILD_CPU (#1443 ) The pinned nixpkgs provides apple-sdk 26.0, but building MLX requires SDK 26.2. The upstream package reads versions.json via a relative path at eval time, so it can't be overridden through callPackage args. Added a thin overlay that copies the upstream apple-sdk source and patches only metadata/versions.json to point at SDK 26.2. Also enabled MLX_BUILD_CPU in the MLX nix build. This avoids vendoring the entire apple-sdk package (~2200 lines) while still getting the SDK version we need. Test plan: - CI - Built and ran on two machines connected with Thunderbolt 5 - Kimi K2.5 starts in Tensor+RDMA and seems sensible.	2026-02-10 19:53:53 +00:00
rltakashige	43728b2047	Send all exo logs (#1439 ) ## Motivation Log rotation adds a bunch of .zst files. Let's send them all in the bug reports. This PR also standardises the logs so that they all include the timestamp.	2026-02-10 19:28:33 +00:00
rltakashige	1699fcfb9f	standardise logs (#1442 ) ## Motivation Standardises exo.log and event_log ## Changes - exo.log and exo.log.zst are now in a exo_log directory. - event_log is now timestamped and not numbered. The timestamps can be sorted as they are YYYY-MM-DD-HH-MM ## Test Plan ### Manual Testing Nothing crashes.	2026-02-10 19:13:25 +00:00
rltakashige	009b43c662	add log rotation for .exo/exo.log (#1438 ) ## Motivation .exo.log currently contains all past history. This just makes it hard to read and is unnecessarily expensive even on disk. ## Why It Works Just uses loguru's rotation. ## Test Plan ### Manual Testing exo.log is new. <img width="1992" height="706" alt="image" src="https://github.com/user-attachments/assets/9b293993-1141-43e7-b58e-0ddd2d4eda2e" />	2026-02-10 18:55:06 +00:00
Jake Hillion	1f242e8eee	gossipsub: stop silent message dropping and warn (#1434 ) The 15-second publish_queue_duration caused messages in peer queues to be silently dropped. When events are dropped, workers detect gaps in the event index sequence and request missing events via the NACK path (RequestEventLog), but this recovery is inefficient. Removed the timeout configuration - gossipsub now uses its default behavior without time-based eviction. If queue buildup is a concern, queue size should be limited explicitly rather than dropping by timeout. Split error handling to log AllQueuesFullError as a warning (indicates peers are unresponsive) while keeping NoPeersSubscribedToTopicError silent (expected during startup and network partitions). Test plan: - CI	2026-02-10 18:47:47 +00:00
rltakashige	64179c6fc1	Dont save to app directory (#1435 ) ## Motivation App keeps losing Local Network permissions. ## Changes Don't save stuff to the app directory anymore. Instead, save to .exo. ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing Before: <img width="1512" height="106" alt="image" src="https://github.com/user-attachments/assets/544ef57e-b626-484d-941f-2472969aa208" /> After: <img width="433" height="53" alt="Screenshot 2026-02-10 at 17 43 06" src="https://github.com/user-attachments/assets/3de2856b-cdf6-4b35-aa8f-50440686344f" /> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-10 18:41:36 +00:00
Jake Hillion	305a3c8b70	event_log: move event log from unbounded in-memory list to disk (#1432 ) The master and API event logs (list[Event]) grew unbounded in RAM for the lifetime of the process. Events are rarely read back (only for RequestEventLog when a new node catches up, or the dashboard /events endpoint). Introduced a DiskEventLog class that writes length-prefixed msgpack records to an append-only file, using a bounded LRU cache of byte offsets for indexed access. On close, the active file is compressed with ZSTD and rotated into a numbered archive slot, keeping the last 5 archives (events.1.bin.zst through events.5.bin.zst). On construction, any stale active file from a crash is rotated before opening a fresh log. The /events API endpoint now streams the JSON array one event at a time rather than materializing the full list in memory. Deserialization routes msgpack through json.dumps into Pydantic's validate_json() to get correct JSON-mode coercion (e.g. string to enum) under strict mode. This bounds memory usage to the LRU cache (128 entries) regardless of event volume, while still supporting efficient sequential reads from disk when needed. Test plan: - CI - New unit tests for DiskEventLog: append/read, range queries, rotation on close, stale file recovery, idempotent close, successive sessions, archive retention limit (5 max) - Tested on a cluster with 9000 events. /events continues working. - On disk size is 3.9MiB with ~8000 events, and the compression is very effective. - Disconnected and rejoined a machine, it rejoined fine. --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>	2026-02-10 17:27:32 +00:00
Alex Cheema	ead19bea74	Always load image model cards into cache (#1421 ) ## Motivation Follows up on #1408. Image models (FLUX, Qwen-Image, etc.) don't have a `config.json` on HuggingFace. Previously, image model TOML cards were only loaded into `_card_cache` when `EXO_ENABLE_IMAGE_MODELS=true`. When the flag was off but an image model was requested (e.g., via `get_placement_previews`), `ModelCard.load()` fell through to `fetch_from_hf()` which tried to download `config.json` — causing `FileNotFoundError` spam. #1408 added defensive error handling; this PR fixes the root cause. ## Changes `model_cards.py`: Always include `image_model_cards/` in `CARD_SEARCH_PATH` so image model TOML cards are always loaded into `_card_cache`. `ModelCard.load()` then finds them directly and never falls through to `fetch_from_hf()`. The `EXO_ENABLE_IMAGE_MODELS` flag now controls whether image models appear in `get_model_cards()` (the listing) rather than whether they're loaded at all. ## Why It Works `fetch_from_hf()` is designed for text models only (it hardcodes `tasks=[ModelTask.TextGeneration]` and requires `config.json`). Image models should never reach that path. By always having them in the cache, the lookup succeeds immediately and `fetch_from_hf()` is never called. ## Test Plan ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-10 09:11:57 -08:00
Jake Hillion	5a83e59182	dashboard: allow typing in chat input while response is generating (#1433 ) The chat textarea was fully disabled during response generation, preventing users from drafting their next message while waiting. Removed the `disabled={loading}` attribute from the textarea element. Submission is still blocked during generation by the early return in `handleSubmit()` and the submit button's own disabled state. Test plan: - Ran on one machine. While a model was writing a really long poem, I typed my next response. I couldn't submit it with Enter and the button still said "Processing" greyed out. I could send the message after generation finished.	2026-02-10 16:12:08 +00:00
Jake Hillion	5b5577bead	build-app: upload DMG to S3 for non-tagged builds (#1428 ) Non-tagged builds (test-app branch, manual dispatch) only uploaded the DMG as a GitHub artifact, which requires authentication to download. Added an early exit path that uploads the DMG with a commit hash suffix (EXO-<sha>.dmg) for non-tagged builds, making it publicly accessible via S3. Test plan: - CI - https://github.com/exo-explore/exo/actions/runs/21837274032/job/63011907978 worked as intended Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-10 15:47:49 +00:00
Evan Quiney	8314a2aa78	cleaning up the todos (#1406 ) kinda closes #1400 ( a bit )	2026-02-10 12:35:29 +00:00
Alex Cheema	163cf18384	Add error handling to info gatherer monitor loops (#1422 ) ## Motivation If any of the `InfoGatherer` monitor loops throw an unexpected exception, the entire monitoring task crashes and never recovers. This can silently stop memory, network, or Thunderbolt data collection for the lifetime of the process. ## Changes Wrap the body of each `while True` monitor loop in a try/except that logs the exception as a warning and continues to the next iteration. The sleep at the end of each loop runs regardless, providing natural backoff before retry. Affected methods: `_monitor_misc`, `_monitor_system_profiler_thunderbolt_data`, `_monitor_memory_usage`, `_watch_system_info`, `_monitor_thunderbolt_bridge_status`. `_monitor_macmon` already had its own error handling so was left as-is. ## Why It Works A transient error (e.g., a subprocess failing, a permission issue) in one iteration no longer kills the loop. The warning log provides visibility while the monitor continues collecting data on subsequent iterations. ## Test Plan ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-10 12:24:43 +00:00
rltakashige	2204f651c8	Yield from reachability checks (#1427 ) ## Motivation check_reachable waits for all connection profile checks to be completed. Since there are retries on failures, this can take around 20s to resolve, preventing any instances from showing up. This feels very slow for UX, and it slows down distributed testing. ## Changes Made check_reachable an async generator. ## Test Plan ### Manual Testing Works for me at least.	2026-02-10 12:18:45 +00:00
rltakashige	4abdaaf74b	Address GPU timeouts (#1429 ) ## Motivation For large prompts and/or slow machines, users are running into GPU timeout errors very often. ## Changes Only during prefill, we eval distributed operations. We don't do this during decode to maintain decode performance. Raise the prefill step size to 8192 because now we can (we see a speedup here). We also now see a 2x speedup in pipeline parallel prefill by disabling an unnecessary all_gather during prefill. ## Why It Works GPU timeout errors happen in the Metal backend when GPU operations take too long without making progress. By isolating distributed operations, we can allow them to run without any timeouts. ## Test Plan ### Manual Testing Doesn't GPU timeout on 100k tokens on Minimax anymore. Also tested on Kimi. ### Automated Testing Needs more exo bench, but I think this is a good step in the right direction.	2026-02-10 11:53:23 +00:00
ciaranbor	2fbdb27bb1	Handle config.json not found (image models) (#1408 ) ## Motivation When downloading image models, a missing config.json file triggers a FileNotFoundError inside download_file_with_retry. This error was being caught by the generic except Exception handler and retried 3 times before failing. Then, the whole thing would be retried from the start ## Changes - src/exo/download/download_utils.py: Added FileNotFoundError to the list of immediately-raised exceptions in download_file_with_retry, alongside HuggingFaceAuthenticationError. This prevents useless retries when a file genuinely doesn't exist on the remote. - src/exo/master/api.py: Wrapped ModelCard.load(model_id) in a try/except that converts failures into an HTTPException(400) with a descriptive error message, giving API consumers a clear error response. ## Why It Works - FileNotFoundError is a deterministic error — the file won't appear on retry, so re-raising immediately avoids 3 wasted download attempts with exponential backoff. - Catching ModelCard.load() failures and returning a 400 HTTP response prevents unhandled exceptions from surfacing as opaque 500 errors in the API. ## Test Plan ### Manual Testing Verified an image model not in model cards does not cause an infinite error loop	2026-02-07 03:34:58 +00:00
ciaranbor	3f57416dbf	Add image lightbox (#1414 ) ## Motivation No way to view generated or attached images at full resolution in the dashboard ## Changes - New ImageLightbox.svelte — fullscreen overlay with download, close (click-outside/Escape), and transitions - ChatMessages.svelte — all images (input attachments + generated) are now clickable to open in lightbox; added expand button to generated image hover overlay ## Why It Works Single expandedImageSrc state variable drives the lightbox — set it to show, null to hide. ## Test Plan ### Manual Testing - Click any image (attachment thumbnail or generated) → lightbox opens - Close via Escape, click-outside, or close button - Download button saves with correct extension	2026-02-07 01:30:03 +00:00
rltakashige	8f3681cf7e	Synchronize before warmup (#1419 ) ## Motivation Maybe addresses #1303 ## Changes Add an mx barrier before warmup ## Why It Works It might, it might not. Shouldn't break anything that's not already broken though. ## Test Plan ### Manual Testing The two machines I tested on were fine on GLM 4.7 Flash 8bit (the one in exo.log in the issue). Obviously not definitive for anything, however. <img width="594" height="878" alt="image" src="https://github.com/user-attachments/assets/534d3ad6-16ef-4cb5-b823-43c8d4e1d3c6" />	2026-02-07 00:14:19 +00:00
ciaranbor	9dc4f786bd	Ciaran/image model listing (#1417 ) ## Motivation Image models (FLUX, Qwen Image) had no family grouping or quantization metadata in the dashboard ## Changes - Added family, quantization, base_model, and capabilities fields to all 18 image model TOML cards (FLUX.1 variants + Qwen Image variants) - Added FLUX and Qwen Image SVG logos to FamilyLogos.svelte - Added "flux" and "qwen-image" families to the sidebar and family sort order - Added "Image Gen" and "Image Edit" capability filters in ModelFilterPopover.svelte - Added image edit icon/badge to ModelPickerGroup.svelte - Made the model category sidebar scrollable to accommodate the new entries - Hidden scrollbars on model list panels ## Why It Works Reuses the existing family/quantization grouping infrastructure that LLMs already use, extending it to image models with appropriate metadata and icons ## Test Plan ### Manual Testing Verified image models behave like text models in the model list dialog --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>	2026-02-06 16:08:57 -08:00
rltakashige	dcb4cabc15	Update the nix hash for mlx 0.30.5 (#1416 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-06 21:27:10 +00:00
Jake Hillion	d79b3a0e75	bench: make exo-bench available via nix run on all platforms (#1415 ) exo-bench was gated behind isDarwin in python/parts.nix because it used exoVenv, which pulls in MLX (Darwin-only). However, exo_bench.py is an HTTP client that only needs loguru, transformers, huggingface-hub, and tiktoken. Made bench a uv workspace member with its own pyproject.toml declaring only the minimal dependencies. Added a separate benchVenv in parts.nix built from that workspace member, and moved exo-bench out of the isDarwin block so it is available on all platforms. Test plan: - `nix run .#exo-bench -- --help` prints argparse help --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-06 21:07:17 +00:00
Evan Quiney	a2f1d48712	slow down catchup (#1407 ) our event log request blasted the whole event log over libp2p, now it just does the next 1000 messages - hopefully allowing nodes to catch up a bit more consistently for long lived clusters Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-06 20:45:27 +00:00
rltakashige	3b2f553a25	Fix kimi tool calling id (#1413 ) ## Motivation Kimi produces its own tool id. It gets confused when we generate our own id. ## Changes Add id to tool call item and parse Kimi id properly. ## Test Plan ### Manual Testing <img width="3198" height="522" alt="image" src="https://github.com/user-attachments/assets/d71ec2be-7f57-49dc-a569-d304cc430f4d" /> Long running Kimi K2.5 cluster querying itself through OpenCode running on the same Kimi K2.5 instance.	2026-02-06 11:33:51 -08:00
rltakashige	5455a97a8c	Fix GLM4Moe Tensor Sharding (#1411 ) ## Motivation Recent commit broke glm (non lite) sharding ## Why It Works Assert is no longer hit, as isinstance check includes GLM4MoeDecoderLayer. Added type stubs to keep the type checker happy. ## Test Plan ### Manual Testing Runs as expected without gibberish.	2026-02-06 16:53:15 +00:00
ciaranbor	6f0cb99919	Ciaran/flux1 kontext (#1394 ) ## Motivation Add support for FLUX.1-Kontext-dev, an image editing variant of FLUX.1-dev ## Changes - New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow - encodes input image as conditioning latents with special position IDs, generates from pure noise - Model config: 57 transformer blocks (19 joint + 38 single), guidance scale 4.0, ImageToImage task - Pipeline updates: Added kontext_image_ids property to PromptData interface, passed through diffusion runner - Model cards: Added TOML configs for base, 4-bit, and 8-bit variants - Dependency: mflux 0.15.4 → 0.15.5 - Utility: tmp/quantize_and_upload.py for quantizing and uploading models to HuggingFace ## Test Plan ### Manual Testing Works better than Qwen-Image-Edit	2026-02-06 16:20:31 +00:00
ciaranbor	c8d3154f83	More image dimensions (#1395 ) ## Motivation More dimensions for image generation ## Changes - dashboard/src/lib/components/ImageParamsPanel.svelte: Added "1024x1365" and "1365x1024" to the sizeOptions array - dashboard/src/lib/stores/app.svelte.ts: Extended the size type in ImageGenerationParams interface to include the two new dimension options	2026-02-06 15:59:06 +00:00
ciaranbor	63e9cc4fea	Ciaran/num sync steps (#1396 ) ## Motivation Allow users to directly configure num_sync_steps for distributed image generation instead of deriving it from a factor of total steps. ## Changes - Added num_sync_steps field to AdvancedImageParams API (range 1-50) - Changed model configs from num_sync_steps_factor: float to num_sync_steps: int - Updated Flux/Qwen configs with direct values (1, 4, 7 respectively) - Added slider control in dashboard advanced params panel - Falls back to model default when not specified ## Why It Works Decouples sync steps from inference steps, giving users direct control over distributed inference synchronization while preserving sensible defaults. ## Test Plan ### Manual Testing - Generate images with various sync step values via dashboard slider - Verify default behavior when parameter is unset	2026-02-06 15:51:46 +00:00
Evan Quiney	9b5cae3db6	auto bench (#1405 ) runs exo_bench remotely with some nice git QoL ## usage run tests/auto_bench.sh host1 [host2] exo bench will be run on those hosts and its output saved to bench/commit_hash/*.json on all models currently downloaded	2026-02-06 15:35:46 +00:00
Jake Hillion	cf7201f91e	pyproject: set minimum uv version The uv.lock is churning constantly as different UV versions bounce it between revisions. This is made worse by GitHub automatically hiding the uv.lock changes, meaning it's hard to notice when this went wrong. Set a minimum version for `uv` in pyproject.toml to fix this. I tried quite a few versions (not all) and found 0.8.6 sets the revision to 3, which I believe is the latest. This is from August 2025 so has been around for a while. Test plan: ``` jake@maverick:/data/users/jake/repos/exo/ > git checkout main uv.lock jake@maverick:/data/users/jake/repos/exo/ > nix shell github:nixos/nixpkgs/3dce7f4a77812afd69efcbfe15e5223f98c5c69e#uv --command sh -c 'uv add pip --frozen && uv lock && uv remove pip --frozen && uv lock && uv --version' Resolved 140 packages in 147ms Added pip v26.0.1 Resolved 139 packages in 48ms Removed pip v26.0.1 uv 0.8.6 ```	2026-02-06 15:28:10 +00:00
rltakashige	b315035ae0	Add minimax and fix qwen sharding strategies (#1318 ) ## Motivation MiniMax tensor sharding does not provide equivalent outputs to running it as a single node because RMSNorm weights cannot be split without affecting the output. Qwen3Next sharding was broken, and something with Qwen3MoE was likely changed upstream, as several variables no longer exist. This also ballooned into fixing prefix caching for non-standard models as Qwen3Next was behaving weirdly. ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing Worked for a 8 hour long eval at the same performance and a more similar completion/reasoning token distribution. --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-06 13:26:59 +00:00
rltakashige	c8dbbee27b	skip tensor ring on bench (#1403 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-06 13:06:59 +00:00
rltakashige	f0107e9670	Fix offline no cache (#1402 ) ## Motivation In offline mode, exo complains if there is no caches directory, even if the files are there. ## Changes Check safetensors index and the directory structure to build caches directory. ## Test Plan ### Manual Testing <img width="2338" height="1102" alt="image" src="https://github.com/user-attachments/assets/ad769911-399b-4fca-ac80-aeaa046af06b" /> <img width="656" height="1668" alt="image" src="https://github.com/user-attachments/assets/6080986c-3904-4600-a340-8c70f1b33266" />	2026-02-06 12:57:01 +00:00
Hunter Bown	9f502793c1	fix: retry downloads on transient errors instead of breaking (#1398 ) ## Motivation `download_file_with_retry()` has a `break` in the generic exception handler that exits the retry loop after the first transient failure. This means network timeouts, connection resets, and server errors all cause an immediate download failure — the two remaining retry attempts never run. ## Changes download_utils.py: Replaced `break` with logging and exponential backoff in the generic exception handler, matching the existing rate-limit handler behavior. Before: ```python except Exception as e: on_connection_lost() if attempt == n_attempts - 1: raise e break # exits loop immediately ``` After: ```python except Exception as e: on_connection_lost() if attempt == n_attempts - 1: raise e logger.error(f"Download error on attempt {attempt + 1}/{n_attempts} ...") logger.error(traceback.format_exc()) await asyncio.sleep(2.0**attempt) ``` ## Why It Works The `break` statement was bypassing the retry mechanism entirely. Replacing it with the same log-and-backoff pattern used by the `HuggingFaceRateLimitError` handler means all 3 attempts are actually used before giving up. The exponential backoff (1s, 2s) gives transient issues time to resolve between attempts. ## Test Plan ### Manual Testing - Downloads that hit transient network errors now retry instead of failing immediately ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `uv run pytest src/exo/download/tests/ -v` — 11 tests pass --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-06 11:51:54 +00:00
Evan Quiney	c8371349d5	add scripts (#1401 ) allow running exo-bench and the headless runner from nix	2026-02-06 11:06:40 +00:00
Evan Quiney	6b907398a4	cancel downloads for deleted instances (#1393 ) after deleting an instance, if a given (node_id, model_id) pair doesn't exist in the left over instances, cancel the download of model_id on node_id.	2026-02-05 18:16:43 +00:00
Evan Quiney	572e647908	better cancellation (#1388 ) a lot of our cleanup logic wasn't running leading to bad shutdown states ## changes - added `try: except` blocks around most task groups - made the runner shutdown code synchronous - abandon the MpReceiver's recv_async thread on cancellation - this only occurs during runner shutdown, the queue closing from the other end should terminate the mp.Queue, cleaning up the thread in its own time. i could try other methods if this is not sufficient. ## outcome ctrl-c just works now! minus the tokio panic of course :) no more hypercorn lifespan errors though!	2026-02-05 15:22:33 +00:00
Evan Quiney	e59ebd986d	set exo as the nix default package (#1391 ) !!!	2026-02-05 15:15:52 +00:00
Alex Cheema	5c2f29f3f2	feat: show download availability in model picker (#1377 ) ## Motivation Users browsing models in the picker need to know which models are already downloaded and ready to run on their cluster, without having to check the downloads page separately. ## Changes - ModelPickerModal.svelte: Computes per-model download availability by checking which nodes have `DownloadCompleted` entries and summing their total RAM against the model's storage size. Passes availability data to `ModelPickerGroup`. Enhances the info modal with a "Downloaded on:" section showing node friendly names with green badges. - ModelPickerGroup.svelte: Accepts new `downloadStatus` prop. Shows a green checkmark-in-circle icon next to models that are downloaded on sufficient nodes. Tooltip shows which nodes have the model. - +page.svelte: Passes `downloadsData` and `topologyNodes` to `ModelPickerModal`. ## Why It Works The download state from `/state` already tracks per-node completed downloads. The shared `getNodesWithModelDownloaded()` utility (from PR #1375) finds nodes with `DownloadCompleted` entries for each model. Total RAM is summed from the topology node data (using `ram_total`, not `ram_available`) and compared to the model's `storage_size_megabytes` to determine if there's enough aggregate memory. This is intentionally a simple heuristic — not a full placement preview. ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> - Open the model picker modal - Verify downloaded models show a green checkmark icon - Verify the checkmark appears dimmer for models downloaded on nodes with insufficient total RAM - Click the (i) info button on a downloaded model - Verify "Downloaded on:" section appears with correct node names - Verify models with no downloads show no indicator ### Automated Testing - Dashboard builds successfully (`npm run build`) - No new Python changes requiring type checking > Note: This is a chained PR. Base branch is `alexcheema/topology-download-indicators` (#1375). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 14:32:53 +00:00

1 2 3 4 5 ...

2071 Commits