## Motivation
When a model download fails repeatedly (e.g. `ContentLengthError` on a
large model like `zai-org/GLM-5`), the download coordinator accumulates
duplicate progress callbacks — one per retry cycle. Each callback
independently throttles at 1 event/sec, so after N retries, every
download progress tick generates N events instead of 1. After an hour of
failures (~60 retry cycles), this produces ~60 `NodeDownloadProgress`
events/sec, overwhelming the master, delaying heartbeats, and causing
the node to time itself out.
### The callback accumulation cycle
1. `_start_download_task()` calls
`shard_downloader.on_progress(callback)` which **appends** to a list
2. Download fails → `DownloadFailed` status set, but old callback stays
in the list
3. 60s later: `_emit_existing_download_progress()` scans disk → resets
status to `DownloadPending`
4. Worker sends new `StartDownload` → coordinator accepts (guard didn't
check `DownloadFailed`)
5. `_start_download_task()` appends **another** callback
6. Each callback has its own throttle → N callbacks = N events per
progress tick
## Changes
### Commit 1: `src/exo/worker/main.py`
Move the `DownloadModel` backoff check **before** `TaskCreated` emission
in `plan_step()`. Previously `TaskCreated` was emitted unconditionally
every 0.1s even when backoff blocked the download command.
### Commit 2: `src/exo/download/coordinator.py`
1. **Register progress callback once** in `__post_init__` instead of
per-download in `_start_download_task()`. Uses a per-model throttle dict
instead of per-callback closure variables.
2. **Add `DownloadFailed` to the `_start_download()` guard** so
redundant `_start_download_task()` calls don't happen. Retries still
work because `_emit_existing_download_progress` resets `DownloadFailed`
→ `DownloadPending` by scanning disk every 60s.
## Why It Works
The root cause was callbacks accumulating in
`ResumableShardDownloader.on_progress_callbacks` (a list that only
appends, never clears). By registering one callback per coordinator
lifetime and guarding against re-entry on `DownloadFailed`, we ensure
exactly one progress event per model per progress tick regardless of how
many retry cycles have occurred.
## Test Plan
### Manual Testing
- Verified the download retry flow: failed download → 60s scan resets
status → new `StartDownload` accepted → download retries with single
callback
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `uv run pytest` — 188 passed
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
Info gatherer monitors could silently stop posting events, causing stale
node state after rejoins. The macmon monitor was especially fragile — it
had no retry loop, so a crash or parse error would kill it permanently.
Worse, the unhandled exception would propagate to the TaskGroup and take
down *all* sibling monitors. Additionally, none of the monitors had
timeouts on their subprocess calls, so a hung `system_profiler` or
`networksetup` could stall a monitor indefinitely.
## Changes
- Wrap `_monitor_macmon` in a `while True` retry loop with `except
Exception`, matching the pattern used by all other monitors
- Add `fail_after` timeouts to all monitor loop bodies:
- 10s for lightweight commands (`_monitor_misc`, `_watch_system_info`,
`_gather_iface_map` init)
- 30s for heavier commands (`_monitor_system_profiler_thunderbolt_data`,
`_monitor_thunderbolt_bridge_status`)
- Remove unused `CalledProcessError` and `cast` imports
## Why It Works
All monitors now follow the same resilient pattern: `while True` → `try`
with `fail_after` → `except Exception` (logs warning) → `sleep`. If a
subprocess hangs, the timeout fires and `TimeoutError` is caught by the
existing `except Exception` handler. If macmon crashes, it restarts
after the interval instead of dying permanently. No single monitor
failure can cascade to kill the others.
## Test Plan
### Manual Testing
<!-- Hardware: macOS with macmon installed -->
<!-- What you did: -->
- Run exo, kill macmon process (`kill $(pgrep macmon)`), verify it
restarts and metrics resume
- Verify all monitors continue posting events after simulated hangs
### Automated Testing
- All 188 existing tests pass
- basedpyright: 0 errors
- ruff: all checks passed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The downloads page previously only showed the approximate space used by
downloaded models (summed from completed download sizes), but did not
show how much disk space was actually available. This made it difficult
to know if a download would succeed before pressing the button.
Added disk space tracking to the InfoGatherer that polls the models
directory partition every 30 seconds. The DiskUsage type captures total
and available space, which flows through the event system to State and
is exposed via the /state API. The dashboard now displays "X on disk /
Y available" for each node in the downloads view.
Test plan:
- CI
## Motivation
When nodes in an exo cluster run different macOS versions, inference can
produce incompatible results or fail silently. Users currently have no
way to know this from the dashboard.
## Changes
- Added `get_os_version()` to `system_info.py` that returns the macOS
version (e.g. `"15.3"`) or platform name for non-Mac nodes
- Added `os_version` field to `NodeIdentity` and
`StaticNodeInformation`, gathered once at startup
- Propagated `os_version` through the event sourcing pipeline
(`apply.py`)
- Exposed `nodeIdentities` from the dashboard store with `osVersion`
- Added a derived `macosVersionMismatch` check in `+page.svelte` that
triggers when 2+ macOS nodes report different versions
- Rendered a yellow "INCOMPATIBLE macOS VERSIONS" warning badge
(matching the existing Thunderbolt Bridge cycle warning style) with a
hover tooltip listing each node's name and version, in all three
topology view sizes (large, medium, compact)
## Why It Works
The OS version is a static property gathered once at node startup via
`platform.mac_ver()`. It flows through the existing
`StaticNodeInformation` → `NodeGatheredInfo` event → `NodeIdentity`
state pipeline, so no new event types or state fields beyond
`os_version` on `NodeIdentity` are needed. The dashboard derives the
mismatch by comparing `osVersion` across all nodes whose version looks
like a macOS version string (starts with a digit).
## Test Plan
### Manual Testing
Hardware: 4x Mac Studio M2 Ultra 512GB (s18, s17 (2), james, mike),
connected via Thunderbolt
- s18 and s17 (2) on macOS 26.2, james and mike on macOS 26.3
- Verified the "INCOMPATIBLE macOS VERSIONS" warning badge appears in
the topology view
- Verified the hover tooltip lists all four nodes with their respective
versions
- Screenshots attached in comment below
### Automated Testing
- basedpyright: 0 errors
- ruff check: all checks passed
- nix fmt: no formatting changes needed
- Dashboard builds successfully
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
Several RDMA/Thunderbolt UX issues in the dashboard and macOS app:
1. **Debug mode showed "? ?" for RDMA connections** — the topology view
only extracted IPs from socket connections, not RDMA interface names
2. **No way to detect if RDMA is actually enabled** — the system only
knew about TB5 hardware and RDMA topology edges, not whether `rdma_ctl`
was enabled on each node
3. **False "RDMA AVAILABLE" info box** — showed on Mac Minis with idle
TB5 ports even when RDMA was already enabled, and on single nodes with
TB5
4. **macOS app only showed local RDMA status** — ran `rdma_ctl` locally
with no visibility into other nodes in the cluster
## Changes
### Dashboard: Fix RDMA debug labels (`0abc90c4`)
- Added `sourceRdmaIface` and `sinkRdmaIface` to `TopologyEdge`
interface
- Updated `TopologyGraph.svelte` and `ModelCard.svelte` to show `RDMA
en2 → en3` instead of `? ?`
### Dashboard: TB5 RDMA info box (`a3795552`, `8ce8e173`)
- Added dismissible info box when 2+ nodes have TB5 hardware but RDMA is
disabled
- Includes setup instructions (Recovery mode → `rdma_ctl enable` →
reboot, TB5 cables, macOS version match)
- Requires 2+ exo nodes with TB5 to avoid false positives from
single-node setups
### Backend: `rdma_ctl status` detection (`ae07239b`)
- Added `RdmaCtlStatus` event to `info_gatherer.py` — runs `rdma_ctl
status` with 5s timeout, `shutil.which` guard, and `OSError` handling
(polls every 10s on macOS)
- Added `NodeRdmaCtlStatus` model to `profiling.py` and `node_rdma_ctl`
field to `State`
- Handle in `apply.py` (event apply + node timeout cleanup)
- Exposed `nodeRdmaCtl` in dashboard store (`app.svelte.ts`)
- Info box detection now uses actual RDMA status instead of TB5 link
speeds
### Dashboard: Per-node RDMA debug labels (`ae07239b`)
- Debug mode shows `RDMA:ON` (green) or `RDMA:OFF` (dim) per node in
topology view, below the TB bridge label
### macOS app: Cluster-wide RDMA status from `/state` (`a1455b61`,
`d0d77b63`)
- Added `NodeRdmaCtlStatus` to `ClusterState.swift` — decoded from
`/state` endpoint
- Replaced local-only `rdma_ctl status` check with cluster-wide
`nodeRdmaCtl` from state
- Debug section shows per-node RDMA enabled/disabled for all nodes in
the cluster
- Still shows local `ibv_devices` and `ibv_devinfo` details (device
names, active ports) for richer local debugging
## Files changed
| Area | File | Change |
|------|------|--------|
| Backend | `src/exo/utils/info_gatherer/info_gatherer.py` |
`RdmaCtlStatus` event, monitor task |
| Backend | `src/exo/shared/types/profiling.py` | `NodeRdmaCtlStatus`
model |
| Backend | `src/exo/shared/types/state.py` | `node_rdma_ctl` field |
| Backend | `src/exo/shared/apply.py` | Event handler + timeout cleanup
|
| Dashboard | `dashboard/src/lib/stores/app.svelte.ts` | `nodeRdmaCtl` +
`nodeThunderbolt` in store |
| Dashboard | `dashboard/src/routes/+page.svelte` | Info box with RDMA
detection + instructions |
| Dashboard | `dashboard/src/lib/components/TopologyGraph.svelte` | RDMA
debug labels per node + fix "? ?" |
| Dashboard | `dashboard/src/lib/components/ModelCard.svelte` | RDMA
interface display fix |
| App | `app/EXO/EXO/Models/ClusterState.swift` | `NodeRdmaCtlStatus`
struct + decode |
| App | `app/EXO/EXO/ContentView.swift` | Cluster-wide RDMA view + local
device details |
| App | `app/EXO/EXO/Services/NetworkStatusService.swift` | Remove local
`rdma_ctl`, keep `ibv_*` |
## Test Plan
- [x] `uv run basedpyright` — 0 errors
- [x] `uv run ruff check` — pass
- [x] `nix fmt` — clean
- [x] `cd dashboard && npm run build` — success
- [x] `uv run pytest` — 188 passed
- [x] Xcode build — compiles (only pre-existing `dist/exo` resource
error)
- [x] Deployed to Mac Minis — `nodeRdmaCtl` shows `enabled: true`, no
false info box
- [x] Deployed to James cluster — RDMA debug labels show correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
When frequently switching between models, it's tedious to search through
the full model list to find ones you've used before. A "Recent" tab
provides quick access to previously launched models.
## Changes
- **New store** (`dashboard/src/lib/stores/recents.svelte.ts`):
`RecentsStore` class persisting recently launched model IDs with
timestamps to localStorage (key: `exo-recent-models`). Caps at 20
entries, deduplicates on re-launch (moves to top).
- **FamilySidebar**: Added "Recent" tab between Favorites and Hub,
conditionally shown when there are recent models.
- **FamilyLogos**: Added clock/history icon for the recents tab.
- **ModelPickerModal**: Added `recentModelIds`/`hasRecents` props.
Derives single-variant `ModelGroup[]` from recent IDs and renders them
using the same `ModelPickerGroup` component as all other tabs —
consistent styling, memory grey-out, favorites, info button, download
indicators.
- **+page.svelte**: Calls `recordRecentLaunch(modelId)` after successful
instance launch. Passes reactive recent state to the modal.
## Why It Works
Follows the exact same pattern as the existing Favorites feature
(localStorage persistence, conditional tab display, reactive Svelte 5
`$state`/`$derived`). Recent models are wrapped as single-variant
`ModelGroup` objects so they reuse `ModelPickerGroup` for identical row
rendering across all tabs.
## Test Plan
### Manual Testing
<!-- Hardware: MacBook Pro -->
- Launch a model instance → reopen model picker → "Recent" tab appears
with the launched model
- Launch a second model → it appears at top of the Recent list
- Re-launch the first model → it moves back to top
- Search within the Recent tab filters the list
- Models that don't fit in memory are greyed out (same as All tab)
- Close/reopen browser → recents persist from localStorage
### Automated Testing
- Dashboard builds successfully (`npm run build`)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
In the model picker, non-runnable models were not clearly separated
between two different cases:
- model exceeds currently available RAM but can fit in total cluster
capacity
- model exceeds total cluster capacity
That made it harder to distinguish "not runnable right now" from "too
large for this cluster."
## Changes
- Added a tri-state fit status in dashboard model-picker flow:
- `fits_now`
- `fits_cluster_capacity`
- `too_large`
- Updated dashboard logic to compute both available cluster RAM and
total cluster RAM.
- Passed fit status through picker components.
- Updated model size color mapping:
- `fits_now` -> white/gray
- `fits_cluster_capacity` -> orange
- `too_large` -> red
- Updated group ordering for non-runnable models:
- orange groups (`fits_cluster_capacity`) are listed above red groups
(`too_large`).
## Why It Works
Launch safety is unchanged: selection is still gated by existing
placement feasibility (`canModelFit`), so models that cannot run now
remain disabled.
The new fit status is used for visual distinction and ordering only:
- runnable now
- fits cluster capacity but not free RAM now
- too large for cluster capacity
## Before
Non-runnable models were not clearly distinguished by temporary capacity
vs hard capacity limit.
## After
Model picker clearly separates states by both color and order:
- Runnable now (white/gray)
- Fits cluster capacity but not free RAM now (orange, disabled)
- Exceeds cluster capacity (red, disabled)
## Test Plan
### Manual testing
- Open model picker with live cluster memory telemetry.
- Verify white/gray models are selectable.
- Verify orange models are disabled and appear above red models.
- Verify red models are disabled and appear below orange models.
### Automated checks
- `npm --prefix dashboard run build` passes.
- `uv run basedpyright` passes.
- `uv run ruff check` passes.
- `npm --prefix dashboard run check` reports existing pre-change Svelte
diagnostics (same known SVG title/a11y items).
- `uv run pytest` in this local environment exits during collection due
existing `tests/start_distributed_test.py` `SystemExit` usage.
- No Python code was changed.
---------
Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>
## Summary
`exo` crashes on startup when the system's hard file descriptor limit is
below 65535, which occurs in macOS LaunchDaemon environments, Docker
containers, and other restricted setups.
**Root cause:** `resource.setrlimit(resource.RLIMIT_NOFILE, (max(soft,
65535), hard))` raises `ValueError` when `hard < 65535` because the soft
limit cannot exceed the hard limit.
**Fix:**
- **`main.py`**: Clamp the target soft limit to `min(max(soft, 65535),
hard)` so it never exceeds the hard limit
- **`utils_mlx.py`**: Query current limits instead of hardcoding `(2048,
4096)`, which both crashed on restricted systems and incorrectly lowered
the hard limit when it was set higher
## Test plan
- [x] `basedpyright` passes with 0 errors
- [x] `ruff check` passes
- [x] Verify startup works on system with hard limit < 65535 (tested in
macOS LaunchDaemon with hard limit 10240)
- [x] Verify startup still works on default macOS (hard limit typically
unlimited)
VecExt added a .map() convenience method on Vec<T> that simply called
.into_iter().map(f).collect(). This thin wrapper provided no
optimisation benefit and obscured a standard iterator pattern behind a
nightly feature gate and an extra dependency.
Replaced the single call site in exo_pyo3_bindings with the equivalent
iterator chain and removed the ext module, the extend dependency, and
the trait_alias feature gate from the util crate.
Test plan:
- CI
The util crate contained several unused items: NonemptyArray,
BoxedSliceExt, a blanket Sealed trait, an empty alias module, six unused
nightly feature gates, and five unused Cargo dependencies (thiserror,
once_cell, internment, derive_more, bon, recursion).
Removed all items that had no references outside their own definitions,
keeping only WakerDeque, VecExt, and the trait_alias feature gate which
are actively used by the networking and exo_pyo3_bindings crates.
Test plan:
- CI
The pinned nixpkgs provides apple-sdk 26.0, but building MLX requires
SDK 26.2. The upstream package reads versions.json via a relative path
at eval time, so it can't be overridden through callPackage args.
Added a thin overlay that copies the upstream apple-sdk source and
patches only metadata/versions.json to point at SDK 26.2. Also enabled
MLX_BUILD_CPU in the MLX nix build.
This avoids vendoring the entire apple-sdk package (~2200 lines) while
still getting the SDK version we need.
Test plan:
- CI
- Built and ran on two machines connected with Thunderbolt 5 - Kimi K2.5
starts in Tensor+RDMA and seems sensible.
## Motivation
Log rotation adds a bunch of .zst files. Let's send them all in the bug
reports.
This PR also standardises the logs so that they all include the
timestamp.
## Motivation
Standardises exo.log and event_log
## Changes
- exo.log and exo.log.zst are now in a exo_log directory.
- event_log is now timestamped and not numbered. The timestamps can be
sorted as they are YYYY-MM-DD-HH-MM
## Test Plan
### Manual Testing
Nothing crashes.
## Motivation
.exo.log currently contains all past history. This just makes it hard to
read and is unnecessarily expensive even on disk.
## Why It Works
Just uses loguru's rotation.
## Test Plan
### Manual Testing
exo.log is new.
<img width="1992" height="706" alt="image"
src="https://github.com/user-attachments/assets/9b293993-1141-43e7-b58e-0ddd2d4eda2e"
/>
The 15-second publish_queue_duration caused messages in peer queues to
be silently dropped. When events are dropped, workers detect gaps in the
event index sequence and request missing events via the NACK path
(RequestEventLog), but this recovery is inefficient.
Removed the timeout configuration - gossipsub now uses its default
behavior without time-based eviction. If queue buildup is a concern,
queue size should be limited explicitly rather than dropping by timeout.
Split error handling to log AllQueuesFullError as a warning (indicates
peers are unresponsive) while keeping NoPeersSubscribedToTopicError
silent (expected during startup and network partitions).
Test plan:
- CI
## Motivation
App keeps losing Local Network permissions.
## Changes
Don't save stuff to the app directory anymore. Instead, save to .exo.
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
Before:
<img width="1512" height="106" alt="image"
src="https://github.com/user-attachments/assets/544ef57e-b626-484d-941f-2472969aa208"
/>
After:
<img width="433" height="53" alt="Screenshot 2026-02-10 at 17 43 06"
src="https://github.com/user-attachments/assets/3de2856b-cdf6-4b35-aa8f-50440686344f"
/>
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
The master and API event logs (list[Event]) grew unbounded in RAM for
the lifetime of the process. Events are rarely read back (only for
RequestEventLog when a new node catches up, or the dashboard /events
endpoint).
Introduced a DiskEventLog class that writes length-prefixed msgpack
records to an append-only file, using a bounded LRU cache of byte
offsets for indexed access. On close, the active file is compressed
with ZSTD and rotated into a numbered archive slot, keeping the last 5
archives (events.1.bin.zst through events.5.bin.zst). On construction,
any stale active file from a crash is rotated before opening a fresh
log. The /events API endpoint now streams the JSON array one event at a
time rather than materializing the full list in memory. Deserialization
routes msgpack through json.dumps into Pydantic's validate_json() to
get correct JSON-mode coercion (e.g. string to enum) under strict mode.
This bounds memory usage to the LRU cache (128 entries) regardless of
event volume, while still supporting efficient sequential reads from
disk when needed.
Test plan:
- CI
- New unit tests for DiskEventLog: append/read, range queries, rotation
on close, stale file recovery, idempotent close, successive sessions,
archive retention limit (5 max)
- Tested on a cluster with 9000 events. /events continues working.
- On disk size is 3.9MiB with ~8000 events, and the compression is very
effective.
- Disconnected and rejoined a machine, it rejoined fine.
---------
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
## Motivation
Follows up on #1408. Image models (FLUX, Qwen-Image, etc.) don't have a
`config.json` on HuggingFace. Previously, image model TOML cards were
only loaded into `_card_cache` when `EXO_ENABLE_IMAGE_MODELS=true`. When
the flag was off but an image model was requested (e.g., via
`get_placement_previews`), `ModelCard.load()` fell through to
`fetch_from_hf()` which tried to download `config.json` — causing
`FileNotFoundError` spam. #1408 added defensive error handling; this PR
fixes the root cause.
## Changes
**`model_cards.py`**: Always include `image_model_cards/` in
`CARD_SEARCH_PATH` so image model TOML cards are always loaded into
`_card_cache`. `ModelCard.load()` then finds them directly and never
falls through to `fetch_from_hf()`. The `EXO_ENABLE_IMAGE_MODELS` flag
now controls whether image models appear in `get_model_cards()` (the
listing) rather than whether they're loaded at all.
## Why It Works
`fetch_from_hf()` is designed for text models only (it hardcodes
`tasks=[ModelTask.TextGeneration]` and requires `config.json`). Image
models should never reach that path. By always having them in the cache,
the lookup succeeds immediately and `fetch_from_hf()` is never called.
## Test Plan
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
The chat textarea was fully disabled during response generation,
preventing users from drafting their next message while waiting.
Removed the `disabled={loading}` attribute from the textarea element.
Submission is still blocked during generation by the early return in
`handleSubmit()` and the submit button's own disabled state.
Test plan:
- Ran on one machine. While a model was writing a really long poem, I
typed my next response. I couldn't submit it with Enter and the button
still said "Processing" greyed out. I could send the message after
generation finished.
Non-tagged builds (test-app branch, manual dispatch) only uploaded the
DMG as a GitHub artifact, which requires authentication to download.
Added an early exit path that uploads the DMG with a commit hash suffix
(EXO-<sha>.dmg) for non-tagged builds, making it publicly accessible
via S3.
Test plan:
- CI
-
https://github.com/exo-explore/exo/actions/runs/21837274032/job/63011907978
worked as intended
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
If any of the `InfoGatherer` monitor loops throw an unexpected
exception, the entire monitoring task crashes and never recovers. This
can silently stop memory, network, or Thunderbolt data collection for
the lifetime of the process.
## Changes
Wrap the body of each `while True` monitor loop in a try/except that
logs the exception as a warning and continues to the next iteration. The
sleep at the end of each loop runs regardless, providing natural backoff
before retry.
Affected methods: `_monitor_misc`,
`_monitor_system_profiler_thunderbolt_data`, `_monitor_memory_usage`,
`_watch_system_info`, `_monitor_thunderbolt_bridge_status`.
`_monitor_macmon` already had its own error handling so was left as-is.
## Why It Works
A transient error (e.g., a subprocess failing, a permission issue) in
one iteration no longer kills the loop. The warning log provides
visibility while the monitor continues collecting data on subsequent
iterations.
## Test Plan
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
check_reachable waits for all connection profile checks to be completed.
Since there are retries on failures, this can take around 20s to
resolve, preventing any instances from showing up. This feels very slow
for UX, and it slows down distributed testing.
## Changes
Made check_reachable an async generator.
## Test Plan
### Manual Testing
Works for me at least.
## Motivation
For large prompts and/or slow machines, users are running into GPU
timeout errors very often.
## Changes
Only during prefill, we eval distributed operations. We don't do this
during decode to maintain decode performance.
Raise the prefill step size to 8192 because now we can (we see a speedup
here).
We also now see a 2x speedup in pipeline parallel prefill by disabling
an unnecessary all_gather during prefill.
## Why It Works
GPU timeout errors happen in the Metal backend when GPU operations take
too long without making progress.
By isolating distributed operations, we can allow them to run without
any timeouts.
## Test Plan
### Manual Testing
Doesn't GPU timeout on 100k tokens on Minimax anymore. Also tested on
Kimi.
### Automated Testing
Needs more exo bench, but I think this is a good step in the right
direction.
## Motivation
When downloading image models, a missing config.json file triggers a
FileNotFoundError inside download_file_with_retry. This error was being
caught by the generic except Exception handler and retried 3 times
before failing. Then, the whole thing would be retried from the start
## Changes
- src/exo/download/download_utils.py: Added FileNotFoundError to the
list of immediately-raised exceptions in download_file_with_retry,
alongside HuggingFaceAuthenticationError. This prevents useless retries
when a file genuinely doesn't exist on the remote.
- src/exo/master/api.py: Wrapped ModelCard.load(model_id) in a
try/except that converts failures into an HTTPException(400) with a
descriptive error message, giving API consumers a clear error response.
## Why It Works
- FileNotFoundError is a deterministic error — the file won't appear on
retry, so re-raising immediately avoids 3 wasted download attempts with
exponential backoff.
- Catching ModelCard.load() failures and returning a 400 HTTP response
prevents unhandled exceptions from surfacing as opaque 500 errors in the
API.
## Test Plan
### Manual Testing
Verified an image model not in model cards does not cause an infinite
error loop
## Motivation
No way to view generated or attached images at full resolution in the
dashboard
## Changes
- New ImageLightbox.svelte — fullscreen overlay with download, close
(click-outside/Escape), and transitions
- ChatMessages.svelte — all images (input attachments + generated) are
now clickable to open in lightbox; added expand button to generated
image hover overlay
## Why It Works
Single expandedImageSrc state variable drives the lightbox — set it to
show, null to hide.
## Test Plan
### Manual Testing
- Click any image (attachment thumbnail or generated) → lightbox opens
- Close via Escape, click-outside, or close button
- Download button saves with correct extension
## Motivation
Maybe addresses #1303
## Changes
Add an mx barrier before warmup
## Why It Works
It might, it might not. Shouldn't break anything that's not already
broken though.
## Test Plan
### Manual Testing
The two machines I tested on were fine on GLM 4.7 Flash 8bit (the one in
exo.log in the issue). Obviously not definitive for anything, however.
<img width="594" height="878" alt="image"
src="https://github.com/user-attachments/assets/534d3ad6-16ef-4cb5-b823-43c8d4e1d3c6"
/>
## Motivation
Image models (FLUX, Qwen Image) had no family grouping or quantization
metadata in the dashboard
## Changes
- Added family, quantization, base_model, and capabilities fields to all
18 image model TOML cards (FLUX.1 variants + Qwen Image variants)
- Added FLUX and Qwen Image SVG logos to FamilyLogos.svelte
- Added "flux" and "qwen-image" families to the sidebar and family sort
order
- Added "Image Gen" and "Image Edit" capability filters in
ModelFilterPopover.svelte
- Added image edit icon/badge to ModelPickerGroup.svelte
- Made the model category sidebar scrollable to accommodate the new
entries
- Hidden scrollbars on model list panels
## Why It Works
Reuses the existing family/quantization grouping infrastructure that
LLMs already use, extending it to image models with appropriate metadata
and icons
## Test Plan
### Manual Testing
Verified image models behave like text models in the model list dialog
---------
Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
exo-bench was gated behind isDarwin in python/parts.nix because it used
exoVenv, which pulls in MLX (Darwin-only). However, exo_bench.py is an
HTTP client that only needs loguru, transformers, huggingface-hub, and
tiktoken.
Made bench a uv workspace member with its own pyproject.toml declaring
only the minimal dependencies. Added a separate benchVenv in parts.nix
built from that workspace member, and moved exo-bench out of the
isDarwin block so it is available on all platforms.
Test plan:
- `nix run .#exo-bench -- --help` prints argparse help
---------
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
our event log request blasted the whole event log over libp2p, now it
just does the next 1000 messages - hopefully allowing nodes to catch up
a bit more consistently for long lived clusters
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
Kimi produces its own tool id. It gets confused when we generate our own
id.
## Changes
Add id to tool call item and parse Kimi id properly.
## Test Plan
### Manual Testing
<img width="3198" height="522" alt="image"
src="https://github.com/user-attachments/assets/d71ec2be-7f57-49dc-a569-d304cc430f4d"
/>
Long running Kimi K2.5 cluster querying itself through OpenCode running
on the same Kimi K2.5 instance.
## Motivation
Recent commit broke glm (non lite) sharding
## Why It Works
Assert is no longer hit, as isinstance check includes
GLM4MoeDecoderLayer.
Added type stubs to keep the type checker happy.
## Test Plan
### Manual Testing
Runs as expected without gibberish.
## Motivation
Add support for FLUX.1-Kontext-dev, an image editing variant of
FLUX.1-dev
## Changes
- New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow
- encodes input image as conditioning latents with special position IDs,
generates from pure noise
- Model config: 57 transformer blocks (19 joint + 38 single), guidance
scale 4.0, ImageToImage task
- Pipeline updates: Added kontext_image_ids property to PromptData
interface, passed through diffusion runner
- Model cards: Added TOML configs for base, 4-bit, and 8-bit variants
- Dependency: mflux 0.15.4 → 0.15.5
- Utility: tmp/quantize_and_upload.py for quantizing and uploading
models to HuggingFace
## Test Plan
### Manual Testing
Works better than Qwen-Image-Edit
## Motivation
More dimensions for image generation
## Changes
- dashboard/src/lib/components/ImageParamsPanel.svelte: Added
"1024x1365" and "1365x1024" to the sizeOptions array
- dashboard/src/lib/stores/app.svelte.ts: Extended the size type in
ImageGenerationParams interface to include the two new dimension options
## Motivation
Allow users to directly configure num_sync_steps for distributed image
generation instead of deriving it from a factor of total steps.
## Changes
- Added num_sync_steps field to AdvancedImageParams API (range 1-50)
- Changed model configs from num_sync_steps_factor: float to
num_sync_steps: int
- Updated Flux/Qwen configs with direct values (1, 4, 7 respectively)
- Added slider control in dashboard advanced params panel
- Falls back to model default when not specified
## Why It Works
Decouples sync steps from inference steps, giving users direct control
over distributed inference synchronization while preserving sensible
defaults.
## Test Plan
### Manual Testing
- Generate images with various sync step values via dashboard slider
- Verify default behavior when parameter is unset
runs exo_bench remotely with some nice git QoL
## usage
run tests/auto_bench.sh host1 [host2]
exo bench will be run on those hosts and its output saved to
bench/commit_hash/*.json on all models currently downloaded
The uv.lock is churning constantly as different UV versions bounce it
between revisions. This is made worse by GitHub automatically hiding the
uv.lock changes, meaning it's hard to notice when this went wrong.
Set a minimum version for `uv` in pyproject.toml to fix this. I tried
quite a few versions (not all) and found 0.8.6 sets the revision to 3,
which I believe is the latest. This is from August 2025 so has been
around for a while.
Test plan:
```
jake@maverick:/data/users/jake/repos/exo/ > git checkout main uv.lock
jake@maverick:/data/users/jake/repos/exo/ > nix shell github:nixos/nixpkgs/3dce7f4a77812afd69efcbfe15e5223f98c5c69e#uv --command sh -c 'uv add pip --frozen && uv lock && uv remove pip --frozen && uv lock && uv --version'
Resolved 140 packages in 147ms
Added pip v26.0.1
Resolved 139 packages in 48ms
Removed pip v26.0.1
uv 0.8.6
```
## Motivation
MiniMax tensor sharding does not provide equivalent outputs to running
it as a single node because RMSNorm weights cannot be split without
affecting the output.
Qwen3Next sharding was broken, and something with Qwen3MoE was likely
changed upstream, as several variables no longer exist.
This also ballooned into fixing prefix caching for non-standard models
as Qwen3Next was behaving weirdly.
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
Worked for a 8 hour long eval at the same performance and a more similar
completion/reasoning token distribution.
---------
Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
`download_file_with_retry()` has a `break` in the generic exception
handler that exits the retry loop after the first transient failure.
This means network timeouts, connection resets, and server errors all
cause an immediate download failure — the two remaining retry attempts
never run.
## Changes
**download_utils.py**: Replaced `break` with logging and exponential
backoff in the generic exception handler, matching the existing
rate-limit handler behavior.
Before:
```python
except Exception as e:
on_connection_lost()
if attempt == n_attempts - 1:
raise e
break # exits loop immediately
```
After:
```python
except Exception as e:
on_connection_lost()
if attempt == n_attempts - 1:
raise e
logger.error(f"Download error on attempt {attempt + 1}/{n_attempts} ...")
logger.error(traceback.format_exc())
await asyncio.sleep(2.0**attempt)
```
## Why It Works
The `break` statement was bypassing the retry mechanism entirely.
Replacing it with the same log-and-backoff pattern used by the
`HuggingFaceRateLimitError` handler means all 3 attempts are actually
used before giving up. The exponential backoff (1s, 2s) gives transient
issues time to resolve between attempts.
## Test Plan
### Manual Testing
- Downloads that hit transient network errors now retry instead of
failing immediately
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `uv run pytest src/exo/download/tests/ -v` — 11 tests pass
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
a lot of our cleanup logic wasn't running leading to bad shutdown states
## changes
- added `try: except` blocks around most task groups
- made the runner shutdown code synchronous
- abandon the MpReceiver's recv_async thread on cancellation
- this only occurs during runner shutdown, the queue closing from the
other end should terminate the mp.Queue, cleaning up the thread in its
own time. i could try other methods if this is not sufficient.
## outcome
ctrl-c just works now! minus the tokio panic of course :) no more
hypercorn lifespan errors though!
## Motivation
Users browsing models in the picker need to know which models are
already downloaded and ready to run on their cluster, without having to
check the downloads page separately.
## Changes
- **ModelPickerModal.svelte**: Computes per-model download availability
by checking which nodes have `DownloadCompleted` entries and summing
their total RAM against the model's storage size. Passes availability
data to `ModelPickerGroup`. Enhances the info modal with a "Downloaded
on:" section showing node friendly names with green badges.
- **ModelPickerGroup.svelte**: Accepts new `downloadStatus` prop. Shows
a green checkmark-in-circle icon next to models that are downloaded on
sufficient nodes. Tooltip shows which nodes have the model.
- **+page.svelte**: Passes `downloadsData` and `topologyNodes` to
`ModelPickerModal`.
## Why It Works
The download state from `/state` already tracks per-node completed
downloads. The shared `getNodesWithModelDownloaded()` utility (from PR
#1375) finds nodes with `DownloadCompleted` entries for each model.
Total RAM is summed from the topology node data (using `ram_total`, not
`ram_available`) and compared to the model's `storage_size_megabytes` to
determine if there's enough aggregate memory. This is intentionally a
simple heuristic — not a full placement preview.
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
- Open the model picker modal
- Verify downloaded models show a green checkmark icon
- Verify the checkmark appears dimmer for models downloaded on nodes
with insufficient total RAM
- Click the (i) info button on a downloaded model
- Verify "Downloaded on:" section appears with correct node names
- Verify models with no downloads show no indicator
### Automated Testing
- Dashboard builds successfully (`npm run build`)
- No new Python changes requiring type checking
> **Note:** This is a chained PR. Base branch is
`alexcheema/topology-download-indicators` (#1375).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>