Commit Graph

2282 Commits

Author SHA1 Message Date
Evan
65b9c9df81 remove layer loading callback 2026-04-15 11:49:39 +01:00
rltakashige
2cd66ae4cf Fix out of order event idx causing fatal crashes (#1894)
## Motivation

<img width="828" height="373" alt="Screenshot 2026-04-14 at 22 56 52"
src="https://github.com/user-attachments/assets/f8f48c1d-68c5-4acc-a6de-9d180672da9d"
/>

if is_new_master=True, _elect_loop creates a new EventRouter before the
worker has receivers. Then, event router runs _run_ext_in and
buf.drain_indexed() will pick off events, even though
self.internal_outbound is not populated fully.

Finally, when the worker does try requesting events, the next event it
receives is not the first event, meaning the worker crashes.

## Changes

Start the event router after all the receivers are registered

## Why It Works

self.internal_outbound is populated before the loop begins.

## Test Plan

### Manual Testing
No more crashes observed in testing (it's actually quite easy to
reproduce the issue if you have one node with this fix but the other
node on main).

I'm convinced this is a fix, at least.
2026-04-15 08:46:02 +00:00
rltakashige
2ecefa0cfe Fix Qwen3-VL and autodetect vision config (#1893)
## Motivation

Qwen3 VL TP doesn't work atm, and vision is not behaving.

## Test Plan

### Manual Testing
Works now.
2026-04-14 23:05:55 +01:00
rltakashige
b8eaf707a8 Add gemma 4 tensor parallelism (#1891) 2026-04-14 20:31:59 +01:00
rltakashige
8d81811b89 Try harder to clean up processes nicely (#1889)
## Motivation

Model loading is actually quite reliable now. No need to kill if you
have a slow SSD or it's a massive model; the user can shut the instance
down if necessary.

This was a major cause of signal=9 issues although not the only one (can
happen during inference too?).
The reason signal=9 is so bad is that RDMA will no longer work until
restart if this ever happens.

## Changes

- no more model load timeout
- no more crazy sigkills
- try harder to clean up processes on model shutdown

## Test Plan

### Manual Testing
Tested with some RDMA instances
2026-04-14 16:37:49 +01:00
rltakashige
f2709dcde6 Add prefix cache flag to exo bench (#1888)
## Motivation
For using Exo-Bench extensively, there are many cases that we could use
prefix caching to speed up the benchmarks, especially when the focus is
on the token generation.

At the same time, it's very clear that prefix caching decode tokens is
not very useful in most current scenarios. Surprisingly, even for
non-thinking models, the chat template means that a continued
conversation will be formatted such that the existing cache is not
effective.

We already (slightly accidentally) do this for the batch generator - we
should do it for the sequential generator too.

## Changes

We can now speed up exo bench by having a use prefix caching flag. Of
course, for most accurate pp results, it is better to not have it, but
this speeds up tg and large benchmarking significantly.
Updated methodology to match

## Test Plan

### Manual Testing
Tested on many configurations that the difference in results is
negligible, even with multiple --pp options.
2026-04-14 11:12:58 +01:00
ciaranbor
77ffe039b3 Complete responses api usage response field (#1885)
## Motivation

The Responses API usage response was missing `input_tokens_details` and
`output_tokens_details`. The chat completions API already reports these.

## Changes

- Added `InputTokensDetails` (`cached_tokens`) and `OutputTokensDetails`
(`reasoning_tokens`) to `ResponseUsage`
- Extracted shared `_build_response_usage()` helper for both streaming
and non-streaming paths

## Test Plan

### Manual Testing

4-node cluster, `Qwen3-30B-A3B-4bit` — verified both detail objects
present with correct values in streaming and non-streaming responses.

### Automated Testing

13 tests in `test_openai_responses_api.py`.
2026-04-13 17:38:33 +00:00
rltakashige
3f0df404a5 Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886)
## Motivation

Part 1 of many memory improvements.

## Changes
As written in the title

## Test Plan

### Manual Testing
Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B
A3B cache reduced from 21GB every 100000 tokens to 7GB.
2026-04-13 18:32:17 +01:00
Evan Quiney
9b381f7bfe bump and simplify flake (#1866)
seems like stablepkgs swiftfmt works now! also bump macmon to 0.7
2026-04-13 15:45:17 +00:00
Alex Cheema
d2f67b5d10 dashboard: group Gemma under Google with proper logo (#1883)
## Motivation

In the dashboard model picker sidebar, the Gemma 4 models were showing
up under a "Gemma" family with the generic fallback tick/checkmark icon
(the default case in `FamilyLogos.svelte`), since no dedicated logo
branch existed for `family === "gemma"`. Every other vendor (Meta,
NVIDIA, OpenAI, DeepSeek, Qwen, …) has its own brand mark.

Gemma is Google's model family, so it should live under a **Google**
bucket that future Google-authored models can join, and it should render
with a proper Google logo in the same style as its neighbors.

## Changes

- `dashboard/src/lib/components/FamilyLogos.svelte`: added a `family ===
"google"` branch rendering a monochrome Google "G" as a single `<path>`
inside the shared `24×24` viewBox with `fill="currentColor"`, matching
the other vendor logos.
- `dashboard/src/lib/components/FamilySidebar.svelte`: added `google:
"Google"` to the `familyNames` display map.
- `dashboard/src/lib/components/ModelPickerModal.svelte`: inserted
`"google"` into the `familyOrder` array (next to `"llama"`) so the
vendor has a deterministic sort position.
- `resources/inference_model_cards/mlx-community--gemma-4-*.toml` (16
files): changed `family = "gemma"` → `family = "google"`. `base_model =
"Gemma 4 …"` is unchanged, so the model titles still read "Gemma".

## Why It Works

The sidebar builds its family list from whatever values appear in
`model.family` across the loaded model cards (`ModelPickerModal.svelte`
`uniqueFamilies`). Renaming the family string on the 16 Gemma cards from
`"gemma"` to `"google"` collapses them into a single "Google" bucket,
and the new logo branch + display-name map entry gives that bucket a
real brand mark and label. All other logos share the same `w-6 h-6 /
viewBox="0 0 24 24" / fill="currentColor"` shape, so inheriting
`text-exo-yellow` / `text-white/50` just works.

## Test Plan

### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- `cd dashboard && npm install && npm run build` — dashboard builds
cleanly.
- `uv run exo`, opened `http://localhost:52415`, clicked **SELECT
MODEL**:
- sidebar shows a **Google** entry with a monochrome Google "G" logo in
the same style as Meta / NVIDIA / etc.
  - old "Gemma" entry with the generic tick is gone.
- clicking **Google** filters to the Gemma 4 variants (e2b / e4b / 26B
A4B / 31B).
- hover/selected color states switch between `text-white/50` and
`text-exo-yellow` correctly.

### Automated Testing
- No new tests — this is a cosmetic grouping/logo change. Existing
dashboard build verifies the Svelte + TS compiles.
2026-04-13 14:08:15 +00:00
chaoliang yan
8973503322 fix: use configured api_port for IP connectivity probes (#1877)
## Motivation

Fixes #1861

When `--api-port` is set to a non-default value (e.g., `--api-port
55555`), the IP connectivity discovery system still probes peers on the
hardcoded default port 52415. Since the API is not listening on 52415,
all reachability checks fail, the topology reports zero reachable nodes,
and the dashboard shows "No valid configurations for current settings."

## Changes

Thread the configured `api_port` from `Args` through `Worker` into the
reachability probe functions:

- `net_profile.py`: `check_reachability()` and `check_reachable()`
accept an `api_port` parameter (default 52415 for backward
compatibility)
- `worker/main.py`: `Worker` stores `api_port` and passes it to
`check_reachable()`, uses it in `Multiaddr` construction and the mDNS
connection filter
- `main.py`: passes `args.api_port` to the `Worker` constructor

## Why It Works

The `/node_id` endpoint used by reachability probes is served by the
FastAPI app, which binds to `args.api_port`. The probes must use the
same port the API is actually listening on. Before this fix, the port
was hardcoded in three places in `net_profile.py` and `worker/main.py`;
now it uses the value from the CLI flag.

## Test Plan

### Manual Testing
<!-- Hardware: not available for multi-node testing -->
- Verified ruff passes on all changed files
- Code inspection: traced `api_port` flow from `Args.parse()` →
`Node.create()` → `Worker.__init__()` → `_poll_connection_updates()` →
`check_reachable()` → `check_reachability()` → HTTP probe URL

### Automated Testing
- No existing automated tests cover the reachability probe code path
- The new `api_port` parameter defaults to `52415`, so all existing
behavior is preserved when `--api-port` is not specified

---------

Co-authored-by: lawrence3699 <lawrence3699@users.noreply.github.com>
Co-authored-by: Evan <evanev7@gmail.com>
2026-04-13 14:01:42 +00:00
Alex Cheema
eb9228615f models: add MiniMax M2.7 cards (#1884)
## Motivation

The mlx-community [MiniMax-M2.7
collection](https://huggingface.co/collections/mlx-community/minimax-m27)
landed but exo didn't have model cards for any of the variants yet, so
they weren't selectable from the dashboard model picker. Adding cards
also makes them discoverable under the existing MiniMax family entry.

## Changes

Added 6 new model cards in `resources/inference_model_cards/`, one per
quant of MiniMax M2.7:

- `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB)
- `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB)
- `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB)
- `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB)
- `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB)
- `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB)

All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"`
so they collapse into a single group in the picker with the existing
MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size =
3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read
from each repo's `config.json`; `storage_size.in_bytes` was summed from
the HF tree API per repo.

`capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5
cards — the chat template always emits `<think>` tags (no toggle),
matching M2.5 behavior.

## Why It Works

Model cards in `resources/inference_model_cards/` are auto-loaded by
`src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard
picker groups by `base_model` and filters by `family`, so sharing both
across all six variants gives a single "MiniMax M2.7" group under the
MiniMax sidebar entry, with the quant variants exposed as selectable
sub-options.

## Test Plan

### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6
new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and
correct quant + byte sizes.
- `cd dashboard && npm run build` then `uv run exo`, opened the model
picker → **MiniMax** family → **MiniMax M2.7** group shows all six quant
variants.

### Automated Testing
- No new automated tests — these are data files validated by the
existing Pydantic `ModelCard` schema at load time.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:47:09 +01:00
MikkoParkkola
4b13735ea3 build: remove pyinstaller temp artifacts (#1868)
removes PyInstaller `build/` leftovers after `just package`
2026-04-11 11:46:03 +00:00
rltakashige
196543ce69 Add Gemma 4 + VLM fixes + thinking parsing updates (#1851)
## Motivation
Add support for Gemma 4, including VLM!

## Changes

- Add auto parallel strategies and model cards for Gemma 4
- Normalise Gemma 4's special Vision Transformer handling to be in line
with the rest of our vision processors.
- Also adds reprs to messages and b64 hashes to prevent log spam.

## Test Plan

### Manual Testing
Tested manually on 4bit E2B and 8bit 26B

### Automated Testing
Model onboarding shows small logit diffs.

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-04-11 12:29:33 +01:00
ciaranbor
6172617b00 add env override to macos app (#1869)
## Motivation

- Let users pass/override arbitrary exo env vars from the macOS app
without a code change.

## Changes

- `ExoProcessController.swift`: `CustomEnvironmentVariable` struct +
`@Published` list persisted to `UserDefaults`, injected into the child
process after built-ins.
- `SettingsView.swift`: new **Environment** tab with add/remove rows,
trim + dedup on save, and a POSIX name validator with a warning badge.

## Why It Works

- Custom vars applied last in `makeEnvironment`, so overriding a
built-in works with no special-casing.

## Test Plan

### Manual Testing

- Set `EXO_LIBP2P_NAMESPACE` via the new UI; confirmed override in
`~/.exo/exo_log/exo.log`.
2026-04-10 17:39:55 +01:00
ciaranbor
93a980a61e just package first builds dashboard (#1867)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-04-10 16:25:30 +00:00
ciaranbor
2962ebee60 Fix pdf inputs on Safari (#1865)
## Motivation

PDF attachments weren't working on Safari

## Changes

Create async readable stream if none exists

## Why It Works
pdfjs-dist requires an async readable stream internally

## Test Plan

### Manual Testing
pdf attachments now work on Safari, still work on Firefox
2026-04-10 14:59:34 +00:00
rltakashige
abd75ae06c Truncate long logs with repr (#1854)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-04-10 15:53:48 +01:00
kaiisfree
ee2e505b3c fix: handle BrokenResourceError in download progress callback (#1846)
## Summary
- Wraps progress callback `send()` in try/except to gracefully handle
`BrokenResourceError` when the memory stream is closed
- Prevents unhandled `ExceptionGroup` from crashing the process when the
download consumer disconnects during transfer

## Root Cause
The download progress callback sends updates through an anyio memory
object stream. When the receiving end closes (e.g., client disconnect,
timeout, or task cancellation), `send()` raises `BrokenResourceError`.
Inside an anyio `TaskGroup`, this unhandled exception becomes an
`ExceptionGroup` that propagates up and crashes the coordinator.

## Fix
Catch `BrokenResourceError` (and `ClosedResourceError` for completeness)
in the progress callback and handle gracefully — the download continues
but progress updates are silently dropped for disconnected consumers.

Fixes #1844

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-10 11:43:42 +00:00
Evan Quiney
f2e6b1ef76 prevent some crash loops (#1827)
extension to #1763 that prevents crash looping in some common scenarios.
2026-04-09 11:34:35 +00:00
ciaranbor
e2e17eafb7 Fix reasoning_tokens counting for multi-token thinking tag models (#1848)
## Motivation

`reasoning_tokens` is always 0 in usage stats, even when thinking
content streams correctly via `reasoning_content` SSE deltas. The MLX
generators had their own thinking detection comparing individual
detokenized tokens against think tags — this never fires for models
where tags span multiple tokens (e.g. gpt-oss-120b) or are already in
the prompt.

## Changes

- Removed broken per-token thinking detection from `batch_generate.py`
and `generate.py`
- Added `_count_reasoning_tokens` wrapper in `model_output_parsers.py`
that counts `is_thinking=True` responses and patches the total into
Usage on the final response
- Wired it as the outermost stage of `apply_all_parsers`, so it works
regardless of which parser sets `is_thinking`
- Added 3 tests covering `parse_thinking_models` and `parse_gpt_oss`
paths

## Why It Works

The parser pipeline already correctly sets `is_thinking` on each
response. Counting at the output of `apply_all_parsers` means one
counting point that works for all model types, replacing the duplicate
broken logic in two generators.

## Test Plan

### Manual Testing

- 4-node cluster, `mlx-community/gpt-oss-120b-MXFP4-Q8`
- Main branch: `reasoning_tokens: 0` — fix branch: `reasoning_tokens:
25`

### Automated Testing

- 3 new tests: explicit think tags, `starts_in_thinking=True`, and
gpt-oss Harmony analysis channel
2026-04-09 12:29:35 +01:00
ciaranbor
b12cd1b186 Cancel SSE keep-alive when instance is deleted (#1828)
## Motivation

When a model instance is deleted (e.g. node disconnect, manual
teardown), any in-flight SSE streaming connections for that instance
hang indefinitely. The API never closes the response stream, so clients
block forever waiting for more chunks.

## Changes

- Listen for `InstanceDeleted` events in the API event loop
- Add `_close_streams_for_instance()` to find and close any active
text/image generation queues tied to tasks on the deleted instance
- Add unit tests covering text gen, image gen, and
unrelated-instance-not-closed scenarios

## Why It Works

When an instance is deleted, we iterate `state.tasks` to find commands
running on that instance, then close and remove their send-side queue
handles. This causes the SSE generator to terminate, unblocking the
client.

## Test Plan

### Manual Testing
- This was causing issues for me on another branch (integration tests).
Including this fix solved the issue

### Automated Testing
- `test_instance_deleted_stream_cleanup.py`: 3 tests covering text gen
cleanup, image gen cleanup, and ensuring unrelated streams are not
affected
2026-04-08 16:14:28 +01:00
mlpy0
62570227ff Catch ClosedResourceError when forwarding chunks to client queues (#1856)
While stress-testing inference with rapid client cancels mid-stream, I
hit a reproducible crash where the entire exo process exits.

When a client cancels a streaming chat completion partway through, its
receive stream gets closed cleanly via its context manager. The producer
in `API._apply_state` then calls `queue.send(event.chunk)`, which raises
`anyio.ClosedResourceError` rather than `BrokenResourceError`. The
existing handler only catches `BrokenResourceError`, so the exception
propagates through the API task group, kills the Node task group, and
the process exits with `EXO Shutdown complete`.

Trace from one of the crashes:

```
File "exo/api/main.py", line 1818, in _apply_state
    await queue.send(event.chunk)
File "anyio/streams/memory.py", line 212, in send_nowait
    raise ClosedResourceError
anyio.ClosedResourceError
```

The fix is to catch `ClosedResourceError` alongside
`BrokenResourceError` in both queue handlers (text and image), so the
dead queue gets dropped and `_apply_state` keeps running for other
in-flight requests.
2026-04-08 08:50:04 +00:00
Alex Cheema
645bc20950 Add Fast Synch Enabled toggle to macOS app settings (#1852)
## Motivation

The exo backend already supports `--fast-synch` / `--no-fast-synch` CLI
flags and the `EXO_FAST_SYNCH` environment variable, but there was no
way to toggle this from the macOS app UI. Users who want fast CPU-to-GPU
synchronization for RDMA with Tensor Parallelism had to use CLI flags.

## Changes

- **ExoProcessController.swift**: Added `fastSynchEnabled`
UserDefaults-backed property and pass `EXO_FAST_SYNCH=on` to the exo
process environment when enabled.
- **SettingsView.swift**: Added a "Performance" section to the Advanced
tab with a "Fast Synch Enabled" toggle, an info icon (ⓘ) tooltip
explaining the feature and trade-offs, and a "Save & Restart" button.

## Why It Works

Follows the exact same pattern as the existing `offlineMode` and
`enableImageModels` settings — UserDefaults persistence, `@Published`
property with `didSet`, environment variable passthrough in
`makeEnvironment()`, and pending state with Save & Restart in the
settings UI. The `EXO_FAST_SYNCH=on` value matches what the Python
backend already reads in `main.py`.

## Test Plan

### Manual Testing
<!-- Hardware: macOS app -->
- Open Settings → Advanced tab → verify "Performance" section with "Fast
Synch Enabled" toggle appears
- Hover the ⓘ icon → verify tooltip explains the feature and GPU lock
trade-off
- Toggle on → click "Save & Restart" → verify process restarts with
`EXO_FAST_SYNCH=on` in env
- Close and reopen Settings → verify the toggle state persists
- Verify "Save & Restart" button is disabled when no changes are pending

### Automated Testing
- Existing settings patterns are well-established; no new automated
tests needed for this UI toggle

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 01:04:42 +00:00
rltakashige
5757c27dd5 Add download utility script (#1855)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-04-08 00:58:39 +00:00
Andrei Cravtov
fd5b23281c Workspace tweaks (#1849)
## Changes

Mostly chore changes around vscode and jetbrains workspace settings, and
some basedpyright settings tweaks, to allow direnv to work and nixd
autocomplete with flake parts to work
2026-04-07 17:26:29 +00:00
rltakashige
43b3df45fb Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835)
## Motivation

MLX LM has had a massive refactor to their BatchGenerator recently.
Since we'd like new features from MLX LM such as Gemma 4, we need to
update the code to handle this.

Additionally this fixes a significant memory leak in GatedDeltaNet (the
difference is quite substantial, up to 1GB every 1000 tokens, explaining
several memory issues users were facing with Qwen3.5 models)

## Testing
Before
<img width="3146" height="884" alt="image"
src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4"
/>


After (no memory leak, as one of the changes upstream)
<img width="3190" height="892" alt="image"
src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14"
/>
2026-04-07 11:50:12 +00:00
mlpy0
24420eb10a Fix reasoning_tokens always reported as 0 for thinking models (#1836)
When `enable_thinking` is set, chat templates (Qwen3, DeepSeek, etc.)
append `<think>` to the prompt. The model starts generating thinking
content directly without emitting a `<think>` token in the output
stream.

Both generators initialized `in_thinking = False` and only set it to
`True` on seeing a `<think>` token in output. Since that token was part
of the prompt, the flag never flipped and `reasoning_tokens` stayed at 0
in the usage response.

Fix: initialize `in_thinking` from `detect_thinking_prompt_suffix()`,
which already exists and is used by `model_output_parsers` for routing
thinking content correctly.
2026-04-05 00:05:18 +00:00
rltakashige
59669c1168 Tighten EXO bench concurrency numbers and explain methodology (#1811)
## Motivation

The timings in the batch generator are a little optimistic; a minor
change is needed to make them more correct.

## Changes

Include the time spent in the API in the generation tps and make sure to
send all requests simultaneously
2026-04-05 00:57:52 +01:00
ciaranbor
1d2ce464dc Allow pausing and deleting active downloads (#1829)
## Motivation

No way to pause active downloads or delete partial/failed downloads from
the dashboard.

## Changes

- **Backend:** Added `POST /download/cancel` endpoint with
`CancelDownloadParams`/`CancelDownloadResponse` types. Wires into the
existing `CancelDownload` command + coordinator handler.
- **Dashboard store:** Added `cancelDownload(nodeId, modelId)` function.
- **Dashboard UI:**
  - Pause + delete buttons on active (downloading) cells
  - Delete button on paused/pending and failed cells
- Extracted duplicated SVG icons into `{#snippet}` blocks (`trashIcon`,
`downloadIcon`, `pauseIcon`, `deleteButton`)
- **Tests:** 3 coordinator-level tests for cancel: active download →
pending, nonexistent → no-op, cancel then resume.

## Why It Works

`CancelDownload` command and coordinator handler already existed — just
needed an HTTP endpoint and dashboard wiring. Delete endpoint already
supported all download states.

## Test Plan

### Manual Testing

Started a model download, paused it. Deleted some paused downloads.
Deleted some ongoing downloads.

### Automated Testing

- `test_cancel_active_download_transitions_to_pending` — cancels
in-progress download, asserts `DownloadPending` event and cleanup
- `test_cancel_nonexistent_download_is_noop` — no events emitted
- `test_cancel_then_resume_download` — restart after cancel works
2026-04-02 15:56:33 +01:00
ciaranbor
eb6ae9fd3c Prevent failed instance retries (#1763)
## Motivation

Currently, when a runner fails, the master retries the instance. Most of
the time, this causes a loop over failure. Retries need backoff and a
cap.

## Changes

- src/exo/worker/main.py: Before creating a runner, check an exponential
backoff timer per instance. After EXO_MAX_INSTANCE_RETRIES failures,
send DeleteInstance to permanently remove the instance. Record attempts
on Shutdown; reset on InstanceDeleted.
- src/exo/utils/keyed_backoff.py: Add attempts() method to query retry
count
- src/exo/shared/constants.py: Add EXO_MAX_INSTANCE_RETRIES = 3.

## Why It Works

The worker gates CreateRunner tasks behind a KeyedBackoff, adding
exponential delay (2s base, 30s cap) between retries. After 3 failures
the worker sends DeleteInstance, stopping retries entirely. The backoff
resets when the instance is deleted, so a fresh placement starts clean.

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-04-01 21:03:34 +01:00
rltakashige
4688adb5d2 Support PDFs in dashboard (#1822)
Like ChatGPT does, we now send both the extracted text and the image of
each PDF page.
2026-03-31 18:25:40 +01:00
rltakashige
d9ed943034 Fix Nemotron cache leak upstream (#1819)
## Motivation
Nemotron Cascade and Nano failing at long decodes.

## Changes

Fixed upstream, just change pyproject and uv lock here.


## Test Plan
### Automated Testing
Tested with a reproduce script upstream
2026-03-30 16:53:21 +00:00
rltakashige
c6815bfdce Only update KV prefix cache on a good cache hit (#1817)
## Motivation

Addresses #1816 

## Changes

Update on min prefix cache > min_prefix_hit_length **and** hit ratio >
_MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system
prompts must match exactly.

## Test Plan

### Manual Testing
Test on OpenCode and Claude Code
2026-03-30 15:04:38 +01:00
rltakashige
39c39e8199 Integrations helpers (#1810)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-03-30 14:28:41 +01:00
rltakashige
e5cb7b80d0 Add SSE-keepalive to not time out on long prefill on clients (#1803)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-03-30 12:18:38 +01:00
rltakashige
635801d515 Add multimodality! (#1802)
## Motivation

Images!

TODO (in a future PR): Add audio and video support.

## Test Plan

### Manual Testing
<img width="2652" height="1900" alt="image"
src="https://github.com/user-attachments/assets/7d3a7137-542f-4f94-9193-2c73b7c4a5ec"
/>

<img width="2770" height="1956" alt="image"
src="https://github.com/user-attachments/assets/e3c3a096-8029-4409-97a6-aca31a9a3f24"
/>
<img width="2738" height="1768" alt="image"
src="https://github.com/user-attachments/assets/d70ea37f-cd1d-4a4c-ad08-3beb9fafa380"
/>

(And batching also works)

---------

Co-authored-by: David Hind <davehind@yahoo.co.uk>
2026-03-30 11:52:19 +01:00
rltakashige
2efbb8ab4f Improve exo harness with path state (#1815)
<img width="3224" height="1476" alt="image"
src="https://github.com/user-attachments/assets/d90a7d8a-9fe5-43a1-a715-1ef7ecc15422"
/>
2026-03-30 11:20:46 +01:00
Evan Quiney
c6c5a3e73c feat: /state/paths (#1796)
adds a path option to the /state endpoint, allowing you to query
subfields of state without grabbing the whole blob

## test plan
poking around in the api
2026-03-30 10:10:00 +00:00
ArvidSU
10ef7ec9e8 feat: add Firefox AI sidebar (?q=) support to dashboard (#1814)
This PR builds on https://github.com/exo-explore/exo/pull/1677 to enable
custom prompts sent from Firefox `browser.ml.chat` to EXO dashboard
using URL parameters in sidebar for summary and other browser
interactions. See "Summarize page" example below.

## Summary
- Parse `?q=<encoded prompt>` URL parameter on page load and auto-submit
it as a chat message
- Clean up the URL with `history.replaceState` to prevent re-submission
on refresh
- Defer auto-send until both cluster state and model list are loaded so
model auto-selection works correctly

## Context
Firefox's built-in AI sidebar (`about:config: browser.ml.chat.enabled`)
integrates with chat providers by appending the user's prompt as
`?q=<URL-encoded prompt>`. Previously the exo dashboard ignored this
parameter. Users can now configure `http://localhost:52415` as a Firefox
AI chatbot provider.

See: https://support.mozilla.org/en-US/kb/ai-chatbot

## Technical notes
- Frontend-only change in `dashboard/src/routes/+page.svelte`
- Uses a Svelte `$effect` that reacts to `pendingFirefoxQuery`, `data`
(cluster state), and `models.length` — fires exactly once when all three
are ready
- If no model is selected, `handleAutoSend` auto-picks the best
available model; if no model fits memory, a toast is shown
- If a model is selected but not running, the message is queued until
the model loads

## Testing
```
http://localhost:52415/?q=Hello+world
http://localhost:52415/?q=Summarize+this+page%3A+%5Bpage+title%5D+%5Bpage+url%5D
```

<img width="2056" height="1329" alt="image"
src="https://github.com/user-attachments/assets/74463eb4-ca1a-400d-806a-c19ba93147b9"
/>
2026-03-30 11:02:35 +01:00
Evan Quiney
1e51dc89b0 chore: bump exo-version with release version (#1807)
our pyproject.toml version was 0.3.68 - update to .69 in line with
release!!
2026-03-27 11:47:13 +00:00
Alex Cheema
5327bdde84 Fix custom model add requiring two attempts + enlarge sidebar buttons (#1805)
## Motivation

Adding a custom model from the Hub tab shows "Added" toast but the model
doesn't appear in the All tab. You have to add it a second time for it
to work. Also, the "All" button in the model picker sidebar is too small
to read comfortably.

## Changes

**Race condition fix (`src/exo/api/main.py`):**
- Call `add_to_card_cache(card)` directly in `add_custom_model()` after
sending the `ForwarderCommand`, before the API response returns

**Sidebar sizing
(`dashboard/src/lib/components/FamilySidebar.svelte`):**
- Increased sidebar min-width from 72/64px to 80/72px
- Increased "All" icon from `w-5 h-5` to `w-6 h-6`
- Increased all sidebar labels from 9px to 11px

## Why It Works

`POST /models/add` sends a `ForwarderCommand(AddCustomModelCard)` and
returns immediately. The frontend then calls `GET /models` which reads
from `_card_cache`. But the cache was only updated by the worker event
handler after the event round-trips through the master — a race the
frontend almost always loses. By updating the cache directly in the API
handler, `GET /models` immediately reflects the new model. The worker's
later `add_to_card_cache` call is idempotent (dict key assignment).

## Test Plan

### Manual Testing
<!-- Hardware: any Mac -->
- Open model picker → Hub tab → add a custom model → verify it appears
in All tab on the first attempt
- Verify sidebar "All" button and other labels are visually larger and
readable

### Automated Testing
- `uv run basedpyright` passes with 0 errors
- `uv run ruff check` passes

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v1.0.69
2026-03-26 17:57:00 -07:00
ciaranbor
15f1b61f4c Rework model storage directory management (for external storage) (#1765)
## Motivation

Replace confusing EXO_MODELS_DIR/EXO_MODELS_PATH with clearer
multi-directory support, enabling automatic download spillover across
volumes.

## Changes

- EXO_MODELS_DIRS: colon-separated writable dirs (default always
prepended, first with enough space wins)
- EXO_MODELS_READ_ONLY_DIRS: colon-separated read-only dirs (protected
from deletion)
- select_download_dir(): picks writable dir by free space
- resolve_existing_model(): unified lookup across all dirs
- is_read_only_model_dir(): path-based read-only detection instead of
hardcoded flag
- Updated coordinator, worker, model cards, tests

## Why It Works

Default dir always included so zero-config behavior is unchanged. Disk
space checked at download time for automatic spillover. Read-only status
derived from path, not hardcoded.

## Test Plan

### Manual Testing

- No env vars set → identical behavior
- EXO_MODELS_DIRS=/Volumes/SSD/models → downloads to external storage
- EXO_MODELS_READ_ONLY_DIRS=/mnt/nfs → models found, deletion blocked

### Automated Testing

- 4 new tests in test_xdg_paths.py (prepend, default-only, overlap,
empty read-only)
- Existing tests updated to patch new constants
2026-03-26 17:46:46 +00:00
Michael Harrigan
9034300163 [Fix] Node hang on reelection (#1801)
## Motivation

During master reelection, `_elect_loop` called `worker.shutdown()` (fire
& forget) then immediately created and started a new Worker.

This caused the old runner subprocess's Metal/GPU teardown to race with
the new worker's startup, resulting in `IOConnectUnmapMemory failed:
kr=0xe00002bc` errors and a full node hang requiring `^C`. Same issue
existed for `DownloadCoordinator`.

## Changes

- Added `anyio.Event`-based `_stopped` signal to `Worker` and
`DownloadCoordinator`, set at the end of their `run()` finally blocks
- Added `wait_stopped()` async method to both classes
- Updated `_elect_loop` to `await wait_stopped()` after calling
`shutdown()` on the old Worker and DownloadCoordinator before creating
replacements

## Why It Works

The old Worker's task group contains the RunnerSupervisor tasks, whose
finally blocks join the runner subprocess (with 5s timeout + SIGTERM +
SIGKILL escalation). By awaiting `wait_stopped()`, we guarantee the old
runner process has fully exited — including GPU memory cleanup — before
a new Worker can start and potentially access the GPU. This eliminates
the race without changing the shutdown mechanics themselves.

## Test Plan

### Manual Testing
Hardware: M4 Pro Mac Mini 24GB + M3 Ultra Mac Studio 96GB, connected via
Thunderbolt

**Repro steps:**
1. Start exo on two nodes with a model sharded across both (e.g.
`Josiefied-Qwen3-14B-abliterated-v3-4bit`)
2. Wait for "runner ready" on both
3. `kill -9` the master node
4. Observe the surviving node's re-election behavior

**Before fix (original crash):**
```
[ 11:02:39.0896AM ] Runner supervisor shutting down
[ 11:02:39.0905AM ] bye from the runner
[ 11:02:39.1052AM ] Stopping Worker
IOConnectUnmapMemory failed: kr=0xe00002bc
IOConnectUnmapMemory failed: kr=0xe00002bc
IOConnectUnmapMemory failed: kr=0xe00002bc
IOConnectUnmapMemory failed: kr=0xe00002bc
^C[ 11:03:45 ] ← hung for over a minute, required manual kill
```

**After fix (clean re-election):**
```
[ 12:15:22.4703PM ] runner loaded
[ 12:15:24.1672PM ] runner ready
[ 12:15:33.5393PM ] Waiting for other campaign to finish
[ 12:15:36.5409PM ] Node elected Master
[ 12:15:36.5413PM ] Unpausing API
```
No `IOConnectUnmapMemory` errors, no hang, no `^C` needed.

### Automated Testing
- No existing tests cover the `_elect_loop` re-election path; this is an
integration-level flow requiring a live router/election/worker stack
- All existing tests pass (307/308, 1 pre-existing Rust binding failure)
- basedpyright: 0 errors, ruff: all checks passed

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-03-26 17:28:47 +00:00
rltakashige
1d1dfaa1f3 Don't download original/ and metal/ folders from HF (#1800)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-03-26 13:36:07 +00:00
Evan Quiney
7625213df0 fix: enable macmon if preflight fails (#1799)
missed in #1747, issue #1798.

### the issue

we didn't set the memory poll rate after failling the macmon preflight,
only after failing the followups - as we never ran macmon if preflight
failed, we never hit the followup errors etc.

### testing

requires testing on an m5 pro, but the core issue is solved.
2026-03-26 11:35:22 +00:00
Alex Cheema
f318f9ea14 Fix macOS build bundling wrong macmon binary (#1797)
## Motivation

PR #1747 fixed macmon support for M5 Pro/Max by pinning the
`swiftraccoon/macmon` fork in `flake.nix`. This works when running from
source (via Nix) but the distributed macOS `.app` build was still broken
on M5 Pro/Max because it was bundling the wrong macmon.

The error on M5 Pro/Max:
```
macmon preflight failed with return code -6: thread 'main' panicked at src/sources.rs:394:41
```

## Changes

- Removed `macmon` from `brew install` in `build-app.yml` — this was
installing the upstream `vladkens/macmon` which doesn't support M5
Pro/Max
- Added a new step that resolves the pinned macmon fork from the Nix dev
shell (same `swiftraccoon/macmon` at rev `9154d23` already defined in
`flake.nix`) and adds it to `$GITHUB_PATH`
- Added a safety `brew uninstall macmon` to ensure no Homebrew macmon
can shadow the pinned version

## Why It Works

PyInstaller bundles macmon via `shutil.which("macmon")`. Previously this
found the Homebrew (upstream) binary. Now it finds the Nix-overlayed
fork that has M5 Pro/Max support, because `$GITHUB_PATH` prepends the
Nix store path before the PyInstaller step runs.

## Test Plan

### Manual Testing
<!-- Hardware: M5 Pro -->
- Trigger a macOS build and verify the bundled macmon is the pinned fork
- Run the built `.app` on M5 Pro/Max and confirm macmon preflight
succeeds

### Automated Testing
- Existing CI build workflow will validate that the macmon binary is
found and bundled correctly

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 09:47:48 +00:00
ciaranbor
30fd5aa1cc Prefer higher % downloaded nodes for API placement previews (#1795)
Follow up to https://github.com/exo-explore/exo/pull/1767

Same thing for placement previews through API
2026-03-25 17:26:31 +00:00
ciaranbor
6de14cfedb Support image generation cancellation (#1774)
## Motivation

Support cancelling image generation, similar to existing support for
cancelling text generation

## Changes

- Dashboard (app.svelte.ts): Wire up AbortController for both
generateImage and editImage API calls. On abort, show "Cancelled"
instead of an error. Clean up the controller in finally.
- Pipeline runner (pipeline/runner.py): Introduce a cancel_checker
callback and NaN-sentinel cancellation protocol for distributed
diffusion:
  - _check_cancellation() - only rank 0 polls the cancel callback
- _send() - replaces data with NaN sentinels when cancelling, so
downstream ranks detect cancellation via _recv_and_check()
  - _recv() / _recv_like() wrappers that eval and check for NaN sentinel
  - After cancellation, drains any pending ring recv to prevent deadlock
  - Skips partial image yields and final decode when cancelled
- Image runner (runner/image_models/runner.py): Deduplicate the
ImageGeneration and ImageEdits match arms into a shared
_run_image_task() method. Thread a cancel_checker closure (backed by the
existing cancel_receiver + cancelled_tasks set) into generate_image().
- Plumbing (distributed_model.py, generate.py): Pass cancel_checker
through the call chain.

## Why It Works

- Rank 0 is the only node that knows about task-level cancellation. When
it detects cancellation, it sends NaN tensors instead of real data.
Higher-order ranks detect the NaN sentinel on recv, set their own
_cancelling flag, and propagate NaN forward
- A drain step after the loop prevents the deadlock case where the last
rank already sent patches that the first would never consume.
- For single-node mode, the loop simply breaks immediately on
cancellation.

## Test Plan

### Automated Testing

New tests in src/exo/worker/tests/unittests/test_image
2026-03-25 16:56:04 +00:00
vskiwi
fc1ae90111 fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (#1769)
## Summary

DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's
inference engine (architecture whitelisted in `model_cards.py`, DSML
encoding added in #1548), but **doesn't work out of the box** due to two
bugs:

### Bug 1: `warmup_inference` passes empty model ID

`warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a
parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)`
instead of using it. Since `_needs_dsml_encoding()` checks
`"deepseek-v3.2" in task_params.model.lower()`, the empty string never
matches → falls back to `tokenizer.apply_chat_template()` →
**ValueError** because V3.2 has no Jinja chat template.

**Fix:** `model=ModelId("")` → `model=model_id` (one line).

### Bug 2: `_needs_dsml_encoding` limited to tool calling

`_needs_dsml_encoding()` returns `True` only when `task_params.tools` is
present or tool messages exist in `chat_template_messages`. For warmup
and regular chat requests without tools → `return False` → Jinja
fallback → **ValueError**.

Unlike V3.1 (which has a `.jinja` chat template file that transformers
picks up automatically), V3.2 **has no Jinja template at all** — it uses
Python-based DSML encoding for all message types.

**Fix:** For V3.2, always return `True` — DSML encoding handles all
message types.

### Catalog cards

Added inference model cards for:
- `mlx-community/DeepSeek-V3.2-8bit`
- `mlx-community/DeepSeek-V3.2-4bit`

Parameters taken from model `config.json` on HuggingFace, storage sizes
from HF API. Capabilities include `thinking_toggle` (related: #1456).

## Notes

- The model ID string matching approach (`"deepseek-v3.2" in
model.lower()`) is acknowledged tech debt — see #1371 for the planned
architecture-based approach.

## Test plan

- [x] Start exo with DeepSeek V3.2 model → warmup should complete
without crash
- [x] Send a regular chat message (no tools) → should get a response
- [x] Send a chat message with tools → should work as before
- [x] V3.2 cards should appear in the dashboard model catalog

---------

Co-authored-by: user <user@m1.note>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
2026-03-25 16:20:35 +00:00