Commit Graph

2291 Commits

Author SHA1 Message Date
Evan
3eb1fbe394 wuff 2026-04-17 18:13:43 +01:00
Evan
e0c9e82755 fix test 2026-04-17 18:13:22 +01:00
Evan
b2fe0b8904 remove layer loading callback 2026-04-17 18:13:09 +01:00
mlpy0
01598960bd Add model card for Qwen3.6-35B-A3B-8bit (#1917)
Adds the 8bit variant missing from #1907 — the safetensors index is now
live on HF.

- `mlx-community/Qwen3.6-35B-A3B-8bit` (~35 GB)

Architectural fields match the existing 4bit/5bit/bf16 cards.
`storage_size.in_bytes` is taken from `metadata.total_size` of the
upstream `model.safetensors.index.json`.
2026-04-17 10:06:23 +00:00
Alex Cheema
63b8e64715 Add model cards for Qwen3.6-35B-A3B variants (#1907)
## Motivation

`mlx-community` has just published the new **Qwen3.6-35B-A3B**
multimodal MoE family on HuggingFace. Without static model cards exo
doesn't surface these models in the dashboard picker or match its
placement / prefill logic, so users can't one-click launch them. This PR
adds cards for the three quants whose safetensors indexes are already
live on HF (4bit / 5bit / bf16).

## Changes

Three new TOML files in `resources/inference_model_cards/`:

- `mlx-community--Qwen3.6-35B-A3B-4bit.toml` (~19 GB)
- `mlx-community--Qwen3.6-35B-A3B-5bit.toml` (~23 GB)
- `mlx-community--Qwen3.6-35B-A3B-bf16.toml` (~65 GB)

All three share the same architectural fields (`n_layers = 40`,
`hidden_size = 2048`, `num_key_value_heads = 2`, `context_length =
262144`, capabilities `text, thinking, thinking_toggle, vision`,
`base_model = "Qwen3.6 35B A3B"`) — only `model_id`, `quantization`, and
`storage_size.in_bytes` differ between variants.

## Why It Works

- Qwen3.6-35B-A3B reuses the `qwen3_5_moe` architecture
(`Qwen3_5MoeForConditionalGeneration`) — the same one already wired into
exo's MLX runner at `src/exo/worker/engines/mlx/auto_parallel.py:47` via
`Qwen3_5MoeModel`. The architectural fields are taken verbatim from the
HF `config.json.text_config` and match the existing `Qwen3.5-35B-A3B-*`
cards.
- Storage sizes are the exact `metadata.total_size` read from each
variant's `model.safetensors.index.json` on HF, so download progress and
cluster-memory-fit checks are accurate.
- Vision support is flagged in `capabilities`; the `[vision]` block is
auto-detected by `ModelCard._autodetect_vision` from the upstream
`config.json`, so no hand-written vision config is required.
- The card loader (`_refresh_card_cache` in
`src/exo/shared/models/model_cards.py`) globs every `.toml` in
`resources/inference_model_cards/` on startup, so nothing else needs to
change — the `/models` endpoint and the dashboard picker pick them up
automatically.

The `mxfp4` / `mxfp8` / `nvfp4` variants are still uploading upstream
(index JSONs currently 404) and can be added in a follow-up PR once HF
completes.

## Test Plan

### Manual Testing

Hardware: MacBook Pro M4 Max, 48 GB unified memory.

- Built the dashboard, ran `uv run exo`, waited for the API to come up
on `http://localhost:52415`.
- `curl -s http://localhost:52415/models` returns the three new model
ids (`mlx-community/Qwen3.6-35B-A3B-{4bit,5bit,bf16}`) alongside
existing models.
- Opened the dashboard, clicked SELECT MODEL, typed "Qwen3.6" into the
search box. A single **"Qwen3.6 35B A3B"** group appears showing `3
variants (19GB-65GB)`. Expanding it lists the `4bit` / `5bit` / `bf16`
quants with sizes `19GB` / `23GB` / `65GB`, exactly as expected:

![Qwen3.6 35B A3B in model
picker](127119f703/qwen36-picker.png)

- Programmatically loaded each TOML via `ModelCard.load_from_path(...)`
and confirmed the parsed fields (layers / hidden / KV heads / context /
quant / base_model / caps / bytes) match what's written in the files.

### Automated Testing

No code paths were touched — these are pure TOML data files that plug
into the existing model-card loader. The existing pytest suite covers
TOML parsing and card serving; adding new TOMLs doesn't require new test
scaffolding. `uv run ruff check` and `nix fmt` are clean.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ryuichi Leo Takashige <rl.takashige@gmail.com>
2026-04-16 23:25:26 +01:00
rltakashige
28c797846a Update mlx and mlx lm to latest (#1906)
Just bumping to the very latest upstream versions.
2026-04-16 10:59:33 +00:00
rltakashige
058bb08261 Allow copying on dashboard even on HTTP (#1902)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-04-15 23:30:12 +01:00
rltakashige
3eead80238 Better environment variables in MacOS app (#1901)
## Motivation

Closes #1858 

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-04-15 20:14:52 +00:00
rltakashige
87329c80ef Add usage stats to tool calls and handle multiple tool calls correctly (#1899)
## Motivation

Tool calls are usually not end tokens, so they didn't have usage stats.
2026-04-15 19:40:00 +01:00
rltakashige
8cdc833892 Drain tokens silently skipped in thinking parsing (#1898)
## Motivation
Closes #1882
2026-04-15 14:23:07 +00:00
rltakashige
2cd66ae4cf Fix out of order event idx causing fatal crashes (#1894)
## Motivation

<img width="828" height="373" alt="Screenshot 2026-04-14 at 22 56 52"
src="https://github.com/user-attachments/assets/f8f48c1d-68c5-4acc-a6de-9d180672da9d"
/>

if is_new_master=True, _elect_loop creates a new EventRouter before the
worker has receivers. Then, event router runs _run_ext_in and
buf.drain_indexed() will pick off events, even though
self.internal_outbound is not populated fully.

Finally, when the worker does try requesting events, the next event it
receives is not the first event, meaning the worker crashes.

## Changes

Start the event router after all the receivers are registered

## Why It Works

self.internal_outbound is populated before the loop begins.

## Test Plan

### Manual Testing
No more crashes observed in testing (it's actually quite easy to
reproduce the issue if you have one node with this fix but the other
node on main).

I'm convinced this is a fix, at least.
2026-04-15 08:46:02 +00:00
rltakashige
2ecefa0cfe Fix Qwen3-VL and autodetect vision config (#1893)
## Motivation

Qwen3 VL TP doesn't work atm, and vision is not behaving.

## Test Plan

### Manual Testing
Works now.
2026-04-14 23:05:55 +01:00
rltakashige
b8eaf707a8 Add gemma 4 tensor parallelism (#1891) 2026-04-14 20:31:59 +01:00
rltakashige
8d81811b89 Try harder to clean up processes nicely (#1889)
## Motivation

Model loading is actually quite reliable now. No need to kill if you
have a slow SSD or it's a massive model; the user can shut the instance
down if necessary.

This was a major cause of signal=9 issues although not the only one (can
happen during inference too?).
The reason signal=9 is so bad is that RDMA will no longer work until
restart if this ever happens.

## Changes

- no more model load timeout
- no more crazy sigkills
- try harder to clean up processes on model shutdown

## Test Plan

### Manual Testing
Tested with some RDMA instances
2026-04-14 16:37:49 +01:00
rltakashige
f2709dcde6 Add prefix cache flag to exo bench (#1888)
## Motivation
For using Exo-Bench extensively, there are many cases that we could use
prefix caching to speed up the benchmarks, especially when the focus is
on the token generation.

At the same time, it's very clear that prefix caching decode tokens is
not very useful in most current scenarios. Surprisingly, even for
non-thinking models, the chat template means that a continued
conversation will be formatted such that the existing cache is not
effective.

We already (slightly accidentally) do this for the batch generator - we
should do it for the sequential generator too.

## Changes

We can now speed up exo bench by having a use prefix caching flag. Of
course, for most accurate pp results, it is better to not have it, but
this speeds up tg and large benchmarking significantly.
Updated methodology to match

## Test Plan

### Manual Testing
Tested on many configurations that the difference in results is
negligible, even with multiple --pp options.
2026-04-14 11:12:58 +01:00
ciaranbor
77ffe039b3 Complete responses api usage response field (#1885)
## Motivation

The Responses API usage response was missing `input_tokens_details` and
`output_tokens_details`. The chat completions API already reports these.

## Changes

- Added `InputTokensDetails` (`cached_tokens`) and `OutputTokensDetails`
(`reasoning_tokens`) to `ResponseUsage`
- Extracted shared `_build_response_usage()` helper for both streaming
and non-streaming paths

## Test Plan

### Manual Testing

4-node cluster, `Qwen3-30B-A3B-4bit` — verified both detail objects
present with correct values in streaming and non-streaming responses.

### Automated Testing

13 tests in `test_openai_responses_api.py`.
2026-04-13 17:38:33 +00:00
rltakashige
3f0df404a5 Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886)
## Motivation

Part 1 of many memory improvements.

## Changes
As written in the title

## Test Plan

### Manual Testing
Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B
A3B cache reduced from 21GB every 100000 tokens to 7GB.
2026-04-13 18:32:17 +01:00
Evan Quiney
9b381f7bfe bump and simplify flake (#1866)
seems like stablepkgs swiftfmt works now! also bump macmon to 0.7
2026-04-13 15:45:17 +00:00
Alex Cheema
d2f67b5d10 dashboard: group Gemma under Google with proper logo (#1883)
## Motivation

In the dashboard model picker sidebar, the Gemma 4 models were showing
up under a "Gemma" family with the generic fallback tick/checkmark icon
(the default case in `FamilyLogos.svelte`), since no dedicated logo
branch existed for `family === "gemma"`. Every other vendor (Meta,
NVIDIA, OpenAI, DeepSeek, Qwen, …) has its own brand mark.

Gemma is Google's model family, so it should live under a **Google**
bucket that future Google-authored models can join, and it should render
with a proper Google logo in the same style as its neighbors.

## Changes

- `dashboard/src/lib/components/FamilyLogos.svelte`: added a `family ===
"google"` branch rendering a monochrome Google "G" as a single `<path>`
inside the shared `24×24` viewBox with `fill="currentColor"`, matching
the other vendor logos.
- `dashboard/src/lib/components/FamilySidebar.svelte`: added `google:
"Google"` to the `familyNames` display map.
- `dashboard/src/lib/components/ModelPickerModal.svelte`: inserted
`"google"` into the `familyOrder` array (next to `"llama"`) so the
vendor has a deterministic sort position.
- `resources/inference_model_cards/mlx-community--gemma-4-*.toml` (16
files): changed `family = "gemma"` → `family = "google"`. `base_model =
"Gemma 4 …"` is unchanged, so the model titles still read "Gemma".

## Why It Works

The sidebar builds its family list from whatever values appear in
`model.family` across the loaded model cards (`ModelPickerModal.svelte`
`uniqueFamilies`). Renaming the family string on the 16 Gemma cards from
`"gemma"` to `"google"` collapses them into a single "Google" bucket,
and the new logo branch + display-name map entry gives that bucket a
real brand mark and label. All other logos share the same `w-6 h-6 /
viewBox="0 0 24 24" / fill="currentColor"` shape, so inheriting
`text-exo-yellow` / `text-white/50` just works.

## Test Plan

### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- `cd dashboard && npm install && npm run build` — dashboard builds
cleanly.
- `uv run exo`, opened `http://localhost:52415`, clicked **SELECT
MODEL**:
- sidebar shows a **Google** entry with a monochrome Google "G" logo in
the same style as Meta / NVIDIA / etc.
  - old "Gemma" entry with the generic tick is gone.
- clicking **Google** filters to the Gemma 4 variants (e2b / e4b / 26B
A4B / 31B).
- hover/selected color states switch between `text-white/50` and
`text-exo-yellow` correctly.

### Automated Testing
- No new tests — this is a cosmetic grouping/logo change. Existing
dashboard build verifies the Svelte + TS compiles.
2026-04-13 14:08:15 +00:00
chaoliang yan
8973503322 fix: use configured api_port for IP connectivity probes (#1877)
## Motivation

Fixes #1861

When `--api-port` is set to a non-default value (e.g., `--api-port
55555`), the IP connectivity discovery system still probes peers on the
hardcoded default port 52415. Since the API is not listening on 52415,
all reachability checks fail, the topology reports zero reachable nodes,
and the dashboard shows "No valid configurations for current settings."

## Changes

Thread the configured `api_port` from `Args` through `Worker` into the
reachability probe functions:

- `net_profile.py`: `check_reachability()` and `check_reachable()`
accept an `api_port` parameter (default 52415 for backward
compatibility)
- `worker/main.py`: `Worker` stores `api_port` and passes it to
`check_reachable()`, uses it in `Multiaddr` construction and the mDNS
connection filter
- `main.py`: passes `args.api_port` to the `Worker` constructor

## Why It Works

The `/node_id` endpoint used by reachability probes is served by the
FastAPI app, which binds to `args.api_port`. The probes must use the
same port the API is actually listening on. Before this fix, the port
was hardcoded in three places in `net_profile.py` and `worker/main.py`;
now it uses the value from the CLI flag.

## Test Plan

### Manual Testing
<!-- Hardware: not available for multi-node testing -->
- Verified ruff passes on all changed files
- Code inspection: traced `api_port` flow from `Args.parse()` →
`Node.create()` → `Worker.__init__()` → `_poll_connection_updates()` →
`check_reachable()` → `check_reachability()` → HTTP probe URL

### Automated Testing
- No existing automated tests cover the reachability probe code path
- The new `api_port` parameter defaults to `52415`, so all existing
behavior is preserved when `--api-port` is not specified

---------

Co-authored-by: lawrence3699 <lawrence3699@users.noreply.github.com>
Co-authored-by: Evan <evanev7@gmail.com>
2026-04-13 14:01:42 +00:00
Alex Cheema
eb9228615f models: add MiniMax M2.7 cards (#1884)
## Motivation

The mlx-community [MiniMax-M2.7
collection](https://huggingface.co/collections/mlx-community/minimax-m27)
landed but exo didn't have model cards for any of the variants yet, so
they weren't selectable from the dashboard model picker. Adding cards
also makes them discoverable under the existing MiniMax family entry.

## Changes

Added 6 new model cards in `resources/inference_model_cards/`, one per
quant of MiniMax M2.7:

- `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB)
- `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB)
- `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB)
- `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB)
- `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB)
- `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB)

All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"`
so they collapse into a single group in the picker with the existing
MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size =
3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read
from each repo's `config.json`; `storage_size.in_bytes` was summed from
the HF tree API per repo.

`capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5
cards — the chat template always emits `<think>` tags (no toggle),
matching M2.5 behavior.

## Why It Works

Model cards in `resources/inference_model_cards/` are auto-loaded by
`src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard
picker groups by `base_model` and filters by `family`, so sharing both
across all six variants gives a single "MiniMax M2.7" group under the
MiniMax sidebar entry, with the quant variants exposed as selectable
sub-options.

## Test Plan

### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6
new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and
correct quant + byte sizes.
- `cd dashboard && npm run build` then `uv run exo`, opened the model
picker → **MiniMax** family → **MiniMax M2.7** group shows all six quant
variants.

### Automated Testing
- No new automated tests — these are data files validated by the
existing Pydantic `ModelCard` schema at load time.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:47:09 +01:00
MikkoParkkola
4b13735ea3 build: remove pyinstaller temp artifacts (#1868)
removes PyInstaller `build/` leftovers after `just package`
2026-04-11 11:46:03 +00:00
rltakashige
196543ce69 Add Gemma 4 + VLM fixes + thinking parsing updates (#1851)
## Motivation
Add support for Gemma 4, including VLM!

## Changes

- Add auto parallel strategies and model cards for Gemma 4
- Normalise Gemma 4's special Vision Transformer handling to be in line
with the rest of our vision processors.
- Also adds reprs to messages and b64 hashes to prevent log spam.

## Test Plan

### Manual Testing
Tested manually on 4bit E2B and 8bit 26B

### Automated Testing
Model onboarding shows small logit diffs.

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-04-11 12:29:33 +01:00
ciaranbor
6172617b00 add env override to macos app (#1869)
## Motivation

- Let users pass/override arbitrary exo env vars from the macOS app
without a code change.

## Changes

- `ExoProcessController.swift`: `CustomEnvironmentVariable` struct +
`@Published` list persisted to `UserDefaults`, injected into the child
process after built-ins.
- `SettingsView.swift`: new **Environment** tab with add/remove rows,
trim + dedup on save, and a POSIX name validator with a warning badge.

## Why It Works

- Custom vars applied last in `makeEnvironment`, so overriding a
built-in works with no special-casing.

## Test Plan

### Manual Testing

- Set `EXO_LIBP2P_NAMESPACE` via the new UI; confirmed override in
`~/.exo/exo_log/exo.log`.
2026-04-10 17:39:55 +01:00
ciaranbor
93a980a61e just package first builds dashboard (#1867)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-04-10 16:25:30 +00:00
ciaranbor
2962ebee60 Fix pdf inputs on Safari (#1865)
## Motivation

PDF attachments weren't working on Safari

## Changes

Create async readable stream if none exists

## Why It Works
pdfjs-dist requires an async readable stream internally

## Test Plan

### Manual Testing
pdf attachments now work on Safari, still work on Firefox
2026-04-10 14:59:34 +00:00
rltakashige
abd75ae06c Truncate long logs with repr (#1854)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-04-10 15:53:48 +01:00
kaiisfree
ee2e505b3c fix: handle BrokenResourceError in download progress callback (#1846)
## Summary
- Wraps progress callback `send()` in try/except to gracefully handle
`BrokenResourceError` when the memory stream is closed
- Prevents unhandled `ExceptionGroup` from crashing the process when the
download consumer disconnects during transfer

## Root Cause
The download progress callback sends updates through an anyio memory
object stream. When the receiving end closes (e.g., client disconnect,
timeout, or task cancellation), `send()` raises `BrokenResourceError`.
Inside an anyio `TaskGroup`, this unhandled exception becomes an
`ExceptionGroup` that propagates up and crashes the coordinator.

## Fix
Catch `BrokenResourceError` (and `ClosedResourceError` for completeness)
in the progress callback and handle gracefully — the download continues
but progress updates are silently dropped for disconnected consumers.

Fixes #1844

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-10 11:43:42 +00:00
Evan Quiney
f2e6b1ef76 prevent some crash loops (#1827)
extension to #1763 that prevents crash looping in some common scenarios.
2026-04-09 11:34:35 +00:00
ciaranbor
e2e17eafb7 Fix reasoning_tokens counting for multi-token thinking tag models (#1848)
## Motivation

`reasoning_tokens` is always 0 in usage stats, even when thinking
content streams correctly via `reasoning_content` SSE deltas. The MLX
generators had their own thinking detection comparing individual
detokenized tokens against think tags — this never fires for models
where tags span multiple tokens (e.g. gpt-oss-120b) or are already in
the prompt.

## Changes

- Removed broken per-token thinking detection from `batch_generate.py`
and `generate.py`
- Added `_count_reasoning_tokens` wrapper in `model_output_parsers.py`
that counts `is_thinking=True` responses and patches the total into
Usage on the final response
- Wired it as the outermost stage of `apply_all_parsers`, so it works
regardless of which parser sets `is_thinking`
- Added 3 tests covering `parse_thinking_models` and `parse_gpt_oss`
paths

## Why It Works

The parser pipeline already correctly sets `is_thinking` on each
response. Counting at the output of `apply_all_parsers` means one
counting point that works for all model types, replacing the duplicate
broken logic in two generators.

## Test Plan

### Manual Testing

- 4-node cluster, `mlx-community/gpt-oss-120b-MXFP4-Q8`
- Main branch: `reasoning_tokens: 0` — fix branch: `reasoning_tokens:
25`

### Automated Testing

- 3 new tests: explicit think tags, `starts_in_thinking=True`, and
gpt-oss Harmony analysis channel
2026-04-09 12:29:35 +01:00
ciaranbor
b12cd1b186 Cancel SSE keep-alive when instance is deleted (#1828)
## Motivation

When a model instance is deleted (e.g. node disconnect, manual
teardown), any in-flight SSE streaming connections for that instance
hang indefinitely. The API never closes the response stream, so clients
block forever waiting for more chunks.

## Changes

- Listen for `InstanceDeleted` events in the API event loop
- Add `_close_streams_for_instance()` to find and close any active
text/image generation queues tied to tasks on the deleted instance
- Add unit tests covering text gen, image gen, and
unrelated-instance-not-closed scenarios

## Why It Works

When an instance is deleted, we iterate `state.tasks` to find commands
running on that instance, then close and remove their send-side queue
handles. This causes the SSE generator to terminate, unblocking the
client.

## Test Plan

### Manual Testing
- This was causing issues for me on another branch (integration tests).
Including this fix solved the issue

### Automated Testing
- `test_instance_deleted_stream_cleanup.py`: 3 tests covering text gen
cleanup, image gen cleanup, and ensuring unrelated streams are not
affected
2026-04-08 16:14:28 +01:00
mlpy0
62570227ff Catch ClosedResourceError when forwarding chunks to client queues (#1856)
While stress-testing inference with rapid client cancels mid-stream, I
hit a reproducible crash where the entire exo process exits.

When a client cancels a streaming chat completion partway through, its
receive stream gets closed cleanly via its context manager. The producer
in `API._apply_state` then calls `queue.send(event.chunk)`, which raises
`anyio.ClosedResourceError` rather than `BrokenResourceError`. The
existing handler only catches `BrokenResourceError`, so the exception
propagates through the API task group, kills the Node task group, and
the process exits with `EXO Shutdown complete`.

Trace from one of the crashes:

```
File "exo/api/main.py", line 1818, in _apply_state
    await queue.send(event.chunk)
File "anyio/streams/memory.py", line 212, in send_nowait
    raise ClosedResourceError
anyio.ClosedResourceError
```

The fix is to catch `ClosedResourceError` alongside
`BrokenResourceError` in both queue handlers (text and image), so the
dead queue gets dropped and `_apply_state` keeps running for other
in-flight requests.
2026-04-08 08:50:04 +00:00
Alex Cheema
645bc20950 Add Fast Synch Enabled toggle to macOS app settings (#1852)
## Motivation

The exo backend already supports `--fast-synch` / `--no-fast-synch` CLI
flags and the `EXO_FAST_SYNCH` environment variable, but there was no
way to toggle this from the macOS app UI. Users who want fast CPU-to-GPU
synchronization for RDMA with Tensor Parallelism had to use CLI flags.

## Changes

- **ExoProcessController.swift**: Added `fastSynchEnabled`
UserDefaults-backed property and pass `EXO_FAST_SYNCH=on` to the exo
process environment when enabled.
- **SettingsView.swift**: Added a "Performance" section to the Advanced
tab with a "Fast Synch Enabled" toggle, an info icon (ⓘ) tooltip
explaining the feature and trade-offs, and a "Save & Restart" button.

## Why It Works

Follows the exact same pattern as the existing `offlineMode` and
`enableImageModels` settings — UserDefaults persistence, `@Published`
property with `didSet`, environment variable passthrough in
`makeEnvironment()`, and pending state with Save & Restart in the
settings UI. The `EXO_FAST_SYNCH=on` value matches what the Python
backend already reads in `main.py`.

## Test Plan

### Manual Testing
<!-- Hardware: macOS app -->
- Open Settings → Advanced tab → verify "Performance" section with "Fast
Synch Enabled" toggle appears
- Hover the ⓘ icon → verify tooltip explains the feature and GPU lock
trade-off
- Toggle on → click "Save & Restart" → verify process restarts with
`EXO_FAST_SYNCH=on` in env
- Close and reopen Settings → verify the toggle state persists
- Verify "Save & Restart" button is disabled when no changes are pending

### Automated Testing
- Existing settings patterns are well-established; no new automated
tests needed for this UI toggle

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 01:04:42 +00:00
rltakashige
5757c27dd5 Add download utility script (#1855)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-04-08 00:58:39 +00:00
Andrei Cravtov
fd5b23281c Workspace tweaks (#1849)
## Changes

Mostly chore changes around vscode and jetbrains workspace settings, and
some basedpyright settings tweaks, to allow direnv to work and nixd
autocomplete with flake parts to work
2026-04-07 17:26:29 +00:00
rltakashige
43b3df45fb Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835)
## Motivation

MLX LM has had a massive refactor to their BatchGenerator recently.
Since we'd like new features from MLX LM such as Gemma 4, we need to
update the code to handle this.

Additionally this fixes a significant memory leak in GatedDeltaNet (the
difference is quite substantial, up to 1GB every 1000 tokens, explaining
several memory issues users were facing with Qwen3.5 models)

## Testing
Before
<img width="3146" height="884" alt="image"
src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4"
/>


After (no memory leak, as one of the changes upstream)
<img width="3190" height="892" alt="image"
src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14"
/>
2026-04-07 11:50:12 +00:00
mlpy0
24420eb10a Fix reasoning_tokens always reported as 0 for thinking models (#1836)
When `enable_thinking` is set, chat templates (Qwen3, DeepSeek, etc.)
append `<think>` to the prompt. The model starts generating thinking
content directly without emitting a `<think>` token in the output
stream.

Both generators initialized `in_thinking = False` and only set it to
`True` on seeing a `<think>` token in output. Since that token was part
of the prompt, the flag never flipped and `reasoning_tokens` stayed at 0
in the usage response.

Fix: initialize `in_thinking` from `detect_thinking_prompt_suffix()`,
which already exists and is used by `model_output_parsers` for routing
thinking content correctly.
2026-04-05 00:05:18 +00:00
rltakashige
59669c1168 Tighten EXO bench concurrency numbers and explain methodology (#1811)
## Motivation

The timings in the batch generator are a little optimistic; a minor
change is needed to make them more correct.

## Changes

Include the time spent in the API in the generation tps and make sure to
send all requests simultaneously
2026-04-05 00:57:52 +01:00
ciaranbor
1d2ce464dc Allow pausing and deleting active downloads (#1829)
## Motivation

No way to pause active downloads or delete partial/failed downloads from
the dashboard.

## Changes

- **Backend:** Added `POST /download/cancel` endpoint with
`CancelDownloadParams`/`CancelDownloadResponse` types. Wires into the
existing `CancelDownload` command + coordinator handler.
- **Dashboard store:** Added `cancelDownload(nodeId, modelId)` function.
- **Dashboard UI:**
  - Pause + delete buttons on active (downloading) cells
  - Delete button on paused/pending and failed cells
- Extracted duplicated SVG icons into `{#snippet}` blocks (`trashIcon`,
`downloadIcon`, `pauseIcon`, `deleteButton`)
- **Tests:** 3 coordinator-level tests for cancel: active download →
pending, nonexistent → no-op, cancel then resume.

## Why It Works

`CancelDownload` command and coordinator handler already existed — just
needed an HTTP endpoint and dashboard wiring. Delete endpoint already
supported all download states.

## Test Plan

### Manual Testing

Started a model download, paused it. Deleted some paused downloads.
Deleted some ongoing downloads.

### Automated Testing

- `test_cancel_active_download_transitions_to_pending` — cancels
in-progress download, asserts `DownloadPending` event and cleanup
- `test_cancel_nonexistent_download_is_noop` — no events emitted
- `test_cancel_then_resume_download` — restart after cancel works
2026-04-02 15:56:33 +01:00
ciaranbor
eb6ae9fd3c Prevent failed instance retries (#1763)
## Motivation

Currently, when a runner fails, the master retries the instance. Most of
the time, this causes a loop over failure. Retries need backoff and a
cap.

## Changes

- src/exo/worker/main.py: Before creating a runner, check an exponential
backoff timer per instance. After EXO_MAX_INSTANCE_RETRIES failures,
send DeleteInstance to permanently remove the instance. Record attempts
on Shutdown; reset on InstanceDeleted.
- src/exo/utils/keyed_backoff.py: Add attempts() method to query retry
count
- src/exo/shared/constants.py: Add EXO_MAX_INSTANCE_RETRIES = 3.

## Why It Works

The worker gates CreateRunner tasks behind a KeyedBackoff, adding
exponential delay (2s base, 30s cap) between retries. After 3 failures
the worker sends DeleteInstance, stopping retries entirely. The backoff
resets when the instance is deleted, so a fresh placement starts clean.

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-04-01 21:03:34 +01:00
rltakashige
4688adb5d2 Support PDFs in dashboard (#1822)
Like ChatGPT does, we now send both the extracted text and the image of
each PDF page.
2026-03-31 18:25:40 +01:00
rltakashige
d9ed943034 Fix Nemotron cache leak upstream (#1819)
## Motivation
Nemotron Cascade and Nano failing at long decodes.

## Changes

Fixed upstream, just change pyproject and uv lock here.


## Test Plan
### Automated Testing
Tested with a reproduce script upstream
2026-03-30 16:53:21 +00:00
rltakashige
c6815bfdce Only update KV prefix cache on a good cache hit (#1817)
## Motivation

Addresses #1816 

## Changes

Update on min prefix cache > min_prefix_hit_length **and** hit ratio >
_MIN_PREFIX_HIT_RATIO_TO_UPDATE
min_prefix_hit_length = max(1000, system prompt length) -> system
prompts must match exactly.

## Test Plan

### Manual Testing
Test on OpenCode and Claude Code
2026-03-30 15:04:38 +01:00
rltakashige
39c39e8199 Integrations helpers (#1810)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-03-30 14:28:41 +01:00
rltakashige
e5cb7b80d0 Add SSE-keepalive to not time out on long prefill on clients (#1803)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-03-30 12:18:38 +01:00
rltakashige
635801d515 Add multimodality! (#1802)
## Motivation

Images!

TODO (in a future PR): Add audio and video support.

## Test Plan

### Manual Testing
<img width="2652" height="1900" alt="image"
src="https://github.com/user-attachments/assets/7d3a7137-542f-4f94-9193-2c73b7c4a5ec"
/>

<img width="2770" height="1956" alt="image"
src="https://github.com/user-attachments/assets/e3c3a096-8029-4409-97a6-aca31a9a3f24"
/>
<img width="2738" height="1768" alt="image"
src="https://github.com/user-attachments/assets/d70ea37f-cd1d-4a4c-ad08-3beb9fafa380"
/>

(And batching also works)

---------

Co-authored-by: David Hind <davehind@yahoo.co.uk>
2026-03-30 11:52:19 +01:00
rltakashige
2efbb8ab4f Improve exo harness with path state (#1815)
<img width="3224" height="1476" alt="image"
src="https://github.com/user-attachments/assets/d90a7d8a-9fe5-43a1-a715-1ef7ecc15422"
/>
2026-03-30 11:20:46 +01:00
Evan Quiney
c6c5a3e73c feat: /state/paths (#1796)
adds a path option to the /state endpoint, allowing you to query
subfields of state without grabbing the whole blob

## test plan
poking around in the api
2026-03-30 10:10:00 +00:00
ArvidSU
10ef7ec9e8 feat: add Firefox AI sidebar (?q=) support to dashboard (#1814)
This PR builds on https://github.com/exo-explore/exo/pull/1677 to enable
custom prompts sent from Firefox `browser.ml.chat` to EXO dashboard
using URL parameters in sidebar for summary and other browser
interactions. See "Summarize page" example below.

## Summary
- Parse `?q=<encoded prompt>` URL parameter on page load and auto-submit
it as a chat message
- Clean up the URL with `history.replaceState` to prevent re-submission
on refresh
- Defer auto-send until both cluster state and model list are loaded so
model auto-selection works correctly

## Context
Firefox's built-in AI sidebar (`about:config: browser.ml.chat.enabled`)
integrates with chat providers by appending the user's prompt as
`?q=<URL-encoded prompt>`. Previously the exo dashboard ignored this
parameter. Users can now configure `http://localhost:52415` as a Firefox
AI chatbot provider.

See: https://support.mozilla.org/en-US/kb/ai-chatbot

## Technical notes
- Frontend-only change in `dashboard/src/routes/+page.svelte`
- Uses a Svelte `$effect` that reacts to `pendingFirefoxQuery`, `data`
(cluster state), and `models.length` — fires exactly once when all three
are ready
- If no model is selected, `handleAutoSend` auto-picks the best
available model; if no model fits memory, a toast is shown
- If a model is selected but not running, the message is queued until
the model loads

## Testing
```
http://localhost:52415/?q=Hello+world
http://localhost:52415/?q=Summarize+this+page%3A+%5Bpage+title%5D+%5Bpage+url%5D
```

<img width="2056" height="1329" alt="image"
src="https://github.com/user-attachments/assets/74463eb4-ca1a-400d-806a-c19ba93147b9"
/>
2026-03-30 11:02:35 +01:00
Evan Quiney
1e51dc89b0 chore: bump exo-version with release version (#1807)
our pyproject.toml version was 0.3.68 - update to .69 in line with
release!!
2026-03-27 11:47:13 +00:00