mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-18 04:52:40 -04:00

Author	SHA1	Message	Date
Evan	3eb1fbe394	wuff	2026-04-17 18:13:43 +01:00
Evan	e0c9e82755	fix test	2026-04-17 18:13:22 +01:00
Evan	b2fe0b8904	remove layer loading callback	2026-04-17 18:13:09 +01:00
mlpy0	01598960bd	Add model card for Qwen3.6-35B-A3B-8bit (#1917 ) Adds the 8bit variant missing from #1907 — the safetensors index is now live on HF. - `mlx-community/Qwen3.6-35B-A3B-8bit` (~35 GB) Architectural fields match the existing 4bit/5bit/bf16 cards. `storage_size.in_bytes` is taken from `metadata.total_size` of the upstream `model.safetensors.index.json`.	2026-04-17 10:06:23 +00:00
Alex Cheema	63b8e64715	Add model cards for Qwen3.6-35B-A3B variants (#1907 ) ## Motivation `mlx-community` has just published the new Qwen3.6-35B-A3B multimodal MoE family on HuggingFace. Without static model cards exo doesn't surface these models in the dashboard picker or match its placement / prefill logic, so users can't one-click launch them. This PR adds cards for the three quants whose safetensors indexes are already live on HF (4bit / 5bit / bf16). ## Changes Three new TOML files in `resources/inference_model_cards/`: - `mlx-community--Qwen3.6-35B-A3B-4bit.toml` (~19 GB) - `mlx-community--Qwen3.6-35B-A3B-5bit.toml` (~23 GB) - `mlx-community--Qwen3.6-35B-A3B-bf16.toml` (~65 GB) All three share the same architectural fields (`n_layers = 40`, `hidden_size = 2048`, `num_key_value_heads = 2`, `context_length = 262144`, capabilities `text, thinking, thinking_toggle, vision`, `base_model = "Qwen3.6 35B A3B"`) — only `model_id`, `quantization`, and `storage_size.in_bytes` differ between variants. ## Why It Works - Qwen3.6-35B-A3B reuses the `qwen3_5_moe` architecture (`Qwen3_5MoeForConditionalGeneration`) — the same one already wired into exo's MLX runner at `src/exo/worker/engines/mlx/auto_parallel.py:47` via `Qwen3_5MoeModel`. The architectural fields are taken verbatim from the HF `config.json.text_config` and match the existing `Qwen3.5-35B-A3B-` cards. - Storage sizes are the exact `metadata.total_size` read from each variant's `model.safetensors.index.json` on HF, so download progress and cluster-memory-fit checks are accurate. - Vision support is flagged in `capabilities`; the `[vision]` block is auto-detected by `ModelCard._autodetect_vision` from the upstream `config.json`, so no hand-written vision config is required. - The card loader (`_refresh_card_cache` in `src/exo/shared/models/model_cards.py`) globs every `.toml` in `resources/inference_model_cards/` on startup, so nothing else needs to change — the `/models` endpoint and the dashboard picker pick them up automatically. The `mxfp4` / `mxfp8` / `nvfp4` variants are still uploading upstream (index JSONs currently 404) and can be added in a follow-up PR once HF completes. ## Test Plan ### Manual Testing Hardware: MacBook Pro M4 Max, 48 GB unified memory. - Built the dashboard, ran `uv run exo`, waited for the API to come up on `http://localhost:52415`. - `curl -s http://localhost:52415/models` returns the three new model ids (`mlx-community/Qwen3.6-35B-A3B-{4bit,5bit,bf16}`) alongside existing models. - Opened the dashboard, clicked SELECT MODEL, typed "Qwen3.6" into the search box. A single "Qwen3.6 35B A3B"* group appears showing `3 variants (19GB-65GB)`. Expanding it lists the `4bit` / `5bit` / `bf16` quants with sizes `19GB` / `23GB` / `65GB`, exactly as expected: ![Qwen3.6 35B A3B in model picker](`127119f703/qwen36-picker.png`) - Programmatically loaded each TOML via `ModelCard.load_from_path(...)` and confirmed the parsed fields (layers / hidden / KV heads / context / quant / base_model / caps / bytes) match what's written in the files. ### Automated Testing No code paths were touched — these are pure TOML data files that plug into the existing model-card loader. The existing pytest suite covers TOML parsing and card serving; adding new TOMLs doesn't require new test scaffolding. `uv run ruff check` and `nix fmt` are clean. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ryuichi Leo Takashige <rl.takashige@gmail.com>	2026-04-16 23:25:26 +01:00
rltakashige	28c797846a	Update mlx and mlx lm to latest (#1906 ) Just bumping to the very latest upstream versions.	2026-04-16 10:59:33 +00:00
rltakashige	058bb08261	Allow copying on dashboard even on HTTP (#1902 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-15 23:30:12 +01:00
rltakashige	3eead80238	Better environment variables in MacOS app (#1901 ) ## Motivation Closes #1858 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-15 20:14:52 +00:00
rltakashige	87329c80ef	Add usage stats to tool calls and handle multiple tool calls correctly (#1899 ) ## Motivation Tool calls are usually not end tokens, so they didn't have usage stats.	2026-04-15 19:40:00 +01:00
rltakashige	8cdc833892	Drain tokens silently skipped in thinking parsing (#1898 ) ## Motivation Closes #1882	2026-04-15 14:23:07 +00:00
rltakashige	2cd66ae4cf	Fix out of order event idx causing fatal crashes (#1894 ) ## Motivation <img width="828" height="373" alt="Screenshot 2026-04-14 at 22 56 52" src="https://github.com/user-attachments/assets/f8f48c1d-68c5-4acc-a6de-9d180672da9d" /> if is_new_master=True, _elect_loop creates a new EventRouter before the worker has receivers. Then, event router runs _run_ext_in and buf.drain_indexed() will pick off events, even though self.internal_outbound is not populated fully. Finally, when the worker does try requesting events, the next event it receives is not the first event, meaning the worker crashes. ## Changes Start the event router after all the receivers are registered ## Why It Works self.internal_outbound is populated before the loop begins. ## Test Plan ### Manual Testing No more crashes observed in testing (it's actually quite easy to reproduce the issue if you have one node with this fix but the other node on main). I'm convinced this is a fix, at least.	2026-04-15 08:46:02 +00:00
rltakashige	2ecefa0cfe	Fix Qwen3-VL and autodetect vision config (#1893 ) ## Motivation Qwen3 VL TP doesn't work atm, and vision is not behaving. ## Test Plan ### Manual Testing Works now.	2026-04-14 23:05:55 +01:00
rltakashige	b8eaf707a8	Add gemma 4 tensor parallelism (#1891 )	2026-04-14 20:31:59 +01:00
rltakashige	8d81811b89	Try harder to clean up processes nicely (#1889 ) ## Motivation Model loading is actually quite reliable now. No need to kill if you have a slow SSD or it's a massive model; the user can shut the instance down if necessary. This was a major cause of signal=9 issues although not the only one (can happen during inference too?). The reason signal=9 is so bad is that RDMA will no longer work until restart if this ever happens. ## Changes - no more model load timeout - no more crazy sigkills - try harder to clean up processes on model shutdown ## Test Plan ### Manual Testing Tested with some RDMA instances	2026-04-14 16:37:49 +01:00
rltakashige	f2709dcde6	Add prefix cache flag to exo bench (#1888 ) ## Motivation For using Exo-Bench extensively, there are many cases that we could use prefix caching to speed up the benchmarks, especially when the focus is on the token generation. At the same time, it's very clear that prefix caching decode tokens is not very useful in most current scenarios. Surprisingly, even for non-thinking models, the chat template means that a continued conversation will be formatted such that the existing cache is not effective. We already (slightly accidentally) do this for the batch generator - we should do it for the sequential generator too. ## Changes We can now speed up exo bench by having a use prefix caching flag. Of course, for most accurate pp results, it is better to not have it, but this speeds up tg and large benchmarking significantly. Updated methodology to match ## Test Plan ### Manual Testing Tested on many configurations that the difference in results is negligible, even with multiple --pp options.	2026-04-14 11:12:58 +01:00
ciaranbor	77ffe039b3	Complete responses api usage response field (#1885 ) ## Motivation The Responses API usage response was missing `input_tokens_details` and `output_tokens_details`. The chat completions API already reports these. ## Changes - Added `InputTokensDetails` (`cached_tokens`) and `OutputTokensDetails` (`reasoning_tokens`) to `ResponseUsage` - Extracted shared `_build_response_usage()` helper for both streaming and non-streaming paths ## Test Plan ### Manual Testing 4-node cluster, `Qwen3-30B-A3B-4bit` — verified both detail objects present with correct values in streaming and non-streaming responses. ### Automated Testing 13 tests in `test_openai_responses_api.py`.	2026-04-13 17:38:33 +00:00
rltakashige	3f0df404a5	Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886 ) ## Motivation Part 1 of many memory improvements. ## Changes As written in the title ## Test Plan ### Manual Testing Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B A3B cache reduced from 21GB every 100000 tokens to 7GB.	2026-04-13 18:32:17 +01:00
Evan Quiney	9b381f7bfe	bump and simplify flake (#1866 ) seems like stablepkgs swiftfmt works now! also bump macmon to 0.7	2026-04-13 15:45:17 +00:00
Alex Cheema	d2f67b5d10	dashboard: group Gemma under Google with proper logo (#1883 ) ## Motivation In the dashboard model picker sidebar, the Gemma 4 models were showing up under a "Gemma" family with the generic fallback tick/checkmark icon (the default case in `FamilyLogos.svelte`), since no dedicated logo branch existed for `family === "gemma"`. Every other vendor (Meta, NVIDIA, OpenAI, DeepSeek, Qwen, …) has its own brand mark. Gemma is Google's model family, so it should live under a Google bucket that future Google-authored models can join, and it should render with a proper Google logo in the same style as its neighbors. ## Changes - `dashboard/src/lib/components/FamilyLogos.svelte`: added a `family === "google"` branch rendering a monochrome Google "G" as a single `<path>` inside the shared `24×24` viewBox with `fill="currentColor"`, matching the other vendor logos. - `dashboard/src/lib/components/FamilySidebar.svelte`: added `google: "Google"` to the `familyNames` display map. - `dashboard/src/lib/components/ModelPickerModal.svelte`: inserted `"google"` into the `familyOrder` array (next to `"llama"`) so the vendor has a deterministic sort position. - `resources/inference_model_cards/mlx-community--gemma-4-.toml` (16 files): changed `family = "gemma"` → `family = "google"`. `base_model = "Gemma 4 …"` is unchanged, so the model titles still read "Gemma". ## Why It Works The sidebar builds its family list from whatever values appear in `model.family` across the loaded model cards (`ModelPickerModal.svelte` `uniqueFamilies`). Renaming the family string on the 16 Gemma cards from `"gemma"` to `"google"` collapses them into a single "Google" bucket, and the new logo branch + display-name map entry gives that bucket a real brand mark and label. All other logos share the same `w-6 h-6 / viewBox="0 0 24 24" / fill="currentColor"` shape, so inheriting `text-exo-yellow` / `text-white/50` just works. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - `cd dashboard && npm install && npm run build` — dashboard builds cleanly. - `uv run exo`, opened `http://localhost:52415`, clicked SELECT MODEL: - sidebar shows a Google* entry with a monochrome Google "G" logo in the same style as Meta / NVIDIA / etc. - old "Gemma" entry with the generic tick is gone. - clicking Google filters to the Gemma 4 variants (e2b / e4b / 26B A4B / 31B). - hover/selected color states switch between `text-white/50` and `text-exo-yellow` correctly. ### Automated Testing - No new tests — this is a cosmetic grouping/logo change. Existing dashboard build verifies the Svelte + TS compiles.	2026-04-13 14:08:15 +00:00
chaoliang yan	8973503322	fix: use configured api_port for IP connectivity probes (#1877 ) ## Motivation Fixes #1861 When `--api-port` is set to a non-default value (e.g., `--api-port 55555`), the IP connectivity discovery system still probes peers on the hardcoded default port 52415. Since the API is not listening on 52415, all reachability checks fail, the topology reports zero reachable nodes, and the dashboard shows "No valid configurations for current settings." ## Changes Thread the configured `api_port` from `Args` through `Worker` into the reachability probe functions: - `net_profile.py`: `check_reachability()` and `check_reachable()` accept an `api_port` parameter (default 52415 for backward compatibility) - `worker/main.py`: `Worker` stores `api_port` and passes it to `check_reachable()`, uses it in `Multiaddr` construction and the mDNS connection filter - `main.py`: passes `args.api_port` to the `Worker` constructor ## Why It Works The `/node_id` endpoint used by reachability probes is served by the FastAPI app, which binds to `args.api_port`. The probes must use the same port the API is actually listening on. Before this fix, the port was hardcoded in three places in `net_profile.py` and `worker/main.py`; now it uses the value from the CLI flag. ## Test Plan ### Manual Testing <!-- Hardware: not available for multi-node testing --> - Verified ruff passes on all changed files - Code inspection: traced `api_port` flow from `Args.parse()` → `Node.create()` → `Worker.__init__()` → `_poll_connection_updates()` → `check_reachable()` → `check_reachability()` → HTTP probe URL ### Automated Testing - No existing automated tests cover the reachability probe code path - The new `api_port` parameter defaults to `52415`, so all existing behavior is preserved when `--api-port` is not specified --------- Co-authored-by: lawrence3699 <lawrence3699@users.noreply.github.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-04-13 14:01:42 +00:00
Alex Cheema	eb9228615f	models: add MiniMax M2.7 cards (#1884 ) ## Motivation The mlx-community [MiniMax-M2.7 collection](https://huggingface.co/collections/mlx-community/minimax-m27) landed but exo didn't have model cards for any of the variants yet, so they weren't selectable from the dashboard model picker. Adding cards also makes them discoverable under the existing MiniMax family entry. ## Changes Added 6 new model cards in `resources/inference_model_cards/`, one per quant of MiniMax M2.7: - `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB) - `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB) - `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB) - `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB) - `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB) - `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB) All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"` so they collapse into a single group in the picker with the existing MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size = 3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read from each repo's `config.json`; `storage_size.in_bytes` was summed from the HF tree API per repo. `capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5 cards — the chat template always emits `<think>` tags (no toggle), matching M2.5 behavior. ## Why It Works Model cards in `resources/inference_model_cards/` are auto-loaded by `src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard picker groups by `base_model` and filters by `family`, so sharing both across all six variants gives a single "MiniMax M2.7" group under the MiniMax sidebar entry, with the quant variants exposed as selectable sub-options. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6 new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and correct quant + byte sizes. - `cd dashboard && npm run build` then `uv run exo`, opened the model picker → MiniMax family → MiniMax M2.7 group shows all six quant variants. ### Automated Testing - No new automated tests — these are data files validated by the existing Pydantic `ModelCard` schema at load time. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:47:09 +01:00
MikkoParkkola	4b13735ea3	build: remove pyinstaller temp artifacts (#1868 ) removes PyInstaller `build/` leftovers after `just package`	2026-04-11 11:46:03 +00:00
rltakashige	196543ce69	Add Gemma 4 + VLM fixes + thinking parsing updates (#1851 ) ## Motivation Add support for Gemma 4, including VLM! ## Changes - Add auto parallel strategies and model cards for Gemma 4 - Normalise Gemma 4's special Vision Transformer handling to be in line with the rest of our vision processors. - Also adds reprs to messages and b64 hashes to prevent log spam. ## Test Plan ### Manual Testing Tested manually on 4bit E2B and 8bit 26B ### Automated Testing Model onboarding shows small logit diffs. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-11 12:29:33 +01:00
ciaranbor	6172617b00	add env override to macos app (#1869 ) ## Motivation - Let users pass/override arbitrary exo env vars from the macOS app without a code change. ## Changes - `ExoProcessController.swift`: `CustomEnvironmentVariable` struct + `@Published` list persisted to `UserDefaults`, injected into the child process after built-ins. - `SettingsView.swift`: new Environment tab with add/remove rows, trim + dedup on save, and a POSIX name validator with a warning badge. ## Why It Works - Custom vars applied last in `makeEnvironment`, so overriding a built-in works with no special-casing. ## Test Plan ### Manual Testing - Set `EXO_LIBP2P_NAMESPACE` via the new UI; confirmed override in `~/.exo/exo_log/exo.log`.	2026-04-10 17:39:55 +01:00
ciaranbor	93a980a61e	`just package` first builds dashboard (#1867 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-10 16:25:30 +00:00
ciaranbor	2962ebee60	Fix pdf inputs on Safari (#1865 ) ## Motivation PDF attachments weren't working on Safari ## Changes Create async readable stream if none exists ## Why It Works pdfjs-dist requires an async readable stream internally ## Test Plan ### Manual Testing pdf attachments now work on Safari, still work on Firefox	2026-04-10 14:59:34 +00:00
rltakashige	abd75ae06c	Truncate long logs with repr (#1854 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-10 15:53:48 +01:00
kaiisfree	ee2e505b3c	fix: handle BrokenResourceError in download progress callback (#1846 ) ## Summary - Wraps progress callback `send()` in try/except to gracefully handle `BrokenResourceError` when the memory stream is closed - Prevents unhandled `ExceptionGroup` from crashing the process when the download consumer disconnects during transfer ## Root Cause The download progress callback sends updates through an anyio memory object stream. When the receiving end closes (e.g., client disconnect, timeout, or task cancellation), `send()` raises `BrokenResourceError`. Inside an anyio `TaskGroup`, this unhandled exception becomes an `ExceptionGroup` that propagates up and crashes the coordinator. ## Fix Catch `BrokenResourceError` (and `ClosedResourceError` for completeness) in the progress callback and handle gracefully — the download continues but progress updates are silently dropped for disconnected consumers. Fixes #1844 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-10 11:43:42 +00:00
Evan Quiney	f2e6b1ef76	prevent some crash loops (#1827 ) extension to #1763 that prevents crash looping in some common scenarios.	2026-04-09 11:34:35 +00:00
ciaranbor	e2e17eafb7	Fix reasoning_tokens counting for multi-token thinking tag models (#1848 ) ## Motivation `reasoning_tokens` is always 0 in usage stats, even when thinking content streams correctly via `reasoning_content` SSE deltas. The MLX generators had their own thinking detection comparing individual detokenized tokens against think tags — this never fires for models where tags span multiple tokens (e.g. gpt-oss-120b) or are already in the prompt. ## Changes - Removed broken per-token thinking detection from `batch_generate.py` and `generate.py` - Added `_count_reasoning_tokens` wrapper in `model_output_parsers.py` that counts `is_thinking=True` responses and patches the total into Usage on the final response - Wired it as the outermost stage of `apply_all_parsers`, so it works regardless of which parser sets `is_thinking` - Added 3 tests covering `parse_thinking_models` and `parse_gpt_oss` paths ## Why It Works The parser pipeline already correctly sets `is_thinking` on each response. Counting at the output of `apply_all_parsers` means one counting point that works for all model types, replacing the duplicate broken logic in two generators. ## Test Plan ### Manual Testing - 4-node cluster, `mlx-community/gpt-oss-120b-MXFP4-Q8` - Main branch: `reasoning_tokens: 0` — fix branch: `reasoning_tokens: 25` ### Automated Testing - 3 new tests: explicit think tags, `starts_in_thinking=True`, and gpt-oss Harmony analysis channel	2026-04-09 12:29:35 +01:00
ciaranbor	b12cd1b186	Cancel SSE keep-alive when instance is deleted (#1828 ) ## Motivation When a model instance is deleted (e.g. node disconnect, manual teardown), any in-flight SSE streaming connections for that instance hang indefinitely. The API never closes the response stream, so clients block forever waiting for more chunks. ## Changes - Listen for `InstanceDeleted` events in the API event loop - Add `_close_streams_for_instance()` to find and close any active text/image generation queues tied to tasks on the deleted instance - Add unit tests covering text gen, image gen, and unrelated-instance-not-closed scenarios ## Why It Works When an instance is deleted, we iterate `state.tasks` to find commands running on that instance, then close and remove their send-side queue handles. This causes the SSE generator to terminate, unblocking the client. ## Test Plan ### Manual Testing - This was causing issues for me on another branch (integration tests). Including this fix solved the issue ### Automated Testing - `test_instance_deleted_stream_cleanup.py`: 3 tests covering text gen cleanup, image gen cleanup, and ensuring unrelated streams are not affected	2026-04-08 16:14:28 +01:00
mlpy0	62570227ff	Catch ClosedResourceError when forwarding chunks to client queues (#1856 ) While stress-testing inference with rapid client cancels mid-stream, I hit a reproducible crash where the entire exo process exits. When a client cancels a streaming chat completion partway through, its receive stream gets closed cleanly via its context manager. The producer in `API._apply_state` then calls `queue.send(event.chunk)`, which raises `anyio.ClosedResourceError` rather than `BrokenResourceError`. The existing handler only catches `BrokenResourceError`, so the exception propagates through the API task group, kills the Node task group, and the process exits with `EXO Shutdown complete`. Trace from one of the crashes: ``` File "exo/api/main.py", line 1818, in _apply_state await queue.send(event.chunk) File "anyio/streams/memory.py", line 212, in send_nowait raise ClosedResourceError anyio.ClosedResourceError ``` The fix is to catch `ClosedResourceError` alongside `BrokenResourceError` in both queue handlers (text and image), so the dead queue gets dropped and `_apply_state` keeps running for other in-flight requests.	2026-04-08 08:50:04 +00:00
Alex Cheema	645bc20950	Add Fast Synch Enabled toggle to macOS app settings (#1852 ) ## Motivation The exo backend already supports `--fast-synch` / `--no-fast-synch` CLI flags and the `EXO_FAST_SYNCH` environment variable, but there was no way to toggle this from the macOS app UI. Users who want fast CPU-to-GPU synchronization for RDMA with Tensor Parallelism had to use CLI flags. ## Changes - ExoProcessController.swift: Added `fastSynchEnabled` UserDefaults-backed property and pass `EXO_FAST_SYNCH=on` to the exo process environment when enabled. - SettingsView.swift: Added a "Performance" section to the Advanced tab with a "Fast Synch Enabled" toggle, an info icon (ⓘ) tooltip explaining the feature and trade-offs, and a "Save & Restart" button. ## Why It Works Follows the exact same pattern as the existing `offlineMode` and `enableImageModels` settings — UserDefaults persistence, `@Published` property with `didSet`, environment variable passthrough in `makeEnvironment()`, and pending state with Save & Restart in the settings UI. The `EXO_FAST_SYNCH=on` value matches what the Python backend already reads in `main.py`. ## Test Plan ### Manual Testing <!-- Hardware: macOS app --> - Open Settings → Advanced tab → verify "Performance" section with "Fast Synch Enabled" toggle appears - Hover the ⓘ icon → verify tooltip explains the feature and GPU lock trade-off - Toggle on → click "Save & Restart" → verify process restarts with `EXO_FAST_SYNCH=on` in env - Close and reopen Settings → verify the toggle state persists - Verify "Save & Restart" button is disabled when no changes are pending ### Automated Testing - Existing settings patterns are well-established; no new automated tests needed for this UI toggle --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 01:04:42 +00:00
rltakashige	5757c27dd5	Add download utility script (#1855 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-08 00:58:39 +00:00
Andrei Cravtov	fd5b23281c	Workspace tweaks (#1849 ) ## Changes Mostly chore changes around vscode and jetbrains workspace settings, and some basedpyright settings tweaks, to allow direnv to work and nixd autocomplete with flake parts to work	2026-04-07 17:26:29 +00:00
rltakashige	43b3df45fb	Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835 ) ## Motivation MLX LM has had a massive refactor to their BatchGenerator recently. Since we'd like new features from MLX LM such as Gemma 4, we need to update the code to handle this. Additionally this fixes a significant memory leak in GatedDeltaNet (the difference is quite substantial, up to 1GB every 1000 tokens, explaining several memory issues users were facing with Qwen3.5 models) ## Testing Before <img width="3146" height="884" alt="image" src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4" /> After (no memory leak, as one of the changes upstream) <img width="3190" height="892" alt="image" src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14" />	2026-04-07 11:50:12 +00:00
mlpy0	24420eb10a	Fix reasoning_tokens always reported as 0 for thinking models (#1836 ) When `enable_thinking` is set, chat templates (Qwen3, DeepSeek, etc.) append `<think>` to the prompt. The model starts generating thinking content directly without emitting a `<think>` token in the output stream. Both generators initialized `in_thinking = False` and only set it to `True` on seeing a `<think>` token in output. Since that token was part of the prompt, the flag never flipped and `reasoning_tokens` stayed at 0 in the usage response. Fix: initialize `in_thinking` from `detect_thinking_prompt_suffix()`, which already exists and is used by `model_output_parsers` for routing thinking content correctly.	2026-04-05 00:05:18 +00:00
rltakashige	59669c1168	Tighten EXO bench concurrency numbers and explain methodology (#1811 ) ## Motivation The timings in the batch generator are a little optimistic; a minor change is needed to make them more correct. ## Changes Include the time spent in the API in the generation tps and make sure to send all requests simultaneously	2026-04-05 00:57:52 +01:00
ciaranbor	1d2ce464dc	Allow pausing and deleting active downloads (#1829 ) ## Motivation No way to pause active downloads or delete partial/failed downloads from the dashboard. ## Changes - Backend: Added `POST /download/cancel` endpoint with `CancelDownloadParams`/`CancelDownloadResponse` types. Wires into the existing `CancelDownload` command + coordinator handler. - Dashboard store: Added `cancelDownload(nodeId, modelId)` function. - Dashboard UI: - Pause + delete buttons on active (downloading) cells - Delete button on paused/pending and failed cells - Extracted duplicated SVG icons into `{#snippet}` blocks (`trashIcon`, `downloadIcon`, `pauseIcon`, `deleteButton`) - Tests: 3 coordinator-level tests for cancel: active download → pending, nonexistent → no-op, cancel then resume. ## Why It Works `CancelDownload` command and coordinator handler already existed — just needed an HTTP endpoint and dashboard wiring. Delete endpoint already supported all download states. ## Test Plan ### Manual Testing Started a model download, paused it. Deleted some paused downloads. Deleted some ongoing downloads. ### Automated Testing - `test_cancel_active_download_transitions_to_pending` — cancels in-progress download, asserts `DownloadPending` event and cleanup - `test_cancel_nonexistent_download_is_noop` — no events emitted - `test_cancel_then_resume_download` — restart after cancel works	2026-04-02 15:56:33 +01:00
ciaranbor	eb6ae9fd3c	Prevent failed instance retries (#1763 ) ## Motivation Currently, when a runner fails, the master retries the instance. Most of the time, this causes a loop over failure. Retries need backoff and a cap. ## Changes - src/exo/worker/main.py: Before creating a runner, check an exponential backoff timer per instance. After EXO_MAX_INSTANCE_RETRIES failures, send DeleteInstance to permanently remove the instance. Record attempts on Shutdown; reset on InstanceDeleted. - src/exo/utils/keyed_backoff.py: Add attempts() method to query retry count - src/exo/shared/constants.py: Add EXO_MAX_INSTANCE_RETRIES = 3. ## Why It Works The worker gates CreateRunner tasks behind a KeyedBackoff, adding exponential delay (2s base, 30s cap) between retries. After 3 failures the worker sends DeleteInstance, stopping retries entirely. The backoff resets when the instance is deleted, so a fresh placement starts clean. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-01 21:03:34 +01:00
rltakashige	4688adb5d2	Support PDFs in dashboard (#1822 ) Like ChatGPT does, we now send both the extracted text and the image of each PDF page.	2026-03-31 18:25:40 +01:00
rltakashige	d9ed943034	Fix Nemotron cache leak upstream (#1819 ) ## Motivation Nemotron Cascade and Nano failing at long decodes. ## Changes Fixed upstream, just change pyproject and uv lock here. ## Test Plan ### Automated Testing Tested with a reproduce script upstream	2026-03-30 16:53:21 +00:00
rltakashige	c6815bfdce	Only update KV prefix cache on a good cache hit (#1817 ) ## Motivation Addresses #1816 ## Changes Update on min prefix cache > min_prefix_hit_length and hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly. ## Test Plan ### Manual Testing Test on OpenCode and Claude Code	2026-03-30 15:04:38 +01:00
rltakashige	39c39e8199	Integrations helpers (#1810 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-03-30 14:28:41 +01:00
rltakashige	e5cb7b80d0	Add SSE-keepalive to not time out on long prefill on clients (#1803 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-03-30 12:18:38 +01:00
rltakashige	635801d515	Add multimodality! (#1802 ) ## Motivation Images! TODO (in a future PR): Add audio and video support. ## Test Plan ### Manual Testing <img width="2652" height="1900" alt="image" src="https://github.com/user-attachments/assets/7d3a7137-542f-4f94-9193-2c73b7c4a5ec" /> <img width="2770" height="1956" alt="image" src="https://github.com/user-attachments/assets/e3c3a096-8029-4409-97a6-aca31a9a3f24" /> <img width="2738" height="1768" alt="image" src="https://github.com/user-attachments/assets/d70ea37f-cd1d-4a4c-ad08-3beb9fafa380" /> (And batching also works) --------- Co-authored-by: David Hind <davehind@yahoo.co.uk>	2026-03-30 11:52:19 +01:00
rltakashige	2efbb8ab4f	Improve exo harness with path state (#1815 ) <img width="3224" height="1476" alt="image" src="https://github.com/user-attachments/assets/d90a7d8a-9fe5-43a1-a715-1ef7ecc15422" />	2026-03-30 11:20:46 +01:00
Evan Quiney	c6c5a3e73c	feat: /state/paths (#1796 ) adds a path option to the /state endpoint, allowing you to query subfields of state without grabbing the whole blob ## test plan poking around in the api	2026-03-30 10:10:00 +00:00
ArvidSU	10ef7ec9e8	feat: add Firefox AI sidebar (?q=) support to dashboard (#1814 ) This PR builds on https://github.com/exo-explore/exo/pull/1677 to enable custom prompts sent from Firefox `browser.ml.chat` to EXO dashboard using URL parameters in sidebar for summary and other browser interactions. See "Summarize page" example below. ## Summary - Parse `?q=<encoded prompt>` URL parameter on page load and auto-submit it as a chat message - Clean up the URL with `history.replaceState` to prevent re-submission on refresh - Defer auto-send until both cluster state and model list are loaded so model auto-selection works correctly ## Context Firefox's built-in AI sidebar (`about:config: browser.ml.chat.enabled`) integrates with chat providers by appending the user's prompt as `?q=<URL-encoded prompt>`. Previously the exo dashboard ignored this parameter. Users can now configure `http://localhost:52415` as a Firefox AI chatbot provider. See: https://support.mozilla.org/en-US/kb/ai-chatbot ## Technical notes - Frontend-only change in `dashboard/src/routes/+page.svelte` - Uses a Svelte `$effect` that reacts to `pendingFirefoxQuery`, `data` (cluster state), and `models.length` — fires exactly once when all three are ready - If no model is selected, `handleAutoSend` auto-picks the best available model; if no model fits memory, a toast is shown - If a model is selected but not running, the message is queued until the model loads ## Testing ``` http://localhost:52415/?q=Hello+world http://localhost:52415/?q=Summarize+this+page%3A+%5Bpage+title%5D+%5Bpage+url%5D ``` <img width="2056" height="1329" alt="image" src="https://github.com/user-attachments/assets/74463eb4-ca1a-400d-806a-c19ba93147b9" />	2026-03-30 11:02:35 +01:00
Evan Quiney	1e51dc89b0	chore: bump exo-version with release version (#1807 ) our pyproject.toml version was 0.3.68 - update to .69 in line with release!!	2026-03-27 11:47:13 +00:00

1 2 3 4 5 ...

2291 Commits