mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 12:30:29 -04:00

Author	SHA1	Message	Date
Evan	65b9c9df81	remove layer loading callback	2026-04-15 11:49:39 +01:00
rltakashige	2cd66ae4cf	Fix out of order event idx causing fatal crashes (#1894 ) ## Motivation <img width="828" height="373" alt="Screenshot 2026-04-14 at 22 56 52" src="https://github.com/user-attachments/assets/f8f48c1d-68c5-4acc-a6de-9d180672da9d" /> if is_new_master=True, _elect_loop creates a new EventRouter before the worker has receivers. Then, event router runs _run_ext_in and buf.drain_indexed() will pick off events, even though self.internal_outbound is not populated fully. Finally, when the worker does try requesting events, the next event it receives is not the first event, meaning the worker crashes. ## Changes Start the event router after all the receivers are registered ## Why It Works self.internal_outbound is populated before the loop begins. ## Test Plan ### Manual Testing No more crashes observed in testing (it's actually quite easy to reproduce the issue if you have one node with this fix but the other node on main). I'm convinced this is a fix, at least.	2026-04-15 08:46:02 +00:00
rltakashige	2ecefa0cfe	Fix Qwen3-VL and autodetect vision config (#1893 ) ## Motivation Qwen3 VL TP doesn't work atm, and vision is not behaving. ## Test Plan ### Manual Testing Works now.	2026-04-14 23:05:55 +01:00
rltakashige	b8eaf707a8	Add gemma 4 tensor parallelism (#1891 )	2026-04-14 20:31:59 +01:00
rltakashige	8d81811b89	Try harder to clean up processes nicely (#1889 ) ## Motivation Model loading is actually quite reliable now. No need to kill if you have a slow SSD or it's a massive model; the user can shut the instance down if necessary. This was a major cause of signal=9 issues although not the only one (can happen during inference too?). The reason signal=9 is so bad is that RDMA will no longer work until restart if this ever happens. ## Changes - no more model load timeout - no more crazy sigkills - try harder to clean up processes on model shutdown ## Test Plan ### Manual Testing Tested with some RDMA instances	2026-04-14 16:37:49 +01:00
rltakashige	f2709dcde6	Add prefix cache flag to exo bench (#1888 ) ## Motivation For using Exo-Bench extensively, there are many cases that we could use prefix caching to speed up the benchmarks, especially when the focus is on the token generation. At the same time, it's very clear that prefix caching decode tokens is not very useful in most current scenarios. Surprisingly, even for non-thinking models, the chat template means that a continued conversation will be formatted such that the existing cache is not effective. We already (slightly accidentally) do this for the batch generator - we should do it for the sequential generator too. ## Changes We can now speed up exo bench by having a use prefix caching flag. Of course, for most accurate pp results, it is better to not have it, but this speeds up tg and large benchmarking significantly. Updated methodology to match ## Test Plan ### Manual Testing Tested on many configurations that the difference in results is negligible, even with multiple --pp options.	2026-04-14 11:12:58 +01:00
ciaranbor	77ffe039b3	Complete responses api usage response field (#1885 ) ## Motivation The Responses API usage response was missing `input_tokens_details` and `output_tokens_details`. The chat completions API already reports these. ## Changes - Added `InputTokensDetails` (`cached_tokens`) and `OutputTokensDetails` (`reasoning_tokens`) to `ResponseUsage` - Extracted shared `_build_response_usage()` helper for both streaming and non-streaming paths ## Test Plan ### Manual Testing 4-node cluster, `Qwen3-30B-A3B-4bit` — verified both detail objects present with correct values in streaming and non-streaming responses. ### Automated Testing 13 tests in `test_openai_responses_api.py`.	2026-04-13 17:38:33 +00:00
rltakashige	3f0df404a5	Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886 ) ## Motivation Part 1 of many memory improvements. ## Changes As written in the title ## Test Plan ### Manual Testing Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B A3B cache reduced from 21GB every 100000 tokens to 7GB.	2026-04-13 18:32:17 +01:00
Evan Quiney	9b381f7bfe	bump and simplify flake (#1866 ) seems like stablepkgs swiftfmt works now! also bump macmon to 0.7	2026-04-13 15:45:17 +00:00
Alex Cheema	d2f67b5d10	dashboard: group Gemma under Google with proper logo (#1883 ) ## Motivation In the dashboard model picker sidebar, the Gemma 4 models were showing up under a "Gemma" family with the generic fallback tick/checkmark icon (the default case in `FamilyLogos.svelte`), since no dedicated logo branch existed for `family === "gemma"`. Every other vendor (Meta, NVIDIA, OpenAI, DeepSeek, Qwen, …) has its own brand mark. Gemma is Google's model family, so it should live under a Google bucket that future Google-authored models can join, and it should render with a proper Google logo in the same style as its neighbors. ## Changes - `dashboard/src/lib/components/FamilyLogos.svelte`: added a `family === "google"` branch rendering a monochrome Google "G" as a single `<path>` inside the shared `24×24` viewBox with `fill="currentColor"`, matching the other vendor logos. - `dashboard/src/lib/components/FamilySidebar.svelte`: added `google: "Google"` to the `familyNames` display map. - `dashboard/src/lib/components/ModelPickerModal.svelte`: inserted `"google"` into the `familyOrder` array (next to `"llama"`) so the vendor has a deterministic sort position. - `resources/inference_model_cards/mlx-community--gemma-4-.toml` (16 files): changed `family = "gemma"` → `family = "google"`. `base_model = "Gemma 4 …"` is unchanged, so the model titles still read "Gemma". ## Why It Works The sidebar builds its family list from whatever values appear in `model.family` across the loaded model cards (`ModelPickerModal.svelte` `uniqueFamilies`). Renaming the family string on the 16 Gemma cards from `"gemma"` to `"google"` collapses them into a single "Google" bucket, and the new logo branch + display-name map entry gives that bucket a real brand mark and label. All other logos share the same `w-6 h-6 / viewBox="0 0 24 24" / fill="currentColor"` shape, so inheriting `text-exo-yellow` / `text-white/50` just works. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - `cd dashboard && npm install && npm run build` — dashboard builds cleanly. - `uv run exo`, opened `http://localhost:52415`, clicked SELECT MODEL: - sidebar shows a Google* entry with a monochrome Google "G" logo in the same style as Meta / NVIDIA / etc. - old "Gemma" entry with the generic tick is gone. - clicking Google filters to the Gemma 4 variants (e2b / e4b / 26B A4B / 31B). - hover/selected color states switch between `text-white/50` and `text-exo-yellow` correctly. ### Automated Testing - No new tests — this is a cosmetic grouping/logo change. Existing dashboard build verifies the Svelte + TS compiles.	2026-04-13 14:08:15 +00:00
chaoliang yan	8973503322	fix: use configured api_port for IP connectivity probes (#1877 ) ## Motivation Fixes #1861 When `--api-port` is set to a non-default value (e.g., `--api-port 55555`), the IP connectivity discovery system still probes peers on the hardcoded default port 52415. Since the API is not listening on 52415, all reachability checks fail, the topology reports zero reachable nodes, and the dashboard shows "No valid configurations for current settings." ## Changes Thread the configured `api_port` from `Args` through `Worker` into the reachability probe functions: - `net_profile.py`: `check_reachability()` and `check_reachable()` accept an `api_port` parameter (default 52415 for backward compatibility) - `worker/main.py`: `Worker` stores `api_port` and passes it to `check_reachable()`, uses it in `Multiaddr` construction and the mDNS connection filter - `main.py`: passes `args.api_port` to the `Worker` constructor ## Why It Works The `/node_id` endpoint used by reachability probes is served by the FastAPI app, which binds to `args.api_port`. The probes must use the same port the API is actually listening on. Before this fix, the port was hardcoded in three places in `net_profile.py` and `worker/main.py`; now it uses the value from the CLI flag. ## Test Plan ### Manual Testing <!-- Hardware: not available for multi-node testing --> - Verified ruff passes on all changed files - Code inspection: traced `api_port` flow from `Args.parse()` → `Node.create()` → `Worker.__init__()` → `_poll_connection_updates()` → `check_reachable()` → `check_reachability()` → HTTP probe URL ### Automated Testing - No existing automated tests cover the reachability probe code path - The new `api_port` parameter defaults to `52415`, so all existing behavior is preserved when `--api-port` is not specified --------- Co-authored-by: lawrence3699 <lawrence3699@users.noreply.github.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-04-13 14:01:42 +00:00
Alex Cheema	eb9228615f	models: add MiniMax M2.7 cards (#1884 ) ## Motivation The mlx-community [MiniMax-M2.7 collection](https://huggingface.co/collections/mlx-community/minimax-m27) landed but exo didn't have model cards for any of the variants yet, so they weren't selectable from the dashboard model picker. Adding cards also makes them discoverable under the existing MiniMax family entry. ## Changes Added 6 new model cards in `resources/inference_model_cards/`, one per quant of MiniMax M2.7: - `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB) - `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB) - `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB) - `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB) - `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB) - `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB) All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"` so they collapse into a single group in the picker with the existing MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size = 3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read from each repo's `config.json`; `storage_size.in_bytes` was summed from the HF tree API per repo. `capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5 cards — the chat template always emits `<think>` tags (no toggle), matching M2.5 behavior. ## Why It Works Model cards in `resources/inference_model_cards/` are auto-loaded by `src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard picker groups by `base_model` and filters by `family`, so sharing both across all six variants gives a single "MiniMax M2.7" group under the MiniMax sidebar entry, with the quant variants exposed as selectable sub-options. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6 new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and correct quant + byte sizes. - `cd dashboard && npm run build` then `uv run exo`, opened the model picker → MiniMax family → MiniMax M2.7 group shows all six quant variants. ### Automated Testing - No new automated tests — these are data files validated by the existing Pydantic `ModelCard` schema at load time. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:47:09 +01:00
MikkoParkkola	4b13735ea3	build: remove pyinstaller temp artifacts (#1868 ) removes PyInstaller `build/` leftovers after `just package`	2026-04-11 11:46:03 +00:00
rltakashige	196543ce69	Add Gemma 4 + VLM fixes + thinking parsing updates (#1851 ) ## Motivation Add support for Gemma 4, including VLM! ## Changes - Add auto parallel strategies and model cards for Gemma 4 - Normalise Gemma 4's special Vision Transformer handling to be in line with the rest of our vision processors. - Also adds reprs to messages and b64 hashes to prevent log spam. ## Test Plan ### Manual Testing Tested manually on 4bit E2B and 8bit 26B ### Automated Testing Model onboarding shows small logit diffs. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-11 12:29:33 +01:00
ciaranbor	6172617b00	add env override to macos app (#1869 ) ## Motivation - Let users pass/override arbitrary exo env vars from the macOS app without a code change. ## Changes - `ExoProcessController.swift`: `CustomEnvironmentVariable` struct + `@Published` list persisted to `UserDefaults`, injected into the child process after built-ins. - `SettingsView.swift`: new Environment tab with add/remove rows, trim + dedup on save, and a POSIX name validator with a warning badge. ## Why It Works - Custom vars applied last in `makeEnvironment`, so overriding a built-in works with no special-casing. ## Test Plan ### Manual Testing - Set `EXO_LIBP2P_NAMESPACE` via the new UI; confirmed override in `~/.exo/exo_log/exo.log`.	2026-04-10 17:39:55 +01:00
ciaranbor	93a980a61e	`just package` first builds dashboard (#1867 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-10 16:25:30 +00:00
ciaranbor	2962ebee60	Fix pdf inputs on Safari (#1865 ) ## Motivation PDF attachments weren't working on Safari ## Changes Create async readable stream if none exists ## Why It Works pdfjs-dist requires an async readable stream internally ## Test Plan ### Manual Testing pdf attachments now work on Safari, still work on Firefox	2026-04-10 14:59:34 +00:00
rltakashige	abd75ae06c	Truncate long logs with repr (#1854 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-10 15:53:48 +01:00
kaiisfree	ee2e505b3c	fix: handle BrokenResourceError in download progress callback (#1846 ) ## Summary - Wraps progress callback `send()` in try/except to gracefully handle `BrokenResourceError` when the memory stream is closed - Prevents unhandled `ExceptionGroup` from crashing the process when the download consumer disconnects during transfer ## Root Cause The download progress callback sends updates through an anyio memory object stream. When the receiving end closes (e.g., client disconnect, timeout, or task cancellation), `send()` raises `BrokenResourceError`. Inside an anyio `TaskGroup`, this unhandled exception becomes an `ExceptionGroup` that propagates up and crashes the coordinator. ## Fix Catch `BrokenResourceError` (and `ClosedResourceError` for completeness) in the progress callback and handle gracefully — the download continues but progress updates are silently dropped for disconnected consumers. Fixes #1844 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-10 11:43:42 +00:00
Evan Quiney	f2e6b1ef76	prevent some crash loops (#1827 ) extension to #1763 that prevents crash looping in some common scenarios.	2026-04-09 11:34:35 +00:00
ciaranbor	e2e17eafb7	Fix reasoning_tokens counting for multi-token thinking tag models (#1848 ) ## Motivation `reasoning_tokens` is always 0 in usage stats, even when thinking content streams correctly via `reasoning_content` SSE deltas. The MLX generators had their own thinking detection comparing individual detokenized tokens against think tags — this never fires for models where tags span multiple tokens (e.g. gpt-oss-120b) or are already in the prompt. ## Changes - Removed broken per-token thinking detection from `batch_generate.py` and `generate.py` - Added `_count_reasoning_tokens` wrapper in `model_output_parsers.py` that counts `is_thinking=True` responses and patches the total into Usage on the final response - Wired it as the outermost stage of `apply_all_parsers`, so it works regardless of which parser sets `is_thinking` - Added 3 tests covering `parse_thinking_models` and `parse_gpt_oss` paths ## Why It Works The parser pipeline already correctly sets `is_thinking` on each response. Counting at the output of `apply_all_parsers` means one counting point that works for all model types, replacing the duplicate broken logic in two generators. ## Test Plan ### Manual Testing - 4-node cluster, `mlx-community/gpt-oss-120b-MXFP4-Q8` - Main branch: `reasoning_tokens: 0` — fix branch: `reasoning_tokens: 25` ### Automated Testing - 3 new tests: explicit think tags, `starts_in_thinking=True`, and gpt-oss Harmony analysis channel	2026-04-09 12:29:35 +01:00
ciaranbor	b12cd1b186	Cancel SSE keep-alive when instance is deleted (#1828 ) ## Motivation When a model instance is deleted (e.g. node disconnect, manual teardown), any in-flight SSE streaming connections for that instance hang indefinitely. The API never closes the response stream, so clients block forever waiting for more chunks. ## Changes - Listen for `InstanceDeleted` events in the API event loop - Add `_close_streams_for_instance()` to find and close any active text/image generation queues tied to tasks on the deleted instance - Add unit tests covering text gen, image gen, and unrelated-instance-not-closed scenarios ## Why It Works When an instance is deleted, we iterate `state.tasks` to find commands running on that instance, then close and remove their send-side queue handles. This causes the SSE generator to terminate, unblocking the client. ## Test Plan ### Manual Testing - This was causing issues for me on another branch (integration tests). Including this fix solved the issue ### Automated Testing - `test_instance_deleted_stream_cleanup.py`: 3 tests covering text gen cleanup, image gen cleanup, and ensuring unrelated streams are not affected	2026-04-08 16:14:28 +01:00
mlpy0	62570227ff	Catch ClosedResourceError when forwarding chunks to client queues (#1856 ) While stress-testing inference with rapid client cancels mid-stream, I hit a reproducible crash where the entire exo process exits. When a client cancels a streaming chat completion partway through, its receive stream gets closed cleanly via its context manager. The producer in `API._apply_state` then calls `queue.send(event.chunk)`, which raises `anyio.ClosedResourceError` rather than `BrokenResourceError`. The existing handler only catches `BrokenResourceError`, so the exception propagates through the API task group, kills the Node task group, and the process exits with `EXO Shutdown complete`. Trace from one of the crashes: ``` File "exo/api/main.py", line 1818, in _apply_state await queue.send(event.chunk) File "anyio/streams/memory.py", line 212, in send_nowait raise ClosedResourceError anyio.ClosedResourceError ``` The fix is to catch `ClosedResourceError` alongside `BrokenResourceError` in both queue handlers (text and image), so the dead queue gets dropped and `_apply_state` keeps running for other in-flight requests.	2026-04-08 08:50:04 +00:00
Alex Cheema	645bc20950	Add Fast Synch Enabled toggle to macOS app settings (#1852 ) ## Motivation The exo backend already supports `--fast-synch` / `--no-fast-synch` CLI flags and the `EXO_FAST_SYNCH` environment variable, but there was no way to toggle this from the macOS app UI. Users who want fast CPU-to-GPU synchronization for RDMA with Tensor Parallelism had to use CLI flags. ## Changes - ExoProcessController.swift: Added `fastSynchEnabled` UserDefaults-backed property and pass `EXO_FAST_SYNCH=on` to the exo process environment when enabled. - SettingsView.swift: Added a "Performance" section to the Advanced tab with a "Fast Synch Enabled" toggle, an info icon (ⓘ) tooltip explaining the feature and trade-offs, and a "Save & Restart" button. ## Why It Works Follows the exact same pattern as the existing `offlineMode` and `enableImageModels` settings — UserDefaults persistence, `@Published` property with `didSet`, environment variable passthrough in `makeEnvironment()`, and pending state with Save & Restart in the settings UI. The `EXO_FAST_SYNCH=on` value matches what the Python backend already reads in `main.py`. ## Test Plan ### Manual Testing <!-- Hardware: macOS app --> - Open Settings → Advanced tab → verify "Performance" section with "Fast Synch Enabled" toggle appears - Hover the ⓘ icon → verify tooltip explains the feature and GPU lock trade-off - Toggle on → click "Save & Restart" → verify process restarts with `EXO_FAST_SYNCH=on` in env - Close and reopen Settings → verify the toggle state persists - Verify "Save & Restart" button is disabled when no changes are pending ### Automated Testing - Existing settings patterns are well-established; no new automated tests needed for this UI toggle --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 01:04:42 +00:00
rltakashige	5757c27dd5	Add download utility script (#1855 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-08 00:58:39 +00:00
Andrei Cravtov	fd5b23281c	Workspace tweaks (#1849 ) ## Changes Mostly chore changes around vscode and jetbrains workspace settings, and some basedpyright settings tweaks, to allow direnv to work and nixd autocomplete with flake parts to work	2026-04-07 17:26:29 +00:00
rltakashige	43b3df45fb	Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835 ) ## Motivation MLX LM has had a massive refactor to their BatchGenerator recently. Since we'd like new features from MLX LM such as Gemma 4, we need to update the code to handle this. Additionally this fixes a significant memory leak in GatedDeltaNet (the difference is quite substantial, up to 1GB every 1000 tokens, explaining several memory issues users were facing with Qwen3.5 models) ## Testing Before <img width="3146" height="884" alt="image" src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4" /> After (no memory leak, as one of the changes upstream) <img width="3190" height="892" alt="image" src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14" />	2026-04-07 11:50:12 +00:00
mlpy0	24420eb10a	Fix reasoning_tokens always reported as 0 for thinking models (#1836 ) When `enable_thinking` is set, chat templates (Qwen3, DeepSeek, etc.) append `<think>` to the prompt. The model starts generating thinking content directly without emitting a `<think>` token in the output stream. Both generators initialized `in_thinking = False` and only set it to `True` on seeing a `<think>` token in output. Since that token was part of the prompt, the flag never flipped and `reasoning_tokens` stayed at 0 in the usage response. Fix: initialize `in_thinking` from `detect_thinking_prompt_suffix()`, which already exists and is used by `model_output_parsers` for routing thinking content correctly.	2026-04-05 00:05:18 +00:00
rltakashige	59669c1168	Tighten EXO bench concurrency numbers and explain methodology (#1811 ) ## Motivation The timings in the batch generator are a little optimistic; a minor change is needed to make them more correct. ## Changes Include the time spent in the API in the generation tps and make sure to send all requests simultaneously	2026-04-05 00:57:52 +01:00
ciaranbor	1d2ce464dc	Allow pausing and deleting active downloads (#1829 ) ## Motivation No way to pause active downloads or delete partial/failed downloads from the dashboard. ## Changes - Backend: Added `POST /download/cancel` endpoint with `CancelDownloadParams`/`CancelDownloadResponse` types. Wires into the existing `CancelDownload` command + coordinator handler. - Dashboard store: Added `cancelDownload(nodeId, modelId)` function. - Dashboard UI: - Pause + delete buttons on active (downloading) cells - Delete button on paused/pending and failed cells - Extracted duplicated SVG icons into `{#snippet}` blocks (`trashIcon`, `downloadIcon`, `pauseIcon`, `deleteButton`) - Tests: 3 coordinator-level tests for cancel: active download → pending, nonexistent → no-op, cancel then resume. ## Why It Works `CancelDownload` command and coordinator handler already existed — just needed an HTTP endpoint and dashboard wiring. Delete endpoint already supported all download states. ## Test Plan ### Manual Testing Started a model download, paused it. Deleted some paused downloads. Deleted some ongoing downloads. ### Automated Testing - `test_cancel_active_download_transitions_to_pending` — cancels in-progress download, asserts `DownloadPending` event and cleanup - `test_cancel_nonexistent_download_is_noop` — no events emitted - `test_cancel_then_resume_download` — restart after cancel works	2026-04-02 15:56:33 +01:00
ciaranbor	eb6ae9fd3c	Prevent failed instance retries (#1763 ) ## Motivation Currently, when a runner fails, the master retries the instance. Most of the time, this causes a loop over failure. Retries need backoff and a cap. ## Changes - src/exo/worker/main.py: Before creating a runner, check an exponential backoff timer per instance. After EXO_MAX_INSTANCE_RETRIES failures, send DeleteInstance to permanently remove the instance. Record attempts on Shutdown; reset on InstanceDeleted. - src/exo/utils/keyed_backoff.py: Add attempts() method to query retry count - src/exo/shared/constants.py: Add EXO_MAX_INSTANCE_RETRIES = 3. ## Why It Works The worker gates CreateRunner tasks behind a KeyedBackoff, adding exponential delay (2s base, 30s cap) between retries. After 3 failures the worker sends DeleteInstance, stopping retries entirely. The backoff resets when the instance is deleted, so a fresh placement starts clean. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-01 21:03:34 +01:00
rltakashige	4688adb5d2	Support PDFs in dashboard (#1822 ) Like ChatGPT does, we now send both the extracted text and the image of each PDF page.	2026-03-31 18:25:40 +01:00
rltakashige	d9ed943034	Fix Nemotron cache leak upstream (#1819 ) ## Motivation Nemotron Cascade and Nano failing at long decodes. ## Changes Fixed upstream, just change pyproject and uv lock here. ## Test Plan ### Automated Testing Tested with a reproduce script upstream	2026-03-30 16:53:21 +00:00
rltakashige	c6815bfdce	Only update KV prefix cache on a good cache hit (#1817 ) ## Motivation Addresses #1816 ## Changes Update on min prefix cache > min_prefix_hit_length and hit ratio > _MIN_PREFIX_HIT_RATIO_TO_UPDATE min_prefix_hit_length = max(1000, system prompt length) -> system prompts must match exactly. ## Test Plan ### Manual Testing Test on OpenCode and Claude Code	2026-03-30 15:04:38 +01:00
rltakashige	39c39e8199	Integrations helpers (#1810 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-03-30 14:28:41 +01:00
rltakashige	e5cb7b80d0	Add SSE-keepalive to not time out on long prefill on clients (#1803 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-03-30 12:18:38 +01:00
rltakashige	635801d515	Add multimodality! (#1802 ) ## Motivation Images! TODO (in a future PR): Add audio and video support. ## Test Plan ### Manual Testing <img width="2652" height="1900" alt="image" src="https://github.com/user-attachments/assets/7d3a7137-542f-4f94-9193-2c73b7c4a5ec" /> <img width="2770" height="1956" alt="image" src="https://github.com/user-attachments/assets/e3c3a096-8029-4409-97a6-aca31a9a3f24" /> <img width="2738" height="1768" alt="image" src="https://github.com/user-attachments/assets/d70ea37f-cd1d-4a4c-ad08-3beb9fafa380" /> (And batching also works) --------- Co-authored-by: David Hind <davehind@yahoo.co.uk>	2026-03-30 11:52:19 +01:00
rltakashige	2efbb8ab4f	Improve exo harness with path state (#1815 ) <img width="3224" height="1476" alt="image" src="https://github.com/user-attachments/assets/d90a7d8a-9fe5-43a1-a715-1ef7ecc15422" />	2026-03-30 11:20:46 +01:00
Evan Quiney	c6c5a3e73c	feat: /state/paths (#1796 ) adds a path option to the /state endpoint, allowing you to query subfields of state without grabbing the whole blob ## test plan poking around in the api	2026-03-30 10:10:00 +00:00
ArvidSU	10ef7ec9e8	feat: add Firefox AI sidebar (?q=) support to dashboard (#1814 ) This PR builds on https://github.com/exo-explore/exo/pull/1677 to enable custom prompts sent from Firefox `browser.ml.chat` to EXO dashboard using URL parameters in sidebar for summary and other browser interactions. See "Summarize page" example below. ## Summary - Parse `?q=<encoded prompt>` URL parameter on page load and auto-submit it as a chat message - Clean up the URL with `history.replaceState` to prevent re-submission on refresh - Defer auto-send until both cluster state and model list are loaded so model auto-selection works correctly ## Context Firefox's built-in AI sidebar (`about:config: browser.ml.chat.enabled`) integrates with chat providers by appending the user's prompt as `?q=<URL-encoded prompt>`. Previously the exo dashboard ignored this parameter. Users can now configure `http://localhost:52415` as a Firefox AI chatbot provider. See: https://support.mozilla.org/en-US/kb/ai-chatbot ## Technical notes - Frontend-only change in `dashboard/src/routes/+page.svelte` - Uses a Svelte `$effect` that reacts to `pendingFirefoxQuery`, `data` (cluster state), and `models.length` — fires exactly once when all three are ready - If no model is selected, `handleAutoSend` auto-picks the best available model; if no model fits memory, a toast is shown - If a model is selected but not running, the message is queued until the model loads ## Testing ``` http://localhost:52415/?q=Hello+world http://localhost:52415/?q=Summarize+this+page%3A+%5Bpage+title%5D+%5Bpage+url%5D ``` <img width="2056" height="1329" alt="image" src="https://github.com/user-attachments/assets/74463eb4-ca1a-400d-806a-c19ba93147b9" />	2026-03-30 11:02:35 +01:00
Evan Quiney	1e51dc89b0	chore: bump exo-version with release version (#1807 ) our pyproject.toml version was 0.3.68 - update to .69 in line with release!!	2026-03-27 11:47:13 +00:00
Alex Cheema	5327bdde84	Fix custom model add requiring two attempts + enlarge sidebar buttons (#1805 ) ## Motivation Adding a custom model from the Hub tab shows "Added" toast but the model doesn't appear in the All tab. You have to add it a second time for it to work. Also, the "All" button in the model picker sidebar is too small to read comfortably. ## Changes Race condition fix (`src/exo/api/main.py`): - Call `add_to_card_cache(card)` directly in `add_custom_model()` after sending the `ForwarderCommand`, before the API response returns Sidebar sizing (`dashboard/src/lib/components/FamilySidebar.svelte`): - Increased sidebar min-width from 72/64px to 80/72px - Increased "All" icon from `w-5 h-5` to `w-6 h-6` - Increased all sidebar labels from 9px to 11px ## Why It Works `POST /models/add` sends a `ForwarderCommand(AddCustomModelCard)` and returns immediately. The frontend then calls `GET /models` which reads from `_card_cache`. But the cache was only updated by the worker event handler after the event round-trips through the master — a race the frontend almost always loses. By updating the cache directly in the API handler, `GET /models` immediately reflects the new model. The worker's later `add_to_card_cache` call is idempotent (dict key assignment). ## Test Plan ### Manual Testing <!-- Hardware: any Mac --> - Open model picker → Hub tab → add a custom model → verify it appears in All tab on the first attempt - Verify sidebar "All" button and other labels are visually larger and readable ### Automated Testing - `uv run basedpyright` passes with 0 errors - `uv run ruff check` passes --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> v1.0.69	2026-03-26 17:57:00 -07:00
ciaranbor	15f1b61f4c	Rework model storage directory management (for external storage) (#1765 ) ## Motivation Replace confusing EXO_MODELS_DIR/EXO_MODELS_PATH with clearer multi-directory support, enabling automatic download spillover across volumes. ## Changes - EXO_MODELS_DIRS: colon-separated writable dirs (default always prepended, first with enough space wins) - EXO_MODELS_READ_ONLY_DIRS: colon-separated read-only dirs (protected from deletion) - select_download_dir(): picks writable dir by free space - resolve_existing_model(): unified lookup across all dirs - is_read_only_model_dir(): path-based read-only detection instead of hardcoded flag - Updated coordinator, worker, model cards, tests ## Why It Works Default dir always included so zero-config behavior is unchanged. Disk space checked at download time for automatic spillover. Read-only status derived from path, not hardcoded. ## Test Plan ### Manual Testing - No env vars set → identical behavior - EXO_MODELS_DIRS=/Volumes/SSD/models → downloads to external storage - EXO_MODELS_READ_ONLY_DIRS=/mnt/nfs → models found, deletion blocked ### Automated Testing - 4 new tests in test_xdg_paths.py (prepend, default-only, overlap, empty read-only) - Existing tests updated to patch new constants	2026-03-26 17:46:46 +00:00
Michael Harrigan	9034300163	[Fix] Node hang on reelection (#1801 ) ## Motivation During master reelection, `_elect_loop` called `worker.shutdown()` (fire & forget) then immediately created and started a new Worker. This caused the old runner subprocess's Metal/GPU teardown to race with the new worker's startup, resulting in `IOConnectUnmapMemory failed: kr=0xe00002bc` errors and a full node hang requiring `^C`. Same issue existed for `DownloadCoordinator`. ## Changes - Added `anyio.Event`-based `_stopped` signal to `Worker` and `DownloadCoordinator`, set at the end of their `run()` finally blocks - Added `wait_stopped()` async method to both classes - Updated `_elect_loop` to `await wait_stopped()` after calling `shutdown()` on the old Worker and DownloadCoordinator before creating replacements ## Why It Works The old Worker's task group contains the RunnerSupervisor tasks, whose finally blocks join the runner subprocess (with 5s timeout + SIGTERM + SIGKILL escalation). By awaiting `wait_stopped()`, we guarantee the old runner process has fully exited — including GPU memory cleanup — before a new Worker can start and potentially access the GPU. This eliminates the race without changing the shutdown mechanics themselves. ## Test Plan ### Manual Testing Hardware: M4 Pro Mac Mini 24GB + M3 Ultra Mac Studio 96GB, connected via Thunderbolt Repro steps: 1. Start exo on two nodes with a model sharded across both (e.g. `Josiefied-Qwen3-14B-abliterated-v3-4bit`) 2. Wait for "runner ready" on both 3. `kill -9` the master node 4. Observe the surviving node's re-election behavior Before fix (original crash): ``` [ 11:02:39.0896AM ] Runner supervisor shutting down [ 11:02:39.0905AM ] bye from the runner [ 11:02:39.1052AM ] Stopping Worker IOConnectUnmapMemory failed: kr=0xe00002bc IOConnectUnmapMemory failed: kr=0xe00002bc IOConnectUnmapMemory failed: kr=0xe00002bc IOConnectUnmapMemory failed: kr=0xe00002bc ^C[ 11:03:45 ] ← hung for over a minute, required manual kill ``` After fix (clean re-election): ``` [ 12:15:22.4703PM ] runner loaded [ 12:15:24.1672PM ] runner ready [ 12:15:33.5393PM ] Waiting for other campaign to finish [ 12:15:36.5409PM ] Node elected Master [ 12:15:36.5413PM ] Unpausing API ``` No `IOConnectUnmapMemory` errors, no hang, no `^C` needed. ### Automated Testing - No existing tests cover the `_elect_loop` re-election path; this is an integration-level flow requiring a live router/election/worker stack - All existing tests pass (307/308, 1 pre-existing Rust binding failure) - basedpyright: 0 errors, ruff: all checks passed --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-03-26 17:28:47 +00:00
rltakashige	1d1dfaa1f3	Don't download original/ and metal/ folders from HF (#1800 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-03-26 13:36:07 +00:00
Evan Quiney	7625213df0	fix: enable macmon if preflight fails (#1799 ) missed in #1747, issue #1798. ### the issue we didn't set the memory poll rate after failling the macmon preflight, only after failing the followups - as we never ran macmon if preflight failed, we never hit the followup errors etc. ### testing requires testing on an m5 pro, but the core issue is solved.	2026-03-26 11:35:22 +00:00
Alex Cheema	f318f9ea14	Fix macOS build bundling wrong macmon binary (#1797 ) ## Motivation PR #1747 fixed macmon support for M5 Pro/Max by pinning the `swiftraccoon/macmon` fork in `flake.nix`. This works when running from source (via Nix) but the distributed macOS `.app` build was still broken on M5 Pro/Max because it was bundling the wrong macmon. The error on M5 Pro/Max: ``` macmon preflight failed with return code -6: thread 'main' panicked at src/sources.rs:394:41 ``` ## Changes - Removed `macmon` from `brew install` in `build-app.yml` — this was installing the upstream `vladkens/macmon` which doesn't support M5 Pro/Max - Added a new step that resolves the pinned macmon fork from the Nix dev shell (same `swiftraccoon/macmon` at rev `9154d23` already defined in `flake.nix`) and adds it to `$GITHUB_PATH` - Added a safety `brew uninstall macmon` to ensure no Homebrew macmon can shadow the pinned version ## Why It Works PyInstaller bundles macmon via `shutil.which("macmon")`. Previously this found the Homebrew (upstream) binary. Now it finds the Nix-overlayed fork that has M5 Pro/Max support, because `$GITHUB_PATH` prepends the Nix store path before the PyInstaller step runs. ## Test Plan ### Manual Testing <!-- Hardware: M5 Pro --> - Trigger a macOS build and verify the bundled macmon is the pinned fork - Run the built `.app` on M5 Pro/Max and confirm macmon preflight succeeds ### Automated Testing - Existing CI build workflow will validate that the macmon binary is found and bundled correctly Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 09:47:48 +00:00
ciaranbor	30fd5aa1cc	Prefer higher % downloaded nodes for API placement previews (#1795 ) Follow up to https://github.com/exo-explore/exo/pull/1767 Same thing for placement previews through API	2026-03-25 17:26:31 +00:00
ciaranbor	6de14cfedb	Support image generation cancellation (#1774 ) ## Motivation Support cancelling image generation, similar to existing support for cancelling text generation ## Changes - Dashboard (app.svelte.ts): Wire up AbortController for both generateImage and editImage API calls. On abort, show "Cancelled" instead of an error. Clean up the controller in finally. - Pipeline runner (pipeline/runner.py): Introduce a cancel_checker callback and NaN-sentinel cancellation protocol for distributed diffusion: - _check_cancellation() - only rank 0 polls the cancel callback - _send() - replaces data with NaN sentinels when cancelling, so downstream ranks detect cancellation via _recv_and_check() - _recv() / _recv_like() wrappers that eval and check for NaN sentinel - After cancellation, drains any pending ring recv to prevent deadlock - Skips partial image yields and final decode when cancelled - Image runner (runner/image_models/runner.py): Deduplicate the ImageGeneration and ImageEdits match arms into a shared _run_image_task() method. Thread a cancel_checker closure (backed by the existing cancel_receiver + cancelled_tasks set) into generate_image(). - Plumbing (distributed_model.py, generate.py): Pass cancel_checker through the call chain. ## Why It Works - Rank 0 is the only node that knows about task-level cancellation. When it detects cancellation, it sends NaN tensors instead of real data. Higher-order ranks detect the NaN sentinel on recv, set their own _cancelling flag, and propagate NaN forward - A drain step after the loop prevents the deadlock case where the last rank already sent patches that the first would never consume. - For single-node mode, the loop simply breaks immediately on cancellation. ## Test Plan ### Automated Testing New tests in src/exo/worker/tests/unittests/test_image	2026-03-25 16:56:04 +00:00
vskiwi	fc1ae90111	fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (#1769 ) ## Summary DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's inference engine (architecture whitelisted in `model_cards.py`, DSML encoding added in #1548), but doesn't work out of the box due to two bugs: ### Bug 1: `warmup_inference` passes empty model ID `warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)` instead of using it. Since `_needs_dsml_encoding()` checks `"deepseek-v3.2" in task_params.model.lower()`, the empty string never matches → falls back to `tokenizer.apply_chat_template()` → ValueError because V3.2 has no Jinja chat template. Fix: `model=ModelId("")` → `model=model_id` (one line). ### Bug 2: `_needs_dsml_encoding` limited to tool calling `_needs_dsml_encoding()` returns `True` only when `task_params.tools` is present or tool messages exist in `chat_template_messages`. For warmup and regular chat requests without tools → `return False` → Jinja fallback → ValueError. Unlike V3.1 (which has a `.jinja` chat template file that transformers picks up automatically), V3.2 has no Jinja template at all — it uses Python-based DSML encoding for all message types. Fix: For V3.2, always return `True` — DSML encoding handles all message types. ### Catalog cards Added inference model cards for: - `mlx-community/DeepSeek-V3.2-8bit` - `mlx-community/DeepSeek-V3.2-4bit` Parameters taken from model `config.json` on HuggingFace, storage sizes from HF API. Capabilities include `thinking_toggle` (related: #1456). ## Notes - The model ID string matching approach (`"deepseek-v3.2" in model.lower()`) is acknowledged tech debt — see #1371 for the planned architecture-based approach. ## Test plan - [x] Start exo with DeepSeek V3.2 model → warmup should complete without crash - [x] Send a regular chat message (no tools) → should get a response - [x] Send a chat message with tools → should work as before - [x] V3.2 cards should appear in the dashboard model catalog --------- Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-25 16:20:35 +00:00

1 2 3 4 5 ...

2282 Commits