mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 20:40:35 -04:00

Author	SHA1	Message	Date
ciaranbor	24cd959d4d	Auto-detect existing instance unless `--fresh-instance`	2026-04-17 19:22:18 +01:00
ciaranbor	c5ebbd7d20	Add minimax m2.7 to eval configs	2026-04-17 19:22:18 +01:00
ciaranbor	160eb570b2	Cleanup	2026-04-17 19:22:18 +01:00
ciaranbor	c75c661a52	Support streaming with exo bench	2026-04-17 19:22:18 +01:00
ciaranbor	290bfc28db	exo_bench: --skip-instance-setup flag and remove existing instances before instance creation	2026-04-17 19:22:18 +01:00
ciaranbor	a865e9fcde	Monitor instance health Don't checkpoint failed evals Auto-delete stale instances before placement	2026-04-17 19:22:18 +01:00
ciaranbor	925c3aee02	Don't timout for downloads	2026-04-17 19:22:18 +01:00
ciaranbor	4ad49b83a1	Add models	2026-04-17 19:22:18 +01:00
ciaranbor	4fc43b0920	Remove power_usage from response in favour of existing energy polling in exo bench	2026-04-17 19:22:18 +01:00
ciaranbor	f2fad85cdc	Fix Kimi-K2.5 tokenization	2026-04-17 19:22:18 +01:00
ciaranbor	c23fc9c674	Resolve higher-node placements first	2026-04-17 19:22:18 +01:00
ciaranbor	25109a70ad	Add power/energy measurements to exo_bench	2026-04-17 19:22:18 +01:00
ciaranbor	4a6a9bc4b7	Extend eval harness: - top_k, min_p, enable_thinking params - capture reasoning_content, finish_reason, power/energy - checkpoint/resume support - instance resuse - LCB release_version param	2026-04-17 19:22:18 +01:00
ciaranbor	79560cc0a1	Align models.toml configs with vllm run	2026-04-17 19:22:18 +01:00
Adam Durham	af9e847edb	fix: force gc + clear_cache after KV prefix cache eviction (#1832 ) ## Summary - After `KVPrefixCache` evicts LRU entries, the MLX Metal buffers stay allocated until Python's GC runs - This leaks ~3-4 GB between long-context requests, reducing the effective context ceiling for back-to-back requests - Adding `gc.collect()` + `mx.clear_cache()` after eviction frees Metal buffers promptly ## Test plan - [x] Measured on 2-node PP cluster with Qwen3.5-397B-A17B-4bit at 63K context - [x] Before: 108.88 GB retained after eviction (3.78 GB above baseline) - [x] After: 105.48 GB retained after eviction (0.38 GB above baseline — draft model KV + minor overhead) - [x] `gc.collect()` adds ~2-3ms latency, runs once per eviction cycle (not per token) - [ ] Verify with `uv run pytest` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Adam Durham <adam@example.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-04-17 17:57:32 +00:00
mlpy0	01598960bd	Add model card for Qwen3.6-35B-A3B-8bit (#1917 ) Adds the 8bit variant missing from #1907 — the safetensors index is now live on HF. - `mlx-community/Qwen3.6-35B-A3B-8bit` (~35 GB) Architectural fields match the existing 4bit/5bit/bf16 cards. `storage_size.in_bytes` is taken from `metadata.total_size` of the upstream `model.safetensors.index.json`.	2026-04-17 10:06:23 +00:00
Alex Cheema	63b8e64715	Add model cards for Qwen3.6-35B-A3B variants (#1907 ) ## Motivation `mlx-community` has just published the new Qwen3.6-35B-A3B multimodal MoE family on HuggingFace. Without static model cards exo doesn't surface these models in the dashboard picker or match its placement / prefill logic, so users can't one-click launch them. This PR adds cards for the three quants whose safetensors indexes are already live on HF (4bit / 5bit / bf16). ## Changes Three new TOML files in `resources/inference_model_cards/`: - `mlx-community--Qwen3.6-35B-A3B-4bit.toml` (~19 GB) - `mlx-community--Qwen3.6-35B-A3B-5bit.toml` (~23 GB) - `mlx-community--Qwen3.6-35B-A3B-bf16.toml` (~65 GB) All three share the same architectural fields (`n_layers = 40`, `hidden_size = 2048`, `num_key_value_heads = 2`, `context_length = 262144`, capabilities `text, thinking, thinking_toggle, vision`, `base_model = "Qwen3.6 35B A3B"`) — only `model_id`, `quantization`, and `storage_size.in_bytes` differ between variants. ## Why It Works - Qwen3.6-35B-A3B reuses the `qwen3_5_moe` architecture (`Qwen3_5MoeForConditionalGeneration`) — the same one already wired into exo's MLX runner at `src/exo/worker/engines/mlx/auto_parallel.py:47` via `Qwen3_5MoeModel`. The architectural fields are taken verbatim from the HF `config.json.text_config` and match the existing `Qwen3.5-35B-A3B-` cards. - Storage sizes are the exact `metadata.total_size` read from each variant's `model.safetensors.index.json` on HF, so download progress and cluster-memory-fit checks are accurate. - Vision support is flagged in `capabilities`; the `[vision]` block is auto-detected by `ModelCard._autodetect_vision` from the upstream `config.json`, so no hand-written vision config is required. - The card loader (`_refresh_card_cache` in `src/exo/shared/models/model_cards.py`) globs every `.toml` in `resources/inference_model_cards/` on startup, so nothing else needs to change — the `/models` endpoint and the dashboard picker pick them up automatically. The `mxfp4` / `mxfp8` / `nvfp4` variants are still uploading upstream (index JSONs currently 404) and can be added in a follow-up PR once HF completes. ## Test Plan ### Manual Testing Hardware: MacBook Pro M4 Max, 48 GB unified memory. - Built the dashboard, ran `uv run exo`, waited for the API to come up on `http://localhost:52415`. - `curl -s http://localhost:52415/models` returns the three new model ids (`mlx-community/Qwen3.6-35B-A3B-{4bit,5bit,bf16}`) alongside existing models. - Opened the dashboard, clicked SELECT MODEL, typed "Qwen3.6" into the search box. A single "Qwen3.6 35B A3B"* group appears showing `3 variants (19GB-65GB)`. Expanding it lists the `4bit` / `5bit` / `bf16` quants with sizes `19GB` / `23GB` / `65GB`, exactly as expected: ![Qwen3.6 35B A3B in model picker](`127119f703/qwen36-picker.png`) - Programmatically loaded each TOML via `ModelCard.load_from_path(...)` and confirmed the parsed fields (layers / hidden / KV heads / context / quant / base_model / caps / bytes) match what's written in the files. ### Automated Testing No code paths were touched — these are pure TOML data files that plug into the existing model-card loader. The existing pytest suite covers TOML parsing and card serving; adding new TOMLs doesn't require new test scaffolding. `uv run ruff check` and `nix fmt` are clean. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ryuichi Leo Takashige <rl.takashige@gmail.com>	2026-04-16 23:25:26 +01:00
rltakashige	28c797846a	Update mlx and mlx lm to latest (#1906 ) Just bumping to the very latest upstream versions.	2026-04-16 10:59:33 +00:00
rltakashige	058bb08261	Allow copying on dashboard even on HTTP (#1902 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-15 23:30:12 +01:00
rltakashige	3eead80238	Better environment variables in MacOS app (#1901 ) ## Motivation Closes #1858 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-15 20:14:52 +00:00
rltakashige	87329c80ef	Add usage stats to tool calls and handle multiple tool calls correctly (#1899 ) ## Motivation Tool calls are usually not end tokens, so they didn't have usage stats.	2026-04-15 19:40:00 +01:00
rltakashige	8cdc833892	Drain tokens silently skipped in thinking parsing (#1898 ) ## Motivation Closes #1882	2026-04-15 14:23:07 +00:00
rltakashige	2cd66ae4cf	Fix out of order event idx causing fatal crashes (#1894 ) ## Motivation <img width="828" height="373" alt="Screenshot 2026-04-14 at 22 56 52" src="https://github.com/user-attachments/assets/f8f48c1d-68c5-4acc-a6de-9d180672da9d" /> if is_new_master=True, _elect_loop creates a new EventRouter before the worker has receivers. Then, event router runs _run_ext_in and buf.drain_indexed() will pick off events, even though self.internal_outbound is not populated fully. Finally, when the worker does try requesting events, the next event it receives is not the first event, meaning the worker crashes. ## Changes Start the event router after all the receivers are registered ## Why It Works self.internal_outbound is populated before the loop begins. ## Test Plan ### Manual Testing No more crashes observed in testing (it's actually quite easy to reproduce the issue if you have one node with this fix but the other node on main). I'm convinced this is a fix, at least.	2026-04-15 08:46:02 +00:00
rltakashige	2ecefa0cfe	Fix Qwen3-VL and autodetect vision config (#1893 ) ## Motivation Qwen3 VL TP doesn't work atm, and vision is not behaving. ## Test Plan ### Manual Testing Works now.	2026-04-14 23:05:55 +01:00
rltakashige	b8eaf707a8	Add gemma 4 tensor parallelism (#1891 )	2026-04-14 20:31:59 +01:00
rltakashige	8d81811b89	Try harder to clean up processes nicely (#1889 ) ## Motivation Model loading is actually quite reliable now. No need to kill if you have a slow SSD or it's a massive model; the user can shut the instance down if necessary. This was a major cause of signal=9 issues although not the only one (can happen during inference too?). The reason signal=9 is so bad is that RDMA will no longer work until restart if this ever happens. ## Changes - no more model load timeout - no more crazy sigkills - try harder to clean up processes on model shutdown ## Test Plan ### Manual Testing Tested with some RDMA instances	2026-04-14 16:37:49 +01:00
rltakashige	f2709dcde6	Add prefix cache flag to exo bench (#1888 ) ## Motivation For using Exo-Bench extensively, there are many cases that we could use prefix caching to speed up the benchmarks, especially when the focus is on the token generation. At the same time, it's very clear that prefix caching decode tokens is not very useful in most current scenarios. Surprisingly, even for non-thinking models, the chat template means that a continued conversation will be formatted such that the existing cache is not effective. We already (slightly accidentally) do this for the batch generator - we should do it for the sequential generator too. ## Changes We can now speed up exo bench by having a use prefix caching flag. Of course, for most accurate pp results, it is better to not have it, but this speeds up tg and large benchmarking significantly. Updated methodology to match ## Test Plan ### Manual Testing Tested on many configurations that the difference in results is negligible, even with multiple --pp options.	2026-04-14 11:12:58 +01:00
ciaranbor	77ffe039b3	Complete responses api usage response field (#1885 ) ## Motivation The Responses API usage response was missing `input_tokens_details` and `output_tokens_details`. The chat completions API already reports these. ## Changes - Added `InputTokensDetails` (`cached_tokens`) and `OutputTokensDetails` (`reasoning_tokens`) to `ResponseUsage` - Extracted shared `_build_response_usage()` helper for both streaming and non-streaming paths ## Test Plan ### Manual Testing 4-node cluster, `Qwen3-30B-A3B-4bit` — verified both detail objects present with correct values in streaming and non-streaming responses. ### Automated Testing 13 tests in `test_openai_responses_api.py`.	2026-04-13 17:38:33 +00:00
rltakashige	3f0df404a5	Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886 ) ## Motivation Part 1 of many memory improvements. ## Changes As written in the title ## Test Plan ### Manual Testing Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B A3B cache reduced from 21GB every 100000 tokens to 7GB.	2026-04-13 18:32:17 +01:00
Evan Quiney	9b381f7bfe	bump and simplify flake (#1866 ) seems like stablepkgs swiftfmt works now! also bump macmon to 0.7	2026-04-13 15:45:17 +00:00
Alex Cheema	d2f67b5d10	dashboard: group Gemma under Google with proper logo (#1883 ) ## Motivation In the dashboard model picker sidebar, the Gemma 4 models were showing up under a "Gemma" family with the generic fallback tick/checkmark icon (the default case in `FamilyLogos.svelte`), since no dedicated logo branch existed for `family === "gemma"`. Every other vendor (Meta, NVIDIA, OpenAI, DeepSeek, Qwen, …) has its own brand mark. Gemma is Google's model family, so it should live under a Google bucket that future Google-authored models can join, and it should render with a proper Google logo in the same style as its neighbors. ## Changes - `dashboard/src/lib/components/FamilyLogos.svelte`: added a `family === "google"` branch rendering a monochrome Google "G" as a single `<path>` inside the shared `24×24` viewBox with `fill="currentColor"`, matching the other vendor logos. - `dashboard/src/lib/components/FamilySidebar.svelte`: added `google: "Google"` to the `familyNames` display map. - `dashboard/src/lib/components/ModelPickerModal.svelte`: inserted `"google"` into the `familyOrder` array (next to `"llama"`) so the vendor has a deterministic sort position. - `resources/inference_model_cards/mlx-community--gemma-4-.toml` (16 files): changed `family = "gemma"` → `family = "google"`. `base_model = "Gemma 4 …"` is unchanged, so the model titles still read "Gemma". ## Why It Works The sidebar builds its family list from whatever values appear in `model.family` across the loaded model cards (`ModelPickerModal.svelte` `uniqueFamilies`). Renaming the family string on the 16 Gemma cards from `"gemma"` to `"google"` collapses them into a single "Google" bucket, and the new logo branch + display-name map entry gives that bucket a real brand mark and label. All other logos share the same `w-6 h-6 / viewBox="0 0 24 24" / fill="currentColor"` shape, so inheriting `text-exo-yellow` / `text-white/50` just works. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - `cd dashboard && npm install && npm run build` — dashboard builds cleanly. - `uv run exo`, opened `http://localhost:52415`, clicked SELECT MODEL: - sidebar shows a Google* entry with a monochrome Google "G" logo in the same style as Meta / NVIDIA / etc. - old "Gemma" entry with the generic tick is gone. - clicking Google filters to the Gemma 4 variants (e2b / e4b / 26B A4B / 31B). - hover/selected color states switch between `text-white/50` and `text-exo-yellow` correctly. ### Automated Testing - No new tests — this is a cosmetic grouping/logo change. Existing dashboard build verifies the Svelte + TS compiles.	2026-04-13 14:08:15 +00:00
chaoliang yan	8973503322	fix: use configured api_port for IP connectivity probes (#1877 ) ## Motivation Fixes #1861 When `--api-port` is set to a non-default value (e.g., `--api-port 55555`), the IP connectivity discovery system still probes peers on the hardcoded default port 52415. Since the API is not listening on 52415, all reachability checks fail, the topology reports zero reachable nodes, and the dashboard shows "No valid configurations for current settings." ## Changes Thread the configured `api_port` from `Args` through `Worker` into the reachability probe functions: - `net_profile.py`: `check_reachability()` and `check_reachable()` accept an `api_port` parameter (default 52415 for backward compatibility) - `worker/main.py`: `Worker` stores `api_port` and passes it to `check_reachable()`, uses it in `Multiaddr` construction and the mDNS connection filter - `main.py`: passes `args.api_port` to the `Worker` constructor ## Why It Works The `/node_id` endpoint used by reachability probes is served by the FastAPI app, which binds to `args.api_port`. The probes must use the same port the API is actually listening on. Before this fix, the port was hardcoded in three places in `net_profile.py` and `worker/main.py`; now it uses the value from the CLI flag. ## Test Plan ### Manual Testing <!-- Hardware: not available for multi-node testing --> - Verified ruff passes on all changed files - Code inspection: traced `api_port` flow from `Args.parse()` → `Node.create()` → `Worker.__init__()` → `_poll_connection_updates()` → `check_reachable()` → `check_reachability()` → HTTP probe URL ### Automated Testing - No existing automated tests cover the reachability probe code path - The new `api_port` parameter defaults to `52415`, so all existing behavior is preserved when `--api-port` is not specified --------- Co-authored-by: lawrence3699 <lawrence3699@users.noreply.github.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-04-13 14:01:42 +00:00
Alex Cheema	eb9228615f	models: add MiniMax M2.7 cards (#1884 ) ## Motivation The mlx-community [MiniMax-M2.7 collection](https://huggingface.co/collections/mlx-community/minimax-m27) landed but exo didn't have model cards for any of the variants yet, so they weren't selectable from the dashboard model picker. Adding cards also makes them discoverable under the existing MiniMax family entry. ## Changes Added 6 new model cards in `resources/inference_model_cards/`, one per quant of MiniMax M2.7: - `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB) - `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB) - `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB) - `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB) - `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB) - `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB) All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"` so they collapse into a single group in the picker with the existing MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size = 3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read from each repo's `config.json`; `storage_size.in_bytes` was summed from the HF tree API per repo. `capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5 cards — the chat template always emits `<think>` tags (no toggle), matching M2.5 behavior. ## Why It Works Model cards in `resources/inference_model_cards/` are auto-loaded by `src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard picker groups by `base_model` and filters by `family`, so sharing both across all six variants gives a single "MiniMax M2.7" group under the MiniMax sidebar entry, with the quant variants exposed as selectable sub-options. ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M3 Max --> - Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6 new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and correct quant + byte sizes. - `cd dashboard && npm run build` then `uv run exo`, opened the model picker → MiniMax family → MiniMax M2.7 group shows all six quant variants. ### Automated Testing - No new automated tests — these are data files validated by the existing Pydantic `ModelCard` schema at load time. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:47:09 +01:00
MikkoParkkola	4b13735ea3	build: remove pyinstaller temp artifacts (#1868 ) removes PyInstaller `build/` leftovers after `just package`	2026-04-11 11:46:03 +00:00
rltakashige	196543ce69	Add Gemma 4 + VLM fixes + thinking parsing updates (#1851 ) ## Motivation Add support for Gemma 4, including VLM! ## Changes - Add auto parallel strategies and model cards for Gemma 4 - Normalise Gemma 4's special Vision Transformer handling to be in line with the rest of our vision processors. - Also adds reprs to messages and b64 hashes to prevent log spam. ## Test Plan ### Manual Testing Tested manually on 4bit E2B and 8bit 26B ### Automated Testing Model onboarding shows small logit diffs. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-11 12:29:33 +01:00
ciaranbor	6172617b00	add env override to macos app (#1869 ) ## Motivation - Let users pass/override arbitrary exo env vars from the macOS app without a code change. ## Changes - `ExoProcessController.swift`: `CustomEnvironmentVariable` struct + `@Published` list persisted to `UserDefaults`, injected into the child process after built-ins. - `SettingsView.swift`: new Environment tab with add/remove rows, trim + dedup on save, and a POSIX name validator with a warning badge. ## Why It Works - Custom vars applied last in `makeEnvironment`, so overriding a built-in works with no special-casing. ## Test Plan ### Manual Testing - Set `EXO_LIBP2P_NAMESPACE` via the new UI; confirmed override in `~/.exo/exo_log/exo.log`.	2026-04-10 17:39:55 +01:00
ciaranbor	93a980a61e	`just package` first builds dashboard (#1867 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-10 16:25:30 +00:00
ciaranbor	2962ebee60	Fix pdf inputs on Safari (#1865 ) ## Motivation PDF attachments weren't working on Safari ## Changes Create async readable stream if none exists ## Why It Works pdfjs-dist requires an async readable stream internally ## Test Plan ### Manual Testing pdf attachments now work on Safari, still work on Firefox	2026-04-10 14:59:34 +00:00
rltakashige	abd75ae06c	Truncate long logs with repr (#1854 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-10 15:53:48 +01:00
kaiisfree	ee2e505b3c	fix: handle BrokenResourceError in download progress callback (#1846 ) ## Summary - Wraps progress callback `send()` in try/except to gracefully handle `BrokenResourceError` when the memory stream is closed - Prevents unhandled `ExceptionGroup` from crashing the process when the download consumer disconnects during transfer ## Root Cause The download progress callback sends updates through an anyio memory object stream. When the receiving end closes (e.g., client disconnect, timeout, or task cancellation), `send()` raises `BrokenResourceError`. Inside an anyio `TaskGroup`, this unhandled exception becomes an `ExceptionGroup` that propagates up and crashes the coordinator. ## Fix Catch `BrokenResourceError` (and `ClosedResourceError` for completeness) in the progress callback and handle gracefully — the download continues but progress updates are silently dropped for disconnected consumers. Fixes #1844 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-10 11:43:42 +00:00
Evan Quiney	f2e6b1ef76	prevent some crash loops (#1827 ) extension to #1763 that prevents crash looping in some common scenarios.	2026-04-09 11:34:35 +00:00
ciaranbor	e2e17eafb7	Fix reasoning_tokens counting for multi-token thinking tag models (#1848 ) ## Motivation `reasoning_tokens` is always 0 in usage stats, even when thinking content streams correctly via `reasoning_content` SSE deltas. The MLX generators had their own thinking detection comparing individual detokenized tokens against think tags — this never fires for models where tags span multiple tokens (e.g. gpt-oss-120b) or are already in the prompt. ## Changes - Removed broken per-token thinking detection from `batch_generate.py` and `generate.py` - Added `_count_reasoning_tokens` wrapper in `model_output_parsers.py` that counts `is_thinking=True` responses and patches the total into Usage on the final response - Wired it as the outermost stage of `apply_all_parsers`, so it works regardless of which parser sets `is_thinking` - Added 3 tests covering `parse_thinking_models` and `parse_gpt_oss` paths ## Why It Works The parser pipeline already correctly sets `is_thinking` on each response. Counting at the output of `apply_all_parsers` means one counting point that works for all model types, replacing the duplicate broken logic in two generators. ## Test Plan ### Manual Testing - 4-node cluster, `mlx-community/gpt-oss-120b-MXFP4-Q8` - Main branch: `reasoning_tokens: 0` — fix branch: `reasoning_tokens: 25` ### Automated Testing - 3 new tests: explicit think tags, `starts_in_thinking=True`, and gpt-oss Harmony analysis channel	2026-04-09 12:29:35 +01:00
ciaranbor	b12cd1b186	Cancel SSE keep-alive when instance is deleted (#1828 ) ## Motivation When a model instance is deleted (e.g. node disconnect, manual teardown), any in-flight SSE streaming connections for that instance hang indefinitely. The API never closes the response stream, so clients block forever waiting for more chunks. ## Changes - Listen for `InstanceDeleted` events in the API event loop - Add `_close_streams_for_instance()` to find and close any active text/image generation queues tied to tasks on the deleted instance - Add unit tests covering text gen, image gen, and unrelated-instance-not-closed scenarios ## Why It Works When an instance is deleted, we iterate `state.tasks` to find commands running on that instance, then close and remove their send-side queue handles. This causes the SSE generator to terminate, unblocking the client. ## Test Plan ### Manual Testing - This was causing issues for me on another branch (integration tests). Including this fix solved the issue ### Automated Testing - `test_instance_deleted_stream_cleanup.py`: 3 tests covering text gen cleanup, image gen cleanup, and ensuring unrelated streams are not affected	2026-04-08 16:14:28 +01:00
mlpy0	62570227ff	Catch ClosedResourceError when forwarding chunks to client queues (#1856 ) While stress-testing inference with rapid client cancels mid-stream, I hit a reproducible crash where the entire exo process exits. When a client cancels a streaming chat completion partway through, its receive stream gets closed cleanly via its context manager. The producer in `API._apply_state` then calls `queue.send(event.chunk)`, which raises `anyio.ClosedResourceError` rather than `BrokenResourceError`. The existing handler only catches `BrokenResourceError`, so the exception propagates through the API task group, kills the Node task group, and the process exits with `EXO Shutdown complete`. Trace from one of the crashes: ``` File "exo/api/main.py", line 1818, in _apply_state await queue.send(event.chunk) File "anyio/streams/memory.py", line 212, in send_nowait raise ClosedResourceError anyio.ClosedResourceError ``` The fix is to catch `ClosedResourceError` alongside `BrokenResourceError` in both queue handlers (text and image), so the dead queue gets dropped and `_apply_state` keeps running for other in-flight requests.	2026-04-08 08:50:04 +00:00
Alex Cheema	645bc20950	Add Fast Synch Enabled toggle to macOS app settings (#1852 ) ## Motivation The exo backend already supports `--fast-synch` / `--no-fast-synch` CLI flags and the `EXO_FAST_SYNCH` environment variable, but there was no way to toggle this from the macOS app UI. Users who want fast CPU-to-GPU synchronization for RDMA with Tensor Parallelism had to use CLI flags. ## Changes - ExoProcessController.swift: Added `fastSynchEnabled` UserDefaults-backed property and pass `EXO_FAST_SYNCH=on` to the exo process environment when enabled. - SettingsView.swift: Added a "Performance" section to the Advanced tab with a "Fast Synch Enabled" toggle, an info icon (ⓘ) tooltip explaining the feature and trade-offs, and a "Save & Restart" button. ## Why It Works Follows the exact same pattern as the existing `offlineMode` and `enableImageModels` settings — UserDefaults persistence, `@Published` property with `didSet`, environment variable passthrough in `makeEnvironment()`, and pending state with Save & Restart in the settings UI. The `EXO_FAST_SYNCH=on` value matches what the Python backend already reads in `main.py`. ## Test Plan ### Manual Testing <!-- Hardware: macOS app --> - Open Settings → Advanced tab → verify "Performance" section with "Fast Synch Enabled" toggle appears - Hover the ⓘ icon → verify tooltip explains the feature and GPU lock trade-off - Toggle on → click "Save & Restart" → verify process restarts with `EXO_FAST_SYNCH=on` in env - Close and reopen Settings → verify the toggle state persists - Verify "Save & Restart" button is disabled when no changes are pending ### Automated Testing - Existing settings patterns are well-established; no new automated tests needed for this UI toggle --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 01:04:42 +00:00
rltakashige	5757c27dd5	Add download utility script (#1855 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-04-08 00:58:39 +00:00
Andrei Cravtov	fd5b23281c	Workspace tweaks (#1849 ) ## Changes Mostly chore changes around vscode and jetbrains workspace settings, and some basedpyright settings tweaks, to allow direnv to work and nixd autocomplete with flake parts to work	2026-04-07 17:26:29 +00:00
rltakashige	43b3df45fb	Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835 ) ## Motivation MLX LM has had a massive refactor to their BatchGenerator recently. Since we'd like new features from MLX LM such as Gemma 4, we need to update the code to handle this. Additionally this fixes a significant memory leak in GatedDeltaNet (the difference is quite substantial, up to 1GB every 1000 tokens, explaining several memory issues users were facing with Qwen3.5 models) ## Testing Before <img width="3146" height="884" alt="image" src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4" /> After (no memory leak, as one of the changes upstream) <img width="3190" height="892" alt="image" src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14" />	2026-04-07 11:50:12 +00:00
mlpy0	24420eb10a	Fix reasoning_tokens always reported as 0 for thinking models (#1836 ) When `enable_thinking` is set, chat templates (Qwen3, DeepSeek, etc.) append `<think>` to the prompt. The model starts generating thinking content directly without emitting a `<think>` token in the output stream. Both generators initialized `in_thinking = False` and only set it to `True` on seeing a `<think>` token in output. Since that token was part of the prompt, the flag never flipped and `reasoning_tokens` stayed at 0 in the usage response. Fix: initialize `in_thinking` from `detect_thinking_prompt_suffix()`, which already exists and is used by `model_output_parsers` for routing thinking content correctly.	2026-04-05 00:05:18 +00:00
rltakashige	59669c1168	Tighten EXO bench concurrency numbers and explain methodology (#1811 ) ## Motivation The timings in the batch generator are a little optimistic; a minor change is needed to make them more correct. ## Changes Include the time spent in the API in the generation tps and make sure to send all requests simultaneously	2026-04-05 00:57:52 +01:00

1 2 3 4 5 ...

2303 Commits