mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-02-07 04:32:28 -05:00

Author	SHA1	Message	Date
ciaranbor	31c021aad8	Prevent deleting messages during streamed generation	2026-02-06 20:28:21 +00:00
ciaranbor	9394462e5f	Clarify message deletion behaviour	2026-02-06 20:28:21 +00:00
rltakashige	3b2f553a25	Fix kimi tool calling id (#1413 ) ## Motivation Kimi produces its own tool id. It gets confused when we generate our own id. ## Changes Add id to tool call item and parse Kimi id properly. ## Test Plan ### Manual Testing <img width="3198" height="522" alt="image" src="https://github.com/user-attachments/assets/d71ec2be-7f57-49dc-a569-d304cc430f4d" /> Long running Kimi K2.5 cluster querying itself through OpenCode running on the same Kimi K2.5 instance.	2026-02-06 11:33:51 -08:00
rltakashige	5455a97a8c	Fix GLM4Moe Tensor Sharding (#1411 ) ## Motivation Recent commit broke glm (non lite) sharding ## Why It Works Assert is no longer hit, as isinstance check includes GLM4MoeDecoderLayer. Added type stubs to keep the type checker happy. ## Test Plan ### Manual Testing Runs as expected without gibberish.	2026-02-06 16:53:15 +00:00
ciaranbor	6f0cb99919	Ciaran/flux1 kontext (#1394 ) ## Motivation Add support for FLUX.1-Kontext-dev, an image editing variant of FLUX.1-dev ## Changes - New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow - encodes input image as conditioning latents with special position IDs, generates from pure noise - Model config: 57 transformer blocks (19 joint + 38 single), guidance scale 4.0, ImageToImage task - Pipeline updates: Added kontext_image_ids property to PromptData interface, passed through diffusion runner - Model cards: Added TOML configs for base, 4-bit, and 8-bit variants - Dependency: mflux 0.15.4 → 0.15.5 - Utility: tmp/quantize_and_upload.py for quantizing and uploading models to HuggingFace ## Test Plan ### Manual Testing Works better than Qwen-Image-Edit	2026-02-06 16:20:31 +00:00
ciaranbor	c8d3154f83	More image dimensions (#1395 ) ## Motivation More dimensions for image generation ## Changes - dashboard/src/lib/components/ImageParamsPanel.svelte: Added "1024x1365" and "1365x1024" to the sizeOptions array - dashboard/src/lib/stores/app.svelte.ts: Extended the size type in ImageGenerationParams interface to include the two new dimension options	2026-02-06 15:59:06 +00:00
ciaranbor	63e9cc4fea	Ciaran/num sync steps (#1396 ) ## Motivation Allow users to directly configure num_sync_steps for distributed image generation instead of deriving it from a factor of total steps. ## Changes - Added num_sync_steps field to AdvancedImageParams API (range 1-50) - Changed model configs from num_sync_steps_factor: float to num_sync_steps: int - Updated Flux/Qwen configs with direct values (1, 4, 7 respectively) - Added slider control in dashboard advanced params panel - Falls back to model default when not specified ## Why It Works Decouples sync steps from inference steps, giving users direct control over distributed inference synchronization while preserving sensible defaults. ## Test Plan ### Manual Testing - Generate images with various sync step values via dashboard slider - Verify default behavior when parameter is unset	2026-02-06 15:51:46 +00:00
Evan Quiney	9b5cae3db6	auto bench (#1405 ) runs exo_bench remotely with some nice git QoL ## usage run tests/auto_bench.sh host1 [host2] exo bench will be run on those hosts and its output saved to bench/commit_hash/*.json on all models currently downloaded	2026-02-06 15:35:46 +00:00
Jake Hillion	cf7201f91e	pyproject: set minimum uv version The uv.lock is churning constantly as different UV versions bounce it between revisions. This is made worse by GitHub automatically hiding the uv.lock changes, meaning it's hard to notice when this went wrong. Set a minimum version for `uv` in pyproject.toml to fix this. I tried quite a few versions (not all) and found 0.8.6 sets the revision to 3, which I believe is the latest. This is from August 2025 so has been around for a while. Test plan: ``` jake@maverick:/data/users/jake/repos/exo/ > git checkout main uv.lock jake@maverick:/data/users/jake/repos/exo/ > nix shell github:nixos/nixpkgs/3dce7f4a77812afd69efcbfe15e5223f98c5c69e#uv --command sh -c 'uv add pip --frozen && uv lock && uv remove pip --frozen && uv lock && uv --version' Resolved 140 packages in 147ms Added pip v26.0.1 Resolved 139 packages in 48ms Removed pip v26.0.1 uv 0.8.6 ```	2026-02-06 15:28:10 +00:00
rltakashige	b315035ae0	Add minimax and fix qwen sharding strategies (#1318 ) ## Motivation MiniMax tensor sharding does not provide equivalent outputs to running it as a single node because RMSNorm weights cannot be split without affecting the output. Qwen3Next sharding was broken, and something with Qwen3MoE was likely changed upstream, as several variables no longer exist. This also ballooned into fixing prefix caching for non-standard models as Qwen3Next was behaving weirdly. ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing Worked for a 8 hour long eval at the same performance and a more similar completion/reasoning token distribution. --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-06 13:26:59 +00:00
rltakashige	c8dbbee27b	skip tensor ring on bench (#1403 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-06 13:06:59 +00:00
rltakashige	f0107e9670	Fix offline no cache (#1402 ) ## Motivation In offline mode, exo complains if there is no caches directory, even if the files are there. ## Changes Check safetensors index and the directory structure to build caches directory. ## Test Plan ### Manual Testing <img width="2338" height="1102" alt="image" src="https://github.com/user-attachments/assets/ad769911-399b-4fca-ac80-aeaa046af06b" /> <img width="656" height="1668" alt="image" src="https://github.com/user-attachments/assets/6080986c-3904-4600-a340-8c70f1b33266" />	2026-02-06 12:57:01 +00:00
Hunter Bown	9f502793c1	fix: retry downloads on transient errors instead of breaking (#1398 ) ## Motivation `download_file_with_retry()` has a `break` in the generic exception handler that exits the retry loop after the first transient failure. This means network timeouts, connection resets, and server errors all cause an immediate download failure — the two remaining retry attempts never run. ## Changes download_utils.py: Replaced `break` with logging and exponential backoff in the generic exception handler, matching the existing rate-limit handler behavior. Before: ```python except Exception as e: on_connection_lost() if attempt == n_attempts - 1: raise e break # exits loop immediately ``` After: ```python except Exception as e: on_connection_lost() if attempt == n_attempts - 1: raise e logger.error(f"Download error on attempt {attempt + 1}/{n_attempts} ...") logger.error(traceback.format_exc()) await asyncio.sleep(2.0**attempt) ``` ## Why It Works The `break` statement was bypassing the retry mechanism entirely. Replacing it with the same log-and-backoff pattern used by the `HuggingFaceRateLimitError` handler means all 3 attempts are actually used before giving up. The exponential backoff (1s, 2s) gives transient issues time to resolve between attempts. ## Test Plan ### Manual Testing - Downloads that hit transient network errors now retry instead of failing immediately ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `uv run pytest src/exo/download/tests/ -v` — 11 tests pass --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-06 11:51:54 +00:00
Evan Quiney	c8371349d5	add scripts (#1401 ) allow running exo-bench and the headless runner from nix	2026-02-06 11:06:40 +00:00
Evan Quiney	6b907398a4	cancel downloads for deleted instances (#1393 ) after deleting an instance, if a given (node_id, model_id) pair doesn't exist in the left over instances, cancel the download of model_id on node_id.	2026-02-05 18:16:43 +00:00
Evan Quiney	572e647908	better cancellation (#1388 ) a lot of our cleanup logic wasn't running leading to bad shutdown states ## changes - added `try: except` blocks around most task groups - made the runner shutdown code synchronous - abandon the MpReceiver's recv_async thread on cancellation - this only occurs during runner shutdown, the queue closing from the other end should terminate the mp.Queue, cleaning up the thread in its own time. i could try other methods if this is not sufficient. ## outcome ctrl-c just works now! minus the tokio panic of course :) no more hypercorn lifespan errors though!	2026-02-05 15:22:33 +00:00
Evan Quiney	e59ebd986d	set exo as the nix default package (#1391 ) !!!	2026-02-05 15:15:52 +00:00
Alex Cheema	5c2f29f3f2	feat: show download availability in model picker (#1377 ) ## Motivation Users browsing models in the picker need to know which models are already downloaded and ready to run on their cluster, without having to check the downloads page separately. ## Changes - ModelPickerModal.svelte: Computes per-model download availability by checking which nodes have `DownloadCompleted` entries and summing their total RAM against the model's storage size. Passes availability data to `ModelPickerGroup`. Enhances the info modal with a "Downloaded on:" section showing node friendly names with green badges. - ModelPickerGroup.svelte: Accepts new `downloadStatus` prop. Shows a green checkmark-in-circle icon next to models that are downloaded on sufficient nodes. Tooltip shows which nodes have the model. - +page.svelte: Passes `downloadsData` and `topologyNodes` to `ModelPickerModal`. ## Why It Works The download state from `/state` already tracks per-node completed downloads. The shared `getNodesWithModelDownloaded()` utility (from PR #1375) finds nodes with `DownloadCompleted` entries for each model. Total RAM is summed from the topology node data (using `ram_total`, not `ram_available`) and compared to the model's `storage_size_megabytes` to determine if there's enough aggregate memory. This is intentionally a simple heuristic — not a full placement preview. ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> - Open the model picker modal - Verify downloaded models show a green checkmark icon - Verify the checkmark appears dimmer for models downloaded on nodes with insufficient total RAM - Click the (i) info button on a downloaded model - Verify "Downloaded on:" section appears with correct node names - Verify models with no downloads show no indicator ### Automated Testing - Dashboard builds successfully (`npm run build`) - No new Python changes requiring type checking > Note: This is a chained PR. Base branch is `alexcheema/topology-download-indicators` (#1375). 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 14:32:53 +00:00
Alex Cheema	ffe6396c91	Add Qwen3-Coder-Next model cards (#1367 ) ## Motivation Qwen3-Coder-Next just dropped on mlx-community in several quantizations. It's an 80B MoE model (Qwen3NextForCausalLM) which we already have tensor parallelism support for via QwenShardingStrategy — just needs model cards. ## Changes Added model cards for all 5 available quantizations: - `mlx-community/Qwen3-Coder-Next-4bit` (~46GB) - `mlx-community/Qwen3-Coder-Next-5bit` (~58GB) - `mlx-community/Qwen3-Coder-Next-6bit` (~69GB) - `mlx-community/Qwen3-Coder-Next-8bit` (~89GB) - `mlx-community/Qwen3-Coder-Next-bf16` (~158GB) All with `supports_tensor = true` since the architecture is already supported. ## Why It Works `Qwen3NextForCausalLM` is already handled by QwenShardingStrategy in auto_parallel.py and is in the supports_tensor allowlist in model_cards.py. No code changes needed — just the TOML card files. ## Test Plan ### Manual Testing <!-- n/a - model card addition only --> ### Automated Testing - `basedpyright` — 0 errors - `ruff check` — passes - `nix fmt` — no changes - `pytest` — 173 passed, 1 skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 13:37:18 +00:00
Jake Hillion	3a9baeb9db	EXO: add CLI flags for root install/uninstall The macOS app required user interaction via AppleScript prompts to install or uninstall network configuration components, making automated deployments difficult. Added --install and --uninstall command line flags that execute the network setup scripts directly when running as root, bypassing GUI prompts. Created a new main.swift entry point that parses CLI arguments and delegates to NetworkSetupHelper's new direct execution methods. This enables headless installation via `sudo EXO --install` for automated deployment scenarios while preserving the existing GUI behavior when launched normally. Test plan: - Deployed to a machine that didn't have the content installed. Got blocked on the popup and EXO never launched. - Relaunched EXO, confirmed it still never starts because of the popup. - Ran `sudo /Applications/EXO.app/Contents/MacOS/EXO --install` - Launched EXO - the API started as expected. - Ran `sudo /Applications/EXO.app/Contents/MacOS/EXO --uninstall` - Launched EXO - got the popup.	2026-02-05 13:27:46 +00:00
Alex Cheema	01b86a9e81	feat: add uncertainty visualization with token-level logprobs (#1180 ) ## Motivation Adds uncertainty visualization to the chat interface, allowing users to see token-level confidence scores and regenerate responses from any point in the generation. This enables users to: - Understand model confidence at each token - Explore alternative completions by regenerating from uncertain tokens - Debug and analyze model behavior ## Changes ### Uncertainty Visualization - Add `TokenHeatmap` component showing token-level probability coloring - Toggle uncertainty view per message with bar chart icon - Display tooltip with probability, logprob, and top alternative tokens on hover ### Regenerate from Token - Add "Regenerate from here" button in token tooltip - Use `continue_final_message` in chat template to continue within same turn (no EOS tokens) - Add `continue_from_prefix` flag to `ChatCompletionTaskParams` ### Request Cancellation - Add `AbortController` to cancel in-flight requests when regenerating mid-generation - Handle `BrokenResourceError` server-side when client disconnects gracefully ### Additional APIs - Add Claude Messages API support (`/v1/messages`) - Add OpenAI Responses API support (`/v1/responses`) ## Why It Works - Proper continuation: Using `continue_final_message=True` instead of `add_generation_prompt=True` keeps the assistant turn open, allowing the model to continue naturally from the prefix without end-of-turn markers - Clean cancellation: AbortController aborts the HTTP request, and server catches `BrokenResourceError` to avoid crashes - Stable hover during generation: TokenHeatmap tracks hover by index (stable across re-renders) with longer hide delay during generation ## Test Plan ### Manual Testing <!-- Hardware: MacBook Pro M1 --> - Send a message and verify logprobs are collected - Enable uncertainty view and verify token coloring based on probability - Hover over tokens to see tooltip with alternatives - Click "Regenerate from here" on a token mid-response - Verify the response continues naturally from that point - Verify aborting mid-generation and regenerating works without server crash ### Automated Testing - Added tests for Claude Messages API adapter - Added tests for OpenAI Responses API adapter 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-05 05:21:26 -08:00
rltakashige	221640a65b	Acknowledge task after runner status is updated (#1381 ) ## Motivation Duplicate tasks are still observed. ## Changes Moved task acknowledgement to after the runner has changed its status. ## Why It Works Tasks now remain pending until the runner has updated its status. ## Test Plan ### Manual Testing Seems to work fine from manual testing. Hard to test a race condition though. ### Automated Testing Updated the event ordering test.	2026-02-05 12:00:37 +00:00
ciaranbor	6177550c34	Ciaran/parallel cfg (#1361 ) ## Motivation Enable parallel classifier-free guidance (CFG) for Qwen image models. CFG requires two forward passes (positive/negative prompts) - this allows them to run on separate nodes simultaneously, reducing latency. ## Changes - Added uses_cfg flag to ModelCard to identify CFG-based models - Extended PipelineShardMetadata with CFG topology fields (cfg_rank, cfg_world_size, peer device info) - Updated placement to create two CFG groups with reversed ordering (places CFG peers as ring neighbors) - Refactored DiffusionRunner to process CFG branches separately with exchange at last pipeline stage - Added get_cfg_branch_data() to PromptData for single-branch embeddings - Fixed seed handling in API for distributed consistency - Fixed image yield to only emit from CFG rank 0 at last stage - Increased num_sync_steps_factor from 0.125 to 0.25 for Qwen ## Why It Works - 2 nodes + CFG: Both run all layers, process different CFG branches in parallel - 4+ even nodes + CFG: Hybrid - 2 CFG groups × N/2 pipeline stages - Odd nodes or non-CFG: Falls back to pure pipeline parallelism Ring topology places CFG peers as neighbors to enable direct exchange. ## Test Plan ### Manual Testing Verified performance gain for Qwen-Image for 2 node and 4 node cluster. Non-CFG models still work ### Automated Testing Added tests in test_placement_utils.py covering 2-node CFG parallel, 4-node hybrid, odd-node fallback, and non-CFG pipeline modes.	2026-02-04 21:16:35 +00:00
Evan Quiney	7b6cad94c6	add resources dir to nix (#1376 ) add resources directory to the nix exo package, and fixes the env for the dashboard dir	2026-02-04 16:38:43 +00:00
Alex Cheema	41ed7afb3b	feat: add model picker modal with grouped models and HF Hub search (#1369 ) ## Motivation Reimplements the model picker modal from #1191 on top of the custom model support branch. Replaces the inline model dropdown with a full-featured modal that groups models by base model, supports filtering, favorites, and HuggingFace Hub search. ## Changes Backend: - Add `family`, `quantization`, `base_model`, `capabilities` metadata fields to `ModelCard` and all 40 TOML model cards - Pass new fields through `ModelListModel` and `get_models()` API response - Add `GET /models/search` endpoint using `huggingface_hub.list_models()` Dashboard (7 new files): - `ModelPickerModal.svelte` — Main modal with search, family filtering, HuggingFace Hub tab - `ModelPickerGroup.svelte` — Expandable model group row with quantization variants - `FamilySidebar.svelte` — Vertical sidebar with family icons (All, Favorites, Hub, model families) - `FamilyLogos.svelte` — SVG icons for each model family - `ModelFilterPopover.svelte` — Capability and size range filters - `HuggingFaceResultItem.svelte` — HF search result item with download/like counts - `favorites.svelte.ts` — localStorage-backed favorites store Integration: - Replace inline dropdown in `+page.svelte` with button that opens `ModelPickerModal` - Custom models shown in Hub tab with delete support Polish: - Real brand logos (Meta, Qwen, DeepSeek, OpenAI, GLM, MiniMax, Kimi, HuggingFace) from Simple Icons / LobeHub - Clean SVG stroke icons for capabilities (thinking, code, vision, image gen) - Consistent `border-exo-yellow/10` borders, descriptive tooltips throughout - Cluster memory (used/total) shown in modal header - Selected model highlight with checkmark for both single and multi-variant groups - Cursor pointer on all interactive elements, fix filter popover click-outside bug - Custom models now appear in All tab alongside built-in models ## Bug Fix: Gemma 3 EOS tokens Also included in this branch: fix for Gemma 3 models generating infinite `<end_of_turn>` tokens. The tokenizer's `eos_token_ids` was missing token ID 106 (`<end_of_turn>`), so generation never stopped. The fix appends this token to the EOS list after loading the tokenizer. Also handles `eos_token_ids` being a `set` (not just a `list`). ## Why It Works Model metadata (family, capabilities, etc.) is stored directly in TOML cards rather than derived from heuristics, ensuring accuracy. The modal groups models by `base_model` field so quantization variants appear together. Custom models are separated into the Hub tab since they lack grouping metadata. ## Test Plan ### Manual Testing - Open dashboard, click model selector to open modal - Browse models by family sidebar, search, and filters - Expand model groups to see quantization variants - Star favorites and verify persistence across page reloads - Navigate to Hub tab, search and add models - Verify error messages shown for invalid model IDs - Run a Gemma 3 model and verify generation stops at `<end_of_turn>` ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `nix fmt` — clean - `uv run pytest src/` — 173 passed - `cd dashboard && npm run build` — builds successfully --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 05:56:23 -08:00
Alex Cheema	2063278906	feat: add custom HuggingFace model support (#1368 ) ## Motivation Users should be able to run any HuggingFace model, not just the ones we ship TOML cards for. Continues the aim of #1191 with a minimal implementation on top of the current TOML model card system. Custom cards are saved to `~/.exo/custom_model_cards/` rather than the bundled `resources/inference_model_cards/` because `RESOURCES_DIR` is read-only in PyInstaller bundles (`sys._MEIPASS`). This also fixes `fetch_from_hf` which was saving cards to the wrong path (`resources/` root instead of `resources/inference_model_cards/`). ## Changes - Add `EXO_CUSTOM_MODEL_CARDS_DIR` constant (`~/.exo/custom_model_cards/`) - Update `model_cards.py`: add custom dir to search path, fix `save_to_custom_dir`, add `delete_custom_card`/`is_custom_card` - Add `POST /models/add` and `DELETE /models/custom/{model_id}` API endpoints - Add `is_custom` field to `ModelListModel` API response - Dashboard: add custom model input form in dropdown, delete button for custom models, show actual API errors, auto-select newly added model ## Why It Works Two separate directories for model cards: the bundled read-only `resources/inference_model_cards/` for built-in cards, and user-writable `~/.exo/custom_model_cards/` for custom cards. Both are scanned when listing models. This works in all environments including PyInstaller bundles where `RESOURCES_DIR` points to `sys._MEIPASS`. ## Test Plan ### Manual Testing - Add a custom model via the dropdown (e.g. `mlx-community/Llama-3.2-1B-Instruct-4bit`) - Verify it appears in the model list with the delete (x) button - Delete it and verify it disappears - Try adding an invalid model ID and verify the actual error is shown ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `uv run pytest src/` — passes - `cd dashboard && npm run build` — builds --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 05:06:15 -08:00
rltakashige	a0f4f36355	Reduce reliance on internet (#1363 ) ## Motivation Offline users currently have to wait for every retry to fail before being able to launch a model. For users that restart clusters often or share API keys between devices, we also spam HuggingFace with downloads every 5 minutes. These issues are caused by _emit_existing_download_progress being inefficient. ## Changes - Only query HuggingFace once while EXO is running (assumption being that a change should only be reflected on a new EXO session) - Only query HuggingFace when there is an internet connection (polling connectivity every 10 seconds) - Request download progress if we switch from no connectivity -> connected to reduce the wait. - Reduce download progress sleep as it's no longer expensive (queries cache most of the time). - Reduce retries as 30 is way too many. ## Test Plan ### Manual Testing Manually tested the behaviour. ### Automated Testing None, should I add any? We do have some tests for this folder, but they are probably not too helpful.	2026-02-03 20:03:29 +00:00
Alex Cheema	acb97127bf	Normalize TextGenerationTaskParams.input to list[InputMessage] (#1360 ) ## Motivation With the addition of the Responses API, we introduced `str \| list[InputMessage]` as the type for `TextGenerationTaskParams.input` since the Responses API supports sending input as a plain string. But there was no reason to leak that flexibility past the API adapter boundary — it just meant every downstream consumer had to do `if isinstance(messages, str):` checks, adding complexity for no benefit. ## Changes - Changed `TextGenerationTaskParams.input` from `str \| list[InputMessage]` to `list[InputMessage]` - Each API adapter (Chat Completions, Claude Messages, Responses) now normalizes to `list[InputMessage]` at the boundary - Removed `isinstance(task_params.input, str)` branches in `utils_mlx.py` and `runner.py` - Wrapped string inputs in `[InputMessage(role="user", content=...)]` in the warmup path and all test files ## Why It Works The API adapters are the only place where we deal with raw user input formats. By normalizing there, all downstream code (worker, runner, MLX engine) can just assume `list[InputMessage]` and skip the type-checking branches. The type system (`basedpyright`) catches any missed call sites at compile time. ## Test Plan ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `nix fmt` — applied - `uv run pytest` — 174 passed, 1 skipped Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 06:01:56 -08:00
Evan Quiney	d90605f198	migrate model cards to .toml files (#1354 )	2026-02-03 12:32:06 +00:00
Evan Quiney	f400b4d7c5	fix InstanceViewModel.swift (#1359 ) wasn't caught when we merged the API changes	2026-02-02 18:43:27 +00:00
Evan Quiney	d97bca88e6	improve distributed testing (#1300 ) Our distributed test now does a full query cycle for every model loaded onto the relevant machine. This will help find bugs early, as it already has found one with Qwen3 Next! I didn't write down what the error was though. Gooooooood luck with that! Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-02 18:25:39 +00:00
Alex Cheema	dfce188d99	fix: handle unclosed tool calls and GLM arg parsing edge cases (#1344 ) ## Motivation Tool-call requests can hang indefinitely when `max_tokens` truncates generation mid-tool-call. ## Reproduction 1. Send a chat completion with `tools` and a low `max_tokens` (e.g. 65) to Qwen3-0.6B 2. Model generates `<think>...</think>` then starts `<tool_call>` but `max_tokens` cuts it off before `</tool_call>` 3. Before this fix: `parse_tool_calls` buffers tokens after `<tool_call>`, generator exhausts, buffered tokens (including `finish_reason`) are silently dropped → stream hangs forever 4. After this fix: buffered tokens are flushed as regular text with `finish_reason` propagated → response returns normally with `finish_reason: "length"` Confirmed with fresh local testing: 4 unclosed tool call flushes triggered in a single session. Also confirmed via production logs from Jan 29 (2 occurrences). ## Changes 1. `parse_tool_calls` unclosed tool call flush — when the generator exhausts inside an open `<tool_call>` block, flush buffered tokens as regular text and propagate `finish_reason` 2. GLM regex fix — match literal `\n` (not escaped `\\n`) between arg tags; handle missing `</arg_value>` via lookahead 3. 7 new unit tests for `parse_tool_calls` covering unclosed, closed, passthrough, and failed-parse scenarios ## Why It Works - `parse_tool_calls` now has a post-loop check: if `in_tool_call` is still true, it yields the buffered text with the tracked `finish_reason` instead of silently dropping it - The GLM regex now matches real-world output where newlines appear between tags and `</arg_value>` may be absent ## Test Plan ### Manual Testing - Qwen3-0.6B-4bit with `tools` + various `max_tokens` values (61-75) - Confirmed responses return with `finish_reason: "length"` instead of hanging - Log output shows `"generator exhausted inside unclosed tool call, flushing buffered text"` ### Automated Testing - 7 new tests in `test_parse_tool_calls.py` - Full test suite passes (`uv run pytest`) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com> Co-authored-by: Jake Hillion <jake@hillion.co.uk> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-02 17:45:51 +00:00
Evan Quiney	54b19879a0	create config home when checking for config file (#1353 ) we didn't check before, raising a critical exception. now we create ~/.config/exo on linux systems before touching config.toml. this wasn't caught before since everything lives in ~/.exo on macos, and we no longer write the keypair to CONFIG_HOME, so config.toml has to do init work it avoided before.	2026-02-02 17:36:51 +00:00
ciaranbor	19965c7ba5	Ciaran/profiling (#1345 ) ## Motivation Know what the hell is going on ## Changes - Tracing library (src/exo/shared/tracing.py): trace() context manager, Chrome Trace Format export, statistics computation - Runner instrumentation (src/exo/worker/engines/image/pipeline/runner.py): Wrapped sync/async steps, compute blocks, and send/recv operations - Trace collection: Workers send traces to master after task completion; merged into ~/.exo/traces/trace_{task_id}.json - API endpoints: List, fetch, stats, and raw download at /v1/traces/* - Dashboard: Trace list and detail pages with Perfetto integration ## Why It Works <img width="1236" height="767" alt="Screenshot 2026-01-30 at 19 00 09" src="https://github.com/user-attachments/assets/73e6e46d-ba10-4e83-ba99-ff1c3f62ab05" /> <img width="1659" height="89" alt="Screenshot 2026-01-30 at 19 00 58" src="https://github.com/user-attachments/assets/c0fd0e65-e4fc-4fd5-920d-b43b2887d591" />	2026-02-02 17:19:45 +00:00
Evan Quiney	3e27ead705	remove mdns discovered peers from appearing in state (#1312 ) ## motivation eagerly discovered peers through gossipsub were added to state. this left things looking broken from one-sided connections ## changes the worker no longer writes topology edges from these gossipsub messages we now strictly rely on http-discovered topology, which tends to be more reflective of the actual state of the systems connectivity	2026-02-02 16:58:53 +00:00
Alex Cheema	d826d309b3	chore: gitignore hosts_.json files (#1343 ) ## Motivation `hosts_.json` files are local host configuration snapshots that shouldn't be tracked in version control. ## Changes Added `hosts_.json` pattern to `.gitignore`. ## Why It Works The glob pattern `hosts_.json` matches any file starting with `hosts_` and ending with `.json` in the repo root. ## Test Plan ### Manual Testing - Verified that `hosts_*.json` files are ignored by git after this change. ### Automated Testing - No automated tests needed for a `.gitignore` change. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 16:14:11 +00:00
Alex Cheema	c3537980bd	feat: add Claude Messages API and OpenAI Responses API support (#1167 ) ## Motivation Add support for Claude Messages API and OpenAI Responses API to allow users to interact with exo using these popular API formats. This enables broader compatibility with existing tooling and SDKs that expect these API formats. ## Architecture Adapter logic lives exclusively in the API layer (`src/exo/master/adapters/`). On the way in, each adapter converts its API-specific request type (`ChatCompletionRequest`, `ClaudeMessagesRequest`, `ResponsesRequest`) into `TextGenerationTaskParams`. On the way out, each adapter converts the `TokenChunk` stream back into its API-specific response format. Everything inside the application — commands, worker, runner, event sourcing — only sees `TextGenerationTaskParams` and `TokenChunk`. No API-specific types cross the boundary. ``` API layer │ Application internals │ Chat Completions → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ChatCompletionResponse Claude Messages → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ClaudeMessagesResponse Responses API → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ResponsesResponse ``` ## Changes ### New Files - `src/exo/shared/types/claude_api.py` - Pydantic types for Claude Messages API - `src/exo/shared/types/openai_responses.py` - Pydantic types for OpenAI Responses API - `src/exo/shared/types/text_generation.py` - Shared `TextGenerationTaskParams` internal type - `src/exo/master/adapters/chat_completions.py` - Chat Completions adapter (streaming/non-streaming) - `src/exo/master/adapters/claude.py` - Claude Messages adapter (streaming/non-streaming) - `src/exo/master/adapters/responses.py` - OpenAI Responses adapter (streaming/non-streaming) ### Modified Files - `src/exo/master/api.py` - Refactored to use adapters uniformly for all endpoints; extracted `_resolve_and_validate_text_model` helper to deduplicate model validation across all text endpoints; removed ad-hoc `try/except ValueError` blocks from non-streaming paths ### New Endpoints - `POST /v1/messages` - Claude Messages API (streaming and non-streaming) - `POST /v1/responses` - OpenAI Responses API (streaming and non-streaming) ## Why It Works All APIs are implemented as pure conversion adapters at the edge of the application: 1. Adapter functions in `src/exo/master/adapters/` convert incoming requests to `TextGenerationTaskParams` 2. `api.py` wraps the params in a `TextGeneration` command and sends it through the existing command/event flow 3. The worker, runner, and event sourcing layers only handle `TextGenerationTaskParams` and `TokenChunk` — they have no awareness of Chat Completions, Claude, or Responses API formats 4. On response, adapter functions convert the `TokenChunk` stream back to the caller's expected format 5. Model validation is handled by a single shared helper (`_resolve_and_validate_text_model`), mirroring the existing `_validate_image_model` pattern for image endpoints No changes to core inference logic were needed. ### Streaming Formats - Chat Completions: Uses `data: {...}\n\n` with `[DONE]` terminator - Claude: Uses event types `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, `message_stop` - OpenAI Responses: Uses event types `response.created`, `response.in_progress`, `response.output_item.added`, `response.content_part.added`, `response.output_text.delta`, `response.output_text.done`, `response.content_part.done`, `response.output_item.done`, `response.completed` ## Test Plan ### Manual Testing Hardware: MacBook Pro M3 Max Non-streaming tests: ```bash # Chat Completions API curl -X POST http://localhost:52415/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}' # Claude Messages API curl -X POST http://localhost:52415/v1/messages \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}' # OpenAI Responses API curl -X POST http://localhost:52415/v1/responses \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}' ``` Streaming tests: ```bash # Chat Completions API (streaming) curl -N -X POST http://localhost:52415/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}' # Claude Messages API (streaming) curl -N -X POST http://localhost:52415/v1/messages \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}' # OpenAI Responses API (streaming) curl -N -X POST http://localhost:52415/v1/responses \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}' ``` All endpoints tested successfully with proper response formats and streaming events. ### Automated Testing - Tests in `src/exo/master/tests/` all pass (85 tests) - Type checker (basedpyright) passes with 0 errors - Linter (ruff) passes --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-02 15:58:37 +00:00
rltakashige	21d477f1cb	Update exo bench (#1357 ) ## Motivation Make exo bench faster for longer prompts, lengthen default timeouts and use pairs for pp and tg. ## Changes - Uses binary search to find correct prompt - Flag to force all combinations if that is desired	2026-02-02 15:46:15 +00:00
Jake Hillion	b2579c78fe	nix: add macmon to PATH in wrapper scripts on Darwin `nix run .#exo` couldn't find `macmon` because the Nix wrapper scripts didn't include it in PATH, causing `shutil.which("macmon")` to fail. Added `--prefix PATH : ${pkgs.macmon}/bin` to the `makeWrapper` call, conditional on Darwin via `lib.optionalString`, so macmon's binary is available at runtime without modifying the user's system PATH. Test plan: - Verified `nix build .#exo` succeeds - Checked wrapper script contains macmon store path in PATH prefix	2026-02-02 13:42:36 +00:00
Evan Quiney	cd946742f7	fix skipping logic in worker plan (#1342 ) the worker plan function had some skipping logic missing, leading to double-submitting tasks.	2026-01-30 14:31:40 +00:00
rltakashige	a5bc38ad1f	Check all nodes to evict (#1341 ) ## Motivation If nodes have uneven memory, one node may evict cache that remains on another node. This will break prefill on some setups. ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-01-30 13:42:09 +00:00
Evan Quiney	2a4e0d4629	make node-ids unique per-session (#1338 ) we currently have no strict reuqirements that node ids persist across sessions, so we can generate fresh nodeids each time this avoids issues like #1332, but prevents further features such as caching downloads or node-id dialling Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-01-30 13:33:31 +00:00
Evan Quiney	46a14153dd	switch to ModelCard.load outside of download log (#1339 ) some attempts to load model cards (i.e. build_base_shard) always went through networking rather than using downloaded model cards. we should always default to ModelCard.load in these scenarios	2026-01-30 11:20:20 +00:00
Evan	9ba61f3733	improve log message in shard downloader closes #1336	2026-01-30 10:35:01 +00:00
rltakashige	d9eca75895	Add usage stats (#1333 ) ## Motivation (Probably) the final missing piece of the Chat Completions API ## Changes Add UsageStats ## Why It Works OpenCode reviewed my PR and gave me stats: <img width="1150" height="802" alt="image" src="https://github.com/user-attachments/assets/ebc06bae-797f-4087-87d5-2f26cf60fc48" /> ## Test Plan ### Automated Testing No tests were broken.	2026-01-30 10:23:08 +00:00
rltakashige	9dabde7e57	Fix bench after recent updates (#1331 ) ## Motivation A lot of changes happened without much attention to the state of exo bench. ## Changes Use TaggedModel for BenchChatCompletion so it serialises properly. Don't break after gpt oss tool call to preserve parity with the rest of the codebase. ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <img width="2856" height="678" alt="image" src="https://github.com/user-attachments/assets/2e18cf0d-c0f8-467c-9763-1a6a59c8a327" /> Also tested GPT OSS tool calling in OpenCode	2026-01-29 19:14:40 +00:00
ciaranbor	a31942ce12	Ciaran/image non streaming (#1328 ) ## Motivation The dashboard UI attempted to parse all image generation responses as SSE streams, even when streaming was disabled. This broke non-streaming image generation. ## Changes - Parse JSON responses directly when not streaming, use SSE parser only when stream=true AND partialImages > 0 - explicitly disable partial images when not streaming ## Why It Works Both API and dashboard now use the same condition (stream && partialImages > 0) to determine response format, ensuring correct parsing. ## Test Plan ### Manual Testing Non-streamed image generation results appear in the UI. Streamed image generation still works	2026-01-29 17:24:32 +00:00
Alex Cheema	7cc313b22a	Treat Swift/Xcode build warnings as errors (#1322 ) ## Motivation Warnings that go unchecked tend to accumulate and hide real issues. Treating them as errors ensures they are addressed immediately, both locally during development and in CI. ## Changes Added `SWIFT_TREAT_WARNINGS_AS_ERRORS = YES` and `GCC_TREAT_WARNINGS_AS_ERRORS = YES` to the project-level Debug and Release build configurations in `project.pbxproj`. This applies to all targets (EXO, EXOTests, EXOUITests). ## Why It Works Xcode's `SWIFT_TREAT_WARNINGS_AS_ERRORS` and `GCC_TREAT_WARNINGS_AS_ERRORS` build settings promote Swift and C/ObjC warnings to errors at compile time. Setting them at the project level means all targets inherit the policy without needing per-target or CI-level overrides. ## Test Plan ### Manual Testing - Built the EXO scheme in Release configuration with `xcodebuild` — no warning-as-error failures from Swift or C/ObjC sources. ### Automated Testing - CI already builds with `-configuration Release`, so it will automatically enforce warnings-as-errors via the inherited project settings — no CI changes needed.	2026-01-29 17:15:49 +00:00
rltakashige	2837225dc7	Load pipeline layers sequentially (#1329 ) ## Motivation Slightly annoyed by needing this change, but same story as for tensor loading...	2026-01-29 17:08:38 +00:00
Jake Hillion	e4c6a7dbb4	nix: add Python packaging with uv2nix Add uv2nix to build Python packages from uv.lock. This creates a fully Nix-managed Python environment with the Rust bindings injected via overlay. Changes: - Add pyproject-nix, uv2nix, and pyproject-build-systems flake inputs - Create python/parts.nix with overlays to inject Nix-built Rust wheel - Export packages.exo on macOS (wraps exo/exo-master/exo-worker with dashboard) - Add checks.lint (ruff, all platforms) and checks.pytest (macOS only) - Simplify CI typecheck job using nicknovitski/nix-develop action - Delete .github/actions/typecheck composite action (no longer needed) - Add no-build-package for MLX packages in pyproject.toml (use wheels) The Python build is currently macOS-only since MLX requires Metal. Linux support will be added once the pyproject dependencies are simplified. Test plan: - Run `nix flake check` on macOS to verify pytest and lint pass - Build exo package on macOS: `nix build .#exo` - Verify CI pipeline passes with simplified typecheck job	2026-01-29 16:35:58 +00:00

1 2 3 4 5 ...

2039 Commits