## Motivation
Kimi produces its own tool id. It gets confused when we generate our own
id.
## Changes
Add id to tool call item and parse Kimi id properly.
## Test Plan
### Manual Testing
<img width="3198" height="522" alt="image"
src="https://github.com/user-attachments/assets/d71ec2be-7f57-49dc-a569-d304cc430f4d"
/>
Long running Kimi K2.5 cluster querying itself through OpenCode running
on the same Kimi K2.5 instance.
## Motivation
Recent commit broke glm (non lite) sharding
## Why It Works
Assert is no longer hit, as isinstance check includes
GLM4MoeDecoderLayer.
Added type stubs to keep the type checker happy.
## Test Plan
### Manual Testing
Runs as expected without gibberish.
## Motivation
Add support for FLUX.1-Kontext-dev, an image editing variant of
FLUX.1-dev
## Changes
- New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow
- encodes input image as conditioning latents with special position IDs,
generates from pure noise
- Model config: 57 transformer blocks (19 joint + 38 single), guidance
scale 4.0, ImageToImage task
- Pipeline updates: Added kontext_image_ids property to PromptData
interface, passed through diffusion runner
- Model cards: Added TOML configs for base, 4-bit, and 8-bit variants
- Dependency: mflux 0.15.4 → 0.15.5
- Utility: tmp/quantize_and_upload.py for quantizing and uploading
models to HuggingFace
## Test Plan
### Manual Testing
Works better than Qwen-Image-Edit
## Motivation
More dimensions for image generation
## Changes
- dashboard/src/lib/components/ImageParamsPanel.svelte: Added
"1024x1365" and "1365x1024" to the sizeOptions array
- dashboard/src/lib/stores/app.svelte.ts: Extended the size type in
ImageGenerationParams interface to include the two new dimension options
## Motivation
Allow users to directly configure num_sync_steps for distributed image
generation instead of deriving it from a factor of total steps.
## Changes
- Added num_sync_steps field to AdvancedImageParams API (range 1-50)
- Changed model configs from num_sync_steps_factor: float to
num_sync_steps: int
- Updated Flux/Qwen configs with direct values (1, 4, 7 respectively)
- Added slider control in dashboard advanced params panel
- Falls back to model default when not specified
## Why It Works
Decouples sync steps from inference steps, giving users direct control
over distributed inference synchronization while preserving sensible
defaults.
## Test Plan
### Manual Testing
- Generate images with various sync step values via dashboard slider
- Verify default behavior when parameter is unset
runs exo_bench remotely with some nice git QoL
## usage
run tests/auto_bench.sh host1 [host2]
exo bench will be run on those hosts and its output saved to
bench/commit_hash/*.json on all models currently downloaded
The uv.lock is churning constantly as different UV versions bounce it
between revisions. This is made worse by GitHub automatically hiding the
uv.lock changes, meaning it's hard to notice when this went wrong.
Set a minimum version for `uv` in pyproject.toml to fix this. I tried
quite a few versions (not all) and found 0.8.6 sets the revision to 3,
which I believe is the latest. This is from August 2025 so has been
around for a while.
Test plan:
```
jake@maverick:/data/users/jake/repos/exo/ > git checkout main uv.lock
jake@maverick:/data/users/jake/repos/exo/ > nix shell github:nixos/nixpkgs/3dce7f4a77812afd69efcbfe15e5223f98c5c69e#uv --command sh -c 'uv add pip --frozen && uv lock && uv remove pip --frozen && uv lock && uv --version'
Resolved 140 packages in 147ms
Added pip v26.0.1
Resolved 139 packages in 48ms
Removed pip v26.0.1
uv 0.8.6
```
## Motivation
MiniMax tensor sharding does not provide equivalent outputs to running
it as a single node because RMSNorm weights cannot be split without
affecting the output.
Qwen3Next sharding was broken, and something with Qwen3MoE was likely
changed upstream, as several variables no longer exist.
This also ballooned into fixing prefix caching for non-standard models
as Qwen3Next was behaving weirdly.
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
Worked for a 8 hour long eval at the same performance and a more similar
completion/reasoning token distribution.
---------
Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
`download_file_with_retry()` has a `break` in the generic exception
handler that exits the retry loop after the first transient failure.
This means network timeouts, connection resets, and server errors all
cause an immediate download failure — the two remaining retry attempts
never run.
## Changes
**download_utils.py**: Replaced `break` with logging and exponential
backoff in the generic exception handler, matching the existing
rate-limit handler behavior.
Before:
```python
except Exception as e:
on_connection_lost()
if attempt == n_attempts - 1:
raise e
break # exits loop immediately
```
After:
```python
except Exception as e:
on_connection_lost()
if attempt == n_attempts - 1:
raise e
logger.error(f"Download error on attempt {attempt + 1}/{n_attempts} ...")
logger.error(traceback.format_exc())
await asyncio.sleep(2.0**attempt)
```
## Why It Works
The `break` statement was bypassing the retry mechanism entirely.
Replacing it with the same log-and-backoff pattern used by the
`HuggingFaceRateLimitError` handler means all 3 attempts are actually
used before giving up. The exponential backoff (1s, 2s) gives transient
issues time to resolve between attempts.
## Test Plan
### Manual Testing
- Downloads that hit transient network errors now retry instead of
failing immediately
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `uv run pytest src/exo/download/tests/ -v` — 11 tests pass
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
a lot of our cleanup logic wasn't running leading to bad shutdown states
## changes
- added `try: except` blocks around most task groups
- made the runner shutdown code synchronous
- abandon the MpReceiver's recv_async thread on cancellation
- this only occurs during runner shutdown, the queue closing from the
other end should terminate the mp.Queue, cleaning up the thread in its
own time. i could try other methods if this is not sufficient.
## outcome
ctrl-c just works now! minus the tokio panic of course :) no more
hypercorn lifespan errors though!
## Motivation
Users browsing models in the picker need to know which models are
already downloaded and ready to run on their cluster, without having to
check the downloads page separately.
## Changes
- **ModelPickerModal.svelte**: Computes per-model download availability
by checking which nodes have `DownloadCompleted` entries and summing
their total RAM against the model's storage size. Passes availability
data to `ModelPickerGroup`. Enhances the info modal with a "Downloaded
on:" section showing node friendly names with green badges.
- **ModelPickerGroup.svelte**: Accepts new `downloadStatus` prop. Shows
a green checkmark-in-circle icon next to models that are downloaded on
sufficient nodes. Tooltip shows which nodes have the model.
- **+page.svelte**: Passes `downloadsData` and `topologyNodes` to
`ModelPickerModal`.
## Why It Works
The download state from `/state` already tracks per-node completed
downloads. The shared `getNodesWithModelDownloaded()` utility (from PR
#1375) finds nodes with `DownloadCompleted` entries for each model.
Total RAM is summed from the topology node data (using `ram_total`, not
`ram_available`) and compared to the model's `storage_size_megabytes` to
determine if there's enough aggregate memory. This is intentionally a
simple heuristic — not a full placement preview.
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
- Open the model picker modal
- Verify downloaded models show a green checkmark icon
- Verify the checkmark appears dimmer for models downloaded on nodes
with insufficient total RAM
- Click the (i) info button on a downloaded model
- Verify "Downloaded on:" section appears with correct node names
- Verify models with no downloads show no indicator
### Automated Testing
- Dashboard builds successfully (`npm run build`)
- No new Python changes requiring type checking
> **Note:** This is a chained PR. Base branch is
`alexcheema/topology-download-indicators` (#1375).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Qwen3-Coder-Next just dropped on mlx-community in several quantizations.
It's an 80B MoE model (Qwen3NextForCausalLM) which we already have
tensor parallelism support for via QwenShardingStrategy — just needs
model cards.
## Changes
Added model cards for all 5 available quantizations:
- `mlx-community/Qwen3-Coder-Next-4bit` (~46GB)
- `mlx-community/Qwen3-Coder-Next-5bit` (~58GB)
- `mlx-community/Qwen3-Coder-Next-6bit` (~69GB)
- `mlx-community/Qwen3-Coder-Next-8bit` (~89GB)
- `mlx-community/Qwen3-Coder-Next-bf16` (~158GB)
All with `supports_tensor = true` since the architecture is already
supported.
## Why It Works
`Qwen3NextForCausalLM` is already handled by QwenShardingStrategy in
auto_parallel.py and is in the supports_tensor allowlist in
model_cards.py. No code changes needed — just the TOML card files.
## Test Plan
### Manual Testing
<!-- n/a - model card addition only -->
### Automated Testing
- `basedpyright` — 0 errors
- `ruff check` — passes
- `nix fmt` — no changes
- `pytest` — 173 passed, 1 skipped
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
The macOS app required user interaction via AppleScript prompts to
install or uninstall network configuration components, making automated
deployments difficult.
Added --install and --uninstall command line flags that execute the
network setup scripts directly when running as root, bypassing GUI
prompts. Created a new main.swift entry point that parses CLI arguments
and delegates to NetworkSetupHelper's new direct execution methods.
This enables headless installation via `sudo EXO --install` for
automated deployment scenarios while preserving the existing GUI
behavior when launched normally.
Test plan:
- Deployed to a machine that didn't have the content installed. Got
blocked on the popup and EXO never launched.
- Relaunched EXO, confirmed it still never starts because of the popup.
- Ran `sudo /Applications/EXO.app/Contents/MacOS/EXO --install`
- Launched EXO - the API started as expected.
- Ran `sudo /Applications/EXO.app/Contents/MacOS/EXO --uninstall`
- Launched EXO - got the popup.
## Motivation
Adds uncertainty visualization to the chat interface, allowing users to
see token-level confidence scores and regenerate responses from any
point in the generation. This enables users to:
- Understand model confidence at each token
- Explore alternative completions by regenerating from uncertain tokens
- Debug and analyze model behavior
## Changes
### Uncertainty Visualization
- Add `TokenHeatmap` component showing token-level probability coloring
- Toggle uncertainty view per message with bar chart icon
- Display tooltip with probability, logprob, and top alternative tokens
on hover
### Regenerate from Token
- Add "Regenerate from here" button in token tooltip
- Use `continue_final_message` in chat template to continue within same
turn (no EOS tokens)
- Add `continue_from_prefix` flag to `ChatCompletionTaskParams`
### Request Cancellation
- Add `AbortController` to cancel in-flight requests when regenerating
mid-generation
- Handle `BrokenResourceError` server-side when client disconnects
gracefully
### Additional APIs
- Add Claude Messages API support (`/v1/messages`)
- Add OpenAI Responses API support (`/v1/responses`)
## Why It Works
- **Proper continuation**: Using `continue_final_message=True` instead
of `add_generation_prompt=True` keeps the assistant turn open, allowing
the model to continue naturally from the prefix without end-of-turn
markers
- **Clean cancellation**: AbortController aborts the HTTP request, and
server catches `BrokenResourceError` to avoid crashes
- **Stable hover during generation**: TokenHeatmap tracks hover by index
(stable across re-renders) with longer hide delay during generation
## Test Plan
### Manual Testing
<!-- Hardware: MacBook Pro M1 -->
- Send a message and verify logprobs are collected
- Enable uncertainty view and verify token coloring based on probability
- Hover over tokens to see tooltip with alternatives
- Click "Regenerate from here" on a token mid-response
- Verify the response continues naturally from that point
- Verify aborting mid-generation and regenerating works without server
crash
### Automated Testing
- Added tests for Claude Messages API adapter
- Added tests for OpenAI Responses API adapter
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
Duplicate tasks are still observed.
## Changes
Moved task acknowledgement to after the runner has changed its status.
## Why It Works
Tasks now remain pending until the runner has updated its status.
## Test Plan
### Manual Testing
Seems to work fine from manual testing. Hard to test a race condition
though.
### Automated Testing
Updated the event ordering test.
## Motivation
Enable parallel classifier-free guidance (CFG) for Qwen image models.
CFG requires two forward passes (positive/negative prompts) - this
allows them to run on separate nodes simultaneously, reducing latency.
## Changes
- Added uses_cfg flag to ModelCard to identify CFG-based models
- Extended PipelineShardMetadata with CFG topology fields (cfg_rank,
cfg_world_size, peer device info)
- Updated placement to create two CFG groups with reversed ordering
(places CFG peers as ring neighbors)
- Refactored DiffusionRunner to process CFG branches separately with
exchange at last pipeline stage
- Added get_cfg_branch_data() to PromptData for single-branch embeddings
- Fixed seed handling in API for distributed consistency
- Fixed image yield to only emit from CFG rank 0 at last stage
- Increased num_sync_steps_factor from 0.125 to 0.25 for Qwen
## Why It Works
- 2 nodes + CFG: Both run all layers, process different CFG branches in
parallel
- 4+ even nodes + CFG: Hybrid - 2 CFG groups × N/2 pipeline stages
- Odd nodes or non-CFG: Falls back to pure pipeline parallelism
Ring topology places CFG peers as neighbors to enable direct exchange.
## Test Plan
### Manual Testing
Verified performance gain for Qwen-Image for 2 node and 4 node cluster.
Non-CFG models still work
### Automated Testing
Added tests in test_placement_utils.py covering 2-node CFG parallel,
4-node hybrid, odd-node fallback, and non-CFG pipeline modes.
## Motivation
Reimplements the model picker modal from #1191 on top of the custom
model support branch. Replaces the inline model dropdown with a
full-featured modal that groups models by base model, supports
filtering, favorites, and HuggingFace Hub search.
## Changes
**Backend:**
- Add `family`, `quantization`, `base_model`, `capabilities` metadata
fields to `ModelCard` and all 40 TOML model cards
- Pass new fields through `ModelListModel` and `get_models()` API
response
- Add `GET /models/search` endpoint using
`huggingface_hub.list_models()`
**Dashboard (7 new files):**
- `ModelPickerModal.svelte` — Main modal with search, family filtering,
HuggingFace Hub tab
- `ModelPickerGroup.svelte` — Expandable model group row with
quantization variants
- `FamilySidebar.svelte` — Vertical sidebar with family icons (All,
Favorites, Hub, model families)
- `FamilyLogos.svelte` — SVG icons for each model family
- `ModelFilterPopover.svelte` — Capability and size range filters
- `HuggingFaceResultItem.svelte` — HF search result item with
download/like counts
- `favorites.svelte.ts` — localStorage-backed favorites store
**Integration:**
- Replace inline dropdown in `+page.svelte` with button that opens
`ModelPickerModal`
- Custom models shown in Hub tab with delete support
**Polish:**
- Real brand logos (Meta, Qwen, DeepSeek, OpenAI, GLM, MiniMax, Kimi,
HuggingFace) from Simple Icons / LobeHub
- Clean SVG stroke icons for capabilities (thinking, code, vision, image
gen)
- Consistent `border-exo-yellow/10` borders, descriptive tooltips
throughout
- Cluster memory (used/total) shown in modal header
- Selected model highlight with checkmark for both single and
multi-variant groups
- Cursor pointer on all interactive elements, fix filter popover
click-outside bug
- Custom models now appear in All tab alongside built-in models
## Bug Fix: Gemma 3 EOS tokens
Also included in this branch: fix for Gemma 3 models generating infinite
`<end_of_turn>` tokens. The tokenizer's `eos_token_ids` was missing
token ID 106 (`<end_of_turn>`), so generation never stopped. The fix
appends this token to the EOS list after loading the tokenizer. Also
handles `eos_token_ids` being a `set` (not just a `list`).
## Why It Works
Model metadata (family, capabilities, etc.) is stored directly in TOML
cards rather than derived from heuristics, ensuring accuracy. The modal
groups models by `base_model` field so quantization variants appear
together. Custom models are separated into the Hub tab since they lack
grouping metadata.
## Test Plan
### Manual Testing
- Open dashboard, click model selector to open modal
- Browse models by family sidebar, search, and filters
- Expand model groups to see quantization variants
- Star favorites and verify persistence across page reloads
- Navigate to Hub tab, search and add models
- Verify error messages shown for invalid model IDs
- Run a Gemma 3 model and verify generation stops at `<end_of_turn>`
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — clean
- `uv run pytest src/` — 173 passed
- `cd dashboard && npm run build` — builds successfully
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Users should be able to run any HuggingFace model, not just the ones we
ship TOML cards for. Continues the aim of #1191 with a minimal
implementation on top of the current TOML model card system.
Custom cards are saved to `~/.exo/custom_model_cards/` rather than the
bundled `resources/inference_model_cards/` because `RESOURCES_DIR` is
read-only in PyInstaller bundles (`sys._MEIPASS`). This also fixes
`fetch_from_hf` which was saving cards to the wrong path (`resources/`
root instead of `resources/inference_model_cards/`).
## Changes
- Add `EXO_CUSTOM_MODEL_CARDS_DIR` constant
(`~/.exo/custom_model_cards/`)
- Update `model_cards.py`: add custom dir to search path, fix
`save_to_custom_dir`, add `delete_custom_card`/`is_custom_card`
- Add `POST /models/add` and `DELETE /models/custom/{model_id}` API
endpoints
- Add `is_custom` field to `ModelListModel` API response
- Dashboard: add custom model input form in dropdown, delete button for
custom models, show actual API errors, auto-select newly added model
## Why It Works
Two separate directories for model cards: the bundled read-only
`resources/inference_model_cards/` for built-in cards, and user-writable
`~/.exo/custom_model_cards/` for custom cards. Both are scanned when
listing models. This works in all environments including PyInstaller
bundles where `RESOURCES_DIR` points to `sys._MEIPASS`.
## Test Plan
### Manual Testing
- Add a custom model via the dropdown (e.g.
`mlx-community/Llama-3.2-1B-Instruct-4bit`)
- Verify it appears in the model list with the delete (x) button
- Delete it and verify it disappears
- Try adding an invalid model ID and verify the actual error is shown
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `uv run pytest src/` — passes
- `cd dashboard && npm run build` — builds
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Offline users currently have to wait for every retry to fail before
being able to launch a model.
For users that restart clusters often or share API keys between devices,
we also spam HuggingFace with downloads every 5 minutes.
These issues are caused by _emit_existing_download_progress being
inefficient.
## Changes
- Only query HuggingFace once while EXO is running (assumption being
that a change should only be reflected on a new EXO session)
- Only query HuggingFace when there is an internet connection (polling
connectivity every 10 seconds)
- Request download progress if we switch from no connectivity ->
connected to reduce the wait.
- Reduce download progress sleep as it's no longer expensive (queries
cache most of the time).
- Reduce retries as 30 is way too many.
## Test Plan
### Manual Testing
Manually tested the behaviour.
### Automated Testing
None, should I add any? We do have some tests for this folder, but they
are probably not too helpful.
## Motivation
With the addition of the Responses API, we introduced `str |
list[InputMessage]` as the type for `TextGenerationTaskParams.input`
since the Responses API supports sending input as a plain string. But
there was no reason to leak that flexibility past the API adapter
boundary — it just meant every downstream consumer had to do `if
isinstance(messages, str):` checks, adding complexity for no benefit.
## Changes
- Changed `TextGenerationTaskParams.input` from `str |
list[InputMessage]` to `list[InputMessage]`
- Each API adapter (Chat Completions, Claude Messages, Responses) now
normalizes to `list[InputMessage]` at the boundary
- Removed `isinstance(task_params.input, str)` branches in
`utils_mlx.py` and `runner.py`
- Wrapped string inputs in `[InputMessage(role="user", content=...)]` in
the warmup path and all test files
## Why It Works
The API adapters are the only place where we deal with raw user input
formats. By normalizing there, all downstream code (worker, runner, MLX
engine) can just assume `list[InputMessage]` and skip the type-checking
branches. The type system (`basedpyright`) catches any missed call sites
at compile time.
## Test Plan
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — applied
- `uv run pytest` — 174 passed, 1 skipped
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Our distributed test now does a full query cycle for every model loaded
onto the relevant machine. This will help find bugs early, as it already
has found one with Qwen3 Next! I didn't write down what the error was
though. Gooooooood luck with that!
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
Tool-call requests can hang indefinitely when `max_tokens` truncates
generation mid-tool-call.
## Reproduction
1. Send a chat completion with `tools` and a low `max_tokens` (e.g. 65)
to Qwen3-0.6B
2. Model generates `<think>...</think>` then starts `<tool_call>` but
`max_tokens` cuts it off before `</tool_call>`
3. **Before this fix:** `parse_tool_calls` buffers tokens after
`<tool_call>`, generator exhausts, buffered tokens (including
`finish_reason`) are silently dropped → stream hangs forever
4. **After this fix:** buffered tokens are flushed as regular text with
`finish_reason` propagated → response returns normally with
`finish_reason: "length"`
Confirmed with fresh local testing: 4 unclosed tool call flushes
triggered in a single session. Also confirmed via production logs from
Jan 29 (2 occurrences).
## Changes
1. **`parse_tool_calls` unclosed tool call flush** — when the generator
exhausts inside an open `<tool_call>` block, flush buffered tokens as
regular text and propagate `finish_reason`
2. **GLM regex fix** — match literal `\n` (not escaped `\\n`) between
arg tags; handle missing `</arg_value>` via lookahead
3. **7 new unit tests** for `parse_tool_calls` covering unclosed,
closed, passthrough, and failed-parse scenarios
## Why It Works
- `parse_tool_calls` now has a post-loop check: if `in_tool_call` is
still true, it yields the buffered text with the tracked `finish_reason`
instead of silently dropping it
- The GLM regex now matches real-world output where newlines appear
between tags and `</arg_value>` may be absent
## Test Plan
### Manual Testing
- Qwen3-0.6B-4bit with `tools` + various `max_tokens` values (61-75)
- Confirmed responses return with `finish_reason: "length"` instead of
hanging
- Log output shows `"generator exhausted inside unclosed tool call,
flushing buffered text"`
### Automated Testing
- 7 new tests in `test_parse_tool_calls.py`
- Full test suite passes (`uv run pytest`)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
we didn't check before, raising a critical exception.
now we create ~/.config/exo on linux systems before touching config.toml.
this wasn't caught before since everything lives in ~/.exo on macos, and we no longer write the keypair to CONFIG_HOME, so config.toml has to do init work it avoided before.
## Motivation
Know what the hell is going on
## Changes
- Tracing library (src/exo/shared/tracing.py): trace() context manager,
Chrome Trace Format export, statistics computation
- Runner instrumentation
(src/exo/worker/engines/image/pipeline/runner.py): Wrapped sync/async
steps, compute blocks, and send/recv operations
- Trace collection: Workers send traces to master after task completion;
merged into ~/.exo/traces/trace_{task_id}.json
- API endpoints: List, fetch, stats, and raw download at /v1/traces/*
- Dashboard: Trace list and detail pages with Perfetto integration
## Why It Works
<img width="1236" height="767" alt="Screenshot 2026-01-30 at 19 00 09"
src="https://github.com/user-attachments/assets/73e6e46d-ba10-4e83-ba99-ff1c3f62ab05"
/>
<img width="1659" height="89" alt="Screenshot 2026-01-30 at 19 00 58"
src="https://github.com/user-attachments/assets/c0fd0e65-e4fc-4fd5-920d-b43b2887d591"
/>
## motivation
eagerly discovered peers through gossipsub were added to state. this left things looking broken from one-sided connections
## changes
the worker no longer writes topology edges from these gossipsub messages
we now strictly rely on http-discovered topology, which tends to be more reflective of the actual state of the systems connectivity
## Motivation
`hosts_*.json` files are local host configuration snapshots that
shouldn't be tracked in version control.
## Changes
Added `hosts_*.json` pattern to `.gitignore`.
## Why It Works
The glob pattern `hosts_*.json` matches any file starting with `hosts_`
and ending with `.json` in the repo root.
## Test Plan
### Manual Testing
- Verified that `hosts_*.json` files are ignored by git after this
change.
### Automated Testing
- No automated tests needed for a `.gitignore` change.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Add support for Claude Messages API and OpenAI Responses API to allow
users to interact with exo using these popular API formats. This enables
broader compatibility with existing tooling and SDKs that expect these
API formats.
## Architecture
Adapter logic lives exclusively in the API layer
(`src/exo/master/adapters/`). On the way in, each adapter converts its
API-specific request type (`ChatCompletionRequest`,
`ClaudeMessagesRequest`, `ResponsesRequest`) into
`TextGenerationTaskParams`. On the way out, each adapter converts the
`TokenChunk` stream back into its API-specific response format.
Everything inside the application — commands, worker, runner, event
sourcing — only sees `TextGenerationTaskParams` and `TokenChunk`. No
API-specific types cross the boundary.
```
API layer │ Application internals
│
Chat Completions → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ChatCompletionResponse
Claude Messages → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ClaudeMessagesResponse
Responses API → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ResponsesResponse
```
## Changes
### New Files
- `src/exo/shared/types/claude_api.py` - Pydantic types for Claude
Messages API
- `src/exo/shared/types/openai_responses.py` - Pydantic types for OpenAI
Responses API
- `src/exo/shared/types/text_generation.py` - Shared
`TextGenerationTaskParams` internal type
- `src/exo/master/adapters/chat_completions.py` - Chat Completions
adapter (streaming/non-streaming)
- `src/exo/master/adapters/claude.py` - Claude Messages adapter
(streaming/non-streaming)
- `src/exo/master/adapters/responses.py` - OpenAI Responses adapter
(streaming/non-streaming)
### Modified Files
- `src/exo/master/api.py` - Refactored to use adapters uniformly for all
endpoints; extracted `_resolve_and_validate_text_model` helper to
deduplicate model validation across all text endpoints; removed ad-hoc
`try/except ValueError` blocks from non-streaming paths
### New Endpoints
- `POST /v1/messages` - Claude Messages API (streaming and
non-streaming)
- `POST /v1/responses` - OpenAI Responses API (streaming and
non-streaming)
## Why It Works
All APIs are implemented as pure conversion adapters at the edge of the
application:
1. Adapter functions in `src/exo/master/adapters/` convert incoming
requests to `TextGenerationTaskParams`
2. `api.py` wraps the params in a `TextGeneration` command and sends it
through the existing command/event flow
3. The worker, runner, and event sourcing layers only handle
`TextGenerationTaskParams` and `TokenChunk` — they have no awareness of
Chat Completions, Claude, or Responses API formats
4. On response, adapter functions convert the `TokenChunk` stream back
to the caller's expected format
5. Model validation is handled by a single shared helper
(`_resolve_and_validate_text_model`), mirroring the existing
`_validate_image_model` pattern for image endpoints
No changes to core inference logic were needed.
### Streaming Formats
- **Chat Completions**: Uses `data: {...}\n\n` with `[DONE]` terminator
- **Claude**: Uses event types `message_start`, `content_block_start`,
`content_block_delta`, `content_block_stop`, `message_delta`,
`message_stop`
- **OpenAI Responses**: Uses event types `response.created`,
`response.in_progress`, `response.output_item.added`,
`response.content_part.added`, `response.output_text.delta`,
`response.output_text.done`, `response.content_part.done`,
`response.output_item.done`, `response.completed`
## Test Plan
### Manual Testing
Hardware: MacBook Pro M3 Max
**Non-streaming tests:**
```bash
# Chat Completions API
curl -X POST http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}'
# Claude Messages API
curl -X POST http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}'
# OpenAI Responses API
curl -X POST http://localhost:52415/v1/responses \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}'
```
**Streaming tests:**
```bash
# Chat Completions API (streaming)
curl -N -X POST http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}'
# Claude Messages API (streaming)
curl -N -X POST http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
# OpenAI Responses API (streaming)
curl -N -X POST http://localhost:52415/v1/responses \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}'
```
All endpoints tested successfully with proper response formats and
streaming events.
### Automated Testing
- Tests in `src/exo/master/tests/` all pass (85 tests)
- Type checker (basedpyright) passes with 0 errors
- Linter (ruff) passes
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
Make exo bench faster for longer prompts, lengthen default timeouts and
use pairs for pp and tg.
## Changes
- Uses binary search to find correct prompt
- Flag to force all combinations if that is desired
`nix run .#exo` couldn't find `macmon` because the Nix wrapper scripts
didn't include it in PATH, causing `shutil.which("macmon")` to fail.
Added `--prefix PATH : ${pkgs.macmon}/bin` to the `makeWrapper` call,
conditional on Darwin via `lib.optionalString`, so macmon's binary is
available at runtime without modifying the user's system PATH.
Test plan:
- Verified `nix build .#exo` succeeds
- Checked wrapper script contains macmon store path in PATH prefix
## Motivation
If nodes have uneven memory, one node may evict cache that remains on
another node. This will break prefill on some setups.
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
we currently have no strict reuqirements that node ids persist across
sessions, so we can generate fresh nodeids each time
this avoids issues like #1332, but prevents further features such as
caching downloads or node-id dialling
Co-authored-by: rltakashige <rl.takashige@gmail.com>
some attempts to load model cards (i.e. build_base_shard) always went
through networking rather than using downloaded model cards. we should
always default to ModelCard.load in these scenarios
## Motivation
(Probably) the final missing piece of the Chat Completions API
## Changes
Add UsageStats
## Why It Works
OpenCode reviewed my PR and gave me stats:
<img width="1150" height="802" alt="image"
src="https://github.com/user-attachments/assets/ebc06bae-797f-4087-87d5-2f26cf60fc48"
/>
## Test Plan
### Automated Testing
No tests were broken.
## Motivation
A lot of changes happened without much attention to the state of exo
bench.
## Changes
Use TaggedModel for BenchChatCompletion so it serialises properly.
Don't break after gpt oss tool call to preserve parity with the rest of
the codebase.
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<img width="2856" height="678" alt="image"
src="https://github.com/user-attachments/assets/2e18cf0d-c0f8-467c-9763-1a6a59c8a327"
/>
Also tested GPT OSS tool calling in OpenCode
## Motivation
The dashboard UI attempted to parse all image generation responses as
SSE streams, even when streaming was disabled. This broke non-streaming
image generation.
## Changes
- Parse JSON responses directly when not streaming, use SSE parser only
when stream=true AND partialImages > 0
- explicitly disable partial images when not streaming
## Why It Works
Both API and dashboard now use the same condition (stream &&
partialImages > 0) to determine response format, ensuring correct
parsing.
## Test Plan
### Manual Testing
Non-streamed image generation results appear in the UI. Streamed image
generation still works
## Motivation
Warnings that go unchecked tend to accumulate and hide real issues.
Treating them as errors ensures they are addressed immediately, both
locally during development and in CI.
## Changes
Added `SWIFT_TREAT_WARNINGS_AS_ERRORS = YES` and
`GCC_TREAT_WARNINGS_AS_ERRORS = YES` to the **project-level** Debug and
Release build configurations in `project.pbxproj`. This applies to all
targets (EXO, EXOTests, EXOUITests).
## Why It Works
Xcode's `SWIFT_TREAT_WARNINGS_AS_ERRORS` and
`GCC_TREAT_WARNINGS_AS_ERRORS` build settings promote Swift and C/ObjC
warnings to errors at compile time. Setting them at the project level
means all targets inherit the policy without needing per-target or
CI-level overrides.
## Test Plan
### Manual Testing
- Built the EXO scheme in Release configuration with `xcodebuild` — no
warning-as-error failures from Swift or C/ObjC sources.
### Automated Testing
- CI already builds with `-configuration Release`, so it will
automatically enforce warnings-as-errors via the inherited project
settings — no CI changes needed.
Add uv2nix to build Python packages from uv.lock. This creates a fully
Nix-managed Python environment with the Rust bindings injected via overlay.
Changes:
- Add pyproject-nix, uv2nix, and pyproject-build-systems flake inputs
- Create python/parts.nix with overlays to inject Nix-built Rust wheel
- Export packages.exo on macOS (wraps exo/exo-master/exo-worker with dashboard)
- Add checks.lint (ruff, all platforms) and checks.pytest (macOS only)
- Simplify CI typecheck job using nicknovitski/nix-develop action
- Delete .github/actions/typecheck composite action (no longer needed)
- Add no-build-package for MLX packages in pyproject.toml (use wheels)
The Python build is currently macOS-only since MLX requires Metal. Linux
support will be added once the pyproject dependencies are simplified.
Test plan:
- Run `nix flake check` on macOS to verify pytest and lint pass
- Build exo package on macOS: `nix build .#exo`
- Verify CI pipeline passes with simplified typecheck job