runs exo_bench remotely with some nice git QoL
## usage
run tests/auto_bench.sh host1 [host2]
exo bench will be run on those hosts and its output saved to
bench/commit_hash/*.json on all models currently downloaded
a lot of our cleanup logic wasn't running leading to bad shutdown states
## changes
- added `try: except` blocks around most task groups
- made the runner shutdown code synchronous
- abandon the MpReceiver's recv_async thread on cancellation
- this only occurs during runner shutdown, the queue closing from the
other end should terminate the mp.Queue, cleaning up the thread in its
own time. i could try other methods if this is not sufficient.
## outcome
ctrl-c just works now! minus the tokio panic of course :) no more
hypercorn lifespan errors though!
## Motivation
With the addition of the Responses API, we introduced `str |
list[InputMessage]` as the type for `TextGenerationTaskParams.input`
since the Responses API supports sending input as a plain string. But
there was no reason to leak that flexibility past the API adapter
boundary — it just meant every downstream consumer had to do `if
isinstance(messages, str):` checks, adding complexity for no benefit.
## Changes
- Changed `TextGenerationTaskParams.input` from `str |
list[InputMessage]` to `list[InputMessage]`
- Each API adapter (Chat Completions, Claude Messages, Responses) now
normalizes to `list[InputMessage]` at the boundary
- Removed `isinstance(task_params.input, str)` branches in
`utils_mlx.py` and `runner.py`
- Wrapped string inputs in `[InputMessage(role="user", content=...)]` in
the warmup path and all test files
## Why It Works
The API adapters are the only place where we deal with raw user input
formats. By normalizing there, all downstream code (worker, runner, MLX
engine) can just assume `list[InputMessage]` and skip the type-checking
branches. The type system (`basedpyright`) catches any missed call sites
at compile time.
## Test Plan
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — applied
- `uv run pytest` — 174 passed, 1 skipped
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Our distributed test now does a full query cycle for every model loaded
onto the relevant machine. This will help find bugs early, as it already
has found one with Qwen3 Next! I didn't write down what the error was
though. Gooooooood luck with that!
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
Add support for Claude Messages API and OpenAI Responses API to allow
users to interact with exo using these popular API formats. This enables
broader compatibility with existing tooling and SDKs that expect these
API formats.
## Architecture
Adapter logic lives exclusively in the API layer
(`src/exo/master/adapters/`). On the way in, each adapter converts its
API-specific request type (`ChatCompletionRequest`,
`ClaudeMessagesRequest`, `ResponsesRequest`) into
`TextGenerationTaskParams`. On the way out, each adapter converts the
`TokenChunk` stream back into its API-specific response format.
Everything inside the application — commands, worker, runner, event
sourcing — only sees `TextGenerationTaskParams` and `TokenChunk`. No
API-specific types cross the boundary.
```
API layer │ Application internals
│
Chat Completions → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ChatCompletionResponse
Claude Messages → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ClaudeMessagesResponse
Responses API → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ResponsesResponse
```
## Changes
### New Files
- `src/exo/shared/types/claude_api.py` - Pydantic types for Claude
Messages API
- `src/exo/shared/types/openai_responses.py` - Pydantic types for OpenAI
Responses API
- `src/exo/shared/types/text_generation.py` - Shared
`TextGenerationTaskParams` internal type
- `src/exo/master/adapters/chat_completions.py` - Chat Completions
adapter (streaming/non-streaming)
- `src/exo/master/adapters/claude.py` - Claude Messages adapter
(streaming/non-streaming)
- `src/exo/master/adapters/responses.py` - OpenAI Responses adapter
(streaming/non-streaming)
### Modified Files
- `src/exo/master/api.py` - Refactored to use adapters uniformly for all
endpoints; extracted `_resolve_and_validate_text_model` helper to
deduplicate model validation across all text endpoints; removed ad-hoc
`try/except ValueError` blocks from non-streaming paths
### New Endpoints
- `POST /v1/messages` - Claude Messages API (streaming and
non-streaming)
- `POST /v1/responses` - OpenAI Responses API (streaming and
non-streaming)
## Why It Works
All APIs are implemented as pure conversion adapters at the edge of the
application:
1. Adapter functions in `src/exo/master/adapters/` convert incoming
requests to `TextGenerationTaskParams`
2. `api.py` wraps the params in a `TextGeneration` command and sends it
through the existing command/event flow
3. The worker, runner, and event sourcing layers only handle
`TextGenerationTaskParams` and `TokenChunk` — they have no awareness of
Chat Completions, Claude, or Responses API formats
4. On response, adapter functions convert the `TokenChunk` stream back
to the caller's expected format
5. Model validation is handled by a single shared helper
(`_resolve_and_validate_text_model`), mirroring the existing
`_validate_image_model` pattern for image endpoints
No changes to core inference logic were needed.
### Streaming Formats
- **Chat Completions**: Uses `data: {...}\n\n` with `[DONE]` terminator
- **Claude**: Uses event types `message_start`, `content_block_start`,
`content_block_delta`, `content_block_stop`, `message_delta`,
`message_stop`
- **OpenAI Responses**: Uses event types `response.created`,
`response.in_progress`, `response.output_item.added`,
`response.content_part.added`, `response.output_text.delta`,
`response.output_text.done`, `response.content_part.done`,
`response.output_item.done`, `response.completed`
## Test Plan
### Manual Testing
Hardware: MacBook Pro M3 Max
**Non-streaming tests:**
```bash
# Chat Completions API
curl -X POST http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}'
# Claude Messages API
curl -X POST http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}'
# OpenAI Responses API
curl -X POST http://localhost:52415/v1/responses \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}'
```
**Streaming tests:**
```bash
# Chat Completions API (streaming)
curl -N -X POST http://localhost:52415/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}'
# Claude Messages API (streaming)
curl -N -X POST http://localhost:52415/v1/messages \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
# OpenAI Responses API (streaming)
curl -N -X POST http://localhost:52415/v1/responses \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}'
```
All endpoints tested successfully with proper response formats and
streaming events.
### Automated Testing
- Tests in `src/exo/master/tests/` all pass (85 tests)
- Type checker (basedpyright) passes with 0 errors
- Linter (ruff) passes
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
The Worker previously owned the ShardDownloader directly via dependency
injection, which prevented --no-worker nodes from downloading and made
it impossible for multiple Workers to share a single downloader instance.
Moved download functionality to a new DownloadCoordinator component at
the Node level that communicates via the DOWNLOAD_COMMANDS pub/sub topic.
Workers now send StartDownload commands instead of calling the downloader
directly, and receive progress updates through the event-sourced state.
This decouples downloads from the Worker lifecycle and enables future
features like UI-triggered downloads to specific nodes and multi-worker
download sharing.
Test plan:
- Mostly tested in the next PR that adds explicit downloads/deletions to
the dashboard.
- Started a model that isn't downloaded - it works.
## Motivation
We have a lot of unneeded data in the model card - lets just keep the
necessary stuff and add back more data when we need it
## Test Plan
EXO still runs! (pipeline on 2)
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Motivation
Information gathering is tightly coupled to MacMon - we should start
generalizing our information sources so we can add more in future.
## Changes
Added a new system to gather any information. Currently, it is attached
to the Worker - though this is mostly to keep the data processing logic
simple. It could be made independent quite easily.
I also refactored topology to include different kinds of connections as
we can gather RDMA connections without having a pre-existing socket
connection, and made the relevant placement updates. We should no longer
need the network locations script in the app.
Other sources of information now include:
- static node information like "model" and "chip" (macos, "Unknown"
fallback)
- device friendly name (macos, falls back to device hostname)
- network interfaces + ips (cross platform)
- thunderbolt interfaces (macos)
- thunderbolt connections (macos)
- RAM usage (cross platform)
- per-device configuration written to EXO_HOME/config.toml
## Limitations
Model and Chip are not cross platform concepts.
We do not differentiate between unified and non-unified memory systems.
A lot of this data collection is based on simple timers. Watching the SC
store on macos is the correct way to gather some of this information,
but requires a detour into rust for macos.
## Why It Works
The InfoGatherer is a generic subsystem which returns a union of metric
datatypes. It writes them to an event, which is applied to state. It is
currently re-spawned with the worker so each cluster receives the
correct information.
As for topology, macOS identifies TB ports with a uuid in
SPThunderboltDataType, and also stores remote uuids if it can find them.
These changes read that data with the system_profiler, hopefully not so
often as to cause notable performance impacts (though this should be
tuned) but frequently enough for moderate responsiveness.
As we can identify TB connections between devices without needing ips
attached to each interface, we can remove the network setup script
(almost) completely.
## Test Plan
### Manual Testing
Spawn RDMA instances without enabling DHCP on the RDMA interfaces.
### Automated Testing
Updated the current master and shared tests to cover the topology
refactor and new events.
---------
Co-authored-by: Sami Khan <smsak99@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Jake Hillion <jake@hillion.co.uk>
## Motivation
GPT OSS did not previously support tensor sharding
## Changes
Add GPT sharding support in tensor_auto_parallel.
Code is mostly @rltakashige's
## Test Plan
### Manual Testing
Tested GPT-OSS - MLX Fast Sync causes issues in Tensor RDMA - this is a general problem at the moment.
## Motivation
Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as
a prerelease dependency. This enables support for newer models like Kimi
K2 Thinking while maintaining compatibility with existing models.
The transformers 5.x release includes breaking changes that affect
custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility
fixes.
## Changes
### Core Changes
- **mlx-lm upgrade**: Bump to 0.30.2 with locked exact versions for
mlx/mlx-lm to prevent breaking changes
- **transformers 5.x compatibility**: Enable prerelease transformers
dependency
### Kimi K2 Tokenizer Fixes
- Add `bytes_to_unicode` monkey-patch to restore function moved in
transformers 5.0.0rc2
- Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to
bypass transformers 5.x bug with `auto_map` fallback
- Patch `encode()` to use tiktoken directly with `allowed_special="all"`
to handle special tokens from chat templates
### Other Changes
- Dashboard: Show disk usage for completed model downloads
- CI: Add `workflow_dispatch` trigger to build-app workflow
- Docs: Add basic API documentation
### Testing
- Add comprehensive tokenizer unit tests for all supported models
- Tests verify encode/decode, special token handling, and chat template
encoding
## Why It Works
**bytes_to_unicode issue**: transformers 5.0.0rc2 moved
`bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to
`transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py`
imports from the old location. The monkey-patch restores it at module
load time.
**AutoTokenizer issue**: transformers 5.x has a bug where
`tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for
custom tokenizers with `auto_map`. Loading the tokenizer directly
bypasses this.
**encode() issue**: transformers 5.x's `pad()` method fails for slow
tokenizers. Using tiktoken's encode directly with
`allowed_special="all"` avoids this path and properly handles special
tokens like `<|im_user|>` from chat templates.
## Test Plan
### Manual Testing
- Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and
james21)
- Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both
nodes
- Verified warmup inference completes successfully
- Verified chat completions work with special tokens
### Automated Testing
- Added `test_tokenizers.py` with 31 tests covering:
- Basic encode/decode for all model families (deepseek, kimi, llama,
qwen, gpt-oss, glm)
- Special token encoding (critical for chat templates)
- Chat template application and encoding
- Kimi-specific and GLM-specific edge cases
- All tests pass: `uv run pytest
src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py`
### Failing Tests
RDMA with all models.
---------
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
Testing multiple devices simultaneously requires coordination, and we
don't necessarily want to run a full EXO to test single components. We
need a mid-scale integration testing framework for distributed tests.
## Changes
Add a simple python server + bash query that runs Jaccl and Ring tests
without constructing a worker/master/networking. The query relies on all
devices being accessible over tailscale, currently.
## Test Plan
Manually tested RDMA + Ring inference on 2 nodes.