## Motivation
No automated integration tests exist for exo. Manual testing against
real hardware clusters is slow and error-prone. We need a pytest
framework that deploys clusters via `eco`, runs inference scenarios, and
tears down cleanly.
## Changes
- **`tools/src/exo_tools/`** — New workspace member shared by bench,
eval, and tests:
- `client.py` — `ExoClient` HTTP client (extracted from
`bench/harness.py`)
- `harness.py` — instance lifecycle helpers (placement, wait-for-ready,
etc.)
- `cluster.py` — `EcoSession` for eco cluster lifecycle
(deploy/stop/start/release/logs/exec) with unique `USER=<prefix>-<uuid>`
per session and atexit/signal cleanup
- **`tests/integration/`** — 17 pytest tests across 5 files:
- `test_1node.py` — place, chat, multi-turn, delete, state/models
endpoints, cluster snapshot, download-from-scratch
- `test_2node.py` — parametrized tensor/jaccl + pipeline/ring inference
and multi-turn
- `test_4node.py` — parametrized 4-node pipeline/ring inference, cluster
state
- `test_resilience.py` — full disconnect/reconnect cycle (2-node →
disconnect → 1-node → reconnect → 2-node)
- `test_dashboard.py` — Playwright: dashboard loads, shows node info,
chat flow
- `helpers.py` — placement/inference helpers, re-exports from
`exo_tools`
- `conftest.py` — session-scoped cluster fixtures with constraint-based
eco reservations; `--hosts` override; `EXO_REF` env var for CI
deployments from a GitHub branch
- **`bench/`** — Updated imports from `exo_tools.client` /
`exo_tools.harness`
- **`pyproject.toml`** — Added `tools` workspace member, `playwright`
dev dep, `--ignore=tests/integration`
## Why It Works
Tests use `eco` for cluster lifecycle and `ExoClient` for API
interactions — same tools humans use. Session-scoped fixtures deploy
once per file. Unique eco users prevent test runs from interfering with
each other or manual usage.
## Test Plan
### Automated Testing
- `uv run pytest tests/integration/ -v -s` — full suite (~4-5 min, 17/17
passing)
- `uv run pytest tests/integration/ -v -s --hosts s4,s9,s10,s22` — pin
specific hosts
- `EXO_REF=main uv run pytest tests/integration/ -v` — deploy from a
GitHub branch (CI)
- `uv run pytest` — confirms integration tests are excluded from default
runs
## Motivation
Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by
`mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel
sharding support for this architecture. This prevents running large
Qwen3.5 models across multiple nodes.
Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to
Qwen3-Next, but with a different projection layout — separate
`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of
Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires
architecture-aware sharding logic.
## Changes (evan summary)
- enable qwen3_5 dense + moe tensor parallelism from config
- defensively skip evalling _cache.keys if it doesn't exist
- ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation
- add sharding for qwen3_5 moe linear attention
- added another 6 million model cards
## Why It Works
Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three
concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive
contiguous split (`segments=1`) would slice across section boundaries,
corrupting q/k/v values and producing garbled output.
By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`,
each section is split independently before distributing across devices.
This ensures every rank receives correctly aligned q, k, and v
components.
The remaining separate projections (`in_proj_z`, `in_proj_b`,
`in_proj_a`) and the MoE layers follow the same `all_to_sharded` /
`sharded_to_all` pattern already used for Qwen3-Next.
Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks.
## Test Plan
tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval.
---------
Co-authored-by: hw <hw@hwStudio1.local>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
Our hits are really bad at the moment (0.2%). This PR makes it 98.5% on
average.
## Changes
Also adds an example for how to run Claude using Exo.
## Why It Works
Claude sends some billing and session headers that change with each
message.
## Test Plan
### Manual Testing
Works in manual testing.
## Motivation
Add support for FLUX.1-Kontext-dev, an image editing variant of
FLUX.1-dev
## Changes
- New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow
- encodes input image as conditioning latents with special position IDs,
generates from pure noise
- Model config: 57 transformer blocks (19 joint + 38 single), guidance
scale 4.0, ImageToImage task
- Pipeline updates: Added kontext_image_ids property to PromptData
interface, passed through diffusion runner
- Model cards: Added TOML configs for base, 4-bit, and 8-bit variants
- Dependency: mflux 0.15.4 → 0.15.5
- Utility: tmp/quantize_and_upload.py for quantizing and uploading
models to HuggingFace
## Test Plan
### Manual Testing
Works better than Qwen-Image-Edit
## Motivation
A lot of changes happened without much attention to the state of exo
bench.
## Changes
Use TaggedModel for BenchChatCompletion so it serialises properly.
Don't break after gpt oss tool call to preserve parity with the rest of
the codebase.
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<img width="2856" height="678" alt="image"
src="https://github.com/user-attachments/assets/2e18cf0d-c0f8-467c-9763-1a6a59c8a327"
/>
Also tested GPT OSS tool calling in OpenCode
## Motivation
Running RDMA from source is not well documented as is. Several
surprising things that took time to debug internally too.
App should be updated to detect MacOS versions in future.
## Motivation
This PR implements benchmarking in the style of llama-bench. The main
difficulty here is the fact that exo is not a library - it exposes an
endpoint. This means that benchmarking numbers will be inaccurate if the
API is measured.
The solution assumes nodes are set up with uv run exo (or via the app),
and then hits the new endpoint /bench/chat/completions to retrieve
generation statistics directly from mlx_lm.
<!-- Why is this change needed? What problem does it solve? -->
This will allow us to release benchmarks for models and perform
regression tests.
TODO: Performance benchmarking.
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
- Adds /bench/chat/completions endpoint
- Adds BenchChatCompletion/Response
- Adds a logits processor to prevent response from ending early
- Adds a "Prompt Sizer" which downloads the tokenizer and dynamically
adjusts the prompt of "a" to fit the desired prompt size.
- Reduce prefill step size to 2048 for now (in future, dynamically
adjust this value)
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several
fixes to run consistently on all configurations (to be done in the
future).
Manually tested the normal API to verify chat requests complete as
expected.
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
Not really possible. Type checker passes.