## Motivation
Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by
`mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel
sharding support for this architecture. This prevents running large
Qwen3.5 models across multiple nodes.
Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to
Qwen3-Next, but with a different projection layout — separate
`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of
Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires
architecture-aware sharding logic.
## Changes (evan summary)
- enable qwen3_5 dense + moe tensor parallelism from config
- defensively skip evalling _cache.keys if it doesn't exist
- ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation
- add sharding for qwen3_5 moe linear attention
- added another 6 million model cards
## Why It Works
Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three
concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive
contiguous split (`segments=1`) would slice across section boundaries,
corrupting q/k/v values and producing garbled output.
By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`,
each section is split independently before distributing across devices.
This ensures every rank receives correctly aligned q, k, and v
components.
The remaining separate projections (`in_proj_z`, `in_proj_b`,
`in_proj_a`) and the MoE layers follow the same `all_to_sharded` /
`sharded_to_all` pattern already used for Qwen3-Next.
Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks.
## Test Plan
tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval.
---------
Co-authored-by: hw <hw@hwStudio1.local>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
Our hits are really bad at the moment (0.2%). This PR makes it 98.5% on
average.
## Changes
Also adds an example for how to run Claude using Exo.
## Why It Works
Claude sends some billing and session headers that change with each
message.
## Test Plan
### Manual Testing
Works in manual testing.
## Motivation
Add support for FLUX.1-Kontext-dev, an image editing variant of
FLUX.1-dev
## Changes
- New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow
- encodes input image as conditioning latents with special position IDs,
generates from pure noise
- Model config: 57 transformer blocks (19 joint + 38 single), guidance
scale 4.0, ImageToImage task
- Pipeline updates: Added kontext_image_ids property to PromptData
interface, passed through diffusion runner
- Model cards: Added TOML configs for base, 4-bit, and 8-bit variants
- Dependency: mflux 0.15.4 → 0.15.5
- Utility: tmp/quantize_and_upload.py for quantizing and uploading
models to HuggingFace
## Test Plan
### Manual Testing
Works better than Qwen-Image-Edit
## Motivation
A lot of changes happened without much attention to the state of exo
bench.
## Changes
Use TaggedModel for BenchChatCompletion so it serialises properly.
Don't break after gpt oss tool call to preserve parity with the rest of
the codebase.
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<img width="2856" height="678" alt="image"
src="https://github.com/user-attachments/assets/2e18cf0d-c0f8-467c-9763-1a6a59c8a327"
/>
Also tested GPT OSS tool calling in OpenCode
## Motivation
Running RDMA from source is not well documented as is. Several
surprising things that took time to debug internally too.
App should be updated to detect MacOS versions in future.
## Motivation
This PR implements benchmarking in the style of llama-bench. The main
difficulty here is the fact that exo is not a library - it exposes an
endpoint. This means that benchmarking numbers will be inaccurate if the
API is measured.
The solution assumes nodes are set up with uv run exo (or via the app),
and then hits the new endpoint /bench/chat/completions to retrieve
generation statistics directly from mlx_lm.
<!-- Why is this change needed? What problem does it solve? -->
This will allow us to release benchmarks for models and perform
regression tests.
TODO: Performance benchmarking.
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
- Adds /bench/chat/completions endpoint
- Adds BenchChatCompletion/Response
- Adds a logits processor to prevent response from ending early
- Adds a "Prompt Sizer" which downloads the tokenizer and dynamically
adjusts the prompt of "a" to fit the desired prompt size.
- Reduce prefill step size to 2048 for now (in future, dynamically
adjust this value)
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several
fixes to run consistently on all configurations (to be done in the
future).
Manually tested the normal API to verify chat requests complete as
expected.
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
Not really possible. Type checker passes.