exo/tmp at 401ccfbd307abe6984a73d4deb72f10ed3546ee4 - exo

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-03-06 15:17:36 -05:00

Files

Daiz 28817d3ee3 Add support for Qwen3.5 (#1644 )

## Motivation

Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by
`mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel
sharding support for this architecture. This prevents running large
Qwen3.5 models across multiple nodes.

Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to
Qwen3-Next, but with a different projection layout — separate
`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of
Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires
architecture-aware sharding logic.

## Changes (evan summary)

- enable qwen3_5 dense + moe tensor parallelism from config
- defensively skip evalling _cache.keys if it doesn't exist
- ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation
- add sharding for qwen3_5 moe linear attention
- added another 6 million model cards

## Why It Works

Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three
concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive
contiguous split (`segments=1`) would slice across section boundaries,
corrupting q/k/v values and producing garbled output.

By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`,
each section is split independently before distributing across devices.
This ensures every rank receives correctly aligned q, k, and v
components.

The remaining separate projections (`in_proj_z`, `in_proj_b`,
`in_proj_a`) and the MoE layers follow the same `all_to_sharded` /
`sharded_to_all` pattern already used for Qwen3-Next.

Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks.

## Test Plan

tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval.

---------

Co-authored-by: hw <hw@hwStudio1.local>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>

2026-03-03 14:31:57 +00:00

config_examples

Strip Claude headers to improve prefix cache hit rates (#1552 )

2026-02-19 18:29:34 +00:00

gen_card.py

Add support for Qwen3.5 (#1644 )

2026-03-03 14:31:57 +00:00

prompt.txt