mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

Fork 0

mirror of https://github.com/exo-explore/exo.git synced 2026-04-18 04:52:40 -04:00

Commit Graph

Author	SHA1	Message	Date
Mustafa Alp Yılmaz	2994b41089	fix: validate num_key_value_heads in tensor sharding placement (#1669 ) ## Problem Models with fewer KV heads than nodes crash during tensor parallelism. For example, Qwen3.5 MoE models have only 2 KV heads — trying to shard across 4 nodes produces empty tensors and a reshape error at runtime. The placement system already validates `hidden_size % num_nodes == 0` but doesn't check KV heads, so it creates configurations that look valid but blow up when the worker tries to split the attention heads. Affected models include Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3-Next-80B-A3B, and Qwen3-Coder-Next (all have 2 KV heads). ## Changes Placement validation (`src/exo/master/placement.py`): - Combined KV heads divisibility check with the existing hidden_size filter in a single pass - Cycles where `num_key_value_heads % len(cycle) != 0` are now excluded for tensor sharding - Error message includes both constraints when no valid cycle is found Model card schema (`src/exo/shared/models/model_cards.py`): - Added optional `num_key_value_heads` field to `ModelCard` and `ConfigData` - Extracted from HuggingFace `config.json` (handles both top-level and `text_config` nesting) - Passed through in `fetch_from_hf()` for dynamically fetched cards All 68 inference model cards (`resources/inference_model_cards/.toml`): - Populated `num_key_value_heads` from each model's HuggingFace config Utility script* (`scripts/fetch_kv_heads.py`): - Fetches `num_key_value_heads` from HuggingFace and updates TOML cards - `--missing`: only fills in cards that don't have the field yet - `--all`: re-fetches and overwrites everything - Uses tomlkit for safe TOML editing and ThreadPoolExecutor for parallel fetches ## Behavior - Instance previews no longer show tensor options for models that can't split their KV heads across the cluster size - `place_instance()` rejects with a clear error instead of crash-looping - Pipeline parallelism is unaffected - 2-node tensor still works for 2-KV-head models (2 ÷ 2 = 1) - Field is optional — existing custom cards without it continue to work (validation is skipped when `None`)	2026-03-11 13:46:33 +00:00

Author

SHA1

Message

Date

Mustafa Alp Yılmaz

2994b41089

fix: validate num_key_value_heads in tensor sharding placement (#1669 )

## Problem

Models with fewer KV heads than nodes crash during tensor parallelism.
For example, Qwen3.5 MoE models have only 2 KV heads — trying to shard
across 4 nodes produces empty tensors and a reshape error at runtime.

The placement system already validates `hidden_size % num_nodes == 0`
but doesn't check KV heads, so it creates configurations that look valid
but blow up when the worker tries to split the attention heads.

Affected models include Qwen3.5-35B-A3B, Qwen3.5-122B-A10B,
Qwen3.5-397B-A17B, Qwen3-Next-80B-A3B, and Qwen3-Coder-Next (all have 2
KV heads).

## Changes

**Placement validation** (`src/exo/master/placement.py`):
- Combined KV heads divisibility check with the existing hidden_size
filter in a single pass
- Cycles where `num_key_value_heads % len(cycle) != 0` are now excluded
for tensor sharding
- Error message includes both constraints when no valid cycle is found

**Model card schema** (`src/exo/shared/models/model_cards.py`):
- Added optional `num_key_value_heads` field to `ModelCard` and
`ConfigData`
- Extracted from HuggingFace `config.json` (handles both top-level and
`text_config` nesting)
- Passed through in `fetch_from_hf()` for dynamically fetched cards

**All 68 inference model cards**
(`resources/inference_model_cards/*.toml`):
- Populated `num_key_value_heads` from each model's HuggingFace config

**Utility script** (`scripts/fetch_kv_heads.py`):
- Fetches `num_key_value_heads` from HuggingFace and updates TOML cards
- `--missing`: only fills in cards that don't have the field yet
- `--all`: re-fetches and overwrites everything
- Uses tomlkit for safe TOML editing and ThreadPoolExecutor for parallel
fetches

## Behavior

- Instance previews no longer show tensor options for models that can't
split their KV heads across the cluster size
- `place_instance()` rejects with a clear error instead of crash-looping
- Pipeline parallelism is unaffected
- 2-node tensor still works for 2-KV-head models (2 ÷ 2 = 1)
- Field is optional — existing custom cards without it continue to work
(validation is skipped when `None`)

2026-03-11 13:46:33 +00:00

1 Commits