Adds the 8bit variant missing from #1907 — the safetensors index is now
live on HF.
- `mlx-community/Qwen3.6-35B-A3B-8bit` (~35 GB)
Architectural fields match the existing 4bit/5bit/bf16 cards.
`storage_size.in_bytes` is taken from `metadata.total_size` of the
upstream `model.safetensors.index.json`.
## Motivation
`mlx-community` has just published the new **Qwen3.6-35B-A3B**
multimodal MoE family on HuggingFace. Without static model cards exo
doesn't surface these models in the dashboard picker or match its
placement / prefill logic, so users can't one-click launch them. This PR
adds cards for the three quants whose safetensors indexes are already
live on HF (4bit / 5bit / bf16).
## Changes
Three new TOML files in `resources/inference_model_cards/`:
- `mlx-community--Qwen3.6-35B-A3B-4bit.toml` (~19 GB)
- `mlx-community--Qwen3.6-35B-A3B-5bit.toml` (~23 GB)
- `mlx-community--Qwen3.6-35B-A3B-bf16.toml` (~65 GB)
All three share the same architectural fields (`n_layers = 40`,
`hidden_size = 2048`, `num_key_value_heads = 2`, `context_length =
262144`, capabilities `text, thinking, thinking_toggle, vision`,
`base_model = "Qwen3.6 35B A3B"`) — only `model_id`, `quantization`, and
`storage_size.in_bytes` differ between variants.
## Why It Works
- Qwen3.6-35B-A3B reuses the `qwen3_5_moe` architecture
(`Qwen3_5MoeForConditionalGeneration`) — the same one already wired into
exo's MLX runner at `src/exo/worker/engines/mlx/auto_parallel.py:47` via
`Qwen3_5MoeModel`. The architectural fields are taken verbatim from the
HF `config.json.text_config` and match the existing `Qwen3.5-35B-A3B-*`
cards.
- Storage sizes are the exact `metadata.total_size` read from each
variant's `model.safetensors.index.json` on HF, so download progress and
cluster-memory-fit checks are accurate.
- Vision support is flagged in `capabilities`; the `[vision]` block is
auto-detected by `ModelCard._autodetect_vision` from the upstream
`config.json`, so no hand-written vision config is required.
- The card loader (`_refresh_card_cache` in
`src/exo/shared/models/model_cards.py`) globs every `.toml` in
`resources/inference_model_cards/` on startup, so nothing else needs to
change — the `/models` endpoint and the dashboard picker pick them up
automatically.
The `mxfp4` / `mxfp8` / `nvfp4` variants are still uploading upstream
(index JSONs currently 404) and can be added in a follow-up PR once HF
completes.
## Test Plan
### Manual Testing
Hardware: MacBook Pro M4 Max, 48 GB unified memory.
- Built the dashboard, ran `uv run exo`, waited for the API to come up
on `http://localhost:52415`.
- `curl -s http://localhost:52415/models` returns the three new model
ids (`mlx-community/Qwen3.6-35B-A3B-{4bit,5bit,bf16}`) alongside
existing models.
- Opened the dashboard, clicked SELECT MODEL, typed "Qwen3.6" into the
search box. A single **"Qwen3.6 35B A3B"** group appears showing `3
variants (19GB-65GB)`. Expanding it lists the `4bit` / `5bit` / `bf16`
quants with sizes `19GB` / `23GB` / `65GB`, exactly as expected:

- Programmatically loaded each TOML via `ModelCard.load_from_path(...)`
and confirmed the parsed fields (layers / hidden / KV heads / context /
quant / base_model / caps / bytes) match what's written in the files.
### Automated Testing
No code paths were touched — these are pure TOML data files that plug
into the existing model-card loader. The existing pytest suite covers
TOML parsing and card serving; adding new TOMLs doesn't require new test
scaffolding. `uv run ruff check` and `nix fmt` are clean.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ryuichi Leo Takashige <rl.takashige@gmail.com>
## Motivation
The mlx-community [MiniMax-M2.7
collection](https://huggingface.co/collections/mlx-community/minimax-m27)
landed but exo didn't have model cards for any of the variants yet, so
they weren't selectable from the dashboard model picker. Adding cards
also makes them discoverable under the existing MiniMax family entry.
## Changes
Added 6 new model cards in `resources/inference_model_cards/`, one per
quant of MiniMax M2.7:
- `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB)
- `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB)
- `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB)
- `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB)
- `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB)
- `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB)
All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"`
so they collapse into a single group in the picker with the existing
MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size =
3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read
from each repo's `config.json`; `storage_size.in_bytes` was summed from
the HF tree API per repo.
`capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5
cards — the chat template always emits `<think>` tags (no toggle),
matching M2.5 behavior.
## Why It Works
Model cards in `resources/inference_model_cards/` are auto-loaded by
`src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard
picker groups by `base_model` and filters by `family`, so sharing both
across all six variants gives a single "MiniMax M2.7" group under the
MiniMax sidebar entry, with the quant variants exposed as selectable
sub-options.
## Test Plan
### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6
new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and
correct quant + byte sizes.
- `cd dashboard && npm run build` then `uv run exo`, opened the model
picker → **MiniMax** family → **MiniMax M2.7** group shows all six quant
variants.
### Automated Testing
- No new automated tests — these are data files validated by the
existing Pydantic `ModelCard` schema at load time.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Motivation
Add support for Gemma 4, including VLM!
## Changes
- Add auto parallel strategies and model cards for Gemma 4
- Normalise Gemma 4's special Vision Transformer handling to be in line
with the rest of our vision processors.
- Also adds reprs to messages and b64 hashes to prevent log spam.
## Test Plan
### Manual Testing
Tested manually on 4bit E2B and 8bit 26B
### Automated Testing
Model onboarding shows small logit diffs.
---------
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Summary
DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's
inference engine (architecture whitelisted in `model_cards.py`, DSML
encoding added in #1548), but **doesn't work out of the box** due to two
bugs:
### Bug 1: `warmup_inference` passes empty model ID
`warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a
parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)`
instead of using it. Since `_needs_dsml_encoding()` checks
`"deepseek-v3.2" in task_params.model.lower()`, the empty string never
matches → falls back to `tokenizer.apply_chat_template()` →
**ValueError** because V3.2 has no Jinja chat template.
**Fix:** `model=ModelId("")` → `model=model_id` (one line).
### Bug 2: `_needs_dsml_encoding` limited to tool calling
`_needs_dsml_encoding()` returns `True` only when `task_params.tools` is
present or tool messages exist in `chat_template_messages`. For warmup
and regular chat requests without tools → `return False` → Jinja
fallback → **ValueError**.
Unlike V3.1 (which has a `.jinja` chat template file that transformers
picks up automatically), V3.2 **has no Jinja template at all** — it uses
Python-based DSML encoding for all message types.
**Fix:** For V3.2, always return `True` — DSML encoding handles all
message types.
### Catalog cards
Added inference model cards for:
- `mlx-community/DeepSeek-V3.2-8bit`
- `mlx-community/DeepSeek-V3.2-4bit`
Parameters taken from model `config.json` on HuggingFace, storage sizes
from HF API. Capabilities include `thinking_toggle` (related: #1456).
## Notes
- The model ID string matching approach (`"deepseek-v3.2" in
model.lower()`) is acknowledged tech debt — see #1371 for the planned
architecture-based approach.
## Test plan
- [x] Start exo with DeepSeek V3.2 model → warmup should complete
without crash
- [x] Send a regular chat message (no tools) → should get a response
- [x] Send a chat message with tools → should work as before
- [x] V3.2 cards should appear in the dashboard model catalog
---------
Co-authored-by: user <user@m1.note>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
## Problem
Models with fewer KV heads than nodes crash during tensor parallelism.
For example, Qwen3.5 MoE models have only 2 KV heads — trying to shard
across 4 nodes produces empty tensors and a reshape error at runtime.
The placement system already validates `hidden_size % num_nodes == 0`
but doesn't check KV heads, so it creates configurations that look valid
but blow up when the worker tries to split the attention heads.
Affected models include Qwen3.5-35B-A3B, Qwen3.5-122B-A10B,
Qwen3.5-397B-A17B, Qwen3-Next-80B-A3B, and Qwen3-Coder-Next (all have 2
KV heads).
## Changes
**Placement validation** (`src/exo/master/placement.py`):
- Combined KV heads divisibility check with the existing hidden_size
filter in a single pass
- Cycles where `num_key_value_heads % len(cycle) != 0` are now excluded
for tensor sharding
- Error message includes both constraints when no valid cycle is found
**Model card schema** (`src/exo/shared/models/model_cards.py`):
- Added optional `num_key_value_heads` field to `ModelCard` and
`ConfigData`
- Extracted from HuggingFace `config.json` (handles both top-level and
`text_config` nesting)
- Passed through in `fetch_from_hf()` for dynamically fetched cards
**All 68 inference model cards**
(`resources/inference_model_cards/*.toml`):
- Populated `num_key_value_heads` from each model's HuggingFace config
**Utility script** (`scripts/fetch_kv_heads.py`):
- Fetches `num_key_value_heads` from HuggingFace and updates TOML cards
- `--missing`: only fills in cards that don't have the field yet
- `--all`: re-fetches and overwrites everything
- Uses tomlkit for safe TOML editing and ThreadPoolExecutor for parallel
fetches
## Behavior
- Instance previews no longer show tensor options for models that can't
split their KV heads across the cluster size
- `place_instance()` rejects with a clear error instead of crash-looping
- Pipeline parallelism is unaffected
- 2-node tensor still works for 2-KV-head models (2 ÷ 2 = 1)
- Field is optional — existing custom cards without it continue to work
(validation is skipped when `None`)
## Motivation
Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by
`mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel
sharding support for this architecture. This prevents running large
Qwen3.5 models across multiple nodes.
Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to
Qwen3-Next, but with a different projection layout — separate
`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of
Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires
architecture-aware sharding logic.
## Changes (evan summary)
- enable qwen3_5 dense + moe tensor parallelism from config
- defensively skip evalling _cache.keys if it doesn't exist
- ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation
- add sharding for qwen3_5 moe linear attention
- added another 6 million model cards
## Why It Works
Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three
concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive
contiguous split (`segments=1`) would slice across section boundaries,
corrupting q/k/v values and producing garbled output.
By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`,
each section is split independently before distributing across devices.
This ensures every rank receives correctly aligned q, k, and v
components.
The remaining separate projections (`in_proj_z`, `in_proj_b`,
`in_proj_a`) and the MoE layers follow the same `all_to_sharded` /
`sharded_to_all` pattern already used for Qwen3-Next.
Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks.
## Test Plan
tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval.
---------
Co-authored-by: hw <hw@hwStudio1.local>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
The Qwen3-Coder-Next model card TOML files were missing family,
quantization, base_model, and capabilities fields. This caused them not
to appear under the Qwen family filter in the dashboard's model picker.
Added the missing metadata to all five variants (4bit, 5bit, 6bit, 8bit,
bf16), matching the format used by the existing Qwen3-Coder-480B model
cards.
Test plan:
- Eyeballs
## Summary
- Adds model cards for MiniMax M2.5 in three quantizations: 4bit (~129
GB), 6bit (~186 GB), 8bit (~243 GB)
- No code changes needed — `MiniMaxM2ForCausalLM` is already in the
tensor parallel whitelist and `MiniMaxShardingStrategy` is already
implemented in `auto_parallel.py`
- Credit to @vskiwi for confirming MiniMax M2.5 works out of the box
with existing code
Closes#1480
## Test plan
- [x] `basedpyright` passes with 0 errors
- [x] `ruff check` passes
- [x] `pytest` passes (260 passed, 1 skipped)
- [ ] Verify MiniMax M2.5 models appear in model selector on dashboard
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
## Summary
- Adds `thinking_toggle` capability to 26 model cards that support
toggling thinking mode on/off
- GPT-OSS models (20b, 120b) excluded — they always think and don't
support toggling
- Dashboard UI updated to check for `thinking_toggle` capability before
showing the toggle button
## Test plan
- [x] `uv run basedpyright` — 0 errors
- [x] `uv run ruff check` — all checks passed
- [x] `nix fmt` — 0 files changed
- [x] `uv run pytest` — 188 passed, 0 failed
- [x] Security review passed (no secrets, eval/exec, innerHTML, or dep
changes)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
Add GLM 5 support in favor of #1513
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
Working version of #1366
## Changes
Add Step 3.5 Flash
## Test Plan
### Manual Testing
Works!
### Automated Testing
Running two processes tensor/pipeline sharded gives same logits as
single process.
## Motivation
Image models (FLUX, Qwen Image) had no family grouping or quantization
metadata in the dashboard
## Changes
- Added family, quantization, base_model, and capabilities fields to all
18 image model TOML cards (FLUX.1 variants + Qwen Image variants)
- Added FLUX and Qwen Image SVG logos to FamilyLogos.svelte
- Added "flux" and "qwen-image" families to the sidebar and family sort
order
- Added "Image Gen" and "Image Edit" capability filters in
ModelFilterPopover.svelte
- Added image edit icon/badge to ModelPickerGroup.svelte
- Made the model category sidebar scrollable to accommodate the new
entries
- Hidden scrollbars on model list panels
## Why It Works
Reuses the existing family/quantization grouping infrastructure that
LLMs already use, extending it to image models with appropriate metadata
and icons
## Test Plan
### Manual Testing
Verified image models behave like text models in the model list dialog
---------
Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>
## Motivation
Add support for FLUX.1-Kontext-dev, an image editing variant of
FLUX.1-dev
## Changes
- New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow
- encodes input image as conditioning latents with special position IDs,
generates from pure noise
- Model config: 57 transformer blocks (19 joint + 38 single), guidance
scale 4.0, ImageToImage task
- Pipeline updates: Added kontext_image_ids property to PromptData
interface, passed through diffusion runner
- Model cards: Added TOML configs for base, 4-bit, and 8-bit variants
- Dependency: mflux 0.15.4 → 0.15.5
- Utility: tmp/quantize_and_upload.py for quantizing and uploading
models to HuggingFace
## Test Plan
### Manual Testing
Works better than Qwen-Image-Edit
## Motivation
Qwen3-Coder-Next just dropped on mlx-community in several quantizations.
It's an 80B MoE model (Qwen3NextForCausalLM) which we already have
tensor parallelism support for via QwenShardingStrategy — just needs
model cards.
## Changes
Added model cards for all 5 available quantizations:
- `mlx-community/Qwen3-Coder-Next-4bit` (~46GB)
- `mlx-community/Qwen3-Coder-Next-5bit` (~58GB)
- `mlx-community/Qwen3-Coder-Next-6bit` (~69GB)
- `mlx-community/Qwen3-Coder-Next-8bit` (~89GB)
- `mlx-community/Qwen3-Coder-Next-bf16` (~158GB)
All with `supports_tensor = true` since the architecture is already
supported.
## Why It Works
`Qwen3NextForCausalLM` is already handled by QwenShardingStrategy in
auto_parallel.py and is in the supports_tensor allowlist in
model_cards.py. No code changes needed — just the TOML card files.
## Test Plan
### Manual Testing
<!-- n/a - model card addition only -->
### Automated Testing
- `basedpyright` — 0 errors
- `ruff check` — passes
- `nix fmt` — no changes
- `pytest` — 173 passed, 1 skipped
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Motivation
Enable parallel classifier-free guidance (CFG) for Qwen image models.
CFG requires two forward passes (positive/negative prompts) - this
allows them to run on separate nodes simultaneously, reducing latency.
## Changes
- Added uses_cfg flag to ModelCard to identify CFG-based models
- Extended PipelineShardMetadata with CFG topology fields (cfg_rank,
cfg_world_size, peer device info)
- Updated placement to create two CFG groups with reversed ordering
(places CFG peers as ring neighbors)
- Refactored DiffusionRunner to process CFG branches separately with
exchange at last pipeline stage
- Added get_cfg_branch_data() to PromptData for single-branch embeddings
- Fixed seed handling in API for distributed consistency
- Fixed image yield to only emit from CFG rank 0 at last stage
- Increased num_sync_steps_factor from 0.125 to 0.25 for Qwen
## Why It Works
- 2 nodes + CFG: Both run all layers, process different CFG branches in
parallel
- 4+ even nodes + CFG: Hybrid - 2 CFG groups × N/2 pipeline stages
- Odd nodes or non-CFG: Falls back to pure pipeline parallelism
Ring topology places CFG peers as neighbors to enable direct exchange.
## Test Plan
### Manual Testing
Verified performance gain for Qwen-Image for 2 node and 4 node cluster.
Non-CFG models still work
### Automated Testing
Added tests in test_placement_utils.py covering 2-node CFG parallel,
4-node hybrid, odd-node fallback, and non-CFG pipeline modes.
## Motivation
Reimplements the model picker modal from #1191 on top of the custom
model support branch. Replaces the inline model dropdown with a
full-featured modal that groups models by base model, supports
filtering, favorites, and HuggingFace Hub search.
## Changes
**Backend:**
- Add `family`, `quantization`, `base_model`, `capabilities` metadata
fields to `ModelCard` and all 40 TOML model cards
- Pass new fields through `ModelListModel` and `get_models()` API
response
- Add `GET /models/search` endpoint using
`huggingface_hub.list_models()`
**Dashboard (7 new files):**
- `ModelPickerModal.svelte` — Main modal with search, family filtering,
HuggingFace Hub tab
- `ModelPickerGroup.svelte` — Expandable model group row with
quantization variants
- `FamilySidebar.svelte` — Vertical sidebar with family icons (All,
Favorites, Hub, model families)
- `FamilyLogos.svelte` — SVG icons for each model family
- `ModelFilterPopover.svelte` — Capability and size range filters
- `HuggingFaceResultItem.svelte` — HF search result item with
download/like counts
- `favorites.svelte.ts` — localStorage-backed favorites store
**Integration:**
- Replace inline dropdown in `+page.svelte` with button that opens
`ModelPickerModal`
- Custom models shown in Hub tab with delete support
**Polish:**
- Real brand logos (Meta, Qwen, DeepSeek, OpenAI, GLM, MiniMax, Kimi,
HuggingFace) from Simple Icons / LobeHub
- Clean SVG stroke icons for capabilities (thinking, code, vision, image
gen)
- Consistent `border-exo-yellow/10` borders, descriptive tooltips
throughout
- Cluster memory (used/total) shown in modal header
- Selected model highlight with checkmark for both single and
multi-variant groups
- Cursor pointer on all interactive elements, fix filter popover
click-outside bug
- Custom models now appear in All tab alongside built-in models
## Bug Fix: Gemma 3 EOS tokens
Also included in this branch: fix for Gemma 3 models generating infinite
`<end_of_turn>` tokens. The tokenizer's `eos_token_ids` was missing
token ID 106 (`<end_of_turn>`), so generation never stopped. The fix
appends this token to the EOS list after loading the tokenizer. Also
handles `eos_token_ids` being a `set` (not just a `list`).
## Why It Works
Model metadata (family, capabilities, etc.) is stored directly in TOML
cards rather than derived from heuristics, ensuring accuracy. The modal
groups models by `base_model` field so quantization variants appear
together. Custom models are separated into the Hub tab since they lack
grouping metadata.
## Test Plan
### Manual Testing
- Open dashboard, click model selector to open modal
- Browse models by family sidebar, search, and filters
- Expand model groups to see quantization variants
- Star favorites and verify persistence across page reloads
- Navigate to Hub tab, search and add models
- Verify error messages shown for invalid model IDs
- Run a Gemma 3 model and verify generation stops at `<end_of_turn>`
### Automated Testing
- `uv run basedpyright` — 0 errors
- `uv run ruff check` — passes
- `nix fmt` — clean
- `uv run pytest src/` — 173 passed
- `cd dashboard && npm run build` — builds successfully
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>