mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 12:30:29 -04:00

Author	SHA1	Message	Date
Daiz	28817d3ee3	Add support for Qwen3.5 (#1644 ) ## Motivation Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by `mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel sharding support for this architecture. This prevents running large Qwen3.5 models across multiple nodes. Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to Qwen3-Next, but with a different projection layout — separate `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires architecture-aware sharding logic. ## Changes (evan summary) - enable qwen3_5 dense + moe tensor parallelism from config - defensively skip evalling _cache.keys if it doesn't exist - ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation - add sharding for qwen3_5 moe linear attention - added another 6 million model cards ## Why It Works Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive contiguous split (`segments=1`) would slice across section boundaries, corrupting q/k/v values and producing garbled output. By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`, each section is split independently before distributing across devices. This ensures every rank receives correctly aligned q, k, and v components. The remaining separate projections (`in_proj_z`, `in_proj_b`, `in_proj_a`) and the MoE layers follow the same `all_to_sharded` / `sharded_to_all` pattern already used for Qwen3-Next. Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks. ## Test Plan tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval. --------- Co-authored-by: hw <hw@hwStudio1.local> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-03 14:31:57 +00:00
rltakashige	365dd68d9a	Final fixes for release (#1603 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-23 21:10:15 +00:00
rltakashige	423ed0f07f	Strip Claude headers to improve prefix cache hit rates (#1552 ) ## Motivation Our hits are really bad at the moment (0.2%). This PR makes it 98.5% on average. ## Changes Also adds an example for how to run Claude using Exo. ## Why It Works Claude sends some billing and session headers that change with each message. ## Test Plan ### Manual Testing Works in manual testing.	2026-02-19 18:29:34 +00:00
ciaranbor	6f0cb99919	Ciaran/flux1 kontext (#1394 ) ## Motivation Add support for FLUX.1-Kontext-dev, an image editing variant of FLUX.1-dev ## Changes - New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow - encodes input image as conditioning latents with special position IDs, generates from pure noise - Model config: 57 transformer blocks (19 joint + 38 single), guidance scale 4.0, ImageToImage task - Pipeline updates: Added kontext_image_ids property to PromptData interface, passed through diffusion runner - Model cards: Added TOML configs for base, 4-bit, and 8-bit variants - Dependency: mflux 0.15.4 → 0.15.5 - Utility: tmp/quantize_and_upload.py for quantizing and uploading models to HuggingFace ## Test Plan ### Manual Testing Works better than Qwen-Image-Edit	2026-02-06 16:20:31 +00:00
rltakashige	9dabde7e57	Fix bench after recent updates (#1331 ) ## Motivation A lot of changes happened without much attention to the state of exo bench. ## Changes Use TaggedModel for BenchChatCompletion so it serialises properly. Don't break after gpt oss tool call to preserve parity with the rest of the codebase. ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <img width="2856" height="678" alt="image" src="https://github.com/user-attachments/assets/2e18cf0d-c0f8-467c-9763-1a6a59c8a327" /> Also tested GPT OSS tool calling in OpenCode	2026-01-29 19:14:40 +00:00
Evan	b1e88a3d06	shfmt adds shfmt, a shell formatter, and formats the bash files	2026-01-29 15:24:36 +00:00
rltakashige	9e58a57599	Add RDMA caveats to README.md (#1316 ) ## Motivation Running RDMA from source is not well documented as is. Several surprising things that took time to debug internally too. App should be updated to detect MacOS versions in future.	2026-01-28 18:44:00 +00:00
rltakashige	077b1bc732	exo-bench (Benchmark model pp & tg speed) (#1099 ) ## Motivation This PR implements benchmarking in the style of llama-bench. The main difficulty here is the fact that exo is not a library - it exposes an endpoint. This means that benchmarking numbers will be inaccurate if the API is measured. The solution assumes nodes are set up with uv run exo (or via the app), and then hits the new endpoint /bench/chat/completions to retrieve generation statistics directly from mlx_lm. <!-- Why is this change needed? What problem does it solve? --> This will allow us to release benchmarks for models and perform regression tests. TODO: Performance benchmarking. <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> - Adds /bench/chat/completions endpoint - Adds BenchChatCompletion/Response - Adds a logits processor to prevent response from ending early - Adds a "Prompt Sizer" which downloads the tokenizer and dynamically adjusts the prompt of "a" to fit the desired prompt size. - Reduce prefill step size to 2048 for now (in future, dynamically adjust this value) <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several fixes to run consistently on all configurations (to be done in the future). Manually tested the normal API to verify chat requests complete as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> Not really possible. Type checker passes.	2026-01-06 17:39:09 +00:00
Evan Quiney	9815283a82	8000 -> 52415 (#915 ) * 8000 -> 52415 * dont grab the api port for placement --------- Co-authored-by: rltakashige <rl.takashige@gmail.com>	2025-12-18 18:39:44 +00:00
Jake Hillion	5629983809	fmt: format all python/rust/nix files	2025-12-05 16:58:55 +00:00
Evan Quiney	e702313b32	pingers Co-authored-by: Jake Hillion <jake@hillion.co.uk>	2025-12-05 16:41:19 +00:00
rltakashige	2b243bd80e	Consolidate!!! Fixes	2025-12-03 12:19:25 +00:00
rltakashige	28a91787e8	Demo Co-authored-by: Evan <evanev7@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com>	2025-11-20 20:03:51 +00:00
Evan Quiney	aa519b8c03	Worker refactor Co-authored-by: rltakashige <rl.takashige@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com>	2025-11-10 23:31:53 +00:00
rltakashige	16f724e24c	Update staging 14 Co-authored-by: Evan <evanev7@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: David Munha Canas Correia <dmunha@MacBook-David.local> Co-authored-by: github-actions bot <github-actions@users.noreply.github.com>	2025-11-05 01:44:24 +00:00

15 Commits