mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 12:30:29 -04:00

Author	SHA1	Message	Date
rltakashige	3f0df404a5	Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886 ) ## Motivation Part 1 of many memory improvements. ## Changes As written in the title ## Test Plan ### Manual Testing Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B A3B cache reduced from 21GB every 100000 tokens to 7GB.	2026-04-13 18:32:17 +01:00
rltakashige	196543ce69	Add Gemma 4 + VLM fixes + thinking parsing updates (#1851 ) ## Motivation Add support for Gemma 4, including VLM! ## Changes - Add auto parallel strategies and model cards for Gemma 4 - Normalise Gemma 4's special Vision Transformer handling to be in line with the rest of our vision processors. - Also adds reprs to messages and b64 hashes to prevent log spam. ## Test Plan ### Manual Testing Tested manually on 4bit E2B and 8bit 26B ### Automated Testing Model onboarding shows small logit diffs. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-04-11 12:29:33 +01:00
rltakashige	43b3df45fb	Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835 ) ## Motivation MLX LM has had a massive refactor to their BatchGenerator recently. Since we'd like new features from MLX LM such as Gemma 4, we need to update the code to handle this. Additionally this fixes a significant memory leak in GatedDeltaNet (the difference is quite substantial, up to 1GB every 1000 tokens, explaining several memory issues users were facing with Qwen3.5 models) ## Testing Before <img width="3146" height="884" alt="image" src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4" /> After (no memory leak, as one of the changes upstream) <img width="3190" height="892" alt="image" src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14" />	2026-04-07 11:50:12 +00:00
rltakashige	d9ed943034	Fix Nemotron cache leak upstream (#1819 ) ## Motivation Nemotron Cascade and Nano failing at long decodes. ## Changes Fixed upstream, just change pyproject and uv lock here. ## Test Plan ### Automated Testing Tested with a reproduce script upstream	2026-03-30 16:53:21 +00:00
rltakashige	635801d515	Add multimodality! (#1802 ) ## Motivation Images! TODO (in a future PR): Add audio and video support. ## Test Plan ### Manual Testing <img width="2652" height="1900" alt="image" src="https://github.com/user-attachments/assets/7d3a7137-542f-4f94-9193-2c73b7c4a5ec" /> <img width="2770" height="1956" alt="image" src="https://github.com/user-attachments/assets/e3c3a096-8029-4409-97a6-aca31a9a3f24" /> <img width="2738" height="1768" alt="image" src="https://github.com/user-attachments/assets/d70ea37f-cd1d-4a4c-ad08-3beb9fafa380" /> (And batching also works) --------- Co-authored-by: David Hind <davehind@yahoo.co.uk>	2026-03-30 11:52:19 +01:00
Evan Quiney	1e51dc89b0	chore: bump exo-version with release version (#1807 ) our pyproject.toml version was 0.3.68 - update to .69 in line with release!!	2026-03-27 11:47:13 +00:00
vskiwi	fc1ae90111	fix: DeepSeek V3.2 warmup crash and tool calling + add catalog cards (#1769 ) ## Summary DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's inference engine (architecture whitelisted in `model_cards.py`, DSML encoding added in #1548), but doesn't work out of the box due to two bugs: ### Bug 1: `warmup_inference` passes empty model ID `warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)` instead of using it. Since `_needs_dsml_encoding()` checks `"deepseek-v3.2" in task_params.model.lower()`, the empty string never matches → falls back to `tokenizer.apply_chat_template()` → ValueError because V3.2 has no Jinja chat template. Fix: `model=ModelId("")` → `model=model_id` (one line). ### Bug 2: `_needs_dsml_encoding` limited to tool calling `_needs_dsml_encoding()` returns `True` only when `task_params.tools` is present or tool messages exist in `chat_template_messages`. For warmup and regular chat requests without tools → `return False` → Jinja fallback → ValueError. Unlike V3.1 (which has a `.jinja` chat template file that transformers picks up automatically), V3.2 has no Jinja template at all — it uses Python-based DSML encoding for all message types. Fix: For V3.2, always return `True` — DSML encoding handles all message types. ### Catalog cards Added inference model cards for: - `mlx-community/DeepSeek-V3.2-8bit` - `mlx-community/DeepSeek-V3.2-4bit` Parameters taken from model `config.json` on HuggingFace, storage sizes from HF API. Capabilities include `thinking_toggle` (related: #1456). ## Notes - The model ID string matching approach (`"deepseek-v3.2" in model.lower()`) is acknowledged tech debt — see #1371 for the planned architecture-based approach. ## Test plan - [x] Start exo with DeepSeek V3.2 model → warmup should complete without crash - [x] Send a regular chat message (no tools) → should get a response - [x] Send a chat message with tools → should work as before - [x] V3.2 cards should appear in the dashboard model catalog --------- Co-authored-by: user <user@m1.note> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-25 16:20:35 +00:00
rltakashige	7117d748ec	Update dependencies including mlx 0.31.2 (#1789 ) Update mlx fork to 0.31.2 and mflux to 0.17.2	2026-03-25 06:03:19 +00:00
rltakashige	b6240a97e8	Prevent Qwen3.5 looping by using mlx lm fork (#1784 ) ## Motivation Move back to an MLX LM to improve the Qwen 3.5 experience. ## Test Plan ### Manual Testing Seems to loop less from testing, no speed regressions.	2026-03-24 17:16:41 +00:00
rltakashige	7df3774ca2	Improve batch performance and stats reporting (#1777 ) ## Motivation Batch generation reports incorrect statistics, as mlx lm never clears the original stats, meaning they get polluted over time. The dashboard also seems considerably slower than bench statistics. We also have a large discrepancy between B=1 batch generating and mlx_generate. Extracting logprobs is massively expensive, causing up to a 25% slowdown compared to pure batching. ``` [ 12:02:01.1240AM \| INFO ] step overhead: 3.49ms (next=12.49ms total=15.99ms) [ 12:02:02.1600AM \| INFO ] step overhead: 3.23ms (next=13.01ms total=16.24ms) [ 12:02:03.2228AM \| INFO ] step overhead: 3.28ms (next=13.38ms total=16.66ms) [ 12:02:04.2798AM \| INFO ] step overhead: 3.25ms (next=12.84ms total=16.10ms) [ 12:02:05.3152AM \| INFO ] step overhead: 3.18ms (next=12.61ms total=15.79ms) [ 12:02:06.3522AM \| INFO ] step overhead: 3.41ms (next=12.83ms total=16.25ms) [ 12:02:07.3987AM \| INFO ] step overhead: 3.38ms (next=13.14ms total=16.52ms) [ 12:02:08.4537AM \| INFO ] step overhead: 1.84ms (next=19.44ms total=21.28ms) ``` ## Changes 1. Report stats ourselves instead of using mlx lm's stats for batch generation (they use perf_counter anyway). 2. Adjust exo bench to match 3. Improve logprobs extraction speed by 10x, improving tps for dashboard & any requests for logprobs 4. Use an SSE comment to align the speed to the real numbers at the end of generation 5. Patch mlx for several optimizations given our assumptions and use cases (e.g. use vllm style RoPE). 6. Switch MLX LM version to latest main, including support for Nemotron Super and some Qwen3.5 fixes. ## Why It Works 1. Exo bench no longer reports polluted stats 2. Exo bench now handles the reported per-request stats rather than the aggregate stats 3. The decode speed now jumps back to a real number at the end of the generation 4. Large batch speedup for rotating KV cache models + 1:1 matching cache with vllm ## Test Plan ### Manual Testing Needs testing on OpenCode and CC Needs eval testing ### Automated Testing Only going to show the performance optimization difference after the accurate reporting: GPT OSS 20B MXFP4 Q8 (large change) Before: <img width="2466" height="1534" alt="image" src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e" /> <img width="2410" height="1240" alt="image" src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68" /> After: <img width="2476" height="1472" alt="image" src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2" /> <img width="2454" height="1236" alt="image" src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a" /> Qwen 3.5 35B A3B 8bit (No change) Before: <img width="2414" height="1396" alt="image" src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3" /> After: <img width="2346" height="1234" alt="image" src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6" /> Llama 3.2 1B Instruct 4bit (small change) Before: <img width="2516" height="1220" alt="image" src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80" /> After: <img width="2566" height="1370" alt="image" src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543" />	2026-03-24 14:03:03 +00:00
ciaranbor	a6519ba006	Update mflux to 0.16.9 (#1751 ) Prevents malformed output from Qwen-Image	2026-03-17 16:58:23 +00:00
rltakashige	131ad0ff36	Implement continuous batching (#1642 ) ## Motivation Following the changes made in #1632 ! Closes #1020 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> --------- Co-authored-by: Evan Quiney <evanev7@gmail.com>	2026-03-09 15:04:45 +00:00
Daiz	28817d3ee3	Add support for Qwen3.5 (#1644 ) ## Motivation Qwen3.5 MoE models (e.g., `Qwen3.5-397B-A17B-6bit`) are now supported by `mlx-lm` via `qwen3_5_moe` model type, but exo lacks tensor parallel sharding support for this architecture. This prevents running large Qwen3.5 models across multiple nodes. Qwen3.5 uses a GatedDeltaNet hybrid attention mechanism similar to Qwen3-Next, but with a different projection layout — separate `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` instead of Qwen3-Next's combined `in_proj_qkvz` and `in_proj_ba`. This requires architecture-aware sharding logic. ## Changes (evan summary) - enable qwen3_5 dense + moe tensor parallelism from config - defensively skip evalling _cache.keys if it doesn't exist - ignore kwargs in qwen35 pipeline masking and ensure pipeline segments match global model parameters for mask creation - add sharding for qwen3_5 moe linear attention - added another 6 million model cards ## Why It Works Qwen3.5's GatedDeltaNet has an `in_proj_qkv` linear layer with three concatenated sections: `[q(key_dim), k(key_dim), v(value_dim)]`. A naive contiguous split (`segments=1`) would slice across section boundaries, corrupting q/k/v values and producing garbled output. By passing `segments=[key_dim, key_dim + key_dim]` to `shard_linear()`, each section is split independently before distributing across devices. This ensures every rank receives correctly aligned q, k, and v components. The remaining separate projections (`in_proj_z`, `in_proj_b`, `in_proj_a`) and the MoE layers follow the same `all_to_sharded` / `sharded_to_all` pattern already used for Qwen3-Next. Some pipeline splits didn't include an ssm layer or a linear layer resulting in a subset of the model acting like it shouldn't create the appropriate masks for the next layer - we patch the model to manually create such masks. ## Test Plan tensor sharded 2,3,4 models & pipeline sharded 2,3,4 with simple eval. --------- Co-authored-by: hw <hw@hwStudio1.local> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-03 14:31:57 +00:00
Evan Quiney	db73c4fd5d	move messaging into rust (#1549 ) the main body of the rust refactor. fixes the tokio panic on shutdown. simplifies the networking module significantly. doesn't touch lp2p behaviour	2026-02-26 13:58:22 +00:00
rltakashige	e23c3a3026	Address Mac Mini pipeline GPU timeouts (#1620 ) ## Motivation Users were reporting GPU timeout errors on Mac Minis, which we never saw on testing with Mac Studios. It also seems to only happen with large models. ## Changes Eval specific distributed operations. ## Why It Works As I wrote in a Slack message: Basically, prefill is too slow for pipeline communications. If there are both communications and GPU operations as part of an mlx graph, the communications become subject to the GPU's 5 second command buffer timeout. For normal generation, I added evals to the communications (only during prefill, as it slows down decode) to do this, fixing GPU timeouts. But we don't do this during warmup, as the prompt is absolutely tiny. This is still too slow on an M4 Pro on some models that it causes a GPU timeout during warmup... ---------------------- This was one of the issues. However, there is another issue: mx.all_gather sometimes reads stale data with FAST_SYNCH enabled. I'm still investigating the root cause, but the code as it is now works on Mac Minis. ## Test Plan ### Manual Testing <img width="2762" height="1808" alt="image" src="https://github.com/user-attachments/assets/27c88542-606c-4551-8f7c-bd2c0471f54e" /> <img width="2820" height="1898" alt="image" src="https://github.com/user-attachments/assets/0ba3478c-ee39-438d-902c-92893db23d05" /> ### Automated Testing Needs a bunch on mac minis	2026-02-25 17:37:32 +00:00
Evan Quiney	9a2d2a4a7c	bump (#1608 ) should be release for 1.0.68 - let's synch our py version with our app version - 0.3.68. next minor release should be 1.4.0 and 0.4.0 respectively. Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-24 19:14:15 +00:00
rltakashige	14526d281a	update mlx 2 (#1611 ) ## Motivation GPU locks because of prompt progress callbacks taking time. Current solution: Don't fix it, make the symptom better ## Changes Shortened timeout by 2x Get event leak fixes from latest upstream	2026-02-24 18:30:48 +00:00
rltakashige	05986f77aa	add exo bench protobuf dependency (#1596 ) kimi k2.5 requires protobuf	2026-02-23 16:45:22 +00:00
rltakashige	e01f50a5cd	Update mlx fork (#1565 ) ## Motivation Some fixes upstream. This sort of commit will probably be quite common until GPU locks are resolved.	2026-02-20 17:23:52 +00:00
rltakashige	c2f2111b88	Fix tool calling (#1529 ) ## Motivation GPT OSS tool calling issues. ## Changes Fixes those and adds a bunch of evals for tool calling. Fixes GLM5 prefix caching, where CacheList wasn't getting handled properly. Extracts a bunch of the setup functionality of exo bench to a harness that can be reused elsewhere, such as in the tool calling eval. ## Test Plan ### Automated Testing Let's run the evals for all models	2026-02-18 20:29:18 +00:00
rltakashige	48b8f86395	Add support for GLM 5 (#1526 ) ## Motivation Add GLM 5 support in favor of #1513 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-18 14:04:06 +00:00
Alex Cheema	3addeadea8	Update mlx-lm to 0.30.7 (#1520 ) ## Summary - Bumps `mlx-lm` from 0.30.6 to 0.30.7 in `pyproject.toml` and `uv.lock` ## Test plan - [x] `uv lock` resolves successfully - [x] `basedpyright` — no new errors (63 pre-existing in unrelated `test_tool_call_tracker.py`) - [x] `ruff check` — all checks passed - [x] `nix fmt` — no formatting changes - [x] `pytest` — 188 passed, 1 skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 11:14:23 +00:00
rltakashige	f2be929211	Leo/address rdma gpu locks 2 (#1515 ) Same as #1489 . Had to revert and redo thanks to Claude. --------- Co-authored-by: Jake Hillion <jake@hillion.co.uk> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 14:00:52 -08:00
rltakashige	83af8c63fa	Revert "Use custom fork that resolves GPU locks" (#1502 ) Reverts exo-explore/exo#1489 Goddammit Claude...	2026-02-17 18:18:54 +00:00
rltakashige	facf2d4d03	Use custom fork that resolves GPU locks (#1489 ) ## Motivation There is an issue on Macs that means that an explicit synchronization is necessary for memory to be updated from L1 cache. This means that GPU locks can occur when a spin wait does not see the updated timestamp. ## Changes Updated in my own personal fork. ## Why It Works https://github.com/ARM-software/acle/releases ## Test Plan ### Manual Testing Tested manually that no GPU locks occur (even with multiple simultaneous instances running) and that the performance differential is negligible (267 vs 269 tps on Llama 3.2 1B at an approx 10k context.) ------------------------------------------------------ I have seen a GPU lock, specifically when sending a particularly large chat completion while the model was loading. However, I have since been unable to reproduce and this may be something I did wrong. Please do create an issue and tag me if any GPU locks do occur. --------- Co-authored-by: Jake Hillion <jake@hillion.co.uk> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 17:48:43 +00:00
Ryuichi Leo Takashige	dc781497c5	update mlx to 0.30.6	2026-02-10 20:16:17 +00:00
Jake Hillion	305a3c8b70	event_log: move event log from unbounded in-memory list to disk (#1432 ) The master and API event logs (list[Event]) grew unbounded in RAM for the lifetime of the process. Events are rarely read back (only for RequestEventLog when a new node catches up, or the dashboard /events endpoint). Introduced a DiskEventLog class that writes length-prefixed msgpack records to an append-only file, using a bounded LRU cache of byte offsets for indexed access. On close, the active file is compressed with ZSTD and rotated into a numbered archive slot, keeping the last 5 archives (events.1.bin.zst through events.5.bin.zst). On construction, any stale active file from a crash is rotated before opening a fresh log. The /events API endpoint now streams the JSON array one event at a time rather than materializing the full list in memory. Deserialization routes msgpack through json.dumps into Pydantic's validate_json() to get correct JSON-mode coercion (e.g. string to enum) under strict mode. This bounds memory usage to the LRU cache (128 entries) regardless of event volume, while still supporting efficient sequential reads from disk when needed. Test plan: - CI - New unit tests for DiskEventLog: append/read, range queries, rotation on close, stale file recovery, idempotent close, successive sessions, archive retention limit (5 max) - Tested on a cluster with 9000 events. /events continues working. - On disk size is 3.9MiB with ~8000 events, and the compression is very effective. - Disconnected and rejoined a machine, it rejoined fine. --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>	2026-02-10 17:27:32 +00:00
Jake Hillion	d79b3a0e75	bench: make exo-bench available via nix run on all platforms (#1415 ) exo-bench was gated behind isDarwin in python/parts.nix because it used exoVenv, which pulls in MLX (Darwin-only). However, exo_bench.py is an HTTP client that only needs loguru, transformers, huggingface-hub, and tiktoken. Made bench a uv workspace member with its own pyproject.toml declaring only the minimal dependencies. Added a separate benchVenv in parts.nix built from that workspace member, and moved exo-bench out of the isDarwin block so it is available on all platforms. Test plan: - `nix run .#exo-bench -- --help` prints argparse help --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-06 21:07:17 +00:00
ciaranbor	6f0cb99919	Ciaran/flux1 kontext (#1394 ) ## Motivation Add support for FLUX.1-Kontext-dev, an image editing variant of FLUX.1-dev ## Changes - New FluxKontextModelAdapter: Handles Kontext's image-to-image workflow - encodes input image as conditioning latents with special position IDs, generates from pure noise - Model config: 57 transformer blocks (19 joint + 38 single), guidance scale 4.0, ImageToImage task - Pipeline updates: Added kontext_image_ids property to PromptData interface, passed through diffusion runner - Model cards: Added TOML configs for base, 4-bit, and 8-bit variants - Dependency: mflux 0.15.4 → 0.15.5 - Utility: tmp/quantize_and_upload.py for quantizing and uploading models to HuggingFace ## Test Plan ### Manual Testing Works better than Qwen-Image-Edit	2026-02-06 16:20:31 +00:00
Jake Hillion	cf7201f91e	pyproject: set minimum uv version The uv.lock is churning constantly as different UV versions bounce it between revisions. This is made worse by GitHub automatically hiding the uv.lock changes, meaning it's hard to notice when this went wrong. Set a minimum version for `uv` in pyproject.toml to fix this. I tried quite a few versions (not all) and found 0.8.6 sets the revision to 3, which I believe is the latest. This is from August 2025 so has been around for a while. Test plan: ``` jake@maverick:/data/users/jake/repos/exo/ > git checkout main uv.lock jake@maverick:/data/users/jake/repos/exo/ > nix shell github:nixos/nixpkgs/3dce7f4a77812afd69efcbfe15e5223f98c5c69e#uv --command sh -c 'uv add pip --frozen && uv lock && uv remove pip --frozen && uv lock && uv --version' Resolved 140 packages in 147ms Added pip v26.0.1 Resolved 139 packages in 48ms Removed pip v26.0.1 uv 0.8.6 ```	2026-02-06 15:28:10 +00:00
rltakashige	b315035ae0	Add minimax and fix qwen sharding strategies (#1318 ) ## Motivation MiniMax tensor sharding does not provide equivalent outputs to running it as a single node because RMSNorm weights cannot be split without affecting the output. Qwen3Next sharding was broken, and something with Qwen3MoE was likely changed upstream, as several variables no longer exist. This also ballooned into fixing prefix caching for non-standard models as Qwen3Next was behaving weirdly. ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing Worked for a 8 hour long eval at the same performance and a more similar completion/reasoning token distribution. --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-06 13:26:59 +00:00
Alex Cheema	b3c8f85fc8	Update MLX to 0.30.4 (#1311 ) ## Summary - Bump mlx from 0.30.3 to 0.30.4 ## Test plan - [x] `uv lock` succeeds - [x] Type checking passes (`uv run basedpyright`) - [x] Run inference tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 04:30:21 -08:00
rltakashige	a562114ba5	Add Kimi K2.5 support (#1302 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>	2026-01-28 05:44:19 +00:00
rltakashige	9968abe816	Leo/fix basic model shard (#1291 ) ## Motivation Some models, on some configurations, would have several issues that caused the model to be stuck on loading. ## Changes Several loading issues were with upstream mlx lm shard loading for tensor parallel. GLM 4.7 Flash now uses GLM 4.7 Lite. A final portion of the issues were from mlx memory not being properly released before calling mx.eval(model), causing the system to run out of memory. ## Test Plan ### Manual Testing Done a bunch (thanks @AlexCheema), hopefully exhaustive. ### Automated Testing A bunch of automated testing is imminent but not landed yet. --------- Co-authored-by: Alex Cheema <alexcheema123@gmail.com>	2026-01-26 17:49:09 +00:00
ciaranbor	23fd37fe4d	Add FLUX.1-Krea-dev model (#1269 ) ## Why It Works Same implementation as FLUX.1-dev, just different weights	2026-01-23 19:48:24 +00:00
ciaranbor	6a9251b920	Add mflux type stubs (#1234 ) ## Motivation Simplify image generation review	2026-01-21 15:07:42 +00:00
Evan Quiney	22b5d836ef	swap all instances of model_id: str for model_id: ModelId (#1221 ) This change uses the stronger typed ModelId, and introduces some convenience methods. It also cleans up some code left over from #1204. ## Changes `model_id: str -> model_id: ModelId` `repo_id: str -> model_id: ModelId` Introduces methods on ModelId, in particular ModelId.normalize() to replace `/` with `--`. This PR did introduce some circular imports, so has moved some code around to try and limit them. ## Test Plan Tests still pass, types still check. As this is about metadata, I haven't tested inference.	2026-01-20 17:38:06 +00:00
Alex Cheema	176ab5ba40	Add GLM-4.7-Flash model cards (4bit, 5bit, 6bit, 8bit) (#1214 ) ## Motivation Add support for GLM-4.7-Flash, a lighter variant of GLM-4.7 with the `glm4_moe_lite` architecture. These models are smaller and faster while maintaining good performance. ## Changes 1. Added 4 new model cards for GLM-4.7-Flash variants: - `glm-4.7-flash-4bit` (~18 GB) - `glm-4.7-flash-5bit` (~21 GB) - `glm-4.7-flash-6bit` (~25 GB) - `glm-4.7-flash-8bit` (~32 GB) All variants have: - `n_layers`: 47 (vs 91 in GLM-4.7) - `hidden_size`: 2048 (vs 5120 in GLM-4.7) - `supports_tensor`: True (native `shard()` method) 2. Bumped mlx from 0.30.1 to 0.30.3 - required by mlx-lm 0.30.4 3. Updated mlx-lm from 0.30.2 to 0.30.4 - adds `glm4_moe_lite` architecture support 4. Added type ignores in `auto_parallel.py` for stricter type annotations in new mlx-lm 5. Fixed EOS token IDs for GLM-4.7-Flash - uses different tokenizer with IDs `[154820, 154827, 154829]` vs other GLM models' `[151336, 151329, 151338]` 6. Renamed `MLX_IBV_DEVICES` to `MLX_JACCL_DEVICES` - env var name changed in new mlx ## Why It Works The model cards follow the same pattern as existing GLM-4.7 models. Tensor parallel support is enabled because GLM-4.7-Flash implements the native `shard()` method in mlx-lm 0.30.4, which is automatically detected in `auto_parallel.py`. GLM-4.7-Flash uses a new tokenizer with different special token IDs. Without the correct EOS tokens, generation wouldn't stop properly. ## Test Plan ### Manual Testing Tested generation with GLM-4.7-Flash-4bit - now correctly stops at EOS tokens. ### Automated Testing - `basedpyright`: 0 errors - `ruff check`: All checks passed - `pytest`: 162/162 tests pass (excluding pre-existing `test_distributed_fix.py` timeout failures) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 03:58:09 +00:00
Evan Quiney	39ee2bf7bd	switch from synchronous threaded pinging to an async implementation (#1170 ) still seeing churn in our networking - lets properly rate limit it ## changes added an httpx client with max connections with a persistent AsyncClient ## testing deployed on cluster, discovery VASTLY more stable (the only deleted edges were those discovered by mdns)	2026-01-16 13:20:03 +00:00
Alex Cheema	e5e74e1eef	Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125 ) ## Motivation Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models. The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes. ## Changes ### Core Changes - mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes - transformers 5.x compatibility: Enable prerelease transformers dependency ### Kimi K2 Tokenizer Fixes - Add `bytes_to_unicode` monkey-patch to restore function moved in transformers 5.0.0rc2 - Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to bypass transformers 5.x bug with `auto_map` fallback - Patch `encode()` to use tiktoken directly with `allowed_special="all"` to handle special tokens from chat templates ### Other Changes - Dashboard: Show disk usage for completed model downloads - CI: Add `workflow_dispatch` trigger to build-app workflow - Docs: Add basic API documentation ### Testing - Add comprehensive tokenizer unit tests for all supported models - Tests verify encode/decode, special token handling, and chat template encoding ## Why It Works bytes_to_unicode issue: transformers 5.0.0rc2 moved `bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to `transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py` imports from the old location. The monkey-patch restores it at module load time. AutoTokenizer issue: transformers 5.x has a bug where `tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for custom tokenizers with `auto_map`. Loading the tokenizer directly bypasses this. encode() issue: transformers 5.x's `pad()` method fails for slow tokenizers. Using tiktoken's encode directly with `allowed_special="all"` avoids this path and properly handles special tokens like `<\|im_user\|>` from chat templates. ## Test Plan ### Manual Testing - Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21) - Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both nodes - Verified warmup inference completes successfully - Verified chat completions work with special tokens ### Automated Testing - Added `test_tokenizers.py` with 31 tests covering: - Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm) - Special token encoding (critical for chat templates) - Chat template application and encoding - Kimi-specific and GLM-specific edge cases - All tests pass: `uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py` ### Failing Tests RDMA with all models. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-01-13 12:06:04 +00:00
Evan	cca8c9984a	cleanup unused dependencies we have a lot of dependencies we have no intent of using. kill them with fire! ## testing exo still launches and does the worst inference known to man on my Qwen3 instance. tests pass too!!	2026-01-09 13:11:58 +00:00
Evan Quiney	9d9e24f969	some dashboard updates (#1017 ) Mostly @samiamjidkhan and @AlexCheema's work in progress. --------- Co-authored-by: Sami Khan <smsak99@gmail.com> Co-authored-by: Alex Cheema	2025-12-28 20:50:23 +00:00
Jake Hillion	b5d424b658	placement: generate per-node host lists for MLX ring backend Pipeline + MLX Ring worked with 2 nodes but failed to initialize with 3 or more nodes. The MLX ring backend requires each node to know its specific left and right neighbors in the ring, but the previous implementation provided a single flat host list shared by all nodes. With 2 nodes, a flat list [host0, host1] accidentally worked because each node could find its only neighbor. With 3+ nodes, each node needs a customized view: - Rank 0: [self, right_neighbor, placeholder] - Rank 1: [left_neighbor, self, right_neighbor] - Rank 2: [placeholder, left_neighbor, self] Changed MlxRingInstance from `hosts: list[Host]` to `hosts_by_node: dict[NodeId, list[Host]]` with `ephemeral_port: int`. Added `get_mlx_ring_hosts_by_node()` which generates per-node host lists where: - Self position uses 0.0.0.0 for local binding - Left/right neighbors use actual connection IPs - Non-neighbors use 198.51.100.1 (RFC 5737 TEST-NET-2 placeholder) Also added IP prioritization (en0 > en1 > non-Thunderbolt > any) to prefer stable network interfaces. Fixed topology discovery recording loopback addresses (127.0.0.1) as valid connections to remote nodes. The reachability check now verifies node identity via HTTP GET /node_id rather than just checking if the port is open. Test plan: - Built a DMG [0] - Installed on all Macs and started cluster. - Requested a 3 node Pipeline + MLX Ring Llama 3.3 70B (FP16). - It started and I was able to send a few chat messages. Eventually my instance seemed to get into a broken state and chat stopped working, but this commit is a clear step forward. [0] https://github.com/exo-explore/exo/actions/runs/20473983471/job/58834969418	2025-12-28 20:38:20 +00:00
Evan Quiney	8e9332d6a7	Separate out the Runner's behaviour into a "connect" phase and a "load" phase (#1006 ) ## Motivation We should ensure all runners are connected before loading the model - this gives us finer grained control in the future for the workers planning mechanism over the runners state. ## Changes - Introduced task ConnectToGroup, preceeding LoadModel - Introduced runner statuses Idle, Connecting, Connected - Separated out initialize_mlx from shard_and_load - Single instances never go through the connecting phase ## Test Plan # Automated Testing Added a test for checking event ordering in a standard workflow. # Manual testing Tested Llama 3.2 1b and Kimi K2 Thinking loads and shuts down repeatedly on multiple configurations. Not exhaustive, however. --------- Co-authored-by: rltakashige <rl.takashige@gmail.com>	2025-12-27 16:28:42 +00:00
Jake Hillion	1c1792f5e8	mlx: update to 0.30.1 and align coordinator naming with MLX conventions The Jaccl distributed backend requires MLX 0.30.1+, which includes the RDMA over Thunderbolt support. The previous minimum version (0.29.3) would fail at runtime with "The only valid values for backend are 'any', 'mpi' and 'ring' but 'jaccl' was provided." Bump MLX dependency to >=0.30.1 and rename ibv_coordinators to jaccl_coordinators to match MLX's naming conventions. This includes the environment variable change from MLX_IBV_COORDINATOR to MLX_JACCL_COORDINATOR. Test plan: Hardware setup: 3x Mac Studio M3 Ultra connected all-to-all with TB5 - Built a DMG [0] - Installed on all Macs and started cluster. - Requested a 2 node Tensor + MLX RDMA instance of Llama 3.3 70B (FP16). - It started successfully. - Queried the chat a few times. All was good. This didn't work previously. - Killed the instance and spawned Pipeline + MLX Ring Llama 3.3 70B (FP16). Also started succesfully on two nodes and could be queried. Still not working: - Pipeline + MLX Ring on 3 nodes is failing. Haven't debugged that yet. [0] https://github.com/exo-explore/exo/actions/runs/20467656904/job/58815275013	2025-12-24 16:47:01 +00:00
Jake Hillion	02c915a88d	pyproject: drop pathlib dependency	2025-12-22 17:52:44 +00:00
Jake Hillion	dd0638b74d	pyproject: add pyinstaller to dev-dependencies	2025-12-22 15:49:27 +00:00
Evan Quiney	c9e2062f6e	switch from uvicorn to hypercorn	2025-12-05 17:29:06 +00:00
rltakashige	2b243bd80e	Consolidate!!! Fixes	2025-12-03 12:19:25 +00:00
rltakashige	b45cbdeecd	Consolidate cleanup	2025-11-21 14:54:02 +00:00

1 2

89 Commits