mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-18 13:00:59 -04:00

Author	SHA1	Message	Date
rltakashige	28c797846a	Update mlx and mlx lm to latest (#1906 ) Just bumping to the very latest upstream versions.	2026-04-16 10:59:33 +00:00
rltakashige	3f0df404a5	Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886 ) ## Motivation Part 1 of many memory improvements. ## Changes As written in the title ## Test Plan ### Manual Testing Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B A3B cache reduced from 21GB every 100000 tokens to 7GB.	2026-04-13 18:32:17 +01:00
rltakashige	43b3df45fb	Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835 ) ## Motivation MLX LM has had a massive refactor to their BatchGenerator recently. Since we'd like new features from MLX LM such as Gemma 4, we need to update the code to handle this. Additionally this fixes a significant memory leak in GatedDeltaNet (the difference is quite substantial, up to 1GB every 1000 tokens, explaining several memory issues users were facing with Qwen3.5 models) ## Testing Before <img width="3146" height="884" alt="image" src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4" /> After (no memory leak, as one of the changes upstream) <img width="3190" height="892" alt="image" src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14" />	2026-04-07 11:50:12 +00:00
ciaranbor	f28b2fd037	Extract mlx revision from uv lock (#1715 ) ## Motivation The MLX version and git revision in nix/mlx.nix were hardcoded and had to be manually kept in sync with uv.lock ## Changes - flake.nix: Extract MLX git rev from uv.lock's source.git URL and pass as uvLockMlxRev - nix/mlx.nix: Use uvLockMlxVersion and uvLockMlxRev instead of hardcoded values; remove version mismatch assertion ## Why It Works uv.lock is already the source of truth — now Nix reads both version and rev from it directly. The pinned fetchFromGitHub hash still guards against unexpected changes.	2026-03-13 12:34:54 +00:00
rltakashige	e23c3a3026	Address Mac Mini pipeline GPU timeouts (#1620 ) ## Motivation Users were reporting GPU timeout errors on Mac Minis, which we never saw on testing with Mac Studios. It also seems to only happen with large models. ## Changes Eval specific distributed operations. ## Why It Works As I wrote in a Slack message: Basically, prefill is too slow for pipeline communications. If there are both communications and GPU operations as part of an mlx graph, the communications become subject to the GPU's 5 second command buffer timeout. For normal generation, I added evals to the communications (only during prefill, as it slows down decode) to do this, fixing GPU timeouts. But we don't do this during warmup, as the prompt is absolutely tiny. This is still too slow on an M4 Pro on some models that it causes a GPU timeout during warmup... ---------------------- This was one of the issues. However, there is another issue: mx.all_gather sometimes reads stale data with FAST_SYNCH enabled. I'm still investigating the root cause, but the code as it is now works on Mac Minis. ## Test Plan ### Manual Testing <img width="2762" height="1808" alt="image" src="https://github.com/user-attachments/assets/27c88542-606c-4551-8f7c-bd2c0471f54e" /> <img width="2820" height="1898" alt="image" src="https://github.com/user-attachments/assets/0ba3478c-ee39-438d-902c-92893db23d05" /> ### Automated Testing Needs a bunch on mac minis	2026-02-25 17:37:32 +00:00
rltakashige	14526d281a	update mlx 2 (#1611 ) ## Motivation GPU locks because of prompt progress callbacks taking time. Current solution: Don't fix it, make the symptom better ## Changes Shortened timeout by 2x Get event leak fixes from latest upstream	2026-02-24 18:30:48 +00:00
rltakashige	e01f50a5cd	Update mlx fork (#1565 ) ## Motivation Some fixes upstream. This sort of commit will probably be quite common until GPU locks are resolved.	2026-02-20 17:23:52 +00:00
rltakashige	48b8f86395	Add support for GLM 5 (#1526 ) ## Motivation Add GLM 5 support in favor of #1513 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-18 14:04:06 +00:00
rltakashige	f2be929211	Leo/address rdma gpu locks 2 (#1515 ) Same as #1489 . Had to revert and redo thanks to Claude. --------- Co-authored-by: Jake Hillion <jake@hillion.co.uk> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 14:00:52 -08:00
rltakashige	83af8c63fa	Revert "Use custom fork that resolves GPU locks" (#1502 ) Reverts exo-explore/exo#1489 Goddammit Claude...	2026-02-17 18:18:54 +00:00
rltakashige	facf2d4d03	Use custom fork that resolves GPU locks (#1489 ) ## Motivation There is an issue on Macs that means that an explicit synchronization is necessary for memory to be updated from L1 cache. This means that GPU locks can occur when a spin wait does not see the updated timestamp. ## Changes Updated in my own personal fork. ## Why It Works https://github.com/ARM-software/acle/releases ## Test Plan ### Manual Testing Tested manually that no GPU locks occur (even with multiple simultaneous instances running) and that the performance differential is negligible (267 vs 269 tps on Llama 3.2 1B at an approx 10k context.) ------------------------------------------------------ I have seen a GPU lock, specifically when sending a particularly large chat completion while the model was loading. However, I have since been unable to reproduce and this may be something I did wrong. Please do create an issue and tag me if any GPU locks do occur. --------- Co-authored-by: Jake Hillion <jake@hillion.co.uk> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 17:48:43 +00:00
Ryuichi Leo Takashige	dc7ade8052	set the mlx hash	2026-02-10 20:16:17 +00:00
Ryuichi Leo Takashige	dc781497c5	update mlx to 0.30.6	2026-02-10 20:16:17 +00:00
Jake Hillion	8af2af6328	nix: override apple-sdk to 26.2 and enable MLX_BUILD_CPU (#1443 ) The pinned nixpkgs provides apple-sdk 26.0, but building MLX requires SDK 26.2. The upstream package reads versions.json via a relative path at eval time, so it can't be overridden through callPackage args. Added a thin overlay that copies the upstream apple-sdk source and patches only metadata/versions.json to point at SDK 26.2. Also enabled MLX_BUILD_CPU in the MLX nix build. This avoids vendoring the entire apple-sdk package (~2200 lines) while still getting the SDK version we need. Test plan: - CI - Built and ran on two machines connected with Thunderbolt 5 - Kimi K2.5 starts in Tensor+RDMA and seems sensible.	2026-02-10 19:53:53 +00:00
rltakashige	dcb4cabc15	Update the nix hash for mlx 0.30.5 (#1416 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-06 21:27:10 +00:00
rltakashige	b315035ae0	Add minimax and fix qwen sharding strategies (#1318 ) ## Motivation MiniMax tensor sharding does not provide equivalent outputs to running it as a single node because RMSNorm weights cannot be split without affecting the output. Qwen3Next sharding was broken, and something with Qwen3MoE was likely changed upstream, as several variables no longer exist. This also ballooned into fixing prefix caching for non-standard models as Qwen3Next was behaving weirdly. ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing Worked for a 8 hour long eval at the same performance and a more similar completion/reasoning token distribution. --------- Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-06 13:26:59 +00:00
Jake Hillion	ebeddfb308	mlx: build with Nix (#1285 ) In order to make testing and deployment simpler and more reproducible, we want to provide a Nix derivation for our macOS .app build. We already build the Rust and dashboard with Nix, but so far the Python has been blocked because we haven't had an MLX build. This change adds a Metal compiler derivation that uses `requireFile` to be provided a NAR of the unfree macOS Metal compiler. It is documented how to get this file, but effectively you have to trigger the download, mount the DMG, and NAR the result. Once this is added to the store by hash we can build MLX using it. The MLX build itself is quite self explanatory. Test plan: - CI. We follow the instructions to grab the Metal compiler. Once this is in Cachix we should really never do this again, and I can pin the path too to ensure it doesn't leave. - MLX tests run as part of the MLX derivation's build. They pass. - `NIXPKGS_ALLOW_UNFREE=1 nix build .#mlx.passthru.tests.mlxTest --impure --option sandbox false` --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>	2026-01-29 14:07:00 +00:00

17 Commits