Commit Graph

17 Commits

Author SHA1 Message Date
rltakashige
28c797846a Update mlx and mlx lm to latest (#1906)
Just bumping to the very latest upstream versions.
2026-04-16 10:59:33 +00:00
rltakashige
3f0df404a5 Reduce memory consumption by adding Flash Attention to Qwen3.5 and Gemma 4, and fix RotatingKVCache prefix cache memory leak (#1886)
## Motivation

Part 1 of many memory improvements.

## Changes
As written in the title

## Test Plan

### Manual Testing
Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B
A3B cache reduced from 21GB every 100000 tokens to 7GB.
2026-04-13 18:32:17 +01:00
rltakashige
43b3df45fb Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835)
## Motivation

MLX LM has had a massive refactor to their BatchGenerator recently.
Since we'd like new features from MLX LM such as Gemma 4, we need to
update the code to handle this.

Additionally this fixes a significant memory leak in GatedDeltaNet (the
difference is quite substantial, up to 1GB every 1000 tokens, explaining
several memory issues users were facing with Qwen3.5 models)

## Testing
Before
<img width="3146" height="884" alt="image"
src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4"
/>


After (no memory leak, as one of the changes upstream)
<img width="3190" height="892" alt="image"
src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14"
/>
2026-04-07 11:50:12 +00:00
ciaranbor
f28b2fd037 Extract mlx revision from uv lock (#1715)
## Motivation

The MLX version and git revision in nix/mlx.nix were hardcoded and had
to be manually kept in sync with uv.lock

## Changes

- flake.nix: Extract MLX git rev from uv.lock's source.git URL and pass
as uvLockMlxRev
- nix/mlx.nix: Use uvLockMlxVersion and uvLockMlxRev instead of
hardcoded values; remove version mismatch assertion

## Why It Works

uv.lock is already the source of truth — now Nix reads both version and
rev from it directly. The pinned fetchFromGitHub hash still guards
against unexpected changes.
2026-03-13 12:34:54 +00:00
rltakashige
e23c3a3026 Address Mac Mini pipeline GPU timeouts (#1620)
## Motivation
Users were reporting GPU timeout errors on Mac Minis, which we never saw
on testing with Mac Studios. It also seems to only happen with large
models.

## Changes
Eval specific distributed operations.

## Why It Works

As I wrote in a Slack message:
Basically, prefill is too slow for pipeline communications. If there are
both communications and GPU operations as part of an mlx graph, the
communications become subject to the GPU's 5 second command buffer
timeout.

For normal generation, I added evals to the communications (only during
prefill, as it slows down decode) to do this, fixing GPU timeouts.

But we don't do this during warmup, as the prompt is absolutely tiny.
This is still too slow on an M4 Pro on some models that it causes a GPU
timeout during warmup...


----------------------
This was one of the issues. However, there is another issue:

mx.all_gather sometimes reads stale data with FAST_SYNCH enabled. I'm
still investigating the root cause, but the code as it is now works on
Mac Minis.



## Test Plan

### Manual Testing
<img width="2762" height="1808" alt="image"
src="https://github.com/user-attachments/assets/27c88542-606c-4551-8f7c-bd2c0471f54e"
/>

<img width="2820" height="1898" alt="image"
src="https://github.com/user-attachments/assets/0ba3478c-ee39-438d-902c-92893db23d05"
/>


### Automated Testing
Needs a bunch on mac minis
2026-02-25 17:37:32 +00:00
rltakashige
14526d281a update mlx 2 (#1611)
## Motivation

GPU locks because of prompt progress callbacks taking time. Current
solution: Don't fix it, make the symptom better

## Changes
Shortened timeout by 2x
Get event leak fixes from latest upstream
2026-02-24 18:30:48 +00:00
rltakashige
e01f50a5cd Update mlx fork (#1565)
## Motivation

Some fixes upstream. This sort of commit will probably be quite common
until GPU locks are resolved.
2026-02-20 17:23:52 +00:00
rltakashige
48b8f86395 Add support for GLM 5 (#1526)
## Motivation

Add GLM 5 support in favor of #1513 

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-02-18 14:04:06 +00:00
rltakashige
f2be929211 Leo/address rdma gpu locks 2 (#1515)
Same as #1489 . Had to revert and redo thanks to Claude.

---------

Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 14:00:52 -08:00
rltakashige
83af8c63fa Revert "Use custom fork that resolves GPU locks" (#1502)
Reverts exo-explore/exo#1489

Goddammit Claude...
2026-02-17 18:18:54 +00:00
rltakashige
facf2d4d03 Use custom fork that resolves GPU locks (#1489)
## Motivation

There is an issue on Macs that means that an explicit synchronization is
necessary for memory to be updated from L1 cache. This means that GPU
locks can occur when a spin wait does not see the updated timestamp.

## Changes

Updated in my own personal fork.

## Why It Works

https://github.com/ARM-software/acle/releases

## Test Plan

### Manual Testing
Tested manually that no GPU locks occur (even with multiple simultaneous
instances running) and that the performance differential is negligible
(267 vs 269 tps on Llama 3.2 1B at an approx 10k context.)


------------------------------------------------------
I have seen a GPU lock, specifically when sending a particularly large
chat completion while the model was loading. However, I have since been
unable to reproduce and this may be something I did wrong. Please do
create an issue and tag me if any GPU locks do occur.

---------

Co-authored-by: Jake Hillion <jake@hillion.co.uk>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 17:48:43 +00:00
Ryuichi Leo Takashige
dc7ade8052 set the mlx hash 2026-02-10 20:16:17 +00:00
Ryuichi Leo Takashige
dc781497c5 update mlx to 0.30.6 2026-02-10 20:16:17 +00:00
Jake Hillion
8af2af6328 nix: override apple-sdk to 26.2 and enable MLX_BUILD_CPU (#1443)
The pinned nixpkgs provides apple-sdk 26.0, but building MLX requires
SDK 26.2. The upstream package reads versions.json via a relative path
at eval time, so it can't be overridden through callPackage args.

Added a thin overlay that copies the upstream apple-sdk source and
patches only metadata/versions.json to point at SDK 26.2. Also enabled
MLX_BUILD_CPU in the MLX nix build.

This avoids vendoring the entire apple-sdk package (~2200 lines) while
still getting the SDK version we need.

Test plan:
- CI
- Built and ran on two machines connected with Thunderbolt 5 - Kimi K2.5
starts in Tensor+RDMA and seems sensible.
2026-02-10 19:53:53 +00:00
rltakashige
dcb4cabc15 Update the nix hash for mlx 0.30.5 (#1416)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-02-06 21:27:10 +00:00
rltakashige
b315035ae0 Add minimax and fix qwen sharding strategies (#1318)
## Motivation

MiniMax tensor sharding does not provide equivalent outputs to running
it as a single node because RMSNorm weights cannot be split without
affecting the output.

Qwen3Next sharding was broken, and something with Qwen3MoE was likely
changed upstream, as several variables no longer exist.

This also ballooned into fixing prefix caching for non-standard models
as Qwen3Next was behaving weirdly.

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
Worked for a 8 hour long eval at the same performance and a more similar
completion/reasoning token distribution.

---------

Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
2026-02-06 13:26:59 +00:00
Jake Hillion
ebeddfb308 mlx: build with Nix (#1285)
In order to make testing and deployment simpler and more reproducible,
we want to provide a Nix derivation for our macOS .app build. We already
build the Rust and dashboard with Nix, but so far the Python has been
blocked because we haven't had an MLX build.

This change adds a Metal compiler derivation that uses `requireFile` to
be provided a NAR of the unfree macOS Metal compiler. It is documented
how to get this file, but effectively you have to trigger the download,
mount the DMG, and NAR the result. Once this is added to the store by
hash we can build MLX using it. The MLX build itself is quite self
explanatory.

Test plan:
- CI. We follow the instructions to grab the Metal compiler. Once this
is in Cachix we should really never do this again, and I can pin the
path too to ensure it doesn't leave.
- MLX tests run as part of the MLX derivation's build. They pass.
- `NIXPKGS_ALLOW_UNFREE=1 nix build .#mlx.passthru.tests.mlxTest
--impure --option sandbox false`

---------

Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
2026-01-29 14:07:00 +00:00