Commit Graph

56 Commits

Author SHA1 Message Date
ciaranbor
a4a965c80c Update mflux version 2026-01-20 18:47:49 +00:00
ciaranbor
970f62e645 Add mflux dependency 2026-01-20 18:47:49 +00:00
ciaranbor
b824f7a60d Add pillow dependency 2026-01-20 18:47:49 +00:00
Evan Quiney
22b5d836ef swap all instances of model_id: str for model_id: ModelId (#1221)
This change uses the stronger typed ModelId, and introduces some
convenience methods. It also cleans up some code left over from #1204.

## Changes

`model_id: str -> model_id: ModelId`
`repo_id: str -> model_id: ModelId`

Introduces methods on ModelId, in particular ModelId.normalize() to
replace `/` with `--`.

This PR did introduce some circular imports, so has moved some code
around to try and limit them.

## Test Plan

Tests still pass, types still check. As this is about metadata, I
haven't tested inference.
2026-01-20 17:38:06 +00:00
Alex Cheema
176ab5ba40 Add GLM-4.7-Flash model cards (4bit, 5bit, 6bit, 8bit) (#1214)
## Motivation

Add support for GLM-4.7-Flash, a lighter variant of GLM-4.7 with the
`glm4_moe_lite` architecture. These models are smaller and faster while
maintaining good performance.

## Changes

1. **Added 4 new model cards** for GLM-4.7-Flash variants:
   - `glm-4.7-flash-4bit` (~18 GB)
   - `glm-4.7-flash-5bit` (~21 GB)
   - `glm-4.7-flash-6bit` (~25 GB)
   - `glm-4.7-flash-8bit` (~32 GB)

   All variants have:
   - `n_layers`: 47 (vs 91 in GLM-4.7)
   - `hidden_size`: 2048 (vs 5120 in GLM-4.7)
   - `supports_tensor`: True (native `shard()` method)

2. **Bumped mlx from 0.30.1 to 0.30.3** - required by mlx-lm 0.30.4

3. **Updated mlx-lm from 0.30.2 to 0.30.4** - adds `glm4_moe_lite`
architecture support

4. **Added type ignores** in `auto_parallel.py` for stricter type
annotations in new mlx-lm

5. **Fixed EOS token IDs** for GLM-4.7-Flash - uses different tokenizer
with IDs `[154820, 154827, 154829]` vs other GLM models' `[151336,
151329, 151338]`

6. **Renamed `MLX_IBV_DEVICES` to `MLX_JACCL_DEVICES`** - env var name
changed in new mlx

## Why It Works

The model cards follow the same pattern as existing GLM-4.7 models.
Tensor parallel support is enabled because GLM-4.7-Flash implements the
native `shard()` method in mlx-lm 0.30.4, which is automatically
detected in `auto_parallel.py`.

GLM-4.7-Flash uses a new tokenizer with different special token IDs.
Without the correct EOS tokens, generation wouldn't stop properly.

## Test Plan

### Manual Testing
Tested generation with GLM-4.7-Flash-4bit - now correctly stops at EOS
tokens.

### Automated Testing
- `basedpyright`: 0 errors
- `ruff check`: All checks passed
- `pytest`: 162/162 tests pass (excluding pre-existing
`test_distributed_fix.py` timeout failures)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 03:58:09 +00:00
Evan Quiney
39ee2bf7bd switch from synchronous threaded pinging to an async implementation (#1170)
still seeing churn in our networking - lets properly rate limit it

## changes

added an httpx client with max connections with a persistent AsyncClient

## testing

deployed on cluster, discovery VASTLY more stable (the only deleted
edges were those discovered by mdns)
2026-01-16 13:20:03 +00:00
Alex Cheema
e5e74e1eef Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125)
## Motivation

Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as
a prerelease dependency. This enables support for newer models like Kimi
K2 Thinking while maintaining compatibility with existing models.

The transformers 5.x release includes breaking changes that affect
custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility
fixes.

## Changes

### Core Changes
- **mlx-lm upgrade**: Bump to 0.30.2 with locked exact versions for
mlx/mlx-lm to prevent breaking changes
- **transformers 5.x compatibility**: Enable prerelease transformers
dependency

### Kimi K2 Tokenizer Fixes
- Add `bytes_to_unicode` monkey-patch to restore function moved in
transformers 5.0.0rc2
- Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to
bypass transformers 5.x bug with `auto_map` fallback
- Patch `encode()` to use tiktoken directly with `allowed_special="all"`
to handle special tokens from chat templates

### Other Changes
- Dashboard: Show disk usage for completed model downloads
- CI: Add `workflow_dispatch` trigger to build-app workflow
- Docs: Add basic API documentation

### Testing
- Add comprehensive tokenizer unit tests for all supported models
- Tests verify encode/decode, special token handling, and chat template
encoding

## Why It Works

**bytes_to_unicode issue**: transformers 5.0.0rc2 moved
`bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to
`transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py`
imports from the old location. The monkey-patch restores it at module
load time.

**AutoTokenizer issue**: transformers 5.x has a bug where
`tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for
custom tokenizers with `auto_map`. Loading the tokenizer directly
bypasses this.

**encode() issue**: transformers 5.x's `pad()` method fails for slow
tokenizers. Using tiktoken's encode directly with
`allowed_special="all"` avoids this path and properly handles special
tokens like `<|im_user|>` from chat templates.

## Test Plan

### Manual Testing
- Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and
james21)
- Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both
nodes
- Verified warmup inference completes successfully
- Verified chat completions work with special tokens

### Automated Testing
- Added `test_tokenizers.py` with 31 tests covering:
- Basic encode/decode for all model families (deepseek, kimi, llama,
qwen, gpt-oss, glm)
  - Special token encoding (critical for chat templates)
  - Chat template application and encoding
  - Kimi-specific and GLM-specific edge cases
- All tests pass: `uv run pytest
src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py`

### Failing Tests
RDMA with all models.

---------

Co-authored-by: Evan <evanev7@gmail.com>
2026-01-13 12:06:04 +00:00
Evan
cca8c9984a cleanup unused dependencies
we have a lot of dependencies we have no intent of using. kill them with
fire!

## testing
exo still launches and does the worst inference known to man on my Qwen3
instance. tests pass too!!
2026-01-09 13:11:58 +00:00
Evan Quiney
9d9e24f969 some dashboard updates (#1017)
Mostly @samiamjidkhan and @AlexCheema's work in progress.

---------

Co-authored-by: Sami Khan <smsak99@gmail.com>
Co-authored-by: Alex Cheema
2025-12-28 20:50:23 +00:00
Jake Hillion
b5d424b658 placement: generate per-node host lists for MLX ring backend
Pipeline + MLX Ring worked with 2 nodes but failed to initialize with
3 or more nodes. The MLX ring backend requires each node to know its
specific left and right neighbors in the ring, but the previous
implementation provided a single flat host list shared by all nodes.

With 2 nodes, a flat list [host0, host1] accidentally worked because
each node could find its only neighbor. With 3+ nodes, each node needs
a customized view:
- Rank 0: [self, right_neighbor, placeholder]
- Rank 1: [left_neighbor, self, right_neighbor]
- Rank 2: [placeholder, left_neighbor, self]

Changed MlxRingInstance from `hosts: list[Host]` to
`hosts_by_node: dict[NodeId, list[Host]]` with `ephemeral_port: int`.

Added `get_mlx_ring_hosts_by_node()` which generates per-node host
lists where:
- Self position uses 0.0.0.0 for local binding
- Left/right neighbors use actual connection IPs
- Non-neighbors use 198.51.100.1 (RFC 5737 TEST-NET-2 placeholder)

Also added IP prioritization (en0 > en1 > non-Thunderbolt > any) to
prefer stable network interfaces.

Fixed topology discovery recording loopback addresses (127.0.0.1) as
valid connections to remote nodes. The reachability check now verifies
node identity via HTTP GET /node_id rather than just checking if the
port is open.

Test plan:

- Built a DMG [0]
- Installed on all Macs and started cluster.
- Requested a 3 node Pipeline + MLX Ring Llama 3.3 70B (FP16).
- It started and I was able to send a few chat messages.

Eventually my instance seemed to get into a broken state and chat
stopped working, but this commit is a clear step forward.

[0] https://github.com/exo-explore/exo/actions/runs/20473983471/job/58834969418
2025-12-28 20:38:20 +00:00
Evan Quiney
8e9332d6a7 Separate out the Runner's behaviour into a "connect" phase and a "load" phase (#1006)
## Motivation

We should ensure all runners are connected before loading the model -
this gives us finer grained control in the future for the workers
planning mechanism over the runners state.

## Changes

- Introduced task ConnectToGroup, preceeding LoadModel
- Introduced runner statuses Idle, Connecting, Connected
- Separated out initialize_mlx from shard_and_load
- Single instances never go through the connecting phase

## Test Plan

# Automated Testing
Added a test for checking event ordering in a standard workflow.

# Manual testing
Tested Llama 3.2 1b and Kimi K2 Thinking loads and shuts down repeatedly
on multiple configurations.
Not exhaustive, however.

---------

Co-authored-by: rltakashige <rl.takashige@gmail.com>
2025-12-27 16:28:42 +00:00
Jake Hillion
1c1792f5e8 mlx: update to 0.30.1 and align coordinator naming with MLX conventions
The Jaccl distributed backend requires MLX 0.30.1+, which includes the
RDMA over Thunderbolt support. The previous minimum version (0.29.3)
would fail at runtime with "The only valid values for backend are
'any', 'mpi' and 'ring' but 'jaccl' was provided."

Bump MLX dependency to >=0.30.1 and rename ibv_coordinators to
jaccl_coordinators to match MLX's naming conventions. This includes
the environment variable change from MLX_IBV_COORDINATOR to
MLX_JACCL_COORDINATOR.

Test plan:

Hardware setup: 3x Mac Studio M3 Ultra connected all-to-all with TB5

- Built a DMG [0]
- Installed on all Macs and started cluster.
- Requested a 2 node Tensor + MLX RDMA instance of Llama 3.3 70B (FP16).
- It started successfully.
- Queried the chat a few times. All was good. This didn't work
  previously.
- Killed the instance and spawned Pipeline + MLX Ring Llama 3.3 70B (FP16).
  Also started succesfully on two nodes and could be queried.

Still not working:
- Pipeline + MLX Ring on 3 nodes is failing. Haven't debugged that yet.

[0] https://github.com/exo-explore/exo/actions/runs/20467656904/job/58815275013
2025-12-24 16:47:01 +00:00
Jake Hillion
02c915a88d pyproject: drop pathlib dependency 2025-12-22 17:52:44 +00:00
Jake Hillion
dd0638b74d pyproject: add pyinstaller to dev-dependencies 2025-12-22 15:49:27 +00:00
Evan Quiney
c9e2062f6e switch from uvicorn to hypercorn 2025-12-05 17:29:06 +00:00
rltakashige
2b243bd80e Consolidate!!! Fixes 2025-12-03 12:19:25 +00:00
rltakashige
b45cbdeecd Consolidate cleanup 2025-11-21 14:54:02 +00:00
Alex Cheema
631cb81009 kimi k2 thinking 2025-11-11 18:03:39 +00:00
Evan Quiney
aa519b8c03 Worker refactor
Co-authored-by: rltakashige <rl.takashige@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
2025-11-10 23:31:53 +00:00
rltakashige
6bbb6344b6 mlx.distributed.Group type stubs 2025-11-06 05:26:04 +00:00
rltakashige
16f724e24c Update staging 14
Co-authored-by: Evan <evanev7@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: David Munha Canas Correia <dmunha@MacBook-David.local>
Co-authored-by: github-actions bot <github-actions@users.noreply.github.com>
2025-11-05 01:44:24 +00:00
rltakashige
91c635ca7a Update mlx and mlx-lm packages
Co-authored-by: Evan <evanev7@gmail.com>
2025-10-31 01:34:43 +00:00
Alex Cheema
a346af3477 download fixes 2025-10-22 11:56:52 +01:00
Alex Cheema
84dfc8a738 Fast memory profiling
Co-authored-by: Evan <evanev7@gmail.com>
2025-10-07 16:23:51 +01:00
Evan Quiney
38ff949bf4 big refactor
Fix. Everything.

Co-authored-by: Andrei Cravtov <the.andrei.cravtov@gmail.com>
Co-authored-by: Matt Beton <matthew.beton@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Seth Howes <sethshowes@gmail.com>
2025-09-30 11:03:04 +01:00
Matt Beton
1b8b456ced full mlx caching implementation 2025-08-26 17:15:08 +01:00
Evan Quiney
5efe5562d7 feat: single entrypoint and logging rework 2025-08-26 11:08:09 +01:00
Andrei Cravtov
ef5c5b9654 changes include: ipc, general utilities, flakes stuff w/ just, autopull script 2025-08-25 17:33:40 +01:00
Evan Quiney
be6f5ae7f1 feat: build system and homebrew compatibility 2025-08-21 16:07:37 +01:00
Gelu Vrabie
57073f35c3 collection of fixes for Shanghai demo
Co-authored-by: Matt Beton <matthew.beton@gmail.com>
Co-authored-by: Gelu Vrabie <gelu@exolabs.net>
2025-08-15 15:21:51 +01:00
Matt Beton
1fe4ed3442 Worker Exception & Timeout Refactor
Co-authored-by: Gelu Vrabie <gelu@exolabs.net>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Seth Howes <sethshowes@gmail.com>
2025-08-02 08:28:37 -07:00
Alex Cheema
92c9688bf0 Remove rust 2025-08-02 08:16:39 -07:00
Andrei Cravtov
3730160477 Fix the node-ID test
Co-authored-by: Matt Beton <matthew.beton@gmail.com>
2025-07-24 17:09:12 +01:00
Andrei Cravtov
8d2536d926 Implemented basic discovery library in Rust + python bindings
Co-authored-by: Gelu Vrabie <gelu@exolabs.net>
Co-authored-by: Seth Howes <sethshowes@gmail.com>
Co-authored-by: Matt Beton <matthew.beton@gmail.com>
2025-07-23 13:11:29 +01:00
Matt Beton
14b3c4a6be New API! 2025-07-22 21:21:12 +01:00
Gelu Vrabie
596d9fc9d0 add forwarder service
Co-authored-by: Gelu Vrabie <gelu@exolabs.net>
2025-07-22 20:53:26 +01:00
Alex Cheema
449fdac27a Downloads 2025-07-21 22:42:37 +01:00
Gelu Vrabie
2f64e30dd1 Add sqlite connector
Co-authored-by: Gelu Vrabie <gelu@exolabs.net>
2025-07-21 14:10:29 +01:00
Arbion Halili
520b1122a3 fix: Many Fixes 2025-07-16 13:35:31 +01:00
Arbion Halili
d9b9aa7ad2 Merge branch 'master-node' into staging 2025-07-15 16:32:08 +01:00
Arbion Halili
8799c288b0 BROKEN: work thus far 2025-07-14 21:09:08 +01:00
Matt Beton
21acd3794a New Runner! 2025-07-10 16:34:35 +01:00
Arbion Halili
b0bd951005 Merge Basic Interfaces
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Seth Howes <sethshowes@gmail.com>
Co-authored-by: Matt Beton <matthew.beton@gmail.com>
Co-authored-by: Andrei Cravtov <the.andrei.cravtov@gmail.com>
2025-07-09 19:04:21 +01:00
Arbion Halili
7dd8a979d2 feature: Simplest utilities for logging 2025-07-02 22:13:42 +01:00
Arbion Halili
40793f1d86 refactor: Refactor most things 2025-07-02 21:11:49 +01:00
Alex Cheema
0c46adc298 refactor: Use official OpenAI types 2025-06-29 22:30:18 +01:00
Arbion Halili
5abf03e31b Scaffold Event Sourcing 2025-06-29 19:44:58 +01:00
Arbion Halili
c977ce9419 Ensure exo-shared is a Dependency of exo-master and exo-worker 2025-06-28 14:34:49 +01:00
Arbion Halili
f7f779da19 Fix Type Checker; Improve Protobuf Generation 2025-06-28 12:28:26 +01:00
Arbion Halili
61b8b1cb18 Add Protobuf Support 2025-06-28 01:26:49 +01:00