mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-01-28 15:52:56 -05:00

Author	SHA1	Message	Date
ciaranbor	a4a965c80c	Update mflux version	2026-01-20 18:47:49 +00:00
ciaranbor	970f62e645	Add mflux dependency	2026-01-20 18:47:49 +00:00
ciaranbor	b824f7a60d	Add pillow dependency	2026-01-20 18:47:49 +00:00
Evan Quiney	22b5d836ef	swap all instances of model_id: str for model_id: ModelId (#1221 ) This change uses the stronger typed ModelId, and introduces some convenience methods. It also cleans up some code left over from #1204. ## Changes `model_id: str -> model_id: ModelId` `repo_id: str -> model_id: ModelId` Introduces methods on ModelId, in particular ModelId.normalize() to replace `/` with `--`. This PR did introduce some circular imports, so has moved some code around to try and limit them. ## Test Plan Tests still pass, types still check. As this is about metadata, I haven't tested inference.	2026-01-20 17:38:06 +00:00
Alex Cheema	176ab5ba40	Add GLM-4.7-Flash model cards (4bit, 5bit, 6bit, 8bit) (#1214 ) ## Motivation Add support for GLM-4.7-Flash, a lighter variant of GLM-4.7 with the `glm4_moe_lite` architecture. These models are smaller and faster while maintaining good performance. ## Changes 1. Added 4 new model cards for GLM-4.7-Flash variants: - `glm-4.7-flash-4bit` (~18 GB) - `glm-4.7-flash-5bit` (~21 GB) - `glm-4.7-flash-6bit` (~25 GB) - `glm-4.7-flash-8bit` (~32 GB) All variants have: - `n_layers`: 47 (vs 91 in GLM-4.7) - `hidden_size`: 2048 (vs 5120 in GLM-4.7) - `supports_tensor`: True (native `shard()` method) 2. Bumped mlx from 0.30.1 to 0.30.3 - required by mlx-lm 0.30.4 3. Updated mlx-lm from 0.30.2 to 0.30.4 - adds `glm4_moe_lite` architecture support 4. Added type ignores in `auto_parallel.py` for stricter type annotations in new mlx-lm 5. Fixed EOS token IDs for GLM-4.7-Flash - uses different tokenizer with IDs `[154820, 154827, 154829]` vs other GLM models' `[151336, 151329, 151338]` 6. Renamed `MLX_IBV_DEVICES` to `MLX_JACCL_DEVICES` - env var name changed in new mlx ## Why It Works The model cards follow the same pattern as existing GLM-4.7 models. Tensor parallel support is enabled because GLM-4.7-Flash implements the native `shard()` method in mlx-lm 0.30.4, which is automatically detected in `auto_parallel.py`. GLM-4.7-Flash uses a new tokenizer with different special token IDs. Without the correct EOS tokens, generation wouldn't stop properly. ## Test Plan ### Manual Testing Tested generation with GLM-4.7-Flash-4bit - now correctly stops at EOS tokens. ### Automated Testing - `basedpyright`: 0 errors - `ruff check`: All checks passed - `pytest`: 162/162 tests pass (excluding pre-existing `test_distributed_fix.py` timeout failures) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-20 03:58:09 +00:00
rltakashige	618cee5223	Resolve test event ordering flakiness (#1194 ) ## Motivation mp sender occasionally does not have time to flush its events before collect() is called, making the event ordering test fail. ## Changes - Replace mp_channel with simple collector for event ordering test - Also suppress warning for <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing Ran the test 100 times without it failing.	2026-01-18 20:33:20 +00:00
Evan Quiney	39ee2bf7bd	switch from synchronous threaded pinging to an async implementation (#1170 ) still seeing churn in our networking - lets properly rate limit it ## changes added an httpx client with max connections with a persistent AsyncClient ## testing deployed on cluster, discovery VASTLY more stable (the only deleted edges were those discovered by mdns)	2026-01-16 13:20:03 +00:00
Alex Cheema	e5e74e1eef	Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125 ) ## Motivation Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models. The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes. ## Changes ### Core Changes - mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes - transformers 5.x compatibility: Enable prerelease transformers dependency ### Kimi K2 Tokenizer Fixes - Add `bytes_to_unicode` monkey-patch to restore function moved in transformers 5.0.0rc2 - Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to bypass transformers 5.x bug with `auto_map` fallback - Patch `encode()` to use tiktoken directly with `allowed_special="all"` to handle special tokens from chat templates ### Other Changes - Dashboard: Show disk usage for completed model downloads - CI: Add `workflow_dispatch` trigger to build-app workflow - Docs: Add basic API documentation ### Testing - Add comprehensive tokenizer unit tests for all supported models - Tests verify encode/decode, special token handling, and chat template encoding ## Why It Works bytes_to_unicode issue: transformers 5.0.0rc2 moved `bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to `transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py` imports from the old location. The monkey-patch restores it at module load time. AutoTokenizer issue: transformers 5.x has a bug where `tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for custom tokenizers with `auto_map`. Loading the tokenizer directly bypasses this. encode() issue: transformers 5.x's `pad()` method fails for slow tokenizers. Using tiktoken's encode directly with `allowed_special="all"` avoids this path and properly handles special tokens like `<\|im_user\|>` from chat templates. ## Test Plan ### Manual Testing - Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21) - Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both nodes - Verified warmup inference completes successfully - Verified chat completions work with special tokens ### Automated Testing - Added `test_tokenizers.py` with 31 tests covering: - Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm) - Special token encoding (critical for chat templates) - Chat template application and encoding - Kimi-specific and GLM-specific edge cases - All tests pass: `uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py` ### Failing Tests RDMA with all models. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-01-13 12:06:04 +00:00
Evan	cca8c9984a	cleanup unused dependencies we have a lot of dependencies we have no intent of using. kill them with fire! ## testing exo still launches and does the worst inference known to man on my Qwen3 instance. tests pass too!!	2026-01-09 13:11:58 +00:00
rltakashige	077b1bc732	exo-bench (Benchmark model pp & tg speed) (#1099 ) ## Motivation This PR implements benchmarking in the style of llama-bench. The main difficulty here is the fact that exo is not a library - it exposes an endpoint. This means that benchmarking numbers will be inaccurate if the API is measured. The solution assumes nodes are set up with uv run exo (or via the app), and then hits the new endpoint /bench/chat/completions to retrieve generation statistics directly from mlx_lm. <!-- Why is this change needed? What problem does it solve? --> This will allow us to release benchmarks for models and perform regression tests. TODO: Performance benchmarking. <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> - Adds /bench/chat/completions endpoint - Adds BenchChatCompletion/Response - Adds a logits processor to prevent response from ending early - Adds a "Prompt Sizer" which downloads the tokenizer and dynamically adjusts the prompt of "a" to fit the desired prompt size. - Reduce prefill step size to 2048 for now (in future, dynamically adjust this value) <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several fixes to run consistently on all configurations (to be done in the future). Manually tested the normal API to verify chat requests complete as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> Not really possible. Type checker passes.	2026-01-06 17:39:09 +00:00
Evan Quiney	9d9e24f969	some dashboard updates (#1017 ) Mostly @samiamjidkhan and @AlexCheema's work in progress. --------- Co-authored-by: Sami Khan <smsak99@gmail.com> Co-authored-by: Alex Cheema	2025-12-28 20:50:23 +00:00
Evan Quiney	8e9332d6a7	Separate out the Runner's behaviour into a "connect" phase and a "load" phase (#1006 ) ## Motivation We should ensure all runners are connected before loading the model - this gives us finer grained control in the future for the workers planning mechanism over the runners state. ## Changes - Introduced task ConnectToGroup, preceeding LoadModel - Introduced runner statuses Idle, Connecting, Connected - Separated out initialize_mlx from shard_and_load - Single instances never go through the connecting phase ## Test Plan # Automated Testing Added a test for checking event ordering in a standard workflow. # Manual testing Tested Llama 3.2 1b and Kimi K2 Thinking loads and shuts down repeatedly on multiple configurations. Not exhaustive, however. --------- Co-authored-by: rltakashige <rl.takashige@gmail.com>	2025-12-27 16:28:42 +00:00
Jake Hillion	1c1792f5e8	mlx: update to 0.30.1 and align coordinator naming with MLX conventions The Jaccl distributed backend requires MLX 0.30.1+, which includes the RDMA over Thunderbolt support. The previous minimum version (0.29.3) would fail at runtime with "The only valid values for backend are 'any', 'mpi' and 'ring' but 'jaccl' was provided." Bump MLX dependency to >=0.30.1 and rename ibv_coordinators to jaccl_coordinators to match MLX's naming conventions. This includes the environment variable change from MLX_IBV_COORDINATOR to MLX_JACCL_COORDINATOR. Test plan: Hardware setup: 3x Mac Studio M3 Ultra connected all-to-all with TB5 - Built a DMG [0] - Installed on all Macs and started cluster. - Requested a 2 node Tensor + MLX RDMA instance of Llama 3.3 70B (FP16). - It started successfully. - Queried the chat a few times. All was good. This didn't work previously. - Killed the instance and spawned Pipeline + MLX Ring Llama 3.3 70B (FP16). Also started succesfully on two nodes and could be queried. Still not working: - Pipeline + MLX Ring on 3 nodes is failing. Haven't debugged that yet. [0] https://github.com/exo-explore/exo/actions/runs/20467656904/job/58815275013	2025-12-24 16:47:01 +00:00
Jake Hillion	02c915a88d	pyproject: drop pathlib dependency	2025-12-22 17:52:44 +00:00
Jake Hillion	dd0638b74d	pyproject: add pyinstaller to dev-dependencies	2025-12-22 15:49:27 +00:00
Jake Hillion	ac3a0a6b47	ci: enable `ruff check` in CI through nix	2025-12-09 12:26:56 +00:00
Evan Quiney	c9e2062f6e	switch from uvicorn to hypercorn	2025-12-05 17:29:06 +00:00
rltakashige	2b243bd80e	Consolidate!!! Fixes	2025-12-03 12:19:25 +00:00
rltakashige	b45cbdeecd	Consolidate cleanup	2025-11-21 14:54:02 +00:00
Alex Cheema	631cb81009	kimi k2 thinking	2025-11-11 18:03:39 +00:00
Evan Quiney	aa519b8c03	Worker refactor Co-authored-by: rltakashige <rl.takashige@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com>	2025-11-10 23:31:53 +00:00
rltakashige	ff00b165c5	MLX LM type stubs	2025-11-06 21:59:29 +00:00
Alex Cheema	699fd9591e	fix exo scripts	2025-11-05 21:47:08 -08:00
rltakashige	6bbb6344b6	mlx.distributed.Group type stubs	2025-11-06 05:26:04 +00:00
rltakashige	16f724e24c	Update staging 14 Co-authored-by: Evan <evanev7@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: David Munha Canas Correia <dmunha@MacBook-David.local> Co-authored-by: github-actions bot <github-actions@users.noreply.github.com>	2025-11-05 01:44:24 +00:00
rltakashige	91c635ca7a	Update mlx and mlx-lm packages Co-authored-by: Evan <evanev7@gmail.com>	2025-10-31 01:34:43 +00:00
Alex Cheema	a346af3477	download fixes	2025-10-22 11:56:52 +01:00
Evan Quiney	962e5ef40d	version bump for brew consistency	2025-10-07 15:18:54 +01:00
Evan Quiney	38ff949bf4	big refactor Fix. Everything. Co-authored-by: Andrei Cravtov <the.andrei.cravtov@gmail.com> Co-authored-by: Matt Beton <matthew.beton@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Seth Howes <sethshowes@gmail.com>	2025-09-30 11:03:04 +01:00
Matt Beton	a33787f5fd	Prompt length	2025-08-29 16:07:36 +01:00
Matt Beton	1b8b456ced	full mlx caching implementation	2025-08-26 17:15:08 +01:00
Evan Quiney	5efe5562d7	feat: single entrypoint and logging rework	2025-08-26 11:08:09 +01:00
Andrei Cravtov	ef5c5b9654	changes include: ipc, general utilities, flakes stuff w/ just, autopull script	2025-08-25 17:33:40 +01:00
Evan Quiney	be6f5ae7f1	feat: build system and homebrew compatibility	2025-08-21 16:07:37 +01:00
Matt Beton	1fe4ed3442	Worker Exception & Timeout Refactor Co-authored-by: Gelu Vrabie <gelu@exolabs.net> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Seth Howes <sethshowes@gmail.com>	2025-08-02 08:28:37 -07:00
Alex Cheema	92c9688bf0	Remove rust	2025-08-02 08:16:39 -07:00
Gelu Vrabie	0e32599e71	fix libp2p + other prs that were wrongly overwritten before (111,112,117,118,1119 + misc commits from Alex) Co-authored-by: Gelu Vrabie <gelu@exolabs.net> Co-authored-by: Alex Cheema <41707476+AlexCheema@users.noreply.github.com> Co-authored-by: Seth Howes <71157822+sethhowes@users.noreply.github.com> Co-authored-by: Matt Beton <matthew.beton@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com>	2025-07-31 20:36:47 +01:00
Andrei Cravtov	8d2536d926	Implemented basic discovery library in Rust + python bindings Co-authored-by: Gelu Vrabie <gelu@exolabs.net> Co-authored-by: Seth Howes <sethshowes@gmail.com> Co-authored-by: Matt Beton <matthew.beton@gmail.com>	2025-07-23 13:11:29 +01:00
Gelu Vrabie	596d9fc9d0	add forwarder service Co-authored-by: Gelu Vrabie <gelu@exolabs.net>	2025-07-22 20:53:26 +01:00
Alex Cheema	449fdac27a	Downloads	2025-07-21 22:42:37 +01:00
Arbion Halili	d9b9aa7ad2	Merge branch 'master-node' into staging	2025-07-15 16:32:08 +01:00
Arbion Halili	4e4dbf52ec	fix: Use Nix-compatible LSP set-up	2025-07-14 21:08:43 +01:00
Matt Beton	21acd3794a	New Runner!	2025-07-10 16:34:35 +01:00
Matt Beton	0425422f55	Simple fix	2025-07-07 17:18:43 +01:00
Matt Beton	03a1cf59a6	Matt's interfaces Added interfaces for chunks, worker, runner, supervisor, resourcemonitor, etc.	2025-07-07 16:42:52 +01:00
Arbion Halili	5abf03e31b	Scaffold Event Sourcing	2025-06-29 19:44:58 +01:00
Arbion Halili	74adbc4280	Remove PoeThePoet	2025-06-28 14:33:01 +01:00
Arbion Halili	f7f779da19	Fix Type Checker; Improve Protobuf Generation	2025-06-28 12:28:26 +01:00
Arbion Halili	61b8b1cb18	Add Protobuf Support	2025-06-28 01:26:49 +01:00
Arbion Halili	3564d77e58	Add Sync to Runner	2025-06-27 11:56:02 +01:00

1 2

57 Commits