mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-01-20 03:51:14 -05:00

Author	SHA1	Message	Date
Evan	b8842e8081	rebase lint fmt	2026-01-18 21:14:03 +00:00
Evan	9c5467aa35	add to the test server	2026-01-18 21:14:03 +00:00
Evan Quiney	1200a7db64	Add tensor sharding for GPT-OSS (#1144 ) ## Motivation GPT OSS did not previously support tensor sharding ## Changes Add GPT sharding support in tensor_auto_parallel. Code is mostly @rltakashige's ## Test Plan ### Manual Testing Tested GPT-OSS - MLX Fast Sync causes issues in Tensor RDMA - this is a general problem at the moment.	2026-01-13 17:25:52 +00:00
Alex Cheema	e5e74e1eef	Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125 ) ## Motivation Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models. The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes. ## Changes ### Core Changes - mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes - transformers 5.x compatibility: Enable prerelease transformers dependency ### Kimi K2 Tokenizer Fixes - Add `bytes_to_unicode` monkey-patch to restore function moved in transformers 5.0.0rc2 - Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to bypass transformers 5.x bug with `auto_map` fallback - Patch `encode()` to use tiktoken directly with `allowed_special="all"` to handle special tokens from chat templates ### Other Changes - Dashboard: Show disk usage for completed model downloads - CI: Add `workflow_dispatch` trigger to build-app workflow - Docs: Add basic API documentation ### Testing - Add comprehensive tokenizer unit tests for all supported models - Tests verify encode/decode, special token handling, and chat template encoding ## Why It Works bytes_to_unicode issue: transformers 5.0.0rc2 moved `bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to `transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py` imports from the old location. The monkey-patch restores it at module load time. AutoTokenizer issue: transformers 5.x has a bug where `tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for custom tokenizers with `auto_map`. Loading the tokenizer directly bypasses this. encode() issue: transformers 5.x's `pad()` method fails for slow tokenizers. Using tiktoken's encode directly with `allowed_special="all"` avoids this path and properly handles special tokens like `<\|im_user\|>` from chat templates. ## Test Plan ### Manual Testing - Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21) - Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both nodes - Verified warmup inference completes successfully - Verified chat completions work with special tokens ### Automated Testing - Added `test_tokenizers.py` with 31 tests covering: - Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm) - Special token encoding (critical for chat templates) - Chat template application and encoding - Kimi-specific and GLM-specific edge cases - All tests pass: `uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py` ### Failing Tests RDMA with all models. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-01-13 12:06:04 +00:00
Evan Quiney	56af61fac9	add a server for distributed testing in /tests until we work out a stable solution. (#1098 ) ## Motivation Testing multiple devices simultaneously requires coordination, and we don't necessarily want to run a full EXO to test single components. We need a mid-scale integration testing framework for distributed tests. ## Changes Add a simple python server + bash query that runs Jaccl and Ring tests without constructing a worker/master/networking. The query relies on all devices being accessible over tailscale, currently. ## Test Plan Manually tested RDMA + Ring inference on 2 nodes.	2026-01-08 12:50:04 +00:00

5 Commits