mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-02-07 04:32:28 -05:00

Author	SHA1	Message	Date
Evan Quiney	9b5cae3db6	auto bench (#1405 ) runs exo_bench remotely with some nice git QoL ## usage run tests/auto_bench.sh host1 [host2] exo bench will be run on those hosts and its output saved to bench/commit_hash/*.json on all models currently downloaded	2026-02-06 15:35:46 +00:00
Evan Quiney	6b907398a4	cancel downloads for deleted instances (#1393 ) after deleting an instance, if a given (node_id, model_id) pair doesn't exist in the left over instances, cancel the download of model_id on node_id.	2026-02-05 18:16:43 +00:00
Evan Quiney	572e647908	better cancellation (#1388 ) a lot of our cleanup logic wasn't running leading to bad shutdown states ## changes - added `try: except` blocks around most task groups - made the runner shutdown code synchronous - abandon the MpReceiver's recv_async thread on cancellation - this only occurs during runner shutdown, the queue closing from the other end should terminate the mp.Queue, cleaning up the thread in its own time. i could try other methods if this is not sufficient. ## outcome ctrl-c just works now! minus the tokio panic of course :) no more hypercorn lifespan errors though!	2026-02-05 15:22:33 +00:00
Alex Cheema	acb97127bf	Normalize TextGenerationTaskParams.input to list[InputMessage] (#1360 ) ## Motivation With the addition of the Responses API, we introduced `str \| list[InputMessage]` as the type for `TextGenerationTaskParams.input` since the Responses API supports sending input as a plain string. But there was no reason to leak that flexibility past the API adapter boundary — it just meant every downstream consumer had to do `if isinstance(messages, str):` checks, adding complexity for no benefit. ## Changes - Changed `TextGenerationTaskParams.input` from `str \| list[InputMessage]` to `list[InputMessage]` - Each API adapter (Chat Completions, Claude Messages, Responses) now normalizes to `list[InputMessage]` at the boundary - Removed `isinstance(task_params.input, str)` branches in `utils_mlx.py` and `runner.py` - Wrapped string inputs in `[InputMessage(role="user", content=...)]` in the warmup path and all test files ## Why It Works The API adapters are the only place where we deal with raw user input formats. By normalizing there, all downstream code (worker, runner, MLX engine) can just assume `list[InputMessage]` and skip the type-checking branches. The type system (`basedpyright`) catches any missed call sites at compile time. ## Test Plan ### Automated Testing - `uv run basedpyright` — 0 errors - `uv run ruff check` — passes - `nix fmt` — applied - `uv run pytest` — 174 passed, 1 skipped Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 06:01:56 -08:00
Evan Quiney	d90605f198	migrate model cards to .toml files (#1354 )	2026-02-03 12:32:06 +00:00
Evan Quiney	d97bca88e6	improve distributed testing (#1300 ) Our distributed test now does a full query cycle for every model loaded onto the relevant machine. This will help find bugs early, as it already has found one with Qwen3 Next! I didn't write down what the error was though. Gooooooood luck with that! Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-02 18:25:39 +00:00
Alex Cheema	c3537980bd	feat: add Claude Messages API and OpenAI Responses API support (#1167 ) ## Motivation Add support for Claude Messages API and OpenAI Responses API to allow users to interact with exo using these popular API formats. This enables broader compatibility with existing tooling and SDKs that expect these API formats. ## Architecture Adapter logic lives exclusively in the API layer (`src/exo/master/adapters/`). On the way in, each adapter converts its API-specific request type (`ChatCompletionRequest`, `ClaudeMessagesRequest`, `ResponsesRequest`) into `TextGenerationTaskParams`. On the way out, each adapter converts the `TokenChunk` stream back into its API-specific response format. Everything inside the application — commands, worker, runner, event sourcing — only sees `TextGenerationTaskParams` and `TokenChunk`. No API-specific types cross the boundary. ``` API layer │ Application internals │ Chat Completions → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ChatCompletionResponse Claude Messages → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ClaudeMessagesResponse Responses API → [adapter] → TextGenerationTaskParams ──→ │ ──→ TextGeneration command → Runner → TokenChunk ──→ │ ──→ [adapter] → ResponsesResponse ``` ## Changes ### New Files - `src/exo/shared/types/claude_api.py` - Pydantic types for Claude Messages API - `src/exo/shared/types/openai_responses.py` - Pydantic types for OpenAI Responses API - `src/exo/shared/types/text_generation.py` - Shared `TextGenerationTaskParams` internal type - `src/exo/master/adapters/chat_completions.py` - Chat Completions adapter (streaming/non-streaming) - `src/exo/master/adapters/claude.py` - Claude Messages adapter (streaming/non-streaming) - `src/exo/master/adapters/responses.py` - OpenAI Responses adapter (streaming/non-streaming) ### Modified Files - `src/exo/master/api.py` - Refactored to use adapters uniformly for all endpoints; extracted `_resolve_and_validate_text_model` helper to deduplicate model validation across all text endpoints; removed ad-hoc `try/except ValueError` blocks from non-streaming paths ### New Endpoints - `POST /v1/messages` - Claude Messages API (streaming and non-streaming) - `POST /v1/responses` - OpenAI Responses API (streaming and non-streaming) ## Why It Works All APIs are implemented as pure conversion adapters at the edge of the application: 1. Adapter functions in `src/exo/master/adapters/` convert incoming requests to `TextGenerationTaskParams` 2. `api.py` wraps the params in a `TextGeneration` command and sends it through the existing command/event flow 3. The worker, runner, and event sourcing layers only handle `TextGenerationTaskParams` and `TokenChunk` — they have no awareness of Chat Completions, Claude, or Responses API formats 4. On response, adapter functions convert the `TokenChunk` stream back to the caller's expected format 5. Model validation is handled by a single shared helper (`_resolve_and_validate_text_model`), mirroring the existing `_validate_image_model` pattern for image endpoints No changes to core inference logic were needed. ### Streaming Formats - Chat Completions: Uses `data: {...}\n\n` with `[DONE]` terminator - Claude: Uses event types `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, `message_stop` - OpenAI Responses: Uses event types `response.created`, `response.in_progress`, `response.output_item.added`, `response.content_part.added`, `response.output_text.delta`, `response.output_text.done`, `response.content_part.done`, `response.output_item.done`, `response.completed` ## Test Plan ### Manual Testing Hardware: MacBook Pro M3 Max Non-streaming tests: ```bash # Chat Completions API curl -X POST http://localhost:52415/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 20}' # Claude Messages API curl -X POST http://localhost:52415/v1/messages \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}]}' # OpenAI Responses API curl -X POST http://localhost:52415/v1/responses \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "input": "Hello", "max_output_tokens": 20}' ``` Streaming tests: ```bash # Chat Completions API (streaming) curl -N -X POST http://localhost:52415/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "messages": [{"role": "user", "content": "Hello"}], "stream": true, "max_tokens": 20}' # Claude Messages API (streaming) curl -N -X POST http://localhost:52415/v1/messages \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "max_tokens": 50, "messages": [{"role": "user", "content": "Hello"}], "stream": true}' # OpenAI Responses API (streaming) curl -N -X POST http://localhost:52415/v1/responses \ -H "Content-Type: application/json" \ -d '{"model": "llama-3.2-1b", "input": "Hello", "stream": true, "max_output_tokens": 20}' ``` All endpoints tested successfully with proper response formats and streaming events. ### Automated Testing - Tests in `src/exo/master/tests/` all pass (85 tests) - Type checker (basedpyright) passes with 0 errors - Linter (ruff) passes --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-02-02 15:58:37 +00:00
Evan	b1e88a3d06	shfmt adds shfmt, a shell formatter, and formats the bash files	2026-01-29 15:24:36 +00:00
Jake Hillion	9357503c6f	downloads: refactor to run at node level The Worker previously owned the ShardDownloader directly via dependency injection, which prevented --no-worker nodes from downloading and made it impossible for multiple Workers to share a single downloader instance. Moved download functionality to a new DownloadCoordinator component at the Node level that communicates via the DOWNLOAD_COMMANDS pub/sub topic. Workers now send StartDownload commands instead of calling the downloader directly, and receive progress updates through the event-sourced state. This decouples downloads from the Worker lifecycle and enables future features like UI-triggered downloads to specific nodes and multi-worker download sharing. Test plan: - Mostly tested in the next PR that adds explicit downloads/deletions to the dashboard. - Started a model that isn't downloaded - it works.	2026-01-23 18:04:09 +00:00
Evan Quiney	d4f551c602	Simplify model cards (#1204 ) ## Motivation We have a lot of unneeded data in the model card - lets just keep the necessary stuff and add back more data when we need it ## Test Plan EXO still runs! (pipeline on 2) Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-01-20 11:01:19 +00:00
Evan Quiney	2202685c3e	refactor all information sources (including ipless rdma discovery) (#928 ) ## Motivation Information gathering is tightly coupled to MacMon - we should start generalizing our information sources so we can add more in future. ## Changes Added a new system to gather any information. Currently, it is attached to the Worker - though this is mostly to keep the data processing logic simple. It could be made independent quite easily. I also refactored topology to include different kinds of connections as we can gather RDMA connections without having a pre-existing socket connection, and made the relevant placement updates. We should no longer need the network locations script in the app. Other sources of information now include: - static node information like "model" and "chip" (macos, "Unknown" fallback) - device friendly name (macos, falls back to device hostname) - network interfaces + ips (cross platform) - thunderbolt interfaces (macos) - thunderbolt connections (macos) - RAM usage (cross platform) - per-device configuration written to EXO_HOME/config.toml ## Limitations Model and Chip are not cross platform concepts. We do not differentiate between unified and non-unified memory systems. A lot of this data collection is based on simple timers. Watching the SC store on macos is the correct way to gather some of this information, but requires a detour into rust for macos. ## Why It Works The InfoGatherer is a generic subsystem which returns a union of metric datatypes. It writes them to an event, which is applied to state. It is currently re-spawned with the worker so each cluster receives the correct information. As for topology, macOS identifies TB ports with a uuid in SPThunderboltDataType, and also stores remote uuids if it can find them. These changes read that data with the system_profiler, hopefully not so often as to cause notable performance impacts (though this should be tuned) but frequently enough for moderate responsiveness. As we can identify TB connections between devices without needing ips attached to each interface, we can remove the network setup script (almost) completely. ## Test Plan ### Manual Testing Spawn RDMA instances without enabling DHCP on the RDMA interfaces. ### Automated Testing Updated the current master and shared tests to cover the topology refactor and new events. --------- Co-authored-by: Sami Khan <smsak99@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Jake Hillion <jake@hillion.co.uk>	2026-01-19 16:58:09 +00:00
Evan Quiney	1200a7db64	Add tensor sharding for GPT-OSS (#1144 ) ## Motivation GPT OSS did not previously support tensor sharding ## Changes Add GPT sharding support in tensor_auto_parallel. Code is mostly @rltakashige's ## Test Plan ### Manual Testing Tested GPT-OSS - MLX Fast Sync causes issues in Tensor RDMA - this is a general problem at the moment.	2026-01-13 17:25:52 +00:00
Alex Cheema	e5e74e1eef	Upgrade mlx-lm to 0.30.2 with transformers 5.x compatibility (#1125 ) ## Motivation Upgrade mlx-lm to version 0.30.2 which requires transformers 5.0.0rc2 as a prerelease dependency. This enables support for newer models like Kimi K2 Thinking while maintaining compatibility with existing models. The transformers 5.x release includes breaking changes that affect custom tokenizers like Kimi's TikTokenTokenizer, requiring compatibility fixes. ## Changes ### Core Changes - mlx-lm upgrade: Bump to 0.30.2 with locked exact versions for mlx/mlx-lm to prevent breaking changes - transformers 5.x compatibility: Enable prerelease transformers dependency ### Kimi K2 Tokenizer Fixes - Add `bytes_to_unicode` monkey-patch to restore function moved in transformers 5.0.0rc2 - Load `TikTokenTokenizer` directly instead of via `AutoTokenizer` to bypass transformers 5.x bug with `auto_map` fallback - Patch `encode()` to use tiktoken directly with `allowed_special="all"` to handle special tokens from chat templates ### Other Changes - Dashboard: Show disk usage for completed model downloads - CI: Add `workflow_dispatch` trigger to build-app workflow - Docs: Add basic API documentation ### Testing - Add comprehensive tokenizer unit tests for all supported models - Tests verify encode/decode, special token handling, and chat template encoding ## Why It Works bytes_to_unicode issue: transformers 5.0.0rc2 moved `bytes_to_unicode` from `transformers.models.gpt2.tokenization_gpt2` to `transformers.convert_slow_tokenizer`. Kimi's `tokenization_kimi.py` imports from the old location. The monkey-patch restores it at module load time. AutoTokenizer issue: transformers 5.x has a bug where `tokenizer_class_from_name('TikTokenTokenizer')` returns `None` for custom tokenizers with `auto_map`. Loading the tokenizer directly bypasses this. encode() issue: transformers 5.x's `pad()` method fails for slow tokenizers. Using tiktoken's encode directly with `allowed_special="all"` avoids this path and properly handles special tokens like `<\|im_user\|>` from chat templates. ## Test Plan ### Manual Testing - Hardware: 2x Mac Studios connected via Thunderbolt 5 (mike22 and james21) - Tested Kimi K2 Thinking, GPT-OSS-120B, GPT-OSS-20B, LLama-3.1-8B-bf16, qwen3-30B-A3B-8bit model with pipeline parallelism across both nodes - Verified warmup inference completes successfully - Verified chat completions work with special tokens ### Automated Testing - Added `test_tokenizers.py` with 31 tests covering: - Basic encode/decode for all model families (deepseek, kimi, llama, qwen, gpt-oss, glm) - Special token encoding (critical for chat templates) - Chat template application and encoding - Kimi-specific and GLM-specific edge cases - All tests pass: `uv run pytest src/exo/worker/tests/unittests/test_mlx/test_tokenizers.py` ### Failing Tests RDMA with all models. --------- Co-authored-by: Evan <evanev7@gmail.com>	2026-01-13 12:06:04 +00:00
Evan Quiney	56af61fac9	add a server for distributed testing in /tests until we work out a stable solution. (#1098 ) ## Motivation Testing multiple devices simultaneously requires coordination, and we don't necessarily want to run a full EXO to test single components. We need a mid-scale integration testing framework for distributed tests. ## Changes Add a simple python server + bash query that runs Jaccl and Ring tests without constructing a worker/master/networking. The query relies on all devices being accessible over tailscale, currently. ## Test Plan Manually tested RDMA + Ring inference on 2 nodes.	2026-01-08 12:50:04 +00:00

14 Commits