mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 20:40:35 -04:00

Author	SHA1	Message	Date
rltakashige	f2709dcde6	Add prefix cache flag to exo bench (#1888 ) ## Motivation For using Exo-Bench extensively, there are many cases that we could use prefix caching to speed up the benchmarks, especially when the focus is on the token generation. At the same time, it's very clear that prefix caching decode tokens is not very useful in most current scenarios. Surprisingly, even for non-thinking models, the chat template means that a continued conversation will be formatted such that the existing cache is not effective. We already (slightly accidentally) do this for the batch generator - we should do it for the sequential generator too. ## Changes We can now speed up exo bench by having a use prefix caching flag. Of course, for most accurate pp results, it is better to not have it, but this speeds up tg and large benchmarking significantly. Updated methodology to match ## Test Plan ### Manual Testing Tested on many configurations that the difference in results is negligible, even with multiple --pp options.	2026-04-14 11:12:58 +01:00
rltakashige	43b3df45fb	Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835 ) ## Motivation MLX LM has had a massive refactor to their BatchGenerator recently. Since we'd like new features from MLX LM such as Gemma 4, we need to update the code to handle this. Additionally this fixes a significant memory leak in GatedDeltaNet (the difference is quite substantial, up to 1GB every 1000 tokens, explaining several memory issues users were facing with Qwen3.5 models) ## Testing Before <img width="3146" height="884" alt="image" src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4" /> After (no memory leak, as one of the changes upstream) <img width="3190" height="892" alt="image" src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14" />	2026-04-07 11:50:12 +00:00
rltakashige	59669c1168	Tighten EXO bench concurrency numbers and explain methodology (#1811 ) ## Motivation The timings in the batch generator are a little optimistic; a minor change is needed to make them more correct. ## Changes Include the time spent in the API in the generation tps and make sure to send all requests simultaneously	2026-04-05 00:57:52 +01:00
rltakashige	2efbb8ab4f	Improve exo harness with path state (#1815 ) <img width="3224" height="1476" alt="image" src="https://github.com/user-attachments/assets/d90a7d8a-9fe5-43a1-a715-1ef7ecc15422" />	2026-03-30 11:20:46 +01:00
ciaranbor	15f1b61f4c	Rework model storage directory management (for external storage) (#1765 ) ## Motivation Replace confusing EXO_MODELS_DIR/EXO_MODELS_PATH with clearer multi-directory support, enabling automatic download spillover across volumes. ## Changes - EXO_MODELS_DIRS: colon-separated writable dirs (default always prepended, first with enough space wins) - EXO_MODELS_READ_ONLY_DIRS: colon-separated read-only dirs (protected from deletion) - select_download_dir(): picks writable dir by free space - resolve_existing_model(): unified lookup across all dirs - is_read_only_model_dir(): path-based read-only detection instead of hardcoded flag - Updated coordinator, worker, model cards, tests ## Why It Works Default dir always included so zero-config behavior is unchanged. Disk space checked at download time for automatic spillover. Read-only status derived from path, not hardcoded. ## Test Plan ### Manual Testing - No env vars set → identical behavior - EXO_MODELS_DIRS=/Volumes/SSD/models → downloads to external storage - EXO_MODELS_READ_ONLY_DIRS=/mnt/nfs → models found, deletion blocked ### Automated Testing - 4 new tests in test_xdg_paths.py (prepend, default-only, overlap, empty read-only) - Existing tests updated to patch new constants	2026-03-26 17:46:46 +00:00
rltakashige	7df3774ca2	Improve batch performance and stats reporting (#1777 ) ## Motivation Batch generation reports incorrect statistics, as mlx lm never clears the original stats, meaning they get polluted over time. The dashboard also seems considerably slower than bench statistics. We also have a large discrepancy between B=1 batch generating and mlx_generate. Extracting logprobs is massively expensive, causing up to a 25% slowdown compared to pure batching. ``` [ 12:02:01.1240AM \| INFO ] step overhead: 3.49ms (next=12.49ms total=15.99ms) [ 12:02:02.1600AM \| INFO ] step overhead: 3.23ms (next=13.01ms total=16.24ms) [ 12:02:03.2228AM \| INFO ] step overhead: 3.28ms (next=13.38ms total=16.66ms) [ 12:02:04.2798AM \| INFO ] step overhead: 3.25ms (next=12.84ms total=16.10ms) [ 12:02:05.3152AM \| INFO ] step overhead: 3.18ms (next=12.61ms total=15.79ms) [ 12:02:06.3522AM \| INFO ] step overhead: 3.41ms (next=12.83ms total=16.25ms) [ 12:02:07.3987AM \| INFO ] step overhead: 3.38ms (next=13.14ms total=16.52ms) [ 12:02:08.4537AM \| INFO ] step overhead: 1.84ms (next=19.44ms total=21.28ms) ``` ## Changes 1. Report stats ourselves instead of using mlx lm's stats for batch generation (they use perf_counter anyway). 2. Adjust exo bench to match 3. Improve logprobs extraction speed by 10x, improving tps for dashboard & any requests for logprobs 4. Use an SSE comment to align the speed to the real numbers at the end of generation 5. Patch mlx for several optimizations given our assumptions and use cases (e.g. use vllm style RoPE). 6. Switch MLX LM version to latest main, including support for Nemotron Super and some Qwen3.5 fixes. ## Why It Works 1. Exo bench no longer reports polluted stats 2. Exo bench now handles the reported per-request stats rather than the aggregate stats 3. The decode speed now jumps back to a real number at the end of the generation 4. Large batch speedup for rotating KV cache models + 1:1 matching cache with vllm ## Test Plan ### Manual Testing Needs testing on OpenCode and CC Needs eval testing ### Automated Testing Only going to show the performance optimization difference after the accurate reporting: GPT OSS 20B MXFP4 Q8 (large change) Before: <img width="2466" height="1534" alt="image" src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e" /> <img width="2410" height="1240" alt="image" src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68" /> After: <img width="2476" height="1472" alt="image" src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2" /> <img width="2454" height="1236" alt="image" src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a" /> Qwen 3.5 35B A3B 8bit (No change) Before: <img width="2414" height="1396" alt="image" src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3" /> After: <img width="2346" height="1234" alt="image" src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6" /> Llama 3.2 1B Instruct 4bit (small change) Before: <img width="2516" height="1220" alt="image" src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80" /> After: <img width="2566" height="1370" alt="image" src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543" />	2026-03-24 14:03:03 +00:00
rltakashige	b713889f73	Fix exo bench again again (#1750 ) Mb premature auto merge	2026-03-17 13:47:24 +00:00
rltakashige	131ad0ff36	Implement continuous batching (#1642 ) ## Motivation Following the changes made in #1632 ! Closes #1020 ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> --------- Co-authored-by: Evan Quiney <evanev7@gmail.com>	2026-03-09 15:04:45 +00:00
Jake Hillion	dc0bb5e13b	fmt: add taplo TOML formatter to treefmt configuration	2026-02-27 17:19:48 +00:00
rltakashige	2fe689315b	download .model files in exo bench (#1607 ) ## Motivation failed again for kimi on a machine that had never downloaded it. ## Test Plan ### Manual Testing it worked this time	2026-02-24 11:13:04 +00:00
rltakashige	05986f77aa	add exo bench protobuf dependency (#1596 ) kimi k2.5 requires protobuf	2026-02-23 16:45:22 +00:00
Jake Hillion	8d94eab6c6	bench: fix KeyError on DownloadCompleted total field The bench harness accessed the serialized DownloadCompleted field as "totalBytes", but the Python field is `total: Memory` which serializes to "total" (not snake_case, so the camelCase alias generator leaves it unchanged). This caused a KeyError when --danger-delete-downloads needed to free disk space by deleting existing models. Changed the key from "totalBytes" to "total" to match the actual serialized JSON structure. Test plan: - CI - Reproduced KeyError on unfixed code by filling disk on a test node to 14GB free and running exo-bench with a 16GB model (Meta-Llama-3.1-8B-Instruct-bf16) with --danger-delete-downloads - Verified fixed code successfully deletes smaller models to free space and completes the benchmark run	2026-02-23 14:17:41 +00:00
Jake Hillion	ab9273e723	downloads: add read_only flag to DownloadCompleted for EXO_MODELS_PATH Models in EXO_MODELS_PATH are pre-downloaded into read-only directories and must not be deleted. The DownloadCoordinator had no awareness of these paths, so they never appeared as completed downloads in cluster state, and the bench harness could attempt to delete them when freeing disk space. Added a `read_only: bool` field to `DownloadCompleted` (default False). The DownloadCoordinator now checks `resolve_model_in_path` in `_start_download`, proactively scans EXO_MODELS_PATH in `_emit_existing_download_progress` to emit DownloadCompleted events for all pre-downloaded models (overriding DownloadPending from the regular scan), and refuses deletion of read-only models. The bench harness filters out read-only models from deletion candidates. Test plan: - Ran with EXO_MODELS_PATH. Available models now show as downloaded in the UI. There isn't good UI for the fact they can't be deleted, but it should work with exo_bench.	2026-02-20 20:27:45 +00:00
Jake Hillion	d484b062e8	bench: add download timing to bench output (#1566 ) The bench script downloads models during the planning phase but doesn't record how long the download took, making it difficult to track download performance for a given model over time. Modified `run_planning_phase` to return download metadata: whether a fresh download occurred, the wall-clock duration, and the model size in bytes. These fields are included in every JSON output row alongside the existing per-run metrics, and a summary line is logged to the console. This allows filtering bench results by `download_occurred` and grouping by `model_id` to compute average download times across runs. Test plan: ``` # existing model jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --host s1 --model mlx-community/gpt-oss-120b-MXFP4-Q8 --pp 128 --tg 128 ... 2026-02-20 15:23:49.081 \| INFO \| __main__:main:340 - Planning phase: checking downloads... 2026-02-20 15:23:49.152 \| INFO \| harness:run_planning_phase:402 - Started download on 12D3KooWKx41iikn188ozrxSdoG26g88jFCfie9wEA1eQR8csbPm 2026-02-20 15:23:49.184 \| INFO \| __main__:main:352 - Download: model already cached ... Wrote results JSON: bench/results.json jake@maverick:/data/users/jake/repos/exo/ > cat bench/results.json [ { "elapsed_s": 2.9446684420108795, "output_text_preview": "The user just typed a long series of \"a\". Possibly they are testing. There's no explicit question. Could be they want a response? Might be a test of handling long input. We can respond politely, ask i", "stats": { "prompt_tps": 117.7872141515621, "generation_tps": 85.49598231498028, "prompt_tokens": 129, "generation_tokens": 128, "peak_memory_usage": { "inBytes": 68215145744 } }, "model_short_id": "gpt-oss-120b-MXFP4-Q8", "model_id": "mlx-community/gpt-oss-120b-MXFP4-Q8", "placement_sharding": "Pipeline", "placement_instance_meta": "MlxRing", "placement_nodes": 1, "instance_id": "68babc2a-6e94-4c70-aa07-7ec681f7c856", "pp_tokens": 128, "tg": 128, "repeat_index": 0 } ]% # no change to output ``` ``` # missing model jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --host s1 --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --pp 128 --tg 128 ... 2026-02-20 15:24:42.553 \| INFO \| __main__:main:340 - Planning phase: checking downloads... 2026-02-20 15:24:42.625 \| INFO \| harness:run_planning_phase:402 - Started download on 12D3KooWKx41iikn188ozrxSdoG26g88jFCfie9wEA1eQR8csbPm 2026-02-20 15:25:37.494 \| INFO \| __main__:main:350 - Download: 54.9s (freshly downloaded) ... Wrote results JSON: bench/results.json jake@maverick:/data/users/jake/repos/exo/ > cat bench/results.json [ { "elapsed_s": 1.500349276990164, "output_text_preview": "It seems like you've entered a large number of 'a's. If you'd like to discuss something or ask a question, I'm here to help. If not, is there anything else I can assist you with? \n\nIf you're intereste", "stats": { "prompt_tps": 395.43264952543666, "generation_tps": 128.03520443181478, "prompt_tokens": 129, "generation_tokens": 128, "peak_memory_usage": { "inBytes": 5116952079 } }, "model_short_id": "Meta-Llama-3.1-8B-Instruct-4bit", "model_id": "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit", "placement_sharding": "Pipeline", "placement_instance_meta": "MlxRing", "placement_nodes": 1, "instance_id": "ccd9bd71-d4cc-4b75-a37f-98090544626a", "pp_tokens": 128, "tg": 128, "repeat_index": 0, "download_duration_s": 54.88322358299047 } ]% # one new field ```	2026-02-20 15:33:08 +00:00
Jake Hillion	42e1e7322b	bench: restore --danger-delete-downloads planning phase (#1542 ) `c2f2111b` extracted shared utilities from exo_bench.py into harness.py but accidentally dropped the run_planning_phase function and --danger-delete-downloads CLI argument in the process. Restored run_planning_phase in harness.py (where its dependencies now live) and re-added the --danger-delete-downloads argument to add_common_instance_args. Re-wired the planning phase call in exo_bench.py's main() before the benchmark loop.	2026-02-19 15:42:02 +00:00
Alex Cheema	025ed9fd82	feat: add prefill progress bar for long prompts (#1181 ) ## Motivation Users processing long prompts have no visibility into when token generation will start. This feature adds a progress bar showing prefill progress, giving users real-time feedback during prompt processing. ## Changes ### Backend - Added `PrefillProgress` event type with `command_id`, `processed_tokens`, `total_tokens` - Added `PrefillProgressResponse` type (though now using direct callback approach) - Wired `prompt_progress_callback` through MLX's `stream_generate()` - Progress events sent directly from callback for real-time updates (not batched) - API generates SSE named events: `event: prefill_progress\ndata: {...}` - Added `PrefillProgressData` dataclass and `StreamEvent` union type in API ### Dashboard - Added `PrefillProgress` interface to store - Updated SSE parsing to handle `event:` lines (named events) - Created `PrefillProgressBar.svelte` with animated progress bar - Shows "Processing prompt: X/Y tokens" with percentage - Progress bar disappears when first token arrives ## Why It Works MLX's `stream_generate()` accepts a `prompt_progress_callback(processed, total)` that's called after each prefill chunk. By sending events directly from this callback (rather than yielding from the generator), progress updates are sent in real-time during prefill. Using SSE named events (`event: prefill_progress`) maintains full OpenAI/Claude API compatibility - standard clients ignore named events they don't recognize, while the exo dashboard explicitly listens for them. ## Test Plan ### Manual Testing - Hardware: MacBook Pro M3 Max - Set `prefill_step_size=256` for more frequent updates - Tested with long prompts (pasted large documents) - Verified progress bar updates incrementally during prefill - Confirmed progress bar disappears when generation starts - Tested with curl - standard `data:` events still work normally Here is it working: https://github.com/user-attachments/assets/5cc6f075-c5b2-4a44-bb4d-9efb246bc5fe ### Automated Testing - Type checker passes (0 errors) - All 192 tests pass - Dashboard builds successfully ### API Compatibility - Named SSE events are ignored by OpenAI SDK clients - Regular token data uses standard `data: {...}` format - `[DONE]` sentinel works as expected --- Note: `prefill_step_size` is temporarily set to 256 for testing. Should be changed back to 2048 before merging for production performance. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Evan <evanev7@gmail.com> Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>	2026-02-19 03:18:25 +00:00
rltakashige	24e99ce197	Cleanup mistakes (#1537 ) Oops	2026-02-18 22:05:26 +00:00
rltakashige	c2f2111b88	Fix tool calling (#1529 ) ## Motivation GPT OSS tool calling issues. ## Changes Fixes those and adds a bunch of evals for tool calling. Fixes GLM5 prefix caching, where CacheList wasn't getting handled properly. Extracts a bunch of the setup functionality of exo bench to a harness that can be reused elsewhere, such as in the tool calling eval. ## Test Plan ### Automated Testing Let's run the evals for all models	2026-02-18 20:29:18 +00:00
Jake Hillion	8392e78afe	bench: add spec for automatic canary benchmarks (#1483 ) Adds all the models that can fit onto a single M3 Ultra for single machine benchmarks. Fixes the macOS version, GPU spec, and chip type for maximum reproducibility. Specifies the minimum memory accordingly for each type of model, using the smallest machine available (the smallest M3 Ultra is 96GiB). Test plan: - Running this with some code that makes machines of this spec available and stores the results. It works. This will become part of a larger testing/stability strategy once we've collected more of the data.	2026-02-17 10:52:05 +00:00
Jake Hillion	131fb141a6	bench: add --danger-delete-downloads flag with planning phase exo bench previously relied on the worker's plan loop to download models, which could fail silently or run into disk space issues during benchmarking. This made it difficult to diagnose download failures. Added a planning phase that runs before benchmarking to explicitly handle downloads. It checks available disk space on each node via the /state endpoint and starts downloads via POST /download/start. When the --danger-delete-downloads flag is set and there's insufficient space, it deletes existing models from smallest to largest until there's room for the benchmark model. Test plan: - CI ``` jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --pp 128,2048,4096 --tg 128 --stdout --settle-timeout 10 --host s1 --model mlx-community/gpt-oss-120b-MXFP4-Q8 PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 2026-02-16 12:12:11.807 \| INFO \| __main__:main:710 - pp/tg mode: combinations (product) - 3 pairs Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. 2026-02-16 12:12:13.455 \| DEBUG \| __main__:main:725 - [exo-bench] loaded tokenizer: mlx-community/gpt-oss-120b-MXFP4-Q8 for prompt sizer 2026-02-16 12:12:13.473 \| DEBUG \| __main__:main:761 - exo-bench model: short_id=gpt-oss-120b-MXFP4-Q8 full_id=mlx-community/gpt-oss-120b-MXFP4-Q8 2026-02-16 12:12:13.473 \| INFO \| __main__:main:762 - placements: 1 2026-02-16 12:12:13.474 \| INFO \| __main__:main:764 - - Pipeline / MlxRing / nodes=1 2026-02-16 12:12:13.474 \| INFO \| __main__:main:771 - Planning phase: checking downloads... Traceback (most recent call last): File "/nix/store/q31kmbcfr5bf97290bvbnhrvpc3fv824-source/bench/exo_bench.py", line 885, in <module> raise SystemExit(main()) ~~~~^^ File "/nix/store/q31kmbcfr5bf97290bvbnhrvpc3fv824-source/bench/exo_bench.py", line 772, in main run_planning_phase( ~~~~~~~~~~~~~~~~~~^ client, ^^^^^^^ ...<4 lines>... settle_deadline, ^^^^^^^^^^^^^^^^ ) ^ File "/nix/store/q31kmbcfr5bf97290bvbnhrvpc3fv824-source/bench/exo_bench.py", line 367, in run_planning_phase raise RuntimeError( ...<2 lines>... ) RuntimeError: Insufficient disk on 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj: need 65GB, have 55GB. Use --danger-delete-downloads to free space. jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --pp 128,2048,4096 --tg 128 --stdout --settle-timeout 10 --host s1 --model mlx-community/gpt-oss-120b-MXFP4-Q8 --danger-delete-downloads PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. 2026-02-16 12:12:19.626 \| INFO \| __main__:main:710 - pp/tg mode: combinations (product) - 3 pairs 2026-02-16 12:12:21.262 \| DEBUG \| __main__:main:725 - [exo-bench] loaded tokenizer: mlx-community/gpt-oss-120b-MXFP4-Q8 for prompt sizer 2026-02-16 12:12:21.280 \| DEBUG \| __main__:main:761 - exo-bench model: short_id=gpt-oss-120b-MXFP4-Q8 full_id=mlx-community/gpt-oss-120b-MXFP4-Q8 2026-02-16 12:12:21.280 \| INFO \| __main__:main:762 - placements: 1 2026-02-16 12:12:21.280 \| INFO \| __main__:main:764 - - Pipeline / MlxRing / nodes=1 2026-02-16 12:12:21.280 \| INFO \| __main__:main:771 - Planning phase: checking downloads... 2026-02-16 12:12:21.336 \| INFO \| __main__:run_planning_phase:386 - Deleting mlx-community/Qwen3-0.6B-4bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (335MB) 2026-02-16 12:12:21.350 \| INFO \| __main__:run_planning_phase:386 - Deleting mlx-community/Llama-3.2-1B-Instruct-4bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (679MB) 2026-02-16 12:12:21.363 \| INFO \| __main__:run_planning_phase:386 - Deleting mlx-community/Llama-3.2-3B-Instruct-4bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (1740MB) 2026-02-16 12:12:21.373 \| INFO \| __main__:run_planning_phase:386 - Deleting mlx-community/Llama-3.2-3B-Instruct-8bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (3264MB) 2026-02-16 12:12:21.384 \| INFO \| __main__:run_planning_phase:386 - Deleting mlx-community/GLM-4.7-Flash-8bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (30366MB) 2026-02-16 12:12:21.413 \| INFO \| __main__:run_planning_phase:407 - Started download on 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj ``` It's not pretty but it works!	2026-02-16 13:06:38 +00:00
Jake Hillion	cc33213842	bench: add --settle-timeout for cluster startup retry (#1449 ) exo_bench.py fails if started too soon after a cluster starts because the topology hasn't populated yet, resulting in no valid placements. Extracted the preview-fetch-and-filter logic into a `fetch_and_filter_placements` helper and added a retry loop with exponential backoff (1s initial, 2x multiplier, 60s cap). The new `--settle-timeout` flag controls how long to retry (default 0 = try once, preserving existing behaviour). Each retry logs a warning explaining the cluster may still be settling. Test plan: - Tested on several freshly started clusters. This used to fail a lot, now it succeeds.	2026-02-12 16:38:09 +00:00
Jake Hillion	d79b3a0e75	bench: make exo-bench available via nix run on all platforms (#1415 ) exo-bench was gated behind isDarwin in python/parts.nix because it used exoVenv, which pulls in MLX (Darwin-only). However, exo_bench.py is an HTTP client that only needs loguru, transformers, huggingface-hub, and tiktoken. Made bench a uv workspace member with its own pyproject.toml declaring only the minimal dependencies. Added a separate benchVenv in parts.nix built from that workspace member, and moved exo-bench out of the isDarwin block so it is available on all platforms. Test plan: - `nix run .#exo-bench -- --help` prints argparse help --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: rltakashige <rl.takashige@gmail.com>	2026-02-06 21:07:17 +00:00
Evan Quiney	9b5cae3db6	auto bench (#1405 ) runs exo_bench remotely with some nice git QoL ## usage run tests/auto_bench.sh host1 [host2] exo bench will be run on those hosts and its output saved to bench/commit_hash/*.json on all models currently downloaded	2026-02-06 15:35:46 +00:00
rltakashige	c8dbbee27b	skip tensor ring on bench (#1403 ) ## Motivation <!-- Why is this change needed? What problem does it solve? --> <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> ## Why It Works <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-02-06 13:06:59 +00:00
rltakashige	21d477f1cb	Update exo bench (#1357 ) ## Motivation Make exo bench faster for longer prompts, lengthen default timeouts and use pairs for pp and tg. ## Changes - Uses binary search to find correct prompt - Flag to force all combinations if that is desired	2026-02-02 15:46:15 +00:00
Alex Cheema	8f6726d6be	Fix config.json download errors for image models (#1245 ) ## Motivation When `get_shard_download_status()` runs, it iterates over all models in `MODEL_CARDS` and calls `build_full_shard()` → `build_base_shard()` → `ModelCard.from_hf()`. This unconditionally tried to download `config.json` from HuggingFace, but image models (FLUX, Qwen-Image) don't have a root-level config.json file, causing errors: ``` Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/config.json Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/config.json Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image/resolve/main/config.json Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image-Edit-2509/resolve/main/config.json ``` ## Changes ### ModelCard.load() fix - `build_base_shard()` now uses `ModelCard.load()` instead of `ModelCard.from_hf()` - `ModelCard.load()` iterates through `MODEL_CARDS.values()` to find a match by `model_id` ### exo-bench fixes - Use `name` field instead of `id` for model resolution - Pass `full_model_id` to `/instance/previews` endpoint - Make model name matching case-insensitive - Update README example model name ## Why It Works `MODEL_CARDS` uses short names as keys (e.g., `"flux1-schnell"`) but the `model_id` values are HuggingFace paths (e.g., `"black-forest-labs/FLUX.1-schnell"`). When `ModelCard.load()` was called with the HF path, it didn't match any key and fell back to `from_hf()` which tried to download config.json. The fix iterates through `MODEL_CARDS.values()` to find a match by `model_id`, ensuring predefined models (including image models) use their registry entries directly without network calls. A key lookup is unnecessary since `load()` is always called with HF paths which don't match the short-name keys. ## Test Plan ### Manual Testing - Run exo and verify no more "Error downloading shard: File not found: .../config.json" errors for image models - Run exo-bench and verify model resolution works correctly ### Automated Testing - `uv run basedpyright` - passes with 0 errors - `uv run pytest` - all tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 21:30:48 +00:00
rltakashige	5fd55594c9	Wrap pipeline models for explicit mx.depends between cache and logits (#1206 ) ## Motivation GPU timeouts often when prompt size > profile_step_size. It also happens for seemingly random models. ## Changes Add mx.depends for cache on the logits. All gather at the model level rather than the layer level, reducing the amount of data sent. ## Why It Works mlx_lm's prefill loop only evaluates cache state, not logits. When prompt > prefill_step_size, the all_gather is never evaluated, causing GPU timeout. ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> ### Automated Testing Added failing test cases and then resolved them.	2026-01-19 17:49:42 +00:00
rltakashige	5c8a237940	Handle model timeouts (#1177 ) - Add eval with a timeout. - Add fast synch flag ## Motivation Because of the experimental FAST SYNCH flag, some models may not work. This PR catches when this occurs and allows users to specify a run without fast synch ## Changes - Adds a flag to enable or disable fast synch (--fast-synch and --no-fast-synch) - Adds a heuristic timeout - Reduces exo_bench default timeout to 10 minutes. ## Why It Works Heuristic timeout assumes normal loading times on Mac devices (60 + model size in gb / 5: e.g. DeepSeek takes up to 120 seconds to load on tensor parallel, and timeout is set to 60 + 120 = 180s. We could raise this value if necessary. ## Test Plan ### Manual Testing Catches that GPT OSS fails to load in Tensor RDMA Can launch with --no-fast-synch flag to launch GPT OSS. GPT OSS 20B TP with fast synch <img width="3064" height="456" alt="image" src="https://github.com/user-attachments/assets/f6e25cd8-8621-4e99-99fe-292ee05c4035" /> TP without fast synch <img width="3098" height="496" alt="image" src="https://github.com/user-attachments/assets/d36453d9-6686-4cfe-aa7c-a7d458369d4d" /> [Note: the performance is really not great as fast synch is off] (As a sanity check) PP with fast synch <img width="3124" height="496" alt="image" src="https://github.com/user-attachments/assets/e97d4547-c6fa-483d-badb-4b371b900b4c" /> PP without fast synch <img width="3078" height="508" alt="image" src="https://github.com/user-attachments/assets/b2e20dfd-4b0e-4295-8a92-417dfe745c28" /> PP without RDMA <img width="3070" height="498" alt="image" src="https://github.com/user-attachments/assets/a8509d68-0aef-4cda-bca5-a67d39a0801e" /> TP without RDMA <img width="3068" height="496" alt="image" src="https://github.com/user-attachments/assets/b5691429-89f4-4369-bcf2-8fde2ad7154a" />	2026-01-16 20:25:12 +00:00
rltakashige	745343c705	Return error responses for Chat Completions (#1173 ) - Error chunks - Use error handling in exo_bench.py ## Motivation Return when an error occurs so that generation stops. Adding timeouts is a separate TODO for model loading and chat completions. ## Changes - Return HTTP exceptions as JSON responses in an OpenAI compatible format. - Context manager for generation to catch and return error messages. - Use error handling in exo_bench.py. ## Test Plan ### Manual Testing Manually tested that exo_bench returns on failures within and outside generation ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-01-16 19:24:37 +00:00
rltakashige	4b3de6b984	Fix exo bench for transformers 5.x (#1168 ) ## Motivation Prompt Sizer was broken as transformers 5.x tokenizers create BatchEncodings which are essentially a dictionary of {input_ids: []} instead of the list of input ids. ## Test Plan ### Manual Testing Tested that exo bench runs as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - -->	2026-01-16 12:39:22 +00:00
rltakashige	077b1bc732	exo-bench (Benchmark model pp & tg speed) (#1099 ) ## Motivation This PR implements benchmarking in the style of llama-bench. The main difficulty here is the fact that exo is not a library - it exposes an endpoint. This means that benchmarking numbers will be inaccurate if the API is measured. The solution assumes nodes are set up with uv run exo (or via the app), and then hits the new endpoint /bench/chat/completions to retrieve generation statistics directly from mlx_lm. <!-- Why is this change needed? What problem does it solve? --> This will allow us to release benchmarks for models and perform regression tests. TODO: Performance benchmarking. <!-- If it fixes an open issue, please link to the issue here --> ## Changes <!-- Describe what you changed in detail --> - Adds /bench/chat/completions endpoint - Adds BenchChatCompletion/Response - Adds a logits processor to prevent response from ending early - Adds a "Prompt Sizer" which downloads the tokenizer and dynamically adjusts the prompt of "a" to fit the desired prompt size. - Reduce prefill step size to 2048 for now (in future, dynamically adjust this value) <!-- Explain why your approach solves the problem --> ## Test Plan ### Manual Testing <!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB, connected via Thunderbolt 4) --> <!-- What you did: --> <!-- - --> Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several fixes to run consistently on all configurations (to be done in the future). Manually tested the normal API to verify chat requests complete as expected. ### Automated Testing <!-- Describe changes to automated tests, or how existing tests cover this change --> <!-- - --> Not really possible. Type checker passes.	2026-01-06 17:39:09 +00:00

31 Commits