Commit Graph

31 Commits

Author SHA1 Message Date
rltakashige
f2709dcde6 Add prefix cache flag to exo bench (#1888)
## Motivation
For using Exo-Bench extensively, there are many cases that we could use
prefix caching to speed up the benchmarks, especially when the focus is
on the token generation.

At the same time, it's very clear that prefix caching decode tokens is
not very useful in most current scenarios. Surprisingly, even for
non-thinking models, the chat template means that a continued
conversation will be formatted such that the existing cache is not
effective.

We already (slightly accidentally) do this for the batch generator - we
should do it for the sequential generator too.

## Changes

We can now speed up exo bench by having a use prefix caching flag. Of
course, for most accurate pp results, it is better to not have it, but
this speeds up tg and large benchmarking significantly.
Updated methodology to match

## Test Plan

### Manual Testing
Tested on many configurations that the difference in results is
negligible, even with multiple --pp options.
2026-04-14 11:12:58 +01:00
rltakashige
43b3df45fb Fix BatchGenerator in line with upstream refactor (and prevent Qwen3.5 memory leak) (#1835)
## Motivation

MLX LM has had a massive refactor to their BatchGenerator recently.
Since we'd like new features from MLX LM such as Gemma 4, we need to
update the code to handle this.

Additionally this fixes a significant memory leak in GatedDeltaNet (the
difference is quite substantial, up to 1GB every 1000 tokens, explaining
several memory issues users were facing with Qwen3.5 models)

## Testing
Before
<img width="3146" height="884" alt="image"
src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4"
/>


After (no memory leak, as one of the changes upstream)
<img width="3190" height="892" alt="image"
src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14"
/>
2026-04-07 11:50:12 +00:00
rltakashige
59669c1168 Tighten EXO bench concurrency numbers and explain methodology (#1811)
## Motivation

The timings in the batch generator are a little optimistic; a minor
change is needed to make them more correct.

## Changes

Include the time spent in the API in the generation tps and make sure to
send all requests simultaneously
2026-04-05 00:57:52 +01:00
rltakashige
2efbb8ab4f Improve exo harness with path state (#1815)
<img width="3224" height="1476" alt="image"
src="https://github.com/user-attachments/assets/d90a7d8a-9fe5-43a1-a715-1ef7ecc15422"
/>
2026-03-30 11:20:46 +01:00
ciaranbor
15f1b61f4c Rework model storage directory management (for external storage) (#1765)
## Motivation

Replace confusing EXO_MODELS_DIR/EXO_MODELS_PATH with clearer
multi-directory support, enabling automatic download spillover across
volumes.

## Changes

- EXO_MODELS_DIRS: colon-separated writable dirs (default always
prepended, first with enough space wins)
- EXO_MODELS_READ_ONLY_DIRS: colon-separated read-only dirs (protected
from deletion)
- select_download_dir(): picks writable dir by free space
- resolve_existing_model(): unified lookup across all dirs
- is_read_only_model_dir(): path-based read-only detection instead of
hardcoded flag
- Updated coordinator, worker, model cards, tests

## Why It Works

Default dir always included so zero-config behavior is unchanged. Disk
space checked at download time for automatic spillover. Read-only status
derived from path, not hardcoded.

## Test Plan

### Manual Testing

- No env vars set → identical behavior
- EXO_MODELS_DIRS=/Volumes/SSD/models → downloads to external storage
- EXO_MODELS_READ_ONLY_DIRS=/mnt/nfs → models found, deletion blocked

### Automated Testing

- 4 new tests in test_xdg_paths.py (prepend, default-only, overlap,
empty read-only)
- Existing tests updated to patch new constants
2026-03-26 17:46:46 +00:00
rltakashige
7df3774ca2 Improve batch performance and stats reporting (#1777)
## Motivation

Batch generation reports incorrect statistics, as mlx lm never clears
the original stats, meaning they get polluted over time.
The dashboard also seems considerably slower than bench statistics.
We also have a large discrepancy between B=1 batch generating and
mlx_generate.
Extracting logprobs is massively expensive, causing up to a 25% slowdown
compared to pure batching.
```
[ 12:02:01.1240AM | INFO    ] step overhead: 3.49ms (next=12.49ms total=15.99ms)
[ 12:02:02.1600AM | INFO    ] step overhead: 3.23ms (next=13.01ms total=16.24ms)
[ 12:02:03.2228AM | INFO    ] step overhead: 3.28ms (next=13.38ms total=16.66ms)
[ 12:02:04.2798AM | INFO    ] step overhead: 3.25ms (next=12.84ms total=16.10ms)
[ 12:02:05.3152AM | INFO    ] step overhead: 3.18ms (next=12.61ms total=15.79ms)
[ 12:02:06.3522AM | INFO    ] step overhead: 3.41ms (next=12.83ms total=16.25ms)
[ 12:02:07.3987AM | INFO    ] step overhead: 3.38ms (next=13.14ms total=16.52ms)
[ 12:02:08.4537AM | INFO    ] step overhead: 1.84ms (next=19.44ms total=21.28ms)
```

## Changes

1. Report stats ourselves instead of using mlx lm's stats for batch
generation (they use perf_counter anyway).
2. Adjust exo bench to match
3. Improve logprobs extraction speed by 10x, improving tps for dashboard
& any requests for logprobs
4. Use an SSE comment to align the speed to the real numbers at the end
of generation
5. Patch mlx for several optimizations given our assumptions and use
cases (e.g. use vllm style RoPE).
6. Switch MLX LM version to latest main, including support for Nemotron
Super and some Qwen3.5 fixes.

## Why It Works
1. Exo bench no longer reports polluted stats
2. Exo bench now handles the reported per-request stats rather than the
aggregate stats
3. The decode speed now jumps back to a real number at the end of the
generation
4. Large batch speedup for rotating KV cache models + 1:1 matching cache
with vllm

## Test Plan

### Manual Testing
Needs testing on OpenCode and CC
Needs eval testing

### Automated Testing
Only going to show the performance optimization difference after the
accurate reporting:

**GPT OSS 20B MXFP4 Q8 (large change)**
Before:
<img width="2466" height="1534" alt="image"
src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e"
/>
<img width="2410" height="1240" alt="image"
src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68"
/>


After:
<img width="2476" height="1472" alt="image"
src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2"
/>
<img width="2454" height="1236" alt="image"
src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a"
/>



**Qwen 3.5 35B A3B 8bit (No change)**
Before:
<img width="2414" height="1396" alt="image"
src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3"
/>


After:
<img width="2346" height="1234" alt="image"
src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6"
/>


**Llama 3.2 1B Instruct 4bit (small change)**
Before:
<img width="2516" height="1220" alt="image"
src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80"
/>

After:
<img width="2566" height="1370" alt="image"
src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543"
/>
2026-03-24 14:03:03 +00:00
rltakashige
b713889f73 Fix exo bench again again (#1750)
Mb premature auto merge
2026-03-17 13:47:24 +00:00
rltakashige
131ad0ff36 Implement continuous batching (#1642)
## Motivation

Following the changes made in #1632 !
Closes #1020

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->

---------

Co-authored-by: Evan Quiney <evanev7@gmail.com>
2026-03-09 15:04:45 +00:00
Jake Hillion
dc0bb5e13b fmt: add taplo TOML formatter to treefmt configuration 2026-02-27 17:19:48 +00:00
rltakashige
2fe689315b download .model files in exo bench (#1607)
## Motivation

failed again for kimi on a machine that had never downloaded it.

## Test Plan

### Manual Testing
it worked this time
2026-02-24 11:13:04 +00:00
rltakashige
05986f77aa add exo bench protobuf dependency (#1596)
kimi k2.5 requires protobuf
2026-02-23 16:45:22 +00:00
Jake Hillion
8d94eab6c6 bench: fix KeyError on DownloadCompleted total field
The bench harness accessed the serialized DownloadCompleted field as
"totalBytes", but the Python field is `total: Memory` which serializes
to "total" (not snake_case, so the camelCase alias generator leaves it
unchanged). This caused a KeyError when --danger-delete-downloads
needed to free disk space by deleting existing models.

Changed the key from "totalBytes" to "total" to match the actual
serialized JSON structure.

Test plan:
- CI
- Reproduced KeyError on unfixed code by filling disk on a test node
  to 14GB free and running exo-bench with a 16GB model
  (Meta-Llama-3.1-8B-Instruct-bf16) with --danger-delete-downloads
- Verified fixed code successfully deletes smaller models to free
  space and completes the benchmark run
2026-02-23 14:17:41 +00:00
Jake Hillion
ab9273e723 downloads: add read_only flag to DownloadCompleted for EXO_MODELS_PATH
Models in EXO_MODELS_PATH are pre-downloaded into read-only directories
and must not be deleted. The DownloadCoordinator had no awareness of
these paths, so they never appeared as completed downloads in cluster
state, and the bench harness could attempt to delete them when freeing
disk space.

Added a `read_only: bool` field to `DownloadCompleted` (default False).
The DownloadCoordinator now checks `resolve_model_in_path` in
`_start_download`, proactively scans EXO_MODELS_PATH in
`_emit_existing_download_progress` to emit DownloadCompleted events for
all pre-downloaded models (overriding DownloadPending from the regular
scan), and refuses deletion of read-only models. The bench harness
filters out read-only models from deletion candidates.

Test plan:
- Ran with EXO_MODELS_PATH. Available models now show as downloaded in
  the UI. There isn't good UI for the fact they can't be deleted, but it
  should work with exo_bench.
2026-02-20 20:27:45 +00:00
Jake Hillion
d484b062e8 bench: add download timing to bench output (#1566)
The bench script downloads models during the planning phase but doesn't
record how long the download took, making it difficult to track download
performance for a given model over time.

Modified `run_planning_phase` to return download metadata: whether a
fresh download occurred, the wall-clock duration, and the model size in
bytes. These fields are included in every JSON output row alongside the
existing per-run metrics, and a summary line is logged to the console.

This allows filtering bench results by `download_occurred` and grouping
by `model_id` to compute average download times across runs.

Test plan:

```
# existing model
jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --host s1 --model mlx-community/gpt-oss-120b-MXFP4-Q8 --pp 128 --tg 128
...
2026-02-20 15:23:49.081 | INFO     | __main__:main:340 - Planning phase: checking downloads...
2026-02-20 15:23:49.152 | INFO     | harness:run_planning_phase:402 - Started download on 12D3KooWKx41iikn188ozrxSdoG26g88jFCfie9wEA1eQR8csbPm
2026-02-20 15:23:49.184 | INFO     | __main__:main:352 - Download: model already cached
...
Wrote results JSON: bench/results.json
jake@maverick:/data/users/jake/repos/exo/ > cat bench/results.json
[
  {
    "elapsed_s": 2.9446684420108795,
    "output_text_preview": "The user just typed a long series of \"a\". Possibly they are testing. There's no explicit question. Could be they want a response? Might be a test of handling long input. We can respond politely, ask i",
    "stats": {
      "prompt_tps": 117.7872141515621,
      "generation_tps": 85.49598231498028,
      "prompt_tokens": 129,
      "generation_tokens": 128,
      "peak_memory_usage": {
        "inBytes": 68215145744
      }
    },
    "model_short_id": "gpt-oss-120b-MXFP4-Q8",
    "model_id": "mlx-community/gpt-oss-120b-MXFP4-Q8",
    "placement_sharding": "Pipeline",
    "placement_instance_meta": "MlxRing",
    "placement_nodes": 1,
    "instance_id": "68babc2a-6e94-4c70-aa07-7ec681f7c856",
    "pp_tokens": 128,
    "tg": 128,
    "repeat_index": 0
  }
]%
# no change to output
```

```
# missing model
jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --host s1 --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --pp 128 --tg 128
...
2026-02-20 15:24:42.553 | INFO     | __main__:main:340 - Planning phase: checking downloads...
2026-02-20 15:24:42.625 | INFO     | harness:run_planning_phase:402 - Started download on 12D3KooWKx41iikn188ozrxSdoG26g88jFCfie9wEA1eQR8csbPm
2026-02-20 15:25:37.494 | INFO     | __main__:main:350 - Download: 54.9s (freshly downloaded)
...
Wrote results JSON: bench/results.json
jake@maverick:/data/users/jake/repos/exo/ > cat bench/results.json
[
  {
    "elapsed_s": 1.500349276990164,
    "output_text_preview": "It seems like you've entered a large number of 'a's. If you'd like to discuss something or ask a question, I'm here to help. If not, is there anything else I can assist you with? \n\nIf you're intereste",
    "stats": {
      "prompt_tps": 395.43264952543666,
      "generation_tps": 128.03520443181478,
      "prompt_tokens": 129,
      "generation_tokens": 128,
      "peak_memory_usage": {
        "inBytes": 5116952079
      }
    },
    "model_short_id": "Meta-Llama-3.1-8B-Instruct-4bit",
    "model_id": "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
    "placement_sharding": "Pipeline",
    "placement_instance_meta": "MlxRing",
    "placement_nodes": 1,
    "instance_id": "ccd9bd71-d4cc-4b75-a37f-98090544626a",
    "pp_tokens": 128,
    "tg": 128,
    "repeat_index": 0,
    "download_duration_s": 54.88322358299047
  }
]%
# one new field
```
2026-02-20 15:33:08 +00:00
Jake Hillion
42e1e7322b bench: restore --danger-delete-downloads planning phase (#1542)
c2f2111b extracted shared utilities from exo_bench.py into harness.py
but accidentally dropped the run_planning_phase function and
--danger-delete-downloads CLI argument in the process.

Restored run_planning_phase in harness.py (where its dependencies now
live) and re-added the --danger-delete-downloads argument to
add_common_instance_args. Re-wired the planning phase call in
exo_bench.py's main() before the benchmark loop.
2026-02-19 15:42:02 +00:00
Alex Cheema
025ed9fd82 feat: add prefill progress bar for long prompts (#1181)
## Motivation

Users processing long prompts have no visibility into when token
generation will start. This feature adds a progress bar showing prefill
progress, giving users real-time feedback during prompt processing.

## Changes

### Backend
- Added `PrefillProgress` event type with `command_id`,
`processed_tokens`, `total_tokens`
- Added `PrefillProgressResponse` type (though now using direct callback
approach)
- Wired `prompt_progress_callback` through MLX's `stream_generate()`
- Progress events sent directly from callback for real-time updates (not
batched)
- API generates SSE named events: `event: prefill_progress\ndata: {...}`
- Added `PrefillProgressData` dataclass and `StreamEvent` union type in
API

### Dashboard
- Added `PrefillProgress` interface to store
- Updated SSE parsing to handle `event:` lines (named events)
- Created `PrefillProgressBar.svelte` with animated progress bar
- Shows "Processing prompt: X/Y tokens" with percentage
- Progress bar disappears when first token arrives

## Why It Works

MLX's `stream_generate()` accepts a `prompt_progress_callback(processed,
total)` that's called after each prefill chunk. By sending events
directly from this callback (rather than yielding from the generator),
progress updates are sent in real-time during prefill.

Using SSE named events (`event: prefill_progress`) maintains full
OpenAI/Claude API compatibility - standard clients ignore named events
they don't recognize, while the exo dashboard explicitly listens for
them.

## Test Plan

### Manual Testing
- Hardware: MacBook Pro M3 Max
- Set `prefill_step_size=256` for more frequent updates
- Tested with long prompts (pasted large documents)
- Verified progress bar updates incrementally during prefill
- Confirmed progress bar disappears when generation starts
- Tested with curl - standard `data:` events still work normally

Here is it working:


https://github.com/user-attachments/assets/5cc6f075-c5b2-4a44-bb4d-9efb246bc5fe


### Automated Testing
- Type checker passes (0 errors)
- All 192 tests pass
- Dashboard builds successfully

### API Compatibility
- Named SSE events are ignored by OpenAI SDK clients
- Regular token data uses standard `data: {...}` format
- `[DONE]` sentinel works as expected

---

**Note:** `prefill_step_size` is temporarily set to 256 for testing.
Should be changed back to 2048 before merging for production
performance.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Evan <evanev7@gmail.com>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
2026-02-19 03:18:25 +00:00
rltakashige
24e99ce197 Cleanup mistakes (#1537)
Oops
2026-02-18 22:05:26 +00:00
rltakashige
c2f2111b88 Fix tool calling (#1529)
## Motivation

GPT OSS tool calling issues.

## Changes

Fixes those and adds a bunch of evals for tool calling.
Fixes GLM5 prefix caching, where CacheList wasn't getting handled
properly.
Extracts a bunch of the setup functionality of exo bench to a harness
that can be reused elsewhere, such as in the tool calling eval.

## Test Plan
### Automated Testing
Let's run the evals for all models
2026-02-18 20:29:18 +00:00
Jake Hillion
8392e78afe bench: add spec for automatic canary benchmarks (#1483)
Adds all the models that can fit onto a single M3 Ultra for single
machine benchmarks. Fixes the macOS version, GPU spec, and chip type for
maximum reproducibility. Specifies the minimum memory accordingly for
each type of model, using the smallest machine available (the smallest
M3 Ultra is 96GiB).

Test plan:
- Running this with some code that makes machines of this spec available
and stores the results. It works.

This will become part of a larger testing/stability strategy once we've
collected more of the data.
2026-02-17 10:52:05 +00:00
Jake Hillion
131fb141a6 bench: add --danger-delete-downloads flag with planning phase
exo bench previously relied on the worker's plan loop to download
models, which could fail silently or run into disk space issues during
benchmarking. This made it difficult to diagnose download failures.

Added a planning phase that runs before benchmarking to explicitly
handle downloads. It checks available disk space on each node via the
/state endpoint and starts downloads via POST /download/start. When
the --danger-delete-downloads flag is set and there's insufficient
space, it deletes existing models from smallest to largest until
there's room for the benchmark model.

Test plan:
- CI

```
jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --pp 128,2048,4096 --tg 128 --stdout --settle-timeout 10 --host s1 --model mlx-community/gpt-oss-120b-MXFP4-Q8
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2026-02-16 12:12:11.807 | INFO     | __main__:main:710 - pp/tg mode: combinations (product) - 3 pairs
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-02-16 12:12:13.455 | DEBUG    | __main__:main:725 - [exo-bench] loaded tokenizer: mlx-community/gpt-oss-120b-MXFP4-Q8 for prompt sizer
2026-02-16 12:12:13.473 | DEBUG    | __main__:main:761 - exo-bench model: short_id=gpt-oss-120b-MXFP4-Q8 full_id=mlx-community/gpt-oss-120b-MXFP4-Q8
2026-02-16 12:12:13.473 | INFO     | __main__:main:762 - placements: 1
2026-02-16 12:12:13.474 | INFO     | __main__:main:764 -   - Pipeline / MlxRing / nodes=1
2026-02-16 12:12:13.474 | INFO     | __main__:main:771 - Planning phase: checking downloads...
Traceback (most recent call last):
  File "/nix/store/q31kmbcfr5bf97290bvbnhrvpc3fv824-source/bench/exo_bench.py", line 885, in <module>
    raise SystemExit(main())
                     ~~~~^^
  File "/nix/store/q31kmbcfr5bf97290bvbnhrvpc3fv824-source/bench/exo_bench.py", line 772, in main
    run_planning_phase(
    ~~~~~~~~~~~~~~~~~~^
        client,
        ^^^^^^^
    ...<4 lines>...
        settle_deadline,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/nix/store/q31kmbcfr5bf97290bvbnhrvpc3fv824-source/bench/exo_bench.py", line 367, in run_planning_phase
    raise RuntimeError(
    ...<2 lines>...
    )
RuntimeError: Insufficient disk on 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj: need 65GB, have 55GB. Use --danger-delete-downloads to free space.
jake@maverick:/data/users/jake/repos/exo/ > nix run .#exo-bench -- --pp 128,2048,4096 --tg 128 --stdout --settle-timeout 10 --host s1 --model mlx-community/gpt-oss-120b-MXFP4-Q8 --danger-delete-downloads
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2026-02-16 12:12:19.626 | INFO     | __main__:main:710 - pp/tg mode: combinations (product) - 3 pairs
2026-02-16 12:12:21.262 | DEBUG    | __main__:main:725 - [exo-bench] loaded tokenizer: mlx-community/gpt-oss-120b-MXFP4-Q8 for prompt sizer
2026-02-16 12:12:21.280 | DEBUG    | __main__:main:761 - exo-bench model: short_id=gpt-oss-120b-MXFP4-Q8 full_id=mlx-community/gpt-oss-120b-MXFP4-Q8
2026-02-16 12:12:21.280 | INFO     | __main__:main:762 - placements: 1
2026-02-16 12:12:21.280 | INFO     | __main__:main:764 -   - Pipeline / MlxRing / nodes=1
2026-02-16 12:12:21.280 | INFO     | __main__:main:771 - Planning phase: checking downloads...
2026-02-16 12:12:21.336 | INFO     | __main__:run_planning_phase:386 - Deleting mlx-community/Qwen3-0.6B-4bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (335MB)
2026-02-16 12:12:21.350 | INFO     | __main__:run_planning_phase:386 - Deleting mlx-community/Llama-3.2-1B-Instruct-4bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (679MB)
2026-02-16 12:12:21.363 | INFO     | __main__:run_planning_phase:386 - Deleting mlx-community/Llama-3.2-3B-Instruct-4bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (1740MB)
2026-02-16 12:12:21.373 | INFO     | __main__:run_planning_phase:386 - Deleting mlx-community/Llama-3.2-3B-Instruct-8bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (3264MB)
2026-02-16 12:12:21.384 | INFO     | __main__:run_planning_phase:386 - Deleting mlx-community/GLM-4.7-Flash-8bit from 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj (30366MB)
2026-02-16 12:12:21.413 | INFO     | __main__:run_planning_phase:407 - Started download on 12D3KooWE2C7dzC9d9YJMEfWK3g8og7JdZj3HHXZ8VmGrXYAEnEj
```

It's not pretty but it works!
2026-02-16 13:06:38 +00:00
Jake Hillion
cc33213842 bench: add --settle-timeout for cluster startup retry (#1449)
exo_bench.py fails if started too soon after a cluster starts because
the topology hasn't populated yet, resulting in no valid placements.

Extracted the preview-fetch-and-filter logic into a
`fetch_and_filter_placements` helper and added a retry loop with
exponential backoff (1s initial, 2x multiplier, 60s cap). The new
`--settle-timeout` flag controls how long to retry (default 0 = try
once, preserving existing behaviour). Each retry logs a warning
explaining the cluster may still be settling.

Test plan:
- Tested on several freshly started clusters. This used to fail a lot,
  now it succeeds.
2026-02-12 16:38:09 +00:00
Jake Hillion
d79b3a0e75 bench: make exo-bench available via nix run on all platforms (#1415)
exo-bench was gated behind isDarwin in python/parts.nix because it used
exoVenv, which pulls in MLX (Darwin-only). However, exo_bench.py is an
HTTP client that only needs loguru, transformers, huggingface-hub, and
tiktoken.

Made bench a uv workspace member with its own pyproject.toml declaring
only the minimal dependencies. Added a separate benchVenv in parts.nix
built from that workspace member, and moved exo-bench out of the
isDarwin block so it is available on all platforms.

Test plan:
- `nix run .#exo-bench -- --help` prints argparse help

---------

Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
2026-02-06 21:07:17 +00:00
Evan Quiney
9b5cae3db6 auto bench (#1405)
runs exo_bench remotely with some nice git QoL

## usage
run tests/auto_bench.sh host1 [host2]

exo bench will be run on those hosts and its output saved to
bench/commit_hash/*.json on all models currently downloaded
2026-02-06 15:35:46 +00:00
rltakashige
c8dbbee27b skip tensor ring on bench (#1403)
## Motivation

<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->

## Why It Works

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-02-06 13:06:59 +00:00
rltakashige
21d477f1cb Update exo bench (#1357)
## Motivation

Make exo bench faster for longer prompts, lengthen default timeouts and
use pairs for pp and tg.

## Changes

- Uses binary search to find correct prompt
- Flag to force all combinations if that is desired
2026-02-02 15:46:15 +00:00
Alex Cheema
8f6726d6be Fix config.json download errors for image models (#1245)
## Motivation

When `get_shard_download_status()` runs, it iterates over all models in
`MODEL_CARDS` and calls `build_full_shard()` → `build_base_shard()` →
`ModelCard.from_hf()`. This unconditionally tried to download
`config.json` from HuggingFace, but image models (FLUX, Qwen-Image)
don't have a root-level config.json file, causing errors:

```
Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/config.json
Error downloading shard: File not found: https://huggingface.co/black-forest-labs/FLUX.1-schnell/resolve/main/config.json
Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image/resolve/main/config.json
Error downloading shard: File not found: https://huggingface.co/Qwen/Qwen-Image-Edit-2509/resolve/main/config.json
```

## Changes

### ModelCard.load() fix
- `build_base_shard()` now uses `ModelCard.load()` instead of
`ModelCard.from_hf()`
- `ModelCard.load()` iterates through `MODEL_CARDS.values()` to find a
match by `model_id`

### exo-bench fixes
- Use `name` field instead of `id` for model resolution
- Pass `full_model_id` to `/instance/previews` endpoint
- Make model name matching case-insensitive
- Update README example model name

## Why It Works

`MODEL_CARDS` uses short names as keys (e.g., `"flux1-schnell"`) but the
`model_id` values are HuggingFace paths (e.g.,
`"black-forest-labs/FLUX.1-schnell"`). When `ModelCard.load()` was
called with the HF path, it didn't match any key and fell back to
`from_hf()` which tried to download config.json.

The fix iterates through `MODEL_CARDS.values()` to find a match by
`model_id`, ensuring predefined models (including image models) use
their registry entries directly without network calls. A key lookup is
unnecessary since `load()` is always called with HF paths which don't
match the short-name keys.

## Test Plan

### Manual Testing
- Run exo and verify no more "Error downloading shard: File not found:
.../config.json" errors for image models
- Run exo-bench and verify model resolution works correctly

### Automated Testing
- `uv run basedpyright` - passes with 0 errors
- `uv run pytest` - all tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 21:30:48 +00:00
rltakashige
5fd55594c9 Wrap pipeline models for explicit mx.depends between cache and logits (#1206)
## Motivation

GPU timeouts often when prompt size > profile_step_size. It also happens
for seemingly random models.

## Changes

Add mx.depends for cache on the logits.
All gather at the model level rather than the layer level, reducing the
amount of data sent.

## Why It Works

mlx_lm's prefill loop only evaluates cache state, not logits.
When prompt > prefill_step_size, the all_gather is never evaluated,
causing GPU timeout.

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
Added failing test cases and then resolved them.
2026-01-19 17:49:42 +00:00
rltakashige
5c8a237940 Handle model timeouts (#1177)
- Add eval with a timeout.
- Add fast synch flag

## Motivation

Because of the experimental FAST SYNCH flag, some models may not work.
This PR catches when this occurs and allows users to specify a run
without fast synch

## Changes

- Adds a flag to enable or disable fast synch (--fast-synch and
--no-fast-synch)
- Adds a heuristic timeout
- Reduces exo_bench default timeout to 10 minutes.

## Why It Works

Heuristic timeout assumes normal loading times on Mac devices (60 +
model size in gb / 5: e.g. DeepSeek takes up to 120 seconds to load on
tensor parallel, and timeout is set to 60 + 120 = 180s.

We could raise this value if necessary.

## Test Plan

### Manual Testing
Catches that GPT OSS fails to load in Tensor RDMA
Can launch with --no-fast-synch flag to launch GPT OSS.

**GPT OSS 20B**
TP with fast synch
<img width="3064" height="456" alt="image"
src="https://github.com/user-attachments/assets/f6e25cd8-8621-4e99-99fe-292ee05c4035"
/>

TP without fast synch
<img width="3098" height="496" alt="image"
src="https://github.com/user-attachments/assets/d36453d9-6686-4cfe-aa7c-a7d458369d4d"
/>
[Note: the performance is really not great as fast synch is off]

(As a sanity check)
PP with fast synch
<img width="3124" height="496" alt="image"
src="https://github.com/user-attachments/assets/e97d4547-c6fa-483d-badb-4b371b900b4c"
/>

PP without fast synch
<img width="3078" height="508" alt="image"
src="https://github.com/user-attachments/assets/b2e20dfd-4b0e-4295-8a92-417dfe745c28"
/>

PP without RDMA
<img width="3070" height="498" alt="image"
src="https://github.com/user-attachments/assets/a8509d68-0aef-4cda-bca5-a67d39a0801e"
/>

TP without RDMA
<img width="3068" height="496" alt="image"
src="https://github.com/user-attachments/assets/b5691429-89f4-4369-bcf2-8fde2ad7154a"
/>
2026-01-16 20:25:12 +00:00
rltakashige
745343c705 Return error responses for Chat Completions (#1173)
- Error chunks
- Use error handling in exo_bench.py

## Motivation

Return when an error occurs so that generation stops. Adding timeouts is
a separate TODO for model loading and chat completions.

## Changes

- Return HTTP exceptions as JSON responses in an OpenAI compatible
format.
- Context manager for generation to catch and return error messages.
- Use error handling in exo_bench.py.

## Test Plan

### Manual Testing
Manually tested that exo_bench returns on failures within and outside
generation

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-01-16 19:24:37 +00:00
rltakashige
4b3de6b984 Fix exo bench for transformers 5.x (#1168)
## Motivation
Prompt Sizer was broken as transformers 5.x tokenizers create
BatchEncodings which are essentially a dictionary of {input_ids: []}
instead of the list of input ids.

## Test Plan

### Manual Testing
Tested that exo bench runs as expected.

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
2026-01-16 12:39:22 +00:00
rltakashige
077b1bc732 exo-bench (Benchmark model pp & tg speed) (#1099)
## Motivation

This PR implements benchmarking in the style of llama-bench. The main
difficulty here is the fact that exo is not a library - it exposes an
endpoint. This means that benchmarking numbers will be inaccurate if the
API is measured.

The solution assumes nodes are set up with uv run exo (or via the app),
and then hits the new endpoint /bench/chat/completions to retrieve
generation statistics directly from mlx_lm.
<!-- Why is this change needed? What problem does it solve? -->

This will allow us to release benchmarks for models and perform
regression tests.

TODO: Performance benchmarking.
<!-- If it fixes an open issue, please link to the issue here -->

## Changes

<!-- Describe what you changed in detail -->
- Adds /bench/chat/completions endpoint
- Adds BenchChatCompletion/Response
- Adds a logits processor to prevent response from ending early
- Adds a "Prompt Sizer" which downloads the tokenizer and dynamically
adjusts the prompt of "a" to fit the desired prompt size.
- Reduce prefill step size to 2048 for now (in future, dynamically
adjust this value)

<!-- Explain why your approach solves the problem -->

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
Benchmarked Llama, Qwen, DeepSeek and Kimi models. Will require several
fixes to run consistently on all configurations (to be done in the
future).
Manually tested the normal API to verify chat requests complete as
expected.

### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
Not really possible. Type checker passes.
2026-01-06 17:39:09 +00:00