exo/tests at dc0bb5e13bf1975e0fe71dcda97e657353403301 - exo

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-02-28 04:06:50 -05:00

Files

rltakashige e23c3a3026 Address Mac Mini pipeline GPU timeouts (#1620 )

## Motivation
Users were reporting GPU timeout errors on Mac Minis, which we never saw
on testing with Mac Studios. It also seems to only happen with large
models.

## Changes
Eval specific distributed operations.

## Why It Works

As I wrote in a Slack message:
Basically, prefill is too slow for pipeline communications. If there are
both communications and GPU operations as part of an mlx graph, the
communications become subject to the GPU's 5 second command buffer
timeout.

For normal generation, I added evals to the communications (only during
prefill, as it slows down decode) to do this, fixing GPU timeouts.

But we don't do this during warmup, as the prompt is absolutely tiny.
This is still too slow on an M4 Pro on some models that it causes a GPU
timeout during warmup...


----------------------
This was one of the issues. However, there is another issue:

mx.all_gather sometimes reads stale data with FAST_SYNCH enabled. I'm
still investigating the root cause, but the code as it is now works on
Mac Minis.



## Test Plan

### Manual Testing
<img width="2762" height="1808" alt="image"
src="https://github.com/user-attachments/assets/27c88542-606c-4551-8f7c-bd2c0471f54e"
/>

<img width="2820" height="1898" alt="image"
src="https://github.com/user-attachments/assets/0ba3478c-ee39-438d-902c-92893db23d05"
/>


### Automated Testing
Needs a bunch on mac minis

2026-02-25 17:37:32 +00:00

auto_bench.sh

Address Mac Mini pipeline GPU timeouts (#1620 )

2026-02-25 17:37:32 +00:00

eval_tool_calls.sh

Fix tool calling (#1529 )

2026-02-18 20:29:18 +00:00

get_all_models_on_cluster.py

auto bench (#1405 )

2026-02-06 15:35:46 +00:00

headless_runner.py

Normalize TextGenerationTaskParams.input to list[InputMessage] (#1360 )

2026-02-03 06:01:56 -08:00

run_exo_on.sh

cancel downloads for deleted instances (#1393 )

2026-02-05 18:16:43 +00:00

start_distributed_test.py

improve distributed testing (#1300 )

2026-02-02 18:25:39 +00:00