exo/bench at fac6832e5f4bb7dddf8fc61fa4675b19f0059f00 - exo

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-29 10:13:04 -04:00

Files

rltakashige 7df3774ca2 Improve batch performance and stats reporting (#1777 )

## Motivation

Batch generation reports incorrect statistics, as mlx lm never clears
the original stats, meaning they get polluted over time.
The dashboard also seems considerably slower than bench statistics.
We also have a large discrepancy between B=1 batch generating and
mlx_generate.
Extracting logprobs is massively expensive, causing up to a 25% slowdown
compared to pure batching.
```
[ 12:02:01.1240AM | INFO    ] step overhead: 3.49ms (next=12.49ms total=15.99ms)
[ 12:02:02.1600AM | INFO    ] step overhead: 3.23ms (next=13.01ms total=16.24ms)
[ 12:02:03.2228AM | INFO    ] step overhead: 3.28ms (next=13.38ms total=16.66ms)
[ 12:02:04.2798AM | INFO    ] step overhead: 3.25ms (next=12.84ms total=16.10ms)
[ 12:02:05.3152AM | INFO    ] step overhead: 3.18ms (next=12.61ms total=15.79ms)
[ 12:02:06.3522AM | INFO    ] step overhead: 3.41ms (next=12.83ms total=16.25ms)
[ 12:02:07.3987AM | INFO    ] step overhead: 3.38ms (next=13.14ms total=16.52ms)
[ 12:02:08.4537AM | INFO    ] step overhead: 1.84ms (next=19.44ms total=21.28ms)
```

## Changes

1. Report stats ourselves instead of using mlx lm's stats for batch
generation (they use perf_counter anyway).
2. Adjust exo bench to match
3. Improve logprobs extraction speed by 10x, improving tps for dashboard
& any requests for logprobs
4. Use an SSE comment to align the speed to the real numbers at the end
of generation
5. Patch mlx for several optimizations given our assumptions and use
cases (e.g. use vllm style RoPE).
6. Switch MLX LM version to latest main, including support for Nemotron
Super and some Qwen3.5 fixes.

## Why It Works
1. Exo bench no longer reports polluted stats
2. Exo bench now handles the reported per-request stats rather than the
aggregate stats
3. The decode speed now jumps back to a real number at the end of the
generation
4. Large batch speedup for rotating KV cache models + 1:1 matching cache
with vllm

## Test Plan

### Manual Testing
Needs testing on OpenCode and CC
Needs eval testing

### Automated Testing
Only going to show the performance optimization difference after the
accurate reporting:

**GPT OSS 20B MXFP4 Q8 (large change)**
Before:
<img width="2466" height="1534" alt="image"
src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e"
/>
<img width="2410" height="1240" alt="image"
src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68"
/>


After:
<img width="2476" height="1472" alt="image"
src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2"
/>
<img width="2454" height="1236" alt="image"
src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a"
/>



**Qwen 3.5 35B A3B 8bit (No change)**
Before:
<img width="2414" height="1396" alt="image"
src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3"
/>


After:
<img width="2346" height="1234" alt="image"
src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6"
/>


**Llama 3.2 1B Instruct 4bit (small change)**
Before:
<img width="2516" height="1220" alt="image"
src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80"
/>

After:
<img width="2566" height="1370" alt="image"
src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543"
/>

2026-03-24 14:03:03 +00:00