mirror of
https://github.com/exo-explore/exo.git
synced 2026-04-29 10:13:04 -04:00
## Motivation Batch generation reports incorrect statistics, as mlx lm never clears the original stats, meaning they get polluted over time. The dashboard also seems considerably slower than bench statistics. We also have a large discrepancy between B=1 batch generating and mlx_generate. Extracting logprobs is massively expensive, causing up to a 25% slowdown compared to pure batching. ``` [ 12:02:01.1240AM | INFO ] step overhead: 3.49ms (next=12.49ms total=15.99ms) [ 12:02:02.1600AM | INFO ] step overhead: 3.23ms (next=13.01ms total=16.24ms) [ 12:02:03.2228AM | INFO ] step overhead: 3.28ms (next=13.38ms total=16.66ms) [ 12:02:04.2798AM | INFO ] step overhead: 3.25ms (next=12.84ms total=16.10ms) [ 12:02:05.3152AM | INFO ] step overhead: 3.18ms (next=12.61ms total=15.79ms) [ 12:02:06.3522AM | INFO ] step overhead: 3.41ms (next=12.83ms total=16.25ms) [ 12:02:07.3987AM | INFO ] step overhead: 3.38ms (next=13.14ms total=16.52ms) [ 12:02:08.4537AM | INFO ] step overhead: 1.84ms (next=19.44ms total=21.28ms) ``` ## Changes 1. Report stats ourselves instead of using mlx lm's stats for batch generation (they use perf_counter anyway). 2. Adjust exo bench to match 3. Improve logprobs extraction speed by 10x, improving tps for dashboard & any requests for logprobs 4. Use an SSE comment to align the speed to the real numbers at the end of generation 5. Patch mlx for several optimizations given our assumptions and use cases (e.g. use vllm style RoPE). 6. Switch MLX LM version to latest main, including support for Nemotron Super and some Qwen3.5 fixes. ## Why It Works 1. Exo bench no longer reports polluted stats 2. Exo bench now handles the reported per-request stats rather than the aggregate stats 3. The decode speed now jumps back to a real number at the end of the generation 4. Large batch speedup for rotating KV cache models + 1:1 matching cache with vllm ## Test Plan ### Manual Testing Needs testing on OpenCode and CC Needs eval testing ### Automated Testing Only going to show the performance optimization difference after the accurate reporting: **GPT OSS 20B MXFP4 Q8 (large change)** Before: <img width="2466" height="1534" alt="image" src="https://github.com/user-attachments/assets/88b50637-fca2-4db4-9413-b9eee6e2057e" /> <img width="2410" height="1240" alt="image" src="https://github.com/user-attachments/assets/21e5c76a-2f5f-44d2-8953-121b3ebdbd68" /> After: <img width="2476" height="1472" alt="image" src="https://github.com/user-attachments/assets/fec5cfbd-fff8-430a-b12e-a329410107a2" /> <img width="2454" height="1236" alt="image" src="https://github.com/user-attachments/assets/0400344b-a4a6-42c0-a9dd-4ee91ade714a" /> **Qwen 3.5 35B A3B 8bit (No change)** Before: <img width="2414" height="1396" alt="image" src="https://github.com/user-attachments/assets/e75f0b38-df5d-49fd-ab90-bc1667d981b3" /> After: <img width="2346" height="1234" alt="image" src="https://github.com/user-attachments/assets/eabfb59c-851f-4d88-b927-e1e699a75cc6" /> **Llama 3.2 1B Instruct 4bit (small change)** Before: <img width="2516" height="1220" alt="image" src="https://github.com/user-attachments/assets/c2873655-acff-4536-8263-fb8aea33db80" /> After: <img width="2566" height="1370" alt="image" src="https://github.com/user-attachments/assets/15f95c75-1c2f-4474-85a2-88c4d0a32543" />