exo/bench at 3bcdd46bb167b06adf3ed20249f4466369e7c2c1 - exo

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-01-20 20:10:10 -05:00

Files

rltakashige 5fd55594c9 Wrap pipeline models for explicit mx.depends between cache and logits (#1206 )

## Motivation

GPU timeouts often when prompt size > profile_step_size. It also happens
for seemingly random models.

## Changes

Add mx.depends for cache on the logits.
All gather at the model level rather than the layer level, reducing the
amount of data sent.

## Why It Works

mlx_lm's prefill loop only evaluates cache state, not logits.
When prompt > prefill_step_size, the all_gather is never evaluated,
causing GPU timeout.

## Test Plan

### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->

### Automated Testing
Added failing test cases and then resolved them.

2026-01-19 17:49:42 +00:00

exo_bench.py

Wrap pipeline models for explicit mx.depends between cache and logits (#1206 )

2026-01-19 17:49:42 +00:00