exo/bench at runner-refactor-2 - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-17 20:40:35 -04:00

Files

rltakashige f2709dcde6 Add prefix cache flag to exo bench (#1888 )

## Motivation
For using Exo-Bench extensively, there are many cases that we could use
prefix caching to speed up the benchmarks, especially when the focus is
on the token generation.

At the same time, it's very clear that prefix caching decode tokens is
not very useful in most current scenarios. Surprisingly, even for
non-thinking models, the chat template means that a continued
conversation will be formatted such that the existing cache is not
effective.

We already (slightly accidentally) do this for the batch generator - we
should do it for the sequential generator too.

## Changes

We can now speed up exo bench by having a use prefix caching flag. Of
course, for most accurate pp results, it is better to not have it, but
this speeds up tg and large benchmarking significantly.
Updated methodology to match

## Test Plan

### Manual Testing
Tested on many configurations that the difference in results is
negligible, even with multiple --pp options.

2026-04-14 11:12:58 +01:00