exo/bench/eval_configs at prefix-cache-oom - exo

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-06-02 19:27:55 -04:00

Files

ciaranbor f2a0db4e23 Extend bench/eval tooling (#1905 )

## Motivation

Extend bench/eval tooling with robustness features, streaming support,
and align model configs with vllm eval for reproducible comparisons.

## Changes

- **exo_eval**: Checkpoint/resume (JSONL), instance health monitoring +
early abort, `top_k`/`min_p`/`enable_thinking` params, LCB
`--release-version`/`--offset`
- **exo_bench**: Streaming SSE (`--stream`), Kimi tokenizer fix for
transformers 5.x
- **Both tools**: Auto-detect running instances instead of requiring
`--skip-instance-setup`; `--fresh-instance` to override
- **harness**: SSE streaming client, `find_existing_instance()` shared
helper, removed download timeout, settle-timeout default 0→7200s
- **models.toml**: Added `enable_thinking`, aligned `max_tokens`/temps
with vllm, added new models
- **API**: Streaming SSE for `/bench/chat/completions`

## Why It Works

- Checkpoint/resume uses append-only JSONL + skip-on-load so interrupted
evals resume without re-running completed questions
- Health monitoring races an `asyncio.Event` against API calls for fast
abort when the instance dies
- Auto-detection queries `/state` for existing instances matching the
model ID before attempting placement
- Streaming reuses the existing `generate_chat_stream` infrastructure
from the regular chat endpoint

2026-04-27 16:53:43 +01:00

models.toml

…