## Motivation
GPT OSS tool calling issues.
## Changes
Fixes those and adds a bunch of evals for tool calling.
Fixes GLM5 prefix caching, where CacheList wasn't getting handled
properly.
Extracts a bunch of the setup functionality of exo bench to a harness
that can be reused elsewhere, such as in the tool calling eval.
## Test Plan
### Automated Testing
Let's run the evals for all models