## Motivation
A lot of changes happened without much attention to the state of exo
bench.
## Changes
Use TaggedModel for BenchChatCompletion so it serialises properly.
Don't break after gpt oss tool call to preserve parity with the rest of
the codebase.
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<img width="2856" height="678" alt="image"
src="https://github.com/user-attachments/assets/2e18cf0d-c0f8-467c-9763-1a6a59c8a327"
/>
Also tested GPT OSS tool calling in OpenCode