## Summary
- After `KVPrefixCache` evicts LRU entries, the MLX Metal buffers stay
allocated until Python's GC runs
- This leaks ~3-4 GB between long-context requests, reducing the
effective context ceiling for back-to-back requests
- Adding `gc.collect()` + `mx.clear_cache()` after eviction frees Metal
buffers promptly
## Test plan
- [x] Measured on 2-node PP cluster with Qwen3.5-397B-A17B-4bit at 63K
context
- [x] Before: 108.88 GB retained after eviction (3.78 GB above baseline)
- [x] After: 105.48 GB retained after eviction (0.38 GB above baseline —
draft model KV + minor overhead)
- [x] `gc.collect()` adds ~2-3ms latency, runs once per eviction cycle
(not per token)
- [ ] Verify with `uv run pytest`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Adam Durham <adam@example.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
Adds the 8bit variant missing from #1907 — the safetensors index is now
live on HF.
- `mlx-community/Qwen3.6-35B-A3B-8bit` (~35 GB)
Architectural fields match the existing 4bit/5bit/bf16 cards.
`storage_size.in_bytes` is taken from `metadata.total_size` of the
upstream `model.safetensors.index.json`.
## Motivation
`mlx-community` has just published the new **Qwen3.6-35B-A3B**
multimodal MoE family on HuggingFace. Without static model cards exo
doesn't surface these models in the dashboard picker or match its
placement / prefill logic, so users can't one-click launch them. This PR
adds cards for the three quants whose safetensors indexes are already
live on HF (4bit / 5bit / bf16).
## Changes
Three new TOML files in `resources/inference_model_cards/`:
- `mlx-community--Qwen3.6-35B-A3B-4bit.toml` (~19 GB)
- `mlx-community--Qwen3.6-35B-A3B-5bit.toml` (~23 GB)
- `mlx-community--Qwen3.6-35B-A3B-bf16.toml` (~65 GB)
All three share the same architectural fields (`n_layers = 40`,
`hidden_size = 2048`, `num_key_value_heads = 2`, `context_length =
262144`, capabilities `text, thinking, thinking_toggle, vision`,
`base_model = "Qwen3.6 35B A3B"`) — only `model_id`, `quantization`, and
`storage_size.in_bytes` differ between variants.
## Why It Works
- Qwen3.6-35B-A3B reuses the `qwen3_5_moe` architecture
(`Qwen3_5MoeForConditionalGeneration`) — the same one already wired into
exo's MLX runner at `src/exo/worker/engines/mlx/auto_parallel.py:47` via
`Qwen3_5MoeModel`. The architectural fields are taken verbatim from the
HF `config.json.text_config` and match the existing `Qwen3.5-35B-A3B-*`
cards.
- Storage sizes are the exact `metadata.total_size` read from each
variant's `model.safetensors.index.json` on HF, so download progress and
cluster-memory-fit checks are accurate.
- Vision support is flagged in `capabilities`; the `[vision]` block is
auto-detected by `ModelCard._autodetect_vision` from the upstream
`config.json`, so no hand-written vision config is required.
- The card loader (`_refresh_card_cache` in
`src/exo/shared/models/model_cards.py`) globs every `.toml` in
`resources/inference_model_cards/` on startup, so nothing else needs to
change — the `/models` endpoint and the dashboard picker pick them up
automatically.
The `mxfp4` / `mxfp8` / `nvfp4` variants are still uploading upstream
(index JSONs currently 404) and can be added in a follow-up PR once HF
completes.
## Test Plan
### Manual Testing
Hardware: MacBook Pro M4 Max, 48 GB unified memory.
- Built the dashboard, ran `uv run exo`, waited for the API to come up
on `http://localhost:52415`.
- `curl -s http://localhost:52415/models` returns the three new model
ids (`mlx-community/Qwen3.6-35B-A3B-{4bit,5bit,bf16}`) alongside
existing models.
- Opened the dashboard, clicked SELECT MODEL, typed "Qwen3.6" into the
search box. A single **"Qwen3.6 35B A3B"** group appears showing `3
variants (19GB-65GB)`. Expanding it lists the `4bit` / `5bit` / `bf16`
quants with sizes `19GB` / `23GB` / `65GB`, exactly as expected:

- Programmatically loaded each TOML via `ModelCard.load_from_path(...)`
and confirmed the parsed fields (layers / hidden / KV heads / context /
quant / base_model / caps / bytes) match what's written in the files.
### Automated Testing
No code paths were touched — these are pure TOML data files that plug
into the existing model-card loader. The existing pytest suite covers
TOML parsing and card serving; adding new TOMLs doesn't require new test
scaffolding. `uv run ruff check` and `nix fmt` are clean.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ryuichi Leo Takashige <rl.takashige@gmail.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
Closes#1858
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
<img width="828" height="373" alt="Screenshot 2026-04-14 at 22 56 52"
src="https://github.com/user-attachments/assets/f8f48c1d-68c5-4acc-a6de-9d180672da9d"
/>
if is_new_master=True, _elect_loop creates a new EventRouter before the
worker has receivers. Then, event router runs _run_ext_in and
buf.drain_indexed() will pick off events, even though
self.internal_outbound is not populated fully.
Finally, when the worker does try requesting events, the next event it
receives is not the first event, meaning the worker crashes.
## Changes
Start the event router after all the receivers are registered
## Why It Works
self.internal_outbound is populated before the loop begins.
## Test Plan
### Manual Testing
No more crashes observed in testing (it's actually quite easy to
reproduce the issue if you have one node with this fix but the other
node on main).
I'm convinced this is a fix, at least.
## Motivation
Model loading is actually quite reliable now. No need to kill if you
have a slow SSD or it's a massive model; the user can shut the instance
down if necessary.
This was a major cause of signal=9 issues although not the only one (can
happen during inference too?).
The reason signal=9 is so bad is that RDMA will no longer work until
restart if this ever happens.
## Changes
- no more model load timeout
- no more crazy sigkills
- try harder to clean up processes on model shutdown
## Test Plan
### Manual Testing
Tested with some RDMA instances
## Motivation
For using Exo-Bench extensively, there are many cases that we could use
prefix caching to speed up the benchmarks, especially when the focus is
on the token generation.
At the same time, it's very clear that prefix caching decode tokens is
not very useful in most current scenarios. Surprisingly, even for
non-thinking models, the chat template means that a continued
conversation will be formatted such that the existing cache is not
effective.
We already (slightly accidentally) do this for the batch generator - we
should do it for the sequential generator too.
## Changes
We can now speed up exo bench by having a use prefix caching flag. Of
course, for most accurate pp results, it is better to not have it, but
this speeds up tg and large benchmarking significantly.
Updated methodology to match
## Test Plan
### Manual Testing
Tested on many configurations that the difference in results is
negligible, even with multiple --pp options.
## Motivation
The Responses API usage response was missing `input_tokens_details` and
`output_tokens_details`. The chat completions API already reports these.
## Changes
- Added `InputTokensDetails` (`cached_tokens`) and `OutputTokensDetails`
(`reasoning_tokens`) to `ResponseUsage`
- Extracted shared `_build_response_usage()` helper for both streaming
and non-streaming paths
## Test Plan
### Manual Testing
4-node cluster, `Qwen3-30B-A3B-4bit` — verified both detail objects
present with correct values in streaming and non-streaming responses.
### Automated Testing
13 tests in `test_openai_responses_api.py`.
## Motivation
Part 1 of many memory improvements.
## Changes
As written in the title
## Test Plan
### Manual Testing
Gemma 4 26B cache reduced from 54GB -> 10GB per 100k tokens, Qwen3.5 35B
A3B cache reduced from 21GB every 100000 tokens to 7GB.
## Motivation
In the dashboard model picker sidebar, the Gemma 4 models were showing
up under a "Gemma" family with the generic fallback tick/checkmark icon
(the default case in `FamilyLogos.svelte`), since no dedicated logo
branch existed for `family === "gemma"`. Every other vendor (Meta,
NVIDIA, OpenAI, DeepSeek, Qwen, …) has its own brand mark.
Gemma is Google's model family, so it should live under a **Google**
bucket that future Google-authored models can join, and it should render
with a proper Google logo in the same style as its neighbors.
## Changes
- `dashboard/src/lib/components/FamilyLogos.svelte`: added a `family ===
"google"` branch rendering a monochrome Google "G" as a single `<path>`
inside the shared `24×24` viewBox with `fill="currentColor"`, matching
the other vendor logos.
- `dashboard/src/lib/components/FamilySidebar.svelte`: added `google:
"Google"` to the `familyNames` display map.
- `dashboard/src/lib/components/ModelPickerModal.svelte`: inserted
`"google"` into the `familyOrder` array (next to `"llama"`) so the
vendor has a deterministic sort position.
- `resources/inference_model_cards/mlx-community--gemma-4-*.toml` (16
files): changed `family = "gemma"` → `family = "google"`. `base_model =
"Gemma 4 …"` is unchanged, so the model titles still read "Gemma".
## Why It Works
The sidebar builds its family list from whatever values appear in
`model.family` across the loaded model cards (`ModelPickerModal.svelte`
`uniqueFamilies`). Renaming the family string on the 16 Gemma cards from
`"gemma"` to `"google"` collapses them into a single "Google" bucket,
and the new logo branch + display-name map entry gives that bucket a
real brand mark and label. All other logos share the same `w-6 h-6 /
viewBox="0 0 24 24" / fill="currentColor"` shape, so inheriting
`text-exo-yellow` / `text-white/50` just works.
## Test Plan
### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- `cd dashboard && npm install && npm run build` — dashboard builds
cleanly.
- `uv run exo`, opened `http://localhost:52415`, clicked **SELECT
MODEL**:
- sidebar shows a **Google** entry with a monochrome Google "G" logo in
the same style as Meta / NVIDIA / etc.
- old "Gemma" entry with the generic tick is gone.
- clicking **Google** filters to the Gemma 4 variants (e2b / e4b / 26B
A4B / 31B).
- hover/selected color states switch between `text-white/50` and
`text-exo-yellow` correctly.
### Automated Testing
- No new tests — this is a cosmetic grouping/logo change. Existing
dashboard build verifies the Svelte + TS compiles.
## Motivation
Fixes#1861
When `--api-port` is set to a non-default value (e.g., `--api-port
55555`), the IP connectivity discovery system still probes peers on the
hardcoded default port 52415. Since the API is not listening on 52415,
all reachability checks fail, the topology reports zero reachable nodes,
and the dashboard shows "No valid configurations for current settings."
## Changes
Thread the configured `api_port` from `Args` through `Worker` into the
reachability probe functions:
- `net_profile.py`: `check_reachability()` and `check_reachable()`
accept an `api_port` parameter (default 52415 for backward
compatibility)
- `worker/main.py`: `Worker` stores `api_port` and passes it to
`check_reachable()`, uses it in `Multiaddr` construction and the mDNS
connection filter
- `main.py`: passes `args.api_port` to the `Worker` constructor
## Why It Works
The `/node_id` endpoint used by reachability probes is served by the
FastAPI app, which binds to `args.api_port`. The probes must use the
same port the API is actually listening on. Before this fix, the port
was hardcoded in three places in `net_profile.py` and `worker/main.py`;
now it uses the value from the CLI flag.
## Test Plan
### Manual Testing
<!-- Hardware: not available for multi-node testing -->
- Verified ruff passes on all changed files
- Code inspection: traced `api_port` flow from `Args.parse()` →
`Node.create()` → `Worker.__init__()` → `_poll_connection_updates()` →
`check_reachable()` → `check_reachability()` → HTTP probe URL
### Automated Testing
- No existing automated tests cover the reachability probe code path
- The new `api_port` parameter defaults to `52415`, so all existing
behavior is preserved when `--api-port` is not specified
---------
Co-authored-by: lawrence3699 <lawrence3699@users.noreply.github.com>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
The mlx-community [MiniMax-M2.7
collection](https://huggingface.co/collections/mlx-community/minimax-m27)
landed but exo didn't have model cards for any of the variants yet, so
they weren't selectable from the dashboard model picker. Adding cards
also makes them discoverable under the existing MiniMax family entry.
## Changes
Added 6 new model cards in `resources/inference_model_cards/`, one per
quant of MiniMax M2.7:
- `mlx-community--MiniMax-M2.7.toml` (bf16, full precision — 457 GB)
- `mlx-community--MiniMax-M2.7-4bit.toml` (128 GB)
- `mlx-community--MiniMax-M2.7-4bit-mxfp4.toml` (121 GB)
- `mlx-community--MiniMax-M2.7-5bit.toml` (157 GB)
- `mlx-community--MiniMax-M2.7-6bit.toml` (185 GB)
- `mlx-community--MiniMax-M2.7-8bit.toml` (243 GB)
All six use `family = "minimax"` and share `base_model = "MiniMax M2.7"`
so they collapse into a single group in the picker with the existing
MiniMax logo. Architecture fields (`n_layers = 62`, `hidden_size =
3072`, `num_key_value_heads = 8`, `context_length = 196608`) were read
from each repo's `config.json`; `storage_size.in_bytes` was summed from
the HF tree API per repo.
`capabilities = ["text", "thinking"]` follows the existing MiniMax M2.5
cards — the chat template always emits `<think>` tags (no toggle),
matching M2.5 behavior.
## Why It Works
Model cards in `resources/inference_model_cards/` are auto-loaded by
`src/exo/shared/models/model_cards.py::get_model_cards`. The dashboard
picker groups by `base_model` and filters by `family`, so sharing both
across all six variants gives a single "MiniMax M2.7" group under the
MiniMax sidebar entry, with the quant variants exposed as selectable
sub-options.
## Test Plan
### Manual Testing
<!-- Hardware: MacBook Pro M3 Max -->
- Ran `uv run python -c "…await get_model_cards()…"` and confirmed all 6
new cards load with `family=minimax`, `base_model="MiniMax M2.7"`, and
correct quant + byte sizes.
- `cd dashboard && npm run build` then `uv run exo`, opened the model
picker → **MiniMax** family → **MiniMax M2.7** group shows all six quant
variants.
### Automated Testing
- No new automated tests — these are data files validated by the
existing Pydantic `ModelCard` schema at load time.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Motivation
Add support for Gemma 4, including VLM!
## Changes
- Add auto parallel strategies and model cards for Gemma 4
- Normalise Gemma 4's special Vision Transformer handling to be in line
with the rest of our vision processors.
- Also adds reprs to messages and b64 hashes to prevent log spam.
## Test Plan
### Manual Testing
Tested manually on 4bit E2B and 8bit 26B
### Automated Testing
Model onboarding shows small logit diffs.
---------
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
- Let users pass/override arbitrary exo env vars from the macOS app
without a code change.
## Changes
- `ExoProcessController.swift`: `CustomEnvironmentVariable` struct +
`@Published` list persisted to `UserDefaults`, injected into the child
process after built-ins.
- `SettingsView.swift`: new **Environment** tab with add/remove rows,
trim + dedup on save, and a POSIX name validator with a warning badge.
## Why It Works
- Custom vars applied last in `makeEnvironment`, so overriding a
built-in works with no special-casing.
## Test Plan
### Manual Testing
- Set `EXO_LIBP2P_NAMESPACE` via the new UI; confirmed override in
`~/.exo/exo_log/exo.log`.
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Motivation
PDF attachments weren't working on Safari
## Changes
Create async readable stream if none exists
## Why It Works
pdfjs-dist requires an async readable stream internally
## Test Plan
### Manual Testing
pdf attachments now work on Safari, still work on Firefox
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
---------
Co-authored-by: Evan <evanev7@gmail.com>
## Summary
- Wraps progress callback `send()` in try/except to gracefully handle
`BrokenResourceError` when the memory stream is closed
- Prevents unhandled `ExceptionGroup` from crashing the process when the
download consumer disconnects during transfer
## Root Cause
The download progress callback sends updates through an anyio memory
object stream. When the receiving end closes (e.g., client disconnect,
timeout, or task cancellation), `send()` raises `BrokenResourceError`.
Inside an anyio `TaskGroup`, this unhandled exception becomes an
`ExceptionGroup` that propagates up and crashes the coordinator.
## Fix
Catch `BrokenResourceError` (and `ClosedResourceError` for completeness)
in the progress callback and handle gracefully — the download continues
but progress updates are silently dropped for disconnected consumers.
Fixes#1844🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Motivation
`reasoning_tokens` is always 0 in usage stats, even when thinking
content streams correctly via `reasoning_content` SSE deltas. The MLX
generators had their own thinking detection comparing individual
detokenized tokens against think tags — this never fires for models
where tags span multiple tokens (e.g. gpt-oss-120b) or are already in
the prompt.
## Changes
- Removed broken per-token thinking detection from `batch_generate.py`
and `generate.py`
- Added `_count_reasoning_tokens` wrapper in `model_output_parsers.py`
that counts `is_thinking=True` responses and patches the total into
Usage on the final response
- Wired it as the outermost stage of `apply_all_parsers`, so it works
regardless of which parser sets `is_thinking`
- Added 3 tests covering `parse_thinking_models` and `parse_gpt_oss`
paths
## Why It Works
The parser pipeline already correctly sets `is_thinking` on each
response. Counting at the output of `apply_all_parsers` means one
counting point that works for all model types, replacing the duplicate
broken logic in two generators.
## Test Plan
### Manual Testing
- 4-node cluster, `mlx-community/gpt-oss-120b-MXFP4-Q8`
- Main branch: `reasoning_tokens: 0` — fix branch: `reasoning_tokens:
25`
### Automated Testing
- 3 new tests: explicit think tags, `starts_in_thinking=True`, and
gpt-oss Harmony analysis channel
## Motivation
When a model instance is deleted (e.g. node disconnect, manual
teardown), any in-flight SSE streaming connections for that instance
hang indefinitely. The API never closes the response stream, so clients
block forever waiting for more chunks.
## Changes
- Listen for `InstanceDeleted` events in the API event loop
- Add `_close_streams_for_instance()` to find and close any active
text/image generation queues tied to tasks on the deleted instance
- Add unit tests covering text gen, image gen, and
unrelated-instance-not-closed scenarios
## Why It Works
When an instance is deleted, we iterate `state.tasks` to find commands
running on that instance, then close and remove their send-side queue
handles. This causes the SSE generator to terminate, unblocking the
client.
## Test Plan
### Manual Testing
- This was causing issues for me on another branch (integration tests).
Including this fix solved the issue
### Automated Testing
- `test_instance_deleted_stream_cleanup.py`: 3 tests covering text gen
cleanup, image gen cleanup, and ensuring unrelated streams are not
affected
While stress-testing inference with rapid client cancels mid-stream, I
hit a reproducible crash where the entire exo process exits.
When a client cancels a streaming chat completion partway through, its
receive stream gets closed cleanly via its context manager. The producer
in `API._apply_state` then calls `queue.send(event.chunk)`, which raises
`anyio.ClosedResourceError` rather than `BrokenResourceError`. The
existing handler only catches `BrokenResourceError`, so the exception
propagates through the API task group, kills the Node task group, and
the process exits with `EXO Shutdown complete`.
Trace from one of the crashes:
```
File "exo/api/main.py", line 1818, in _apply_state
await queue.send(event.chunk)
File "anyio/streams/memory.py", line 212, in send_nowait
raise ClosedResourceError
anyio.ClosedResourceError
```
The fix is to catch `ClosedResourceError` alongside
`BrokenResourceError` in both queue handlers (text and image), so the
dead queue gets dropped and `_apply_state` keeps running for other
in-flight requests.
## Motivation
The exo backend already supports `--fast-synch` / `--no-fast-synch` CLI
flags and the `EXO_FAST_SYNCH` environment variable, but there was no
way to toggle this from the macOS app UI. Users who want fast CPU-to-GPU
synchronization for RDMA with Tensor Parallelism had to use CLI flags.
## Changes
- **ExoProcessController.swift**: Added `fastSynchEnabled`
UserDefaults-backed property and pass `EXO_FAST_SYNCH=on` to the exo
process environment when enabled.
- **SettingsView.swift**: Added a "Performance" section to the Advanced
tab with a "Fast Synch Enabled" toggle, an info icon (ⓘ) tooltip
explaining the feature and trade-offs, and a "Save & Restart" button.
## Why It Works
Follows the exact same pattern as the existing `offlineMode` and
`enableImageModels` settings — UserDefaults persistence, `@Published`
property with `didSet`, environment variable passthrough in
`makeEnvironment()`, and pending state with Save & Restart in the
settings UI. The `EXO_FAST_SYNCH=on` value matches what the Python
backend already reads in `main.py`.
## Test Plan
### Manual Testing
<!-- Hardware: macOS app -->
- Open Settings → Advanced tab → verify "Performance" section with "Fast
Synch Enabled" toggle appears
- Hover the ⓘ icon → verify tooltip explains the feature and GPU lock
trade-off
- Toggle on → click "Save & Restart" → verify process restarts with
`EXO_FAST_SYNCH=on` in env
- Close and reopen Settings → verify the toggle state persists
- Verify "Save & Restart" button is disabled when no changes are pending
### Automated Testing
- Existing settings patterns are well-established; no new automated
tests needed for this UI toggle
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Motivation
<!-- Why is this change needed? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here -->
## Changes
<!-- Describe what you changed in detail -->
## Why It Works
<!-- Explain why your approach solves the problem -->
## Test Plan
### Manual Testing
<!-- Hardware: (e.g., MacBook Pro M1 Max 32GB, Mac Mini M2 16GB,
connected via Thunderbolt 4) -->
<!-- What you did: -->
<!-- - -->
### Automated Testing
<!-- Describe changes to automated tests, or how existing tests cover
this change -->
<!-- - -->
## Changes
Mostly chore changes around vscode and jetbrains workspace settings, and
some basedpyright settings tweaks, to allow direnv to work and nixd
autocomplete with flake parts to work
## Motivation
MLX LM has had a massive refactor to their BatchGenerator recently.
Since we'd like new features from MLX LM such as Gemma 4, we need to
update the code to handle this.
Additionally this fixes a significant memory leak in GatedDeltaNet (the
difference is quite substantial, up to 1GB every 1000 tokens, explaining
several memory issues users were facing with Qwen3.5 models)
## Testing
Before
<img width="3146" height="884" alt="image"
src="https://github.com/user-attachments/assets/5af0f55a-393c-4a32-9eed-ae43f1611af4"
/>
After (no memory leak, as one of the changes upstream)
<img width="3190" height="892" alt="image"
src="https://github.com/user-attachments/assets/f0bd128d-fd48-40d4-9bbd-50a564beab14"
/>
When `enable_thinking` is set, chat templates (Qwen3, DeepSeek, etc.)
append `<think>` to the prompt. The model starts generating thinking
content directly without emitting a `<think>` token in the output
stream.
Both generators initialized `in_thinking = False` and only set it to
`True` on seeing a `<think>` token in output. Since that token was part
of the prompt, the flag never flipped and `reasoning_tokens` stayed at 0
in the usage response.
Fix: initialize `in_thinking` from `detect_thinking_prompt_suffix()`,
which already exists and is used by `model_output_parsers` for routing
thinking content correctly.
## Motivation
The timings in the batch generator are a little optimistic; a minor
change is needed to make them more correct.
## Changes
Include the time spent in the API in the generation tps and make sure to
send all requests simultaneously