## Motivation
The timings in the batch generator are a little optimistic; a minor
change is needed to make them more correct.
## Changes
Include the time spent in the API in the generation tps and make sure to
send all requests simultaneously
## Summary
DeepSeek V3.2 (`DeepseekV32ForCausalLM`) is already supported by exo's
inference engine (architecture whitelisted in `model_cards.py`, DSML
encoding added in #1548), but **doesn't work out of the box** due to two
bugs:
### Bug 1: `warmup_inference` passes empty model ID
`warmup_inference()` in `generate.py` accepts `model_id: ModelId` as a
parameter but creates `TextGenerationTaskParams(model=ModelId(""), ...)`
instead of using it. Since `_needs_dsml_encoding()` checks
`"deepseek-v3.2" in task_params.model.lower()`, the empty string never
matches → falls back to `tokenizer.apply_chat_template()` →
**ValueError** because V3.2 has no Jinja chat template.
**Fix:** `model=ModelId("")` → `model=model_id` (one line).
### Bug 2: `_needs_dsml_encoding` limited to tool calling
`_needs_dsml_encoding()` returns `True` only when `task_params.tools` is
present or tool messages exist in `chat_template_messages`. For warmup
and regular chat requests without tools → `return False` → Jinja
fallback → **ValueError**.
Unlike V3.1 (which has a `.jinja` chat template file that transformers
picks up automatically), V3.2 **has no Jinja template at all** — it uses
Python-based DSML encoding for all message types.
**Fix:** For V3.2, always return `True` — DSML encoding handles all
message types.
### Catalog cards
Added inference model cards for:
- `mlx-community/DeepSeek-V3.2-8bit`
- `mlx-community/DeepSeek-V3.2-4bit`
Parameters taken from model `config.json` on HuggingFace, storage sizes
from HF API. Capabilities include `thinking_toggle` (related: #1456).
## Notes
- The model ID string matching approach (`"deepseek-v3.2" in
model.lower()`) is acknowledged tech debt — see #1371 for the planned
architecture-based approach.
## Test plan
- [x] Start exo with DeepSeek V3.2 model → warmup should complete
without crash
- [x] Send a regular chat message (no tools) → should get a response
- [x] Send a chat message with tools → should work as before
- [x] V3.2 cards should appear in the dashboard model catalog
---------
Co-authored-by: user <user@m1.note>
Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: Evan <evanev7@gmail.com>
**Enabling peers to be discovered in environments where mDNS is
unavailable (SSH sessions, headless servers, Docker).**
## Motivation
Exo discovers peers exclusively via mDNS, which works great on a local
network but breaks once you move beyond a single L2 broadcast domain:
- SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI
sessions (#1488)
- Headless servers/rack machines — #1682 ("DGX Spark does not find other
nodes")
- Docker Compose — mDNS is often unavailable across container networks;
e.g. #1462 (E2E test framework) needs an alternative
Related works:
#1488 (working implementation made by @AlexCheema and closed because SSH
had a GUI workaround),
#1023 (Headscale WAN then closed due to merge conflicts),
#1656 (discovery cleanup, open).
This PR introduces an optional bootstrap mechanism for peer discovery
while leaving the existing mDNS behavior unchanged.
## Changes
Adds two new CLI flags:
- `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated
libp2p multiaddrs to dial on startup and retry periodically
- `--libp2p-port` — fixed TCP port for libp2p to listen on (default:
OS-assigned). Required when bootstrap peers, so other nodes know which
port to dial.
8 files:
- `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in
existing retry loop
- `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to
`Behaviour`
- `rust/networking/examples/chatroom.rs`: Updated call site for new
create_swarm signature
- `rust/networking/tests/bootstrap_peers.rs`: Integration tests
- `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional
`bootstrap_peers` in PyO3 constructor
- `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub
- `src/exo/routing/router.py`: Pass peers to `NetworkingHandle`
- `src/exo/main.py` : `--bootstrap-peers` CLI arg +
`EXO_BOOTSTRAP_PEERS` env var
## Why It Works
Bootstrap peers are dialed in the existing retry loop — the same path
taken by peers when mDNS-discovered. The swarm handles connection, Noise
handshake, and gossipsub mesh joining from there.
PeerId is intentionally not required in the multiaddr, the Noise
handshake discovers it.
Docker Compose example:
```yaml
services:
exo-1:
environment:
EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000"
exo-2:
environment:
EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000"
```
## Test Plan
### Manual Testing
<details>
<summary>Docker Compose config</summary>
```
services:
exo-node1:
build:
context: .
dockerfile: Dockerfile.bootstrap-test
container_name: exo-bootstrap-node1
hostname: exo-node1
command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"]
environment:
- EXO_LIBP2P_NAMESPACE=bootstrap-test
ports:
- "52415:52415"
networks:
bootstrap-net:
ipv4_address: 172.30.20.2
deploy:
resources:
limits:
memory: 4g
exo-node2:
build:
context: .
dockerfile: Dockerfile.bootstrap-test
container_name: exo-bootstrap-node2
hostname: exo-node2
command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"]
environment:
- EXO_LIBP2P_NAMESPACE=bootstrap-test
ports:
- "52416:52415"
networks:
bootstrap-net:
ipv4_address: 172.30.20.3
deploy:
resources:
limits:
memory: 4g
networks:
bootstrap-net:
driver: bridge
ipam:
config:
- subnet: 172.30.20.0/24
```
</details>
Two containers on a bridge network (`172.30.20.0/24`), fixed IPs,
`--libp2p-port 30000`, cross-referencing `--bootstrap-peers`.
Both nodes found each other and established a connection then ran the
election protocol.
### Automated Testing
4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs`
(`cargo test -p networking`):
| Test | What it verifies | Result |
|------|-----------------|--------|
| `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via
bootstrap addr (real TCP connection) | PASS |
| `create_swarm_with_empty_bootstrap_peers` | Backward compatibility —
no bootstrap peers works | PASS |
| `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs
silently filtered | PASS |
| `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS
|
All 4 pass. The connection test takes ~6s
---------
Signed-off-by: DeepZima <deepzima@outlook.com>
Co-authored-by: Evan <evanev7@gmail.com>
## Motivation
Batching will require us to send tasks concurrently and queue them up.
Our current infrastructure cannot handle that all. This PR gets us
closer to this by allowing multiple tasks to be sent in parallel and
then queuing up tasks.
## Changes
Change Plan logic
Make runner main into a class
Add a "BatchGenerator" to which tasks can be submitted (although tasks
are handled sequentially) and sent back through an MpSender.
Refactor runner to accept tasks during generation
Keep the generator threading
Separate the runner into several files for better readability
## Test Plan
### Manual Testing
Tested manually, needs a lot more automated testing. Cancellation still
works on a single device. Needs checking on multiple devices.
### Automated Testing
---------
Co-authored-by: Evan Quiney <evanev7@gmail.com>
## Summary
Large prompts (70K+ tokens / ~500KB+ JSON) cause exo to silently crash.
The root cause is an unhandled `PublishError::MessageTooLarge` from
gossipsub when serialized `TextGeneration` commands exceed the 1MB
`max_transmit_size` limit.
The error propagates as a generic Python exception through the PyO3
bindings. Since `_networking_publish` in `router.py` only catches
`NoPeersSubscribedToTopicError` and `AllQueuesFullError`, the unhandled
exception crashes the networking async task, causing exo to shut down
silently — no error message, no API response.
## Changes
- **Rust (PyO3 bindings):** Add `MessageTooLargeError` exception class
and handle `PublishError::MessageTooLarge` explicitly in the gossipsub
publish path, matching the existing pattern for
`NoPeersSubscribedToTopicError` and `AllQueuesFullError`
- **Python (router):** Catch `MessageTooLargeError` in
`_networking_publish` and log a warning with the message size,
preventing the networking task from crashing
## Reproduction
On a multi-node cluster with a large model (e.g., GLM-5 754B tensor
parallel over JACCL RDMA):
1. Send a chat completion request with ~70K+ tokens
2. exo silently shuts down — no error logged, curl gets no response
3. With shorter prompts (< ~50K tokens): works fine
## Test plan
- Verified `cargo check` passes for `networking` and `exo_pyo3_bindings`
crates
- Verified `ruff check` passes for modified Python files
- Manual testing on 4× Mac Studio M3 Ultra cluster: 50K token requests
pass, 70K+ previously caused silent shutdown, now logs a warning and
drops the oversized message gracefully
Co-authored-by: vsm <vsm@nomail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
VecExt added a .map() convenience method on Vec<T> that simply called
.into_iter().map(f).collect(). This thin wrapper provided no
optimisation benefit and obscured a standard iterator pattern behind a
nightly feature gate and an extra dependency.
Replaced the single call site in exo_pyo3_bindings with the equivalent
iterator chain and removed the ext module, the extend dependency, and
the trait_alias feature gate from the util crate.
Test plan:
- CI
The util crate contained several unused items: NonemptyArray,
BoxedSliceExt, a blanket Sealed trait, an empty alias module, six unused
nightly feature gates, and five unused Cargo dependencies (thiserror,
once_cell, internment, derive_more, bon, recursion).
Removed all items that had no references outside their own definitions,
keeping only WakerDeque, VecExt, and the trait_alias feature gate which
are actively used by the networking and exo_pyo3_bindings crates.
Test plan:
- CI
The 15-second publish_queue_duration caused messages in peer queues to
be silently dropped. When events are dropped, workers detect gaps in the
event index sequence and request missing events via the NACK path
(RequestEventLog), but this recovery is inefficient.
Removed the timeout configuration - gossipsub now uses its default
behavior without time-based eviction. If queue buildup is a concern,
queue size should be limited explicitly rather than dropping by timeout.
Split error handling to log AllQueuesFullError as a warning (indicates
peers are unresponsive) while keeping NoPeersSubscribedToTopicError
silent (expected during startup and network partitions).
Test plan:
- CI
The Rust workspace lacked Nix build support, making it difficult to
build packages reproducibly or run checks in CI.
Added a flake-parts module at rust/parts.nix that uses crane for Rust
builds and fenix for the nightly toolchain. The source filter isolates
rust/ and root Cargo files to prevent Python/docs changes from
triggering Rust rebuilds. Exports packages (system_custodian,
exo_pyo3_bindings wheel, exo-rust-workspace) and checks (cargo-nextest,
cargo-doc) for all three target platforms.
The devShell now uses inputsFrom to inherit build dependencies from the
workspace package, removing the need for manual pkg-config/openssl setup.
Test plan:
- Ran `nix flake check` successfully
- Built `nix build ".#checks.x86_64-linux.cargo-nextest"` and tests pass
- Built `nix build ".#exo_pyo3_bindings"` and wheel is produced