Files
exo/rust
vskiwi dab7ed4821 fix: handle gossipsub MessageTooLarge error to prevent silent crash (#1583)
## Summary

Large prompts (70K+ tokens / ~500KB+ JSON) cause exo to silently crash.
The root cause is an unhandled `PublishError::MessageTooLarge` from
gossipsub when serialized `TextGeneration` commands exceed the 1MB
`max_transmit_size` limit.

The error propagates as a generic Python exception through the PyO3
bindings. Since `_networking_publish` in `router.py` only catches
`NoPeersSubscribedToTopicError` and `AllQueuesFullError`, the unhandled
exception crashes the networking async task, causing exo to shut down
silently — no error message, no API response.

## Changes

- **Rust (PyO3 bindings):** Add `MessageTooLargeError` exception class
and handle `PublishError::MessageTooLarge` explicitly in the gossipsub
publish path, matching the existing pattern for
`NoPeersSubscribedToTopicError` and `AllQueuesFullError`
- **Python (router):** Catch `MessageTooLargeError` in
`_networking_publish` and log a warning with the message size,
preventing the networking task from crashing

## Reproduction

On a multi-node cluster with a large model (e.g., GLM-5 754B tensor
parallel over JACCL RDMA):
1. Send a chat completion request with ~70K+ tokens
2. exo silently shuts down — no error logged, curl gets no response
3. With shorter prompts (< ~50K tokens): works fine

## Test plan

- Verified `cargo check` passes for `networking` and `exo_pyo3_bindings`
crates
- Verified `ruff check` passes for modified Python files
- Manual testing on 4× Mac Studio M3 Ultra cluster: 50K token requests
pass, 70K+ previously caused silent shutdown, now logs a warning and
drops the oversized message gracefully

Co-authored-by: vsm <vsm@nomail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: rltakashige <rl.takashige@gmail.com>
2026-02-23 16:21:21 +00:00
..
2026-02-19 12:55:31 +00:00
2026-02-19 12:55:31 +00:00