mirror of
https://github.com/exo-explore/exo.git
synced 2026-02-25 10:48:26 -05:00
## Summary Large prompts (70K+ tokens / ~500KB+ JSON) cause exo to silently crash. The root cause is an unhandled `PublishError::MessageTooLarge` from gossipsub when serialized `TextGeneration` commands exceed the 1MB `max_transmit_size` limit. The error propagates as a generic Python exception through the PyO3 bindings. Since `_networking_publish` in `router.py` only catches `NoPeersSubscribedToTopicError` and `AllQueuesFullError`, the unhandled exception crashes the networking async task, causing exo to shut down silently — no error message, no API response. ## Changes - **Rust (PyO3 bindings):** Add `MessageTooLargeError` exception class and handle `PublishError::MessageTooLarge` explicitly in the gossipsub publish path, matching the existing pattern for `NoPeersSubscribedToTopicError` and `AllQueuesFullError` - **Python (router):** Catch `MessageTooLargeError` in `_networking_publish` and log a warning with the message size, preventing the networking task from crashing ## Reproduction On a multi-node cluster with a large model (e.g., GLM-5 754B tensor parallel over JACCL RDMA): 1. Send a chat completion request with ~70K+ tokens 2. exo silently shuts down — no error logged, curl gets no response 3. With shorter prompts (< ~50K tokens): works fine ## Test plan - Verified `cargo check` passes for `networking` and `exo_pyo3_bindings` crates - Verified `ruff check` passes for modified Python files - Manual testing on 4× Mac Studio M3 Ultra cluster: 50K token requests pass, 70K+ previously caused silent shutdown, now logs a warning and drops the oversized message gracefully Co-authored-by: vsm <vsm@nomail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: rltakashige <rl.takashige@gmail.com>