**Enabling peers to be discovered in environments where mDNS is
unavailable (SSH sessions, headless servers, Docker).**
## Motivation
Exo discovers peers exclusively via mDNS, which works great on a local
network but breaks once you move beyond a single L2 broadcast domain:
- SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI
sessions (#1488)
- Headless servers/rack machines — #1682 ("DGX Spark does not find other
nodes")
- Docker Compose — mDNS is often unavailable across container networks;
e.g. #1462 (E2E test framework) needs an alternative
Related works:
#1488 (working implementation made by @AlexCheema and closed because SSH
had a GUI workaround),
#1023 (Headscale WAN then closed due to merge conflicts),
#1656 (discovery cleanup, open).
This PR introduces an optional bootstrap mechanism for peer discovery
while leaving the existing mDNS behavior unchanged.
## Changes
Adds two new CLI flags:
- `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated
libp2p multiaddrs to dial on startup and retry periodically
- `--libp2p-port` — fixed TCP port for libp2p to listen on (default:
OS-assigned). Required when bootstrap peers, so other nodes know which
port to dial.
8 files:
- `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in
existing retry loop
- `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to
`Behaviour`
- `rust/networking/examples/chatroom.rs`: Updated call site for new
create_swarm signature
- `rust/networking/tests/bootstrap_peers.rs`: Integration tests
- `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional
`bootstrap_peers` in PyO3 constructor
- `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub
- `src/exo/routing/router.py`: Pass peers to `NetworkingHandle`
- `src/exo/main.py` : `--bootstrap-peers` CLI arg +
`EXO_BOOTSTRAP_PEERS` env var
## Why It Works
Bootstrap peers are dialed in the existing retry loop — the same path
taken by peers when mDNS-discovered. The swarm handles connection, Noise
handshake, and gossipsub mesh joining from there.
PeerId is intentionally not required in the multiaddr, the Noise
handshake discovers it.
Docker Compose example:
```yaml
services:
exo-1:
environment:
EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000"
exo-2:
environment:
EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000"
```
## Test Plan
### Manual Testing
<details>
<summary>Docker Compose config</summary>
```
services:
exo-node1:
build:
context: .
dockerfile: Dockerfile.bootstrap-test
container_name: exo-bootstrap-node1
hostname: exo-node1
command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"]
environment:
- EXO_LIBP2P_NAMESPACE=bootstrap-test
ports:
- "52415:52415"
networks:
bootstrap-net:
ipv4_address: 172.30.20.2
deploy:
resources:
limits:
memory: 4g
exo-node2:
build:
context: .
dockerfile: Dockerfile.bootstrap-test
container_name: exo-bootstrap-node2
hostname: exo-node2
command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"]
environment:
- EXO_LIBP2P_NAMESPACE=bootstrap-test
ports:
- "52416:52415"
networks:
bootstrap-net:
ipv4_address: 172.30.20.3
deploy:
resources:
limits:
memory: 4g
networks:
bootstrap-net:
driver: bridge
ipam:
config:
- subnet: 172.30.20.0/24
```
</details>
Two containers on a bridge network (`172.30.20.0/24`), fixed IPs,
`--libp2p-port 30000`, cross-referencing `--bootstrap-peers`.
Both nodes found each other and established a connection then ran the
election protocol.
### Automated Testing
4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs`
(`cargo test -p networking`):
| Test | What it verifies | Result |
|------|-----------------|--------|
| `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via
bootstrap addr (real TCP connection) | PASS |
| `create_swarm_with_empty_bootstrap_peers` | Backward compatibility —
no bootstrap peers works | PASS |
| `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs
silently filtered | PASS |
| `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS
|
All 4 pass. The connection test takes ~6s
---------
Signed-off-by: DeepZima <deepzima@outlook.com>
Co-authored-by: Evan <evanev7@gmail.com>
The 15-second publish_queue_duration caused messages in peer queues to
be silently dropped. When events are dropped, workers detect gaps in the
event index sequence and request missing events via the NACK path
(RequestEventLog), but this recovery is inefficient.
Removed the timeout configuration - gossipsub now uses its default
behavior without time-based eviction. If queue buildup is a concern,
queue size should be limited explicitly rather than dropping by timeout.
Split error handling to log AllQueuesFullError as a warning (indicates
peers are unresponsive) while keeping NoPeersSubscribedToTopicError
silent (expected during startup and network partitions).
Test plan:
- CI