Commit Graph

11 Commits

Author SHA1 Message Date
DeepZima
2da740c387 Feat/static peer discovery (#1690)
**Enabling peers to be discovered in environments where mDNS is
unavailable (SSH sessions, headless servers, Docker).**

## Motivation
Exo discovers peers exclusively via mDNS, which works great on a local
network but breaks once you move beyond a single L2 broadcast domain:

- SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI
sessions (#1488)
- Headless servers/rack machines — #1682 ("DGX Spark does not find other
nodes")
- Docker Compose — mDNS is often unavailable across container networks;
e.g. #1462 (E2E test framework) needs an alternative

Related works: 
#1488 (working implementation made by @AlexCheema and closed because SSH
had a GUI workaround),
#1023 (Headscale WAN then closed due to merge conflicts), 
#1656 (discovery cleanup, open). 

This PR introduces an optional bootstrap mechanism for peer discovery
while leaving the existing mDNS behavior unchanged.

## Changes
Adds two new CLI flags:

- `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated
libp2p multiaddrs to dial on startup and retry periodically
- `--libp2p-port` — fixed TCP port for libp2p to listen on (default:
OS-assigned). Required when bootstrap peers, so other nodes know which
port to dial.

8 files: 
- `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in
existing retry loop
- `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to
`Behaviour`
- `rust/networking/examples/chatroom.rs`: Updated call site for new
create_swarm signature
- `rust/networking/tests/bootstrap_peers.rs`: Integration tests
- `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional
`bootstrap_peers` in PyO3 constructor
- `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub 
- `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` 
- `src/exo/main.py` : `--bootstrap-peers` CLI arg +
`EXO_BOOTSTRAP_PEERS` env var

## Why It Works

Bootstrap peers are dialed in the existing retry loop — the same path
taken by peers when mDNS-discovered. The swarm handles connection, Noise
handshake, and gossipsub mesh joining from there.

PeerId is intentionally not required in the multiaddr, the Noise
handshake discovers it.

Docker Compose example:

```yaml
services:
  exo-1:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000"
  exo-2:
    environment:
      EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000"
```

## Test Plan

### Manual Testing
<details>
<summary>Docker Compose config</summary>

```
services:
  exo-node1:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node1
    hostname: exo-node1
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52415:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.2
    deploy:
      resources:
        limits:
          memory: 4g

  exo-node2:
    build:
      context: .
      dockerfile: Dockerfile.bootstrap-test
    container_name: exo-bootstrap-node2
    hostname: exo-node2
    command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"]
    environment:
      - EXO_LIBP2P_NAMESPACE=bootstrap-test
    ports:
      - "52416:52415"
    networks:
      bootstrap-net:
        ipv4_address: 172.30.20.3
    deploy:
      resources:
        limits:
          memory: 4g

networks:
  bootstrap-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.30.20.0/24
```
</details> 

Two containers on a bridge network (`172.30.20.0/24`), fixed IPs,
`--libp2p-port 30000`, cross-referencing `--bootstrap-peers`.

Both nodes found each other and established a connection then ran the
election protocol.

### Automated Testing

4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs`
(`cargo test -p networking`):

| Test | What it verifies | Result |
|------|-----------------|--------|
| `two_nodes_connect_via_bootstrap_peers` | Node B discovers Node A via
bootstrap addr (real TCP connection) | PASS |
| `create_swarm_with_empty_bootstrap_peers` | Backward compatibility —
no bootstrap peers works | PASS |
| `create_swarm_ignores_invalid_bootstrap_addrs` | Invalid multiaddrs
silently filtered | PASS |
| `create_swarm_with_fixed_port` | `listen_port` parameter works | PASS
|

All 4 pass. The connection test takes ~6s

---------

Signed-off-by: DeepZima <deepzima@outlook.com>
Co-authored-by: Evan <evanev7@gmail.com>
2026-03-25 10:55:12 +00:00
Evan Quiney
0096159728 up gossipsub limit (#1671)
bump gossipsub limit to 8MB + add warning
this is a bandaid solution for large messages while we figure out the
point to point protocol
2026-03-06 14:20:51 +00:00
Evan Quiney
db73c4fd5d move messaging into rust (#1549)
the main body of the rust refactor. fixes the tokio panic on shutdown.
simplifies the networking module significantly. doesn't touch lp2p
behaviour
2026-02-26 13:58:22 +00:00
Evan Quiney
cacb456cb2 remove nightly (#1538)
we have no good need for rust nightly (nor futures, for that matter)
2026-02-19 12:55:31 +00:00
Evan Quiney
8f01523ddb remove dead code (#1496) 2026-02-18 11:43:27 +00:00
Jake Hillion
1f242e8eee gossipsub: stop silent message dropping and warn (#1434)
The 15-second publish_queue_duration caused messages in peer queues to
be silently dropped. When events are dropped, workers detect gaps in the
event index sequence and request missing events via the NACK path
(RequestEventLog), but this recovery is inefficient.

Removed the timeout configuration - gossipsub now uses its default
behavior without time-based eviction. If queue buildup is a concern,
queue size should be limited explicitly rather than dropping by timeout.

Split error handling to log AllQueuesFullError as a warning (indicates
peers are unresponsive) while keeping NoPeersSubscribedToTopicError
silent (expected during startup and network partitions).

Test plan:
- CI
2026-02-10 18:47:47 +00:00
Jake Hillion
5629983809 fmt: format all python/rust/nix files 2025-12-05 16:58:55 +00:00
Alex Cheema
19e90572e6 set max_transmit_size on gossipsub to 1MB. Fixes large message erorr 2025-11-06 19:18:48 +00:00
rltakashige
16f724e24c Update staging 14
Co-authored-by: Evan <evanev7@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: David Munha Canas Correia <dmunha@MacBook-David.local>
Co-authored-by: github-actions bot <github-actions@users.noreply.github.com>
2025-11-05 01:44:24 +00:00
Evan Quiney
3b409647ba Squash merge merging_clusters into tensor_parallel94 2025-10-31 17:41:57 +00:00
Evan Quiney
38ff949bf4 big refactor
Fix. Everything.

Co-authored-by: Andrei Cravtov <the.andrei.cravtov@gmail.com>
Co-authored-by: Matt Beton <matthew.beton@gmail.com>
Co-authored-by: Alex Cheema <alexcheema123@gmail.com>
Co-authored-by: Seth Howes <sethshowes@gmail.com>
2025-09-30 11:03:04 +01:00