mirror/exo - exo - Gitea: Git with a cup of tea

mirror/exo

mirror of https://github.com/exo-explore/exo.git synced 2026-04-18 04:52:40 -04:00

Author	SHA1	Message	Date
DeepZima	2da740c387	Feat/static peer discovery (#1690 ) Enabling peers to be discovered in environments where mDNS is unavailable (SSH sessions, headless servers, Docker). ## Motivation Exo discovers peers exclusively via mDNS, which works great on a local network but breaks once you move beyond a single L2 broadcast domain: - SSH sessions on macOS — TCC blocks mDNS multicast from non-GUI sessions (#1488) - Headless servers/rack machines — #1682 ("DGX Spark does not find other nodes") - Docker Compose — mDNS is often unavailable across container networks; e.g. #1462 (E2E test framework) needs an alternative Related works: #1488 (working implementation made by @AlexCheema and closed because SSH had a GUI workaround), #1023 (Headscale WAN then closed due to merge conflicts), #1656 (discovery cleanup, open). This PR introduces an optional bootstrap mechanism for peer discovery while leaving the existing mDNS behavior unchanged. ## Changes Adds two new CLI flags: - `--bootstrap-peers` (env: `EXO_BOOTSTRAP_PEERS`) — comma-separated libp2p multiaddrs to dial on startup and retry periodically - `--libp2p-port` — fixed TCP port for libp2p to listen on (default: OS-assigned). Required when bootstrap peers, so other nodes know which port to dial. 8 files: - `rust/networking/src/discovery.rs`: Store bootstrap addrs, dial in existing retry loop - `rust/networking/src/swarm.rs`: Thread `bootstrap_peers` parameter to `Behaviour` - `rust/networking/examples/chatroom.rs`: Updated call site for new create_swarm signature - `rust/networking/tests/bootstrap_peers.rs`: Integration tests - `rust/exo_pyo3_bindings/src/networking.rs`: Accept optional `bootstrap_peers` in PyO3 constructor - `rust/exo_pyo3_bindings/exo_pyo3_bindings.pyi` : Update type stub - `src/exo/routing/router.py`: Pass peers to `NetworkingHandle` - `src/exo/main.py` : `--bootstrap-peers` CLI arg + `EXO_BOOTSTRAP_PEERS` env var ## Why It Works Bootstrap peers are dialed in the existing retry loop — the same path taken by peers when mDNS-discovered. The swarm handles connection, Noise handshake, and gossipsub mesh joining from there. PeerId is intentionally not required in the multiaddr, the Noise handshake discovers it. Docker Compose example: ```yaml services: exo-1: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-2/tcp/30000" exo-2: environment: EXO_BOOTSTRAP_PEERS: "/ip4/exo-1/tcp/30000" ``` ## Test Plan ### Manual Testing <details> <summary>Docker Compose config</summary> ``` services: exo-node1: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node1 hostname: exo-node1 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.3/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52415:52415" networks: bootstrap-net: ipv4_address: 172.30.20.2 deploy: resources: limits: memory: 4g exo-node2: build: context: . dockerfile: Dockerfile.bootstrap-test container_name: exo-bootstrap-node2 hostname: exo-node2 command: ["-q", "--libp2p-port", "30000", "--bootstrap-peers", "/ip4/172.30.20.2/tcp/30000"] environment: - EXO_LIBP2P_NAMESPACE=bootstrap-test ports: - "52416:52415" networks: bootstrap-net: ipv4_address: 172.30.20.3 deploy: resources: limits: memory: 4g networks: bootstrap-net: driver: bridge ipam: config: - subnet: 172.30.20.0/24 ``` </details> Two containers on a bridge network (`172.30.20.0/24`), fixed IPs, `--libp2p-port 30000`, cross-referencing `--bootstrap-peers`. Both nodes found each other and established a connection then ran the election protocol. ### Automated Testing 4 Rust integration tests in `rust/networking/tests/bootstrap_peers.rs` (`cargo test -p networking`): \| Test \| What it verifies \| Result \| \|------\|-----------------\|--------\| \| `two_nodes_connect_via_bootstrap_peers` \| Node B discovers Node A via bootstrap addr (real TCP connection) \| PASS \| \| `create_swarm_with_empty_bootstrap_peers` \| Backward compatibility — no bootstrap peers works \| PASS \| \| `create_swarm_ignores_invalid_bootstrap_addrs` \| Invalid multiaddrs silently filtered \| PASS \| \| `create_swarm_with_fixed_port` \| `listen_port` parameter works \| PASS \| All 4 pass. The connection test takes ~6s --------- Signed-off-by: DeepZima <deepzima@outlook.com> Co-authored-by: Evan <evanev7@gmail.com>	2026-03-25 10:55:12 +00:00
Evan Quiney	0096159728	up gossipsub limit (#1671 ) bump gossipsub limit to 8MB + add warning this is a bandaid solution for large messages while we figure out the point to point protocol	2026-03-06 14:20:51 +00:00
Evan Quiney	db73c4fd5d	move messaging into rust (#1549 ) the main body of the rust refactor. fixes the tokio panic on shutdown. simplifies the networking module significantly. doesn't touch lp2p behaviour	2026-02-26 13:58:22 +00:00
Evan Quiney	cacb456cb2	remove nightly (#1538 ) we have no good need for rust nightly (nor futures, for that matter)	2026-02-19 12:55:31 +00:00
Evan Quiney	8f01523ddb	remove dead code (#1496 )	2026-02-18 11:43:27 +00:00
Jake Hillion	1f242e8eee	gossipsub: stop silent message dropping and warn (#1434 ) The 15-second publish_queue_duration caused messages in peer queues to be silently dropped. When events are dropped, workers detect gaps in the event index sequence and request missing events via the NACK path (RequestEventLog), but this recovery is inefficient. Removed the timeout configuration - gossipsub now uses its default behavior without time-based eviction. If queue buildup is a concern, queue size should be limited explicitly rather than dropping by timeout. Split error handling to log AllQueuesFullError as a warning (indicates peers are unresponsive) while keeping NoPeersSubscribedToTopicError silent (expected during startup and network partitions). Test plan: - CI	2026-02-10 18:47:47 +00:00
Jake Hillion	5629983809	fmt: format all python/rust/nix files	2025-12-05 16:58:55 +00:00
Alex Cheema	19e90572e6	set max_transmit_size on gossipsub to 1MB. Fixes large message erorr	2025-11-06 19:18:48 +00:00
rltakashige	16f724e24c	Update staging 14 Co-authored-by: Evan <evanev7@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: David Munha Canas Correia <dmunha@MacBook-David.local> Co-authored-by: github-actions bot <github-actions@users.noreply.github.com>	2025-11-05 01:44:24 +00:00
Evan Quiney	3b409647ba	Squash merge merging_clusters into tensor_parallel94	2025-10-31 17:41:57 +00:00
Evan Quiney	38ff949bf4	big refactor Fix. Everything. Co-authored-by: Andrei Cravtov <the.andrei.cravtov@gmail.com> Co-authored-by: Matt Beton <matthew.beton@gmail.com> Co-authored-by: Alex Cheema <alexcheema123@gmail.com> Co-authored-by: Seth Howes <sethshowes@gmail.com>	2025-09-30 11:03:04 +01:00

11 Commits